Introduction to DeepEval
DeepEval is an open-source LLM evaluation framework for LLM applications. DeepEval makes it extremely easy to build and iterate on LLM (applications) and was built with the following principles in mind:
- Unit test LLM outputs with Pytest-style assertions.
- Use 50+ ready-to-use metrics, including LLM-as-a-judge, agent, tool-use, conversational, safety, RAG, and multimodal metrics.
- Evaluate AI agents, conversational agents (chatbots), RAG pipelines, MCP systems, and other custom workflows.
- Run both end-to-end evals and component-level evals with tracing.
- Generate synthetic datasets for edge cases that are hard to collect manually.
- Customize metrics, prompts, models, and evaluation templates when built-in behavior is not enough.
DeepEval is local-first: your evaluations run in your own environment. When your team needs shared dashboards, regression tracking, observability, or production monitoring, DeepEval integrates natively with Confident AI.
Who is DeepEval For?
DeepEval was designed for a technical audience and here are the main personas we serve well:
- AI engineers who need to evaluate agents, RAG pipelines, tool calls, and production LLM workflows, write unit tests for AI behavior, and use evals in agentic coding tools like Claude Code and Codex.
- Data scientists who want repeatable experiments for comparing prompts, models, datasets, and metric scores.
- QAs who need reliable regression tests for AI behavior before changes reach users.
- Tech-savvy PMs who want to define quality criteria, inspect failures, and track whether product changes improve AI outputs.
Using DeepEval for Coding Agents
Apart from building evaluation suites and pipelines with DeepEval, DeepEval's CLI evaluation capabilities make it one of the best eval harnesses for vibe coding agents such as Claude Code, Codex, and Cursor.
The diagram below explains how DeepEval can take part in your iteration cycles, not just as a final validation check.
Coding Agent
Cursor · Claude Code · Codex
Your AI App
Agent · RAG · Chatbot
deepeval test run
50+ metrics, one CLI
Scored Trace
Span-level scores + reasons
Choose Your Path
We highly recommend starting with either of these two quickstarts:
5-min Human Quickstart
Install DeepEval, create your first test case, run it with deepeval test run, and inspect the results — by hand.
5-min Vibe Coder Quickstart
Install the Skill in Cursor / Claude Code / Codex and have your coding agent build the test suite, run evals, and iterate for you.
Start with a Use Case in Mind
Alternatively, if you already have a concrete use case - try out one of our use case specific quickstarts:
AI Agents
Set up tracing, evaluate end-to-end task completion, and score individual agent components.
Chatbots
Evaluate multi-turn conversations, turns, and simulated user interactions.
RAG
Evaluate RAG quality end-to-end, then test retrieval and generation separately.
More Resources
The Core Building Blocks
These concepts show up throughout DeepEval and learning these fundamentals are imperative:
Test Cases
A single behavior you want to evaluate: task input, agent output, expected behavior, tools, context, and metadata.
Datasets
Collections of goldens that make evals repeatable across prompts, models, and releases.
Metrics
The scoring logic that determines whether an agent response, trace, span, or output satisfies your criteria.
Traces
Runtime records of your agent's steps, spans, inputs, outputs, tool calls, and component behavior.
Two Modes of Evals
DeepEval supports two complementary ways to evaluate your application, it's important to know which one(s) suit you:
End-to-End LLM Evals
Best for raw LLM APIs, simple apps, chatbots, and black-box quality checks.
Treat your LLM app as a black box. Provide inputs, outputs, expected behavior, and metrics, then use DeepEval to detect quality regressions.
Component-Level LLM Evals
Best for agents, tool-using workflows, MCP systems, and complex multi-step applications.
Trace your app and evaluate individual spans, tools, planners, retrievers, generators, or other internal components.
You can use either mode independently, or combine them: score the whole trace for overall task quality, then score individual spans to find where failures happen.
DeepEval Ecosystem
DeepEval can run by itself, but it also connects to adjacent tools when your workflow needs collaboration, monitoring, or security testing.
Confident AI
An AI quality platform for shared eval dashboards, regression analysis, observability, and monitoring.
DeepTeam
A safety and security testing framework for red-teaming LLM applications against vulnerabilities.
Quick Shoutout To Our Community
DeepEval is shaped by the people who report bugs, propose ideas, review changes, improve docs, and ship code with us. Thank you for building this project with us.