Introduction to DeepEval
DeepEval is an open-source LLM evaluation framework for LLM applications. DeepEval makes it extremely easy to build and iterate on LLM (applications) and was built with the following principles in mind:
- Unit test LLM outputs with Pytest-style assertions.
- Use 50+ ready-to-use metrics, including LLM-as-a-judge, agent, tool-use, conversational, safety, RAG, and multimodal metrics.
- Evaluate AI agents, conversational agents (chatbots), RAG pipelines, MCP systems, and other custom workflows.
- Run both end-to-end evals and component-level evals with tracing.
- Generate synthetic datasets for edge cases that are hard to collect manually.
- Customize metrics, prompts, models, and evaluation templates when built-in behavior is not enough.
DeepEval is local-first: your evaluations run in your own environment. When your team needs shared dashboards, regression tracking, observability, or production monitoring, DeepEval integrates natively with Confident AI.
Why DeepEval Exists
LLM applications are hard to test with traditional assertions. The same input can produce multiple acceptable outputs, failures are often semantic, and quality can depend on tool calls, multi-step reasoning, state, retrieved context, or conversation history.
DeepEval exists to make those behaviors testable without forcing every team into one workflow. You can evaluate a saved test case, a dataset of expected outputs, an entire trace, or an individual span inside a trace.
Choose Your Path
If you already know what you're building, start with a system-specific quickstart:
5 min Quickstart
Install DeepEval, create your first test case, run it with deepeval test run, and inspect the results.
AI Agents
Set up tracing, evaluate end-to-end task completion, and score individual agent components.
Chatbots
Evaluate multi-turn conversations, turns, and simulated user interactions.
RAG
Evaluate RAG quality end-to-end, then test retrieval and generation separately.
*All quickstarts include a guide on how to bring evals to production near the end
The Core Building Blocks
These concepts show up throughout DeepEval:
Test Cases
A single behavior you want to evaluate: task input, agent output, expected behavior, tools, context, and metadata.
Datasets
Collections of goldens that make evals repeatable across prompts, models, and releases.
Metrics
The scoring logic that determines whether an agent response, trace, span, or output satisfies your criteria.
Traces
Runtime records of your agent's steps, spans, inputs, outputs, tool calls, and component behavior.
Two Modes of Evals
DeepEval supports two complementary ways to evaluate your application.
End-to-End LLM Evals
Best for raw LLM APIs, simple apps, chatbots, and black-box quality checks.
Treat your LLM app as a black box. Provide inputs, outputs, expected behavior, and metrics, then use DeepEval to detect quality regressions.
Component-Level LLM Evals
Best for agents, tool-using workflows, MCP systems, and complex multi-step applications.
Trace your app and evaluate individual spans, tools, planners, retrievers, generators, or other internal components.
You can use either mode independently, or combine them: score the whole trace for overall task quality, then score individual spans to find where failures happen.
DeepEval Ecosystem
DeepEval can run by itself, but it also connects to adjacent tools when your workflow needs collaboration, monitoring, or security testing.
Confident AI
An AI quality platform for shared eval dashboards, regression analysis, observability, and monitoring.
DeepTeam
A safety and security testing framework for red-teaming LLM applications against vulnerabilities.