DeepEval is an open-source LLM evaluation framework. It lets you unit-test LLM outputs, run end-to-end and component-level evals, generate synthetic datasets, and bring evals into CI/CD from Python.

Do I need an account to use DeepEval?

No. DeepEval runs locally. You only need an LLM provider key, such as OPENAI_API_KEY, for metrics that use an LLM judge. An account is only needed if you want to send results to Confident AI.

What can I evaluate with DeepEval?

AI agents, MCP systems, chatbots, tool-using workflows, LLM arenas, RAG pipelines, summarizers, structured outputs, multimodal apps, and custom LLM workflows.

How is DeepEval different from observability tools?

Observability tools help you inspect what happened. DeepEval focuses on whether behavior is good enough by running metrics against test cases, traces, spans, and datasets. You can use both together.

Can I use DeepEval in CI/CD?

Yes. DeepEval is built to run with pytest and CI providers, so you can gate changes on LLM regression tests.

Introduction to DeepEval | DeepEval - The LLM Evaluation Framework

DeepEval is an open-source LLM evaluation framework for LLM applications. DeepEval makes it extremely easy to build and iterate on LLM (applications) and was built with the following principles in mind:

Unit test LLM outputs with Pytest-style assertions.
Use 50+ ready-to-use metrics, including LLM-as-a-judge, agent, tool-use, conversational, safety, RAG, and multimodal metrics.
Evaluate AI agents, conversational agents (chatbots), RAG pipelines, MCP systems, and other custom workflows.
Run both end-to-end evals and component-level evals with tracing.
Generate synthetic datasets for edge cases that are hard to collect manually.
Customize metrics, prompts, models, and evaluation templates when built-in behavior is not enough.

DeepEval is local-first: your evaluations run in your own environment. When your team needs shared dashboards, regression tracking, observability, or production monitoring, DeepEval integrates natively with Confident AI.

Who is DeepEval For?

DeepEval was designed for a technical audience and here are the main personas we serve well:

AI engineers who need to evaluate agents, RAG pipelines, tool calls, and production LLM workflows, write unit tests for AI behavior, and use evals in agentic coding tools like Claude Code and Codex.
Data scientists who want repeatable experiments for comparing prompts, models, datasets, and metric scores.
QAs who need reliable regression tests for AI behavior before changes reach users.
Tech-savvy PMs who want to define quality criteria, inspect failures, and track whether product changes improve AI outputs.

Using DeepEval for Coding Agents

Apart from building evaluation suites and pipelines with DeepEval, DeepEval's CLI evaluation capabilities make it one of the best eval harnesses for vibe coding agents such as Claude Code, Codex, and Cursor.

The diagram below explains how DeepEval can take part in your iteration cycles, not just as a final validation check.

Coding Agent

Cursor · Claude Code · Codex

Your AI App

Agent · RAG · Chatbot

deepeval test run

50+ metrics, one CLI

Scored Trace

Span-level scores + reasons

Choose Your Path

We highly recommend starting with either of these two quickstarts:

5-min Human Quickstart

Install DeepEval, create your first test case, run it with deepeval test run, and inspect the results — by hand.

5-min Vibe Coder Quickstart

Install the Skill in Cursor / Claude Code / Codex and have your coding agent build the test suite, run evals, and iterate for you.

Start with a Use Case in Mind

Alternatively, if you already have a concrete use case - try out one of our use case specific quickstarts:

AI Agents

Set up tracing, evaluate end-to-end task completion, and score individual agent components.

Chatbots

Evaluate multi-turn conversations, turns, and simulated user interactions.

RAG

Evaluate RAG quality end-to-end, then test retrieval and generation separately.

More Resources

The Core Building Blocks

These concepts show up throughout DeepEval and learning these fundamentals are imperative:

Test Cases

A single behavior you want to evaluate: task input, agent output, expected behavior, tools, context, and metadata.

Datasets

Collections of goldens that make evals repeatable across prompts, models, and releases.

Metrics

The scoring logic that determines whether an agent response, trace, span, or output satisfies your criteria.

Traces

Runtime records of your agent's steps, spans, inputs, outputs, tool calls, and component behavior.

Two Modes of Evals

DeepEval supports two complementary ways to evaluate your application, it's important to know which one(s) suit you:

End-to-End LLM Evals

Best for raw LLM APIs, simple apps, chatbots, and black-box quality checks.

Treat your LLM app as a black box. Provide inputs, outputs, expected behavior, and metrics, then use DeepEval to detect quality regressions.

Component-Level LLM Evals

Best for agents, tool-using workflows, MCP systems, and complex multi-step applications.

Trace your app and evaluate individual spans, tools, planners, retrievers, generators, or other internal components.

You can use either mode independently, or combine them: score the whole trace for overall task quality, then score individual spans to find where failures happen.

DeepEval Ecosystem

DeepEval can run by itself, but it also connects to adjacent tools when your workflow needs collaboration, monitoring, or security testing.

Confident AI

An AI quality platform for shared eval dashboards, regression analysis, observability, and monitoring.

DeepTeam

A safety and security testing framework for red-teaming LLM applications against vulnerabilities.

Quick Shoutout To Our Community

DeepEval is shaped by the people who report bugs, propose ideas, review changes, improve docs, and ship code with us. Thank you for building this project with us.

+145

Introduction to DeepEval

Who is DeepEval For?