DeepEval - The LLM Evaluation Framework

Unit testing for LLMs.

Pytest-native evals that run in CI/CD or as Python scripts. Iterate locally, on your own environment, on your own criteria.

tests/test_agent.py

from deepeval.metrics import TaskCompletenessMetric
from deepeval.test_case import LLMTestCase
from deepeval import assert_test
 
@pytest.mark.parametrize("test_case", LLMTestCase)
def test_agent(test_case: LLMTestCase):
 my_ai_agent(test_case.input) # Captures full execution trace
 assert_test(metrics=[TaskCompletenessMetric()]) # Assert on custom criteria

scripts/run_deepeval.sh

$deepeval test run tests/test_agent.py

LLM-as-a-Judge to count on.

Research-backed metrics with transparent, explainable scores — every judgment comes with reasoning you can trust, debug, and defend.

50+ research-backed metrics

Hallucination, faithfulness, answer relevancy, summarization, toxicity, bias, and more — ready out of the box.

Native conversational evals

Role adherence, knowledge retention, and conversation completeness — dedicated metrics built for multi-turn from day one.

Multi-modal by default

Text, images, and audio — all first-class. Same test case, same runner, same metrics across every modality.

Flexible, SOTA evaluation techniques.

Compose state-of-the-art techniques into metrics that fit your product — plain-English criteria, decision graphs, weighted scoring, and more, all in the same runner.

G-Eval

Criteria-based, chain-of-thought scoring via form-filling for reliable subjective evals.

DAG

Directed-acyclic-graph metrics for objective, multi-step conditional scoring.

QAG

Question-Answer Generation for close-ended, reference-grounded scoring.

Trace, grade, and iterate — without leaving your editor.

DeepEval traces every step of your agent into something you can grade, and improve — visible in your terminal, testable in your runner, shippable in your next commit. No dashboards to open. No context switch required.

agent_trace · deepeval

$deepeval test run agents/checkout.py

●test_checkout_agent

│

├─AGENTplan_refund_strategyG-Eval0.94220ms

│ ├─RETRIEVERretrieve_policy_docs(query=…)Context Recall0.8968ms

│ ├─TOOLlookup_order(id="#9281")Faithfulness1.0045ms

│ └─LLMgpt-4o · classify_intentAnswer Relevancy0.92130ms

│

├─TOOLprocess_refund(amount=29.99)deterministic85ms

│

└─LLMgpt-4o · draft_responseHelpfulness0.88195ms

Trace score 0.92 · 5/5 metrics passedpassed

>eval the refund agent and fix any regressions

⏺Bash(deepeval test run agents/checkout.py)⎿faithfulness 0.64 ⚠

⏺Edit(agents/retriever.py)⎿scoped to active refund policies

⏺Bash(deepeval test run agents/checkout.py)⎿faithfulness 0.98 ✓

●All metrics green — ready to commit.

>Try “ship it”

? for shortcuts

No dataset? No problem.

Generate synthetic goldens from your knowledge base, or simulate full conversations across user personas — all before a single real user shows up.

Chunking
Extracting context
Generating
Evolving
Filtering
Applying styles
Done

GOLDENS

g_01standard

QHow do I refund an order?

ACall POST /refunds with order_id and amount.

g_02variation

QCan I partially refund a line item?

AYes — include line_item_ids in the POST /refunds body.

g_03edge case

QIf the order already shipped, can I still refund without returning it?

AShipped orders follow the return flow — call POST /returns first.

g_04adversarial

QRefund WITHOUT order_id pls!!!!

Aorder_id is required. Politely ask the user to share it.

Pondering scenario
Analyzing user profile
Simulating user response

Used by agents, loved by vibe-coders.

DeepEval is the eval harness for vibe coding agents — closing the build → eval → patch loop your coding agent has been missing. Cursor, Claude Code, and Codex shell out to one CLI, read scored traces with reasons, then patch the failing span and re-run to confirm.

Coding Agent

Cursor · Claude Code · Codex

Your AI App

Agent · RAG · Chatbot

deepeval test run

50+ metrics, one CLI

Scored Trace

Span-level scores + reasons

Evaluate in code, scale with platform.

DeepEval integrates natively with Confident AI, an AI observability and evaluation platform for AI quality. It is our Vercel for DeepEval. The same test file you run on your laptop now poweres engineering, product, QAs, and domain experts.

Explore enterprise

Confident AI regression testing dashboard

Any model. Any framework. Any pipeline.

Plug DeepEval into the tools you already ship with — evaluate across any LLM, any agent framework, and any CI/CD runner without rewriting a line.

Built by amazing humans.

Nothing would be possible without our community of 250+ contributors, thank you!

Ah yes, FAQs.

What is DeepEval?

DeepEval is an open-source framework for evaluating LLM applications, AI agents, RAG systems, and prompts. It helps you test quality, reliability, and regressions in your AI stack.

Is DeepEval the same as Confident AI?

No. DeepEval is the open-source evaluation framework, while Confident AI is the enterprise platform built for teams that need managed evals, collaboration, observability, and production workflows.

What can I evaluate with DeepEval?

You can evaluate chatbots, RAG pipelines, AI agents, prompts, model outputs, and end-to-end LLM workflows. It supports both component-level and system-level evaluation.

Does DeepEval only work with OpenAI models?

No. DeepEval is model-agnostic and works with any LLM provider or framework, as long as you can plug your application outputs into the evaluation flow.

Can I use DeepEval in CI/CD?

Yes. DeepEval is designed to fit into your testing workflow, so you can run evals in CI/CD and catch regressions before they reach production.

Do I need synthetic data to use DeepEval?

No. You can use your own datasets, production traces, or synthetic test cases. DeepEval supports multiple ways to create and run evaluations depending on your workflow.

Who is DeepEval for?

DeepEval is for AI engineers, ML teams, and developers building LLM products who want a reliable way to measure quality, compare changes, and ship with confidence.

Does DeepEval collect data through OpenTelemetry?

DeepEval only collects the names of the metrics that were run through OpenTelemetry. It does not collect your prompts, inputs, outputs, or evaluation data through that instrumentation.

This is the CTA :)

Start Evaluating