Unit testing for LLMs.

Pytest-native evals that run in CI/CD or as Python scripts. Iterate locally, on your own environment, on your own criteria.

tests/test_agent.py
from deepeval.metrics import TaskCompletenessMetric
from deepeval.test_case import LLMTestCase
from deepeval import assert_test
@pytest.mark.parametrize("test_case", LLMTestCase)
def test_agent(test_case: LLMTestCase):
my_ai_agent(test_case.input) # Captures full execution trace
assert_test(metrics=[TaskCompletenessMetric()]) # Assert on custom criteria
scripts/run_deepeval.sh
$deepeval test run tests/test_agent.py


LLM-as-a-Judge to count on.

Research-backed metrics with transparent, explainable scores — every judgment comes with reasoning you can trust, debug, and defend.

50+ research-backed metrics

Hallucination, faithfulness, answer relevancy, summarization, toxicity, bias, and more — ready out of the box.

Native conversational evals

Role adherence, knowledge retention, and conversation completeness — dedicated metrics built for multi-turn from day one.

Multi-modal by default

Text, images, and audio — all first-class. Same test case, same runner, same metrics across every modality.



Flexible, SOTA evaluation techniques.

Compose state-of-the-art techniques into metrics that fit your product — plain-English criteria, decision graphs, weighted scoring, and more, all in the same runner.

G-Eval

Criteria-based, chain-of-thought scoring via form-filling for reliable subjective evals.

DAG

Directed-acyclic-graph metrics for objective, multi-step conditional scoring.

QAG

Question-Answer Generation for close-ended, reference-grounded scoring.



Trace, grade, and iterate — without leaving your editor.

DeepEval traces every step of your agent into something you can grade, and improve — visible in your terminal, testable in your runner, shippable in your next commit. No dashboards to open. No context switch required.



No dataset? No problem.

Generate synthetic goldens from your knowledge base, or simulate full conversations across user personas — all before a single real user shows up.

  1. Chunking
  2. Extracting context
  3. Generating
  4. Evolving
  5. Filtering
  6. Applying styles
  7. Done
GOLDENS
g_01standard

QHow do I refund an order?

ACall POST /refunds with order_id and amount.

g_02variation

QCan I partially refund a line item?

AYes — include line_item_ids in the POST /refunds body.

g_03edge case

QIf the order already shipped, can I still refund without returning it?

AShipped orders follow the return flow — call POST /returns first.

g_04adversarial

QRefund WITHOUT order_id pls!!!!

Aorder_id is required. Politely ask the user to share it.

  1. Pondering scenario
  2. Analyzing user profile
  3. Simulating user response


Used by agents, loved by vibe-coders.

Coming soon on May 15th...



Evaluate in code, scale with platform.

DeepEval integrates natively with Confident AI, an AI observability and evaluation platform for AI quality. It is our Vercel for DeepEval. The same test file you run on your laptop now poweres engineering, product, QAs, and domain experts.

Explore enterprise

Confident AI regression testing dashboard
Confident AI experimentation view
Confident AI tracing and observability
Confident AI production monitoring
Confident AI dataset management
Confident AI prompt versioning
Confident AI human annotation


Any model. Any framework. Any pipeline.

Plug DeepEval into the tools you already ship with — evaluate across any LLM, any agent framework, and any CI/CD runner without rewriting a line.



Built by amazing humans.

Nothing would be possible without our community of 250+ contributors, thank you!



Ah yes, FAQs.



This is the CTA :)

Start Evaluating