DeepEval just got a new look 🎉 Read the announcement to learn more.

Introduction to DeepEval

DeepEval is an open-source LLM evaluation framework for LLM applications. DeepEval makes it extremely easy to build and iterate on LLM (applications) and was built with the following principles in mind:

  • Unit test LLM outputs with Pytest-style assertions.
  • Use 50+ ready-to-use metrics, including LLM-as-a-judge, agent, tool-use, conversational, safety, RAG, and multimodal metrics.
  • Evaluate AI agents, conversational agents (chatbots), RAG pipelines, MCP systems, and other custom workflows.
  • Run both end-to-end evals and component-level evals with tracing.
  • Generate synthetic datasets for edge cases that are hard to collect manually.
  • Customize metrics, prompts, models, and evaluation templates when built-in behavior is not enough.

DeepEval is local-first: your evaluations run in your own environment. When your team needs shared dashboards, regression tracking, observability, or production monitoring, DeepEval integrates natively with Confident AI.

Why DeepEval Exists

LLM applications are hard to test with traditional assertions. The same input can produce multiple acceptable outputs, failures are often semantic, and quality can depend on tool calls, multi-step reasoning, state, retrieved context, or conversation history.

DeepEval exists to make those behaviors testable without forcing every team into one workflow. You can evaluate a saved test case, a dataset of expected outputs, an entire trace, or an individual span inside a trace.

Choose Your Path

If you already know what you're building, start with a system-specific quickstart:

*All quickstarts include a guide on how to bring evals to production near the end

The Core Building Blocks

These concepts show up throughout DeepEval:

Two Modes of Evals

DeepEval supports two complementary ways to evaluate your application.

You can use either mode independently, or combine them: score the whole trace for overall task quality, then score individual spans to find where failures happen.

DeepEval Ecosystem

DeepEval can run by itself, but it also connects to adjacent tools when your workflow needs collaboration, monitoring, or security testing.

FAQs

On this page