πŸ”₯ DeepEval 4.0 just got released. Read the announcement.

Eval harness: What it is, how to use it, and why you should care

An eval harness is the infrastructure that runs your LLM evaluations end to end. Here's what it is, why it matters, and what a good one looks like.

First authorJeffrey Ip
Community

To you and I, an agent harness is everything in 2026. For AI agents, an agent harness is everything in an agent that isn't the model. Think about memory, tool calling, API functions, and most importantly, evals. In other words, the infrastructure around the model that makes an AI agent work.

By simple deduction, you might think that an eval harness is as simple as the evaluation layer of an agent, correct? Well, not quite. Here is what I am here to talk about today.

What is an Eval Harness?

An evaluation harness is the validation layer for AI agents. Notice how it is for AI agents, not just the model. This might be conflicting, because didn't we just say that an eval harness is part of the agent harness, and that an agent harness wraps around the model?

To clarify: the agent harness isn't a single wrapper around the model, it's multiple layers handling tool orchestration, state persistence, error recovery, validation loops, and safety enforcement across an agent's lifecycle.

These layers fall into three tiers:

  • Runtime: the core loop that keeps the model running, including prompt construction, output parsing, and error handling.
  • Capabilities: what the agent can actually do, such as tools, memory, state, and context management.
  • Assurance: the outermost guardrails, including subagent orchestration and validation loops.

Validation loops, in the assurance layer, are where the eval harness lives.

The Eval Harness is not what it seems

Before we continue, there is one thing I absolutely must clarify. An eval (same dataset, same metrics) can appear in two places:

  • Offline / dev-time: runs against a fixed dataset, not on live traffic. This is the DeepEval loop and the CI gate. It's not in the runtime path of the agent serving a user. Nothing a real user sends touches it. This is genuinely "eval" in the pure sense: measuring behavior against ground truth, out of band.
  • Online / runtime: runs on the actual request/response as the agent is serving it, and can act on the result β€” block the output, retry, escalate, fall back. The moment it's in the live path and gates the response, that's a guardrail, not an eval. The scoring mechanism might be identical (an LLM-judge, a faithfulness check), but its role changed. It's no longer measuring for your benefit; it's intercepting for the user's safety.

In practice the runtime version β€” the guardrail β€” is becoming a commodity: intercepting a bad output and retrying or falling back is increasingly off-the-shelf, and it really belongs to the error-handling part of the runtime layer, not here.

So now that we've determined that an eval harness is strictly for offline/dev-time validation, lets talk about how it works.

The Eval Harness is made up of metrics and datasets

At the center of the eval harness is the idea of a benchmark. Traditionally, academic benchmarks like MMLU and Big Bench Hard are benchmarks for foundational models. These were built for academic researchers to see how well their models are performing through popular projects such as LM Eleuther and Stanford HELM, not for AI engineers looping their AI agents.

But that doesn't mean the idea of benchmarking can't be applied to AI agents. When benchmarking an AI agent using an eval harness, we require two things:

  1. Metrics
  2. Datasets

Both of which are custom to your agentic use case. A single verification loop within an eval harness typically involves:

  1. Looping through your dataset of goldens
  2. For each golden within your dataset:
    • Invoke your agent
    • Collect the response, and execution trace if available
  3. Use all collected responses and traces to run your metric suite on it

Once completed, you now have a set of initial scores on how your agent performs on a particular dataset. Run this across multiple agent versions, and now you have a clear understanding of which agent of yours performs best.

Setting up an eval harness

DeepEval is the eval harness for AI agents, and so naturally it contains the means for you to generate or load existing datasets from knowledge bases, while providing 50+ ready to use LLM-as-a-judge metrics.

You can include it as a test file using DeepEval's native Pytest integration, that runs 100% in CI, and blocks a release if things start failing. Example below is an eval harness for a LangChain agent:

test_langchain_agent.py
import pytest
from langchain.agents import create_agent
from deepeval import assert_test
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")
dataset = EvaluationDataset(goldens=[
    Golden(input="What is 8 multiplied by 6?"),
    Golden(input="What is 7 multiplied by 9?"),
])

@pytest.mark.parametrize("golden", dataset.goldens)
def test_langchain_agent(golden: Golden):
    agent.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler()]},
    )
    assert_test(golden=golden, metrics=[TaskCompletionMetric()])
deepeval test run test_langchain_agent.py

You can jump ahead and get started here, but do first finish the article because it's about to get interesting.

The Eval Harnesses behaves differently based on who uses it

No, I don't literally mean who runs your eval harness. The engineering intern, your engineering manager, your manager's manager, you, none of that.

I mean coding agents like Claude Code.

Understanding the Claude Code harness

Quick detour, because the next part borrows Claude Code's vocabulary and it's worth getting straight first. Anthropic has a good write-up on this if you want the full version.

Remember how we said an agent harness is everything around the model that makes it work? Claude Code has its own version of exactly that β€” the scaffolding Anthropic wraps around the model so it can navigate a codebase and write good code. Their framing is blunt about it: the harness matters as much as the model. A weaker model in a well-built harness beats a stronger model running naked.

That harness is built from a handful of named extension points, each loading at a different time:

ComponentWhat it isWhen it loads
CLAUDE.mdContext files Claude reads automatically β€” project conventions, gotchasEvery session
HooksScripts that fire at set moments; enforce rules deterministicallyTriggered by an event
SkillsPackaged instructions for a specific task type, loaded only when relevantOn demand
PluginsSkills + hooks + MCP configs bundled into one installable packageOnce installed, always available
MCP serversConnections to external tools and data the model can't otherwise reachOnce configured, always available

Here's a real example, from a single Claude Code session:

  • Session starts β†’ CLAUDE.md loads. Claude automatically reads the root file for the big picture and walks down subdirectory files for local conventions β€” before you've even typed your task.
  • Start hook fires (optional). A start hook can inject team- or module-specific context dynamically, so the session is set up for the right part of the codebase without manual config.
  • You prompt β†’ skills load on demand. Claude matches the task to a relevant skill and pulls in that specialized workflow only when needed β€” a security-review skill for a vuln check, a deploy skill scoped to the payments directory, etc.
  • Claude works β†’ LSP + MCP servers do the reaching. As it navigates, LSP gives symbol-level precision ("go to definition," "find references") instead of text guessing, and MCP servers let it pull from internal tools, docs, or search it couldn't otherwise touch. Heavy exploration can be handed to a subagent that returns just the findings.
  • Action happens β†’ hooks enforce. On events like a file write or commit, hooks run deterministic checks (lint, format, tests) β€” rules that run whether or not Claude remembers them.
  • Session ends β†’ stop hook reflects. A stop hook can review what happened and propose CLAUDE.md updates while context is fresh, making the setup self-improving for next time.

Notice what's NOT in that table: anything resembling an eval. The Claude Code harness is built to get the right context into the model and shape what it does. It has no native concept of measuring the output against ground truth.

This is a big deal because, although Claude Code's own harness requires concepts of Skills and Hooks - there exists no eval harness for Claude Code building AI agents, not deterministic software.

For those vibe coding AI agents - this should be moderately concerning.

Using DeepEval as Claude Code's (or any coding agent's) Eval Harness

For most of what Claude Code writes, that's fine β€” the output is deterministic. A test is green or red; a hook can check for it. But building an AI agent breaks that.

There's no assertEqual for "was this response faithful?" The output is non-deterministic β€” the one thing hooks and lint can't catch. Claude Code's harness can validate the software it writes, but not the agent it writes.

Let's take a look how the regular eval harness DeepEval provides from this section above maps into Claude Code's harness:

Claude Code componentIts job in the CC harnessWhat DeepEval puts hereEval or guardrail?
SkillLoads specialized expertise on demand when the task calls for itThe DeepEval skill β€” templates, the 50+ metric catalog, and the iteration-loop guardrails the agent follows when you say "add evals and fix the failures"Eval β€” offline, agent-driven
HookFires deterministically on an event, whether the agent remembers or notdeepeval test run wired to a stop or pre-commit hook, so green metrics become a gate rather than a suggestionEval β€” offline, enforced
PluginBundles skills + hooks + MCP configs to distribute a setup across the teamThe skill + eval hook + dataset packaged once, installed org-wide so the loop isn't tribalEval β€” distributed

If this sounds complicated, it isn't. Note that skills load on demand, and so really all you need to do to have an eval harness setup is to install the DeepEval skill:

npx skills add confident-ai/deepeval --skill "deepeval"

Under the hood, DeepEval chains together a series of CLI commands that makes up your eval harness that Claude Code's harness can take advantage of. This includes deepeval generate to generate datasets if you don't already have some, and deepeval test run to run the eval suite based on your preference of metrics.

Traces are also automatically captured on your machine in a local .json file. Instead of offloading your agent's execution traces to somewhere else, CC can run deepeval inspect to view traces directly to avoid overfitting metrics.

deepeval inspect TUI showing a trace tree with per-span scores and metric reasons

Conclusion: The harness that builds the eval harness

There's something a little odd about where we ended up. We started with the agent harness being everything around the model β€” memory, tools, evals β€” and then called the eval harness the validation layer for the agent that harness produces. When Claude Code reaches for DeepEval to build an agent, those two definitions collapse into each other: one coding agent is using a harness to check the agent it's writing.

It's worth noticing how that's different from everything else Claude Code does. The rest of its harness is about getting the right context in and good code out. None of it has an opinion on whether the agent that came out actually behaves. That's the gap the eval harness fills, and it happens to be the one place a passing test tells you nothing, because "was this response faithful?" isn't something lint or a unit test can answer.

I think this matters more as the models converge. When everyone has access to roughly the same frontier model, the harness around it is what's left to differentiate on, and evals are the part of that harness that decides what you actually ship. You're not shipping whatever the model wrote on the first pass. You're shipping the version that survived the evals β€” and that's a deliberate choice you make, not a side effect of the model being good.

DeepEval is free and 100% open-source on ⭐ GitHub.

FAQs

On this page