๐Ÿ”ฅ DeepEval 4.0 just got released. Read the announcement.
Orchestration Frameworks

CrewAI

Native Instrumentation
Evals in CI/CD
Evals with Traceability

CrewAI is a Python framework for orchestrating role-playing autonomous agents that collaborate on multi-step tasks.

The deepeval integration registers a CrewAI event listener and ships drop-in Crew, Agent, LLM, and tool shims that accept metrics. Every crew.kickoff(...), agent execution, LLM call, and tool call becomes a span you can inspect โ€” without rewriting your crew.

deepeval's CrewAI integration enables you to:

  • Trace every crew.kickoff(...) โ€” each kickoff produces a trace, and each agent execution, LLM call, and tool call becomes a component span.
  • Attach metrics directly to Crew, Agent, LLM, and @tool through deepeval-aware shims.
  • Run evals from scripts or CI/CD โ€” same crew, different surfaces.
  • Compose with @observe and with trace(...) to evaluate larger flows that wrap one or more crew kickoffs.

Getting Started

Installation

pip install -U deepeval crewai

The integration calls instrument_crewai() once to register the event listener. After that, the deepeval-aware Crew, Agent, LLM, and tool shims accept metrics directly.

Instrument and evaluate

Call instrument_crewai() at startup, then build the crew with deepeval.integrations.crewai.Crew/Agent and the @tool decorator. Pass metrics on the Agent (or Crew) you want to evaluate.

crewai_agent.py
from crewai import Task
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent, tool
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

instrument_crewai()

@tool
def get_weather(city: str) -> str:
    """Fetch weather data for a given city."""
    return f"It's always sunny in {city}!"

reporter = Agent(
    role="Weather Reporter",
    goal="Provide accurate weather information.",
    backstory="An experienced meteorologist.",
    tools=[get_weather],
    metrics=[TaskCompletionMetric()],
)

task = Task(
    description="Get the current weather for {city} and summarize it.",
    expected_output="A clear weather report for the requested city.",
    agent=reporter,
)

crew = Crew(agents=[reporter], tasks=[task])

# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="Paris")])

for golden in dataset.evals_iterator():
    crew.kickoff({"city": golden.input})

Done โœ…. You've run your first eval with full traceability into CrewAI via deepeval.

What gets traced

Each crew.kickoff(...) call produces a trace โ€” the end-to-end unit your user observes. Inside that trace are component spans for every step the crew took:

  • Agent spans โ€” one per Agent execution within the crew.
  • LLM spans โ€” model calls dispatched by agents.
  • Tool spans โ€” tool invocations including knowledge retrieval.
Trace                          โ† what the user observes
โ””โ”€โ”€ Agent: weather_reporter    โ† one crew.kickoff(...) execution
    โ”œโ”€โ”€ LLM: gpt-4o            โ† component span: model decides
    โ”œโ”€โ”€ Tool: get_weather      โ† component span: tool input + output
    โ””โ”€โ”€ LLM: gpt-4o            โ† component span: final summary

The trace and its component spans are independently evaluable.

Running evals

There are two surfaces for running evals against a CrewAI crew. Pick by where you want results to surface โ€” your terminal during development, or your CI pipeline as a pass/fail gate.

In CI/CD (pytest)

Use the deepeval pytest integration. Each parametrized test invocation becomes one crew.kickoff(...); failing metrics fail the test, which fails the build.

test_crewai_agent.py
import pytest
from crewai import Task
from deepeval import assert_test
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent, tool
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

instrument_crewai()

@tool
def get_weather(city: str) -> str:
    """Fetch weather data for a given city."""
    return f"It's always sunny in {city}!"

reporter = Agent(
    role="Weather Reporter",
    goal="Provide accurate weather information.",
    backstory="An experienced meteorologist.",
    tools=[get_weather],
)
task = Task(
    description="Get the current weather for {city} and summarize it.",
    expected_output="A clear weather report for the requested city.",
    agent=reporter,
)
crew = Crew(agents=[reporter], tasks=[task])

dataset = EvaluationDataset(goldens=[Golden(input="Paris"), Golden(input="London")])

@pytest.mark.parametrize("golden", dataset.goldens)
def test_crewai_agent(golden: Golden):
    crew.kickoff({"city": golden.input})
    assert_test(golden=golden, metrics=[TaskCompletionMetric()])

Run it with:

deepeval test run test_crewai_agent.py

In a script

Use EvaluationDataset + evals_iterator(...). Each Golden becomes one kickoff; metrics score the resulting trace.

crewai_agent.py
import asyncio

from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.metrics import TaskCompletionMetric
...

dataset = EvaluationDataset(goldens=[Golden(input="Paris"), Golden(input="London")])

async def run_crew(city: str):
    return await crew.kickoff_async({"city": city})

for golden in dataset.evals_iterator(
    async_config=AsyncConfig(run_async=True),
    metrics=[TaskCompletionMetric()],
):
    task = asyncio.create_task(run_crew(golden.input))
    dataset.evaluate(task)

Sync (crew.kickoff) and async (crew.kickoff_async) execution both work; pick whichever matches your code.

Applying metrics to components

The metrics=[...] you pass to evals_iterator evaluates the trace. To evaluate a component โ€” a specific agent, LLM call, or tool โ€” attach metrics directly where the component is defined.

Agent spans

Pass metrics=[...] to deepeval.integrations.crewai.Agent. The metric is applied to that agent's span on every execution.

crewai_agent.py
from deepeval.integrations.crewai import Agent
from deepeval.metrics import TaskCompletionMetric
...

reporter = Agent(
    role="Weather Reporter",
    goal="Provide accurate weather information.",
    backstory="An experienced meteorologist.",
    tools=[get_weather],
    metrics=[TaskCompletionMetric()],
)

LLM calls

Pass metrics=[...] to deepeval.integrations.crewai.LLM. The metric is applied to LLM spans produced by that model.

crewai_agent.py
from deepeval.integrations.crewai import LLM, Agent
from deepeval.metrics import AnswerRelevancyMetric
...

llm = LLM(model="gpt-4o", metrics=[AnswerRelevancyMetric()])
reporter = Agent(
    role="Weather Reporter",
    goal="Provide accurate weather information.",
    backstory="An experienced meteorologist.",
    tools=[get_weather],
    llm=llm,
)

Tool calls

Pass metric=[...] to the deepeval-aware @tool decorator. The metric is applied to that tool's span on every call.

crewai_agent.py
from deepeval.integrations.crewai import tool
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

@tool(metric=[GEval(
    name="Helpful Weather Lookup",
    criteria="The output must be a clear weather summary for the requested city.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)])
def get_weather(city: str) -> str:
    """Fetch weather data for a given city."""
    return f"It's always sunny in {city}!"

For deterministic tool calls, prefer update_current_span(...) to add metadata, inputs, and outputs instead of attaching metrics to the tool span.

Customizing trace and span data

The integration captures inputs, outputs, model names, and tool calls automatically. For anything dynamic, the right API depends on where your code runs.

  • Use with trace(...) for trace-level fields (name, tags, metadata, thread_id, user_id, metrics).
  • Use shim kwargs (Agent(metrics=...), LLM(metrics=...), @tool(metric=...)) for component-level defaults.
  • Use update_current_trace(...) and update_current_span(...) from inside a tool body to mutate fields the framework can't see.
crewai_agent.py
from deepeval.integrations.crewai import tool
from deepeval.tracing import update_current_trace, update_current_span

@tool
def get_weather(city: str) -> str:
    """Fetch weather data for a given city."""
    update_current_trace(metadata={"city": city})
    update_current_span(metadata={"source": "static-table"})
    return f"It's always sunny in {city}!"

Advanced patterns

The primitives above โ€” instrument_crewai, Crew, Agent, LLM, @tool, with trace(...) โ€” compose around one boundary: CrewAI owns the kickoff lifecycle, and your code attaches metrics where they make sense.

Trace-level metrics with with trace(...)

When you want a metric on the whole crew run rather than a specific component, wrap the kickoff in with trace(metrics=[...]). The metric scores the trace's overall input/output.

crewai_agent.py
from deepeval.tracing import trace
from deepeval.metrics import AnswerRelevancyMetric
...

for golden in dataset.evals_iterator():
    with trace(metrics=[AnswerRelevancyMetric()]):
        crew.kickoff({"city": golden.input})

No trace-level metrics required

Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary when component metrics are already attached to the agent, LLM, or tool โ€” CI/CD and scripts only need to run the crew.

This is how you'd run it:

test_crewai_agent.py
import pytest
from deepeval import assert_test
...

@pytest.mark.parametrize("golden", dataset.goldens)
def test_component_metrics(golden: Golden):
    crew.kickoff({"city": golden.input})
    assert_test(golden=golden)
deepeval test run test_crewai_agent.py
crewai_agent.py
...

for golden in dataset.evals_iterator():
    crew.kickoff({"city": golden.input})

Wrap a kickoff in @observe

When the crew run is part of a larger operation, decorate the outer function with @observe. CrewAI spans nest under your observed span automatically.

crewai_agent.py
from deepeval.tracing import observe
...

@observe(name="respond_to_user")
def respond_to_user(city: str) -> str:
    result = crew.kickoff({"city": city})
    return str(result)

API reference

The deepeval-aware shims accept the framework's standard kwargs plus the following:

ShimKwargDescription
Crew(...)metricsMetrics applied to the crew's top-level span on every kickoff.
Agent(...)metricsMetrics applied to this agent's span on every execution.
LLM(...)metricsMetrics applied to LLM spans produced by this model.
@tool(...)metricMetrics applied to this tool's span on every call.

For runtime helpers (update_current_trace, update_current_span) and the test-decorator surface (@observe, @assert_test, with trace(...)), see the tracing reference.

On this page