CrewAI
CrewAI is a Python framework for orchestrating role-playing autonomous agents that collaborate on multi-step tasks.
The deepeval integration registers a CrewAI event listener and ships drop-in Crew, Agent, LLM, and tool shims that accept metrics. Every crew.kickoff(...), agent execution, LLM call, and tool call becomes a span you can inspect โ without rewriting your crew.
deepeval's CrewAI integration enables you to:
- Trace every
crew.kickoff(...)โ each kickoff produces a trace, and each agent execution, LLM call, and tool call becomes a component span. - Attach metrics directly to
Crew,Agent,LLM, and@toolthrough deepeval-aware shims. - Run evals from scripts or CI/CD โ same crew, different surfaces.
- Compose with
@observeandwith trace(...)to evaluate larger flows that wrap one or more crew kickoffs.
Getting Started
Installation
pip install -U deepeval crewaiThe integration calls instrument_crewai() once to register the event listener. After that, the deepeval-aware Crew, Agent, LLM, and tool shims accept metrics directly.
Instrument and evaluate
Call instrument_crewai() at startup, then build the crew with deepeval.integrations.crewai.Crew/Agent and the @tool decorator. Pass metrics on the Agent (or Crew) you want to evaluate.
from crewai import Task
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent, tool
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
instrument_crewai()
@tool
def get_weather(city: str) -> str:
"""Fetch weather data for a given city."""
return f"It's always sunny in {city}!"
reporter = Agent(
role="Weather Reporter",
goal="Provide accurate weather information.",
backstory="An experienced meteorologist.",
tools=[get_weather],
metrics=[TaskCompletionMetric()],
)
task = Task(
description="Get the current weather for {city} and summarize it.",
expected_output="A clear weather report for the requested city.",
agent=reporter,
)
crew = Crew(agents=[reporter], tasks=[task])
# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="Paris")])
for golden in dataset.evals_iterator():
crew.kickoff({"city": golden.input})Done โ
. You've run your first eval with full traceability into CrewAI via deepeval.
What gets traced
Each crew.kickoff(...) call produces a trace โ the end-to-end unit your user observes. Inside that trace are component spans for every step the crew took:
- Agent spans โ one per
Agentexecution within the crew. - LLM spans โ model calls dispatched by agents.
- Tool spans โ tool invocations including knowledge retrieval.
Trace โ what the user observes
โโโ Agent: weather_reporter โ one crew.kickoff(...) execution
โโโ LLM: gpt-4o โ component span: model decides
โโโ Tool: get_weather โ component span: tool input + output
โโโ LLM: gpt-4o โ component span: final summaryThe trace and its component spans are independently evaluable.
Running evals
There are two surfaces for running evals against a CrewAI crew. Pick by where you want results to surface โ your terminal during development, or your CI pipeline as a pass/fail gate.
In CI/CD (pytest)
Use the deepeval pytest integration. Each parametrized test invocation becomes one crew.kickoff(...); failing metrics fail the test, which fails the build.
import pytest
from crewai import Task
from deepeval import assert_test
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent, tool
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
instrument_crewai()
@tool
def get_weather(city: str) -> str:
"""Fetch weather data for a given city."""
return f"It's always sunny in {city}!"
reporter = Agent(
role="Weather Reporter",
goal="Provide accurate weather information.",
backstory="An experienced meteorologist.",
tools=[get_weather],
)
task = Task(
description="Get the current weather for {city} and summarize it.",
expected_output="A clear weather report for the requested city.",
agent=reporter,
)
crew = Crew(agents=[reporter], tasks=[task])
dataset = EvaluationDataset(goldens=[Golden(input="Paris"), Golden(input="London")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_crewai_agent(golden: Golden):
crew.kickoff({"city": golden.input})
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Run it with:
deepeval test run test_crewai_agent.pyIn a script
Use EvaluationDataset + evals_iterator(...). Each Golden becomes one kickoff; metrics score the resulting trace.
import asyncio
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.metrics import TaskCompletionMetric
...
dataset = EvaluationDataset(goldens=[Golden(input="Paris"), Golden(input="London")])
async def run_crew(city: str):
return await crew.kickoff_async({"city": city})
for golden in dataset.evals_iterator(
async_config=AsyncConfig(run_async=True),
metrics=[TaskCompletionMetric()],
):
task = asyncio.create_task(run_crew(golden.input))
dataset.evaluate(task)Sync (crew.kickoff) and async (crew.kickoff_async) execution both work; pick whichever matches your code.
Applying metrics to components
The metrics=[...] you pass to evals_iterator evaluates the trace. To evaluate a component โ a specific agent, LLM call, or tool โ attach metrics directly where the component is defined.
Agent spans
Pass metrics=[...] to deepeval.integrations.crewai.Agent. The metric is applied to that agent's span on every execution.
from deepeval.integrations.crewai import Agent
from deepeval.metrics import TaskCompletionMetric
...
reporter = Agent(
role="Weather Reporter",
goal="Provide accurate weather information.",
backstory="An experienced meteorologist.",
tools=[get_weather],
metrics=[TaskCompletionMetric()],
)LLM calls
Pass metrics=[...] to deepeval.integrations.crewai.LLM. The metric is applied to LLM spans produced by that model.
from deepeval.integrations.crewai import LLM, Agent
from deepeval.metrics import AnswerRelevancyMetric
...
llm = LLM(model="gpt-4o", metrics=[AnswerRelevancyMetric()])
reporter = Agent(
role="Weather Reporter",
goal="Provide accurate weather information.",
backstory="An experienced meteorologist.",
tools=[get_weather],
llm=llm,
)Tool calls
Pass metric=[...] to the deepeval-aware @tool decorator. The metric is applied to that tool's span on every call.
from deepeval.integrations.crewai import tool
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
@tool(metric=[GEval(
name="Helpful Weather Lookup",
criteria="The output must be a clear weather summary for the requested city.",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)])
def get_weather(city: str) -> str:
"""Fetch weather data for a given city."""
return f"It's always sunny in {city}!"For deterministic tool calls, prefer update_current_span(...) to add metadata, inputs, and outputs instead of attaching metrics to the tool span.
Customizing trace and span data
The integration captures inputs, outputs, model names, and tool calls automatically. For anything dynamic, the right API depends on where your code runs.
- Use
with trace(...)for trace-level fields (name,tags,metadata,thread_id,user_id,metrics). - Use shim kwargs (
Agent(metrics=...),LLM(metrics=...),@tool(metric=...)) for component-level defaults. - Use
update_current_trace(...)andupdate_current_span(...)from inside a tool body to mutate fields the framework can't see.
from deepeval.integrations.crewai import tool
from deepeval.tracing import update_current_trace, update_current_span
@tool
def get_weather(city: str) -> str:
"""Fetch weather data for a given city."""
update_current_trace(metadata={"city": city})
update_current_span(metadata={"source": "static-table"})
return f"It's always sunny in {city}!"Advanced patterns
The primitives above โ instrument_crewai, Crew, Agent, LLM, @tool, with trace(...) โ compose around one boundary: CrewAI owns the kickoff lifecycle, and your code attaches metrics where they make sense.
Trace-level metrics with with trace(...)
When you want a metric on the whole crew run rather than a specific component, wrap the kickoff in with trace(metrics=[...]). The metric scores the trace's overall input/output.
from deepeval.tracing import trace
from deepeval.metrics import AnswerRelevancyMetric
...
for golden in dataset.evals_iterator():
with trace(metrics=[AnswerRelevancyMetric()]):
crew.kickoff({"city": golden.input})No trace-level metrics required
Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary when component metrics are already attached to the agent, LLM, or tool โ CI/CD and scripts only need to run the crew.
This is how you'd run it:
import pytest
from deepeval import assert_test
...
@pytest.mark.parametrize("golden", dataset.goldens)
def test_component_metrics(golden: Golden):
crew.kickoff({"city": golden.input})
assert_test(golden=golden)deepeval test run test_crewai_agent.py...
for golden in dataset.evals_iterator():
crew.kickoff({"city": golden.input})Wrap a kickoff in @observe
When the crew run is part of a larger operation, decorate the outer function with @observe. CrewAI spans nest under your observed span automatically.
from deepeval.tracing import observe
...
@observe(name="respond_to_user")
def respond_to_user(city: str) -> str:
result = crew.kickoff({"city": city})
return str(result)API reference
The deepeval-aware shims accept the framework's standard kwargs plus the following:
| Shim | Kwarg | Description |
|---|---|---|
Crew(...) | metrics | Metrics applied to the crew's top-level span on every kickoff. |
Agent(...) | metrics | Metrics applied to this agent's span on every execution. |
LLM(...) | metrics | Metrics applied to LLM spans produced by this model. |
@tool(...) | metric | Metrics applied to this tool's span on every call. |
For runtime helpers (update_current_trace, update_current_span) and the test-decorator surface (@observe, @assert_test, with trace(...)), see the tracing reference.