CrewAI
CrewAI is a lean, independent Python framework designed for creating and orchestrating autonomous multi-agent AI systems, offering high flexibility, speed, and precision control for complex automation tasks.
We recommend logging in to Confident AI to view your CrewAI evaluation traces.
deepeval login
End-to-End Evals
deepeval
allows you to evaluate CrewAI applications end-to-end in under a minute.
Configure CrewAI
Create a Crew
and use instrument_crewai
to instrument your CrewAI application.
import random
from crewai import Task, Crew, Agent
from crewai.tools import tool
from deepeval.integrations.crewai import instrument_crewai
instrument_crewai()
@tool
def get_weather(city: str) -> str:
"""Fetch weather data for a given city. Returns temperature and conditions."""
weather_data = {
"New York": "Partly Cloudy",
"London": "Rainy",
"Tokyo": "Sunny",
"Paris": "Cloudy",
"Sydney": "Clear",
}
condition = weather_data.get(city, "Clear")
temperature = f"{random.randint(45, 95)}°F"
humidity = f"{random.randint(30, 90)}%"
return f"Weather in {city}: {temperature}, {condition}, Humidity: {humidity}"
agent = Agent(
role="Weather Reporter",
goal="Provide accurate and helpful weather information to users.",
backstory="An experienced meteorologist who loves helping people plan their day with accurate weather reports.",
tools=[get_weather],
verbose=True,
)
task = Task(
description="Get the current weather for {city} and provide a helpful summary.",
expected_output="A clear weather report including temperature, conditions, and humidity.",
agent=agent,
)
crew = Crew(
agents=[agent],
tasks=[task],
)
Evaluations are supported for CrewAI Agent
. Only metrics with parameters input
, output
, expected_output
and tools_called
are eligible for evaluation.
Run evaluations
Create an EvaluationDataset
and invoke your CrewAI application for each golden within the evals_iterator()
loop to run end-to-end evaluations. Pass the metrics
to the trace
context manager.
- Synchronous
- Asynchronous
from deepeval.tracing import trace
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset, Golden
answer_relavancy_metric = AnswerRelevancyMetric()
dataset = EvaluationDataset(
goldens=[
Golden(input="London"),
Golden(input="Paris"),
]
)
for golden in dataset.evals_iterator():
with trace(trace_metrics=[answer_relavancy_metric]):
crew.kickoff({"city": golden.input})
from deepeval.tracing import trace
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset, Golden
answer_relavancy_metric = AnswerRelevancyMetric()
dataset = EvaluationDataset(
goldens=[
Golden(input="London"),
Golden(input="Paris"),
]
)
async def run_crewai_e2e_async(input: str):
with trace(trace_metrics=[answer_relavancy_metric]):
await crew.kickoff_async({"city": input})
for golden in dataset.evals_iterator():
task = asyncio.create_task(run_crewai_e2e_async(golden.input))
dataset.evaluate(task)
✅ Done. The evals_iterator
will automatically generate a test run with individual evaluation traces for each golden.
View on Confident AI (optional)
If you need to evaluate individual components of your CrewAI application, set up tracing instead.
Evals in Production
To run online evaluations in production, replace metrics
with a metric collection string from Confident AI, and push your CrewAI agent to production.
...
with trace(trace_metric_collection="test_collection_1"):
result = crew.kickoff(
"city": "London"
)