CrewAI

CrewAI is a lean, independent Python framework designed for creating and orchestrating autonomous multi-agent AI systems, offering high flexibility, speed, and precision control for complex automation tasks.

tip

We recommend logging in to Confident AI to view your CrewAI evaluation traces.

deepeval login

End-to-End Evals

deepeval allows you to evaluate CrewAI applications end-to-end in under a minute.

Configure CrewAI

Create a Crew and use instrument_crewai to instrument your CrewAI application.

main.py
import random

from crewai import Task, Crew, Agent
from crewai.tools import tool

from deepeval.integrations.crewai import instrument_crewai

instrument_crewai()

@tool
def get_weather(city: str) -> str:
    """Fetch weather data for a given city. Returns temperature and conditions."""
    weather_data = {
        "New York": "Partly Cloudy",
        "London": "Rainy",
        "Tokyo": "Sunny",
        "Paris": "Cloudy",
        "Sydney": "Clear",
    }

    condition = weather_data.get(city, "Clear")
    temperature = f"{random.randint(45, 95)}°F"
    humidity = f"{random.randint(30, 90)}%"

    return f"Weather in {city}: {temperature}, {condition}, Humidity: {humidity}"


agent = Agent(
    role="Weather Reporter",
    goal="Provide accurate and helpful weather information to users.",
    backstory="An experienced meteorologist who loves helping people plan their day with accurate weather reports.",
    tools=[get_weather],
    verbose=True,
)

task = Task(
    description="Get the current weather for {city} and provide a helpful summary.",
    expected_output="A clear weather report including temperature, conditions, and humidity.",
    agent=agent,
)

crew = Crew(
    agents=[agent],
    tasks=[task],
)

info

Evaluations are supported for CrewAI Agent. Only metrics with parameters input, output, expected_output and tools_called are eligible for evaluation.

Run evaluations

Create an EvaluationDataset and invoke your CrewAI application for each golden within the evals_iterator() loop to run end-to-end evaluations. Pass the metrics to the trace context manager.

Synchronous
Asynchronous

main.py
from deepeval.tracing import trace
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset, Golden

answer_relavancy_metric = AnswerRelevancyMetric()

dataset = EvaluationDataset(
    goldens=[
        Golden(input="London"),
        Golden(input="Paris"),
    ]
)

for golden in dataset.evals_iterator():
    with trace(trace_metrics=[answer_relavancy_metric]):
        crew.kickoff({"city": golden.input})

main.py
from deepeval.tracing import trace
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset, Golden

answer_relavancy_metric = AnswerRelevancyMetric()

dataset = EvaluationDataset(
    goldens=[
        Golden(input="London"),
        Golden(input="Paris"),
    ]
)

async def run_crewai_e2e_async(input: str):
    with trace(trace_metrics=[answer_relavancy_metric]):
        await crew.kickoff_async({"city": input})

for golden in dataset.evals_iterator():
    task = asyncio.create_task(run_crewai_e2e_async(golden.input))
    dataset.evaluate(task)

✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.

View on Confident AI (optional)

note

If you need to evaluate individual components of your CrewAI application, set up tracing instead.

Evals in Production

To run online evaluations in production, replace metrics with a metric collection string from Confident AI, and push your CrewAI agent to production.

...
with trace(trace_metric_collection="test_collection_1"):
    result = crew.kickoff(
        "city": "London"
    )

CrewAI

End-to-End Evals​

Configure CrewAI

Run evaluations

View on Confident AI (optional)

Evals in Production​

End-to-End Evals

Evals in Production