Skip to main content

CrewAI

CrewAI is a lean, independent Python framework designed for creating and orchestrating autonomous multi-agent AI systems, offering high flexibility, speed, and precision control for complex automation tasks.

tip

We recommend logging in to Confident AI to view your CrewAI evaluation traces.

deepeval login

End-to-End Evals

deepeval allows you to evaluate CrewAI applications end-to-end in under a minute.

Configure CrewAI

Create a Crew and pass metrics to the deepeval's Agent wrapper.

main.py
from crewai import Task, Crew

from deepeval.integrations.crewai import Agent
from deepeval.integrations.crewai import instrument_crewai

instrument_crewai()

from deepeval.metrics import AnswerRelevancyMetric
answer_relavancy_metric = AnswerRelevancyMetric()

agent = Agent(
role="Consultant",
goal="Write clear, concise explanation.",
backstory="An expert consultant with a keen eye for software trends.",
metrics=[answer_relavancy_metric]
)

task = Task(
description="Explain the given topic",
expected_output="A clear and concise explanation.",
agent=agent
)

crew = Crew(
agents=[agent],
tasks=[task],
)

# result = crew.kickoff({"input": "What are the LLMs?"})
# print(result)
info

Evaluations are supported for CrewAI Agent. Only metrics with parameters input, output, expected_output and tools_called are eligible for evaluation.

Run evaluations

Create an EvaluationDataset and invoke your CrewAI application for each golden within the evals_iterator() loop to run end-to-end evaluations.

main.py
dataset = EvaluationDataset(goldens=[
Golden(input="What are Transformers in AI?"),
Golden(input="What is the biggest open source database?"),
Golden(input="What are LLMs?"),
])

for golden in dataset.evals_iterator():
result = crew.kickoff(inputs={"input": golden.input})

✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.

View on Confident AI (optional)

note

If you need to evaluate individual components of your CrewAI application, set up tracing instead.

Evals in Production

To run online evaluations in production, replace metrics with a metric collection string from Confident AI, and push your CrewAI agent to production.

...
agent = Agent(
role="Consultant",
goal="Write clear, concise explanation.",
backstory="An expert consultant with a keen eye for software trends.",
# metrics=[answer_relavancy_metric]
metric_collection="test_collection_1",
)

result = crew.kickoff(
"input": "What are the LLMs?"
)