CrewAI
CrewAI is a lean, independent Python framework designed for creating and orchestrating autonomous multi-agent AI systems, offering high flexibility, speed, and precision control for complex automation tasks.
We recommend logging in to Confident AI to view your CrewAI evaluation traces.
deepeval login
End-to-End Evals
deepeval
allows you to evaluate CrewAI applications end-to-end in under a minute.
Configure CrewAI
Create a Crew
and pass metrics
to the deepeval
's Agent
wrapper.
from crewai import Task, Crew
from deepeval.integrations.crewai import Agent
from deepeval.integrations.crewai import instrument_crewai
instrument_crewai()
from deepeval.metrics import AnswerRelevancyMetric
answer_relavancy_metric = AnswerRelevancyMetric()
agent = Agent(
role="Consultant",
goal="Write clear, concise explanation.",
backstory="An expert consultant with a keen eye for software trends.",
metrics=[answer_relavancy_metric]
)
task = Task(
description="Explain the given topic",
expected_output="A clear and concise explanation.",
agent=agent
)
crew = Crew(
agents=[agent],
tasks=[task],
)
# result = crew.kickoff({"input": "What are the LLMs?"})
# print(result)
Evaluations are supported for CrewAI Agent
. Only metrics with parameters input
, output
, expected_output
and tools_called
are eligible for evaluation.
Run evaluations
Create an EvaluationDataset
and invoke your CrewAI application for each golden within the evals_iterator()
loop to run end-to-end evaluations.
- Synchronous
- Asynchronous
dataset = EvaluationDataset(goldens=[
Golden(input="What are Transformers in AI?"),
Golden(input="What is the biggest open source database?"),
Golden(input="What are LLMs?"),
])
for golden in dataset.evals_iterator():
result = crew.kickoff(inputs={"input": golden.input})
dataset = EvaluationDataset(goldens=[
Golden(input="What are Transformers in AI?"),
Golden(input="What is the biggest open source database?"),
Golden(input="What are LLMs?"),
])
for golden in dataset.evals_iterator():
task = asyncio.create_task(crew.kickoff_async(inputs={"input": golden.input}))
dataset.evaluate(task)
✅ Done. The evals_iterator
will automatically generate a test run with individual evaluation traces for each golden.
View on Confident AI (optional)
If you need to evaluate individual components of your CrewAI application, set up tracing instead.
Evals in Production
To run online evaluations in production, replace metrics
with a metric collection string from Confident AI, and push your CrewAI agent to production.
...
agent = Agent(
role="Consultant",
goal="Write clear, concise explanation.",
backstory="An expert consultant with a keen eye for software trends.",
# metrics=[answer_relavancy_metric]
metric_collection="test_collection_1",
)
result = crew.kickoff(
"input": "What are the LLMs?"
)