LangGraph
LangGraph is an open-source framework for developing applications powered by large language models, enabling chaining of LLMs with external data sources and expressive workflows to build advanced generative AI solutions.
We recommend logging in to Confident AI to view your LangGraph evaluation traces.
deepeval login
End-to-End Evals
deepeval
allows you to evaluate LangGraph applications end-to-end in under a minute.
Configure LangGraph
Create a CallbackHandler
with a list of task completion metrics you wish to use, and pass it to your LangGraph application's invoke
method.
from langgraph.prebuilt import create_react_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
task_completion_metric = TaskCompletionMetric()
def get_weather(city: str) -> str:
"""Returns the weather in a city"""
return f"It's always sunny in {city}!"
agent = create_react_agent(
model="openai:gpt-4o-mini",
tools=[get_weather],
prompt="You are a helpful assistant",
)
#result = agent.invoke(
# input = {"messages": [{"role": "user", "content": "what is the weather in sf"}]},
# config = {"callbacks": [CallbackHandler(metrics=[task_completion_metric])]}
#)
#print(result)
Only Task Completion is supported for the LangGraph integration. To use other metrics, manually set up tracing instead.
Run evaluations
Create an EvaluationDataset
and invoke your LangGraph application for each golden within the evals_iterator()
loop to run end-to-end evaluations.
- Synchronous
- Asynchronous
from deepeval.dataset import Golden, EvaluationDataset
goldens = [
Golden(input="What is the weather in Bogotá, Colombia?"),
Golden(input="What is the weather in Paris, France?"),
]
dataset = EvaluationDataset(goldens=goldens)
for golden in dataset.evals_iterator():
agent.invoke(
input={"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler(metrics=[task_completion_metric])]}
)
import asyncio
from deepeval.dataset import Golden, EvaluationDataset
dataset = EvaluationDataset(goldens=[
Golden(input="What is the weather in Bogotá, Colombia?"),
Golden(input="What is the weather in Paris, France?"),
])
for golden in dataset.evals_iterator():
task = asyncio.create_task(
agent.ainvoke(
input={"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler(metrics=[task_completion_metric])]}
)
)
dataset.evaluate(task)
✅ Done. The evals_iterator
will automatically generate a test run with individual evaluation traces for each golden.
View on Confident AI (optional)
If you need to evaluate individual components of your LangGraph application, set up tracing instead.
Evals in Production
To run online evaluations in production, simply replace metrics
in CallbackHandler
with a metric collection string from Confident AI, and push your LangGraph agent to production.
This will automatically evaluate all incoming traces in production with the task completion metrics defined in your metric collection.
result = agent.invoke(
input = {"messages": [{"role": "user", "content": "what is the weather in sf"}]},
config = {"callbacks": [CallbackHandler(metric_collection="<metric-collection-name-with-task-completion>")]}
)