LangChain
DeepEval makes it easy to evaluate LangChain applications in both development and production environments.
We recommend logging in to Confident AI to view your LangChain evaluation traces.
deepeval login
End-to-End Evals
DeepEval allows you to evaluate LangChain applications end-to-end in under a minute.
Configure LangChain
Create a CallbackHandler
with a list of task completion metrics you wish to use, and pass it to your LangChain application's invoke
method.
from langchain.chat_models import init_chat_model
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
def multiply(a: int, b: int) -> int:
"""Returns the product of two numbers"""
return a * b
llm = init_chat_model("gpt-4o-mini", model_provider="openai")
llm_with_tools = llm.bind_tools([multiply])
# Create goldens
llm_with_tools.invoke(
"What is 3 * 12?",
config = {"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric(task="multiplication")])]}
)
Only Task Completion is supported for the LangChain integration. To use other metrics, manually set up tracing instead.
Run evaluations
Create an EvaluationDataset
and invoke your LangChain application for each golden within the evals_iterator()
loop to run end-to-end evaluations.
- Synchronous
- Asynchronous
from deepeval.dataset import EvaluationDataset, Golden
...
dataset = EvaluationDataset(goldens=[Golden(input="What is 3 * 12?")])
for golden in dataset.evals_iterator():
llm_with_tools.invoke(
"What is 3 * 12?",
config = {"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]}
)
from deepeval.dataset import EvaluationDataset, Golden
import asyncio
...
dataset = EvaluationDataset(goldens=[Golden(input="What is 3 * 12?")])
for golden in dataset.evals_iterator():
task = asyncio.create_task(
llm_with_tools.invoke(
"What is 3 * 12?",
config = {"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]}
)
)
dataset.evaluate(task)
✅ Done. The evals_iterator
will automatically generate a test run with individual evaluation traces for each golden.
If you need to evaluate individual components of your LangChain application, set up tracing instead.
Evals in Production
To run online evaluations in production, simply replace metrics
in CallbackHandler
with a metric collection string from Confident AI, and push your LangChain agent to production.
This will automatically evaluate all incoming traces in production with the task completion metrics defined in your metric collection.
from deepeval.integrations.langchain import CallbackHandler
...
# Invoke your agent with the metric collection name
llm_with_tools.invoke(
"What is 3 * 12?",
config = {"callbacks": [
CallbackHandler(metric_collection="<metric-collection-name-with-task-completion>")
]}
)