AWS AgentCore
Amazon AgentCore is AWS's managed runtime for deploying and scaling AI agents.
We recommend logging in to Confident AI to view your AgentCore evaluations.
deepeval login
For users in the EU region, please set your OTEL endpoint in the env as following:
export CONFIDENT_OTEL_URL="https://eu.otel.confident-ai.com"
Or if you're in the AU region, please set your OTEL endpoint in the env as following:
export CONFIDENT_OTEL_URL="https://au.otel.confident-ai.com"
End-to-End Evals
deepeval allows you to evaluate Strands agents using agentcore in under a minute.
Configure AgentCore
Pass agent_metrics to the instrument_agentcore method.
import nest_asyncio
nest_asyncio.apply()
from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent
from deepeval.integrations.agentcore import instrument_agentcore
from deepeval.metrics import AnswerRelevancyMetric
instrument_agentcore(
name="AgentCore Tracing",
environment="development",
agent_metrics=[AnswerRelevancyMetric()],
)
app = BedrockAgentCoreApp()
agent = Agent(model="amazon.nova-lite-v1:0")
@app.entrypoint
def invoke(payload):
user_message = payload.get("prompt")
result = agent(user_message)
return {"result": result.message}
response = invoke({"prompt": "Make a funny joke"})
Evaluations are supported for Strands Agent. Only metrics with parameters input, output and tools_called are eligible for evaluation.
Run evaluations
Create an EvaluationDataset and invoke your agentcore application for each golden within the evals_iterator() loop to run end-to-end evaluations.
- Asynchronous
from deepeval.evaluate.configs import AsyncConfig
dataset = EvaluationDataset(
goldens=[
Golden(input="What's the weather in Paris?"),
Golden(input="What's the weather in London?"),
]
)
for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=False)):
response = invoke({"prompt": golden.input})
✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.
Evals in Production
To run online evaluations in production, replace metrics with a metric collection string from Confident AI, and run your Strands agent as usual with agentcore:
from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent
from deepeval.integrations.agentcore import instrument_agentcore
from deepeval.metrics import AnswerRelevancyMetric
instrument_agentcore(
name="AgentCore Tracing",
environment="development",
trace_metric_collection="my-trace-collection",
agent_metric_collection="my-agent-collection",
llm_metric_collection="my-llm-collection",
tool_metric_collection_map={
"get_weather": "my-tool-collection",
},
)
app = BedrockAgentCoreApp()
agent = Agent(model="amazon.nova-lite-v1:0")
@app.entrypoint
def invoke(payload):
user_message = payload.get("prompt")
result = agent(user_message)
return {"result": result.message}
response = invoke({"prompt": "Make a funny joke"})
deepeval allows you to run component evals at different levels of like Trace, Agent, LLM and Tool spans. You can pass your metric collection for any spans using the instrument_agentcore method.