Skip to main content

AWS AgentCore

Amazon AgentCore is AWS's managed runtime for deploying and scaling AI agents.

tip

We recommend logging in to Confident AI to view your AgentCore evaluations.

deepeval login

For users in the EU region, please set your OTEL endpoint in the env as following:

export CONFIDENT_OTEL_URL="https://eu.otel.confident-ai.com"

Or if you're in the AU region, please set your OTEL endpoint in the env as following:

export CONFIDENT_OTEL_URL="https://au.otel.confident-ai.com"

End-to-End Evals

deepeval allows you to evaluate Strands agents using agentcore in under a minute.

Configure AgentCore

Pass agent_metrics to the instrument_agentcore method.

main.py
import nest_asyncio
nest_asyncio.apply()

from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent
from deepeval.integrations.agentcore import instrument_agentcore
from deepeval.metrics import AnswerRelevancyMetric

instrument_agentcore(
name="AgentCore Tracing",
environment="development",
agent_metrics=[AnswerRelevancyMetric()],
)

app = BedrockAgentCoreApp()
agent = Agent(model="amazon.nova-lite-v1:0")

@app.entrypoint
def invoke(payload):
user_message = payload.get("prompt")
result = agent(user_message)
return {"result": result.message}

response = invoke({"prompt": "Make a funny joke"})
info

Evaluations are supported for Strands Agent. Only metrics with parameters input, output and tools_called are eligible for evaluation.

Run evaluations

Create an EvaluationDataset and invoke your agentcore application for each golden within the evals_iterator() loop to run end-to-end evaluations.

main.py
from deepeval.evaluate.configs import AsyncConfig

dataset = EvaluationDataset(
goldens=[
Golden(input="What's the weather in Paris?"),
Golden(input="What's the weather in London?"),
]
)

for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=False)):
response = invoke({"prompt": golden.input})

✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.

Evals in Production

To run online evaluations in production, replace metrics with a metric collection string from Confident AI, and run your Strands agent as usual with agentcore:

from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent
from deepeval.integrations.agentcore import instrument_agentcore
from deepeval.metrics import AnswerRelevancyMetric

instrument_agentcore(
name="AgentCore Tracing",
environment="development",
trace_metric_collection="my-trace-collection",
agent_metric_collection="my-agent-collection",
llm_metric_collection="my-llm-collection",
tool_metric_collection_map={
"get_weather": "my-tool-collection",
},
)

app = BedrockAgentCoreApp()
agent = Agent(model="amazon.nova-lite-v1:0")

@app.entrypoint
def invoke(payload):
user_message = payload.get("prompt")
result = agent(user_message)
return {"result": result.message}

response = invoke({"prompt": "Make a funny joke"})

deepeval allows you to run component evals at different levels of like Trace, Agent, LLM and Tool spans. You can pass your metric collection for any spans using the instrument_agentcore method.

Confident AI
Try DeepEval on Confident AI for FREE
View and save evaluation results, curate datasets and manage annotations, monitor online performance, trace for AI observability, and auto-optimize prompts.
Try it for Free