๐Ÿ”ฅ DeepEval 4.0 just got released. Read the announcement.
Orchestration Frameworks

AWS AgentCore

OTel Instrumentation
Evals in CI/CD
Evals with Traceability

Amazon AgentCore is AWS's managed runtime for deploying and scaling AI agents.

The deepeval integration auto-instruments AgentCore apps through OpenTelemetry. Every agent invocation, model call, and tool call becomes a span you can inspect, without wiring trace structure by hand.

deepeval's AgentCore integration enables you to:

  • Auto-instrument every AgentCore invocation โ€” each app entrypoint call produces a trace, and each agent, LLM, and tool call becomes a component span.
  • Evaluate traces or model / agent components with any deepeval metric.
  • Run evals from scripts or CI/CD โ€” same metrics, different surfaces.
  • Customize trace and span data at runtime from tool bodies, wrappers, or staged span config.

Getting Started

Installation

pip install -U deepeval bedrock-agentcore strands-agents opentelemetry-sdk opentelemetry-exporter-otlp-proto-http

Under the hood the integration registers an OpenTelemetry span processor that translates AgentCore spans into deepeval traces.

Instrument and evaluate

Call instrument_agentcore(...) before creating or invoking your AgentCore app. From that point on, AgentCore spans are available to deepeval.

agentcore_agent.py
from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent
from deepeval.integrations.agentcore import instrument_agentcore
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

instrument_agentcore()

app = BedrockAgentCoreApp()
agent = Agent(model="amazon.nova-lite-v1:0")

@app.entrypoint
def invoke(payload):
    result = agent(payload["prompt"])
    return {"result": result.message}

# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="Help me return my order.")])

# `evals_iterator` loops through goldens and applies metrics.
for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    invoke({"prompt": golden.input}) # Produces trace for evaluation

Done โœ…. You've run your first eval with full traceability into AgentCore via deepeval.

What gets traced

Each AgentCore app invocation produces a trace โ€” the end-to-end unit your user observes. Inside that trace are component spans for each step the agent took:

  • Agent spans โ€” Strands agent invocations and agent workflow steps.
  • LLM spans โ€” model calls emitted through AgentCore / Strands.
  • Tool spans โ€” tool calls and function executions.
Trace                                    โ† what the user observes
โ””โ”€โ”€ Agent: refund_assistant              โ† one AgentCore app invocation
    โ”œโ”€โ”€ LLM: amazon.nova-lite-v1:0       โ† component span: model plans
    โ”œโ”€โ”€ Tool: lookup_order               โ† component span: tool input + output
    โ””โ”€โ”€ LLM: amazon.nova-lite-v1:0       โ† component span: final answer

The trace and its component spans are independently evaluable.

Running evals

There are two surfaces for running evals against an AgentCore app. Pick by where you want results to surface โ€” your terminal during development, or your CI pipeline as a pass/fail gate.

In CI/CD (pytest)

Use the deepeval pytest integration. Each parametrized test invocation becomes one AgentCore app invocation; failing metrics fail the test, which fails the build.

test_agentcore_agent.py
import pytest

from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.integrations.agentcore import instrument_agentcore
from deepeval.metrics import TaskCompletionMetric

instrument_agentcore()

app = BedrockAgentCoreApp()
agent = Agent(model="amazon.nova-lite-v1:0")

@app.entrypoint
def invoke(payload):
    result = agent(payload["prompt"])
    return {"result": result.message}

dataset = EvaluationDataset(goldens=[
    Golden(input="Help me return my order."),
    Golden(input="Explain my refund options."),
])

@pytest.mark.parametrize("golden", dataset.goldens)
def test_agentcore_agent(golden: Golden):
    invoke({"prompt": golden.input})
    assert_test(golden=golden, metrics=[TaskCompletionMetric()])

Run it with:

deepeval test run test_agentcore_agent.py

In a script

Use EvaluationDataset + evals_iterator(...). Each Golden becomes one app invocation; metrics score the resulting trace.

agentcore_agent.py
dataset = EvaluationDataset(goldens=[
    Golden(input="Help me return my order."),
    Golden(input="Explain my refund options."),
])

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    invoke({"prompt": golden.input})

Applying metrics to components

The metrics=[...] you passed to evals_iterator evaluates the trace. To evaluate a component instead โ€” a specific LLM call or agent span โ€” stage the metric with the appropriate next_*_span(...) wrapper before invoking the app.

Agent spans

agentcore_agent.py
from deepeval.tracing import next_agent_span
...

def run_agentcore(prompt: str):
    with next_agent_span(metrics=[TaskCompletionMetric()]):
        return invoke({"prompt": prompt})

LLM calls

agentcore_agent.py
from deepeval.tracing import next_llm_span
...

def run_agentcore(prompt: str):
    with next_llm_span(metrics=[AnswerRelevancyMetric()]):
        return invoke({"prompt": prompt})

For deterministic tool calls, prefer update_current_span(...) to add metadata, inputs, and outputs instead of attaching metrics to the tool span.

Customizing trace and span data at runtime

Trace-level fields you pass to instrument_agentcore(...) are defaults. For anything dynamic, the right API depends on where your code runs.

AgentCore creates most of the trace structure for you, which means the agent, LLM, and tool spans are mostly hidden behind the app invocation. Calls like update_current_trace(...) and update_current_span(...) only work while there is an active deepeval trace/span in context. In practice, tool bodies are the clearest mutation point, because AgentCore has already opened the trace and tool span before your function runs.

If you need to customize from outside a tool, use instrument_agentcore(...) for static defaults, next_*_span(...) to stage config for the next AgentCore-created span, or @observe / with trace(...) when you own the outer operation.

Trace-level fields from inside a tool

agentcore_agent.py
from deepeval.tracing import update_current_trace
...

def lookup_order(order_id: str) -> dict:
    order = orders_db.get(order_id)
    update_current_trace(user_id=order["user_id"], metadata={"order_id": order_id})
    return order

Span-level fields from inside a tool

agentcore_agent.py
from deepeval.tracing import update_current_span
...

def lookup_order(order_id: str) -> dict:
    order = orders_db.get(order_id)
    update_current_span(metadata={"order_id": order_id}, output=order)
    return order

Advanced patterns

The primitives above โ€” instrument_agentcore(...), @observe, with trace(...), next_*_span(...), update_current_*(...) โ€” compose around one boundary: AgentCore owns the auto-instrumented spans, and your code customizes them from the places it can actually see.

Evaluate subagents with next_*_span

next_*_span(metrics=[...]) stages a metric for the next matching AgentCore component span. Use this when you want to evaluate a subagent or model step instead of the full trace. Pick the helper that matches the span you want to score: next_agent_span(...) or next_llm_span(...).

agentcore_agent.py
from deepeval.tracing import next_agent_span
...

def run_agent(prompt: str):
    with next_agent_span(metrics=[TaskCompletionMetric()]):
        return invoke({"prompt": prompt})

No trace-level metrics required

Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because the TaskCompletionMetric is attached to the next agent span, so CI/CD and scripts only need to run the subagent.

This is how you'd run it:

test_agentcore_agent.py
import pytest
from deepeval import assert_test
...

@pytest.mark.parametrize("golden", dataset.goldens)
def test_agent_span(golden: Golden):
    run_agent(golden.input)
    assert_test(golden=golden)

Then finally:

deepeval test run test_agentcore_agent.py
agentcore_agent.py
...

for golden in dataset.evals_iterator():
    run_agent(golden.input)

Wrap an AgentCore invocation in @observe

When the AgentCore app is part of a larger operation, decorate the outer function with @observe. AgentCore spans nest under your observed span automatically.

agentcore_agent.py
from deepeval.tracing import observe
...

@observe(name="respond_to_user")
def respond_to_user(prompt: str) -> str:
    response = invoke({"prompt": prompt})
    return response["result"]

API reference

instrument_agentcore(...) accepts the following trace-level kwargs. Each one is a default; runtime calls always win.

KwargTypeDescription
namestrDefault trace name. Override at runtime via update_current_trace.
thread_idstrDefault thread identifier. Useful for grouping conversational turns.
user_idstrDefault actor identifier. Override per-request via update_current_trace.
metadatadictDefault trace metadata. Merged with runtime overrides; runtime wins.
tagslist[str]Default tags applied to every trace produced by this app.
environmentstrOne of "development", "staging", "production", "testing".
metric_collectionstrDefault metric collection applied at the trace level.

For runtime helpers (update_current_trace, update_current_span, next_agent_span, next_llm_span) and the test-decorator surface (@observe, @assert_test, with trace(...)), see the tracing reference.

FAQs

On this page