Strands Agents
The Strands Agents SDK is a Python framework for building agents with tools, streaming, and multi-agent patterns.
The deepeval integration auto-instruments Strands apps through OpenTelemetry. Every agent invocation, model call, and tool call becomes a span you can inspect, without wiring trace structure by hand.
deepeval's Strands integration enables you to:
- Auto-instrument every Strands
Agentinvocation โ each agent call produces a trace, and each agent, LLM, and tool call becomes a component span. - Evaluate traces or model / agent components with any
deepevalmetric. - Run evals from scripts or CI/CD โ same metrics, different surfaces.
- Customize trace and span data at runtime from tool bodies, wrappers, or staged span config.
Getting Started
Installation
pip install -U deepeval strands-agentsUnder the hood the integration registers an OpenTelemetry span processor that translates Strands spans into deepeval traces.
Instrument and evaluate
Call instrument_strands(...) before creating or invoking your Strands agent. From that point on, Strands spans are available to deepeval.
import os
from strands import Agent
from strands.models.openai import OpenAIModel
from deepeval.integrations.strands import instrument_strands
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
instrument_strands()
model = OpenAIModel(
client_args={"api_key": os.environ["OPENAI_API_KEY"]},
model_id="gpt-4o-mini",
)
agent = Agent(model=model, system_prompt="You are a helpful assistant.")
# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="Help me return my order.")])
# `evals_iterator` loops through goldens and applies metrics.
for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
agent(golden.input) # Produces trace for evaluationDone โ
. You've run your first eval with full traceability into Strands via deepeval.
What gets traced
Each Strands agent invocation produces a trace โ the end-to-end unit your user observes. Inside that trace are component spans for each step the agent took:
- Agent spans โ Strands agent invocations and agent workflow steps.
- LLM spans โ model calls emitted through Strands.
- Tool spans โ tool calls and function executions.
Trace โ what the user observes
โโโ Agent: support_agent โ one Strands agent invocation
โโโ LLM: gpt-4o-mini โ component span: model plans
โโโ Tool: lookup_order โ component span: tool input + output
โโโ LLM: gpt-4o-mini โ component span: final answerThe trace and its component spans are independently evaluable.
Running evals
There are two surfaces for running evals against a Strands agent. Pick by where you want results to surface โ your terminal during development, or your CI pipeline as a pass/fail gate.
In CI/CD (pytest)
Use the deepeval pytest integration. Each parametrized test invocation becomes one agent run; failing metrics fail the test, which fails the build.
import os
import pytest
from strands import Agent
from strands.models.openai import OpenAIModel
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.integrations.strands import instrument_strands
from deepeval.metrics import TaskCompletionMetric
instrument_strands()
model = OpenAIModel(
client_args={"api_key": os.environ["OPENAI_API_KEY"]},
model_id="gpt-4o-mini",
)
agent = Agent(model=model)
dataset = EvaluationDataset(goldens=[
Golden(input="Help me return my order."),
Golden(input="Explain my refund options."),
])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_strands_agent(golden: Golden):
agent(golden.input)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Run it with:
deepeval test run test_strands_agent.pyIn a script
Use EvaluationDataset + evals_iterator(...). Each Golden becomes one agent invocation; metrics score the resulting trace.
dataset = EvaluationDataset(goldens=[
Golden(input="Help me return my order."),
Golden(input="Explain my refund options."),
])
for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
agent(golden.input)Applying metrics to components
The metrics=[...] you passed to evals_iterator evaluates the trace. To evaluate a component instead โ a specific LLM call or agent span โ stage the metric with the appropriate next_*_span(...) wrapper before calling the agent.
Agent spans
from deepeval.metrics import TaskCompletionMetric
from deepeval.tracing import next_agent_span
...
def run_strands(prompt: str):
with next_agent_span(metrics=[TaskCompletionMetric()]):
return agent(prompt)LLM calls
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import next_llm_span
...
def run_strands(prompt: str):
with next_llm_span(metrics=[AnswerRelevancyMetric()]):
return agent(prompt)For deterministic tool calls, prefer update_current_span(...) to add metadata, inputs, and outputs instead of attaching metrics to the tool span.
Customizing trace and span data at runtime
Trace-level fields you pass to instrument_strands(...) are defaults. For anything dynamic, the right API depends on where your code runs.
Strands creates most of the trace structure for you, which means the agent, LLM, and tool spans are mostly hidden behind the app invocation. Calls like update_current_trace(...) and update_current_span(...) only work while there is an active deepeval trace/span in context. In practice, tool bodies are the clearest mutation point, because Strands has already opened the trace and tool span before your function runs.
If you need to customize from outside a tool, use instrument_strands(...) for static defaults, next_*_span(...) to stage config for the next Strands-created span, or @observe / with trace(...) when you own the outer operation.
Trace-level fields from inside a tool
from deepeval.tracing import update_current_trace
...
def lookup_order(order_id: str) -> dict:
order = orders_db.get(order_id)
update_current_trace(user_id=order["user_id"], metadata={"order_id": order_id})
return orderSpan-level fields from inside a tool
from deepeval.tracing import update_current_span
...
def lookup_order(order_id: str) -> dict:
order = orders_db.get(order_id)
update_current_span(metadata={"order_id": order_id}, output=order)
return orderAdvanced patterns
The primitives above โ instrument_strands(...), @observe, with trace(...), next_*_span(...), update_current_*(...) โ compose around one boundary: Strands owns the auto-instrumented spans, and your code customizes them from the places it can actually see.
Evaluate subagents with next_*_span
next_*_span(metrics=[...]) stages a metric for the next matching Strands component span. Use this when you want to evaluate a subagent or model step instead of the full trace. Pick the helper that matches the span you want to score: next_agent_span(...) or next_llm_span(...).
from deepeval.metrics import TaskCompletionMetric
from deepeval.tracing import next_agent_span
...
def run_agent(prompt: str):
with next_agent_span(metrics=[TaskCompletionMetric()]):
return agent(prompt)No trace-level metrics required
Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because the TaskCompletionMetric is attached to the next agent span, so CI/CD and scripts only need to run the subagent.
This is how you'd run it:
import pytest
from deepeval import assert_test
...
@pytest.mark.parametrize("golden", dataset.goldens)
def test_agent_span(golden: Golden):
run_agent(golden.input)
assert_test(golden=golden)Then finally:
deepeval test run test_strands_agent.py...
for golden in dataset.evals_iterator():
run_agent(golden.input)Wrap a Strands invocation in @observe
When the agent is part of a larger operation, decorate the outer function with @observe. Strands spans nest under your observed span automatically.
from deepeval.tracing import observe
...
@observe(name="respond_to_user")
def respond_to_user(prompt: str) -> str:
result = agent(prompt)
return result.message.get("content", [{}])[0].get("text", "")API reference
instrument_strands(...) accepts the following trace-level kwargs. Each one is a default; runtime calls always win.
| Kwarg | Type | Description |
|---|---|---|
name | str | Default trace name. Override at runtime via update_current_trace. |
thread_id | str | Default thread identifier. Useful for grouping conversational turns. |
user_id | str | Default actor identifier. Override per-request via update_current_trace. |
metadata | dict | Default trace metadata. Merged with runtime overrides; runtime wins. |
tags | list[str] | Default tags applied to every trace produced by this app. |
environment | str | One of "development", "staging", "production", "testing". |
metric_collection | str | Default metric collection applied at the trace level. |
For runtime helpers (update_current_trace, update_current_span, next_agent_span, next_llm_span) and the test-decorator surface (@observe, @assert_test, with trace(...)), see the tracing reference.