🐴 2026

So far in 2026, DeepEval has focused on making evaluation more reliable, observable, and easier to run across real-world LLM systems:

Tracing & observability improved with richer trace fields, better OTel exports, and deeper integration coverage
Model support expanded with new frontier and provider model entries, more accurate pricing, and safer capability handling
Component-level evals got cleaner with active-trace assertions, structured result exports, and less duplicate logging
Conversation simulation became more flexible with controller APIs, custom templates, and stronger test coverage
Docs & release tooling moved forward with the new docs site, changelog automation, and clearer tracing guides

Thank you to our contributors

First things first, DeepEval exists because of everyone who opened issues, reviewed changes, wrote docs, and merged code this year. Thank you for shaping every release with us.

April focused on simplifying core APIs while expanding model, tracing, and simulator capabilities. Testing and golden assertions were streamlined by removing legacy hooks, adding configurable structured run outputs, deprecating per-result logs, and tightening error handling so misconfigured eval runs fail loudly. The release added support for new OpenAI and Anthropic models with improved multimodal/structured output handling, more accurate token/cost reporting, and safer behavior when logprob-dependent metrics aren’t supported. Observability and workflow got a major boost with richer trace correlation fields like turn_id and test_case_id, optional internal span instrumentation, a more in

Backward Incompatible Change

v3.9.9

Remove the legacy API_KEY alias and require CONFIDENT_API_KEY for Confident uploads. Update dataset loading to use metadata instead of additional_metadata, and refresh docs/examples to use SingleTurnParams for GEval evaluation parameters. (#2635) (Jeffrey Ip)

v3.9.8

Remove the observed_callback hook from assert_test and rely on the active trace when asserting against a golden. Add results_folder/results_subfolder options to control where full structured test-run JSON is saved, and deprecate per-result .log output. (#2622) (Jeffrey Ip)
Remove the metric logging manager and related configuration options, simplifying debug settings and API endpoints. Update CI to run the simulator test suite and reorganize conversation simulator tests for the new layout. (#2629) (Jeffrey Ip)

New Feature

v3.9.9

Add metadata and tags support to both SingleTurnParams and MultiTurnParams, making it easier to pass custom context through single-turn and conversational evaluation workflows. (#2635) (Jeffrey Ip)
Add a generate CLI command to create synthetic goldens from documents, contexts, scratch prompts, or existing goldens, with configurable output format, concurrency, and styling options. (#2633) (Jeffrey Ip)
Add a Cursor/skills-compatible deepeval skill with templates and guidance for generating datasets, creating pytest eval suites, enabling tracing, and iterating on evaluation failures. (#2634) (Jeffrey Ip)

v3.9.8

Add support for the claude-opus-4-7 model, including multimodal inputs, structured outputs, and JSON mode, with updated pricing metadata. (#2617) (Tanay)
Add a conversation simulator controller API with proceed()/end() decisions, plus a public SimulationTemplate. Update and expand simulator tests and CI coverage, including safer defaults when controllers return None or unexpected values. (#2628) (Jeffrey Ip)

v3.9.6

Add support for the gpt-5.4-mini model. Metrics that rely on log probabilities now detect when the model doesn’t support them and avoid failing with unexpected errors. (#2603) (Tanay)

v3.9.5

Add support for extracting confident.trace.test_case_id in ConfidentSpanExporter so OTel-exported traces can propagate testCaseId and be linked to the right test case instead of always being null. (#2570) (Alex Maggioni)
Add prompt branch support, including pushing to a specific branch and listing, creating, renaming, and deleting branches. Cache and commit lookups can now be scoped by branch to keep versions organized. (#2583) (Vamshi Adimalla)

Improvement

v3.9.9

Improve OpenAI defaults by switching the default GPT model to gpt-5.4 when no model is configured. Add model metadata for gpt-5.4 (and its snapshot alias) and update JSON output support flags for relevant models. (#2630) (Jeffrey Ip)
Improve changelog and docs parsing by supporting React-style comment markers for release note markers, PR tags, and changelog-ignore blocks, while remaining compatible with the legacy HTML comment format. (#2631) (Jeffrey Ip)

v3.9.8

Improve documentation site by migrating to a new Next.js-based setup with updated layouts and built-in search, along with refreshed docs tooling and ignores for generated build artifacts. (#2624) (Jeffrey Ip)
Improve docs site link previews and layout by adding a default Open Graph image, tightening homepage spacing, and fixing overflow/scrolling behavior in code and terminal demo blocks. (#2627) (Jeffrey Ip)

v3.9.7

Improve telemetry dependency compatibility by using PostHog 7.x on Python 3.10+ while keeping PostHog 5–6 on Python 3.9 via environment markers. (#2605) (Manoj Kumar Nagabandi)

v3.9.6

Add new tracing guides for multi-turn chatbots, RAG flows, and AI agents, including examples for grouping turns with thread_id and instrumenting spans for better end-to-end observability. (#2581) (Vamshi Adimalla)
Add a manual GitHub Actions workflow to generate changelog updates for a given year or tag range and open an update pull request automatically. (#2588) (Vamshi Adimalla)
Add optional internal tracing for metric and model methods called inside @observe spans, controlled by CONFIDENT_TRACE_INTERNAL. When enabled, key LLM generation methods and metric execution paths are captured with more detailed nested spans; when disabled, this extra instrumentation is skipped to reduce overhead. (#2589) (Kritin Vongthongsri)

v3.9.5

Add optional turn_id and test_case_id fields to tracing across supported integrations, and include them in exported trace payloads for easier correlation of multi-turn runs. (#2576) (Kritin Vongthongsri)

Bug Fix

v3.9.8

Fix multi-turn Pydantic trace input to use the most recent user message instead of the first. This prevents follow-up questions from incorrectly showing the initial question as the trace input. (#2614) (Brian Romain)
Fix a KeyError in MLLMImage.parse_multimodal_string when parsing [DEEPEVAL:IMG:<id>] markers for images that aren’t already registered. Newly created images are now kept alive for the caller so registry lookups don’t fail. (#2615) (Tanay)
Fix Anthropic Opus 4.7 requests by omitting temperature when the model does not support it, preventing API errors in both sync and async generation. (#2618) (Tanay)
Prevent evals_iterator runs from silently doing nothing by raising a clear error when no metrics are declared at any level. This avoids misleading end-of-run messages and makes missing metric configuration easier to diagnose. (#2621) (Jeffrey Ip)

v3.9.7

Fix loading single-turn golden tool calls from CSV by parsing tools_called and expected_tools as JSON objects instead of splitting by a delimiter, matching the format produced by save_as. This prevents errors and ensures tool call goldens round-trip correctly through CSV. (#2565) (Sean Kelley)
Add missing Anthropic model entries for Claude Opus 4.6 and Sonnet 4.6, including dated IDs and short aliases. Fix Opus 4.5 pricing so cost reports are no longer inflated. Restore cost tracking for default *-latest models by registering their IDs so require_costs() no longer falls back to None. (#2584) (Ajay Sai Reddy Desireddy)
Fix conversion of conversational goldens to preserve expected_outcome, preventing metrics that rely on it from failing validation or skipping evaluation after conversion. (#2598) (aerosta)
Fix OpenAI tracing spans for newer gpt-5.x/Responses API models to correctly record input/output token counts and populate per-token cost data. This prevents None values when instrumenting an OpenAI client via patch_openai_client. (#2601) (tiffanychum)
Fix PydanticAI tracing integrations by correctly classifying agent vs LLM spans and preventing mislabeling when agent attributes are present. Improve message normalization across instrumentation versions and ensure trace context is properly reset after a trace ends. (#2606) (Vamshi Adimalla)
Fix schema construction for structured outputs by correctly unwrapping Optional[...] types and detecting nested Pydantic models through full inheritance. This prevents Optional[List[int]] from being misclassified as STRING and ensures derived BaseModel types are recognized as OBJECT. (#2611) (SamSi0322)
Fix _mcp_interaction detection so MCP usage is correctly recognized under Pydantic v2. This prevents MCP-related metrics from returning near-zero scores when tools, resources, or prompts were actually called. (#2612) (SamSi0322)

v3.9.5

Fix type annotations for model and using_native_model in base metric classes by making them proper optional fields with None defaults, improving static type checking and reducing annotation-related errors. (#2574) (Tommy Beadle)
Fix docs structured data by removing the Product schema for metric pages and generating only Article schema. This avoids incorrect organization and product metadata in the rendered schema output. (#2577) (Vamshi Adimalla)

March

March focused on making LLM and agent evaluations more observable and configurable, with new AgentCore and OpenInference integrations that capture richer OpenTelemetry traces and export them via OTLP with improved metadata, tagging, and metric collection controls. We also expanded provider flexibility and reliability, from environment-driven backend selection and better Bedrock and Azure auth handling to more accurate usage extraction across LangChain/LangGraph versions. Across the toolchain, numerous fixes improved correctness and stability in concurrent and streaming evaluations, CLI aggregation, caching behavior, and sandboxed HumanEval execution, while memory and Windows cleanup issues в

New Feature

v3.9.1

Add AgentCore integration with OpenTelemetry instrumentation, including span classification and message extraction for agent, tool, and LLM traces. Support exporting telemetry via OTLP with configurable metadata, tags, and metric collection, and provide a test mode exporter for local validation. (#2534) (Vamshi Adimalla)
Add OpenInference integration to intercept OpenTelemetry spans, extract LLM/agent/tool inputs and outputs, and export traces via OTLP. Provides configurable metadata, tags, and metric collection, and surfaces clear errors when required OpenTelemetry deps or CONFIDENT_API_KEY are missing. (#2555) (Vamshi Adimalla)

v3.8.9

Add custom_column_key_values to LLMTestCase to store custom metadata as a Dict[str, str]. Accept both custom_column_key_values and customColumnKeyValues on input and serialize as customColumnKeyValues, with type validation for safer usage. (#2530) (Brian Romain)

Improvement

v3.9.3

Improve code formatting consistency by reformatting the codebase with the latest Black rules, reducing lint noise and keeping style checks stable across environments. (#2567) (Vamshi Adimalla)
Add documentation describing the updated Amazon Bedrock integration behavior, helping users configure and use Bedrock correctly after recent changes. (#2571) (Vamshi Adimalla)

v3.9.1

Add AWS AgentCore integration documentation and CI coverage, and improve span extraction and test-mode handling. Also allow tuning OTLP batch exporter settings to better control export timing and batch size. (#2544) (Vamshi Adimalla)
Fix AgentCore tracing to capture agent and trace input/output reliably, avoid duplicate traces during evaluation, and recognize additional GenAI span attributes. Also simplify instrument_agentcore by removing OTEL exporter tuning options and update docs to use evals_iterator() for end-to-end eval runs. (#2545) (Vamshi Adimalla)
Support selecting the evaluation provider via environment variables and passing a model name as a string when initializing metrics. This makes it easier to switch between OpenAI, Anthropic, Gemini, Azure, and local backends without changing code. (#2550) (Vamshi Adimalla)

v3.8.9

Support setting metric_collection on the active trace and span via the update helpers. This makes metric collection configuration consistent when updating an in-progress trace rather than only at trace creation. (#2532) (Vamshi Adimalla)

Bug Fix

v3.9.3

Fix chunk-size validation to treat collection.count() as chunk count rather than token count, preventing incorrect errors when generating contexts. Improve the guidance in the exception message with clearer suggestions to reduce chunk_size and chunk_overlap. (#2468) (Xuan-Phung Pham)
Fix initialize_model() to recognize Amazon Bedrock configuration so USE_AWS_BEDROCK_MODEL=YES no longer falls back to the default GPT model. This prevents silently using the wrong provider when Bedrock is intended. (#2537) (Parafee41)
Fix LangChain/LangGraph token usage extraction by reading usage_metadata with a fallback to legacy response_metadata, improving callback accuracy across versions. Improve test stability by retrying flaky Confident and integration tests and updating integration dependencies and fixtures. (#2557) (Vamshi Adimalla)
Fix async trace evaluation so per-trace metrics aren’t shared across concurrent tasks. This prevents concurrent runs from overwriting score, reason, and success, eliminating timing-dependent and inconsistent results. (#2559) (aerosta)
Fix pydantic-ai tracing so thread_id, name, and metadata set via the current trace context are exported on span start. Falls back to settings for compatibility and merges settings metadata with per-request metadata. (#2563) (Oluwanifemi Adeyemi)

v3.9.1

Fix Azure OpenAI keyless authentication by deferring credential checks to the OpenAI SDK. Only fail fast when an explicit credential is provided but empty, while preserving key-based auth and handling both SecretStr and string credentials consistently. (#2464) (ppon1086)
Fix a Pydantic ValidationError in KnowledgeRetentionMetric._extract_knowledges by correctly unpacking LLM response dicts when creating Knowledge objects, preventing double-wrapping and improving validation reliability. (#2513) (Diego Gómez Moreno)
Fix evaluate() in CLI runs to stop resetting the test run manager so results from multiple files are accumulated and reported together. Add a skip_reset option for manual control outside CLI mode. Ensure test case order values are always unique to prevent earlier results being overwritten or shown as skipped. (#2529) (Alex Maggioni)
Fix CrewAI tool tracing when events arrive out of order from the thread pool. Tool spans are now created and closed reliably using finished-event data, with corrected timestamps and consistent propagation of called tools to the parent span. (#2547) (Vamshi Adimalla)
Fix ConversationalGEval to make top_logprobs configurable. Add a top_logprobs parameter to the initializer (default 20) and use it in both sync and async execution paths instead of a hardcoded value. (#2549) (Szymon Cogiel)
Fix a memory leak when processing many multimodal test cases by storing _MLLM_IMAGE_REGISTRY in a weakref.WeakValueDictionary. Unreferenced MLLMImage instances are now garbage-collected automatically, preventing unbounded memory growth in large batch runs. (#2551) (eason)
Fix metric cache loading to ignore incomplete cached entries and fall back to recomputing when no score is available. Progress reporting now updates correctly when cached results are used. (#2552) (Konstantin)

v3.8.9

Fix generator tracing so observers record the final yielded item when a generator finishes without returning a value, improving captured outputs for streaming workflows. (#2514) (Vamshi Adimalla)
Fix FilterTemplate.evaluate_context examples by removing duplicate contexts and replacing them with distinct ones. Each example now has a unique input/output pairing, avoiding repeated contexts with different scores. (#2518) (Fiza Mukhtar)
Fix ContextConstructionConfig.critic_model defaulting to a new model when unset. It now falls back to the model passed to Synthesizer, so you only need to specify a custom model once when generating goldens from docs. (#2520) (Br1an)
Fix HumanEval evaluation so test assertions run against the generated function by executing the function and tests in the same sandboxed context. Also treat runtime exceptions as failures and expand allowed builtins needed by common HumanEval test cases. (#2521) (Br1an)
Fix temp ChromaDB directory cleanup on Windows by stopping the client system before calling shutil.rmtree. This releases open SQLite file handles and prevents PermissionError: [WinError 32] during teardown, with retries kept as a fallback. (#2522) (Br1an)
Fix calculate_weighted_summed_score to avoid ZeroDivisionError when all token logprobs are filtered out and the probability sum is 0. When no tokens survive filtering, it now falls back to the raw score instead of failing. (#2524) (VENKATA PRANAY BATHINI)
Fix the Goal Accuracy Score equation to remove a circular dependency. The formula now matches the implementation by averaging Goal Evaluation Score and Plan Evaluation Score as two distinct components. (#2526) (JevDev2304)

February

February focused on making prompts, tools, and tracing more consistent after the API migration, with prompt identity and caching now keyed by commit hashes and richer prompt metadata recorded as first-class span fields. Tool calling got a major upgrade across pull/cache and push/update flows, including consistent JSON Schema input_schema generation and better handling of empty schemas. Reliability improvements landed throughout integrations and evaluations, including fixes for Azure reasoning models, Bedrock async credential handling, offline evaluation endpoints, and more robust generator observation. The release also streamlined installs and portability by trimming default OpenTelemetry/

New Feature

v3.8.4

Add support for overriding the default cache directory via an environment variable, allowing you to relocate cached files without changing code. (#2455) (vection)
Add tool support to pulled and cached prompts, exposing any returned tools and converting their structured fields into a JSON Schema input_schema for easier function/tool calling. (#2466) (Vamshi Adimalla)
Add tool support to prompt push and update, including optional tools payloads. Improve tool schema handling by reusing output schema conversion and generating JSON Schema input parameters consistently, even for empty schemas. (#2474) (Vamshi Adimalla)

Improvement

v3.8.7

Fix minor typos in the Tool Correctness metric documentation, including list formatting and a missing final newline for cleaner rendering. (#2504) (nikkie)

v3.8.5

Improve prompt handling after the API migration by switching prompt identity and caching to use commit hashes and adding support for prompt commits endpoints. This helps prompts and logged hyperparameters stay consistent when versions change, and preserves tools and schema data when pulling from cache. (#2475) (Vamshi Adimalla)
Remove the opentelemetry-exporter-otlp-proto-grpc dependency from the default install to reduce required packages and keep OpenTelemetry exporter tooling out of core installs. (#2477) (Tommy Beadle)
Improve dependency compatibility by allowing Click 8.3.x (click < 8.4.0). Also adjust Linux-only pysqlite3-binary handling to prevent install failures on other platforms. (#2486) (Muhammad Faizan)
Improve prompt logging in traces by recording prompt alias, commit hash, label, and version as first-class span fields across integrations and the OTEL exporter. (#2487) (Vamshi Adimalla)
Add support for the AU data region for Confident AI requests. The CLI can now set region to AU, and API routing will automatically use AU endpoints when your API key starts with confident_au_. (#2494) (Vamshi Adimalla)

Bug Fix

v3.8.7

Fix Azure OpenAI requests to omit temperature for reasoning models that don’t support it, preventing Azure API errors. Validation now allows temperature=None, and defaults remain unchanged for standard models. (#2491) (aerosta)
Fix tests by updating the hardcoded valid trace UUID used in annotation test fixtures. (#2505) (Vamshi Adimalla)

v3.8.8

Fix the @observe decorator to capture the final return value from synchronous generator functions, not just yielded items. Also ensure the observer is closed reliably on normal completion, GeneratorExit, and errors. (#2509) (Vamshi Adimalla)

v3.8.6

Fix offline trace/span evaluation requests by sending the correct endpoints and parameters, and add overwrite_metrics support. Also allow passing chatbot_role when evaluating a thread, and avoid printing tool calls when none are present. (#2498) (Vamshi Adimalla)

v3.8.5

Fix the TaskCompletionMetric docs example to use goldens instead of golden, preventing a NameError when iterating over dataset.evals_iterator. (#2454) (Himanshu Kumar Singh)
Fix a_generate_with_schema_and_extract to handle models that return (result, cost) tuples. It now accrues cost when supported and extracts the actual result so downstream processing works without tuple parsing errors. (#2470) (Angelen)
Fix AmazonBedrockModel raising AccessDeniedException during async evaluations when AWS credentials are valid. Improves async-safe aiobotocore session and credential handling to prevent loss under concurrency while keeping sync behavior unchanged. (#2471) (Fiza Mukhtar)
Fix the conversation completeness prompt to better extract user intentions by separating multiple tasks per turn instead of summarizing them into one intention. (#2478) (Vamshi Adimalla)
Fix the knowledge retention extraction template to wrap extracted fields under a top-level data object, matching the expected JSON output format and improving parsing reliability. (#2479) (Vamshi Adimalla)
Fix OpenTelemetry span attribute setting to avoid sending None values for prompt metadata, reducing invalid or noisy telemetry attributes. (#2488) (Vamshi Adimalla)
Fix tracing prompt metadata conversion so prompt_alias, commit hash, label, and version are set consistently and don’t get overwritten by empty prompt objects. Improve prompt tests to avoid cache and alias collisions, making pull and cache behavior more reliable. (#2489) (Vamshi Adimalla)
Fix CrewAI tracing to capture prompt metadata and expected outputs on LLM spans, improve tool-span detection, and make tool completion more reliable when duplicate events or key mismatches occur. Also broaden metric lookup to support both underscored and public attribute names. (#2490) (Vamshi Adimalla)
Fix synthesizer crashes when include_expected_output=False and when max_quality_retries=0, preventing AttributeError and UnboundLocalError. Correct goldens generation so evolutions_used metadata no longer leaks across iterations. Add tests covering these scenarios. (#2493) (aerosta)
Fix a crash in deepeval view after login when telemetry doesn’t create a span. upload_and_open_link() now treats the span as optional and only sets attributes when it exists, so the command completes instead of raising an AttributeError. (#2496) (Jeremy Johnson)

Security

v3.8.5

Improve cleanup on Windows by updating safe_rmtree to use subprocess.run with argument lists instead of os.system. This handles paths with spaces more reliably and reduces the risk of command injection, making directory removal more robust across environments. (#2484) (Rin)

January

January focused on widening model-provider support and making integrations more configurable, including OpenRouter via an OpenAI-compatible API, Azure OpenAI auth via azure_ad_token, and clearer control over Gemini via use_vertexai alongside refreshed default model IDs. Tracing and telemetry saw major stabilization across LangChain, LangGraph, PydanticAI, CrewAI, and OpenTelemetry, with improved context propagation, safer progress handling, regional routing based on API key prefixes, and removal of the New Relic exporter while adding CONFIDENT_OTEL_URL for endpoint control. Evaluation workflows improved with upload() for GEval/ConversationalGEval, richer contextual recall verdict

New Feature

v3.8.1

Add support for OpenRouter with an OpenAI-compatible model API and dynamic model names. Support structured outputs, configurable retries, and custom headers like HTTP-Referer and X-Title. Allow user-provided pricing with fallback to provider-reported pricing. (#2314) (Wang Junwei)
Add an automated changelog generator that builds ClickHouse-style release notes from git tags, with optional GitHub and AI enrichment. Backfill the docs changelog for 2025 to match the new year/month/category layout. (#2403) (Trevor Wilson)
Add upload() support for GEval and ConversationalGEval to send metric definitions (criteria/steps, required parameters, rubric, multi-turn) to Confident AI and store the returned metric id. (#2419) (Vamshi Adimalla)
Add a use_vertexai option to explicitly choose between Vertex AI and Gemini API-key clients when creating GeminiModel. This overrides the GOOGLE_GENAI_USE_VERTEXAI setting, including forcing False to avoid Vertex AI even if project/location are set. (#2436) (Trevor Wilson)
Add support for authenticating Azure OpenAI models using azure_ad_token or an azure_ad_token_provider, so you can use Azure AD credentials instead of an API key when desired. (#2448) (Vamshi Adimalla)

Improvement

v3.8.3

Add support for setting test_case_id in update_current_trace, and include it when serializing traces. This makes it easier to associate a trace with a specific test case in downstream processing. (#2463) (Kritin Vongthongsri)

v3.8.2

Remove the New Relic OpenTelemetry tracing exporter from telemetry. This reduces external tracing overhead and avoids requiring New Relic-related tracing setup during event capture. (#2364) (Kritin Vongthongsri)
Improve LangChain and LangGraph integrations by stabilizing tracing and metric collection. Fix schema and integration test behavior to be deterministic, representative of supported usage, and aligned with the documentation. (#2457) (Trevor Wilson)

v3.8.1

Fix grammar in the README to improve clarity when describing locally run evaluation models and metrics. (#2423) (yuri)
Add a 2024 changelog page to the documentation and link it from the changelog index and sidebar for easier navigation. Update the changelog generator default output directory to match the new docs path. (#2429) (Trevor Wilson)
Improve contextual recall verdicts by attaching the expected_output to each verdict, making results easier to interpret and debug in both sync and async runs. (#2449) (Vamshi Adimalla)
Add support for passing a confident_api_key to dataset and prompt objects, and use it automatically for push/pull/update/queue operations. This makes it easier to work with multiple API keys without relying on a single global setting. (#2453) (Vamshi Adimalla)

v3.7.9

Improve environment variable docs by clarifying how boolean flags are parsed, including accepted truthy/falsy tokens and how unset or unrecognized values fall back to defaults. Update env var tables to show 1/0/unset for boolean settings. (#2399) (Trevor Wilson)
Fix a typo in the getting started docs to improve readability and clarity. (#2404) (Neelay Shah)
Update GEval documentation to reference evaluation_steps (instead of evaluation_params) when describing which parameters should be included for accurate results. (#2406) (Vishnu Sai Teja)
Fix Gemini defaults and documentation to avoid retired model IDs. Update the default model to gemini-2.5-pro and refresh Gemini/Vertex AI docs to use current stable/preview models and *-latest aliases. Update custom LLM guide examples to gemini-2.5-flash. (#2414) (Trevor Wilson)

v3.7.8

Improve OpenTelemetry export configuration by introducing CONFIDENT_OTEL_URL (defaulting to the hosted endpoint) and using it across integrations. This makes it easier to point tracing to regional endpoints such as the EU collector via an environment variable. (#2400) (Vamshi Adimalla)

Bug Fix

v3.8.3

Fix the CrewAI integration tests by improving event loop handling in sync wrappers and correctly comparing tool usage when traces are returned as lists. This prevents failures caused by missing loops and mismatched tools-used invariants. (#2460) (Vamshi Adimalla)

v3.8.2

Fix answer relevancy scoring by raising an error when actual_output is empty or whitespace-only, preventing blank outputs from being treated as fully relevant. (#2451) (Trevor Wilson)
Fix Confident API request routing by inferring the region from the API key prefix when no region is set. EU keys (confident_eu_...) now automatically use the EU endpoint, preventing invalid API key errors. Defaults to the US endpoint when the region cannot be inferred. (#2456) (Trevor Wilson)
Fix Pydantic AI OpenTelemetry instrumentation by setting the global tracer provider when possible and warning if it’s already configured. Improve agent span detection by reading agent names from gen_ai.agent.name or pydantic_ai.agent.name and applying agent attributes consistently at span start/end. (#2459) (Vamshi Adimalla)
Fix PydanticAI tracing environment selection so it prefers the trace manager setting, then CONFIDENT_TRACE_ENVIRONMENT, and defaults to development only when neither is set. (#2462) (Vamshi Adimalla)

v3.8.1

Fix LangChain/LangGraph callback tracing to reuse active traces, restore trace/span context across async tasks, and keep correct parent-child span hierarchy. Also avoid overwriting trace metadata when values are not provided. (#2434) (Jeffrey Ip)
Fix Amazon Bedrock Converse response parsing by extracting all text content blocks and ignoring reasoningContent. Improve error messages when no text is returned, and return None for cost when pricing data is unavailable. (#2437) (Trevor Wilson)
Fix incorrect return type annotations for send_annotation() and a_send_annotation() by changing them from str to None, matching their actual no-return behavior. (#2441) (yuri)
Fix multimodal image metrics to fail fast with a clear ValueError when actual_output contains no images. Also validate expected_output when detecting multimodal test cases and improve error messaging for mismatched output image counts. (#2447) (Vamshi Adimalla)

v3.7.9

Fix batch scoring on the DROP benchmark to use quasi_contains_score, matching single prediction behavior. This prevents partial matches like '2' being incorrectly marked wrong when the gold answer includes variants such as '2, 2-yards'. (#2402) (Aadam Haq)
Fix progress updates to ignore missing tasks during teardown, preventing StopIteration when async callbacks run after a progress task is removed. Progress updates now safely become a no-op in this race condition. (#2405) (Trevor Wilson)
Fix MMLU batch_predict to require batch_generate to return a list of MultipleChoiceSchema and raise a clear TypeError otherwise, preventing inconsistent response handling. (#2408) (Aadam Haq)
Fix Gemini Vertex AI authentication to fall back to Application Default Credentials when no service account key is provided, instead of requiring GOOGLE_SERVICE_ACCOUNT_KEY. Only parse and validate the key when present to avoid unnecessary OAuth imports. (#2412) (Trevor Wilson)
Fix Synthesizer export and save for conversational goldens: to_pandas and save_as now handle both QA and conversational outputs, include the right fields, and raise an error only when neither type is present. (#2415) (Vamshi Adimalla)

On this page