π΄ 2026
So far in 2026, DeepEval has focused on making evaluation more reliable, observable, and easier to run across real-world LLM systems:
- Tracing & observability improved with richer trace fields, better OTel exports, and deeper integration coverage
- Model support expanded with new frontier and provider model entries, more accurate pricing, and safer capability handling
- Component-level evals got cleaner with active-trace assertions, structured result exports, and less duplicate logging
- Conversation simulation became more flexible with controller APIs, custom templates, and stronger test coverage
- Docs & release tooling moved forward with the new docs site, changelog automation, and clearer tracing guides
Thank you to our contributors
First things first, DeepEval exists because of everyone who opened issues, reviewed changes, wrote docs, and merged code this year. Thank you for shaping every release with us.
April
April focused on simplifying core APIs while expanding model, tracing, and simulator capabilities. Testing and golden assertions were streamlined by removing legacy hooks, adding configurable structured run outputs, deprecating per-result logs, and tightening error handling so misconfigured eval runs fail loudly. The release added support for new OpenAI and Anthropic models with improved multimodal/structured output handling, more accurate token/cost reporting, and safer behavior when logprob-dependent metrics arenβt supported. Observability and workflow got a major boost with richer trace correlation fields like turn_id and test_case_id, optional internal span instrumentation, a more in
Backward Incompatible Change
v3.9.9
- Remove the legacy
API_KEYalias and requireCONFIDENT_API_KEYfor Confident uploads. Update dataset loading to usemetadatainstead ofadditional_metadata, and refresh docs/examples to useSingleTurnParamsforGEvalevaluation parameters. (#2635) (Jeffrey Ip)
v3.9.8
- Remove the
observed_callbackhook fromassert_testand rely on the active trace when asserting against agolden. Addresults_folder/results_subfolderoptions to control where full structured test-run JSON is saved, and deprecate per-result.logoutput. (#2622) (Jeffrey Ip) - Remove the metric logging manager and related configuration options, simplifying debug settings and API endpoints. Update CI to run the simulator test suite and reorganize conversation simulator tests for the new layout. (#2629) (Jeffrey Ip)
New Feature
v3.9.9
- Add
metadataandtagssupport to bothSingleTurnParamsandMultiTurnParams, making it easier to pass custom context through single-turn and conversational evaluation workflows. (#2635) (Jeffrey Ip) - Add a
generateCLI command to create synthetic goldens from documents, contexts, scratch prompts, or existing goldens, with configurable output format, concurrency, and styling options. (#2633) (Jeffrey Ip) - Add a Cursor/skills-compatible
deepevalskill with templates and guidance for generating datasets, creating pytest eval suites, enabling tracing, and iterating on evaluation failures. (#2634) (Jeffrey Ip)
v3.9.8
- Add support for the
claude-opus-4-7model, including multimodal inputs, structured outputs, and JSON mode, with updated pricing metadata. (#2617) (Tanay) - Add a conversation simulator controller API with
proceed()/end()decisions, plus a publicSimulationTemplate. Update and expand simulator tests and CI coverage, including safer defaults when controllers returnNoneor unexpected values. (#2628) (Jeffrey Ip)
v3.9.6
- Add support for the
gpt-5.4-minimodel. Metrics that rely on log probabilities now detect when the model doesnβt support them and avoid failing with unexpected errors. (#2603) (Tanay)
v3.9.5
- Add support for extracting
confident.trace.test_case_idinConfidentSpanExporterso OTel-exported traces can propagatetestCaseIdand be linked to the right test case instead of always being null. (#2570) (Alex Maggioni) - Add prompt branch support, including pushing to a specific branch and listing, creating, renaming, and deleting branches. Cache and commit lookups can now be scoped by branch to keep versions organized. (#2583) (Vamshi Adimalla)
Improvement
v3.9.9
- Improve OpenAI defaults by switching the default GPT model to
gpt-5.4when no model is configured. Add model metadata forgpt-5.4(and its snapshot alias) and update JSON output support flags for relevant models. (#2630) (Jeffrey Ip) - Improve changelog and docs parsing by supporting React-style comment markers for release note markers, PR tags, and
changelog-ignoreblocks, while remaining compatible with the legacy HTML comment format. (#2631) (Jeffrey Ip)
v3.9.8
- Improve documentation site by migrating to a new Next.js-based setup with updated layouts and built-in search, along with refreshed docs tooling and ignores for generated build artifacts. (#2624) (Jeffrey Ip)
- Improve docs site link previews and layout by adding a default Open Graph image, tightening homepage spacing, and fixing overflow/scrolling behavior in code and terminal demo blocks. (#2627) (Jeffrey Ip)
v3.9.7
- Improve telemetry dependency compatibility by using PostHog 7.x on Python 3.10+ while keeping PostHog 5β6 on Python 3.9 via environment markers. (#2605) (Manoj Kumar Nagabandi)
v3.9.6
- Add new tracing guides for multi-turn chatbots, RAG flows, and AI agents, including examples for grouping turns with
thread_idand instrumenting spans for better end-to-end observability. (#2581) (Vamshi Adimalla) - Add a manual GitHub Actions workflow to generate changelog updates for a given year or tag range and open an update pull request automatically. (#2588) (Vamshi Adimalla)
- Add optional internal tracing for metric and model methods called inside
@observespans, controlled byCONFIDENT_TRACE_INTERNAL. When enabled, key LLM generation methods and metric execution paths are captured with more detailed nested spans; when disabled, this extra instrumentation is skipped to reduce overhead. (#2589) (Kritin Vongthongsri)
v3.9.5
- Add optional
turn_idandtest_case_idfields to tracing across supported integrations, and include them in exported trace payloads for easier correlation of multi-turn runs. (#2576) (Kritin Vongthongsri)
Bug Fix
v3.9.8
- Fix multi-turn Pydantic trace input to use the most recent user message instead of the first. This prevents follow-up questions from incorrectly showing the initial question as the trace input. (#2614) (Brian Romain)
- Fix a
KeyErrorinMLLMImage.parse_multimodal_stringwhen parsing[DEEPEVAL:IMG:<id>]markers for images that arenβt already registered. Newly created images are now kept alive for the caller so registry lookups donβt fail. (#2615) (Tanay) - Fix Anthropic Opus 4.7 requests by omitting
temperaturewhen the model does not support it, preventing API errors in both sync and async generation. (#2618) (Tanay) - Prevent
evals_iteratorruns from silently doing nothing by raising a clear error when no metrics are declared at any level. This avoids misleading end-of-run messages and makes missing metric configuration easier to diagnose. (#2621) (Jeffrey Ip)
v3.9.7
- Fix loading single-turn golden tool calls from CSV by parsing
tools_calledandexpected_toolsas JSON objects instead of splitting by a delimiter, matching the format produced bysave_as. This prevents errors and ensures tool call goldens round-trip correctly through CSV. (#2565) (Sean Kelley) - Add missing Anthropic model entries for Claude Opus 4.6 and Sonnet 4.6, including dated IDs and short aliases. Fix Opus 4.5 pricing so cost reports are no longer inflated. Restore cost tracking for default
*-latestmodels by registering their IDs sorequire_costs()no longer falls back to None. (#2584) (Ajay Sai Reddy Desireddy) - Fix conversion of conversational goldens to preserve
expected_outcome, preventing metrics that rely on it from failing validation or skipping evaluation after conversion. (#2598) (aerosta) - Fix OpenAI tracing spans for newer
gpt-5.x/Responses API models to correctly record input/output token counts and populate per-token cost data. This preventsNonevalues when instrumenting an OpenAI client viapatch_openai_client. (#2601) (tiffanychum) - Fix PydanticAI tracing integrations by correctly classifying agent vs LLM spans and preventing mislabeling when agent attributes are present. Improve message normalization across instrumentation versions and ensure trace context is properly reset after a trace ends. (#2606) (Vamshi Adimalla)
- Fix schema construction for structured outputs by correctly unwrapping
Optional[...]types and detecting nested Pydantic models through full inheritance. This preventsOptional[List[int]]from being misclassified as STRING and ensures derivedBaseModeltypes are recognized as OBJECT. (#2611) (SamSi0322) - Fix
_mcp_interactiondetection so MCP usage is correctly recognized under Pydantic v2. This prevents MCP-related metrics from returning near-zero scores when tools, resources, or prompts were actually called. (#2612) (SamSi0322)
v3.9.5
- Fix type annotations for
modelandusing_native_modelin base metric classes by making them proper optional fields withNonedefaults, improving static type checking and reducing annotation-related errors. (#2574) (Tommy Beadle) - Fix docs structured data by removing the
Productschema for metric pages and generating onlyArticleschema. This avoids incorrect organization and product metadata in the rendered schema output. (#2577) (Vamshi Adimalla)
March
March focused on making LLM and agent evaluations more observable and configurable, with new AgentCore and OpenInference integrations that capture richer OpenTelemetry traces and export them via OTLP with improved metadata, tagging, and metric collection controls. We also expanded provider flexibility and reliability, from environment-driven backend selection and better Bedrock and Azure auth handling to more accurate usage extraction across LangChain/LangGraph versions. Across the toolchain, numerous fixes improved correctness and stability in concurrent and streaming evaluations, CLI aggregation, caching behavior, and sandboxed HumanEval execution, while memory and Windows cleanup issues Π²
New Feature
v3.9.1
- Add AgentCore integration with OpenTelemetry instrumentation, including span classification and message extraction for agent, tool, and LLM traces. Support exporting telemetry via OTLP with configurable metadata, tags, and metric collection, and provide a test mode exporter for local validation. (#2534) (Vamshi Adimalla)
- Add OpenInference integration to intercept OpenTelemetry spans, extract LLM/agent/tool inputs and outputs, and export traces via OTLP. Provides configurable metadata, tags, and metric collection, and surfaces clear errors when required OpenTelemetry deps or
CONFIDENT_API_KEYare missing. (#2555) (Vamshi Adimalla)
v3.8.9
- Add
custom_column_key_valuestoLLMTestCaseto store custom metadata as aDict[str, str]. Accept bothcustom_column_key_valuesandcustomColumnKeyValueson input and serialize ascustomColumnKeyValues, with type validation for safer usage. (#2530) (Brian Romain)
Improvement
v3.9.3
- Improve code formatting consistency by reformatting the codebase with the latest Black rules, reducing lint noise and keeping style checks stable across environments. (#2567) (Vamshi Adimalla)
- Add documentation describing the updated Amazon Bedrock integration behavior, helping users configure and use Bedrock correctly after recent changes. (#2571) (Vamshi Adimalla)
v3.9.1
- Add AWS AgentCore integration documentation and CI coverage, and improve span extraction and test-mode handling. Also allow tuning OTLP batch exporter settings to better control export timing and batch size. (#2544) (Vamshi Adimalla)
- Fix AgentCore tracing to capture agent and trace
input/outputreliably, avoid duplicate traces during evaluation, and recognize additional GenAI span attributes. Also simplifyinstrument_agentcoreby removing OTEL exporter tuning options and update docs to useevals_iterator()for end-to-end eval runs. (#2545) (Vamshi Adimalla) - Support selecting the evaluation provider via environment variables and passing a model name as a string when initializing metrics. This makes it easier to switch between OpenAI, Anthropic, Gemini, Azure, and local backends without changing code. (#2550) (Vamshi Adimalla)
v3.8.9
- Support setting
metric_collectionon the active trace and span via the update helpers. This makes metric collection configuration consistent when updating an in-progress trace rather than only at trace creation. (#2532) (Vamshi Adimalla)
Bug Fix
v3.9.3
- Fix chunk-size validation to treat
collection.count()as chunk count rather than token count, preventing incorrect errors when generating contexts. Improve the guidance in the exception message with clearer suggestions to reducechunk_sizeandchunk_overlap. (#2468) (Xuan-Phung Pham) - Fix
initialize_model()to recognize Amazon Bedrock configuration soUSE_AWS_BEDROCK_MODEL=YESno longer falls back to the default GPT model. This prevents silently using the wrong provider when Bedrock is intended. (#2537) (Parafee41) - Fix LangChain/LangGraph token usage extraction by reading
usage_metadatawith a fallback to legacyresponse_metadata, improving callback accuracy across versions. Improve test stability by retrying flaky Confident and integration tests and updating integration dependencies and fixtures. (#2557) (Vamshi Adimalla) - Fix async trace evaluation so per-trace metrics arenβt shared across concurrent tasks. This prevents concurrent runs from overwriting
score,reason, andsuccess, eliminating timing-dependent and inconsistent results. (#2559) (aerosta) - Fix pydantic-ai tracing so
thread_id,name, andmetadataset via the current trace context are exported on span start. Falls back to settings for compatibility and merges settings metadata with per-request metadata. (#2563) (Oluwanifemi Adeyemi)
v3.9.1
- Fix Azure OpenAI keyless authentication by deferring credential checks to the OpenAI SDK. Only fail fast when an explicit credential is provided but empty, while preserving key-based auth and handling both
SecretStrand string credentials consistently. (#2464) (ppon1086) - Fix a Pydantic
ValidationErrorinKnowledgeRetentionMetric._extract_knowledgesby correctly unpacking LLM response dicts when creatingKnowledgeobjects, preventing double-wrapping and improving validation reliability. (#2513) (Diego GΓ³mez Moreno) - Fix
evaluate()in CLI runs to stop resetting the test run manager so results from multiple files are accumulated and reported together. Add askip_resetoption for manual control outside CLI mode. Ensure test caseordervalues are always unique to prevent earlier results being overwritten or shown as skipped. (#2529) (Alex Maggioni) - Fix CrewAI tool tracing when events arrive out of order from the thread pool. Tool spans are now created and closed reliably using finished-event data, with corrected timestamps and consistent propagation of called tools to the parent span. (#2547) (Vamshi Adimalla)
- Fix
ConversationalGEvalto maketop_logprobsconfigurable. Add atop_logprobsparameter to the initializer (default 20) and use it in both sync and async execution paths instead of a hardcoded value. (#2549) (Szymon Cogiel) - Fix a memory leak when processing many multimodal test cases by storing
_MLLM_IMAGE_REGISTRYin aweakref.WeakValueDictionary. UnreferencedMLLMImageinstances are now garbage-collected automatically, preventing unbounded memory growth in large batch runs. (#2551) (eason) - Fix metric cache loading to ignore incomplete cached entries and fall back to recomputing when no score is available. Progress reporting now updates correctly when cached results are used. (#2552) (Konstantin)
v3.8.9
- Fix generator tracing so observers record the final yielded item when a generator finishes without returning a value, improving captured outputs for streaming workflows. (#2514) (Vamshi Adimalla)
- Fix
FilterTemplate.evaluate_contextexamples by removing duplicate contexts and replacing them with distinct ones. Each example now has a unique input/output pairing, avoiding repeated contexts with different scores. (#2518) (Fiza Mukhtar) - Fix
ContextConstructionConfig.critic_modeldefaulting to a new model when unset. It now falls back to the model passed toSynthesizer, so you only need to specify a custom model once when generating goldens from docs. (#2520) (Br1an) - Fix HumanEval evaluation so test assertions run against the generated function by executing the function and tests in the same sandboxed context. Also treat runtime exceptions as failures and expand allowed builtins needed by common HumanEval test cases. (#2521) (Br1an)
- Fix temp ChromaDB directory cleanup on Windows by stopping the client system before calling
shutil.rmtree. This releases open SQLite file handles and preventsPermissionError: [WinError 32]during teardown, with retries kept as a fallback. (#2522) (Br1an) - Fix
calculate_weighted_summed_scoreto avoid ZeroDivisionError when all token logprobs are filtered out and the probability sum is 0. When no tokens survive filtering, it now falls back to the raw score instead of failing. (#2524) (VENKATA PRANAY BATHINI) - Fix the Goal Accuracy Score equation to remove a circular dependency. The formula now matches the implementation by averaging
Goal Evaluation ScoreandPlan Evaluation Scoreas two distinct components. (#2526) (JevDev2304)
February
February focused on making prompts, tools, and tracing more consistent after the API migration, with prompt identity and caching now keyed by commit hashes and richer prompt metadata recorded as first-class span fields. Tool calling got a major upgrade across pull/cache and push/update flows, including consistent JSON Schema input_schema generation and better handling of empty schemas. Reliability improvements landed throughout integrations and evaluations, including fixes for Azure reasoning models, Bedrock async credential handling, offline evaluation endpoints, and more robust generator observation. The release also streamlined installs and portability by trimming default OpenTelemetry/
New Feature
v3.8.4
- Add support for overriding the default cache directory via an environment variable, allowing you to relocate cached files without changing code. (#2455) (vection)
- Add tool support to pulled and cached prompts, exposing any returned tools and converting their structured fields into a JSON Schema
input_schemafor easier function/tool calling. (#2466) (Vamshi Adimalla) - Add tool support to prompt push and update, including optional
toolspayloads. Improve tool schema handling by reusing output schema conversion and generating JSON Schema input parameters consistently, even for empty schemas. (#2474) (Vamshi Adimalla)
Improvement
v3.8.7
- Fix minor typos in the Tool Correctness metric documentation, including list formatting and a missing final newline for cleaner rendering. (#2504) (nikkie)
v3.8.5
- Improve prompt handling after the API migration by switching prompt identity and caching to use commit hashes and adding support for prompt commits endpoints. This helps prompts and logged hyperparameters stay consistent when versions change, and preserves tools and schema data when pulling from cache. (#2475) (Vamshi Adimalla)
- Remove the
opentelemetry-exporter-otlp-proto-grpcdependency from the default install to reduce required packages and keep OpenTelemetry exporter tooling out of core installs. (#2477) (Tommy Beadle) - Improve dependency compatibility by allowing Click 8.3.x (
click< 8.4.0). Also adjust Linux-onlypysqlite3-binaryhandling to prevent install failures on other platforms. (#2486) (Muhammad Faizan) - Improve prompt logging in traces by recording prompt alias, commit hash, label, and version as first-class span fields across integrations and the OTEL exporter. (#2487) (Vamshi Adimalla)
- Add support for the AU data region for Confident AI requests. The CLI can now set region to AU, and API routing will automatically use AU endpoints when your API key starts with
confident_au_. (#2494) (Vamshi Adimalla)
Bug Fix
v3.8.7
- Fix Azure OpenAI requests to omit
temperaturefor reasoning models that donβt support it, preventing Azure API errors. Validation now allowstemperature=None, and defaults remain unchanged for standard models. (#2491) (aerosta) - Fix tests by updating the hardcoded valid trace UUID used in annotation test fixtures. (#2505) (Vamshi Adimalla)
v3.8.8
- Fix the
@observedecorator to capture the final return value from synchronous generator functions, not just yielded items. Also ensure the observer is closed reliably on normal completion,GeneratorExit, and errors. (#2509) (Vamshi Adimalla)
v3.8.6
- Fix offline trace/span evaluation requests by sending the correct endpoints and parameters, and add
overwrite_metricssupport. Also allow passingchatbot_rolewhen evaluating a thread, and avoid printing tool calls when none are present. (#2498) (Vamshi Adimalla)
v3.8.5
- Fix the TaskCompletionMetric docs example to use
goldensinstead ofgolden, preventing a NameError when iterating overdataset.evals_iterator. (#2454) (Himanshu Kumar Singh) - Fix
a_generate_with_schema_and_extractto handle models that return(result, cost)tuples. It now accrues cost when supported and extracts the actual result so downstream processing works without tuple parsing errors. (#2470) (Angelen) - Fix
AmazonBedrockModelraisingAccessDeniedExceptionduring async evaluations when AWS credentials are valid. Improves async-safeaiobotocoresession and credential handling to prevent loss under concurrency while keeping sync behavior unchanged. (#2471) (Fiza Mukhtar) - Fix the conversation completeness prompt to better extract user intentions by separating multiple tasks per turn instead of summarizing them into one intention. (#2478) (Vamshi Adimalla)
- Fix the knowledge retention extraction template to wrap extracted fields under a top-level
dataobject, matching the expected JSON output format and improving parsing reliability. (#2479) (Vamshi Adimalla) - Fix OpenTelemetry span attribute setting to avoid sending
Nonevalues for prompt metadata, reducing invalid or noisy telemetry attributes. (#2488) (Vamshi Adimalla) - Fix tracing prompt metadata conversion so
prompt_alias, commit hash, label, and version are set consistently and donβt get overwritten by empty prompt objects. Improve prompt tests to avoid cache and alias collisions, making pull and cache behavior more reliable. (#2489) (Vamshi Adimalla) - Fix CrewAI tracing to capture prompt metadata and expected outputs on LLM spans, improve tool-span detection, and make tool completion more reliable when duplicate events or key mismatches occur. Also broaden metric lookup to support both underscored and public attribute names. (#2490) (Vamshi Adimalla)
- Fix synthesizer crashes when
include_expected_output=Falseand whenmax_quality_retries=0, preventingAttributeErrorandUnboundLocalError. Correct goldens generation soevolutions_usedmetadata no longer leaks across iterations. Add tests covering these scenarios. (#2493) (aerosta) - Fix a crash in
deepeval viewafter login when telemetry doesnβt create a span.upload_and_open_link()now treats the span as optional and only sets attributes when it exists, so the command completes instead of raising an AttributeError. (#2496) (Jeremy Johnson)
Security
v3.8.5
- Improve cleanup on Windows by updating
safe_rmtreeto usesubprocess.runwith argument lists instead ofos.system. This handles paths with spaces more reliably and reduces the risk of command injection, making directory removal more robust across environments. (#2484) (Rin)
January
January focused on widening model-provider support and making integrations more configurable, including OpenRouter via an OpenAI-compatible API, Azure OpenAI auth via azure_ad_token, and clearer control over Gemini via use_vertexai alongside refreshed default model IDs. Tracing and telemetry saw major stabilization across LangChain, LangGraph, PydanticAI, CrewAI, and OpenTelemetry, with improved context propagation, safer progress handling, regional routing based on API key prefixes, and removal of the New Relic exporter while adding CONFIDENT_OTEL_URL for endpoint control. Evaluation workflows improved with upload() for GEval/ConversationalGEval, richer contextual recall verdict
New Feature
v3.8.1
- Add support for OpenRouter with an OpenAI-compatible model API and dynamic model names. Support structured outputs, configurable retries, and custom headers like
HTTP-RefererandX-Title. Allow user-provided pricing with fallback to provider-reported pricing. (#2314) (Wang Junwei) - Add an automated changelog generator that builds ClickHouse-style release notes from git tags, with optional GitHub and AI enrichment. Backfill the docs changelog for 2025 to match the new year/month/category layout. (#2403) (Trevor Wilson)
- Add
upload()support forGEvalandConversationalGEvalto send metric definitions (criteria/steps, required parameters, rubric, multi-turn) to Confident AI and store the returned metric id. (#2419) (Vamshi Adimalla) - Add a
use_vertexaioption to explicitly choose between Vertex AI and Gemini API-key clients when creatingGeminiModel. This overrides theGOOGLE_GENAI_USE_VERTEXAIsetting, including forcingFalseto avoid Vertex AI even if project/location are set. (#2436) (Trevor Wilson) - Add support for authenticating Azure OpenAI models using
azure_ad_tokenor anazure_ad_token_provider, so you can use Azure AD credentials instead of an API key when desired. (#2448) (Vamshi Adimalla)
Improvement
v3.8.3
- Add support for setting
test_case_idinupdate_current_trace, and include it when serializing traces. This makes it easier to associate a trace with a specific test case in downstream processing. (#2463) (Kritin Vongthongsri)
v3.8.2
- Remove the New Relic OpenTelemetry tracing exporter from telemetry. This reduces external tracing overhead and avoids requiring New Relic-related tracing setup during event capture. (#2364) (Kritin Vongthongsri)
- Improve LangChain and LangGraph integrations by stabilizing tracing and metric collection. Fix schema and integration test behavior to be deterministic, representative of supported usage, and aligned with the documentation. (#2457) (Trevor Wilson)
v3.8.1
- Fix grammar in the README to improve clarity when describing locally run evaluation models and metrics. (#2423) (yuri)
- Add a 2024 changelog page to the documentation and link it from the changelog index and sidebar for easier navigation. Update the changelog generator default output directory to match the new docs path. (#2429) (Trevor Wilson)
- Improve contextual recall verdicts by attaching the
expected_outputto each verdict, making results easier to interpret and debug in both sync and async runs. (#2449) (Vamshi Adimalla) - Add support for passing a
confident_api_keyto dataset and prompt objects, and use it automatically for push/pull/update/queue operations. This makes it easier to work with multiple API keys without relying on a single global setting. (#2453) (Vamshi Adimalla)
v3.7.9
- Improve environment variable docs by clarifying how boolean flags are parsed, including accepted truthy/falsy tokens and how unset or unrecognized values fall back to defaults. Update env var tables to show
1/0/unsetfor boolean settings. (#2399) (Trevor Wilson) - Fix a typo in the getting started docs to improve readability and clarity. (#2404) (Neelay Shah)
- Update GEval documentation to reference
evaluation_steps(instead ofevaluation_params) when describing which parameters should be included for accurate results. (#2406) (Vishnu Sai Teja) - Fix Gemini defaults and documentation to avoid retired model IDs. Update the default model to
gemini-2.5-proand refresh Gemini/Vertex AI docs to use current stable/preview models and*-latestaliases. Update custom LLM guide examples togemini-2.5-flash. (#2414) (Trevor Wilson)
v3.7.8
- Improve OpenTelemetry export configuration by introducing
CONFIDENT_OTEL_URL(defaulting to the hosted endpoint) and using it across integrations. This makes it easier to point tracing to regional endpoints such as the EU collector via an environment variable. (#2400) (Vamshi Adimalla)
Bug Fix
v3.8.3
- Fix the CrewAI integration tests by improving event loop handling in sync wrappers and correctly comparing tool usage when traces are returned as lists. This prevents failures caused by missing loops and mismatched tools-used invariants. (#2460) (Vamshi Adimalla)
v3.8.2
- Fix answer relevancy scoring by raising an error when
actual_outputis empty or whitespace-only, preventing blank outputs from being treated as fully relevant. (#2451) (Trevor Wilson) - Fix Confident API request routing by inferring the region from the API key prefix when no region is set. EU keys (
confident_eu_...) now automatically use the EU endpoint, preventing invalid API key errors. Defaults to the US endpoint when the region cannot be inferred. (#2456) (Trevor Wilson) - Fix Pydantic AI OpenTelemetry instrumentation by setting the global tracer provider when possible and warning if itβs already configured. Improve agent span detection by reading agent names from
gen_ai.agent.nameorpydantic_ai.agent.nameand applying agent attributes consistently at span start/end. (#2459) (Vamshi Adimalla) - Fix PydanticAI tracing environment selection so it prefers the trace manager setting, then
CONFIDENT_TRACE_ENVIRONMENT, and defaults todevelopmentonly when neither is set. (#2462) (Vamshi Adimalla)
v3.8.1
- Fix LangChain/LangGraph callback tracing to reuse active traces, restore trace/span context across async tasks, and keep correct parent-child span hierarchy. Also avoid overwriting trace metadata when values are not provided. (#2434) (Jeffrey Ip)
- Fix Amazon Bedrock Converse response parsing by extracting all
textcontent blocks and ignoringreasoningContent. Improve error messages when no text is returned, and returnNonefor cost when pricing data is unavailable. (#2437) (Trevor Wilson) - Fix incorrect return type annotations for
send_annotation()anda_send_annotation()by changing them fromstrtoNone, matching their actual no-return behavior. (#2441) (yuri) - Fix multimodal image metrics to fail fast with a clear ValueError when
actual_outputcontains no images. Also validateexpected_outputwhen detecting multimodal test cases and improve error messaging for mismatched output image counts. (#2447) (Vamshi Adimalla)
v3.7.9
- Fix batch scoring on the DROP benchmark to use
quasi_contains_score, matching single prediction behavior. This prevents partial matches like '2' being incorrectly marked wrong when the gold answer includes variants such as '2, 2-yards'. (#2402) (Aadam Haq) - Fix progress updates to ignore missing tasks during teardown, preventing
StopIterationwhen async callbacks run after a progress task is removed. Progress updates now safely become a no-op in this race condition. (#2405) (Trevor Wilson) - Fix MMLU
batch_predictto requirebatch_generateto return a list ofMultipleChoiceSchemaand raise a clearTypeErrorotherwise, preventing inconsistent response handling. (#2408) (Aadam Haq) - Fix Gemini Vertex AI authentication to fall back to Application Default Credentials when no service account key is provided, instead of requiring
GOOGLE_SERVICE_ACCOUNT_KEY. Only parse and validate the key when present to avoid unnecessary OAuth imports. (#2412) (Trevor Wilson) - Fix Synthesizer export and save for conversational goldens:
to_pandasandsave_asnow handle both QA and conversational outputs, include the right fields, and raise an error only when neither type is present. (#2415) (Vamshi Adimalla)