Skip to main content

🐍 2025

2025 was all about making LLM evaluation production-ready:

  • Tracing & observability matured with deep integrations across LangChain, LlamaIndex, CrewAI, PydanticAI, and OpenAI Agents—plus first-class OpenTelemetry support
  • Agent evaluation took center stage with new metrics for task completion, tool correctness, and MCP interactions
  • Multimodal capabilities expanded across test cases and metrics
  • Provider support broadened to include Anthropic, Gemini, Amazon Bedrock, and improved Ollama/Azure setups
  • Safety coverage grew with guardrails, red-teaming, and compliance metrics
  • Reliability improved with better async handling, timeouts, and retries
  • Documentation expanded with comprehensive tutorials to help teams ship confidently

December

December strengthened evaluation, multimodal support, and prompt optimization. Multimodal test cases now flow through standard evaluation paths with better placeholder detection, Azure OpenAI support, and clearer validation errors. Prompt optimization expanded with GEPA plus new algorithms, alongside more consistent schema-based outputs and broader provider configuration via typed Settings.

New Feature

v3.7.6

  • Add support for multimodal conversational test cases and goldens by automatically detecting [DEEPEVAL:IMG:...] placeholders across fields and attaching an imagesMapping so referenced images can be resolved during dataset loading. (#2373) (Vamshi Adimalla)

v3.7.5

  • Add an example script showing how to run prompt optimization with a model callback, a small golden dataset, and relevancy metrics to print original vs optimized prompts. (#2347) (Jeffrey Ip)

v3.7.4

  • Add GEPA (Genetic-Pareto) prompt optimization to automatically improve prompt templates against goldens and metrics. Provide GEPARunner.optimize(...) with reusable runner state, sync/async execution, configurable tie-breaking, and an OptimizationReport attached to the returned prompt. (#2293) (Trevor Wilson)
  • Add MIPROv2, COPRO, and SIMBA prompt-optimization algorithms with new configuration options and runner support, enabling additional search strategies and cooperative candidate proposals during optimization. (#2341) (Trevor Wilson)
  • Add support for a Portkey-backed model configured via settings. Introduce Portkey-specific options (API key, model name, base URL, provider) and validate required values early to reduce misconfiguration errors. (#2342) (Trevor Wilson)

v3.7.3

  • Add Azure OpenAI support for multimodal models, including image+text prompts and optional structured/JSON outputs. Multimodal model initialization can now select Azure based on configuration, using your deployment settings and tracking token-based cost. (#2319) (dhinkris)

Experimental Feature

v3.7.5

  • Add a proof-of-concept multimodal path by auto-detecting image placeholders in dataset inputs/turns and routing supported RAG-style metrics accordingly, without requiring a separate test case type. (#2346) (Vamshi Adimalla)

Improvement

v3.7.6

  • Refactor evaluation to treat multimodal LLM test cases like standard LLM cases, simplifying metric execution and removing special multimodal-only handling paths. (#2369) (Vamshi Adimalla)
  • Add a dedicated CI workflow and pytest coverage for metrics, including multimodal conversational cases. Improve multimodal detection and propagate the multimodal flag through evaluation step generation and scoring. Prevent invalid model usage for multimodal metrics by raising an error. (#2375) (Vamshi Adimalla)
  • Improve LLM metric output consistency by standardizing schema-based generation and fallback parsing. Add configuration options for more model providers (including token pricing and Bedrock settings) and align defaults for Ollama and OpenAI model selection. (#2378) (Trevor Wilson)

v3.7.5

  • Make the Ollama, Anthropic, and Gemini integrations optional at runtime. If an integration isn’t installed, raise a clear error explaining the missing dependency and how to install it. (#2345) (Trevor Wilson)
  • Improve CI reliability by including optional model provider dependencies (ollama, anthropic, google-genai) in the development dependency set, reducing failures when running tests that require these integrations. (#2357) (Trevor Wilson)
  • Prevent multimodal from being serialized in golden records by excluding it from model output. This reduces noisy fields in exported datasets and API payloads. (#2368) (Vamshi Adimalla)

v3.7.4

  • Improve API key management across LLM providers by standardizing on typed Settings for model name, endpoint/base URL, and secrets. Constructor arguments still take precedence, and secret values are only unwrapped when building the client. (#2330) (Trevor Wilson)
  • Improve the staleness policy docs by pointing reopen requests to a new MAINTAINERS.md file. This clarifies who to mention when reviving inactive issues and what details to include. (#2331) (Trevor Wilson)

v3.7.3

  • Rename the pytest plugin entry point from plugins to deepeval so the plugin is registered under a clearer name. (#2308) (Gavin Morgan)
  • Improve agentic metric docs with corrected code samples and clearer guidance that PlanAdherence, PlanQuality, and StepEfficiency are trace-only metrics that must run via evals_iterator or the observe decorator. (#2316) (Vamshi Adimalla)
  • Improve dataset conversions to carry additional_metadata from test cases into generated goldens, preserving metadata through CSV/JSON imports. Also prevent mixing single-turn and multi-turn items in the same dataset with clearer type errors. (#2336) (Vamshi Adimalla)
  • Support per-trace API keys when sending and flushing traces, so background flush uses the correct credentials. This prevents traces from being uploaded with the wrong API key when multiple keys are used in the same process. (#2337) (Kritin Vongthongsri)

Bug Fix

v3.7.6

  • Fix arena test case parameter validation by passing the correct arguments when checking each case, preventing incorrect validation failures for arena-based evaluations. (#2372) (Vamshi Adimalla)
  • Fix multi-turn Arena G-Eval comparisons when some turns have no retrieval context, and correctly apply multimodal evaluation rules when images are present. (#2376) (Vamshi Adimalla)
  • Fix MCP metrics to generate a single unified reason from all interaction reasons, with consistent sync/async behavior and correct cost tracking for native models. Also relax PlanAdherenceMetric required inputs and update tests to use a valid default model name. (#2381) (Vamshi Adimalla)
  • Fix multimodal model validation by resolving callable model metadata factories and improving prompt concatenation for image inputs, preventing errors when checking supported multimodal models. (#2382) (Trevor Wilson)

v3.7.5

  • Fix pydantic_ai integration imports so the package no longer crashes when optional pydantic-ai and OpenTelemetry dependencies are missing, using safe fallbacks and clearer optional-dependency errors. (#2354) (trevor-cai)
  • Fix dependency lockfile to match pyproject.toml, preventing CI failures and inconsistent installs caused by mismatched dependency groups and markers. (#2358) (Trevor Wilson)
  • Fix CLI test runs to avoid finalizing the same test run twice. This prevents duplicate uploads or local saves and reduces temp file race issues when deepeval test run hands off finalization to the CLI. (#2360) (Trevor Wilson)
  • Fix binary verdict JSON examples to use lowercase booleans (true/false) instead of Python-style True/False, reducing invalid JSON output from metric templates. (#2365) (Trevor Wilson)

v3.7.4

  • Fix Anthropic client initialization to unwrap SecretStr API keys and consistently prefer an explicit constructor key over settings. Raise a clear error when the key is missing or empty, and add tests to prevent regressions. (#2329) (Trevor Wilson)
  • Fix execute to avoid raising on async gather timeouts when errors are configured to be ignored, allowing timed-out metrics to be marked and execution to continue. (#2335) (Trevor Wilson)
  • Fix JSON corruption on NFS by flushing and fsyncing lock-protected writes for test runs and the prompt cache. This prevents truncated or partially written files during parallel runs on network storage, with added tests to verify the behavior. (#2338) (Trevor Wilson)
  • Fix parsing of provider-prefixed model names so inputs like provider/model correctly resolve to the underlying model name. (#2343) (Trevor Wilson)
  • Fix URL and endpoint fallback resolution for local, Ollama, and Azure models so configured settings are used correctly instead of boolean values, preventing invalid base URLs during initialization. (#2344) (Trevor Wilson)
  • Fix CLI test runs by loading the correct pytest plugin. Update the plugin argument to deepeval so the updated entry point is used and tests run with the intended plugin enabled. (#2348) (Trevor Wilson)
  • Fix test discovery by adding a missing __init__.py, ensuring the test suite is treated as a module and runs reliably across environments. (#2349) (Trevor Wilson)

v3.7.3

  • Fix HumanEval so verbose_mode is respected and not always treated as enabled. Also fix predictions DataFrame creation by aligning the collected row fields with the DataFrame columns, preventing a column mismatch ValueError during evaluation. (#2323) (Levent K. (M.Sc.))

November

November improved observability and evaluation workflows. Tracing expanded with Anthropic messages.create capture, richer tool-call visibility for LangChain and LlamaIndex, and clearer CrewAI spans. Evaluation grew with experiment support for compare() runs, new ExactMatchMetric and PatternMatchMetric, and a conversational golden synthesizer plus updated agent evaluation docs.

New Feature

v3.7.1

  • Add support for sending compare() runs as experiments, including test run summaries, hyperparameters, and run duration, and optionally opening the results in a browser. (#2287) (Kritin Vongthongsri)
  • Add support for passing a Google service account key when using Gemini via Vertex AI, including a new CLI option to save it in config. This enables authenticated Vertex AI access without relying on default credentials. (#2291) (Kritin Vongthongsri)
  • Add support for overriding the Confident API base URL via CONFIDENT_BASE_URL, allowing use of custom or self-hosted endpoints. Also align the API key header name to CONFIDENT-API-KEY for better compatibility. (#2305) (Tanay)
  • Support creating MLLMImage from Base64 data by providing dataBase64 and mimeType, and prevent invalid combinations like setting both url and dataBase64. Add as_data_uri() to return a data URI when Base64 data is available. (#2306) (Vamshi Adimalla)

v3.7.2

  • Add a conversational golden synthesizer to generate multi-turn scenarios from docs, contexts, or from scratch, with sync/async APIs and optional expected outcomes. Include new conversational styling options to control scenario context, roles, and task. (#2310) (Vamshi Adimalla)

v3.7.0

  • Add Anthropic integration that automatically captures messages.create (sync and async) calls for tracing, including model, inputs/outputs, token usage, and tool calls when available. (#2224) (Tanay)
  • Add tracing for CrewAI knowledge retrieval events, recording the query as span input and the retrieved knowledge as span output for clearer observability. (#2261) (Mayank)
  • Add non-LLM metrics for exact equality and regex full matching. Use ExactMatchMetric to compare actual_output vs expected_output, and PatternMatchMetric to validate actual_output against a pattern with optional case-insensitive matching and verbose logs. (#2274) (Vamshi Adimalla)

Improvement

v3.7.1

  • Relax the dependency pin for pytest-rerunfailures to allow newer versions, improving compatibility with modern pytest releases and reducing dependency conflicts during installation. (#2304) (Konstantin Kutsy)
  • Remove unused temporary scripts from the repository to keep the codebase cleaner and reduce clutter. (#2309) (Bowen Liang)

v3.7.2

  • Fix README code block formatting so the .env.local setup snippet renders correctly and is easier to copy and follow. (#2312) (Bhuvnesh)

v3.7.0

  • Add tools_called tracking for LangChain and LlamaIndex traces, capturing tool name, inputs, and outputs on both the parent span and trace. This makes tool usage visible in recorded runs and improves debugging of agent workflows. (#2251) (Mayank)
  • Add a documented issue lifecycle policy: inactive issues may be closed after 12 months, with guidance on how to request reopening and which issues are excluded. (#2273) (Trevor Wilson)
  • Add documentation for running end-to-end evaluations with OpenAI Agents using evals_iterator(), including synchronous and asynchronous examples and automatic trace generation per golden. (#2275) (Mayank)
  • Improve non-LLM metric documentation with clearer wording, corrected references, and more consistent parameter and calculation descriptions for ExactMatchMetric and PatternMatchMetric. (#2276) (Kritin Vongthongsri)
  • Add telemetry logging around OpenAI and Anthropic integrations to capture tracing when their client classes are patched. This improves observability of provider integration behavior during runtime. (#2279) (Tanay)

Bug Fix

v3.7.1

  • Fix tracing masking to return the value from a custom mask function in TaskManager.mask, so masked data is actually propagated instead of being discarded. (#2289) (Trevor Wilson)
  • Fix runtime crashes in the OpenAI Agents callback handler by adding missing explicit imports and replacing wildcard imports. This prevents NameError issues and cleans up linting problems around undefined names. (#2290) (Trevor Wilson)
  • Fix prompt template handling by catching JSONDecodeError and TypeError during parsing, and prevent crashes by wrapping os.makedirs in a try/except. Remove stray debug output and avoid overly broad exception handling for clearer failures. (#2295) (Trevor Wilson)
  • Fix cache reads by creating a fresh temp cache when the existing cache file can’t be parsed or loaded. This prevents failures and keeps test runs moving forward even if the cache is corrupted. (#2296) (Trevor Wilson)
  • Fix prompt and test-run workflows on read-only filesystems by gating disk I/O and optional portalocker usage. Skip local caching when the environment is read-only while continuing to upload results. (#2297) (Trevor Wilson)
  • Fix the simulator’s example JSON output to use valid JSON booleans (false instead of False), preventing JSON parse errors. Add an AlwaysJsonModel stub and a regression test to ensure JSON mode output stays parseable. (#2301) (Trevor Wilson)

v3.7.0

  • Fix Anthropic and OpenAI integration tests to use LlmSpanContext for prompt and metric collection, with thread_id passed separately. This aligns tracing usage with the current API and prevents test failures. (#2256) (Tanay)
  • Fix Anthropic async integration tests by switching to the tool’s Anthropic client, updating prompt version handling, and adding a new trace fixture for messages.create. (#2258) (Tanay)
  • Fix Anthropic integration tests to use the official anthropic client and updated tracing expectations, keeping async/sync trace fixtures in sync with current outputs. (#2259) (Tanay)
  • Fix TaskCompletionMetric task handling so extracted tasks only replace task when it wasn’t provided at initialization. Prevents a provided task from being overwritten during repeated measure/a_measure calls. (#2260) (Mayank)
  • Fix OpenTelemetry token counting by falling back to gen_ai.usage.input_tokens and gen_ai.usage.output_tokens when provider-specific attributes are missing, ensuring input/output token counts are captured consistently. (#2263) (Mayank)
  • Fix Python 3.9 compatibility by replacing bool | None type hints with Optional[bool], preventing syntax errors when using the package on py39. (#2264) (OwenKephart)
  • Fix settings and dotenv test behavior by restoring auto-refresh when environment variables change and using the correct telemetry opt-out variable (DEEPEVAL_TELEMETRY_OPT_OUT). Add an enable_dotenv test marker and environment sandboxing, and improve boolean coercion coverage. (#2266) (Trevor Wilson)
  • Fix TestRun loading and updates to preserve the in-memory state when disk reads or writes fail. Only replace the current data on a successful load, warn on errors, and fall back to in-memory updates. Ensure the parent directory exists before saving. (#2267) (Trevor Wilson)
  • Fix integration tests by centralizing URL/JSON formatting helpers and ensuring OpenAI tracing updates span and trace attributes consistently. (#2269) (Mayank)
  • Fix Pydantic v2 deprecation warnings by migrating all models from class-based Config to ConfigDict. Imports and common workflows no longer emit DeprecationWarnings. (#2272) (Andres Soto)
  • Fix DROP batching by requiring schema-aware batch_generate(prompts, schemas) and failing fast with clearer errors when unsupported. Remove the obsolete type= argument from batch_predict() to match predict(), and make the base batch_generate raise NotImplementedError for clearer behavior. (#2278) (Trevor Wilson)
  • Fix LangChain integration tests by importing create_tool_calling_agent from a stable module path, reducing breakage across LangChain versions. (#2281) (Trevor Wilson)
  • Fix PostHog dependency constraints to allow versions from 5.4.0 up to (but not including) 7.0.0, improving compatibility with supported PostHog releases. (#2283) (Trevor Wilson)

October

October made tracing and evaluation more robust with gen_ai.*.messages normalization, structured message types, JSON-safe metadata, and better agent output capture across OpenAI, PydanticAI, and CrewAI. Async reliability improved with per-task timeouts and cooperative timeout budgeting so stalled work fails fast while runs finalize. Metrics gained async-by-default Hallucination evaluation, new agent-focused metrics, and configurable logging.

Backward Incompatible Change

v3.6.9

  • Add cooperative timeout budgeting across retries and tasks, and always persist test cases and metrics when runs are cancelled or time out. Introduce *_OVERRIDE env settings for per-attempt and per-task timeouts, gather buffer, and stack-trace logging, and default the OpenAI client timeout from settings. (#2247) (Trevor Wilson)
  • Revert settings auto-refresh based on environment changes, restoring the previous cached Settings behavior. Telemetry and error reporting now read DEEPEVAL_TELEMETRY_OPT_OUT and ERROR_REPORTING directly from environment variables again. (#2253) (Jeffrey Ip)

v3.6.8

  • Remove patched LlamaIndex agent wrappers and attach metrics/metric collections via tracing context instead. This simplifies the integration and keeps LlamaIndex agents unmodified while still enriching agent and LLM spans with the expected metadata. (#2233) (Mayank)

v3.6.6

  • Update the CrewAI integration to use the latest event APIs and simplify setup. Remove the custom Agent wrapper so you can use CrewAI’s built-in Agent directly while still enabling tracing via instrument_crewai(). (#2152) (Mayank)

New Feature

v3.6.8

  • Add per-task timeouts to semaphore-guarded async evaluation work, so individual stalled tasks fail fast instead of hanging the whole run. When exceeded, the task raises asyncio.TimeoutError. (#2134) (Harsh S)
  • Add a tool decorator for the CrewAI integration that propagates metric and metric_collection onto tool spans while staying compatible with existing CrewAI decorator usage patterns. (#2206) (Mayank)
  • Add new agent evaluation metrics (Goal Accuracy, Topic Adherence, Plan Adherence, Plan Quality, Tool Use, and Step Efficiency), and improve trace handling by relying on a metric’s requires_trace flag. Also prevent duplicate trace results from being reported in test output. (#2238) (Vamshi Adimalla)
  • Add async-friendly eval iteration for the PydanticAI integration so evals_iterator() can collect and await tasks while finalizing and serializing traces, with optional agent-level metrics during runs. (#2241) (trevor-cai)

v3.6.7

  • Add OpenAI integration support with clearer dependency errors, and update evaluation flow to avoid relying on OpenAI-specific test case queues. CI now runs integration tests when API keys are available and safely skips them otherwise. (#2173) (Mayank)
  • Add CrewAI wrappers Crew, Agent, and LLM that accept metrics and metric_collection and pass them into tracing spans. This lets you capture per-run metrics automatically when using with trace(metrics=...). (#2189) (Mayank)

v3.6.6

  • Add display of conversational turns in multi-turn evaluations, showing role, truncated content, and any tools used. Turns are now included in test results and appear in CLI output and log/file reports. (#2113) (Trevor Wilson)
  • Add saving of the trace ID in the Pydantic AI instrumentator so it can be accessed later from the same run context. This makes it possible to reference past traces for follow-up actions like annotation. (#2140) (Mayank)
  • Add test_run_id to the EvaluationResult returned by evaluate, so you can reference the created test run programmatically. The existing confident_link is still returned when available. (#2156) (Vamshi Adimalla)

v3.6.3

  • Add support for pulling prompts by label and cache them separately from version-based pulls. Improve prompt cache reliability by using file locking and falling back to the API when the cache is missing, locked, or unreadable. (#2154) (Jeffrey Ip)

Improvement

v3.6.9

  • Add automatic settings refresh when environment variables change and expand dotenv-related tests using the enable_dotenv marker to validate boolean coercion. Update telemetry env handling to use DEEPEVAL_TELEMETRY_OPT_OUT for clearer opt-out behavior. (#2249) (Trevor Wilson)

v3.6.8

  • Add timeouts around async task orchestration to prevent asyncio.gather from hanging indefinitely. On timeout, pending tasks are cancelled and drained before the error is raised, improving reliability of async evaluations. (#2136) (S3lc0uth)
  • Improve test run metrics aggregation and results table output by refactoring into clearer helper functions. The results table formatting is now more consistent, easier to extend, and handles separators and empty rows more cleanly. (#2153) (Ayesha Shafique)
  • Add support for passing arguments to embedding models and for customizing ConversationalGEval prompts via an evaluation_template. Fix MCP scoring to avoid division-by-zero when no scores are produced, and expand quickstart/docs with a template customization example. (#2203) (Vamshi Adimalla)
  • Improve error surfacing during evaluation and tracing with a clearer error taxonomy and typed messages. When required inputs are missing or async tasks fail, affected spans are marked ERRORED while evaluation continues. Skip metric collection for failed nodes and keep progress reporting accurate when work is skipped. (#2207) (Trevor Wilson)
  • Add model request parameters (like temperature and max_tokens) to the traced LLM input messages when available, making it easier to see the exact settings used for a call. (#2210) (Mayank)
  • Improve OpenAI integration tracing to better handle legacy and Responses API calls. Input/output extraction is now guarded to prevent crashes, messages are rendered consistently, and tool-only outputs are captured so traces still show what happened. (#2211) (Mayank)
  • Improve the Hallucination metric by moving the required parameter list from module scope to a class-level attribute for consistency with other metrics. This makes required inputs easier to inspect and validate when integrating with custom observability tooling. (#2215) (Anurag Gowda)
  • Add an OpenAI integration cookbook with a ready-to-run Colab notebook showing how to trace OpenAI SDK calls and run evaluations for standalone requests and full LLM apps. (#2237) (Mayank)

v3.6.7

  • Add structured prompt metadata and improved Prompt.load() parsing, including safer fallbacks when JSON is invalid or malformed. Test runs now capture and persist prompts seen during LLM spans for easier tracking and reproducibility. (#2102) (Kritin Vongthongsri)
  • Add structured message types for LLM spans, including text, tool call, and tool output payloads. This improves typing and serialization for input and output when tracing multi-part model interactions. (#2116) (Mayank)
  • Improve code formatting and lint compliance in OpenAI integration and trace test helpers, reducing lint noise and keeping patching logic easier to maintain. (#2166) (Trevor Wilson)
  • Add configurable metric logging controls, including enable/disable, verbosity, flush, and sampling rate, separate from trace sampling. This also renames CONFIDENT_SAMPLE_RATE to CONFIDENT_TRACE_SAMPLE_RATE for clarity. (#2174) (Jeffrey Ip)
  • Improve tracing so parent spans automatically include tools_called when tool spans run underneath them, even if the parent didn’t record tool calls directly. (#2175) (Mayank)
  • Improve LangChain and LangGraph integration docs with clearer metric usage examples and new guidance for component-level evals. Update snippets to pass metrics inline and document how to attach metrics to LLMs and tools. Hide the PydanticAI integration page from the sidebar. (#2177) (Mayank)
  • Improve dataset turn serialization by using json.dumps(..., ensure_ascii=False) so non-ASCII characters are preserved instead of being escaped in the output JSON. (#2186) (danerlt)
  • Improve multimodal metric evaluation by adding a _log_metric_to_confident flag and propagating it through sync and async measure calls, making it easier to control metric logging behavior in different execution modes. (#2191) (Jeffrey Ip)
  • Improve docs by adding tabbed examples for model integrations (OpenAI, Anthropic, Gemini, Ollama, Grok, Azure OpenAI, Amazon Bedrock, Vertex AI), making it easier to copy the right setup for each provider. (#2196) (Kritin Vongthongsri)
  • Fix typos and wording in the metrics DAG documentation to improve clarity and readability. (#2198) (Simone Busoli)

v3.6.6

  • Add a test mode for tracing integrations so spans can be captured in-memory instead of exported over OTLP. This makes integration CI tests more reliable by avoiding network calls and letting tests assert on collected trace data. (#2131) (Mayank)
  • Improve optional CrewAI integration imports by handling missing dependencies cleanly and logging details in verbose mode, while also applying consistent formatting and lint fixes to keep CI passing. (#2158) (Trevor Wilson)
  • Improve verbose logging for missing optional dependencies by emitting warnings instead of errors. Logs now show the missing module name when available and avoid tracebacks while pointing to the caller for easier debugging. Messages are only shown when DEEPEVAL_VERBOSE_MODE is enabled. (#2159) (Trevor Wilson)
  • Improve PydanticAI tracing by including gen_ai.system_instructions in the captured input and flattening agent outputs to the final non-thinking text when final_result is missing. (#2160) (Mayank)
  • Prevent sync HTTP calls from hanging indefinitely by enforcing per-attempt timeouts and retrying failures with a configurable Tenacity backoff policy. (#2162) (Trevor Wilson)

v3.6.3

  • Improve Amazon Bedrock request building by passing generation_kwargs through as-is, removing automatic snake_case-to-camelCase parameter translation. This makes parameter names consistent with what Bedrock expects and avoids unexpected remapping. (#2106) (Vamshi Adimalla)

v3.6.2

  • Improve OpenTelemetry tracing by normalizing gen_ai.*.messages that use parts into plain role/content messages and by forcing trace/span metadata into JSON-safe strings, including circular-reference handling, to prevent export/serialization failures. (#2114) (Mayank)
  • Improve trace and agent input/output flattening by normalizing message parts and making non-text content JSON-serializable. This reduces errors when traces include structured or non-text payloads. (#2115) (Mayank)
  • Improve the Hallucination metric by enabling async_mode=True by default, so evaluations run asynchronously unless you opt out. This can reduce blocking during metric execution in async-capable workflows. (#2117) (Sai-Suraj-27)
  • Improve code formatting and lint compliance by cleaning up imports and exception handling in tracing utilities, reducing ruff/black warnings without changing behavior. (#2119) (Trevor Wilson)
  • Improve readability of cards and expandable sections in dark mode by refining background, borders, and text contrast. Adjust hover and focus states to keep interactive elements clear and accessible. (#2122) (Debangshu)
  • Add per-task timeouts for async observed_callback execution so slow callbacks don’t block evaluation indefinitely, raising asyncio.TimeoutError after the configured limit. Synchronous callbacks are unaffected. (#2127) (Tharun K)

Bug Fix

v3.6.9

  • Fix EvaluationDataset.save_as serialization so critical fields (like tools_called, expected_tools, metadata, and custom columns) are preserved across JSON, JSONL, and CSV. Multi-turn datasets now save turns as structured objects in JSON/JSONL, and CSV embeds full turn data as a JSON string while extending headers accordingly. (#2227) (Wang Junwei)
  • Fix unclosed aiohttp client sessions when using AmazonBedrockModel with aiobotocore, preventing post-evaluation warnings about unclosed sessions and connectors. (#2250) (m.tsukada)

v3.6.8

  • Fix embedding model initialization so generation_kwargs is passed as a dict and client options are provided via **client_kwargs. Also add explicit parameters for required connection settings (like API keys, endpoints, and host) to reduce confusion when configuring clients. (#2209) (Vamshi Adimalla)
  • Fix the CrewAI example notebook by adding tracing around crew.kickoff() and reusing the answer relevancy metric, so execution traces and metric reporting work more reliably in the walkthrough. (#2212) (Mayank)
  • Fix a_generate_goldens_from_contexts so generated goldens use the correct source_file for each context instead of mismatching indices, and keep progress/scores aligned with the right input. (#2213) (Vamshi Adimalla)
  • Fix span result extraction to treat TraceSpanApiStatus.SUCCESS as a successful span status, so enum-based statuses are handled correctly. Adds a regression test to prevent status comparisons from incorrectly marking spans as failed. (#2214) (Trevor Wilson)
  • Fix ToolCall.__repr__ to serialize input_parameters and dict output with ensure_ascii=False, so non-ASCII characters are shown correctly instead of being escaped in the printed representation. (#2230) (danerlt)
  • Fix Contextual Precision verdict payloads to use a singular reason field instead of reasons, improving compatibility with schema-based generation and JSON parsing. (#2234) (Trevor Wilson)
  • Fix multimodal contextual precision verdict parsing by using the singular reason field to match the expected template and schema. Prevents missing reasons and related TypeErrors when generating or reading verdicts. (#2235) (Trevor Wilson)

v3.6.7

  • Prevent core tests from unintentionally calling the Confident backend by clearing Confident API keys from the environment and in-memory settings, and disabling dotenv autoload for these tests. This keeps tests/test_core isolated and avoids accidental external network use. (#2165) (Trevor Wilson)
  • Fix test isolation by sandboxing os.environ per test and resetting settings before and after each run. This prevents settings.edit(persist=False) from leaking environment changes across tests and altering timeouts, retry policies, and other settings. (#2168) (Trevor Wilson)
  • Fix multimodal metric parameter validation by using check_mllm_test_case_params instead of the LLM-only checker. This ensures multimodal test cases are validated with the correct rules and avoids incorrect parameter errors. (#2170) (Ayesha Shafique)
  • Fix synthesizer generation so all evolved prompts are saved as Goldens instead of only the last one. Improve JSON turn serialization to preserve non-ASCII characters. Update docs to clarify when expected_output is produced and how to use a custom embedder for context construction. (#2171) (Vamshi Adimalla)
  • Fix trace evaluation to always run even when there are no leftover tasks, and handle _snapshot_tasks() failures by treating them as empty. Trace evaluation is only skipped when the event loop is closed. (#2178) (Trevor Wilson)
  • Fix G-Eval metric evaluations failing with OpenAI o4-mini by treating it as a model without logprobs support. The evaluator now automatically falls back to standard scoring when o4-mini (including o4-mini-2025-04-16) is used, avoiding 403 errors and completing with valid results. (#2184) (Niyas Hameed)
  • Fix is_successful to correctly set and return success on the happy path based on the score threshold, avoiding false results when checking metric outcomes. (#2188) (Trevor Wilson)
  • Fix evaluation tracing by mapping traces to goldens and skipping any that can’t be mapped. Prevent DFS from failing agentic test execution by finalizing runs even when spans are missing. Add async regression coverage and reset per-test state to avoid cross-test leakage. (#2190) (Trevor Wilson)
  • Fix assert_test validation by rejecting mismatched metric types for LLM, conversational, and multimodal test cases. Update MultimodalToolCorrectnessMetric to use BaseMultimodalMetric and report the correct metric name. (#2193) (Vamshi Adimalla)
  • Fix OpenAI multimodal user messages by stringifying mixed content to avoid Pydantic validation errors. Preserve the original list payload in messages for Responses, and add tests to prevent import-time side effects from SDK patching. (#2199) (Trevor Wilson)

v3.6.6

  • Fix broken tracing integration tests by moving the trace test manager into the package and updating imports so tests no longer depend on a tests.* module path. (#2167) (Mayank)

v3.6.3

  • Fix gpt-5-chat-latest being treated as a reasoning model that forces temperature=1. This restores support for temperature=0.0 and lets users control output determinism as expected. (#2121) (himanushi)
  • Fix Google Colab buttons in the framework integration docs by pointing them to the correct example notebook paths, so the notebooks open properly from the documentation. (#2130) (Mayank)
  • Revert the previous handling for empty expected_tools in the tool correctness metric, restoring the earlier scoring behavior when no expected tools are provided. (#2139) (Trevor Wilson)
  • Fix G_Eval score normalization when the score range does not start at 0. Scores now subtract the lower bound before dividing by the range span, so values like 1–5 correctly map to 0.0–1.0. Adds test coverage for the corrected behavior. (#2142) (Priyank Bansal)
  • Fix PydanticAI agent tracing to capture input and output messages more reliably. If final_result is missing, the output now falls back to the last recorded message, improving completeness of recorded spans. (#2149) (Mayank)
  • Fix Amazon Bedrock requests to stop forcing a default temperature value. temperature is now only sent when provided via generation_kwargs, letting Bedrock apply its own defaults. (#2151) (Vamshi Adimalla)

v3.6.2

  • Fix OpenAI Agents span handling so LLM span properties update only for spans marked as llm. This prevents spans from being skipped due to an incorrect early return and restores expected agent behavior. (#2123) (Mayank)
  • Fix documentation code examples to correctly iterate over datasets, preventing TypeError: 'EvaluationDataset' object is not iterable when following the testing snippets. (#2132) (Denis)
  • Fix ToolCorrectnessMetric crashing with ZeroDivisionError when expected_tools is empty. It now returns 1.0 when both tools_called and expected_tools are empty, and 0.0 when tools are called but none are expected. Added tests for these edge cases. (#2135) (Priyank Bansal)

September

September made agent evaluation and tracing easier to adopt with expanded quickstarts and guides across LangChain, LangGraph, CrewAI, PydanticAI, and OpenAI Agents. Tracing improved with better input/output capture, OpenTelemetry/OTLP export behavior, and new APIs like update_current_span and update_current_trace(). Evaluation added G-Eval templating updates, MCP and conversational/DAG capabilities, and better dataset round-tripping.

Backward Incompatible Change

v2.4.8

  • Remove span feedback from the OpenTelemetry exporter so traces no longer parse or emit the confident.span.feedback attribute, reducing exporter dependencies and payload. (#1942) (Mayank)
  • Change benchmark evaluate results to return strongly typed Pydantic models instead of untyped dicts or floats, with a consistent overall_accuracy interface and optional benchmark-specific fields. This is a breaking change for code expecting raw primitives. Also pin datasets to <4.0.0 to avoid failures from deprecated loader scripts. (#1975) (trevor-inflection)

New Feature

v3.5.9

  • Add evaluation_template support to MultimodalGEval so you can customize how evaluation steps and results are generated, including strict results. Also tighten exception handling and imports to satisfy lint rules. (#2090) (Trevor Wilson)
  • Add Jinja template interpolation for prompt rendering, with template and messages_template now validated to be mutually exclusive to prevent ambiguous prompt types. (#2100) (Jeffrey Ip)

v3.5.5

  • Add a PydanticAI Agent wrapper that automatically captures traces and metrics and patches the underlying model. Also export an OpenTelemetry instrumentation helper so you can instrument PydanticAI more easily without manual setup each run. (#2071) (Mayank)

v3.5.6

  • Add set-debug and unset-debug CLI commands to configure verbose logging, tracing, gRPC verbosity, and error reporting. Settings can be applied immediately and optionally persisted to a dotenv file, with a no-op guard to avoid output when nothing changes. (#2082) (Trevor Wilson)
  • Add support for capturing OpenAI Agents trace context into tool tracing, including workflow name, group/thread id, and metadata. Improve input/output handling so traced runs keep the initial input and select the correct output when running inside a trace. (#2087) (Mayank)

v3.5.3

  • Add a unified, configurable retry policy across all supported model providers. Improve transient error detection and provider-specific handling, with opt-in delegation to provider SDK retries. Allow runtime-tunable retry logging levels and env-driven backoff settings. (#2047) (Trevor Wilson)
  • Add tracing support for sync and async generator functions, ensuring observer spans stay open while items are yielded and close cleanly on completion or errors. (#2074) (Kritin Vongthongsri)

v3.5.0

  • Add optional OpenTelemetry (OTLP) tracing for dataset evaluation runs via run_otel, generating a per-run ID and emitting start/stop spans plus per-item dummy spans. This enables exporting evaluation traces to an OTLP endpoint for run-level observability. (#2008) (Mayank)

v3.5.1

  • Add token-level streaming timestamps to LLM tracing spans, recording each emitted token with a precise ISO time to help analyze generation latency and pacing. (#2048) (Kritin Vongthongsri)
  • Add prompt version listing and update prompt pulling to use version IDs, with optional background refresh that keeps the local cache up to date. (#2057) (Kritin Vongthongsri)

v2.4.8

  • Add a PydanticAI integration that instruments Agent.run with OpenTelemetry spans and exports agent input/output and optional custom trace attributes. Provide setup_instrumentation() to patch the agent safely and configure span exporting when the OpenTelemetry SDK is available. (#1851) (Mayank)
  • Add MCP metrics for conversational evaluations, including args correctness, task completion, and tool correctness. These metrics support async execution, strict scoring, and verbose reasoning to help debug tool-using interactions. (#1894) (Vamshi Adimalla)
  • Add support for setting trace name, tags, metadata, thread ID, and user ID via confident.trace.* span attributes. Existing confident.trace.attributes is still read for compatibility but is planned for deprecation. (#1897) (Mayank)
  • Add a configurable language parameter to ConversationSimulator so prompts can be generated in any language. Default behavior remains English, so existing usage continues to work without changes. (#1899) (Johan Cifuentes)
  • Add MCP evaluation support for single-turn test cases with the new MCPUseMetric, and introduce MultiTurnMCPUseMetric for multi-turn conversations. This updates the MCP metrics set to better score whether the right MCP primitives and arguments are used for a task. (#1908) (Vamshi Adimalla)
  • Add a new tracing update interface that sets span data directly and introduces update_llm_span for token counts. This simplifies instrumenting LLM and retriever steps and makes metric evaluation work from span inputs/outputs without requiring a prebuilt test case. (#1909) (Kritin Vongthongsri)
  • Add support for passing trace environment, metric_collection, and an optional LLM test case through OpenTelemetry attributes, so these fields are attached to exported traces and can override the default environment when provided. (#1919) (Mayank)
  • Add automatic loading of .env.local then .env at import time so configuration works out of the box, while keeping existing process env vars highest priority. Allow opting out via DEEPEVAL_DISABLE_DOTENV=1. Include a .env.example and expand docs on environment setup and provider keys. (#1938) (Trevor Wilson)
  • Add support for trace-level metrics in end-to-end evaluations, so you can attach metrics to a whole trace via update_current_trace() and have them run and reported alongside span-level metrics. (#1949) (Kritin Vongthongsri)
  • Add an option to run conversation simulation remotely via the API with run_remote=True. This allows generating user turns without a local simulator model, and raises a clear error when the API key is missing. (#1959) (Kritin Vongthongsri)
  • Add support for GPT-5 completion parameters such as reasoning_effort. You can now pass new model-specific options via a dedicated params dict, avoiding code changes when new parameters are introduced. (#1965) (John Lemmon)
  • Add --save=dotenv[:path] to provider set/unset so credentials can be stored in a .env file instead of the JSON store, reducing the chance of leaking secrets. Expand set/unset tests across providers and prepare for future secure storage backends. (#1967) (Trevor Wilson)
  • Add MCP evaluation examples for single-turn and multi-turn conversations, showing how to connect to MCP servers, invoke tools, and build test cases from tool calls and model outputs. (#1979) (Vamshi Adimalla)
  • Add support for customizing GEval prompts via an injectable evaluation_template, and export GEvalTemplate for easier reuse. Improve evaluation docs with expanded component-level guidance, unit testing in CI/CD coverage, and updated custom embedding model configuration examples. (#1986) (Vamshi Adimalla)
  • Add save_as support for conversational goldens so multi-turn datasets can be exported to JSON or CSV. Turns are serialized into a single field for portable round-tripping, and save_as now errors clearly when called on an empty dataset. (#1991) (Vamshi Adimalla)
  • Add a public option when pulling datasets so you can fetch publicly shared cookbook datasets without requiring private access. (#1995) (Mayank)
  • Add component-level evals for LangGraph by propagating metrics and metric_collection metadata through LLM and tool spans. Include a patched tool decorator so tools can carry metric settings without custom wiring. (#2000) (Mayank)
  • Add prompt metadata to LLM tracing spans, including alias and version. This lets traces record which prompt was used alongside model and token/cost details. (#2001) (Kritin Vongthongsri)
  • Add ConversationalDAGMetric and conversational DAG node types to evaluate multi-turn conversations using a DAG workflow. Supports async and sync execution with threshold/strict modes, cycle detection, and optional verbose logs and reasons. (#2002) (Vamshi Adimalla)
  • Add component-level evaluation support for PydanticAI tools by allowing metric_collection or metrics on the @agent.tool decorator and recording tool outputs as tracing span attributes. (#2003) (Mayank)
  • Add an OpenAI Agents Runner wrapper that collects metrics during run/run_sync and attaches inputs/results to traces. Export Runner from the openai_agents package for easier use in agent eval workflows. (#2005) (Mayank)
  • Add a function_tool wrapper for OpenAI Agents that automatically traces tool calls with observe and supports passing metrics or a metric collection. Tool spans are skipped in the tracing processor to reduce noise during component evaluation. (#2010) (Mayank)
  • Add Markdown document support (.md, .markdown, .mdx) in the synthesizer loaders. Improve lazy imports and type hints so heavy optional deps like LangChain and Chroma are only required when used, with clearer errors and updated docs on required packages. (#2018) (Trevor Wilson)

Improvement

v3.6.0

  • Add a documented, explicit way to access the active dataset golden and pass its expected_output during component-level evaluation. The executor now sets and resets the current golden around user code, and tests ensure expected_output is preserved across spans and traces with sensible override and None handling. (#2096) (Trevor Wilson)
  • Add a new CLI guide covering install, secrets, provider switching, debug flags, retries, examples, and troubleshooting. Improve Multimodal G-Eval docs by documenting evaluation_template behavior, expected JSON return shapes, and a minimal customization example. Fix multiple broken links across metrics, guides, integrations, and tutorials. (#2109) (Trevor Wilson)
  • Improve the OpenAI Agents integration by simplifying agent/model processing and exposing only the supported public API (DeepEvalTracingProcessor, Agent, and function_tool). This reduces unused imports and avoids exporting Runner from the package namespace. (#2110) (Mayank)

v3.5.9

  • Add support for name and comments fields when loading goldens from CSV/JSON and when exporting datasets via save_as, preserving this metadata across round-trips. (#2066) (Vamshi Adimalla)
  • Fix a typo in the agents getting-started guide so the end-to-end evaluation instructions read correctly. (#2095) (Raj Ravi)
  • Improve PydanticAI OpenTelemetry instrumentation by reviving and consolidating it under ConfidentInstrumentationSettings. Agent-level tracing and metric wiring is now configured via the instrument setting, and the old instrument_pydantic_ai path is deprecated. (#2098) (Mayank)

v3.5.5

  • Improve OpenAI Agents tracing and metrics by using typed BaseMetric lists and recording a Prompt on LLM spans. Also serialize streamed and non-streamed outputs for more reliable observability and downstream processing. (#2084) (Mayank)

v3.5.3

  • Improve prompt tests by asserting the pulled prompt version starts at 0, ensuring versioning behavior is validated alongside template and message content. (#2064) (Kritin Vongthongsri)
  • Fix a typo in the metrics introduction docs by changing “read-to-use” to “ready-to-use” for clearer wording. (#2065) (Jason Smith)
  • Add a maintainer-only GitHub Actions workflow to manually run the full test suite against a PR’s head or merge ref, with concurrency control and optional secret-based tests. (#2069) (trevor-cai)

v3.5.2

  • Improve LangChain/LangGraph tracing by using context variables to keep the active trace consistent across tool calls and nested runs. Also expose the tool decorator from the integration so you can attach metric_collection metadata and keep span attributes in the correct trace. (#2052) (Mayank)
  • Improve the PydanticAI integration by adding safer one-time instrumentation, tracing for run_sync, and consistent trace argument names (e.g., name, tags, metadata). This also sanitizes run context data to avoid noisy or circular payloads in captured traces. (#2060) (Mayank)

v3.5.0

  • Add a provider-agnostic retry policy with env-tunable defaults and clearer transient vs non-retryable classification. OpenAI requests now use the shared policy, disable SDK internal retries to avoid double backoff, and log retries more consistently. Quota-exhausted 429s are treated as non-retryable while timeouts and 5xx errors still retry. (#1941) (Trevor Wilson)
  • Add a trace JSON validation flow for integration tests. Provide commands to generate trace test data and then validate the generated JSON to catch regressions earlier. (#2019) (Mayank)
  • Add a centralized, validated Settings system and refactor CLI config commands to use it for consistent env and persistence behavior. Prevent secrets from being written to the legacy JSON store, and allow safe persistence to dotenv files when --save (or the default save setting) is enabled. (#2026) (Trevor Wilson)
  • Improve example notebook formatting to satisfy black and fix lint errors, making the Conversational DAG example easier to run and review. (#2028) (Trevor Wilson)
  • Improve OpenTelemetry handling by importing the OTLP exporter lazily and raising a clear error when the dependency is missing. This prevents import-time failures and guides you to install opentelemetry-exporter-otlp-proto-http when tracing is enabled. (#2032) (Mayank)
  • Improve test setup reliability by reusing shared helpers to reset settings environment and tear down the settings singleton. Ensure the hidden store directory is created consistently and make config tests importable via a package __init__.py. (#2033) (Trevor Wilson)
  • Add __init__.py files to nested test directories to prevent Python import/module name collisions during test runs. (#2037) (Trevor Wilson)
  • Add pre-commit hooks and Ruff to provide consistent linting and formatting on changed files. Update the lockfile to include the new development dependencies. (#2038) (Trevor Wilson)
  • Temporarily skip CLI and config tests that rely on environment/settings persistence while the persistence layer is being refactored. (#2041) (Trevor Wilson)
  • Add a simplified PydanticAI integration API by exposing instrument_pydantic_ai and removing the custom Agent wrapper, with updated CLI trace flag names and tests to ensure trace output is generated as expected. (#2042) (Mayank)

v2.4.8

  • Add new documentation quickstarts for AI agent evaluation, including setup for LLM tracing and both end-to-end and component-level evals across popular frameworks. Improve clarity in existing evaluation docs with updated titles and expanded dataset terminology. (#1818) (Kritin Vongthongsri)
  • Improve documentation site styling for collapsible sections, sidebar menu, and code blocks for a more consistent reading experience. (#1879) (Jeffrey Ip)
  • Improve tutorials by reorganizing evaluation sections, renaming pages to simpler routes, and adding a dedicated RAG QA evaluation guide with setup and synthetic data generation examples. (#1885) (Vamshi Adimalla)
  • Add support for exporting trace-level input and output fields from span attributes, so traces capture the overall request and response alongside existing trace attributes. (#1887) (Mayank)
  • Improve telemetry tracing integration event names by standardizing them under a deepeval.integrations.* namespace for more consistent reporting across supported frameworks. (#1888) (Mayank)
  • Add support for setting a span’s input and output via update_current_span, so custom values are preserved and masked correctly during trace updates. (#1893) (Kritin Vongthongsri)
  • Improve the LLM Arena quickstart with a full walkthrough for creating ArenaTestCases, defining an arena metric, and running compare() to pick a winner. Also fix a typo in the arena criteria example and add the page back to the docs sidebar for easier discovery. (#1896) (Vamshi Adimalla)
  • Add LangChain integration docs with end-to-end and production evaluation examples using a CallbackHandler, including synchronous and asynchronous workflows and guidance on supported metrics. (#1900) (Kritin Vongthongsri)
  • Improve CrewAI tracing by capturing agent roles, available tools, tool inputs/outputs, and completed LLM call details, and by tracing contextual memory retrieval. This makes traces more informative across agent, tool, LLM, and retriever spans. (#1902) (Mayank)
  • Improve DeepSeek integration docs by updating the initialization example to use model instead of model_name, matching the current constructor and reducing setup confusion. (#1906) (Lukman Arif Sanjani)
  • Improve tracing for CrewAI, LangChain, LlamaIndex, and PydanticAI integrations by scoping instrumentation with a context manager. This makes span capture more reliable during initialization and setup. (#1911) (Jeffrey Ip)
  • Improve G-Eval prompting to generate reasoning before the final score. This encourages more complete evaluations and can lead to more accurate, consistent scoring across judge use cases. (#1912) (Bofeng Huang)
  • Add generation_kwargs to supported LLM model wrappers so you can pass provider-specific generation options like top_p and max_tokens, with updated docs and a new MCP quickstart page in the sidebar. (#1921) (Vamshi Adimalla)
  • Improve the OpenAI integration docs by adding gpt-5, gpt-5-mini, and gpt-5-nano to the list of commonly used models. (#1924) (fangshengren)
  • Add and refresh end-to-end evaluation documentation for multiple frameworks, including new guides for CrewAI and Pydantic AI plus updated LangChain examples. Include clearer setup, dataset iteration, and optional trace viewing steps to help you run evals quickly. (#1926) (Mayank)
  • Improve documentation examples for LLM tracing and agent evaluation by fixing imports, metric names, and tracing helpers. Update the walkthrough to use EvaluationDataset.evals_iterator() and update_current_span so the sample code matches current APIs. (#1927) (Kritin Vongthongsri)
  • Add support for newer GPT-5 and o4-mini model variants, including updated pricing metadata. Automatically set temperature=1 for models that require it to prevent invalid configuration errors. (#1930) (John Lemmon)
  • Improve modes imports by defining __all__, making ARCMode and TruthfulQAMode the explicitly exported public API for star-imports and tooling. (#1932) (trevor-inflection)
  • Improve the Confident API client by standardizing responses and surfacing clearer errors and deprecation warnings. Update endpoints and return (data, link) so CLI, prompts, datasets, and tracing can consume links consistently. (#1933) (Jeffrey Ip)
  • Upgrade the PostHog client dependency to a newer version to avoid telemetry conflicts with projects that also use PostHog. This improves compatibility when both tools are installed in the same environment. (#1935) (Lucas Castelo)
  • Improve PydanticAI tracing by exporting spans via an OTLP HTTP endpoint and requiring a configured API key. This makes instrumentation fail fast when credentials are missing and aligns traces with standard OpenTelemetry exporters. (#1940) (Mayank)
  • Improve benchmark evaluate polymorphism by standardizing interfaces and accepting extra **kwargs. This lets you call different benchmarks with shared arguments like batch_size without crashing when a benchmark does not use them. (#1955) (trevor-inflection)
  • Improve trace API payloads by populating input/output, expected output, context, retrieval context, tool calls, and metadata. This makes exported traces and generated test cases more complete and easier to debug. (#1961) (Kritin Vongthongsri)
  • Improve the PydanticAI integration with a new Agent interface that supports passing metric_collection, metrics, and trace fields directly to run/run_sync. Add validation for trace and metric inputs and require OpenTelemetry to enable tracing. (#1978) (Mayank)
  • Add an overwrite_metrics option to thread offline evaluations so you can replace existing metric results when re-running evaluations. (#1980) (Kritin Vongthongsri)
  • Add new LangGraph, Pydantic AI, and CrewAI cookbooks with “Open in Colab” buttons in the docs, making it easier to run the example notebooks from the integration pages. (#1987) (Mayank)
  • Improve OpenTelemetry export by capturing span error status and description from the official status fields instead of custom attributes. Also handle trace metadata as a dict to avoid unnecessary JSON parsing and make metadata export more reliable. (#1990) (Mayank)
  • Improve example notebooks by adding black[jupyter] to dev dependencies and reformatting notebook code for more consistent, readable cells. (#2011) (Trevor Wilson)
  • Add an Agent wrapper for openai-agents that automatically traces model calls with metrics and an optional Prompt. Improve tracing so span and trace inputs/outputs are captured correctly, and LLM spans record the prompt when provided. (#2012) (Mayank)
  • Fix async execution in Conversational DAG nodes by awaiting model generation and metric evaluation calls, preventing missed results during traversal. Add detailed Conversational-DAG documentation with end-to-end examples for building and running multi-turn decision-tree evaluations. (#2014) (Vamshi Adimalla)
  • Improve code formatting to satisfy linting and keep tests and DAG modules consistent with Black style. (#2016) (Trevor Wilson)

Bug Fix

v3.6.0

  • Fix Info and Caution callouts not rendering correctly in the documentation when using dark mode, improving readability and visual consistency. (#2111) (Sai-Suraj-27)

v3.5.9

  • Fix streaming completion handling so the final result is captured reliably and the streamed LLM output is JSON-serializable, preventing errors when consuming streamed responses. (#2097) (Mayank)

v3.5.5

  • Fix async evaluations by tracking and gathering only tasks created on the active event loop, preventing coroutine re-await and cross-loop errors. Normalize awaitables via coerce_to_task(), cancel pending tasks when clearing, and properly shut down async generators. Replace blocking sleeps in async tests and stabilize CI workflows. (#2068) (Trevor Wilson)
  • Fix NonAdvice metric scoring in strict_mode: enforce a threshold of 1 and return 0 when the computed score falls below that threshold. (#2070) (Sai-Suraj-27)
  • Fix mcp_use_metric when multiple MCP servers are configured by correctly including primitives from all servers in the interaction text. (#2076) (Diego Rani Mazine)
  • Fix sidebar heading contrast in dark mode so section titles are clearly visible and easier to scan. (#2077) (Sai-Suraj-27)
  • Fix deepeval login failing on Python 3.9 by avoiding the unsupported str | ProviderSlug type union syntax, restoring compatibility for supported Python versions. (#2079) (Sai-Suraj-27)
  • Fix incorrect argument name when configuring local models by passing model_format to set_local_model_env, preventing misconfiguration in LM Studio and vLLM setup. (#2083) (Sai-Suraj-27)

v3.5.6

  • Fix async eval execution to use the current trace when building LLMTestCase, so outputs, expected output, context, and tool expectations are recorded correctly. (#2088) (Kritin Vongthongsri)
  • Fix incorrect model imports so faithfulness and answer relevancy scoring load SummaCModels and answer relevancy models from the correct modules instead of failing at runtime. (#2089) (Sai-Suraj-27)

v3.5.3

  • Fix pii_leakage metric scoring in strict_mode by enforcing a threshold of 1 and returning 0 when the computed score falls below that threshold. (#2067) (Sai-Suraj-27)
  • Fix the getting-started example to use strict_mode instead of strict when creating metrics, preventing confusion and failures with the current API. (#2073) (Sai-Suraj-27)

v3.5.2

  • Fix a typo in the getting-started chatbots guide so the “metrics” link text is spelled correctly. (#2058) (grant-sobkowski)
  • Fix passing test_case_content when generating conversational evaluation prompts so evaluations run correctly instead of failing due to a missing argument. (#2059) (Sai-Suraj-27)
  • Fix LocalEmbeddingModel async embedding methods to properly await embedding requests, preventing missed awaits and ensuring async calls return embeddings reliably. (#2061) (Trevor Wilson)
  • Fix async prompt polling to work reliably with already-running event loops by reusing a general event loop and scheduling tasks instead of always blocking on run_until_complete. This prevents errors in async environments and keeps polling running in the background. (#2062) (Kritin Vongthongsri)
  • Fix duplicate arguments being passed to update_current_trace, preventing conflicting trace updates in online metrics tests. (#2063) (Sai-Suraj-27)

v3.5.0

  • Fix AWS Bedrock Converse requests by translating generation_kwargs from snake_case to the required camelCase. Prevents ParamValidationError when using parameters like max_tokens, top_p, top_k, and stop_sequences. (#2017) (Active FigureX)
  • Fix tool correctness scoring when no tools are expected. If both expected and called tools lists are empty, the score is now 1.0 instead of 0.0, avoiding false failures in tool-free runs. (#2027) (Kema Uday Kiran)
  • Fix a documentation import typo for DeepAcyclicGraph so the Conversational DAG example uses the correct module path. (#2029) (Vamshi Adimalla)
  • Fix telemetry tests to reliably start from a clean state by removing any existing .deepeval directory in the temp workspace before assertions, preventing flaky failures when the hidden store already exists. (#2035) (Trevor Wilson)
  • Fix tracing JSON serialization by stripping embedded NUL bytes from strings before writing to Postgres. This prevents 22P05 errors when storing text/jsonb payloads that contain \x00. (#2036) (Trevor Wilson)
  • Fix Grok-3 Fast output token pricing by using the correct per-1e6 divisor, preventing inflated cost calculations for responses. (#2046) (Trevor Wilson)
  • Fix Kimi kimi-k2-0711-preview output cost divisor so output usage is calculated with the correct scale. (#2054) (Trevor Wilson)

v3.5.1

  • Fix generate_goldens_from_contexts when using source_files so generated goldens map to the correct source file. This prevents a possible IndexError when max_goldens_per_context exceeds the number of source files. (#2053) (Evan Livelo)

v2.4.8

  • Fix trace posting to allow a dynamic API key set on each trace, instead of always relying on a global configured key. This prevents traces from being skipped when the per-trace key is provided at runtime. (#1889) (Mayank)
  • Fix Conversation Simulator generating the first user turn twice, which could duplicate user messages. First-turn prompts are now only created when starting a new conversation or after an opening message. (#1891) (Kritin Vongthongsri)
  • Fix Ollama integration docs to use the correct model parameter when initializing OllamaModel, avoiding confusion and incorrect example code. (#1892) (Phil Nash)
  • Fix CLI identifier handling so runs correctly propagate the identifier into evaluation and assertion flows. (#1903) (Kritin Vongthongsri)
  • Fix pydantic-ai agent tracing to avoid warnings and span attribute errors by safely handling missing names and non-string inputs/outputs when recording LLM test case data. (#1904) (Mayank)
  • Fix OpenTelemetry span metadata handling by reading confident.span.metadata and attaching it to exported spans, instead of dumping the full span JSON. Also reduce noisy console output by swallowing conversion/validation errors during export. (#1910) (Mayank)
  • Fix G-Eval score normalization in non-strict mode by scaling to the rubric’s actual score range instead of always dividing by 10. This also aligns normalization behavior between measure and a_measure for consistent results across different rubrics. (#1915) (Bofeng Huang)
  • Fix dataset iterator integration tests to use EvaluationDataset.evals_iterator() and load API keys from environment variables, improving reliability and avoiding hardcoded credentials. (#1920) (Mayank)
  • Fix OpenTelemetry and PydanticAI instrumentation by setting standard trace attributes (name, tags, thread_id, user_id, metadata, environment) and ensuring tool/expected tool attributes are parsed reliably. This improves span export compatibility and corrects retriever attribute keys. (#1934) (Mayank)
  • Fix type checker errors when overriding methods on base model classes by adding the missing return type annotations. This prevents methods from being inferred as returning None and incorrectly triggering type errors in subclasses. (#1936) (trevor-inflection)
  • Fix model list definitions to prevent accidental string concatenation that merged entries and broke capability checks for certain model names. This corrects which models are treated as supporting structured outputs or requiring temperature=1. (#1939) (Trevor Wilson)
  • Fix conversation simulation to respect max_user_simulations and stop generating extra user turns. Preserve any pre-seeded turns without inserting the opening message, and validate invalid limits with a clear error. (#1943) (Kritin Vongthongsri)
  • Fix trace export to handle trace_metadata provided as a dict or JSON string, ensuring metadata is captured correctly. Also update async trace posting to use the API’s returned link field when reporting success. (#1944) (Mayank)
  • Fix task completion evaluation for LangChain and LangGraph traces by correctly preparing the metric test case from the root span. This prevents missing or incorrect task extraction and avoids unexpected evaluation cost being recorded. (#1946) (Mayank)
  • Fix ToolCorrectnessMetric to avoid division-by-zero when no expected tools are provided. Return 1.0 when both expected and called tools are empty, and 0.0 when only expected tools are empty. (#1947) (Vamshi Adimalla)
  • Fix duplicate items when generating synthetic datasets with synthesizer.generate_goldens_from_docs(). Goldens are now added only once in the generation call chain, so each generated item appears exactly once. (#1951) (Jaya)
  • Fix set-openai CLI writing cost_per_input_token and cost_per_output_token to the wrong environment keys. This prevents inverted token cost accounting and keeps any downstream cost calculations accurate. (#1952) (Trevor Wilson)
  • Fix set-openai so --cost_per_input_token and --cost_per_output_token are optional for known OpenAI models, matching runtime behavior. Improve help text to clarify that costs are only required for custom or unsupported models, reducing redundant flags and misleading errors. (#1953) (Trevor Wilson)
  • Fix the Multi-Turn Getting Started code example by importing ConversationalGEval instead of an unused GEval, so the snippet runs correctly as written. (#1954) (Connor Brinton)
  • Fix Arena docs example to print results from the correct variable (arena_geval), preventing a NameError and making the snippet runnable as written. (#1960) (Julius Berger)
  • Fix duplicated aggregate metric results by computing pass-rate summaries once per evaluation run, and handle empty result sets safely. (#1962) (John Lemmon)
  • Fix LangChain callback on_llm_end handling to avoid missing-span and bad metadata issues. Model names and token usage are now extracted safely, and token counts are left unset when unavailable. (#1963) (Mayank)
  • Fix Azure OpenAI model calls to forward constructor kwargs (like max_tokens) in both sync and async generation. This ensures the API receives the expected parameters and prevents LengthFinishReasonError. (#1969) (Active FigureX)
  • Prevent endless retries in LiteLLMModel by adding a maximum retry limit (default 6) so failures stop instead of looping indefinitely. Add support for LiteLLM proxy environment variables. Move retry settings to class-level variables to simplify future configuration changes. (#1972) (Radosław Hęś)
  • Fix ContextualRelevancy evaluation when a retrieval_context item contains no meaningful statements. The metric now handles empty or non-informative context so LLM output can be parsed reliably instead of failing when no JSON is returned. (#1973) (Radosław Hęś)
  • Fix progress bar updates during conversation simulator runs, ensuring tasks advance correctly and are removed when finished. Also ensure evaluation state is always cleaned up in a finally block even if an error occurs. (#1974) (Kritin Vongthongsri)
  • Fix telemetry to fully respect opt-out by skipping all writes when DEEPEVAL_TELEMETRY_OPT_OUT=YES and returning a telemetry-opted-out sentinel ID. Also ensure the .deepeval directory exists before writing telemetry data, with tests covering directory creation and file writes. (#1976) (Trevor Wilson)
  • Fix benchmarks to work with datasets 4.0.0 by removing unsupported trust_remote_code from load_dataset calls. Update MMLU and MathQA to use current Parquet datasets with the required logic adjustments. (#1977) (Vincent Lannurien)
  • Fix incorrect imports in the getting-started LLM arena docs example so the sample code runs without import errors. (#1981) (raphaeluzan)
  • Fix Synthesizer state tracking by clearing synthetic_goldens on reset and appending newly generated goldens during doc and scratch generation, so results reflect the latest run. Update the introduction docs with required dependencies and a working end-to-end example. (#1984) (Mayank)
  • Fix notebook evaluation runs by clearing trace_manager.integration_traces_to_evaluate at the start of each dataset evaluation. This prevents traces from a previous run from leaking into a new run and affecting results. (#1985) (Mayank)
  • Fix OpenTelemetry trace status so the overall trace is marked as errored when the root span fails, improving error visibility in exported traces. (#1993) (Mayank)
  • Fix trace status reporting so traces are marked as errored when any span fails, and include a status field in the trace API payload for more accurate error visibility. (#1999) (Mayank)
  • Fix --confident-api-key so it works again, and make login save the key to .env.local by default unless --save is set. Logout now also removes the saved key from both the JSON keystore and dotenv, and commands no longer write "None" values for optional model settings. (#2015) (Trevor Wilson)

August

August made evaluation and tracing more production-ready with refreshed docs covering component-level evaluation, tracing, and deployment patterns. Tracing gained richer LLM outputs, a v1 OpenTelemetry exporter, better span ordering, and deeper LangChain/LlamaIndex/CrewAI integrations with metrics and metric_collection support. New tutorials included the Medical Chatbot series and improved RAG guides.

New Feature

v3.3.5

  • Add a new Medical Chatbot tutorial series to the docs, covering development, evaluation, improvement, and deployment of a multi-turn chatbot. Improve and correct several evaluation docs examples and parameter descriptions for multi-turn test cases and datasets. (#1802) (Vamshi Adimalla)
  • Add CLI support to configure Grok, Moonshot, and DeepSeek as the LLM provider for evaluations, including setting the model name, API key, and temperature. You can switch back to the default OpenAI setup with corresponding unset-* commands. (#1807) (Kritin Vongthongsri)
  • Add a Medical Chatbot tutorial to the docs and navigation, with updated walkthrough content and links for building, configuring, and evaluating the example app. (#1814) (Vamshi Adimalla)
  • Add support for evaluating LangGraph/LangChain traces with metrics via the callback handler. Root spans can now carry metrics and an optional metric_collection, and captured traces can be queued for evaluation instead of being posted immediately. (#1829) (Mayank)
  • Add a CrewAI Agent wrapper that registers agents with an optional metric_collection and per-agent metrics, enabling easier evaluation and online tracing during crew runs. (#1833) (Mayank)
  • Add a v1 OpenTelemetry span exporter that supports API key setup and trace configuration via env vars or OTel resource attributes. Improve trace handling by preserving provided trace IDs, applying trace metadata, and safely ending and clearing active traces after export. (#1838) (Mayank)
  • Add MCP support to conversational test cases by allowing turns to record MCP tool/prompt/resource calls and optional server metadata, with validation of MCP types to catch invalid inputs early. (#1839) (Vamshi Adimalla)
  • Add support for setting trace attributes in the LangChain callback handler. You can now pass name, tags, metadata, thread_id, and user_id when creating the callback to populate these fields on the completed trace. (#1862) (Mayank)
  • Add an ArgumentCorrectnessMetric to score whether tool call arguments match the user input, with optional reasons and async support. Returns a perfect score when no tool calls are provided. (#1866) (Kritin Vongthongsri)
  • Add a revamped conversation simulator that generates conversational test cases from ConversationalGolden inputs using a provided model callback, with configurable opening message, concurrency, and async or sync execution. (#1876) (Kritin Vongthongsri)

Improvement

v3.3.5

  • Improve component-level evaluation docs with clearer guidance on when to use it, what tracing means, and how to log in to view traces. Reorganize sections and examples for easier navigation and fewer confusing callouts. (#1782) (Kritin Vongthongsri)
  • Improve the Meeting Summarizer tutorial with a new Deployment section covering CI/CD-style continuous evaluation, dataset reuse, and optional tracing setup. Also update tutorial navigation and fix a broken docs anchor link. (#1783) (Vamshi Adimalla)
  • Bump the package release metadata and version number for a new release. (#1784) (Jeffrey Ip)
  • Improve LLM trace output to match the updated UI by capturing structured AI responses, including role, content, and tool call details instead of only a concatenated string. (#1786) (Mayank)
  • Improve the meeting summarizer tutorial with updated walkthrough content, refreshed screenshots, and clearer examples for generating summaries and action items using different models. (#1788) (Vamshi Adimalla)
  • Fix typos and formatting across tracing integrations, tests, and documentation for clearer examples and cleaner files. (#1789) (Vamshi Adimalla)
  • Improve the RAG QA Agent tutorial and navigation by adding a new tutorial section, updating sidebar links and icons, and refreshing examples to use deepeval test run instead of running pytest directly. (#1793) (Vamshi Adimalla)
  • Improve docs and tutorials by switching embedded images to hosted URLs and removing bundled image assets, keeping guides lighter and images consistently available. (#1794) (Vamshi Adimalla)
  • Improve SummarizationMetric schema naming and usage to reduce ambiguity and make results clearer. This refactor replaces a generic Verdicts schema with more descriptive Pydantic schemas, improving readability and maintainability. (#1804) (Shabareesh Shetty)
  • Improve tutorial introductions by adding Tech Stack cards that show the key tools used in each guide, making it easier to understand the setup at a glance. (#1808) (Vamshi Adimalla)
  • Improve tutorials and docs with updated examples and configuration names, plus refreshed navigation and UI tweaks for easier browsing. (#1825) (Vamshi Adimalla)
  • Support passing extra **kwargs to underlying LLM clients across providers. This lets you customize client setup (for example timeouts, proxies, or transport settings) without modifying the model wrappers. (#1827) (Kritin Vongthongsri)
  • Improve contributor setup instructions by updating the dependency installation command from make install to poetry install. (#1828) (Vamshi Adimalla)
  • Add patched LlamaIndex agents that accept metrics and metric_collection, and rework LlamaIndex tracing to start and link traces correctly for workflow/agent runs. (#1836) (Mayank)
  • Fix docs metadata and improve tutorial link cards by adding singleTurn tags to several metric pages and updating card layout with icons and objectives for clearer navigation. (#1837) (Jeffrey Ip)
  • Improve model CLI config handling by separating stored keys for evaluation LLMs vs embeddings, reducing key collisions when switching providers or running unset-* commands. (#1855) (Kritin Vongthongsri)
  • Improve tutorials with clearer section titles, updated wording, and expanded guidance for building and evaluating RAG QA and summarization agents, including a better focus on production eval setup. (#1860) (Vamshi Adimalla)

Bug Fix

v3.3.5

  • Fix LLM span cost calculation by honoring cost_per_input_token and cost_per_output_token passed to observe, ensuring traced runs report the correct token costs. (#1787) (Kritin Vongthongsri)
  • Fix async OpenAI integration by restoring asyncio.create_task safely after evaluation, preventing leaked monkeypatching across runs and improving stability when running concurrent test cases. (#1790) (Kritin Vongthongsri)
  • Fix g_eval to prevent a crash when accumulating evaluation cost if the initial cost is None. This avoids a TypeError during async evaluation and allows scoring to complete normally. (#1796) (高汝貞)
  • Fix the docs snippet for ConversationalGEval by renaming the example variable to metric, making it consistent and easier to copy and run. (#1799) (Nimish Bongale)
  • Fix the few-shot example used in the Synthesizer constrained evolution template so the sample rewritten input correctly matches the solar power prompt and produces more consistent guidance. (#1800) (Simon M.)
  • Prevent mixing single-turn and multi-turn goldens in a dataset by enforcing the dataset mode and raising clear TypeErrors for invalid items. Add add_golden to append goldens after initialization. (#1810) (Vamshi Adimalla)
  • Fix conversation eval serialization by using the correct API field aliases for retrievalContext, toolsCalled, and additionalMetadata, and by typing tool calls as ToolCall objects. (#1811) (Kritin Vongthongsri)
  • Fix tutorial command examples to run evaluation tests with deepeval test run instead of pytest, and improve YAML snippet formatting for the deployment guide. (#1830) (Vamshi Adimalla)
  • Fix AzureOpenAIModel initialization to use the correct model_name argument instead of model, restoring compatibility with Azure OpenAI deployments. This prevents setup failures that made Azure-backed usage unusable in recent releases. (#1832) (StefanMojsilovic)
  • Fix LiteLLMModel generate/a_generate to always return (result, cost) when a schema is provided. Prevents unpacking errors in schema-based metrics and restores consistent cost reporting. (#1841) (Dylan Li)
  • Fix a type hint in login_with_confident_api_key by using str for the API key parameter, improving type checking and editor autocomplete. (#1847) (John Lemmon)
  • Fix LangChain/LangGraph prompt parsing so multi-line messages and recognized roles are grouped correctly, instead of being split line-by-line or misclassified as Human messages. (#1848) (Kritin Vongthongsri)
  • Fix LLM tracing to accept and safely serialize non-standard output objects so responses aren’t dropped when capturing spans. (#1849) (Kritin Vongthongsri)
  • Fix CLI model configuration to clear previously saved evaluation or embedding settings when switching providers, preventing stale keys from overriding the newly selected model. (#1852) (Kritin Vongthongsri)
  • Fix code execution in the HumanEval benchmark by calling exec on compiled code instead of recursively invoking the secure executor, preventing infinite recursion and allowing snippets to run correctly. (#1856) (Vamshi Adimalla)
  • Fix missing temperature handling in GptModel generate/a_generate when no schema is provided, so output randomness is consistently user-controlled instead of falling back to the provider default (often 1). (#1857) (Daniel Yakubov)
  • Fix crashes in synthesizer workflows by guarding progress updates and handling fewer than 10 goldens when sampling examples. Improve test reliability by adding a pytest.ini config and expanding the test suite so CI runs pytest directly. (#1858) (Kritin Vongthongsri)
  • Fix OpenTelemetry trace exporting by ordering spans into parent-child trees and treating missing parents as root spans, preventing failures on incomplete span batches. Update LLM span attribute keys to the confident.llm.* namespace so model, token, and prompt fields are captured correctly. (#1859) (Mayank)
  • Fix misuse metric failures by passing the correct misuse_violations parameter to generate_reason in MisuseTemplate. This prevents errors when running measure. (#1863) (Rohit ojha)
  • Prevent generating more synthetic inputs than requested by enforcing max_goldens_per_context and truncating any extra results. This keeps dataset sizes predictable and avoids overshooting configured limits. (#1867) (Noah Gil)
  • Fix structured output requests in the LiteLLM model by passing the Pydantic schema directly via response_format instead of an unsupported json_schema argument. Prevents TypeError failures when requesting JSON-formatted responses. (#1871) (Rohit ojha)
  • Fix conversation relevancy windowing by grouping turns into valid user→assistant interactions and flattening them before verdict generation, preventing invalid or partial turns from skewing results. (#1873) (Vamshi Adimalla)
  • Fix an ImportError caused by a circular import between the scorer module and the IFEval benchmark. The Scorer import is now deferred to IFEval initialization so modules load cleanly and IFEval can be imported reliably. (#1875) (Rohit ojha)
  • Fix Conversation Simulator turn generation and progress tracking: max_turns is now validated, opening messages count toward the limit, and async vs sync callbacks are handled automatically without raising type errors. Simulated test cases now carry over scenario and metadata fields from the golden inputs. (#1878) (Kritin Vongthongsri)

July

July improved tracing and evaluation across agent frameworks with major upgrades to LangChain/LangGraph, CrewAI, LlamaIndex, and OpenTelemetry span handling. Safety coverage expanded with new metrics for PII leakage, role violations, non-advice, and misuse, plus IFEval benchmark support and better task-completion evaluation. The default model moved from gpt-4o to gpt-4.1 with updated costs and docs.

New Feature

v3.2.6

  • Add a LangChain/LangGraph callback handler that captures chain, tool, LLM, and retriever events into tracing spans, and automatically starts and ends a trace for top-level runs. (#1722) (Mayank)
  • Add a CrewAI integration to instrument crewai.LLM.call and capture LLM input/output in traces. Raises a clear error if CrewAI is not installed and supports optional API key login before patching. (#1723) (Mayank)
  • Add a revised CrewAI tracing integration with an instrumentator() helper that listens to CrewAI events and captures agent and LLM calls as trace spans. Also emit integration telemetry to New Relic in addition to existing PostHog tracking. (#1724) (Mayank)
  • Add support for the IFEval benchmark to evaluate instruction-following and format compliance. Includes rule-based verification and more detailed per-instruction reporting in verbose mode. (#1729) (Abhishek Ranjan)
  • Add a new dataset() test-run interface that lets you iterate over goldens from a local list or a pulled dataset alias and track the run via test_run tasks, with async execution support. (#1737) (Kritin Vongthongsri)
  • Add 10 new safety metrics to detect PII leakage, harmful or illegal instructions, misinformation, graphic content, prompt extraction, role boundary violations, IP issues, manipulation, and risky command execution. Improve template consistency, align parameter names, and add full test coverage for these checks. (#1747) (sid-murali)
  • Add new safety metrics: PIILeakageMetric to detect SSNs/emails/addresses, RoleViolationMetric to flag role-breaking output, and NonAdviceMetric to catch financial or medical advice. Require explicit parameters like role and advice types, and switch role violations to a clear yes/no result. (#1749) (sid-murali)
  • Add CLI support to set/unset the default OpenAI model and per-token pricing used by metrics. GPTModel can now read model name and pricing from saved settings, and will prompt for pricing when using an unknown model. (#1766) (Kritin Vongthongsri)
  • Add the Misuse metric to detect when an LLM uses a specialized domain chatbot inappropriately (for example, asking a finance bot to write poetry). This helps keep outputs aligned with domain expertise and prevents scope creep in specialized AI use cases. (#1773) (sid-murali)

Improvement

v3.2.6

  • Prepare a new release by updating package metadata and internal version information. (#1721) (Jeffrey Ip)
  • Add telemetry events that record when tracing integrations are initialized (LangChain, LlamaIndex, and OpenTelemetry exporter), respecting telemetry opt-out settings. (#1725) (Mayank)
  • Update the default OpenAI and multimodal GPT model from gpt-4o to gpt-4.1. Cost calculations and documentation examples now also default to gpt-4.1 when a model name is not specified. (#1727) (Kritin Vongthongsri)
  • Add an X (Twitter) follow icon to the README and documentation site header for quicker access to the project’s social profile. (#1731) (Kritin Vongthongsri)
  • Improve documentation and examples for multi-turn chatbot evaluation, clarifying conversation simulation, CI setup, and metric usage. Fix small wording issues in docs and ensure files end with a trailing newline. (#1732) (Vamshi Adimalla)
  • Improve task completion evaluations by supporting span-based tracing. TaskCompletionMetric can now run without an LLMTestCase when it’s the only metric, and it attaches the trace to produce suggested fixes while giving a clearer error for other metrics missing update_current_span(). (#1734) (Kritin Vongthongsri)
  • Improve CrewAI tracing by capturing tool usage and memory search as dedicated spans, with inputs/outputs recorded for easier debugging. LLM spans no longer fail when a parent span can’t be found. (#1740) (Mayank)
  • Improve LlamaIndex instrumentation by unifying event and span handling, generating stable span UUIDs, and properly starting/ending traces when spans are dropped or completed. This makes LLM and tool spans more consistent and avoids lingering spans in trace output. (#1745) (Mayank)
  • Improve OpenAI integration by evaluating captured OpenAI test case/metric pairs when no traces are available, and by recording the latest OpenAI hyperparameters in the test run. Also clear stored OpenAI pairs after a run to avoid leaking state between evaluations. (#1746) (Kritin Vongthongsri)
  • Improve LangChain and LangGraph integration with clearer message roles, better tool call/result handling, and cleaner inputs. Fix span naming plus fallback/metadata behavior and make outputs visible in LangChain. Update docs with function descriptions; token usage and cost reporting is still pending. (#1752) (Mayank)
  • Fix a typo in the README explanation of expected_output and GEval to make the quickstart guidance clearer. (#1754) (Chetan Shinde)
  • Add comprehensive docs for NonAdviceMetric, PIILeakageMetric, and RoleViolationMetric, including usage examples, parameters, and scoring rubrics. Improve consistency by standardizing metric names, schema fields, and clarifying parameter naming for these metrics. (#1755) (sid-murali)
  • Improve the tutorials onboarding experience by grouping Getting Started pages in the sidebar and refreshing the Introduction with clearer guidance and a first evaluation walkthrough. (#1759) (Vamshi Adimalla)
  • Improve compatibility by loosening the click version restriction so newer click releases can be used, reducing dependency conflicts and avoiding the need to pin an outdated version. (#1760) (lwarsaame)
  • Improve the tutorial introduction and setup docs with a clearer getting-started flow, curated tutorial cards, and tightened wording. Add a concrete OPENAI_API_KEY export example and clarify the required test_ filename prefix. (#1761) (Vamshi Adimalla)
  • Add a blog sidebar that lists all posts and expand the tutorials sidebar with a new Meeting Summarizer section. Improve tutorials navigation by renaming the tutorial card component to LinkCards and enabling sidebar icons on tutorial routes. (#1767) (Vamshi Adimalla)
  • Support passing extra client options to Azure OpenAI model initialization via kwargs. This lets you customize the underlying Azure OpenAI client without modifying the tool’s source code. (#1772) (Aaryan Verma)
  • Improve tutorials and docs navigation with refreshed summarization content, clearer headings, and new example visuals. Add optional numbered tutorial link cards and temporarily hide the Meeting Summarizer section from the sidebar. (#1775) (Vamshi Adimalla)
  • Improve dependency compatibility by loosening the tenacity version constraint to allow newer releases while keeping a safe supported range. (#1776) (Andy Freeland)
  • Improve dataset handling by aligning dataset endpoints, making golden lists optional, and supporting extra conversational metadata like scenario, userDescription, and comments when sending test runs. (#1777) (Jeffrey Ip)
  • Improve the TaskCompletionMetric docs with a clearer tracing example, including the correct Golden input format and updated imports for evaluate and ToolCall. This makes it easier to run the sample code without adjustments. (#1779) (Mayank)

Bug Fix

v3.2.6

  • Fix the quickstart link shown after CLI login so it points to the correct setup page. (#1726) (Kritin Vongthongsri)
  • Fix OpenAI Completions examples in the docs to use the current OpenAI() client and chat.completions.create, preventing runtime errors and incorrect response parsing in sample code. (#1728) (Kritin Vongthongsri)
  • Fix AnthropicModel.calculate_cost indentation so cost calculation and fallback pricing warning run correctly when pricing is missing. (#1739) (nsking02)
  • Fix component-level evaluation serialization by converting test run payloads into JSON-safe data before sending them, preventing failures when metrics or complex objects are included. (#1744) (Kritin Vongthongsri)
  • Fix synthetic golden sample generation when context_size is 1 by making the context generator always return a consistent list-of-lists shape. This prevents type mismatches in Golden creation when a document has only one chunk. (#1748) (Nicolas Torres)
  • Improve JSON tool-call reliability when using instructor TOOLS mode with custom LLMs by renaming internal Reason schemas so models don’t skip tool calls and return plain content. This prevents exceptions and keeps structured outputs coming from tool_calls as expected. (#1753) (Radosław Hęś)
  • Fix EvaluationDataset.evaluate type hints to accept all supported metric base types and explicitly annotate the EvaluationResult return type, avoiding circular import issues. (#1756) (AI)
  • Fix an error when calculating OpenAI costs by handling a missing model value and falling back to the default model when none is provided. (#1768) (Kritin Vongthongsri)
  • Fix component-level metric data not showing up in test results by extracting and appending trace and span-level metric outputs to the reported results. (#1769) (Mayank)
  • Fix syntax errors in the evaluation test case documentation examples so ToolCall snippets parse correctly and can be copied into Python without edits. (#1770) (Dhanesh Gujrathi)
  • Fix the Task Completion metric documentation example by using valid sample inputs for destination and days, preventing the snippet from failing when copied and run. (#1778) (Kritin Vongthongsri)

June

June made evaluations and tracing more robust across providers and async workloads with fixes to prevent crashes and broken serialization. Tracing matured with improved OpenAI/OTEL integrations and new hooks for OpenAI Agents and LlamaIndex via trace_manager.configure. Evaluation added native LiteLLM support, MultimodalGEval, arena-style GEval, and jsonl dataset saving.

Backward Incompatible Change

v3.1.5

  • Remove the client parameter from observe() and rely on trace_manager.configure(openai_client=...) for LLM spans. LLM tracing now requires either a model in observe or a configured openai_client, otherwise a clear error is raised. (#1667) (Mayank)

v3.0.8

  • Improve the packaged API by removing the monitor helpers from top-level imports, leaving only send_feedback and a_send_feedback available via deepeval. (#1673) (Jeffrey Ip)

New Feature

v3.1.9

  • Add a LlamaIndex integration entry point via instrument_llama_index to hook into LlamaIndex instrumentation and capture agent runs for monitoring. (#1714) (Mayank)
  • Add expanded OpenAI multimodal model support, including newer GPT-4.1 and o-series options. Improve structured output handling by using native parsing when available and falling back to JSON parsing when needed, while tracking log-prob limitations for unsupported models. (#1716) (Kritin Vongthongsri)
  • Add arena-style evaluation to GEval by allowing a list of test cases and selecting the best output. Validate that all candidates share the same input and expose best_test_case and best_test_case_index for easier comparisons. (#1717) (Kritin Vongthongsri)

v3.1.5

  • Add MultimodalGEval, a GEval-based metric to score multimodal test cases using configurable criteria, rubrics, and evaluation steps. Supports async evaluation and can incorporate inputs like context, retrieval context, and tool calls. Also improve image encoding by converting non-RGB images before JPEG serialization. (#1684) (Kritin Vongthongsri)
  • Add OpenAI Agents tracing integration via DeepEvalTracingProcessor, capturing agent, tool, and LLM spans and mapping key metadata like prompts, responses, and token usage into the tracing system. (#1699) (Kritin Vongthongsri)
  • Add broader multimodal test case support in the platform API by sending expected output, context, and retrieval context fields. Improve handling of local image inputs by detecting file:// paths, capturing filenames and MIME types, and embedding file data as Base64. (#1704) (Kritin Vongthongsri)

v3.0.8

  • Add native LiteLLM model support so you can run evaluations with any LiteLLM-supported provider. Includes sync/async text generation, schema validation, cost tracking, and improved error handling, plus tests and updated docs. (#1670) (Prahlad Sahu)

v3.0.6

  • Add support for saving datasets in jsonl format, making it easier to write large datasets without loading everything into memory. This is especially useful for generating and exporting datasets with more than 10k rows. (#1652) (Yudhiesh Ravindranath)

Improvement

v3.1.9

  • Bump package version metadata for a new release, updating the published version string and release date. (#1710) (Jeffrey Ip)
  • Improve the RoleAdherenceMetric documentation by fixing wording, removing a duplicate argument entry, and clarifying how assistant turns are evaluated against chatbot_role using prior context. (#1711) (Vamshi Adimalla)
  • Add pricing support for claude-opus-4 and claude-sonnet-4. Raise a clear ValueError when cost pricing is missing for an unknown Anthropic model, preventing silent fallbacks and TypeError crashes. (#1715) (Abhishek Ranjan)
  • Add a new blog guide on building and evaluating multi-turn chatbots, covering conversation simulation, metrics for memory and tone, and CI-friendly regression testing. (#1718) (Vamshi Adimalla)

v3.1.5

  • Bump the package version metadata for a new release. (#1676) (Jeffrey Ip)
  • Improve telemetry for traceable evaluate() runs by tracking them as a separate component evaluation feature. This records the correct feature status and updates the last-used feature accordingly. (#1678) (Kritin Vongthongsri)
  • Add a new blog post covering an evaluation-first approach to building and testing RAG apps, including automated test data generation, retriever/generator metrics, and CI test integration. Add a new blog author profile and related images. (#1686) (Vamshi Adimalla)
  • Add links in the README to translated versions in multiple languages, making it easier for non-English readers to find localized documentation. (#1687) (neo)
  • Improve the RAG evaluation blog guide with updated wording, clearer code examples, and revised diagrams. Rename the article file and slug to better reflect its focus, and simplify CI/CD integration examples for easier copy-paste. (#1694) (Vamshi Adimalla)

v3.0.8

  • Prepare a new release by updating the package version metadata and reported __version__. (#1668) (Jeffrey Ip)

v3.0.6

  • Prepare the 3.0.0 release by updating package version metadata and release date. (#1631) (Jeffrey Ip)
  • Improve multimodal metrics docs by fixing the Answer Relevancy example to use MultimodalAnswerRelevancyMetric, and by aligning output and bulk-evaluation snippets to print score and reason consistently. (#1635) (Jeffrey Ip)
  • Improve the faithfulness verdict prompt wording by fixing grammar and removing threatening language, making instructions clearer and more professional for LLM evaluations. (#1636) (Vamshi Adimalla)
  • Improve AnswerRelevancy prompt templates to produce valid, parseable JSON more reliably. Clarify when ambiguous fragments count as statements and add clearer examples and end markers to reduce malformed outputs. (#1642) (Aaron McClintock)
  • Improve conversation simulation progress output by switching to Rich traceable progress bars and showing per-conversation and per-step progress during scenario setup and turn simulation, in both sync and async modes. (#1649) (Kritin Vongthongsri)
  • Improve tracing internals by moving current span/trace state to context variables and reorganizing attribute and type definitions. This makes trace updates more consistent across sync and async execution, and enables centralized OpenAI client patching via the trace manager. (#1651) (Jeffrey Ip)

Bug Fix

v3.1.9

  • Fix JSON serialization failures when a dictionary contains non-string keys by converting keys to strings during tracing serialization. (#1712) (Kritin Vongthongsri)

v3.1.5

  • Fix import failures on read-only file systems by skipping telemetry-related filesystem setup when DEEPEVAL_TELEMETRY_OPT_OUT is set. This prevents evaluations from failing in restricted environments like serverless runtimes. (#1654) (Leo Kacenjar)
  • Fix OpenAI model initialization to pass base_url, enabling proxy or custom endpoint configurations in both sync and async clients. (#1703) (jnchen)
  • Fix evaluate so it no longer raises TypeError when a single TestResult is passed. The metric pass rate aggregation now wraps non-list results into a list before processing. (#1705) (Aditya Bharadwaj)
  • Fix an IndexError in Synthesizer.generate_goldens_from_docs() by safely handling missing or shorter source_files, preventing crashes when generating goldens from documentation inputs. (#1706) (Aditya Bharadwaj)

v3.0.6

  • Fix GSM8K benchmark crashes when a model returns a tuple or other non-standard response. Prediction extraction now handles NumberSchema, tuples, strings, dicts, and .text/.content objects, and avoids unsafe .values() unpacking to prevent AttributeError/TypeError. (#1628) (Muhammad Hussain)
  • Fix traceable span evaluation traversal so child spans are always processed and recorded, even when a parent span has no metrics or test case. This prevents missing spans in trace output and avoids incomplete evaluations. (#1632) (Kritin Vongthongsri)
  • Fix TruthfulQA evaluation with AnthropicModel by handling JSON parsing failures and falling back to text-based prompting when structured output isn’t supported. This prevents crashes from uncaught errors and improves robustness across models. (#1638) (Pradyun Magal)
  • Fix the OpenAI tracing integration so LLM span attributes are applied correctly and tracing data is recorded as expected. (#1639) (Kritin Vongthongsri)
  • Fix async golden generation to call a_embed_text instead of the blocking embed_text when building contexts. This prevents event-loop blocking, improves parallel performance, and avoids runtime errors like asyncio.run() being called from a running loop. (#1641) (Andreas Gabrielsson)
  • Fix OTEL exporter crashes when span or event attributes are missing by handling None values and returning empty objects or None instead of raising type conversion errors. (#1646) (Mayank)
  • Fix expected_output serialization for span test cases by correcting the expectedOutput field alias so optional expected outputs are sent and parsed correctly. (#1650) (Kritin Vongthongsri)
  • Fix the traceable evaluation progress bar so it updates correctly during runs, including async execution, by using the proper progress bar ID. (#1655) (Kritin Vongthongsri)
  • Fix trace posting when a Confident AI API key is provided directly, so traces are no longer skipped due to the environment not being detected as Confident. (#1656) (Kritin Vongthongsri)
  • Fix a typo in the conversation simulator docs so the user_intentions example is valid Python and can be copied and run without errors. (#1664) (Eduardo Arndt)
  • Fix a circular import in the tracing API by importing current_trace_context from the context module, preventing import-time errors when using tracing. (#1665) (Mayank)

May

May made evaluations and tracing more robust and configurable. LLM wrappers gained configurable temperature, new providers including Amazon Bedrock, and PEP 561 support for static analysis. Tracing improved with cleaner defaults, richer metadata, optional sampling/masking, and better OpenTelemetry interoperability while respecting opt-out more consistently.

Backward Incompatible Change

v2.8.5

  • Rename the tracing callback parameter from traceable_callback to observed_callback in evaluate() and assert_test() when running agentic golden tests, improving naming consistency for traced runs. (#1561) (Jeffrey Ip)

v2.8.4

  • Remove the LangChain dependency so installs are lighter and avoid importing LangChain modules. Update conversational GEval to use OpenAI ChatCompletion responses directly when parsing content and logprobs. (#1544) (Kritin Vongthongsri)

New Feature

v3.0

  • Add utility functions to write evaluation logs to a file, making it easier to track results when running large batches without a web app. This also helps spot missing results caused by connection errors. (#1601) (Daehui Kim)
  • Add an OpenTelemetry span exporter that detects gen_ai operations and converts spans into LLM, tool, agent, and retriever traces with inputs, outputs, token usage, and cost metadata for export. (#1603) (Mayank)
  • Add optional thread_id to traces and support sending it as threadId in the tracing API. This lets you associate a trace with a specific conversation thread when updating the current trace. (#1604) (Kritin Vongthongsri)
  • Add support for setting a trace userId so you can associate traces with a specific end user when updating and exporting trace data. (#1605) (Kritin Vongthongsri)
  • Add input and output fields to trace data so you can record request payloads and final results at the trace level, including via update_current_trace. (#1606) (Kritin Vongthongsri)

v2.9.0

  • Add support for tracing LlmAttributes on the OpenAI client by patching it into Observer, so @observe(type="llm", client=...) captures LLM call attributes automatically. (#1560) (Mayank)
  • Add AmazonBedrockModel to run LLM-based evaluations using Amazon Bedrock, with async and sync generation plus optional Pydantic schema parsing. Includes usage docs and recognizes Bedrock models as native for metric execution. (#1570) (Kritin Vongthongsri)
  • Add support for setting per-span metadata via update_current_span, and include it when exporting spans to the tracing API. (#1575) (Kritin Vongthongsri)
  • Add trace-level tags and metadata, plus an optional environment label for better trace filtering and context. Support masking trace inputs/outputs via a configurable mask function. Allow sampling with CONFIDENT_SAMPLE_RATE to skip posting a portion of traces. (#1578) (Kritin Vongthongsri)

v2.9.1

  • Add a more flexible conversation simulator: generate a configurable number of conversations per intent, accept either user_profile_items or predefined user_profiles, and optionally stop early using a stopping_criteria. Progress tracking now reflects the total conversations generated across intents. (#1584) (Kritin Vongthongsri)

v2.8.5

  • Add get_actual_model_name() helper to extract the underlying model ID from provider-prefixed strings like openai/gpt-4.1-mini, as used by proxies such as LiteLLM. This makes it easier to work with provider/model formats consistently. (#1555) (Serghei Iakovlev)

v2.8.4

  • Add support for gpt-4.1 in structured output mode by including it in the list of supported models. This lets you use gpt-4.1 where structured outputs are required without extra configuration. (#1547) (Serghei Iakovlev)

Improvement

v3.0

  • Support passing through unknown command-line options from deepeval test run to pytest, so third-party and custom pytest plugins can receive their flags without the CLI rejecting them. (#1589) (Matt Barr)
  • Improve telemetry and tracing reliability by propagating an internal _in_component flag through metric evaluation and wrapping trace flush sends with capture logic, reducing noisy progress output and ensuring in-flight tasks are cleaned up more safely. (#1596) (Kritin Vongthongsri)
  • Bump package version to 2.9.1 for the latest release. (#1600) (Jeffrey Ip)
  • Add support for saving expected_output when exporting datasets, so expected results are preserved alongside inputs and other golden fields. (#1602) (Nail Khusainov)
  • Add default trace input/output capture when they are not explicitly set, using the observed function’s kwargs and result. This ensures traces include basic I/O data without requiring manual update_current_trace calls. (#1620) (Kritin Vongthongsri)
  • Remove the SIGINT/SIGTERM signal handler from tracing so the tool no longer overrides your process signal handling during shutdown. (#1621) (Mayank)
  • Improve assert_test AssertionError messages by including the failure reason in the thrown metrics string. This makes it easier to understand failures when logging exceptions, abstracting tests, or running under pytest. (#1623) (Orel Lazri)

v2.9.0

  • Update package metadata and internal version to 2.8.5 for the new release. (#1567) (Jeffrey Ip)
  • Improve tracing span updates by consolidating update_current_span_test_case and update_current_span_attributes into a single update_current_span API. This makes it easier to attach both span attributes and an LLMTestCase, and updates docs and error messages to match the new call pattern. (#1574) (Kritin Vongthongsri)
  • Add the PEP 561 py.typed marker so type checkers like mypy can analyze installed package imports without reporting missing stubs or import-untyped errors. (#1592) (Sigurd Spieckermann)

v2.9.1

  • Bump the package release to 2.9.0 and update version metadata across the project. (#1597) (Jeffrey Ip)

v2.8.5

  • Update package metadata and internal __version__ to reflect the latest release. (#1558) (Jeffrey Ip)
  • Prevent trace status logs from printing during evaluations unless CONFIDENT_TRACE_VERBOSE explicitly enables them, reducing noisy console output while running eval traces. (#1565) (Kritin Vongthongsri)

v2.8.4

  • Improve type safety and simplify golden/context generation by removing legacy _nodes paths. Add a ChromaDB availability check and clearer error messages to fail fast when optional dependencies are missing. (#1534) (Rami Pellumbi)
  • Add configurable temperature to supported LLM model wrappers (including Anthropic, Azure OpenAI, and Gemini) and pass it through on generation calls. Prevent invalid settings by rejecting negative temperatures with a clear error. (#1541) (Kritin Vongthongsri)
  • Improve type hints in the MMLU benchmark by making tasks optional and simplifying prompt variable typing for better static analysis and editor support. (#1550) (Serghei Iakovlev)
  • Fix typos across benchmark prompts, comments, and tests to improve wording clarity and reduce confusion when reading task names and evaluation steps. (#1552) (João Matias)
  • Move telemetry, cache, temp test-run data, and key storage into a .deepeval/ folder to reduce clutter in the project root. Automatically migrates legacy files to the new location when found. (#1556) (Kritin Vongthongsri)
  • Improve tracing logs with clearer success/failure messages, a queue-size status, and an exit warning when traces are still pending. Add optional flushing on shutdown via CONFIDENT_TRACE_FLUSH, and control log verbosity with CONFIDENT_TRACE_VERBOSE. (#1557) (Kritin Vongthongsri)

Bug Fix

v3.0

  • Fix TaskNodeOutput response format types so list and dict outputs are fully specified and accepted by OpenAI. This prevents confusing bad request errors that only appeared when the model tried to emit those previously invalid shapes. (#1599) (Matt Barr)
  • Restrict typer and click dependency versions to improve compatibility and prevent install issues with newer releases. (#1607) (Vamshi Adimalla)
  • Fix ToolCorrectnessMetric input parameter comparison so identical dictionaries are treated as a full match, improving scoring consistency when tool inputs are the same. (#1608) (Nathan-Kr)
  • Fix temp directory cleanup on Windows by adding a safer rmtree with retries and forced garbage collection to reduce failures from locked files. Also register an exit cleanup hook to help release resources before deletion. (#1609) (Propet40)
  • Fix telemetry opt-out so no analytics events or traces are captured when opt-out is enabled across evaluation, metrics, dataset pulls, and trace sending. (#1614) (Kritin Vongthongsri)
  • Fix a ZeroDivisionError when running the HellaSwag benchmark with no predictions for a task by returning an accuracy of 0 instead of dividing by zero. (#1616) (Mikhail Salnikov)
  • Fix a ValueError when running the TruthfulQA benchmark by including the expected output in each recorded prediction row, keeping result data aligned during evaluation. (#1619) (Mikhail Salnikov)
  • Fix ToolCall.__hash__ to support unhashable input/output values like lists and nested dicts. Hashing now converts complex nested structures into stable hashable forms, preventing TypeError during comparisons and test runs. (#1625) (Muhammad Hussain)
  • Fix a FileNotFoundError in telemetry by using a consistent temp run data filename when moving it into the .deepeval directory. This prevents failures caused by a mismatch between dotted and non-dotted filenames. (#1630) (Jakub Koněrza)

v2.9.0

  • Fix Azure OpenAI initialization to use the correct deployment_name when setting azure_deployment, preventing misconfigured clients and failed requests. (#1571) (Kritin Vongthongsri)
  • Fix Amazon Bedrock model imports to avoid unnecessary dependencies being loaded when using the Bedrock LLM integration. (#1573) (Kritin Vongthongsri)
  • Fix a typo in the MMLU benchmark that could cause an assertion failure when validating the example dataset, so load_benchmark and prediction run as expected. (#1580) (Tri Dao)
  • Fix broken integration documentation links for LlamaIndex and Hugging Face so the README points to the correct pages. (#1582) (Wey Gu)
  • Fix client patching during tracing context setup by skipping type checks when the client is None, preventing errors when no client is configured. (#1585) (Mayank)
  • Fix a syntax error in the synthesizer generate-from-scratch documentation example by adding a missing trailing comma in StylingConfig, making the snippet copy-pasteable. (#1587) (Shun Liang)
  • Fix OllamaModel.a_generate() to use the model name set in the constructor. This keeps async generation consistent with OllamaModel.generate() and prevents using the wrong Ollama model. (#1594) (Sigurd Spieckermann)

v2.8.5

  • Fix trace queue handling so queued and in-flight traces are posted more reliably on exit or interruption. Add SIGINT/SIGTERM handling and improve warnings to report remaining traces and support optional flushing via CONFIDENT_TRACE_FLUSH. (#1559) (Kritin Vongthongsri)
  • Fix the exit warning to only appear when there are pending traces to post. This prevents misleading warnings when the trace queue and in-flight tasks are empty. (#1566) (Kritin Vongthongsri)

v2.8.4

  • Fix MMLU evaluation when model.generate() returns a tuple or list by extracting the first result before reading .answer. This prevents AttributeError/TypeError and improves compatibility across different model implementations. (#1546) (krishna0125)

April

April made evaluations more traceable and easier to configure. Native model support expanded with Gemini and Anthropic, plus improved Azure OpenAI and Ollama setup. New metadata fields (token_cost, completion_time, additional_metadata) and tracing upgrades made multi-turn test generation and debugging smoother, while robustness fixes reduced import failures and crashes.

Backward Incompatible Change

v2.7.6

  • Remove async from get_model_name on the base embedding model interface, making model name retrieval a synchronous call for simpler implementations and call sites. (#1516) (Rami Pellumbi)

v2.7.3

  • Remove the auto_evaluate helper from the public API to streamline the tracing-focused surface area and reduce unused functionality. (#1513) (Jeffrey Ip)

New Feature

v2.7.7

  • Add traceable eval runs so agent/tool/LLM steps can be captured and attached to each test case during evaluation. This improves debugging and makes it easier to understand how outputs were produced, including when running evals over pulled datasets. (#1523) (Kritin Vongthongsri)
  • Add support for named goldens and allow assert_test to run traceable evals using a Golden plus callback, in both sync and async modes. Improve input validation for assert_test to prevent invalid argument combinations. (#1532) (Kritin Vongthongsri)

v2.7.6

  • Add min_context_length and min_contexts_per_document to Synthesizer document context generation, so you can enforce a minimum context size and minimum number of contexts per document while still capping with the existing max settings. (#1508) (Kritin Vongthongsri)

v2.7.3

  • Add generate_goldens_from_goldens to expand an existing set of Goldens into new ones, reusing available contexts for grounded generation or falling back to scratch generation when context is missing. Optionally generates expected outputs and can infer prompt styling from the provided examples. (#1506) (Kritin Vongthongsri)

v2.6.8

  • Add native Gemini model support, including multimodal judging and structured outputs. Configure it via set-gemini using either a Google API key or Vertex AI project/location, and disable it with unset-gemini to revert to the default provider. (#1493) (Kritin Vongthongsri)
  • Add support for running evaluations with Anthropic Claude models via a new AnthropicModel, including sync/async generation and token cost tracking. (#1495) (Kritin Vongthongsri)

v2.6.6

  • Add a conversation simulator that generates multi-turn conversational test cases from user profile items and intentions, with optional opening messages. Supports async concurrency and tracks simulation cost when using native models. (#1481) (Jeffrey Ip)

Improvement

v2.7.7

  • Prepare a new release by updating the package version metadata. (#1525) (Jeffrey Ip)
  • Allow LLM and retriever spans to be recorded without calling update_current_span_attributes. Missing attributes no longer raise errors, and span conversion skips optional fields when they aren’t provided. Improve error handling for non-JSON API responses. (#1530) (Kritin Vongthongsri)
  • Improve how LLMTestCase is converted to a string for g-eval prompts by centralizing the formatting and ensuring tool-call values are rendered consistently via repr(). (#1531) (João Matias)

v2.7.9

  • Improve documentation by clarifying CLI usage (deepeval test run), updating command examples to bash, and fixing links to the correct evaluation guide sections. (#1537) (Jeffrey Ip)
  • Prepare a new package release by bumping the project version metadata. (#1539) (Jeffrey Ip)
  • Bump the package version to 2.7.8 for the latest release metadata. (#1540) (Jeffrey Ip)

v2.7.6

  • Add a new documentation article showcasing popular G-Eval metric examples, with sample code and guidance for defining custom LLM-judge criteria and RAG-focused evaluations. (#1517) (Kritin Vongthongsri)
  • Improve the G-Eval documentation with research context, clearer RAG evaluation criteria, and a new advanced section explaining limitations and when to use DAG-based metrics, including an end-to-end example. (#1519) (Kritin Vongthongsri)
  • Fix typos and improve wording in synthesizer prompt templates to make instructions clearer and reduce confusion in generated outputs. (#1521) (Song Luar)
  • Improve import-time dependency resolution by deferring optional integration imports, reducing startup failures when LangChain or LlamaIndex aren’t installed. Change update checks to be opt-in via DEEPEVAL_UPDATE_WARNING_OPT_IN. (#1524) (Jeffrey Ip)

v2.7.3

  • Fix a typo in the QA agent metrics tutorial by correcting “weather” to “whether” in the Faithfulness description, improving documentation clarity. (#1505) (Justin Nauman)
  • Fix typos in the benchmarks introduction docs to use the correct prompts variable name and improve wording for clarity. (#1511) (Russell-Day)

v2.6.8

  • Add retention analytics by sending PostHog events for evaluation runs and synthesizer invocations when telemetry is enabled, improving visibility into feature usage over time. (#1486) (Kritin Vongthongsri)
  • Add log-probability support for Azure OpenAI in GEval, including Azure models in log-probability compatibility checks and enabling raw response generation with cost tracking via the LangChain client. (#1492) (Kritin Vongthongsri)
  • Add google-genai and posthog as dependencies and refresh the lockfile to pull in required transitive packages. (#1499) (Kritin Vongthongsri)

v2.6.6

  • Add a new comparison blog post and author profile to the documentation, expanding the site’s blog content and attribution. (#1471) (Kritin Vongthongsri)
  • Improve Ollama embedding configuration by using the same underlying ollama module as the chat model. This aligns base_url handling so embeddings and chat can share the same Ollama host without requiring different /v1 URL variants, reducing setup confusion. (#1474) (Paul Lewis)
  • Add a new documentation blog post comparing the tool with Langfuse, and update existing comparison content for clearer messaging about provider integration and metric support. (#1475) (Kritin Vongthongsri)
  • Add token_cost and completion_time fields to LLM and multimodal test cases, and include them in the API test case payload as tokenCost and completionTime. (#1476) (Kritin Vongthongsri)
  • Add additional_metadata to test results so extra per-test details are preserved and returned for conversational, multimodal, and standard evaluations. (#1477) (Mayank)
  • Improve the conversation simulator API by moving model_callback, turn limits, and conversation count into simulate() and adding clearer progress reporting during generation for both sync and async runs. (#1491) (Jeffrey Ip)

Bug Fix

v2.7.7

  • Fix invalid enum errors in tracing by aligning span status values to use ERRORED instead of ERROR, so failed spans serialize and report correctly. (#1536) (Mayank)
  • Fix agentic assert_test runs so they no longer always disable saving results. Test runs now respect the save_to_disk setting and correctly reuse or create the current test run by identifier. (#1538) (Kritin Vongthongsri)

v2.7.6

  • Fix FiltrationConfig.synthetic_input_quality_threshold to use a float instead of an int, matching its default value and preventing type-related configuration errors. (#1515) (Rami Pellumbi)
  • Fix the Bias metric docs example to import evaluate from deepeval, so the sample code runs as written. (#1520) (snsk)

v2.7.3

  • Fix Gemini model wrappers to stop hardcoding an allowlist of model names. You can now pass newer or custom Gemini model IDs without getting an unnecessary "Invalid model" error. (#1503) (Mete Atamel)
  • Fix Anthropic model initialization and async generation by treating AnthropicModel as a native provider and loading the client in async mode, preventing failures when calling a_generate. (#1504) (Kritin Vongthongsri)

v2.6.8

  • Fix synthetic dataset generation from documents failing with UnicodeDecodeError on non-UTF-8 text. Default to auto-detecting file encoding instead of Windows defaults, and allow manually setting an encoding for edge cases. (#1485) (Aahil Shaikh)
  • Fix type hints for context_quality_threshold and context_similarity_threshold to use float, matching their default values and preventing misleading type checking. (#1490) (Jakub Koněrza)

v2.6.6

  • Fix Azure OpenAI setup by separating openai_model_name from the deployment name and using the deployment name when creating the client. The CLI now prompts for --openai-model-name and stores/clears it alongside other Azure settings. (#1480) (Kritin Vongthongsri)
  • Fix the QA agent evaluation tutorial to import EvaluationDataset from deepeval.dataset, matching the current package structure and preventing import errors when following the docs. (#1483) (Anton)
  • Fix ToolCorrectness metric crashing with an unhashable type error when a tool call output is a list and expected tools are provided without a guaranteed order. This lets tool-correctness evaluation run reliably for list outputs. (#1487) (Sai Pavan Kumar)

March

March made evaluations and synthesis more reliable. Defaults improved for Ollama and Azure OpenAI, broader model support landed (including gpt-4.5-preview), and structured outputs became more consistent. Large runs gained resilience with expanded retry handling for transient failures, plus fixes for async scoring, G-Eval strict mode, and benchmark parsing.

New Feature

v2.6.5

  • Add support for the gpt-4.5-preview-2025-02-27 model, including pricing metadata and compatibility flags for features like structured outputs and JSON mode. (#1453) (John Lemmon)
  • Add file_name and quiet options to Synthesizer.save_as() so you can control the output filename and suppress console output. Improve validation for file types and synthetic goldens, with updated docs and tests. (#1455) (Serghei Iakovlev)

v2.5.9

  • Support additional native model providers when initializing metrics and evaluators, including Azure OpenAI, Ollama, and local models. Model selection can now be driven by configuration without changing code. (#1441) (Kritin Vongthongsri)

v2.5.8

  • Add optional cost_tracking to Synthesizer to enable full API cost tracking, disabled by default. When enabled, generation runs report detailed cost information alongside the output. (#1406) (Chuqing Gao)

Improvement

v2.6.5

  • Update package metadata for a new release, including the published version and release date. (#1446) (Jeffrey Ip)
  • Improve resilience of large runs by retrying on additional OpenAI connection-related exceptions, not just rate limits. This reduces failures from transient network issues during long parallel evaluations. (#1450) (John Lemmon)
  • Improve reliability of uploads to Confident AI by adding retries on transient HTTPS/SSL failures, especially for large batch test runs, so evaluations are more likely to complete successfully. (#1452) (John Lemmon)

v2.5.9

  • Update package metadata to the latest release version for more accurate reporting in builds and tooling. (#1445) (Jeffrey Ip)

v2.5.8

  • Bump package metadata to the latest release version. (#1399) (Jeffrey Ip)
  • Improve Ollama model configuration by defaulting the base URL to http://localhost:11434 and removing the response format option from set-ollama. This reduces mismatches with Ollama endpoints and keeps CLI setup focused on LLM configuration. (#1401) (Kritin Vongthongsri)
  • Improve documentation for JSON correctness metrics by showing how to validate actual_output that is a list of JSON objects using a Pydantic RootModel list schema. (#1403) (Kritin Vongthongsri)
  • Update the Task Completion metric docs to use gpt-4o instead of gpt-4 in the example configuration. (#1415) (Obada Khalili)
  • Fix a typo in the RAG evaluation guide example input, changing “gow” to “how” for clearer documentation. (#1431) (Vamshi Adimalla)
  • Improve prettify_list() JSON formatting by enabling ensure_ascii, making output consistently ASCII-escaped for non-ASCII characters and easier to paste into logs and terminals. (#1437) (Vamshi Adimalla)
  • Improve benchmark imports by loading datasets only when needed, reducing import-time failures for users who don’t use those benchmarks. Update packaging metadata to broaden the supported Python range and remove the legacy setup.py. (#1440) (Jeffrey Ip)

Bug Fix

v2.6.5

  • Fix infinite verbose output in notebooks by only constructing verbose logs when verbose mode is enabled, and by writing logs via sys.stdout with an explicit flush. (#1444) (fetz236)
  • Fix a typo in the tracing example prompt so the sample question reads correctly when you run the demo. (#1448) (Mert Doğruca)
  • Fix Azure OpenAI initialization to always use the configured deployment name from settings, ensuring the correct azure_deployment is passed to sync and async clients. Improve the docs for set-azure-openai with clearer endpoint examples and a minimum required API version note. (#1451) (Kritin Vongthongsri)
  • Fix incorrect metadata propagation in conversational test cases so each turn keeps its own additional_metadata and comments instead of inheriting the parent test case values. (#1456) (Xiaopei)
  • Fix synthesizer compatibility with Azure OpenAI by handling generate() responses that return plain strings or (result, cost) tuples, preventing tuple attribute errors when extracting synthetic data. (#1459) (Nicolas Torres)
  • Fix set-ollama --base-url so Ollama requests use the configured base URL from .deepeval instead of falling back to the default localhost setting. (#1460) (Paul Lewis)
  • Fix native model handling in the synthesizer and multimodal metrics by using structured outputs when a schema is provided, returning typed results instead of parsing JSON strings. Add CLI commands to set and unset Ollama embeddings, and use the configured embedding initializer instead of a hardcoded OpenAI embedder. (#1461) (Kritin Vongthongsri)
  • Fix the red-teaming guide example so the chat.completions.create call uses the correct messages argument and returns the message content, making the snippet runnable as written. (#1463) (Karthick Nagarajan)
  • Fix async measure to return self.score when async_mode=True, instead of returning None. Async and sync metric execution now produce a consistent, non-empty score value. (#1464) (Roman Makeev)

v2.5.8

  • Fix Ragas metrics failing with an “async_mode is missing” error by explicitly running metric tracking in non-async mode during evaluation. (#1402) (Tanay Agrawal)
  • Fix the import path for LLMTestCaseParams in the metrics selection tutorial so the example code runs without import errors. (#1407) (Obada Khalili)
  • Fix a typo in the synthetic input generation template to clarify instructions about avoiding repetitive input. (#1408) (John D. McDonald)
  • Fix tool correctness reason messages so the expected and called tool names are reported in the right order when using exact match checks. (#1409) (Casey Lewiston)
  • Fix the dataset synthesis tutorial to use the correct StylingConfig keyword argument, replacing expected_output with expected_output_format so the example code runs as intended. (#1411) (Obada Khalili)
  • Fix a typo in __all__ by restoring a missing comma so auto_evaluate and assert_test are exported correctly from the package. (#1412) (88roy88)
  • Fix benchmark prediction generation to fall back more reliably by also handling AttributeError when extracting the model answer. (#1414) (Stan Kirdey)
  • Fix G-Eval strict mode to use a dedicated prompt and return a binary score (0/1) with an explicit reason, instead of scaling scores and post-adjusting them against the threshold. (#1416) (Kritin Vongthongsri)
  • Fix SQuAD benchmark answer parsing by using StringSchema for enforced model generation instead of a multiple-choice schema, improving compatibility with model outputs. (#1423) (Diogo Carvalho)
  • Fix the documented Azure OpenAI embedding setup command by correcting the flag name to --embedding-deployment-name, so the example works as shown. (#1424) (Amali Matharaarachchi)
  • Prevent G-Eval from requesting log probabilities on unsupported GPT models (such as o1 and o3-mini). This avoids errors when generating raw responses and lets evaluations run normally by falling back when logprobs aren’t available. (#1425) (Kritin Vongthongsri)
  • Fix login_with_confident_api_key() to reject missing API keys by raising a clear ValueError, preventing confusing behavior when the key is empty or not provided. (#1427) (Vamshi Adimalla)
  • Fix the LLM monitoring docs example to use the correct variable name for the monitored response, so the async a_monitor call matches the returned output. (#1432) (Lucas Le Ray)
  • Fix document-based golden generation to rebuild the vector index each run instead of reusing cached state, avoiding stale chunks in repeated notebook executions. Add validation to prevent chunk_overlap from exceeding chunk_size - 1, and relax the chromadb install requirement to any compatible version. (#1433) (Kritin Vongthongsri)
  • Fix the DAG non-binary verdict prompt to require a consistent JSON response with verdict and reason, including an example format. This reduces malformed outputs and makes results easier to parse reliably. (#1434) (Hani Cierlak)
  • Fix synthesizer chunking with ChromaDB by handling missing collections more robustly, avoiding failures when the collection error type differs across versions. (#1442) (Kritin Vongthongsri)

February

February improved evaluation reliability and expanded customization. Fixes landed for batching detection, async auto_evaluate, custom LLM validation, and concurrent evaluation stability. Metrics gained injectable templates including FaithfulnessTemplate, improved DAG reasoning with include_reason, and MultimodalToolCorrectnessMetric, plus conversational metadata and Prompt hyperparameters.

New Feature

v2.4.6

  • Add MultimodalToolCorrectnessMetric to score whether an MLLM called the expected tools correctly. Evaluation can check tool name, input parameters, and outputs, with optional exact-match and ordering rules. Results now include expected and called tool data in API test cases. (#1386) (Umut Hope YILDIRIM)
  • Support passing Prompt objects as hyperparameters in test runs and monitoring, preserving prompt version metadata when available. Improve prompt pulling and validation so prompts can be created from an alias or a manually provided template. (#1387) (Jeffrey Ip)

v2.3.9

  • Add deepeval recommend metrics, an interactive CLI flow that asks a few yes/no questions and returns recommended evaluation metrics for your use case. (#1342) (Kritin Vongthongsri)
  • Add support for passing additional_metadata on conversational test cases, and include it in the generated API payload as additionalMetadata. This preserves extra context when creating and evaluating test runs. (#1352) (Kritin Vongthongsri)
  • Add CLI support for running LLM-based evaluations with local Ollama models via set-ollama and unset-ollama, including configurable base URL and response format. Documentation was updated with setup and usage guidance. (#1360) (Kritin Vongthongsri)
  • Add support for injecting a custom FaithfulnessTemplate into FaithfulnessMetric for dynamic prompt generation. This lets you plug in domain-specific or few-shot templates without overriding claim generation methods. (#1367) (Lei WANG)

v2.3.1

  • Add support for the o3-mini and o3-mini-2025-01-31 models, including pricing metadata and enabling use in structured outputs and JSON mode where supported. (#1331) (Song Luar)

Improvement

v2.4.7

  • Update package metadata and internal __version__ to match the latest release. (#1392) (Jeffrey Ip)
  • Add support for injecting custom evaluation templates into metrics, making it easier to customize the prompts used to generate statements, verdicts, and reasons. (#1393) (Jeffrey Ip)
  • Fix a typo in the getting started guide so the GEval description correctly refers to evaluating outputs on any custom metric. (#1394) (Christian Bernhard)
  • Fix a typo in the getting started guide to improve clarity when describing GEval and recommending DAGMetric for deterministic scoring. (#1395) (Christian Bernhard)
  • Fix a typo in the getting-started guide by correcting “somewhre” to “somewhere” for clearer documentation. (#1396) (Christian Bernhard)

v2.4.6

  • Improve dependency compatibility by relaxing the grpcio pin to allow newer 1.x releases while staying below 2.0. This reduces install and resolver conflicts across environments. (#1383) (Jeffrey Ip)
  • Bump the package release metadata to 2.4.3 so the published version and citation information reflect the latest release. (#1385) (Jeffrey Ip)
  • Update package metadata and internal version to 2.4.4 for the new release. (#1388) (Jeffrey Ip)
  • Improve metric parameter validation by moving each metric’s required test-case fields into the metric class, ensuring consistent checks in both sync and async evaluation. (#1389) (Jeffrey Ip)

v2.4.3

  • Add telemetry for dataset pulls, capturing login method, environment, and basic user identifiers to help monitor usage and diagnose issues. (#1377) (Kritin Vongthongsri)

v2.3.9

  • Update package metadata for a new release, including the version and release date. (#1334) (Jeffrey Ip)
  • Improve CLI login by opening a paired browser flow and recording the login provider for telemetry. Evaluation and run events now include a logged_in_with attribute to help diagnose usage patterns. (#1341) (Kritin Vongthongsri)
  • Fix typos and small wording issues in the contextual precision and contextual recall metric templates to make the generated prompts clearer and more consistent. (#1344) (Filippo Paganelli)
  • Add telemetry for the recommend metrics CLI flow to capture usage context when telemetry is enabled. Mark runs as incomplete when the command errors out. (#1346) (Kritin Vongthongsri)
  • Add include_reason support to DAG-based metrics and generate clearer, path-based reasons from the DAG traversal. Improve verbose output by recording per-node execution steps, and normalize static node scores to a 0–1 range. (#1348) (Jeffrey Ip)
  • Improve documentation navigation and onboarding by reorganizing the Guides sidebar and adding an early deepeval login step in the tutorial introduction to help users set up their API key before starting. (#1353) (Kritin Vongthongsri)
  • Add documentation for integrating Elasticsearch as a vector database, including setup steps and examples for evaluating and tuning retrieval with contextual metrics. (#1354) (Kritin Vongthongsri)
  • Improve Elasticsearch integration documentation with clearer setup steps and an expanded walkthrough for preparing LLMTestCases and running contextual retrieval metrics to evaluate and tune retriever performance. (#1355) (Kritin Vongthongsri)
  • Add integration docs for Chroma, including setup and examples for evaluating retrieval quality with contextual metrics and tuning retriever hyperparameters. (#1357) (Kritin Vongthongsri)
  • Improve the Chroma integration docs with clearer setup and retrieval evaluation examples, including persistent client usage and n_results (top-K) tuning guidance. (#1361) (Kritin Vongthongsri)
  • Improve metric docs with a clearer example of using evaluate() to generate reports or run multiple metrics on a test case, plus an explicit alternative showing how to call metric.measure() directly. (#1364) (Kritin Vongthongsri)
  • Add telemetry for metrics run mode by recording whether a metric is executed in async mode. This improves observability when diagnosing performance and runtime behavior across different execution paths. (#1365) (Kritin Vongthongsri)
  • Improve the PGVector integration guide with clearer setup and retrieval steps, expanded evaluation guidance, and updated examples for embedding models and tuning LIMIT/top-k. Reorganize content to better explain how PGVector fits into a RAG pipeline. (#1366) (Kritin Vongthongsri)
  • Fix a typo in the tutorial introduction so the guidance on choosing evaluation criteria reads correctly. (#1370) (JonasHildershavnUke)

v2.3.1

  • Prepare a new release by updating package metadata and reported version. (#1328) (Jeffrey Ip)

Bug Fix

v2.4.7

  • Fix a typo in the Faithfulness metric docs by correcting a sentence in the truths_extraction_limit parameter description. (#1391) (Christian Bernhard)

v2.4.6

  • Fix cleanup of test case instance IDs so concurrent evaluate calls with multiple non-conversational metrics no longer crash in the same process. (#1384) (cancelself)

v2.4.3

  • Fix the faithfulness prompt example to use the correct truths JSON key instead of claims. (#1373) (Jeffrey Ip)
  • Fix initialization of the faithfulness metric by ensuring the prompt template is created during construction. This prevents missing template errors and makes metric setup more reliable. (#1374) (Jaime Enríquez)
  • Fix ValidationErrors when evaluating with a custom LLM after the verdict-based schema change, ensuring custom models validate correctly and evaluation runs without failing. (#1375) (Tyler Ball)
  • Relax the grpcio dependency to ^1.67.1 instead of pinning 1.67.1. This reduces pip upgrade conflicts in projects that already require a newer grpcio (for example via grpcio-status). (#1379) (Dmitriy Vasilyuk)
  • Fix the first README example by adding missing imports and providing expected_output in LLMTestCase, so the snippet runs without NameError and matches the documented setup. (#1382) (dokato)

v2.3.9

  • Fix the broken link to the G-Eval paper in the ConversationalGEval documentation so readers can access the referenced source directly. (#1336) (Jonathan du Mesnil)
  • Fix auto_evaluate async execution by passing the correct async_mode flag, and export auto_evaluate at the package top level so it can be imported directly from the main module. (#1338) (Kritin Vongthongsri)
  • Fix CLI login pairing flow by starting the local server on an available port and opening a direct pairing URL. Show which provider you logged in with after login (and on failure) to make troubleshooting easier. (#1345) (Kritin Vongthongsri)
  • Fix DAG template examples to use valid JSON booleans (true/false) so generated verdict outputs are JSON-compliant and easier to parse. (#1349) (Aaron McClintock)
  • Fix red_teamer.scan documentation by adding the missing comma in the example call, so the code block parses correctly and can be copied without syntax errors. (#1351) (Akshay Rahatwal)
  • Fix prompt wording so verdict is only set to 'yes' when the instruction is completely followed, reducing ambiguous interpretations in generated results. (#1369) (Daniel Abraján)
  • Fix the CybersecurityGuard API by renaming CyberattackType to CyberattackCategory and switching configuration from vulnerabilities to categories. Remove stray debug prints and make input/output guard type selection consistent. (#1372) (Jeffrey Ip)

v2.3.1

  • Fix should_use_batch detection by checking for a batch_generate method instead of calling it and swallowing errors. This prevents false negatives when batch_generate requires extra arguments (for example schemas) and ensures batching is enabled when supported. (#1327) (Ruiqi(Ricky) Zhu)
  • Fix typos in generated telemetry output to improve accuracy and readability of telemetry files. (#1329) (Paul-Louis NECH)
  • Fix passing document paths to the context generator when building embeddings, preventing incorrect argument mapping during golden generation from docs. (#1330) (Kritin Vongthongsri)

January

January made evaluations and red-teaming easier to adopt with documentation cleanups, new tutorials, and clearer configuration patterns like target_model_callback and ignore_errors. Observability improved with expanded telemetry, run identifiers, and synthesis_cost tracking. Features advanced with new ARC benchmark runners, structured ToolCall support, an upgraded TaskCompletionMetric, and a revamped Guardrails API.

New Feature

v2.2.7

  • Add auto_evaluate to automatically generate evaluation datasets from captured LangChain or LlamaIndex context, run a target model, and score results with selected metrics. Supports async execution and optional dataset/result caching. (#1283) (Kritin Vongthongsri)
  • Add TaskCompletionMetric to score whether an agent completed the user’s goal based on the actual outcome and tools called, with optional reasons and async support. (#1295) (Kritin Vongthongsri)
  • Add a new Legal Document Summarizer tutorial series, covering how to define summarization criteria, pick metrics, run evaluations, iterate on hyperparameters, and catch regressions by comparing test runs. (#1323) (Kritin Vongthongsri)
  • Add a new RAG QA Agent tutorial in the docs, including guidance on choosing metrics, running evaluations, and improving hyperparameters. The tutorials sidebar now includes this section and surfaces it by default. (#1326) (Kritin Vongthongsri)

v2.2.2

  • Add three new multimodal evaluation metrics: ImageCoherenceMetric, ImageHelpfulnessMetric, and ImageReferenceMetric for scoring how well images align with surrounding context, user intent, and provided references. (#1230) (Kritin Vongthongsri)
  • Add an optional identifier to tag and persist test runs, available via the CLI flag --identifier and the pytest plugin option. This helps you distinguish and group results across multiple runs more easily. (#1237) (Jeffrey Ip)
  • Add an ARC benchmark runner with ARC-Easy and ARC-Challenge modes, configurable n_shots and problem count, and built-in accuracy reporting with per-example predictions. Expand the docs to include new benchmark pages and navigation entries for additional benchmark suites. (#1239) (Kritin Vongthongsri)
  • Add multimodal RAG evaluation support, including test cases with image inputs and retrieval context plus new multimodal metrics for recall, relevancy, precision, answer relevancy, and faithfulness. (#1241) (Kritin Vongthongsri)
  • Add a revamped guardrails API with built-in guard classes (e.g., privacy, prompt-injection, jailbreaking, topical, cybersecurity) and support for running multiple guards in one call, returning per-guard scores and breakdowns. (#1247) (Kritin Vongthongsri)
  • Add max_context_length to control how many chunks are grouped into each generated context during document-based synthesis, letting you tune context size for generation. Also adjust context grouping defaults and de-duplication to produce more consistent context groups. (#1289) (Kritin Vongthongsri)
  • Add ToolCall support for tool evaluation data. Datasets can now load tools_called and expected_tools from JSON/CSV into structured ToolCall objects, with more robust JSON parsing. Metrics like ToolCorrectness and GEval now handle ToolCall values when evaluating and formatting outputs. (#1290) (Kritin Vongthongsri)
  • Add configurable tool correctness scoring to validate tool names, input parameters, or outputs. Improve verbose logs by showing expected vs called values and the final score and reason, making tool-call mismatches easier to diagnose. (#1293) (Kritin Vongthongsri)

Improvement

v2.2.7

  • Bump package version metadata to 2.2.2 for the latest release. (#1302) (Jeffrey Ip)
  • Improve the G-Eval documentation by adding guidance for running evaluations on Confident AI, including the deepeval login step to get started. (#1303) (Kritin Vongthongsri)
  • Fix a typo in the dataset push success message and docs, correcting “Confidnet” to “Confident” for clearer branding and guidance. (#1307) (Rahul Shah)
  • Add an ignore_errors option to red teaming scans so attack generation and evaluation can surface failures without aborting the run. Also rename the async concurrency setting to max_concurrent for clearer configuration. (#1309) (Jeffrey Ip)
  • Improve the Task Completion metric documentation by clarifying that it evaluates tool-calling agents using input, tools_called, and actual_output. Expand the calculation section to explain task/outcome extraction and alignment scoring, with additional examples for context. (#1310) (Kritin Vongthongsri)
  • Improve Jailbreaking Crescendo JSON schema generation by adding stricter system prompts to confine outputs to the expected keys and moving the description field to the eval schema. Also ensure remote attack generation initializes the API client with an explicit API key value. (#1311) (Kritin Vongthongsri)
  • Fix the MMLU benchmark docs by updating the example to use MMLUTask, helping users get started with the correct setup. This addresses an issue in the MMLU introduction, though some guidance gaps remain around long outputs and batching with varying prompt lengths. (#1313) (Matthew Khoriaty)
  • Improve tool correctness evaluation by supporting multiple ToolCallParams at once and generating clearer scoring and verbose logs for exact-match and ordering checks. (#1317) (Kritin Vongthongsri)
  • Improve synthesizer docs by clarifying that for RAG evaluation only certain evolution types reliably stick to the provided context, and annotate the examples accordingly. (#1319) (Sebastian)
  • Add a new RAG QA Agent tutorial series covering synthetic dataset generation, evaluation criteria, and metric selection, and reorganize the tutorials sidebar to keep other sections collapsed by default. (#1325) (Kritin Vongthongsri)

v2.2.2

  • Improve red-teaming 2.0 documentation with clearer setup and scan examples, including how to define vulnerabilities and a target model callback. Reorganize the docs sidebar to add OWASP guidance and a dedicated vulnerabilities section for easier navigation. (#1209) (Kritin Vongthongsri)
  • Bump package version to 2.0.5. (#1217) (Jeffrey Ip)
  • Add tracking of synthesis_cost when synthesizing goldens by accumulating model call costs, so you can see the estimated spend for synthesis runs. (#1218) (Vytenis Šliogeris)
  • Improve dependency compatibility by updating the tenacity requirement to allow up to version 9.0.0, reducing install conflicts with newer environments. (#1226) (Anindyadeep)
  • Fix a grammar issue in the RAG evaluation guide to clarify that prompts are constructed from both the initial input and the retrieved context. (#1233) (Nishant Mahesh)
  • Improve benchmark docs with clearer descriptions, supported modes/tasks, and copy-paste examples for ARC, BBQ, and Winogrande. Also tidy benchmark exports and naming to make imports and evaluation parameters more consistent. (#1240) (Kritin Vongthongsri)
  • Prepare a new release by bumping the package version to 2.1.0. (#1245) (Jeffrey Ip)
  • Improve benchmark runs by adding more built-in benchmark imports, optional verbose per-problem logging, and configurable answer-format confinement instructions to reduce parsing errors and make results easier to inspect. (#1246) (Kritin Vongthongsri)
  • Improve red-teaming documentation by renaming the target model function parameter to target_model_callback and updating sync/async examples to match, reducing confusion when wiring up scans. (#1250) (Kritin Vongthongsri)
  • Change the default Guardrails API base URL to https://deepeval.confident-ai.com/ instead of http://localhost:8000, so it connects to the hosted service by default. (#1252) (Kritin Vongthongsri)
  • Update package metadata by bumping the release version and refreshing the project description. (#1254) (Jeffrey Ip)
  • Improve Guardrails API configuration by using the shared BASE_URL from the guardrails API module instead of a hardcoded localhost URL. (#1255) (Kritin Vongthongsri)
  • Add an IS_CONFIDENT environment toggle to switch the API base URL to a local server (using PORT) instead of the default hosted endpoint. (#1258) (Kritin Vongthongsri)
  • Improve guardrails base classes and typing by introducing BaseGuard/BaseDecorativeGuard and a shared GuardType enum. This makes guard metadata and guardrail configuration more consistent across built-in guards. (#1259) (Jeffrey Ip)
  • Add a configurable top_logprobs setting to better support OpenAI and Azure OpenAI deployments where logprobs limits vary by model/version. This helps avoid failures or unexpected clamping when a service only supports smaller values (for example, 5 instead of 20). (#1261) (Dave Erickson)
  • Add PostHog analytics tracking to the documentation site, with tracking disabled in development to avoid collecting local activity. (#1268) (Kritin Vongthongsri)
  • Update package metadata for a new release. (#1270) (Jeffrey Ip)
  • Fix typos in the README by correcting “continous” to “continuous” in multiple places. (#1273) (Ikko Eltociear Ashimine)
  • Improve telemetry spans for evaluations, synthesizer, red teaming, guardrails, and benchmarks by capturing more run details and consistently tagging an anonymous unique_id (and public IP when available). This makes usage and performance monitoring more consistent across features. (#1276) (Kritin Vongthongsri)
  • Add support for additional OpenAI GPT model IDs, including versioned gpt-4o, gpt-4o-mini, gpt-4-turbo, and gpt-3.5-turbo-instruct variants, so model validation accepts more current options out of the box. (#1277) (Song Luar)
  • Add an opt-out for automatic update warnings via the DEEPEVAL_UPDATE_WARNING_OPT_OUT=YES environment variable, so you can suppress update checks in non-interactive or CI environments. Documentation was added for this setting. (#1278) (Song Luar)
  • Bump the package version for a new release. (#1279) (Jeffrey Ip)
  • Improve telemetry by tagging spans with the runtime environment (Jupyter notebook vs other) to better understand where evaluations and tools are run. (#1280) (Kritin Vongthongsri)
  • Improve OpenAI-native model calls by using structured outputs with explicit schemas, returning typed fields directly instead of parsing JSON strings. This makes metric verdicts/reasons/statements more reliable and reduces parsing failures. (#1285) (Kritin Vongthongsri)
  • Update OpenAI model lists so gpt_model and gpt_model_schematic stay in sync, including refreshed multimodal model support. Adjust validation and pricing data to match the latest available models and costs. (#1287) (Song Luar)
  • Update the default API base URL used by the red teaming attack synthesizer to point to the hosted service instead of localhost. (#1288) (Kritin Vongthongsri)
  • Improve documentation with a new Cognee integration guide and corrected guardrails example usage, plus small styling and copy updates across the site. (#1291) (Jeffrey Ip)
  • Fix typos in the custom LLMs guide to clarify the exception note and correct the instantiate instruction. (#1294) (Christian Bernhard)
  • Add telemetry attributes to record whether each feature run is considered new or old, and persist that status after a feature is used. This improves feature-usage reporting across evaluation, synthesizer, red teaming, guardrails, and benchmarks. (#1296) (Kritin Vongthongsri)
  • Add validation and pricing metadata for OpenAI o1 models (o1, o1-preview, o1-2024-12-17) so they can be used with JSON mode and structured outputs where supported. (#1299) (Song Luar)
  • Add a --display option to control which test cases are shown in the final results output, so you can view all, only failing, or only passing cases in CLI runs and evaluate() printing. (#1301) (Jeffrey Ip)

Bug Fix

v2.2.7

  • Fix structured (schema) responses when using non-OpenAI models (including Azure/local) by correctly invoking the loaded model and returning the parsed JSON along with the tracked cost. (#1304) (Kritin Vongthongsri)
  • Fix circular imports involving Scorer by deferring its import in benchmark modules, preventing import-time crashes when loading benchmarks. (#1315) (Song Luar)
  • Fix async tracing in the LangChain callback by making trace state thread-safe and correctly linking parent/child spans. This prevents missing or mis-associated traces when runs execute concurrently. (#1318) (Kritin Vongthongsri)
  • Fix leftover .vector_db collections when chunking fails by cleaning up the generated collection folders before raising an error. Also handle invalid Chroma collections explicitly so document loading can recover more reliably. (#1320) (Kritin Vongthongsri)
  • Fix context generation from docs by passing document_paths explicitly, preventing incorrect argument binding. Also skip the MULTICONTEXT evolution when transforming evolution distributions to avoid generating unsupported prompt evolutions. (#1321) (Kritin Vongthongsri)
  • Fix local Ollama embedding requests by routing through the OpenAI client when the base URL points to localhost. This restores embedding support for both single text and batch inputs without changing cloud OpenAI behavior. (#1322) (Kritin Vongthongsri)

v2.2.2

  • Prevent endless verdict generation in ContextualPrecision by including the explicit document count in the prompt, helping LLMs stay aligned on long or complex context lists. (#1222) (enrico-stauss)
  • Fix MMLUTemplate.format_subject to be a static method, allowing it to be called without an instance and preventing incorrect usage in MMLU prompt formatting. (#1229) (Terrasse)
  • Prevent OpenTelemetry from loading on import when telemetry is opted out. This avoids importing protobuf dependencies unnecessarily and reduces conflicts with other libraries. (#1231) (Mykhailo Chalyi (Mike Chaliy))
  • Fix red teaming risk-category mapping to use the updated *Type vulnerability enums, keeping vulnerability classification consistent after recent naming changes. (#1236) (Kritin Vongthongsri)
  • Fix synthetic data generation when ChromaDB raises InvalidCollectionException by catching the correct exception type in a_chunk_doc, ensuring fallback handling runs instead of stopping early. (#1242) (Mizuki Nakano)
  • Fix text-to-image metric semantic consistency evaluation to use the generated output image instead of an input image, improving scoring accuracy for text-only prompts. (#1253) (Kritin Vongthongsri)
  • Fix docs to use the correct import paths for sensitive information disclosure attack types (PIILeakageType, PromptLeakageType, IntellectualPropertyType), preventing import errors when following the example code. (#1256) (Mohammad-Reza Azizi)
  • Fix guardrails API calls to use the updated /guardrails endpoint instead of the old multiple-guard path. (#1257) (Jeffrey Ip)
  • Fix guardrails API schema so input and response are defined at the request level instead of per-guard, preventing invalid payloads when multiple guards are used. (#1260) (Jeffrey Ip)
  • Fix MMLU task reloading so the benchmark dataset is fetched fresh for the selected task instead of reusing a previously cached dataset. This prevents running evaluations against the wrong task data when switching tasks. (#1267) (Yuyao Huang)
  • Fix synthesizer cost tracking to handle unset synthesis_cost. This prevents errors when generating data if cost accounting is disabled or not initialized. (#1271) (Jeffrey Ip)
  • Fix batched evaluate() results so prediction rows include the expected output alongside the input, prediction, and score, keeping benchmark output consistent and easier to inspect. (#1274) (BjarniH)
  • Fix documentation “Edit this page” links to point to the correct docs/ directory so edits open in the right place on GitHub. (#1292) (Jeffrey Ip)
  • Prevent installing the tests folder into site-packages by excluding it from the package install. This avoids name conflicts when your project also includes a tests directory. (#1300) (冯键)