🐍 2025
2025 was all about making LLM evaluation production-ready:
- Tracing & observability matured with deep integrations across LangChain, LlamaIndex, CrewAI, PydanticAI, and OpenAI Agents—plus first-class OpenTelemetry support
- Agent evaluation took center stage with new metrics for task completion, tool correctness, and MCP interactions
- Multimodal capabilities expanded across test cases and metrics
- Provider support broadened to include Anthropic, Gemini, Amazon Bedrock, and improved Ollama/Azure setups
- Safety coverage grew with guardrails, red-teaming, and compliance metrics
- Reliability improved with better async handling, timeouts, and retries
- Documentation expanded with comprehensive tutorials to help teams ship confidently
December
December strengthened evaluation, multimodal support, and prompt optimization. Multimodal test cases now flow through standard evaluation paths with better placeholder detection, Azure OpenAI support, and clearer validation errors. Prompt optimization expanded with GEPA plus new algorithms, alongside more consistent schema-based outputs and broader provider configuration via typed Settings.
New Feature
v3.7.6
- Add support for multimodal conversational test cases and goldens by automatically detecting
[DEEPEVAL:IMG:...]placeholders across fields and attaching animagesMappingso referenced images can be resolved during dataset loading. (#2373) (Vamshi Adimalla)
v3.7.5
- Add an example script showing how to run prompt optimization with a model callback, a small golden dataset, and relevancy metrics to print original vs optimized prompts. (#2347) (Jeffrey Ip)
v3.7.4
- Add GEPA (Genetic-Pareto) prompt optimization to automatically improve prompt templates against goldens and metrics. Provide
GEPARunner.optimize(...)with reusable runner state, sync/async execution, configurable tie-breaking, and anOptimizationReportattached to the returned prompt. (#2293) (Trevor Wilson) - Add MIPROv2, COPRO, and SIMBA prompt-optimization algorithms with new configuration options and runner support, enabling additional search strategies and cooperative candidate proposals during optimization. (#2341) (Trevor Wilson)
- Add support for a Portkey-backed model configured via settings. Introduce Portkey-specific options (API key, model name, base URL, provider) and validate required values early to reduce misconfiguration errors. (#2342) (Trevor Wilson)
v3.7.3
- Add Azure OpenAI support for multimodal models, including image+text prompts and optional structured/JSON outputs. Multimodal model initialization can now select Azure based on configuration, using your deployment settings and tracking token-based cost. (#2319) (dhinkris)
Experimental Feature
v3.7.5
- Add a proof-of-concept multimodal path by auto-detecting image placeholders in dataset inputs/turns and routing supported RAG-style metrics accordingly, without requiring a separate test case type. (#2346) (Vamshi Adimalla)
Improvement
v3.7.6
- Refactor evaluation to treat multimodal LLM test cases like standard LLM cases, simplifying metric execution and removing special multimodal-only handling paths. (#2369) (Vamshi Adimalla)
- Add a dedicated CI workflow and pytest coverage for metrics, including multimodal conversational cases. Improve multimodal detection and propagate the
multimodalflag through evaluation step generation and scoring. Prevent invalid model usage for multimodal metrics by raising an error. (#2375) (Vamshi Adimalla) - Improve LLM metric output consistency by standardizing schema-based generation and fallback parsing. Add configuration options for more model providers (including token pricing and Bedrock settings) and align defaults for Ollama and OpenAI model selection. (#2378) (Trevor Wilson)
v3.7.5
- Make the Ollama, Anthropic, and Gemini integrations optional at runtime. If an integration isn’t installed, raise a clear error explaining the missing dependency and how to install it. (#2345) (Trevor Wilson)
- Improve CI reliability by including optional model provider dependencies (
ollama,anthropic,google-genai) in the development dependency set, reducing failures when running tests that require these integrations. (#2357) (Trevor Wilson) - Prevent
multimodalfrom being serialized in golden records by excluding it from model output. This reduces noisy fields in exported datasets and API payloads. (#2368) (Vamshi Adimalla)
v3.7.4
- Improve API key management across LLM providers by standardizing on typed
Settingsfor model name, endpoint/base URL, and secrets. Constructor arguments still take precedence, and secret values are only unwrapped when building the client. (#2330) (Trevor Wilson) - Improve the staleness policy docs by pointing reopen requests to a new
MAINTAINERS.mdfile. This clarifies who to mention when reviving inactive issues and what details to include. (#2331) (Trevor Wilson)
v3.7.3
- Rename the pytest plugin entry point from
pluginstodeepevalso the plugin is registered under a clearer name. (#2308) (Gavin Morgan) - Improve agentic metric docs with corrected code samples and clearer guidance that PlanAdherence, PlanQuality, and StepEfficiency are trace-only metrics that must run via
evals_iteratoror theobservedecorator. (#2316) (Vamshi Adimalla) - Improve dataset conversions to carry
additional_metadatafrom test cases into generated goldens, preserving metadata through CSV/JSON imports. Also prevent mixing single-turn and multi-turn items in the same dataset with clearer type errors. (#2336) (Vamshi Adimalla) - Support per-trace API keys when sending and flushing traces, so background flush uses the correct credentials. This prevents traces from being uploaded with the wrong API key when multiple keys are used in the same process. (#2337) (Kritin Vongthongsri)
Bug Fix
v3.7.6
- Fix arena test case parameter validation by passing the correct arguments when checking each case, preventing incorrect validation failures for arena-based evaluations. (#2372) (Vamshi Adimalla)
- Fix multi-turn Arena G-Eval comparisons when some turns have no retrieval context, and correctly apply multimodal evaluation rules when images are present. (#2376) (Vamshi Adimalla)
- Fix MCP metrics to generate a single unified
reasonfrom all interaction reasons, with consistent sync/async behavior and correct cost tracking for native models. Also relax PlanAdherenceMetric required inputs and update tests to use a valid default model name. (#2381) (Vamshi Adimalla) - Fix multimodal model validation by resolving callable model metadata factories and improving prompt concatenation for image inputs, preventing errors when checking supported multimodal models. (#2382) (Trevor Wilson)
v3.7.5
- Fix
pydantic_aiintegration imports so the package no longer crashes when optionalpydantic-aiand OpenTelemetry dependencies are missing, using safe fallbacks and clearer optional-dependency errors. (#2354) (trevor-cai) - Fix dependency lockfile to match
pyproject.toml, preventing CI failures and inconsistent installs caused by mismatched dependency groups and markers. (#2358) (Trevor Wilson) - Fix CLI test runs to avoid finalizing the same test run twice. This prevents duplicate uploads or local saves and reduces temp file race issues when
deepeval test runhands off finalization to the CLI. (#2360) (Trevor Wilson) - Fix binary verdict JSON examples to use lowercase booleans (
true/false) instead of Python-styleTrue/False, reducing invalid JSON output from metric templates. (#2365) (Trevor Wilson)
v3.7.4
- Fix Anthropic client initialization to unwrap
SecretStrAPI keys and consistently prefer an explicit constructor key over settings. Raise a clear error when the key is missing or empty, and add tests to prevent regressions. (#2329) (Trevor Wilson) - Fix
executeto avoid raising on async gather timeouts when errors are configured to be ignored, allowing timed-out metrics to be marked and execution to continue. (#2335) (Trevor Wilson) - Fix JSON corruption on NFS by flushing and fsyncing lock-protected writes for test runs and the prompt cache. This prevents truncated or partially written files during parallel runs on network storage, with added tests to verify the behavior. (#2338) (Trevor Wilson)
- Fix parsing of provider-prefixed model names so inputs like
provider/modelcorrectly resolve to the underlying model name. (#2343) (Trevor Wilson) - Fix URL and endpoint fallback resolution for local, Ollama, and Azure models so configured settings are used correctly instead of boolean values, preventing invalid base URLs during initialization. (#2344) (Trevor Wilson)
- Fix CLI test runs by loading the correct pytest plugin. Update the plugin argument to
deepevalso the updated entry point is used and tests run with the intended plugin enabled. (#2348) (Trevor Wilson) - Fix test discovery by adding a missing
__init__.py, ensuring the test suite is treated as a module and runs reliably across environments. (#2349) (Trevor Wilson)
v3.7.3
- Fix
HumanEvalsoverbose_modeis respected and not always treated as enabled. Also fix predictions DataFrame creation by aligning the collected row fields with the DataFrame columns, preventing a column mismatchValueErrorduring evaluation. (#2323) (Levent K. (M.Sc.))
November
November improved observability and evaluation workflows. Tracing expanded with Anthropic messages.create capture, richer tool-call visibility for LangChain and LlamaIndex, and clearer CrewAI spans. Evaluation grew with experiment support for compare() runs, new ExactMatchMetric and PatternMatchMetric, and a conversational golden synthesizer plus updated agent evaluation docs.
New Feature
v3.7.1
- Add support for sending
compare()runs as experiments, including test run summaries, hyperparameters, and run duration, and optionally opening the results in a browser. (#2287) (Kritin Vongthongsri) - Add support for passing a Google service account key when using Gemini via Vertex AI, including a new CLI option to save it in config. This enables authenticated Vertex AI access without relying on default credentials. (#2291) (Kritin Vongthongsri)
- Add support for overriding the Confident API base URL via
CONFIDENT_BASE_URL, allowing use of custom or self-hosted endpoints. Also align the API key header name toCONFIDENT-API-KEYfor better compatibility. (#2305) (Tanay) - Support creating
MLLMImagefrom Base64 data by providingdataBase64andmimeType, and prevent invalid combinations like setting bothurlanddataBase64. Addas_data_uri()to return a data URI when Base64 data is available. (#2306) (Vamshi Adimalla)
v3.7.2
- Add a conversational golden synthesizer to generate multi-turn scenarios from docs, contexts, or from scratch, with sync/async APIs and optional expected outcomes. Include new conversational styling options to control scenario context, roles, and task. (#2310) (Vamshi Adimalla)
v3.7.0
- Add Anthropic integration that automatically captures
messages.create(sync and async) calls for tracing, including model, inputs/outputs, token usage, and tool calls when available. (#2224) (Tanay) - Add tracing for CrewAI knowledge retrieval events, recording the query as span input and the retrieved knowledge as span output for clearer observability. (#2261) (Mayank)
- Add non-LLM metrics for exact equality and regex full matching. Use
ExactMatchMetricto compareactual_outputvsexpected_output, andPatternMatchMetricto validateactual_outputagainst a pattern with optional case-insensitive matching and verbose logs. (#2274) (Vamshi Adimalla)
Improvement
v3.7.1
- Relax the dependency pin for
pytest-rerunfailuresto allow newer versions, improving compatibility with modern pytest releases and reducing dependency conflicts during installation. (#2304) (Konstantin Kutsy) - Remove unused temporary scripts from the repository to keep the codebase cleaner and reduce clutter. (#2309) (Bowen Liang)
v3.7.2
- Fix README code block formatting so the
.env.localsetup snippet renders correctly and is easier to copy and follow. (#2312) (Bhuvnesh)
v3.7.0
- Add
tools_calledtracking for LangChain and LlamaIndex traces, capturing tool name, inputs, and outputs on both the parent span and trace. This makes tool usage visible in recorded runs and improves debugging of agent workflows. (#2251) (Mayank) - Add a documented issue lifecycle policy: inactive issues may be closed after 12 months, with guidance on how to request reopening and which issues are excluded. (#2273) (Trevor Wilson)
- Add documentation for running end-to-end evaluations with OpenAI Agents using
evals_iterator(), including synchronous and asynchronous examples and automatic trace generation per golden. (#2275) (Mayank) - Improve non-LLM metric documentation with clearer wording, corrected references, and more consistent parameter and calculation descriptions for
ExactMatchMetricandPatternMatchMetric. (#2276) (Kritin Vongthongsri) - Add telemetry logging around OpenAI and Anthropic integrations to capture tracing when their client classes are patched. This improves observability of provider integration behavior during runtime. (#2279) (Tanay)
Bug Fix
v3.7.1
- Fix tracing masking to return the value from a custom mask function in
TaskManager.mask, so masked data is actually propagated instead of being discarded. (#2289) (Trevor Wilson) - Fix runtime crashes in the OpenAI Agents callback handler by adding missing explicit imports and replacing wildcard imports. This prevents
NameErrorissues and cleans up linting problems around undefined names. (#2290) (Trevor Wilson) - Fix prompt template handling by catching
JSONDecodeErrorandTypeErrorduring parsing, and prevent crashes by wrappingos.makedirsin a try/except. Remove stray debug output and avoid overly broad exception handling for clearer failures. (#2295) (Trevor Wilson) - Fix cache reads by creating a fresh temp cache when the existing cache file can’t be parsed or loaded. This prevents failures and keeps test runs moving forward even if the cache is corrupted. (#2296) (Trevor Wilson)
- Fix prompt and test-run workflows on read-only filesystems by gating disk I/O and optional
portalockerusage. Skip local caching when the environment is read-only while continuing to upload results. (#2297) (Trevor Wilson) - Fix the simulator’s example JSON output to use valid JSON booleans (
falseinstead ofFalse), preventing JSON parse errors. Add anAlwaysJsonModelstub and a regression test to ensure JSON mode output stays parseable. (#2301) (Trevor Wilson)
v3.7.0
- Fix Anthropic and OpenAI integration tests to use
LlmSpanContextfor prompt and metric collection, withthread_idpassed separately. This aligns tracing usage with the current API and prevents test failures. (#2256) (Tanay) - Fix Anthropic async integration tests by switching to the tool’s Anthropic client, updating prompt version handling, and adding a new trace fixture for
messages.create. (#2258) (Tanay) - Fix Anthropic integration tests to use the official
anthropicclient and updated tracing expectations, keeping async/sync trace fixtures in sync with current outputs. (#2259) (Tanay) - Fix TaskCompletionMetric task handling so extracted tasks only replace
taskwhen it wasn’t provided at initialization. Prevents a provided task from being overwritten during repeatedmeasure/a_measurecalls. (#2260) (Mayank) - Fix OpenTelemetry token counting by falling back to
gen_ai.usage.input_tokensandgen_ai.usage.output_tokenswhen provider-specific attributes are missing, ensuring input/output token counts are captured consistently. (#2263) (Mayank) - Fix Python 3.9 compatibility by replacing
bool | Nonetype hints withOptional[bool], preventing syntax errors when using the package on py39. (#2264) (OwenKephart) - Fix settings and dotenv test behavior by restoring auto-refresh when environment variables change and using the correct telemetry opt-out variable (
DEEPEVAL_TELEMETRY_OPT_OUT). Add anenable_dotenvtest marker and environment sandboxing, and improve boolean coercion coverage. (#2266) (Trevor Wilson) - Fix TestRun loading and updates to preserve the in-memory state when disk reads or writes fail. Only replace the current data on a successful load, warn on errors, and fall back to in-memory updates. Ensure the parent directory exists before saving. (#2267) (Trevor Wilson)
- Fix integration tests by centralizing URL/JSON formatting helpers and ensuring OpenAI tracing updates span and trace attributes consistently. (#2269) (Mayank)
- Fix Pydantic v2 deprecation warnings by migrating all models from class-based
ConfigtoConfigDict. Imports and common workflows no longer emitDeprecationWarnings. (#2272) (Andres Soto) - Fix DROP batching by requiring schema-aware
batch_generate(prompts, schemas)and failing fast with clearer errors when unsupported. Remove the obsoletetype=argument frombatch_predict()to matchpredict(), and make the basebatch_generateraiseNotImplementedErrorfor clearer behavior. (#2278) (Trevor Wilson) - Fix LangChain integration tests by importing
create_tool_calling_agentfrom a stable module path, reducing breakage across LangChain versions. (#2281) (Trevor Wilson) - Fix PostHog dependency constraints to allow versions from 5.4.0 up to (but not including) 7.0.0, improving compatibility with supported PostHog releases. (#2283) (Trevor Wilson)
October
October made tracing and evaluation more robust with gen_ai.*.messages normalization, structured message types, JSON-safe metadata, and better agent output capture across OpenAI, PydanticAI, and CrewAI. Async reliability improved with per-task timeouts and cooperative timeout budgeting so stalled work fails fast while runs finalize. Metrics gained async-by-default Hallucination evaluation, new agent-focused metrics, and configurable logging.
Backward Incompatible Change
v3.6.9
- Add cooperative timeout budgeting across retries and tasks, and always persist test cases and metrics when runs are cancelled or time out. Introduce
*_OVERRIDEenv settings for per-attempt and per-task timeouts, gather buffer, and stack-trace logging, and default the OpenAI client timeout from settings. (#2247) (Trevor Wilson) - Revert settings auto-refresh based on environment changes, restoring the previous cached Settings behavior. Telemetry and error reporting now read
DEEPEVAL_TELEMETRY_OPT_OUTandERROR_REPORTINGdirectly from environment variables again. (#2253) (Jeffrey Ip)
v3.6.8
- Remove patched LlamaIndex agent wrappers and attach metrics/metric collections via tracing context instead. This simplifies the integration and keeps LlamaIndex agents unmodified while still enriching agent and LLM spans with the expected metadata. (#2233) (Mayank)
v3.6.6
- Update the CrewAI integration to use the latest event APIs and simplify setup. Remove the custom
Agentwrapper so you can use CrewAI’s built-inAgentdirectly while still enabling tracing viainstrument_crewai(). (#2152) (Mayank)
New Feature
v3.6.8
- Add per-task timeouts to semaphore-guarded async evaluation work, so individual stalled tasks fail fast instead of hanging the whole run. When exceeded, the task raises
asyncio.TimeoutError. (#2134) (Harsh S) - Add a
tooldecorator for the CrewAI integration that propagatesmetricandmetric_collectiononto tool spans while staying compatible with existing CrewAI decorator usage patterns. (#2206) (Mayank) - Add new agent evaluation metrics (Goal Accuracy, Topic Adherence, Plan Adherence, Plan Quality, Tool Use, and Step Efficiency), and improve trace handling by relying on a metric’s
requires_traceflag. Also prevent duplicate trace results from being reported in test output. (#2238) (Vamshi Adimalla) - Add async-friendly eval iteration for the PydanticAI integration so
evals_iterator()can collect and await tasks while finalizing and serializing traces, with optional agent-level metrics during runs. (#2241) (trevor-cai)
v3.6.7
- Add OpenAI integration support with clearer dependency errors, and update evaluation flow to avoid relying on OpenAI-specific test case queues. CI now runs integration tests when API keys are available and safely skips them otherwise. (#2173) (Mayank)
- Add CrewAI wrappers
Crew,Agent, andLLMthat acceptmetricsandmetric_collectionand pass them into tracing spans. This lets you capture per-run metrics automatically when usingwith trace(metrics=...). (#2189) (Mayank)
v3.6.6
- Add display of conversational turns in multi-turn evaluations, showing role, truncated content, and any tools used. Turns are now included in test results and appear in CLI output and log/file reports. (#2113) (Trevor Wilson)
- Add saving of the trace ID in the Pydantic AI instrumentator so it can be accessed later from the same
runcontext. This makes it possible to reference past traces for follow-up actions like annotation. (#2140) (Mayank) - Add
test_run_idto theEvaluationResultreturned byevaluate, so you can reference the created test run programmatically. The existingconfident_linkis still returned when available. (#2156) (Vamshi Adimalla)
v3.6.3
- Add support for pulling prompts by
labeland cache them separately from version-based pulls. Improve prompt cache reliability by using file locking and falling back to the API when the cache is missing, locked, or unreadable. (#2154) (Jeffrey Ip)
Improvement
v3.6.9
- Add automatic settings refresh when environment variables change and expand dotenv-related tests using the
enable_dotenvmarker to validate boolean coercion. Update telemetry env handling to useDEEPEVAL_TELEMETRY_OPT_OUTfor clearer opt-out behavior. (#2249) (Trevor Wilson)
v3.6.8
- Add timeouts around async task orchestration to prevent
asyncio.gatherfrom hanging indefinitely. On timeout, pending tasks are cancelled and drained before the error is raised, improving reliability of async evaluations. (#2136) (S3lc0uth) - Improve test run metrics aggregation and results table output by refactoring into clearer helper functions. The results table formatting is now more consistent, easier to extend, and handles separators and empty rows more cleanly. (#2153) (Ayesha Shafique)
- Add support for passing arguments to embedding models and for customizing ConversationalGEval prompts via an
evaluation_template. Fix MCP scoring to avoid division-by-zero when no scores are produced, and expand quickstart/docs with a template customization example. (#2203) (Vamshi Adimalla) - Improve error surfacing during evaluation and tracing with a clearer error taxonomy and typed messages. When required inputs are missing or async tasks fail, affected spans are marked ERRORED while evaluation continues. Skip metric collection for failed nodes and keep progress reporting accurate when work is skipped. (#2207) (Trevor Wilson)
- Add model request parameters (like
temperatureandmax_tokens) to the traced LLM input messages when available, making it easier to see the exact settings used for a call. (#2210) (Mayank) - Improve OpenAI integration tracing to better handle legacy and Responses API calls. Input/output extraction is now guarded to prevent crashes, messages are rendered consistently, and tool-only outputs are captured so traces still show what happened. (#2211) (Mayank)
- Improve the Hallucination metric by moving the required parameter list from module scope to a class-level attribute for consistency with other metrics. This makes required inputs easier to inspect and validate when integrating with custom observability tooling. (#2215) (Anurag Gowda)
- Add an OpenAI integration cookbook with a ready-to-run Colab notebook showing how to trace OpenAI SDK calls and run evaluations for standalone requests and full LLM apps. (#2237) (Mayank)
v3.6.7
- Add structured prompt metadata and improved
Prompt.load()parsing, including safer fallbacks when JSON is invalid or malformed. Test runs now capture and persist prompts seen during LLM spans for easier tracking and reproducibility. (#2102) (Kritin Vongthongsri) - Add structured message types for LLM spans, including text, tool call, and tool output payloads. This improves typing and serialization for
inputandoutputwhen tracing multi-part model interactions. (#2116) (Mayank) - Improve code formatting and lint compliance in OpenAI integration and trace test helpers, reducing lint noise and keeping patching logic easier to maintain. (#2166) (Trevor Wilson)
- Add configurable metric logging controls, including enable/disable, verbosity, flush, and sampling rate, separate from trace sampling. This also renames
CONFIDENT_SAMPLE_RATEtoCONFIDENT_TRACE_SAMPLE_RATEfor clarity. (#2174) (Jeffrey Ip) - Improve tracing so parent spans automatically include
tools_calledwhen tool spans run underneath them, even if the parent didn’t record tool calls directly. (#2175) (Mayank) - Improve LangChain and LangGraph integration docs with clearer metric usage examples and new guidance for component-level evals. Update snippets to pass metrics inline and document how to attach metrics to LLMs and tools. Hide the PydanticAI integration page from the sidebar. (#2177) (Mayank)
- Improve dataset turn serialization by using
json.dumps(..., ensure_ascii=False)so non-ASCII characters are preserved instead of being escaped in the output JSON. (#2186) (danerlt) - Improve multimodal metric evaluation by adding a
_log_metric_to_confidentflag and propagating it through sync and asyncmeasurecalls, making it easier to control metric logging behavior in different execution modes. (#2191) (Jeffrey Ip) - Improve docs by adding tabbed examples for model integrations (OpenAI, Anthropic, Gemini, Ollama, Grok, Azure OpenAI, Amazon Bedrock, Vertex AI), making it easier to copy the right setup for each provider. (#2196) (Kritin Vongthongsri)
- Fix typos and wording in the metrics DAG documentation to improve clarity and readability. (#2198) (Simone Busoli)
v3.6.6
- Add a test mode for tracing integrations so spans can be captured in-memory instead of exported over OTLP. This makes integration CI tests more reliable by avoiding network calls and letting tests assert on collected trace data. (#2131) (Mayank)
- Improve optional CrewAI integration imports by handling missing dependencies cleanly and logging details in verbose mode, while also applying consistent formatting and lint fixes to keep CI passing. (#2158) (Trevor Wilson)
- Improve verbose logging for missing optional dependencies by emitting warnings instead of errors. Logs now show the missing module name when available and avoid tracebacks while pointing to the caller for easier debugging. Messages are only shown when
DEEPEVAL_VERBOSE_MODEis enabled. (#2159) (Trevor Wilson) - Improve PydanticAI tracing by including
gen_ai.system_instructionsin the captured input and flattening agent outputs to the final non-thinking text whenfinal_resultis missing. (#2160) (Mayank) - Prevent sync HTTP calls from hanging indefinitely by enforcing per-attempt timeouts and retrying failures with a configurable Tenacity backoff policy. (#2162) (Trevor Wilson)
v3.6.3
- Improve Amazon Bedrock request building by passing
generation_kwargsthrough as-is, removing automatic snake_case-to-camelCase parameter translation. This makes parameter names consistent with what Bedrock expects and avoids unexpected remapping. (#2106) (Vamshi Adimalla)
v3.6.2
- Improve OpenTelemetry tracing by normalizing
gen_ai.*.messagesthat usepartsinto plain role/content messages and by forcing trace/span metadata into JSON-safe strings, including circular-reference handling, to prevent export/serialization failures. (#2114) (Mayank) - Improve trace and agent input/output flattening by normalizing message parts and making non-text content JSON-serializable. This reduces errors when traces include structured or non-text payloads. (#2115) (Mayank)
- Improve the Hallucination metric by enabling
async_mode=Trueby default, so evaluations run asynchronously unless you opt out. This can reduce blocking during metric execution in async-capable workflows. (#2117) (Sai-Suraj-27) - Improve code formatting and lint compliance by cleaning up imports and exception handling in tracing utilities, reducing ruff/black warnings without changing behavior. (#2119) (Trevor Wilson)
- Improve readability of cards and expandable sections in dark mode by refining background, borders, and text contrast. Adjust hover and focus states to keep interactive elements clear and accessible. (#2122) (Debangshu)
- Add per-task timeouts for async
observed_callbackexecution so slow callbacks don’t block evaluation indefinitely, raisingasyncio.TimeoutErrorafter the configured limit. Synchronous callbacks are unaffected. (#2127) (Tharun K)
Bug Fix
v3.6.9
- Fix
EvaluationDataset.save_asserialization so critical fields (liketools_called,expected_tools, metadata, and custom columns) are preserved across JSON, JSONL, and CSV. Multi-turn datasets now save turns as structured objects in JSON/JSONL, and CSV embeds full turn data as a JSON string while extending headers accordingly. (#2227) (Wang Junwei) - Fix unclosed
aiohttpclient sessions when usingAmazonBedrockModelwithaiobotocore, preventing post-evaluation warnings about unclosed sessions and connectors. (#2250) (m.tsukada)
v3.6.8
- Fix embedding model initialization so
generation_kwargsis passed as a dict and client options are provided via**client_kwargs. Also add explicit parameters for required connection settings (like API keys, endpoints, and host) to reduce confusion when configuring clients. (#2209) (Vamshi Adimalla) - Fix the CrewAI example notebook by adding tracing around
crew.kickoff()and reusing the answer relevancy metric, so execution traces and metric reporting work more reliably in the walkthrough. (#2212) (Mayank) - Fix
a_generate_goldens_from_contextsso generated goldens use the correctsource_filefor each context instead of mismatching indices, and keep progress/scores aligned with the right input. (#2213) (Vamshi Adimalla) - Fix span result extraction to treat
TraceSpanApiStatus.SUCCESSas a successful span status, so enum-based statuses are handled correctly. Adds a regression test to prevent status comparisons from incorrectly marking spans as failed. (#2214) (Trevor Wilson) - Fix
ToolCall.__repr__to serializeinput_parametersand dictoutputwithensure_ascii=False, so non-ASCII characters are shown correctly instead of being escaped in the printed representation. (#2230) (danerlt) - Fix Contextual Precision verdict payloads to use a singular
reasonfield instead ofreasons, improving compatibility with schema-based generation and JSON parsing. (#2234) (Trevor Wilson) - Fix multimodal contextual precision verdict parsing by using the singular
reasonfield to match the expected template and schema. Prevents missing reasons and related TypeErrors when generating or reading verdicts. (#2235) (Trevor Wilson)
v3.6.7
- Prevent core tests from unintentionally calling the Confident backend by clearing Confident API keys from the environment and in-memory settings, and disabling dotenv autoload for these tests. This keeps
tests/test_coreisolated and avoids accidental external network use. (#2165) (Trevor Wilson) - Fix test isolation by sandboxing
os.environper test and resetting settings before and after each run. This preventssettings.edit(persist=False)from leaking environment changes across tests and altering timeouts, retry policies, and other settings. (#2168) (Trevor Wilson) - Fix multimodal metric parameter validation by using
check_mllm_test_case_paramsinstead of the LLM-only checker. This ensures multimodal test cases are validated with the correct rules and avoids incorrect parameter errors. (#2170) (Ayesha Shafique) - Fix synthesizer generation so all evolved prompts are saved as Goldens instead of only the last one. Improve JSON turn serialization to preserve non-ASCII characters. Update docs to clarify when
expected_outputis produced and how to use a custom embedder for context construction. (#2171) (Vamshi Adimalla) - Fix trace evaluation to always run even when there are no leftover tasks, and handle
_snapshot_tasks()failures by treating them as empty. Trace evaluation is only skipped when the event loop is closed. (#2178) (Trevor Wilson) - Fix G-Eval metric evaluations failing with OpenAI
o4-miniby treating it as a model without logprobs support. The evaluator now automatically falls back to standard scoring wheno4-mini(includingo4-mini-2025-04-16) is used, avoiding 403 errors and completing with valid results. (#2184) (Niyas Hameed) - Fix
is_successfulto correctly set and returnsuccesson the happy path based on the score threshold, avoiding false results when checking metric outcomes. (#2188) (Trevor Wilson) - Fix evaluation tracing by mapping traces to goldens and skipping any that can’t be mapped. Prevent DFS from failing agentic test execution by finalizing runs even when spans are missing. Add async regression coverage and reset per-test state to avoid cross-test leakage. (#2190) (Trevor Wilson)
- Fix
assert_testvalidation by rejecting mismatched metric types for LLM, conversational, and multimodal test cases. Update MultimodalToolCorrectnessMetric to useBaseMultimodalMetricand report the correct metric name. (#2193) (Vamshi Adimalla) - Fix OpenAI multimodal user messages by stringifying mixed content to avoid Pydantic validation errors. Preserve the original list payload in
messagesfor Responses, and add tests to prevent import-time side effects from SDK patching. (#2199) (Trevor Wilson)
v3.6.6
- Fix broken tracing integration tests by moving the trace test manager into the package and updating imports so tests no longer depend on a
tests.*module path. (#2167) (Mayank)
v3.6.3
- Fix
gpt-5-chat-latestbeing treated as a reasoning model that forcestemperature=1. This restores support fortemperature=0.0and lets users control output determinism as expected. (#2121) (himanushi) - Fix Google Colab buttons in the framework integration docs by pointing them to the correct example notebook paths, so the notebooks open properly from the documentation. (#2130) (Mayank)
- Revert the previous handling for empty
expected_toolsin the tool correctness metric, restoring the earlier scoring behavior when no expected tools are provided. (#2139) (Trevor Wilson) - Fix G_Eval score normalization when the score range does not start at 0. Scores now subtract the lower bound before dividing by the range span, so values like 1–5 correctly map to 0.0–1.0. Adds test coverage for the corrected behavior. (#2142) (Priyank Bansal)
- Fix PydanticAI agent tracing to capture input and output messages more reliably. If
final_resultis missing, the output now falls back to the last recorded message, improving completeness of recorded spans. (#2149) (Mayank) - Fix Amazon Bedrock requests to stop forcing a default
temperaturevalue.temperatureis now only sent when provided viageneration_kwargs, letting Bedrock apply its own defaults. (#2151) (Vamshi Adimalla)
v3.6.2
- Fix OpenAI Agents span handling so LLM span properties update only for spans marked as
llm. This prevents spans from being skipped due to an incorrect early return and restores expected agent behavior. (#2123) (Mayank) - Fix documentation code examples to correctly iterate over datasets, preventing
TypeError: 'EvaluationDataset' object is not iterablewhen following the testing snippets. (#2132) (Denis) - Fix ToolCorrectnessMetric crashing with ZeroDivisionError when
expected_toolsis empty. It now returns 1.0 when bothtools_calledandexpected_toolsare empty, and 0.0 when tools are called but none are expected. Added tests for these edge cases. (#2135) (Priyank Bansal)
September
September made agent evaluation and tracing easier to adopt with expanded quickstarts and guides across LangChain, LangGraph, CrewAI, PydanticAI, and OpenAI Agents. Tracing improved with better input/output capture, OpenTelemetry/OTLP export behavior, and new APIs like update_current_span and update_current_trace(). Evaluation added G-Eval templating updates, MCP and conversational/DAG capabilities, and better dataset round-tripping.
Backward Incompatible Change
v2.4.8
- Remove span feedback from the OpenTelemetry exporter so traces no longer parse or emit the
confident.span.feedbackattribute, reducing exporter dependencies and payload. (#1942) (Mayank) - Change benchmark
evaluateresults to return strongly typed Pydantic models instead of untyped dicts or floats, with a consistentoverall_accuracyinterface and optional benchmark-specific fields. This is a breaking change for code expecting raw primitives. Also pindatasetsto <4.0.0 to avoid failures from deprecated loader scripts. (#1975) (trevor-inflection)
New Feature
v3.5.9
- Add
evaluation_templatesupport to MultimodalGEval so you can customize how evaluation steps and results are generated, including strict results. Also tighten exception handling and imports to satisfy lint rules. (#2090) (Trevor Wilson) - Add Jinja template interpolation for prompt rendering, with
templateandmessages_templatenow validated to be mutually exclusive to prevent ambiguous prompt types. (#2100) (Jeffrey Ip)
v3.5.5
- Add a PydanticAI
Agentwrapper that automatically captures traces and metrics and patches the underlying model. Also export an OpenTelemetry instrumentation helper so you can instrument PydanticAI more easily without manual setup each run. (#2071) (Mayank)
v3.5.6
- Add
set-debugandunset-debugCLI commands to configure verbose logging, tracing, gRPC verbosity, and error reporting. Settings can be applied immediately and optionally persisted to a dotenv file, with a no-op guard to avoid output when nothing changes. (#2082) (Trevor Wilson) - Add support for capturing OpenAI Agents
tracecontext into tool tracing, including workflow name, group/thread id, and metadata. Improve input/output handling so traced runs keep the initial input and select the correct output when running inside a trace. (#2087) (Mayank)
v3.5.3
- Add a unified, configurable retry policy across all supported model providers. Improve transient error detection and provider-specific handling, with opt-in delegation to provider SDK retries. Allow runtime-tunable retry logging levels and env-driven backoff settings. (#2047) (Trevor Wilson)
- Add tracing support for sync and async generator functions, ensuring observer spans stay open while items are yielded and close cleanly on completion or errors. (#2074) (Kritin Vongthongsri)
v3.5.0
- Add optional OpenTelemetry (OTLP) tracing for dataset evaluation runs via
run_otel, generating a per-run ID and emitting start/stop spans plus per-item dummy spans. This enables exporting evaluation traces to an OTLP endpoint for run-level observability. (#2008) (Mayank)
v3.5.1
- Add token-level streaming timestamps to LLM tracing spans, recording each emitted token with a precise ISO time to help analyze generation latency and pacing. (#2048) (Kritin Vongthongsri)
- Add prompt version listing and update prompt pulling to use version IDs, with optional background refresh that keeps the local cache up to date. (#2057) (Kritin Vongthongsri)
v2.4.8
- Add a PydanticAI integration that instruments
Agent.runwith OpenTelemetry spans and exports agent input/output and optional custom trace attributes. Providesetup_instrumentation()to patch the agent safely and configure span exporting when the OpenTelemetry SDK is available. (#1851) (Mayank) - Add MCP metrics for conversational evaluations, including args correctness, task completion, and tool correctness. These metrics support async execution, strict scoring, and verbose reasoning to help debug tool-using interactions. (#1894) (Vamshi Adimalla)
- Add support for setting trace name, tags, metadata, thread ID, and user ID via
confident.trace.*span attributes. Existingconfident.trace.attributesis still read for compatibility but is planned for deprecation. (#1897) (Mayank) - Add a configurable
languageparameter toConversationSimulatorso prompts can be generated in any language. Default behavior remains English, so existing usage continues to work without changes. (#1899) (Johan Cifuentes) - Add MCP evaluation support for single-turn test cases with the new
MCPUseMetric, and introduceMultiTurnMCPUseMetricfor multi-turn conversations. This updates the MCP metrics set to better score whether the right MCP primitives and arguments are used for a task. (#1908) (Vamshi Adimalla) - Add a new tracing update interface that sets span data directly and introduces
update_llm_spanfor token counts. This simplifies instrumenting LLM and retriever steps and makes metric evaluation work from span inputs/outputs without requiring a prebuilt test case. (#1909) (Kritin Vongthongsri) - Add support for passing trace
environment,metric_collection, and an optional LLM test case through OpenTelemetry attributes, so these fields are attached to exported traces and can override the default environment when provided. (#1919) (Mayank) - Add automatic loading of
.env.localthen.envat import time so configuration works out of the box, while keeping existing process env vars highest priority. Allow opting out viaDEEPEVAL_DISABLE_DOTENV=1. Include a.env.exampleand expand docs on environment setup and provider keys. (#1938) (Trevor Wilson) - Add support for trace-level metrics in end-to-end evaluations, so you can attach metrics to a whole trace via
update_current_trace()and have them run and reported alongside span-level metrics. (#1949) (Kritin Vongthongsri) - Add an option to run conversation simulation remotely via the API with
run_remote=True. This allows generating user turns without a local simulator model, and raises a clear error when the API key is missing. (#1959) (Kritin Vongthongsri) - Add support for GPT-5 completion parameters such as
reasoning_effort. You can now pass new model-specific options via a dedicated params dict, avoiding code changes when new parameters are introduced. (#1965) (John Lemmon) - Add
--save=dotenv[:path]to provider set/unset so credentials can be stored in a.envfile instead of the JSON store, reducing the chance of leaking secrets. Expand set/unset tests across providers and prepare for future secure storage backends. (#1967) (Trevor Wilson) - Add MCP evaluation examples for single-turn and multi-turn conversations, showing how to connect to MCP servers, invoke tools, and build test cases from tool calls and model outputs. (#1979) (Vamshi Adimalla)
- Add support for customizing GEval prompts via an injectable
evaluation_template, and exportGEvalTemplatefor easier reuse. Improve evaluation docs with expanded component-level guidance, unit testing in CI/CD coverage, and updated custom embedding model configuration examples. (#1986) (Vamshi Adimalla) - Add
save_assupport for conversational goldens so multi-turn datasets can be exported to JSON or CSV. Turns are serialized into a single field for portable round-tripping, andsave_asnow errors clearly when called on an empty dataset. (#1991) (Vamshi Adimalla) - Add a
publicoption when pulling datasets so you can fetch publicly shared cookbook datasets without requiring private access. (#1995) (Mayank) - Add component-level evals for LangGraph by propagating
metricsandmetric_collectionmetadata through LLM and tool spans. Include a patchedtooldecorator so tools can carry metric settings without custom wiring. (#2000) (Mayank) - Add prompt metadata to LLM tracing spans, including
aliasandversion. This lets traces record which prompt was used alongside model and token/cost details. (#2001) (Kritin Vongthongsri) - Add
ConversationalDAGMetricand conversational DAG node types to evaluate multi-turn conversations using a DAG workflow. Supports async and sync execution with threshold/strict modes, cycle detection, and optional verbose logs and reasons. (#2002) (Vamshi Adimalla) - Add component-level evaluation support for PydanticAI tools by allowing
metric_collectionormetricson the@agent.tooldecorator and recording tool outputs as tracing span attributes. (#2003) (Mayank) - Add an OpenAI Agents
Runnerwrapper that collects metrics duringrun/run_syncand attaches inputs/results to traces. ExportRunnerfrom the openai_agents package for easier use in agent eval workflows. (#2005) (Mayank) - Add a
function_toolwrapper for OpenAI Agents that automatically traces tool calls withobserveand supports passing metrics or a metric collection. Tool spans are skipped in the tracing processor to reduce noise during component evaluation. (#2010) (Mayank) - Add Markdown document support (
.md,.markdown,.mdx) in the synthesizer loaders. Improve lazy imports and type hints so heavy optional deps like LangChain and Chroma are only required when used, with clearer errors and updated docs on required packages. (#2018) (Trevor Wilson)
Improvement
v3.6.0
- Add a documented, explicit way to access the active dataset golden and pass its
expected_outputduring component-level evaluation. The executor now sets and resets the current golden around user code, and tests ensureexpected_outputis preserved across spans and traces with sensible override andNonehandling. (#2096) (Trevor Wilson) - Add a new CLI guide covering install, secrets, provider switching, debug flags, retries, examples, and troubleshooting. Improve Multimodal G-Eval docs by documenting
evaluation_templatebehavior, expected JSON return shapes, and a minimal customization example. Fix multiple broken links across metrics, guides, integrations, and tutorials. (#2109) (Trevor Wilson) - Improve the OpenAI Agents integration by simplifying agent/model processing and exposing only the supported public API (
DeepEvalTracingProcessor,Agent, andfunction_tool). This reduces unused imports and avoids exportingRunnerfrom the package namespace. (#2110) (Mayank)
v3.5.9
- Add support for
nameandcommentsfields when loading goldens from CSV/JSON and when exporting datasets viasave_as, preserving this metadata across round-trips. (#2066) (Vamshi Adimalla) - Fix a typo in the agents getting-started guide so the end-to-end evaluation instructions read correctly. (#2095) (Raj Ravi)
- Improve PydanticAI OpenTelemetry instrumentation by reviving and consolidating it under
ConfidentInstrumentationSettings. Agent-level tracing and metric wiring is now configured via theinstrumentsetting, and the oldinstrument_pydantic_aipath is deprecated. (#2098) (Mayank)
v3.5.5
- Improve OpenAI Agents tracing and metrics by using typed
BaseMetriclists and recording aPrompton LLM spans. Also serialize streamed and non-streamed outputs for more reliable observability and downstream processing. (#2084) (Mayank)
v3.5.3
- Improve prompt tests by asserting the pulled prompt version starts at
0, ensuring versioning behavior is validated alongside template and message content. (#2064) (Kritin Vongthongsri) - Fix a typo in the metrics introduction docs by changing “read-to-use” to “ready-to-use” for clearer wording. (#2065) (Jason Smith)
- Add a maintainer-only GitHub Actions workflow to manually run the full test suite against a PR’s head or merge ref, with concurrency control and optional secret-based tests. (#2069) (trevor-cai)
v3.5.2
- Improve LangChain/LangGraph tracing by using context variables to keep the active trace consistent across tool calls and nested runs. Also expose the
tooldecorator from the integration so you can attachmetric_collectionmetadata and keep span attributes in the correct trace. (#2052) (Mayank) - Improve the PydanticAI integration by adding safer one-time instrumentation, tracing for
run_sync, and consistent trace argument names (e.g.,name,tags,metadata). This also sanitizes run context data to avoid noisy or circular payloads in captured traces. (#2060) (Mayank)
v3.5.0
- Add a provider-agnostic retry policy with env-tunable defaults and clearer transient vs non-retryable classification. OpenAI requests now use the shared policy, disable SDK internal retries to avoid double backoff, and log retries more consistently. Quota-exhausted 429s are treated as non-retryable while timeouts and 5xx errors still retry. (#1941) (Trevor Wilson)
- Add a trace JSON validation flow for integration tests. Provide commands to generate trace test data and then validate the generated JSON to catch regressions earlier. (#2019) (Mayank)
- Add a centralized, validated Settings system and refactor CLI config commands to use it for consistent env and persistence behavior. Prevent secrets from being written to the legacy JSON store, and allow safe persistence to dotenv files when
--save(or the default save setting) is enabled. (#2026) (Trevor Wilson) - Improve example notebook formatting to satisfy
blackand fix lint errors, making the Conversational DAG example easier to run and review. (#2028) (Trevor Wilson) - Improve OpenTelemetry handling by importing the OTLP exporter lazily and raising a clear error when the dependency is missing. This prevents import-time failures and guides you to install
opentelemetry-exporter-otlp-proto-httpwhen tracing is enabled. (#2032) (Mayank) - Improve test setup reliability by reusing shared helpers to reset settings environment and tear down the settings singleton. Ensure the hidden store directory is created consistently and make config tests importable via a package
__init__.py. (#2033) (Trevor Wilson) - Add
__init__.pyfiles to nested test directories to prevent Python import/module name collisions during test runs. (#2037) (Trevor Wilson) - Add pre-commit hooks and Ruff to provide consistent linting and formatting on changed files. Update the lockfile to include the new development dependencies. (#2038) (Trevor Wilson)
- Temporarily skip CLI and config tests that rely on environment/settings persistence while the persistence layer is being refactored. (#2041) (Trevor Wilson)
- Add a simplified PydanticAI integration API by exposing
instrument_pydantic_aiand removing the custom Agent wrapper, with updated CLI trace flag names and tests to ensure trace output is generated as expected. (#2042) (Mayank)
v2.4.8
- Add new documentation quickstarts for AI agent evaluation, including setup for LLM tracing and both end-to-end and component-level evals across popular frameworks. Improve clarity in existing evaluation docs with updated titles and expanded dataset terminology. (#1818) (Kritin Vongthongsri)
- Improve documentation site styling for collapsible sections, sidebar menu, and code blocks for a more consistent reading experience. (#1879) (Jeffrey Ip)
- Improve tutorials by reorganizing evaluation sections, renaming pages to simpler routes, and adding a dedicated RAG QA evaluation guide with setup and synthetic data generation examples. (#1885) (Vamshi Adimalla)
- Add support for exporting trace-level input and output fields from span attributes, so traces capture the overall request and response alongside existing trace attributes. (#1887) (Mayank)
- Improve telemetry tracing integration event names by standardizing them under a
deepeval.integrations.*namespace for more consistent reporting across supported frameworks. (#1888) (Mayank) - Add support for setting a span’s
inputandoutputviaupdate_current_span, so custom values are preserved and masked correctly during trace updates. (#1893) (Kritin Vongthongsri) - Improve the LLM Arena quickstart with a full walkthrough for creating
ArenaTestCases, defining an arena metric, and runningcompare()to pick a winner. Also fix a typo in the arena criteria example and add the page back to the docs sidebar for easier discovery. (#1896) (Vamshi Adimalla) - Add LangChain integration docs with end-to-end and production evaluation examples using a
CallbackHandler, including synchronous and asynchronous workflows and guidance on supported metrics. (#1900) (Kritin Vongthongsri) - Improve CrewAI tracing by capturing agent roles, available tools, tool inputs/outputs, and completed LLM call details, and by tracing contextual memory retrieval. This makes traces more informative across agent, tool, LLM, and retriever spans. (#1902) (Mayank)
- Improve DeepSeek integration docs by updating the initialization example to use
modelinstead ofmodel_name, matching the current constructor and reducing setup confusion. (#1906) (Lukman Arif Sanjani) - Improve tracing for CrewAI, LangChain, LlamaIndex, and PydanticAI integrations by scoping instrumentation with a context manager. This makes span capture more reliable during initialization and setup. (#1911) (Jeffrey Ip)
- Improve G-Eval prompting to generate reasoning before the final score. This encourages more complete evaluations and can lead to more accurate, consistent scoring across judge use cases. (#1912) (Bofeng Huang)
- Add
generation_kwargsto supported LLM model wrappers so you can pass provider-specific generation options liketop_pandmax_tokens, with updated docs and a new MCP quickstart page in the sidebar. (#1921) (Vamshi Adimalla) - Improve the OpenAI integration docs by adding
gpt-5,gpt-5-mini, andgpt-5-nanoto the list of commonly used models. (#1924) (fangshengren) - Add and refresh end-to-end evaluation documentation for multiple frameworks, including new guides for CrewAI and Pydantic AI plus updated LangChain examples. Include clearer setup, dataset iteration, and optional trace viewing steps to help you run evals quickly. (#1926) (Mayank)
- Improve documentation examples for LLM tracing and agent evaluation by fixing imports, metric names, and tracing helpers. Update the walkthrough to use
EvaluationDataset.evals_iterator()andupdate_current_spanso the sample code matches current APIs. (#1927) (Kritin Vongthongsri) - Add support for newer GPT-5 and o4-mini model variants, including updated pricing metadata. Automatically set
temperature=1for models that require it to prevent invalid configuration errors. (#1930) (John Lemmon) - Improve
modesimports by defining__all__, makingARCModeandTruthfulQAModethe explicitly exported public API for star-imports and tooling. (#1932) (trevor-inflection) - Improve the Confident API client by standardizing responses and surfacing clearer errors and deprecation warnings. Update endpoints and return
(data, link)so CLI, prompts, datasets, and tracing can consume links consistently. (#1933) (Jeffrey Ip) - Upgrade the PostHog client dependency to a newer version to avoid telemetry conflicts with projects that also use PostHog. This improves compatibility when both tools are installed in the same environment. (#1935) (Lucas Castelo)
- Improve PydanticAI tracing by exporting spans via an OTLP HTTP endpoint and requiring a configured API key. This makes instrumentation fail fast when credentials are missing and aligns traces with standard OpenTelemetry exporters. (#1940) (Mayank)
- Improve benchmark
evaluatepolymorphism by standardizing interfaces and accepting extra**kwargs. This lets you call different benchmarks with shared arguments likebatch_sizewithout crashing when a benchmark does not use them. (#1955) (trevor-inflection) - Improve trace API payloads by populating input/output, expected output, context, retrieval context, tool calls, and metadata. This makes exported traces and generated test cases more complete and easier to debug. (#1961) (Kritin Vongthongsri)
- Improve the PydanticAI integration with a new
Agentinterface that supports passingmetric_collection,metrics, and trace fields directly torun/run_sync. Add validation for trace and metric inputs and require OpenTelemetry to enable tracing. (#1978) (Mayank) - Add an
overwrite_metricsoption to thread offline evaluations so you can replace existing metric results when re-running evaluations. (#1980) (Kritin Vongthongsri) - Add new LangGraph, Pydantic AI, and CrewAI cookbooks with “Open in Colab” buttons in the docs, making it easier to run the example notebooks from the integration pages. (#1987) (Mayank)
- Improve OpenTelemetry export by capturing span error status and description from the official status fields instead of custom attributes. Also handle trace metadata as a dict to avoid unnecessary JSON parsing and make metadata export more reliable. (#1990) (Mayank)
- Improve example notebooks by adding
black[jupyter]to dev dependencies and reformatting notebook code for more consistent, readable cells. (#2011) (Trevor Wilson) - Add an
Agentwrapper for openai-agents that automatically traces model calls with metrics and an optionalPrompt. Improve tracing so span and trace inputs/outputs are captured correctly, and LLM spans record the prompt when provided. (#2012) (Mayank) - Fix async execution in Conversational DAG nodes by awaiting model generation and metric evaluation calls, preventing missed results during traversal. Add detailed Conversational-DAG documentation with end-to-end examples for building and running multi-turn decision-tree evaluations. (#2014) (Vamshi Adimalla)
- Improve code formatting to satisfy linting and keep tests and DAG modules consistent with Black style. (#2016) (Trevor Wilson)
Bug Fix
v3.6.0
- Fix Info and Caution callouts not rendering correctly in the documentation when using dark mode, improving readability and visual consistency. (#2111) (Sai-Suraj-27)
v3.5.9
- Fix streaming completion handling so the final result is captured reliably and the streamed LLM output is JSON-serializable, preventing errors when consuming streamed responses. (#2097) (Mayank)
v3.5.5
- Fix async evaluations by tracking and gathering only tasks created on the active event loop, preventing coroutine re-await and cross-loop errors. Normalize awaitables via
coerce_to_task(), cancel pending tasks when clearing, and properly shut down async generators. Replace blocking sleeps in async tests and stabilize CI workflows. (#2068) (Trevor Wilson) - Fix
NonAdvicemetric scoring instrict_mode: enforce a threshold of 1 and return 0 when the computed score falls below that threshold. (#2070) (Sai-Suraj-27) - Fix
mcp_use_metricwhen multiple MCP servers are configured by correctly including primitives from all servers in the interaction text. (#2076) (Diego Rani Mazine) - Fix sidebar heading contrast in dark mode so section titles are clearly visible and easier to scan. (#2077) (Sai-Suraj-27)
- Fix
deepeval loginfailing on Python 3.9 by avoiding the unsupportedstr | ProviderSlugtype union syntax, restoring compatibility for supported Python versions. (#2079) (Sai-Suraj-27) - Fix incorrect argument name when configuring local models by passing
model_formattoset_local_model_env, preventing misconfiguration in LM Studio and vLLM setup. (#2083) (Sai-Suraj-27)
v3.5.6
- Fix async eval execution to use the current trace when building
LLMTestCase, so outputs, expected output, context, and tool expectations are recorded correctly. (#2088) (Kritin Vongthongsri) - Fix incorrect model imports so faithfulness and answer relevancy scoring load
SummaCModelsand answer relevancy models from the correct modules instead of failing at runtime. (#2089) (Sai-Suraj-27)
v3.5.3
- Fix
pii_leakagemetric scoring instrict_modeby enforcing a threshold of 1 and returning 0 when the computed score falls below that threshold. (#2067) (Sai-Suraj-27) - Fix the getting-started example to use
strict_modeinstead ofstrictwhen creating metrics, preventing confusion and failures with the current API. (#2073) (Sai-Suraj-27)
v3.5.2
- Fix a typo in the getting-started chatbots guide so the “metrics” link text is spelled correctly. (#2058) (grant-sobkowski)
- Fix passing
test_case_contentwhen generating conversational evaluation prompts so evaluations run correctly instead of failing due to a missing argument. (#2059) (Sai-Suraj-27) - Fix LocalEmbeddingModel async embedding methods to properly await embedding requests, preventing missed awaits and ensuring async calls return embeddings reliably. (#2061) (Trevor Wilson)
- Fix async prompt polling to work reliably with already-running event loops by reusing a general event loop and scheduling tasks instead of always blocking on
run_until_complete. This prevents errors in async environments and keeps polling running in the background. (#2062) (Kritin Vongthongsri) - Fix duplicate arguments being passed to
update_current_trace, preventing conflicting trace updates in online metrics tests. (#2063) (Sai-Suraj-27)
v3.5.0
- Fix AWS Bedrock Converse requests by translating
generation_kwargsfrom snake_case to the required camelCase. PreventsParamValidationErrorwhen using parameters likemax_tokens,top_p,top_k, andstop_sequences. (#2017) (Active FigureX) - Fix tool correctness scoring when no tools are expected. If both expected and called tools lists are empty, the score is now 1.0 instead of 0.0, avoiding false failures in tool-free runs. (#2027) (Kema Uday Kiran)
- Fix a documentation import typo for
DeepAcyclicGraphso the Conversational DAG example uses the correct module path. (#2029) (Vamshi Adimalla) - Fix telemetry tests to reliably start from a clean state by removing any existing
.deepevaldirectory in the temp workspace before assertions, preventing flaky failures when the hidden store already exists. (#2035) (Trevor Wilson) - Fix tracing JSON serialization by stripping embedded NUL bytes from strings before writing to Postgres. This prevents
22P05errors when storing text/jsonb payloads that contain\x00. (#2036) (Trevor Wilson) - Fix Grok-3 Fast output token pricing by using the correct per-1e6 divisor, preventing inflated cost calculations for responses. (#2046) (Trevor Wilson)
- Fix Kimi
kimi-k2-0711-previewoutput cost divisor so output usage is calculated with the correct scale. (#2054) (Trevor Wilson)
v3.5.1
- Fix
generate_goldens_from_contextswhen usingsource_filesso generated goldens map to the correct source file. This prevents a possibleIndexErrorwhenmax_goldens_per_contextexceeds the number of source files. (#2053) (Evan Livelo)
v2.4.8
- Fix trace posting to allow a dynamic API key set on each trace, instead of always relying on a global configured key. This prevents traces from being skipped when the per-trace key is provided at runtime. (#1889) (Mayank)
- Fix Conversation Simulator generating the first user turn twice, which could duplicate user messages. First-turn prompts are now only created when starting a new conversation or after an opening message. (#1891) (Kritin Vongthongsri)
- Fix Ollama integration docs to use the correct
modelparameter when initializingOllamaModel, avoiding confusion and incorrect example code. (#1892) (Phil Nash) - Fix CLI
identifierhandling so runs correctly propagate the identifier into evaluation and assertion flows. (#1903) (Kritin Vongthongsri) - Fix pydantic-ai agent tracing to avoid warnings and span attribute errors by safely handling missing names and non-string inputs/outputs when recording LLM test case data. (#1904) (Mayank)
- Fix OpenTelemetry span metadata handling by reading
confident.span.metadataand attaching it to exported spans, instead of dumping the full span JSON. Also reduce noisy console output by swallowing conversion/validation errors during export. (#1910) (Mayank) - Fix G-Eval score normalization in non-strict mode by scaling to the rubric’s actual score range instead of always dividing by 10. This also aligns normalization behavior between
measureanda_measurefor consistent results across different rubrics. (#1915) (Bofeng Huang) - Fix dataset iterator integration tests to use
EvaluationDataset.evals_iterator()and load API keys from environment variables, improving reliability and avoiding hardcoded credentials. (#1920) (Mayank) - Fix OpenTelemetry and PydanticAI instrumentation by setting standard trace attributes (
name,tags,thread_id,user_id,metadata,environment) and ensuring tool/expected tool attributes are parsed reliably. This improves span export compatibility and corrects retriever attribute keys. (#1934) (Mayank) - Fix type checker errors when overriding methods on base model classes by adding the missing return type annotations. This prevents methods from being inferred as returning
Noneand incorrectly triggering type errors in subclasses. (#1936) (trevor-inflection) - Fix model list definitions to prevent accidental string concatenation that merged entries and broke capability checks for certain model names. This corrects which models are treated as supporting structured outputs or requiring
temperature=1. (#1939) (Trevor Wilson) - Fix conversation simulation to respect
max_user_simulationsand stop generating extra user turns. Preserve any pre-seededturnswithout inserting the opening message, and validate invalid limits with a clear error. (#1943) (Kritin Vongthongsri) - Fix trace export to handle
trace_metadataprovided as a dict or JSON string, ensuring metadata is captured correctly. Also update async trace posting to use the API’s returned link field when reporting success. (#1944) (Mayank) - Fix task completion evaluation for LangChain and LangGraph traces by correctly preparing the metric test case from the root span. This prevents missing or incorrect task extraction and avoids unexpected evaluation cost being recorded. (#1946) (Mayank)
- Fix ToolCorrectnessMetric to avoid division-by-zero when no expected tools are provided. Return 1.0 when both expected and called tools are empty, and 0.0 when only expected tools are empty. (#1947) (Vamshi Adimalla)
- Fix duplicate items when generating synthetic datasets with
synthesizer.generate_goldens_from_docs(). Goldens are now added only once in the generation call chain, so each generated item appears exactly once. (#1951) (Jaya) - Fix
set-openaiCLI writingcost_per_input_tokenandcost_per_output_tokento the wrong environment keys. This prevents inverted token cost accounting and keeps any downstream cost calculations accurate. (#1952) (Trevor Wilson) - Fix
set-openaiso--cost_per_input_tokenand--cost_per_output_tokenare optional for known OpenAI models, matching runtime behavior. Improve help text to clarify that costs are only required for custom or unsupported models, reducing redundant flags and misleading errors. (#1953) (Trevor Wilson) - Fix the Multi-Turn Getting Started code example by importing
ConversationalGEvalinstead of an unusedGEval, so the snippet runs correctly as written. (#1954) (Connor Brinton) - Fix Arena docs example to print results from the correct variable (
arena_geval), preventing a NameError and making the snippet runnable as written. (#1960) (Julius Berger) - Fix duplicated aggregate metric results by computing pass-rate summaries once per evaluation run, and handle empty result sets safely. (#1962) (John Lemmon)
- Fix LangChain callback
on_llm_endhandling to avoid missing-span and bad metadata issues. Model names and token usage are now extracted safely, and token counts are left unset when unavailable. (#1963) (Mayank) - Fix Azure OpenAI model calls to forward constructor kwargs (like
max_tokens) in both sync and async generation. This ensures the API receives the expected parameters and preventsLengthFinishReasonError. (#1969) (Active FigureX) - Prevent endless retries in LiteLLMModel by adding a maximum retry limit (default 6) so failures stop instead of looping indefinitely. Add support for LiteLLM proxy environment variables. Move retry settings to class-level variables to simplify future configuration changes. (#1972) (Radosław Hęś)
- Fix ContextualRelevancy evaluation when a
retrieval_contextitem contains no meaningful statements. The metric now handles empty or non-informative context so LLM output can be parsed reliably instead of failing when no JSON is returned. (#1973) (Radosław Hęś) - Fix progress bar updates during conversation simulator runs, ensuring tasks advance correctly and are removed when finished. Also ensure evaluation state is always cleaned up in a
finallyblock even if an error occurs. (#1974) (Kritin Vongthongsri) - Fix telemetry to fully respect opt-out by skipping all writes when
DEEPEVAL_TELEMETRY_OPT_OUT=YESand returning atelemetry-opted-outsentinel ID. Also ensure the.deepevaldirectory exists before writing telemetry data, with tests covering directory creation and file writes. (#1976) (Trevor Wilson) - Fix benchmarks to work with
datasets4.0.0 by removing unsupportedtrust_remote_codefromload_datasetcalls. Update MMLU and MathQA to use current Parquet datasets with the required logic adjustments. (#1977) (Vincent Lannurien) - Fix incorrect imports in the getting-started LLM arena docs example so the sample code runs without import errors. (#1981) (raphaeluzan)
- Fix Synthesizer state tracking by clearing
synthetic_goldenson reset and appending newly generated goldens during doc and scratch generation, so results reflect the latest run. Update the introduction docs with required dependencies and a working end-to-end example. (#1984) (Mayank) - Fix notebook evaluation runs by clearing
trace_manager.integration_traces_to_evaluateat the start of each dataset evaluation. This prevents traces from a previous run from leaking into a new run and affecting results. (#1985) (Mayank) - Fix OpenTelemetry trace status so the overall trace is marked as errored when the root span fails, improving error visibility in exported traces. (#1993) (Mayank)
- Fix trace status reporting so traces are marked as errored when any span fails, and include a
statusfield in the trace API payload for more accurate error visibility. (#1999) (Mayank) - Fix
--confident-api-keyso it works again, and make login save the key to.env.localby default unless--saveis set. Logout now also removes the saved key from both the JSON keystore and dotenv, and commands no longer write "None" values for optional model settings. (#2015) (Trevor Wilson)
August
August made evaluation and tracing more production-ready with refreshed docs covering component-level evaluation, tracing, and deployment patterns. Tracing gained richer LLM outputs, a v1 OpenTelemetry exporter, better span ordering, and deeper LangChain/LlamaIndex/CrewAI integrations with metrics and metric_collection support. New tutorials included the Medical Chatbot series and improved RAG guides.
New Feature
v3.3.5
- Add a new Medical Chatbot tutorial series to the docs, covering development, evaluation, improvement, and deployment of a multi-turn chatbot. Improve and correct several evaluation docs examples and parameter descriptions for multi-turn test cases and datasets. (#1802) (Vamshi Adimalla)
- Add CLI support to configure Grok, Moonshot, and DeepSeek as the LLM provider for evaluations, including setting the model name, API key, and temperature. You can switch back to the default OpenAI setup with corresponding
unset-*commands. (#1807) (Kritin Vongthongsri) - Add a Medical Chatbot tutorial to the docs and navigation, with updated walkthrough content and links for building, configuring, and evaluating the example app. (#1814) (Vamshi Adimalla)
- Add support for evaluating LangGraph/LangChain traces with metrics via the callback handler. Root spans can now carry
metricsand an optionalmetric_collection, and captured traces can be queued for evaluation instead of being posted immediately. (#1829) (Mayank) - Add a CrewAI
Agentwrapper that registers agents with an optionalmetric_collectionand per-agent metrics, enabling easier evaluation and online tracing during crew runs. (#1833) (Mayank) - Add a v1 OpenTelemetry span exporter that supports API key setup and trace configuration via env vars or OTel resource attributes. Improve trace handling by preserving provided trace IDs, applying trace metadata, and safely ending and clearing active traces after export. (#1838) (Mayank)
- Add MCP support to conversational test cases by allowing turns to record MCP tool/prompt/resource calls and optional server metadata, with validation of MCP types to catch invalid inputs early. (#1839) (Vamshi Adimalla)
- Add support for setting trace attributes in the LangChain callback handler. You can now pass
name,tags,metadata,thread_id, anduser_idwhen creating the callback to populate these fields on the completed trace. (#1862) (Mayank) - Add an
ArgumentCorrectnessMetricto score whether tool call arguments match the user input, with optional reasons and async support. Returns a perfect score when no tool calls are provided. (#1866) (Kritin Vongthongsri) - Add a revamped conversation simulator that generates conversational test cases from
ConversationalGoldeninputs using a provided model callback, with configurable opening message, concurrency, and async or sync execution. (#1876) (Kritin Vongthongsri)
Improvement
v3.3.5
- Improve component-level evaluation docs with clearer guidance on when to use it, what tracing means, and how to log in to view traces. Reorganize sections and examples for easier navigation and fewer confusing callouts. (#1782) (Kritin Vongthongsri)
- Improve the Meeting Summarizer tutorial with a new Deployment section covering CI/CD-style continuous evaluation, dataset reuse, and optional tracing setup. Also update tutorial navigation and fix a broken docs anchor link. (#1783) (Vamshi Adimalla)
- Bump the package release metadata and version number for a new release. (#1784) (Jeffrey Ip)
- Improve LLM trace output to match the updated UI by capturing structured AI responses, including role, content, and tool call details instead of only a concatenated string. (#1786) (Mayank)
- Improve the meeting summarizer tutorial with updated walkthrough content, refreshed screenshots, and clearer examples for generating summaries and action items using different models. (#1788) (Vamshi Adimalla)
- Fix typos and formatting across tracing integrations, tests, and documentation for clearer examples and cleaner files. (#1789) (Vamshi Adimalla)
- Improve the RAG QA Agent tutorial and navigation by adding a new tutorial section, updating sidebar links and icons, and refreshing examples to use
deepeval test runinstead of runningpytestdirectly. (#1793) (Vamshi Adimalla) - Improve docs and tutorials by switching embedded images to hosted URLs and removing bundled image assets, keeping guides lighter and images consistently available. (#1794) (Vamshi Adimalla)
- Improve SummarizationMetric schema naming and usage to reduce ambiguity and make results clearer. This refactor replaces a generic
Verdictsschema with more descriptive Pydantic schemas, improving readability and maintainability. (#1804) (Shabareesh Shetty) - Improve tutorial introductions by adding Tech Stack cards that show the key tools used in each guide, making it easier to understand the setup at a glance. (#1808) (Vamshi Adimalla)
- Improve tutorials and docs with updated examples and configuration names, plus refreshed navigation and UI tweaks for easier browsing. (#1825) (Vamshi Adimalla)
- Support passing extra
**kwargsto underlying LLM clients across providers. This lets you customize client setup (for example timeouts, proxies, or transport settings) without modifying the model wrappers. (#1827) (Kritin Vongthongsri) - Improve contributor setup instructions by updating the dependency installation command from
make installtopoetry install. (#1828) (Vamshi Adimalla) - Add patched LlamaIndex agents that accept
metricsandmetric_collection, and rework LlamaIndex tracing to start and link traces correctly for workflow/agent runs. (#1836) (Mayank) - Fix docs metadata and improve tutorial link cards by adding
singleTurntags to several metric pages and updating card layout with icons and objectives for clearer navigation. (#1837) (Jeffrey Ip) - Improve model CLI config handling by separating stored keys for evaluation LLMs vs embeddings, reducing key collisions when switching providers or running
unset-*commands. (#1855) (Kritin Vongthongsri) - Improve tutorials with clearer section titles, updated wording, and expanded guidance for building and evaluating RAG QA and summarization agents, including a better focus on production eval setup. (#1860) (Vamshi Adimalla)
Bug Fix
v3.3.5
- Fix LLM span cost calculation by honoring
cost_per_input_tokenandcost_per_output_tokenpassed toobserve, ensuring traced runs report the correct token costs. (#1787) (Kritin Vongthongsri) - Fix async OpenAI integration by restoring
asyncio.create_tasksafely after evaluation, preventing leaked monkeypatching across runs and improving stability when running concurrent test cases. (#1790) (Kritin Vongthongsri) - Fix
g_evalto prevent a crash when accumulating evaluation cost if the initial cost isNone. This avoids aTypeErrorduring async evaluation and allows scoring to complete normally. (#1796) (高汝貞) - Fix the docs snippet for
ConversationalGEvalby renaming the example variable tometric, making it consistent and easier to copy and run. (#1799) (Nimish Bongale) - Fix the few-shot example used in the Synthesizer constrained evolution template so the sample rewritten input correctly matches the solar power prompt and produces more consistent guidance. (#1800) (Simon M.)
- Prevent mixing single-turn and multi-turn goldens in a dataset by enforcing the dataset mode and raising clear
TypeErrors for invalid items. Addadd_goldento append goldens after initialization. (#1810) (Vamshi Adimalla) - Fix conversation eval serialization by using the correct API field aliases for
retrievalContext,toolsCalled, andadditionalMetadata, and by typing tool calls asToolCallobjects. (#1811) (Kritin Vongthongsri) - Fix tutorial command examples to run evaluation tests with
deepeval test runinstead ofpytest, and improve YAML snippet formatting for the deployment guide. (#1830) (Vamshi Adimalla) - Fix AzureOpenAIModel initialization to use the correct
model_nameargument instead ofmodel, restoring compatibility with Azure OpenAI deployments. This prevents setup failures that made Azure-backed usage unusable in recent releases. (#1832) (StefanMojsilovic) - Fix
LiteLLMModelgenerate/a_generateto always return(result, cost)when a schema is provided. Prevents unpacking errors in schema-based metrics and restores consistent cost reporting. (#1841) (Dylan Li) - Fix a type hint in
login_with_confident_api_keyby usingstrfor the API key parameter, improving type checking and editor autocomplete. (#1847) (John Lemmon) - Fix LangChain/LangGraph prompt parsing so multi-line messages and recognized roles are grouped correctly, instead of being split line-by-line or misclassified as Human messages. (#1848) (Kritin Vongthongsri)
- Fix LLM tracing to accept and safely serialize non-standard output objects so responses aren’t dropped when capturing spans. (#1849) (Kritin Vongthongsri)
- Fix CLI model configuration to clear previously saved evaluation or embedding settings when switching providers, preventing stale keys from overriding the newly selected model. (#1852) (Kritin Vongthongsri)
- Fix code execution in the HumanEval benchmark by calling
execon compiled code instead of recursively invoking the secure executor, preventing infinite recursion and allowing snippets to run correctly. (#1856) (Vamshi Adimalla) - Fix missing temperature handling in
GptModelgenerate/a_generatewhen no schema is provided, so output randomness is consistently user-controlled instead of falling back to the provider default (often 1). (#1857) (Daniel Yakubov) - Fix crashes in synthesizer workflows by guarding progress updates and handling fewer than 10 goldens when sampling examples. Improve test reliability by adding a
pytest.iniconfig and expanding the test suite so CI runspytestdirectly. (#1858) (Kritin Vongthongsri) - Fix OpenTelemetry trace exporting by ordering spans into parent-child trees and treating missing parents as root spans, preventing failures on incomplete span batches. Update LLM span attribute keys to the
confident.llm.*namespace so model, token, and prompt fields are captured correctly. (#1859) (Mayank) - Fix misuse metric failures by passing the correct
misuse_violationsparameter togenerate_reasoninMisuseTemplate. This prevents errors when runningmeasure. (#1863) (Rohit ojha) - Prevent generating more synthetic inputs than requested by enforcing
max_goldens_per_contextand truncating any extra results. This keeps dataset sizes predictable and avoids overshooting configured limits. (#1867) (Noah Gil) - Fix structured output requests in the LiteLLM model by passing the Pydantic schema directly via
response_formatinstead of an unsupportedjson_schemaargument. PreventsTypeErrorfailures when requesting JSON-formatted responses. (#1871) (Rohit ojha) - Fix conversation relevancy windowing by grouping turns into valid user→assistant interactions and flattening them before verdict generation, preventing invalid or partial turns from skewing results. (#1873) (Vamshi Adimalla)
- Fix an ImportError caused by a circular import between the scorer module and the IFEval benchmark. The
Scorerimport is now deferred to IFEval initialization so modules load cleanly and IFEval can be imported reliably. (#1875) (Rohit ojha) - Fix Conversation Simulator turn generation and progress tracking:
max_turnsis now validated, opening messages count toward the limit, and async vs sync callbacks are handled automatically without raising type errors. Simulated test cases now carry over scenario and metadata fields from the golden inputs. (#1878) (Kritin Vongthongsri)
July
July improved tracing and evaluation across agent frameworks with major upgrades to LangChain/LangGraph, CrewAI, LlamaIndex, and OpenTelemetry span handling. Safety coverage expanded with new metrics for PII leakage, role violations, non-advice, and misuse, plus IFEval benchmark support and better task-completion evaluation. The default model moved from gpt-4o to gpt-4.1 with updated costs and docs.
New Feature
v3.2.6
- Add a LangChain/LangGraph callback handler that captures chain, tool, LLM, and retriever events into tracing spans, and automatically starts and ends a trace for top-level runs. (#1722) (Mayank)
- Add a CrewAI integration to instrument
crewai.LLM.calland capture LLM input/output in traces. Raises a clear error if CrewAI is not installed and supports optional API key login before patching. (#1723) (Mayank) - Add a revised CrewAI tracing integration with an
instrumentator()helper that listens to CrewAI events and captures agent and LLM calls as trace spans. Also emit integration telemetry to New Relic in addition to existing PostHog tracking. (#1724) (Mayank) - Add support for the IFEval benchmark to evaluate instruction-following and format compliance. Includes rule-based verification and more detailed per-instruction reporting in verbose mode. (#1729) (Abhishek Ranjan)
- Add a new
dataset()test-run interface that lets you iterate over goldens from a local list or a pulled dataset alias and track the run viatest_runtasks, with async execution support. (#1737) (Kritin Vongthongsri) - Add 10 new safety metrics to detect PII leakage, harmful or illegal instructions, misinformation, graphic content, prompt extraction, role boundary violations, IP issues, manipulation, and risky command execution. Improve template consistency, align parameter names, and add full test coverage for these checks. (#1747) (sid-murali)
- Add new safety metrics:
PIILeakageMetricto detect SSNs/emails/addresses,RoleViolationMetricto flag role-breaking output, andNonAdviceMetricto catch financial or medical advice. Require explicit parameters likeroleand advicetypes, and switch role violations to a clear yes/no result. (#1749) (sid-murali) - Add CLI support to set/unset the default OpenAI model and per-token pricing used by metrics.
GPTModelcan now read model name and pricing from saved settings, and will prompt for pricing when using an unknown model. (#1766) (Kritin Vongthongsri) - Add the Misuse metric to detect when an LLM uses a specialized domain chatbot inappropriately (for example, asking a finance bot to write poetry). This helps keep outputs aligned with domain expertise and prevents scope creep in specialized AI use cases. (#1773) (sid-murali)
Improvement
v3.2.6
- Prepare a new release by updating package metadata and internal version information. (#1721) (Jeffrey Ip)
- Add telemetry events that record when tracing integrations are initialized (LangChain, LlamaIndex, and OpenTelemetry exporter), respecting telemetry opt-out settings. (#1725) (Mayank)
- Update the default OpenAI and multimodal GPT model from
gpt-4otogpt-4.1. Cost calculations and documentation examples now also default togpt-4.1when a model name is not specified. (#1727) (Kritin Vongthongsri) - Add an X (Twitter) follow icon to the README and documentation site header for quicker access to the project’s social profile. (#1731) (Kritin Vongthongsri)
- Improve documentation and examples for multi-turn chatbot evaluation, clarifying conversation simulation, CI setup, and metric usage. Fix small wording issues in docs and ensure files end with a trailing newline. (#1732) (Vamshi Adimalla)
- Improve task completion evaluations by supporting span-based tracing. TaskCompletionMetric can now run without an
LLMTestCasewhen it’s the only metric, and it attaches the trace to produce suggested fixes while giving a clearer error for other metrics missingupdate_current_span(). (#1734) (Kritin Vongthongsri) - Improve CrewAI tracing by capturing tool usage and memory search as dedicated spans, with inputs/outputs recorded for easier debugging. LLM spans no longer fail when a parent span can’t be found. (#1740) (Mayank)
- Improve LlamaIndex instrumentation by unifying event and span handling, generating stable span UUIDs, and properly starting/ending traces when spans are dropped or completed. This makes LLM and tool spans more consistent and avoids lingering spans in trace output. (#1745) (Mayank)
- Improve OpenAI integration by evaluating captured OpenAI test case/metric pairs when no traces are available, and by recording the latest OpenAI hyperparameters in the test run. Also clear stored OpenAI pairs after a run to avoid leaking state between evaluations. (#1746) (Kritin Vongthongsri)
- Improve LangChain and LangGraph integration with clearer message roles, better tool call/result handling, and cleaner inputs. Fix span naming plus fallback/metadata behavior and make outputs visible in LangChain. Update docs with function descriptions; token usage and cost reporting is still pending. (#1752) (Mayank)
- Fix a typo in the README explanation of
expected_outputandGEvalto make the quickstart guidance clearer. (#1754) (Chetan Shinde) - Add comprehensive docs for
NonAdviceMetric,PIILeakageMetric, andRoleViolationMetric, including usage examples, parameters, and scoring rubrics. Improve consistency by standardizing metric names, schema fields, and clarifying parameter naming for these metrics. (#1755) (sid-murali) - Improve the tutorials onboarding experience by grouping Getting Started pages in the sidebar and refreshing the Introduction with clearer guidance and a first evaluation walkthrough. (#1759) (Vamshi Adimalla)
- Improve compatibility by loosening the
clickversion restriction so newerclickreleases can be used, reducing dependency conflicts and avoiding the need to pin an outdated version. (#1760) (lwarsaame) - Improve the tutorial introduction and setup docs with a clearer getting-started flow, curated tutorial cards, and tightened wording. Add a concrete
OPENAI_API_KEYexport example and clarify the requiredtest_filename prefix. (#1761) (Vamshi Adimalla) - Add a blog sidebar that lists all posts and expand the tutorials sidebar with a new Meeting Summarizer section. Improve tutorials navigation by renaming the tutorial card component to
LinkCardsand enabling sidebar icons on tutorial routes. (#1767) (Vamshi Adimalla) - Support passing extra client options to Azure OpenAI model initialization via
kwargs. This lets you customize the underlying Azure OpenAI client without modifying the tool’s source code. (#1772) (Aaryan Verma) - Improve tutorials and docs navigation with refreshed summarization content, clearer headings, and new example visuals. Add optional numbered tutorial link cards and temporarily hide the Meeting Summarizer section from the sidebar. (#1775) (Vamshi Adimalla)
- Improve dependency compatibility by loosening the
tenacityversion constraint to allow newer releases while keeping a safe supported range. (#1776) (Andy Freeland) - Improve dataset handling by aligning dataset endpoints, making golden lists optional, and supporting extra conversational metadata like
scenario,userDescription, andcommentswhen sending test runs. (#1777) (Jeffrey Ip) - Improve the TaskCompletionMetric docs with a clearer tracing example, including the correct
Goldeninput format and updated imports forevaluateandToolCall. This makes it easier to run the sample code without adjustments. (#1779) (Mayank)
Bug Fix
v3.2.6
- Fix the quickstart link shown after CLI login so it points to the correct setup page. (#1726) (Kritin Vongthongsri)
- Fix OpenAI Completions examples in the docs to use the current
OpenAI()client andchat.completions.create, preventing runtime errors and incorrect response parsing in sample code. (#1728) (Kritin Vongthongsri) - Fix
AnthropicModel.calculate_costindentation so cost calculation and fallback pricing warning run correctly when pricing is missing. (#1739) (nsking02) - Fix component-level evaluation serialization by converting test run payloads into JSON-safe data before sending them, preventing failures when metrics or complex objects are included. (#1744) (Kritin Vongthongsri)
- Fix synthetic golden sample generation when
context_sizeis 1 by making the context generator always return a consistent list-of-lists shape. This prevents type mismatches inGoldencreation when a document has only one chunk. (#1748) (Nicolas Torres) - Improve JSON tool-call reliability when using
instructorTOOLS mode with custom LLMs by renaming internalReasonschemas so models don’t skip tool calls and return plain content. This prevents exceptions and keeps structured outputs coming fromtool_callsas expected. (#1753) (Radosław Hęś) - Fix
EvaluationDataset.evaluatetype hints to accept all supported metric base types and explicitly annotate theEvaluationResultreturn type, avoiding circular import issues. (#1756) (AI) - Fix an error when calculating OpenAI costs by handling a missing model value and falling back to the default model when none is provided. (#1768) (Kritin Vongthongsri)
- Fix component-level metric data not showing up in test results by extracting and appending trace and span-level metric outputs to the reported results. (#1769) (Mayank)
- Fix syntax errors in the evaluation test case documentation examples so
ToolCallsnippets parse correctly and can be copied into Python without edits. (#1770) (Dhanesh Gujrathi) - Fix the Task Completion metric documentation example by using valid sample inputs for
destinationanddays, preventing the snippet from failing when copied and run. (#1778) (Kritin Vongthongsri)
June
June made evaluations and tracing more robust across providers and async workloads with fixes to prevent crashes and broken serialization. Tracing matured with improved OpenAI/OTEL integrations and new hooks for OpenAI Agents and LlamaIndex via trace_manager.configure. Evaluation added native LiteLLM support, MultimodalGEval, arena-style GEval, and jsonl dataset saving.
Backward Incompatible Change
v3.1.5
- Remove the
clientparameter fromobserve()and rely ontrace_manager.configure(openai_client=...)for LLM spans. LLM tracing now requires either amodelinobserveor a configuredopenai_client, otherwise a clear error is raised. (#1667) (Mayank)
v3.0.8
- Improve the packaged API by removing the
monitorhelpers from top-level imports, leaving onlysend_feedbackanda_send_feedbackavailable viadeepeval. (#1673) (Jeffrey Ip)
New Feature
v3.1.9
- Add a LlamaIndex integration entry point via
instrument_llama_indexto hook into LlamaIndex instrumentation and capture agent runs for monitoring. (#1714) (Mayank) - Add expanded OpenAI multimodal model support, including newer GPT-4.1 and o-series options. Improve structured output handling by using native parsing when available and falling back to JSON parsing when needed, while tracking log-prob limitations for unsupported models. (#1716) (Kritin Vongthongsri)
- Add arena-style evaluation to
GEvalby allowing a list of test cases and selecting the best output. Validate that all candidates share the same input and exposebest_test_caseandbest_test_case_indexfor easier comparisons. (#1717) (Kritin Vongthongsri)
v3.1.5
- Add
MultimodalGEval, a GEval-based metric to score multimodal test cases using configurable criteria, rubrics, and evaluation steps. Supports async evaluation and can incorporate inputs like context, retrieval context, and tool calls. Also improve image encoding by converting non-RGB images before JPEG serialization. (#1684) (Kritin Vongthongsri) - Add OpenAI Agents tracing integration via
DeepEvalTracingProcessor, capturing agent, tool, and LLM spans and mapping key metadata like prompts, responses, and token usage into the tracing system. (#1699) (Kritin Vongthongsri) - Add broader multimodal test case support in the platform API by sending expected output, context, and retrieval context fields. Improve handling of local image inputs by detecting
file://paths, capturing filenames and MIME types, and embedding file data as Base64. (#1704) (Kritin Vongthongsri)
v3.0.8
- Add native LiteLLM model support so you can run evaluations with any LiteLLM-supported provider. Includes sync/async text generation, schema validation, cost tracking, and improved error handling, plus tests and updated docs. (#1670) (Prahlad Sahu)
v3.0.6
- Add support for saving datasets in
jsonlformat, making it easier to write large datasets without loading everything into memory. This is especially useful for generating and exporting datasets with more than 10k rows. (#1652) (Yudhiesh Ravindranath)
Improvement
v3.1.9
- Bump package version metadata for a new release, updating the published version string and release date. (#1710) (Jeffrey Ip)
- Improve the RoleAdherenceMetric documentation by fixing wording, removing a duplicate argument entry, and clarifying how assistant turns are evaluated against
chatbot_roleusing prior context. (#1711) (Vamshi Adimalla) - Add pricing support for
claude-opus-4andclaude-sonnet-4. Raise a clearValueErrorwhen cost pricing is missing for an unknown Anthropic model, preventing silent fallbacks andTypeErrorcrashes. (#1715) (Abhishek Ranjan) - Add a new blog guide on building and evaluating multi-turn chatbots, covering conversation simulation, metrics for memory and tone, and CI-friendly regression testing. (#1718) (Vamshi Adimalla)
v3.1.5
- Bump the package version metadata for a new release. (#1676) (Jeffrey Ip)
- Improve telemetry for
traceable evaluate()runs by tracking them as a separate component evaluation feature. This records the correct feature status and updates the last-used feature accordingly. (#1678) (Kritin Vongthongsri) - Add a new blog post covering an evaluation-first approach to building and testing RAG apps, including automated test data generation, retriever/generator metrics, and CI test integration. Add a new blog author profile and related images. (#1686) (Vamshi Adimalla)
- Add links in the README to translated versions in multiple languages, making it easier for non-English readers to find localized documentation. (#1687) (neo)
- Improve the RAG evaluation blog guide with updated wording, clearer code examples, and revised diagrams. Rename the article file and slug to better reflect its focus, and simplify CI/CD integration examples for easier copy-paste. (#1694) (Vamshi Adimalla)
v3.0.8
- Prepare a new release by updating the package version metadata and reported
__version__. (#1668) (Jeffrey Ip)
v3.0.6
- Prepare the 3.0.0 release by updating package version metadata and release date. (#1631) (Jeffrey Ip)
- Improve multimodal metrics docs by fixing the Answer Relevancy example to use
MultimodalAnswerRelevancyMetric, and by aligning output and bulk-evaluation snippets to print score and reason consistently. (#1635) (Jeffrey Ip) - Improve the faithfulness verdict prompt wording by fixing grammar and removing threatening language, making instructions clearer and more professional for LLM evaluations. (#1636) (Vamshi Adimalla)
- Improve AnswerRelevancy prompt templates to produce valid, parseable JSON more reliably. Clarify when ambiguous fragments count as statements and add clearer examples and end markers to reduce malformed outputs. (#1642) (Aaron McClintock)
- Improve conversation simulation progress output by switching to Rich traceable progress bars and showing per-conversation and per-step progress during scenario setup and turn simulation, in both sync and async modes. (#1649) (Kritin Vongthongsri)
- Improve tracing internals by moving current span/trace state to context variables and reorganizing attribute and type definitions. This makes trace updates more consistent across sync and async execution, and enables centralized OpenAI client patching via the trace manager. (#1651) (Jeffrey Ip)
Bug Fix
v3.1.9
- Fix JSON serialization failures when a dictionary contains non-string keys by converting keys to strings during tracing serialization. (#1712) (Kritin Vongthongsri)
v3.1.5
- Fix import failures on read-only file systems by skipping telemetry-related filesystem setup when
DEEPEVAL_TELEMETRY_OPT_OUTis set. This prevents evaluations from failing in restricted environments like serverless runtimes. (#1654) (Leo Kacenjar) - Fix OpenAI model initialization to pass
base_url, enabling proxy or custom endpoint configurations in both sync and async clients. (#1703) (jnchen) - Fix
evaluateso it no longer raises TypeError when a singleTestResultis passed. The metric pass rate aggregation now wraps non-list results into a list before processing. (#1705) (Aditya Bharadwaj) - Fix an
IndexErrorinSynthesizer.generate_goldens_from_docs()by safely handling missing or shortersource_files, preventing crashes when generating goldens from documentation inputs. (#1706) (Aditya Bharadwaj)
v3.0.6
- Fix GSM8K benchmark crashes when a model returns a tuple or other non-standard response. Prediction extraction now handles
NumberSchema, tuples, strings, dicts, and.text/.contentobjects, and avoids unsafe.values()unpacking to preventAttributeError/TypeError. (#1628) (Muhammad Hussain) - Fix traceable span evaluation traversal so child spans are always processed and recorded, even when a parent span has no metrics or test case. This prevents missing spans in trace output and avoids incomplete evaluations. (#1632) (Kritin Vongthongsri)
- Fix TruthfulQA evaluation with AnthropicModel by handling JSON parsing failures and falling back to text-based prompting when structured output isn’t supported. This prevents crashes from uncaught errors and improves robustness across models. (#1638) (Pradyun Magal)
- Fix the OpenAI tracing integration so LLM span attributes are applied correctly and tracing data is recorded as expected. (#1639) (Kritin Vongthongsri)
- Fix async golden generation to call
a_embed_textinstead of the blockingembed_textwhen building contexts. This prevents event-loop blocking, improves parallel performance, and avoids runtime errors likeasyncio.run()being called from a running loop. (#1641) (Andreas Gabrielsson) - Fix OTEL exporter crashes when span or event attributes are missing by handling
Nonevalues and returning empty objects orNoneinstead of raising type conversion errors. (#1646) (Mayank) - Fix
expected_outputserialization for span test cases by correcting theexpectedOutputfield alias so optional expected outputs are sent and parsed correctly. (#1650) (Kritin Vongthongsri) - Fix the traceable evaluation progress bar so it updates correctly during runs, including async execution, by using the proper progress bar ID. (#1655) (Kritin Vongthongsri)
- Fix trace posting when a Confident AI API key is provided directly, so traces are no longer skipped due to the environment not being detected as Confident. (#1656) (Kritin Vongthongsri)
- Fix a typo in the conversation simulator docs so the
user_intentionsexample is valid Python and can be copied and run without errors. (#1664) (Eduardo Arndt) - Fix a circular import in the tracing API by importing
current_trace_contextfrom the context module, preventing import-time errors when using tracing. (#1665) (Mayank)
May
May made evaluations and tracing more robust and configurable. LLM wrappers gained configurable temperature, new providers including Amazon Bedrock, and PEP 561 support for static analysis. Tracing improved with cleaner defaults, richer metadata, optional sampling/masking, and better OpenTelemetry interoperability while respecting opt-out more consistently.
Backward Incompatible Change
v2.8.5
- Rename the tracing callback parameter from
traceable_callbacktoobserved_callbackinevaluate()andassert_test()when running agentic golden tests, improving naming consistency for traced runs. (#1561) (Jeffrey Ip)
v2.8.4
- Remove the LangChain dependency so installs are lighter and avoid importing LangChain modules. Update conversational GEval to use OpenAI
ChatCompletionresponses directly when parsing content and logprobs. (#1544) (Kritin Vongthongsri)
New Feature
v3.0
- Add utility functions to write evaluation logs to a file, making it easier to track results when running large batches without a web app. This also helps spot missing results caused by connection errors. (#1601) (Daehui Kim)
- Add an OpenTelemetry span exporter that detects
gen_aioperations and converts spans into LLM, tool, agent, and retriever traces with inputs, outputs, token usage, and cost metadata for export. (#1603) (Mayank) - Add optional
thread_idto traces and support sending it asthreadIdin the tracing API. This lets you associate a trace with a specific conversation thread when updating the current trace. (#1604) (Kritin Vongthongsri) - Add support for setting a trace
userIdso you can associate traces with a specific end user when updating and exporting trace data. (#1605) (Kritin Vongthongsri) - Add
inputandoutputfields to trace data so you can record request payloads and final results at the trace level, including viaupdate_current_trace. (#1606) (Kritin Vongthongsri)
v2.9.0
- Add support for tracing
LlmAttributeson the OpenAI client by patching it intoObserver, so@observe(type="llm", client=...)captures LLM call attributes automatically. (#1560) (Mayank) - Add
AmazonBedrockModelto run LLM-based evaluations using Amazon Bedrock, with async and sync generation plus optional Pydantic schema parsing. Includes usage docs and recognizes Bedrock models as native for metric execution. (#1570) (Kritin Vongthongsri) - Add support for setting per-span
metadataviaupdate_current_span, and include it when exporting spans to the tracing API. (#1575) (Kritin Vongthongsri) - Add trace-level tags and metadata, plus an optional environment label for better trace filtering and context. Support masking trace inputs/outputs via a configurable mask function. Allow sampling with
CONFIDENT_SAMPLE_RATEto skip posting a portion of traces. (#1578) (Kritin Vongthongsri)
v2.9.1
- Add a more flexible conversation simulator: generate a configurable number of conversations per intent, accept either
user_profile_itemsor predefineduser_profiles, and optionally stop early using astopping_criteria. Progress tracking now reflects the total conversations generated across intents. (#1584) (Kritin Vongthongsri)
v2.8.5
- Add
get_actual_model_name()helper to extract the underlying model ID from provider-prefixed strings likeopenai/gpt-4.1-mini, as used by proxies such as LiteLLM. This makes it easier to work with provider/model formats consistently. (#1555) (Serghei Iakovlev)
v2.8.4
- Add support for
gpt-4.1in structured output mode by including it in the list of supported models. This lets you usegpt-4.1where structured outputs are required without extra configuration. (#1547) (Serghei Iakovlev)
Improvement
v3.0
- Support passing through unknown command-line options from
deepeval test runto pytest, so third-party and custom pytest plugins can receive their flags without the CLI rejecting them. (#1589) (Matt Barr) - Improve telemetry and tracing reliability by propagating an internal
_in_componentflag through metric evaluation and wrapping trace flush sends with capture logic, reducing noisy progress output and ensuring in-flight tasks are cleaned up more safely. (#1596) (Kritin Vongthongsri) - Bump package version to 2.9.1 for the latest release. (#1600) (Jeffrey Ip)
- Add support for saving
expected_outputwhen exporting datasets, so expected results are preserved alongside inputs and other golden fields. (#1602) (Nail Khusainov) - Add default trace input/output capture when they are not explicitly set, using the observed function’s kwargs and result. This ensures traces include basic I/O data without requiring manual
update_current_tracecalls. (#1620) (Kritin Vongthongsri) - Remove the SIGINT/SIGTERM signal handler from tracing so the tool no longer overrides your process signal handling during shutdown. (#1621) (Mayank)
- Improve
assert_testAssertionError messages by including the failure reason in the thrown metrics string. This makes it easier to understand failures when logging exceptions, abstracting tests, or running under pytest. (#1623) (Orel Lazri)
v2.9.0
- Update package metadata and internal version to 2.8.5 for the new release. (#1567) (Jeffrey Ip)
- Improve tracing span updates by consolidating
update_current_span_test_caseandupdate_current_span_attributesinto a singleupdate_current_spanAPI. This makes it easier to attach both span attributes and anLLMTestCase, and updates docs and error messages to match the new call pattern. (#1574) (Kritin Vongthongsri) - Add the PEP 561
py.typedmarker so type checkers like mypy can analyze installed package imports without reporting missing stubs orimport-untypederrors. (#1592) (Sigurd Spieckermann)
v2.9.1
- Bump the package release to 2.9.0 and update version metadata across the project. (#1597) (Jeffrey Ip)
v2.8.5
- Update package metadata and internal
__version__to reflect the latest release. (#1558) (Jeffrey Ip) - Prevent trace status logs from printing during evaluations unless
CONFIDENT_TRACE_VERBOSEexplicitly enables them, reducing noisy console output while running eval traces. (#1565) (Kritin Vongthongsri)
v2.8.4
- Improve type safety and simplify golden/context generation by removing legacy
_nodespaths. Add a ChromaDB availability check and clearer error messages to fail fast when optional dependencies are missing. (#1534) (Rami Pellumbi) - Add configurable
temperatureto supported LLM model wrappers (including Anthropic, Azure OpenAI, and Gemini) and pass it through on generation calls. Prevent invalid settings by rejecting negative temperatures with a clear error. (#1541) (Kritin Vongthongsri) - Improve type hints in the MMLU benchmark by making
tasksoptional and simplifying prompt variable typing for better static analysis and editor support. (#1550) (Serghei Iakovlev) - Fix typos across benchmark prompts, comments, and tests to improve wording clarity and reduce confusion when reading task names and evaluation steps. (#1552) (João Matias)
- Move telemetry, cache, temp test-run data, and key storage into a
.deepeval/folder to reduce clutter in the project root. Automatically migrates legacy files to the new location when found. (#1556) (Kritin Vongthongsri) - Improve tracing logs with clearer success/failure messages, a queue-size status, and an exit warning when traces are still pending. Add optional flushing on shutdown via
CONFIDENT_TRACE_FLUSH, and control log verbosity withCONFIDENT_TRACE_VERBOSE. (#1557) (Kritin Vongthongsri)
Bug Fix
v3.0
- Fix
TaskNodeOutputresponse format types so list and dict outputs are fully specified and accepted by OpenAI. This prevents confusing bad request errors that only appeared when the model tried to emit those previously invalid shapes. (#1599) (Matt Barr) - Restrict
typerandclickdependency versions to improve compatibility and prevent install issues with newer releases. (#1607) (Vamshi Adimalla) - Fix
ToolCorrectnessMetricinput parameter comparison so identical dictionaries are treated as a full match, improving scoring consistency when tool inputs are the same. (#1608) (Nathan-Kr) - Fix temp directory cleanup on Windows by adding a safer
rmtreewith retries and forced garbage collection to reduce failures from locked files. Also register an exit cleanup hook to help release resources before deletion. (#1609) (Propet40) - Fix telemetry opt-out so no analytics events or traces are captured when opt-out is enabled across evaluation, metrics, dataset pulls, and trace sending. (#1614) (Kritin Vongthongsri)
- Fix a
ZeroDivisionErrorwhen running the HellaSwag benchmark with no predictions for a task by returning an accuracy of 0 instead of dividing by zero. (#1616) (Mikhail Salnikov) - Fix a ValueError when running the TruthfulQA benchmark by including the expected output in each recorded prediction row, keeping result data aligned during evaluation. (#1619) (Mikhail Salnikov)
- Fix
ToolCall.__hash__to support unhashable input/output values like lists and nested dicts. Hashing now converts complex nested structures into stable hashable forms, preventingTypeErrorduring comparisons and test runs. (#1625) (Muhammad Hussain) - Fix a
FileNotFoundErrorin telemetry by using a consistent temp run data filename when moving it into the.deepevaldirectory. This prevents failures caused by a mismatch between dotted and non-dotted filenames. (#1630) (Jakub Koněrza)
v2.9.0
- Fix Azure OpenAI initialization to use the correct
deployment_namewhen settingazure_deployment, preventing misconfigured clients and failed requests. (#1571) (Kritin Vongthongsri) - Fix Amazon Bedrock model imports to avoid unnecessary dependencies being loaded when using the Bedrock LLM integration. (#1573) (Kritin Vongthongsri)
- Fix a typo in the MMLU benchmark that could cause an assertion failure when validating the example dataset, so
load_benchmarkand prediction run as expected. (#1580) (Tri Dao) - Fix broken integration documentation links for LlamaIndex and Hugging Face so the README points to the correct pages. (#1582) (Wey Gu)
- Fix client patching during tracing context setup by skipping type checks when the client is
None, preventing errors when no client is configured. (#1585) (Mayank) - Fix a syntax error in the synthesizer generate-from-scratch documentation example by adding a missing trailing comma in
StylingConfig, making the snippet copy-pasteable. (#1587) (Shun Liang) - Fix
OllamaModel.a_generate()to use the model name set in the constructor. This keeps async generation consistent withOllamaModel.generate()and prevents using the wrong Ollama model. (#1594) (Sigurd Spieckermann)
v2.8.5
- Fix trace queue handling so queued and in-flight traces are posted more reliably on exit or interruption. Add SIGINT/SIGTERM handling and improve warnings to report remaining traces and support optional flushing via
CONFIDENT_TRACE_FLUSH. (#1559) (Kritin Vongthongsri) - Fix the exit warning to only appear when there are pending traces to post. This prevents misleading warnings when the trace queue and in-flight tasks are empty. (#1566) (Kritin Vongthongsri)
v2.8.4
- Fix MMLU evaluation when
model.generate()returns a tuple or list by extracting the first result before reading.answer. This preventsAttributeError/TypeErrorand improves compatibility across different model implementations. (#1546) (krishna0125)
April
April made evaluations more traceable and easier to configure. Native model support expanded with Gemini and Anthropic, plus improved Azure OpenAI and Ollama setup. New metadata fields (token_cost, completion_time, additional_metadata) and tracing upgrades made multi-turn test generation and debugging smoother, while robustness fixes reduced import failures and crashes.
Backward Incompatible Change
v2.7.6
- Remove async from
get_model_nameon the base embedding model interface, making model name retrieval a synchronous call for simpler implementations and call sites. (#1516) (Rami Pellumbi)
v2.7.3
- Remove the
auto_evaluatehelper from the public API to streamline the tracing-focused surface area and reduce unused functionality. (#1513) (Jeffrey Ip)
New Feature
v2.7.7
- Add traceable eval runs so agent/tool/LLM steps can be captured and attached to each test case during evaluation. This improves debugging and makes it easier to understand how outputs were produced, including when running evals over pulled datasets. (#1523) (Kritin Vongthongsri)
- Add support for named goldens and allow
assert_testto run traceable evals using a Golden plus callback, in both sync and async modes. Improve input validation forassert_testto prevent invalid argument combinations. (#1532) (Kritin Vongthongsri)
v2.7.6
- Add
min_context_lengthandmin_contexts_per_documentto Synthesizer document context generation, so you can enforce a minimum context size and minimum number of contexts per document while still capping with the existing max settings. (#1508) (Kritin Vongthongsri)
v2.7.3
- Add
generate_goldens_from_goldensto expand an existing set of Goldens into new ones, reusing available contexts for grounded generation or falling back to scratch generation when context is missing. Optionally generates expected outputs and can infer prompt styling from the provided examples. (#1506) (Kritin Vongthongsri)
v2.6.8
- Add native Gemini model support, including multimodal judging and structured outputs. Configure it via
set-geminiusing either a Google API key or Vertex AI project/location, and disable it withunset-geminito revert to the default provider. (#1493) (Kritin Vongthongsri) - Add support for running evaluations with Anthropic Claude models via a new
AnthropicModel, including sync/async generation and token cost tracking. (#1495) (Kritin Vongthongsri)
v2.6.6
- Add a conversation simulator that generates multi-turn conversational test cases from user profile items and intentions, with optional opening messages. Supports async concurrency and tracks simulation cost when using native models. (#1481) (Jeffrey Ip)
Improvement
v2.7.7
- Prepare a new release by updating the package version metadata. (#1525) (Jeffrey Ip)
- Allow LLM and retriever spans to be recorded without calling
update_current_span_attributes. Missing attributes no longer raise errors, and span conversion skips optional fields when they aren’t provided. Improve error handling for non-JSON API responses. (#1530) (Kritin Vongthongsri) - Improve how
LLMTestCaseis converted to a string for g-eval prompts by centralizing the formatting and ensuring tool-call values are rendered consistently viarepr(). (#1531) (João Matias)
v2.7.9
- Improve documentation by clarifying CLI usage (
deepeval test run), updating command examples tobash, and fixing links to the correct evaluation guide sections. (#1537) (Jeffrey Ip) - Prepare a new package release by bumping the project version metadata. (#1539) (Jeffrey Ip)
- Bump the package version to 2.7.8 for the latest release metadata. (#1540) (Jeffrey Ip)
v2.7.6
- Add a new documentation article showcasing popular G-Eval metric examples, with sample code and guidance for defining custom LLM-judge criteria and RAG-focused evaluations. (#1517) (Kritin Vongthongsri)
- Improve the G-Eval documentation with research context, clearer RAG evaluation criteria, and a new advanced section explaining limitations and when to use DAG-based metrics, including an end-to-end example. (#1519) (Kritin Vongthongsri)
- Fix typos and improve wording in synthesizer prompt templates to make instructions clearer and reduce confusion in generated outputs. (#1521) (Song Luar)
- Improve import-time dependency resolution by deferring optional integration imports, reducing startup failures when LangChain or LlamaIndex aren’t installed. Change update checks to be opt-in via
DEEPEVAL_UPDATE_WARNING_OPT_IN. (#1524) (Jeffrey Ip)
v2.7.3
- Fix a typo in the QA agent metrics tutorial by correcting “weather” to “whether” in the
Faithfulnessdescription, improving documentation clarity. (#1505) (Justin Nauman) - Fix typos in the benchmarks introduction docs to use the correct
promptsvariable name and improve wording for clarity. (#1511) (Russell-Day)
v2.6.8
- Add retention analytics by sending PostHog events for evaluation runs and synthesizer invocations when telemetry is enabled, improving visibility into feature usage over time. (#1486) (Kritin Vongthongsri)
- Add log-probability support for Azure OpenAI in
GEval, including Azure models in log-probability compatibility checks and enabling raw response generation with cost tracking via the LangChain client. (#1492) (Kritin Vongthongsri) - Add
google-genaiandposthogas dependencies and refresh the lockfile to pull in required transitive packages. (#1499) (Kritin Vongthongsri)
v2.6.6
- Add a new comparison blog post and author profile to the documentation, expanding the site’s blog content and attribution. (#1471) (Kritin Vongthongsri)
- Improve Ollama embedding configuration by using the same underlying
ollamamodule as the chat model. This alignsbase_urlhandling so embeddings and chat can share the same Ollama host without requiring different/v1URL variants, reducing setup confusion. (#1474) (Paul Lewis) - Add a new documentation blog post comparing the tool with Langfuse, and update existing comparison content for clearer messaging about provider integration and metric support. (#1475) (Kritin Vongthongsri)
- Add
token_costandcompletion_timefields to LLM and multimodal test cases, and include them in the API test case payload astokenCostandcompletionTime. (#1476) (Kritin Vongthongsri) - Add
additional_metadatato test results so extra per-test details are preserved and returned for conversational, multimodal, and standard evaluations. (#1477) (Mayank) - Improve the conversation simulator API by moving
model_callback, turn limits, and conversation count intosimulate()and adding clearer progress reporting during generation for both sync and async runs. (#1491) (Jeffrey Ip)
Bug Fix
v2.7.7
- Fix invalid enum errors in tracing by aligning span status values to use
ERROREDinstead ofERROR, so failed spans serialize and report correctly. (#1536) (Mayank) - Fix agentic
assert_testruns so they no longer always disable saving results. Test runs now respect thesave_to_disksetting and correctly reuse or create the current test run by identifier. (#1538) (Kritin Vongthongsri)
v2.7.6
- Fix
FiltrationConfig.synthetic_input_quality_thresholdto use afloatinstead of anint, matching its default value and preventing type-related configuration errors. (#1515) (Rami Pellumbi) - Fix the Bias metric docs example to import
evaluatefromdeepeval, so the sample code runs as written. (#1520) (snsk)
v2.7.3
- Fix Gemini model wrappers to stop hardcoding an allowlist of model names. You can now pass newer or custom Gemini model IDs without getting an unnecessary "Invalid model" error. (#1503) (Mete Atamel)
- Fix Anthropic model initialization and async generation by treating
AnthropicModelas a native provider and loading the client in async mode, preventing failures when callinga_generate. (#1504) (Kritin Vongthongsri)
v2.6.8
- Fix synthetic dataset generation from documents failing with
UnicodeDecodeErroron non-UTF-8 text. Default to auto-detecting file encoding instead of Windows defaults, and allow manually setting an encoding for edge cases. (#1485) (Aahil Shaikh) - Fix type hints for
context_quality_thresholdandcontext_similarity_thresholdto usefloat, matching their default values and preventing misleading type checking. (#1490) (Jakub Koněrza)
v2.6.6
- Fix Azure OpenAI setup by separating
openai_model_namefrom the deployment name and using the deployment name when creating the client. The CLI now prompts for--openai-model-nameand stores/clears it alongside other Azure settings. (#1480) (Kritin Vongthongsri) - Fix the QA agent evaluation tutorial to import
EvaluationDatasetfromdeepeval.dataset, matching the current package structure and preventing import errors when following the docs. (#1483) (Anton) - Fix ToolCorrectness metric crashing with an unhashable type error when a tool call output is a list and expected tools are provided without a guaranteed order. This lets tool-correctness evaluation run reliably for list outputs. (#1487) (Sai Pavan Kumar)
March
March made evaluations and synthesis more reliable. Defaults improved for Ollama and Azure OpenAI, broader model support landed (including gpt-4.5-preview), and structured outputs became more consistent. Large runs gained resilience with expanded retry handling for transient failures, plus fixes for async scoring, G-Eval strict mode, and benchmark parsing.
New Feature
v2.6.5
- Add support for the
gpt-4.5-preview-2025-02-27model, including pricing metadata and compatibility flags for features like structured outputs and JSON mode. (#1453) (John Lemmon) - Add
file_nameandquietoptions toSynthesizer.save_as()so you can control the output filename and suppress console output. Improve validation for file types and synthetic goldens, with updated docs and tests. (#1455) (Serghei Iakovlev)
v2.5.9
- Support additional native model providers when initializing metrics and evaluators, including Azure OpenAI, Ollama, and local models. Model selection can now be driven by configuration without changing code. (#1441) (Kritin Vongthongsri)
v2.5.8
- Add optional
cost_trackingto Synthesizer to enable full API cost tracking, disabled by default. When enabled, generation runs report detailed cost information alongside the output. (#1406) (Chuqing Gao)
Improvement
v2.6.5
- Update package metadata for a new release, including the published version and release date. (#1446) (Jeffrey Ip)
- Improve resilience of large runs by retrying on additional OpenAI connection-related exceptions, not just rate limits. This reduces failures from transient network issues during long parallel evaluations. (#1450) (John Lemmon)
- Improve reliability of uploads to Confident AI by adding retries on transient HTTPS/SSL failures, especially for large batch test runs, so evaluations are more likely to complete successfully. (#1452) (John Lemmon)
v2.5.9
- Update package metadata to the latest release version for more accurate reporting in builds and tooling. (#1445) (Jeffrey Ip)
v2.5.8
- Bump package metadata to the latest release version. (#1399) (Jeffrey Ip)
- Improve Ollama model configuration by defaulting the base URL to
http://localhost:11434and removing the response format option fromset-ollama. This reduces mismatches with Ollama endpoints and keeps CLI setup focused on LLM configuration. (#1401) (Kritin Vongthongsri) - Improve documentation for JSON correctness metrics by showing how to validate
actual_outputthat is a list of JSON objects using a PydanticRootModellist schema. (#1403) (Kritin Vongthongsri) - Update the Task Completion metric docs to use
gpt-4oinstead ofgpt-4in the example configuration. (#1415) (Obada Khalili) - Fix a typo in the RAG evaluation guide example input, changing “gow” to “how” for clearer documentation. (#1431) (Vamshi Adimalla)
- Improve
prettify_list()JSON formatting by enablingensure_ascii, making output consistently ASCII-escaped for non-ASCII characters and easier to paste into logs and terminals. (#1437) (Vamshi Adimalla) - Improve benchmark imports by loading
datasetsonly when needed, reducing import-time failures for users who don’t use those benchmarks. Update packaging metadata to broaden the supported Python range and remove the legacysetup.py. (#1440) (Jeffrey Ip)
Bug Fix
v2.6.5
- Fix infinite verbose output in notebooks by only constructing verbose logs when verbose mode is enabled, and by writing logs via
sys.stdoutwith an explicit flush. (#1444) (fetz236) - Fix a typo in the tracing example prompt so the sample question reads correctly when you run the demo. (#1448) (Mert Doğruca)
- Fix Azure OpenAI initialization to always use the configured deployment name from settings, ensuring the correct
azure_deploymentis passed to sync and async clients. Improve the docs forset-azure-openaiwith clearer endpoint examples and a minimum required API version note. (#1451) (Kritin Vongthongsri) - Fix incorrect metadata propagation in conversational test cases so each turn keeps its own
additional_metadataandcommentsinstead of inheriting the parent test case values. (#1456) (Xiaopei) - Fix synthesizer compatibility with Azure OpenAI by handling
generate()responses that return plain strings or(result, cost)tuples, preventing tuple attribute errors when extracting synthetic data. (#1459) (Nicolas Torres) - Fix
set-ollama --base-urlso Ollama requests use the configured base URL from.deepevalinstead of falling back to the default localhost setting. (#1460) (Paul Lewis) - Fix native model handling in the synthesizer and multimodal metrics by using structured outputs when a schema is provided, returning typed results instead of parsing JSON strings. Add CLI commands to set and unset Ollama embeddings, and use the configured embedding initializer instead of a hardcoded OpenAI embedder. (#1461) (Kritin Vongthongsri)
- Fix the red-teaming guide example so the
chat.completions.createcall uses the correctmessagesargument and returns the message content, making the snippet runnable as written. (#1463) (Karthick Nagarajan) - Fix async
measureto returnself.scorewhenasync_mode=True, instead of returningNone. Async and sync metric execution now produce a consistent, non-empty score value. (#1464) (Roman Makeev)
v2.5.8
- Fix Ragas metrics failing with an “async_mode is missing” error by explicitly running metric tracking in non-async mode during evaluation. (#1402) (Tanay Agrawal)
- Fix the import path for
LLMTestCaseParamsin the metrics selection tutorial so the example code runs without import errors. (#1407) (Obada Khalili) - Fix a typo in the synthetic input generation template to clarify instructions about avoiding repetitive
input. (#1408) (John D. McDonald) - Fix tool correctness reason messages so the
expectedandcalledtool names are reported in the right order when using exact match checks. (#1409) (Casey Lewiston) - Fix the dataset synthesis tutorial to use the correct
StylingConfigkeyword argument, replacingexpected_outputwithexpected_output_formatso the example code runs as intended. (#1411) (Obada Khalili) - Fix a typo in
__all__by restoring a missing comma soauto_evaluateandassert_testare exported correctly from the package. (#1412) (88roy88) - Fix benchmark prediction generation to fall back more reliably by also handling
AttributeErrorwhen extracting the model answer. (#1414) (Stan Kirdey) - Fix G-Eval strict mode to use a dedicated prompt and return a binary score (0/1) with an explicit reason, instead of scaling scores and post-adjusting them against the threshold. (#1416) (Kritin Vongthongsri)
- Fix SQuAD benchmark answer parsing by using
StringSchemafor enforced model generation instead of a multiple-choice schema, improving compatibility with model outputs. (#1423) (Diogo Carvalho) - Fix the documented Azure OpenAI embedding setup command by correcting the flag name to
--embedding-deployment-name, so the example works as shown. (#1424) (Amali Matharaarachchi) - Prevent G-Eval from requesting log probabilities on unsupported GPT models (such as
o1ando3-mini). This avoids errors when generating raw responses and lets evaluations run normally by falling back when logprobs aren’t available. (#1425) (Kritin Vongthongsri) - Fix
login_with_confident_api_key()to reject missing API keys by raising a clearValueError, preventing confusing behavior when the key is empty or not provided. (#1427) (Vamshi Adimalla) - Fix the LLM monitoring docs example to use the correct variable name for the monitored response, so the async
a_monitorcall matches the returned output. (#1432) (Lucas Le Ray) - Fix document-based golden generation to rebuild the vector index each run instead of reusing cached state, avoiding stale chunks in repeated notebook executions. Add validation to prevent
chunk_overlapfrom exceedingchunk_size - 1, and relax thechromadbinstall requirement to any compatible version. (#1433) (Kritin Vongthongsri) - Fix the DAG non-binary verdict prompt to require a consistent JSON response with
verdictandreason, including an example format. This reduces malformed outputs and makes results easier to parse reliably. (#1434) (Hani Cierlak) - Fix synthesizer chunking with ChromaDB by handling missing collections more robustly, avoiding failures when the collection error type differs across versions. (#1442) (Kritin Vongthongsri)
February
February improved evaluation reliability and expanded customization. Fixes landed for batching detection, async auto_evaluate, custom LLM validation, and concurrent evaluation stability. Metrics gained injectable templates including FaithfulnessTemplate, improved DAG reasoning with include_reason, and MultimodalToolCorrectnessMetric, plus conversational metadata and Prompt hyperparameters.
New Feature
v2.4.6
- Add
MultimodalToolCorrectnessMetricto score whether an MLLM called the expected tools correctly. Evaluation can check tool name, input parameters, and outputs, with optional exact-match and ordering rules. Results now include expected and called tool data in API test cases. (#1386) (Umut Hope YILDIRIM) - Support passing
Promptobjects as hyperparameters in test runs and monitoring, preserving prompt version metadata when available. Improve prompt pulling and validation so prompts can be created from an alias or a manually provided template. (#1387) (Jeffrey Ip)
v2.3.9
- Add
deepeval recommend metrics, an interactive CLI flow that asks a few yes/no questions and returns recommended evaluation metrics for your use case. (#1342) (Kritin Vongthongsri) - Add support for passing
additional_metadataon conversational test cases, and include it in the generated API payload asadditionalMetadata. This preserves extra context when creating and evaluating test runs. (#1352) (Kritin Vongthongsri) - Add CLI support for running LLM-based evaluations with local Ollama models via
set-ollamaandunset-ollama, including configurable base URL and response format. Documentation was updated with setup and usage guidance. (#1360) (Kritin Vongthongsri) - Add support for injecting a custom
FaithfulnessTemplateintoFaithfulnessMetricfor dynamic prompt generation. This lets you plug in domain-specific or few-shot templates without overriding claim generation methods. (#1367) (Lei WANG)
v2.3.1
- Add support for the
o3-miniando3-mini-2025-01-31models, including pricing metadata and enabling use in structured outputs and JSON mode where supported. (#1331) (Song Luar)
Improvement
v2.4.7
- Update package metadata and internal
__version__to match the latest release. (#1392) (Jeffrey Ip) - Add support for injecting custom evaluation templates into metrics, making it easier to customize the prompts used to generate statements, verdicts, and reasons. (#1393) (Jeffrey Ip)
- Fix a typo in the getting started guide so the
GEvaldescription correctly refers to evaluating outputs on any custom metric. (#1394) (Christian Bernhard) - Fix a typo in the getting started guide to improve clarity when describing
GEvaland recommendingDAGMetricfor deterministic scoring. (#1395) (Christian Bernhard) - Fix a typo in the getting-started guide by correcting “somewhre” to “somewhere” for clearer documentation. (#1396) (Christian Bernhard)
v2.4.6
- Improve dependency compatibility by relaxing the
grpciopin to allow newer 1.x releases while staying below 2.0. This reduces install and resolver conflicts across environments. (#1383) (Jeffrey Ip) - Bump the package release metadata to 2.4.3 so the published version and citation information reflect the latest release. (#1385) (Jeffrey Ip)
- Update package metadata and internal version to 2.4.4 for the new release. (#1388) (Jeffrey Ip)
- Improve metric parameter validation by moving each metric’s required test-case fields into the metric class, ensuring consistent checks in both sync and async evaluation. (#1389) (Jeffrey Ip)
v2.4.3
- Add telemetry for dataset pulls, capturing login method, environment, and basic user identifiers to help monitor usage and diagnose issues. (#1377) (Kritin Vongthongsri)
v2.3.9
- Update package metadata for a new release, including the version and release date. (#1334) (Jeffrey Ip)
- Improve CLI login by opening a paired browser flow and recording the login provider for telemetry. Evaluation and run events now include a
logged_in_withattribute to help diagnose usage patterns. (#1341) (Kritin Vongthongsri) - Fix typos and small wording issues in the contextual precision and contextual recall metric templates to make the generated prompts clearer and more consistent. (#1344) (Filippo Paganelli)
- Add telemetry for the
recommend metricsCLI flow to capture usage context when telemetry is enabled. Mark runs as incomplete when the command errors out. (#1346) (Kritin Vongthongsri) - Add
include_reasonsupport to DAG-based metrics and generate clearer, path-based reasons from the DAG traversal. Improve verbose output by recording per-node execution steps, and normalize static node scores to a 0–1 range. (#1348) (Jeffrey Ip) - Improve documentation navigation and onboarding by reorganizing the Guides sidebar and adding an early
deepeval loginstep in the tutorial introduction to help users set up their API key before starting. (#1353) (Kritin Vongthongsri) - Add documentation for integrating Elasticsearch as a vector database, including setup steps and examples for evaluating and tuning retrieval with contextual metrics. (#1354) (Kritin Vongthongsri)
- Improve Elasticsearch integration documentation with clearer setup steps and an expanded walkthrough for preparing
LLMTestCases and running contextual retrieval metrics to evaluate and tune retriever performance. (#1355) (Kritin Vongthongsri) - Add integration docs for Chroma, including setup and examples for evaluating retrieval quality with contextual metrics and tuning retriever hyperparameters. (#1357) (Kritin Vongthongsri)
- Improve the Chroma integration docs with clearer setup and retrieval evaluation examples, including persistent client usage and
n_results(top-K) tuning guidance. (#1361) (Kritin Vongthongsri) - Improve metric docs with a clearer example of using
evaluate()to generate reports or run multiple metrics on a test case, plus an explicit alternative showing how to callmetric.measure()directly. (#1364) (Kritin Vongthongsri) - Add telemetry for metrics run mode by recording whether a metric is executed in async mode. This improves observability when diagnosing performance and runtime behavior across different execution paths. (#1365) (Kritin Vongthongsri)
- Improve the PGVector integration guide with clearer setup and retrieval steps, expanded evaluation guidance, and updated examples for embedding models and tuning
LIMIT/top-k. Reorganize content to better explain how PGVector fits into a RAG pipeline. (#1366) (Kritin Vongthongsri) - Fix a typo in the tutorial introduction so the guidance on choosing evaluation criteria reads correctly. (#1370) (JonasHildershavnUke)
v2.3.1
- Prepare a new release by updating package metadata and reported version. (#1328) (Jeffrey Ip)
Bug Fix
v2.4.7
- Fix a typo in the Faithfulness metric docs by correcting a sentence in the
truths_extraction_limitparameter description. (#1391) (Christian Bernhard)
v2.4.6
- Fix cleanup of test case instance IDs so concurrent
evaluatecalls with multiple non-conversational metrics no longer crash in the same process. (#1384) (cancelself)
v2.4.3
- Fix the faithfulness prompt example to use the correct
truthsJSON key instead ofclaims. (#1373) (Jeffrey Ip) - Fix initialization of the faithfulness metric by ensuring the prompt template is created during construction. This prevents missing template errors and makes metric setup more reliable. (#1374) (Jaime Enríquez)
- Fix ValidationErrors when evaluating with a custom LLM after the verdict-based schema change, ensuring custom models validate correctly and evaluation runs without failing. (#1375) (Tyler Ball)
- Relax the
grpciodependency to^1.67.1instead of pinning1.67.1. This reduces pip upgrade conflicts in projects that already require a newergrpcio(for example viagrpcio-status). (#1379) (Dmitriy Vasilyuk) - Fix the first README example by adding missing imports and providing
expected_outputinLLMTestCase, so the snippet runs without NameError and matches the documented setup. (#1382) (dokato)
v2.3.9
- Fix the broken link to the G-Eval paper in the
ConversationalGEvaldocumentation so readers can access the referenced source directly. (#1336) (Jonathan du Mesnil) - Fix
auto_evaluateasync execution by passing the correctasync_modeflag, and exportauto_evaluateat the package top level so it can be imported directly from the main module. (#1338) (Kritin Vongthongsri) - Fix CLI login pairing flow by starting the local server on an available port and opening a direct pairing URL. Show which provider you logged in with after login (and on failure) to make troubleshooting easier. (#1345) (Kritin Vongthongsri)
- Fix DAG template examples to use valid JSON booleans (
true/false) so generated verdict outputs are JSON-compliant and easier to parse. (#1349) (Aaron McClintock) - Fix
red_teamer.scandocumentation by adding the missing comma in the example call, so the code block parses correctly and can be copied without syntax errors. (#1351) (Akshay Rahatwal) - Fix prompt wording so
verdictis only set to 'yes' when the instruction is completely followed, reducing ambiguous interpretations in generated results. (#1369) (Daniel Abraján) - Fix the CybersecurityGuard API by renaming
CyberattackTypetoCyberattackCategoryand switching configuration fromvulnerabilitiestocategories. Remove stray debug prints and make input/output guard type selection consistent. (#1372) (Jeffrey Ip)
v2.3.1
- Fix
should_use_batchdetection by checking for abatch_generatemethod instead of calling it and swallowing errors. This prevents false negatives whenbatch_generaterequires extra arguments (for exampleschemas) and ensures batching is enabled when supported. (#1327) (Ruiqi(Ricky) Zhu) - Fix typos in generated telemetry output to improve accuracy and readability of telemetry files. (#1329) (Paul-Louis NECH)
- Fix passing document paths to the context generator when building embeddings, preventing incorrect argument mapping during golden generation from docs. (#1330) (Kritin Vongthongsri)
January
January made evaluations and red-teaming easier to adopt with documentation cleanups, new tutorials, and clearer configuration patterns like target_model_callback and ignore_errors. Observability improved with expanded telemetry, run identifiers, and synthesis_cost tracking. Features advanced with new ARC benchmark runners, structured ToolCall support, an upgraded TaskCompletionMetric, and a revamped Guardrails API.
New Feature
v2.2.7
- Add
auto_evaluateto automatically generate evaluation datasets from captured LangChain or LlamaIndex context, run a target model, and score results with selected metrics. Supports async execution and optional dataset/result caching. (#1283) (Kritin Vongthongsri) - Add
TaskCompletionMetricto score whether an agent completed the user’s goal based on the actual outcome and tools called, with optional reasons and async support. (#1295) (Kritin Vongthongsri) - Add a new Legal Document Summarizer tutorial series, covering how to define summarization criteria, pick metrics, run evaluations, iterate on hyperparameters, and catch regressions by comparing test runs. (#1323) (Kritin Vongthongsri)
- Add a new RAG QA Agent tutorial in the docs, including guidance on choosing metrics, running evaluations, and improving hyperparameters. The tutorials sidebar now includes this section and surfaces it by default. (#1326) (Kritin Vongthongsri)
v2.2.2
- Add three new multimodal evaluation metrics:
ImageCoherenceMetric,ImageHelpfulnessMetric, andImageReferenceMetricfor scoring how well images align with surrounding context, user intent, and provided references. (#1230) (Kritin Vongthongsri) - Add an optional
identifierto tag and persist test runs, available via the CLI flag--identifierand the pytest plugin option. This helps you distinguish and group results across multiple runs more easily. (#1237) (Jeffrey Ip) - Add an ARC benchmark runner with ARC-Easy and ARC-Challenge modes, configurable
n_shotsand problem count, and built-in accuracy reporting with per-example predictions. Expand the docs to include new benchmark pages and navigation entries for additional benchmark suites. (#1239) (Kritin Vongthongsri) - Add multimodal RAG evaluation support, including test cases with image inputs and retrieval context plus new multimodal metrics for recall, relevancy, precision, answer relevancy, and faithfulness. (#1241) (Kritin Vongthongsri)
- Add a revamped guardrails API with built-in guard classes (e.g., privacy, prompt-injection, jailbreaking, topical, cybersecurity) and support for running multiple guards in one call, returning per-guard scores and breakdowns. (#1247) (Kritin Vongthongsri)
- Add
max_context_lengthto control how many chunks are grouped into each generated context during document-based synthesis, letting you tune context size for generation. Also adjust context grouping defaults and de-duplication to produce more consistent context groups. (#1289) (Kritin Vongthongsri) - Add ToolCall support for tool evaluation data. Datasets can now load
tools_calledandexpected_toolsfrom JSON/CSV into structured ToolCall objects, with more robust JSON parsing. Metrics like ToolCorrectness and GEval now handle ToolCall values when evaluating and formatting outputs. (#1290) (Kritin Vongthongsri) - Add configurable tool correctness scoring to validate tool names, input parameters, or outputs. Improve verbose logs by showing expected vs called values and the final score and reason, making tool-call mismatches easier to diagnose. (#1293) (Kritin Vongthongsri)
Improvement
v2.2.7
- Bump package version metadata to 2.2.2 for the latest release. (#1302) (Jeffrey Ip)
- Improve the G-Eval documentation by adding guidance for running evaluations on Confident AI, including the
deepeval loginstep to get started. (#1303) (Kritin Vongthongsri) - Fix a typo in the dataset push success message and docs, correcting “Confidnet” to “Confident” for clearer branding and guidance. (#1307) (Rahul Shah)
- Add an
ignore_errorsoption to red teaming scans so attack generation and evaluation can surface failures without aborting the run. Also rename the async concurrency setting tomax_concurrentfor clearer configuration. (#1309) (Jeffrey Ip) - Improve the Task Completion metric documentation by clarifying that it evaluates tool-calling agents using
input,tools_called, andactual_output. Expand the calculation section to explain task/outcome extraction and alignment scoring, with additional examples for context. (#1310) (Kritin Vongthongsri) - Improve Jailbreaking Crescendo JSON schema generation by adding stricter system prompts to confine outputs to the expected keys and moving the
descriptionfield to the eval schema. Also ensure remote attack generation initializes the API client with an explicit API key value. (#1311) (Kritin Vongthongsri) - Fix the MMLU benchmark docs by updating the example to use
MMLUTask, helping users get started with the correct setup. This addresses an issue in the MMLU introduction, though some guidance gaps remain around long outputs and batching with varying prompt lengths. (#1313) (Matthew Khoriaty) - Improve tool correctness evaluation by supporting multiple
ToolCallParamsat once and generating clearer scoring and verbose logs for exact-match and ordering checks. (#1317) (Kritin Vongthongsri) - Improve synthesizer docs by clarifying that for RAG evaluation only certain evolution types reliably stick to the provided context, and annotate the examples accordingly. (#1319) (Sebastian)
- Add a new RAG QA Agent tutorial series covering synthetic dataset generation, evaluation criteria, and metric selection, and reorganize the tutorials sidebar to keep other sections collapsed by default. (#1325) (Kritin Vongthongsri)
v2.2.2
- Improve red-teaming 2.0 documentation with clearer setup and scan examples, including how to define vulnerabilities and a target model callback. Reorganize the docs sidebar to add OWASP guidance and a dedicated vulnerabilities section for easier navigation. (#1209) (Kritin Vongthongsri)
- Bump package version to 2.0.5. (#1217) (Jeffrey Ip)
- Add tracking of
synthesis_costwhen synthesizing goldens by accumulating model call costs, so you can see the estimated spend for synthesis runs. (#1218) (Vytenis Šliogeris) - Improve dependency compatibility by updating the
tenacityrequirement to allow up to version 9.0.0, reducing install conflicts with newer environments. (#1226) (Anindyadeep) - Fix a grammar issue in the RAG evaluation guide to clarify that prompts are constructed from both the initial input and the retrieved context. (#1233) (Nishant Mahesh)
- Improve benchmark docs with clearer descriptions, supported modes/tasks, and copy-paste examples for ARC, BBQ, and Winogrande. Also tidy benchmark exports and naming to make imports and evaluation parameters more consistent. (#1240) (Kritin Vongthongsri)
- Prepare a new release by bumping the package version to 2.1.0. (#1245) (Jeffrey Ip)
- Improve benchmark runs by adding more built-in benchmark imports, optional verbose per-problem logging, and configurable answer-format confinement instructions to reduce parsing errors and make results easier to inspect. (#1246) (Kritin Vongthongsri)
- Improve red-teaming documentation by renaming the target model function parameter to
target_model_callbackand updating sync/async examples to match, reducing confusion when wiring up scans. (#1250) (Kritin Vongthongsri) - Change the default Guardrails API base URL to
https://deepeval.confident-ai.com/instead ofhttp://localhost:8000, so it connects to the hosted service by default. (#1252) (Kritin Vongthongsri) - Update package metadata by bumping the release version and refreshing the project description. (#1254) (Jeffrey Ip)
- Improve Guardrails API configuration by using the shared
BASE_URLfrom the guardrails API module instead of a hardcoded localhost URL. (#1255) (Kritin Vongthongsri) - Add an
IS_CONFIDENTenvironment toggle to switch the API base URL to a local server (usingPORT) instead of the default hosted endpoint. (#1258) (Kritin Vongthongsri) - Improve guardrails base classes and typing by introducing
BaseGuard/BaseDecorativeGuardand a sharedGuardTypeenum. This makes guard metadata and guardrail configuration more consistent across built-in guards. (#1259) (Jeffrey Ip) - Add a configurable
top_logprobssetting to better support OpenAI and Azure OpenAI deployments where logprobs limits vary by model/version. This helps avoid failures or unexpected clamping when a service only supports smaller values (for example, 5 instead of 20). (#1261) (Dave Erickson) - Add PostHog analytics tracking to the documentation site, with tracking disabled in development to avoid collecting local activity. (#1268) (Kritin Vongthongsri)
- Update package metadata for a new release. (#1270) (Jeffrey Ip)
- Fix typos in the README by correcting “continous” to “continuous” in multiple places. (#1273) (Ikko Eltociear Ashimine)
- Improve telemetry spans for evaluations, synthesizer, red teaming, guardrails, and benchmarks by capturing more run details and consistently tagging an anonymous
unique_id(and public IP when available). This makes usage and performance monitoring more consistent across features. (#1276) (Kritin Vongthongsri) - Add support for additional OpenAI GPT model IDs, including versioned
gpt-4o,gpt-4o-mini,gpt-4-turbo, andgpt-3.5-turbo-instructvariants, so model validation accepts more current options out of the box. (#1277) (Song Luar) - Add an opt-out for automatic update warnings via the
DEEPEVAL_UPDATE_WARNING_OPT_OUT=YESenvironment variable, so you can suppress update checks in non-interactive or CI environments. Documentation was added for this setting. (#1278) (Song Luar) - Bump the package version for a new release. (#1279) (Jeffrey Ip)
- Improve telemetry by tagging spans with the runtime environment (Jupyter notebook vs other) to better understand where evaluations and tools are run. (#1280) (Kritin Vongthongsri)
- Improve OpenAI-native model calls by using structured outputs with explicit schemas, returning typed fields directly instead of parsing JSON strings. This makes metric verdicts/reasons/statements more reliable and reduces parsing failures. (#1285) (Kritin Vongthongsri)
- Update OpenAI model lists so
gpt_modelandgpt_model_schematicstay in sync, including refreshed multimodal model support. Adjust validation and pricing data to match the latest available models and costs. (#1287) (Song Luar) - Update the default API base URL used by the red teaming attack synthesizer to point to the hosted service instead of localhost. (#1288) (Kritin Vongthongsri)
- Improve documentation with a new Cognee integration guide and corrected guardrails example usage, plus small styling and copy updates across the site. (#1291) (Jeffrey Ip)
- Fix typos in the custom LLMs guide to clarify the exception note and correct the
instantiateinstruction. (#1294) (Christian Bernhard) - Add telemetry attributes to record whether each feature run is considered
neworold, and persist that status after a feature is used. This improves feature-usage reporting across evaluation, synthesizer, red teaming, guardrails, and benchmarks. (#1296) (Kritin Vongthongsri) - Add validation and pricing metadata for OpenAI
o1models (o1,o1-preview,o1-2024-12-17) so they can be used with JSON mode and structured outputs where supported. (#1299) (Song Luar) - Add a
--displayoption to control which test cases are shown in the final results output, so you can view all, only failing, or only passing cases in CLI runs andevaluate()printing. (#1301) (Jeffrey Ip)
Bug Fix
v2.2.7
- Fix structured (
schema) responses when using non-OpenAI models (including Azure/local) by correctly invoking the loaded model and returning the parsed JSON along with the tracked cost. (#1304) (Kritin Vongthongsri) - Fix circular imports involving
Scorerby deferring its import in benchmark modules, preventing import-time crashes when loading benchmarks. (#1315) (Song Luar) - Fix async tracing in the LangChain callback by making trace state thread-safe and correctly linking parent/child spans. This prevents missing or mis-associated traces when runs execute concurrently. (#1318) (Kritin Vongthongsri)
- Fix leftover
.vector_dbcollections when chunking fails by cleaning up the generated collection folders before raising an error. Also handle invalid Chroma collections explicitly so document loading can recover more reliably. (#1320) (Kritin Vongthongsri) - Fix context generation from docs by passing
document_pathsexplicitly, preventing incorrect argument binding. Also skip the MULTICONTEXT evolution when transforming evolution distributions to avoid generating unsupported prompt evolutions. (#1321) (Kritin Vongthongsri) - Fix local Ollama embedding requests by routing through the OpenAI client when the base URL points to localhost. This restores embedding support for both single text and batch inputs without changing cloud OpenAI behavior. (#1322) (Kritin Vongthongsri)
v2.2.2
- Prevent endless verdict generation in ContextualPrecision by including the explicit document count in the prompt, helping LLMs stay aligned on long or complex context lists. (#1222) (enrico-stauss)
- Fix
MMLUTemplate.format_subjectto be a static method, allowing it to be called without an instance and preventing incorrect usage in MMLU prompt formatting. (#1229) (Terrasse) - Prevent OpenTelemetry from loading on import when telemetry is opted out. This avoids importing protobuf dependencies unnecessarily and reduces conflicts with other libraries. (#1231) (Mykhailo Chalyi (Mike Chaliy))
- Fix red teaming risk-category mapping to use the updated
*Typevulnerability enums, keeping vulnerability classification consistent after recent naming changes. (#1236) (Kritin Vongthongsri) - Fix synthetic data generation when ChromaDB raises
InvalidCollectionExceptionby catching the correct exception type ina_chunk_doc, ensuring fallback handling runs instead of stopping early. (#1242) (Mizuki Nakano) - Fix text-to-image metric semantic consistency evaluation to use the generated output image instead of an input image, improving scoring accuracy for text-only prompts. (#1253) (Kritin Vongthongsri)
- Fix docs to use the correct import paths for sensitive information disclosure attack types (
PIILeakageType,PromptLeakageType,IntellectualPropertyType), preventing import errors when following the example code. (#1256) (Mohammad-Reza Azizi) - Fix guardrails API calls to use the updated
/guardrailsendpoint instead of the old multiple-guard path. (#1257) (Jeffrey Ip) - Fix guardrails API schema so
inputandresponseare defined at the request level instead of per-guard, preventing invalid payloads when multiple guards are used. (#1260) (Jeffrey Ip) - Fix MMLU task reloading so the benchmark dataset is fetched fresh for the selected task instead of reusing a previously cached dataset. This prevents running evaluations against the wrong task data when switching tasks. (#1267) (Yuyao Huang)
- Fix synthesizer cost tracking to handle unset
synthesis_cost. This prevents errors when generating data if cost accounting is disabled or not initialized. (#1271) (Jeffrey Ip) - Fix batched
evaluate()results so prediction rows include the expected output alongside the input, prediction, and score, keeping benchmark output consistent and easier to inspect. (#1274) (BjarniH) - Fix documentation “Edit this page” links to point to the correct
docs/directory so edits open in the right place on GitHub. (#1292) (Jeffrey Ip) - Prevent installing the
testsfolder into site-packages by excluding it from the package install. This avoids name conflicts when your project also includes atestsdirectory. (#1300) (冯键)