π² 2024
Decemberβ
December delivered the 2.0 major release with refreshed packaging, updated dependency pins, langchain-community added, broader Python support up to <3.13, and smoother installs including automatic nest_asyncio. Documentation saw a significant polish pass, with expanded dataset tutorials and clearer navigation across dataset synthesis, LLM app, metrics, guardrails, and getting started guidance including Windows notes for DEEPEVAL_RESULTS_FOLDER. Red Teaming 2.0 landed with broader vulnerability coverage, improved evaluation prompts, and new IP and competitor checks while retiring older politics and religion graders and updating baseline attack generation. The month also improved extens-
Backward Incompatible Changeβ
v2.0.1β
- Bump the package version to 2.0 for the new major release. (#1191) (Jeffrey Ip)
Improvementβ
v2.0.5β
- Add Red Teaming 2.0 updates with expanded vulnerability coverage and improved evaluation prompts, including new intellectual property and competitor checks. Remove older politics and religion graders and refresh baseline attack generation support. (#1206) (Kritin Vongthongsri)
- Support custom OpenAI endpoints by passing
base_urlthrough when creating the ChatOpenAI client. This lets you point the model at non-default API hosts without extra configuration. (#1214) (cmorris108) - Update package version metadata for the latest release. (#1215) (Jeffrey Ip)
v2.0.2β
- Improve the getting started guide with Windows-specific instructions for setting
DEEPEVAL_RESULTS_FOLDER, alongside the existing Linux example. (#1198) (Bernhard Merkle) - Improve packaging for the new release by updating dependency pins, adding
langchain-community, and expanding supported Python versions to <3.13. (#1204) (Jeffrey Ip)
v2.0.1β
- Improve dataset tutorials by expanding guidance on pulling datasets, converting goldens into test cases, and running evaluations, and make the dataset pages visible in the docs sidebar. (#1192) (Kritin Vongthongsri)
- Improve tutorial docs by cleaning up section headings and numbering for clearer navigation across dataset synthesis, LLM app, and metrics guides. (#1193) (Kritin Vongthongsri)
- Update guardrails documentation to reflect the current set of available guards and vulnerability coverage. Refresh the example configuration and simplify the list of guards that work with only input and output. (#1197) (Kritin Vongthongsri)
Bug Fixβ
v2.0.2β
- Fix
copy_metricsto preserve metric configuration inherited from base classes, ensuring copied metrics keep the same parameters (including model settings). Adds a regression test to prevent future copy issues. (#1202) (Vytenis Ε liogeris) - Fix missing dependency installation so
nest_asynciois included automatically, preventingModuleNotFoundError: No module named 'nest_asyncio'after install. (#1208) (Kars Barendrecht)
v2.0.1β
- Fix
enhance_attackto return the original attack object on enhancement errors instead of returning nothing, improving error handling and preventing downstream crashes. (#1195) (Chris W) - Fix
save_as()to use the correct file encoding by mirroring the synthesizer implementation, aligning it with other UTF-8 defaults and preventing encoding-related save failures. (#1196) (Chris W)
Novemberβ
November focused on polish, reliability, and a major expansion of learning resources, alongside several version bumps through 1.6.0. Documentation grew substantially with reorganized Synthesizer guidance, new observability and red-teaming tutorials, and step-by-step walkthroughs for synthetic dataset generation, evaluation workflows, and an agentic RAG medical chatbot. Core functionality improved with safer async generation via max_concurrent, richer evaluation outputs by including the test case name in TestResult, enhanced tracing/monitoring behavior and payload sanitization, and more consistent guardrails scoring and configuration. The release also introduced new safety and quality leβ
New Featureβ
v2.0β
- Add
JsonCorrectnessMetricto validate that an LLMβs output conforms to a provided Pydantic JSON schema. Returns a 1/0 score and can include an actionable reason when the output fails validation. (#1155) (Jeffrey Ip) - Add
PromptAlignmentMetricto score how well a model output follows a set of prompt instructions, with optional per-instruction verdicts and a generated reason in async or sync mode. (#1190) (Jeffrey Ip)
v1.6.0β
- Add prompt versioning support by letting you pull a prompt template by alias (and optional version) from Confident AI, then interpolate it locally with variables using
Prompt.interpolate. The pulled prompt version is stored on thePromptinstance for traceability. (#1176) (Jeffrey Ip)
v1.5.1β
- Add
guard()and theGuardenum to run configurable content safety checks on an input/response pair, with optional purpose, allowed entities, and detailed reasons. Validates required parameters for selected guards and errors early when context is missing. (#1144) (Kritin Vongthongsri)
Improvementβ
v2.0β
- Bump the package version to 1.6.0. (#1186) (Jeffrey Ip)
- Add and reorganize tutorial documentation for dataset review and running evaluations, including updated guidance on synthetic dataset generation and metric selection. (#1187) (Kritin Vongthongsri)
- Update available guard types by disabling several unused guard options and tidying guard list formatting, reducing confusion when selecting guards. (#1188) (Kritin Vongthongsri)
v1.6.0β
- Add a step-by-step tutorial for building an agentic RAG medical chatbot, covering knowledge-base loading, embedding and vector storage, tool setup, and an interactive end-to-end code example. (#1162) (Kritin Vongthongsri)
- Update package version metadata to 1.5.7 for the latest release. (#1170) (Jeffrey Ip)
- Add new tutorial docs covering synthetic dataset generation and preparing conversational evaluation datasets, and update the tutorial sidebar to include them. Also improve LlamaIndex callback tracing to handle OpenAI
ChatCompletionresponses when extracting messages and token usage. (#1175) (Kritin Vongthongsri)
v1.5.7β
- Bump the package version to 1.5.2 for the latest release. (#1157) (Jeffrey Ip)
- Add a
max_concurrentoption to cap async generation concurrency in the synthesizer, preventing too many tasks from running at once and helping avoid rate limits or resource spikes. Default is 100 concurrent tasks. (#1159) (Kritin Vongthongsri) - Replace
contextwithretrieval_contextin HallucinationMetric LLM test case params to match other evaluators. This makes it possible to run multiple evaluators in a loop against the sameTestCasewithout special handling. (#1161) (Louis BrulΓ© Naudet) - Improve guardrails harm scoring by tying the
scoreto the specified harm category and reducing false positives from unrelated harmful content. Update guardrail test output formatting to print results as pretty-printed JSON. (#1168) (Kritin Vongthongsri)
v1.5.2β
- Bump the package version to 1.5.1 for this release. (#1154) (Jeffrey Ip)
- Improve tracing by renaming internal
track_paramstomonitor_paramsand passingrun_asyncthrough to monitoring so events can be recorded asynchronously when enabled. (#1156) (Kritin Vongthongsri)
v1.5.1β
- Bump package version metadata to 1.4.9. (#1143) (Jeffrey Ip)
- Add a red-teaming tutorial guide that walks through setting up a target LLM, running scans with
RedTeamer, interpreting vulnerability results, and iterating on fixes to improve LLM safety and reliability. (#1148) (Kritin Vongthongsri) - Add the test case name to
TestResultso evaluation outputs include which test produced each result. (#1152) (AugmentedMo)
v1.4.9β
- Prepare a new package release by updating the project version metadata. (#1138) (Jeffrey Ip)
v1.4.8β
- Improve formatting and bump the package version metadata to 1.4.7. (#1133) (Jeffrey Ip)
- Improve Synthesizer documentation by splitting the previous single page into a clearer sectioned guide covering generation from documents, contexts, scratch, and datasets, and updating the docs sidebar navigation accordingly. (#1135) (Kritin Vongthongsri)
- Add a new guide on LLM observability and monitoring, covering why it matters and key components like response monitoring, automated evaluations, filtering, tracing, and human feedback. (#1136) (Kritin Vongthongsri)
Bug Fixβ
v1.6.0β
- Fix context generation so chunk counts reset per run, preventing incorrect
total_chunksreporting after loading documents multiple times. (#1177) (Kritin Vongthongsri) - Fix dataset loading from CSV/JSON by converting missing values to
None, adding configurable file encoding for JSON reads, and allowingsource_fileto be loaded from an explicit column/key instead of defaulting to the input path. (#1178) (Kritin Vongthongsri) - Fix unaligned attack category codes to match Promptfoo labels (for example
harmful:violent-crime), improving consistency when mapping vulnerabilities to API codes. (#1180) (nabeel-chhatri) - Fix the G-Eval documentation for the
Correctnessmetric soexpected_outputis included inevaluation_params, ensuring evaluations compare against the expected output as intended. (#1182) (Zane Lim) - Fix the red teaming guide example to use the correct
load_model()return name (client) so the sample code matches the API and avoids confusion when calling chat completions. (#1184) (Manish-Luci)
v1.5.7β
- Improve tracer monitoring by no longer passing the
run_asyncoption when a trace is closed, reducing unexpected async behavior during report submission. (#1158) (Jeffrey Ip) - Fix jailbreak linear and jailbreak tree evaluations by aligning
on_topicand rating prompt outputs with the expected schema fields, so these methods work correctly again. (#1160) (nabeel-chhatri) - Fix Guardrails API calls by updating the base endpoint URL. Update the Guardrails docs example to use the
responseparameter and correct syntax, and group Guardrails under a dedicated docs section for easier navigation. (#1164) (Kritin Vongthongsri) - Fix chunk indexing for large documents by adding embeddings to the vector store in batches, avoiding oversized
addcalls. Also handle missing collections more explicitly by catching the collection-not-found error before creating and populating a new collection. (#1165) (Kritin Vongthongsri) - Fix tracing for agent steps by correctly populating
agentAttributes, preventing missing or misnamed trace fields during LlamaIndex callback handling. (#1166) (Kritin Vongthongsri) - Fix Hallucination metric to use
contextagain instead ofretrieval_contextwhen reading required inputs. This restores expectedLLMTestCaseparameter naming in the metric and related examples/tests. (#1167) (Jeffrey Ip)
v1.5.1β
- Fix a broken link in the getting started guide so the "using a custom LLM" reference points to the correct documentation page. (#1141) (Nim Jayawardena)
- Fix tracing callbacks to send events via
monitorand sanitize payloads by stripping null bytes from nested data. Prevent errors when node scores are missing during LlamaIndex trace aggregation. (#1151) (Kritin Vongthongsri) - Fix configuration defaults to avoid creating models/config objects at import time, preventing import-time side effects and shared mutable defaults. Defaults are now set in
__post_init__or during initialization when values are omitted. (#1153) (Jeffrey Ip)
v1.4.9β
- Fix creating an empty
EvaluationDatasetso it no longer prompts forOPENAI_API_KEYunnecessarily. (#1142) (Stefano Michieletto)
v1.4.8β
- Fix noisy console output by removing an unintended print of the truths extraction limit during faithfulness truth generation. (#1134) (Jeffrey Ip)
Octoberβ
October focused on making evaluations more reliable and scalable, with stronger concurrency controls for async LLM calls and new limits like limit_count and truths_extraction_limit to curb token runaway and improve faithfulness/summarization stability on large RAG inputs. The evaluation surface was refined with cleaner defaults, a new EvaluationResult return type, more consistent tool-calling fields, and end-to-end improvements to KnowledgeRetentionMetric, plus broader metric coverage including role adherence and dedicated multimodal image metrics. RAG and synthesizer workflows saw notable expansion through improved golden generation APIs, higher-quality context selection, richer RAG
Backward Incompatible Changeβ
v1.4.2β
- Fix red-teaming vulnerability handling by mapping vulnerabilities to stable API codes and updating renamed vulnerability enums. This prevents incorrect attack generation for unaligned/remote categories and keeps grading and reporting consistent across the full vulnerability set. (#1101) (Kritin Vongthongsri)
v1.3.5β
- Add explicit telemetry opt-in via
DEEPEVAL_ENABLE_TELEMETRY=YES, with telemetry disabled by default when the variable is unset or not set to YES. (#1047) (Pritam Soni) - Restore telemetry opt-out behavior and switch the controlling env var to
DEEPEVAL_TELEMETRY_OPT_OUT. Telemetry is now enabled by default unless you explicitly opt out. (#1049) (Jeffrey Ip)
New Featureβ
v1.4.5β
- Add dedicated image metrics for multimodal evaluation:
TextToImageMetricfor text-to-image generation andImageEditingMetricfor image editing test cases, replacing the previous combined VIEScore workflow. (#1123) (Kritin Vongthongsri)
v1.4.2β
- Add new red-teaming vulnerability graders, including BFLA, BOLA, SSRF, prompt extraction, competitors, religion, hijacking, and overreliance checks. This expands the set of security behaviors you can evaluate during vulnerability scans. (#1099) (Kritin Vongthongsri)
v1.4.0β
- Add new attack enhancements for red-teaming, including
MathProblem,Multilingual, andJailbreakingCrescendo. Improve gray-box enhancements by retrying more and verifying the rewritten prompt is both compliant and actually a gray-box attack before returning it. (#1093) (Kritin Vongthongsri)
v1.3.8β
- Add optional
scenario,task,input_format, andexpected_output_formatcontrols when generating goldens from docs, for both sync and async APIs. This lets you steer how inputs are rewritten during evolution and how expected outputs are formatted. (#1080) (Kritin Vongthongsri)
v1.3.5β
- Add
RoleAdherenceMetricto score how well a chatbot stays in character across conversational turns, with optional reasons, strict scoring, async evaluation, and verbose logs. (#1054) (Jeffrey Ip) - Add support for function-calling fields on Golden records via
tools_calledandexpected_tools, including JSON serialization astoolsCalledandexpectedTools. (#1057) (Andy)
Improvementβ
v1.4.7β
- Bump the package version to 1.4.6 for the latest release. (#1127) (Jeffrey Ip)
v1.4.6β
- Bump the package version to 1.4.5 for this release. (#1125) (Jeffrey Ip)
v1.4.5β
- Improve dataset golden generation APIs by adding
generate_goldens_from_scratch, expanding doc-based generation options (chunking and context limits), and letting you weight evolutions with a dict. Also add optional scenario/task and input/expected output format fields, and default to generating expected outputs. (#1110) (Kritin Vongthongsri) - Improve Ragas-based RAG evaluation metrics by adding context recall and context entity recall, and by returning per-test-case scores consistently. This also updates async
a_measuresignatures and fixes score indexing to avoid dataset-level results leaking into single-case runs. (#1113) (Kritin Vongthongsri) - Bump the package version metadata for a new release. (#1117) (Jeffrey Ip)
- Improve telemetry for benchmark, synthesizer, and red teaming runs by capturing clearer span names and richer attributes like methods, generation limits, tasks, vulnerabilities, and enhancements. Add benchmark and login event capture to better track feature usage when telemetry is enabled. (#1118) (Kritin Vongthongsri)
- Improve the docs site header by fixing the logo asset name, adding a Confident link icon, and enabling Plausible analytics tracking. (#1124) (Jeffrey Ip)
v1.4.3β
- Bump the package version to 1.4.2 for the latest release. (#1103) (Jeffrey Ip)
- Improve red-teaming documentation by splitting it into separate pages for introduction, vulnerabilities, and attack enhancements, and reorganizing the docs sidebar for easier navigation. (#1107) (Kritin Vongthongsri)
- Improve synthetic dataset documentation visuals by centering diagrams, adjusting spacing, and switching images to SVG for clearer rendering. (#1109) (Kritin Vongthongsri)
v1.4.4β
- Improve synthesizer prompt construction when rewriting evolved inputs, and update the package release metadata. (#1111) (Jeffrey Ip)
v1.4.2β
- Bump package version to 1.4.1 for the latest release. (#1098) (Jeffrey Ip)
v1.4.0β
- Prepare a new package release by bumping the project version. (#1084) (Jeffrey Ip)
- Improve the Synthesizer documentation with an overview of generation methods (from documents, contexts, or scratch) and clearer parameter guidance, including async generation and model configuration. (#1088) (Kritin Vongthongsri)
v1.4.1β
- Prepare a new release by updating the package version metadata. (#1096) (Jeffrey Ip)
v1.3.9β
- Bump the package version to 1.3.8. (#1081) (Jeffrey Ip)
v1.3.7β
- Bump the package version for a new release. (#1076) (Jeffrey Ip)
v1.3.8β
- Bump the package version to 1.3.7. (#1078) (Jeffrey Ip)
v1.3.6β
- Bump package version to 1.3.5 for the latest release. (#1066) (Jeffrey Ip)
- Add a RAG evaluation example that indexes docs in Qdrant, queries with retrieved context, and runs relevancy/faithfulness and contextual metrics to help validate end-to-end retrieval quality. (#1067) (Anush)
- Improve context generation quality control by adding configurable retry and scoring thresholds, and by tracking similarity scores during context selection. This makes context cleanup more consistent and reduces low-quality contexts in generated outputs. (#1070) (Kritin Vongthongsri)
- Improve
evaluate()output by returning anEvaluationResultobject with bothtest_resultsand an optionalconfident_linkfor viewing saved runs. (#1075) (Jeffrey Ip)
v1.3.5β
- Bump the package version to 1.3.2 for the latest release. (#1040) (Jeffrey Ip)
- Fix a typo in the getting started docs describing
Goldentest cases and output generation at evaluation time. (#1041) (fabio fumarola) - Add a configurable semaphore to limit concurrent async LLM calls during test execution (default 10). This reduces simultaneous API requests, helps stay within rate limits, and prevents "too many requests" errors for more predictable runs. (#1043) (Waldemar KoΕodziejczyk)
- Add a
limit_countparameter to faithfulness and summarization to cap the number of generated claims and truths, reducing runaway token usage and incomplete JSON outputs on large RAG inputs. Fix a typo in the contextual relevancy prompt example. (#1045) (Jan F.) - Improve docs for Faithfulness and Summarization metrics by documenting the new
truths_extraction_limitoption and explaining when to use it to evaluate only the most important truths. (#1051) (Jeffrey Ip) - Bump the package version to 1.3.3 for the latest release. (#1055) (Jeffrey Ip)
- Support passing
*argsand**kwargstoload_benchmark_dataset, allowing benchmarks to load datasets with optional parameters without changing the base interface. (#1056) (Andy) - Improve the evaluation API by simplifying defaults and removing
traceStackfrom API test case payloads. Also exposetools_calledandexpected_toolsconsistently in API test cases for clearer tool-related evaluations. (#1059) (Jeffrey Ip) - Improve
KnowledgeRetentionMetricto work end-to-end: validate required conversational turn fields, support async evaluation, and calculate scores more reliably. Add clearer verbose logs and allow optional verdict indices and reasons. (#1060) (Jeffrey Ip) - Bump the package release metadata to reflect the latest published version. (#1061) (Jeffrey Ip)
Bug Fixβ
v1.4.7β
- Fix GEval documentation to use
strict_modeinstead ofstrict, matching the current API and avoiding confusion when copying examples. (#1129) (Chad Kimes) - Fix JSON and CSV exports to consistently use UTF-8 encoding. This preserves non-ASCII characters and avoids garbled text when saving files. (#1131) (Kinga MarszaΕkowska)
v1.4.6β
- Fix non-async reason generation to include
relevant_statements, ensuring contextual relevancy explanations reflect both relevant and irrelevant statements. (#1126) (dreiii)
v1.4.3β
- Fix the BBH multiple-choice schema key for the multistep arithmetic task so the correct prompt instructions are applied during evaluation. (#1104) (Nikita Parfenov)
- Fix synthesizer input handling so generated goldens consistently use the evolved input. Also rewrite the evolved input using the provided
input_format,scenario, ortaskbefore generating expected output when those options are set. (#1108) (Kritin Vongthongsri)
v1.4.2β
- Fix MMLU benchmark task loading so switching tasks always loads the correct dataset instead of reusing a previously cached one. (#1097) (Thomas Hagen)
v1.4.0β
- Fix synthesizer goldens generation to fall back to the original evolved input when a rewritten input is empty, preventing missing or blank
inputvalues in created goldens. (#1091) (Kritin Vongthongsri) - Fix the BBH schema key for the Dyck Languages task so the expected
dyck_languagesname is used, preventing mismatches when looking up task instructions. (#1092) (Nikita Parfenov) - Add error catching during red-team attack synthesis so failed generations are recorded with an
errorfield and donβt crash the run, in both sync and async modes. (#1095) (Jeffrey Ip)
v1.3.9β
- Fix contextual relevancy scoring to evaluate each retrieval context separately, then compute the final score across all verdicts. Improve the generated reason by including both irrelevant reasons and relevant statements, and update verdict parsing to match the new schema. (#1083) (Jeffrey Ip)
v1.3.6β
- Fix async metric evaluation to also catch
AttributeError, preventing crashes when a custom LLM returns unexpected types (for example, strings) during scoring. (#1058) (Robert Otting) - Fix
generate_goldens_from_docsto use the documented parameters for golden generation by splitting the limit intomax_goldens_per_contextandmax_contexts_per_documentwith updated defaults. (#1073) (Dominik ChodounskΓ½)
v1.3.5β
- Fix concurrency limiting by correctly passing
max_concurrentinto async evaluation, ensuring the semaphore is applied consistently during test execution. (#1048) (Jeffrey Ip) - Fix
FaithfulnessMetrictruth extraction so you can optionally cap extracted truths viatruths_extraction_limit(clamped to 0+), and show the configured limit in verbose logs for easier debugging. (#1050) (Jeffrey Ip) - Fix red-teaming evaluation flow by updating bias grading to use the new purpose-based API and simplified success criteria, and by aligning red-team tests with the renamed
VulnerabilityandAttackEnhancementenums. (#1063) (Kritin Vongthongsri) - Fix a TypeError when calling
evaluate(show_indicator=False)by passing the missingskip_on_missing_paramsargument toa_execute_llm_test_cases(). (#1065) (AdrienDuff)
Septemberβ
September focused on smoother installs, richer telemetry, and big Synthesizer quality-of-life upgrades. Dependency constraints were relaxed (notably around opentelemetry, grpcio, and opentelemetry-sdk) alongside several version bumps, improving compatibility when used as a downstream dependency. Red-teaming and evaluation gained deeper observability and robustness, with span-based tracking, packaging/import cleanups, improved result handling, and new multimodal support via MLLMTestCase and VIEScore. The Synthesizer saw faster document-based context generation with async chunking and caching, better progress visibility with optional tqdm, quality scoring and filtering via `critic
't
Backward Incompatible Changeβ
v1.3.2β
- Add a
critic_modeloption to the Synthesizer for quality filtering, and update generation to handle LLMs that return a single value. Document a required chromadb 0.5.3 install for faster chunk indexing and retrieval when generating from documents. (#1039) (Kritin Vongthongsri)
v1.2.7β
- Change
generate_goldens_from_docsto always initialize the embedder before running, and to route async execution consistently through the async implementation whenasync_modeis enabled. This can affect control flow and timing for async callers. (#1025) (Kritin Vongthongsri)
New Featureβ
v1.3.0β
- Add
skip_on_missing_paramsto skip metric execution for test cases missing required fields, with a matching--skip-on-missing-paramsCLI flag. When enabled, missing-parameter errors are treated as skips instead of failing the run. (#1030) (Jeffrey Ip)
v1.2.0β
- Add multimodal evaluation support with
MLLMTestCase, allowing datasets andevaluate()to run image-and-text test cases alongside existing LLM and conversational tests. Include a new VIEScore metric for text-to-image generation and editing quality checks. (#998) (Kritin Vongthongsri)
v1.1.7β
- Add support for local LLM and embeddings via OpenAI-compatible providers like Ollama and LM Studio using
base_url. Add CLI setup similar to Azure OpenAI and docs for configuring local endpoints. Improve reliability by supportingformat=jsonand forcing temperature to 0 for more consistent outputs. (#996) (CΓ©sar GarcΓa)
Improvementβ
v1.3.2β
- Prepare a new release by updating the package version metadata. (#1036) (Jeffrey Ip)
- Improve the Synthesizer documentation with a new guide covering document chunking, evolutions, and quality scoring, and clarify how context limits and quality metrics are reported. (#1037) (Kritin Vongthongsri)
v1.3.1β
- Bump package version to 1.3.0 for the new release. (#1031) (Jeffrey Ip)
- Add optional async progress bars and return context quality scores when generating contexts, enabling filtering and better visibility during synthesizer runs. (#1033) (Kritin Vongthongsri)
v1.3.0β
- Prepare a new release by bumping the package version to 1.2.8. (#1028) (Jeffrey Ip)
v1.2.7β
- Improve synthesizer dataset publishing by prompting to overwrite or change an alias on conflicts. Add
use_casesupport and disable automatic data sending when generating goldens from datasets or docs. Speed up document-based context generation with async chunking and caching. (#1016) (Kritin Vongthongsri) - Bump the package version to 1.2.4 for this release. (#1022) (Jeffrey Ip)
- Bump the package version to 1.2.5 for the latest release. (#1024) (Jeffrey Ip)
v1.2.8β
- Bump the package version to 1.2.7 for the latest release. (#1026) (Jeffrey Ip)
v1.2.3β
- Prepare a new package release by bumping the project version to 1.2.1. (#1013) (Jeffrey Ip)
v1.2.4β
- Prepare a new package release by updating the project version metadata. (#1020) (Jeffrey Ip)
v1.2.1β
- Improve Synthesizer progress and context generation. Show a
tqdmprogress bar that can be passed through the generation loop, and include the selected method in telemetry and status text. Add clearer validation for chunk sizing and show per-file processing progress to prevent invalid context requests. (#1008) (Kritin Vongthongsri) - Bump the package release to 1.2.0. (#1012) (Jeffrey Ip)
v1.2.0β
- Bump the package version for the latest release. (#1006) (Jeffrey Ip)
v1.1.8β
- Bump package version metadata for a new release. (#1000) (Jeffrey Ip)
v1.1.9β
- Bump the package version to 1.1.8 for this release. (#1004) (Jeffrey Ip)
v1.1.7β
- Bump the package version for a new release. (#992) (Jeffrey Ip)
- Add telemetry-based usage tracking for RedTeamer runs, capturing spans for
scanand red-teaming golden generation in both sync and async workflows. (#999) (Kritin Vongthongsri)
v1.1.5β
- Relax dependency constraints to reduce version conflicts when using the tool as a dependency in other projects, including more flexible requirements for
opentelemetryandgrpcio. (#939) (Martino Mensio) - Update package metadata for a new release. (#990) (Jeffrey Ip)
v1.1.6β
- Update package version and refresh dependencies, including relaxing the
opentelemetry-sdkpin to~=1.24.0to improve install compatibility. (#991) (Jeffrey Ip)
Bug Fixβ
v1.3.1β
- Fix
ConversationalTestCasesocopied_turnsincludes every turn in multi-turn conversations instead of only the last one. (#1035) (Jaime CΓ©spedes Sisniega)
v1.2.8β
- Fix ChromaDB collection initialization by falling back to
create_collectionwhen getting an existing collection fails, preventing errors during document chunking. (#1027) (Kritin Vongthongsri)
v1.2.3β
- Fix
evaluateresults by makingmultimodaloptional with a default ofNone, preventing errors when the flag is not provided. (#1014) (Jeffrey Ip) - Fix
generate_goldens_from_docsso it still generates goldens when a custom model is provided. The method now only sets a default model when none is specified, preventing silent no-op runs and ensuring output is produced from the given docs. (#1017) (Dominik ChodounskΓ½) - Fix a typo in the MMLU documentation so the import statement uses
from deepeval.benchmarks import MMLU, matching the supported API. (#1018) (John Alling)
v1.2.4β
- Fix metric data handling during evaluation by validating test case list types, caching API test case creation correctly, and skipping missing metrics data in result tables. This prevents mixed test case lists and avoids crashes or incorrect aggregation when metrics data or evaluation costs are missing. (#1021) (Jeffrey Ip)
v1.2.0β
- Fix a HellaSwag task label typo by updating
POLISHING_FURNITUREto match the expected dataset string, preventing mismatches when selecting or running that task. (#1009) (Kritin Vongthongsri) - Fix multimodal evaluation results to return a single
TestResultthat supports text andMLLMImageinputs/outputs, and update examples/tests to useMLLMImageinstead of the older image type. (#1010) (Jeffrey Ip) - Fix MLLM evaluation stability by only recording run duration when MLLM metrics are used, and correct async result unpacking in
VIEScoreto prevent runtime errors. Add an optionalnamefield toMLLMTestCasefor better test case identification. (#1011) (Kritin Vongthongsri)
v1.1.8β
- Fix red-teaming module packaging and imports by consolidating
RedTeamerunderdeepeval.red_teamingand aligning vulnerability/metric mappings, reducing import errors and inconsistencies. (#1003) (Jeffrey Ip)
v1.1.9β
- Fix incorrect success reporting for conversational test runs when an individual test case fails. Also prevent errors when metrics data is missing by handling
metrics_data=Noneduring result printing and aggregation. (#1005) (Jeffrey Ip)
v1.1.7β
- Fix JSON output truncation by using an explicit verdict count instead of emitting an unbounded list. This prevents JSON parsing errors in some test cases, such as when only a single context is present. (#994) (John Lemmon)
- Fix sample code to include the missing
retrieval_contextvariable so the βLetβs breakdown what happenedβ section runs as written and matches the surrounding explanation. (#995) (CΓ©sar GarcΓa)
Augustβ
August focused on a major stabilization and API-polish push, culminating in the 1.0.0 release and subsequent rapid version updates. Observability and feedback workflows were streamlined with monitor() as the primary logging API (standardizing on response_id) and clearer guides for monitoring, tracing, and reviewer/user feedback via send_feedback(). Evaluation gained stronger multi-turn support and richer metrics, including new conversational messages, ConversationCompletenessMetric, improved tool correctness (exact and ordered matching), standardized metrics_data reporting, and parameter naming cleanups like tools_used/tools_called. Reliability and schema enforcement also saw a
Backward Incompatible Changeβ
v1.1.4β
- Rename
tools_usedtotools_calledacross LLM test cases and the tool correctness metric, aligning parameter names in evaluation, API payloads, and documentation. (#989) (Jeffrey Ip)
v1.0.5β
- Update the public API to use
*Metricmetric class names (for exampleConversationCompletenessMetricandConversationRelevancyMetric) and refresh related examples/tests to match. (#960) (Jeffrey Ip)
v1.0.2β
- Bump the package version to 1.0.0 for the new major release. (#951) (Jeffrey Ip)
New Featureβ
v1.1.2β
- Add concurrent evaluation with
run_async=Trueto execute metrics across test cases in parallel, with optional progress output. Improve reliability withignore_errorsand better metric copying so runs donβt interfere with each other. (#985) (Jeffrey Ip)
v1.0.7β
- Add a red team scanner with built-in graders to test LLM outputs for common safety and security issues (for example bias, hallucination, PII, and injection risks), with optional async execution and detailed reasons. (#938) (Kritin Vongthongsri)
v1.0.2β
- Add support for supplying a custom
TestRunManagerwhen running evaluations, while keeping a global default. This makes it easier to isolate test-run state and caching across multiple runs or integrations. (#955) (Jeffrey Ip)
v0.21.75β
- Add conversational messages to better model multi-turn evaluations, letting you mark which turns should be evaluated and enabling conversation-level relevancy metrics. (#935) (Jeffrey Ip)
- Add a
ConversationCompletenessconversational metric to score whether a multi-turn chat fully addresses the userβs intentions, with configurable threshold, strict mode, async evaluation, and verbose logs. (#941) (Jeffrey Ip)
Improvementβ
v1.1.4β
- Bump the package version metadata to 1.1.3 for this release. (#988) (Jeffrey Ip)
v1.1.3β
- Update package metadata for a new release by bumping the version number. (#986) (Jeffrey Ip)
v1.1.2β
- Bump package version to 1.1.1 for a new release. (#978) (Jeffrey Ip)
v1.1.1β
- Bump the package version to 1.1.0 for the latest release. (#970) (Jeffrey Ip)
- Improve LangChain tracing docs by clarifying how to return sources with
RunnableParallel, including an example that assigns the RAG chain output using theoutputkey. (#974) (Kritin Vongthongsri)
v1.1.0β
- Bump the package version to 1.0.9 for the latest release. (#968) (Jeffrey Ip)
v1.0.7β
- Bump the package version metadata to 1.0.6 for this release. (#962) (Jeffrey Ip)
v1.0.8β
- Update package metadata for a new release. (#966) (Jeffrey Ip)
v1.0.9β
- Bump package version metadata to 1.0.8 for this release. (#967) (Jeffrey Ip)
v1.0.6β
- Bump the package release version to 1.0.5. (#961) (Jeffrey Ip)
v1.0.5β
- Bump package version to 1.0.4. (#958) (Jeffrey Ip)
v1.0.4β
- Bump the package version to 1.0.3 for the new release. (#957) (Jeffrey Ip)
v1.0.3β
- Bump package version to 1.0.2 for the latest release. (#956) (Jeffrey Ip)
v1.0.0β
- Update package metadata for a new release. (#950) (Jeffrey Ip)
v0.21.78β
- Prepare a new release by updating the package version metadata. (#948) (Jeffrey Ip)
- Improve evaluation results reporting by standardizing metric output as
metrics_datawith a consistentnamefield, so tables and API payloads display metric status, scores, reasons, and errors more reliably. (#949) (Jeffrey Ip)
v0.21.75β
- Bump the package version for a new release. (#922) (Jeffrey Ip)
- Improve evaluation test case documentation by adding optional
tools_usedandexpected_toolsfields. Clarifies how these parameters are used in agent evaluation metrics and updates examples accordingly. (#923) (Kritin Vongthongsri) - Improve documentation for human feedback by adding dedicated guides for sending user feedback via
send_feedback()and managing reviewer feedback in the UI, with updated navigation in the docs sidebar. (#925) (Kritin Vongthongsri) - Improve documentation for LLM monitoring to make setup and usage clearer. (#926) (Kritin Vongthongsri)
- Add
monitor()as the primary API for logging model outputs and rename returned IDs toresponse_id. Keeptrack()as a compatibility wrapper that forwards tomonitor()and prints a deprecation notice. Updatesend_feedbackto useresponse_id. (#927) (Jeffrey Ip) - Improve tracing documentation with embedded videos and framework icons for LangChain and LlamaIndex, making it easier to recognize trace types and understand setup at a glance. (#928) (Kritin Vongthongsri)
- Improve benchmark output confinement by enforcing JSON/schema-based answers for BigBenchHard and DROP, with a fallback to prompt-based constraints when schema generation is unsupported. (#930) (Kritin Vongthongsri)
- Improve benchmark output typing by renaming enforced generation classes from
modelstoschema, and updating imports across built-in benchmarks to match the new names. (#934) (Jeffrey Ip) - Add documentation for the
ConversationCompletenessMetric, including required arguments, examples, and how the score is calculated. Also fix the conversation relevancy docs to correctly state the number of optional parameters. (#942) (Jeffrey Ip)
v0.21.76β
- Update package metadata for a new release, including the recorded version. (#943) (Jeffrey Ip)
- Improve the tool correctness metric by supporting exact matching and optional ordering checks, with clearer verbose logs and reasons. This makes scores more accurate when tool call sequence matters or must match exactly. (#945) (Jeffrey Ip)
v0.21.77β
- Update package metadata for a new release. (#946) (Jeffrey Ip)
Bug Fixβ
v1.1.1β
- Fix metric Pydantic schemas to prevent
ValidationErrors when using custom LLM judges: allowVerdicts.reasonto be optional and correct GEvalSteps.stepstoList[str]. Add tests to cover these schema validations. (#963) (harriet-wood) - Fix multiple schema mismatches by making verdict
reasona required string and correcting the BBH boolean task key. This improves consistency when generating structured outputs and avoids failures caused by missing or nullreasonfields. (#971) (Jeffrey Ip) - Improve output formatting and compatibility when printing Pydantic models by supporting both
model_dump()(v2) anddict()(v1) during pretty-printing. (#977) (Jeffrey Ip)
v1.0.5β
- Fix dependency conflicts by updating OpenTelemetry to a newer release. This prevents
ModuleNotFoundError: No module named 'opentelemetry.semconv.attributes'when using libraries that rely on the new semantic-convention structure, such as Arize/Phoenix. (#952) (Federico Sierra) - Fix
check_llm_test_case_paramsto setmetric.errorbefore raisingValueErrorwhen a non-LLMTestCaseis provided, ensuring the error message is preserved for callers. (#959) (G. Caglia)
v1.0.0β
- Fix
ContextGenerator.generate_contexts()to reliably generate the requested number of contexts, especially for small documents wherenum_chunksis lower thannum_contexts. Improve test reliability by adding missing test dependencies and updating several tests to avoid import-time execution issues. (#932) (fschuh)
Julyβ
July focused on more reliable evaluation and tracing across LangChain and LlamaIndex, with new one-line integration helpers and more consistent, structured input/output capture to reduce missing fields. Synthetic data and red-teaming workflows saw a major usability pass, including new dataset helpers, async generation options, schema-enforced outputs via schema=, and clearer docs and renamed APIs around attacks, vulnerabilities, and evolution settings. Metrics and tooling improved with Pydantic-backed JSON outputs, better verbose logging via verboseLogs, the new ToolCorrectnessMetric, and prompt refinements for benchmarks like GSM8K and HumanEval. The release also included a steady set
Backward Incompatible Changeβ
v0.21.66β
- Simplify feedback submission by removing the
providerargument and returning less data fromsend_feedback, while still sending the same feedback payload. (#879) (Jeffrey Ip)
v0.21.63β
- Remove deployment config support from the test runner and pytest plugin, including the
--deploymentoption. Test runs now only capture the test file name and avoid opening result links when running in CI environments. (#860) (Jeffrey Ip)
v0.21.64β
- Rename red-teaming enums and parameters for clearer intent:
RedTeamEvolution/ResponsebecomeRTAdversarialAttack/RTVulnerability, andgenerate_red_teaming_goldensnow usesattacksandvulnerabilities(with updated defaults). (#863) (Jeffrey Ip)
New Featureβ
v0.21.74β
- Add
ToolCorrectnessMetricto score whether a test case used the expected tools, with optional strict and verbose modes. Test cases and API payloads now accepttools_usedandexpected_toolsso tool-usage expectations can be evaluated and reported. (#920) (Kritin Vongthongsri)
v0.21.69β
- Add an optional
additional_metadataparameter toadd_test_cases_from_csv_file()so you can attach extra metadata when importing LLM test cases from a CSV. Updated type hints and docs to reflect the new argument. (#902) (Ladislas Walewski)
v0.21.68β
- Add support for the
gpt-4o-minimodel option when selecting valid GPT models. (#898) (JoΓ£o Felipe Pizzolotto Bini)
v0.21.67β
- Add
async_modefor synthetic data generation so document loading and chunking can run concurrently via asyncio, improving throughput when processing many source files. Also remove a stray debug print from the synthesizer progress output. (#892) (Kritin Vongthongsri)
v0.21.66β
- Add an
Integrationshelper to enable one-line tracing setup for LangChain and LlamaIndex apps viaIntegrations.trace_langchain()andIntegrations.trace_llama_index(). This centralizes integration setup and updates docs and examples to use the new API. (#880) (Kritin Vongthongsri) - Add
--verbose/-vto enable verbose metric output intest run, and support averbose_modeoverride inevaluate()to print intermediate metric steps when debugging. (#884) (Jeffrey Ip) - Add automatic tracing for LangChain and LlamaIndex runs, including model, token usage, retrieval context, and inputs/outputs. Tracing now triggers
track()automatically when LangChain is the outermost provider, reducing the need for manual instrumentation. (#890) (Kritin Vongthongsri)
v0.21.65β
- Add LangChain integration that hooks into LangChain callbacks to automatically capture chain, tool, retriever, and LLM traces, including inputs/outputs, metadata, and timing. Also improve error status handling for LlamaIndex traces. (#859) (Kritin Vongthongsri)
- Add
generate_goldens_from_scratchto create synthetic Goldens from only a subject, task, and output format, with optional prompt evolutions to increase diversity. Includes documentation and a basic test example. (#868) (Kritin Vongthongsri) - Add support for logging a list of
Linkvalues inadditional_datawhen tracking events. This lets you attach multiple links under one key, with stricter validation to reject mixed or unsupported list items. (#877) (Jeffrey Ip)
v0.21.63β
- Add dataset helpers to synthesize goldens from scratch, prompts, documents, and red-team scenarios, with configurable evolution types and optional expected outputs. This makes it easier to generate both standard and adversarial test data directly from an
EvaluationDataset. (#857) (Kritin Vongthongsri)
Improvementβ
v0.21.74β
- Improve tracing payload capture for LangChain and LlamaIndex runs by recording structured input/output payloads on each trace and deriving readable input/output values when keys vary. This makes trace data more consistent and easier to inspect. (#894) (Kritin Vongthongsri)
- Prepare a new release by updating the package version metadata. (#913) (Jeffrey Ip)
- Remove the redundant generation prompt so multiple-choice outputs start directly with
Answer:instead of extra instructions. (#918) (Wenjie Fu) - Improve synthetic data generation by adding a shared schema and supporting enforced model outputs via
schema=. Falls back to JSON parsing when schema enforcement is not available, improving compatibility across LLM backends. (#919) (Kritin Vongthongsri) - Add documentation for the Tool Correctness metric, including required arguments, scoring behavior, and an example. Improve synthetic data docs with a clarification and a tip for troubleshooting invalid JSON when using custom models. (#921) (Kritin Vongthongsri)
v0.21.72β
- Update package metadata for a new release. (#908) (Jeffrey Ip)
v0.21.73β
- Improve packaging metadata and minor formatting to support the latest release. (#911) (Jeffrey Ip)
v0.21.69β
- Bump the package version for the latest release. (#899) (Jeffrey Ip)
- Improve synthetic data docs by replacing the
enable_breadth_evolveflag with theIN_BREADTHevolution option and updating the listed available evolutions. This clarifies how to configure breadth-style evolutions when generating synthetic datasets. (#900) (Kritin Vongthongsri) - Improve tracing documentation with new LangChain and LlamaIndex integration guides, including one-line setup examples and embedded walkthrough videos for faster onboarding. (#901) (Kritin Vongthongsri)
- Support passing custom
argsandkwargswhen creating the OpenAI embedding client, so you can forward extra provider settings without modifying the tool. (#903) (Jeffrey Ip)
v0.21.70β
- Update the package metadata for a new release. (#904) (Jeffrey Ip)
v0.21.71β
- Update package version metadata for the new release. (#905) (Jeffrey Ip)
- Add async document embedding support when generating contexts from docs, using
a_embed_textsfor non-blocking chunk processing. Improve validation by raising a clear error if contexts are requested before documents are loaded. (#907) (Jeffrey Ip)
v0.21.68β
- Update package metadata for a new release. (#896) (Jeffrey Ip)
v0.21.67β
- Prepare a new release by updating the package version metadata. (#891) (Jeffrey Ip)
- Improve custom LLM guide examples by consistently using a
schemaparameter for JSON generation and schema parsing. This reduces confusion when instantiating and validating structured outputs from models. (#893) (Kritin Vongthongsri) - Improve verbose output by capturing metric intermediate steps into
verboseLogsmetadata instead of only printing them. This makes verbose details easier to collect and inspect after a run while still printing whenverbose_modeis enabled. (#895) (Jeffrey Ip)
v0.21.66β
- Add Pydantic schema support for JSON-based metric outputs, allowing models to return typed
Reason,Verdicts, andStatementsobjects with a safe fallback to JSON parsing when schema generation isnβt supported. (#874) (Kritin Vongthongsri) - Add a JSON Enforcement guide showing how to use Pydantic schemas to validate custom evaluation LLM outputs and prevent invalid JSON errors. Includes practical tutorials for common libraries and providers so evaluations continue instead of failing on malformed responses. (#875) (Kritin Vongthongsri)
- Prepare a new package release by updating the project version metadata. (#878) (Jeffrey Ip)
- Fix spelling and grammar issues across several documentation pages to improve clarity and reduce confusion when following evaluation and RAG guidance. (#885) (Philip Nuzhnyi)
- Improve documentation clarity by fixing spelling and grammar issues in the metrics introduction, including wording around default metrics and async execution behavior. (#886) (Philip Nuzhnyi)
- Improve metric module organization by renaming internal
modelsmodules toschemaacross several metrics, aligning imports and naming for clarity and consistency. (#888) (Jeffrey Ip) - Improve docs for
rouge_scoreby noting that therouge-scorepackage must be installed separately, preventing missing-dependency errors when starting a new project. (#889) (oftenfrequent)
v0.21.65β
- Bump the package version for this release. (#864) (Jeffrey Ip)
- Improve GSM8K prompting to handle 0-shot and
enable_cotruns by adding step-by-step instructions only when requested and keeping non-CoT prompts concise with a numerical final answer. (#866) (Alejandro Companioni)
v0.21.63β
- Prepare a new package release by updating the toolβs internal version metadata. (#851) (Jeffrey Ip)
- Improve tracing stability for the LlamaIndex integration by unifying trace data and updating attribute handling (for LLM, embedding, reranking, and agent events). This reduces missing or inconsistent fields when capturing inputs/outputs during runs. (#852) (Kritin Vongthongsri)
- Improve synthetic dataset docs by replacing prompt- and scratch-based generation guidance with a dedicated red-teaming workflow using
generate_red_team_goldens, including contexts, evolution types, and response targets. This clarifies how to synthesize vulnerability-focused test cases with or without retrieval context. (#858) (Kritin Vongthongsri) - Improve dataset and synthesizer APIs by renaming red-teaming generation and evolution parameters for consistency (
generate_red_teaming_goldens,evolutions). Also rename the synthesizer types module import path todeepeval.synthesizer.types. (#861) (Jeffrey Ip)
v0.21.64β
- Prepare a new package release by updating the project version metadata. (#862) (Jeffrey Ip)
Bug Fixβ
v0.21.74β
- Fix evaluation so a metric error from one testcase doesnβt carry over to later testcases. The metric error state is reset for each testcase, preventing unrelated failures in subsequent results. (#915) (wanghuanjing)
v0.21.73β
- Fix dependency conflicts by updating
tenacityand pinninggrpcioand OpenTelemetry gRPC packages to compatible versions, improving install reliability. (#912) (Jeffrey Ip)
v0.21.66β
- Fix
get_model_nameto be a synchronous method instead of async, simplifying model implementations and avoiding unnecessary awaits. (#871) (AndrΓ©s) - Fix
--logincommand failure caused by incorrect use ofAnnotations. This restores login functionality in Docker/Ubuntu without regressing macOS behavior. (#883) (Jerry D Boonstra)
v0.21.65β
- Fix Pyright false-positive errors when creating
Goldenmodels with minimal arguments by making optional PydanticFielddefaults explicit (e.g.,default=None). This prevents the type checker from treating optional fields as required. (#867) (Sebastian Kucharzyk) - Fix HumanEval prompt text by removing the hardcoded temperature instruction, so generated prompts no longer force a specific temperature value. (#869) (Kritin Vongthongsri)
v0.21.63β
- Fix
weighted_summed_scorein GEval metrics by correctly accumulating repeated token probabilities before normalization. This prevents normalization errors when the same token appears multiple times inscore_logprobs. (#854) (Song Tingyu)
Juneβ
June focused on making evaluations and synthetic data generation more robust, configurable, and easier to diagnose. Tracing and metrics got clearer typing/documentation, improved parsing and JSON-only reason handling, stronger error and retry visibility, and multiple fixes around metric state isolation and async reliability, including a later revert to restore instance-based state behavior. The Synthesizer advanced with new evolution capabilities via evolve(), broader guidance and options like evolution_types, a new Text-to-SQL use case, and support for custom embedding models through the embedder interface. Benchmarks gained an optional dataset hook for local/custom runs, and API/
New Featureβ
v0.21.58β
- Add extra Synthesizer support for evolving prompts and contexts, including configurable evolution types and breadth evolution. This makes it easier to generate more varied synthetic inputs from either raw prompts or source contexts. (#828) (Kritin Vongthongsri)
- Add a Text-to-SQL synthesizer use case that generates schema-aware inputs and can optionally produce expected SQL outputs, alongside the existing QA flow. (#837) (Kritin Vongthongsri)
v0.21.52β
- Add support for passing a custom embedding model to the synthesizer and context generator. When not provided, the default OpenAI embedder is still used. (#815) (Jonas)
- Add support for custom embedding models via the
embedderparameter, including an OpenAI-based embedding model implementation. Update the embedding model interface to useembed_text/embed_texts(plus async variants) and requireget_model_name()for consistent model identification. (#822) (Jeffrey Ip)
v0.21.51β
- Add support for pushing conversational datasets alongside standard goldens, and allow
push()to optionally control overwrite behavior when uploading a dataset. (#817) (Jeffrey Ip)
v0.21.49β
- Add
evolve()to generate more complex query variants by applying multiple evolution templates over several rounds, with optional breadth evolution for added diversity. (#802) (Kritin Vongthongsri)
Improvementβ
v0.21.61β
- Prepare a new release by updating the package version metadata. (#846) (Jeffrey Ip)
v0.21.62β
- Bump the package version for a new release. (#849) (Jeffrey Ip)
v0.21.60β
- Prepare a new release by updating the package version metadata. (#842) (Jeffrey Ip)
v0.21.58β
- Improve Synthesizer docs by expanding synthetic dataset generation guidance to four approaches, including generating from prompts and from scratch. Document the new
evolution_typesoption across generation methods and clarify what each method populates. (#831) (Kritin Vongthongsri) - Update the package metadata for a new release. (#835) (Jeffrey Ip)
- Improve the Synthesizer by exposing
UseCasein the public API and showing the selected use case in the generation progress output. Also remove stray local-path and demo__main__code to keep the module clean. (#839) (Jeffrey Ip)
v0.21.59β
- Prepare a new release by updating package metadata and the reported version. (#840) (Jeffrey Ip)
v0.21.56β
- Add stateless execution support for most metrics by tracking required context and updating
measure/a_measure, including async handling to avoid lost context. Indicators were also updated to work witha_measure. RAGAS and knowledge-retention metrics are not yet covered. (#806) (Kritin Vongthongsri) - Bump the package version for a new release. (#827) (Jeffrey Ip)
- Improve metric statelessness by storing intermediate results in per-instance context variables and adding
verbose_modeoutput for Answer Relevancy. This reduces cross-test contamination when running evaluations concurrently and makes debugging intermediate steps easier. (#830) (Jeffrey Ip)
v0.21.57β
- Prepare a new package release by updating the toolβs version metadata. (#833) (Jeffrey Ip)
v0.21.54β
- Bump package version for a new release. (#825) (Jeffrey Ip)
v0.21.55β
- Bump the package version for a new release. (#826) (Jeffrey Ip)
v0.21.52β
- Update package metadata for a new release. (#818) (Jeffrey Ip)
- Add an optional
datasetargument to benchmarks so you can run them on locally loaded or custom datasets without depending on HuggingFace access. (#820) (Alberto Romero)
v0.21.53β
- Prepare a new package release by updating the project version metadata. (#823) (Jeffrey Ip)
v0.21.51β
- Bump the package version to 0.21.50 for this release. (#813) (Jeffrey Ip)
- Improve metrics JSON parsing by recovering from missing closing brackets when the end of the JSON isnβt found. This makes evaluations more resilient to slightly malformed model outputs, especially from custom LLMs. (#816) (Jonas)
v0.21.49β
- Prepare a new release by updating the package version metadata. (#799) (Jeffrey Ip)
- Improve tracer type hints by adding clearer comments for expected
outputshapes across LLM, embedding, retriever, and reranking traces. (#801) (Kritin Vongthongsri) - Add a new guide for the Answer Correctness metric, including how to build a custom correctness evaluator with
GEval, choose evaluation parameters and steps, and set a practical scoring threshold. (#803) (Kritin Vongthongsri) - Update the default API base URL to
https://api.confident-ai.comand adjust request URL construction to avoid double slashes. This helps API calls route to the correct endpoint more reliably. (#807) (Jeffrey Ip)
v0.21.50β
- Bump the package release version metadata. (#808) (Jeffrey Ip)
- Improve visibility into OpenAI rate-limit retries by logging an error after each retry attempt. Logs include the current attempt count to help diagnose throttling and backoff behavior. (#812) (Jeffrey Ip)
Bug Fixβ
v0.21.61β
- Fix superclass initialization in
ragas.pyby switching fromsuper.__init__()tosuper().__init__(). This preventsTypeErrorduring metric construction and ensures base class setup runs before class-specific attributes. (#848) (Rishi)
v0.21.62β
- Revert recent stateless metric behavior changes so metric state is stored on the metric instance again. This restores the previous async execution flow and defaults verbose output back to enabled. (#850) (Jeffrey Ip)
v0.21.60β
- Fix dataset and benchmark parsing by consistently using
expected_outputand converting API response keys to snake_case, improving compatibility with camelCase payloads. (#845) (Jeffrey Ip)
v0.21.59β
- Fix metric state initialization by moving
ContextVarfields toBaseMetric.__init__and callingsuper().__init__()in metric constructors. This prevents state from being shared across metric classes and improves isolation when running multiple metrics. (#841) (Jeffrey Ip)
v0.21.56β
- Fix the
TestResultfield name to usemetrics_metadataconsistently, improving compatibility for users accessing metric results programmatically. (#832) (Jeffrey Ip)
v0.21.57β
- Fix BaseMetric state isolation by assigning new ContextVar instances per metric class, preventing score, reason, and error values from leaking across metrics in concurrent or multi-metric runs. (#834) (Jeffrey Ip)
v0.21.53β
- Fix metric reason output to return a JSON
reasonvalue instead of raw model text. Prompts now request JSON-only responses and reason parsing trims/loads the JSON for more reliableinclude_reasonresults. (#824) (Jeffrey Ip)
v0.21.49β
- Fix a typo in the Answer Correctness Metric guide by removing stray markup around the G-Eval reference. (#804) (Kritin Vongthongsri)
v0.21.50β
- Fix bias and toxicity metric prompt templates by formatting rubrics as JSON for more consistent model parsing. Improve metric runner error handling so
ignore_errorsreliably marks failing metrics as unsuccessful instead of crashing async runs. (#811) (Jeffrey Ip)
Mayβ
May focused on making evaluations more observable, faster, and easier to analyze, with major work around tracing, richer event metadata, and clearer reporting across datasets. The release added OpenTelemetry-style tracing for evaluation runs, improved metadata serialization and retrieval/reranking trace details, and introduced conveniences like aggregated pass-rate summaries, optional batch scoring via batch_size, and hyperparameters logging for reproducible runs. Dataset and CLI usability improved as well, including better golden generation with include_expected_output, saving paths from EvaluationDataset.save_as, Azure embedding deployment configuration, and more reliable large run
Backward Incompatible Changeβ
v0.21.38β
- Constrain
send_feedbackratings to the 0β5 range and raise a clear error for out-of-range values. Documentation now reflects the updated rating scale. (#752) (Jeffrey Ip)
New Featureβ
v0.21.46β
- Add new tracing types and metadata for retrieval and reranking, and include conversational test cases when uploading large test runs in batches. This improves observability and makes large mixed test runs more reliable to send. (#791) (Jeffrey Ip)
- Add new trace types for retriever and reranking events, with richer metadata such as
topK, reranker model, and average chunk size. Improve LLM and embedding metadata serialization by using stable field aliases liketokenCountandvectorLengthfor compatibility across integrations. (#795) (Jeffrey Ip)
v0.21.45β
- Add optional
hyperparameterslogging toevaluate()so test runs can record the model and prompt template used. Raises a clear error if required keys are missing. (#785) (Jeffrey Ip)
v0.21.43β
- Add optional batch generation to benchmark evaluation via
batch_sizeto speed up scoring when the model supportsbatch_generate, with a safe fallback to per-sample generation. (#774) (Jeffrey Ip)
v0.21.40β
- Add typed custom properties for event tracking so
additional_datacan include text, JSON dicts, orLinkvalues. This replaces the previous string-only validation and sends the data ascustomProperties. (#761) (Jeffrey Ip)
v0.21.41β
- Add CLI support to set a dedicated Azure OpenAI embedding deployment name, and use it when initializing Azure embeddings. Unsetting Azure OpenAI now also clears the embedding deployment setting. (#764) (Jeffrey Ip)
v0.21.38β
- Add optional expected output generation for synthetic goldens via
include_expected_output, and make dataset golden generation work without explicitly passing a synthesizer. (#753) (Jeffrey Ip)
v0.21.37β
- Add tracing integration to capture and pass trace context during evaluations, including LlamaIndex callback events. This improves visibility into LLM, embedding, and tool execution steps and helps surface errors with clearer trace outputs. (#725) (Kritin Vongthongsri)
- Add OpenTelemetry-based tracing for evaluation runs, including CLI test runs and per-test-case execution, to improve observability of evaluation performance and behavior. (#746) (Jeffrey Ip)
- Add a helper to show pass rates aggregated across all
TestResultitems, making it easier to understand how each metric performs over an entire evaluation dataset instead of only per test case. (#749) (Yudhiesh Ravindranath)
Improvementβ
v0.21.47β
- Prepare a new release by updating package version metadata. (#796) (Jeffrey Ip)
v0.21.48β
- Update package metadata for a new release. (#797) (Jeffrey Ip)
v0.21.46β
- Prepare a new release by bumping the package version. (#788) (Jeffrey Ip)
- Add pagination when posting large test runs with conversational test cases, sending both regular and conversational cases in batches to avoid oversized requests. Also fix a few broken documentation links. (#789) (Jeffrey Ip)
v0.21.44β
- Update package metadata for a new release. (#777) (Jeffrey Ip)
- Fix a typo in the RAG evaluation guide by correcting
secrchtosearchin the description of vector search. (#780) (Jeroen Overschie)
v0.21.45β
- Bump the package version to reflect a new release. (#784) (Jeffrey Ip)
- Fix a spelling error in the getting started docs by replacing
environementwithenvironmentin headings and setup instructions. (#786) (Jeroen Overschie) - Improve documentation for
evaluate()and test cases by linking to accepted arguments and adding examples for logginghyperparameters. Also clarify imports and show how to log in and track hyperparameters for Confident AI runs. (#787) (Jeffrey Ip)
v0.21.43β
- Add optional
trace_stackandtrace_providerfields to event tracking so integrations can attach structured trace context to tracked events. (#758) (Kritin Vongthongsri) - Bump package version metadata for a new release. (#766) (Jeffrey Ip)
v0.21.42β
- Prepare a new release by updating the package version metadata. (#765) (Jeffrey Ip)
v0.21.40β
- Bump the package version for a new release. (#756) (Jeffrey Ip)
- Improve the custom metrics guide by fixing the ROUGE scoring example and noting that
rouge-scoremust be installed before use. (#760) (oftenfrequent)
v0.21.41β
- Update the package release metadata to a new version. (#763) (Jeffrey Ip)
v0.21.38β
- Bump package version for a new release. (#750) (Jeffrey Ip)
- Improve
EvaluationDataset.save_asby returning the full saved file path, making it easier to reuse the output location programmatically. (#751) (jakelucasnyc) - Add trace stack capture to API test cases so evaluations can include a final, structured execution trace and richer LLM metadata when available. (#754) (Kritin Vongthongsri)
v0.21.39β
- Update package metadata for a new release. (#755) (Jeffrey Ip)
v0.21.37β
- Bump the package version for a new release. (#727) (Jeffrey Ip)
- Improve benchmark package initialization by exporting additional benchmarks and tasks (
DROP,TruthfulQA,GSM8K,HumanEval) from the__init__modules, making them easier to import from the top-level benchmarks namespace. (#728) (Kritin Vongthongsri) - Improve LlamaIndex tracing by capturing richer event payloads, including prompt templates, tool calls, and model metadata, and recording exceptions as error traces. This makes trace output more complete and easier to debug across LLM, embedding, and retrieval steps. (#745) (Kritin Vongthongsri)
- Add documentation showing how to use a Google Vertex AI Gemini model for evaluations by wrapping LangChain
ChatVertexAIin a custom LLM class, including safety settings and metric usage examples. (#747) (Aditya)
Bug Fixβ
v0.21.44β
- Fix document chunking when generating contexts from multiple files so chunks stay grouped by source and
source_filemetadata is preserved when exporting datasets to CSV/JSON. (#783) (Jeffrey Ip)
v0.21.43β
- Fix
testCLI to return a failing process exit status when tests fail, so CI and scripts can reliably detect failures. (#773) (Jeffrey Ip) - Fix custom metric docs for
LatencyMetricby reading latency fromadditional_metadataand updating theLLMTestCaseexample. Add an asynca_measuremethod to match required metric interfaces and prevent example code from erroring. (#776) (Giannis Manousaridis)
v0.21.37β
- Fix relevancy chat template to request
reasoninstead ofsentence, avoiding conflicting instructions when using structured JSON output across precision, recall, and relevancy metrics. (#729) (Ulises M) - Fix
KnowledgeRetentionMetricdocumentation to reflect the correct scoring behavior instrict_modeand the correct formula, clarifying that higher scores represent better retention and messages without knowledge attrition contribute positively. (#738) (Ananya Raval) - Remove the tracing integration and stop attaching trace stack data to generated API test cases. This reverts recent tracing-related behavior to reduce unexpected side effects during evaluation and LlamaIndex callback handling. (#742) (Jeffrey Ip)
- Fix G-Eval reasoning output by including the configured evaluation parameters in the results prompt. The generated
reasonnow references the specific inputs being evaluated, making explanations more relevant and consistent. (#744) (Jeffrey Ip)
Aprilβ
April focused on making evaluations more resilient, reproducible, and easier to understand, with richer metadata and clearer results output. Reliability improved through Tenacity-based retries for rate limits, --ignore-errors to keep runs going when a metric fails, stable dataset ordering, and better conversational test case support across evaluation, datasets, and API posting. The tool also expanded and refined benchmark capabilities and docs around GSM8K, HumanEval, and DROP, while adding cost tracking with total USD display and more configurable model initialization via GPTModel. The month included multiple version bumps, dependency compatibility tweaks, documentation cleanups, and aε
Backward Incompatible Changeβ
v0.21.31β
- Remove the
LatencyMetric,CostMetric, andJudgementalGPTmetrics and their documentation to reduce unused surface area. Imports fromdeepeval.metricsno longer include these metrics. (#706) (Jeffrey Ip)
v0.21.18β
- Remove the TruthfulQA benchmark dataset and related benchmark code from the package. (#657) (Jeffrey Ip)
- Remove the
PII_scorehelper that depended onpresidio-analyzer, reverting the previous PII scoring implementation. (#658) (Jeffrey Ip)
New Featureβ
v0.21.33β
- Add
send_feedbackto submit ratings and optional expected responses/explanations for tracked events. Also refinetrackerror handling so you can choose silent failure, printing errors, or raising exceptions. (#714) (Jeffrey Ip)
v0.21.34β
- Add a
--mark/-moption totest runso you can select tests by pytest mark. Tests can now be excluded by default via pytest config and overridden at runtime when needed. (#689) (Simon Podhajsky)
v0.21.30β
- Add a DROP benchmark runner that loads the
ucinlp/dropdataset, supports task selection and up to 5-shot prompting, and reports per-task and overall exact-match accuracy. (#696) (Kritin Vongthongsri)
v0.21.28β
- Add a HumanEval benchmark that measures functional correctness using
pass@k. Support generating multiple samples for the same prompt so users can run the benchmark reliably. (#674) (Kritin Vongthongsri)
v0.21.26β
- Add support for conversational goldens in datasets, including
conversationalGoldensin API responses and a newConversationalGoldenmodel to represent multi-turn examples with optional retrieval context and metadata. (#680) (Jeffrey Ip) - Add initial support for conversational datasets and test cases, including parsing
conversationalGoldensintoconversational_goldensand treating conversation messages as test-case inputs for evaluation results. (#681) (Jeffrey Ip)
v0.21.25β
- Add a GSM8K benchmark to evaluate grade-school math word problems with configurable few-shot prompting and optional chain-of-thought. Reports exact-match accuracy and stores per-question predictions for review. (#675) (Kritin Vongthongsri)
- Add a
write_cacheoption to control whether evaluation results are written to disk. When disabled, cache files are cleaned up to avoid leaving artifacts on the filesystem. (#677) (Jeffrey Ip)
v0.21.24β
- Add support for Cohere as an LLM provider via a new
CohereModelimplementation. Include a dedicated test and ensure thecoheredependency is installed during setup. (#661) (Fabian Greavu)
v0.21.17β
- Add TruthfulQA benchmarking support with selectable tasks and MC1/MC2 scoring modes, plus a new
truth_identification_scoremetric for evaluating identified true answers. (#651) (Kritin Vongthongsri)
v0.21.18β
- Add a
PII_scorehelper to analyze text for PII using Presidio and return an average score plus per-entity scores. Raises a clear error ifpresidio-analyzeris not installed. (#338) (Arinjay Wyawhare) - Add initial TruthfulQA benchmark support, including dataset loading and task definitions for generation and multiple-choice evaluation. (#549) (Rohinish)
Improvementβ
v0.21.36β
- Prepare a new package release by updating the project version metadata. (#723) (Jeffrey Ip)
- Fix a typo in the README section title for bulk evaluation, changing βEvalutingβ to βEvaluatingβ for clearer documentation. (#724) (Vinicius Mesel)
v0.21.35β
- Bump the package version for the latest release. (#719) (Jeffrey Ip)
- Relax the
importlib-metadatadependency to allow versions >=6.0.2, improving compatibility with a wider range of environments and dependency sets. (#721) (Philip Chung)
v0.21.33β
- Prepare a new package release by bumping the tool version to 0.21.32. (#711) (Jeffrey Ip)
- Improve dataset
pullfeedback by showing a spinner and completion time while downloading from Confident AI, making long pulls easier to track. (#713) (Jeffrey Ip)
v0.21.34β
- Prepare a new package release by updating the published version metadata. (#716) (Jeffrey Ip)
v0.21.31β
- Add support for passing custom arguments to
GPTModel(for exampletemperatureandseed) to make evaluations more deterministic and reproducible. Improve native model detection so anyGPTModelis treated as native, preserving features like cost reporting and logprob-based scoring. (#699) (lplcor) - Add
commentsandadditional_metadatafields to LLM and conversational test cases, and preserve them when converting goldens and sending API test cases. Also fix empty conversation validation to use==for correct message length checks. (#703) (Jeffrey Ip) - Add
--use-existingtodeepeval loginto reuse an existing API key file. When provided, the command checks for an existing key and skips the prompt for a new one, making repeat logins faster and smoother. (#704) (Simon Podhajsky) - Improve the GEval prompt template by clarifying the scoring criteria and adding a concrete JSON example output. This helps ensure evaluators return valid
scoreandreasonfields in the expected format. (#705) (repetitioestmaterstudiorum)
v0.21.32β
- Bump package version metadata for the latest release. (#708) (Jeffrey Ip)
- Fix typos in the dataset evaluation documentation to improve clarity and reduce confusion when following the examples. (#709) (Kritin Vongthongsri)
v0.21.30β
- Prepare a new release by updating the package version metadata. (#694) (Jeffrey Ip)
- Add documentation for the
DROPbenchmark, including available tasks,n_shots/tasksarguments, and a usage example for evaluating a model and interpreting the exact-match score. (#697) (Kritin Vongthongsri) - Remove inline benchmark example code from benchmark modules to avoid executing demo logic on import and keep the library API focused on evaluation. (#698) (Kritin Vongthongsri)
- Add deterministic ordering for dataset test cases by tracking a stable rank and sorting test runs consistently, so results appear in a predictable order across runs and pulls. (#700) (Jeffrey Ip)
v0.21.29β
- Improve OpenAI call reliability by adding Tenacity-based retries with exponential backoff and jitter for rate-limit failures in GPT model requests. (#648) (pedroallenrevez)
- Update package metadata for a new release. (#688) (Jeffrey Ip)
- Add GSM8K benchmark documentation, including available arguments (
n_problems,n_shots,enable_cot), an evaluation example, and details on exact-match scoring. Include the new page in the benchmarks sidebar for easier discovery. (#690) (Kritin Vongthongsri) - Add HumanEval benchmark documentation with usage examples,
pass@kexplanation, and a full list ofHumanEvalTaskoptions. Also exportHumanEvalTaskfromdeepeval.benchmarks.tasksfor easier importing. (#691) (Kritin Vongthongsri) - Add automatic conversion of conversational goldens into conversational test cases when pulling a dataset, so both standard and conversation examples load as runnable tests. (#693) (Jeffrey Ip)
v0.21.27β
- Support passing
ConversationalTestCasetoevaluate()alongsideLLMTestCasefor more flexible evaluation workflows. (#682) (Jeffrey Ip) - Support conversational test cases in the results table and API posting flow, so conversation evaluations are no longer dropped. Also fix naming of message-based test cases to use the correct indexed
test_case_\{index\}format. (#684) (Jeffrey Ip)
v0.21.26β
- Bump the package version for the latest release. (#679) (Jeffrey Ip)
v0.21.25β
- Bump the package release to 0.21.24. (#673) (Jeffrey Ip)
v0.21.24β
- Bump the package release metadata to 0.21.23. (#670) (Jeffrey Ip)
v0.21.22β
- Bump the package version to 0.21.20 for this release. (#665) (Jeffrey Ip)
- Bump package version metadata for the latest release. (#666) (Jeffrey Ip)
- Add evaluation cost tracking to metric metadata and test runs, and aggregate per-test costs into the total run cost. Cached metric results now store
evaluationCostas 0 to avoid inflating totals when reusing cached evaluations. (#667) (Jeffrey Ip)
v0.21.23β
- Update package version metadata for a new release. (#668) (Jeffrey Ip)
- Add display of the total evaluation token cost (USD) when showing test run results, making it easier to understand evaluation spend at a glance. (#669) (Jeffrey Ip)
v0.21.19β
- Add an
--ignore-errorsoption to continue running tests when a metric raises an exception, recording the error on the metric result instead of stopping the run. Metrics that error are excluded from caching to avoid persisting invalid results. (#662) (Jeffrey Ip)
v0.21.20β
- Bump the package version for a new release. (#664) (Jeffrey Ip)
v0.21.17β
- Update package version metadata for the next release. (#649) (Jeffrey Ip)
- Add documentation for the TruthfulQA benchmark, including supported MC1/MC2 modes, available task enums, and a code example for running evaluations and interpreting
overall_score. (#652) (Kritin Vongthongsri) - Add support for passing an OpenAI API key directly to
GPTModelvia a hidden_openai_api_keyparameter, and use it when creating the underlyingChatOpenAIclient. (#654) (Jeffrey Ip)
v0.21.18β
- Bump the package version for a new release. (#655) (Jeffrey Ip)
- Improve TruthfulQA benchmark code formatting and lint compliance, including consistent quoting, spacing, and line wrapping. This should reduce style-related CI noise without changing runtime behavior. (#659) (Jeffrey Ip)
v0.21.16β
- Bump the package version for a new release. (#647) (Jeffrey Ip)
v0.21.15β
- Prepare a new release by updating the package version metadata. (#646) (Jeffrey Ip)
Bug Fixβ
v0.21.31β
- Fix Dataset string representation so printing it shows its key fields (test cases, goldens, and identifiers) instead of a default object display. (#707) (Jeffrey Ip)
v0.21.32β
- Fix hyperparameter logging so model and prompt template are recorded consistently as part of the hyperparameters. This also simplifies test run caching by keying cached results only on the test case inputs and hyperparameters. (#710) (Jeffrey Ip)
v0.21.30β
- Fix Tenacity retry configuration so OpenAI rate limit errors are retried correctly, preventing failures when generating responses under throttling. (#695) (Jeffrey Ip)
- Fix dataset test case handling by validating that
test_casesis a list and correctly appending new test cases. Prevents type errors and avoids corrupting internal test case storage when adding cases. (#701) (Jeffrey Ip)
v0.21.29β
- Fix benchmark output and docs: correct GSM8K and HumanEval accuracy labels, update GSM8K
n_shotslimit to 15, and repair broken in-page links in benchmark documentation. (#692) (Kritin Vongthongsri)
v0.21.28β
- Fix
test_everythingto validate aConversationalTestCaseinstead of a single test case. (#685) (Jeffrey Ip) - Fix pulling conversational datasets so conversational goldens are parsed correctly and messages load from the
goldensfield. (#686) (Jeffrey Ip) - Fix metrics to accept
ConversationalTestCaseby validating messages and converting to anLLMTestCasebefore evaluation. Prevents failures when running answer relevancy, bias, and contextual metrics on conversational inputs. (#687) (Jeffrey Ip)
v0.21.25β
- Fix Azure OpenAI usage by preventing
generate_raw_responsecalls that arenβt supported, avoiding confusing runtime failures. Update the default GPT model togpt-4-turboand clarify the output message as an estimated token cost. (#678) (Jeffrey Ip)
v0.21.24β
- Fix Knowledge Retention metric when using the built-in model wrapper by handling
generate()return values correctly. This prevents crashes or invalid parsing when generating verdicts, knowledges, and reasons. (#672) (Jeffrey Ip)
v0.21.18β
- Fix logprob-based G-Eval scoring by converting tokens to numeric scores more safely and correctly. Remove the now-unneeded
return_raw_responseparameter in favor ofgenerate_raw_response. Reduce overhead by avoiding repeated computation inside the scoring loop. (#650) (lplcor)
Marchβ
March focused on making evaluations faster, clearer, and easier to automate, with major work on async execution, event-loop compatibility in notebooks, and more reliable concurrency controls via the run_async flag. Evaluation UX improved with a new progress indicator (and better toggles), richer and more consistent score metadata, and caching that reuses prior results safely without trampling metric configuration. The synthesizer and dataset tooling expanded significantly with new APIs for generating and exporting synthetic Golden test cases from contexts and documents, plus prompt evolution for more diverse inputs and improved reproducibility through saved prompt templates and hyperper-
Backward Incompatible Changeβ
v0.20.79β
- Rename the hyperparameter decorator from
set_hyperparameterstolog_hyperparametersand update public exports and docs accordingly. (#557) (Jeffrey Ip)
New Featureβ
v0.21.14β
- Add optional logprob-based G-Eval scoring. If logprobs are unavailable or fail, it automatically falls back to the standard one-shot score. Relax Python version requirements to better support older runtimes. (#619) (lplcor)
v0.21.13β
- Add support for generating dataset goldens from document files via
generate_goldens_from_docs, and expose new controls likenum_evolutionsandenable_breadth_evolvewhen generating goldens. Update the docs with a dedicated Synthetic Datasets guide and refreshed dataset generation examples. (#635) (Kritin Vongthongsri)
v0.20.99β
- Add
--repeat/-roption to rerun each test case a specified number of times when running tests from the CLI. (#616) (Jeffrey Ip) - Add support for loading
retrieval_contextwhen creating evaluation datasets from CSV and JSON files, with configurable column/key names and delimiters. This lets test cases carry retrieval context data alongside input, outputs, and context. (#617) (Jeffrey Ip)
v0.20.93β
- Add a BIG-Bench Hard benchmark runner with configurable few-shot and optional chain-of-thought prompting, plus per-task and overall accuracy reporting. Results are also stored for inspection in
predictions,task_scores, andoverall_score. (#574) (Kritin Vongthongsri)
v0.20.91β
- Add
metricsScoresto test run output to capture the full list of scores per metric across test cases, alongside the existing averagedmetricScores. This makes it easier to inspect score distributions instead of only summary values. (#601) (Jeffrey Ip)
v0.20.82β
- Add
strict_modeto evaluation metrics to enforce stricter pass/fail scoring. When enabled, thresholds become all-or-nothing (e.g., return 0 for partial relevancy and 1 for any detected bias), making results less forgiving. (#566) (Jeffrey Ip) - Add optional async execution for
evaluate()andassert_test(), running metric evaluations concurrently with asyncio to speed up runs. You can disable it withasynchronous=Falsefor fully synchronous behavior. (#569) (Jeffrey Ip) - Add async support to
GEval, with anasynchronousoption to run evaluations via an event loop or synchronously. Improve validation for missing test case fields and update prompt generation for clearer parameter formatting. (#571) (Jeffrey Ip)
v0.20.80β
- Add
login_with_confident_api_keyto let users save an API key programmatically and get a success message after login. (#560) (Jeffrey Ip) - Add input augmentation when generating synthetic goldens by evolving each generated prompt into multiple rewritten variants, producing more diverse test inputs. Synthetic data generation no longer requires an
expected_outputfield. (#561) (Jeffrey Ip) - Add
save_asto export evaluation datasets to JSON or CSV, creating the output directory and timestamped files automatically. Prevent saving when no goldens are present and includeactual_outputin both JSON and CSV exports. (#562) (Jeffrey Ip)
v0.20.79β
- Add a new Synthesizer that generates synthetic
Goldentest cases from a list of context strings using an LLM prompt and JSON parsing, with support for pluggable embedding models viaDeepEvalBaseEmbeddingModel. (#533) (Jeffrey Ip) - Add a revamped synthesizer API to generate
Goldenexamples from multiple contexts with optional multithreading and amax_goldens_per_contextlimit. Generated goldens can now be saved to CSV or JSON files for easier reuse and sharing. (#553) (Jeffrey Ip) - Add
Dataset.generate_goldens()to generate and append synthetic goldens from a synthesizer. Improve synthesizer UX by showing a progress spinner during generation and routing progress output to stderr. (#554) (Jeffrey Ip)
v0.20.78β
- Add initial Big Bench Hard benchmark support with task selection, dataset loading from Hugging Face, and exact-match scoring for model predictions. (#548) (Jeffrey Ip)
- Add support for capturing and exporting the user prompt template alongside the model and hyperparameters in test run metadata, enabling easier reproduction and debugging of evaluation runs. (#551) (Jeffrey Ip)
Experimental Featureβ
v0.21.01β
- Add early support for generating synthetic data from documents by chunking PDFs, embedding chunks, and selecting related contexts via cosine similarity. Integrate this flow into the synchronous
generate_goldens_from_docspath. (#604) (Kritin Vongthongsri)
Improvementβ
v0.21.14β
- Prepare a new release by bumping the package version to
0.21.13. (#640) (Jeffrey Ip)
v0.21.13β
- Update package metadata for a new release. (#634) (Jeffrey Ip)
v0.21.01β
- Add caching for test runs to reuse previous results during evaluation, reducing repeated computation. Update the progress indicator to show when cached results are used. (#593) (Kritin Vongthongsri)
- Bump the package version to 0.21.00 for a new release. (#622) (Jeffrey Ip)
- Improve Synthesizer usability and test coverage by allowing the progress indicator to be disabled and by making context generation gracefully handle requests larger than the available chunks instead of erroring. Also includes small formatting and test-data cleanups. (#623) (Jeffrey Ip)
- Fix a typo in the getting started guide so the Custom Metrics section reads correctly. (#624) (Pierre Marais)
- Improve evaluation caching so metric configs are no longer overwritten from cached metadata, and only write cache data when saving results to disk. (#627) (Jeffrey Ip)
- Improve test-run caching by comparing full metric configuration fields (including
threshold,evaluation_model, andstrict_mode) when reusing cached results. Add a regression test to ensure cached metrics are matched consistently. (#629) (Jeffrey Ip)
v0.21.11β
- Improve packaging for the latest release by removing a duplicate
pytestrequirement and addingdocx2txtandimportlib-metadatadependencies. (#631) (Jeffrey Ip)
v0.21.12β
- Update package version metadata for a new release. (#632) (Jeffrey Ip)
v0.20.99β
- Bump package version for the latest release. (#615) (Jeffrey Ip)
v0.21.00β
- Improve packaging by adding
importlib-metadataas a dependency to ensure Python package metadata is available at runtime. (#618) (Jeffrey Ip)
v0.20.98β
- Prepare a new package release by updating the project version metadata. (#611) (Jeffrey Ip)
- Fix typos and wording in several prompt templates to improve clarity and consistency in the generated instructions and examples. (#613) (Harumi Yamashita)
v0.20.93β
- Improve benchmark module exports so
BigBenchHard,MMLU, andHellaSwag(and their task variants) can be imported directly from the benchmarks packages. (#606) (Jeffrey Ip)
v0.20.94β
- Update the package release metadata. (#607) (Jeffrey Ip)
v0.20.95β
- Bump the package version for the latest release. (#608) (Jeffrey Ip)
v0.20.96β
- Prepare a new release by updating the package version metadata. (#609) (Jeffrey Ip)
v0.20.97β
- Bump the package version for the latest release. (#610) (Jeffrey Ip)
v0.20.91β
- Bump package version to 0.20.90. (#598) (Jeffrey Ip)
v0.20.92β
- Bump the package version for a new release. (#602) (Jeffrey Ip)
v0.20.90β
- Bump the package release version metadata. (#591) (Jeffrey Ip)
- Improve type hint compatibility by switching from built-in generics like
listanddicttotyping.Listandtyping.Dictin public annotations. (#596) (Navkar)
v0.20.88β
- Bump package version metadata for a new release. (#586) (Jeffrey Ip)
- Improve hyperparameter logging by validating inputs and storing them as
hyperparametersinstead ofconfigurations. IgnoreNonevalues and enforce string keys with scalar values, converting values to strings for consistent output. (#587) (Jeffrey Ip) - Improve retry error reporting by switching from
printto standard logging, emitting warnings instead of writing directly to stdout. (#588) (Jeffrey Ip)
v0.20.89β
- Bump the package version to 0.20.88 for the latest release. (#589) (Jeffrey Ip)
v0.20.86β
- Prepare a new package release with updated version metadata. (#583) (Jeffrey Ip)
v0.20.87β
- Bump the package version for a new release. (#584) (Jeffrey Ip)
v0.20.82β
- Prepare a new release by bumping the package version. (#564) (Jeffrey Ip)
- Add a new progress indicator for metric evaluation and allow disabling it via
show_indicatorinevaluate(). Update output messaging during evaluation. Remove the deprecatedrun_testhelper from the public API. (#573) (Jeffrey Ip)
v0.20.83β
- Bump package version and skip the
test_everythingtest by default to avoid running expensive OpenAI-dependent checks during test runs. (#576) (Jeffrey Ip)
v0.20.84β
- Prepare a new package release by updating the project version metadata. (#578) (Jeffrey Ip)
- Improve async execution in environments with an active event loop by applying
nest_asynciowhen a loop is already running, reducing failures when running async code from notebooks or nested contexts. (#579) (Jeffrey Ip)
v0.20.85β
- Prepare a new package release by updating the project version metadata. (#581) (Jeffrey Ip)
v0.20.80β
- Prepare a new package release by updating the tool version metadata. (#558) (Jeffrey Ip)
- Improve docs wording by clarifying that
AnswerRelevancyMetricneedsOPENAI_API_KEYand linking directly to instructions for using a custom LLM. Update the landing page headline to describe the tool as an open-source LLM evaluation framework. (#559) (Jeffrey Ip)
v0.20.81β
- Bump the package version for a new release. (#563) (Jeffrey Ip)
v0.20.79β
- Bump the package version for the latest release. (#552) (Jeffrey Ip)
- Refactor conversational test case internals to simplify structure and remove unused typing/imports, improving maintainability without changing expected behavior. (#556) (Jeffrey Ip)
v0.20.78β
- Bump the package version for a new release. (#547) (Jeffrey Ip)
Bug Fixβ
v0.21.14β
- Improve G-Eval scoring by safely handling logprob-based responses and falling back to standard generation when logprobs are unavailable or parsing fails. This reduces evaluation failures across models that donβt support logprobs. (#644) (Jeffrey Ip)
v0.21.13β
- Fix a typo in
generate_goldens_from_docsby renaming thedocuemnt_pathsargument todocument_pathsfor clearer and consistent usage. (#639) (eLafo)
v0.21.01β
- Fix RAGAS metrics to accept either a model name string or a prebuilt chat model instance. This prevents incorrect model wrapping and ensures the provided model is used when running evaluations, including in async measurement paths. (#630) (Jeffrey Ip)
v0.21.12β
- Fix multiprocessing issues when using cached test runs by ensuring the current test run is loaded before appending results and by disabling cache writes when not running under the tool. This prevents missing or corrupted run data in parallel executions. (#633) (Jeffrey Ip)
v0.21.00β
- Fix errors when sending large test runs by batching test case uploads and reporting incomplete uploads with a clearer message. Also record total passed/failed counts for the run so results are summarized reliably. (#621) (Jeffrey Ip)
v0.20.98β
- Fix a typo in the G-Eval results prompt so it now reads "the evaluation steps" instead of "th evaluation steps". (#612) (lplcor)
- Fix metric score output to use consistent metric names and a single
metricsScoresstructure, removing the legacymetricScoresfield. This prevents mismatched keys and simplifies downstream parsing of test run results. (#614) (Jeffrey Ip)
v0.20.93β
- Fix noisy console output during test run wrap-up by removing an unintended print of metrics scores. (#603) (Jeffrey Ip)
v0.20.91β
- Fix JSON serialization for older Pydantic versions by falling back to
dict()whenmodel_dump()is unavailable, preventing errors when pushing datasets or saving test runs. (#600) (Vaibhav Kubre)
v0.20.89β
- Fix G-Eval to reuse provided
evaluation_stepsinstead of regenerating them. Improve evaluation prompt instructions to avoid quoting the score in the reason. Also clarify the init error message when neithercriterianorevaluation_stepsis provided. (#590) (Jeffrey Ip)
v0.20.87β
- Fix synthesizer model calls to use
model.generate()so text evolution and synthetic data generation work with models that donβt support direct invocation. (#585) (Jeffrey Ip)
v0.20.82β
- Fix
strict_modebehavior for the hallucination metric so it uses a zero threshold for stricter evaluation, instead of incorrectly forcing a threshold of 1. (#567) (Jeffrey Ip) - Fix async execution controls by renaming the
asynchronousflag torun_asyncacross evaluation and metrics, ensuring metrics run with the intended sync/async behavior and clearer error messages when async isnβt supported. (#572) (Jeffrey Ip) - Fix LlamaIndex async evaluators to await metric execution by using
a_measure, preventing missed async work and making evaluation results more reliable. (#575) (Jeffrey Ip)
v0.20.83β
- Fix async evaluation and metric
async_modeexecution by reusing or creating an event loop instead of callingasyncio.run, preventing failures when a loop is already running or closed. (#577) (Jeffrey Ip)
v0.20.85β
- Fix indicator toggle behavior by setting
DISABLE_DEEPEVAL_INDICATORconsistently based onshow_indicator, so the indicator can be re-enabled after being disabled. (#582) (Jeffrey Ip)
v0.20.79β
- Fix knowledge retention evaluation to use the current message fields (
inputandactual_output) when generating verdicts and extracting knowledge, preventing mismatched or empty prompts in conversational test cases. (#555) (Jeffrey Ip)
v0.20.78β
- Fix summarization coverage scoring so the score is calculated only from questions where the original verdict is
yes. This prevents incorrect results when non-applicable questions were previously included in the denominator. (#550) (Jeffrey Ip)
Februaryβ
February focused on making evaluations more reliable, faster, and easier to integrate as the metrics and template layout was reorganized into clearer per-metric modules while preserving key imports like HallucinationMetric. Multiple core metrics saw meaningful upgrades, including improved faithfulness, answer relevancy, hallucination, summarization, and knowledge retention with better prompt parsing, clearer verdict rules, optional multithreading, and more consistent reasoning outputs. Integrations and tooling were refined with safer defaults and compatibility updates for Hugging Face, LlamaIndex, and RAGAS, alongside stricter type validation, improved JSON error messages, and a more CI-vi
Backward Incompatible Changeβ
v0.20.65β
- Improve custom LLM support in metrics by switching the expected base type from
DeepEvalBaseModeltoDeepEvalBaseLLM, and update docs accordingly. (#478) (Jeffrey Ip)
v0.20.63β
- Remove support for passing LangChain
BaseChatModelinstances into metricmodelparameters. Metrics now accept only a model name string or aDeepEvalBaseModel, reducing LangChain coupling. (#468) (Jeffrey Ip)
New Featureβ
v0.20.75β
- Add an initial synthesizer module with a
BaseSynthesizerinterface and scaffolding for generatingLLMTestCaseobjects from text, including evolution prompt templates for instruction rewriting. (#531) (Jeffrey Ip) - Add conversational test case support with a new
KnowledgeRetentionMetricfor scoring how well a model retains facts across multi-turn chats. (#534) (Jeffrey Ip)
v0.20.71β
- Add support for pushing existing goldens when publishing a dataset, including goldens converted from test cases in the same push. (#514) (Jeffrey Ip)
- Add automatic generation of summarization assessment questions when none are provided, with a new
noption to control how many are created. (#517) (Jeffrey Ip) - Add support for passing custom LangChain
Embeddingsto RAGAS metrics so answer relevancy can use your chosen embedding model for cosine-similarity scoring. (#518) (Jeffrey Ip)
v0.20.69β
- Add a new
ToxicityMetricthat scores model outputs for toxic language using an LLM-based rubric and can return a brief explanation. Support selecting a GPT model or providing a custom LLM, and configure a pass/fail threshold and whether to include reasons. (#498) (Jeffrey Ip)
v0.20.66β
- Add a revamped bias metric that uses an LLM to extract opinions, judge each one for bias, and compute a bias score. You can configure the evaluation model and optionally include a generated explanation of the result. (#486) (Jeffrey Ip)
Improvementβ
v0.20.75β
- Bump package version to 0.20.74 for the latest release. (#528) (Jeffrey Ip)
- Improve answer relevancy prompt templates by fixing typos and clarifying instructions, including tighter JSON key wording and clearer verdict guidance. (#532) (moruga123)
v0.20.76β
- Prepare a new package release by updating the project version metadata. (#536) (Jeffrey Ip)
- Improve the knowledge retention metric by restoring progress reporting and metric type capture, and refining verdict/data extraction prompts to better handle clarifications and keep outputs consistently JSON. (#537) (Jeffrey Ip)
- Improve detection of when the tool is running by storing the state in the
DEEPEVALenvironment variable instead of a process-global flag, making it more reliable across processes. (#540) (Jeffrey Ip)
v0.20.77β
- Prepare a new release by updating the package version metadata. (#541) (Jeffrey Ip)
- Improve test case organization by moving LLM and conversational test cases into a dedicated
test_casepackage, with clearer imports and stricter validation forretrieval_context. (#544) (Jeffrey Ip) - Add stricter type validation for
test_casesandmetricsin dataset creation and evaluation helpers, raising clearTypeErrors when inputs are notLLMTestCaseorBaseMetric. This prevents confusing failures later in the run. (#545) (Jeffrey Ip) - Improve multithreaded verdict generation in the contextual relevancy and hallucination metrics by switching to
ThreadPoolExecutor, so exceptions propagate reliably and results are collected more consistently. (#546) (Jeffrey Ip)
v0.20.72β
- Update package metadata for the latest release by bumping the tool version. (#519) (Jeffrey Ip)
- Add support for the
gpt-4-turbo-previewandgpt-4-0125-previewOpenAI models, and switch the default GPT model togpt-4-0125-preview. Documentation now reflects the new default in integrations and metric examples. (#521) (Jeffrey Ip)
v0.20.73β
- Bump the package version for a new release. (#524) (Jeffrey Ip)
v0.20.74β
- Update package metadata for a new release. (#526) (Jeffrey Ip)
- Allow running the test suite with
pytestby makingassert_testexecute even outside the dedicated test runner, while adjusting behavior based on whether the tool is active. (#527) (Jeffrey Ip)
v0.20.71β
- Prepare a new package release by bumping the tool version. (#511) (Jeffrey Ip)
- Add a 5-second timeout to the package update check so startup isnβt blocked by slow or unresponsive network requests. (#515) (Jeffrey Ip)
- Reformat the update check request call for improved readability without changing behavior. (#516) (Jeffrey Ip)
v0.20.70β
- Bump the package version for a new release. (#505) (Jeffrey Ip)
v0.20.69β
- Update the package release metadata by bumping the version number. (#494) (Jeffrey Ip)
- Reorganize metric modules into per-metric packages and move prompt templates alongside each metric for clearer structure and imports. (#497) (Jeffrey Ip)
- Improve LlamaIndex integration compatibility with the newer
llama_index.coreAPI. Addmodelandinclude_reasonoptions to the LlamaIndex bias, toxicity, and summarization evaluators so you can control the underlying LLM and whether explanations are returned. (#501) (Jeffrey Ip)
v0.20.67β
- Update package metadata for a new release. (#487) (Jeffrey Ip)
v0.20.68β
- Update package metadata for a new release. (#491) (Jeffrey Ip)
- Improve JSON parsing for evaluation outputs by loading trimmed JSON directly and raising a clearer error when the model returns invalid JSON, guiding you to use a more reliable evaluation model. (#492) (Jeffrey Ip)
- Reduce install size by making ROUGE, BLEU, and BERTScore dependencies optional and importing them only when used, with clearer messages when modules are missing. (#493) (Jeffrey Ip)
v0.20.66β
- Prepare a new release by updating the package version metadata. (#479) (Jeffrey Ip)
- Add a
DEEPEVAL_TELEMETRY_OPT_OUTenvironment variable to disable Sentry telemetry. When set, evaluation and metric tracking messages are not sent and telemetry is not initialized. (#480) (Brian DeRenzi) - Add model logging to test run outputs by letting
set_hyperparameterscapture a model name and saving it alongside configurations. (#481) (Jeffrey Ip) - Add a new deployment-focused test that pulls an evaluation dataset and runs parameterized checks with a sample metric. Update CI to run this deployment test in the dedicated results workflow and skip it in the default pytest suite. (#485) (Jeffrey Ip)
v0.20.65β
- Update package metadata for a new release. (#476) (Jeffrey Ip)
v0.20.64β
- Prepare a new package release by updating the project version metadata. (#470) (Jeffrey Ip)
- Fix typos and improve grammar in the README to make setup and usage instructions clearer. (#472) (Michael Leung)
- Improve the answer relevancy metric by scoring per-statement against the input and retrieval context, and by generating clearer reasons for irrelevant content. Also fix the project repository URL metadata. (#475) (Jeffrey Ip)
v0.20.63β
- Bump the package version for a new release. (#467) (Jeffrey Ip)
- Improve the summarization metric with clearer Alignment/Inclusion scoring, optional explanatory reasons, and configurable multithreading. This also refines verdict parsing so contradictions and redundancies are reported more consistently. (#469) (Jeffrey Ip)
v0.20.59β
- Bump the package version for the latest release. (#459) (Jeffrey Ip)
- Add telemetry logging for metric usage by reporting each metric type when
measure()runs, improving visibility into which metrics are being used during evaluations. (#460) (Jeffrey Ip)
v0.20.60β
- Bump the package version for a new release. (#462) (Jeffrey Ip)
v0.20.61β
- Prepare a new package release by updating the project version metadata. (#464) (Jeffrey Ip)
- Improve dependency and tooling compatibility by updating Poetry lockfiles and related formatting, and adjust the RAGAS metrics integration to pass the LLM via
evaluate(...)with a safer default model. (#465) (Jeffrey Ip)
v0.20.62β
- Prepare a new package release by bumping the library version. (#466) (Jeffrey Ip)
v0.20.58β
- Prepare a new package release by updating the project version metadata. (#456) (Jeffrey Ip)
- Prevent accidental commits of macOS
.DS_Storefiles by removing the existing file from the repository and updating.gitignoreto ignore it going forward. (#457) (Aldin Kiselica) - Improve the faithfulness metric by generating claims and retrieval truths in parallel and tightening verdict rules to return
noonly on direct contradictions (otherwiseidk). This makes scoring more consistent and speeds up evaluation, with an option to disable multithreading. (#458) (Jeffrey Ip)
v0.20.57β
- Update package version metadata to 0.20.56. (#452) (Jeffrey Ip)
- Improve dataset and integration imports by centralizing
Goldenin a dedicated module and updating Hugging Face callback behavior to always refresh evaluation metrics and tables during training. (#454) (Jeffrey Ip) - Improve the Hallucination metric implementation and template imports, and reorganize it under
deepeval.metrics.hallucinationwhile keepingHallucinationMetricavailable fromdeepeval.metrics. (#455) (Jeffrey Ip)
Bug Fixβ
v0.20.75β
- Fix test status reporting so a metric without an explicit failure no longer marks the whole test run as failed. (#535) (Jeffrey Ip)
v0.20.77β
- Fix threaded metric evaluation to capture and re-raise exceptions from worker threads instead of failing silently. Add a
multithreadingoption to run verdict generation sequentially when needed. (#542) (AndrΓ©s) - Fix the knowledge retention metric to evaluate contradictions and extract facts using the correct conversation fields (
user_inputandllm_response), improving verdict accuracy and knowledge tracking across messages. (#543) (Jeffrey Ip)
v0.20.72β
- Fix SummarizationMetric to treat an empty
assessment_questionslist as unset, preventing unexpected behavior. Improve metric docs by clarifying parameters and adding calculation details for Bias and Toxicity, and reorganize the metrics sidebar (including removing the Cost metric page). (#520) (Jeffrey Ip) - Fix test run recording by aggregating metric results into a single saved test case per input, with correct duration and success status. This prevents duplicate or partial entries and ensures trace and metadata are captured consistently. (#522) (Jeffrey Ip)
- Fix RAGAS metric evaluation by sending the expected output in the correct
ground_truthfield, preventing dataset schema mismatches and incorrect scoring. (#523) (Jeffrey Ip)
v0.20.73β
- Prevent
assert_testand pytest plugin session setup from running when tests are executed outside the CLI, avoiding unintended assertions and test-run side effects. (#525) (Jeffrey Ip)
v0.20.70β
- Fix metrics module imports by adding missing
__init__.pyfiles and removing a duplicate import, improving package discovery and preventing import errors. (#510) (Jeffrey Ip)
v0.20.69β
- Fix contextual precision scoring and reasoning output when no contexts are available by returning a score of 0 instead of failing. Simplify verdict details by removing the per-node field from the reported verdicts. (#504) (Jeffrey Ip)
v0.20.67β
- Fix summarization metric output by removing a stray prompt print and ensuring missing-question text is interpolated correctly. Refresh development dependencies via an updated Poetry lockfile. (#490) (Jeffrey Ip)
v0.20.65β
- Fix the Hugging Face integration guide by adding missing imports, correcting variable names, and showing how to pass
trainerand register the callback so the example runs as written. (#477) (Michael Leung)
v0.20.59β
- Fix the faithfulness prompt parsing to generate and read
claimsinstead oftruths, preventing missing-key errors and improving consistency in faithfulness evaluation results. (#461) (Jeffrey Ip)
v0.20.60β
- Fix retry error handling by removing the hard dependency on OpenAI exceptions and retrying on any exception. This prevents unexpected crashes when OpenAI is not installed or when other transient errors occur. (#463) (Jeffrey Ip)
Januaryβ
January focused on making evaluations faster, clearer, and easier to integrate across common LLM stacks. Event tracking now runs in the background by default with a synchronous option when needed, while telemetry and CLI output were refined with safer Sentry setup and transient spinner-based progress on stderr. Metrics and results reporting saw major consistency upgrades, including dynamic per-metric thresholds, explicit success flags, evaluation-model metadata in outputs, and new performance assertions via LatencyMetric and CostMetric. Integrations and APIs matured with improved LangChain and Azure OpenAI compatibility, expanded LlamaIndex tracing and evaluator wrappers, Hugging Face/
Backward Incompatible Changeβ
v0.20.50β
- Rename bias and toxicity metrics to
BiasMetricandToxicityMetric, and simplify their usage to scoreactual_outputdirectly with a maximum threshold. Update imports and examples to match the new metric names. (#423) (Jeffrey Ip)
v0.20.49β
- Add
LatencyMetricandCostMetricso you can assert performance and spend thresholds in evaluations. RenameLLMTestCase.execution_timetolatencyand update docs and tests accordingly. (#414) (Jeffrey Ip) - Rename
LLMEvalMetrictoGEvaland update imports and tests accordingly. Test output now includes the evaluation model used, making it easier to trace which model produced a score. (#415) (Jeffrey Ip) - Separate Ragas metrics into
deepeval.metrics.ragasand stop exporting them fromdeepeval.metrics. Also rename metric score details toscore_breakdownfor clearer per-component reporting. (#417) (Jeffrey Ip)
New Featureβ
v0.20.54β
- Add support for passing a custom evaluation model to LLM-based metrics by accepting
DeepEvalBaseModelinstances via themodelargument. This lets you plug in non-default LLM backends (including LangChain chat models) without wrapping them in the built-in GPT model. (#445) (Jeffrey Ip)
v0.20.53β
- Add a dedicated
integrationspackage for Hugging Face, LlamaIndex, and Harness, including new LlamaIndex evaluator wrappers. Rename the Hugging Face trainer callback toDeepEvalHuggingFaceCallbackand adjust tests to match. (#435) (Jeffrey Ip)
v0.20.52β
- Add
DeepEvalCallbacksupport for Hugging Face Trainer, with improved output via a new Rich-based display manager. Extend evaluation data handling by supporting retrieval context inGoldenand allowingEvaluationDatasetto accept an optional list ofGoldenexamples. (#368) (Pratyush K. Patnaik) - Add a
--deployment/-doption to the test CLI to enable deployment mode and pass the flag through to the pytest plugin and test run metadata. (#429) (Jeffrey Ip)
v0.20.48β
- Support passing a LangChain
BaseChatModelinstance (in addition to a model name) to RAGAS metrics, making it easier to run evaluations with custom chat model backends. (#410) (Jeffrey Ip)
v0.20.44β
- Add LlamaIndex integration for tracing via a
LlamaIndexCallbackHandler, capturing nested LLM, retriever, and embedding events into the trace stack. (#392) (Jeffrey Ip)
Improvementβ
v0.20.55β
- Bump package version to 0.20.54 for the latest release. (#446) (Jeffrey Ip)
v0.20.56β
- Update the package metadata for a new release. (#448) (Jeffrey Ip)
- Add optional
costandlatencyfields to test run API payloads so performance and spend can be logged alongside run duration. (#449) (Jeffrey Ip) - Add
aliassupport to evaluation datasets and propagate it to created and pulled test cases viadataset_alias. Prevent evaluating an empty dataset by raising a clear error when no test cases are present. (#450) (Jeffrey Ip)
v0.20.54β
- Update package metadata for a new release. (#437) (Jeffrey Ip)
- Improve
--deploymenthandling by allowing an optional string value and auto-detecting common CI environments to populate it. This helps ensure deployment mode is enabled consistently when running tests in CI. (#439) (Jeffrey Ip) - Add support for
retrievalContextwhen parsing dataset goldens, ensuring retrieval context is correctly read from API responses. (#440) (Jeffrey Ip) - Add support for passing deployment metadata from GitHub Actions into test runs. Deployment runs now send structured configs and can skip posting results for pull requests, and they no longer auto-open the results page in CI. (#442) (Jeffrey Ip)
- Add docs for the Hugging Face
transformersTrainer callback, including setup examples and reference for options likeshow_tableandshow_table_everyduring training evaluation. (#444) (Pratyush K. Patnaik)
v0.20.53β
- Prepare a new release by updating the package version metadata. (#432) (Jeffrey Ip)
- Remove a redundant
Toxicityentry from the README to avoid confusion in the metrics list. (#434) (nicholasburka) - Improve the LlamaIndex integration with clearer evaluator names and expanded documentation. Add end-to-end examples for evaluating RAG responses, extracting retrieval context, and using LlamaIndex evaluators for common metrics like relevancy, faithfulness, summarization, bias, and toxicity. (#436) (Jeffrey Ip)
v0.20.52β
- Bump the package release version to 0.20.51. (#427) (Jeffrey Ip)
- Add empty-list defaults for
goldensandtest_caseswhen creating an evaluation dataset, so you can initialize it without passing either argument. (#428) (Jeffrey Ip)
v0.20.51β
- Prepare a new release by updating the package version metadata. (#424) (Jeffrey Ip)
v0.20.50β
- Bump package version to keep metadata in sync for the latest release. (#420) (Jeffrey Ip)
- Improve quick start docs and examples by clarifying evaluation wording and updating the sample test to use
AnswerRelevancyMetricwithretrieval_context, matching current APIs. (#421) (Jeffrey Ip)
v0.20.49β
- Bump package version to 0.20.48 for the latest release. (#411) (Jeffrey Ip)
- Fix the
ContextualPrecisionMetricdocs to referenceexpected_outputinstead ofactual_output. Improvemeasure()by removing unnecessary type checking for cleaner, more predictable behavior. (#412) (Sehun Heo) - Add evaluation model information to metric metadata in the test run API, and show it in the results table output. When unavailable, the evaluation model is displayed as n/a. (#418) (Jeffrey Ip)
v0.20.47β
- Bump the package version for the latest release. (#405) (Jeffrey Ip)
- Support passing either a model name or a LangChain
BaseChatModelto LLM-based metrics, improving compatibility with more model backends during evaluation. (#408) (Jeffrey Ip)
v0.20.48β
- Update package metadata for a new release, including the internal version string and project version. (#409) (Jeffrey Ip)
v0.20.45β
- Improve metric evaluation output by showing a spinner-based progress indicator instead of printing a one-off message. Progress is written to stderr and is transient by default for cleaner CLI logs. (#396) (Jeffrey Ip)
- Prepare a new release by updating the package version metadata. (#398) (Jeffrey Ip)
- Improve metric configuration by renaming
minimum_scoretothresholdand updating test output to report the new field. AddRAGASAnswerRelevancyMetricto the public metrics exports and refresh RAGAS test imports to match. (#400) (Jeffrey Ip) - Add a
successflag to metric metadata so test run results clearly indicate whether each metric met its threshold. (#402) (Jeffrey Ip)
v0.20.46β
- Bump package release metadata to the latest version for publishing and distribution. (#403) (Jeffrey Ip)
v0.20.44β
- Update package metadata for a new release. (#390) (Jeffrey Ip)
- Improve
trackso it can send events on a background thread by default, reducing blocking in the calling code. Add an option to run the request synchronously when needed. (#391) (Jeffrey Ip) - Add a Sentry telemetry counter that records when an evaluation run completes, including CLI runs. Keep exception reporting behind
ERROR_REPORTING=YESand skip setup when outbound traffic is blocked by a firewall. (#394) (Jeffrey Ip) - Make the per-metric pass threshold dynamic by using each metricβs
minimum_scoreinstead of a fixed 0.5. (#395) (Jeffrey Ip)
Bug Fixβ
v0.20.55β
- Fix package setup so the
integrationsmodule is included in source and wheel distributions. This prevents missingdeepeval.integrationsfiles after installing from PyPI. (#447) (Yves Junqueira)
v0.20.56β
- Fix
CostMetricandLatencyMetricto use clearermax_costandmax_latencyconstructor arguments instead ofthreshold, and update docs and tests to match. This makes performance limits easier to configure consistently. (#451) (Jeffrey Ip)
v0.20.54β
- Improve optional dependency handling by conditionally importing
transformersandsentence_transformersintegrations. This prevents import-time failures when those libraries arenβt installed and surfaces a clear error only when the related callbacks or metrics are used. (#438) (Jeffrey Ip)
v0.20.52β
- Fix
EvaluationDatasetusing shared mutable default lists forgoldensandtest_cases, which could leak entries across instances. New datasets now start with fresh empty lists when not provided. (#431) (jeffometer)
v0.20.51β
- Fix input validation for bias and toxicity metrics to only raise when
actual_outputis None, preventing false failures when the output is an empty string. (#426) (Jeffrey Ip)
v0.20.50β
- Fix API key detection by checking stored credentials instead of relying on a local
.deepevalfile, preventing push/pull and test-run uploads from failing when the file is missing. (#422) (Jeffrey Ip)
v0.20.49β
- Fix
ContextualPrecisionMetricvalidation to reject missingactual_output, and clarify the error message and docs to listactual_outputas a requiredLLMTestCasefield. (#413) (Jeffrey Ip) - Fix event tracking by removing stray debug prints and improving handling of non-JSON API responses to avoid unexpected errors during requests. (#416) (Jeffrey Ip)
- Fix division-by-zero errors in several evaluation metrics by returning a score of 0 when there are no verdicts, no relevant nodes, or no context sentences. (#419) (Jeffrey Ip)
v0.20.45β
- Fix Azure OpenAI support in the LangChain integration by switching to
langchain_openaiand passingmodel_versiondirectly (defaulting to an empty string when unset). This prevents Azure model initialization failures due to outdated imports or missing version handling. (#401) (Jeffrey Ip)
v0.20.46β
- Fix results table pass/fail display by using each metric's
successflag instead of comparing score to threshold, so custom metrics report accurately. (#404) (Jeffrey Ip)