Flags and Configs

Sometimes you might want to customize the behavior of different settings for evaluate() and assert_test(), and this can be done using "configs" (short for configurations) and "flags".

Configs for `evaluate()`

Async Configs

The AsyncConfig controls how concurrently metrics, observed_callback, and test_cases will be evaluated during evaluate().

from deepeval.evaluate import AsyncConfig
from deepeval import evaluate

evaluate(async_config=AsyncConfig(), ...)

There are THREE optional parameters when creating an AsyncConfig:

[Optional] run_async: a boolean which when set to True, enables concurrent evaluation of test cases AND metrics. Defaulted to True.
[Optional] throttle_value: an integer that determines how long (in seconds) to throttle the evaluation of each test case. You can increase this value if your evaluation model is running into rate limit errors. Defaulted to 0.
[Optional] max_concurrent: an integer that determines the maximum number of test cases that can be ran in parallel at any point in time. You can decrease this value if your evaluation model is running into rate limit errors. Defaulted to 20.

The throttle_value and max_concurrent parameter is only used when run_async is set to True. A combination of a throttle_value and max_concurrent is the best way to handle rate limiting errors, either in your LLM judge or LLM application, when running evaluations.

Display Configs

The DisplayConfig controls how results and intermediate execution steps are displayed during evaluate().

from deepeval.evaluate import DisplayConfig
from deepeval import evaluate

evaluate(display_config=DisplayConfig(), ...)

There are TEN optional parameters when creating a DisplayConfig:

[Optional] verbose_mode: a optional boolean which when IS NOT None, overrides each metric's verbose_mode value. Defaulted to None.
[Optional] display: a str of either "all", "failing" or "passing", which allows you to selectively decide which type of test cases to display as the final result. Defaulted to "all".
[Optional] show_indicator: a boolean which when set to True, shows the evaluation progress indicator for each individual metric. Defaulted to True.
[Optional] print_results: a boolean which when set to True, prints the result of each evaluation. Defaulted to True.
[Optional] results_folder: a string path to a directory where each call to evaluate() (or evals_iterator()) will be persisted as a test_run_<YYYYMMDD_HHMMSS>.json file. Defaulted to None (no local save). See Saving test runs locally below.
[Optional] results_subfolder: an optional string that, when set together with results_folder, nests the test_run_*.json files under results_folder/results_subfolder/. Defaulted to None (flat layout).
[Optional] truncate_passing_cases: a boolean which when set to True, truncates the terminal output of passing test cases. Defaulted to True.
[Optional] inspect_after_run: a boolean which when set to True, prompts you at the end of an evals_iterator() run to open the captured traces in the deepeval inspect TUI. Only fires in interactive terminals when at least one test case has a trace. Set to False to disable per call, or export DEEPEVAL_NO_INSPECT_PROMPT=1 to disable globally (e.g. in CI). Defaulted to True.
[Optional] file_type: a string of either "html" or "md", which allows you to export the evaluation dashboard to a file. Defaulted to None.
[Optional] file_output_dir: a string which when set, writes the evaluation dashboard to the specified directory using the format specified in file_type. Defaulted to None.

Saving test runs locally

Set results_folder to persist each evaluate() call to disk as a structured TestRun JSON. Hyperparameters, per-test-case scores, and metric reasons are all serialized into each file via the same schema that Confident AI uses — no extra setup required.

from deepeval import evaluate
from deepeval.evaluate import DisplayConfig

for temp in [0.0, 0.4, 0.8]:
    evaluate(
        test_cases=test_cases,
        metrics=metrics,
        hyperparameters={"model": "gpt-4o-mini", "temperature": temp},
        display_config=DisplayConfig(results_folder="./evals/prompt-v3"),
    )

After the loop, the folder is flat — just the raw test runs:

./evals/prompt-v3/
  test_run_20260421_140114.json
  test_run_20260421_140132.json
  test_run_20260421_140151.json

The timestamp prefix makes ls order match chronological order, so an AI agent (Cursor, Claude Code) can iterate over the folder in the order runs happened. If two runs finish within the same second, the writer appends _2, _3, … to the filename so nothing is ever overwritten.

Set results_subfolder to nest the runs under an extra directory — useful when the parent folder already holds other artifacts:

DisplayConfig(results_folder="./evals/prompt-v3", results_subfolder="test_runs")

./evals/prompt-v3/
  test_runs/
    test_run_20260421_140114.json
    test_run_20260421_140132.json

If results_folder is unset but the DEEPEVAL_RESULTS_FOLDER environment variable is present, deepeval falls back to that path for backwards compatibility.

Error Configs

The ErrorConfig controls how error is handled in evaluate().

from deepeval.evaluate import ErrorConfig
from deepeval import evaluate

evaluate(error_config=ErrorConfig(), ...)

There are TWO optional parameters when creating an ErrorConfig:

[Optional] skip_on_missing_params: a boolean which when set to True, skips all metric executions for test cases with missing parameters. Defaulted to False.
[Optional] ignore_errors: a boolean which when set to True, ignores all exceptions raised during metrics execution for each test case. Defaulted to False.

If both skip_on_missing_params and ignore_errors are set to True, skip_on_missing_params takes precedence. This means that if a metric is missing required test case parameters, it will be skipped (and the result will be missing) rather than appearing as an ignored error in the final test run.

Cache Configs

The CacheConfig controls the caching behavior of evaluate().

from deepeval.evaluate import CacheConfig
from deepeval import evaluate

evaluate(cache_config=CacheConfig(), ...)

There are TWO optional parameters when creating an CacheConfig:

[Optional] use_cache: a boolean which when set to True, uses cached test run results instead. Defaulted to False.
[Optional] write_cache: a boolean which when set to True, uses writes test run results to DISK. Defaulted to True.

The write_cache parameter writes to disk and so you should disable it if that is causing any errors in your environment.

Flags for `deepeval test run`:

Parallelization

Evaluate each test case in parallel by providing a number to the -n flag to specify how many processes to use.

deepeval test run test_example.py -n 4

Cache

Provide the -c flag (with no arguments) to read from the local deepeval cache instead of re-evaluating test cases on the same metrics.

deepeval test run test_example.py -c

Ignore Errors

The -i flag (with no arguments) allows you to ignore errors for metrics executions during a test run. An example of where this is helpful is if you're using a custom LLM and often find it generating invalid JSONs that will stop the execution of the entire test run.

deepeval test run test_example.py -i

Verbose Mode

The -v flag (with no arguments) allows you to turn on verbose_mode for all metrics ran using deepeval test run. Not supplying the -v flag will default each metric's verbose_mode to its value at instantiation.

deepeval test run test_example.py -v

Skip Test Cases

The -s flag (with no arguments) allows you to skip metric executions where the test case has missing//insufficient parameters (such as retrieval_context) that is required for evaluation. An example of where this is helpful is if you're using a metric such as the ContextualPrecisionMetric but don't want to apply it when the retrieval_context is None.

deepeval test run test_example.py -s

Identifier

The -id flag followed by a string allows you to name test runs and better identify them on Confident AI. An example of where this is helpful is if you're running automated deployment pipelines, have deployment IDs, or just want a way to identify which test run is which for comparison purposes.

deepeval test run test_example.py -id "My Latest Test Run"

Display Mode

The -d flag followed by a string of "all", "passing", or "failing" allows you to display only certain test cases in the terminal. For example, you can display "failing" only if you only care about the failing test cases.

deepeval test run test_example.py -d "failing"

Repeats

Repeat each test case by providing a number to the -r flag to specify how many times to rerun each test case.

deepeval test run test_example.py -r 2

Hooks

deepeval's Pytest integration allows you to run custom code at the end of each evaluation via the @deepeval.on_test_run_end decorator:

test_example.py

...

@deepeval.on_test_run_end
def function_to_be_called_after_test_run():
    print("Test finished!")

On this page