Skip to main content

Datasets

In deepeval, an evaluation dataset, or just dataset, is a collection of goldens. At evaluation time, you would first convert all goldens in your dataset to test cases, before running evals on these test cases.

What Are Goldens?

Goldens represent a more flexible alternative to test cases in the deepeval, and is the preferred way to initialize a dataset. Unlike test cases, goldens:

  • Only require input/scenario to initialize
  • Store expected results like expected_output/expected_outcome
  • Serve as templates before becoming fully-formed test cases

Goldens excel in development workflows where you need to:

  • Evaluate changes across different iterations of your LLM application
  • Compare performance between model versions
  • Test with inputs that haven't yet been processed by your LLM

Think of goldens as "pending test cases" - they contain all the input data and expected results, but are missing the dynamic elements (actual_output, retrieval_context, tools_called) that will be generated when your LLM processes them.

You can scroll to the bottom to see the data model of a golden.

Quick Summary

There are two approaches to running evals using datasets in deepeval:

  1. Using deepeval test run
  2. Using evaluate

Depending on the type of goldens you supply, datasets are either single-turn or mult-turn. Evaluating a dataset means exactly the same as evaluating your LLM system, because by definition a dataset contains all the information produced by your LLM needed for evaluation.

What are the best practices for curating an evaluation dataset?
  • Ensure telling test coverage: Include diverse real-world inputs, varying complexity levels, and edge cases to properly challenge the LLM.
  • Focused, quantitative test cases: Design with clear scope that enables meaningful performance metrics without being too broad or narrow.
  • Define clear objectives: Align datasets with specific evaluation goals while avoiding unnecessary fragmentation.
info

If you don't already have an EvaluationDataset, a great starting point is to simply write down the prompts you're currently using to manually eyeball your LLM outputs. You can also do this on Confident AI, which integrates 100% with deepeval:

Learn Dataset Annotation on Confident AI

Full documentation for datasets on Confident AI here.

Create A Dataset

An EvaluationDataset in deepeval is simply a collection of goldens. You can initialize an empty dataset to start with:

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()

A dataset can either be a single-turn one, or a multi-turn one (but not both). Initializing supplying your dataset with a list of Goldens will make it a single-turn one, whereas supplying it with ConversationalGoldens will make it multi-turn:

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[Golden(input="What is your name?")])
print(dataset._multi_turn) # prints False

To ensure best practices, datasets in deepeval are stateful and opinionated. This means you cannot change the value of _multi_turn once its value has been set. However, you can always add new goldens after initialization using the add_golden method:

...

dataset.add_golden(Golden(input="Nice."))

Run Evals On Dataset

You run evals on test cases in datasets, which you'll create at evaluation time using the goldens in the same dataset.

Evaluation Dataset

First step is to load in the goldens to your dataset. This example will load datasets from Confident AI, but you can also explore other options below.

main.py
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Dataset") # replace with your alias
print(dataset.goldens) # print to sanity check yourself
tip

Your dataset is either single or multi-turn the moment you pull your dataset.

Once you have your dataset and can see a non-empty list of goldens, you can start generating outputs and add it back to your dataset as test cases via the add_test_case() method:

main.py
from deepeval.test_case import LLMTestCase
...

for golden in dataset.goldens:
test_case = LLMTestCase(
input=golden.input,
actual_output=your_llm_app(golden.input) # replace with your LLM app
)
dataset.add_test_case(test_case)

print(dataset.test_cases) # print to santiy check yourself

Lastly, you can run evaluations on the list of test cases in your dataset:

test_llm_app.py
import pytest
from deepeval.metrics import AnswerRelevancyMetric
...

@pytest.mark.parametrize("test_case", dataset.test_cases)
def test_llm_app(test_case: LLMTestCase):
assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])

And execute the test file:

deepeval test run test_llm_app.py

You can learn more about assert_test in this section.

Manage Your Dataset

Dataset management is an essential part of your evaluation lifecycle. We recommend Confident AI as the choice for your dataset management workflow as it comes with dozens of collaboration features out of the box, but you can also do it locally as well.

Save Dataset

You can save your dataset on the cloud by using the push method:

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[Golden(input="First golden")])
dataset.push(alias="My dataset")

This pushes all goldens in your evaluation dataset to Confident AI. If you don't already have a dataset, the push method will automatically create one. If you already have a dataset with the same alias, you can also choose to optionally overwrite it by setting overwrite to True:

...
dataset.push(alias="My dataset", overwrite=True) # Very dangerous, this will delete all existing goldens in your dataset

Lastly, you're unsure whether a golden is ready for evaluation, you should queue it instead:

...

dataset.queue(alias="My dataset")

The queue method will similarly push goldens but mark it as "finalized" on Confident AI. This means they won't be pulled until you've manually marked them as finalized on the platform. You can learn more on Confident AI's docs here.

tip

You can also push multi-turn datasets.

Load Dataset

deepeval offers support for loading datasets stored in JSON files, CSV files, and hugging face datasets into an EvaluationDataset as either test cases or goldens.

You can load entire datasets on Confident AI's cloud in one line of code.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")

Non-technical domain experts can create, annotate, and comment on datasets on Confident AI. You can also upload datasets in CSV format, or push synthetic datasets created in deepeval to Confident AI in one line of code.

For more information, visit the Confident AI datasets section.

Generate A Dataset

Sometimes, you might not have datasets ready to use, and that's ok. deepeval provides two options for both single-turn and multi-turn use cases:

  • Synthesizer for generating single-turn goldens
  • ConversationSimulator for generating turns in a ConversationalTestCase

Synthesizer

deepeval offers anyone the ability to easily generate synthetic datasets from documents locally on your machine. This is especially helpful if you don't have an evaluation dataset prepared beforehand.

from deepeval.synthesizer import Synthesizer

goldens = Synthesizer().generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf']
)

dataset = EvaluationDataset(goldens=goldens)

In this example, we've used the generate_goldens_from_docs method, which one one of the four generation methods offered by deepeval's Synthesizer. The four methods include:

deepeval's Synthesizer uses a series of evolution techniques to complicate and make generated goldens more realistic to human prepared data.

info

For more information on how deepeval's Synthesizer works, visit the synthesizer section.

Conversation Simulator

While a Synthesizer generates goldens, the ConversationSimulator works slightly different as it generates turns in a ConversationalTestCase instead:

from deepeval.conversation_simulator import ConversationSimulator

# Define simulator
simulator = ConversationSimulator(
user_intentions={"Opening a bank account": 1},
user_profile_items=[
"full name",
"current address",
"bank account number",
"date of birth",
"mother's maiden name",
"phone number",
"country code",
],
)

# Define model callback
async def model_callback(input: str, conversation_history: List[Dict[str, str]]) -> str:
return f"I don't know how to answer this: {input}"

# Start simluation
convo_test_cases = simulator.simulate(
model_callback=model_callback,
stopping_criteria="Stop when the user's banking request has been fully resolved.",
)
print(convo_test_cases)

You can learn more in the conversation simulator page.

Goldens Data Model

The golden data model is nearly identical to their single/multi-turn test case counterparts (aka. LLMTestCase and ConversationalTestCase).

For single-turn Goldens:

from pydantic import BaseModel

class Golden(BaseModel):
input: str
expected_output: Optional[str] = None
context: Optional[List[str]] = None
expected_tools: Optional[List[ToolCall]] = None

# Useful metadata for generating test cases
additional_metadata: Optional[Dict] = None
comments: Optional[str] = None
custom_column_key_values: Optional[Dict[str, str]] = None

# Fields that you should ideally not populate
actual_output: Optional[str] = None
retrieval_context: Optional[List[str]] = None
tools_called: Optional[List[ToolCall]] = None
info

The actual_output, retrieval_context, and tools_called are meant to be populated dynamically instead of passed directly from a golden to test case at evaluation time.

For multi-turn ConversationalGoldens:

from pydantic import BaseModel

class ConversationalGolden(BaseModel):
scenario: str
expected_outcome: Optional[str] = None
user_description: Optional[str] = None

# Useful metadata for generating test cases
additional_metadata: Optional[Dict] = None
comments: Optional[str] = None
custom_column_key_values: Optional[Dict[str, str]] = None

You can easily add and edit custom columns on Confident AI.