🔥 DeepEval 4.0 just got released. Read the announcement.
Available Benchmarks

TruthfulQA

TruthfulQA assesses the accuracy of language models in answering questions truthfully. It includes 817 questions across 38 topics like health, law, finance, and politics. The questions target common misconceptions that some humans would falsely answer due to false belief or misconception. For more information, visit the TruthfulQA GitHub page.

Arguments

There are TWO optional arguments when using the TruthfulQA benchmark:

  • [Optional] tasks: a list of tasks (TruthfulQATask enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The complete list of TruthfulQATask enums can be found here.
  • [Optional] mode: a TruthfulQAMode enum that selects the evaluation mode. This is set to TruthfulQAMode.MC1 by default. deepeval currently supports 2 modes: MC1 and MC2.

Usage

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on Advertising and Fiction tasks in TruthfulQA using MC2 mode evaluation.

from deepeval.benchmarks import TruthfulQA
from deepeval.benchmarks.tasks import TruthfulQATask
from deepeval.benchmarks.modes import TruthfulQAMode

# Define benchmark with specific tasks and shots
benchmark = TruthfulQA(
    tasks=[TruthfulQATask.ADVERTISING, TruthfulQATask.FICTION],
    mode=TruthfulQAMode.MC2
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score ranges from 0 to 1, signifying the fraction of accurate predictions across tasks. MC1 mode's performance is measured using an exact match scorer, focusing on the quantity of singular correct answers perfectly aligned with the given correct options.

Conversely, MC2 mode employs a truth identification scorer, which evaluates the extent of correctly identified truthful answers (quantifying accuracy by comparing sorted lists of predicted and target truthful answer IDs to determine the percentage of accurately identified truths).

TruthfulQA Tasks

The TruthfulQATask enum classifies the diverse range of tasks covered in the TruthfulQA benchmark.

from deepeval.benchmarks.tasks import TruthfulQATask

truthful_tasks = [TruthfulQATask.ADVERTISING]

Below is the comprehensive list of available tasks:

  • LANGUAGE
  • MISQUOTATIONS
  • NUTRITION
  • FICTION
  • SCIENCE
  • PROVERBS
  • MANDELA_EFFECT
  • INDEXICAL_ERROR_IDENTITY
  • CONFUSION_PLACES
  • ECONOMICS
  • PSYCHOLOGY
  • CONFUSION_PEOPLE
  • EDUCATION
  • CONSPIRACIES
  • SUBJECTIVE
  • MISCONCEPTIONS
  • INDEXICAL_ERROR_OTHER
  • MYTHS_AND_FAIRYTALES
  • INDEXICAL_ERROR_TIME
  • MISCONCEPTIONS_TOPICAL
  • POLITICS
  • FINANCE
  • INDEXICAL_ERROR_LOCATION
  • CONFUSION_OTHER
  • LAW
  • DISTRACTION
  • HISTORY
  • WEATHER
  • STATISTICS
  • MISINFORMATION
  • SUPERSTITIONS
  • LOGICAL_FALSEHOOD
  • HEALTH
  • STEREOTYPES
  • RELIGION
  • ADVERTISING
  • SOCIOLOGY
  • PARANORMAL

FAQs

What does the TruthfulQA benchmark measure?
TruthfulQA assesses how truthfully an LLM answers questions. It includes 817 questions across 38 topics like health, law, finance, and politics, targeting common misconceptions that some humans would falsely answer due to false belief.
Which tasks can I run with TruthfulQA?
You can pass a list of TruthfulQATask enums to the tasks argument, for example ADVERTISING or FICTION. By default, all available tasks are evaluated.
What are the MC1 and MC2 modes?
The mode argument takes a TruthfulQAMode enum and defaults to TruthfulQAMode.MC1. MC1 involves selecting one correct answer from 4-5 options, while MC2 (Multi-true) requires identifying multiple correct answers from a set. Both are multiple-choice evaluations.
How is TruthfulQA scored?
The overall_score ranges from 0 to 1. MC1 uses an exact match scorer focused on singular correct answers, while MC2 uses a truth identification scorer that compares sorted lists of predicted and target truthful answer IDs. See the benchmarks introduction for more on scoring.
Should I use MC1 or MC2?
Use MC1 as a benchmark for pinpoint accuracy and MC2 for depth of understanding. Both modes use the same set of questions, so you can compare them directly.
How do I run TruthfulQA on a custom LLM?
Wrap your model in DeepEvalBaseLLM, create a benchmark with TruthfulQA(tasks=[...], mode=TruthfulQAMode.MC2), and call benchmark.evaluate(model=mistral_7b) before reading benchmark.overall_score. See the benchmarking guide for using any custom LLM.

On this page