🔥 DeepEval 4.0 just got released. Read the announcement.
Available Benchmarks

IFEval

IFEval (Instruction-Following Evaluation for Large Language Models ) is a benchmark for evaluating instruction-following capabilities of language models. It tests various aspects of instruction following including format compliance, constraint adherence, output structure requirements, and specific instruction types.

Arguments

There is ONE optional argument when using the IFEval benchmark:

  • [Optional] n_problems: limits the number of test cases the benchmark will evaluate. Defaulted to None.

Usage

The code below evaluates a custom mistral_7b model (click here to learn how to use ANY custom LLM) and assesses its performance on High School Computer Science and Astronomy using 3-shot learning.

from deepeval.benchmarks import IFEval

# Define benchmark with 'n_problems'
benchmark = IFEval(n_problems=5)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

FAQs

What does the IFEval benchmark measure?
IFEval evaluates the instruction-following capabilities of language models. It tests various aspects including format compliance, constraint adherence, output structure requirements, and specific instruction types. See the benchmarks introduction for how scoring works across benchmarks.
What arguments does the IFEval benchmark accept?
There is one optional argument, n_problems, which limits the number of test cases the benchmark will evaluate. It is defaulted to None.
What does the n_problems argument do?
n_problems limits the number of test cases the benchmark will evaluate. By default it is None, so set it to an integer (for example 5) to evaluate a smaller subset.
What research is deepeval's IFEval implementation based on?
deepeval's IFEval implementation is based on the original research paper by Google. You can read more in the benchmarks introduction.
Does IFEval support tasks or n_shots arguments?
No. IFEval exposes a single optional argument, n_problems. It does not take a tasks list or an n_shots argument.
How do I read the result of an IFEval run?
After calling evaluate(), read the overall_score attribute to see how the model performed on the evaluated instruction-following test cases.
How do I run IFEval on a custom LLM?
Define the benchmark, then call evaluate() with your model and read overall_score. See benchmarking your LLM to learn how to use any custom LLM.

On this page