IFEval
IFEval (Instruction-Following Evaluation for Large Language Models ) is a benchmark for evaluating instruction-following capabilities of language models. It tests various aspects of instruction following including format compliance, constraint adherence, output structure requirements, and specific instruction types.
tip
deepeval
's IFEval
implementation is based on the original research paper by Google.
Arguments
There is ONE optional argument when using the IFEval
benchmark:
- [Optional]
n_problems
: limits the number of test cases the benchmark will evaluate. Defaulted toNone
.
Usage
The code below evaluates a custom mistral_7b
model (click here to learn how to use ANY custom LLM) and assesses its performance on High School Computer Science and Astronomy using 3-shot learning.
from deepeval.benchmarks import IFEval
# Define benchmark with 'n_problems'
benchmark = IFEval(n_problems=5)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)