🔥 DeepEval 4.0 just got released. Read the announcement.

Optimizing Hyperparameters for LLM Applications

Apart from catching regressions and sanity checking your LLM applications, LLM evaluation and testing plays an pivotal role in picking the best hyperparameters for your LLM application.

Which Hyperparameters Should I Iterate On?

Here are typically the hyperparameters you should iterate on:

  • model: the LLM to use for generation.
  • prompt template: the variation of prompt templates to use for generation.
  • temperature: the temperature value to use for generation.
  • max tokens: the max token limit to set for your LLM generation.
  • top-K: the number of retrieved nodes in your retrieval_context in a RAG pipeline.
  • chunk size: the size of the retrieved nodes in your retrieval_context in a RAG pipeline.
  • reranking model: the model used to rerank the retrieved nodes in your retrieval_context in a RAG pipeline.

Finding The Best Hyperparameter Combination

To find the best hyperparameter combination, simply:

  • choose a/multiple LLM evaluation metrics that fits your evaluation criteria
  • execute evaluations in a nested for-loop, while generating actual_outputs at evaluation time based on the current hyperparameter combination

Let's walkthrough a quick example hypothetical example showing how to find the best model and prompt template hyperparameter combination using the AnswerRelevancyMetric as a measurement. First, define a function to generate actual_outputs for LLMTestCases based on a certain hyperparameter combination:

from typing import List
from deepeval.test_case import LLMTestCase

# Hypothetical helper function to construct LLMTestCases
def construct_test_cases(model: str, prompt_template: str) : List[LLMTestCase]:
    # Hypothetical functions for you to implement
    prompt = format_prompt_template(prompt_template)
    llm = get_llm(model)

    test_cases : List[LLMTestCase] = []
    for input in list_of_inputs:
        test_case = LLMTestCase(
            input=input,
            # Hypothetical function to generate actual outputs
            # at evaluation time based on your hyperparameters!
            actual_output=generate_actual_output(llm, prompt)
        )
        test_cases.append(test_case)

    return test_cases

Then, define the AnswerRelevancyMetric and use this helper function to construct LLMTestCases:

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
...

# Define metric(s)
metric = AnswerRelevancyMetric()

# Start the nested for-loop
for model in models:
    for prompt_template in prompt_templates:
        evaluate(
            test_cases=construct_test_cases(model, prompt_template),
            metrics=[metric],
            # log hyperparameters associated with this batch of test cases
            hyperparameter={
                "model": model,
                "prompt template": prompt_template
            }
        )

Keeping Track of Hyperparameters in CI/CD

You can also keep track of hyperparameters used during testing in your CI/CD pipelines. This is helpful since you will be able to pinpoint the hyperparameter combination associated with failing test runs.

To begin, login to Confident AI:

deepeval login

Then define your test function and log hyperparameters in your test file:

test_file.py
import pytest
import deepeval

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

test_cases = [...]

# Loop through test cases using Pytest
@pytest.mark.parametrize(
    "test_case",
    test_cases,
)
def test_customer_chatbot(test_case: LLMTestCase):
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [answer_relevancy_metric])


# You should aim to make these values dynamic
@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
    # Return a dict to log additional hyperparameters.
    # You can also return an empty dict {} if there's no additional parameters to log
    return {
        "temperature": 1,
        "chunk size": 500
    }

Lastly, run deepeval test run:

deepeval test run test_file.py

In the next guide, we'll show you to build your own custom LLM evaluation metrics in case you want more control over evaluation when picking for hyperparameters.

On this page