Improving Your Summarizer

In this section, we'll explore multiple strategies to improve your summarization agent using deepeval. We'll create a full evaluation suite that allows us to iterate on our summarization agent to find the best hyperparameters that help improve it.

Like most LLM applications, our summarizer includes tunable hyperparameters that can significantly influence the performance of our application. In our case, the key hyperparameters for the MeetingSummarizer that can improve our agent are:

Prompt template
Generation model

The above-mentioned hyperparameters are common for almost any LLM application. However, you can extend a few more hyperparameters that are specific to your use case.

Pulling Datasets

In the previous section, we've seen how to create datasets and store them in the cloud. We can now pull that dataset and use it as many times as we need to generate test cases and evaluate our summarization agent.

Here's how we can pull datasets from the cloud:

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="MeetingSummarizer Dataset")

The dataset pulled contains goldens, which can be used to create test cases during run time and run evals. Here's how to create test cases using datasets:

from deepeval.test_case import LLMTestCase
from meeting_summarizer import MeetingSummarizer # import your summarizer here

summarizer = MeetingSummarizer() # Initialize with your best config
summary_test_cases = []
action_item_test_cases = []
for golden in dataset.goldens:
    summary, action_items = summarizer.summarize(golden.input)
    summary_test_case = LLMTestCase(
        input=golden.input,
        actual_output=summary
    )
    action_item_test_case = LLMTestCase(
        input=golden.input,
        actual_output=str(action_items)
    )
    summary_test_cases.append(summary_test_case)
    action_item_test_cases.append(action_item_test_case)

print(len(summary_test_cases))
print(len(action_item_test_cases))

You can use these test cases to evaluate your summarizer anywhere and anytime. Make sure you've already created a dataset on Confident AI for this to work. Click here to learn more about datasets.

Iterating On Hyperparameters

Now that we have our dataset, we can use this dataset to generate test cases using our summarization agent with different configurations and evaluate it to find the best hyperparameters that work for our use case. Here's how we can run iterative evals on our summarization agent.

In the previous stages, we have evaluated our summarization agent separately for summary conciseness and action item correctness. We will use the same approach and run our evaluations separately for summary and action items.

These are the system prompts we've previously used:

For summary generation:

You are an AI assistant summarizing meeting transcripts. Provide a clear and
concise summary of the following conversation, avoiding interpretation and
unnecessary details. Focus on the main discussion points only. Do not include
any action items. Respond with only the summary as plain text — no headings,
formatting, or explanations.

For action items generation:

Extract all action items from the following meeting transcript. Identify individual
and team-wide action items in the following format:

{
  "individual_actions": {
    "Alice": ["Task 1", "Task 2"],
    "Bob": ["Task 1"]
  },
  "team_actions": ["Task 1", "Task 2"],
  "entities": ["Alice", "Bob"]
}

Only include what is explicitly mentioned. Do not infer. You must respond strictly in
valid JSON format — no extra text or commentary.

We will now use the following updated system prompts:

For summary generation:

You are an expert meeting summarization assistant. Generate a tightly written,
executive-style summary of the meeting transcript, focusing only on high-value
information: key technical insights, decisions made, problems discussed, model/tool
comparisons, and rationale behind proposals. Exclude all action items and any
content that is not core to the purpose of the discussion. Prioritize clarity,
brevity, and factual precision. The final summary should read like a high-quality
meeting brief that allows a stakeholder to fully grasp the discussion in under 60
seconds.

For action items generation:

Parse the following meeting transcript and extract only the action items that are explicitly
stated. Organize the output into individual responsibilities, team-wide tasks, and named entities.
You must respond with a valid JSON object that follows this exact format:

{
  "individual_actions": {
    "Alice": ["Task 1", "Task 2"],
    "Bob": ["Task 1"]
  },
  "team_actions": ["Task 1", "Task 2"],
  "entities": ["Alice", "Bob"]
}

Do not invent or infer any tasks. Only include tasks that are clearly and explicitly assigned
or discussed. Do not output anything except valid JSON in the structure above. No natural
language, notes, or extra formatting allowed.

These are more elaborate and clear system prompts that are updated by taking the first system prompts into consideration.

Running Iterations

We can pull a dataset and use that dataset to iterate over our hyperparameters to initialize our summarization agent with different configurations to produce different test cases. Here's how we can do that:

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import GEval
from deepeval import evaluate
from meeting_summarizer import MeetingSummarizer  # import your summarizer here

dataset = EvaluationDataset()
dataset.pull(alias="MeetingSummarizer Dataset")

summary_system_prompt = "..."  # Use your new summary system prompt here

action_item_system_prompt = "..."  # Use your new action item system prompt here

models = ["gpt-3.5-turbo", "gpt-4o", "gpt-4-turbo"]

# Use the same metrics used before
summary_concision = GEval(...)
action_item_check = GEval(...)

for model in models:
    summarizer = MeetingSummarizer(
        model=model,
        summary_system_prompt=summary_system_prompt,
        action_item_system_prompt=action_item_system_prompt,
    )

    summary_test_cases = []
    action_item_test_cases = []
    for golden in dataset.goldens:
        summary, action_items = summarizer.summarize(golden.input)

        summary_test_case = LLMTestCase(input=golden.input, actual_output=summary)
        action_item_test_case = LLMTestCase(
            input=golden.input, actual_output=str(action_items)
        )

        summary_test_cases.append(summary_test_case)
        action_item_test_cases.append(action_item_test_case)

    evaluate(
        test_cases=summary_test_cases,
        metrics=[summary_concision],
        hyperparameters={"model": model},
    )
    evaluate(
        test_cases=action_item_test_cases,
        metrics=[action_item_check],
        hyperparameters={"model": model},
    )

tip

By logging hyperparameters in the evaluate function, you can easily compare performance across runs in Confident AI and trace score changes back to specific hyperparameter adjustments. Learn more about the evaluate function here.

Here's an example of how you can set up Confident AI to check the results in a report format that also provides details on hyperparameters used for test runs:

To get started, run the following command:

deepeval login

The average results of the evaluation iterations are shown below:

Model	Summary Concision	Action Item Accuracy
gpt-3.5-turbo	0.7	0.6
gpt-4o	0.9	0.7
gpt-4-turbo	0.8	0.9

Improving From Eval Results

From these results, we can see that gpt-4o and gpt-4-turbo perform well but for different tasks.

gpt-4o performed better for summary generation.
gpt-4-turbo performed best for action item generation.

This raises an issue of which model to choose among the both as they each excel at their own tasks.

In this situation, you can either use more test cases to run evaluations to get more data or use deepeval's latest ArenaGEval to test which model is better among them by evaluating arena test cases. You can learn more about it here.

OR alternatively, you can update your MeetingSummarizer to to use two different models for different tasks. Here's how you can do that:

from deepeval.tracing import observe
class MeetingSummarizer:
      ...
    @observe()
    def summarize(
      self,
      transcript: str,
      summary_model: str = "gpt-4o",
      action_item_model: str = "gpt-4-turbo",
    ) -> tuple[str, dict]:
        summary = self.get_summary(transcript, summary_model)
        action_items = self.get_action_items(transcript, action_item_model)

        return summary, action_items

    @observe()
    def get_summary(self, transcript: str, model: str = None) -> str:
      ...
      response = self.client.chat.completions.create(
          model=model or self.model,
          messages=[
              {"role": "system", "content": self.summary_system_prompt},
              {"role": "user", "content": transcript}
          ]
      )
      ...

    @observe()
    def get_action_items(self, transcript: str, model: str = None) -> dict:
      ...
      response = self.client.chat.completions.create(
          model=model or self.model,
          messages=[
              {"role": "system", "content": self.action_item_system_prompt},
              {"role": "user", "content": transcript}
          ]
      )
      ...

This setup allows you to change your model for these tasks anytime you want. You now have a robust summarization agent for generating summaries and action items.

In the next section we'll see how to prepare your summarization agent for deployment.

Pulling Datasets​

Iterating On Hyperparameters​

Running Iterations​

Improving From Eval Results​

Pulling Datasets

Iterating On Hyperparameters

Running Iterations

Improving From Eval Results