Skip to main content

Evaluation

In the previous section, we built a meeting summarization agent and reviewed the output it generated from a sample conversation transcript. But how do we assess the quality of that output? Many developers tend to eyeball the results of their LLM applications. This is a common and significant issue in LLM application development.

In this section we are going to see how to evaluate our MeetingSummarizer using DeepEval, a powerful open-source LLM evaluation framework.

Defining Evaluation Criteria

Defining evaluation criteria is arguably the most important part of assessing an LLM application's performance. LLM applications are always made with a clear goal in mind, and the evaluation criteria must be defined by taking this goal into consideration.

The summarization agent we've created processes meeting transcripts and generates a concise summary of the meeting and a list of action items. Our evaluation criteria is directly dependent on these responses which will be generated by our agent.

  • The summaries generated must be concise and contain all important points
  • The action items generated must be correct and cover all the key actions

In the previous section, we've updated our summarizer to have different helper functions each for their own tasks. This makes it easier for us to evaluate each task separately.

Summary Evaluation

We will now write a test suite to evaluate just the summarization part of our summarizer. As mentioned above the criteria for our summary is:

  • The summary generated must be concise and contain all important points

This is a use-case-specific criterion, meaning it's custom-tailored for our meeting summarizer. For this we can use deepeval's GEval metric. GEval is a metric that uses LLM-as-a-judge to evaluate LLM outputs based on ANY custom criteria. The GEval metric is the most versatile type of metric deepeval has to offer, and is capable of evaluating almost any use case.

Here's how you can use GEval to evaluate the summary generated by our summarization agent:

test_summary.py
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from deepeval import evaluate
from meeting_summarizer import MeetingSummarizer # import your summarizer here

with open("meeting_transcript.txt", "r") as file:
transcript = file.read().strip()

summarizer = MeetingSummarizer(...)
summary, action_items = summarizer.summarize(transcript)

summary_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=summary # provide the summary generated by your summarizer here
)

summary_concision = GEval(
name="Summary Concision",
# Write your criteria here
criteria="Assess whether the summary is concise and focused only on the essential points of the meeting? It should avoid repetition, irrelevant details, and unnecessary elaboration.",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

evaluate(test_cases=[summary_test_case], metrics=[summary_concision])

You can use the following command to run the evaluation

deepeval test run test_summary.py

Action Item Evaluation

We will now evaluate the action item generation of our summarization agent. As mentioned in the previous section the criteria for our action item evaluation is

  • The action items generated must be correct and cover all the key actions

We will be using the GEval metric again here because this is also a custom criteria. Here's how you evaluate the action items generated:

test_action_items.py
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from deepeval import evaluate
from meeting_summarizer import MeetingSummarizer # import your summarizer here

with open("meeting_transcript.txt", "r") as file:
transcript = file.read().strip()

summarizer = MeetingSummarizer(...)
summary, action_items = summarizer.summarize(transcript)

action_item_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=str(action_items) # provide the action items generated by your summarizer here
)

action_item_check = GEval(
name="Action Item Accuracy",
# Write your criteria here
criteria="Are the action items accurate, complete, and clearly reflect the key tasks or follow-ups mentioned in the meeting?",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

evaluate(test_cases=[action_item_test_case], metrics=[action_item_check])

You can use the following command to run the evaluation

deepeval test run test_action_items.py

Running Your First Eval

Now that we know our criteria and the metrics we'll be using, we can run our first eval on a single test case to make sure everything's going smoothly.

Here's how you can test your summarization agent:

test_summarizer.py
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from deepeval import evaluate
from meeting_summarizer import MeetingSummarizer # import your summarizer here

summarizer = MeetingSummarizer()
with open("meeting_transcript.txt", "r") as file:
transcript = file.read().strip()

summary, action_items = summarizer.summarize(transcript)

summary_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=summary # provide the summary generated by your summarizer here
)

action_item_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=str(action_items) # provide the action items generated by your summarizer here
)

summary_concision = GEval(
name="Summary Concision",
criteria="Assess whether the summary is concise and focused only on the essential points of the meeting? It should avoid repetition, irrelevant details, and unnecessary elaboration.",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
action_item_check = GEval(
name="Action Item Accuracy",
criteria="Are the action items accurate, complete, and clearly reflect the key tasks or follow-ups mentioned in the meeting?",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

evaluate(test_cases=[summary_test_case], metrics=[summary_concision])
evaluate(test_cases=[action_item_test_case], metrics=[action_item_check])

You can save the above code in a test file named test_summarizer.py and run the following code in your terminal to evaluate your summarizer:

deepeval test run test_summarizer.py

You'll see your evaluation results, including scores and reasoning, printed in the console.

tip

It is highly recommended that you use Confident AI, deepeval's cloud platform that allows you to view your test results in a much more intuitive way. Here's how you can set up Confident AI. Or you can simply run the following code in the terminal to set it up yourself:

deepeval login

It's free to get started! (No credit card required.)

That's how you can run your first eval using deepeval.

Evaluation Results

Here are the results I got after running the evaluation code for a single test case:

MetricScoreResult
Summary Concision0.7Fail
Action Item Accuracy0.8Fail

DeepEval's metrics provide a reason for their evaluation of a test case, which allows you to debug your LLM application easily on why certain test cases pass or fail. Below are the reasons provided by deepeval's GEval for the above evaluation result:

For summary:

The Actual Output effectively identifies the key points of the meeting, covering the issues with the assistant's performance, the comparison between GPT-4o and Claude 3, the proposed hybrid approach, and the discussion around confidence metrics and tone. It omits extraneous details and is significantly shorter than the Input transcript. There's minimal repetition. However, while concise, it could be slightly more reduced; some phrasing feels unnecessarily verbose for a summary (e.g., 'Ethan and Maya discussed... focusing on concerns').

For action items:

The Actual Output captures some key action items discussed in the Input, specifically Maya building the similarity metric and setting up the hybrid model test, and Ethan syncing with design. However, it misses several follow-ups, such as exploring 8-bit embedding quantization and addressing the robotic tone of the assistant via prompt tuning. While the listed actions are clear and accurate, the completeness is lacking. The action items directly correspond to tasks mentioned, but not all tasks are represented.

Understanding Eval Results

Let's understand how the evaluation happened and why our test cases failed. GEval by deepeval is a custom metric that can be used to evaluate ANY custom criteria of a use case. This metric uses LLM-as-a-judge to evaluate the results of the test case.

If we look at the test cases and metrics we've provided for evaluation:

summary_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=summary # provide the summary generated by your summarizer here
)

action_item_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=str(action_items) # provide the action items generated by your summarizer here
)

summary_concision = GEval(
name="Summary Concision",
criteria="Assess whether the summary is concise and focused only on the essential points of the meeting? It should avoid repetition, irrelevant details, and unnecessary elaboration.",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

action_item_check = GEval(
name="Action Item Accuracy",
criteria="Are the action items accurate, complete, and clearly reflect the key tasks or follow-ups mentioned in the meeting?",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

We can see that we are supplying our transcript as inputs and the summary or action items generated as the actual_outputs in our test cases. Now GEval uses it's corresponding criteria which we've supplied, as guidelines to be followed while evaluating our test cases.

This means a new LLM (our evaluation model) is going to check our inputs and actual_outputs (our evaluation parameters supplied in the metric initialization) against the criteria mentioned. This is similar to how you would evaluate the responses yourself, but through an LLM.

This new LLM assesses the quality of the actual_output against the input depending on the criteria mentioned. After assessing the test case, the evaluation model provides a score and reason explaining the evaluation results. If the score is below the threshold supplied in the metric, the test case is deemed fail, and success otherwise. (The default threshold is 0.5 for most test cases in deepeval)

These reasons explain why the test cases failed, and help us identify exactly what needs to be fixed. Click here to learn more about how GEval works.

info

It is advised to use a good evaluation model for better results and reasons. Your evaluation model should be well-suited for the task it's evaluating. Some models like gpt-4, gpt-4o, gpt-3.5-turbo and claude-3-opus are best for summarization evaluations.

In the next section, we are going to see how to create a more robust evaluation suite to test and improve your summarization agent.