Evaluation
In the previous section, we built a meeting summarization agent and reviewed the output it generated from a sample conversation transcript. But how do we assess the quality of that output? Many developers tend to eyeball the results of their LLM applications. This is a common and significant issue in LLM application development.
In this section we are going to see how to evaluate our MeetingSummarizer
using DeepEval, a powerful open-source LLM evaluation framework.
Defining Evaluation Criteria
Defining evaluation criteria is arguably the most important part of assessing an LLM application's performance. LLM applications are always made with a clear goal in mind, and the evaluation criteria must be defined by taking this goal into consideration.
The summarization agent we've created processes meeting transcripts and generates a concise summary of the meeting and a list of action items. Our evaluation criteria is directly dependent on these responses which will be generated by our agent.
- The summaries generated must be concise and contain all important points
- The action items generated must be correct and cover all the key actions
In the previous section, we've updated our summarizer to have different helper functions each for their own tasks. This makes it easier for us to evaluate each task separately.
Summary Evaluation
We will now write a test suite to evaluate just the summarization part of our summarizer. As mentioned above the criteria for our summary is:
- The summary generated must be concise and contain all important points
This is a use-case-specific criterion, meaning it's custom-tailored for our meeting summarizer. For this we can use deepeval
's GEval
metric.
GEval
is a metric that uses LLM-as-a-judge to evaluate LLM outputs based on ANY custom criteria. The GEval
metric is the most versatile type of metric deepeval
has to offer, and is capable of evaluating almost any use case.
Here's how you can use GEval
to evaluate the summary generated by our summarization agent:
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from deepeval import evaluate
from meeting_summarizer import MeetingSummarizer # import your summarizer here
with open("meeting_transcript.txt", "r") as file:
transcript = file.read().strip()
summarizer = MeetingSummarizer(...)
summary, action_items = summarizer.summarize(transcript)
summary_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=summary # provide the summary generated by your summarizer here
)
summary_concision = GEval(
name="Summary Concision",
# Write your criteria here
criteria="Assess whether the summary is concise and focused only on the essential points of the meeting? It should avoid repetition, irrelevant details, and unnecessary elaboration.",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
evaluate(test_cases=[summary_test_case], metrics=[summary_concision])
You can use the following command to run the evaluation
deepeval test run test_summary.py
Action Item Evaluation
We will now evaluate the action item generation of our summarization agent. As mentioned in the previous section the criteria for our action item evaluation is
- The action items generated must be correct and cover all the key actions
We will be using the GEval
metric again here because this is also a custom criteria. Here's how you evaluate the action items generated:
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from deepeval import evaluate
from meeting_summarizer import MeetingSummarizer # import your summarizer here
with open("meeting_transcript.txt", "r") as file:
transcript = file.read().strip()
summarizer = MeetingSummarizer(...)
summary, action_items = summarizer.summarize(transcript)
action_item_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=str(action_items) # provide the action items generated by your summarizer here
)
action_item_check = GEval(
name="Action Item Accuracy",
# Write your criteria here
criteria="Are the action items accurate, complete, and clearly reflect the key tasks or follow-ups mentioned in the meeting?",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
evaluate(test_cases=[action_item_test_case], metrics=[action_item_check])
You can use the following command to run the evaluation
deepeval test run test_action_items.py
Running Your First Eval
Now that we know our criteria and the metrics we'll be using, we can run our first eval on a single test case to make sure everything's going smoothly.
Here's how you can test your summarization agent:
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from deepeval import evaluate
from meeting_summarizer import MeetingSummarizer # import your summarizer here
summarizer = MeetingSummarizer()
with open("meeting_transcript.txt", "r") as file:
transcript = file.read().strip()
summary, action_items = summarizer.summarize(transcript)
summary_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=summary # provide the summary generated by your summarizer here
)
action_item_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=str(action_items) # provide the action items generated by your summarizer here
)
summary_concision = GEval(
name="Summary Concision",
criteria="Assess whether the summary is concise and focused only on the essential points of the meeting? It should avoid repetition, irrelevant details, and unnecessary elaboration.",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
action_item_check = GEval(
name="Action Item Accuracy",
criteria="Are the action items accurate, complete, and clearly reflect the key tasks or follow-ups mentioned in the meeting?",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
evaluate(test_cases=[summary_test_case], metrics=[summary_concision])
evaluate(test_cases=[action_item_test_case], metrics=[action_item_check])
You can save the above code in a test file named test_summarizer.py
and run the following code in your terminal to evaluate your summarizer:
deepeval test run test_summarizer.py
You'll see your evaluation results, including scores and reasoning, printed in the console.
It is highly recommended that you use Confident AI, deepeval
's cloud platform that allows you to view your test results in a much more intuitive way. Here's how you can set up Confident AI. Or you can simply run the following code in the terminal to set it up yourself:
deepeval login
It's free to get started! (No credit card required.)
That's how you can run your first eval using deepeval
.
Evaluation Results
Here are the results I got after running the evaluation code for a single test case:
Metric | Score | Result |
---|---|---|
Summary Concision | 0.7 | Fail |
Action Item Accuracy | 0.8 | Fail |
DeepEval's metrics provide a reason for their evaluation of a test case, which allows you to debug your LLM application easily on why certain test cases pass or fail. Below are the reasons provided by deepeval
's GEval
for the above evaluation result:
For summary:
The Actual Output effectively identifies the key points of the meeting, covering the issues with the assistant's performance, the comparison between GPT-4o and Claude 3, the proposed hybrid approach, and the discussion around confidence metrics and tone. It omits extraneous details and is significantly shorter than the Input transcript. There's minimal repetition. However, while concise, it could be slightly more reduced; some phrasing feels unnecessarily verbose for a summary (e.g., 'Ethan and Maya discussed... focusing on concerns').
For action items:
The Actual Output captures some key action items discussed in the Input, specifically Maya building the similarity metric and setting up the hybrid model test, and Ethan syncing with design. However, it misses several follow-ups, such as exploring 8-bit embedding quantization and addressing the robotic tone of the assistant via prompt tuning. While the listed actions are clear and accurate, the completeness is lacking. The action items directly correspond to tasks mentioned, but not all tasks are represented.
Understanding Eval Results
Let's understand how the evaluation happened and why our test cases failed. GEval
by deepeval
is a custom metric that can be used to evaluate ANY custom criteria of a use case. This metric uses LLM-as-a-judge to evaluate the results of the test case.
If we look at the test cases and metrics we've provided for evaluation:
summary_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=summary # provide the summary generated by your summarizer here
)
action_item_test_case = LLMTestCase(
input=transcript, # your full meeting transcript as a string
actual_output=str(action_items) # provide the action items generated by your summarizer here
)
summary_concision = GEval(
name="Summary Concision",
criteria="Assess whether the summary is concise and focused only on the essential points of the meeting? It should avoid repetition, irrelevant details, and unnecessary elaboration.",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
action_item_check = GEval(
name="Action Item Accuracy",
criteria="Are the action items accurate, complete, and clearly reflect the key tasks or follow-ups mentioned in the meeting?",
threshold=0.9,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
We can see that we are supplying our transcript as input
s and the summary or action items generated as the actual_output
s in our test cases. Now GEval
uses it's corresponding criteria
which we've supplied, as guidelines to be followed while evaluating our test cases.
This means a new LLM (our evaluation model) is going to check our input
s and actual_output
s (our evaluation parameters supplied in the metric initialization) against the criteria
mentioned. This is similar to how you would evaluate the responses yourself, but through an LLM.
This new LLM assesses the quality of the actual_output
against the input
depending on the criteria
mentioned. After assessing the test case, the evaluation model provides a score
and reason
explaining the evaluation results. If the score is below the threshold
supplied in the metric, the test case is deemed fail, and success otherwise. (The default threshold is 0.5 for most test cases in deepeval
)
These reasons explain why the test cases failed, and help us identify exactly what needs to be fixed. Click here to learn more about how GEval
works.
It is advised to use a good evaluation model for better results and reasons. Your evaluation model should be well-suited for the task it's evaluating.
Some models like gpt-4
, gpt-4o
, gpt-3.5-turbo
and claude-3-opus
are best for summarization evaluations.
In the next section, we are going to see how to create a more robust evaluation suite to test and improve your summarization agent.