Development

In this section, we're going to create our Meeting Summarization Agent using the OpenAI API. To create a summarization agent, we'll need to define a MeetingSummarizer class which can be used to create multiple instances that allow us to generate summaries and action items anytime and anywhere.

tip

Defining your agents using reusable classes and methods is generally considered best practice because you can create multiple instances of this agent in various places for evaluations and iterations.

Creating Meeting Summarizer

An LLM application's output is only as good as the prompt that guides it. It is important to define a good system prompt that we can use to generate our summaries and action items. We are going to use the following system prompt in the initial phase of our meeting summarizer:

You are an AI assistant tasked with summarizing meeting transcripts clearly and accurately. 
Given the following conversation, generate a concise summary that captures the key points 
discussed, along with a set of action items reflecting the concrete next steps mentioned. 
Keep the tone neutral and factual, avoid unnecessary detail, and do not add interpretation 
beyond the content of the conversation.

Using OpenAI API

We are now going to create a MeetingSummarizer class that uses OpenAI's chat completions API to generate summaries and action items using the system prompt mentioned above for any given transcript.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

class MeetingSummarizer:
    def __init__(
        self, 
        model: str = "gpt-4", 
        system_prompt: str = "",
    ):
        self.model = model
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.system_prompt = system_prompt or (
            "You are an AI assistant tasked with summarizing meeting transcripts clearly"
            "and accurately. Given the following conversation, generate a concise summary"
            "that captures the key points discussed, along with a set of action items"
            "reflecting the concrete next steps mentioned. Keep the tone neutral and"
            "factual, avoid unnecessary detail, and do not add interpretation beyond"
            "the content of the conversation."
        )

    def summarize(self, transcript: str) -> str:

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": transcript}
            ]
        )

        content = response.choices[0].message.content.strip()
        return content

note

You need to set your environment variable OPENAI_API_KEY in your .env file.

Generating summaries

Now that we've defined our summarization agent, we can use the following code to generate the summary

with open("meeting_transcript.txt", "r") as file:
    transcript = file.read().strip()

summarizer = MeetingSummarizer()
summary = summarizer.summarize(transcript)
print(summary)

note

I have saved a file named meeting_transcript.txt that contains a mock transcript which is provided to the summarizer as shown above. You can provide your own transcript here or use the mock transcript that I've used:

Click here to see the contents of meeting_transcript.txt

meeting_transcript.txt
[2:01:03 PM]  
Ethan:  
Hey Maya, thanks for hopping on. So, I've been looking at some of the recent 
logs from the customer support assistant. There's definitely some mixed feedback 
coming through — especially around response speed and how useful the answers 
actually are. Did you get a chance to dig into those logs in detail yet?

[2:01:20 PM]  
Maya:  
Yeah, I took a look earlier today. Honestly, it's not completely broken or 
anything, but I get why folks are concerned. I noticed the assistant sometimes 
gives answers that are kind of vague or, worse, confidently wrong. Like, it acts 
super sure about something that's just not right, which can be really frustrating 
for users.

[2:01:40 PM]  
Ethan:  
Exactly! I heard one of the PMs mention that the assistant suggested escalating a 
basic password reset issue to Tier 2 support. That's something that should be 
handled automatically or at least on Tier 1, right? It feels like a pretty obvious 
miss.

[2:01:55 PM]  
Maya:  
Yeah, that kind of mistake usually happens when the assistant tries to compress 
or summarize a long conversation thread before answering. If the summary it creates 
is off — even just a little bit — everything else kind of falls apart after that. 
The answer built on a shaky summary is going to be shaky too.

[2:02:14 PM]  
Ethan:  
Makes sense. So, when you look at it, do you think these issues are more about the 
way we're engineering the prompts or is it more a problem of the model itself? Like, 
should we be trying a different LLM, or just tweaking how we ask questions?

[2:02:31 PM]  
Maya:  
Honestly, it's a bit of both. We've been using GPT-4o for the most part, which is 
pretty solid and fast. But last week I ran a test using Claude 3 on the exact same 
dataset, and Claude seemed more grounded in its responses, less prone to making 
stuff up. The trade-off is that Claude was noticeably slower.

[2:02:54 PM]  
Ethan:  
How much slower are we talking?

[2:02:56 PM]  
Maya:  
On average, about one and a half times slower. So if GPT-4o takes around 5 seconds to 
respond, Claude's coming in at about 7 to 8 seconds. That delay might not sound huge in 
isolation, but in the context of a real-time chat with customers, it's pretty noticeable.

[2:03:14 PM]  
Ethan:  
Yeah, that latency definitely matters. From the UX perspective, once you hit that 
6-second mark, users start to lose patience. I've seen analytics where retries and 
page refreshes spike sharply after that threshold.

[2:03:28 PM]  
Maya:  
Exactly. And those retries add load on the system, which kind of compounds the 
problem. So it's not just user frustration but also a backend scaling concern.

[2:03:37 PM]  
Ethan:  
So, what's your gut? Do we stick with GPT-4o and accept some of these errors because 
it's faster? Or do we switch to Claude to get better quality at the expense of speed?

[2:03:49 PM]  
Maya:  
I'm leaning towards keeping GPT-4o as the main model for now, mainly because speed is 
critical. But we can implement Claude as a fallback option — like a second pass when 
the assistant's confidence is low or if it detects uncertainty.

[2:04:06 PM]  
Ethan:  
Kind of like a two-step verification for answers?

[2:04:09 PM]  
Maya:  
Yeah, exactly. The idea is that the first pass gives you a quick answer, and only when 
something smells off do you invoke the slower but more reliable model. Of course, we'll 
need a solid way to detect when the assistant isn't confident.

[2:04:24 PM]  
Ethan:  
Right now, what kind of signals do we have to measure confidence?

[2:04:28 PM]  
Maya:  
Not much, unfortunately. We mostly log latency and token usage for cost monitoring, but 
we don't have anything baked in that measures the quality or confidence of responses.

[2:04:40 PM]  
Ethan:  
Could we use something like embedding similarity? Like, compare the semantic similarity 
between the original support ticket and the assistant's summary or answer to see if they align?

[2:04:51 PM]  
Maya:  
That's a great idea. If the embeddings show a big drift between the question and the 
summary, that could definitely flag a problematic response. The trick is embeddings 
themselves aren't free, cost-wise.

[2:05:05 PM]  
Ethan:  
Finance is already watching our token and API spend like hawks, so we need to be careful.

[2:05:11 PM]  
Maya:  
Yeah, but there are tricks like quantizing embeddings down to 8-bit precision, which can 
reduce storage and compute cost by a lot. It's not perfect, but it might be enough to keep 
costs manageable while adding that confidence signal.

[2:05:27 PM]  
Ethan:  
Okay, that sounds promising. Let's explore that.

[2:05:30 PM]  
Ethan:  
Another thing from UX feedback — some users say the assistant sounds really robotic, even 
when it gives a correct answer. It lacks that human touch or empathy you'd expect from a 
real support agent.

[2:05:44 PM]  
Maya:  
Yeah, that doesn't surprise me. Our system prompt is pretty barebones — polite but definitely 
generic. No personality, no empathy cues, nothing to make it sound warm or relatable.

[2:05:57 PM]  
Ethan:  
What about fine-tuning the model on actual support transcripts? Would that help?

[2:06:02 PM]  
Maya:  
I'm cautious about full fine-tuning right now. It's costly, time-consuming, and the results 
can be unpredictable. Instead, I'd recommend focusing on prompt tuning — like few-shot learning 
where we include a few anonymized example replies in the prompt. That can help steer tone 
without the overhead of full model retraining.

[2:06:22 PM]  
Ethan:  
So basically, you put a couple of well-written, human-sounding responses in the prompt to 
guide the model's style?

[2:06:26 PM]  
Maya:  
Exactly. It's a lot lighter weight and faster to iterate on. And if it works, we could 
eventually create domain-specific prompts too — like one set for billing questions, 
another for technical support — but start simple.

[2:06:41 PM]  
Ethan:  
Makes sense. One last thing I was thinking about — how should the UI handle cases when 
the assistant's confidence is low? Like, do we just let it answer anyway or should we add 
some fallback messaging?

[2:06:54 PM]  
Maya:  
I'd strongly advocate for a fallback banner or prompt, something like “Not sure about 
this? Contact a human agent.” Better to admit uncertainty than provide bad info that 
could confuse or frustrate customers.

[2:07:06 PM]  
Ethan:  
Yeah, I totally agree. But I guess the challenge will be tuning how often that shows 
up so it's helpful but not annoying.

[2:07:11 PM]  
Maya:  
Definitely. We want it to trigger only on real low-confidence cases, not on every 
little uncertainty.

[2:07:16 PM]  
Ethan:  
Alright, sounds like we have a good plan. I'll sync with design on the fallback UX messaging, 
and you can start working on the similarity scoring and the two-pass system with GPT-4o and 
Claude?

[2:07:28 PM]  
Maya:  
Yeah, I'll prioritize building that similarity metric and set up a test run for the hybrid 
model approach over the next few days.

[2:07:34 PM]  
Ethan:  
Perfect. Let's regroup next week and see how things look.

[2:07:37 PM]  
Maya:  
Sounds good. One step at a time, right?

After running the summarizer, the summary generated was a string of markdown (that's how most LLMs respond by default). And this is not desirable for us as we need to parse the response from the LLM and create a UI/UX interface that is appealing for users. The best we can do with the output given by the LLM for now is shown below along with the raw output generated:

UI Image

**Meeting Summary:**

Ethan and Maya discussed performance concerns with the current customer support assistant, particularly issues with inaccurate or vague responses and slow performance trade-offs when using different language models. Maya noted that while GPT-4o offers faster responses, Claude 3 provides more grounded and reliable answers but with higher latency. They agreed to continue using GPT-4o as the primary model and implement Claude as a fallback for low-confidence cases.

To address quality issues, they explored confidence detection via embedding similarity between the input and the assistant's summary. Maya suggested using 8-bit quantized embeddings to manage cost. They also discussed improving the assistant's tone and empathy using prompt tuning instead of full model fine-tuning.

On the UX side, they agreed to implement fallback messaging for low-confidence responses, ensuring it's helpful without being intrusive.

---

**Action Items:**

1. **Maya** to develop a similarity scoring method using embeddings to detect low-confidence responses.
2. **Maya** to test and prototype a hybrid response system using GPT-4o as the default and Claude 3 as a fallback.
3. **Maya** to explore prompt tuning with few-shot examples to improve the assistant's tone and empathy.
4. **Ethan** to coordinate with the design team on fallback UI messaging for low-confidence responses.
5. **Team** to regroup next week to review progress on the hybrid model and confidence detection efforts.

Updating Meeting Summarizer

Since the LLM returns markdown-formatted strings by default, these responses are difficult to parse and structure for a UI. Without predictable formats like json or plain text, developers must write brittle regexes or manual parsers, which break easily. We will now update our MeetingSummarizer to generate responses in our desired format, which makes them more versatile.

We can do that by sperating our summarizer's tasks into 2 independent helper functions. This means, we can create:

A function to generate summary
A function to generate action items

This allows us to create 2 specific system prompts each for different task, giving us more control over what the agent generates and how it generates them. This also gives us more flexibility when evaluating our summarizer in the later stages. For example:

You can use only get_summary() to evaluate the summary generated.
You can use only get_action_items() to evaluate the action items generated.

Generating summaries

We will be creating a helper function which we can use to generate only the summary of the transcript provided. We are creating this function because it gives us more flexibility on how the summary is generated. Also by making it a seperate function, we can do component level evaluations for our summarizer in the future.

System prompt for generating summaries:

You are an AI assistant summarizing meeting transcripts. Provide a clear and 
concise summary of the following conversation, avoiding interpretation and 
unnecessary details. Focus on the main discussion points only. Do not include 
any action items. Respond with only the summary as plain text — no headings, 
formatting, or explanations.

tip

It is also important to note that some models perform well for specific tasks, when you are dealing with different use cases, it is better to test different models to see which one performs best for this specific task. By creating individual functions we can find the best hyperparameters that work for our use case.

Here's how we'll define our helper function to generate summaries:

class MeetingSummarizer:
    ...
    def get_summary(self, transcript: str) -> str:
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": self.summary_system_prompt},
                    {"role": "user", "content": transcript}
                ]
            )

            summary = response.choices[0].message.content.strip()
            return summary
        except Exception as e:
            print(f"Error generating summary: {e}")
            return f"Error: Could not generate summary due to API issue: {e}"

Generating action items

We will now be creating a helper function to generate only the action items of the transcript provided. The action items must be generated in a json format, which will allow us to easily parse and render them in different representations.

System prompt for generating action items:

Extract all action items from the following meeting transcript. Identify individual 
and team-wide action items in the following format:

{
  "individual_actions": {
    "Alice": ["Task 1", "Task 2"],
    "Bob": ["Task 1"]
  },
  "team_actions": ["Task 1", "Task 2"],
  "entities": ["Alice", "Bob"]
}

Only include what is explicitly mentioned. Do not infer. You must respond strictly in 
valid JSON format — no extra text or commentary.

Here's how we'll define our helper function to generate action items:

class MeetingSummarizer:
    ...
    def get_action_items(self, transcript: str) -> dict:
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": self.action_item_system_prompt},
                    {"role": "user", "content": transcript}
                ]
            )

            action_items = response.choices[0].message.content.strip()
            try:
                return json.loads(action_items)
            except json.JSONDecodeError:
                return {"error": "Invalid JSON returned from model", "raw_output": action_items}
        except Exception as e:
            print(f"Error generating action items: {e}")
            return {"error": f"API call failed: {e}", "raw_output": ""}

We can now call these helper functions in our summarize() function and return their respective responses. Here's how we can do that:

class MeetingSummarizer:
    ...
    def summarize(self, transcript: str) -> tuple[str, dict]:
        summary = self.get_summary(transcript)
        action_items = self.get_action_items(transcript)

        return summary, action_items

You can run the new MeetingSummarizer as follows:

summarizer = MeetingSummarizer()

with open("meeting_transcript.txt", "r") as file:
    transcript = file.read().strip()

summary, action_items = summarizer.summarize(transcript)
print(summary)
print("JSON:")
print(json.dumps(action_items, indent=2))

With this new updated MeetingSummarizer, the output we get for summary is a string of text and the output for action items is a JSON object which we can parse and manipulate it in any way we want.

By running the new summarizer, I got the desirable output which can be used to create a webpage, here's both the UI and raw format of the outputs generated:

UI
Raw Summary
Raw Actions items

UI Image

Ethan and Maya discussed recent feedback on the customer support assistant, focusing on concerns around response speed and answer quality. Key issues included vague or incorrect answers and misclassification of simple issues, which may stem from inaccurate internal summarization.

They debated whether the problems are due to prompt engineering or the model itself. Maya shared results comparing GPT-4o and Claude 3, noting that Claude gave more reliable responses but was slower. Ethan emphasized the importance of latency for user experience.

They considered a hybrid approach using GPT-4o for speed and Claude as a fallback when confidence is low. However, current systems lack effective confidence metrics. They explored using embedding similarity as a potential signal, while being mindful of associated costs.

The conversation also touched on user feedback about the assistant's robotic tone. Maya recommended prompt tuning with example replies instead of full model fine-tuning to improve tone and empathy.

Finally, they discussed UI strategies for low-confidence responses, agreeing that a fallback prompt suggesting human assistance would improve user trust, provided it's used judiciously.

{
  "individual_actions": {
    "Ethan": ["Sync with design on the fallback UX messaging"],
    "Maya": [
      "Build the similarity metric",
      "Set up a test run for the hybrid model approach using GPT-4o and Claude"
    ]
  },
  "team_actions": [],
  "entities": ["Ethan", "Maya"]
}

We now have a summarization agent that generates responses in our desired format. Now it's time to evaluate how good this agent works. Many developers stop at a quick glance of the output and assume it's good enough. But LLMs are probabilistic and prone to inconsistency — eyeballing results won't catch subtle regressions, logical errors, or hallucinated action items. That's why rigorous evaluation is essential.

In the next section we are going to see how to evaluate your summarization agent using deepeval.

Creating Meeting Summarizer​

Using OpenAI API​

Generating summaries​

Updating Meeting Summarizer​

Generating summaries​

System prompt for generating summaries:​

Generating action items​

System prompt for generating action items:​

Creating Meeting Summarizer

Using OpenAI API

Generating summaries

Updating Meeting Summarizer

Generating summaries

System prompt for generating summaries:

Generating action items

System prompt for generating action items: