Skip to main content

Development

In this section, we're going to create our Meeting Summarization Agent using the OpenAI API. To create a summarization agent, we'll need to define a MeetingSummarizer class which can be used to create multiple instances that allow us to generate summaries and action items anytime and anywhere.

tip

Defining your agents using reusable classes and methods is generally considered best practice because you can create multiple instances of this agent in various places for evaluations and iterations.

Creating Meeting Summarizer

An LLM application's output is only as good as the prompt that guides it. It is important to define a good system prompt that we can use to generate our summaries and action items. We are going to use the following system prompt in the initial phase of our meeting summarizer:

You are an AI assistant tasked with summarizing meeting transcripts clearly and accurately. 
Given the following conversation, generate a concise summary that captures the key points
discussed, along with a set of action items reflecting the concrete next steps mentioned.
Keep the tone neutral and factual, avoid unnecessary detail, and do not add interpretation
beyond the content of the conversation.

Using OpenAI API

We are now going to create a MeetingSummarizer class that uses OpenAI's chat completions API to generate summaries and action items using the system prompt mentioned above for any given transcript.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

class MeetingSummarizer:
def __init__(
self,
model: str = "gpt-4",
system_prompt: str = "",
):
self.model = model
self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
self.system_prompt = system_prompt or (
"You are an AI assistant tasked with summarizing meeting transcripts clearly"
"and accurately. Given the following conversation, generate a concise summary"
"that captures the key points discussed, along with a set of action items"
"reflecting the concrete next steps mentioned. Keep the tone neutral and"
"factual, avoid unnecessary detail, and do not add interpretation beyond"
"the content of the conversation."
)

def summarize(self, transcript: str) -> str:

response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": transcript}
]
)

content = response.choices[0].message.content.strip()
return content
note

You need to set your environment variable OPENAI_API_KEY in your .env file.

Generating summaries

Now that we've defined our summarization agent, we can use the following code to generate the summary

with open("meeting_transcript.txt", "r") as file:
transcript = file.read().strip()

summarizer = MeetingSummarizer()
summary = summarizer.summarize(transcript)
print(summary)
note

I have saved a file named meeting_transcript.txt that contains a mock transcript which is provided to the summarizer as shown above. You can provide your own transcript here or use the mock transcript that I've used:

Click here to see the contents of meeting_transcript.txt
meeting_transcript.txt
[2:01:03 PM]  
Ethan:
Hey Maya, thanks for hopping on. So, I've been looking at some of the recent
logs from the customer support assistant. There's definitely some mixed feedback
coming through — especially around response speed and how useful the answers
actually are. Did you get a chance to dig into those logs in detail yet?

[2:01:20 PM]
Maya:
Yeah, I took a look earlier today. Honestly, it's not completely broken or
anything, but I get why folks are concerned. I noticed the assistant sometimes
gives answers that are kind of vague or, worse, confidently wrong. Like, it acts
super sure about something that's just not right, which can be really frustrating
for users.

[2:01:40 PM]
Ethan:
Exactly! I heard one of the PMs mention that the assistant suggested escalating a
basic password reset issue to Tier 2 support. That's something that should be
handled automatically or at least on Tier 1, right? It feels like a pretty obvious
miss.

[2:01:55 PM]
Maya:
Yeah, that kind of mistake usually happens when the assistant tries to compress
or summarize a long conversation thread before answering. If the summary it creates
is off — even just a little bit — everything else kind of falls apart after that.
The answer built on a shaky summary is going to be shaky too.

[2:02:14 PM]
Ethan:
Makes sense. So, when you look at it, do you think these issues are more about the
way we're engineering the prompts or is it more a problem of the model itself? Like,
should we be trying a different LLM, or just tweaking how we ask questions?

[2:02:31 PM]
Maya:
Honestly, it's a bit of both. We've been using GPT-4o for the most part, which is
pretty solid and fast. But last week I ran a test using Claude 3 on the exact same
dataset, and Claude seemed more grounded in its responses, less prone to making
stuff up. The trade-off is that Claude was noticeably slower.

[2:02:54 PM]
Ethan:
How much slower are we talking?

[2:02:56 PM]
Maya:
On average, about one and a half times slower. So if GPT-4o takes around 5 seconds to
respond, Claude's coming in at about 7 to 8 seconds. That delay might not sound huge in
isolation, but in the context of a real-time chat with customers, it's pretty noticeable.

[2:03:14 PM]
Ethan:
Yeah, that latency definitely matters. From the UX perspective, once you hit that
6-second mark, users start to lose patience. I've seen analytics where retries and
page refreshes spike sharply after that threshold.

[2:03:28 PM]
Maya:
Exactly. And those retries add load on the system, which kind of compounds the
problem. So it's not just user frustration but also a backend scaling concern.

[2:03:37 PM]
Ethan:
So, what's your gut? Do we stick with GPT-4o and accept some of these errors because
it's faster? Or do we switch to Claude to get better quality at the expense of speed?

[2:03:49 PM]
Maya:
I'm leaning towards keeping GPT-4o as the main model for now, mainly because speed is
critical. But we can implement Claude as a fallback option — like a second pass when
the assistant's confidence is low or if it detects uncertainty.

[2:04:06 PM]
Ethan:
Kind of like a two-step verification for answers?

[2:04:09 PM]
Maya:
Yeah, exactly. The idea is that the first pass gives you a quick answer, and only when
something smells off do you invoke the slower but more reliable model. Of course, we'll
need a solid way to detect when the assistant isn't confident.

[2:04:24 PM]
Ethan:
Right now, what kind of signals do we have to measure confidence?

[2:04:28 PM]
Maya:
Not much, unfortunately. We mostly log latency and token usage for cost monitoring, but
we don't have anything baked in that measures the quality or confidence of responses.

[2:04:40 PM]
Ethan:
Could we use something like embedding similarity? Like, compare the semantic similarity
between the original support ticket and the assistant's summary or answer to see if they align?

[2:04:51 PM]
Maya:
That's a great idea. If the embeddings show a big drift between the question and the
summary, that could definitely flag a problematic response. The trick is embeddings
themselves aren't free, cost-wise.

[2:05:05 PM]
Ethan:
Finance is already watching our token and API spend like hawks, so we need to be careful.

[2:05:11 PM]
Maya:
Yeah, but there are tricks like quantizing embeddings down to 8-bit precision, which can
reduce storage and compute cost by a lot. It's not perfect, but it might be enough to keep
costs manageable while adding that confidence signal.

[2:05:27 PM]
Ethan:
Okay, that sounds promising. Let's explore that.

[2:05:30 PM]
Ethan:
Another thing from UX feedback — some users say the assistant sounds really robotic, even
when it gives a correct answer. It lacks that human touch or empathy you'd expect from a
real support agent.

[2:05:44 PM]
Maya:
Yeah, that doesn't surprise me. Our system prompt is pretty barebones — polite but definitely
generic. No personality, no empathy cues, nothing to make it sound warm or relatable.

[2:05:57 PM]
Ethan:
What about fine-tuning the model on actual support transcripts? Would that help?

[2:06:02 PM]
Maya:
I'm cautious about full fine-tuning right now. It's costly, time-consuming, and the results
can be unpredictable. Instead, I'd recommend focusing on prompt tuning — like few-shot learning
where we include a few anonymized example replies in the prompt. That can help steer tone
without the overhead of full model retraining.

[2:06:22 PM]
Ethan:
So basically, you put a couple of well-written, human-sounding responses in the prompt to
guide the model's style?

[2:06:26 PM]
Maya:
Exactly. It's a lot lighter weight and faster to iterate on. And if it works, we could
eventually create domain-specific prompts too — like one set for billing questions,
another for technical support — but start simple.

[2:06:41 PM]
Ethan:
Makes sense. One last thing I was thinking about — how should the UI handle cases when
the assistant's confidence is low? Like, do we just let it answer anyway or should we add
some fallback messaging?

[2:06:54 PM]
Maya:
I'd strongly advocate for a fallback banner or prompt, something like “Not sure about
this? Contact a human agent.” Better to admit uncertainty than provide bad info that
could confuse or frustrate customers.

[2:07:06 PM]
Ethan:
Yeah, I totally agree. But I guess the challenge will be tuning how often that shows
up so it's helpful but not annoying.

[2:07:11 PM]
Maya:
Definitely. We want it to trigger only on real low-confidence cases, not on every
little uncertainty.

[2:07:16 PM]
Ethan:
Alright, sounds like we have a good plan. I'll sync with design on the fallback UX messaging,
and you can start working on the similarity scoring and the two-pass system with GPT-4o and
Claude?

[2:07:28 PM]
Maya:
Yeah, I'll prioritize building that similarity metric and set up a test run for the hybrid
model approach over the next few days.

[2:07:34 PM]
Ethan:
Perfect. Let's regroup next week and see how things look.

[2:07:37 PM]
Maya:
Sounds good. One step at a time, right?

After running the summarizer, the summary generated was a string of markdown (that's how most LLMs respond by default). And this is not desirable for us as we need to parse the response from the LLM and create a UI/UX interface that is appealing for users. The best we can do with the output given by the LLM for now is shown below along with the raw output generated:

UI Image

Updating Meeting Summarizer

Since the LLM returns markdown-formatted strings by default, these responses are difficult to parse and structure for a UI. Without predictable formats like json or plain text, developers must write brittle regexes or manual parsers, which break easily. We will now update our MeetingSummarizer to generate responses in our desired format, which makes them more versatile.

We can do that by sperating our summarizer's tasks into 2 independent helper functions. This means, we can create:

  • A function to generate summary
  • A function to generate action items

This allows us to create 2 specific system prompts each for different task, giving us more control over what the agent generates and how it generates them. This also gives us more flexibility when evaluating our summarizer in the later stages. For example:

  • You can use only get_summary() to evaluate the summary generated.
  • You can use only get_action_items() to evaluate the action items generated.

Generating summaries

We will be creating a helper function which we can use to generate only the summary of the transcript provided. We are creating this function because it gives us more flexibility on how the summary is generated. Also by making it a seperate function, we can do component level evaluations for our summarizer in the future.

System prompt for generating summaries:

You are an AI assistant summarizing meeting transcripts. Provide a clear and 
concise summary of the following conversation, avoiding interpretation and
unnecessary details. Focus on the main discussion points only. Do not include
any action items. Respond with only the summary as plain text — no headings,
formatting, or explanations.
tip

It is also important to note that some models perform well for specific tasks, when you are dealing with different use cases, it is better to test different models to see which one performs best for this specific task. By creating individual functions we can find the best hyperparameters that work for our use case.

Here's how we'll define our helper function to generate summaries:

class MeetingSummarizer:
...
def get_summary(self, transcript: str) -> str:
try:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.summary_system_prompt},
{"role": "user", "content": transcript}
]
)

summary = response.choices[0].message.content.strip()
return summary
except Exception as e:
print(f"Error generating summary: {e}")
return f"Error: Could not generate summary due to API issue: {e}"

Generating action items

We will now be creating a helper function to generate only the action items of the transcript provided. The action items must be generated in a json format, which will allow us to easily parse and render them in different representations.

System prompt for generating action items:

Extract all action items from the following meeting transcript. Identify individual 
and team-wide action items in the following format:

{
"individual_actions": {
"Alice": ["Task 1", "Task 2"],
"Bob": ["Task 1"]
},
"team_actions": ["Task 1", "Task 2"],
"entities": ["Alice", "Bob"]
}

Only include what is explicitly mentioned. Do not infer. You must respond strictly in
valid JSON format — no extra text or commentary.

Here's how we'll define our helper function to generate action items:

class MeetingSummarizer:
...
def get_action_items(self, transcript: str) -> dict:
try:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.action_item_system_prompt},
{"role": "user", "content": transcript}
]
)

action_items = response.choices[0].message.content.strip()
try:
return json.loads(action_items)
except json.JSONDecodeError:
return {"error": "Invalid JSON returned from model", "raw_output": action_items}
except Exception as e:
print(f"Error generating action items: {e}")
return {"error": f"API call failed: {e}", "raw_output": ""}

We can now call these helper functions in our summarize() function and return their respective responses. Here's how we can do that:

class MeetingSummarizer:
...
def summarize(self, transcript: str) -> tuple[str, dict]:
summary = self.get_summary(transcript)
action_items = self.get_action_items(transcript)

return summary, action_items

You can run the new MeetingSummarizer as follows:

summarizer = MeetingSummarizer()

with open("meeting_transcript.txt", "r") as file:
transcript = file.read().strip()

summary, action_items = summarizer.summarize(transcript)
print(summary)
print("JSON:")
print(json.dumps(action_items, indent=2))

With this new updated MeetingSummarizer, the output we get for summary is a string of text and the output for action items is a JSON object which we can parse and manipulate it in any way we want.

By running the new summarizer, I got the desirable output which can be used to create a webpage, here's both the UI and raw format of the outputs generated:

UI Image

We now have a summarization agent that generates responses in our desired format. Now it's time to evaluate how good this agent works. Many developers stop at a quick glance of the output and assume it's good enough. But LLMs are probabilistic and prone to inconsistency — eyeballing results won't catch subtle regressions, logical errors, or hallucinated action items. That's why rigorous evaluation is essential.

In the next section we are going to see how to evaluate your summarization agent using deepeval.