Building Your Summarizer
In this section, we're going to create our meeting summarization agent using the OpenAI API. Our summarization agent should be able to take an entire meeting transcript as input and returns
- A concise summary of the entire meeting
- A list of action items mentioned in the meeting
We will implement our summarizer with variables of model and summary prompt in a MeetingSummarizer class. This will be helpful for future evaluations and iterations on our summarizer.
Creating Meeting Summarizer
An LLM application's output is only as good as the prompt that guides it. It is important to define a good system prompt that we can use to generate our summaries and action items. We are going to use the following system prompt in the initial phase of our meeting summarizer:
You are an AI assistant tasked with summarizing meeting transcripts clearly and accurately.
Given the following conversation, generate a concise summary that captures the key points
discussed, along with a set of action items reflecting the concrete next steps mentioned.
Keep the tone neutral and factual, avoid unnecessary detail, and do not add interpretation
beyond the content of the conversation.Using OpenAI API
We are now going to create a MeetingSummarizer class that uses OpenAI's chat completions API to generate summaries and action items using the system prompt mentioned above for any given transcript.
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
class MeetingSummarizer:
def __init__(
self,
model: str = "gpt-4",
system_prompt: str = "",
):
self.model = model
self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
self.system_prompt = system_prompt or (
"..." # Use the above system prompt here
)
def summarize(self, transcript: str) -> str:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": transcript}
]
)
content = response.choices[0].message.content.strip()
return contentGenerating summaries
Now that we've defined our summarization agent, we can use the following code to generate the summary
with open("meeting_transcript.txt", "r") as file:
transcript = file.read().strip()
summarizer = MeetingSummarizer()
summary = summarizer.summarize(transcript)
print(summary)After running the summarizer, the summary generated was a string of markdown (that's how most LLMs respond by default). And this is not desirable for us as we need to parse the response from the LLM and create a UI/UX interface that is appealing for users. The best we can do with the output given by the LLM for now is shown below along with the raw output generated:

**Meeting Summary:**
Ethan and Maya discussed performance concerns with the current customer support assistant, particularly issues with inaccurate or vague responses and slow performance trade-offs when using different language models. Maya noted that while GPT-4o offers faster responses, Claude 3 provides more grounded and reliable answers but with higher latency. They agreed to continue using GPT-4o as the primary model and implement Claude as a fallback for low-confidence cases.
To address quality issues, they explored confidence detection via embedding similarity between the input and the assistant's summary. Maya suggested using 8-bit quantized embeddings to manage cost. They also discussed improving the assistant's tone and empathy using prompt tuning instead of full model fine-tuning.
On the UX side, they agreed to implement fallback messaging for low-confidence responses, ensuring it's helpful without being intrusive.
---
**Action Items:**
1. **Maya** to develop a similarity scoring method using embeddings to detect low-confidence responses.
2. **Maya** to test and prototype a hybrid response system using GPT-4o as the default and Claude 3 as a fallback.
3. **Maya** to explore prompt tuning with few-shot examples to improve the assistant's tone and empathy.
4. **Ethan** to coordinate with the design team on fallback UI messaging for low-confidence responses.
5. **Team** to regroup next week to review progress on the hybrid model and confidence detection efforts.Updating Meeting Summarizer
To improve response parsing and structure, we'll split our MeetingSummarizer into two helper functions:
get_summary(): Generates the meeting summaryget_action_items(): Extracts action items
This approach lets us use tailored system prompts for each task, ensuring predictable outputs (e.g., JSON or plain text). It also increases flexibility for evaluation — each function can be tested independently.
Generating summaries
We will now create a helper function to generate only the summary from the transcript. This gives us more control over how summaries are produced and enables component-level evaluation in future stages. Here's the system prompt we'll be using to generate summaries:
System prompt for generating summaries:
You are an AI assistant summarizing meeting transcripts. Provide a clear and
concise summary of the following conversation, avoiding interpretation and
unnecessary details. Focus on the main discussion points only. Do not include
any action items. Respond with only the summary as plain text — no headings,
formatting, or explanations.Here's how we'll define our helper function to generate summaries:
...
class MeetingSummarizer:
...
def get_summary(self, transcript: str) -> str:
try:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.summary_system_prompt},
{"role": "user", "content": transcript}
]
)
summary = response.choices[0].message.content.strip()
return summary
except Exception as e:
print(f"Error generating summary: {e}")
return f"Error: Could not generate summary due to API issue: {e}"Generating action items
We will now be creating a helper function to generate only the action item of the transcript provided. The action items must be generated in a json format, which will allow us to easily parse and render them in different representations.
System prompt for generating action items:
Extract all action items from the following meeting transcript. Identify individual
and team-wide action items in the following format:
{
"individual_actions": {
"Alice": ["Task 1", "Task 2"],
"Bob": ["Task 1"]
},
"team_actions": ["Task 1", "Task 2"],
"entities": ["Alice", "Bob"]
}
Only include what is explicitly mentioned. Do not infer. You must respond strictly in
valid JSON format — no extra text or commentary.Here's how we'll define our helper function to generate action items:
class MeetingSummarizer:
...
def get_action_items(self, transcript: str) -> dict:
try:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.action_item_system_prompt},
{"role": "user", "content": transcript}
]
)
action_items = response.choices[0].message.content.strip()
try:
return json.loads(action_items)
except json.JSONDecodeError:
return {"error": "Invalid JSON returned from model", "raw_output": action_items}
except Exception as e:
print(f"Error generating action items: {e}")
return {"error": f"API call failed: {e}", "raw_output": ""}We can now call these helper functions in our summarize() function and return their respective responses. Here's how we can do that:
class MeetingSummarizer:
...
def summarize(self, transcript: str) -> tuple[str, dict]:
summary = self.get_summary(transcript)
action_items = self.get_action_items(transcript)
return summary, action_itemsYou can run the new MeetingSummarizer as follows:
summarizer = MeetingSummarizer()
with open("meeting_transcript.txt", "r") as file:
transcript = file.read().strip()
summary, action_items = summarizer.summarize(transcript)
print(summary)
print("JSON:")
print(json.dumps(action_items, indent=2))✅ Congratulations! 🎉 You've just built a very robust summarization agent that generates a string of text as summary and outputs the action items in a JSON object which we can parse and manipulate it in any way we want.
Here is an example of a nice looking UI that shows how we can manipulate our new responses.

Ethan and Maya discussed recent feedback on the customer support assistant, focusing on concerns around response speed and answer quality. Key issues included vague or incorrect answers and misclassification of simple issues, which may stem from inaccurate internal summarization.
They debated whether the problems are due to prompt engineering or the model itself. Maya shared results comparing GPT-4o and Claude 3, noting that Claude gave more reliable responses but was slower. Ethan emphasized the importance of latency for user experience.
They considered a hybrid approach using GPT-4o for speed and Claude as a fallback when confidence is low. However, current systems lack effective confidence metrics. They explored using embedding similarity as a potential signal, while being mindful of associated costs.
The conversation also touched on user feedback about the assistant's robotic tone. Maya recommended prompt tuning with example replies instead of full model fine-tuning to improve tone and empathy.
Finally, they discussed UI strategies for low-confidence responses, agreeing that a fallback prompt suggesting human assistance would improve user trust, provided it's used judiciously.{
"individual_actions": {
"Ethan": ["Sync with design on the fallback UX messaging"],
"Maya": [
"Build the similarity metric",
"Set up a test run for the hybrid model approach using GPT-4o and Claude"
]
},
"team_actions": [],
"entities": ["Ethan", "Maya"]
}We now have a summarization agent that generates responses in our desired format. Now it's time to evaluate how good this agent works. Many developers stop at a quick glance of the output and assume it's good enough. But LLMs are probabilistic and prone to inconsistency — eyeballing results won't catch subtle regressions, logical errors, or hallucinated action items. That's why rigorous evaluation is essential.
In the next section we are going to see how to evaluate your summarization agent using deepeval.