Improving Prompts and Models
In this section we'll explore different configurations of our medical chatbot by iterating over different hyperparameters and evaluating these configurations using deepeval
.
By looking at the evaluation results from various configurations we can improve our chatbot's performance significantly. We can improve our chatbot's performance by using different configurations of hyperparameters. The following are the hyperparameters we'll be iterating over our chatbot:
- System prompt: This is the prompt that defines the overall behavior of our chatbot across all interactions.
- Model: This is the model we'll use to generate responses.
Pulling Datasets
In the previous section, we've seen how to create datasets and store them in the cloud. We can now pull that dataset and use it as many times as we need to generate test cases and evaluate our medical chatbot.
Here's how we can pull datasets from the cloud:
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="Medical Chatbot Dataset")
The dataset pulled contains goldens, which can be used to create test cases during run time and run evals. This is how we can use our ConversationalGolden
s and ConversationSimulator
to generate ConversationalTestCase
s:
from deepeval.simulator import ConversationSimulator
from typing import List, Dict
from medical_chatbot import MedicalChatbot # Import your chatbot here
import asyncio
medical_chatbot = MedicalChatbot()
async def model_callback(input: str, conversation_history: List[Dict[str, str]]) -> str:
loop = asyncio.get_event_loop()
res = await loop.run_in_executor(None, medical_chatbot.agent_executer.invoke, {
"input": input,
"chat_history": conversation_history
})
return res["output"]
for golden in dataset.goldens:
simulator = ConversationSimulator(
user_intentions=golden.additional_metadata["user_intentions"],
user_profiles=golden.additional_metadata["user_profiles"]
)
convo_test_cases = simulator.simulate(
model_callback=model_callback,
stopping_criteria="Stop when the user's medical concern is addressed with actionable advice.",
)
for test_case in convo_test_cases:
test_case.scenario = golden.scenario
test_case.expected_outcome = golden.expected_outcome
test_case.chatbot_role = "a professional, empathetic medical assistant"
print(f"\nGenerated {len(convo_test_cases)} conversational test cases.")
We can use these test cases and evaluate our chatbot.
Iterating on Hyperparameters
Now that we can pull our ConversationalGolden
s, we will use these goldens and the ConversationSimulator
to generate test cases for different configurations of our chatbot by iterating on hyperparameters.
We will now iterate on different models and use a better system prompt to see which configuration performs the best.
This is the new system prompt we'll be using:
You are BayMax, a friendly and professional healthcare chatbot. You assist users by retrieving accurate information from the Gale Encyclopedia of Medicine and helping them book medical appointments.
Your key responsibilities:
- Provide clear, fact-based health information from trusted sources only.
- Retrieve and summarize relevant entries from the Gale Encyclopedia when asked.
- Help users schedule or manage healthcare appointments as needed.
- Maintain a warm, empathetic, and calm tone.
- Always recommend consulting a licensed healthcare provider for diagnoses or treatment.
Do not:
- Offer medical diagnoses or personal treatment plans.
- Speculate or give advice beyond verified sources.
- Ask for sensitive personal information unless necessary for booking.
Use phrases like:
- "According to the Gale Encyclopedia of Medicine..."
- "This is general information. Please consult a healthcare provider for advice."
Your goal is to support users with reliable, respectful healthcare guidance.
We will now iterate over different models to see which one perfoms best for our chatbot.
from deepeval.metrics import (
RoleAdherenceMetric,
KnowledgeRetentionMetric,
ConversationalGEval,
)
from deepeval.dataset import EvaluationDataset, ConversationalGolden
from deepeval.simulator import ConversationSimulator
from typing import List, Dict
from deepeval import evaluate
from medical_chatbot import MedicalChatbot # Import your chatbot here
dataset = EvaluationDataset()
dataset.pull(alias="Medical Chatbot Dataset")
metrics = [knowledge_retention, role_adherence, safety_check] # Use the same metrics
models = ["gpt-4", "gpt-4o-mini", "gpt-3.5-turbo"]
system_prompt = "..." # Use your new system prompt here
def create_model_callback(chatbot_instance):
async def model_callback(input: str, conversation_history: List[Dict[str, str]]) -> str:
...
return model_callback
for model in models:
for golden in dataset.goldens:
simulator = ConversationSimulator(
user_intentions=golden.additional_metadata["user_intentions"],
user_profiles=golden.additional_metadata["user_profiles"]
)
chatbot = MedicalChatbot("gale_encyclopedia.txt", model)
chatbot.setup_agent(system_prompt)
convo_test_cases = simulator.simulate(
model_callback=create_model_callback(chatbot),
stopping_criteria="Stop when the user's medical concern is addressed with actionable advice.",
)
for test_case in convo_test_cases:
test_case.scenario = golden.scenario
test_case.expected_outcome = golden.expected_outcome
test_case.chatbot_role = "a professional, empathetic medical assistant"
evaluate(convo_test_cases, metrics)
After running these iterations I've observed that gpt-4
is performing the best for all 3 metrics, here are the average results it got:
Metric | Score |
---|---|
Knowledge Retention | 0.8 |
Role Adherence | 0.7 |
Safety Check | 0.9 |
We'll now see how to update our chatbot to support more hyperparameters.
Updating Chatbot
We have previously seen how to change our parameters, now we'll update cod eof our chatbot to support easier ways to improve it. Here's the new chatbot code:
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
from langchain_openai import ChatOpenAI
from deepeval.tracing import observe
class MedicalChatbot:
def __init__(
self,
document_path,
model="gpt-4",
encoder="all-MiniLM-L6-v2",
memory=":memory:",
system_prompt=""
):
self.model = ChatOpenAI(model=model)
self.appointments = {}
self.encoder = SentenceTransformer(encoder)
self.client = QdrantClient(memory)
self.store_data(document_path)
self.system_prompt = system_prompt or (
"You are a virtual health assistant designed to support users with symptom understanding and appointment management. Start every conversation by actively listening to the user's concerns. Ask clear follow-up questions to gather information like symptom duration, intensity, and relevant health history. Use available tools to fetch diagnostic information or manage medical appointments. Never assume a diagnosis unless there's enough detail, and always recommend professional medical consultation when appropriate."
)
self.setup_agent(self.system_prompt)
def store_data(self, document_path):
...
@tool
@observe()
def query_engine(self, query: str) -> str:
...
@tool
def create_appointment(self, appointment_id: str) -> str:
...
def setup_tools(self):
...
@observe()
def setup_agent(self, system_prompt: str):
...
@observe()
def interactive_session(self, session_id):
...
These were the updates made to our medical chatbot. You can now change the following configurations for your chatbot in the initialization itself
- generation model
- embedding model
- memory management
- system prompt
from medical_chatbot import MedicalChatbot
chatbot = MedicalChatbot(
model="gpt-4",
encoder="all-MiniLM-L6-v2",
memory=":memory:",
system_prompt="..."
)
This new updated model now performs as we intended and can be used to create a UI interface, this is what a UI integrated chatbot looks like:
In the next section, we'll go over how to setup tracing for our chatbot to observe it on a component level and prepare the chatbot for deployment.