Skip to main content

Improving Prompts and Models

In this section we'll explore different configurations of our medical chatbot by iterating over different hyperparameters and evaluating these configurations using deepeval.

By looking at the evaluation results from various configurations we can improve our chatbot's performance significantly. We can improve our chatbot's performance by using different configurations of hyperparameters. The following are the hyperparameters we'll be iterating over our chatbot:

  • System prompt: This is the prompt that defines the overall behavior of our chatbot across all interactions.
  • Model: This is the model we'll use to generate responses.

Pulling Datasets

In the previous section, we've seen how to create datasets and store them in the cloud. We can now pull that dataset and use it as many times as we need to generate test cases and evaluate our medical chatbot.

Here's how we can pull datasets from the cloud:

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="Medical Chatbot Dataset")

The dataset pulled contains goldens, which can be used to create test cases during run time and run evals. This is how we can use our ConversationalGoldens and ConversationSimulator to generate ConversationalTestCases:

from deepeval.simulator import ConversationSimulator
from typing import List, Dict
from medical_chatbot import MedicalChatbot # Import your chatbot here
import asyncio

medical_chatbot = MedicalChatbot()

async def model_callback(input: str, conversation_history: List[Dict[str, str]]) -> str:
loop = asyncio.get_event_loop()
res = await loop.run_in_executor(None, medical_chatbot.agent_executer.invoke, {
"input": input,
"chat_history": conversation_history
})
return res["output"]

for golden in dataset.goldens:
simulator = ConversationSimulator(
user_intentions=golden.additional_metadata["user_intentions"],
user_profiles=golden.additional_metadata["user_profiles"]
)

convo_test_cases = simulator.simulate(
model_callback=model_callback,
stopping_criteria="Stop when the user's medical concern is addressed with actionable advice.",
)

for test_case in convo_test_cases:
test_case.scenario = golden.scenario
test_case.expected_outcome = golden.expected_outcome
test_case.chatbot_role = "a professional, empathetic medical assistant"

print(f"\nGenerated {len(convo_test_cases)} conversational test cases.")

We can use these test cases and evaluate our chatbot.

Iterating on Hyperparameters

Now that we can pull our ConversationalGoldens, we will use these goldens and the ConversationSimulator to generate test cases for different configurations of our chatbot by iterating on hyperparameters.

We will now iterate on different models and use a better system prompt to see which configuration performs the best.

This is the new system prompt we'll be using:

You are BayMax, a friendly and professional healthcare chatbot. You assist users by retrieving accurate information from the Gale Encyclopedia of Medicine and helping them book medical appointments.

Your key responsibilities:
- Provide clear, fact-based health information from trusted sources only.
- Retrieve and summarize relevant entries from the Gale Encyclopedia when asked.
- Help users schedule or manage healthcare appointments as needed.
- Maintain a warm, empathetic, and calm tone.
- Always recommend consulting a licensed healthcare provider for diagnoses or treatment.

Do not:
- Offer medical diagnoses or personal treatment plans.
- Speculate or give advice beyond verified sources.
- Ask for sensitive personal information unless necessary for booking.

Use phrases like:
- "According to the Gale Encyclopedia of Medicine..."
- "This is general information. Please consult a healthcare provider for advice."

Your goal is to support users with reliable, respectful healthcare guidance.

We will now iterate over different models to see which one perfoms best for our chatbot.

from deepeval.metrics import (
RoleAdherenceMetric,
KnowledgeRetentionMetric,
ConversationalGEval,
)
from deepeval.dataset import EvaluationDataset, ConversationalGolden
from deepeval.simulator import ConversationSimulator
from typing import List, Dict
from deepeval import evaluate
from medical_chatbot import MedicalChatbot # Import your chatbot here

dataset = EvaluationDataset()
dataset.pull(alias="Medical Chatbot Dataset")

metrics = [knowledge_retention, role_adherence, safety_check] # Use the same metrics

models = ["gpt-4", "gpt-4o-mini", "gpt-3.5-turbo"]
system_prompt = "..." # Use your new system prompt here

def create_model_callback(chatbot_instance):
async def model_callback(input: str, conversation_history: List[Dict[str, str]]) -> str:
...
return model_callback

for model in models:
for golden in dataset.goldens:
simulator = ConversationSimulator(
user_intentions=golden.additional_metadata["user_intentions"],
user_profiles=golden.additional_metadata["user_profiles"]
)

chatbot = MedicalChatbot("gale_encyclopedia.txt", model)
chatbot.setup_agent(system_prompt)

convo_test_cases = simulator.simulate(
model_callback=create_model_callback(chatbot),
stopping_criteria="Stop when the user's medical concern is addressed with actionable advice.",
)

for test_case in convo_test_cases:
test_case.scenario = golden.scenario
test_case.expected_outcome = golden.expected_outcome
test_case.chatbot_role = "a professional, empathetic medical assistant"

evaluate(convo_test_cases, metrics)

After running these iterations I've observed that gpt-4 is performing the best for all 3 metrics, here are the average results it got:

MetricScore
Knowledge Retention0.8
Role Adherence0.7
Safety Check0.9

We'll now see how to update our chatbot to support more hyperparameters.

Updating Chatbot

We have previously seen how to change our parameters, now we'll update cod eof our chatbot to support easier ways to improve it. Here's the new chatbot code:

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
from langchain_openai import ChatOpenAI
from deepeval.tracing import observe

class MedicalChatbot:
def __init__(
self,
document_path,
model="gpt-4",
encoder="all-MiniLM-L6-v2",
memory=":memory:",
system_prompt=""
):
self.model = ChatOpenAI(model=model)
self.appointments = {}
self.encoder = SentenceTransformer(encoder)
self.client = QdrantClient(memory)
self.store_data(document_path)
self.system_prompt = system_prompt or (
"You are a virtual health assistant designed to support users with symptom understanding and appointment management. Start every conversation by actively listening to the user's concerns. Ask clear follow-up questions to gather information like symptom duration, intensity, and relevant health history. Use available tools to fetch diagnostic information or manage medical appointments. Never assume a diagnosis unless there's enough detail, and always recommend professional medical consultation when appropriate."
)
self.setup_agent(self.system_prompt)

def store_data(self, document_path):
...

@tool
@observe()
def query_engine(self, query: str) -> str:
...

@tool
def create_appointment(self, appointment_id: str) -> str:
...

def setup_tools(self):
...

@observe()
def setup_agent(self, system_prompt: str):
...

@observe()
def interactive_session(self, session_id):
...

These were the updates made to our medical chatbot. You can now change the following configurations for your chatbot in the initialization itself

  • generation model
  • embedding model
  • memory management
  • system prompt
from medical_chatbot import MedicalChatbot

chatbot = MedicalChatbot(
model="gpt-4",
encoder="all-MiniLM-L6-v2",
memory=":memory:",
system_prompt="..."
)

This new updated model now performs as we intended and can be used to create a UI interface, this is what a UI integrated chatbot looks like:

Chatbot UI Overview

In the next section, we'll go over how to setup tracing for our chatbot to observe it on a component level and prepare the chatbot for deployment.