Introduction to Chatbot Evaluation
Learn how to build and evaluate a reliable LLM-powered medical chatbot using OpenAI, LangChain, Qdrant, and DeepEval—from development to deployment.

DeepEval
OpenAI
Qdrant

LangChain
Get Started
Jump ahead to any of the sections in the tutorial, or keep reading to go with the flow.
1
Building your chatbot
- Build with OpenAI
- Use Qdrant as knowledge base
- LangChain for orchestration
2
Evaluate multi-turn conversations
- Learn how to use multi-turn test cases
- Select and create multi-turn metrics
- Use datasets to setup LLM evals pipeline
- Identify weaknesses in your medical chatbot
3
Improving prompts, models, etc.
- Use metric scores to improve existing system prompt
- Experiment with different models with new prompt
- Run regression tests, and figure out whether you've iterated in the right direction
4
Setup evals in prod
- Trace your first LLM completion call and group them as a conversation
- Decide which metrics you wish to bring to prod, and define them in code
- Get alerted for any high risk completions in prod in an ad-hoc fashion
What Will You Be Evaluating?
In this tutorial, you'll learn to evaluate and test a medical chatbot using DeepEval on its ability to:
- Diagnose symptoms, and
- Book appointments
It's a multi-turn conversational agent—meaning it can remember previous messages, handle follow-up questions, and take action based on the full conversation. Here's a nice looking UI to give you a better idea of what your chatbot could look like in the real world:

In the next section, we'll begin by going through the chatbot implementation, built with OpenAI, Qdrant, and LangChain.