Introduction to Chatbot Evaluation
Learn how to build and evaluate a reliable LLM-powered medical chatbot using OpenAI, LangChain, Qdrant, and DeepEval—from development to deployment.

DeepEval

OpenAI

Qdrant

LangChain
If you are working with multi-turn chatbots, this tutorial will be helpful to you. We will go through the entire process of building a reliable multi-turn chatbot and how to evaluate it using deepeval
Get Started
Jump ahead to any of the sections in the tutorial, or keep reading to go with the flow.
1
Building your chatbot
- Build with OpenAI
- Use Qdrant as knowledge base
- LangChain for orchestration
2
Evaluate multi-turn conversations
- Learn how to use multi-turn test cases
- Select and create multi-turn metrics
- Use datasets to setup LLM evals pipeline
- Identify weaknesses in your medical chatbot
3
Improving prompts, models, etc.
- Use metric scores to improve existing system prompt
- Experiment with different models with new prompt
- Run regression tests, and figure out whether you've iterated in the right direction
4
Setup evals in prod
- Trace your first LLM completion call and group them as a conversation
- Decide which metrics you wish to bring to prod, and define them in code
- Get alerted for any high risk completions in prod in an ad-hoc fashion
What Will You Be Evaluating?
In this tutorial, you'll learn to evaluate and test a medical chatbot using DeepEval on its ability to:
- Diagnose symptoms, and
- Book appointments
It's a multi-turn conversational agent—meaning it can remember previous messages, handle follow-up questions, and take action based on the full conversation. Here's a nice looking UI to give you a better idea of what your chatbot could look like in the real world:
In the next section, we'll begin by going through the chatbot implementation, built with OpenAI, Qdrant, and LangChain.
You can also skip straight to the Evaluation section instead.