Introduction to Chatbot Evaluation

Learn how to build and evaluate a reliable LLM-powered medical chatbot using OpenAI, LangChain, Qdrant, and DeepEval—from development to deployment.

DeepEval

OpenAI

Qdrant

LangChain

note

If you are working with multi-turn chatbots, this tutorial will be helpful to you. We will go through the entire process of building a reliable multi-turn chatbot and how to evaluate it using deepeval

Get Started

Jump ahead to any of the sections in the tutorial, or keep reading to go with the flow.

1

Building your chatbot

Build with OpenAI
Use Qdrant as knowledge base
LangChain for orchestration

2

Evaluate multi-turn conversations

Learn how to use multi-turn test cases
Select and create multi-turn metrics
Use datasets to setup LLM evals pipeline
Identify weaknesses in your medical chatbot

3

Improving prompts, models, etc.

Use metric scores to improve existing system prompt
Experiment with different models with new prompt
Run regression tests, and figure out whether you've iterated in the right direction

4

Setup evals in prod

Trace your first LLM completion call and group them as a conversation
Decide which metrics you wish to bring to prod, and define them in code
Get alerted for any high risk completions in prod in an ad-hoc fashion

What Will You Be Evaluating?

In this tutorial, you'll learn to evaluate and test a medical chatbot using DeepEval on its ability to:

Diagnose symptoms, and
Book appointments

It's a multi-turn conversational agent—meaning it can remember previous messages, handle follow-up questions, and take action based on the full conversation. Here's a nice looking UI to give you a better idea of what your chatbot could look like in the real world:

Medical Chatbot Overview

In the next section, we'll begin by going through the chatbot implementation, built with OpenAI, Qdrant, and LangChain.

tip

You can also skip straight to the Evaluation section instead.