Introduction
DeepEval is a powerful open-source LLM evaluation framework. In these tutorials we'll show you how you can use DeepEval to improve your LLM application one step at a time. These tutorials walk you through the process of evaluating and testing your LLM applications — from initial development to post-production.
Below is a curated set of tutorials — each focused on real-world tasks, metrics, and best practices for reliable LLM evaluation. Start with the basics, or jump straight to your use case.
Tutorials
Start Here: Install & Run Your First Evaluation
Not sure where to begin? Click here to get started and run your first evaluation with DeepEval
Meeting Summarizer
Learn how to develop and evaluate a summarization agent using DeepEval.
RAG QA Agent
Evaluate your RAG pipeline for accuracy, relevance, and completeness.
Medical Chatbot
Test a healthcare-focused LLM chatbot for hallucinations and safety.
What You'll Learn
DeepEval tutorials cover the best practices for evaluating LLM applications across both development and production.
Development Evals
You'll learn how to:
- Select evaluation metrics that align with your task
- Use
deepeval
to measure and track LLM performance - Interpret results to tune prompts, models, and other system hyperparameters
- Scale evaluations to cover diverse inputs and edge cases
Production Evals
You'll also see how to:
- Continuously evaluate your LLM's performance in production
- Run A/B tests on different models or configurations using real data
- Feed production insights back into your development workflow to improve future releases
LLM evaluation isn't a one-time step — it's a continuous loop. Production data sharpens development. Development precision strengthens production. Which is why it's crucial to do both — and DeepEval helps you do just that.
Here are a few key terminologies to keep in mind for LLM evaluations
- Hyperparameters: The configuration values that shape your LLM application. This includes system prompts, user prompts, model choice, temperature, chunk size (for RAG), and more.
- System Prompt: A prompt that defines the overall behavior of your LLM across all interactions.
- Generation Model: The model used to generate responses — this is the LLM you're evaluating. Throughout the tutorials, we'll simply call it the model.
- Evaluation Model: A separate LLM used to score, critique, or assess the outputs of your generation model. This is not the model being evaluated.
What DeepEval Offers
DeepEval supports a wide range of LLM evaluation metrics tailored to different use cases, including:
- RAG applications (Retrieval-Augmented Generation)
- Conversational applications
- Agentic applications
Click here to explore all the metrics deepeval
offers.
Throughout these tutorials, we'll walk through how to evaluate a variety of use cases with deepeval
using real-world best practices. Your specific use case may differ — and that's expected.
The evaluation approach remains the same: define your criteria, choose the right metrics, and iterate based on the results.
Who This Is For
Whether you're building chatbots, summarizers, or agent systems powered by LLMs, these tutorials are designed for:
- Developers shipping LLM features in real products
- Researchers testing prompts or model variations
- Teams optimizing LLM outputs at scale
Whether you're just experimenting or managing LLMs in production, these tutorials will help you test reliably, iterate faster, and ship with more confidence.
Want to get started right away? Click here to look at the list of available tutorials.