Introduction

DeepEval is a powerful open-source LLM evaluation framework. In these tutorials we'll show you how you can use DeepEval to improve your LLM application one step at a time. These tutorials walk you through the process of evaluating and testing your LLM applications — from initial development to post-production.

Below is a curated set of tutorials — each focused on real-world tasks, metrics, and best practices for reliable LLM evaluation. Start with the basics, or jump straight to your use case.

Tutorials

Start Here: Install & Run Your First Evaluation

Not sure where to begin? Click here to get started and run your first evaluation with DeepEval

Meeting Summarizer

Learn how to develop and evaluate a summarization agent using DeepEval.

RAG QA Agent

Evaluate your RAG pipeline for accuracy, relevance, and completeness.

Medical Chatbot

Test a healthcare-focused LLM chatbot for hallucinations and safety.

What You'll Learn

DeepEval tutorials cover the best practices for evaluating LLM applications across both development and production.

Development Evals

You'll learn how to:

Select evaluation metrics that align with your task
Use deepeval to measure and track LLM performance
Interpret results to tune prompts, models, and other system hyperparameters
Scale evaluations to cover diverse inputs and edge cases

Production Evals

You'll also see how to:

Continuously evaluate your LLM's performance in production
Run A/B tests on different models or configurations using real data
Feed production insights back into your development workflow to improve future releases

tip

LLM evaluation isn't a one-time step — it's a continuous loop. Production data sharpens development. Development precision strengthens production. Which is why it's crucial to do both — and DeepEval helps you do just that.

Here are a few key terminologies to keep in mind for LLM evaluations

Hyperparameters: The configuration values that shape your LLM application. This includes system prompts, user prompts, model choice, temperature, chunk size (for RAG), and more.
System Prompt: A prompt that defines the overall behavior of your LLM across all interactions.
Generation Model: The model used to generate responses — this is the LLM you're evaluating. Throughout the tutorials, we'll simply call it the model.
Evaluation Model: A separate LLM used to score, critique, or assess the outputs of your generation model. This is not the model being evaluated.

What DeepEval Offers

DeepEval supports a wide range of LLM evaluation metrics tailored to different use cases, including:

RAG applications (Retrieval-Augmented Generation)
Conversational applications
Agentic applications

Click here to explore all the metrics deepeval offers.

Throughout these tutorials, we'll walk through how to evaluate a variety of use cases with deepeval using real-world best practices. Your specific use case may differ — and that's expected. The evaluation approach remains the same: define your criteria, choose the right metrics, and iterate based on the results.

Who This Is For

Whether you're building chatbots, summarizers, or agent systems powered by LLMs, these tutorials are designed for:

Developers shipping LLM features in real products
Researchers testing prompts or model variations
Teams optimizing LLM outputs at scale

Whether you're just experimenting or managing LLMs in production, these tutorials will help you test reliably, iterate faster, and ship with more confidence.

Want to get started right away? Click here to look at the list of available tutorials.

Tutorials​

Start Here: Install & Run Your First Evaluation

Meeting Summarizer

RAG QA Agent

Medical Chatbot

What You'll Learn​

Development Evals​

Production Evals​

What DeepEval Offers​

Who This Is For​