Skip to main content

Introduction

DeepEval is a powerful open-source LLM evaluation framework. In these tutorials we'll show you how you can use DeepEval to improve your LLM application one step at a time. These tutorials walk you through the process of evaluating and testing your LLM applications — from initial development to post-production.

Below is a curated set of tutorials — each focused on real-world tasks, metrics, and best practices for reliable LLM evaluation. Start with the basics, or jump straight to your use case.

Tutorials

What You'll Learn

DeepEval tutorials cover the best practices for evaluating LLM applications across both development and production.

Development Evals

You'll learn how to:

  • Select evaluation metrics that align with your task
  • Use deepeval to measure and track LLM performance
  • Interpret results to tune prompts, models, and other system hyperparameters
  • Scale evaluations to cover diverse inputs and edge cases

Production Evals

You'll also see how to:

  • Continuously evaluate your LLM's performance in production
  • Run A/B tests on different models or configurations using real data
  • Feed production insights back into your development workflow to improve future releases
tip

LLM evaluation isn't a one-time step — it's a continuous loop. Production data sharpens development. Development precision strengthens production. Which is why it's crucial to do both — and DeepEval helps you do just that.

Here are a few key terminologies to keep in mind for LLM evaluations
  • Hyperparameters: The configuration values that shape your LLM application. This includes system prompts, user prompts, model choice, temperature, chunk size (for RAG), and more.
  • System Prompt: A prompt that defines the overall behavior of your LLM across all interactions.
  • Generation Model: The model used to generate responses — this is the LLM you're evaluating. Throughout the tutorials, we'll simply call it the model.
  • Evaluation Model: A separate LLM used to score, critique, or assess the outputs of your generation model. This is not the model being evaluated.

What DeepEval Offers

DeepEval supports a wide range of LLM evaluation metrics tailored to different use cases, including:

  • RAG applications (Retrieval-Augmented Generation)
  • Conversational applications
  • Agentic applications

Click here to explore all the metrics deepeval offers.

Throughout these tutorials, we'll walk through how to evaluate a variety of use cases with deepeval using real-world best practices. Your specific use case may differ — and that's expected. The evaluation approach remains the same: define your criteria, choose the right metrics, and iterate based on the results.

Who This Is For

Whether you're building chatbots, summarizers, or agent systems powered by LLMs, these tutorials are designed for:

  • Developers shipping LLM features in real products
  • Researchers testing prompts or model variations
  • Teams optimizing LLM outputs at scale

Whether you're just experimenting or managing LLMs in production, these tutorials will help you test reliably, iterate faster, and ship with more confidence.

Want to get started right away? Click here to look at the list of available tutorials.