# DeepEval

> DeepEval is an open-source LLM evaluation framework designed to unit-test LLM powered applications such as agents, chatbots, and RAG. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, fluency, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation. DeepEval offers a free cloud platform, Confident AI, for teams to incorperate LLM observability, tracing, and organization-wide collaboration into their LLM evals.

- [DeepEval LLM Evaluation](https://deepeval.com/): Open-source framework for evaluating large language models effectively.
- [DeepEval Framework Quickstart](https://deepeval.com/docs/getting-started): DeepEval is an open-source framework for evaluating LLM applications.
- [DeepEval LLM Evaluation](https://deepeval.com/docs/evaluation-introduction): Learn how to evaluate LLM applications using DeepEval.
- [DeepEval Metrics Overview](https://deepeval.com/docs/metrics-introduction): DeepEval provides 40+ metrics for evaluating LLM performance effectively.
- [G-Eval Framework](https://deepeval.com/docs/metrics-llm-evals): G-Eval framework for evaluating LLM outputs with custom metrics.
- [DAG Metric Overview](https://deepeval.com/docs/metrics-dag): Explore the versatile DAG metric for LLM evaluations.
- [Top G-Eval Use Cases](https://deepeval.com/blog/top-5-geval-use-cases): Explore top G-Eval use cases for custom LLM metrics.
- [Answer Relevancy Metrics](https://deepeval.com/docs/metrics-answer-relevancy): Evaluate answer relevancy using LLM metrics for RAG.
- [Faithfulness Metric Overview](https://deepeval.com/docs/metrics-faithfulness): Evaluate RAG pipeline quality using faithfulness metrics.
- [Contextual Relevancy Metric](https://deepeval.com/docs/metrics-contextual-relevancy): Explore the Contextual Relevancy Metric for evaluating RAG pipelines.
- [Contextual Precision Metric](https://deepeval.com/docs/metrics-contextual-precision): Evaluate RAG pipeline's retriever using contextual precision metric.
- [Contextual Recall Metric](https://deepeval.com/docs/metrics-contextual-recall): Explore the Contextual Recall Metric for evaluating RAG pipelines.
- [Bias Metric Evaluation](https://deepeval.com/docs/metrics-bias): Evaluate LLM outputs for gender, racial, and political bias.
- [Toxicity Metric Overview](https://deepeval.com/docs/metrics-toxicity): Evaluate toxicity in LLM outputs using referenceless metrics.
- [LLM Hallucination Metric](https://deepeval.com/docs/metrics-hallucination): Evaluate LLM hallucination using context comparison metrics.
- [LLM Summarization Metrics](https://deepeval.com/docs/metrics-summarization): Learn how to evaluate LLM summarization metrics effectively.
- [Task Completion Metrics](https://deepeval.com/docs/metrics-task-completion): Evaluate task completion using LLM metrics and arguments.
- [Tool Correctness Metric](https://deepeval.com/docs/metrics-tool-correctness): Assess LLM agent's tool calling accuracy with metrics.
- [JSON Correctness Metric](https://deepeval.com/docs/metrics-json-correctness): Learn how to measure JSON correctness in LLM applications.
- [Prompt Alignment Metric](https://deepeval.com/docs/metrics-prompt-alignment): Evaluate LLM output alignment with prompt instructions effectively.
- [Image Coherence Metric](https://deepeval.com/docs/multimodal-metrics-image-coherence): Evaluate image coherence with accompanying text for MLLM.
- [Knowledge Retention Metric](https://deepeval.com/docs/metrics-knowledge-retention): Learn how to measure knowledge retention in LLM chatbots.
- [Conversation Completeness Metric](https://deepeval.com/docs/metrics-conversation-completeness): Evaluate conversation completeness for LLM chatbots effectively.
- [Conversation Relevancy Metric](https://deepeval.com/docs/metrics-turn-relevancy): Evaluate conversation relevancy for LLM chatbot conversations.
- [RAGAS Metrics Overview](https://deepeval.com/docs/metrics-ragas): Evaluate RAG pipelines using RAGAS metrics.
- [DeepEval Update Warnings](https://deepeval.com/docs/miscellaneous): Opt-in for update warnings in DeepEval documentation.
- [Gemini Model Integration](https://deepeval.com/integrations/models/gemini): Integrate Gemini models with DeepEval using CLI or Python.
- [Anthropic Model Integration](https://deepeval.com/integrations/models/anthropic): Integrate Anthropic models for evaluation metrics easily.
- [LM Studio Integration](https://deepeval.com/integrations/models/lmstudio): Evaluate local LLMs with LM Studio integration guide.
- [OpenAI Integration Guide](https://deepeval.com/integrations/models/openai): Setup OpenAI API key and explore available models.
- [Azure OpenAI Integration](https://deepeval.com/integrations/models/azure-openai): Integrate Azure OpenAI models with DeepEval for metrics.
- [vLLM Inference Integratioin](https://deepeval.com/integrations/models/vllm): High-performance inference engine for LLMs with OpenAI support.
- [GSM8K Benchmark Overview](https://deepeval.com/docs/benchmarks-gsm8k): GSM8K benchmark for evaluating multi-step math reasoning.
- [Custom LLM Metrics Guide](https://deepeval.com/docs/metrics-custom): Learn to create custom LLM evaluation metrics easily.
- [DROP Benchmark Overview](https://deepeval.com/docs/benchmarks-drop): Evaluate language models with complex reasoning tasks using DROP.
- [Data Privacy Assurance](https://deepeval.com/docs/data-privacy): DeepEval ensures data privacy and security for users.
- [Bias Benchmark Evaluation](https://deepeval.com/docs/benchmarks-bbq): Evaluate LLMs for bias across various social categories.
- [MMLU Benchmark Overview](https://deepeval.com/docs/benchmarks-mmlu): Evaluate LLMs using MMLU benchmark across various subjects.
- [LLM Evaluation Tutorial](https://deepeval.com/tutorials/tutorial-introduction): Comprehensive guide to evaluating and improving LLM applications.
- [HellaSwag Benchmark](https://deepeval.com/docs/benchmarks-hellaswag): Evaluate language models' commonsense reasoning with HellaSwag benchmark.
- [DeepEval Setup Guide](https://deepeval.com/tutorials/tutorial-setup): Guide to install DeepEval and set up Confident AI.
- [DeepEval vs TruLens](https://deepeval.com/blog/deepeval-vs-trulens): DeepEval outperforms TruLens in LLM evaluation features.
- [Chatbot Role Adherence](https://deepeval.com/docs/metrics-role-adherence): Learn how to measure chatbot role adherence effectively.
- [DeepEval vs Arize Comparison](https://deepeval.com/blog/deepeval-vs-arize): DeepEval excels in LLM evaluation, surpassing Arize's observability.
- [Metrics Selection Guide](https://deepeval.com/tutorials/tutorial-metrics-selection): Learn to select and define evaluation metrics for LLMs.
- [DeepEval vs Ragas Comparison](https://deepeval.com/blog/deepeval-vs-ragas): DeepEval offers a comprehensive evaluation ecosystem for LLMs.
- [DeepEval vs Langfuse](https://deepeval.com/blog/deepeval-vs-langfuse): DeepEval offers advanced evaluation features compared to Langfuse.
- [Synthetic Dataset Generation](https://deepeval.com/tutorials/tutorial-dataset-synthesis): Learn to generate synthetic datasets for medical chatbots.
- [DeepEval Alternatives Overview](https://deepeval.com/blog/deepeval-alternatives-compared): Explore various alternatives to DeepEval for LLM evaluation.
- [RAG QA Agent Setup](https://deepeval.com/tutorials/qa-agent-introduction): Learn to set up a RAG QA Agent evaluation pipeline quickly.
- [Legal Document Summarization](https://deepeval.com/tutorials/doc-summarization-introduction): Learn to evaluate legal document summarizers effectively and accurately.
- [RAG Triad Evaluation Guide](https://deepeval.com/guides/guides-rag-triad): Learn about the RAG triad for evaluating LLMs effectively.
- [QA Agent Evaluations](https://deepeval.com/tutorials/qa-agent-running-evaluations): Learn to run evaluations on QA Agent effectively.
- [Medical Chatbot Tutorial](https://deepeval.com/tutorials/tutorial-llm-application-example): Learn to build a medical chatbot for diagnosis and appointments.
- [Toxicity Vulnerability Evaluation](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-toxicity): Evaluate LLM's resistance to generating harmful or toxic content.
- [Testing LLM Robustness](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-robustness): Learn how to test LLM robustness against malicious inputs.
- [Generate Synthetic Goldens](https://deepeval.com/docs/synthesizer-generate-from-goldens): Generate synthetic Goldens from existing Goldens easily.
- [Improving QA Agent](https://deepeval.com/tutorials/qa-agent-improving-hyperparameters): Learn to enhance QA agent performance through hyperparameter tuning.
- [Red Teaming Bias](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-bias): Test LLMs for bias in responses across various categories.
- [Confident AI Documentation](https://www.confident-ai.com/docs/): Cloud platform for evaluating LLM applications with DeepEval.
- [BIG-Bench Hard Evaluation](https://deepeval.com/docs/benchmarks-big-bench-hard): Evaluate language models with challenging BIG-Bench Hard tasks.
- [LLM Tracing Guide](https://deepeval.com/docs/evaluation-llm-tracing): Learn to evaluate LLM interactions with tracing metrics.
- [Document Summarization Datasets](https://deepeval.com/tutorials/doc-summarization-annotating-datasets): Learn to create and maintain datasets for document summarization.
- [DeepEval LLM Comparisons](https://deepeval.com/blog/tags/comparisons): DeepEval provides comprehensive LLM evaluation comparisons and insights.
- [Winogrande Benchmark](https://deepeval.com/docs/benchmarks-winogrande): Winogrande dataset for commonsense reasoning evaluation and usage.
- [Synthetic Data Generation](https://deepeval.com/docs/synthesizer-introduction): DeepEval's Synthesizer generates high-quality synthetic evaluation data.
- [LLM Benchmarking Guide](https://deepeval.com/docs/benchmarks-introduction): Standardized benchmarks for evaluating LLM performance effectively.
- [Conversation Simulator](https://deepeval.com/docs/conversation-simulator): Generate conversational test cases for chatbot evaluation.
- [Evaluation Datasets Overview](https://deepeval.com/docs/evaluation-datasets): Explore evaluation datasets, goldens, and dataset creation methods.
- [Evaluation Flags and Configs](https://deepeval.com/docs/evaluation-flags-and-configs): Customize evaluation settings with flags and configurations.
- [Running Evaluations with DeepEval](https://deepeval.com/tutorials/tutorial-evaluations-running-an-evaluation): Learn how to run evaluations using DeepEval metrics.
- [Excessive Agency Vulnerability](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-excessive-agency): Learn to test LLMs against excessive agency vulnerabilities.
- [Red Teaming Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-competition): Test LLMs for competitive information disclosure and market influence.
- [Misinformation Vulnerabilities in LLMs](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-misinformation): Explore how LLMs handle misinformation vulnerabilities effectively.
- [Document Summarization Evaluation](https://deepeval.com/tutorials/doc-summarization-running-an-evaluation): Learn to evaluate document summarization using DeepEval metrics.
- [PII Leakage Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-pii-leakage): Evaluate PII leakage vulnerabilities in LLM systems effectively.
- [Graphic Content Vulnerability](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-graphic-content): Testing LLMs for graphic content vulnerability responses.
- [DeepEval Use Cases](https://deepeval.com/tutorials/use-cases): Explore various use cases for DeepEval's capabilities.
- [Synthetic Dataset Generation](https://deepeval.com/tutorials/qa-agent-generating-a-synthetic-dataset): Learn to generate diverse synthetic datasets for QA agents.
- [Component-Level LLM Evaluation](https://deepeval.com/docs/evaluation-component-level-llm-evals): Evaluate individual LLM components with tailored metrics and tests.
- [Prompt Leakage Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-prompt-leakage): Learn about prompt leakage vulnerabilities in LLMs and testing.
- [Personal Safety Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-personal-safety): Learn to test LLMs for personal safety vulnerabilities.
- [Intellectual Property Testing](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-intellectual-property): Learn how to test LLMs for intellectual property vulnerabilities.
- [Unauthorized Access Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-unauthorized-access): Explore unauthorized access vulnerabilities in LLMs and testing methods.
- [Qdrant Vector Database](https://deepeval.com/integrations/vector-databases/qdrant): Explore Qdrant for efficient vector database retrieval and evaluation.
- [Red Teaming Overview](https://www.trydeepteam.com/docs/red-teaming-introduction): DeepTeam simplifies red teaming for LLM applications, ensuring safety.
- [HumanEval Benchmark](https://deepeval.com/docs/benchmarks-human-eval): Evaluate LLM code generation with HumanEval benchmark tasks.
- [TruthfulQA Benchmark](https://deepeval.com/docs/benchmarks-truthful-qa): Evaluate language models' truthfulness across various topics.
- [Cognee](https://deepeval.com/integrations/vector-databases/cognee): Cognee framework enhances LLM applications with semantic graph retrieval.
- [Optimize LLM Hyperparameters](https://deepeval.com/guides/guides-optimizing-hyperparameters): Guide to optimize hyperparameters for LLM applications effectively.
- [DeepEval Test Cases](https://deepeval.com/docs/evaluation-test-cases): DeepEval provides test cases for evaluating LLM outputs effectively.
- [RAG Evaluation Guide](https://deepeval.com/guides/guides-rag-evaluation): Learn how to evaluate RAG pipelines effectively.
- [DeepEval Synthesizer Guide](https://deepeval.com/guides/guides-using-synthesizer): Quickly generate high-quality synthetic goldens with DeepEval.
- [LLM Observability Guide](https://deepeval.com/guides/guides-llm-observability): Explore LLM observability for monitoring and improving AI models.
- [Red Teaming Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities): Explore vulnerabilities in LLMs for effective red teaming.
- [End-to-End LLM Evaluation](https://deepeval.com/docs/evaluation-end-to-end-llm-evals): Comprehensive guide for end-to-end evaluation of LLM applications.
- [Multimodal Tool Correctness](https://deepeval.com/docs/multimodal-metrics-tool-correctness): Assess multimodal LLM tool calling accuracy and correctness.
- [Generate Goldens](https://deepeval.com/docs/synthesizer-generate-from-contexts): Generate synthetic Goldens from provided contexts easily.
- [Synthetic Goldens Generation](https://deepeval.com/docs/synthesizer-generate-from-scratch): Generate synthetic Goldens from scratch for LLM applications.