# DeepEval > DeepEval is an open-source LLM evaluation framework designed to unit-test LLM powered applications such as agents, chatbots, and RAG. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, fluency, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation. DeepEval offers a free cloud platform, Confident AI, for teams to incorperate LLM observability, tracing, and organization-wide collaboration into their LLM evals. - [DeepEval LLM Evaluation](https://deepeval.com/): Open-source framework for evaluating large language models effectively. - [DeepEval Framework Quickstart](https://deepeval.com/docs/getting-started): DeepEval is an open-source framework for evaluating LLM applications. - [DeepEval LLM Evaluation](https://deepeval.com/docs/evaluation-introduction): Learn how to evaluate LLM applications using DeepEval. - [DeepEval Metrics Overview](https://deepeval.com/docs/metrics-introduction): DeepEval provides 40+ metrics for evaluating LLM performance effectively. - [G-Eval Framework](https://deepeval.com/docs/metrics-llm-evals): G-Eval framework for evaluating LLM outputs with custom metrics. - [DAG Metric Overview](https://deepeval.com/docs/metrics-dag): Explore the versatile DAG metric for LLM evaluations. - [Top G-Eval Use Cases](https://deepeval.com/blog/top-5-geval-use-cases): Explore top G-Eval use cases for custom LLM metrics. - [Answer Relevancy Metrics](https://deepeval.com/docs/metrics-answer-relevancy): Evaluate answer relevancy using LLM metrics for RAG. - [Faithfulness Metric Overview](https://deepeval.com/docs/metrics-faithfulness): Evaluate RAG pipeline quality using faithfulness metrics. - [Contextual Relevancy Metric](https://deepeval.com/docs/metrics-contextual-relevancy): Explore the Contextual Relevancy Metric for evaluating RAG pipelines. - [Contextual Precision Metric](https://deepeval.com/docs/metrics-contextual-precision): Evaluate RAG pipeline's retriever using contextual precision metric. - [Contextual Recall Metric](https://deepeval.com/docs/metrics-contextual-recall): Explore the Contextual Recall Metric for evaluating RAG pipelines. - [Bias Metric Evaluation](https://deepeval.com/docs/metrics-bias): Evaluate LLM outputs for gender, racial, and political bias. - [Toxicity Metric Overview](https://deepeval.com/docs/metrics-toxicity): Evaluate toxicity in LLM outputs using referenceless metrics. - [LLM Hallucination Metric](https://deepeval.com/docs/metrics-hallucination): Evaluate LLM hallucination using context comparison metrics. - [LLM Summarization Metrics](https://deepeval.com/docs/metrics-summarization): Learn how to evaluate LLM summarization metrics effectively. - [Task Completion Metrics](https://deepeval.com/docs/metrics-task-completion): Evaluate task completion using LLM metrics and arguments. - [Tool Correctness Metric](https://deepeval.com/docs/metrics-tool-correctness): Assess LLM agent's tool calling accuracy with metrics. - [JSON Correctness Metric](https://deepeval.com/docs/metrics-json-correctness): Learn how to measure JSON correctness in LLM applications. - [Prompt Alignment Metric](https://deepeval.com/docs/metrics-prompt-alignment): Evaluate LLM output alignment with prompt instructions effectively. - [Image Coherence Metric](https://deepeval.com/docs/multimodal-metrics-image-coherence): Evaluate image coherence with accompanying text for MLLM. - [Knowledge Retention Metric](https://deepeval.com/docs/metrics-knowledge-retention): Learn how to measure knowledge retention in LLM chatbots. - [Conversation Completeness Metric](https://deepeval.com/docs/metrics-conversation-completeness): Evaluate conversation completeness for LLM chatbots effectively. - [Conversation Relevancy Metric](https://deepeval.com/docs/metrics-turn-relevancy): Evaluate conversation relevancy for LLM chatbot conversations. - [RAGAS Metrics Overview](https://deepeval.com/docs/metrics-ragas): Evaluate RAG pipelines using RAGAS metrics. - [DeepEval Update Warnings](https://deepeval.com/docs/miscellaneous): Opt-in for update warnings in DeepEval documentation. - [Gemini Model Integration](https://deepeval.com/integrations/models/gemini): Integrate Gemini models with DeepEval using CLI or Python. - [Anthropic Model Integration](https://deepeval.com/integrations/models/anthropic): Integrate Anthropic models for evaluation metrics easily. - [LM Studio Integration](https://deepeval.com/integrations/models/lmstudio): Evaluate local LLMs with LM Studio integration guide. - [OpenAI Integration Guide](https://deepeval.com/integrations/models/openai): Setup OpenAI API key and explore available models. - [Azure OpenAI Integration](https://deepeval.com/integrations/models/azure-openai): Integrate Azure OpenAI models with DeepEval for metrics. - [vLLM Inference Integratioin](https://deepeval.com/integrations/models/vllm): High-performance inference engine for LLMs with OpenAI support. - [GSM8K Benchmark Overview](https://deepeval.com/docs/benchmarks-gsm8k): GSM8K benchmark for evaluating multi-step math reasoning. - [Custom LLM Metrics Guide](https://deepeval.com/docs/metrics-custom): Learn to create custom LLM evaluation metrics easily. - [DROP Benchmark Overview](https://deepeval.com/docs/benchmarks-drop): Evaluate language models with complex reasoning tasks using DROP. - [Data Privacy Assurance](https://deepeval.com/docs/data-privacy): DeepEval ensures data privacy and security for users. - [Bias Benchmark Evaluation](https://deepeval.com/docs/benchmarks-bbq): Evaluate LLMs for bias across various social categories. - [MMLU Benchmark Overview](https://deepeval.com/docs/benchmarks-mmlu): Evaluate LLMs using MMLU benchmark across various subjects. - [LLM Evaluation Tutorial](https://deepeval.com/tutorials/tutorial-introduction): Comprehensive guide to evaluating and improving LLM applications. - [HellaSwag Benchmark](https://deepeval.com/docs/benchmarks-hellaswag): Evaluate language models' commonsense reasoning with HellaSwag benchmark. - [DeepEval Setup Guide](https://deepeval.com/tutorials/tutorial-setup): Guide to install DeepEval and set up Confident AI. - [DeepEval vs TruLens](https://deepeval.com/blog/deepeval-vs-trulens): DeepEval outperforms TruLens in LLM evaluation features. - [Chatbot Role Adherence](https://deepeval.com/docs/metrics-role-adherence): Learn how to measure chatbot role adherence effectively. - [DeepEval vs Arize Comparison](https://deepeval.com/blog/deepeval-vs-arize): DeepEval excels in LLM evaluation, surpassing Arize's observability. - [Metrics Selection Guide](https://deepeval.com/tutorials/tutorial-metrics-selection): Learn to select and define evaluation metrics for LLMs. - [DeepEval vs Ragas Comparison](https://deepeval.com/blog/deepeval-vs-ragas): DeepEval offers a comprehensive evaluation ecosystem for LLMs. - [DeepEval vs Langfuse](https://deepeval.com/blog/deepeval-vs-langfuse): DeepEval offers advanced evaluation features compared to Langfuse. - [Synthetic Dataset Generation](https://deepeval.com/tutorials/tutorial-dataset-synthesis): Learn to generate synthetic datasets for medical chatbots. - [DeepEval Alternatives Overview](https://deepeval.com/blog/deepeval-alternatives-compared): Explore various alternatives to DeepEval for LLM evaluation. - [RAG QA Agent Setup](https://deepeval.com/tutorials/qa-agent-introduction): Learn to set up a RAG QA Agent evaluation pipeline quickly. - [Legal Document Summarization](https://deepeval.com/tutorials/doc-summarization-introduction): Learn to evaluate legal document summarizers effectively and accurately. - [RAG Triad Evaluation Guide](https://deepeval.com/guides/guides-rag-triad): Learn about the RAG triad for evaluating LLMs effectively. - [QA Agent Evaluations](https://deepeval.com/tutorials/qa-agent-running-evaluations): Learn to run evaluations on QA Agent effectively. - [Medical Chatbot Tutorial](https://deepeval.com/tutorials/tutorial-llm-application-example): Learn to build a medical chatbot for diagnosis and appointments. - [Toxicity Vulnerability Evaluation](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-toxicity): Evaluate LLM's resistance to generating harmful or toxic content. - [Testing LLM Robustness](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-robustness): Learn how to test LLM robustness against malicious inputs. - [Generate Synthetic Goldens](https://deepeval.com/docs/synthesizer-generate-from-goldens): Generate synthetic Goldens from existing Goldens easily. - [Improving QA Agent](https://deepeval.com/tutorials/qa-agent-improving-hyperparameters): Learn to enhance QA agent performance through hyperparameter tuning. - [Red Teaming Bias](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-bias): Test LLMs for bias in responses across various categories. - [Confident AI Documentation](https://www.confident-ai.com/docs/): Cloud platform for evaluating LLM applications with DeepEval. - [BIG-Bench Hard Evaluation](https://deepeval.com/docs/benchmarks-big-bench-hard): Evaluate language models with challenging BIG-Bench Hard tasks. - [LLM Tracing Guide](https://deepeval.com/docs/evaluation-llm-tracing): Learn to evaluate LLM interactions with tracing metrics. - [Document Summarization Datasets](https://deepeval.com/tutorials/doc-summarization-annotating-datasets): Learn to create and maintain datasets for document summarization. - [DeepEval LLM Comparisons](https://deepeval.com/blog/tags/comparisons): DeepEval provides comprehensive LLM evaluation comparisons and insights. - [Winogrande Benchmark](https://deepeval.com/docs/benchmarks-winogrande): Winogrande dataset for commonsense reasoning evaluation and usage. - [Synthetic Data Generation](https://deepeval.com/docs/synthesizer-introduction): DeepEval's Synthesizer generates high-quality synthetic evaluation data. - [LLM Benchmarking Guide](https://deepeval.com/docs/benchmarks-introduction): Standardized benchmarks for evaluating LLM performance effectively. - [Conversation Simulator](https://deepeval.com/docs/conversation-simulator): Generate conversational test cases for chatbot evaluation. - [Evaluation Datasets Overview](https://deepeval.com/docs/evaluation-datasets): Explore evaluation datasets, goldens, and dataset creation methods. - [Evaluation Flags and Configs](https://deepeval.com/docs/evaluation-flags-and-configs): Customize evaluation settings with flags and configurations. - [Running Evaluations with DeepEval](https://deepeval.com/tutorials/tutorial-evaluations-running-an-evaluation): Learn how to run evaluations using DeepEval metrics. - [Excessive Agency Vulnerability](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-excessive-agency): Learn to test LLMs against excessive agency vulnerabilities. - [Red Teaming Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-competition): Test LLMs for competitive information disclosure and market influence. - [Misinformation Vulnerabilities in LLMs](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-misinformation): Explore how LLMs handle misinformation vulnerabilities effectively. - [Document Summarization Evaluation](https://deepeval.com/tutorials/doc-summarization-running-an-evaluation): Learn to evaluate document summarization using DeepEval metrics. - [PII Leakage Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-pii-leakage): Evaluate PII leakage vulnerabilities in LLM systems effectively. - [Graphic Content Vulnerability](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-graphic-content): Testing LLMs for graphic content vulnerability responses. - [DeepEval Use Cases](https://deepeval.com/tutorials/use-cases): Explore various use cases for DeepEval's capabilities. - [Synthetic Dataset Generation](https://deepeval.com/tutorials/qa-agent-generating-a-synthetic-dataset): Learn to generate diverse synthetic datasets for QA agents. - [Component-Level LLM Evaluation](https://deepeval.com/docs/evaluation-component-level-llm-evals): Evaluate individual LLM components with tailored metrics and tests. - [Prompt Leakage Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-prompt-leakage): Learn about prompt leakage vulnerabilities in LLMs and testing. - [Personal Safety Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-personal-safety): Learn to test LLMs for personal safety vulnerabilities. - [Intellectual Property Testing](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-intellectual-property): Learn how to test LLMs for intellectual property vulnerabilities. - [Unauthorized Access Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-unauthorized-access): Explore unauthorized access vulnerabilities in LLMs and testing methods. - [Qdrant Vector Database](https://deepeval.com/integrations/vector-databases/qdrant): Explore Qdrant for efficient vector database retrieval and evaluation. - [Red Teaming Overview](https://www.trydeepteam.com/docs/red-teaming-introduction): DeepTeam simplifies red teaming for LLM applications, ensuring safety. - [HumanEval Benchmark](https://deepeval.com/docs/benchmarks-human-eval): Evaluate LLM code generation with HumanEval benchmark tasks. - [TruthfulQA Benchmark](https://deepeval.com/docs/benchmarks-truthful-qa): Evaluate language models' truthfulness across various topics. - [Cognee](https://deepeval.com/integrations/vector-databases/cognee): Cognee framework enhances LLM applications with semantic graph retrieval. - [Optimize LLM Hyperparameters](https://deepeval.com/guides/guides-optimizing-hyperparameters): Guide to optimize hyperparameters for LLM applications effectively. - [DeepEval Test Cases](https://deepeval.com/docs/evaluation-test-cases): DeepEval provides test cases for evaluating LLM outputs effectively. - [RAG Evaluation Guide](https://deepeval.com/guides/guides-rag-evaluation): Learn how to evaluate RAG pipelines effectively. - [DeepEval Synthesizer Guide](https://deepeval.com/guides/guides-using-synthesizer): Quickly generate high-quality synthetic goldens with DeepEval. - [LLM Observability Guide](https://deepeval.com/guides/guides-llm-observability): Explore LLM observability for monitoring and improving AI models. - [Red Teaming Vulnerabilities](https://www.trydeepteam.com/docs/red-teaming-vulnerabilities): Explore vulnerabilities in LLMs for effective red teaming. - [End-to-End LLM Evaluation](https://deepeval.com/docs/evaluation-end-to-end-llm-evals): Comprehensive guide for end-to-end evaluation of LLM applications. - [Multimodal Tool Correctness](https://deepeval.com/docs/multimodal-metrics-tool-correctness): Assess multimodal LLM tool calling accuracy and correctness. - [Generate Goldens](https://deepeval.com/docs/synthesizer-generate-from-contexts): Generate synthetic Goldens from provided contexts easily. - [Synthetic Goldens Generation](https://deepeval.com/docs/synthesizer-generate-from-scratch): Generate synthetic Goldens from scratch for LLM applications.