GitHub

What is the project about?

DeepEval is an open-source framework designed for evaluating and testing the performance of Large Language Model (LLM) systems. It's analogous to Pytest, but specifically tailored for unit testing the outputs of LLMs.

What problem does it solve?

DeepEval addresses the critical need for robust and reliable evaluation of LLM applications. It helps developers ensure the quality, accuracy, and safety of their LLM-powered systems, whether they are built using Retrieval-Augmented Generation (RAG), fine-tuning, or other techniques. It helps identify issues like hallucination, bias, and toxicity, and provides tools to optimize performance. It also helps to prevent prompt drifting.

What are the features of the project?

  • Wide Range of Metrics: Offers a comprehensive suite of pre-built evaluation metrics, including:
    • General Metrics: G-Eval, Hallucination, Summarization, Bias, Toxicity.
    • RAG Metrics: Answer Relevancy, Faithfulness, Contextual Recall, Contextual Precision, Contextual Relevancy, RAGAS.
    • Agentic Metrics: Task Completion, Tool Correctness.
    • Conversational Metrics: Knowledge Retention, Conversation Completeness, Conversation Relevancy, Role Adherence.
  • Custom Metric Support: Allows users to create and integrate their own custom evaluation metrics.
  • Synthetic Dataset Generation: Provides tools to generate synthetic datasets for evaluation purposes.
  • CI/CD Integration: Seamlessly integrates with any CI/CD environment for automated testing.
  • Red Teaming: Enables red teaming of LLM applications to identify vulnerabilities (toxicity, bias, SQL injection, etc.) using various attack strategies.
  • LLM Benchmarking: Facilitates benchmarking of LLMs against popular benchmarks (MMLU, HellaSwag, DROP, etc.) with minimal code.
  • Confident AI Integration: Fully integrates with the Confident AI platform for a complete evaluation lifecycle, including dataset curation, benchmarking, metric fine-tuning, debugging, and monitoring.
  • Standalone Metrics: Metrics can be used independently of the testing framework.
  • Bulk Evaluation: Supports evaluating datasets (collections of test cases) in bulk.
  • Pytest and Non-Pytest Modes: Can be used with or without Pytest integration.

What are the technologies used in the project?

  • Python: The core language of the framework.
  • Pytest (optional): Used for test case management and execution.
  • LLMs (various): Leverages LLMs (including user-specified ones) for evaluation, along with statistical methods and NLP models. OpenAI API is used by default, but custom models are supported.
  • NLP Models: Uses various NLP models for specific evaluation tasks.
  • Integrations:
    • LlamaIndex: For testing RAG applications.
    • Hugging Face: For real-time evaluations during LLM fine-tuning.
  • Confident AI Platform: Cloud platform.

What are the benefits of the project?

  • Improved LLM Quality: Helps developers build more reliable, accurate, and safe LLM applications.
  • Simplified Evaluation: Provides a user-friendly framework for complex evaluation tasks.
  • Automated Testing: Enables automated testing and integration with CI/CD pipelines.
  • Comprehensive Metrics: Offers a wide range of metrics to cover various aspects of LLM performance.
  • Customizability: Allows users to tailor the evaluation process to their specific needs.
  • Full Evaluation Lifecycle Support: Through Confident AI integration, supports the entire evaluation process from dataset creation to monitoring.
  • Open Source: Freely available and open for contributions.

What are the use cases of the project?

  • Unit Testing LLM Outputs: Testing the quality and correctness of LLM-generated text.
  • RAG System Evaluation: Assessing the performance of Retrieval-Augmented Generation pipelines.
  • Fine-tuning Evaluation: Evaluating the effectiveness of LLM fine-tuning.
  • Prompt Engineering: Optimizing prompts to improve LLM performance.
  • Model Comparison: Benchmarking different LLMs against each other.
  • Safety and Bias Detection: Identifying and mitigating potential risks related to bias, toxicity, and other harmful outputs.
  • Continuous Monitoring: Tracking LLM performance over time and identifying regressions.
  • Customer Support Chatbot Evaluation: As shown in the quickstart, evaluating the responses of a chatbot.
  • Any LLM Application: Generally applicable to any application that utilizes LLMs.
deepeval screenshot