Deployment

In this section, we'll set up CI/CD workflows for your summarization agent, and learn how to add metrics and create spans with test cases in your application for better tracing experience.

Setup Tracing

deepeval offers an @observe decorator for you to apply metrics at any point in your LLM app to evaluate any LLM interaction, this provides full visibility for debugging internal components of your LLM application. We have added these decorators during development of our agent, we will now add metrics and spans for running online evals. Learn more about tracing here.

Here's how we can add metrics and create spans for our @observe decorators in the MeetingSummarizer class:

import os
import json

from openai import OpenAI
from dotenv import load_dotenv
from deepeval.metrics import GEval
from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase, SingleTurnParams

load_dotenv()

class MeetingSummarizer:
    def __init__(
        self,
        model: str = "gpt-4",
        summary_system_prompt: str = "",
        action_item_system_prompt: str = "",
    ):
        self.model = model
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.summary_system_prompt = summary_system_prompt or (
            "..." # Use the summary_system_prompt mentioned above
        )
        self.action_item_system_prompt = action_item_system_prompt or (
            "..." # Use the action_item_system_prompt mentioned above
        )

    @observe(type="agent")
    def summarize(
        self,
        transcript: str,
        summary_model: str = "gpt-4o",
        action_item_model: str = "gpt-4-turbo"
    ) -> tuple[str, dict]:
        summary = self.get_summary(transcript, summary_model)
        action_items = self.get_action_items(transcript, action_item_model)

        return summary, action_items

    @observe(metrics=[GEval(...)], name="Summary") # Use the summary_concision metric here
    def get_summary(self, transcript: str, model: str = None) -> str:
        try:
            response = self.client.chat.completions.create(
                model=model or self.model,
                messages=[
                    {"role": "system", "content": self.summary_system_prompt},
                    {"role": "user", "content": transcript}
                ]
            )

            summary = response.choices[0].message.content.strip()
            update_current_span(
                input=transcript, output=summary
            )
            return summary
        except Exception as e:
            print(f"Error generating summary: {e}")
            return f"Error: Could not generate summary due to API issue: {e}"

    @observe(metrics=[GEval(...)], name="Action Items") # Use the action_item_check metric here
    def get_action_items(self, transcript: str, model: str = None) -> dict:
        try:
            response = self.client.chat.completions.create(
                model=model or self.model,
                messages=[
                    {"role": "system", "content": self.action_item_system_prompt},
                    {"role": "user", "content": transcript}
                ]
            )

            action_items = response.choices[0].message.content.strip()
            try:
                action_items = json.loads(action_items)
                update_current_span(
                    input=transcript, actual_output=str(action_items)
                )
                return action_items
            except json.JSONDecodeError:
                return {"error": "Invalid JSON returned from model", "raw_output": action_items}
        except Exception as e:
            print(f"Error generating action items: {e}")
            return {"error": f"API call failed: {e}", "raw_output": ""}

Why Continuous Evaluation

Most summarization agents are built to summarize documents and transcripts, often to improve productivity. This means that the documents to be summarized are ever-growing, and your summarizer needs to be able to keep up with that. That's why continuous testing is essential — your summarizer must remain reliable, even as new types of documents are introduced.

DeepEval's datasets are very useful for continuous evaluations. You can populate datasets with goldens, which contain just the inputs. During evaluation, test cases are generated on-the-fly by calling your LLM application to produce outputs.

In the previous section, we created a deepeval dataset. You can now reuse this dataset to continuously evaluate your summarization agent.

Using Datasets

Here's how you can pull datasets and reuse them to generate test cases:

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="MeetingSummarizer Dataset")

Integrating CI/CD

You can use pytest with assert_test during your CI/CD to trace and evaluate your summarization agent, here's how you can write the test file to do that:

test_meeting_summarizer_quality.py

import pytest

from deepeval.dataset import EvaluationDataset
from meeting_summarizer import MeetingSummarizer # import your summarizer here
from deepeval import assert_test

dataset = EvaluationDataset()
dataset.pull(alias="MeetingSummarizer Dataset")

summarizer = MeetingSummarizer()

@pytest.mark.parametrize("golden", dataset.goldens)
def test_meeting_summarizer_components(golden):
    summarizer.summarize(golden.input)  # captures trace
    assert_test(golden=golden)  # evaluates spans

poetry run deepeval test run test_meeting_summarizer_quality.py

Finally, let's integrate this test into GitHub Actions to enable automated quality checks on every push.

name: Meeting Summarizer DeepEval Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"

      - name: Install Poetry
        run: |
          curl -sSL https://install.python-poetry.org | python3 -
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Install Dependencies
        run: poetry install --no-root

      - name: Run DeepEval Unit Tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} # Add your OPENAI_API_KEY
          CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }} # Add your CONFIDENT_API_KEY
        run: poetry run deepeval test run test_meeting_summarizer_quality.py

And that's it! You now have a robust, production-ready summarization agent with automated evaluation integrated into your development workflow.

Setup Tracing

Why Continuous Evaluation

Using Datasets

Integrating CI/CD

On this page