πŸ”₯ DeepEval 4.0 just got released. Read the announcement.
Algorithms

COPRO

COPRO (Co-operative Prompt Optimizer) is a prompt optimization algorithm within deepeval adapted from the DSPy optimizer of the same name. It uses Coordinate Ascent to iteratively improve a prompt β€” evaluating a batch of candidates at each depth step, committing the best performer as the new baseline, and using the scored history plus metric feedback to generate an increasingly targeted next batch.

The core insight is that prompt optimization is most efficient when each new generation of candidates is informed by what failed before and why it failed. Rather than generating variations blindly, COPRO feeds the optimizer LLM a full diagnostic history β€” every past prompt attempt, its score, and the specific metric feedback explaining where points were lost β€” so each subsequent batch of candidates directly addresses known weaknesses.

Optimize Prompts With COPRO

To optimize a prompt using COPRO, provide a COPRO algorithm instance to the optimize() method:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.prompt import Prompt
from deepeval.optimizer import PromptOptimizer
from deepeval.optimizer.algorithms import COPRO

prompt = Prompt(text_template="You are a helpful assistant - now answer this. {input}")

def model_callback(prompt: Prompt, golden) -> str:
    prompt_to_llm = prompt.interpolate(input=golden.input)
    return your_llm(prompt_to_llm)

optimizer = PromptOptimizer(
    algorithm=COPRO(),
    model_callback=model_callback
)

optimized_prompt = optimizer.optimize(prompt=prompt, goldens=goldens, metrics=[AnswerRelevancyMetric()])

Done βœ…. You just used COPRO to run a prompt optimization.

Customize COPRO

You can customize COPRO's behavior by passing parameters directly to the COPRO constructor:

from deepeval.optimizer.algorithms import COPRO

copro = COPRO(
    depth=4,
    breadth=7,
    minibatch_size=25,
    random_state=42,
)

There are FOUR optional parameters when creating a COPRO instance:

  • [Optional] depth: number of coordinate ascent steps to run. At each step, a new batch of candidates is evaluated and the best is committed as the baseline for the next step. Defaulted to 4.
  • [Optional] breadth: number of prompt candidates generated and evaluated at each depth step. A higher breadth explores more of the prompt space per step but costs more. Defaulted to 7.
  • [Optional] minibatch_size: number of goldens sampled per depth step for candidate evaluation. Larger batches give more reliable scores. Full-dataset validation is always run on the best candidate of each step. Defaulted to 25.
  • [Optional] random_state: reproducibility control. You can pass either an int seed or a random.Random instance. This affects minibatch sampling and candidate deduplication. Defaulted to a random value.

How Does COPRO Work?

COPRO runs for depth steps. Each step evaluates a batch of breadth candidates, selects the best, validates it on the full dataset, then uses the scored history to propose the next batch. Here is the exact high-level flow:

  1. Bootstrap β€” Generate the initial breadth candidates from the original prompt using zero-shot variation
  2. Evaluate β€” Score all candidates on a stochastic minibatch and extract metric feedback per candidate
  3. Commit β€” Pick the best minibatch candidate and run full-dataset validation on it
  4. Propose β€” Feed the scored history back to the LLM to generate the next targeted batch
  5. Repeat β€” Steps 2–4 run for each of the depth steps
  6. Final Selection β€” Return the prompt with the highest average true validation score across all steps

Phase 1: Bootstrap

Before the coordinate ascent loop begins, COPRO generates an initial set of breadth candidate prompts from the original prompt using zero-shot variation. This is done by the COPROProposer in two passes:

Pass 1 β€” Guideline Generation: The proposer asks the optimizer LLM to brainstorm breadth distinct "variation guidelines" β€” high-level strategies for how to meaningfully alter the prompt. Examples:

Guideline ExampleEffect
"Reframe the prompt to require step-by-step reasoning before the final answer"Generates an instruction that enforces chain-of-thought
"Condense instructions into a highly direct, concise format"Produces a shorter, more aggressive instruction style
"Add strict output formatting constraints"Makes the instruction prescriptive about output structure
"Explicitly call out common mistakes to avoid"Generates a defensive, error-aware instruction

Pass 2 β€” Candidate Generation: For each guideline, the proposer makes a separate LLM call to produce the actual rewritten prompt. These calls run concurrently in the async path, making the bootstrap phase significantly faster than sequential generation.

The original prompt is always inserted as candidate 0 before evaluation begins. This guarantees a baseline that the optimizer can always fall back to, and ensures that the first depth step has a fair reference point.

Phase 2: Coordinate Ascent Loop

The loop runs for depth steps. Each step has three sub-stages: evaluate, commit, and propose.

Step 2a: Evaluate

At the start of each depth step, COPRO draws a random minibatch from your goldens and evaluates every candidate in the current batch against it. For each candidate, two things are captured:

  1. Score β€” the average metric score across all goldens in the minibatch
  2. Metric feedback β€” a diagnostic string describing exactly why points were lost, built from per-metric reasons on the failing examples

The metric feedback is a key enhancement over simpler optimizers. Rather than just recording a score, COPRO captures explanations like:

[Input]: Translate "Good morning" to French
[Expected]: Bonjour
[Actual Model Output]: Good morning in French is "Bonjour." Have a nice day!
[Evaluation Reasons]:
- AnswerRelevancyMetric (Score: 0.4): Response contains unnecessary filler beyond the requested translation.

This feedback is carried forward into the proposal step so the next generation of candidates is explicitly targeted at the failure modes identified here.

Step 2b: Commit

After scoring, candidates are ranked by minibatch score. The top-scoring candidate is selected, then evaluated on the full golden dataset using score_pareto. This full-dataset score is stored in the validation archive.

If the full-dataset average beats the current global_best_score, the candidate is accepted as the new best. All depth steps record full-dataset scores, so the final selection can compare every step's committed winner on equal footing.

Step 2c: Propose

Unless this is the final depth step, COPRO generates the next batch of breadth candidates. This uses the same two-pass proposer as bootstrap, but now passes the full history_log β€” a bounded, sorted record of the top breadth (prompt, score, metric_feedback) triples seen across all prior steps.

Example: What the history log looks like at depth step 3

AttemptScoreMetric Feedback Summary
P₃ᡦ0.81Minor formatting issues on 1/25 examples
P₂ₐ0.74Consistently missed JSON schema on structured outputs
P₁ᡦ0.71Verbose responses triggered conciseness metric failures
P₂ᡦ0.68Lacked step-by-step reasoning on multi-hop questions
.........

The proposer sees this ranked history and generates guidelines that explicitly fix the failure patterns (e.g., "previous attempts failed the JSON schema metric β€” add a strict output format constraint") while preserving the successful traits of the highest-scoring attempts. The resulting candidates at each subsequent depth step are therefore more targeted and diagnostic than the zero-shot bootstrap.

Step 3: Final Selection

After all depth steps, COPRO performs a final sweep over the full validation archive. It picks the configuration with the highest average full-dataset score across all committed depth-step winners. This is the _extract_optimized_set step β€” it ensures that even if a later depth step produced a worse result than an earlier one (possible with minibatch noise), the globally best validated prompt is always returned.

Example: Coordinate ascent progression over 4 depth steps

DepthCandidates EvaluatedBest Minibatch ScoreFull Dataset ScoreAccepted?
18 (7 + original)0.680.65βœ… (root)
270.740.71βœ…
370.790.76βœ…
470.770.73❌

In this example, depth step 4 produces a candidate that looks promising on the minibatch (0.77) but underperforms on the full dataset (0.73) compared to depth step 3's committed baseline (0.76). The final sweep correctly selects the depth step 3 result as the optimized prompt.

When to Use COPRO

COPRO is particularly effective when:

ScenarioWhy COPRO Helps
Instruction quality is the main leverCOPRO focuses entirely on refining the instruction text
You have clear metric feedbackDiagnostic feedback per candidate makes each generation more targeted
You want predictable, monotonic improvementCoordinate ascent commits each improvement before building on it
Smaller datasetsFull-dataset validation at every step works well when goldens are not too numerous
You need fast convergenceDepth steps are shallow and focused; typically 3-5 steps is enough

COPRO vs. Other Algorithms

AspectCOPROSIMBAGEPAMIPROv2
Search strategyInformed coordinate ascentVariance-driven introspective ascentPareto-based evolutionaryBayesian Optimization (TPE)
Feedback signalScore + metric feedback per candidateScore variance across trajectoriesLLM diagnosis of failures/successesMinibatch score per trial
Optimizes instructions?βœ… Yesβœ… Yesβœ… Yesβœ… Yes
Optimizes demos?❌ Noβœ… Yes❌ Noβœ… Yes
Candidate generationTwo-pass guideline + rewritePer-iteration from hard examplesPer-iteration via reflective mutationAll upfront (proposal phase)
Full eval frequencyEvery depth stepEvery N iterationsPer accepted candidateEvery N trials
Best forFast, instruction-focused optimizationInconsistent model behavior, complex tasksDiverse problem types, multi-objectiveLarge search spaces, few-shot-heavy tasks

Choose COPRO when you want fast, targeted instruction improvement with clear diagnostic feedback guiding each generation β€” especially when you don't need few-shot demonstrations and want reliable convergence in a small number of steps.

Choose SIMBA when your model is inconsistent across runs and you want the optimizer to learn from that inconsistency, or when the task benefits from both instruction improvements and injected demonstrations.

Choose GEPA when your task spans diverse problem types and you need to maintain a diverse pool of prompt strategies without converging prematurely on a single approach.

Choose MIPROv2 when the joint combination of instruction and few-shot demonstrations is the main lever and you want systematic Bayesian search over that space.

On this page