COPRO
COPRO (Co-operative Prompt Optimizer) is a prompt optimization algorithm within deepeval adapted from the DSPy optimizer of the same name. It uses Coordinate Ascent to iteratively improve a prompt β evaluating a batch of candidates at each depth step, committing the best performer as the new baseline, and using the scored history plus metric feedback to generate an increasingly targeted next batch.
The core insight is that prompt optimization is most efficient when each new generation of candidates is informed by what failed before and why it failed. Rather than generating variations blindly, COPRO feeds the optimizer LLM a full diagnostic history β every past prompt attempt, its score, and the specific metric feedback explaining where points were lost β so each subsequent batch of candidates directly addresses known weaknesses.
Optimize Prompts With COPRO
To optimize a prompt using COPRO, provide a COPRO algorithm instance to the optimize() method:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.prompt import Prompt
from deepeval.optimizer import PromptOptimizer
from deepeval.optimizer.algorithms import COPRO
prompt = Prompt(text_template="You are a helpful assistant - now answer this. {input}")
def model_callback(prompt: Prompt, golden) -> str:
prompt_to_llm = prompt.interpolate(input=golden.input)
return your_llm(prompt_to_llm)
optimizer = PromptOptimizer(
algorithm=COPRO(),
model_callback=model_callback
)
optimized_prompt = optimizer.optimize(prompt=prompt, goldens=goldens, metrics=[AnswerRelevancyMetric()])Done β
. You just used COPRO to run a prompt optimization.
Customize COPRO
You can customize COPRO's behavior by passing parameters directly to the COPRO constructor:
from deepeval.optimizer.algorithms import COPRO
copro = COPRO(
depth=4,
breadth=7,
minibatch_size=25,
random_state=42,
)There are FOUR optional parameters when creating a COPRO instance:
- [Optional]
depth: number of coordinate ascent steps to run. At each step, a new batch of candidates is evaluated and the best is committed as the baseline for the next step. Defaulted to4. - [Optional]
breadth: number of prompt candidates generated and evaluated at each depth step. A higher breadth explores more of the prompt space per step but costs more. Defaulted to7. - [Optional]
minibatch_size: number of goldens sampled per depth step for candidate evaluation. Larger batches give more reliable scores. Full-dataset validation is always run on the best candidate of each step. Defaulted to25. - [Optional]
random_state: reproducibility control. You can pass either anintseed or arandom.Randominstance. This affects minibatch sampling and candidate deduplication. Defaulted to a random value.
How Does COPRO Work?
COPRO runs for depth steps. Each step evaluates a batch of breadth candidates, selects the best, validates it on the full dataset, then uses the scored history to propose the next batch. Here is the exact high-level flow:
- Bootstrap β Generate the initial
breadthcandidates from the original prompt using zero-shot variation - Evaluate β Score all candidates on a stochastic minibatch and extract metric feedback per candidate
- Commit β Pick the best minibatch candidate and run full-dataset validation on it
- Propose β Feed the scored history back to the LLM to generate the next targeted batch
- Repeat β Steps 2β4 run for each of the
depthsteps - Final Selection β Return the prompt with the highest average true validation score across all steps
Phase 1: Bootstrap
Before the coordinate ascent loop begins, COPRO generates an initial set of breadth candidate prompts from the original prompt using zero-shot variation. This is done by the COPROProposer in two passes:
Pass 1 β Guideline Generation: The proposer asks the optimizer LLM to brainstorm breadth distinct "variation guidelines" β high-level strategies for how to meaningfully alter the prompt. Examples:
| Guideline Example | Effect |
|---|---|
| "Reframe the prompt to require step-by-step reasoning before the final answer" | Generates an instruction that enforces chain-of-thought |
| "Condense instructions into a highly direct, concise format" | Produces a shorter, more aggressive instruction style |
| "Add strict output formatting constraints" | Makes the instruction prescriptive about output structure |
| "Explicitly call out common mistakes to avoid" | Generates a defensive, error-aware instruction |
Pass 2 β Candidate Generation: For each guideline, the proposer makes a separate LLM call to produce the actual rewritten prompt. These calls run concurrently in the async path, making the bootstrap phase significantly faster than sequential generation.
The original prompt is always inserted as candidate 0 before evaluation begins. This guarantees a baseline that the optimizer can always fall back to, and ensures that the first depth step has a fair reference point.
Phase 2: Coordinate Ascent Loop
The loop runs for depth steps. Each step has three sub-stages: evaluate, commit, and propose.
Step 2a: Evaluate
At the start of each depth step, COPRO draws a random minibatch from your goldens and evaluates every candidate in the current batch against it. For each candidate, two things are captured:
- Score β the average metric score across all goldens in the minibatch
- Metric feedback β a diagnostic string describing exactly why points were lost, built from per-metric reasons on the failing examples
The metric feedback is a key enhancement over simpler optimizers. Rather than just recording a score, COPRO captures explanations like:
[Input]: Translate "Good morning" to French
[Expected]: Bonjour
[Actual Model Output]: Good morning in French is "Bonjour." Have a nice day!
[Evaluation Reasons]:
- AnswerRelevancyMetric (Score: 0.4): Response contains unnecessary filler beyond the requested translation.This feedback is carried forward into the proposal step so the next generation of candidates is explicitly targeted at the failure modes identified here.
Step 2b: Commit
After scoring, candidates are ranked by minibatch score. The top-scoring candidate is selected, then evaluated on the full golden dataset using score_pareto. This full-dataset score is stored in the validation archive.
If the full-dataset average beats the current global_best_score, the candidate is accepted as the new best. All depth steps record full-dataset scores, so the final selection can compare every step's committed winner on equal footing.
Step 2c: Propose
Unless this is the final depth step, COPRO generates the next batch of breadth candidates. This uses the same two-pass proposer as bootstrap, but now passes the full history_log β a bounded, sorted record of the top breadth (prompt, score, metric_feedback) triples seen across all prior steps.
Example: What the history log looks like at depth step 3
| Attempt | Score | Metric Feedback Summary |
|---|---|---|
| Pβᡦ | 0.81 | Minor formatting issues on 1/25 examples |
| Pββ | 0.74 | Consistently missed JSON schema on structured outputs |
| Pβᡦ | 0.71 | Verbose responses triggered conciseness metric failures |
| Pβᡦ | 0.68 | Lacked step-by-step reasoning on multi-hop questions |
| ... | ... | ... |
The proposer sees this ranked history and generates guidelines that explicitly fix the failure patterns (e.g., "previous attempts failed the JSON schema metric β add a strict output format constraint") while preserving the successful traits of the highest-scoring attempts. The resulting candidates at each subsequent depth step are therefore more targeted and diagnostic than the zero-shot bootstrap.
Step 3: Final Selection
After all depth steps, COPRO performs a final sweep over the full validation archive. It picks the configuration with the highest average full-dataset score across all committed depth-step winners. This is the _extract_optimized_set step β it ensures that even if a later depth step produced a worse result than an earlier one (possible with minibatch noise), the globally best validated prompt is always returned.
Example: Coordinate ascent progression over 4 depth steps
| Depth | Candidates Evaluated | Best Minibatch Score | Full Dataset Score | Accepted? |
|---|---|---|---|---|
| 1 | 8 (7 + original) | 0.68 | 0.65 | β (root) |
| 2 | 7 | 0.74 | 0.71 | β |
| 3 | 7 | 0.79 | 0.76 | β |
| 4 | 7 | 0.77 | 0.73 | β |
In this example, depth step 4 produces a candidate that looks promising on the minibatch (0.77) but underperforms on the full dataset (0.73) compared to depth step 3's committed baseline (0.76). The final sweep correctly selects the depth step 3 result as the optimized prompt.
When to Use COPRO
COPRO is particularly effective when:
| Scenario | Why COPRO Helps |
|---|---|
| Instruction quality is the main lever | COPRO focuses entirely on refining the instruction text |
| You have clear metric feedback | Diagnostic feedback per candidate makes each generation more targeted |
| You want predictable, monotonic improvement | Coordinate ascent commits each improvement before building on it |
| Smaller datasets | Full-dataset validation at every step works well when goldens are not too numerous |
| You need fast convergence | Depth steps are shallow and focused; typically 3-5 steps is enough |
COPRO vs. Other Algorithms
| Aspect | COPRO | SIMBA | GEPA | MIPROv2 |
|---|---|---|---|---|
| Search strategy | Informed coordinate ascent | Variance-driven introspective ascent | Pareto-based evolutionary | Bayesian Optimization (TPE) |
| Feedback signal | Score + metric feedback per candidate | Score variance across trajectories | LLM diagnosis of failures/successes | Minibatch score per trial |
| Optimizes instructions? | β Yes | β Yes | β Yes | β Yes |
| Optimizes demos? | β No | β Yes | β No | β Yes |
| Candidate generation | Two-pass guideline + rewrite | Per-iteration from hard examples | Per-iteration via reflective mutation | All upfront (proposal phase) |
| Full eval frequency | Every depth step | Every N iterations | Per accepted candidate | Every N trials |
| Best for | Fast, instruction-focused optimization | Inconsistent model behavior, complex tasks | Diverse problem types, multi-objective | Large search spaces, few-shot-heavy tasks |
Choose COPRO when you want fast, targeted instruction improvement with clear diagnostic feedback guiding each generation β especially when you don't need few-shot demonstrations and want reliable convergence in a small number of steps.
Choose SIMBA when your model is inconsistent across runs and you want the optimizer to learn from that inconsistency, or when the task benefits from both instruction improvements and injected demonstrations.
Choose GEPA when your task spans diverse problem types and you need to maintain a diverse pool of prompt strategies without converging prematurely on a single approach.
Choose MIPROv2 when the joint combination of instruction and few-shot demonstrations is the main lever and you want systematic Bayesian search over that space.