πŸ”₯ Vibe coding for DeepEval is here. Get started now.
Algorithms

SIMBA

SIMBA (Stochastic Introspective Mini-Batch Ascent) is a prompt optimization algorithm within deepeval adapted from the DSPy optimizer of the same name. It improves prompts by hunting for high-variance examplesβ€”cases where the model sometimes succeeds and sometimes fails on the exact same inputβ€”and using that contrast to either rewrite the prompt's instructions or inject a verified few-shot demonstration.

The core insight is that uncertainty reveals the most about what a prompt is doing wrong. When a model consistently passes or consistently fails an input, there is little diagnostic signal. But when outcomes vary run-to-run on the same input, the delta between the good and bad execution traces pinpoints exactly what the prompt needs to say differently.

Optimize Prompts With SIMBA

To optimize a prompt using SIMBA, provide a SIMBA algorithm instance to the optimize() method:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.prompt import Prompt
from deepeval.optimizer import PromptOptimizer
from deepeval.optimizer.algorithms import SIMBA

prompt = Prompt(text_template="You are a helpful assistant - now answer this. {input}")

def model_callback(prompt: Prompt, golden) -> str:
    prompt_to_llm = prompt.interpolate(input=golden.input)
    return your_llm(prompt_to_llm)

optimizer = PromptOptimizer(
    algorithm=SIMBA(),
    model_callback=model_callback
)

optimized_prompt = optimizer.optimize(prompt=prompt, goldens=goldens, metrics=[AnswerRelevancyMetric()])

Done βœ…. You just used SIMBA to run a prompt optimization.

Customize SIMBA

You can customize SIMBA's behavior by passing parameters directly to the SIMBA constructor:

from deepeval.optimizer.algorithms import SIMBA

simba = SIMBA(
    iterations=8,
    minibatch_size=15,
    num_candidates=4,
    num_samples=3,
    minibatch_full_eval_steps=4,
    random_state=42,
)

There are SIX optional parameters when creating a SIMBA instance:

  • [Optional] iterations: total number of optimization steps to run. Each step samples a new minibatch, generates candidates, and evaluates them. Defaulted to 8.
  • [Optional] minibatch_size: number of goldens sampled per iteration. Larger batches capture more variance signal but cost more. Defaulted to 15.
  • [Optional] num_candidates: number of hard examples (top-variance buckets) to introspect and generate a candidate from per iteration. Defaulted to 4.
  • [Optional] num_samples: number of independent trajectories to run per golden when measuring variance. More samples = more reliable variance estimates but higher cost. Defaulted to 3.
  • [Optional] minibatch_full_eval_steps: run a full-dataset validation every N iterations, and always on the final iteration. Defaulted to 4.
  • [Optional] random_state: reproducibility control. You can pass either an int seed or a random.Random instance. This affects minibatch sampling, strategy selection, and candidate ordering.

How Does SIMBA Work?

SIMBA runs for a configurable number of iterations. Each iteration targets the examples where the model is most uncertain, generates new candidate prompts from that uncertainty, and accepts the best one if it outperforms the current best on the full dataset. Here is the exact high-level flow:

  1. Trajectory Sampling β€” Run multiple independent traces per golden and measure score variance
  2. Bucket Sorting β€” Rank examples by variability; the most uncertain examples come first
  3. Introspection & Candidate Generation β€” For each top-variance example, apply a strategy (rewrite or demo) to produce a new candidate prompt
  4. Minibatch Evaluation β€” Score all candidates on the same minibatch and pick the best
  5. Periodic Full Validation β€” Every N iterations, validate the best minibatch candidate on the full dataset and accept if it improves
  6. Final Selection β€” Return the prompt with the highest average true validation score

Step 1: Trajectory Sampling

At the start of each iteration, SIMBA draws a random minibatch from your goldens, then runs num_samples independent executions of the current best prompt on every example in the batch.

Each execution captures:

  • The model's actual output
  • The composite metric score (averaged across your provided metrics)
  • Per-metric reasons explaining why points were lost

These num_samples runs form a bucket per golden. For each bucket, SIMBA computes:

StatisticDescription
max_scoreThe best score across all trajectories for this golden
min_scoreThe worst score across all trajectories
avg_scoreThe mean score across all trajectories
max_to_avg_gapmax_score - avg_score β€” the primary variance signal

Step 2: Bucket Sorting

Buckets are sorted in descending order of max_to_avg_gap. This surfaces the examples where the model is most inconsistent β€” sometimes producing a good answer, sometimes a bad one.

Example: Bucket ranking with num_samples=3 and minibatch_size=4

GoldenTrajectory Scoresmaxavgmax_to_avg_gapPriority
G₁[1.0, 0.5, 0.5]1.00.670.33πŸ₯‡ 1st
Gβ‚‚[0.8, 0.7, 0.75]0.80.750.05πŸ₯‰ 3rd
G₃[0.9, 0.3, 0.6]0.90.60.30πŸ₯ˆ 2nd
Gβ‚„[0.2, 0.2, 0.2]0.20.20.004th

In this example:

  • G₁ is top priority β€” the model occasionally gets it fully right (1.0) but usually doesn't (0.5). The prompt is almost there for this input; fixing it would be high value.
  • G₃ comes second β€” high variance between 0.9 and 0.3 shows real inconsistency.
  • Gβ‚‚ is low priority β€” the model is consistently good (scores clustered around 0.75). Not much room to learn here.
  • Gβ‚„ is lowest priority β€” the model consistently fails. This is useful long-term, but with no successful trace to learn from, it can only feed the deterministic fallback path (see below).

Deterministic Fallback

When max_to_avg_gap == 0 (all trajectories scored identically), SIMBA checks whether the model was already perfect (max_score >= 0.99). If so, it skips the bucket. If not, it falls back to using expected_output or expected_outcome from the golden as a synthetic "perfect" trace to contrast against the model's actual (failing) output. If no ground truth is available, the bucket is skipped entirely.

Step 3: Introspection & Candidate Generation

For each of the top num_candidates buckets, SIMBA randomly picks one of two improvement strategies and applies it to the current best prompt:

Strategy 1: Rule (Prompt Rewrite)

SIMBA passes the worse trace and better trace from the bucket to the SIMBAProposer, which calls an LLM to perform a deep introspective rewrite of the entire prompt.

The LLM is shown:

  • The original prompt instructions
  • The failing trajectory: inputs β†’ bad output β†’ score β†’ metric feedback
  • The succeeding trajectory: inputs β†’ good output β†’ score β†’ metric feedback

It produces a discussion field that diagnoses the root cause β€” identifying the exact delta in logic, formatting, or constraint enforcement that separated the two outcomes β€” and then a revised_prompt that rewrites the prompt from scratch to structurally prevent the failure.

Strategy 2: Demo (Few-Shot Injection)

SIMBA takes the best-scoring trajectory from the bucket and injects it as a formatted few-shot example directly into the prompt:

[Example]
Input: <the golden's input>
Output: <the best trajectory's output>

This is appended to the system message (for list-format prompts) or to the end of the text template (for text prompts). The injected demo is verified β€” it comes from a real run that scored highly on your metrics, not from expected_output.

Strategy Selection

The strategy is chosen randomly with equal probability at each bucket. This stochasticity is intentional: it prevents the optimizer from overfitting to one improvement mechanism and ensures both instruction quality and demonstration quality are explored across iterations.

Step 4: Minibatch Evaluation

After generating up to num_candidates new prompt configurations (one per top bucket), SIMBA evaluates all of them on the same minibatch that was used for trajectory sampling. Each candidate's average metric score across the minibatch determines the winner of this iteration.

Only the single best-scoring candidate from this step proceeds to full validation.

Step 5: Periodic Full Validation

Every minibatch_full_eval_steps iterations (and always on the final iteration), SIMBA validates the best minibatch candidate against the full golden dataset. This true score is stored in the validation archive.

If the full-dataset average beats the current global_best_score, the candidate is accepted β€” it becomes the new current_best that all future trajectories are sampled from. Otherwise it is rejected.

Example: Acceptance decisions over 8 iterations with minibatch_full_eval_steps=4

IterationFull Eval?Full ScoreGlobal BestOutcome
1Noβ€”β€”Buffered
2Noβ€”β€”Buffered
3Noβ€”β€”Buffered
4βœ… Yes0.710.0 (root)βœ… Accepted
5Noβ€”0.71Buffered
6Noβ€”0.71Buffered
7Noβ€”0.71Buffered
8 (final)βœ… Yes0.680.71❌ Rejected

In this example, the iteration 4 candidate is accepted since it beats the root. The iteration 8 candidate is rejected despite a reasonable score because it doesn't improve on the already-accepted result from iteration 4.

Step 6: Final Selection

After all iterations, SIMBA performs a final sweep over the full validation archive (pareto_score_table). It picks the configuration with the highest average full-dataset score and returns it as the optimized prompt. If no full evaluation ever ran (e.g., all iterations were skipped), it falls back to the last current_best configuration.

When to Use SIMBA

SIMBA is particularly effective when:

ScenarioWhy SIMBA Helps
Model is inconsistent on certain inputsVariance-hunting directly targets the examples causing inconsistency
Task needs both instruction improvements and few-shot examplesSIMBA optimizes both simultaneously
You have complex multi-step tasksIntrospective rewrites restructure reasoning paths holistically
You want fast iterationMinibatch-based evaluation keeps per-iteration cost low
Ground truth labels are availableEnables the deterministic fallback for zero-variance failing examples

SIMBA vs. Other Algorithms

AspectSIMBAGEPAMIPROv2
Search strategyVariance-driven introspective ascentPareto-based evolutionaryBayesian Optimization (TPE)
Feedback signalScore variance across trajectoriesLLM diagnosis of failures/successesMinibatch score per (instruction, demo) trial
Optimizes demos?βœ… Yes (demo injection strategy)❌ Noβœ… Yes (bootstrapped demo sets)
Optimizes instructions?βœ… Yes (rule/rewrite strategy)βœ… Yes (reflective mutation)βœ… Yes (proposal phase)
Candidate generationPer-iteration from hard examplesPer-iteration via reflective rewriteAll upfront (proposal phase)
Best forInconsistent model behavior, complex tasksDiverse problem types, multi-objectiveLarge search spaces, few-shot-heavy tasks

Choose SIMBA when your model is inconsistent across runs and you want the optimizer to learn from that inconsistency directly.

Choose GEPA when your task spans diverse problem types and you need the optimizer to maintain a diverse pool of prompt strategies rather than converging on one.

Choose MIPROv2 when the combination of instruction and few-shot demonstrations is the main lever, and you want systematic Bayesian search over that joint space.

On this page