GEPA: Optimize Anything

Declarative LLM-Driven Evolutionary Optimization for Text Artifacts Authors: Lakshya A Agrawal, Donghyun Lee et al. Affiliations: UC Berkeley, Stanford Date: February 2026 Install: pip install gepa License: Open Source

**Key Insight:** GEPA provides a declarative API for optimizing ANY text artifact — code, prompts, agent configurations, mathematical proofs — via LLM-driven evolutionary search with structured diagnostic feedback (Actionable Side Information). It achieves state-of-the-art results across coding, mathematics, and optimization benchmarks.

Overview & Motivation
Installation & Quick Start
System Architecture
Optimization Modes
Single-Task Optimization
Multi-Task Optimization (Cross-Transfer)
Generalization Mode (Train+Val)
Actionable Side Information (ASI)
Pareto-Efficient Search
Reflection-Driven Mutation
Seedless Mode
Stopping Criteria
Configuration System
Evaluation Pipeline
API Reference & Code Examples
Results & Benchmarks
Case Studies
Comparison with Other Systems
Limitations & Future Work

1. Overview & Motivation

GEPA (General-purpose Evolutionary Program Architecture) addresses a fundamental limitation in LLM-driven optimization: existing systems are tightly coupled to specific artifact types (code, prompts, etc.) and lack a unified framework for handling diagnostic feedback. GEPA introduces a declarative API where users specify what to optimize and how to measure quality, while the system handles the evolutionary search process automatically.

Core Design Principles

Artifact Agnosticism: Any text-representable artifact can be optimized — Python functions, prompt templates, YAML configurations, agent policies, mathematical expressions, or even entire programs.
Declarative Over Imperative: Users define objectives and constraints, not search procedures. The engine selects appropriate mutation strategies, population management, and stopping criteria.
Diagnostic-First: Actionable Side Information (ASI) is a first-class concept. Every evaluation produces structured feedback (traces, errors, images, metrics) that directly informs the next mutation cycle.
Pareto Optimality: Multi-objective optimization is native. The population maintains a Pareto frontier of non-dominated solutions rather than a single "best" solution.
Reproducibility: Content-hash-based caching ensures that re-evaluating the same artifact produces identical results without redundant computation.

What GEPA Optimizes

Code

Python functions, algorithms, data structures, entire modules. Evaluated via test suites, benchmarks, or custom metrics.

Prompts

System prompts, few-shot examples, chain-of-thought templates. Evaluated against ground-truth datasets or human preferences.

Agent Configs

Multi-agent orchestration YAML, tool selection policies, routing rules. Evaluated via task completion rates.

Mathematical Expressions

Loss functions, optimization objectives, heuristic formulas. Evaluated via numerical accuracy or convergence speed.

Data Pipelines

ETL configurations, feature engineering scripts, preprocessing chains. Evaluated via downstream model performance.

Hybrid Artifacts

Combinations of code + prompts + configs. Co-evolved with inter-dependency tracking.

2. Installation & Quick Start

Installation

# Basic installation
pip install gepa

# With all optional dependencies
pip install gepa[all]

# With specific LLM backends
pip install gepa[openai,anthropic,google]

# Development installation
git clone https://github.com/gepa-ai/gepa.git
cd gepa
pip install -e ".[dev]"

Minimal Example: Optimize a Sorting Algorithm

from gepa import optimize, Artifact, Metric, EvalResult

# Define the artifact to optimize
artifact = Artifact(
    name="sort_function",
    template="""
def sort(arr: list[int]) -> list[int]:
    # Your sorting implementation here
    {{IMPLEMENTATION}}
""",
    language="python"
)

# Define the evaluation metric
def evaluate_sort(candidate: str, task: dict) -> EvalResult:
    """Evaluate sorting correctness and performance."""
    import time
    exec_globals = {}
    exec(candidate, exec_globals)
    sort_fn = exec_globals['sort']

    test_cases = task['test_cases']
    correct = 0
    total_time = 0.0

    for tc in test_cases:
        start = time.perf_counter()
        result = sort_fn(tc['input'].copy())
        elapsed = time.perf_counter() - start
        total_time += elapsed
        if result == tc['expected']:
            correct += 1

    accuracy = correct / len(test_cases)
    avg_time = total_time / len(test_cases)

    return EvalResult(
        scores={"accuracy": accuracy, "speed": 1.0 / (avg_time + 1e-9)},
        side_info={
            "failed_cases": [tc for tc, r in zip(test_cases, results) if r != tc['expected']],
            "avg_time_ms": avg_time * 1000,
        }
    )

# Run optimization
result = optimize(
    artifact=artifact,
    evaluate=evaluate_sort,
    metrics=[
        Metric("accuracy", direction="maximize", weight=0.8),
        Metric("speed", direction="maximize", weight=0.2),
    ],
    max_iterations=50,
    llm="claude-sonnet-4-20250514",
)

print(f"Best solution: {result.best.score}")
print(result.best.code)

**Note:** The `optimize()` function is the primary entry point. It returns an `OptimizationResult` containing the Pareto frontier, best solution per metric, full history, and timing statistics.

3. System Architecture

+-----------------------------------------------------------------------------------+
|                              GEPA System Architecture                              |
+-----------------------------------------------------------------------------------+
|                                                                                    |
|  +------------------+     +-------------------+     +-------------------------+    |
|  |   User API       |     |   GEPAConfig      |     |   Artifact Registry     |    |
|  |                  |---->|                   |---->|                         |    |
|  |  optimize()      |     |  EngineConfig     |     |  Templates, Validators  |    |
|  |  Artifact()      |     |  ReflectionConfig |     |  Language Specs         |    |
|  |  Metric()        |     |  MergeConfig      |     |  Seed Solutions         |    |
|  +------------------+     |  RefinerConfig    |     +-------------------------+    |
|                           +-------------------+                                    |
|                                    |                                               |
|                                    v                                               |
|  +------------------------------------------------------------------------+        |
|  |                        Evolution Engine                                 |        |
|  |                                                                         |        |
|  |  +-------------------+    +--------------------+    +----------------+  |        |
|  |  | Population Manager|    | Pareto Frontier    |    | History Store  |  |        |
|  |  |                   |    |                    |    |                |  |        |
|  |  | - Candidates[]    |    | - Non-dominated    |    | - All evals   |  |        |
|  |  | - Islands[]       |    |   solutions        |    | - ASI records |  |        |
|  |  | - Selection       |    | - Dominance checks |    | - Genealogy   |  |        |
|  |  +-------------------+    +--------------------+    +----------------+  |        |
|  |           |                        |                       |            |        |
|  |           v                        v                       v            |        |
|  |  +------------------------------------------------------------------+  |        |
|  |  |                    Reflection Engine                               |  |        |
|  |  |                                                                    |  |        |
|  |  |  1. Select parent from Pareto frontier                            |  |        |
|  |  |  2. Sample minibatch of 2-3 evaluation examples                   |  |        |
|  |  |  3. Attach ASI (traces, errors, images) to reflection prompt      |  |        |
|  |  |  4. LLM proposes targeted mutation                                |  |        |
|  |  |  5. Validate & insert into population                            |  |        |
|  |  +------------------------------------------------------------------+  |        |
|  |           |                                                             |        |
|  |           v                                                             |        |
|  |  +------------------------------------------------------------------+  |        |
|  |  |                    Evaluation Pipeline                             |  |        |
|  |  |                                                                    |  |        |
|  |  |  [Cache Check] --> [Sandbox Exec] --> [Metric Compute] --> [ASI]  |  |        |
|  |  |       |                  |                   |               |     |  |        |
|  |  |  Content Hash      Docker/Process        Multi-metric     Traces  |  |        |
|  |  |  Dedup              Isolation             Aggregation      Errors  |  |        |
|  |  |                     max_workers           Pareto update    Images  |  |        |
|  |  +------------------------------------------------------------------+  |        |
|  +------------------------------------------------------------------------+        |
|                                    |                                               |
|                                    v                                               |
|  +------------------------------------------------------------------------+        |
|  |                     Stopping Controller                                 |        |
|  |                                                                         |        |
|  |  MaxMetricCalls | Timeout | NoImprovement | ScoreThreshold | Composite  |        |
|  +------------------------------------------------------------------------+        |
|                                    |                                               |
|                                    v                                               |
|  +------------------------------------------------------------------------+        |
|  |                     OptimizationResult                                  |        |
|  |                                                                         |        |
|  |  .best         - Best solution per primary metric                       |        |
|  |  .pareto_front - All non-dominated solutions                            |        |
|  |  .history      - Full evaluation history with ASI                       |        |
|  |  .stats        - Timing, cost, convergence metrics                      |        |
|  +------------------------------------------------------------------------+        |
+-----------------------------------------------------------------------------------+

Component Interactions

The architecture follows a clean separation of concerns:

User API Layer: Declarative interface where users define artifacts, metrics, and evaluation functions. No knowledge of evolutionary internals required.
Configuration Layer: GEPAConfig with sub-configs for engine, reflection, merge, and refinement. Sensible defaults with full override capability.
Evolution Engine: Core loop managing population, Pareto frontier, and genealogy tracking. Orchestrates reflection and evaluation cycles.
Reflection Engine: LLM-powered mutation generator that reads parent candidates, ASI feedback, and optimization history to propose targeted improvements.
Evaluation Pipeline: Parallel execution with content-hash caching, sandbox isolation, multi-metric aggregation, and ASI extraction.
Stopping Controller: Composable stopping criteria with AND/OR logic for flexible termination.

4. Optimization Modes

GEPA supports three distinct optimization modes, each addressing different real-world scenarios.

4.1 Single-Task Optimization

The simplest mode: optimize an artifact against a single task definition. This is the standard evolutionary optimization scenario.

from gepa import optimize, Artifact, Metric, SingleTaskConfig

result = optimize(
    artifact=Artifact(
        name="heuristic",
        template="def heuristic(state: dict) -> float:\n    {{BODY}}",
    ),
    evaluate=my_eval_fn,
    metrics=[Metric("score", direction="maximize")],
    task={"problem_instance": load_instance("tsp_100")},
    config=SingleTaskConfig(
        population_size=20,
        max_iterations=100,
        reflection_minibatch_size=3,
    ),
)

**Use Case:** Optimizing a heuristic for a specific TSP instance, tuning a prompt for a fixed evaluation set, or improving an algorithm for a particular benchmark.

4.2 Multi-Task Optimization (Cross-Transfer)

Multi-task mode optimizes a single artifact across multiple tasks simultaneously, enabling cross-transfer learning. Solutions that perform well on one task can transfer insights to others.

from gepa import optimize, Artifact, Metric, MultiTaskConfig

tasks = [
    {"name": "tsp_50", "instance": load_instance("tsp_50")},
    {"name": "tsp_100", "instance": load_instance("tsp_100")},
    {"name": "tsp_200", "instance": load_instance("tsp_200")},
]

result = optimize(
    artifact=Artifact(name="tsp_heuristic", template="..."),
    evaluate=eval_tsp,
    metrics=[Metric("tour_length", direction="minimize")],
    tasks=tasks,
    config=MultiTaskConfig(
        population_size=30,
        cross_transfer=True,          # Enable cross-task solution transfer
        transfer_frequency=5,          # Transfer every 5 iterations
        min_improvement_for_transfer=0.01,
    ),
)

# Access per-task results
for task_name, task_result in result.per_task.items():
    print(f"{task_name}: best={task_result.best.score:.4f}")

4.3 Generalization Mode (Train+Val)

Generalization mode splits tasks into training and validation sets, optimizing for generalization rather than overfitting to specific instances.

from gepa import optimize, Artifact, Metric, GeneralizationConfig

all_tasks = load_benchmark_tasks("arc-agi-v1")
train_tasks = all_tasks[:80]
val_tasks = all_tasks[80:]

result = optimize(
    artifact=Artifact(name="arc_solver", template="..."),
    evaluate=eval_arc,
    metrics=[Metric("accuracy", direction="maximize")],
    train_tasks=train_tasks,
    val_tasks=val_tasks,
    config=GeneralizationConfig(
        population_size=50,
        val_frequency=10,            # Validate every 10 iterations
        early_stopping_patience=20,  # Stop if val doesn't improve for 20 iters
        overfitting_threshold=0.15,  # Alert if train-val gap exceeds 15%
    ),
)

print(f"Train accuracy: {result.train_score:.3f}")
print(f"Val accuracy:   {result.val_score:.3f}")
print(f"Generalization gap: {result.generalization_gap:.3f}")

5. Actionable Side Information (ASI)

ASI is GEPA's most distinctive innovation. Instead of treating evaluation as a black-box score, ASI captures structured diagnostic feedback that the reflection engine uses to propose targeted mutations.

ASI Types

ASI Type	Description	Example	Use in Reflection
`TraceASI`	Execution traces showing step-by-step behavior	Function call logs, variable states at each step	LLM identifies where execution diverges from expected behavior
`ErrorASI`	Exception details with full stack traces	TypeError, IndexError with line numbers	LLM directly fixes the error-causing code
`ImageASI`	Visual outputs (plots, renders, diagrams)	Circle packing visualization, loss curve	Multimodal LLM analyzes visual quality
`MetricASI`	Detailed metric breakdowns beyond the primary score	Per-test-case scores, timing breakdowns	LLM focuses on worst-performing sub-metrics
`ComparisonASI`	Diff between candidate and reference solution	Output mismatches, behavioral differences	LLM targets specific discrepancies
`TextASI`	Free-form text feedback	Human annotations, LLM-judge feedback	LLM incorporates qualitative feedback into mutation

Implementing Custom ASI

from gepa import EvalResult, TraceASI, ErrorASI, MetricASI

def evaluate_with_asi(candidate: str, task: dict) -> EvalResult:
    """Evaluation function that produces rich ASI feedback."""
    try:
        # Execute candidate
        exec_globals = {}
        exec(candidate, exec_globals)
        solve_fn = exec_globals['solve']

        # Collect execution trace
        trace = []
        original_print = print

        def traced_print(*args, **kwargs):
            trace.append(" ".join(str(a) for a in args))
            original_print(*args, **kwargs)

        import builtins
        builtins.print = traced_print

        result = solve_fn(task['input'])
        builtins.print = original_print

        # Compute metrics
        accuracy = compute_accuracy(result, task['expected'])
        efficiency = compute_efficiency(result)

        # Build ASI
        side_info = [
            TraceASI(trace=trace, label="execution_trace"),
            MetricASI(
                metrics={
                    "accuracy": accuracy,
                    "efficiency": efficiency,
                    "output_length": len(str(result)),
                },
                label="detailed_metrics"
            ),
        ]

        if accuracy < 1.0:
            mismatches = find_mismatches(result, task['expected'])
            side_info.append(
                ComparisonASI(
                    expected=str(task['expected']),
                    actual=str(result),
                    diff=mismatches,
                    label="output_comparison"
                )
            )

        return EvalResult(
            scores={"accuracy": accuracy, "efficiency": efficiency},
            side_info=side_info,
        )

    except Exception as e:
        import traceback
        return EvalResult(
            scores={"accuracy": 0.0, "efficiency": 0.0},
            side_info=[
                ErrorASI(
                    error_type=type(e).__name__,
                    message=str(e),
                    traceback=traceback.format_exc(),
                    label="runtime_error",
                )
            ],
        )

ASI in the Reflection Prompt

When the reflection engine generates a mutation, ASI is injected into the LLM prompt as structured context:

# Internal reflection prompt template (simplified)
REFLECTION_PROMPT = """
You are optimizing a {artifact_type} to maximize {objectives}.

## Current Candidate (Score: {score})

{candidate_code}

## Diagnostic Feedback (ASI)
{formatted_asi}

## Optimization History
- Iteration {iter}: {score_history}
- Recent improvements: {recent_improvements}
- Stagnation detector: {stagnation_status}

## Reflection Minibatch ({minibatch_size} examples)
{minibatch_details}

## Task
Analyze the diagnostic feedback and propose a TARGETED improvement.
Focus on the specific failure modes revealed by the ASI.
Output the complete improved candidate.
"""

6. Pareto-Efficient Search

GEPA maintains a Pareto frontier of non-dominated solutions, enabling multi-objective optimization without manual weight tuning.

Pareto Dominance

A solution x dominates solution y (written x >> y) if and only if:

$$ x >> y ⇔ ∀i: fi(x) ≥ fi(y) ∧ ∃j: fj(x) > fj(y) $$

Where fi represents the i-th objective function (with direction-normalized values so that higher is always better).

Pareto Frontier Maintenance

$$ PF = { x ∈ P : ∄ y ∈ P such that y >> x } $$

The Pareto frontier PF is the set of all non-dominated solutions in the population P. When a new candidate is evaluated:

Check if any existing frontier member dominates the new candidate → if yes, discard.
Check if the new candidate dominates any existing frontier members → if yes, remove dominated members.
If neither dominates, add the new candidate to the frontier (it represents a new trade-off).

Hypervolume Indicator

GEPA uses the hypervolume indicator to measure the quality of the Pareto frontier:

$$ HV(PF, r) = Λ({ q ∈ R^m : ∃ p ∈ PF such that p ≥ q ≥ r }) $$

Where Λ denotes the Lebesgue measure (volume), r is the reference point, and m is the number of objectives. The hypervolume captures both convergence to the true Pareto front and diversity along it.

Selection Strategy

from gepa.engine import ParetoFrontier

class ParetoFrontier:
    def __init__(self, objectives: list[str], directions: list[str]):
        self.objectives = objectives
        self.directions = directions  # "maximize" or "minimize"
        self.frontier: list[Solution] = []

    def dominates(self, a: Solution, b: Solution) -> bool:
        """Check if solution a dominates solution b."""
        at_least_one_better = False
        for obj, direction in zip(self.objectives, self.directions):
            a_val = a.scores[obj]
            b_val = b.scores[obj]
            if direction == "minimize":
                a_val, b_val = -a_val, -b_val
            if a_val < b_val:
                return False
            if a_val > b_val:
                at_least_one_better = True
        return at_least_one_better

    def update(self, candidate: Solution) -> bool:
        """Add candidate to frontier if non-dominated. Returns True if added."""
        # Check if any frontier member dominates candidate
        for member in self.frontier:
            if self.dominates(member, candidate):
                return False  # Dominated, discard

        # Remove frontier members dominated by candidate
        self.frontier = [
            m for m in self.frontier
            if not self.dominates(candidate, m)
        ]
        self.frontier.append(candidate)
        return True

    def select_parent(self, strategy: str = "crowding") -> Solution:
        """Select a parent from the frontier for mutation."""
        if strategy == "crowding":
            # Prefer solutions in sparse regions of the frontier
            distances = self._crowding_distances()
            probs = distances / distances.sum()
            return np.random.choice(self.frontier, p=probs)
        elif strategy == "random":
            return random.choice(self.frontier)
        elif strategy == "tournament":
            a, b = random.sample(self.frontier, 2)
            return a if self._hypervolume_contribution(a) > self._hypervolume_contribution(b) else b

Crowding Distance

To maintain diversity on the Pareto frontier, GEPA computes crowding distances and preferentially selects parents from sparse regions:

$$ CD(i) = ∑m (fm(i+1) - fm(i-1)) / (fm^max - fm^min) $$

Where solutions are sorted by each objective and boundary solutions receive infinite crowding distance.

7. Reflection-Driven Mutation

GEPA's reflection engine is the primary mechanism for generating improved candidates. Unlike random mutation or simple LLM rewriting, reflection uses structured diagnostic feedback to propose targeted improvements.

Reflection Pipeline

Parent Selection: Choose a parent from the Pareto frontier using crowding-distance-weighted selection.
Minibatch Sampling: Sample 2-3 evaluation examples where the parent performs poorly (failure-biased sampling).
ASI Assembly: Gather all ASI records for the selected examples — traces, errors, metrics, comparisons.
History Context: Include recent mutation history (what was tried, what worked, what didn't).
LLM Reflection: Send the complete context to the LLM with instructions to analyze failures and propose a targeted fix.
Validation: Parse the LLM output, validate syntax/structure, and insert into the population for evaluation.

Minibatch Strategy

**Why 2-3 examples?** Reflection minibatches of 2-3 examples provide the optimal trade-off between context richness and LLM attention capacity. Too few examples (1) risk overfitting the mutation to a single case. Too many examples (5+) dilute the LLM's attention and produce generic rather than targeted improvements.

from gepa.reflection import ReflectionEngine, ReflectionConfig

config = ReflectionConfig(
    minibatch_size=3,               # Sample 2-3 examples per reflection
    failure_bias=0.7,               # 70% chance to sample failure cases
    include_trace=True,             # Include execution traces in prompt
    include_error=True,             # Include error details in prompt
    include_comparison=True,        # Include output comparisons
    max_history_length=10,          # Include last 10 mutations in context
    temperature=0.8,                # LLM temperature for diversity
    reflection_prompt_version="v3", # Latest reflection prompt template
)

engine = ReflectionEngine(
    config=config,
    llm=LLMClient("claude-sonnet-4-20250514"),
    artifact_spec=artifact,
)

# Generate a mutation
mutation = await engine.reflect(
    parent=best_candidate,
    eval_results=parent_eval_results,
    history=optimization_history,
)

Reflection Prompt Engineering

The reflection prompt is carefully structured to maximize the quality of mutations:

# Reflection prompt structure
"""
[ROLE] You are an expert optimizer improving a {artifact_type}.

[OBJECTIVE] Maximize: {objectives_description}

[CURRENT SOLUTION] (Score: {score_summary})
{candidate_code}

[DIAGNOSTIC ANALYSIS]
--- Example 1/{minibatch_size} ---
Input: {example_input}
Expected: {expected_output}
Actual: {actual_output}
Trace: {execution_trace}
Error: {error_if_any}
Score: {per_example_score}

--- Example 2/{minibatch_size} ---
...

[MUTATION HISTORY]
Iteration {n-2}: Changed X to Y -> score improved by +0.05
Iteration {n-1}: Changed A to B -> score decreased by -0.02 (reverted)

[INSTRUCTIONS]
1. Analyze the diagnostic feedback above
2. Identify the ROOT CAUSE of failures
3. Propose a SPECIFIC, TARGETED fix (not a rewrite)
4. Explain your reasoning before the code
5. Output the complete improved solution

[OUTPUT FORMAT]
## Analysis
(your analysis here)

## Improved Solution

(complete code here)

"""

8. Seedless Mode

GEPA can bootstrap optimization from a natural language objective alone, without any seed solution. The system generates an initial population by prompting the LLM to create diverse starting points.

from gepa import optimize, Artifact, Metric

# No seed solution provided - GEPA generates initial candidates
result = optimize(
    artifact=Artifact(
        name="bin_packing",
        description="A function that packs items into bins to minimize wasted space.",
        signature="def pack(items: list[tuple[float, float]], bin_capacity: float) -> list[list[int]]",
        language="python",
        # No template or seed - GEPA bootstraps from the description + signature
    ),
    evaluate=eval_bin_packing,
    metrics=[Metric("utilization", direction="maximize")],
    config=SeedlessConfig(
        initial_population_size=10,     # Generate 10 diverse starting points
        diversity_prompt=True,          # Instruct LLM to generate diverse solutions
        bootstrap_strategies=[
            "greedy_first_fit",
            "best_fit_decreasing",
            "random_search",
            "dynamic_programming",
            "genetic_algorithm",
        ],  # Hint different algorithmic approaches
    ),
)

Bootstrap Pipeline

Strategy Enumeration: If strategies are provided, generate one candidate per strategy. Otherwise, ask the LLM to enumerate diverse approaches.
Parallel Generation: Concurrently prompt the LLM to implement each strategy as a concrete solution.
Validation: Each generated candidate is validated (syntax, type checking, basic execution) before insertion.
Initial Evaluation: All valid candidates are evaluated to establish the initial Pareto frontier.
Normal Evolution: Proceed with reflection-driven mutation from the initial population.

9. Stopping Criteria

GEPA provides composable stopping criteria that can be combined with AND/OR logic.

Criterion	Description	Parameters	Example
`MaxMetricCalls`	Stop after N evaluation calls	`max_calls: int`	`MaxMetricCalls(100)`
`Timeout`	Stop after elapsed time	`seconds: float`	`Timeout(3600)` (1 hour)
`NoImprovement`	Stop if no improvement for N iterations	`patience: int, min_delta: float`	`NoImprovement(20, 0.001)`
`ScoreThreshold`	Stop when target score is reached	`metric: str, threshold: float`	`ScoreThreshold("accuracy", 0.99)`
`Composite`	Combine criteria with AND/OR logic	`criteria: list, mode: str`	See example below

from gepa.stopping import MaxMetricCalls, Timeout, NoImprovement, ScoreThreshold, Composite

# Stop when ANY of these conditions is met (OR logic)
stopping = Composite(
    criteria=[
        MaxMetricCalls(200),                     # Hard limit: 200 evaluations
        Timeout(7200),                           # Hard limit: 2 hours
        ScoreThreshold("accuracy", 1.0),         # Perfect score reached
        Composite(                               # AND: both must be true
            criteria=[
                NoImprovement(patience=30, min_delta=0.001),
                MaxMetricCalls(50),               # At least 50 evals before early stop
            ],
            mode="AND",
        ),
    ],
    mode="OR",
)

result = optimize(
    artifact=artifact,
    evaluate=evaluate_fn,
    metrics=metrics,
    stopping=stopping,
)

print(f"Stopped because: {result.stop_reason}")

10. Configuration System

GEPA uses a hierarchical configuration system with sensible defaults at every level.

GEPAConfig Structure

from gepa.config import (
    GEPAConfig, EngineConfig, ReflectionConfig,
    MergeConfig, RefinerConfig
)

config = GEPAConfig(
    # Engine configuration
    engine=EngineConfig(
        population_size=30,           # Max population size
        elite_size=5,                 # Protected elite solutions
        tournament_size=3,            # Tournament selection size
        mutation_rate=1.0,            # Always mutate (vs crossover)
        crossover_rate=0.0,           # No crossover by default
        island_count=1,               # Number of islands (for island model)
        migration_rate=0.1,           # Migration rate between islands
        migration_interval=10,        # Migrate every N iterations
    ),

    # Reflection configuration
    reflection=ReflectionConfig(
        minibatch_size=3,             # Examples per reflection
        failure_bias=0.7,             # Bias toward failure cases
        temperature=0.8,              # LLM temperature
        max_tokens=4096,              # Max response tokens
        include_trace=True,           # Include execution traces
        include_error=True,           # Include error details
        include_comparison=True,      # Include output comparisons
        include_image=False,          # Include image ASI (multimodal)
        max_history_length=10,        # Mutation history context
        reflection_model="claude-sonnet-4-20250514",  # Model for reflection
    ),

    # Merge configuration (for crossover-like operations)
    merge=MergeConfig(
        strategy="llm_guided",        # LLM picks best parts from 2 parents
        merge_model="claude-sonnet-4-20250514",
        max_parents=2,                # Number of parents to merge
    ),

    # Refiner configuration (post-mutation polish)
    refiner=RefinerConfig(
        enabled=True,                 # Enable post-mutation refinement
        refiner_model="claude-sonnet-4-20250514",
        max_refinement_steps=2,       # Max refinement iterations
        refinement_threshold=0.9,     # Only refine if score > threshold
    ),
)

result = optimize(
    artifact=artifact,
    evaluate=evaluate_fn,
    metrics=metrics,
    config=config,
)

Configuration via YAML

# gepa_config.yaml
engine:
  population_size: 30
  elite_size: 5
  tournament_size: 3

reflection:
  minibatch_size: 3
  failure_bias: 0.7
  temperature: 0.8
  include_trace: true
  include_error: true

merge:
  strategy: llm_guided
  max_parents: 2

refiner:
  enabled: true
  max_refinement_steps: 2

stopping:
  type: composite
  mode: OR
  criteria:
    - type: max_metric_calls
      max_calls: 200
    - type: timeout
      seconds: 7200
    - type: score_threshold
      metric: accuracy
      threshold: 1.0

from gepa import optimize_from_config

result = optimize_from_config("gepa_config.yaml")

11. Evaluation Pipeline

Parallel Evaluation

GEPA evaluates candidates in parallel using a configurable worker pool:

from gepa.evaluation import EvaluationPipeline, EvalConfig

pipeline = EvaluationPipeline(
    eval_fn=evaluate,
    config=EvalConfig(
        max_workers=8,              # Parallel evaluation workers
        timeout_per_eval=60,        # Timeout per evaluation (seconds)
        retry_on_error=True,        # Retry failed evaluations once
        cache_enabled=True,         # Enable content-hash caching
        cache_backend="sqlite",     # Cache backend: sqlite, redis, memory
        sandbox="docker",           # Sandbox: docker, subprocess, none
    ),
)

# Evaluate a batch of candidates
results = await pipeline.evaluate_batch(candidates, task)

Content-Hash Caching

Evaluations are cached by content hash, so identical candidates are never re-evaluated:

import hashlib

def content_hash(candidate: str, task: dict) -> str:
    """Compute a deterministic hash for a (candidate, task) pair."""
    content = candidate + json.dumps(task, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

# Cache lookup before evaluation
cache_key = content_hash(candidate_code, task)
if cache_key in cache:
    return cache[cache_key]  # Skip evaluation

# Evaluate and cache
result = evaluate(candidate_code, task)
cache[cache_key] = result

Sandbox Execution

All candidate evaluations run in isolated sandboxes to prevent side effects:

Docker sandbox: Full container isolation with resource limits (CPU, memory, network, filesystem). Safest option for untrusted code.
Subprocess sandbox: Process-level isolation with timeouts. Good balance of safety and speed.
None: Direct execution in the main process. Fastest but no isolation. Only for trusted code.

12. API Reference & Code Examples

Core Classes

[!info]- Artifact - Define what to optimize ``` class Artifact: """Defines a text artifact to be optimized."""

def __init__(
    self,
    name: str,                              # Unique artifact identifier
    template: str | None = None,            # Template with {{PLACEHOLDERS}}
    seed: str | None = None,                # Initial solution (optional)
    description: str | None = None,         # Natural language description
    signature: str | None = None,           # Function signature (for code)
    language: str = "python",               # Language: python, yaml, text, etc.
    constraints: list[str] | None = None,   # Hard constraints for validation
    metadata: dict | None = None,           # Additional context
):
    ...

def validate(self, candidate: str) -> ValidationResult:
    """Validate a candidate against artifact constraints."""
    ...

def render(self, **kwargs) -> str:
    """Render template with placeholder values."""
    ...

```

[!info]- Metric - Define optimization objectives ``` class Metric: """Defines an optimization objective."""

def __init__(
    self,
    name: str,                              # Metric name (must match EvalResult keys)
    direction: str = "maximize",            # "maximize" or "minimize"
    weight: float = 1.0,                    # Weight for weighted-sum aggregation
    bounds: tuple[float, float] | None = None,  # Expected (min, max) range
    primary: bool = True,                   # Is this a primary objective?
):
    ...

```

[!info]- EvalResult - Evaluation output ``` class EvalResult: """Result of evaluating a candidate."""

def __init__(
    self,
    scores: dict[str, float],               # Metric name -> value
    side_info: list[ASI] | dict | None = None,  # Actionable Side Information
    metadata: dict | None = None,            # Additional evaluation metadata
    valid: bool = True,                      # Whether the candidate is valid
    error: str | None = None,                # Error message if invalid
):
    ...

```

[!info]- OptimizationResult - Final output ``` class OptimizationResult: """Complete optimization result."""

best: Solution                     # Best solution (primary metric)
pareto_front: list[Solution]       # All non-dominated solutions
history: list[EvaluationRecord]    # Full evaluation history
stats: OptimizationStats           # Timing, cost, convergence stats
stop_reason: str                   # Why optimization stopped
config: GEPAConfig                 # Configuration used

# Multi-task specific
per_task: dict[str, TaskResult] | None
# Generalization specific
train_score: float | None
val_score: float | None
generalization_gap: float | None

def plot_convergence(self) -> Figure:
    """Plot convergence curve."""
    ...

def plot_pareto(self) -> Figure:
    """Plot Pareto frontier (2D or 3D)."""
    ...

def export(self, path: str, format: str = "json"):
    """Export results to file."""
    ...

```

Advanced: Custom LLM Integration

from gepa.llm import LLMProvider, LLMResponse

class MyCustomLLM(LLMProvider):
    """Custom LLM provider for GEPA."""

    def __init__(self, model_name: str, api_key: str):
        self.model_name = model_name
        self.api_key = api_key

    async def generate(
        self,
        prompt: str,
        temperature: float = 0.8,
        max_tokens: int = 4096,
        stop: list[str] | None = None,
    ) -> LLMResponse:
        # Call your LLM API here
        response = await my_api_call(
            model=self.model_name,
            prompt=prompt,
            temperature=temperature,
            max_tokens=max_tokens,
        )
        return LLMResponse(
            text=response.text,
            usage={"input_tokens": response.input_tokens, "output_tokens": response.output_tokens},
            model=self.model_name,
        )

# Use custom LLM with GEPA
result = optimize(
    artifact=artifact,
    evaluate=evaluate_fn,
    metrics=metrics,
    llm=MyCustomLLM("my-model", "my-api-key"),
)

Advanced: Multi-Artifact Co-Evolution

from gepa import co_optimize, Artifact, Metric

# Co-evolve a prompt and a post-processor together
prompt_artifact = Artifact(
    name="system_prompt",
    template="You are a helpful assistant. {{INSTRUCTIONS}}",
    language="text",
)

processor_artifact = Artifact(
    name="output_processor",
    template="def process(raw_output: str) -> str:\n    {{BODY}}",
    language="python",
)

result = co_optimize(
    artifacts=[prompt_artifact, processor_artifact],
    evaluate=eval_pipeline,        # Evaluation uses both artifacts
    metrics=[Metric("quality", direction="maximize")],
    dependencies={
        "output_processor": ["system_prompt"],  # Processor depends on prompt
    },
    max_iterations=50,
)

print(f"Best prompt: {result.artifacts['system_prompt'].best.code}")
print(f"Best processor: {result.artifacts['output_processor'].best.code}")

13. Results & Benchmarks

**Headline Results:** GEPA achieves state-of-the-art performance across multiple domains, often surpassing specialized systems designed for specific tasks.

Benchmark Results Summary

Benchmark	Baseline	GEPA Result	Improvement	Notes
Claude Code Bleve	79.3%	100%	+20.7pp	Perfect score on all test cases
ARC-AGI v1	32.5%	89.5%	+57.0pp	Generalization mode with train/val split
AIME 2025	46.67%	60%	+13.33pp	Mathematical reasoning benchmark
Circle Packing n=26	Previous SOTA	New Record	Beat AlphaEvolve	Score 2.63594 (matches LLM4AD record)
CloudCast Routing	Baseline routing	40.2% savings	-40.2% cost	Cloud infrastructure routing optimization

Detailed Results: ARC-AGI v1

Method	Train Accuracy	Val Accuracy	Gap	Eval Calls
GEPA (Generalization)	94.2%	89.5%	4.7%	1,200
GEPA (Single-Task)	97.1%	82.3%	14.8%	800
AlphaEvolve	91.0%	85.0%	6.0%	2,500
OpenEvolve	88.5%	80.2%	8.3%	1,800
FunSearch	78.0%	72.5%	5.5%	5,000+

Detailed Results: Circle Packing

n	Previous Best	GEPA Result	Method
20	2.52040	2.52040	Matched known optimal
22	2.56287	2.56290	Slight improvement
24	2.60240	2.60245	Slight improvement
26	2.63590 (AlphaEvolve)	2.63594	New record

Cost Efficiency

Benchmark	Total LLM Cost	Eval Calls	Wall Time	Cost per % Improvement
Claude Code Bleve	$12.50	85	45 min	$0.60/pp
ARC-AGI v1	$180.00	1,200	8 hours	$3.16/pp
AIME 2025	$95.00	400	3 hours	$7.12/pp
Circle Packing n=26	$45.00	300	2 hours	N/A
CloudCast Routing	$250.00	800	12 hours	$6.22/pp

14. Case Studies

Case Study 1: Claude Code Bleve (79.3% → 100%)

GEPA optimized the Bleve search engine integration code for Claude Code, achieving a perfect score on all test cases.

**Setup:** Single-task mode, `claude-sonnet-4-20250514` for reflection, 85 evaluation calls over 45 minutes.

Optimization trajectory:

Iteration 0: Seed solution from Claude Code baseline: 79.3%
Iteration 12: ASI identified edge case in Unicode handling → fix improved to 88.1%
Iteration 28: ASI identified timeout in large document indexing → batch processing fix improved to 94.5%
Iteration 41: ASI identified race condition in concurrent search → mutex fix improved to 97.8%
Iteration 58: ASI identified rounding error in relevance scoring → float64 fix achieved 100%

Case Study 2: ARC-AGI v1 (32.5% → 89.5%)

GEPA used generalization mode with an 80/20 train/val split to optimize an ARC-AGI solver.

Key strategies employed by the reflection engine:

Pattern decomposition: The LLM learned to decompose ARC patterns into geometric primitives
Transformation library: Built up a library of reusable transformations over iterations
Cross-task transfer: Insights from color mapping tasks transferred to spatial reasoning tasks
ASI-driven debugging: Execution traces showed where the solver misidentified patterns

Case Study 3: CloudCast Routing (40.2% Cost Savings)

GEPA optimized cloud infrastructure routing rules, treating the routing configuration as a YAML artifact.

# Artifact: YAML routing configuration
artifact = Artifact(
    name="routing_config",
    language="yaml",
    template="""
routing:
  default_region: {{DEFAULT_REGION}}
  rules:
    {{ROUTING_RULES}}
  fallback:
    strategy: {{FALLBACK_STRATEGY}}
    timeout_ms: {{TIMEOUT}}
""",
)

# Multi-objective optimization: minimize cost, maximize latency SLA compliance
result = optimize(
    artifact=artifact,
    evaluate=eval_routing,
    metrics=[
        Metric("cost", direction="minimize", weight=0.6),
        Metric("sla_compliance", direction="maximize", weight=0.4),
    ],
    config=GEPAConfig(
        engine=EngineConfig(population_size=40),
        reflection=ReflectionConfig(minibatch_size=2),
    ),
)

15. Comparison with Other Systems

Feature	GEPA	AlphaEvolve	OpenEvolve	FunSearch	ShinkaEvolve
Artifact Types	`Any text`	Code	Code	Code	Code
Declarative API	`Yes`	No	Partial	No	Partial
ASI Feedback	`First-class`	Basic errors	Basic errors	Score only	Errors + traces
Pareto Multi-Objective	`Native`	Weighted sum	Weighted sum	Single	Single
Multi-Task Mode	`Yes`	No	No	No	No
Generalization Mode	`Yes`	No	No	No	No
Seedless Bootstrap	`Yes`	Requires seed	Requires seed	Requires seed	Optional seed
Composable Stopping	`Yes`	Fixed	Fixed	Fixed	Configurable
Content-Hash Cache	`Yes`	Yes	Yes	Yes	Yes
Open Source	`Yes`	No	Yes	No	Yes

16. Limitations & Future Work

Current Limitations

LLM Dependency: Quality of mutations is bounded by the capability of the underlying LLM. Weaker models produce weaker mutations.
Cost: Multi-task and generalization modes can be expensive due to the number of evaluation calls required.
ASI Design Burden: Effective ASI requires domain-specific instrumentation of the evaluation function. Users must think carefully about what diagnostic information to expose.
Pareto Scalability: With more than 4-5 objectives, the Pareto frontier grows exponentially and most solutions become non-dominated (the "curse of dimensionality" for multi-objective optimization).
Reflection Context Window: Very large candidates or extensive ASI can exceed LLM context limits, requiring truncation that may lose important information.

Future Directions

Auto-ASI: Automatically instrument evaluation functions to extract useful ASI without manual effort.
Hierarchical Evolution: Evolve meta-strategies (mutation operators, selection criteria) alongside the artifact itself.
Distributed Evolution: Distribute the evolutionary search across multiple machines for large-scale problems.
Interactive Mode: Human-in-the-loop evolution where users can guide the search direction mid-run.
Transfer Learning: Warm-start optimization from similar previously-solved problems.
Formal Verification: Integrate formal verification tools as ASI producers for safety-critical applications.

**Summary:** GEPA represents a significant step toward general-purpose LLM-driven optimization. Its declarative API, first-class ASI support, native Pareto optimization, and composable configuration make it the most flexible evolutionary optimization framework available. The benchmark results demonstrate that a well-designed general system can match or exceed specialized tools across diverse domains.

← Back to Index

GEPA: Optimize Anything

Table of Contents

1. Overview & Motivation

Core Design Principles

What GEPA Optimizes

Code

Prompts

Agent Configs

Mathematical Expressions

Data Pipelines

Hybrid Artifacts

2. Installation & Quick Start

Installation

Minimal Example: Optimize a Sorting Algorithm

3. System Architecture

Component Interactions

4. Optimization Modes

4.1 Single-Task Optimization

4.2 Multi-Task Optimization (Cross-Transfer)

4.3 Generalization Mode (Train+Val)

5. Actionable Side Information (ASI)

ASI Types

Implementing Custom ASI

ASI in the Reflection Prompt

6. Pareto-Efficient Search

Pareto Dominance

Pareto Frontier Maintenance

Hypervolume Indicator

Selection Strategy

Crowding Distance

7. Reflection-Driven Mutation

Reflection Pipeline

Minibatch Strategy

Reflection Prompt Engineering

8. Seedless Mode

Bootstrap Pipeline

9. Stopping Criteria

10. Configuration System

GEPAConfig Structure

Configuration via YAML

11. Evaluation Pipeline

Parallel Evaluation

Content-Hash Caching

Sandbox Execution

12. API Reference & Code Examples

Core Classes

Advanced: Custom LLM Integration

Advanced: Multi-Artifact Co-Evolution

13. Results & Benchmarks

Benchmark Results Summary

Detailed Results: ARC-AGI v1

Detailed Results: Circle Packing

Cost Efficiency

14. Case Studies

Case Study 1: Claude Code Bleve (79.3% → 100%)

Case Study 2: ARC-AGI v1 (32.5% → 89.5%)

Case Study 3: CloudCast Routing (40.2% Cost Savings)

15. Comparison with Other Systems

16. Limitations & Future Work

Current Limitations

Future Directions