← Back to Index

GEPA: Optimize Anything

Declarative LLM-Driven Evolutionary Optimization for Text Artifacts Authors: Lakshya A Agrawal, Donghyun Lee et al. Affiliations: UC Berkeley, Stanford Date: February 2026 Install: pip install gepa License: Open Source

**Key Insight:** GEPA provides a declarative API for optimizing ANY text artifact — code, prompts, agent configurations, mathematical proofs — via LLM-driven evolutionary search with structured diagnostic feedback (Actionable Side Information). It achieves state-of-the-art results across coding, mathematics, and optimization benchmarks.

Table of Contents

1. Overview & Motivation

GEPA (General-purpose Evolutionary Program Architecture) addresses a fundamental limitation in LLM-driven optimization: existing systems are tightly coupled to specific artifact types (code, prompts, etc.) and lack a unified framework for handling diagnostic feedback. GEPA introduces a declarative API where users specify what to optimize and how to measure quality, while the system handles the evolutionary search process automatically.

Core Design Principles

  • Artifact Agnosticism: Any text-representable artifact can be optimized — Python functions, prompt templates, YAML configurations, agent policies, mathematical expressions, or even entire programs.
  • Declarative Over Imperative: Users define objectives and constraints, not search procedures. The engine selects appropriate mutation strategies, population management, and stopping criteria.
  • Diagnostic-First: Actionable Side Information (ASI) is a first-class concept. Every evaluation produces structured feedback (traces, errors, images, metrics) that directly informs the next mutation cycle.
  • Pareto Optimality: Multi-objective optimization is native. The population maintains a Pareto frontier of non-dominated solutions rather than a single "best" solution.
  • Reproducibility: Content-hash-based caching ensures that re-evaluating the same artifact produces identical results without redundant computation.

What GEPA Optimizes

Code

Python functions, algorithms, data structures, entire modules. Evaluated via test suites, benchmarks, or custom metrics.


Prompts

System prompts, few-shot examples, chain-of-thought templates. Evaluated against ground-truth datasets or human preferences.


Agent Configs

Multi-agent orchestration YAML, tool selection policies, routing rules. Evaluated via task completion rates.


Mathematical Expressions

Loss functions, optimization objectives, heuristic formulas. Evaluated via numerical accuracy or convergence speed.


Data Pipelines

ETL configurations, feature engineering scripts, preprocessing chains. Evaluated via downstream model performance.


Hybrid Artifacts

Combinations of code + prompts + configs. Co-evolved with inter-dependency tracking.


2. Installation & Quick Start

Installation

# Basic installation
pip install gepa

# With all optional dependencies
pip install gepa[all]

# With specific LLM backends
pip install gepa[openai,anthropic,google]

# Development installation
git clone https://github.com/gepa-ai/gepa.git
cd gepa
pip install -e ".[dev]"

Minimal Example: Optimize a Sorting Algorithm

from gepa import optimize, Artifact, Metric, EvalResult

# Define the artifact to optimize
artifact = Artifact(
    name="sort_function",
    template="""
def sort(arr: list[int]) -> list[int]:
    # Your sorting implementation here
    {{IMPLEMENTATION}}
""",
    language="python"
)

# Define the evaluation metric
def evaluate_sort(candidate: str, task: dict) -> EvalResult:
    """Evaluate sorting correctness and performance."""
    import time
    exec_globals = {}
    exec(candidate, exec_globals)
    sort_fn = exec_globals['sort']

    test_cases = task['test_cases']
    correct = 0
    total_time = 0.0

    for tc in test_cases:
        start = time.perf_counter()
        result = sort_fn(tc['input'].copy())
        elapsed = time.perf_counter() - start
        total_time += elapsed
        if result == tc['expected']:
            correct += 1

    accuracy = correct / len(test_cases)
    avg_time = total_time / len(test_cases)

    return EvalResult(
        scores={"accuracy": accuracy, "speed": 1.0 / (avg_time + 1e-9)},
        side_info={
            "failed_cases": [tc for tc, r in zip(test_cases, results) if r != tc['expected']],
            "avg_time_ms": avg_time * 1000,
        }
    )

# Run optimization
result = optimize(
    artifact=artifact,
    evaluate=evaluate_sort,
    metrics=[
        Metric("accuracy", direction="maximize", weight=0.8),
        Metric("speed", direction="maximize", weight=0.2),
    ],
    max_iterations=50,
    llm="claude-sonnet-4-20250514",
)

print(f"Best solution: {result.best.score}")
print(result.best.code)

**Note:** The `optimize()` function is the primary entry point. It returns an `OptimizationResult` containing the Pareto frontier, best solution per metric, full history, and timing statistics.

3. System Architecture

+-----------------------------------------------------------------------------------+
|                              GEPA System Architecture                              |
+-----------------------------------------------------------------------------------+
|                                                                                    |
|  +------------------+     +-------------------+     +-------------------------+    |
|  |   User API       |     |   GEPAConfig      |     |   Artifact Registry     |    |
|  |                  |---->|                   |---->|                         |    |
|  |  optimize()      |     |  EngineConfig     |     |  Templates, Validators  |    |
|  |  Artifact()      |     |  ReflectionConfig |     |  Language Specs         |    |
|  |  Metric()        |     |  MergeConfig      |     |  Seed Solutions         |    |
|  +------------------+     |  RefinerConfig    |     +-------------------------+    |
|                           +-------------------+                                    |
|                                    |                                               |
|                                    v                                               |
|  +------------------------------------------------------------------------+        |
|  |                        Evolution Engine                                 |        |
|  |                                                                         |        |
|  |  +-------------------+    +--------------------+    +----------------+  |        |
|  |  | Population Manager|    | Pareto Frontier    |    | History Store  |  |        |
|  |  |                   |    |                    |    |                |  |        |
|  |  | - Candidates[]    |    | - Non-dominated    |    | - All evals   |  |        |
|  |  | - Islands[]       |    |   solutions        |    | - ASI records |  |        |
|  |  | - Selection       |    | - Dominance checks |    | - Genealogy   |  |        |
|  |  +-------------------+    +--------------------+    +----------------+  |        |
|  |           |                        |                       |            |        |
|  |           v                        v                       v            |        |
|  |  +------------------------------------------------------------------+  |        |
|  |  |                    Reflection Engine                               |  |        |
|  |  |                                                                    |  |        |
|  |  |  1. Select parent from Pareto frontier                            |  |        |
|  |  |  2. Sample minibatch of 2-3 evaluation examples                   |  |        |
|  |  |  3. Attach ASI (traces, errors, images) to reflection prompt      |  |        |
|  |  |  4. LLM proposes targeted mutation                                |  |        |
|  |  |  5. Validate & insert into population                            |  |        |
|  |  +------------------------------------------------------------------+  |        |
|  |           |                                                             |        |
|  |           v                                                             |        |
|  |  +------------------------------------------------------------------+  |        |
|  |  |                    Evaluation Pipeline                             |  |        |
|  |  |                                                                    |  |        |
|  |  |  [Cache Check] --> [Sandbox Exec] --> [Metric Compute] --> [ASI]  |  |        |
|  |  |       |                  |                   |               |     |  |        |
|  |  |  Content Hash      Docker/Process        Multi-metric     Traces  |  |        |
|  |  |  Dedup              Isolation             Aggregation      Errors  |  |        |
|  |  |                     max_workers           Pareto update    Images  |  |        |
|  |  +------------------------------------------------------------------+  |        |
|  +------------------------------------------------------------------------+        |
|                                    |                                               |
|                                    v                                               |
|  +------------------------------------------------------------------------+        |
|  |                     Stopping Controller                                 |        |
|  |                                                                         |        |
|  |  MaxMetricCalls | Timeout | NoImprovement | ScoreThreshold | Composite  |        |
|  +------------------------------------------------------------------------+        |
|                                    |                                               |
|                                    v                                               |
|  +------------------------------------------------------------------------+        |
|  |                     OptimizationResult                                  |        |
|  |                                                                         |        |
|  |  .best         - Best solution per primary metric                       |        |
|  |  .pareto_front - All non-dominated solutions                            |        |
|  |  .history      - Full evaluation history with ASI                       |        |
|  |  .stats        - Timing, cost, convergence metrics                      |        |
|  +------------------------------------------------------------------------+        |
+-----------------------------------------------------------------------------------+

Component Interactions

The architecture follows a clean separation of concerns:

  1. User API Layer: Declarative interface where users define artifacts, metrics, and evaluation functions. No knowledge of evolutionary internals required.
  2. Configuration Layer: GEPAConfig with sub-configs for engine, reflection, merge, and refinement. Sensible defaults with full override capability.
  3. Evolution Engine: Core loop managing population, Pareto frontier, and genealogy tracking. Orchestrates reflection and evaluation cycles.
  4. Reflection Engine: LLM-powered mutation generator that reads parent candidates, ASI feedback, and optimization history to propose targeted improvements.
  5. Evaluation Pipeline: Parallel execution with content-hash caching, sandbox isolation, multi-metric aggregation, and ASI extraction.
  6. Stopping Controller: Composable stopping criteria with AND/OR logic for flexible termination.

4. Optimization Modes

GEPA supports three distinct optimization modes, each addressing different real-world scenarios.

4.1 Single-Task Optimization

The simplest mode: optimize an artifact against a single task definition. This is the standard evolutionary optimization scenario.

from gepa import optimize, Artifact, Metric, SingleTaskConfig

result = optimize(
    artifact=Artifact(
        name="heuristic",
        template="def heuristic(state: dict) -> float:\n    {{BODY}}",
    ),
    evaluate=my_eval_fn,
    metrics=[Metric("score", direction="maximize")],
    task={"problem_instance": load_instance("tsp_100")},
    config=SingleTaskConfig(
        population_size=20,
        max_iterations=100,
        reflection_minibatch_size=3,
    ),
)

**Use Case:** Optimizing a heuristic for a specific TSP instance, tuning a prompt for a fixed evaluation set, or improving an algorithm for a particular benchmark.

4.2 Multi-Task Optimization (Cross-Transfer)

Multi-task mode optimizes a single artifact across multiple tasks simultaneously, enabling cross-transfer learning. Solutions that perform well on one task can transfer insights to others.

from gepa import optimize, Artifact, Metric, MultiTaskConfig

tasks = [
    {"name": "tsp_50", "instance": load_instance("tsp_50")},
    {"name": "tsp_100", "instance": load_instance("tsp_100")},
    {"name": "tsp_200", "instance": load_instance("tsp_200")},
]

result = optimize(
    artifact=Artifact(name="tsp_heuristic", template="..."),
    evaluate=eval_tsp,
    metrics=[Metric("tour_length", direction="minimize")],
    tasks=tasks,
    config=MultiTaskConfig(
        population_size=30,
        cross_transfer=True,          # Enable cross-task solution transfer
        transfer_frequency=5,          # Transfer every 5 iterations
        min_improvement_for_transfer=0.01,
    ),
)

# Access per-task results
for task_name, task_result in result.per_task.items():
    print(f"{task_name}: best={task_result.best.score:.4f}")

4.3 Generalization Mode (Train+Val)

Generalization mode splits tasks into training and validation sets, optimizing for generalization rather than overfitting to specific instances.

from gepa import optimize, Artifact, Metric, GeneralizationConfig

all_tasks = load_benchmark_tasks("arc-agi-v1")
train_tasks = all_tasks[:80]
val_tasks = all_tasks[80:]

result = optimize(
    artifact=Artifact(name="arc_solver", template="..."),
    evaluate=eval_arc,
    metrics=[Metric("accuracy", direction="maximize")],
    train_tasks=train_tasks,
    val_tasks=val_tasks,
    config=GeneralizationConfig(
        population_size=50,
        val_frequency=10,            # Validate every 10 iterations
        early_stopping_patience=20,  # Stop if val doesn't improve for 20 iters
        overfitting_threshold=0.15,  # Alert if train-val gap exceeds 15%
    ),
)

print(f"Train accuracy: {result.train_score:.3f}")
print(f"Val accuracy:   {result.val_score:.3f}")
print(f"Generalization gap: {result.generalization_gap:.3f}")

5. Actionable Side Information (ASI)

ASI is GEPA's most distinctive innovation. Instead of treating evaluation as a black-box score, ASI captures structured diagnostic feedback that the reflection engine uses to propose targeted mutations.

ASI Types

ASI Type Description Example Use in Reflection
TraceASI Execution traces showing step-by-step behavior Function call logs, variable states at each step LLM identifies where execution diverges from expected behavior
ErrorASI Exception details with full stack traces TypeError, IndexError with line numbers LLM directly fixes the error-causing code
ImageASI Visual outputs (plots, renders, diagrams) Circle packing visualization, loss curve Multimodal LLM analyzes visual quality
MetricASI Detailed metric breakdowns beyond the primary score Per-test-case scores, timing breakdowns LLM focuses on worst-performing sub-metrics
ComparisonASI Diff between candidate and reference solution Output mismatches, behavioral differences LLM targets specific discrepancies
TextASI Free-form text feedback Human annotations, LLM-judge feedback LLM incorporates qualitative feedback into mutation

Implementing Custom ASI

from gepa import EvalResult, TraceASI, ErrorASI, MetricASI

def evaluate_with_asi(candidate: str, task: dict) -> EvalResult:
    """Evaluation function that produces rich ASI feedback."""
    try:
        # Execute candidate
        exec_globals = {}
        exec(candidate, exec_globals)
        solve_fn = exec_globals['solve']

        # Collect execution trace
        trace = []
        original_print = print

        def traced_print(*args, **kwargs):
            trace.append(" ".join(str(a) for a in args))
            original_print(*args, **kwargs)

        import builtins
        builtins.print = traced_print

        result = solve_fn(task['input'])
        builtins.print = original_print

        # Compute metrics
        accuracy = compute_accuracy(result, task['expected'])
        efficiency = compute_efficiency(result)

        # Build ASI
        side_info = [
            TraceASI(trace=trace, label="execution_trace"),
            MetricASI(
                metrics={
                    "accuracy": accuracy,
                    "efficiency": efficiency,
                    "output_length": len(str(result)),
                },
                label="detailed_metrics"
            ),
        ]

        if accuracy < 1.0:
            mismatches = find_mismatches(result, task['expected'])
            side_info.append(
                ComparisonASI(
                    expected=str(task['expected']),
                    actual=str(result),
                    diff=mismatches,
                    label="output_comparison"
                )
            )

        return EvalResult(
            scores={"accuracy": accuracy, "efficiency": efficiency},
            side_info=side_info,
        )

    except Exception as e:
        import traceback
        return EvalResult(
            scores={"accuracy": 0.0, "efficiency": 0.0},
            side_info=[
                ErrorASI(
                    error_type=type(e).__name__,
                    message=str(e),
                    traceback=traceback.format_exc(),
                    label="runtime_error",
                )
            ],
        )

ASI in the Reflection Prompt

When the reflection engine generates a mutation, ASI is injected into the LLM prompt as structured context:

# Internal reflection prompt template (simplified)
REFLECTION_PROMPT = """
You are optimizing a {artifact_type} to maximize {objectives}.

## Current Candidate (Score: {score})

{candidate_code}

## Diagnostic Feedback (ASI)
{formatted_asi}

## Optimization History
- Iteration {iter}: {score_history}
- Recent improvements: {recent_improvements}
- Stagnation detector: {stagnation_status}

## Reflection Minibatch ({minibatch_size} examples)
{minibatch_details}

## Task
Analyze the diagnostic feedback and propose a TARGETED improvement.
Focus on the specific failure modes revealed by the ASI.
Output the complete improved candidate.
"""

GEPA maintains a Pareto frontier of non-dominated solutions, enabling multi-objective optimization without manual weight tuning.

Pareto Dominance

A solution x dominates solution y (written x >> y) if and only if:

$$ x >> y ⇔ ∀i: fi(x) ≥ fi(y) ∧ ∃j: fj(x) > fj(y) $$

Where fi represents the i-th objective function (with direction-normalized values so that higher is always better).

Pareto Frontier Maintenance

$$ PF = { x ∈ P : ∄ y ∈ P such that y >> x } $$

The Pareto frontier PF is the set of all non-dominated solutions in the population P. When a new candidate is evaluated:

  1. Check if any existing frontier member dominates the new candidate → if yes, discard.
  2. Check if the new candidate dominates any existing frontier members → if yes, remove dominated members.
  3. If neither dominates, add the new candidate to the frontier (it represents a new trade-off).

Hypervolume Indicator

GEPA uses the hypervolume indicator to measure the quality of the Pareto frontier:

$$ HV(PF, r) = Λ({ q ∈ R^m : ∃ p ∈ PF such that p ≥ q ≥ r }) $$

Where Λ denotes the Lebesgue measure (volume), r is the reference point, and m is the number of objectives. The hypervolume captures both convergence to the true Pareto front and diversity along it.

Selection Strategy

from gepa.engine import ParetoFrontier

class ParetoFrontier:
    def __init__(self, objectives: list[str], directions: list[str]):
        self.objectives = objectives
        self.directions = directions  # "maximize" or "minimize"
        self.frontier: list[Solution] = []

    def dominates(self, a: Solution, b: Solution) -> bool:
        """Check if solution a dominates solution b."""
        at_least_one_better = False
        for obj, direction in zip(self.objectives, self.directions):
            a_val = a.scores[obj]
            b_val = b.scores[obj]
            if direction == "minimize":
                a_val, b_val = -a_val, -b_val
            if a_val < b_val:
                return False
            if a_val > b_val:
                at_least_one_better = True
        return at_least_one_better

    def update(self, candidate: Solution) -> bool:
        """Add candidate to frontier if non-dominated. Returns True if added."""
        # Check if any frontier member dominates candidate
        for member in self.frontier:
            if self.dominates(member, candidate):
                return False  # Dominated, discard

        # Remove frontier members dominated by candidate
        self.frontier = [
            m for m in self.frontier
            if not self.dominates(candidate, m)
        ]
        self.frontier.append(candidate)
        return True

    def select_parent(self, strategy: str = "crowding") -> Solution:
        """Select a parent from the frontier for mutation."""
        if strategy == "crowding":
            # Prefer solutions in sparse regions of the frontier
            distances = self._crowding_distances()
            probs = distances / distances.sum()
            return np.random.choice(self.frontier, p=probs)
        elif strategy == "random":
            return random.choice(self.frontier)
        elif strategy == "tournament":
            a, b = random.sample(self.frontier, 2)
            return a if self._hypervolume_contribution(a) > self._hypervolume_contribution(b) else b

Crowding Distance

To maintain diversity on the Pareto frontier, GEPA computes crowding distances and preferentially selects parents from sparse regions:

$$ CD(i) = ∑m (fm(i+1) - fm(i-1)) / (fm^max - fm^min) $$

Where solutions are sorted by each objective and boundary solutions receive infinite crowding distance.

7. Reflection-Driven Mutation

GEPA's reflection engine is the primary mechanism for generating improved candidates. Unlike random mutation or simple LLM rewriting, reflection uses structured diagnostic feedback to propose targeted improvements.

Reflection Pipeline

  1. Parent Selection: Choose a parent from the Pareto frontier using crowding-distance-weighted selection.
  2. Minibatch Sampling: Sample 2-3 evaluation examples where the parent performs poorly (failure-biased sampling).
  3. ASI Assembly: Gather all ASI records for the selected examples — traces, errors, metrics, comparisons.
  4. History Context: Include recent mutation history (what was tried, what worked, what didn't).
  5. LLM Reflection: Send the complete context to the LLM with instructions to analyze failures and propose a targeted fix.
  6. Validation: Parse the LLM output, validate syntax/structure, and insert into the population for evaluation.

Minibatch Strategy

**Why 2-3 examples?** Reflection minibatches of 2-3 examples provide the optimal trade-off between context richness and LLM attention capacity. Too few examples (1) risk overfitting the mutation to a single case. Too many examples (5+) dilute the LLM's attention and produce generic rather than targeted improvements.

from gepa.reflection import ReflectionEngine, ReflectionConfig

config = ReflectionConfig(
    minibatch_size=3,               # Sample 2-3 examples per reflection
    failure_bias=0.7,               # 70% chance to sample failure cases
    include_trace=True,             # Include execution traces in prompt
    include_error=True,             # Include error details in prompt
    include_comparison=True,        # Include output comparisons
    max_history_length=10,          # Include last 10 mutations in context
    temperature=0.8,                # LLM temperature for diversity
    reflection_prompt_version="v3", # Latest reflection prompt template
)

engine = ReflectionEngine(
    config=config,
    llm=LLMClient("claude-sonnet-4-20250514"),
    artifact_spec=artifact,
)

# Generate a mutation
mutation = await engine.reflect(
    parent=best_candidate,
    eval_results=parent_eval_results,
    history=optimization_history,
)

Reflection Prompt Engineering

The reflection prompt is carefully structured to maximize the quality of mutations:

# Reflection prompt structure
"""
[ROLE] You are an expert optimizer improving a {artifact_type}.

[OBJECTIVE] Maximize: {objectives_description}

[CURRENT SOLUTION] (Score: {score_summary})
{candidate_code}

[DIAGNOSTIC ANALYSIS]
--- Example 1/{minibatch_size} ---
Input: {example_input}
Expected: {expected_output}
Actual: {actual_output}
Trace: {execution_trace}
Error: {error_if_any}
Score: {per_example_score}

--- Example 2/{minibatch_size} ---
...

[MUTATION HISTORY]
Iteration {n-2}: Changed X to Y -> score improved by +0.05
Iteration {n-1}: Changed A to B -> score decreased by -0.02 (reverted)

[INSTRUCTIONS]
1. Analyze the diagnostic feedback above
2. Identify the ROOT CAUSE of failures
3. Propose a SPECIFIC, TARGETED fix (not a rewrite)
4. Explain your reasoning before the code
5. Output the complete improved solution

[OUTPUT FORMAT]
## Analysis
(your analysis here)

## Improved Solution

(complete code here)

"""

8. Seedless Mode

GEPA can bootstrap optimization from a natural language objective alone, without any seed solution. The system generates an initial population by prompting the LLM to create diverse starting points.

from gepa import optimize, Artifact, Metric

# No seed solution provided - GEPA generates initial candidates
result = optimize(
    artifact=Artifact(
        name="bin_packing",
        description="A function that packs items into bins to minimize wasted space.",
        signature="def pack(items: list[tuple[float, float]], bin_capacity: float) -> list[list[int]]",
        language="python",
        # No template or seed - GEPA bootstraps from the description + signature
    ),
    evaluate=eval_bin_packing,
    metrics=[Metric("utilization", direction="maximize")],
    config=SeedlessConfig(
        initial_population_size=10,     # Generate 10 diverse starting points
        diversity_prompt=True,          # Instruct LLM to generate diverse solutions
        bootstrap_strategies=[
            "greedy_first_fit",
            "best_fit_decreasing",
            "random_search",
            "dynamic_programming",
            "genetic_algorithm",
        ],  # Hint different algorithmic approaches
    ),
)

Bootstrap Pipeline

  1. Strategy Enumeration: If strategies are provided, generate one candidate per strategy. Otherwise, ask the LLM to enumerate diverse approaches.
  2. Parallel Generation: Concurrently prompt the LLM to implement each strategy as a concrete solution.
  3. Validation: Each generated candidate is validated (syntax, type checking, basic execution) before insertion.
  4. Initial Evaluation: All valid candidates are evaluated to establish the initial Pareto frontier.
  5. Normal Evolution: Proceed with reflection-driven mutation from the initial population.

9. Stopping Criteria

GEPA provides composable stopping criteria that can be combined with AND/OR logic.

Criterion Description Parameters Example
MaxMetricCalls Stop after N evaluation calls max_calls: int MaxMetricCalls(100)
Timeout Stop after elapsed time seconds: float Timeout(3600) (1 hour)
NoImprovement Stop if no improvement for N iterations patience: int, min_delta: float NoImprovement(20, 0.001)
ScoreThreshold Stop when target score is reached metric: str, threshold: float ScoreThreshold("accuracy", 0.99)
Composite Combine criteria with AND/OR logic criteria: list, mode: str See example below
from gepa.stopping import MaxMetricCalls, Timeout, NoImprovement, ScoreThreshold, Composite

# Stop when ANY of these conditions is met (OR logic)
stopping = Composite(
    criteria=[
        MaxMetricCalls(200),                     # Hard limit: 200 evaluations
        Timeout(7200),                           # Hard limit: 2 hours
        ScoreThreshold("accuracy", 1.0),         # Perfect score reached
        Composite(                               # AND: both must be true
            criteria=[
                NoImprovement(patience=30, min_delta=0.001),
                MaxMetricCalls(50),               # At least 50 evals before early stop
            ],
            mode="AND",
        ),
    ],
    mode="OR",
)

result = optimize(
    artifact=artifact,
    evaluate=evaluate_fn,
    metrics=metrics,
    stopping=stopping,
)

print(f"Stopped because: {result.stop_reason}")

10. Configuration System

GEPA uses a hierarchical configuration system with sensible defaults at every level.

GEPAConfig Structure

from gepa.config import (
    GEPAConfig, EngineConfig, ReflectionConfig,
    MergeConfig, RefinerConfig
)

config = GEPAConfig(
    # Engine configuration
    engine=EngineConfig(
        population_size=30,           # Max population size
        elite_size=5,                 # Protected elite solutions
        tournament_size=3,            # Tournament selection size
        mutation_rate=1.0,            # Always mutate (vs crossover)
        crossover_rate=0.0,           # No crossover by default
        island_count=1,               # Number of islands (for island model)
        migration_rate=0.1,           # Migration rate between islands
        migration_interval=10,        # Migrate every N iterations
    ),

    # Reflection configuration
    reflection=ReflectionConfig(
        minibatch_size=3,             # Examples per reflection
        failure_bias=0.7,             # Bias toward failure cases
        temperature=0.8,              # LLM temperature
        max_tokens=4096,              # Max response tokens
        include_trace=True,           # Include execution traces
        include_error=True,           # Include error details
        include_comparison=True,      # Include output comparisons
        include_image=False,          # Include image ASI (multimodal)
        max_history_length=10,        # Mutation history context
        reflection_model="claude-sonnet-4-20250514",  # Model for reflection
    ),

    # Merge configuration (for crossover-like operations)
    merge=MergeConfig(
        strategy="llm_guided",        # LLM picks best parts from 2 parents
        merge_model="claude-sonnet-4-20250514",
        max_parents=2,                # Number of parents to merge
    ),

    # Refiner configuration (post-mutation polish)
    refiner=RefinerConfig(
        enabled=True,                 # Enable post-mutation refinement
        refiner_model="claude-sonnet-4-20250514",
        max_refinement_steps=2,       # Max refinement iterations
        refinement_threshold=0.9,     # Only refine if score > threshold
    ),
)

result = optimize(
    artifact=artifact,
    evaluate=evaluate_fn,
    metrics=metrics,
    config=config,
)

Configuration via YAML

# gepa_config.yaml
engine:
  population_size: 30
  elite_size: 5
  tournament_size: 3

reflection:
  minibatch_size: 3
  failure_bias: 0.7
  temperature: 0.8
  include_trace: true
  include_error: true

merge:
  strategy: llm_guided
  max_parents: 2

refiner:
  enabled: true
  max_refinement_steps: 2

stopping:
  type: composite
  mode: OR
  criteria:
    - type: max_metric_calls
      max_calls: 200
    - type: timeout
      seconds: 7200
    - type: score_threshold
      metric: accuracy
      threshold: 1.0
from gepa import optimize_from_config

result = optimize_from_config("gepa_config.yaml")

11. Evaluation Pipeline

Parallel Evaluation

GEPA evaluates candidates in parallel using a configurable worker pool:

from gepa.evaluation import EvaluationPipeline, EvalConfig

pipeline = EvaluationPipeline(
    eval_fn=evaluate,
    config=EvalConfig(
        max_workers=8,              # Parallel evaluation workers
        timeout_per_eval=60,        # Timeout per evaluation (seconds)
        retry_on_error=True,        # Retry failed evaluations once
        cache_enabled=True,         # Enable content-hash caching
        cache_backend="sqlite",     # Cache backend: sqlite, redis, memory
        sandbox="docker",           # Sandbox: docker, subprocess, none
    ),
)

# Evaluate a batch of candidates
results = await pipeline.evaluate_batch(candidates, task)

Content-Hash Caching

Evaluations are cached by content hash, so identical candidates are never re-evaluated:

import hashlib

def content_hash(candidate: str, task: dict) -> str:
    """Compute a deterministic hash for a (candidate, task) pair."""
    content = candidate + json.dumps(task, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

# Cache lookup before evaluation
cache_key = content_hash(candidate_code, task)
if cache_key in cache:
    return cache[cache_key]  # Skip evaluation

# Evaluate and cache
result = evaluate(candidate_code, task)
cache[cache_key] = result

Sandbox Execution

All candidate evaluations run in isolated sandboxes to prevent side effects:

  • Docker sandbox: Full container isolation with resource limits (CPU, memory, network, filesystem). Safest option for untrusted code.
  • Subprocess sandbox: Process-level isolation with timeouts. Good balance of safety and speed.
  • None: Direct execution in the main process. Fastest but no isolation. Only for trusted code.

12. API Reference & Code Examples

Core Classes

[!info]- Artifact - Define what to optimize ``` class Artifact: """Defines a text artifact to be optimized."""

def __init__(
    self,
    name: str,                              # Unique artifact identifier
    template: str | None = None,            # Template with {{PLACEHOLDERS}}
    seed: str | None = None,                # Initial solution (optional)
    description: str | None = None,         # Natural language description
    signature: str | None = None,           # Function signature (for code)
    language: str = "python",               # Language: python, yaml, text, etc.
    constraints: list[str] | None = None,   # Hard constraints for validation
    metadata: dict | None = None,           # Additional context
):
    ...

def validate(self, candidate: str) -> ValidationResult:
    """Validate a candidate against artifact constraints."""
    ...

def render(self, **kwargs) -> str:
    """Render template with placeholder values."""
    ...

```

[!info]- Metric - Define optimization objectives ``` class Metric: """Defines an optimization objective."""

def __init__(
    self,
    name: str,                              # Metric name (must match EvalResult keys)
    direction: str = "maximize",            # "maximize" or "minimize"
    weight: float = 1.0,                    # Weight for weighted-sum aggregation
    bounds: tuple[float, float] | None = None,  # Expected (min, max) range
    primary: bool = True,                   # Is this a primary objective?
):
    ...

```

[!info]- EvalResult - Evaluation output ``` class EvalResult: """Result of evaluating a candidate."""

def __init__(
    self,
    scores: dict[str, float],               # Metric name -> value
    side_info: list[ASI] | dict | None = None,  # Actionable Side Information
    metadata: dict | None = None,            # Additional evaluation metadata
    valid: bool = True,                      # Whether the candidate is valid
    error: str | None = None,                # Error message if invalid
):
    ...

```

[!info]- OptimizationResult - Final output ``` class OptimizationResult: """Complete optimization result."""

best: Solution                     # Best solution (primary metric)
pareto_front: list[Solution]       # All non-dominated solutions
history: list[EvaluationRecord]    # Full evaluation history
stats: OptimizationStats           # Timing, cost, convergence stats
stop_reason: str                   # Why optimization stopped
config: GEPAConfig                 # Configuration used

# Multi-task specific
per_task: dict[str, TaskResult] | None
# Generalization specific
train_score: float | None
val_score: float | None
generalization_gap: float | None

def plot_convergence(self) -> Figure:
    """Plot convergence curve."""
    ...

def plot_pareto(self) -> Figure:
    """Plot Pareto frontier (2D or 3D)."""
    ...

def export(self, path: str, format: str = "json"):
    """Export results to file."""
    ...

```

Advanced: Custom LLM Integration

from gepa.llm import LLMProvider, LLMResponse

class MyCustomLLM(LLMProvider):
    """Custom LLM provider for GEPA."""

    def __init__(self, model_name: str, api_key: str):
        self.model_name = model_name
        self.api_key = api_key

    async def generate(
        self,
        prompt: str,
        temperature: float = 0.8,
        max_tokens: int = 4096,
        stop: list[str] | None = None,
    ) -> LLMResponse:
        # Call your LLM API here
        response = await my_api_call(
            model=self.model_name,
            prompt=prompt,
            temperature=temperature,
            max_tokens=max_tokens,
        )
        return LLMResponse(
            text=response.text,
            usage={"input_tokens": response.input_tokens, "output_tokens": response.output_tokens},
            model=self.model_name,
        )

# Use custom LLM with GEPA
result = optimize(
    artifact=artifact,
    evaluate=evaluate_fn,
    metrics=metrics,
    llm=MyCustomLLM("my-model", "my-api-key"),
)

Advanced: Multi-Artifact Co-Evolution

from gepa import co_optimize, Artifact, Metric

# Co-evolve a prompt and a post-processor together
prompt_artifact = Artifact(
    name="system_prompt",
    template="You are a helpful assistant. {{INSTRUCTIONS}}",
    language="text",
)

processor_artifact = Artifact(
    name="output_processor",
    template="def process(raw_output: str) -> str:\n    {{BODY}}",
    language="python",
)

result = co_optimize(
    artifacts=[prompt_artifact, processor_artifact],
    evaluate=eval_pipeline,        # Evaluation uses both artifacts
    metrics=[Metric("quality", direction="maximize")],
    dependencies={
        "output_processor": ["system_prompt"],  # Processor depends on prompt
    },
    max_iterations=50,
)

print(f"Best prompt: {result.artifacts['system_prompt'].best.code}")
print(f"Best processor: {result.artifacts['output_processor'].best.code}")

13. Results & Benchmarks

**Headline Results:** GEPA achieves state-of-the-art performance across multiple domains, often surpassing specialized systems designed for specific tasks.

Benchmark Results Summary

Benchmark Baseline GEPA Result Improvement Notes
Claude Code Bleve 79.3% 100% +20.7pp Perfect score on all test cases
ARC-AGI v1 32.5% 89.5% +57.0pp Generalization mode with train/val split
AIME 2025 46.67% 60% +13.33pp Mathematical reasoning benchmark
Circle Packing n=26 Previous SOTA New Record Beat AlphaEvolve Score 2.63594 (matches LLM4AD record)
CloudCast Routing Baseline routing 40.2% savings -40.2% cost Cloud infrastructure routing optimization

Detailed Results: ARC-AGI v1

Method Train Accuracy Val Accuracy Gap Eval Calls
GEPA (Generalization) 94.2% 89.5% 4.7% 1,200
GEPA (Single-Task) 97.1% 82.3% 14.8% 800
AlphaEvolve 91.0% 85.0% 6.0% 2,500
OpenEvolve 88.5% 80.2% 8.3% 1,800
FunSearch 78.0% 72.5% 5.5% 5,000+

Detailed Results: Circle Packing

n Previous Best GEPA Result Method
20 2.52040 2.52040 Matched known optimal
22 2.56287 2.56290 Slight improvement
24 2.60240 2.60245 Slight improvement
26 2.63590 (AlphaEvolve) 2.63594 New record

Cost Efficiency

Benchmark Total LLM Cost Eval Calls Wall Time Cost per % Improvement
Claude Code Bleve $12.50 85 45 min $0.60/pp
ARC-AGI v1 $180.00 1,200 8 hours $3.16/pp
AIME 2025 $95.00 400 3 hours $7.12/pp
Circle Packing n=26 $45.00 300 2 hours N/A
CloudCast Routing $250.00 800 12 hours $6.22/pp

14. Case Studies

Case Study 1: Claude Code Bleve (79.3% → 100%)

GEPA optimized the Bleve search engine integration code for Claude Code, achieving a perfect score on all test cases.

**Setup:** Single-task mode, `claude-sonnet-4-20250514` for reflection, 85 evaluation calls over 45 minutes.

Optimization trajectory:

  • Iteration 0: Seed solution from Claude Code baseline: 79.3%
  • Iteration 12: ASI identified edge case in Unicode handling → fix improved to 88.1%
  • Iteration 28: ASI identified timeout in large document indexing → batch processing fix improved to 94.5%
  • Iteration 41: ASI identified race condition in concurrent search → mutex fix improved to 97.8%
  • Iteration 58: ASI identified rounding error in relevance scoring → float64 fix achieved 100%

Case Study 2: ARC-AGI v1 (32.5% → 89.5%)

GEPA used generalization mode with an 80/20 train/val split to optimize an ARC-AGI solver.

Key strategies employed by the reflection engine:

  • Pattern decomposition: The LLM learned to decompose ARC patterns into geometric primitives
  • Transformation library: Built up a library of reusable transformations over iterations
  • Cross-task transfer: Insights from color mapping tasks transferred to spatial reasoning tasks
  • ASI-driven debugging: Execution traces showed where the solver misidentified patterns

Case Study 3: CloudCast Routing (40.2% Cost Savings)

GEPA optimized cloud infrastructure routing rules, treating the routing configuration as a YAML artifact.

# Artifact: YAML routing configuration
artifact = Artifact(
    name="routing_config",
    language="yaml",
    template="""
routing:
  default_region: {{DEFAULT_REGION}}
  rules:
    {{ROUTING_RULES}}
  fallback:
    strategy: {{FALLBACK_STRATEGY}}
    timeout_ms: {{TIMEOUT}}
""",
)

# Multi-objective optimization: minimize cost, maximize latency SLA compliance
result = optimize(
    artifact=artifact,
    evaluate=eval_routing,
    metrics=[
        Metric("cost", direction="minimize", weight=0.6),
        Metric("sla_compliance", direction="maximize", weight=0.4),
    ],
    config=GEPAConfig(
        engine=EngineConfig(population_size=40),
        reflection=ReflectionConfig(minibatch_size=2),
    ),
)

15. Comparison with Other Systems

Feature GEPA AlphaEvolve OpenEvolve FunSearch ShinkaEvolve
Artifact Types Any text Code Code Code Code
Declarative API Yes No Partial No Partial
ASI Feedback First-class Basic errors Basic errors Score only Errors + traces
Pareto Multi-Objective Native Weighted sum Weighted sum Single Single
Multi-Task Mode Yes No No No No
Generalization Mode Yes No No No No
Seedless Bootstrap Yes Requires seed Requires seed Requires seed Optional seed
Composable Stopping Yes Fixed Fixed Fixed Configurable
Content-Hash Cache Yes Yes Yes Yes Yes
Open Source Yes No Yes No Yes

16. Limitations & Future Work

Current Limitations

  • LLM Dependency: Quality of mutations is bounded by the capability of the underlying LLM. Weaker models produce weaker mutations.
  • Cost: Multi-task and generalization modes can be expensive due to the number of evaluation calls required.
  • ASI Design Burden: Effective ASI requires domain-specific instrumentation of the evaluation function. Users must think carefully about what diagnostic information to expose.
  • Pareto Scalability: With more than 4-5 objectives, the Pareto frontier grows exponentially and most solutions become non-dominated (the "curse of dimensionality" for multi-objective optimization).
  • Reflection Context Window: Very large candidates or extensive ASI can exceed LLM context limits, requiring truncation that may lose important information.

Future Directions

  • Auto-ASI: Automatically instrument evaluation functions to extract useful ASI without manual effort.
  • Hierarchical Evolution: Evolve meta-strategies (mutation operators, selection criteria) alongside the artifact itself.
  • Distributed Evolution: Distribute the evolutionary search across multiple machines for large-scale problems.
  • Interactive Mode: Human-in-the-loop evolution where users can guide the search direction mid-run.
  • Transfer Learning: Warm-start optimization from similar previously-solved problems.
  • Formal Verification: Integrate formal verification tools as ASI producers for safety-critical applications.

**Summary:** GEPA represents a significant step toward general-purpose LLM-driven optimization. Its declarative API, first-class ASI support, native Pareto optimization, and composable configuration make it the most flexible evolutionary optimization framework available. The benchmark results demonstrate that a well-designed general system can match or exceed specialized tools across diverse domains.

← Back to Index