GEPA: Optimize Anything
Declarative LLM-Driven Evolutionary Optimization for Text Artifacts
Authors: Lakshya A Agrawal, Donghyun Lee et al.
Affiliations: UC Berkeley, Stanford
Date: February 2026
Install: pip install gepa
License: Open Source
**Key Insight:** GEPA provides a declarative API for optimizing ANY text artifact — code, prompts, agent configurations, mathematical proofs — via LLM-driven evolutionary search with structured diagnostic feedback (Actionable Side Information). It achieves state-of-the-art results across coding, mathematics, and optimization benchmarks.
Table of Contents
- Overview & Motivation
- Installation & Quick Start
- System Architecture
- Optimization Modes
- Single-Task Optimization
- Multi-Task Optimization (Cross-Transfer)
- Generalization Mode (Train+Val)
- Actionable Side Information (ASI)
- Pareto-Efficient Search
- Reflection-Driven Mutation
- Seedless Mode
- Stopping Criteria
- Configuration System
- Evaluation Pipeline
- API Reference & Code Examples
- Results & Benchmarks
- Case Studies
- Comparison with Other Systems
- Limitations & Future Work
1. Overview & Motivation
GEPA (General-purpose Evolutionary Program Architecture) addresses a fundamental limitation in LLM-driven optimization: existing systems are tightly coupled to specific artifact types (code, prompts, etc.) and lack a unified framework for handling diagnostic feedback. GEPA introduces a declarative API where users specify what to optimize and how to measure quality, while the system handles the evolutionary search process automatically.
Core Design Principles
- Artifact Agnosticism: Any text-representable artifact can be optimized — Python functions, prompt templates, YAML configurations, agent policies, mathematical expressions, or even entire programs.
- Declarative Over Imperative: Users define objectives and constraints, not search procedures. The engine selects appropriate mutation strategies, population management, and stopping criteria.
- Diagnostic-First: Actionable Side Information (ASI) is a first-class concept. Every evaluation produces structured feedback (traces, errors, images, metrics) that directly informs the next mutation cycle.
- Pareto Optimality: Multi-objective optimization is native. The population maintains a Pareto frontier of non-dominated solutions rather than a single "best" solution.
- Reproducibility: Content-hash-based caching ensures that re-evaluating the same artifact produces identical results without redundant computation.
What GEPA Optimizes
Code
Python functions, algorithms, data structures, entire modules. Evaluated via test suites, benchmarks, or custom metrics.
Prompts
System prompts, few-shot examples, chain-of-thought templates. Evaluated against ground-truth datasets or human preferences.
Agent Configs
Multi-agent orchestration YAML, tool selection policies, routing rules. Evaluated via task completion rates.
Mathematical Expressions
Loss functions, optimization objectives, heuristic formulas. Evaluated via numerical accuracy or convergence speed.
Data Pipelines
ETL configurations, feature engineering scripts, preprocessing chains. Evaluated via downstream model performance.
Hybrid Artifacts
Combinations of code + prompts + configs. Co-evolved with inter-dependency tracking.
2. Installation & Quick Start
Installation
# Basic installation
pip install gepa
# With all optional dependencies
pip install gepa[all]
# With specific LLM backends
pip install gepa[openai,anthropic,google]
# Development installation
git clone https://github.com/gepa-ai/gepa.git
cd gepa
pip install -e ".[dev]"
Minimal Example: Optimize a Sorting Algorithm
from gepa import optimize, Artifact, Metric, EvalResult
# Define the artifact to optimize
artifact = Artifact(
name="sort_function",
template="""
def sort(arr: list[int]) -> list[int]:
# Your sorting implementation here
{{IMPLEMENTATION}}
""",
language="python"
)
# Define the evaluation metric
def evaluate_sort(candidate: str, task: dict) -> EvalResult:
"""Evaluate sorting correctness and performance."""
import time
exec_globals = {}
exec(candidate, exec_globals)
sort_fn = exec_globals['sort']
test_cases = task['test_cases']
correct = 0
total_time = 0.0
for tc in test_cases:
start = time.perf_counter()
result = sort_fn(tc['input'].copy())
elapsed = time.perf_counter() - start
total_time += elapsed
if result == tc['expected']:
correct += 1
accuracy = correct / len(test_cases)
avg_time = total_time / len(test_cases)
return EvalResult(
scores={"accuracy": accuracy, "speed": 1.0 / (avg_time + 1e-9)},
side_info={
"failed_cases": [tc for tc, r in zip(test_cases, results) if r != tc['expected']],
"avg_time_ms": avg_time * 1000,
}
)
# Run optimization
result = optimize(
artifact=artifact,
evaluate=evaluate_sort,
metrics=[
Metric("accuracy", direction="maximize", weight=0.8),
Metric("speed", direction="maximize", weight=0.2),
],
max_iterations=50,
llm="claude-sonnet-4-20250514",
)
print(f"Best solution: {result.best.score}")
print(result.best.code)
**Note:** The `optimize()` function is the primary entry point. It returns an `OptimizationResult` containing the Pareto frontier, best solution per metric, full history, and timing statistics.
3. System Architecture
+-----------------------------------------------------------------------------------+
| GEPA System Architecture |
+-----------------------------------------------------------------------------------+
| |
| +------------------+ +-------------------+ +-------------------------+ |
| | User API | | GEPAConfig | | Artifact Registry | |
| | |---->| |---->| | |
| | optimize() | | EngineConfig | | Templates, Validators | |
| | Artifact() | | ReflectionConfig | | Language Specs | |
| | Metric() | | MergeConfig | | Seed Solutions | |
| +------------------+ | RefinerConfig | +-------------------------+ |
| +-------------------+ |
| | |
| v |
| +------------------------------------------------------------------------+ |
| | Evolution Engine | |
| | | |
| | +-------------------+ +--------------------+ +----------------+ | |
| | | Population Manager| | Pareto Frontier | | History Store | | |
| | | | | | | | | |
| | | - Candidates[] | | - Non-dominated | | - All evals | | |
| | | - Islands[] | | solutions | | - ASI records | | |
| | | - Selection | | - Dominance checks | | - Genealogy | | |
| | +-------------------+ +--------------------+ +----------------+ | |
| | | | | | |
| | v v v | |
| | +------------------------------------------------------------------+ | |
| | | Reflection Engine | | |
| | | | | |
| | | 1. Select parent from Pareto frontier | | |
| | | 2. Sample minibatch of 2-3 evaluation examples | | |
| | | 3. Attach ASI (traces, errors, images) to reflection prompt | | |
| | | 4. LLM proposes targeted mutation | | |
| | | 5. Validate & insert into population | | |
| | +------------------------------------------------------------------+ | |
| | | | |
| | v | |
| | +------------------------------------------------------------------+ | |
| | | Evaluation Pipeline | | |
| | | | | |
| | | [Cache Check] --> [Sandbox Exec] --> [Metric Compute] --> [ASI] | | |
| | | | | | | | | |
| | | Content Hash Docker/Process Multi-metric Traces | | |
| | | Dedup Isolation Aggregation Errors | | |
| | | max_workers Pareto update Images | | |
| | +------------------------------------------------------------------+ | |
| +------------------------------------------------------------------------+ |
| | |
| v |
| +------------------------------------------------------------------------+ |
| | Stopping Controller | |
| | | |
| | MaxMetricCalls | Timeout | NoImprovement | ScoreThreshold | Composite | |
| +------------------------------------------------------------------------+ |
| | |
| v |
| +------------------------------------------------------------------------+ |
| | OptimizationResult | |
| | | |
| | .best - Best solution per primary metric | |
| | .pareto_front - All non-dominated solutions | |
| | .history - Full evaluation history with ASI | |
| | .stats - Timing, cost, convergence metrics | |
| +------------------------------------------------------------------------+ |
+-----------------------------------------------------------------------------------+
Component Interactions
The architecture follows a clean separation of concerns:
- User API Layer: Declarative interface where users define artifacts, metrics, and evaluation functions. No knowledge of evolutionary internals required.
- Configuration Layer:
GEPAConfigwith sub-configs for engine, reflection, merge, and refinement. Sensible defaults with full override capability. - Evolution Engine: Core loop managing population, Pareto frontier, and genealogy tracking. Orchestrates reflection and evaluation cycles.
- Reflection Engine: LLM-powered mutation generator that reads parent candidates, ASI feedback, and optimization history to propose targeted improvements.
- Evaluation Pipeline: Parallel execution with content-hash caching, sandbox isolation, multi-metric aggregation, and ASI extraction.
- Stopping Controller: Composable stopping criteria with AND/OR logic for flexible termination.
4. Optimization Modes
GEPA supports three distinct optimization modes, each addressing different real-world scenarios.
4.1 Single-Task Optimization
The simplest mode: optimize an artifact against a single task definition. This is the standard evolutionary optimization scenario.
from gepa import optimize, Artifact, Metric, SingleTaskConfig
result = optimize(
artifact=Artifact(
name="heuristic",
template="def heuristic(state: dict) -> float:\n {{BODY}}",
),
evaluate=my_eval_fn,
metrics=[Metric("score", direction="maximize")],
task={"problem_instance": load_instance("tsp_100")},
config=SingleTaskConfig(
population_size=20,
max_iterations=100,
reflection_minibatch_size=3,
),
)
**Use Case:** Optimizing a heuristic for a specific TSP instance, tuning a prompt for a fixed evaluation set, or improving an algorithm for a particular benchmark.
4.2 Multi-Task Optimization (Cross-Transfer)
Multi-task mode optimizes a single artifact across multiple tasks simultaneously, enabling cross-transfer learning. Solutions that perform well on one task can transfer insights to others.
from gepa import optimize, Artifact, Metric, MultiTaskConfig
tasks = [
{"name": "tsp_50", "instance": load_instance("tsp_50")},
{"name": "tsp_100", "instance": load_instance("tsp_100")},
{"name": "tsp_200", "instance": load_instance("tsp_200")},
]
result = optimize(
artifact=Artifact(name="tsp_heuristic", template="..."),
evaluate=eval_tsp,
metrics=[Metric("tour_length", direction="minimize")],
tasks=tasks,
config=MultiTaskConfig(
population_size=30,
cross_transfer=True, # Enable cross-task solution transfer
transfer_frequency=5, # Transfer every 5 iterations
min_improvement_for_transfer=0.01,
),
)
# Access per-task results
for task_name, task_result in result.per_task.items():
print(f"{task_name}: best={task_result.best.score:.4f}")
4.3 Generalization Mode (Train+Val)
Generalization mode splits tasks into training and validation sets, optimizing for generalization rather than overfitting to specific instances.
from gepa import optimize, Artifact, Metric, GeneralizationConfig
all_tasks = load_benchmark_tasks("arc-agi-v1")
train_tasks = all_tasks[:80]
val_tasks = all_tasks[80:]
result = optimize(
artifact=Artifact(name="arc_solver", template="..."),
evaluate=eval_arc,
metrics=[Metric("accuracy", direction="maximize")],
train_tasks=train_tasks,
val_tasks=val_tasks,
config=GeneralizationConfig(
population_size=50,
val_frequency=10, # Validate every 10 iterations
early_stopping_patience=20, # Stop if val doesn't improve for 20 iters
overfitting_threshold=0.15, # Alert if train-val gap exceeds 15%
),
)
print(f"Train accuracy: {result.train_score:.3f}")
print(f"Val accuracy: {result.val_score:.3f}")
print(f"Generalization gap: {result.generalization_gap:.3f}")
5. Actionable Side Information (ASI)
ASI is GEPA's most distinctive innovation. Instead of treating evaluation as a black-box score, ASI captures structured diagnostic feedback that the reflection engine uses to propose targeted mutations.
ASI Types
| ASI Type | Description | Example | Use in Reflection |
|---|---|---|---|
TraceASI |
Execution traces showing step-by-step behavior | Function call logs, variable states at each step | LLM identifies where execution diverges from expected behavior |
ErrorASI |
Exception details with full stack traces | TypeError, IndexError with line numbers | LLM directly fixes the error-causing code |
ImageASI |
Visual outputs (plots, renders, diagrams) | Circle packing visualization, loss curve | Multimodal LLM analyzes visual quality |
MetricASI |
Detailed metric breakdowns beyond the primary score | Per-test-case scores, timing breakdowns | LLM focuses on worst-performing sub-metrics |
ComparisonASI |
Diff between candidate and reference solution | Output mismatches, behavioral differences | LLM targets specific discrepancies |
TextASI |
Free-form text feedback | Human annotations, LLM-judge feedback | LLM incorporates qualitative feedback into mutation |
Implementing Custom ASI
from gepa import EvalResult, TraceASI, ErrorASI, MetricASI
def evaluate_with_asi(candidate: str, task: dict) -> EvalResult:
"""Evaluation function that produces rich ASI feedback."""
try:
# Execute candidate
exec_globals = {}
exec(candidate, exec_globals)
solve_fn = exec_globals['solve']
# Collect execution trace
trace = []
original_print = print
def traced_print(*args, **kwargs):
trace.append(" ".join(str(a) for a in args))
original_print(*args, **kwargs)
import builtins
builtins.print = traced_print
result = solve_fn(task['input'])
builtins.print = original_print
# Compute metrics
accuracy = compute_accuracy(result, task['expected'])
efficiency = compute_efficiency(result)
# Build ASI
side_info = [
TraceASI(trace=trace, label="execution_trace"),
MetricASI(
metrics={
"accuracy": accuracy,
"efficiency": efficiency,
"output_length": len(str(result)),
},
label="detailed_metrics"
),
]
if accuracy < 1.0:
mismatches = find_mismatches(result, task['expected'])
side_info.append(
ComparisonASI(
expected=str(task['expected']),
actual=str(result),
diff=mismatches,
label="output_comparison"
)
)
return EvalResult(
scores={"accuracy": accuracy, "efficiency": efficiency},
side_info=side_info,
)
except Exception as e:
import traceback
return EvalResult(
scores={"accuracy": 0.0, "efficiency": 0.0},
side_info=[
ErrorASI(
error_type=type(e).__name__,
message=str(e),
traceback=traceback.format_exc(),
label="runtime_error",
)
],
)
ASI in the Reflection Prompt
When the reflection engine generates a mutation, ASI is injected into the LLM prompt as structured context:
# Internal reflection prompt template (simplified)
REFLECTION_PROMPT = """
You are optimizing a {artifact_type} to maximize {objectives}.
## Current Candidate (Score: {score})
{candidate_code}
## Diagnostic Feedback (ASI)
{formatted_asi}
## Optimization History
- Iteration {iter}: {score_history}
- Recent improvements: {recent_improvements}
- Stagnation detector: {stagnation_status}
## Reflection Minibatch ({minibatch_size} examples)
{minibatch_details}
## Task
Analyze the diagnostic feedback and propose a TARGETED improvement.
Focus on the specific failure modes revealed by the ASI.
Output the complete improved candidate.
"""
6. Pareto-Efficient Search
GEPA maintains a Pareto frontier of non-dominated solutions, enabling multi-objective optimization without manual weight tuning.
Pareto Dominance
A solution x dominates solution y (written x >> y) if and only if:
$$ x >> y ⇔ ∀i: fi(x) ≥ fi(y) ∧ ∃j: fj(x) > fj(y) $$
Where fi represents the i-th objective function (with direction-normalized values so that higher is always better).
Pareto Frontier Maintenance
$$ PF = { x ∈ P : ∄ y ∈ P such that y >> x } $$
The Pareto frontier PF is the set of all non-dominated solutions in the population P. When a new candidate is evaluated:
- Check if any existing frontier member dominates the new candidate → if yes, discard.
- Check if the new candidate dominates any existing frontier members → if yes, remove dominated members.
- If neither dominates, add the new candidate to the frontier (it represents a new trade-off).
Hypervolume Indicator
GEPA uses the hypervolume indicator to measure the quality of the Pareto frontier:
$$ HV(PF, r) = Λ({ q ∈ R^m : ∃ p ∈ PF such that p ≥ q ≥ r }) $$
Where Λ denotes the Lebesgue measure (volume), r is the reference point, and m is the number of objectives. The hypervolume captures both convergence to the true Pareto front and diversity along it.
Selection Strategy
from gepa.engine import ParetoFrontier
class ParetoFrontier:
def __init__(self, objectives: list[str], directions: list[str]):
self.objectives = objectives
self.directions = directions # "maximize" or "minimize"
self.frontier: list[Solution] = []
def dominates(self, a: Solution, b: Solution) -> bool:
"""Check if solution a dominates solution b."""
at_least_one_better = False
for obj, direction in zip(self.objectives, self.directions):
a_val = a.scores[obj]
b_val = b.scores[obj]
if direction == "minimize":
a_val, b_val = -a_val, -b_val
if a_val < b_val:
return False
if a_val > b_val:
at_least_one_better = True
return at_least_one_better
def update(self, candidate: Solution) -> bool:
"""Add candidate to frontier if non-dominated. Returns True if added."""
# Check if any frontier member dominates candidate
for member in self.frontier:
if self.dominates(member, candidate):
return False # Dominated, discard
# Remove frontier members dominated by candidate
self.frontier = [
m for m in self.frontier
if not self.dominates(candidate, m)
]
self.frontier.append(candidate)
return True
def select_parent(self, strategy: str = "crowding") -> Solution:
"""Select a parent from the frontier for mutation."""
if strategy == "crowding":
# Prefer solutions in sparse regions of the frontier
distances = self._crowding_distances()
probs = distances / distances.sum()
return np.random.choice(self.frontier, p=probs)
elif strategy == "random":
return random.choice(self.frontier)
elif strategy == "tournament":
a, b = random.sample(self.frontier, 2)
return a if self._hypervolume_contribution(a) > self._hypervolume_contribution(b) else b
Crowding Distance
To maintain diversity on the Pareto frontier, GEPA computes crowding distances and preferentially selects parents from sparse regions:
$$ CD(i) = ∑m (fm(i+1) - fm(i-1)) / (fm^max - fm^min) $$
Where solutions are sorted by each objective and boundary solutions receive infinite crowding distance.
7. Reflection-Driven Mutation
GEPA's reflection engine is the primary mechanism for generating improved candidates. Unlike random mutation or simple LLM rewriting, reflection uses structured diagnostic feedback to propose targeted improvements.
Reflection Pipeline
- Parent Selection: Choose a parent from the Pareto frontier using crowding-distance-weighted selection.
- Minibatch Sampling: Sample 2-3 evaluation examples where the parent performs poorly (failure-biased sampling).
- ASI Assembly: Gather all ASI records for the selected examples — traces, errors, metrics, comparisons.
- History Context: Include recent mutation history (what was tried, what worked, what didn't).
- LLM Reflection: Send the complete context to the LLM with instructions to analyze failures and propose a targeted fix.
- Validation: Parse the LLM output, validate syntax/structure, and insert into the population for evaluation.
Minibatch Strategy
**Why 2-3 examples?** Reflection minibatches of 2-3 examples provide the optimal trade-off between context richness and LLM attention capacity. Too few examples (1) risk overfitting the mutation to a single case. Too many examples (5+) dilute the LLM's attention and produce generic rather than targeted improvements.
from gepa.reflection import ReflectionEngine, ReflectionConfig
config = ReflectionConfig(
minibatch_size=3, # Sample 2-3 examples per reflection
failure_bias=0.7, # 70% chance to sample failure cases
include_trace=True, # Include execution traces in prompt
include_error=True, # Include error details in prompt
include_comparison=True, # Include output comparisons
max_history_length=10, # Include last 10 mutations in context
temperature=0.8, # LLM temperature for diversity
reflection_prompt_version="v3", # Latest reflection prompt template
)
engine = ReflectionEngine(
config=config,
llm=LLMClient("claude-sonnet-4-20250514"),
artifact_spec=artifact,
)
# Generate a mutation
mutation = await engine.reflect(
parent=best_candidate,
eval_results=parent_eval_results,
history=optimization_history,
)
Reflection Prompt Engineering
The reflection prompt is carefully structured to maximize the quality of mutations:
# Reflection prompt structure
"""
[ROLE] You are an expert optimizer improving a {artifact_type}.
[OBJECTIVE] Maximize: {objectives_description}
[CURRENT SOLUTION] (Score: {score_summary})
{candidate_code}
[DIAGNOSTIC ANALYSIS]
--- Example 1/{minibatch_size} ---
Input: {example_input}
Expected: {expected_output}
Actual: {actual_output}
Trace: {execution_trace}
Error: {error_if_any}
Score: {per_example_score}
--- Example 2/{minibatch_size} ---
...
[MUTATION HISTORY]
Iteration {n-2}: Changed X to Y -> score improved by +0.05
Iteration {n-1}: Changed A to B -> score decreased by -0.02 (reverted)
[INSTRUCTIONS]
1. Analyze the diagnostic feedback above
2. Identify the ROOT CAUSE of failures
3. Propose a SPECIFIC, TARGETED fix (not a rewrite)
4. Explain your reasoning before the code
5. Output the complete improved solution
[OUTPUT FORMAT]
## Analysis
(your analysis here)
## Improved Solution
(complete code here)
"""
8. Seedless Mode
GEPA can bootstrap optimization from a natural language objective alone, without any seed solution. The system generates an initial population by prompting the LLM to create diverse starting points.
from gepa import optimize, Artifact, Metric
# No seed solution provided - GEPA generates initial candidates
result = optimize(
artifact=Artifact(
name="bin_packing",
description="A function that packs items into bins to minimize wasted space.",
signature="def pack(items: list[tuple[float, float]], bin_capacity: float) -> list[list[int]]",
language="python",
# No template or seed - GEPA bootstraps from the description + signature
),
evaluate=eval_bin_packing,
metrics=[Metric("utilization", direction="maximize")],
config=SeedlessConfig(
initial_population_size=10, # Generate 10 diverse starting points
diversity_prompt=True, # Instruct LLM to generate diverse solutions
bootstrap_strategies=[
"greedy_first_fit",
"best_fit_decreasing",
"random_search",
"dynamic_programming",
"genetic_algorithm",
], # Hint different algorithmic approaches
),
)
Bootstrap Pipeline
- Strategy Enumeration: If strategies are provided, generate one candidate per strategy. Otherwise, ask the LLM to enumerate diverse approaches.
- Parallel Generation: Concurrently prompt the LLM to implement each strategy as a concrete solution.
- Validation: Each generated candidate is validated (syntax, type checking, basic execution) before insertion.
- Initial Evaluation: All valid candidates are evaluated to establish the initial Pareto frontier.
- Normal Evolution: Proceed with reflection-driven mutation from the initial population.
9. Stopping Criteria
GEPA provides composable stopping criteria that can be combined with AND/OR logic.
| Criterion | Description | Parameters | Example |
|---|---|---|---|
MaxMetricCalls |
Stop after N evaluation calls | max_calls: int |
MaxMetricCalls(100) |
Timeout |
Stop after elapsed time | seconds: float |
Timeout(3600) (1 hour) |
NoImprovement |
Stop if no improvement for N iterations | patience: int, min_delta: float |
NoImprovement(20, 0.001) |
ScoreThreshold |
Stop when target score is reached | metric: str, threshold: float |
ScoreThreshold("accuracy", 0.99) |
Composite |
Combine criteria with AND/OR logic | criteria: list, mode: str |
See example below |
from gepa.stopping import MaxMetricCalls, Timeout, NoImprovement, ScoreThreshold, Composite
# Stop when ANY of these conditions is met (OR logic)
stopping = Composite(
criteria=[
MaxMetricCalls(200), # Hard limit: 200 evaluations
Timeout(7200), # Hard limit: 2 hours
ScoreThreshold("accuracy", 1.0), # Perfect score reached
Composite( # AND: both must be true
criteria=[
NoImprovement(patience=30, min_delta=0.001),
MaxMetricCalls(50), # At least 50 evals before early stop
],
mode="AND",
),
],
mode="OR",
)
result = optimize(
artifact=artifact,
evaluate=evaluate_fn,
metrics=metrics,
stopping=stopping,
)
print(f"Stopped because: {result.stop_reason}")
10. Configuration System
GEPA uses a hierarchical configuration system with sensible defaults at every level.
GEPAConfig Structure
from gepa.config import (
GEPAConfig, EngineConfig, ReflectionConfig,
MergeConfig, RefinerConfig
)
config = GEPAConfig(
# Engine configuration
engine=EngineConfig(
population_size=30, # Max population size
elite_size=5, # Protected elite solutions
tournament_size=3, # Tournament selection size
mutation_rate=1.0, # Always mutate (vs crossover)
crossover_rate=0.0, # No crossover by default
island_count=1, # Number of islands (for island model)
migration_rate=0.1, # Migration rate between islands
migration_interval=10, # Migrate every N iterations
),
# Reflection configuration
reflection=ReflectionConfig(
minibatch_size=3, # Examples per reflection
failure_bias=0.7, # Bias toward failure cases
temperature=0.8, # LLM temperature
max_tokens=4096, # Max response tokens
include_trace=True, # Include execution traces
include_error=True, # Include error details
include_comparison=True, # Include output comparisons
include_image=False, # Include image ASI (multimodal)
max_history_length=10, # Mutation history context
reflection_model="claude-sonnet-4-20250514", # Model for reflection
),
# Merge configuration (for crossover-like operations)
merge=MergeConfig(
strategy="llm_guided", # LLM picks best parts from 2 parents
merge_model="claude-sonnet-4-20250514",
max_parents=2, # Number of parents to merge
),
# Refiner configuration (post-mutation polish)
refiner=RefinerConfig(
enabled=True, # Enable post-mutation refinement
refiner_model="claude-sonnet-4-20250514",
max_refinement_steps=2, # Max refinement iterations
refinement_threshold=0.9, # Only refine if score > threshold
),
)
result = optimize(
artifact=artifact,
evaluate=evaluate_fn,
metrics=metrics,
config=config,
)
Configuration via YAML
# gepa_config.yaml
engine:
population_size: 30
elite_size: 5
tournament_size: 3
reflection:
minibatch_size: 3
failure_bias: 0.7
temperature: 0.8
include_trace: true
include_error: true
merge:
strategy: llm_guided
max_parents: 2
refiner:
enabled: true
max_refinement_steps: 2
stopping:
type: composite
mode: OR
criteria:
- type: max_metric_calls
max_calls: 200
- type: timeout
seconds: 7200
- type: score_threshold
metric: accuracy
threshold: 1.0
from gepa import optimize_from_config
result = optimize_from_config("gepa_config.yaml")
11. Evaluation Pipeline
Parallel Evaluation
GEPA evaluates candidates in parallel using a configurable worker pool:
from gepa.evaluation import EvaluationPipeline, EvalConfig
pipeline = EvaluationPipeline(
eval_fn=evaluate,
config=EvalConfig(
max_workers=8, # Parallel evaluation workers
timeout_per_eval=60, # Timeout per evaluation (seconds)
retry_on_error=True, # Retry failed evaluations once
cache_enabled=True, # Enable content-hash caching
cache_backend="sqlite", # Cache backend: sqlite, redis, memory
sandbox="docker", # Sandbox: docker, subprocess, none
),
)
# Evaluate a batch of candidates
results = await pipeline.evaluate_batch(candidates, task)
Content-Hash Caching
Evaluations are cached by content hash, so identical candidates are never re-evaluated:
import hashlib
def content_hash(candidate: str, task: dict) -> str:
"""Compute a deterministic hash for a (candidate, task) pair."""
content = candidate + json.dumps(task, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
# Cache lookup before evaluation
cache_key = content_hash(candidate_code, task)
if cache_key in cache:
return cache[cache_key] # Skip evaluation
# Evaluate and cache
result = evaluate(candidate_code, task)
cache[cache_key] = result
Sandbox Execution
All candidate evaluations run in isolated sandboxes to prevent side effects:
- Docker sandbox: Full container isolation with resource limits (CPU, memory, network, filesystem). Safest option for untrusted code.
- Subprocess sandbox: Process-level isolation with timeouts. Good balance of safety and speed.
- None: Direct execution in the main process. Fastest but no isolation. Only for trusted code.
12. API Reference & Code Examples
Core Classes
[!info]- Artifact - Define what to optimize ``` class Artifact: """Defines a text artifact to be optimized."""
def __init__( self, name: str, # Unique artifact identifier template: str | None = None, # Template with {{PLACEHOLDERS}} seed: str | None = None, # Initial solution (optional) description: str | None = None, # Natural language description signature: str | None = None, # Function signature (for code) language: str = "python", # Language: python, yaml, text, etc. constraints: list[str] | None = None, # Hard constraints for validation metadata: dict | None = None, # Additional context ): ... def validate(self, candidate: str) -> ValidationResult: """Validate a candidate against artifact constraints.""" ... def render(self, **kwargs) -> str: """Render template with placeholder values.""" ...```
[!info]- Metric - Define optimization objectives ``` class Metric: """Defines an optimization objective."""
def __init__( self, name: str, # Metric name (must match EvalResult keys) direction: str = "maximize", # "maximize" or "minimize" weight: float = 1.0, # Weight for weighted-sum aggregation bounds: tuple[float, float] | None = None, # Expected (min, max) range primary: bool = True, # Is this a primary objective? ): ...```
[!info]- EvalResult - Evaluation output ``` class EvalResult: """Result of evaluating a candidate."""
def __init__( self, scores: dict[str, float], # Metric name -> value side_info: list[ASI] | dict | None = None, # Actionable Side Information metadata: dict | None = None, # Additional evaluation metadata valid: bool = True, # Whether the candidate is valid error: str | None = None, # Error message if invalid ): ...```
[!info]- OptimizationResult - Final output ``` class OptimizationResult: """Complete optimization result."""
best: Solution # Best solution (primary metric) pareto_front: list[Solution] # All non-dominated solutions history: list[EvaluationRecord] # Full evaluation history stats: OptimizationStats # Timing, cost, convergence stats stop_reason: str # Why optimization stopped config: GEPAConfig # Configuration used # Multi-task specific per_task: dict[str, TaskResult] | None # Generalization specific train_score: float | None val_score: float | None generalization_gap: float | None def plot_convergence(self) -> Figure: """Plot convergence curve.""" ... def plot_pareto(self) -> Figure: """Plot Pareto frontier (2D or 3D).""" ... def export(self, path: str, format: str = "json"): """Export results to file.""" ...```
Advanced: Custom LLM Integration
from gepa.llm import LLMProvider, LLMResponse
class MyCustomLLM(LLMProvider):
"""Custom LLM provider for GEPA."""
def __init__(self, model_name: str, api_key: str):
self.model_name = model_name
self.api_key = api_key
async def generate(
self,
prompt: str,
temperature: float = 0.8,
max_tokens: int = 4096,
stop: list[str] | None = None,
) -> LLMResponse:
# Call your LLM API here
response = await my_api_call(
model=self.model_name,
prompt=prompt,
temperature=temperature,
max_tokens=max_tokens,
)
return LLMResponse(
text=response.text,
usage={"input_tokens": response.input_tokens, "output_tokens": response.output_tokens},
model=self.model_name,
)
# Use custom LLM with GEPA
result = optimize(
artifact=artifact,
evaluate=evaluate_fn,
metrics=metrics,
llm=MyCustomLLM("my-model", "my-api-key"),
)
Advanced: Multi-Artifact Co-Evolution
from gepa import co_optimize, Artifact, Metric
# Co-evolve a prompt and a post-processor together
prompt_artifact = Artifact(
name="system_prompt",
template="You are a helpful assistant. {{INSTRUCTIONS}}",
language="text",
)
processor_artifact = Artifact(
name="output_processor",
template="def process(raw_output: str) -> str:\n {{BODY}}",
language="python",
)
result = co_optimize(
artifacts=[prompt_artifact, processor_artifact],
evaluate=eval_pipeline, # Evaluation uses both artifacts
metrics=[Metric("quality", direction="maximize")],
dependencies={
"output_processor": ["system_prompt"], # Processor depends on prompt
},
max_iterations=50,
)
print(f"Best prompt: {result.artifacts['system_prompt'].best.code}")
print(f"Best processor: {result.artifacts['output_processor'].best.code}")
13. Results & Benchmarks
**Headline Results:** GEPA achieves state-of-the-art performance across multiple domains, often surpassing specialized systems designed for specific tasks.
Benchmark Results Summary
| Benchmark | Baseline | GEPA Result | Improvement | Notes |
|---|---|---|---|---|
| Claude Code Bleve | 79.3% | 100% | +20.7pp | Perfect score on all test cases |
| ARC-AGI v1 | 32.5% | 89.5% | +57.0pp | Generalization mode with train/val split |
| AIME 2025 | 46.67% | 60% | +13.33pp | Mathematical reasoning benchmark |
| Circle Packing n=26 | Previous SOTA | New Record | Beat AlphaEvolve | Score 2.63594 (matches LLM4AD record) |
| CloudCast Routing | Baseline routing | 40.2% savings | -40.2% cost | Cloud infrastructure routing optimization |
Detailed Results: ARC-AGI v1
| Method | Train Accuracy | Val Accuracy | Gap | Eval Calls |
|---|---|---|---|---|
| GEPA (Generalization) | 94.2% | 89.5% | 4.7% | 1,200 |
| GEPA (Single-Task) | 97.1% | 82.3% | 14.8% | 800 |
| AlphaEvolve | 91.0% | 85.0% | 6.0% | 2,500 |
| OpenEvolve | 88.5% | 80.2% | 8.3% | 1,800 |
| FunSearch | 78.0% | 72.5% | 5.5% | 5,000+ |
Detailed Results: Circle Packing
| n | Previous Best | GEPA Result | Method |
|---|---|---|---|
| 20 | 2.52040 | 2.52040 | Matched known optimal |
| 22 | 2.56287 | 2.56290 | Slight improvement |
| 24 | 2.60240 | 2.60245 | Slight improvement |
| 26 | 2.63590 (AlphaEvolve) | 2.63594 | New record |
Cost Efficiency
| Benchmark | Total LLM Cost | Eval Calls | Wall Time | Cost per % Improvement |
|---|---|---|---|---|
| Claude Code Bleve | $12.50 | 85 | 45 min | $0.60/pp |
| ARC-AGI v1 | $180.00 | 1,200 | 8 hours | $3.16/pp |
| AIME 2025 | $95.00 | 400 | 3 hours | $7.12/pp |
| Circle Packing n=26 | $45.00 | 300 | 2 hours | N/A |
| CloudCast Routing | $250.00 | 800 | 12 hours | $6.22/pp |
14. Case Studies
Case Study 1: Claude Code Bleve (79.3% → 100%)
GEPA optimized the Bleve search engine integration code for Claude Code, achieving a perfect score on all test cases.
**Setup:** Single-task mode, `claude-sonnet-4-20250514` for reflection, 85 evaluation calls over 45 minutes.
Optimization trajectory:
- Iteration 0: Seed solution from Claude Code baseline: 79.3%
- Iteration 12: ASI identified edge case in Unicode handling → fix improved to 88.1%
- Iteration 28: ASI identified timeout in large document indexing → batch processing fix improved to 94.5%
- Iteration 41: ASI identified race condition in concurrent search → mutex fix improved to 97.8%
- Iteration 58: ASI identified rounding error in relevance scoring → float64 fix achieved 100%
Case Study 2: ARC-AGI v1 (32.5% → 89.5%)
GEPA used generalization mode with an 80/20 train/val split to optimize an ARC-AGI solver.
Key strategies employed by the reflection engine:
- Pattern decomposition: The LLM learned to decompose ARC patterns into geometric primitives
- Transformation library: Built up a library of reusable transformations over iterations
- Cross-task transfer: Insights from color mapping tasks transferred to spatial reasoning tasks
- ASI-driven debugging: Execution traces showed where the solver misidentified patterns
Case Study 3: CloudCast Routing (40.2% Cost Savings)
GEPA optimized cloud infrastructure routing rules, treating the routing configuration as a YAML artifact.
# Artifact: YAML routing configuration
artifact = Artifact(
name="routing_config",
language="yaml",
template="""
routing:
default_region: {{DEFAULT_REGION}}
rules:
{{ROUTING_RULES}}
fallback:
strategy: {{FALLBACK_STRATEGY}}
timeout_ms: {{TIMEOUT}}
""",
)
# Multi-objective optimization: minimize cost, maximize latency SLA compliance
result = optimize(
artifact=artifact,
evaluate=eval_routing,
metrics=[
Metric("cost", direction="minimize", weight=0.6),
Metric("sla_compliance", direction="maximize", weight=0.4),
],
config=GEPAConfig(
engine=EngineConfig(population_size=40),
reflection=ReflectionConfig(minibatch_size=2),
),
)
15. Comparison with Other Systems
| Feature | GEPA | AlphaEvolve | OpenEvolve | FunSearch | ShinkaEvolve |
|---|---|---|---|---|---|
| Artifact Types | Any text |
Code | Code | Code | Code |
| Declarative API | Yes |
No | Partial | No | Partial |
| ASI Feedback | First-class |
Basic errors | Basic errors | Score only | Errors + traces |
| Pareto Multi-Objective | Native |
Weighted sum | Weighted sum | Single | Single |
| Multi-Task Mode | Yes |
No | No | No | No |
| Generalization Mode | Yes |
No | No | No | No |
| Seedless Bootstrap | Yes |
Requires seed | Requires seed | Requires seed | Optional seed |
| Composable Stopping | Yes |
Fixed | Fixed | Fixed | Configurable |
| Content-Hash Cache | Yes |
Yes | Yes | Yes | Yes |
| Open Source | Yes |
No | Yes | No | Yes |
16. Limitations & Future Work
Current Limitations
- LLM Dependency: Quality of mutations is bounded by the capability of the underlying LLM. Weaker models produce weaker mutations.
- Cost: Multi-task and generalization modes can be expensive due to the number of evaluation calls required.
- ASI Design Burden: Effective ASI requires domain-specific instrumentation of the evaluation function. Users must think carefully about what diagnostic information to expose.
- Pareto Scalability: With more than 4-5 objectives, the Pareto frontier grows exponentially and most solutions become non-dominated (the "curse of dimensionality" for multi-objective optimization).
- Reflection Context Window: Very large candidates or extensive ASI can exceed LLM context limits, requiring truncation that may lose important information.
Future Directions
- Auto-ASI: Automatically instrument evaluation functions to extract useful ASI without manual effort.
- Hierarchical Evolution: Evolve meta-strategies (mutation operators, selection criteria) alongside the artifact itself.
- Distributed Evolution: Distribute the evolutionary search across multiple machines for large-scale problems.
- Interactive Mode: Human-in-the-loop evolution where users can guide the search direction mid-run.
- Transfer Learning: Warm-start optimization from similar previously-solved problems.
- Formal Verification: Integrate formal verification tools as ASI producers for safety-critical applications.
**Summary:** GEPA represents a significant step toward general-purpose LLM-driven optimization. Its declarative API, first-class ASI support, native Pareto optimization, and composable configuration make it the most flexible evolutionary optimization framework available. The benchmark results demonstrate that a well-designed general system can match or exceed specialized tools across diverse domains.