GEPA: Optimize Anything
Part P02: General-Purpose Evolutionary Frameworks
7.1 Overview & Motivation
The systems surveyed in earlier chapters — FunSearch (Chapter 3), AlphaEvolve (Chapter 4), OpenEvolve (Chapter 5), and ShinkaEvolve (Chapter 6) — share a common architectural assumption: the artifact under optimization is a code fragment, typically a Python function. This assumption shapes every downstream design choice, from how candidates are represented and mutated to how fitness is evaluated and reported. GEPA (General-purpose Evolutionary Program Architecture), introduced in February 2026 by Agrawal, Lee, and colleagues at UC Berkeley and Stanford, challenges this assumption directly. Its central thesis is that any text-representable artifact — code, prompts, YAML configurations, mathematical expressions, agent policies, or hybrid combinations thereof — can be optimized through a single declarative interface backed by LLM-driven evolutionary search.
GEPA is released as an open-source Python package (pip install gepa,
repository: github.com/gepa-ai/gepa) and
introduces three technical contributions that distinguish it from prior work:
- Actionable Side Information (ASI) — a first-class abstraction for structured diagnostic feedback (execution traces, error stack traces, images, metric breakdowns, output comparisons) that flows from evaluation back into the mutation prompt, enabling targeted rather than generic improvements.
- Native Pareto multi-objective optimization — the population maintains a Pareto frontier of non-dominated solutions with crowding-distance-based selection and hypervolume tracking, replacing the weighted-sum reduction used by most prior systems.
- Three optimization modes — single-task, multi-task with cross-transfer, and generalization (train/validation split with validation-based model selection) — addressing overfitting, a problem that prior systems largely ignored.
Key Contribution
GEPA demonstrates that a declarative, artifact-agnostic evolutionary optimization framework with structured diagnostic feedback (ASI) can achieve competitive performance against domain-specialized systems across coding, mathematics, and infrastructure optimization benchmarks, while providing researchers with a unified API that separates the what (artifact definition, evaluation function, metrics) from the how (evolutionary search, mutation strategy, population management). The strongest empirical evidence for this claim comes from code and code-adjacent text artifacts; the broader "optimize anything" generality remains architecturally supported but less extensively validated.
7.1.1 The Artifact-Agnosticism Thesis
Prior LLM-driven evolution systems treat the candidate as a code string with
language-specific parsing, syntax validation, and execution semantics baked into the
pipeline. GEPA abstracts this into a generic Artifact class parameterized
by a name, an optional template with placeholders, a language tag, and optional
constraints. The GEPA documentation (§1: Overview & Motivation) explicitly
lists six categories of supported artifact types: (1) Code — Python
functions, algorithms, data structures, entire modules; (2) Prompts —
system prompts, few-shot examples, chain-of-thought templates;
(3) Agent Configs — multi-agent orchestration YAML, tool selection
policies, routing rules; (4) Mathematical Expressions — loss functions,
optimization objectives, heuristic formulas; (5) Data Pipelines — ETL
configurations, feature engineering scripts, preprocessing chains; and
(6) Hybrid Artifacts — combinations of code + prompts + configs,
co-evolved with inter-dependency tracking. This is not merely a
labeling convenience: the mutation engine, evaluation pipeline, and ASI feedback system
are all designed to operate on opaque text, with language-specific behavior injected only
through user-provided evaluation functions.
Scope of empirical support. It is important to distinguish GEPA's architectural artifact-agnosticism — the framework imposes no structural assumptions on candidate format — from the empirically demonstrated scope of that claim. The five benchmarks reported in Section 7.5 involve four code artifacts (Python functions for Claude Code Bleve, ARC-AGI, AIME 2025, and circle packing) and one code-adjacent text artifact (YAML routing rules for CloudCast). While the API design is genuinely artifact-agnostic, the "optimize anything" generality beyond code and structured configuration files has limited empirical support in the current documentation. Domains such as natural-language prose, mathematical proofs, or multi-modal artifacts are listed as supported artifact types but are not represented in the reported benchmarks.
7.1.2 Design Principles
The GEPA documentation (§1) articulates five design principles that shape the system:
| Principle | Implication |
|---|---|
| Artifact agnosticism | No hardcoded assumptions about candidate structure; any text is valid |
| Declarative over imperative | Users specify objectives and constraints; the engine selects mutation strategies, population management, and stopping criteria |
| Diagnostic-first | Evaluation produces structured feedback (ASI), not just scalar scores |
| Pareto optimality | Multi-objective optimization is native, not reduced to weighted sums |
| Reproducibility | Content-hash caching ensures identical artifacts are never re-evaluated |
7.1.3 Provenance and Documentation Basis
github.com/gepa-ai/gepa. The documentation includes fourteen sections
covering an overview, quick-start, system architecture, optimization modes, ASI,
Pareto search, reflection-driven mutation, seedless mode, stopping criteria,
configuration, evaluation pipeline, API reference with code examples, benchmark
results, case studies, cross-system comparisons, and limitations. Where specific
import paths and constructor signatures are cited, these are taken directly from the
documentation's code examples and API reference. Where internal algorithmic mechanisms
are described beyond what the documentation explicitly specifies — for instance, the
exact control flow of the optimization loop or the behavior of undocumented helper
functions — these are labeled as reconstructed. Mathematical formalizations
are the author's rendering of the documented API contracts and standard multi-objective
optimization theory unless otherwise attributed.
The following provenance table maps the major technical claims in this chapter to their documentation status and source location within the GEPA documentation. This table is intended to resolve ambiguity about which elements are directly documented, which are referenced but incompletely specified, and which are the author's analysis.
| Claim / Element | Status | Source Location |
|---|---|---|
optimize(), Artifact, Metric, EvalResult, OptimizationResult — classes, constructors, all field types |
Documented | Docs §§2, 12 (API Reference) |
co_optimize() — function, dependencies parameter, usage pattern |
Documented | Docs §12 (Advanced: Multi-Artifact Co-Evolution) |
GEPAConfig, EngineConfig, ReflectionConfig, MergeConfig, RefinerConfig — all parameters with defaults |
Documented | Docs §10 (Configuration System) |
SingleTaskConfig, MultiTaskConfig, GeneralizationConfig — parameters |
Documented | Docs §§4.1–4.3 |
SeedlessConfig — parameters and bootstrap pipeline description |
Documented | Docs §8 (Seedless Mode) |
ParetoFrontier (from gepa.engine) — class with dominates(), update(), select_parent() code |
Documented | Docs §6 (Pareto-Efficient Search) |
ReflectionEngine (from gepa.reflection) — class and instantiation |
Documented | Docs §7 (Reflection-Driven Mutation) |
EvaluationPipeline, EvalConfig (from gepa.evaluation) — parameters |
Documented | Docs §11 (Evaluation Pipeline) |
Stopping criteria: MaxMetricCalls, Timeout, NoImprovement, ScoreThreshold, Composite |
Documented | Docs §9 (Stopping Criteria) |
ASI types: TraceASI, ErrorASI, ImageASI, MetricASI, ComparisonASI, TextASI |
Documented | Docs §5 (ASI) |
Solution, EvaluationRecord, OptimizationStats, TaskResult |
Referenced but not fully defined | Used in ParetoFrontier code (§6) and OptimizationResult signature (§12); no independent constructor docs |
| Reflection prompt template (simplified) | Documented | Docs §§5, 7 (two versions shown) |
| Content-hash caching mechanism and code | Documented | Docs §11 |
| Architecture component names and data flow | Documented | Docs §3 (System Architecture) |
| Benchmark results, cost figures, case study trajectories | Documented (self-reported, no independent reproduction) | Docs §§13, 14 |
| Cross-system figures (AlphaEvolve, OpenEvolve, FunSearch on ARC-AGI) | Attributed in docs; not from matched-condition experiments | Docs §13 (single comparison table) |
| Optimization loop internal control flow (§7.8 pseudocode) | Reconstructed | Author assembly from documented components; helpers (bootstrap_seedless, cached_evaluate, refine, compute_stats) are not documented API |
| Mathematical formalizations: dominance, hypervolume, crowding distance, generalization gap | Author analysis | Standard EMOA theory applied to documented API contracts; crowding distance formula and hypervolume shown in docs §6 in simplified notation |
Scalar aggregation for train_score/val_score in generalization mode |
Mixed | Fields documented (docs §§4.3, 12); averaging formula over tasks is author formalization consistent with the documented scalar fields |
| Handling of infinite crowding distances in proportional selection | Underspecified in docs | Docs §6 shows probs = distances / distances.sum() and states boundary solutions get infinite distance; resolution not specified (see §7.3.4 discussion) |
7.2 System Architecture
GEPA's architecture follows a layered design with six principal components: the User API layer, the Configuration layer, the Evolution Engine, the Reflection Engine, the Evaluation Pipeline, and the Stopping Controller. The following diagram illustrates the data flow and component interactions as described in the GEPA documentation's architecture overview (§3).
Figure 7.1 — GEPA system architecture. The six top-level components and sub-component
names (Population Manager, Pareto Frontier, History Store, Reflection Engine) are drawn
from the documentation's architecture overview (§3). Configuration class names
(GEPAConfig, EngineConfig, ReflectionConfig,
MergeConfig, RefinerConfig) are documented with import paths
in §10. API classes (ParetoFrontier from gepa.engine,
ReflectionEngine from gepa.reflection,
EvaluationPipeline from gepa.evaluation) are documented
with import paths in §§6, 7, 11 respectively. The exact internal class boundaries
within the Evolution Engine may differ from the logical grouping shown here. The
dashed loop on the right represents the ASI feedback path from evaluation back into the
reflection engine. See Table 7.0 for full provenance details.
7.2.1 User API Layer
The primary entry point is the optimize() function, which accepts an
Artifact (what to optimize), an evaluation function (how to measure quality),
a list of Metric objects (what objectives to pursue), and optional
configuration. These four types — Artifact, Metric,
EvalResult, and OptimizationResult — are documented in the
GEPA API reference (§12) with their constructors and field types. This declarative surface
means the user never directly interacts with population management, parent selection,
or mutation scheduling. The documentation also describes
co_optimize() for multi-artifact co-evolution with explicit inter-artifact
dependency tracking (§12), and optimize_from_config() for YAML-driven
runs (§10).
The following code example is adapted from the GEPA documentation's quick-start guide
(§2). Import paths (from gepa import optimize, Artifact, Metric, EvalResult),
class constructors, and parameter names are documented verbatim. One adaptation was
made: the documentation's side_info failed_cases list
comprehension references an undefined variable results — an apparent
documentation error — so the version below uses an inline re-execution check instead.
The evaluation function's overall structure (inline exec, time measurement,
EvalResult return) follows the documented pattern. Note that
side_info is passed as a plain dict here; the typed ASI
pattern (using TraceASI, ErrorASI, etc.) is shown in §7.3.1.
# Adapted from GEPA docs §2 (quick-start guide).
# Import paths and constructor signatures are documented verbatim.
# The failed_cases comprehension is modified from the original (see text).
from gepa import optimize, Artifact, Metric, EvalResult
artifact = Artifact(
name="sort_function",
template="""
def sort(arr: list[int]) -> list[int]:
# Your sorting implementation here
{{IMPLEMENTATION}}
""",
language="python",
)
def evaluate_sort(candidate: str, task: dict) -> EvalResult:
"""Evaluate sorting correctness and performance."""
import time
exec_globals = {}
exec(candidate, exec_globals)
sort_fn = exec_globals["sort"]
test_cases = task["test_cases"]
correct = 0
total_time = 0.0
for tc in test_cases:
start = time.perf_counter()
result = sort_fn(tc["input"].copy())
elapsed = time.perf_counter() - start
total_time += elapsed
if result == tc["expected"]:
correct += 1
accuracy = correct / len(test_cases)
avg_time = total_time / len(test_cases)
return EvalResult(
scores={"accuracy": accuracy, "speed": 1.0 / (avg_time + 1e-9)},
side_info={
"failed_cases": [
tc for tc in test_cases
if sort_fn(tc["input"].copy()) != tc["expected"]
],
"avg_time_ms": avg_time * 1000,
},
)
result = optimize(
artifact=artifact,
evaluate=evaluate_sort,
metrics=[
Metric("accuracy", direction="maximize", weight=0.8),
Metric("speed", direction="maximize", weight=0.2),
],
max_iterations=50,
llm="claude-sonnet-4-20250514",
)
print(f"Best solution: {result.best.score}")
print(result.best.code)
The documented Artifact constructor (§12) accepts name,
template (with {{PLACEHOLDER}} syntax), seed,
description, signature, language
(default: "python"),
constraints, and metadata parameters, plus
validate() and render() methods. The
Metric constructor (§12) accepts name, direction
("maximize" or "minimize"), weight
(default: 1.0), bounds, and primary
(default: True). The EvalResult constructor (§12)
accepts scores (dict[str, float]), side_info
(list[ASI] | dict | None — either typed ASI objects or a plain dictionary),
metadata, valid, and error. All
constructor signatures and defaults are documented in the API reference.
7.2.2 Evolution Engine
The engine manages the core evolutionary loop. The documented
EngineConfig (§10) specifies the following parameters with their defaults:
population_size (default: 30),
elite_size (5), tournament_size (3),
mutation_rate (1.0),
crossover_rate (0.0), island_count (1),
migration_rate (0.1),
and migration_interval (10). Based on these parameters and the
architecture overview (§3), the engine maintains three logical structures:
a candidate pool (Population Manager) with optional island-model
partitioning and configurable migration, a Pareto frontier
(ParetoFrontier, importable from gepa.engine per §6) of
non-dominated solutions, and a History Store recording all results
with ASI and genealogy tracking. The OptimizationResult fields
(.pareto_front, .history, .stats, documented
in §12) directly expose the frontier and history data.
7.2.3 Reflection Engine
The reflection engine is GEPA's primary mutation mechanism. Rather than applying random
perturbations or generic LLM rewrites, it constructs a structured reflection prompt
that includes the parent candidate, a minibatch of evaluation examples (biased toward
failure cases), all associated ASI records, and recent mutation history. The documented
ReflectionConfig (importable from both gepa.config per §10
and gepa.reflection per §7) exposes the following parameters with defaults:
minibatch_size (3),
failure_bias (0.7), temperature (0.8),
max_tokens (4096),
max_history_length (10), boolean flags
include_trace (True), include_error (True),
include_comparison (True), include_image (False),
reflection_model ("claude-sonnet-4-20250514"), and
reflection_prompt_version ("v3").
The ReflectionEngine class is importable from
gepa.reflection, as shown in the documentation's §7 code examples.
This design is the main architectural link between evaluation quality and mutation
quality — the richer the ASI, the more targeted the resulting mutation. Section 7.3
provides a detailed algorithmic treatment.
7.2.4 Evaluation Pipeline
Evaluation proceeds through four stages: (1) cache check via
content-hash deduplication, (2) sandbox execution with configurable
isolation (Docker container, subprocess, or direct execution — as specified by the
sandbox parameter in EvalConfig),
(3) metric computation via the user-provided evaluation function, and
(4) ASI extraction from the evaluation results. The
EvaluationPipeline class and EvalConfig class (both
importable from gepa.evaluation, documented in §11) support the following
parameters with defaults: max_workers (8),
timeout_per_eval (60 seconds), retry_on_error (True),
cache_enabled (True), cache_backend ("sqlite"),
and sandbox ("docker"). The three documented sandbox levels
are "docker" (full container isolation), "subprocess"
(process-level isolation), and "none" (direct execution for trusted code).
7.2.5 Stopping Controller
GEPA provides five composable stopping criteria — MaxMetricCalls,
Timeout, NoImprovement, ScoreThreshold,
and Composite — importable from gepa.stopping (documented
in §9) and combinable with AND/OR logic. This composability is a notable improvement
over the fixed stopping criteria found in FunSearch and AlphaEvolve. The following
example is reproduced from the GEPA documentation (§9), with inline comments condensed:
# Reproduced from GEPA docs §9 (Stopping Criteria).
# Class names, import path, and constructor signatures are documented.
from gepa.stopping import (
MaxMetricCalls, Timeout, NoImprovement,
ScoreThreshold, Composite,
)
stopping = Composite(
criteria=[
MaxMetricCalls(200),
Timeout(7200),
ScoreThreshold("accuracy", 1.0),
Composite(
criteria=[
NoImprovement(patience=30, min_delta=0.001),
MaxMetricCalls(50),
],
mode="AND",
),
],
mode="OR",
)
7.3 Core Algorithms
Notation and Assumptions
The following notation is used throughout Sections 7.3–7.4. Unless otherwise noted, these formalizations are the author's rendering of the documented API contracts, not notation from the GEPA documentation itself. The GEPA documentation presents the dominance relation, hypervolume indicator, and crowding distance formula in simplified notation (§6); the full mathematical treatment below uses standard EMOA definitions consistent with that notation.
| $C$ | Candidate artifact space (set of all valid text artifacts) |
| $T$ | Task space (set of all task definitions) |
| $m$ | Number of objectives (dimensionality of the score vector) |
| $f_k : C \times T \to \mathbb{R}$ | The $k$-th scalar objective function, $k \in \{1, \ldots, m\}$ |
| $\mathbf{f} : C \times T \to \mathbb{R}^m$ | The vector-valued objective: $\mathbf{f}(c, \tau) = (f_1(c, \tau), \ldots, f_m(c, \tau))$ |
| $\mathcal{A}^*$ | Set of all finite sequences of ASI records (heterogeneous typed lists) |
| $P$ | Current population of candidate solutions |
| $\text{PF}$ | Pareto frontier: the set of non-dominated solutions in $P$ |
| $\text{HV}(\text{PF}, \mathbf{r})$ | Hypervolume indicator of $\text{PF}$ relative to reference point $\mathbf{r}$ |
| $\text{CD}(i)$ | Crowding distance of the $i$-th solution on the frontier |
Direction normalization convention. All objectives are
direction-normalized so that higher is always better. For a minimization
objective with raw value $v$, the normalized value is $-v$. This convention is
consistent with the Metric(direction="minimize") API parameter and the
sign-flip logic visible in the documented ParetoFrontier.dominates()
method (§6), which applies a_val, b_val = -a_val, -b_val for
minimization objectives.
Scalar vs. vector objectives. The Pareto frontier (Section 7.3.3)
operates on the full $m$-dimensional score vector $\mathbf{f}(c, \tau)$. The
generalization mode (Section 7.4.3) aggregates per-task scores into scalar averages;
when $m > 1$, model selection uses the primary metric (the Metric
with primary=True) to select the best candidate. Section 7.4.3 makes this
aggregation explicit and notes which parts are documented and which are the author's
formalization.
7.3.1 Actionable Side Information (ASI)
ASI is GEPA's most distinctive algorithmic contribution. In prior systems, evaluation returns a scalar score (or a small vector of scores), and the LLM must infer why a candidate performs poorly from the code alone. GEPA inverts this: the evaluation function returns both scores and structured diagnostic records that are injected directly into the reflection prompt. This transforms mutation from a "guess what went wrong" task to a "here is exactly what went wrong, fix it" task.
The GEPA documentation (§5: Actionable Side Information) defines six ASI types:
| ASI Type | Content | Role in Reflection |
|---|---|---|
TraceASI |
Step-by-step execution traces, variable states | LLM identifies where execution diverges from expected behavior |
ErrorASI |
Exception type, message, full stack trace | LLM directly targets the error-causing code |
ImageASI |
Visual outputs (plots, rendered diagrams) | Multimodal LLM analyzes visual quality |
MetricASI |
Per-case score breakdowns, timing details | LLM focuses on worst-performing sub-metrics |
ComparisonASI |
Diff between actual and expected output | LLM targets specific discrepancies |
TextASI |
Free-form text feedback, LLM-judge annotations | LLM incorporates qualitative guidance |
Formally, let $c$ denote a candidate artifact and $\tau$ a task. In a traditional evolutionary system, evaluation is a function $\mathbf{f}: C \times T \to \mathbb{R}^m$, where $m$ is the number of objectives. GEPA extends this to:
where $\mathbb{R}^m$ is the $m$-dimensional score vector (corresponding to the
scores field in EvalResult) and $\mathcal{A}^*$ is the set
of all finite sequences of ASI records (corresponding to the side_info
field). Each element of the ASI sequence can be any of the six types above.
This extended evaluation signature is the foundation for reflection-driven mutation.
The following code example is near-verbatim from the GEPA documentation (§5: Actionable
Side Information). The import path, class names, and constructor signatures for
EvalResult, TraceASI, ErrorASI,
MetricASI, and ComparisonASI are documented in the API
reference. One minor adaptation: ComparisonASI has been added to the
import statement — it is used in the code body of the documented example but was
omitted from the original import line (an apparent documentation oversight). The helper
functions compute_accuracy(), compute_efficiency(), and
find_mismatches() are used in the documentation as placeholder function
names without definitions; they are not part of the GEPA API.
# Near-verbatim from GEPA docs §5 (ASI).
# Import paths and ASI constructor signatures are documented.
# ComparisonASI added to import (used but missing in original).
# compute_accuracy, compute_efficiency, find_mismatches are
# undefined placeholders in the docs, not GEPA API functions.
from gepa import EvalResult, TraceASI, ErrorASI, MetricASI, ComparisonASI
def evaluate_with_asi(candidate: str, task: dict) -> EvalResult:
"""Evaluation function that produces scores and structured ASI."""
try:
exec_globals = {}
exec(candidate, exec_globals)
solve_fn = exec_globals["solve"]
# Collect execution trace
trace = []
original_print = print
def traced_print(*args, **kwargs):
trace.append(" ".join(str(a) for a in args))
original_print(*args, **kwargs)
import builtins
builtins.print = traced_print
result = solve_fn(task["input"])
builtins.print = original_print
# Compute metrics
accuracy = compute_accuracy(result, task["expected"])
efficiency = compute_efficiency(result)
# Build ASI records
side_info = [
TraceASI(trace=trace, label="execution_trace"),
MetricASI(
metrics={
"accuracy": accuracy,
"efficiency": efficiency,
"output_length": len(str(result)),
},
label="detailed_metrics",
),
]
if accuracy < 1.0:
mismatches = find_mismatches(result, task["expected"])
side_info.append(
ComparisonASI(
expected=str(task["expected"]),
actual=str(result),
diff=mismatches,
label="output_comparison",
)
)
return EvalResult(
scores={"accuracy": accuracy, "efficiency": efficiency},
side_info=side_info,
)
except Exception as e:
import traceback
return EvalResult(
scores={"accuracy": 0.0, "efficiency": 0.0},
side_info=[
ErrorASI(
error_type=type(e).__name__,
message=str(e),
traceback=traceback.format_exc(),
label="runtime_error",
)
],
)
The design choice to make ASI a first-class type in the evaluation contract, rather than an optional logging side-channel, has two consequences. First, it creates a strong incentive for users to instrument their evaluation functions with diagnostic output, since richer ASI directly improves mutation quality. Second, it places the burden of ASI design on the user — the system cannot automatically extract useful diagnostics from an arbitrary evaluation function. The GEPA authors acknowledge this as a current limitation (§16) and identify Auto-ASI (automatic instrumentation) as a future research direction.
7.3.2 Reflection-Driven Mutation
The reflection engine implements a six-step pipeline for generating improved candidates.
This is GEPA's primary search operator, replacing the random or semi-random mutation
operators used in classical evolutionary algorithms. The pipeline is described in the
GEPA documentation (§7: Reflection-Driven Mutation), which includes the
ReflectionConfig parameter documentation, a simplified version of the
reflection prompt template, and code showing ReflectionEngine instantiation
and the reflect() method call.
Step 1 — Parent selection. A parent is chosen from the Pareto frontier
using the ParetoFrontier.select_parent() method (see Section 7.3.4 for the
documented selection strategies). The documentation (§7) states that the default uses
"crowding-distance-weighted selection," consistent with the
strategy="crowding" default shown in the select_parent()
code (§6).
Step 2 — Minibatch sampling. A minibatch of 2–3 evaluation examples
is sampled, with a configurable failure bias (failure_bias, default: 0.7,
meaning 70% probability of sampling failure cases, per §§7, 10). The GEPA documentation
(§7, callout box) states that this minibatch size provides "the optimal trade-off between
context richness and LLM attention capacity" — a claim presented without experimental
evidence (see Section 7.5.6).
Step 3 — ASI assembly. All ASI records associated with the selected
examples are gathered and formatted into structured blocks within the reflection prompt.
The ReflectionConfig flags include_trace,
include_error, include_comparison, and
include_image (documented in §§7, 10) control which ASI types are
included.
Step 4 — History context. The last $k$ mutations (default $k = 10$,
configurable via max_history_length, per §§7, 10) are included, annotated
with whether each improved or degraded the score, providing the LLM with short-term
search memory.
Step 5 — LLM reflection. The assembled prompt is sent to the LLM with explicit instructions to (a) analyze the diagnostic feedback, (b) identify root causes of failure, (c) propose a specific, targeted fix rather than a full rewrite, and (d) output the complete improved candidate. The GEPA documentation includes two versions of this prompt template — a simplified version in §5 (within the ASI section) and a more detailed structural outline in §7 — with sections for role definition, current solution, diagnostic analysis, mutation history, and output format.
Step 6 — Validation and insertion. The LLM output is parsed,
validated against artifact constraints (syntax, type checking, basic execution via
Artifact.validate(), per §12), and inserted into the population for
evaluation.
The key design insight is that this prompt is not a generic "improve this code" instruction — it provides the LLM with the same diagnostic information a human developer would use when debugging.
7.3.3 Pareto Frontier Maintenance
GEPA maintains a Pareto frontier $\text{PF}$ of non-dominated solutions in the population $P$. Given $m$ objective functions $f_1, \ldots, f_m$ (direction-normalized so that higher is always better), solution $\mathbf{x}$ dominates solution $\mathbf{y}$ (written $\mathbf{x} \succ \mathbf{y}$) if and only if:
The Pareto frontier is then defined as the set of all non-dominated solutions:
These are standard definitions from the multi-objective optimization literature
(Deb et al., 2002). The GEPA documentation (§6: Pareto-Efficient Search) includes
a code example of the ParetoFrontier class (importable from
gepa.engine) with dominates(), update(), and
select_parent() methods implementing this logic. The following code is
reproduced verbatim from the GEPA documentation §6, including the class definition,
all three methods, and the Solution type annotation. The Solution
type is used throughout the documented class but is not independently defined in the
API reference; it appears to wrap a candidate's code string and score dictionary.
# Verbatim from GEPA docs §6 (Pareto-Efficient Search).
# Import path: from gepa.engine import ParetoFrontier (documented).
# Solution type is used but not independently defined in the API reference.
from gepa.engine import ParetoFrontier
class ParetoFrontier:
def __init__(self, objectives: list[str], directions: list[str]):
self.objectives = objectives
self.directions = directions # "maximize" or "minimize"
self.frontier: list[Solution] = []
def dominates(self, a: Solution, b: Solution) -> bool:
"""Check if solution a dominates solution b."""
at_least_one_better = False
for obj, direction in zip(self.objectives, self.directions):
a_val = a.scores[obj]
b_val = b.scores[obj]
if direction == "minimize":
a_val, b_val = -a_val, -b_val
if a_val < b_val:
return False
if a_val > b_val:
at_least_one_better = True
return at_least_one_better
def update(self, candidate: Solution) -> bool:
"""Add candidate to frontier if non-dominated. Returns True if added."""
for member in self.frontier:
if self.dominates(member, candidate):
return False # Dominated, discard
self.frontier = [
m for m in self.frontier
if not self.dominates(candidate, m)
]
self.frontier.append(candidate)
return True
def select_parent(self, strategy: str = "crowding") -> Solution:
"""Select a parent from the frontier for mutation."""
if strategy == "crowding":
distances = self._crowding_distances()
probs = distances / distances.sum()
return np.random.choice(self.frontier, p=probs)
elif strategy == "random":
return random.choice(self.frontier)
elif strategy == "tournament":
a, b = random.sample(self.frontier, 2)
return a if self._hypervolume_contribution(a) \
> self._hypervolume_contribution(b) else b
The dominates() method performs the direction-normalized dominance check
described above — note the explicit sign-flip
(a_val, b_val = -a_val, -b_val) for minimization objectives.
The update() method implements the standard frontier
update: (1) if any existing member dominates the candidate, discard it;
(2) if the candidate dominates existing members, remove them; (3) otherwise, add the
candidate as a new trade-off point.
7.3.4 Crowding Distance and Selection
To maintain diversity on the Pareto frontier, GEPA uses crowding distance to measure how isolated a solution is in objective space. The GEPA documentation (§6) presents the crowding distance formula, consistent with the standard definition from NSGA-II (Deb et al., 2002):
where, for each objective $k$, solutions on the frontier are sorted by their $f_k$ value; $f_k(i+1)$ and $f_k(i-1)$ are the objective values of the neighboring solutions in that sorted order; and $f_k^{\max}$, $f_k^{\min}$ are the maximum and minimum values of objective $k$ across the frontier. The GEPA documentation (§6) states that "boundary solutions receive infinite crowding distance," which is the standard NSGA-II convention ensuring extreme trade-off points are preferentially preserved.
Two edge cases require explicit handling:
- Boundary solutions (first and last in the sorted order for any objective $k$): $\text{CD}(i) = \infty$ ensures these are always preferentially selected or preserved, as they represent extreme trade-offs.
- Zero objective range ($f_k^{\max} = f_k^{\min}$): when all frontier solutions have the same value for objective $k$, the contribution from that objective is zero (the term for objective $k$ is skipped). This prevents division by zero.
Documented selection strategies. The GEPA documentation (§6) shows
three selection strategies in the ParetoFrontier.select_parent() method,
reproduced verbatim in the code above:
-
strategy="crowding"(default): proportional selection where the probability of selecting each frontier member is proportional to its crowding distance (probs = distances / distances.sum()). This is the code shown in the documented class. -
strategy="random": uniform random selection from the frontier. -
strategy="tournament": two solutions are sampled from the frontier, and the one with higher hypervolume contribution is selected. This is the most computationally expensive strategy as it requires recomputing the hypervolume indicator with and without each candidate.
strategy="crowding") with
probs = distances / distances.sum() is mathematically undefined when
boundary solutions have $\text{CD} = \infty$: dividing infinity by infinity produces
NaN, and NumPy's np.random.choice will raise a ValueError if any
probability is NaN. This inconsistency exists within the documentation itself — §6
both specifies infinite boundary distances and shows the proportional-selection code
that cannot handle them.
Three plausible resolutions exist: (a) the implementation clips infinite distances to a large finite value before normalization, (b) boundary solutions are always selected first and proportional selection applies only to interior solutions, or (c) the implementation uses tournament selection on crowding distance (where boundary solutions always win), as in canonical NSGA-II. The separately documented
EngineConfig.tournament_size
parameter (§10, default: 3) suggests tournament-based selection may be available in the
engine's main loop beyond the three strategies shown in
ParetoFrontier.select_parent(). The documentation does not resolve this
inconsistency. Readers implementing GEPA-like systems should adopt one of these
strategies explicitly.
7.3.5 Hypervolume Indicator
GEPA uses the hypervolume indicator to measure the overall quality of the Pareto frontier. The GEPA documentation (§6) presents this formula in simplified notation; the standard definition, consistent with the documented behavior, is:
where $\Lambda$ denotes the Lebesgue measure (volume) in $\mathbb{R}^m$, and the
inequality $\mathbf{p} \geq \mathbf{q}$ is component-wise, with all objectives
direction-normalized so higher is better. The reference point $\mathbf{r}$ is typically
the worst observed value in each objective. The hypervolume captures both
convergence toward the true Pareto front and diversity along it — a strictly dominated
frontier has lower hypervolume, and a frontier with gaps has lower hypervolume than one
with uniform coverage. This is a standard quality indicator from the EMOA literature
(Zitzler and Thiele, 1999). The hypervolume also appears in the documented
strategy="tournament" selection, where the solution with higher
hypervolume contribution wins the tournament (see select_parent() code
in §7.3.3).
7.3.6 Seedless Bootstrap
Unlike FunSearch, AlphaEvolve, and OpenEvolve, which require a user-provided seed solution, GEPA can initialize the population from a natural language description alone. The bootstrap pipeline, as described in the GEPA documentation (§8: Seedless Mode), proceeds as follows:
-
Strategy enumeration: If the user provides hint strategies via
SeedlessConfig.bootstrap_strategies(e.g.,["greedy_first_fit", "best_fit_decreasing", "dynamic_programming"]), one candidate is generated per strategy. Otherwise, the LLM is prompted to enumerate diverse algorithmic approaches. -
Parallel generation: The LLM concurrently generates concrete
implementations for each strategy, using the artifact's
descriptionandsignaturefields as context. - Validation: Each candidate undergoes syntax checking, type verification, and basic execution before insertion.
- Initial evaluation: All valid candidates are evaluated to establish the initial Pareto frontier.
- Normal evolution: The standard reflection-driven mutation loop begins.
Seedless mode is configured via SeedlessConfig (documented in §8), which
specifies initial_population_size (default: 10),
diversity_prompt (boolean, default: True), and optional
bootstrap_strategies (a list of strategy name strings). These parameters
are documented in the configuration API with the code example shown in §8. This feature
is particularly valuable for exploratory optimization where the user does not have a
strong prior on solution structure.
7.3.7 Content-Hash Caching
GEPA avoids redundant evaluation through deterministic content-hash caching. The GEPA documentation (§11) includes a code example showing the caching mechanism. For a candidate $c$ and task $\tau$, the cache key is:
where $\|$ denotes string concatenation and $\text{json}_{\text{sorted}}$ is a
deterministic JSON serialization with sorted keys. This formalization matches the
documented code example (§11), which shows
hashlib.sha256(content.encode()).hexdigest() where content
is the candidate string concatenated with json.dumps(task, sort_keys=True).
Before evaluation, the pipeline checks whether $k(c, \tau)$ exists in the cache. The
documented EvalConfig.cache_backend parameter (§11) supports
"sqlite" (default), "redis", or "memory"
storage backends. If found, the cached EvalResult (including ASI) is
returned immediately. This is especially valuable in multi-task and generalization
modes, where the same candidate may be evaluated against overlapping task subsets.
7.4 Optimization Modes
GEPA defines three optimization modes that address different relationships between artifacts and tasks, each configured via a dedicated config class documented in the API (§§4.1–4.3). This is a structural contribution: prior systems in this survey offer only single-task optimization, leaving multi-task generalization to the user.
7.4.1 Single-Task Mode
The simplest mode: one artifact is optimized against one task definition. This is the
standard evolutionary optimization scenario, equivalent to what FunSearch, AlphaEvolve,
and OpenEvolve provide. The user specifies a task dictionary, and all evaluation calls
use that fixed task. Configured via SingleTaskConfig (documented in §4.1,
with parameters including population_size, max_iterations,
and reflection_minibatch_size). In this mode, the full $m$-dimensional
score vector $\mathbf{f}(c, \tau)$ is computed for each candidate, and the Pareto
frontier operates over all $m$ objectives.
7.4.2 Multi-Task Mode with Cross-Transfer
Multi-task mode optimizes a single artifact across multiple tasks simultaneously. The key mechanism is cross-transfer: solutions that perform well on one task can transfer insights to other tasks. For example, a TSP heuristic optimized on instances of size 50, 100, and 200 simultaneously may discover strategies that generalize across scales.
Cross-transfer is controlled by three documented MultiTaskConfig parameters
(§4.2): a boolean cross_transfer flag (shown as True in the
documented example), a transfer_frequency (default: 5, how often transfer
occurs in iterations), and a min_improvement_for_transfer threshold
(default: 0.01, minimum score delta to justify transferring a solution from one task's
population to another's). The result object includes per-task results via
result.per_task (documented in §12).
7.4.3 Generalization Mode
Generalization mode splits tasks into training and validation sets, addressing overfitting — a well-known risk in evolutionary optimization where highly specialized solutions perform well on training instances but fail on unseen problems. This mode imports a standard machine learning practice (train/validation splitting) into the LLM-driven evolutionary setting.
The mode is configured via GeneralizationConfig (documented in §4.3), which
specifies val_frequency (default: 10, how often validation is performed in
iterations), early_stopping_patience (default: 20, number of iterations
without validation improvement before stopping), and
overfitting_threshold (default: 0.15, maximum tolerated gap between
training and validation scores).
Reconciling multi-objective scores with scalar aggregation.
Generalization mode requires reducing per-task, per-objective scores to scalar
summaries for train/val comparison. The GEPA documentation (§§4.3, 12) reports
train_score, val_score, and generalization_gap
as scalar fields on OptimizationResult. These fields are documented; the
following averaging formulas are the author's formalization of the
scalar aggregation implied by these documented fields, not equations taken from the
GEPA documentation. The exact internal aggregation method is not specified in the
documentation beyond the existence of these scalar result fields.
Let $f_k(c, \tau)$ denote the $k$-th objective
for candidate $c$ on task $\tau$, and let $k^*$ denote the primary metric
(the Metric with primary=True). Three roles must be
distinguished:
-
Optimization objective (training). The evolutionary search — parent
selection, reflection-driven mutation, population update — is guided by fitness on the
training task set $\mathcal{T}_{\text{train}}$. For multi-objective problems
($m > 1$), the Pareto frontier operates on the full score vector averaged across
training tasks. For model selection and early stopping, GEPA reduces to the primary
metric. A natural formalization consistent with the documented scalar
train_scorefield is:$$\bar{f}_{k^*,\text{train}}(c) = \frac{1}{|\mathcal{T}_{\text{train}}|} \sum_{\tau \in \mathcal{T}_{\text{train}}} f_{k^*}(c, \tau)$$This is the scalar training score reported as
result.train_score. When $m = 1$ (single objective), $k^*$ is the only metric and the distinction is moot. (Author formalization: the documentation reports this as a scalar field but does not specify whether the aggregation is arithmetic mean, weighted mean, or some other function.) -
Selection criterion (validation). Periodically (every
val_frequencyiterations), the current best candidates are evaluated on the validation task set $\mathcal{T}_{\text{val}}$. The validation score for the primary metric is:$$\bar{f}_{k^*,\text{val}}(c) = \frac{1}{|\mathcal{T}_{\text{val}}|} \sum_{\tau \in \mathcal{T}_{\text{val}}} f_{k^*}(c, \tau)$$Crucially, validation fitness is not used to guide the evolutionary search itself. The validation set acts as a held-out check on generalization, analogous to validation in supervised learning. Validation is used for model selection (choosing which candidate to report as the final result) and for early stopping (halting when validation performance stagnates for
early_stopping_patienceiterations). (Author interpretation: this train/val separation is the standard interpretation consistent with the documented behavior; the documentation does not explicitly state that validation scores are excluded from the search fitness.) -
Final selection. The reported best candidate $c^*$ is selected by
validation performance on the primary metric:
$$c^* = \arg\max_{c \in \text{Candidates}} \; \bar{f}_{k^*,\text{val}}(c)$$
where Candidates is the set of all candidates evaluated on the validation set during the run. This ensures the final output generalizes beyond the training instances. The
OptimizationResultreports this asresult.val_score(documented in §12).
The generalization gap is defined on the primary metric:
A gap exceeding the configured overfitting_threshold (default: 0.15) triggers
an alert. The result object reports train_score, val_score, and
generalization_gap as documented scalar fields in §§4.3 and 12. Note that
when $m > 1$, the generalization gap as formalized above is defined only for the primary
metric; per-objective gaps could be computed but are not part of the documented API.
Note on held-out evaluation. The GEPA documentation does not describe a separate test set beyond the train/val split. For rigorous empirical evaluation, researchers should reserve a third held-out set that is never used for either training or model selection, and report final performance on that set. The ARC-AGI results in Section 7.5 use a train/val split, but it is not documented whether the reported validation accuracy was also the final held-out evaluation or whether an additional test split was used.
7.5 Key Results
The GEPA authors report results across five benchmarks spanning code optimization, mathematical reasoning, geometric optimization, and infrastructure routing. This section separates self-reported GEPA results (Table 7.1) from cross-system comparisons (Table 7.3), as the two categories have fundamentally different evidence standards. All results cited below are from the GEPA project documentation (§§13–14) unless otherwise attributed.
7.5.1 Self-Reported GEPA Results
Table 7.1 presents the benchmark results reported in the GEPA documentation (§13). These are single-system results: GEPA's performance relative to a stated baseline, with no cross-system comparison.
| Benchmark | Artifact Type | Baseline | GEPA Result | Improvement | Eval Calls | LLM | Mode |
|---|---|---|---|---|---|---|---|
| Claude Code Bleve | Python code | 79.3% | 100% | +20.7 pp | 85 | claude-sonnet-4-20250514 (stated in §14) | Single-task |
| ARC-AGI v1 | Python code | 32.5%† | 89.5% | +57.0 pp | 1,200 | Not explicitly stated‡ | Generalization |
| AIME 2025 | Python code | 46.67%† | 60% | +13.33 pp | 400 | Not explicitly stated‡ | Single-task |
| Circle Packing n=26 | Python code | 2.63590†† | 2.63594 | +0.00004 | 300 | Not explicitly stated‡ | Not stated |
| CloudCast Routing | YAML config | Baseline routing | 40.2% cost savings | −40.2% cost | 800 | Not explicitly stated‡ | Not stated |
† Baseline source not attributed in the GEPA documentation. The 32.5%
ARC-AGI baseline may refer to a non-evolutionary baseline; the 46.67% AIME baseline
is not identified as a specific prior system.
†† Attributed to AlphaEvolve in the GEPA documentation.
‡ Documentation examples use claude-sonnet-4-20250514; whether
all benchmark runs used this model is not confirmed.
| Benchmark | Task Definition | Eval Function | Seed / Init | LLM Model | Reproducibility |
|---|---|---|---|---|---|
| Claude Code Bleve | GEPA-specific; described in prose (§14 case study) but not independently defined | Not provided | "Seed solution from Claude Code baseline: 79.3%" (§14) | claude-sonnet-4-20250514 (§14) | Low — task and eval function not publicly available |
| ARC-AGI v1 | Public benchmark; 80/20 train/val split stated (§14) | Not provided | Not stated | Not stated | Partial — public tasks, but eval/model/seed not specified |
| AIME 2025 | Public benchmark | Not provided | Not stated | Not stated | Partial — public tasks, but eval/model/seed not specified |
| Circle Packing n=26 | Standard problem (well-defined geometry) | Not provided | Not stated | Not stated | Partial — standard problem, but implementation details missing |
| CloudCast Routing | GEPA-specific; partial YAML template shown in §14 | Not provided | Not stated | Not stated | Low — task and eval function not publicly available |
Overall limitations: (1) All results are self-reported; none have been independently reproduced. (2) No confidence intervals, standard deviations, or multi-seed statistics are reported. (3) Two of five benchmarks (Claude Code Bleve and CloudCast Routing) are GEPA-specific with no independent task definitions. (4) Evaluation functions are not provided for any benchmark; only the pedagogical examples in §§2, 5 are shown. (5) Exact LLM model is confirmed only for Claude Code Bleve.
7.5.2 Intra-System Comparison: Generalization vs. Single-Task
The most methodologically sound comparison available is between GEPA's own generalization mode and single-task mode on ARC-AGI v1, as these share the same LLM backend, evaluation function, and codebase:
| GEPA Mode | Train Acc. | Val Acc. | Gap | Eval Calls |
|---|---|---|---|---|
| Generalization (train/val split) | 94.2% | 89.5% | 4.7% | 1,200 |
| Single-Task | 97.1% | 82.3% | 14.8% | 800 |
The 7.2 percentage-point improvement in validation accuracy (82.3% → 89.5%) with a concurrent reduction in generalization gap (14.8% → 4.7%) provides direct evidence for the value of the train/val split design, at the cost of 50% more evaluation calls (800 → 1,200). This is a single-run comparison without variance estimates, so the magnitude of the effect should be interpreted cautiously, but the direction is consistent with the expected behavior of regularization via validation-based model selection.
7.5.3 Cross-System Comparisons (Indicative)
Table 7.3 presents the cross-system comparisons drawn from the GEPA documentation (§13). The GEPA documentation presents these figures in a single table alongside GEPA's own results without noting experimental conditions; the separation into indicative comparisons and the "Matched Conditions?" column are the survey author's addition. These comparisons are indicative, not controlled.
| System | Train Acc. | Val Acc. | Gap | Eval Calls | Source | Matched Conditions? |
|---|---|---|---|---|---|---|
| GEPA (Generalization) | 94.2% | 89.5% | 4.7% | 1,200 | GEPA docs §13 | — |
| AlphaEvolve | 91.0% | 85.0% | 6.0% | 2,500 | Attributed in GEPA docs §13 | No: different LLM, evaluation function, seeds, and budget |
| OpenEvolve | 88.5% | 80.2% | 8.3% | 1,800 | Attributed in GEPA docs §13 | No: different LLM, evaluation function, seeds, and budget |
| FunSearch | 78.0% | 72.5% | 5.5% | 5,000+ | Attributed in GEPA docs §13 | No: different LLM, evaluation function, seeds, and budget |
Two observations emerge from this table, with the strong caveat that the cross-system numbers are not from controlled experiments:
Observation 1: Validation accuracy. GEPA's generalization mode reports the highest validation accuracy (89.5%) among the compared systems. However, each system used a different LLM backend, evaluation function, seed solution, and computational budget. These confounders make it impossible to attribute the performance difference to any specific GEPA feature (ASI, Pareto optimization, generalization mode, or the LLM itself).
Observation 2: Evaluation calls. The figures attributed to other systems show higher evaluation-call counts (2,500 for AlphaEvolve; 5,000+ for FunSearch) compared to GEPA's 1,200. This difference is consistent with the hypothesis that ASI-driven mutation is more sample-efficient, but it does not establish that claim: the systems differ in too many dimensions (LLM capability, evaluation granularity, seed quality, population management) for the evaluation-count comparison to isolate the effect of ASI. A controlled ablation removing ASI from GEPA while holding all other factors constant would be needed to support a sample-efficiency claim.
7.5.4 Circle Packing
Circle packing in a square is a well-studied geometric optimization problem with known optimal solutions for small $n$ and competitive results from multiple systems for larger $n$. GEPA reports the following results (§13):
| $n$ | Previous Best | GEPA Result | Status |
|---|---|---|---|
| 20 | 2.52040 | 2.52040 | Matched known optimal |
| 22 | 2.56287 | 2.56290 | Claimed improvement (+0.00003) |
| 24 | 2.60240 | 2.60245 | Claimed improvement (+0.00005) |
| 26 | 2.63590 (attributed to AlphaEvolve) | 2.63594 | Claimed record (the docs state this "matches LLM4AD record"; +0.00004 over AlphaEvolve attribution) |
The n=26 result is particularly notable because it claims to surpass AlphaEvolve's reported result on the same problem instance, using a general-purpose system rather than a code-specialized one. However, improvements at this scale (0.00004 difference for n=26) are within the regime where numerical precision, floating-point representation, and the exact problem formulation (objective function definition, coordinate representation) matter significantly. Without independent verification of both the GEPA result and the baseline, and without confirmation that the same objective function and precision conventions were used, these improvements should be treated as claims consistent with competitiveness rather than definitively established records. The GEPA documentation does not provide the evaluation function, coordinate output, or feasibility verification for the circle packing results.
7.5.5 Case Study: Claude Code Bleve (79.3% → 100%)
The Claude Code Bleve case study (§14) illustrates the ASI feedback loop in action.
This is the best-documented benchmark in terms of experimental setup: the documentation
states single-task mode, claude-sonnet-4-20250514 for reflection, 85
evaluation calls, and 45 minutes of wall time. Starting from a baseline at 79.3%,
the documentation reports the following optimization trajectory:
- Iteration 12: ASI identified an edge case in Unicode handling → fix improved score to 88.1%.
- Iteration 28: ASI identified a timeout in large document indexing → batch processing fix improved to 94.5%.
- Iteration 41: ASI identified a race condition in concurrent search → mutex fix improved to 97.8%.
- Iteration 58: ASI identified a rounding error in relevance scoring → float64 fix achieved 100%.
Each improvement step was driven by specific ASI feedback (error traces, comparison diffs) rather than generic "make it better" prompting. This illustrates the core value proposition of ASI: the reflection engine can propose targeted fixes because the diagnostic information makes failure modes explicit. Note that Claude Code Bleve appears to be a GEPA-specific benchmark; the task definition, evaluation function, and baseline implementation are not independently described outside the GEPA documentation. As an anecdotal demonstration of ASI's utility, this case study is suggestive but does not constitute a controlled ablation of ASI vs. score-only feedback.
7.5.6 Missing Ablations and Open Empirical Questions
The GEPA documentation does not report systematic ablation studies for several key design decisions. The following experiments would substantially strengthen the empirical case for GEPA's architecture:
| Ablation | Question | Current Evidence |
|---|---|---|
| ASI vs. no-ASI | How much does structured diagnostic feedback improve mutation quality compared to score-only feedback? What is the marginal value of each ASI type (traces, errors, comparisons, images)? | No controlled ablation reported. The Claude Code Bleve case study (§14) provides anecdotal evidence that ASI-driven mutations are targeted, but no comparison to a score-only baseline is given. This is the most important missing ablation, as ASI is GEPA's primary claimed contribution. |
| Pareto vs. weighted-sum | Does native Pareto optimization produce better multi-objective trade-offs than weighted-sum aggregation with the same evaluation budget? | No direct comparison reported. The Metric.weight parameter
exists in the API (§12), suggesting weighted aggregation is available, but no
benchmark compares the two approaches under matched conditions. |
| Seedless vs. seeded initialization | How does seedless bootstrap compare to user-provided seeds in terms of final solution quality, convergence speed, and evaluation budget? | No controlled comparison reported. The ARC-AGI and circle packing results do not specify whether seeds were used. Seedless mode is documented (§8) as a capability but its relative effectiveness is not empirically characterized. |
| Generalization mode sensitivity | How sensitive is the train/val split to the split ratio, validation frequency, and early stopping patience? Does the optimal configuration vary across problem domains? | The intra-system comparison (generalization vs. single-task on ARC-AGI, Table 7.2) provides one data point but does not explore the hyperparameter space. |
| Reflection minibatch size | Is the documented default of 2–3 examples actually optimal? How does performance vary with minibatch sizes of 1, 3, 5, and 10? | The documentation (§7, callout box) claims 2–3 is "optimal" but provides no experimental evidence for this claim. |
| Failure bias sensitivity | Is the 70% failure bias optimal? How does varying failure_bias
from 0.0 (uniform) to 1.0 (all failures) affect convergence? |
Not reported. The 0.7 default is presented without justification beyond intuition. |
| LLM model sensitivity | How does GEPA's performance vary across LLM backends (e.g., GPT-4o vs. Claude Sonnet vs. Gemini)? Is ASI more valuable for some models than others? | Documentation examples use claude-sonnet-4-20250514. No
cross-model comparison is provided. The custom LLM integration example (§12)
shows that arbitrary backends can be used, but no results are reported
with alternative models. |
The absence of these ablations is notable because GEPA's primary claims — that ASI improves mutation quality, that Pareto optimization outperforms weighted-sum, and that generalization mode prevents overfitting — are all empirically testable hypotheses that the system is well-positioned to evaluate. Future work should prioritize the ASI ablation, as it tests the core contribution, followed by the Pareto vs. weighted-sum comparison, which tests the second major design choice.
7.6 Cost Analysis & Implementation Details
7.6.1 LLM Cost Breakdown
The GEPA documentation (§13) reports the following cost figures for each benchmark. These represent total LLM API costs (reflection + evaluation where applicable) and are self-reported by the GEPA authors.
| Benchmark | Total LLM Cost | Eval Calls | Wall Time | Cost per pp Improvement |
|---|---|---|---|---|
| Claude Code Bleve | $12.50 | 85 | 45 min | $0.60/pp |
| ARC-AGI v1 | $180.00 | 1,200 | 8 hours | $3.16/pp |
| AIME 2025 | $95.00 | 400 | 3 hours | $7.12/pp |
| Circle Packing n=26 | $45.00 | 300 | 2 hours | N/A |
| CloudCast Routing | $250.00 | 800 | 12 hours | $6.22/pp |
Notes on cost figures: (1) These costs are self-reported by the GEPA authors
in the project documentation §13. (2) The LLM model used for reflection is stated as
claude-sonnet-4-20250514 in the documentation examples and confirmed for the
Claude Code Bleve case study (§14); whether this model was used for all benchmark runs
is not explicitly confirmed. (3) Exact pricing depends on the provider, date of the run,
and whether batch or real-time APIs were used; none of these details are specified.
(4) The "cost per percentage point improvement" metric is useful for comparison but
depends heavily on the baseline — improving from 79% to 100% (easy wins first) has a
different cost profile than improving from 85% to 90%. (5) Cost figures do not include
compute costs for evaluation (sandbox execution, metric computation), only LLM API costs.
7.6.2 Computational Requirements
GEPA is a Python package with minimal infrastructure requirements. The core optimization
loop runs on a single machine with network access to an LLM API. Evaluation parallelism
is achieved via the EvaluationPipeline class (from
gepa.evaluation, documented in §11) with a configurable
max_workers parameter (default: 8) in EvalConfig.
Sandbox isolation supports three documented levels (§11):
"docker" for full container isolation, "subprocess" for
process-level isolation, and "none" for direct execution of trusted code.
The content-hash cache reduces redundant computation. The documented
cache_backend parameter (§11) supports "sqlite" (default),
"redis", and "memory" backends. For large-scale runs, Redis
is recommended in the documentation to share the cache across multiple concurrent
processes.
7.6.3 Reproducibility
GEPA's content-hash caching guarantees that re-evaluating the same candidate on the same
task produces identical scores (assuming a deterministic evaluation function). However,
full run-level reproducibility depends on the LLM's output determinism and the stochastic
elements of parent selection and minibatch sampling. The OptimizationResult
object (documented in §12) includes the full configuration used (.config),
evaluation history (.history), and timing statistics (.stats),
providing the information needed to understand — though not necessarily reproduce — a run.
Factors limiting exact reproducibility include:
- LLM output non-determinism (even at temperature 0, some providers do not guarantee identical outputs across API calls)
- Evaluation function side effects or non-determinism (e.g., timing-dependent metrics, stochastic evaluation components)
- Parallel evaluation ordering (non-deterministic with multiple workers, affecting which candidates are available when the next mutation is generated)
7.7 Comparison with Prior Systems
The following table summarizes the feature comparison, combining information from the GEPA documentation (§15: Comparison with Other Systems) with observations from earlier chapters of this survey. The GEPA documentation includes its own comparison table (§15); the entries below for other systems incorporate corrections and qualifications from earlier chapters where appropriate. "Yes/No" entries reflect whether the feature is documented as available, not whether it has been empirically validated as effective.
| Feature | GEPA | AlphaEvolve (Ch. 4) | OpenEvolve (Ch. 5) | FunSearch (Ch. 3) | ShinkaEvolve (Ch. 6) |
|---|---|---|---|---|---|
| Artifact types | Any text (documented; empirically validated for code + YAML) | Code | Code | Code | Code |
| Declarative API | Yes (§§2, 12) | No | Partial | No | Partial |
| Diagnostic feedback | First-class ASI (6 types, §5) | Basic errors | Basic errors | Score only | Errors + traces |
| Multi-objective | Pareto frontier (§6) | Weighted sum | Weighted sum | Single objective | Single objective |
| Multi-task mode | Yes (cross-transfer, §4.2) | No | No | No | No |
| Generalization mode | Yes (train/val split, §4.3) | No | No | No | No |
| Seedless bootstrap | Yes (§8) | Requires seed | Requires seed | Requires seed | Optional seed |
| Composable stopping | AND/OR logic (§9) | Fixed | Fixed | Fixed | Configurable |
| Content-hash cache | Yes (§11) | Yes | Yes | Yes | Yes |
| Open source | Yes | No | Yes | No | Yes |
GEPA's distinguishing features fall into three categories relative to the surveyed systems:
Features without direct equivalents in other surveyed systems. ASI as a first-class typed abstraction with six modalities (§5), native Pareto multi-objective search with crowding-distance-based selection (§6), the generalization mode with train/val splitting and overfitting detection (§4.3), and multi-artifact co-evolution with dependency tracking (§12). These represent GEPA's clearest architectural contributions.
Adapted from prior work. Content-hash caching is standard across all systems. Island models with migration appear in OpenEvolve and ShinkaEvolve. LLM-driven reflection/mutation is a shared pattern with varying degrees of sophistication. Tournament selection and elite preservation are classical EA techniques.
Absent compared to some systems. GEPA does not appear to implement
MAP-Elites quality-diversity archives (used by AlphaEvolve), bandit-based model
selection across multiple LLM providers (used by ShinkaEvolve), or prompt co-evolution
(used by ShinkaEvolve). The documentation describes LLM-guided crossover
(MergeConfig, §10) and post-mutation refinement (RefinerConfig,
§10) as additional operators.
7.8 The Optimization Loop in Detail
The following pseudocode describes the complete GEPA optimization loop as reconstructed by the survey author from the documented API (§§2, 12), configuration schema (§10), architecture description (§3), and component documentation (§§6–9, 11). This is illustrative pseudocode, not code from the GEPA documentation. The internal implementation may differ in control flow, error handling, concurrency patterns, and helper function decomposition.
The following names are documented in the GEPA API reference
with import paths: optimize, Artifact,
GEPAConfig, OptimizationResult, EvalResult,
ParetoFrontier (from gepa.engine, §6),
ReflectionEngine (from gepa.reflection, §7), and all stopping
criteria (from gepa.stopping, §9). The following are not
documented as standalone classes or functions and are used here as descriptive
placeholders: Solution (used in the documented ParetoFrontier
code but not independently importable), cached_evaluate,
bootstrap_seedless, refine, and compute_stats.
# ILLUSTRATIVE PSEUDOCODE — Author reconstruction from
# documented GEPA components. NOT from the GEPA documentation.
# Documented API names marked with (§N) source references.
# Helper functions (cached_evaluate, bootstrap_seedless, refine,
# compute_stats) are descriptive placeholders, not documented API.
async def optimize( # optimize() documented §§2, 12
artifact: Artifact, # Artifact documented §12
evaluate: Callable,
metrics: list[Metric], # Metric documented §12
config: GEPAConfig, # GEPAConfig documented §10
tasks: list[dict] | None = None,
stopping: StoppingCriterion | None = None, # Stopping criteria §9
) -> OptimizationResult: # OptimizationResult documented §12
# 1. Initialize population
if artifact.seed is not None:
population = [artifact.seed]
elif artifact.description is not None:
population = await bootstrap_seedless(artifact, config) # Placeholder
else:
raise ValueError("Artifact must have either seed or description")
# 2. Initialize Pareto frontier and history
# ParetoFrontier documented: from gepa.engine import ParetoFrontier (§6)
frontier = ParetoFrontier(
objectives=[m.name for m in metrics],
directions=[m.direction for m in metrics],
)
history = []
cache = {} # Actual cache backend per EvalConfig.cache_backend (§11)
# 3. Initial evaluation
for candidate in population:
result = await cached_evaluate(candidate, tasks, evaluate, cache)
frontier.update(Solution(code=candidate, scores=result.scores))
history.append(result)
# 4. Main evolution loop
# ReflectionEngine documented: from gepa.reflection import ReflectionEngine (§7)
reflection_engine = ReflectionEngine(
config=config.reflection, llm=llm_client, artifact_spec=artifact,
)
iteration = 0
while not stopping.should_stop(iteration, history, frontier):
# 4a. Select parent via documented select_parent() method (§6)
parent = frontier.select_parent(strategy="crowding")
# 4b. Generate mutation via reflection with ASI
mutation = await reflection_engine.reflect(
parent=parent,
eval_results=get_results_for(parent, history), # Placeholder
history=history[-config.reflection.max_history_length:],
)
# 4c. Validate mutation against artifact constraints
# Artifact.validate() documented §12
if not artifact.validate(mutation.code).is_valid:
continue
# 4d. Optional post-mutation refinement (RefinerConfig documented §10)
if config.refiner.enabled \
and parent.primary_score > config.refiner.refinement_threshold:
mutation = await refine(mutation, config.refiner) # Placeholder
# 4e. Evaluate mutation
result = await cached_evaluate(mutation.code, tasks, evaluate, cache)
history.append(result)
# 4f. Update Pareto frontier via documented update() method (§6)
frontier.update(Solution(code=mutation.code, scores=result.scores))
iteration += 1
# 5. Return results
return OptimizationResult(
best=frontier.best_by_primary_metric(metrics), # Placeholder method
pareto_front=frontier.solutions(), # Placeholder method
history=history,
stats=compute_stats(history), # Placeholder
stop_reason=stopping.reason(), # Placeholder
)
Several design choices are visible in this reconstructed loop. The reflection engine
receives the parent's evaluation results (including ASI), enabling targeted mutation.
The cache check occurs before evaluation, preventing redundant computation. The Pareto
frontier is updated incrementally after each evaluation via the documented
update() method (§6). The stopping controller is consulted at each
iteration with full access to the history and frontier state. The optional refinement
step (controlled by RefinerConfig.enabled and
RefinerConfig.refinement_threshold, default: 0.9, documented in §10)
applies a post-mutation polish only to high-scoring candidates.
7.9 Multi-Artifact Co-Evolution
A less-discussed but architecturally significant feature is GEPA's support for
co-evolving multiple artifacts simultaneously with explicit dependency
tracking. The co_optimize() function is documented in the API reference
(§12: Advanced: Multi-Artifact Co-Evolution) as accepting a list of artifacts, an
evaluation function, metrics, and a dependency graph specifying which artifacts depend
on which others.
For example, a pipeline consisting of a system prompt and a post-processing function can be co-evolved with the constraint that the post-processor depends on the prompt (changes to the prompt may require compensating changes in the post-processor). The evaluation function receives all artifacts and returns a joint score. The following code example is reproduced verbatim from the GEPA documentation (§12: Advanced: Multi-Artifact Co-Evolution), with only minor formatting changes:
# Verbatim from GEPA docs §12 (Advanced: Multi-Artifact Co-Evolution).
# Import path (from gepa import co_optimize) and constructor
# signatures (Artifact, Metric) are documented in the API reference.
# eval_pipeline is a user-defined function, not part of the GEPA API.
from gepa import co_optimize, Artifact, Metric
# Co-evolve a prompt and a post-processor together
prompt_artifact = Artifact(
name="system_prompt",
template="You are a helpful assistant. {{INSTRUCTIONS}}",
language="text",
)
processor_artifact = Artifact(
name="output_processor",
template="def process(raw_output: str) -> str:\n {{BODY}}",
language="python",
)
result = co_optimize(
artifacts=[prompt_artifact, processor_artifact],
evaluate=eval_pipeline, # Receives both artifacts, returns joint score
metrics=[Metric("quality", direction="maximize")],
dependencies={
"output_processor": ["system_prompt"], # Processor depends on prompt
},
max_iterations=50,
)
# Access per-artifact results
print(f"Best prompt: {result.artifacts['system_prompt'].best.code}")
print(f"Best processor: {result.artifacts['output_processor'].best.code}")
Co-evolution is particularly relevant for prompt + code pipelines, where the prompt
and the downstream processing logic must be jointly optimized. The internal mechanism
by which GEPA schedules mutations across dependent artifacts — whether it mutates one
artifact at a time while holding others fixed, or mutates multiple artifacts
simultaneously — is not documented. The result.artifacts dictionary
access pattern is shown in the documented example above but the
.artifacts field is not listed in the OptimizationResult
class definition (§12), suggesting it may be specific to the co-evolution result type.
This capability has no equivalent in the other systems surveyed in this volume.
7.10 Limitations & Discussion
7.10.1 Acknowledged Limitations
The GEPA authors identify five limitations in their documentation (§16):
- LLM dependency. The quality of mutations is bounded by the capability of the underlying LLM. Weaker models produce weaker mutations, and the system provides no mechanism to compensate for a fundamentally incapable model.
- Cost. Multi-task and generalization modes multiply the evaluation budget by the number of tasks. The ARC-AGI run cost $180 for 1,200 evaluations — a modest cost for a research experiment but potentially prohibitive at scale.
- ASI design burden. Effective ASI requires domain-specific instrumentation of the evaluation function. Users must think carefully about what diagnostic information to expose, and poorly designed ASI can mislead the mutation engine.
- Pareto scalability. With more than 4–5 objectives, the Pareto frontier grows exponentially and most solutions become non-dominated (the "curse of dimensionality" for multi-objective optimization). This is a well-known limitation of Pareto-based EMOA approaches.
- Reflection context window. Very large candidates or extensive ASI can exceed LLM context limits, requiring truncation that may lose important diagnostic information.
7.10.2 Additional Observations
Beyond the authors' own assessment, several additional limitations merit discussion:
Comparison methodology. The benchmark comparisons in Section 7.5 compare GEPA against AlphaEvolve, OpenEvolve, and FunSearch using figures drawn from the GEPA documentation's comparison table (§13), which attributes figures to those systems without specifying whether they come from head-to-head experiments or from separate publications. It is not established that all systems used identical LLM backends, identical evaluation functions, identical seed solutions, or identical computational budgets. These cross-paper comparisons are labeled as indicative in Table 7.3 and should not be interpreted as definitive rankings.
Artifact-agnosticism: documented scope vs. empirical evidence.
GEPA's framework architecture is genuinely artifact-agnostic — the Artifact
class (§12), evaluation pipeline (§11), and reflection engine (§7) impose no structural
constraints on the candidate format. However, the empirical evidence supporting the
"optimize anything" thesis is concentrated in two artifact categories:
(1) code — Python functions for four of five reported benchmarks (Claude
Code Bleve, ARC-AGI, AIME 2025, circle packing), and (2) code-adjacent
structured text — YAML routing rules for CloudCast. The documentation (§1)
lists six categories of artifact types as supported (see §7.1.1), but benchmark results
are reported only for code and YAML. Whether ASI feedback, reflection prompting, and the
seedless bootstrap are equally effective for non-code artifacts remains an open empirical
question. Researchers applying GEPA to domains beyond code should be aware that the
strongest evidence base is for code optimization, and that the effectiveness of ASI
design patterns may vary significantly across artifact types.
Absence of quality-diversity mechanisms. GEPA uses Pareto optimization for multi-objective diversity but does not implement MAP-Elites-style behavioral characterization. For problems where solution diversity in a behavioral descriptor space (as opposed to objective space) is important, this may be a limitation compared to AlphaEvolve's approach.
Missing ablation evidence. As detailed in Section 7.5.6, the documentation lacks controlled ablations for ASI vs. no-ASI, Pareto vs. weighted-sum, seedless vs. seeded initialization, and other key design choices. Without these ablations, the causal contribution of each component to GEPA's reported performance cannot be isolated.
Documentation vs. implementation gaps. Several items in the
documentation have internal inconsistencies or underspecified behavior (see Table 7.0):
the proportional selection with infinite crowding distances (§7.3.4), the
Solution type used in documented code but not independently defined, and
the result.artifacts dictionary in the co-evolution example (§7.9) that
does not appear in the main OptimizationResult class definition. These
are minor issues that suggest the documentation may describe a slightly idealized or
in-progress version of the API.
7.10.3 Future Directions
The GEPA authors outline six future directions (§16): Auto-ASI (automatic evaluation instrumentation), hierarchical evolution (co-evolving meta-strategies alongside artifacts), distributed search across multiple machines, interactive human-in-the-loop mode, transfer learning from previously solved problems, and formal verification integration for safety-critical applications. Of these, Auto-ASI addresses the most significant current limitation (the ASI design burden) and would substantially lower the barrier to effective use.
7.11 Research Contribution Analysis
GEPA's contribution to the LLM-driven evolutionary optimization landscape can be evaluated along three dimensions: what is genuinely novel, what is an effective synthesis of existing ideas, and what impact it may have on future work.
Novelty
The most novel contribution is the ASI abstraction. While prior systems pass error messages or basic feedback to the LLM mutation operator, GEPA formalizes this as a typed system with six distinct feedback modalities (§5), each with specific semantics for how the reflection engine should use them. The strength of this contribution rests on the hypothesis that structured diagnostic feedback materially improves mutation quality — a hypothesis that is architecturally well-supported but not yet empirically validated through controlled ablations (see Section 7.5.6). The generalization mode with train/val splitting and overfitting detection (§4.3) is also novel in this space — it imports a standard machine learning practice into evolutionary optimization. Multi-artifact co-evolution with explicit dependency graphs (§12) extends the single-artifact optimization paradigm.
Synthesis
GEPA effectively synthesizes several established techniques: Pareto frontier maintenance with crowding distance from NSGA-II (Deb et al., 2002), hypervolume-based quality assessment from the EMOA literature, content-hash caching from FunSearch/AlphaEvolve, and LLM-driven mutation from the broader program synthesis community. The contribution is not any individual technique but their integration into a coherent, artifact-agnostic framework with a clean declarative API.
Potential Impact
GEPA's declarative API lowers the barrier to entry for LLM-driven evolutionary
optimization. A researcher who wants to optimize a prompt, a configuration file, or a
scoring function can do so by writing an evaluation function and calling
optimize(), without understanding population management, selection
strategies, or mutation operators. This democratization effect — making evolutionary
search accessible to domain experts who are not evolutionary computation specialists —
may be GEPA's most significant practical contribution, though its full realization
depends on how well the framework performs on artifact types beyond code (see
Section 7.10.2). The open empirical questions identified in Section 7.5.6 represent
clear opportunities for follow-on research that could strengthen or qualify these
architectural claims.
7.12 Fitness Landscape and Search Dynamics
GEPA's approach to navigating the fitness landscape differs fundamentally from classical evolutionary algorithms because the mutation operator is not a random perturbation but an informed LLM-guided transformation. This changes the search dynamics in several ways that are worth characterizing formally. The formalization below is the survey author's analysis, not a claim made in the GEPA documentation.
In a classical $(\mu + \lambda)$ evolution strategy, the mutation operator $M$ maps a parent $c$ to a child $c'$ via a random perturbation: $c' = M(c, \epsilon)$ where $\epsilon$ is drawn from some distribution (e.g., Gaussian). The mutation is blind to the fitness landscape. In GEPA, the mutation operator is conditioned on the ASI feedback $\alpha$ and the optimization history $H$:
where $c$ is the parent candidate, $\alpha \in \mathcal{A}^*$ is the ASI record from the parent's evaluation (as defined in the notation box in Section 7.3), $H = \{(c_i, \mathbf{s}_i, \alpha_i)\}_{i=1}^{t}$ is the optimization history up to iteration $t$ (each entry containing a candidate, its $m$-dimensional score vector $\mathbf{s}_i \in \mathbb{R}^m$, and its ASI), and $\theta$ represents the LLM parameters (model choice, temperature, prompt template). The LLM acts as a learned gradient estimator — using diagnostic feedback to propose directions of improvement in the space of text artifacts.
This has a conceptual analogy to gradient-based optimization: ASI provides a form of "derivative information" about the fitness function, and the LLM uses this information to propose a step in a promising direction. The key difference is that this "gradient" operates over discrete, structured artifacts (code, text) rather than continuous parameter vectors, and the LLM integrates multiple types of feedback (traces, errors, comparisons) that have no direct analogue in continuous optimization. This analogy is suggestive rather than formal — the LLM mutation operator has no provable convergence guarantees analogous to gradient descent.
The failure bias in minibatch sampling (default: 70% probability of sampling failure
cases, via the documented failure_bias parameter in
ReflectionConfig, §§7, 10) biases the search toward regions of the
fitness landscape where the current solution performs worst. This is analogous to
hard-example mining in machine learning: the system focuses its improvement effort on
the cases with the most room for improvement.
Chapter Summary
Key takeaway: GEPA demonstrates that a declarative, artifact-agnostic framework with structured diagnostic feedback (ASI) can achieve competitive results across multiple domains — code, mathematics, geometric optimization, and infrastructure routing — without domain-specific architectural commitments. The reported results are promising but rely on self-reported benchmarks with limited cross-system controls and no published ablation studies. The strongest empirical evidence is for code and code-adjacent text artifacts; the broader "optimize anything" generality is architecturally supported but less extensively validated.
Main contribution: The formalization of Actionable Side Information (ASI) as a first-class typed abstraction that bridges evaluation and mutation, combined with native Pareto multi-objective optimization, multi-task cross-transfer, and a generalization mode with validation-based model selection and overfitting detection. These features collectively move LLM-driven evolutionary optimization from a code-specialized technique toward a general-purpose optimization paradigm.
Provenance summary: The vast majority of API classes, import paths, constructor signatures, configuration parameters, and benchmark figures cited in this chapter are directly documented in the GEPA documentation (§§1–16). See Table 7.0 for the full provenance mapping. The survey author's contributions are: (a) the mathematical formalizations in §§7.3–7.4, which render the documented API contracts in standard EMOA notation; (b) the reconstructed optimization loop in §7.8, which assembles documented components into illustrative pseudocode; (c) the evidence status assessment in §7.5, which evaluates reproducibility of each benchmark; and (d) the search dynamics analysis in §7.12.
What a researcher should know: GEPA's power comes from the quality of the evaluation function and its ASI instrumentation. The system is only as good as the diagnostic feedback it receives. Researchers planning to use GEPA should invest heavily in ASI design — this is where the largest returns likely lie. The generalization mode should be preferred over single-task optimization for any problem where overfitting is a concern, even at the cost of additional evaluation calls. Critical open questions remain around the marginal value of ASI (vs. score-only feedback), Pareto vs. weighted-sum performance, and seedless vs. seeded initialization effectiveness — see Section 7.5.6 for the full list of missing ablations. Cross-system comparisons in Table 7.3 are indicative only; no matched-condition head-to-head evaluations are available.