GEPA: Optimize Anything

Part P02: General-Purpose Evolutionary Frameworks

7.1 Overview & Motivation

The systems surveyed in earlier chapters — FunSearch (Chapter 3), AlphaEvolve (Chapter 4), OpenEvolve (Chapter 5), and ShinkaEvolve (Chapter 6) — share a common architectural assumption: the artifact under optimization is a code fragment, typically a Python function. This assumption shapes every downstream design choice, from how candidates are represented and mutated to how fitness is evaluated and reported. GEPA (General-purpose Evolutionary Program Architecture), introduced in February 2026 by Agrawal, Lee, and colleagues at UC Berkeley and Stanford, challenges this assumption directly. Its central thesis is that any text-representable artifact — code, prompts, YAML configurations, mathematical expressions, agent policies, or hybrid combinations thereof — can be optimized through a single declarative interface backed by LLM-driven evolutionary search.

GEPA is released as an open-source Python package (pip install gepa, repository: github.com/gepa-ai/gepa) and introduces three technical contributions that distinguish it from prior work:

Actionable Side Information (ASI) — a first-class abstraction for structured diagnostic feedback (execution traces, error stack traces, images, metric breakdowns, output comparisons) that flows from evaluation back into the mutation prompt, enabling targeted rather than generic improvements.
Native Pareto multi-objective optimization — the population maintains a Pareto frontier of non-dominated solutions with crowding-distance-based selection and hypervolume tracking, replacing the weighted-sum reduction used by most prior systems.
Three optimization modes — single-task, multi-task with cross-transfer, and generalization (train/validation split with validation-based model selection) — addressing overfitting, a problem that prior systems largely ignored.

Key Contribution

GEPA demonstrates that a declarative, artifact-agnostic evolutionary optimization framework with structured diagnostic feedback (ASI) can achieve competitive performance against domain-specialized systems across coding, mathematics, and infrastructure optimization benchmarks, while providing researchers with a unified API that separates the what (artifact definition, evaluation function, metrics) from the how (evolutionary search, mutation strategy, population management). The strongest empirical evidence for this claim comes from code and code-adjacent text artifacts; the broader "optimize anything" generality remains architecturally supported but less extensively validated.

7.1.1 The Artifact-Agnosticism Thesis

Prior LLM-driven evolution systems treat the candidate as a code string with language-specific parsing, syntax validation, and execution semantics baked into the pipeline. GEPA abstracts this into a generic Artifact class parameterized by a name, an optional template with placeholders, a language tag, and optional constraints. The GEPA documentation (§1: Overview & Motivation) explicitly lists six categories of supported artifact types: (1) Code — Python functions, algorithms, data structures, entire modules; (2) Prompts — system prompts, few-shot examples, chain-of-thought templates; (3) Agent Configs — multi-agent orchestration YAML, tool selection policies, routing rules; (4) Mathematical Expressions — loss functions, optimization objectives, heuristic formulas; (5) Data Pipelines — ETL configurations, feature engineering scripts, preprocessing chains; and (6) Hybrid Artifacts — combinations of code + prompts + configs, co-evolved with inter-dependency tracking. This is not merely a labeling convenience: the mutation engine, evaluation pipeline, and ASI feedback system are all designed to operate on opaque text, with language-specific behavior injected only through user-provided evaluation functions.

Scope of empirical support. It is important to distinguish GEPA's architectural artifact-agnosticism — the framework imposes no structural assumptions on candidate format — from the empirically demonstrated scope of that claim. The five benchmarks reported in Section 7.5 involve four code artifacts (Python functions for Claude Code Bleve, ARC-AGI, AIME 2025, and circle packing) and one code-adjacent text artifact (YAML routing rules for CloudCast). While the API design is genuinely artifact-agnostic, the "optimize anything" generality beyond code and structured configuration files has limited empirical support in the current documentation. Domains such as natural-language prose, mathematical proofs, or multi-modal artifacts are listed as supported artifact types but are not represented in the reported benchmarks.

7.1.2 Design Principles

The GEPA documentation (§1) articulates five design principles that shape the system:

Principle	Implication
Artifact agnosticism	No hardcoded assumptions about candidate structure; any text is valid
Declarative over imperative	Users specify objectives and constraints; the engine selects mutation strategies, population management, and stopping criteria
Diagnostic-first	Evaluation produces structured feedback (ASI), not just scalar scores
Pareto optimality	Multi-objective optimization is native, not reduced to weighted sums
Reproducibility	Content-hash caching ensures identical artifacts are never re-evaluated

7.1.3 Provenance and Documentation Basis

Documentation basis. All descriptions in this chapter are drawn from the GEPA project documentation (February 2026), available at github.com/gepa-ai/gepa. The documentation includes fourteen sections covering an overview, quick-start, system architecture, optimization modes, ASI, Pareto search, reflection-driven mutation, seedless mode, stopping criteria, configuration, evaluation pipeline, API reference with code examples, benchmark results, case studies, cross-system comparisons, and limitations. Where specific import paths and constructor signatures are cited, these are taken directly from the documentation's code examples and API reference. Where internal algorithmic mechanisms are described beyond what the documentation explicitly specifies — for instance, the exact control flow of the optimization loop or the behavior of undocumented helper functions — these are labeled as reconstructed. Mathematical formalizations are the author's rendering of the documented API contracts and standard multi-objective optimization theory unless otherwise attributed.

The following provenance table maps the major technical claims in this chapter to their documentation status and source location within the GEPA documentation. This table is intended to resolve ambiguity about which elements are directly documented, which are referenced but incompletely specified, and which are the author's analysis.

**Table 7.0 — Provenance of major technical claims.** "Documented" means the element appears with import paths, constructor signatures, or explicit parameter listings in the GEPA documentation. "Referenced" means the element is used in documented code but not independently defined. "Author analysis" means the element is the survey author's formalization or interpretation.
Claim / Element	Status	Source Location
`optimize()`, `Artifact`, `Metric`, `EvalResult`, `OptimizationResult` — classes, constructors, all field types	Documented	Docs §§2, 12 (API Reference)
`co_optimize()` — function, `dependencies` parameter, usage pattern	Documented	Docs §12 (Advanced: Multi-Artifact Co-Evolution)
`GEPAConfig`, `EngineConfig`, `ReflectionConfig`, `MergeConfig`, `RefinerConfig` — all parameters with defaults	Documented	Docs §10 (Configuration System)
`SingleTaskConfig`, `MultiTaskConfig`, `GeneralizationConfig` — parameters	Documented	Docs §§4.1–4.3
`SeedlessConfig` — parameters and bootstrap pipeline description	Documented	Docs §8 (Seedless Mode)
`ParetoFrontier` (from `gepa.engine`) — class with `dominates()`, `update()`, `select_parent()` code	Documented	Docs §6 (Pareto-Efficient Search)
`ReflectionEngine` (from `gepa.reflection`) — class and instantiation	Documented	Docs §7 (Reflection-Driven Mutation)
`EvaluationPipeline`, `EvalConfig` (from `gepa.evaluation`) — parameters	Documented	Docs §11 (Evaluation Pipeline)
Stopping criteria: `MaxMetricCalls`, `Timeout`, `NoImprovement`, `ScoreThreshold`, `Composite`	Documented	Docs §9 (Stopping Criteria)
ASI types: `TraceASI`, `ErrorASI`, `ImageASI`, `MetricASI`, `ComparisonASI`, `TextASI`	Documented	Docs §5 (ASI)
`Solution`, `EvaluationRecord`, `OptimizationStats`, `TaskResult`	Referenced but not fully defined	Used in `ParetoFrontier` code (§6) and `OptimizationResult` signature (§12); no independent constructor docs
Reflection prompt template (simplified)	Documented	Docs §§5, 7 (two versions shown)
Content-hash caching mechanism and code	Documented	Docs §11
Architecture component names and data flow	Documented	Docs §3 (System Architecture)
Benchmark results, cost figures, case study trajectories	Documented (self-reported, no independent reproduction)	Docs §§13, 14
Cross-system figures (AlphaEvolve, OpenEvolve, FunSearch on ARC-AGI)	Attributed in docs; not from matched-condition experiments	Docs §13 (single comparison table)
Optimization loop internal control flow (§7.8 pseudocode)	Reconstructed	Author assembly from documented components; helpers (`bootstrap_seedless`, `cached_evaluate`, `refine`, `compute_stats`) are not documented API
Mathematical formalizations: dominance, hypervolume, crowding distance, generalization gap	Author analysis	Standard EMOA theory applied to documented API contracts; crowding distance formula and hypervolume shown in docs §6 in simplified notation
Scalar aggregation for `train_score`/`val_score` in generalization mode	Mixed	Fields documented (docs §§4.3, 12); averaging formula over tasks is author formalization consistent with the documented scalar fields
Handling of infinite crowding distances in proportional selection	Underspecified in docs	Docs §6 shows `probs = distances / distances.sum()` and states boundary solutions get infinite distance; resolution not specified (see §7.3.4 discussion)

7.2 System Architecture

GEPA's architecture follows a layered design with six principal components: the User API layer, the Configuration layer, the Evolution Engine, the Reflection Engine, the Evaluation Pipeline, and the Stopping Controller. The following diagram illustrates the data flow and component interactions as described in the GEPA documentation's architecture overview (§3).

Figure 7.1 — GEPA system architecture. The six top-level components and sub-component names (Population Manager, Pareto Frontier, History Store, Reflection Engine) are drawn from the documentation's architecture overview (§3). Configuration class names (GEPAConfig, EngineConfig, ReflectionConfig, MergeConfig, RefinerConfig) are documented with import paths in §10. API classes (ParetoFrontier from gepa.engine, ReflectionEngine from gepa.reflection, EvaluationPipeline from gepa.evaluation) are documented with import paths in §§6, 7, 11 respectively. The exact internal class boundaries within the Evolution Engine may differ from the logical grouping shown here. The dashed loop on the right represents the ASI feedback path from evaluation back into the reflection engine. See Table 7.0 for full provenance details.

7.2.1 User API Layer

The primary entry point is the optimize() function, which accepts an Artifact (what to optimize), an evaluation function (how to measure quality), a list of Metric objects (what objectives to pursue), and optional configuration. These four types — Artifact, Metric, EvalResult, and OptimizationResult — are documented in the GEPA API reference (§12) with their constructors and field types. This declarative surface means the user never directly interacts with population management, parent selection, or mutation scheduling. The documentation also describes co_optimize() for multi-artifact co-evolution with explicit inter-artifact dependency tracking (§12), and optimize_from_config() for YAML-driven runs (§10).

The following code example is adapted from the GEPA documentation's quick-start guide (§2). Import paths (from gepa import optimize, Artifact, Metric, EvalResult), class constructors, and parameter names are documented verbatim. One adaptation was made: the documentation's side_info failed_cases list comprehension references an undefined variable results — an apparent documentation error — so the version below uses an inline re-execution check instead. The evaluation function's overall structure (inline exec, time measurement, EvalResult return) follows the documented pattern. Note that side_info is passed as a plain dict here; the typed ASI pattern (using TraceASI, ErrorASI, etc.) is shown in §7.3.1.

# Adapted from GEPA docs §2 (quick-start guide).
# Import paths and constructor signatures are documented verbatim.
# The failed_cases comprehension is modified from the original (see text).
from gepa import optimize, Artifact, Metric, EvalResult

artifact = Artifact(
    name="sort_function",
    template="""
def sort(arr: list[int]) -> list[int]:
    # Your sorting implementation here
    {{IMPLEMENTATION}}
""",
    language="python",
)

def evaluate_sort(candidate: str, task: dict) -> EvalResult:
    """Evaluate sorting correctness and performance."""
    import time
    exec_globals = {}
    exec(candidate, exec_globals)
    sort_fn = exec_globals["sort"]

    test_cases = task["test_cases"]
    correct = 0
    total_time = 0.0

    for tc in test_cases:
        start = time.perf_counter()
        result = sort_fn(tc["input"].copy())
        elapsed = time.perf_counter() - start
        total_time += elapsed
        if result == tc["expected"]:
            correct += 1

    accuracy = correct / len(test_cases)
    avg_time = total_time / len(test_cases)

    return EvalResult(
        scores={"accuracy": accuracy, "speed": 1.0 / (avg_time + 1e-9)},
        side_info={
            "failed_cases": [
                tc for tc in test_cases
                if sort_fn(tc["input"].copy()) != tc["expected"]
            ],
            "avg_time_ms": avg_time * 1000,
        },
    )

result = optimize(
    artifact=artifact,
    evaluate=evaluate_sort,
    metrics=[
        Metric("accuracy", direction="maximize", weight=0.8),
        Metric("speed", direction="maximize", weight=0.2),
    ],
    max_iterations=50,
    llm="claude-sonnet-4-20250514",
)

print(f"Best solution: {result.best.score}")
print(result.best.code)

The documented Artifact constructor (§12) accepts name, template (with {{PLACEHOLDER}} syntax), seed, description, signature, language (default: "python"), constraints, and metadata parameters, plus validate() and render() methods. The Metric constructor (§12) accepts name, direction ("maximize" or "minimize"), weight (default: 1.0), bounds, and primary (default: True). The EvalResult constructor (§12) accepts scores (dict[str, float]), side_info (list[ASI] | dict | None — either typed ASI objects or a plain dictionary), metadata, valid, and error. All constructor signatures and defaults are documented in the API reference.

7.2.2 Evolution Engine

The engine manages the core evolutionary loop. The documented EngineConfig (§10) specifies the following parameters with their defaults: population_size (default: 30), elite_size (5), tournament_size (3), mutation_rate (1.0), crossover_rate (0.0), island_count (1), migration_rate (0.1), and migration_interval (10). Based on these parameters and the architecture overview (§3), the engine maintains three logical structures: a candidate pool (Population Manager) with optional island-model partitioning and configurable migration, a Pareto frontier (ParetoFrontier, importable from gepa.engine per §6) of non-dominated solutions, and a History Store recording all results with ASI and genealogy tracking. The OptimizationResult fields (.pareto_front, .history, .stats, documented in §12) directly expose the frontier and history data.

7.2.3 Reflection Engine

The reflection engine is GEPA's primary mutation mechanism. Rather than applying random perturbations or generic LLM rewrites, it constructs a structured reflection prompt that includes the parent candidate, a minibatch of evaluation examples (biased toward failure cases), all associated ASI records, and recent mutation history. The documented ReflectionConfig (importable from both gepa.config per §10 and gepa.reflection per §7) exposes the following parameters with defaults: minibatch_size (3), failure_bias (0.7), temperature (0.8), max_tokens (4096), max_history_length (10), boolean flags include_trace (True), include_error (True), include_comparison (True), include_image (False), reflection_model ("claude-sonnet-4-20250514"), and reflection_prompt_version ("v3"). The ReflectionEngine class is importable from gepa.reflection, as shown in the documentation's §7 code examples. This design is the main architectural link between evaluation quality and mutation quality — the richer the ASI, the more targeted the resulting mutation. Section 7.3 provides a detailed algorithmic treatment.

7.2.4 Evaluation Pipeline

Evaluation proceeds through four stages: (1) cache check via content-hash deduplication, (2) sandbox execution with configurable isolation (Docker container, subprocess, or direct execution — as specified by the sandbox parameter in EvalConfig), (3) metric computation via the user-provided evaluation function, and (4) ASI extraction from the evaluation results. The EvaluationPipeline class and EvalConfig class (both importable from gepa.evaluation, documented in §11) support the following parameters with defaults: max_workers (8), timeout_per_eval (60 seconds), retry_on_error (True), cache_enabled (True), cache_backend ("sqlite"), and sandbox ("docker"). The three documented sandbox levels are "docker" (full container isolation), "subprocess" (process-level isolation), and "none" (direct execution for trusted code).

7.2.5 Stopping Controller

GEPA provides five composable stopping criteria — MaxMetricCalls, Timeout, NoImprovement, ScoreThreshold, and Composite — importable from gepa.stopping (documented in §9) and combinable with AND/OR logic. This composability is a notable improvement over the fixed stopping criteria found in FunSearch and AlphaEvolve. The following example is reproduced from the GEPA documentation (§9), with inline comments condensed:

# Reproduced from GEPA docs §9 (Stopping Criteria).
# Class names, import path, and constructor signatures are documented.
from gepa.stopping import (
    MaxMetricCalls, Timeout, NoImprovement,
    ScoreThreshold, Composite,
)

stopping = Composite(
    criteria=[
        MaxMetricCalls(200),
        Timeout(7200),
        ScoreThreshold("accuracy", 1.0),
        Composite(
            criteria=[
                NoImprovement(patience=30, min_delta=0.001),
                MaxMetricCalls(50),
            ],
            mode="AND",
        ),
    ],
    mode="OR",
)

7.3 Core Algorithms

Notation and Assumptions

The following notation is used throughout Sections 7.3–7.4. Unless otherwise noted, these formalizations are the author's rendering of the documented API contracts, not notation from the GEPA documentation itself. The GEPA documentation presents the dominance relation, hypervolume indicator, and crowding distance formula in simplified notation (§6); the full mathematical treatment below uses standard EMOA definitions consistent with that notation.

$C$	Candidate artifact space (set of all valid text artifacts)
$T$	Task space (set of all task definitions)
$m$	Number of objectives (dimensionality of the score vector)
$f_k : C \times T \to \mathbb{R}$	The $k$-th scalar objective function, $k \in \{1, \ldots, m\}$
$\mathbf{f} : C \times T \to \mathbb{R}^m$	The vector-valued objective: $\mathbf{f}(c, \tau) = (f_1(c, \tau), \ldots, f_m(c, \tau))$
$\mathcal{A}^*$	Set of all finite sequences of ASI records (heterogeneous typed lists)
$P$	Current population of candidate solutions
$\text{PF}$	Pareto frontier: the set of non-dominated solutions in $P$
$\text{HV}(\text{PF}, \mathbf{r})$	Hypervolume indicator of $\text{PF}$ relative to reference point $\mathbf{r}$
$\text{CD}(i)$	Crowding distance of the $i$-th solution on the frontier

Direction normalization convention. All objectives are direction-normalized so that higher is always better. For a minimization objective with raw value $v$, the normalized value is $-v$. This convention is consistent with the Metric(direction="minimize") API parameter and the sign-flip logic visible in the documented ParetoFrontier.dominates() method (§6), which applies a_val, b_val = -a_val, -b_val for minimization objectives.

Scalar vs. vector objectives. The Pareto frontier (Section 7.3.3) operates on the full $m$-dimensional score vector $\mathbf{f}(c, \tau)$. The generalization mode (Section 7.4.3) aggregates per-task scores into scalar averages; when $m > 1$, model selection uses the primary metric (the Metric with primary=True) to select the best candidate. Section 7.4.3 makes this aggregation explicit and notes which parts are documented and which are the author's formalization.

7.3.1 Actionable Side Information (ASI)

ASI is GEPA's most distinctive algorithmic contribution. In prior systems, evaluation returns a scalar score (or a small vector of scores), and the LLM must infer why a candidate performs poorly from the code alone. GEPA inverts this: the evaluation function returns both scores and structured diagnostic records that are injected directly into the reflection prompt. This transforms mutation from a "guess what went wrong" task to a "here is exactly what went wrong, fix it" task.

The GEPA documentation (§5: Actionable Side Information) defines six ASI types:

ASI Type	Content	Role in Reflection
`TraceASI`	Step-by-step execution traces, variable states	LLM identifies where execution diverges from expected behavior
`ErrorASI`	Exception type, message, full stack trace	LLM directly targets the error-causing code
`ImageASI`	Visual outputs (plots, rendered diagrams)	Multimodal LLM analyzes visual quality
`MetricASI`	Per-case score breakdowns, timing details	LLM focuses on worst-performing sub-metrics
`ComparisonASI`	Diff between actual and expected output	LLM targets specific discrepancies
`TextASI`	Free-form text feedback, LLM-judge annotations	LLM incorporates qualitative guidance

Formally, let $c$ denote a candidate artifact and $\tau$ a task. In a traditional evolutionary system, evaluation is a function $\mathbf{f}: C \times T \to \mathbb{R}^m$, where $m$ is the number of objectives. GEPA extends this to:

$$\mathbf{f}_{\text{GEPA}}: C \times T \to \mathbb{R}^m \times \mathcal{A}^*$$

where $\mathbb{R}^m$ is the $m$-dimensional score vector (corresponding to the scores field in EvalResult) and $\mathcal{A}^*$ is the set of all finite sequences of ASI records (corresponding to the side_info field). Each element of the ASI sequence can be any of the six types above. This extended evaluation signature is the foundation for reflection-driven mutation.

The following code example is near-verbatim from the GEPA documentation (§5: Actionable Side Information). The import path, class names, and constructor signatures for EvalResult, TraceASI, ErrorASI, MetricASI, and ComparisonASI are documented in the API reference. One minor adaptation: ComparisonASI has been added to the import statement — it is used in the code body of the documented example but was omitted from the original import line (an apparent documentation oversight). The helper functions compute_accuracy(), compute_efficiency(), and find_mismatches() are used in the documentation as placeholder function names without definitions; they are not part of the GEPA API.

# Near-verbatim from GEPA docs §5 (ASI).
# Import paths and ASI constructor signatures are documented.
# ComparisonASI added to import (used but missing in original).
# compute_accuracy, compute_efficiency, find_mismatches are
# undefined placeholders in the docs, not GEPA API functions.
from gepa import EvalResult, TraceASI, ErrorASI, MetricASI, ComparisonASI

def evaluate_with_asi(candidate: str, task: dict) -> EvalResult:
    """Evaluation function that produces scores and structured ASI."""
    try:
        exec_globals = {}
        exec(candidate, exec_globals)
        solve_fn = exec_globals["solve"]

        # Collect execution trace
        trace = []
        original_print = print

        def traced_print(*args, **kwargs):
            trace.append(" ".join(str(a) for a in args))
            original_print(*args, **kwargs)

        import builtins
        builtins.print = traced_print
        result = solve_fn(task["input"])
        builtins.print = original_print

        # Compute metrics
        accuracy = compute_accuracy(result, task["expected"])
        efficiency = compute_efficiency(result)

        # Build ASI records
        side_info = [
            TraceASI(trace=trace, label="execution_trace"),
            MetricASI(
                metrics={
                    "accuracy": accuracy,
                    "efficiency": efficiency,
                    "output_length": len(str(result)),
                },
                label="detailed_metrics",
            ),
        ]

        if accuracy < 1.0:
            mismatches = find_mismatches(result, task["expected"])
            side_info.append(
                ComparisonASI(
                    expected=str(task["expected"]),
                    actual=str(result),
                    diff=mismatches,
                    label="output_comparison",
                )
            )

        return EvalResult(
            scores={"accuracy": accuracy, "efficiency": efficiency},
            side_info=side_info,
        )

    except Exception as e:
        import traceback
        return EvalResult(
            scores={"accuracy": 0.0, "efficiency": 0.0},
            side_info=[
                ErrorASI(
                    error_type=type(e).__name__,
                    message=str(e),
                    traceback=traceback.format_exc(),
                    label="runtime_error",
                )
            ],
        )

The design choice to make ASI a first-class type in the evaluation contract, rather than an optional logging side-channel, has two consequences. First, it creates a strong incentive for users to instrument their evaluation functions with diagnostic output, since richer ASI directly improves mutation quality. Second, it places the burden of ASI design on the user — the system cannot automatically extract useful diagnostics from an arbitrary evaluation function. The GEPA authors acknowledge this as a current limitation (§16) and identify Auto-ASI (automatic instrumentation) as a future research direction.

7.3.2 Reflection-Driven Mutation

The reflection engine implements a six-step pipeline for generating improved candidates. This is GEPA's primary search operator, replacing the random or semi-random mutation operators used in classical evolutionary algorithms. The pipeline is described in the GEPA documentation (§7: Reflection-Driven Mutation), which includes the ReflectionConfig parameter documentation, a simplified version of the reflection prompt template, and code showing ReflectionEngine instantiation and the reflect() method call.

Step 1 — Parent selection. A parent is chosen from the Pareto frontier using the ParetoFrontier.select_parent() method (see Section 7.3.4 for the documented selection strategies). The documentation (§7) states that the default uses "crowding-distance-weighted selection," consistent with the strategy="crowding" default shown in the select_parent() code (§6).

Step 2 — Minibatch sampling. A minibatch of 2–3 evaluation examples is sampled, with a configurable failure bias (failure_bias, default: 0.7, meaning 70% probability of sampling failure cases, per §§7, 10). The GEPA documentation (§7, callout box) states that this minibatch size provides "the optimal trade-off between context richness and LLM attention capacity" — a claim presented without experimental evidence (see Section 7.5.6).

Step 3 — ASI assembly. All ASI records associated with the selected examples are gathered and formatted into structured blocks within the reflection prompt. The ReflectionConfig flags include_trace, include_error, include_comparison, and include_image (documented in §§7, 10) control which ASI types are included.

Step 4 — History context. The last $k$ mutations (default $k = 10$, configurable via max_history_length, per §§7, 10) are included, annotated with whether each improved or degraded the score, providing the LLM with short-term search memory.

Step 5 — LLM reflection. The assembled prompt is sent to the LLM with explicit instructions to (a) analyze the diagnostic feedback, (b) identify root causes of failure, (c) propose a specific, targeted fix rather than a full rewrite, and (d) output the complete improved candidate. The GEPA documentation includes two versions of this prompt template — a simplified version in §5 (within the ASI section) and a more detailed structural outline in §7 — with sections for role definition, current solution, diagnostic analysis, mutation history, and output format.

Step 6 — Validation and insertion. The LLM output is parsed, validated against artifact constraints (syntax, type checking, basic execution via Artifact.validate(), per §12), and inserted into the population for evaluation.

The key design insight is that this prompt is not a generic "improve this code" instruction — it provides the LLM with the same diagnostic information a human developer would use when debugging.

7.3.3 Pareto Frontier Maintenance

GEPA maintains a Pareto frontier $\text{PF}$ of non-dominated solutions in the population $P$. Given $m$ objective functions $f_1, \ldots, f_m$ (direction-normalized so that higher is always better), solution $\mathbf{x}$ dominates solution $\mathbf{y}$ (written $\mathbf{x} \succ \mathbf{y}$) if and only if:

$$\mathbf{x} \succ \mathbf{y} \iff \forall i \in \{1, \ldots, m\}: f_i(\mathbf{x}) \geq f_i(\mathbf{y}) \;\land\; \exists j \in \{1, \ldots, m\}: f_j(\mathbf{x}) > f_j(\mathbf{y})$$

The Pareto frontier is then defined as the set of all non-dominated solutions:

$$\text{PF} = \{\mathbf{x} \in P : \nexists\; \mathbf{y} \in P \text{ such that } \mathbf{y} \succ \mathbf{x}\}$$

These are standard definitions from the multi-objective optimization literature (Deb et al., 2002). The GEPA documentation (§6: Pareto-Efficient Search) includes a code example of the ParetoFrontier class (importable from gepa.engine) with dominates(), update(), and select_parent() methods implementing this logic. The following code is reproduced verbatim from the GEPA documentation §6, including the class definition, all three methods, and the Solution type annotation. The Solution type is used throughout the documented class but is not independently defined in the API reference; it appears to wrap a candidate's code string and score dictionary.

# Verbatim from GEPA docs §6 (Pareto-Efficient Search).
# Import path: from gepa.engine import ParetoFrontier (documented).
# Solution type is used but not independently defined in the API reference.
from gepa.engine import ParetoFrontier

class ParetoFrontier:
    def __init__(self, objectives: list[str], directions: list[str]):
        self.objectives = objectives
        self.directions = directions  # "maximize" or "minimize"
        self.frontier: list[Solution] = []

    def dominates(self, a: Solution, b: Solution) -> bool:
        """Check if solution a dominates solution b."""
        at_least_one_better = False
        for obj, direction in zip(self.objectives, self.directions):
            a_val = a.scores[obj]
            b_val = b.scores[obj]
            if direction == "minimize":
                a_val, b_val = -a_val, -b_val
            if a_val < b_val:
                return False
            if a_val > b_val:
                at_least_one_better = True
        return at_least_one_better

    def update(self, candidate: Solution) -> bool:
        """Add candidate to frontier if non-dominated. Returns True if added."""
        for member in self.frontier:
            if self.dominates(member, candidate):
                return False  # Dominated, discard

        self.frontier = [
            m for m in self.frontier
            if not self.dominates(candidate, m)
        ]
        self.frontier.append(candidate)
        return True

    def select_parent(self, strategy: str = "crowding") -> Solution:
        """Select a parent from the frontier for mutation."""
        if strategy == "crowding":
            distances = self._crowding_distances()
            probs = distances / distances.sum()
            return np.random.choice(self.frontier, p=probs)
        elif strategy == "random":
            return random.choice(self.frontier)
        elif strategy == "tournament":
            a, b = random.sample(self.frontier, 2)
            return a if self._hypervolume_contribution(a) \
                > self._hypervolume_contribution(b) else b

The dominates() method performs the direction-normalized dominance check described above — note the explicit sign-flip (a_val, b_val = -a_val, -b_val) for minimization objectives. The update() method implements the standard frontier update: (1) if any existing member dominates the candidate, discard it; (2) if the candidate dominates existing members, remove them; (3) otherwise, add the candidate as a new trade-off point.

7.3.4 Crowding Distance and Selection

To maintain diversity on the Pareto frontier, GEPA uses crowding distance to measure how isolated a solution is in objective space. The GEPA documentation (§6) presents the crowding distance formula, consistent with the standard definition from NSGA-II (Deb et al., 2002):

$$\text{CD}(i) = \sum_{k=1}^{m} \frac{f_k(i+1) - f_k(i-1)}{f_k^{\max} - f_k^{\min}}$$

where, for each objective $k$, solutions on the frontier are sorted by their $f_k$ value; $f_k(i+1)$ and $f_k(i-1)$ are the objective values of the neighboring solutions in that sorted order; and $f_k^{\max}$, $f_k^{\min}$ are the maximum and minimum values of objective $k$ across the frontier. The GEPA documentation (§6) states that "boundary solutions receive infinite crowding distance," which is the standard NSGA-II convention ensuring extreme trade-off points are preferentially preserved.

Two edge cases require explicit handling:

Boundary solutions (first and last in the sorted order for any objective $k$): $\text{CD}(i) = \infty$ ensures these are always preferentially selected or preserved, as they represent extreme trade-offs.
Zero objective range ($f_k^{\max} = f_k^{\min}$): when all frontier solutions have the same value for objective $k$, the contribution from that objective is zero (the term for objective $k$ is skipped). This prevents division by zero.

Documented selection strategies. The GEPA documentation (§6) shows three selection strategies in the ParetoFrontier.select_parent() method, reproduced verbatim in the code above:

strategy="crowding" (default): proportional selection where the probability of selecting each frontier member is proportional to its crowding distance (probs = distances / distances.sum()). This is the code shown in the documented class.
strategy="random": uniform random selection from the frontier.
strategy="tournament": two solutions are sampled from the frontier, and the one with higher hypervolume contribution is selected. This is the most computationally expensive strategy as it requires recomputing the hypervolume indicator with and without each candidate.

Mathematical inconsistency in the documented code. The documented proportional selection (strategy="crowding") with probs = distances / distances.sum() is mathematically undefined when boundary solutions have $\text{CD} = \infty$: dividing infinity by infinity produces NaN, and NumPy's np.random.choice will raise a ValueError if any probability is NaN. This inconsistency exists within the documentation itself — §6 both specifies infinite boundary distances and shows the proportional-selection code that cannot handle them.

Three plausible resolutions exist: (a) the implementation clips infinite distances to a large finite value before normalization, (b) boundary solutions are always selected first and proportional selection applies only to interior solutions, or (c) the implementation uses tournament selection on crowding distance (where boundary solutions always win), as in canonical NSGA-II. The separately documented EngineConfig.tournament_size parameter (§10, default: 3) suggests tournament-based selection may be available in the engine's main loop beyond the three strategies shown in ParetoFrontier.select_parent(). The documentation does not resolve this inconsistency. Readers implementing GEPA-like systems should adopt one of these strategies explicitly.

7.3.5 Hypervolume Indicator

GEPA uses the hypervolume indicator to measure the overall quality of the Pareto frontier. The GEPA documentation (§6) presents this formula in simplified notation; the standard definition, consistent with the documented behavior, is:

$$\text{HV}(\text{PF}, \mathbf{r}) = \Lambda\left(\left\{\mathbf{q} \in \mathbb{R}^m : \exists\; \mathbf{p} \in \text{PF} \text{ s.t. } \mathbf{p} \geq \mathbf{q} \geq \mathbf{r}\right\}\right)$$

where $\Lambda$ denotes the Lebesgue measure (volume) in $\mathbb{R}^m$, and the inequality $\mathbf{p} \geq \mathbf{q}$ is component-wise, with all objectives direction-normalized so higher is better. The reference point $\mathbf{r}$ is typically the worst observed value in each objective. The hypervolume captures both convergence toward the true Pareto front and diversity along it — a strictly dominated frontier has lower hypervolume, and a frontier with gaps has lower hypervolume than one with uniform coverage. This is a standard quality indicator from the EMOA literature (Zitzler and Thiele, 1999). The hypervolume also appears in the documented strategy="tournament" selection, where the solution with higher hypervolume contribution wins the tournament (see select_parent() code in §7.3.3).

7.3.6 Seedless Bootstrap

Unlike FunSearch, AlphaEvolve, and OpenEvolve, which require a user-provided seed solution, GEPA can initialize the population from a natural language description alone. The bootstrap pipeline, as described in the GEPA documentation (§8: Seedless Mode), proceeds as follows:

Strategy enumeration: If the user provides hint strategies via SeedlessConfig.bootstrap_strategies (e.g., ["greedy_first_fit", "best_fit_decreasing", "dynamic_programming"]), one candidate is generated per strategy. Otherwise, the LLM is prompted to enumerate diverse algorithmic approaches.
Parallel generation: The LLM concurrently generates concrete implementations for each strategy, using the artifact's description and signature fields as context.
Validation: Each candidate undergoes syntax checking, type verification, and basic execution before insertion.
Initial evaluation: All valid candidates are evaluated to establish the initial Pareto frontier.
Normal evolution: The standard reflection-driven mutation loop begins.

Seedless mode is configured via SeedlessConfig (documented in §8), which specifies initial_population_size (default: 10), diversity_prompt (boolean, default: True), and optional bootstrap_strategies (a list of strategy name strings). These parameters are documented in the configuration API with the code example shown in §8. This feature is particularly valuable for exploratory optimization where the user does not have a strong prior on solution structure.

7.3.7 Content-Hash Caching

GEPA avoids redundant evaluation through deterministic content-hash caching. The GEPA documentation (§11) includes a code example showing the caching mechanism. For a candidate $c$ and task $\tau$, the cache key is:

$$k(c, \tau) = \text{SHA-256}(c \;\|\; \text{json}_{\text{sorted}}(\tau))$$

where $\|$ denotes string concatenation and $\text{json}_{\text{sorted}}$ is a deterministic JSON serialization with sorted keys. This formalization matches the documented code example (§11), which shows hashlib.sha256(content.encode()).hexdigest() where content is the candidate string concatenated with json.dumps(task, sort_keys=True). Before evaluation, the pipeline checks whether $k(c, \tau)$ exists in the cache. The documented EvalConfig.cache_backend parameter (§11) supports "sqlite" (default), "redis", or "memory" storage backends. If found, the cached EvalResult (including ASI) is returned immediately. This is especially valuable in multi-task and generalization modes, where the same candidate may be evaluated against overlapping task subsets.

7.4 Optimization Modes

GEPA defines three optimization modes that address different relationships between artifacts and tasks, each configured via a dedicated config class documented in the API (§§4.1–4.3). This is a structural contribution: prior systems in this survey offer only single-task optimization, leaving multi-task generalization to the user.

7.4.1 Single-Task Mode

The simplest mode: one artifact is optimized against one task definition. This is the standard evolutionary optimization scenario, equivalent to what FunSearch, AlphaEvolve, and OpenEvolve provide. The user specifies a task dictionary, and all evaluation calls use that fixed task. Configured via SingleTaskConfig (documented in §4.1, with parameters including population_size, max_iterations, and reflection_minibatch_size). In this mode, the full $m$-dimensional score vector $\mathbf{f}(c, \tau)$ is computed for each candidate, and the Pareto frontier operates over all $m$ objectives.

7.4.2 Multi-Task Mode with Cross-Transfer

Multi-task mode optimizes a single artifact across multiple tasks simultaneously. The key mechanism is cross-transfer: solutions that perform well on one task can transfer insights to other tasks. For example, a TSP heuristic optimized on instances of size 50, 100, and 200 simultaneously may discover strategies that generalize across scales.

Cross-transfer is controlled by three documented MultiTaskConfig parameters (§4.2): a boolean cross_transfer flag (shown as True in the documented example), a transfer_frequency (default: 5, how often transfer occurs in iterations), and a min_improvement_for_transfer threshold (default: 0.01, minimum score delta to justify transferring a solution from one task's population to another's). The result object includes per-task results via result.per_task (documented in §12).

7.4.3 Generalization Mode

Generalization mode splits tasks into training and validation sets, addressing overfitting — a well-known risk in evolutionary optimization where highly specialized solutions perform well on training instances but fail on unseen problems. This mode imports a standard machine learning practice (train/validation splitting) into the LLM-driven evolutionary setting.

The mode is configured via GeneralizationConfig (documented in §4.3), which specifies val_frequency (default: 10, how often validation is performed in iterations), early_stopping_patience (default: 20, number of iterations without validation improvement before stopping), and overfitting_threshold (default: 0.15, maximum tolerated gap between training and validation scores).

Reconciling multi-objective scores with scalar aggregation. Generalization mode requires reducing per-task, per-objective scores to scalar summaries for train/val comparison. The GEPA documentation (§§4.3, 12) reports train_score, val_score, and generalization_gap as scalar fields on OptimizationResult. These fields are documented; the following averaging formulas are the author's formalization of the scalar aggregation implied by these documented fields, not equations taken from the GEPA documentation. The exact internal aggregation method is not specified in the documentation beyond the existence of these scalar result fields.

Let $f_k(c, \tau)$ denote the $k$-th objective for candidate $c$ on task $\tau$, and let $k^*$ denote the primary metric (the Metric with primary=True). Three roles must be distinguished:

Optimization objective (training). The evolutionary search — parent selection, reflection-driven mutation, population update — is guided by fitness on the training task set $\mathcal{T}_{\text{train}}$. For multi-objective problems ($m > 1$), the Pareto frontier operates on the full score vector averaged across training tasks. For model selection and early stopping, GEPA reduces to the primary metric. A natural formalization consistent with the documented scalar train_score field is:
$$\bar{f}_{k^*,\text{train}}(c) = \frac{1}{|\mathcal{T}_{\text{train}}|} \sum_{\tau \in \mathcal{T}_{\text{train}}} f_{k^*}(c, \tau)$$

This is the scalar training score reported as result.train_score. When $m = 1$ (single objective), $k^*$ is the only metric and the distinction is moot. (Author formalization: the documentation reports this as a scalar field but does not specify whether the aggregation is arithmetic mean, weighted mean, or some other function.)
Selection criterion (validation). Periodically (every val_frequency iterations), the current best candidates are evaluated on the validation task set $\mathcal{T}_{\text{val}}$. The validation score for the primary metric is:
$$\bar{f}_{k^*,\text{val}}(c) = \frac{1}{|\mathcal{T}_{\text{val}}|} \sum_{\tau \in \mathcal{T}_{\text{val}}} f_{k^*}(c, \tau)$$

Crucially, validation fitness is not used to guide the evolutionary search itself. The validation set acts as a held-out check on generalization, analogous to validation in supervised learning. Validation is used for model selection (choosing which candidate to report as the final result) and for early stopping (halting when validation performance stagnates for early_stopping_patience iterations). (Author interpretation: this train/val separation is the standard interpretation consistent with the documented behavior; the documentation does not explicitly state that validation scores are excluded from the search fitness.)
Final selection. The reported best candidate $c^*$ is selected by validation performance on the primary metric:
$$c^* = \arg\max_{c \in \text{Candidates}} \; \bar{f}_{k^*,\text{val}}(c)$$

where Candidates is the set of all candidates evaluated on the validation set during the run. This ensures the final output generalizes beyond the training instances. The OptimizationResult reports this as result.val_score (documented in §12).

The generalization gap is defined on the primary metric:

$$\text{gap}(c) = \bar{f}_{k^*,\text{train}}(c) - \bar{f}_{k^*,\text{val}}(c)$$

A gap exceeding the configured overfitting_threshold (default: 0.15) triggers an alert. The result object reports train_score, val_score, and generalization_gap as documented scalar fields in §§4.3 and 12. Note that when $m > 1$, the generalization gap as formalized above is defined only for the primary metric; per-objective gaps could be computed but are not part of the documented API.

Note on held-out evaluation. The GEPA documentation does not describe a separate test set beyond the train/val split. For rigorous empirical evaluation, researchers should reserve a third held-out set that is never used for either training or model selection, and report final performance on that set. The ARC-AGI results in Section 7.5 use a train/val split, but it is not documented whether the reported validation accuracy was also the final held-out evaluation or whether an additional test split was used.

7.5 Key Results

The GEPA authors report results across five benchmarks spanning code optimization, mathematical reasoning, geometric optimization, and infrastructure routing. This section separates self-reported GEPA results (Table 7.1) from cross-system comparisons (Table 7.3), as the two categories have fundamentally different evidence standards. All results cited below are from the GEPA project documentation (§§13–14) unless otherwise attributed.

7.5.1 Self-Reported GEPA Results

Table 7.1 presents the benchmark results reported in the GEPA documentation (§13). These are single-system results: GEPA's performance relative to a stated baseline, with no cross-system comparison.

**Table 7.1 — Self-reported GEPA benchmark results.** All numbers from the GEPA documentation §13 (February 2026). No independent reproduction confirmed.
Benchmark	Artifact Type	Baseline	GEPA Result	Improvement	Eval Calls	LLM	Mode
Claude Code Bleve	Python code	79.3%	100%	+20.7 pp	85	claude-sonnet-4-20250514 (stated in §14)	Single-task
ARC-AGI v1	Python code	32.5%^†	89.5%	+57.0 pp	1,200	Not explicitly stated^‡	Generalization
AIME 2025	Python code	46.67%^†	60%	+13.33 pp	400	Not explicitly stated^‡	Single-task
Circle Packing n=26	Python code	2.63590^††	2.63594	+0.00004	300	Not explicitly stated^‡	Not stated
CloudCast Routing	YAML config	Baseline routing	40.2% cost savings	−40.2% cost	800	Not explicitly stated^‡	Not stated

^† Baseline source not attributed in the GEPA documentation. The 32.5% ARC-AGI baseline may refer to a non-evolutionary baseline; the 46.67% AIME baseline is not identified as a specific prior system. ^†† Attributed to AlphaEvolve in the GEPA documentation. ^‡ Documentation examples use claude-sonnet-4-20250514; whether all benchmark runs used this model is not confirmed.

Evidence status per benchmark. The following table documents what is reproducible from the public documentation versus merely reported in prose, for each of the five benchmarks.

Benchmark	Task Definition	Eval Function	Seed / Init	LLM Model	Reproducibility
Claude Code Bleve	GEPA-specific; described in prose (§14 case study) but not independently defined	Not provided	"Seed solution from Claude Code baseline: 79.3%" (§14)	claude-sonnet-4-20250514 (§14)	Low — task and eval function not publicly available
ARC-AGI v1	Public benchmark; 80/20 train/val split stated (§14)	Not provided	Not stated	Not stated	Partial — public tasks, but eval/model/seed not specified
AIME 2025	Public benchmark	Not provided	Not stated	Not stated	Partial — public tasks, but eval/model/seed not specified
Circle Packing n=26	Standard problem (well-defined geometry)	Not provided	Not stated	Not stated	Partial — standard problem, but implementation details missing
CloudCast Routing	GEPA-specific; partial YAML template shown in §14	Not provided	Not stated	Not stated	Low — task and eval function not publicly available

Overall limitations: (1) All results are self-reported; none have been independently reproduced. (2) No confidence intervals, standard deviations, or multi-seed statistics are reported. (3) Two of five benchmarks (Claude Code Bleve and CloudCast Routing) are GEPA-specific with no independent task definitions. (4) Evaluation functions are not provided for any benchmark; only the pedagogical examples in §§2, 5 are shown. (5) Exact LLM model is confirmed only for Claude Code Bleve.

7.5.2 Intra-System Comparison: Generalization vs. Single-Task

The most methodologically sound comparison available is between GEPA's own generalization mode and single-task mode on ARC-AGI v1, as these share the same LLM backend, evaluation function, and codebase:

**Table 7.2 — GEPA internal ablation: generalization vs. single-task on ARC-AGI v1.** Both rows from GEPA documentation §13 and presumably use matched conditions.
GEPA Mode	Train Acc.	Val Acc.	Gap	Eval Calls
Generalization (train/val split)	94.2%	89.5%	4.7%	1,200
Single-Task	97.1%	82.3%	14.8%	800

The 7.2 percentage-point improvement in validation accuracy (82.3% → 89.5%) with a concurrent reduction in generalization gap (14.8% → 4.7%) provides direct evidence for the value of the train/val split design, at the cost of 50% more evaluation calls (800 → 1,200). This is a single-run comparison without variance estimates, so the magnitude of the effect should be interpreted cautiously, but the direction is consistent with the expected behavior of regularization via validation-based model selection.

7.5.3 Cross-System Comparisons (Indicative)

Table 7.3 presents the cross-system comparisons drawn from the GEPA documentation (§13). The GEPA documentation presents these figures in a single table alongside GEPA's own results without noting experimental conditions; the separation into indicative comparisons and the "Matched Conditions?" column are the survey author's addition. These comparisons are indicative, not controlled.

**Table 7.3 — Cross-system comparison on ARC-AGI v1 (indicative).** GEPA results from GEPA documentation §13. Other systems' figures are attributed in the GEPA documentation but sourced from their respective publications, not from head-to-head runs under matched conditions. The "Matched Conditions?" column is the survey author's assessment.
System	Train Acc.	Val Acc.	Gap	Eval Calls	Source	Matched Conditions?
GEPA (Generalization)	94.2%	89.5%	4.7%	1,200	GEPA docs §13	—
AlphaEvolve	91.0%	85.0%	6.0%	2,500	Attributed in GEPA docs §13	No: different LLM, evaluation function, seeds, and budget
OpenEvolve	88.5%	80.2%	8.3%	1,800	Attributed in GEPA docs §13	No: different LLM, evaluation function, seeds, and budget
FunSearch	78.0%	72.5%	5.5%	5,000+	Attributed in GEPA docs §13	No: different LLM, evaluation function, seeds, and budget

Two observations emerge from this table, with the strong caveat that the cross-system numbers are not from controlled experiments:

Observation 1: Validation accuracy. GEPA's generalization mode reports the highest validation accuracy (89.5%) among the compared systems. However, each system used a different LLM backend, evaluation function, seed solution, and computational budget. These confounders make it impossible to attribute the performance difference to any specific GEPA feature (ASI, Pareto optimization, generalization mode, or the LLM itself).

Observation 2: Evaluation calls. The figures attributed to other systems show higher evaluation-call counts (2,500 for AlphaEvolve; 5,000+ for FunSearch) compared to GEPA's 1,200. This difference is consistent with the hypothesis that ASI-driven mutation is more sample-efficient, but it does not establish that claim: the systems differ in too many dimensions (LLM capability, evaluation granularity, seed quality, population management) for the evaluation-count comparison to isolate the effect of ASI. A controlled ablation removing ASI from GEPA while holding all other factors constant would be needed to support a sample-efficiency claim.

7.5.4 Circle Packing

Circle packing in a square is a well-studied geometric optimization problem with known optimal solutions for small $n$ and competitive results from multiple systems for larger $n$. GEPA reports the following results (§13):

**Table 7.4 — Circle packing results (self-reported).** All from GEPA documentation §13. "Previous Best" attributions are from the same source.
$n$	Previous Best	GEPA Result	Status
20	2.52040	2.52040	Matched known optimal
22	2.56287	2.56290	Claimed improvement (+0.00003)
24	2.60240	2.60245	Claimed improvement (+0.00005)
26	2.63590 (attributed to AlphaEvolve)	2.63594	Claimed record (the docs state this "matches LLM4AD record"; +0.00004 over AlphaEvolve attribution)

The n=26 result is particularly notable because it claims to surpass AlphaEvolve's reported result on the same problem instance, using a general-purpose system rather than a code-specialized one. However, improvements at this scale (0.00004 difference for n=26) are within the regime where numerical precision, floating-point representation, and the exact problem formulation (objective function definition, coordinate representation) matter significantly. Without independent verification of both the GEPA result and the baseline, and without confirmation that the same objective function and precision conventions were used, these improvements should be treated as claims consistent with competitiveness rather than definitively established records. The GEPA documentation does not provide the evaluation function, coordinate output, or feasibility verification for the circle packing results.

7.5.5 Case Study: Claude Code Bleve (79.3% → 100%)

The Claude Code Bleve case study (§14) illustrates the ASI feedback loop in action. This is the best-documented benchmark in terms of experimental setup: the documentation states single-task mode, claude-sonnet-4-20250514 for reflection, 85 evaluation calls, and 45 minutes of wall time. Starting from a baseline at 79.3%, the documentation reports the following optimization trajectory:

Iteration 12: ASI identified an edge case in Unicode handling → fix improved score to 88.1%.
Iteration 28: ASI identified a timeout in large document indexing → batch processing fix improved to 94.5%.
Iteration 41: ASI identified a race condition in concurrent search → mutex fix improved to 97.8%.
Iteration 58: ASI identified a rounding error in relevance scoring → float64 fix achieved 100%.

Each improvement step was driven by specific ASI feedback (error traces, comparison diffs) rather than generic "make it better" prompting. This illustrates the core value proposition of ASI: the reflection engine can propose targeted fixes because the diagnostic information makes failure modes explicit. Note that Claude Code Bleve appears to be a GEPA-specific benchmark; the task definition, evaluation function, and baseline implementation are not independently described outside the GEPA documentation. As an anecdotal demonstration of ASI's utility, this case study is suggestive but does not constitute a controlled ablation of ASI vs. score-only feedback.

7.5.6 Missing Ablations and Open Empirical Questions

The GEPA documentation does not report systematic ablation studies for several key design decisions. The following experiments would substantially strengthen the empirical case for GEPA's architecture:

Ablation	Question	Current Evidence
ASI vs. no-ASI	How much does structured diagnostic feedback improve mutation quality compared to score-only feedback? What is the marginal value of each ASI type (traces, errors, comparisons, images)?	No controlled ablation reported. The Claude Code Bleve case study (§14) provides anecdotal evidence that ASI-driven mutations are targeted, but no comparison to a score-only baseline is given. This is the most important missing ablation, as ASI is GEPA's primary claimed contribution.
Pareto vs. weighted-sum	Does native Pareto optimization produce better multi-objective trade-offs than weighted-sum aggregation with the same evaluation budget?	No direct comparison reported. The `Metric.weight` parameter exists in the API (§12), suggesting weighted aggregation is available, but no benchmark compares the two approaches under matched conditions.
Seedless vs. seeded initialization	How does seedless bootstrap compare to user-provided seeds in terms of final solution quality, convergence speed, and evaluation budget?	No controlled comparison reported. The ARC-AGI and circle packing results do not specify whether seeds were used. Seedless mode is documented (§8) as a capability but its relative effectiveness is not empirically characterized.
Generalization mode sensitivity	How sensitive is the train/val split to the split ratio, validation frequency, and early stopping patience? Does the optimal configuration vary across problem domains?	The intra-system comparison (generalization vs. single-task on ARC-AGI, Table 7.2) provides one data point but does not explore the hyperparameter space.
Reflection minibatch size	Is the documented default of 2–3 examples actually optimal? How does performance vary with minibatch sizes of 1, 3, 5, and 10?	The documentation (§7, callout box) claims 2–3 is "optimal" but provides no experimental evidence for this claim.
Failure bias sensitivity	Is the 70% failure bias optimal? How does varying `failure_bias` from 0.0 (uniform) to 1.0 (all failures) affect convergence?	Not reported. The 0.7 default is presented without justification beyond intuition.
LLM model sensitivity	How does GEPA's performance vary across LLM backends (e.g., GPT-4o vs. Claude Sonnet vs. Gemini)? Is ASI more valuable for some models than others?	Documentation examples use `claude-sonnet-4-20250514`. No cross-model comparison is provided. The custom LLM integration example (§12) shows that arbitrary backends can be used, but no results are reported with alternative models.

The absence of these ablations is notable because GEPA's primary claims — that ASI improves mutation quality, that Pareto optimization outperforms weighted-sum, and that generalization mode prevents overfitting — are all empirically testable hypotheses that the system is well-positioned to evaluate. Future work should prioritize the ASI ablation, as it tests the core contribution, followed by the Pareto vs. weighted-sum comparison, which tests the second major design choice.

7.6 Cost Analysis & Implementation Details

7.6.1 LLM Cost Breakdown

The GEPA documentation (§13) reports the following cost figures for each benchmark. These represent total LLM API costs (reflection + evaluation where applicable) and are self-reported by the GEPA authors.

**Table 7.5 — Self-reported LLM cost figures.** All from GEPA documentation §13. LLM provider pricing, API tier, and exact run date not specified.
Benchmark	Total LLM Cost	Eval Calls	Wall Time	Cost per pp Improvement
Claude Code Bleve	$12.50	85	45 min	$0.60/pp
ARC-AGI v1	$180.00	1,200	8 hours	$3.16/pp
AIME 2025	$95.00	400	3 hours	$7.12/pp
Circle Packing n=26	$45.00	300	2 hours	N/A
CloudCast Routing	$250.00	800	12 hours	$6.22/pp

Notes on cost figures: (1) These costs are self-reported by the GEPA authors in the project documentation §13. (2) The LLM model used for reflection is stated as claude-sonnet-4-20250514 in the documentation examples and confirmed for the Claude Code Bleve case study (§14); whether this model was used for all benchmark runs is not explicitly confirmed. (3) Exact pricing depends on the provider, date of the run, and whether batch or real-time APIs were used; none of these details are specified. (4) The "cost per percentage point improvement" metric is useful for comparison but depends heavily on the baseline — improving from 79% to 100% (easy wins first) has a different cost profile than improving from 85% to 90%. (5) Cost figures do not include compute costs for evaluation (sandbox execution, metric computation), only LLM API costs.

7.6.2 Computational Requirements

GEPA is a Python package with minimal infrastructure requirements. The core optimization loop runs on a single machine with network access to an LLM API. Evaluation parallelism is achieved via the EvaluationPipeline class (from gepa.evaluation, documented in §11) with a configurable max_workers parameter (default: 8) in EvalConfig. Sandbox isolation supports three documented levels (§11): "docker" for full container isolation, "subprocess" for process-level isolation, and "none" for direct execution of trusted code.

The content-hash cache reduces redundant computation. The documented cache_backend parameter (§11) supports "sqlite" (default), "redis", and "memory" backends. For large-scale runs, Redis is recommended in the documentation to share the cache across multiple concurrent processes.

7.6.3 Reproducibility

GEPA's content-hash caching guarantees that re-evaluating the same candidate on the same task produces identical scores (assuming a deterministic evaluation function). However, full run-level reproducibility depends on the LLM's output determinism and the stochastic elements of parent selection and minibatch sampling. The OptimizationResult object (documented in §12) includes the full configuration used (.config), evaluation history (.history), and timing statistics (.stats), providing the information needed to understand — though not necessarily reproduce — a run.

Factors limiting exact reproducibility include:

LLM output non-determinism (even at temperature 0, some providers do not guarantee identical outputs across API calls)
Evaluation function side effects or non-determinism (e.g., timing-dependent metrics, stochastic evaluation components)
Parallel evaluation ordering (non-deterministic with multiple workers, affecting which candidates are available when the next mutation is generated)

7.7 Comparison with Prior Systems

The following table summarizes the feature comparison, combining information from the GEPA documentation (§15: Comparison with Other Systems) with observations from earlier chapters of this survey. The GEPA documentation includes its own comparison table (§15); the entries below for other systems incorporate corrections and qualifications from earlier chapters where appropriate. "Yes/No" entries reflect whether the feature is documented as available, not whether it has been empirically validated as effective.

Feature	GEPA	AlphaEvolve (Ch. 4)	OpenEvolve (Ch. 5)	FunSearch (Ch. 3)	ShinkaEvolve (Ch. 6)
Artifact types	Any text (documented; empirically validated for code + YAML)	Code	Code	Code	Code
Declarative API	Yes (§§2, 12)	No	Partial	No	Partial
Diagnostic feedback	First-class ASI (6 types, §5)	Basic errors	Basic errors	Score only	Errors + traces
Multi-objective	Pareto frontier (§6)	Weighted sum	Weighted sum	Single objective	Single objective
Multi-task mode	Yes (cross-transfer, §4.2)	No	No	No	No
Generalization mode	Yes (train/val split, §4.3)	No	No	No	No
Seedless bootstrap	Yes (§8)	Requires seed	Requires seed	Requires seed	Optional seed
Composable stopping	AND/OR logic (§9)	Fixed	Fixed	Fixed	Configurable
Content-hash cache	Yes (§11)	Yes	Yes	Yes	Yes
Open source	Yes	No	Yes	No	Yes

GEPA's distinguishing features fall into three categories relative to the surveyed systems:

Features without direct equivalents in other surveyed systems. ASI as a first-class typed abstraction with six modalities (§5), native Pareto multi-objective search with crowding-distance-based selection (§6), the generalization mode with train/val splitting and overfitting detection (§4.3), and multi-artifact co-evolution with dependency tracking (§12). These represent GEPA's clearest architectural contributions.

Adapted from prior work. Content-hash caching is standard across all systems. Island models with migration appear in OpenEvolve and ShinkaEvolve. LLM-driven reflection/mutation is a shared pattern with varying degrees of sophistication. Tournament selection and elite preservation are classical EA techniques.

Absent compared to some systems. GEPA does not appear to implement MAP-Elites quality-diversity archives (used by AlphaEvolve), bandit-based model selection across multiple LLM providers (used by ShinkaEvolve), or prompt co-evolution (used by ShinkaEvolve). The documentation describes LLM-guided crossover (MergeConfig, §10) and post-mutation refinement (RefinerConfig, §10) as additional operators.

7.8 The Optimization Loop in Detail

The following pseudocode describes the complete GEPA optimization loop as reconstructed by the survey author from the documented API (§§2, 12), configuration schema (§10), architecture description (§3), and component documentation (§§6–9, 11). This is illustrative pseudocode, not code from the GEPA documentation. The internal implementation may differ in control flow, error handling, concurrency patterns, and helper function decomposition.

The following names are documented in the GEPA API reference with import paths: optimize, Artifact, GEPAConfig, OptimizationResult, EvalResult, ParetoFrontier (from gepa.engine, §6), ReflectionEngine (from gepa.reflection, §7), and all stopping criteria (from gepa.stopping, §9). The following are not documented as standalone classes or functions and are used here as descriptive placeholders: Solution (used in the documented ParetoFrontier code but not independently importable), cached_evaluate, bootstrap_seedless, refine, and compute_stats.

# ILLUSTRATIVE PSEUDOCODE — Author reconstruction from
# documented GEPA components. NOT from the GEPA documentation.
# Documented API names marked with (§N) source references.
# Helper functions (cached_evaluate, bootstrap_seedless, refine,
# compute_stats) are descriptive placeholders, not documented API.

async def optimize(                    # optimize() documented §§2, 12
    artifact: Artifact,                # Artifact documented §12
    evaluate: Callable,
    metrics: list[Metric],             # Metric documented §12
    config: GEPAConfig,                # GEPAConfig documented §10
    tasks: list[dict] | None = None,
    stopping: StoppingCriterion | None = None,  # Stopping criteria §9
) -> OptimizationResult:               # OptimizationResult documented §12
    # 1. Initialize population
    if artifact.seed is not None:
        population = [artifact.seed]
    elif artifact.description is not None:
        population = await bootstrap_seedless(artifact, config)  # Placeholder
    else:
        raise ValueError("Artifact must have either seed or description")

    # 2. Initialize Pareto frontier and history
    # ParetoFrontier documented: from gepa.engine import ParetoFrontier (§6)
    frontier = ParetoFrontier(
        objectives=[m.name for m in metrics],
        directions=[m.direction for m in metrics],
    )
    history = []
    cache = {}  # Actual cache backend per EvalConfig.cache_backend (§11)

    # 3. Initial evaluation
    for candidate in population:
        result = await cached_evaluate(candidate, tasks, evaluate, cache)
        frontier.update(Solution(code=candidate, scores=result.scores))
        history.append(result)

    # 4. Main evolution loop
    # ReflectionEngine documented: from gepa.reflection import ReflectionEngine (§7)
    reflection_engine = ReflectionEngine(
        config=config.reflection, llm=llm_client, artifact_spec=artifact,
    )
    iteration = 0

    while not stopping.should_stop(iteration, history, frontier):
        # 4a. Select parent via documented select_parent() method (§6)
        parent = frontier.select_parent(strategy="crowding")

        # 4b. Generate mutation via reflection with ASI
        mutation = await reflection_engine.reflect(
            parent=parent,
            eval_results=get_results_for(parent, history),  # Placeholder
            history=history[-config.reflection.max_history_length:],
        )

        # 4c. Validate mutation against artifact constraints
        # Artifact.validate() documented §12
        if not artifact.validate(mutation.code).is_valid:
            continue

        # 4d. Optional post-mutation refinement (RefinerConfig documented §10)
        if config.refiner.enabled \
                and parent.primary_score > config.refiner.refinement_threshold:
            mutation = await refine(mutation, config.refiner)  # Placeholder

        # 4e. Evaluate mutation
        result = await cached_evaluate(mutation.code, tasks, evaluate, cache)
        history.append(result)

        # 4f. Update Pareto frontier via documented update() method (§6)
        frontier.update(Solution(code=mutation.code, scores=result.scores))

        iteration += 1

    # 5. Return results
    return OptimizationResult(
        best=frontier.best_by_primary_metric(metrics),  # Placeholder method
        pareto_front=frontier.solutions(),                # Placeholder method
        history=history,
        stats=compute_stats(history),                     # Placeholder
        stop_reason=stopping.reason(),                    # Placeholder
    )

Several design choices are visible in this reconstructed loop. The reflection engine receives the parent's evaluation results (including ASI), enabling targeted mutation. The cache check occurs before evaluation, preventing redundant computation. The Pareto frontier is updated incrementally after each evaluation via the documented update() method (§6). The stopping controller is consulted at each iteration with full access to the history and frontier state. The optional refinement step (controlled by RefinerConfig.enabled and RefinerConfig.refinement_threshold, default: 0.9, documented in §10) applies a post-mutation polish only to high-scoring candidates.

7.9 Multi-Artifact Co-Evolution

A less-discussed but architecturally significant feature is GEPA's support for co-evolving multiple artifacts simultaneously with explicit dependency tracking. The co_optimize() function is documented in the API reference (§12: Advanced: Multi-Artifact Co-Evolution) as accepting a list of artifacts, an evaluation function, metrics, and a dependency graph specifying which artifacts depend on which others.

For example, a pipeline consisting of a system prompt and a post-processing function can be co-evolved with the constraint that the post-processor depends on the prompt (changes to the prompt may require compensating changes in the post-processor). The evaluation function receives all artifacts and returns a joint score. The following code example is reproduced verbatim from the GEPA documentation (§12: Advanced: Multi-Artifact Co-Evolution), with only minor formatting changes:

# Verbatim from GEPA docs §12 (Advanced: Multi-Artifact Co-Evolution).
# Import path (from gepa import co_optimize) and constructor
# signatures (Artifact, Metric) are documented in the API reference.
# eval_pipeline is a user-defined function, not part of the GEPA API.
from gepa import co_optimize, Artifact, Metric

# Co-evolve a prompt and a post-processor together
prompt_artifact = Artifact(
    name="system_prompt",
    template="You are a helpful assistant. {{INSTRUCTIONS}}",
    language="text",
)

processor_artifact = Artifact(
    name="output_processor",
    template="def process(raw_output: str) -> str:\n    {{BODY}}",
    language="python",
)

result = co_optimize(
    artifacts=[prompt_artifact, processor_artifact],
    evaluate=eval_pipeline,  # Receives both artifacts, returns joint score
    metrics=[Metric("quality", direction="maximize")],
    dependencies={
        "output_processor": ["system_prompt"],  # Processor depends on prompt
    },
    max_iterations=50,
)

# Access per-artifact results
print(f"Best prompt: {result.artifacts['system_prompt'].best.code}")
print(f"Best processor: {result.artifacts['output_processor'].best.code}")

Co-evolution is particularly relevant for prompt + code pipelines, where the prompt and the downstream processing logic must be jointly optimized. The internal mechanism by which GEPA schedules mutations across dependent artifacts — whether it mutates one artifact at a time while holding others fixed, or mutates multiple artifacts simultaneously — is not documented. The result.artifacts dictionary access pattern is shown in the documented example above but the .artifacts field is not listed in the OptimizationResult class definition (§12), suggesting it may be specific to the co-evolution result type. This capability has no equivalent in the other systems surveyed in this volume.

7.10 Limitations & Discussion

7.10.1 Acknowledged Limitations

The GEPA authors identify five limitations in their documentation (§16):

LLM dependency. The quality of mutations is bounded by the capability of the underlying LLM. Weaker models produce weaker mutations, and the system provides no mechanism to compensate for a fundamentally incapable model.
Cost. Multi-task and generalization modes multiply the evaluation budget by the number of tasks. The ARC-AGI run cost $180 for 1,200 evaluations — a modest cost for a research experiment but potentially prohibitive at scale.
ASI design burden. Effective ASI requires domain-specific instrumentation of the evaluation function. Users must think carefully about what diagnostic information to expose, and poorly designed ASI can mislead the mutation engine.
Pareto scalability. With more than 4–5 objectives, the Pareto frontier grows exponentially and most solutions become non-dominated (the "curse of dimensionality" for multi-objective optimization). This is a well-known limitation of Pareto-based EMOA approaches.
Reflection context window. Very large candidates or extensive ASI can exceed LLM context limits, requiring truncation that may lose important diagnostic information.

7.10.2 Additional Observations

Beyond the authors' own assessment, several additional limitations merit discussion:

Comparison methodology. The benchmark comparisons in Section 7.5 compare GEPA against AlphaEvolve, OpenEvolve, and FunSearch using figures drawn from the GEPA documentation's comparison table (§13), which attributes figures to those systems without specifying whether they come from head-to-head experiments or from separate publications. It is not established that all systems used identical LLM backends, identical evaluation functions, identical seed solutions, or identical computational budgets. These cross-paper comparisons are labeled as indicative in Table 7.3 and should not be interpreted as definitive rankings.

Artifact-agnosticism: documented scope vs. empirical evidence. GEPA's framework architecture is genuinely artifact-agnostic — the Artifact class (§12), evaluation pipeline (§11), and reflection engine (§7) impose no structural constraints on the candidate format. However, the empirical evidence supporting the "optimize anything" thesis is concentrated in two artifact categories: (1) code — Python functions for four of five reported benchmarks (Claude Code Bleve, ARC-AGI, AIME 2025, circle packing), and (2) code-adjacent structured text — YAML routing rules for CloudCast. The documentation (§1) lists six categories of artifact types as supported (see §7.1.1), but benchmark results are reported only for code and YAML. Whether ASI feedback, reflection prompting, and the seedless bootstrap are equally effective for non-code artifacts remains an open empirical question. Researchers applying GEPA to domains beyond code should be aware that the strongest evidence base is for code optimization, and that the effectiveness of ASI design patterns may vary significantly across artifact types.

Absence of quality-diversity mechanisms. GEPA uses Pareto optimization for multi-objective diversity but does not implement MAP-Elites-style behavioral characterization. For problems where solution diversity in a behavioral descriptor space (as opposed to objective space) is important, this may be a limitation compared to AlphaEvolve's approach.

Missing ablation evidence. As detailed in Section 7.5.6, the documentation lacks controlled ablations for ASI vs. no-ASI, Pareto vs. weighted-sum, seedless vs. seeded initialization, and other key design choices. Without these ablations, the causal contribution of each component to GEPA's reported performance cannot be isolated.

Documentation vs. implementation gaps. Several items in the documentation have internal inconsistencies or underspecified behavior (see Table 7.0): the proportional selection with infinite crowding distances (§7.3.4), the Solution type used in documented code but not independently defined, and the result.artifacts dictionary in the co-evolution example (§7.9) that does not appear in the main OptimizationResult class definition. These are minor issues that suggest the documentation may describe a slightly idealized or in-progress version of the API.

7.10.3 Future Directions

The GEPA authors outline six future directions (§16): Auto-ASI (automatic evaluation instrumentation), hierarchical evolution (co-evolving meta-strategies alongside artifacts), distributed search across multiple machines, interactive human-in-the-loop mode, transfer learning from previously solved problems, and formal verification integration for safety-critical applications. Of these, Auto-ASI addresses the most significant current limitation (the ASI design burden) and would substantially lower the barrier to effective use.

7.11 Research Contribution Analysis

GEPA's contribution to the LLM-driven evolutionary optimization landscape can be evaluated along three dimensions: what is genuinely novel, what is an effective synthesis of existing ideas, and what impact it may have on future work.

Novelty

The most novel contribution is the ASI abstraction. While prior systems pass error messages or basic feedback to the LLM mutation operator, GEPA formalizes this as a typed system with six distinct feedback modalities (§5), each with specific semantics for how the reflection engine should use them. The strength of this contribution rests on the hypothesis that structured diagnostic feedback materially improves mutation quality — a hypothesis that is architecturally well-supported but not yet empirically validated through controlled ablations (see Section 7.5.6). The generalization mode with train/val splitting and overfitting detection (§4.3) is also novel in this space — it imports a standard machine learning practice into evolutionary optimization. Multi-artifact co-evolution with explicit dependency graphs (§12) extends the single-artifact optimization paradigm.

Synthesis

GEPA effectively synthesizes several established techniques: Pareto frontier maintenance with crowding distance from NSGA-II (Deb et al., 2002), hypervolume-based quality assessment from the EMOA literature, content-hash caching from FunSearch/AlphaEvolve, and LLM-driven mutation from the broader program synthesis community. The contribution is not any individual technique but their integration into a coherent, artifact-agnostic framework with a clean declarative API.

Potential Impact

GEPA's declarative API lowers the barrier to entry for LLM-driven evolutionary optimization. A researcher who wants to optimize a prompt, a configuration file, or a scoring function can do so by writing an evaluation function and calling optimize(), without understanding population management, selection strategies, or mutation operators. This democratization effect — making evolutionary search accessible to domain experts who are not evolutionary computation specialists — may be GEPA's most significant practical contribution, though its full realization depends on how well the framework performs on artifact types beyond code (see Section 7.10.2). The open empirical questions identified in Section 7.5.6 represent clear opportunities for follow-on research that could strengthen or qualify these architectural claims.

7.12 Fitness Landscape and Search Dynamics

GEPA's approach to navigating the fitness landscape differs fundamentally from classical evolutionary algorithms because the mutation operator is not a random perturbation but an informed LLM-guided transformation. This changes the search dynamics in several ways that are worth characterizing formally. The formalization below is the survey author's analysis, not a claim made in the GEPA documentation.

In a classical $(\mu + \lambda)$ evolution strategy, the mutation operator $M$ maps a parent $c$ to a child $c'$ via a random perturbation: $c' = M(c, \epsilon)$ where $\epsilon$ is drawn from some distribution (e.g., Gaussian). The mutation is blind to the fitness landscape. In GEPA, the mutation operator is conditioned on the ASI feedback $\alpha$ and the optimization history $H$:

$$c' = M_{\text{LLM}}(c, \alpha, H, \theta)$$

where $c$ is the parent candidate, $\alpha \in \mathcal{A}^*$ is the ASI record from the parent's evaluation (as defined in the notation box in Section 7.3), $H = \{(c_i, \mathbf{s}_i, \alpha_i)\}_{i=1}^{t}$ is the optimization history up to iteration $t$ (each entry containing a candidate, its $m$-dimensional score vector $\mathbf{s}_i \in \mathbb{R}^m$, and its ASI), and $\theta$ represents the LLM parameters (model choice, temperature, prompt template). The LLM acts as a learned gradient estimator — using diagnostic feedback to propose directions of improvement in the space of text artifacts.

This has a conceptual analogy to gradient-based optimization: ASI provides a form of "derivative information" about the fitness function, and the LLM uses this information to propose a step in a promising direction. The key difference is that this "gradient" operates over discrete, structured artifacts (code, text) rather than continuous parameter vectors, and the LLM integrates multiple types of feedback (traces, errors, comparisons) that have no direct analogue in continuous optimization. This analogy is suggestive rather than formal — the LLM mutation operator has no provable convergence guarantees analogous to gradient descent.

The failure bias in minibatch sampling (default: 70% probability of sampling failure cases, via the documented failure_bias parameter in ReflectionConfig, §§7, 10) biases the search toward regions of the fitness landscape where the current solution performs worst. This is analogous to hard-example mining in machine learning: the system focuses its improvement effort on the cases with the most room for improvement.

Chapter Summary

Key takeaway: GEPA demonstrates that a declarative, artifact-agnostic framework with structured diagnostic feedback (ASI) can achieve competitive results across multiple domains — code, mathematics, geometric optimization, and infrastructure routing — without domain-specific architectural commitments. The reported results are promising but rely on self-reported benchmarks with limited cross-system controls and no published ablation studies. The strongest empirical evidence is for code and code-adjacent text artifacts; the broader "optimize anything" generality is architecturally supported but less extensively validated.

Main contribution: The formalization of Actionable Side Information (ASI) as a first-class typed abstraction that bridges evaluation and mutation, combined with native Pareto multi-objective optimization, multi-task cross-transfer, and a generalization mode with validation-based model selection and overfitting detection. These features collectively move LLM-driven evolutionary optimization from a code-specialized technique toward a general-purpose optimization paradigm.

Provenance summary: The vast majority of API classes, import paths, constructor signatures, configuration parameters, and benchmark figures cited in this chapter are directly documented in the GEPA documentation (§§1–16). See Table 7.0 for the full provenance mapping. The survey author's contributions are: (a) the mathematical formalizations in §§7.3–7.4, which render the documented API contracts in standard EMOA notation; (b) the reconstructed optimization loop in §7.8, which assembles documented components into illustrative pseudocode; (c) the evidence status assessment in §7.5, which evaluates reproducibility of each benchmark; and (d) the search dynamics analysis in §7.12.

What a researcher should know: GEPA's power comes from the quality of the evaluation function and its ASI instrumentation. The system is only as good as the diagnostic feedback it receives. Researchers planning to use GEPA should invest heavily in ASI design — this is where the largest returns likely lie. The generalization mode should be preferred over single-task optimization for any problem where overfitting is a concern, even at the cost of additional evaluation calls. Critical open questions remain around the marginal value of ASI (vs. score-only feedback), Pareto vs. weighted-sum performance, and seedless vs. seeded initialization effectiveness — see Section 7.5.6 for the full list of missing ablations. Cross-system comparisons in Table 7.3 are indicative only; no matched-condition head-to-head evaluations are available.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}