Score8.62/10 — Final

Chapter 1

Introduction: The Rise of LLM-Powered Algorithm Discovery

Part P01: Introduction & Framework

1.1 Overview and Motivation

For decades, the design of algorithms has been a fundamentally human endeavor. Researchers study a problem's structure, draw on mathematical intuition, and iteratively refine candidate solutions until a satisfactory procedure emerges. This process is slow, expensive, and constrained by the cognitive bandwidth of individual experts. A sorting algorithm that took years to discover by hand might be improved in hours by an automated system capable of exploring a vastly larger space of program variants. The central question motivating this survey is: Can large language models serve as the mutation engine inside an evolutionary loop, enabling the automated discovery of novel algorithms and heuristics at superhuman scale?

Between 2024 and 2026, a new class of systems answered this question affirmatively. Beginning with DeepMind's FunSearch (paper: Romera-Paredes et al., Nature, 2024) and culminating in AlphaEvolve (blog/report: DeepMind, 2025), a series of research platforms demonstrated that LLMs — when embedded within structured evolutionary frameworks — can discover programs that exceed human-designed baselines on combinatorial optimization, mathematical reasoning, and engineering problems. These systems do not merely apply LLMs as code generators; they treat the LLM as a variation operator within a population-based search, combining the model's broad code understanding with the selection pressure of an automated evaluator.

This survey examines 17 systems published or released between January 2024 and early 2026 that share this core paradigm. We refer to the paradigm as LLM-powered evolutionary program search: an iterative process in which (1) a population of candidate programs is maintained, (2) an LLM proposes modifications to selected parents, (3) an automated evaluator scores the resulting offspring, and (4) selection pressure drives the population toward higher-fitness regions of program space. The paradigm sits at the intersection of three mature fields — evolutionary computation, program synthesis, and large language models — and its rapid emergence reflects the convergence of capabilities that none of these fields could achieve alone.

Survey Contribution. This book provides the first comprehensive, technically detailed survey of LLM-powered evolutionary program search systems spanning 2024–2026. It offers unified mathematical formalizations, architecture comparisons, reproducibility assessments, and a taxonomy of design decisions across 17 systems, enabling researchers to understand not only what each system does but why specific design choices were made and how they compare.

1.1.1 Why Now? The Convergence of Three Capabilities

The timing of this paradigm's emergence is not accidental. Three independent capability curves crossed critical thresholds nearly simultaneously:

LLM code competence. Models such as GPT-4 (paper: OpenAI, 2023), Claude 3 (report: Anthropic, 2024), and Gemini 1.5 (report: Google, 2024) achieved sufficiently high pass rates on code generation benchmarks (HumanEval, MBPP, SWE-bench) that their outputs could serve as meaningful variation operators. A mutation operator that produces syntactically invalid or semantically nonsensical code most of the time is useless inside an evolutionary loop; the compile rate and functional correctness rate must be high enough that selection pressure can act on a meaningful fraction of offspring.

Scalable evaluation infrastructure. Sandboxed code execution environments — containerized runtimes, restricted subprocesses, and cloud-based evaluation services — matured to the point where thousands of candidate programs could be safely executed and scored per hour. Without fast, automated evaluation, the evolutionary loop bottlenecks at fitness assessment.

Evolutionary computation theory. Decades of work on quality-diversity algorithms (MAP-Elites; paper: Mouret and Clune, 2015), island-model parallelism (paper: Whitley et al., 1999), and adaptive operator selection (paper: Fialho et al., 2008) provided the algorithmic scaffolding needed to structure an effective search. The LLM replaces or augments the traditional variation operators (crossover, mutation), but the surrounding evolutionary infrastructure — selection, population management, diversity maintenance — draws directly on established theory.

1.1.2 The Promise: From Manual Algorithm Design to Automated Discovery

Consider the problem of online bin packing — assigning items of varying sizes to bins of fixed capacity as each item arrives sequentially, without knowledge of future items. Because items must be placed immediately upon arrival, only online heuristics are applicable: algorithms like First Fit (place the item in the first bin with sufficient remaining capacity) and Best Fit (place the item in the bin where it fits most tightly) have been known for decades and remain standard baselines for this setting. FunSearch demonstrated that an LLM-evolutionary system could discover online bin-packing heuristics that outperform these classic online algorithms on specific distribution families, achieving lower waste ratios than any previously published online heuristic (paper: Romera-Paredes et al., 2024, Table 1 and Extended Data). The discovered heuristic was not a trivial recombination of known strategies; its structure contained novel conditional logic that no human had previously proposed.

Note on offline vs. online baselines. Offline algorithms such as First Fit Decreasing (FFD), which sort all items by size before packing, are not applicable in the online setting because they require advance knowledge of all items. FunSearch's contribution was specifically to the online setting, where the algorithm must commit to a bin assignment for each item before seeing subsequent items. Comparisons to offline optima are useful as lower bounds on achievable waste, but the performance gains reported by FunSearch are measured against online baselines.

This example illustrates the core promise: LLM-powered evolutionary search can navigate program spaces too large for exhaustive enumeration and too complex for gradient-based optimization, discovering genuinely novel solutions that advance the state of the art. The approach is domain-agnostic in principle — the same evolutionary loop can target sorting algorithms, mathematical conjectures, scheduling heuristics, or neural network architectures, provided an automated evaluator can score candidate programs.

1.2 Historical Context: From Genetic Programming to LLM-Driven Search

1.2.1 Classical Evolutionary Computation

Evolutionary computation (EC) encompasses a family of optimization meta-heuristics inspired by biological evolution. The major branches — genetic algorithms (paper: Holland, 1975), evolution strategies (paper: Rechenberg, 1973; Schwefel, 1977), evolutionary programming (paper: Fogel et al., 1966), and genetic programming (paper: Koza, 1992) — all share the same abstract loop: maintain a population, apply variation operators, evaluate fitness, and select survivors. The critical differences lie in the representation (bit strings, real vectors, tree-structured programs) and the variation operators (bitflip mutation, Gaussian perturbation, subtree crossover).

Genetic programming (GP) is the most directly relevant ancestor of the systems surveyed here. In GP, candidate solutions are programs — typically represented as abstract syntax trees (ASTs) — and variation operators manipulate program structure. Subtree crossover swaps subtrees between two parent programs; subtree mutation replaces a randomly selected subtree with a newly generated one. GP achieved notable successes in symbolic regression, circuit design, and game strategy evolution (paper: Koza, 1994; Poli et al., 2008), but suffered from fundamental limitations:

Bloat: Programs grow in size without corresponding fitness improvement, slowing evaluation and obscuring functional structure (paper: Langdon and Poli, 1997).
Brittleness of random operators: Random subtree mutation rarely produces semantically meaningful changes, leading to very low rates of beneficial mutation. Empirical studies of standard GP report beneficial mutation rates below 1% on most benchmark problems (paper: Banzhaf et al., 1998, Ch. 8; author synthesis: estimates vary by representation and problem, but 0.01–1% is a commonly cited range in the GP literature).
Representation gap: AST-based representations poorly capture the idioms of modern programming languages, limiting applicability to real-world software.
Scalability: Evolving programs beyond a few hundred lines remains impractical with classical GP operators.

1.2.2 The LLM Breakthrough: Semantic Variation Operators

Large language models fundamentally change the variation operator. Instead of random structural perturbation, an LLM can propose semantically meaningful modifications to a program. Given a parent program and a prompt describing the desired change (or simply asking for improvement), the LLM leverages its training on billions of lines of code to produce offspring that are syntactically valid, semantically coherent, and often functionally distinct from the parent. This transforms the mutation operator from a random walk in syntax space to a guided search in semantic space.

The key insight, first operationalized at scale by FunSearch, is that the LLM need not produce correct solutions directly. It needs only to produce better-than-random variations. An LLM that generates a beneficial mutation even 5–10% of the time is dramatically more effective than random subtree mutation, which may produce beneficial changes less than 0.1% of the time. The evolutionary framework provides the selection pressure; the LLM provides the variation quality.

We can express this core advantage as an inequality on beneficial mutation probabilities:

$$P(\text{beneficial} \mid \text{LLM mutation}) \gg P(\text{beneficial} \mid \text{random mutation})$$

where a beneficial mutation is one that produces an offspring with fitness strictly greater than its parent. This inequality is the fundamental reason the paradigm works: if LLM-generated variations were no better than random perturbations, the evolutionary loop would gain nothing from the LLM's computational expense.

Empirically, the systems surveyed in this book report beneficial mutation rates that vary widely depending on the task, model, and prompt design. FunSearch reports that a meaningful fraction of LLM-generated candidates compile and improve upon their parents (paper: Romera-Paredes et al., 2024, Section 3). Open-source systems such as OpenEvolve and GEPA report compile rates of 40–80% and fitness-improvement rates of 2–30% across their benchmark tasks (repository/README: OpenEvolve docs; paper: GEPA, 2025). These figures represent an author synthesis across reported results; individual rates depend heavily on problem difficulty, prompt engineering, and model capability. For comparison, classical GP beneficial mutation rates are typically estimated below 1% (paper: Banzhaf et al., 1998), placing LLM-based mutation one to two orders of magnitude higher.

1.2.3 Timeline of Key Systems (2024–2026)

Date	System	Organization	Key Contribution
Jan 2024	FunSearch	DeepMind	First LLM-evolutionary system to discover novel mathematical results (cap sets, bin packing)
Mid 2024	ELM / ReEvo	Various	Open-source reimplementations and extensions of the FunSearch paradigm
Late 2024	OpenELM / HumanEval-Evolve	Community	Benchmark-focused evolutionary code generation
Early 2025	AlphaEvolve	DeepMind	Multi-model MAP-Elites with cascaded evaluation; improvements to core Google infrastructure
Mid 2025	OpenEvolve	Community	Open-source recreation of AlphaEvolve's core architecture
2025	GEPA	Research	Multi-objective Pareto-based evolutionary program search with ASI scoring
2025–2026	OmniEvolve	Research	Unified platform synthesizing design patterns from 17 surveyed systems

1.3 The Paradigm: LLM-Powered Evolutionary Program Search

1.3.1 Formal Problem Statement

We formalize the problem as follows. Let $\mathcal{P}$ denote the space of all syntactically valid programs in a target language (typically Python). Let $f: \mathcal{P} \rightarrow \mathbb{R}$ be a fitness function that maps each program to a scalar score, where higher values indicate better solutions. The fitness function encapsulates the problem specification: for a sorting algorithm, $f$ might measure correctness and speed; for a bin-packing heuristic, $f$ might measure the negative waste ratio across a test distribution.

The goal is to find:

$$p^* = \arg\max_{p \in \mathcal{P}} f(p)$$

That is, we seek the program $p^*$ with the highest fitness score across the entire program space. Since $\mathcal{P}$ is countably infinite (the set of all valid programs) and $f$ is generally non-differentiable (small syntactic changes can produce large fitness differences) and expensive to evaluate (requiring program execution), direct optimization is intractable. We cannot enumerate $\mathcal{P}$, and we cannot compute gradients of $f$. Evolutionary search addresses this by maintaining a finite population $\mathbf{P}_t = \{p_1, p_2, \ldots, p_N\}$ at generation $t$ and iteratively improving it through variation and selection — a strategy that requires only the ability to evaluate $f$ at sampled points, not to differentiate or invert it.

1.3.2 The Evolutionary Loop

Every system surveyed in this book implements some variant of the following loop:

$$\mathbf{P}_{t+1} = \text{Select}\Big(\mathbf{P}_t \cup \big\{\text{Evaluate}\big(\text{LLM-Mutate}(p_i)\big) \mid p_i \in \text{Parents}(\mathbf{P}_t)\big\}\Big)$$

This equation reads as follows: to produce the next generation $\mathbf{P}_{t+1}$, we first select parent programs from the current population, generate offspring by applying LLM-based mutation to each parent, evaluate the offspring's fitness, and then apply a selection operator to the combined pool of current population members and evaluated offspring. The selection operator determines which individuals survive into the next generation, implementing the "survival of the fittest" pressure that drives the search toward better programs.

The components are:

$\mathbf{P}_t$ is the population at generation $t$, a set of programs with associated fitness scores.
$\text{Parents}(\mathbf{P}_t)$ selects one or more parent programs from the current population, typically biased toward higher-fitness individuals.
$\text{LLM-Mutate}(p_i)$ uses a large language model to generate a modified version of parent $p_i$.
$\text{Evaluate}(\cdot)$ executes the offspring program and computes its fitness score $f(\cdot)$.
$\text{Select}(\cdot)$ determines which programs survive into the next generation.

The systems differ in how each of these components is implemented: the population structure (flat pool, island model, MAP-Elites archive), the parent selection strategy (fitness-proportionate, tournament, power-law), the LLM mutation mechanism (single-model, multi-model, prompt-evolved), the evaluation pipeline (single-stage, cascaded, multi-objective), and the selection/archival policy (elitist, quality-diversity, Pareto-based).

1.3.3 Illustrative Pseudocode

The following Python pseudocode captures the minimal evolutionary loop shared by all surveyed systems. Individual chapters detail how each system extends, modifies, or replaces components of this template.

from dataclasses import dataclass

@dataclass
class Candidate:
    """A candidate program with its evaluated fitness."""
    source_code: str
    fitness: float

def evolutionary_program_search(
    seed_program: str,
    fitness_fn,          # Callable[[str], float]
    llm_mutate,          # Callable[[str, list[str]], str]
    select_parents,      # Callable[[list[Candidate]], list[Candidate]]
    max_evaluations: int = 10_000,
) -> Candidate:
    """Core loop of LLM-powered evolutionary program search."""
    # Initialize population with the seed
    initial_fitness = fitness_fn(seed_program)
    population = [Candidate(seed_program, initial_fitness)]
    best = population[0]

    for step in range(max_evaluations):
        # 1. Select parents (e.g., tournament selection)
        parents = select_parents(population)

        # 2. LLM generates a mutated offspring
        parent_sources = [p.source_code for p in parents]
        offspring_code = llm_mutate(parent_sources[0], parent_sources[1:])

        # 3. Evaluate the offspring
        try:
            offspring_fitness = fitness_fn(offspring_code)
        except (SyntaxError, TimeoutError, RuntimeError):
            continue  # Discard non-viable offspring

        offspring = Candidate(offspring_code, offspring_fitness)

        # 4. Update population (simple elitist strategy)
        population.append(offspring)
        population.sort(key=lambda c: c.fitness, reverse=True)
        population = population[:100]  # Keep top-N

        # Track best
        if offspring.fitness > best.fitness:
            best = offspring

    return best

This pseudocode omits many features present in real systems — island models, cascaded evaluation, diversity maintenance, prompt evolution, multi-model selection — but it captures the essential loop. Each subsequent chapter will show how a specific system extends this skeleton.

1.4 Key Design Dimensions

Through our analysis of 17 systems, we identified seven primary design dimensions along which these systems vary. These dimensions form a taxonomy that structures the remainder of this survey.

1.4.1 Population Structure

The simplest structure is a flat pool: a single, unstructured collection of candidates ranked by fitness. FunSearch introduced island models to this paradigm, partitioning the population into semi-isolated subpopulations (islands) with periodic migration of elite individuals between them. AlphaEvolve employs MAP-Elites (paper: Mouret and Clune, 2015), a quality-diversity algorithm that maintains a grid-structured archive indexed by behavioral descriptors, ensuring the population covers diverse regions of solution space rather than collapsing to a single mode.

The choice of population structure affects exploration-exploitation balance. Island models promote diversity through isolation and periodic injection of foreign genetic material. MAP-Elites provides a stronger diversity guarantee by explicitly partitioning the behavioral space. Flat pools are simplest but most prone to premature convergence.

Formally, a MAP-Elites archive is a function from a discretized behavioral descriptor space to programs:

$$\mathcal{A}: \mathcal{D}_1 \times \mathcal{D}_2 \times \cdots \times \mathcal{D}_k \rightarrow \mathcal{P} \cup \{\emptyset\}$$

Here $\mathcal{D}_i$ is the discretized range of the $i$-th behavioral descriptor (e.g., program length, cyclomatic complexity, or a task-specific feature), $k$ is the number of descriptor dimensions, and $\mathcal{P}$ is the program space. Each cell in the $k$-dimensional grid stores the highest-fitness program observed for that descriptor combination, or $\emptyset$ if no program has yet been placed there. The archive thus guarantees that every occupied cell contains the best program found with that particular behavioral profile — maintaining diversity across the descriptor space while still applying fitness-based selection within each cell. The total number of cells is $\prod_{i=1}^{k} |\mathcal{D}_i|$, which determines the archive's capacity and the granularity of diversity maintenance.

1.4.2 Parent Selection

Parent selection determines which programs are chosen as the basis for LLM mutation. Common strategies include:

Fitness-proportionate: Probability of selection proportional to fitness. Simple but susceptible to premature convergence when one individual dominates.
Tournament selection: Randomly sample $k$ individuals; select the fittest. The tournament size $k$ controls selection pressure.
Power-law sampling: Rank candidates by fitness and sample with probability proportional to $\text{rank}^{-\alpha}$ for exponent $\alpha > 0$. This strongly favors elite candidates while maintaining non-zero probability of selecting any individual.

Tournament selection analysis. For tournament selection with tournament size $k$ drawn from a population of $N$ individuals with replacement, we derive the probability that the individual ranked $r$ (where rank 1 is best, rank $N$ is worst) wins the tournament. The individual with rank $r$ wins if and only if it is the best-ranked individual among the $k$ drawn — equivalently, all $k$ drawn individuals have rank $\geq r$ (i.e., rank $r$ or worse), but not all have rank $\geq r + 1$ (i.e., rank $r$ itself must appear at least once). Since each draw is independent and uniform over the $N$ individuals:

$$P(\text{win} \mid \text{rank} = r) = \left(\frac{N - r + 1}{N}\right)^k - \left(\frac{N - r}{N}\right)^k$$

The first term counts the probability that all $k$ drawn individuals have rank $\geq r$ (there are $N - r + 1$ such individuals: those ranked $r, r{+}1, \ldots, N$). The second term subtracts the probability that all $k$ drawn individuals have rank $\geq r + 1$, which would exclude rank $r$ from being the winner. Their difference is exactly the probability that rank $r$ is the best (lowest-numbered) rank in the tournament.

We can verify this at the boundaries. For the best individual (rank $r = 1$):

$$P(\text{win} \mid \text{rank} = 1) = 1 - \left(\frac{N - 1}{N}\right)^k$$

This is simply one minus the probability that all $k$ draws miss rank 1, which approaches 1 rapidly as $k$ grows. For example, with $N = 100$ and $k = 5$, the best individual is selected with probability $1 - 0.99^5 \approx 0.049$, roughly five times its uniform baseline of $1/100$. For the worst individual (rank $r = N$): $P = (1/N)^k - 0 = N^{-k}$, which is vanishingly small for large $k$ — the worst individual can only win if all $k$ draws happen to select it.

As $k$ increases, selection pressure intensifies: the probability mass concentrates on the top-ranked individuals and the distribution becomes increasingly skewed. In the limit $k \to \infty$, selection becomes deterministic (always choosing the best individual), while $k = 1$ recovers uniform random selection. The tournament size $k$ thus provides a smooth knob for controlling the exploration-exploitation tradeoff in parent selection.

1.4.3 LLM Mutation Strategies

The LLM mutation component is what distinguishes this paradigm from classical GP. Systems vary along several axes:

Strategy	Description	Example Systems
Single-model, fixed prompt	One LLM, one prompt template	Early FunSearch variants
Single-model, evolved prompts	One LLM, prompts co-evolve with population	EvoPrompt, PromptBreeder
Multi-model, bandit-selected	Multiple LLMs; adaptive selection via bandit algorithm	AlphaEvolve, OpenEvolve
Hierarchical mutation	Different operators for different scales (line, function, file)	AlphaEvolve (diff vs. full rewrite)
Reflection-augmented	LLM receives evaluation feedback and reflects before mutating	GEPA, ReEvo

The multi-model, bandit-selected approach is particularly noteworthy. Given a set of $M$ available LLMs $\{m_1, m_2, \ldots, m_M\}$, the system must decide which model to query at each step. This is a classic exploration-exploitation tradeoff: the system should usually query the model that has historically performed best (exploitation), but must occasionally try under-sampled models that might be superior (exploration). The Upper Confidence Bound (UCB1) algorithm (paper: Auer et al., 2002) provides a principled solution:

$$m^* = \arg\max_{m \in \{m_1, \ldots, m_M\}} \left[\hat{\mu}_m + c\sqrt{\frac{\ln N_{\text{total}}}{N_m}}\right]$$

The first term $\hat{\mu}_m$ is the empirical mean reward of model $m$ — for example, the fraction of offspring generated by $m$ that achieved fitness strictly higher than their parent. The second term $c\sqrt{\ln N_{\text{total}} / N_m}$ is an exploration bonus that grows when model $m$ has been tried relatively few times ($N_m$ is small relative to the total number of selections $N_{\text{total}} = \sum_m N_m$). The constant $c > 0$ controls the balance: larger $c$ encourages more exploration of under-sampled models. Auer et al. (2002) show that with bounded rewards in $[0,1]$, UCB1 achieves logarithmic cumulative regret — meaning the cost of exploration diminishes over time. In practice, the non-stationarity of evolutionary search (model effectiveness changes as the population evolves) means the theoretical regret bounds do not strictly apply, but UCB remains an effective heuristic, as discussed in the system chapters.

1.4.4 Evaluation Pipeline

Evaluation is often the most expensive component — executing and scoring a candidate program can involve compilation, test-suite execution, and benchmark measurement. Systems employ several strategies to manage evaluation cost:

Cascaded evaluation: Multiple stages of increasing expense. A fast syntax/compile check filters obvious failures; a small test suite filters functional errors; the full benchmark runs only on candidates that pass preliminary stages. This is the most widely adopted strategy among the surveyed systems.
Caching: Programs with identical source code (or identical behavior on a subset of tests) reuse cached fitness scores.
Early stopping: If a candidate's partial score on the first $j$ test cases is already below the current best, terminate evaluation early.
Approximate fitness: Use a cheap proxy (e.g., a small subset of the full benchmark) for most evaluations, reserving the full benchmark for promising candidates.

Cascaded evaluation can be formalized as a sequence of filter stages. Let $E_1, E_2, \ldots, E_S$ be $S$ evaluator stages of increasing cost $c_1 < c_2 < \cdots < c_S$ and increasing fidelity. A candidate must pass stage $E_i$ (achieving a minimum threshold score) before proceeding to stage $E_{i+1}$. If the pass rate at stage $i$ is $\rho_i$, then the expected cost per candidate is:

$$\bar{C}_{\text{eval}} = c_1 + \rho_1 \cdot c_2 + \rho_1 \rho_2 \cdot c_3 + \cdots = \sum_{i=1}^{S} c_i \prod_{j=1}^{i-1} \rho_j$$

where $\prod_{j=1}^{0} \rho_j = 1$ by convention. Because $\rho_i < 1$ and typically $\rho_1$ is small (many candidates fail the first, cheapest check), the expected cost is dominated by the early stages. For example, if 60% of candidates fail a compile check ($\rho_1 = 0.4$) and 50% of the remainder fail basic tests ($\rho_2 = 0.5$), only 20% of candidates reach the expensive final stage — reducing average evaluation cost substantially.

1.4.5 Diversity Maintenance

Without explicit diversity mechanisms, LLM-evolutionary systems tend to converge rapidly — the LLM, biased by its training distribution, may repeatedly propose similar modifications. Systems address this through:

Island isolation: Subpopulations evolve independently, preventing global convergence.
Novelty filtering: Embedding-based similarity checks reject offspring too similar to existing population members.
Quality-diversity archives: MAP-Elites ensures coverage of the behavioral descriptor space.
Prompt diversity: Different islands or selection events use different prompt templates, encouraging varied mutation styles.
Temperature and sampling variation: Varying LLM sampling temperature across mutations to produce both conservative and exploratory offspring.

1.4.6 Safety and Sandboxing

Executing arbitrary LLM-generated code poses security risks. All mature systems employ sandboxing — restricted execution environments that limit filesystem access, network connectivity, memory usage, and execution time. The strength of isolation varies from lightweight subprocess restrictions to full container or VM-based sandboxes. This is a critical concern for production deployments and is discussed in detail in the systems chapters.

1.4.7 Cost and Budget Management

LLM API calls are the dominant cost in most systems. A single evolutionary run may require thousands to hundreds of thousands of LLM queries. Systems manage this through:

Model tiering: Use cheaper, faster models for exploration and more expensive, capable models for refinement.
Token budgeting: Track cumulative token usage and enforce per-run or per-experiment budget limits.
Adaptive model selection: Bandit algorithms route queries to the most cost-effective model for the current search phase.

The total monetary cost of a single evolutionary run can be decomposed as follows:

$$C_{\text{total}} = \underbrace{\sum_{t=1}^{T} \sum_{m \in M_t} n_{m,t} \cdot \left(c_m^{\text{input}} \cdot \bar{k}_m^{\text{in}} + c_m^{\text{output}} \cdot \bar{k}_m^{\text{out}}\right)}_{\text{LLM query cost}} + \underbrace{C_{\text{eval}}}_{\text{evaluation compute cost}}$$

The first sum ranges over $T$ generations and the set of models $M_t$ used in each generation. For each model $m$ in generation $t$: $n_{m,t}$ is the number of queries, $c_m^{\text{input}}$ and $c_m^{\text{output}}$ are per-token costs for input and output respectively, and $\bar{k}_m^{\text{in}}$ and $\bar{k}_m^{\text{out}}$ are average token counts per query. The term $C_{\text{eval}}$ captures compute costs for candidate evaluation (CPU time, sandbox overhead, resource provisioning). In practice, the LLM query cost dominates for API-based systems, while evaluation cost dominates for compute-heavy benchmarks. Reported costs for the open-source systems range from a few dollars for simple tasks to hundreds of dollars for complex, long-running experiments (repository/README: OpenEvolve, GEPA documentation; author synthesis: exact costs depend on model pricing, which changes frequently).

1.5 Taxonomy of Surveyed Systems

We organize the 17 surveyed systems into four generations based on their architectural sophistication and the design patterns they introduce. This taxonomy is conceptual, not strictly chronological — some later systems in earlier generations were published after systems in later generations — but it captures the progression of ideas in the field. The book's chapter structure is organized by analytical depth and architectural influence rather than by generation; the table below maps each generation to the chapters where its systems receive primary coverage.

Generation	Defining Feature	Representative Systems	Primary Chapters
G1: Pioneers	First demonstrations that LLMs can serve as mutation operators inside evolutionary loops, yielding novel results	FunSearch, ELM, ReEvo	6, 8
G2: Scaled Systems	Multi-model orchestration, cascaded evaluation, MAP-Elites, quality-diversity at scale	AlphaEvolve	4
G3: Open Recreations	Open-source implementations that reproduce and extend G2 architectures, enabling community experimentation	OpenEvolve, GEPA, EvoCodeBench	5, 7, 8
G4: Unified Platforms	Synthesis of design patterns from multiple systems into configurable, plugin-based research platforms	OmniEvolve	9

Each generation builds on the lessons of the previous one. G1 proved the concept; G2 proved it could scale to real-world impact; G3 democratized access and enabled reproducibility; G4 provides the tooling for systematic comparison and new method development.

Note on chapter ordering. AlphaEvolve (G2) is covered before the G1 pioneers because it is the most architecturally rich system in the survey and introduces the majority of the design vocabulary used throughout the book. Covering it first allows subsequent chapters to reference its patterns as a common baseline. FunSearch and ELM receive dedicated treatment in Chapters 6 and 8, where their historical priority and distinctive design choices are analyzed in detail.

1.6 Scope and Methodology

1.6.1 Inclusion Criteria

A system is included in this survey if it meets all of the following criteria:

LLM as variation operator: The system uses at least one large language model to generate or modify candidate programs within an iterative search loop. Systems that use LLMs only for initial code generation without iterative refinement are excluded.
Population-based search: The system maintains a population of candidate solutions (not just a single candidate) and employs some form of selection pressure. Single-trajectory refinement (e.g., iterative self-repair without a population) is excluded.
Automated evaluation: Candidate fitness is assessed by an automated evaluator, not by human judgment. Human-in-the-loop systems are noted but not the primary focus.
Publication or release between January 2024 and early 2026: The system was described in a peer-reviewed paper, preprint, technical report, or publicly released as open-source software during this period.

1.6.2 Analysis Framework

For each system, we analyze:

Architecture: Component design, data flow, and dependency structure, illustrated with diagrams.
Algorithms: Mathematical formalization of key mechanisms (selection, mutation, evaluation, archival), with pseudocode and, where available, reference to actual implementation.
Results: Reported benchmarks and outcomes, with careful attention to experimental methodology, baselines, and reproducibility. We distinguish between peer-reviewed results, vendor-reported claims, and community reproductions.
Cost analysis: LLM API costs, compute requirements, and budget management strategies.
Limitations: Known failure modes, scalability bottlenecks, and open questions.
Provenance: For each factual claim, we indicate whether it comes from a published paper, repository documentation, code inspection, or is an author reconstruction for expository purposes.

1.6.3 Notation Conventions

The following notation is used throughout this survey. Individual chapters may introduce additional symbols, which are defined at point of first use.

Symbol	Meaning
$\mathcal{P}$	Space of syntactically valid programs
$p, p_i$	A candidate program
$f(p)$	Fitness of program $p$
$\mathbf{P}_t$	Population at generation $t$
$N$	Population size
$T$	Total number of generations
$k$	Tournament size (in tournament selection)
$r$	Rank of an individual (rank 1 = best)
$\alpha$	Power-law exponent (in power-law selection)
$\mathcal{A}$	MAP-Elites archive
$\mathcal{D}_i$	$i$-th behavioral descriptor dimension
$m, m_i$	An LLM (model)
$\hat{\mu}_m$	Empirical mean reward for model $m$
$N_m$	Number of times model $m$ has been selected
$c$	UCB exploration constant
$C_{\text{total}}$	Total monetary cost of an evolutionary run
$\rho_i$	Pass rate at evaluation stage $i$

1.6.4 Provenance Labels

Following the lessons of earlier survey efforts, we adopt a consistent provenance labeling scheme throughout this book. When describing system internals, each claim is grounded in one of the following evidence categories:

Label	Meaning
paper	Stated in a peer-reviewed or preprint publication by the system authors
blog/report	Described in an official blog post, technical report, or announcement
repository	Verified by inspection of the public source code repository
README/docs	Described in repository documentation, README, or official guides
example config	Appears as a default or example in configuration files
author formalization	A mathematical formalization constructed by the survey authors to describe a mechanism; not claimed by the original system authors
author synthesis	A summary or comparison constructed by the survey authors from multiple sources; explicitly marked as interpretive
inferred	Reconstructed from indirect evidence; explicitly marked as uncertain

This scheme addresses a recurring challenge in surveying systems that are partially closed-source or incompletely documented. By being explicit about evidence quality, we enable readers to calibrate their confidence in each claim and to identify where further investigation is needed.

1.7 Research Landscape and Open Problems

1.7.1 What Has Been Demonstrated

As of early 2026, LLM-powered evolutionary program search has demonstrated several categories of results. We organize these by evidence strength to help readers calibrate the maturity of the field:

Discovery of novel mathematical results (paper, peer-reviewed: Romera-Paredes et al., Nature, 2024): FunSearch discovered new constructions for cap sets (a problem in additive combinatorics) that exceeded the previous best known bounds, constituting a genuine mathematical contribution verified by domain experts and published in Nature. This is the strongest evidence in the field: peer-reviewed, independently verifiable, and constituting a contribution to mathematics itself rather than to algorithm engineering alone.
Superhuman heuristic design (paper: Romera-Paredes et al., 2024; repository: multiple open-source systems): Multiple systems have discovered heuristics for online bin packing, scheduling, and routing that outperform standard textbook algorithms on specific benchmark distributions. These results are reported in published papers and, in several cases, have been reproduced by independent teams using open-source implementations. The qualification "on specific distributions" is important — generalization to arbitrary problem instances has not been established for most discovered heuristics.
Infrastructure-scale impact (blog/report: DeepMind, 2025): AlphaEvolve reportedly discovered improvements to hashing algorithms, kernel scheduling, and other components deployed in production systems at Google. These claims are made in an official DeepMind blog post and technical report but have not been independently verified or published in peer-reviewed venues as of early 2026. The impact claims are plausible given the system's architecture, but readers should note the distinction between vendor-reported production deployments and independently reproducible benchmarks.
Cross-domain generality (author synthesis across surveyed systems): The same evolutionary loop has been applied to combinatorial optimization, numerical methods, game-playing strategies, compiler optimization, and neural architecture search. This breadth is documented across the 17 surveyed systems, though no single system has demonstrated state-of-the-art results across all domains simultaneously. The generality is a property of the paradigm's architecture rather than any individual system's benchmarks.

1.7.2 Open Problems

Significant challenges remain, and we highlight them here to frame the critical analysis in subsequent chapters:

Reproducibility. Many flagship results have not been independently reproduced. The stochastic nature of both LLM generation and evolutionary search means that results depend on specific model versions, sampling parameters, random seeds, and evaluation conditions. Most systems do not provide sufficient experimental detail for exact reproduction. This survey documents the reproducibility status of each system's key claims.

Evaluation methodology. Comparing systems is difficult when they are tested on different tasks, with different evaluation budgets, using different LLM backends. There is no standardized benchmark suite for LLM-evolutionary systems. We propose comparison criteria in Chapter 10 and urge the community to adopt shared evaluation protocols.

Scalability to complex programs. Current successes are concentrated on relatively short programs (tens to hundreds of lines) with well-defined fitness functions. Scaling to larger programs — complete libraries, multi-file systems, or programs requiring complex integration testing — remains an open challenge. The fitness landscape becomes harder to navigate as program complexity increases.

Theoretical foundations. The convergence properties of LLM-evolutionary search are poorly understood. Classical evolutionary computation theory (schema theorems, runtime analysis, drift analysis) assumes specific properties of variation operators that LLMs do not satisfy — in particular, classical theory typically assumes that mutation is a memoryless stochastic process, whereas LLM-based mutation is context-dependent and non-stationary. New theoretical frameworks are needed to characterize when and why these systems work.

Safety and alignment. As these systems become more capable, the risk of discovering harmful programs increases. Current sandboxing approaches provide process-level isolation but may be insufficient against adversarial code. The alignment of the search objective with human intent — ensuring the system optimizes what we mean, not just what we measure — is an underexplored concern.

1.7.3 Positioning Relative to Other AI Research Directions

LLM-powered evolutionary program search is not the only approach to automated algorithm discovery. It is useful to situate the paradigm relative to alternatives:

Approach	Mechanism	Strengths	Limitations
LLM-evolutionary (this survey)	Population-based search with LLM variation	Broad applicability, novelty, diversity	High LLM cost, reproducibility challenges
Best-of-N sampling	Generate N solutions independently, select best	Simple, parallelizable	No iterative improvement, limited diversity
Iterative self-refinement	Single trajectory: generate → evaluate → refine	Low cost, simple implementation	No population diversity, prone to local optima
RL-based program synthesis	Train a policy to generate programs token by token	Principled optimization, gradient signal	Requires differentiable reward, training cost
Formal synthesis (e.g., Sketch, Rosette)	Constraint solving over program sketches	Correctness guarantees	Limited scalability, requires formal specs

The evolutionary approach's distinctive advantage is its ability to maintain population diversity while performing iterative improvement — a combination that neither best-of-N sampling nor single-trajectory refinement provides. This is why evolutionary methods have found novel solutions in spaces where simpler approaches converge to local optima.

1.8 Illustrative Comparison: Paradigms in Code

To make the distinction between paradigms concrete, the following code contrasts three approaches to the same task — improving a heuristic function — highlighting why population-based evolutionary search offers structural advantages.

import random

# --- Approach 1: Best-of-N Sampling ---
def best_of_n(llm, prompt: str, evaluator, n: int = 100):
    """Generate N independent solutions, return the best."""
    candidates = [llm.generate(prompt) for _ in range(n)]
    scored = [(c, evaluator(c)) for c in candidates]
    return max(scored, key=lambda x: x[1])
    # Limitation: no iterative improvement; each sample is independent

# --- Approach 2: Iterative Self-Refinement ---
def self_refine(llm, seed: str, evaluator, steps: int = 20):
    """Single-trajectory refinement: improve one solution repeatedly."""
    current = seed
    score = evaluator(current)
    for _ in range(steps):
        feedback = f"Current score: {score}. Improve this:\n{current}"
        candidate = llm.generate(feedback)
        new_score = evaluator(candidate)
        if new_score > score:
            current, score = candidate, new_score
    return current, score
    # Limitation: single trajectory, no diversity, easy to get stuck

# --- Approach 3: LLM-Evolutionary Search ---
def evolutionary_search(llm, seed: str, evaluator, pop_size: int = 50,
                        generations: int = 100):
    """Population-based search with LLM mutation and selection."""
    population = [(seed, evaluator(seed))]

    for gen in range(generations):
        # Tournament selection: pick 3, take best
        parents = random.choices(population, k=3)
        parent = max(parents, key=lambda x: x[1])

        # LLM-powered mutation with context from multiple parents
        context = "\n---\n".join(p[0] for p in random.sample(
            population, min(2, len(population))))
        prompt = f"Improve this function:\n{parent[0]}\n\nOther variants:\n{context}"
        offspring = llm.generate(prompt)

        try:
            score = evaluator(offspring)
        except Exception:
            continue

        # Add to population, maintain diversity
        population.append((offspring, score))
        population.sort(key=lambda x: x[1], reverse=True)
        population = population[:pop_size]

    return population[0]
    # Advantage: diverse population, iterative improvement, cross-pollination

The key structural difference is visible in Approach 3: the evolutionary search maintains a population of diverse solutions, uses selection to focus on promising regions, and provides the LLM with context from multiple parents, enabling cross-pollination of ideas between different solution strategies. This is precisely the mechanism that enables the discovery of novel solutions not present in the LLM's training data.

1.9 Book Structure

The remainder of this book is organized into five parts. The chapter numbering and part assignments below are definitive; cross-references throughout the book use these numbers.

Part I: Foundations (Chapters 1–3) establishes the conceptual and historical foundations. Chapter 2 reviews classical evolutionary computation and quality-diversity algorithms. Chapter 3 surveys LLM capabilities relevant to code generation and mutation.

Part II: System Analyses (Chapters 4–8) provides detailed technical analyses of individual systems, ordered by architectural influence and analytical depth rather than strict chronology.

Chapter 1: AlphaEvolve — the most architecturally sophisticated system in our survey (G2). Covered first because it introduces the majority of the design vocabulary used throughout the book: multi-model bandit selection, cascaded evaluation, MAP-Elites archives, and hierarchical mutation.
Chapter 5: OpenEvolve — the most widely adopted open-source implementation (G3). Recreates and extends AlphaEvolve's core architecture, providing the community's reference implementation.
Chapter 6: FunSearch and Descendants — the pioneering system (G1) and its direct descendants (ELM, ReEvo). Analyzed in depth for its historical priority and distinctive island-model design.
Chapter 7: GEPA — multi-objective Pareto-based evolutionary program search (G3), introducing ASI scoring and Pareto-front management.
Chapter 8: Additional Systems — EvoCodeBench, OpenELM, and remaining surveyed systems, with focused analyses of their distinctive contributions.

Part III: Unified Platforms and Synthesis (Chapters 9–10) covers systems that synthesize design patterns from multiple predecessors, and presents a comparative analysis across all surveyed systems including benchmark parity studies.

Chapter 9: OmniEvolve — a unified research platform (G4) synthesizing design patterns from 17 surveyed systems.
Chapter 10: Comparative Analysis — cross-system comparison, benchmark parity studies, and evaluation methodology recommendations.

Part IV: Frontiers (Chapters 11–12) discusses open problems, theoretical foundations, safety considerations, and future directions for the field.

Each system chapter is self-contained and can be read independently, though cross-references connect related design decisions across systems. We recommend reading this introduction and at least one system chapter (Chapter 4 on AlphaEvolve provides the richest material) before proceeding to the comparative analysis in Chapter 10.

The following table summarizes the mapping between the generation taxonomy (Section 1.5) and the book structure:

Part	Chapters	Content	Generations Covered
I: Foundations	1–3	Introduction, EC background, LLM capabilities	—
II: System Analyses	4–8	AlphaEvolve, OpenEvolve, FunSearch & descendants, GEPA, additional systems	G1, G2, G3
III: Synthesis	9–10	OmniEvolve, comparative analysis	G4
IV: Frontiers	11–12	Theory, safety, future directions	—

1.10 Limitations of This Survey

We acknowledge several limitations of this work:

Incomplete access. Some systems (notably AlphaEvolve) are not fully open-source, and our analysis of their internals relies on published papers, blog posts, and official documentation. Where we reconstruct mechanisms from incomplete information, we label these reconstructions explicitly using the provenance scheme defined in Section 1.6.4.

Rapid field evolution. The field moves fast. Systems released after our analysis cutoff (early 2026) are not covered. New LLM capabilities (longer context, better code understanding, lower costs) may change the tradeoff landscape significantly.

Benchmark comparability. Cross-system comparison is hampered by inconsistent evaluation protocols. We document comparison conditions carefully but cannot always establish strict apples-to-apples parity. Where budget parity, model parity, or task parity is not established, we note this explicitly.

Selection bias. Our inclusion criteria favor systems with publications or public releases. Industrial systems used internally without public disclosure are necessarily absent from our analysis.

Chapter Summary

Key takeaway: Between 2024 and 2026, a new paradigm emerged in which large language models serve as semantic variation operators within evolutionary search loops, enabling the automated discovery of algorithms and heuristics that match or exceed human-designed solutions. This paradigm — LLM-powered evolutionary program search — sits at the convergence of evolutionary computation, program synthesis, and large language models.

Main contribution of this survey: A comprehensive, technically rigorous analysis of 17 systems implementing this paradigm, with unified mathematical formalizations, architecture comparisons, reproducibility assessments, and a consistent provenance labeling scheme that distinguishes verified facts from author reconstructions.

What a researcher should know: The paradigm works — it has produced genuinely novel mathematical and algorithmic results — but significant challenges remain in reproducibility, cross-system comparison, theoretical foundations, and scalability to complex programs. The design space is rich, with major choices in population structure, parent selection, LLM mutation strategy, evaluation pipeline, and diversity maintenance, each with documented tradeoffs analyzed in the chapters that follow.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}