Score8.31/10 — Draft

Chapter 2

Taxonomy of Self-Evolving AI Systems

Part P01: Introduction & Framework

2.1 Overview and Motivation

Between 2024 and 2026, the intersection of large language models and evolutionary computation produced a rapid proliferation of systems capable of autonomously generating, mutating, and selecting programs or heuristics. These systems — which we term self-evolving AI systems — share a common core loop: an LLM proposes candidate solutions, an automated evaluator scores them, and a selection mechanism guides subsequent generations. Yet beneath this shared abstraction lies substantial architectural diversity in how each component is realized.

This chapter introduces a structured, multi-axis classification framework for analyzing these systems. The motivation is threefold. First, the field lacks a common vocabulary: different papers use terms like "evolutionary prompting," "LLM-guided search," and "neural program synthesis" to describe overlapping mechanisms, making cross-system comparison difficult. Second, understanding the design space helps researchers identify unexplored combinations — regions where no existing system operates — and guides future system design. Third, a rigorous taxonomy enables fair benchmarking by clarifying which architectural choices are being compared when two systems are evaluated on the same task.

The framework classifies systems along five analytically separable axes: search strategy, mutation mechanism, evaluation paradigm, population structure, and LLM integration pattern. We say "analytically separable" rather than "orthogonal" because, as Section 2.9 discusses, strong empirical couplings exist between certain axes — for instance, MAP-Elites search strategies almost always co-occur with structured population archives. The axes are chosen so that each captures a design decision that can, in principle, be varied independently of the others, even if practitioners rarely do so. Section 2.6.6 explicitly addresses the most common source of confusion: the boundary between search strategy (A1) and population structure (A4).

Key Contribution

This chapter proposes a five-axis classification framework for LLM-powered self-evolving systems (2024–2026) and applies it to representative systems from the surveyed literature. The taxonomy is intended as an analytical tool for cross-system comparison and design-space exploration, not as a definitive or exhaustive ontology. Where system classifications depend on incomplete public documentation, we mark assignments as provisional and record the evidence basis. Prior classification schemes in evolutionary computation (Poli et al., 2008), program synthesis (Gulwani et al., 2017), and LLM-agent surveys (Wang et al., 2024) have addressed related but distinct system families; our contribution is specific to the LLM-as-mutation-operator paradigm that emerged in 2024.

2.1.1 Scope and Limitations

The systems surveyed in this book span the period from early 2024 through mid-2026 and include both peer-reviewed publications and open-source platforms with significant community adoption. Representative systems discussed in this chapter include FunSearch (Romera-Paredes et al., 2024), EoH (Liu et al., 2024), ReEvo (Ye et al., 2024), AlphaEvolve (Google DeepMind, 2025), OpenEvolve (open-source, 2025), GEPA (2025), LLM4AD (2025), and several others referenced in individual system chapters. The taxonomy is applied to a representative subset; not every system receives a full classification in every table, and we note where evidence is insufficient for confident assignment.

2.2 The Five-Axis Classification Framework

2.2.1 Framework Design Principles

The five axes were selected according to three criteria: discriminative power (the axis meaningfully separates existing systems into distinct groups), design relevance (the axis corresponds to an architectural decision that practitioners explicitly make), and analytical separability (the axis can, in principle, be varied while holding others constant). We do not claim full orthogonality: Section 2.9 explicitly discusses observed couplings between axes and quantifies their frequency in the surveyed corpus. Rather, the axes represent analytically distinct dimensions of variation that together span the major design decisions in this class of systems.

We also note that this is not the first classification scheme applicable to evolutionary computation systems. Prior work in genetic programming (Poli et al., 2008), program synthesis (Gulwani et al., 2017), and LLM-agent surveys (Wang et al., 2024) has proposed various taxonomies. Our contribution is specific to the LLM-as-mutation-operator paradigm that emerged in 2024, where the search operator is a pretrained language model rather than a syntactic rewrite rule. This distinction changes the character of several classical axes — for instance, "representation" becomes less relevant when the LLM operates directly on source code strings, and "variation operator" must account for prompt engineering and model selection as design variables.

Excluded dimensions. Several plausible classification axes were considered but not elevated to top-level status. Task domain (combinatorial optimization, program synthesis, mathematical discovery) was excluded because it characterizes the problem rather than the system, and most surveyed systems are domain-agnostic in architecture. Safety and sandboxing (subprocess isolation, container execution, budget guards) was excluded because it is a cross-cutting infrastructure concern rather than a core algorithmic design choice — all systems require some form of safe execution, but the mechanism does not shape the evolutionary dynamics. Compute budget (number of LLM calls, evaluation cost) was excluded because it is a continuous resource parameter rather than a discrete architectural choice, though it strongly influences which axis combinations are practical (see §2.9). Representation (source code, AST, intermediate representation) was excluded because all surveyed systems operate on source-code strings, making this axis non-discriminative for the 2024–2026 generation. These dimensions remain important for system characterization and could be added as secondary metadata fields in future extensions of the taxonomy.

2.2.2 Axis Definitions

The five axes and their categories are defined as follows. Each axis is elaborated in its own section (§§2.3–2.7).

Axis	Question Answered	Categories
A1: Search Strategy	How is the solution space explored?	Single-population EA, Island-model EA, MAP-Elites / Quality-Diversity, Tree search (MCTS), Hybrid
A2: Mutation Mechanism	How are new candidates generated from existing ones?	LLM-rewrite, LLM-diff-edit, LLM-crossover, Reflect-then-mutate, Multi-operator with selection
A3: Evaluation Paradigm	How are candidates scored and filtered?	Single-stage execution, Cascade (multi-stage), Multi-objective, LLM-as-judge, Hybrid cascade + objective
A4: Population Structure	How are candidates organized and maintained?	Flat pool, Ranked archive, Behavioral archive (MAP-Elites grid), Island topology, Pareto front
A5: LLM Integration Pattern	What role does the LLM play in the system?	Single-model mutator, Multi-model pool with bandit, Hierarchical (proposer + critic), Prompt co-evolution, Model-as-component

2.2.3 Formal Model

We formalize a self-evolving system as a tuple $\mathcal{S} = (\text{select}, M, E, \sigma, \mathcal{L}, \pi)$ where:

A population $P_t = (c_1, c_2, \ldots, c_n)$ is an ordered sequence of candidate programs at generation $t$, allowing duplicates. We use a sequence rather than a set because population management operations such as ranking and selection depend on ordering and may admit repeated elements.
$\text{select}: P \to C^+$ is a parent selection operator that draws one or more parent candidates from the population according to a selection policy (e.g., tournament selection, fitness-proportional sampling, or uniform random). This is separated from survivor selection to make the two roles explicit.
$M: C^+ \times \mathcal{L} \times \Theta_M \to \Delta(C)$ is a mutation operator: a stochastic function that, given one or more selected parents, an LLM $\ell \in \mathcal{L}$, and operator parameters $\theta \in \Theta_M$, produces a distribution over candidate programs $C$. The stochasticity arises from both LLM sampling and any randomized operator selection. When the operator is crossover, $C^+$ contains two or more parents; for point mutation, a single parent suffices.
$E: C \to \mathbb{R}^k$ is an evaluation function mapping candidates to a $k$-dimensional score vector. When $k = 1$, this reduces to scalar fitness; when $k > 1$, it supports multi-objective evaluation.
$\sigma: P \times \{(c, E(c))\} \to P$ is a survivor selection operator that takes the current population and a set of evaluated children and returns the next-generation population. This operator handles population replacement: elitism, truncation, age-based removal, or archive insertion.
$\mathcal{L} = \{\ell_1, \ldots, \ell_m\}$ is a model pool: the set of available LLMs.
$\pi$ is a model routing policy. When the pool contains a single model ($|\mathcal{L}| = 1$), $\pi$ is trivial. When $|\mathcal{L}| > 1$, the routing policy takes one of two forms:
- Stochastic routing: $\pi: \text{State} \to \Delta(\mathcal{L})$, mapping search state to a probability distribution over models, from which a model is sampled.
- Index-based routing: $\pi: \text{State} \to \mathcal{L}$, a deterministic selection rule (such as UCB1) that computes a score for each model and selects the one with the highest score. The exploration arises from the score formula, not from randomized selection.

A single generation of the evolutionary loop is then:

$$P_{t+1} = \sigma\!\left(P_t,\;\left\{\left(c_i,\, E(c_i)\right) : c_i \sim M\!\left(\text{select}(P_t),\, \pi(s_t),\, \theta_t\right),\; i = 1, \ldots, \lambda\right\}\right)$$

where $\lambda$ is the number of children generated per generation, $\text{select}(P_t)$ draws parents from the current population according to the parent selection policy, $\pi(s_t)$ selects or samples a model given the current search state $s_t$, and $\theta_t$ parameterizes the mutation operator at generation $t$. This formulation is intentionally general: specific systems instantiate each component differently, and the taxonomy axes capture the major dimensions of this variation.

Note that this formal model is a descriptive abstraction designed to unify notation across systems. Individual systems may not implement all components (e.g., a single-model system has $|\mathcal{L}| = 1$ and trivial routing), and the actual control flow may include additional mechanisms such as caching, early stopping, or prompt adaptation that are not captured in this minimal formulation.

2.3 Axis 1: Search Strategy

The search strategy determines how the system explores the space of candidate programs. This is perhaps the most fundamental architectural choice, as it shapes the tradeoff between exploitation (refining known good solutions) and exploration (discovering structurally novel alternatives).

2.3.1 Single-Population Evolutionary Algorithm

The simplest approach maintains a single population of $n$ candidates, generates children via LLM mutation, evaluates them, and applies selection pressure to form the next generation. EoH (Liu et al., 2024) exemplifies this pattern: it maintains a population of heuristics, selects parents by fitness-proportional or tournament selection, prompts an LLM to produce variants, and retains the top-scoring candidates.

The selection pressure in single-population systems is typically controlled by tournament selection. In tournament selection with tournament size $k$, we draw $k$ individuals uniformly at random with replacement from a population of $n$ candidates and select the individual with the highest fitness (best rank) as the parent.

Convention. We rank candidates from $r = 1$ (best fitness) to $r = n$ (worst fitness), with ties broken arbitrarily. Under this convention, the winner of a tournament is the drawn individual with the lowest rank number.

Derivation. The probability that a specific individual with rank $r$ wins a tournament equals the probability that $r$ is the minimum of $k$ independent draws from the uniform distribution on $\{1, 2, \ldots, n\}$. Each draw independently falls in $\{r, r+1, \ldots, n\}$ (i.e., has rank $\geq r$) with probability $(n - r + 1)/n$. Therefore:

$$p_{\text{select}}(r) = \Pr(\min = r) = \Pr(\text{all draws} \geq r) - \Pr(\text{all draws} \geq r + 1) = \left(\frac{n - r + 1}{n}\right)^{\!k} - \left(\frac{n - r}{n}\right)^{\!k}$$

This is a telescoping decomposition: summing over all ranks $r = 1, \ldots, n$ yields $(n/n)^k - (0/n)^k = 1$, confirming a valid probability distribution.

Verification. For the best individual ($r = 1$):

$$p_{\text{select}}(1) = 1 - \left(\frac{n-1}{n}\right)^{\!k}$$

which approaches 1 as $k \to \infty$, correctly reflecting that larger tournaments strongly favor the best individual. For the worst individual ($r = n$): $p_{\text{select}}(n) = (1/n)^k$, which vanishes as $k$ grows.

Numeric example. With $n = 10$ candidates and tournament size $k = 2$:

Rank $r$	$p_{\text{select}}(r)$	Interpretation
1 (best)	$1 - 0.81 = 0.19$	Selected in ~19% of tournaments
2	$0.81 - 0.64 = 0.17$
5 (median)	$0.36 - 0.25 = 0.11$
10 (worst)	$0.01 - 0 = 0.01$	Selected in ~1% of tournaments

The best individual is 19 times more likely to be selected than the worst, demonstrating meaningful but not overwhelming selection pressure with $k = 2$. Increasing to $k = 3$ raises the best individual's probability to $1 - (0.9)^3 = 0.271$ and drops the worst to $(0.1)^3 = 0.001$, a 271:1 ratio. In practice, LLM-evolutionary systems typically use $k \in \{2, 3, 5\}$, balancing selection pressure against diversity loss.

2.3.2 Island-Model Evolutionary Algorithm

Island models partition the population into semi-independent subpopulations ("islands"), each evolving in parallel with periodic migration of high-fitness candidates between islands. This promotes diversity: each island may converge to a different region of the solution space, and migration allows cross-pollination without homogenizing the entire population.

FunSearch (Romera-Paredes et al., 2024) uses an island-based architecture with multiple independent "programs databases," as described in the Nature paper. OpenEvolve, an open-source system inspired by AlphaEvolve's design, implements configurable island counts with periodic migration based on fitness thresholds, as documented in its public repository. The migration mechanism typically transfers the best candidate from one island to another at a fixed interval, controlled by a migration rate parameter.

The key design parameters for island models are: the number of islands $I$, the migration interval $\tau_{\text{mig}}$ (in generations), the number of migrants per event $m$, and the migration topology (ring, fully connected, random). In the LLM-evolutionary context, islands may additionally differ in their assigned LLM models, mutation operators, or prompt templates — a dimension that connects A1 (search strategy) to A5 (LLM integration).

2.3.3 MAP-Elites and Quality-Diversity

MAP-Elites (Mouret and Clune, 2015) maintains a grid-structured archive where each cell is defined by a behavioral descriptor — a low-dimensional characterization of how a solution behaves, not just how well it scores. Each cell retains only the highest-fitness individual observed for that behavioral region. This produces a diverse collection of high-performing solutions spanning the descriptor space.

The descriptor space is discretized by partitioning each descriptor dimension $d_i \in [d_i^{\min}, d_i^{\max}]$ into $B_i$ bins of equal width:

$$\text{bin}_i(c) = \min\!\left(B_i - 1,\;\left\lfloor \frac{d_i(c) - d_i^{\min}}{d_i^{\max} - d_i^{\min}} \cdot B_i \right\rfloor\right)$$

where $d_i(c)$ is the $i$-th descriptor value of candidate $c$, and the $\min$ operation clamps the boundary case where $d_i(c) = d_i^{\max}$ to the last valid bin index $B_i - 1$. The total archive has $\prod_i B_i$ cells.

AlphaEvolve's published white paper describes a MAP-Elites-style archive structure. The specific behavioral descriptors used are not fully documented in public sources; for algorithmic discovery tasks, plausible descriptors might include code length, algorithmic complexity, or runtime characteristics, but we note that specific descriptor choices are system- and task-dependent and are not always disclosed. This classification is marked as provisional in the table in §2.8.

2.3.4 Tree Search (MCTS)

Monte Carlo Tree Search treats program construction as a sequential decision problem, building programs token-by-token or block-by-block while using upper confidence bounds to balance exploration and exploitation. MCTS-AHD (Liu et al., 2024) applies this approach to automatic heuristic design, using LLM evaluations at leaf nodes to estimate the value of partial programs.

MCTS occupies a boundary position in this taxonomy: it is not a population-based evolutionary algorithm in the classical sense, as it builds a search tree rather than maintaining and evolving a population. We include it because (a) it shares the LLM-as-generator evaluation-as-fitness paradigm, (b) several surveyed systems either use MCTS or compare against it, and (c) understanding where tree search differs from population-based approaches clarifies the design space. Systems using MCTS are marked as such in the classification table, with a note that their A4 (population structure) axis is not directly applicable.

2.3.5 Hybrid Strategies

Some systems combine multiple search strategies. For example, a system might use island-model evolution at the macro level while employing MAP-Elites-style archiving within each island. The LLM4AD platform (2025) provides configurable search backends, allowing users to select from or combine multiple strategies for a given task. When classifying hybrid systems, we record the primary strategy and note secondary mechanisms.

# Pseudocode — illustrative search strategy dispatch
# Shows how a configurable system selects among search strategies
# Not from any specific repository

def run_search(config, population, evaluator, llm):
    """Dispatch to configured search strategy."""
    strategy = config.search_strategy

    if strategy == "single_population":
        # Standard generational EA loop
        parents = tournament_select(population, k=config.tournament_size)
        children = [llm.mutate(p) for p in parents]
        scored = [(c, evaluator.evaluate(c)) for c in children]
        return update_population(population, scored)

    elif strategy == "island_model":
        # Parallel island evolution with periodic migration
        for island in population.islands:
            island.evolve_one_generation(llm, evaluator)
        if generation % config.migration_interval == 0:
            migrate_best(population.islands, n_migrants=config.n_migrants)
        return population

    elif strategy == "map_elites":
        # Quality-diversity: archive indexed by behavioral descriptors
        parent = sample_from_archive(population.archive)
        child = llm.mutate(parent)
        score, descriptors = evaluator.evaluate_with_descriptors(child)
        cell = compute_cell(descriptors, config.bins)
        if population.archive[cell] is None or score > population.archive[cell].score:
            population.archive[cell] = child
        return population

2.4 Axis 2: Mutation Mechanism

In classical evolutionary computation, mutation operators are syntactic transformations: point mutations, subtree swaps, or grammar-guided rewrites. The introduction of LLMs fundamentally changes the nature of mutation, replacing syntactic perturbation with semantic transformation guided by the model's understanding of code structure, algorithmic patterns, and natural-language instructions.

2.4.1 LLM-Rewrite

The simplest mutation mechanism prompts an LLM to rewrite an entire function or program given the current candidate and (optionally) its fitness score. The prompt typically includes the source code, a task description, and an instruction such as "improve this function to achieve higher performance." FunSearch (Romera-Paredes et al., 2024) uses this approach: programs are presented to the LLM with their scores, and the model generates complete replacement candidates.

The rewrite approach has the advantage of simplicity and allows the LLM to make arbitrarily large changes. However, it can be wasteful: the LLM regenerates unchanged portions of the code, consuming tokens without adding value. It also makes it harder to attribute improvements to specific modifications.

2.4.2 LLM-Diff-Edit

Rather than rewriting entire programs, diff-edit mutation asks the LLM to produce targeted modifications — insertions, deletions, or replacements at specific locations. AlphaEvolve's published white paper emphasizes this approach, where the LLM generates diffs that are applied to the existing code. This is more token-efficient and enables finer-grained control over the magnitude of changes.

The diff-edit approach mirrors how human programmers work: reading existing code, identifying specific improvement opportunities, and making focused changes. It naturally supports a "small step" search strategy where each mutation explores a local neighborhood of the current solution.

2.4.3 LLM-Crossover

Crossover in classical GP combines genetic material from two parents. In the LLM setting, crossover presents two or more parent programs to the LLM and asks it to combine their best features. This is qualitatively different from syntactic crossover because the LLM can semantically understand which aspects of each parent contribute to performance and integrate them coherently.

Systems that support crossover typically use it alongside point mutation, with the operator selected probabilistically or via a bandit mechanism (connecting A2 to A5). EoH and several other systems include crossover as one of multiple available operators.

2.4.4 Reflect-Then-Mutate

Some systems insert a reflection step between evaluation and mutation: the LLM first analyzes why a candidate scored as it did, then uses that analysis to guide the next mutation. ReEvo (Ye et al., 2024) formalizes this as a two-phase process where the LLM generates "reflections" — natural-language explanations of candidate strengths and weaknesses — that are incorporated into mutation prompts.

This approach leverages the LLM's ability to reason about code behavior, creating a form of gradient-free optimization where the "gradient" is a natural-language explanation of the fitness landscape. The cost is additional LLM calls per mutation, making it more expensive but potentially more sample-efficient. Note that reflection is classified under A2 (mutation mechanism) rather than A5 (LLM integration) because it describes how the mutation prompt is constructed, not the structural role of the LLM in the system. A system could use reflect-then-mutate with either a single model or a multi-model pool.

2.4.5 Multi-Operator with Adaptive Selection

Advanced systems maintain a portfolio of mutation operators and dynamically select among them based on recent performance. OpenEvolve implements multiple mutation types with a bandit-based selection mechanism that tracks which operators have recently produced fitness improvements. GEPA provides a configurable operator set with adaptive scheduling.

The adaptive selection problem can be framed as a multi-armed bandit: each operator $j$ is an arm, and pulling it produces a stochastic reward (whether the resulting child improves on its parent). The system uses a UCB1-based selection rule to balance trying underexplored operators against exploiting operators known to work well. The operator with the highest UCB1 index is selected:

$$j^* = \arg\max_j \left[\hat{\mu}_j + c\sqrt{\frac{\ln t}{n_j}}\right]$$

where $\hat{\mu}_j$ is the empirical mean reward of operator $j$, $n_j$ is the number of times it has been selected, $t$ is the total number of selections, and $c$ is an exploration constant. This is a deterministic selection rule given the current statistics: the exploration arises from the confidence-bound term $c\sqrt{(\ln t)/n_j}$, which inflates the index of underexplored operators, not from randomized selection. UCB1 with $c = \sqrt{2}$ achieves logarithmic regret for rewards bounded in $[0, 1]$ under stationary conditions (Auer et al., 2002). In evolutionary search, operator rewards are non-stationary — the usefulness of an operator depends on the current population state — so UCB1 serves as a practical heuristic rather than providing formal regret guarantees.

# Pseudocode — multi-operator mutation with UCB1 selection
# Illustrates adaptive operator selection; not from any specific repository

import math
from dataclasses import dataclass

@dataclass
class OperatorStats:
    name: str
    total_reward: float = 0.0
    n_selections: int = 0

    @property
    def mean_reward(self) -> float:
        if self.n_selections == 0:
            return 0.0
        return self.total_reward / self.n_selections

def select_operator_ucb1(
    operators: list[OperatorStats], t: int, c: float = 1.41
) -> OperatorStats:
    """Select mutation operator via UCB1 index rule.

    Returns the operator with the highest UCB1 index.
    This is a deterministic selection given current statistics;
    exploration comes from the confidence-bound term.
    """
    # Ensure each operator is tried at least once
    for op in operators:
        if op.n_selections == 0:
            return op
    # UCB1: select operator with highest index
    return max(
        operators,
        key=lambda op: op.mean_reward + c * math.sqrt(math.log(t) / op.n_selections)
    )

def mutate_with_adaptive_selection(parent, operators, llm, evaluator, t):
    """Generate a child using adaptively selected operator."""
    op = select_operator_ucb1(operators, t)
    op.n_selections += 1

    if op.name == "rewrite":
        child = llm.generate(prompt=f"Rewrite this function:\n{parent.code}")
    elif op.name == "diff_edit":
        child = llm.generate(prompt=f"Suggest targeted edits:\n{parent.code}")
    elif op.name == "crossover":
        other_parent = select_second_parent()
        child = llm.generate(prompt=f"Combine:\n{parent.code}\n---\n{other_parent.code}")

    score = evaluator.evaluate(child)
    reward = 1.0 if score > parent.score else 0.0  # Binary improvement signal
    op.total_reward += reward
    return child, score

2.5 Axis 3: Evaluation Paradigm

Evaluation determines how candidate quality is assessed. This axis has a disproportionate impact on system cost and throughput: in LLM-evolutionary systems, the evaluation function is often the bottleneck, especially when it involves executing generated code in a sandbox, running test suites, or performing expensive numerical simulations.

2.5.1 Single-Stage Execution

The simplest paradigm evaluates every candidate through the same pipeline: parse, execute against test inputs, and compute a scalar fitness score. FunSearch and EoH use variants of this approach, where candidates are Python functions executed against a fixed set of problem instances.

The fitness function for a combinatorial optimization heuristic might be:

$$f(c) = \frac{1}{|T|}\sum_{t \in T} \text{quality}(c(t), t)$$

where $T$ is a set of problem instances, $c(t)$ is the solution produced by candidate heuristic $c$ on instance $t$, and $\text{quality}$ measures solution quality (e.g., bin utilization, tour length, packing density). The average over instances promotes generalization beyond individual test cases.

2.5.2 Cascade Evaluation

Cascade evaluation applies progressively more expensive tests, filtering candidates at each stage. A typical three-stage cascade might be:

Syntax check: parse the generated code (near-zero cost)
Smoke test: run on a small subset of instances (low cost)
Full evaluation: run on the complete instance set (high cost)

AlphaEvolve's published white paper describes a multi-stage evaluation pipeline, and OpenEvolve implements a configurable cascade where early stages can reject candidates that fail basic correctness checks before investing in expensive full evaluation. This approach can reduce overall evaluation cost by an order of magnitude, since many LLM-generated candidates contain syntax errors or trivial bugs that can be caught cheaply.

If the pass rate at stage $s$ is $p_s$ and the cost of stage $s$ is $\kappa_s$, then the expected cost per candidate under a cascade is:

$$\mathbb{E}[\text{cost}] = \kappa_1 + p_1 \cdot \kappa_2 + p_1 \cdot p_2 \cdot \kappa_3 + \cdots = \sum_{s=1}^{S} \kappa_s \prod_{j=1}^{s-1} p_j$$

where $S$ is the total number of stages and $\prod_{j=1}^{0} p_j = 1$ by convention. If $p_1 = 0.3$ (only 30% of candidates parse), $\kappa_1 = 0.01$s, $\kappa_2 = 1$s, and $\kappa_3 = 60$s, then the expected cost is $0.01 + 0.3 \times 1 + 0.3 \times p_2 \times 60$. Even modest filtering at early stages dramatically reduces total cost.

2.5.3 Multi-Objective Evaluation

Some systems optimize multiple objectives simultaneously — for example, solution quality and runtime, or accuracy and code simplicity. GEPA supports multi-objective optimization via Pareto-front maintenance, where a candidate $c_1$ dominates $c_2$ if $c_1$ is at least as good as $c_2$ on all objectives and strictly better on at least one:

$$c_1 \succ c_2 \iff \forall i: f_i(c_1) \geq f_i(c_2) \;\land\; \exists j: f_j(c_1) > f_j(c_2)$$

where $f_i$ is the $i$-th objective (all assumed to be maximized; minimization objectives are negated). The Pareto front is the set of non-dominated candidates. The hypervolume indicator measures the volume of objective space dominated by the Pareto front relative to a reference point, providing a single scalar measure of Pareto-front quality.

2.5.4 LLM-as-Judge and Hybrid Approaches

When automated evaluation is difficult or expensive, some systems use a second LLM call to judge candidate quality — for example, evaluating code readability, documentation quality, or adherence to design patterns. This is less common in the systems surveyed here (which primarily target numerically evaluable objectives), but appears in creative and design-oriented applications.

Hybrid approaches combine automated execution with LLM judgment — for instance, using execution-based fitness for correctness and LLM evaluation for code quality or maintainability. These sit at the intersection of A3 and A5, illustrating the coupling between axes.

2.6 Axis 4: Population Structure

Population structure determines how candidates are organized, stored, and accessed for parent selection. This axis is related to but analytically distinct from A1 (search strategy) — see §2.6.6 for an explicit discussion of the boundary. The structure affects selection dynamics, diversity maintenance, and memory efficiency.

2.6.1 Flat Pool

The simplest structure: an unstructured list of candidates, possibly sorted by fitness. Parent selection draws from this pool, and new candidates either replace the worst members or are added subject to a capacity constraint. EoH in its basic configuration uses a flat population pool.

2.6.2 Ranked Archive

A ranked archive maintains candidates sorted by fitness, often with a fixed capacity. When a new candidate is evaluated, it is inserted into the archive if it exceeds the fitness of the worst current member. This naturally implements elitism (the best candidates are never lost) and provides a straightforward parent-selection distribution biased toward high-fitness individuals.

2.6.3 Behavioral Archive (MAP-Elites Grid)

As described in §2.3.3, a behavioral archive indexes candidates by their behavioral descriptors, retaining only the best candidate per cell. This decouples diversity maintenance from fitness: the archive is guaranteed to span the descriptor space regardless of fitness-based selection pressure. Parent selection from a behavioral archive typically samples cells uniformly or proportionally to a curiosity/novelty score, ensuring continued exploration of underrepresented behavioral regions.

2.6.4 Island Topology

Island-model systems partition the population across islands, each with its own internal structure (often a flat pool or ranked archive). The topology — which islands can exchange migrants — affects the rate at which information flows through the population. Common topologies include ring (each island connects to two neighbors), fully connected (all islands exchange freely), and random (migration targets are selected stochastically).

2.6.5 Pareto Front

For multi-objective systems, the population structure may be a Pareto front or a set of Pareto-optimal individuals. GEPA maintains a Pareto front as its primary archive, with selection based on crowding distance to promote spread along the front.

2.6.6 Distinguishing Search Strategy (A1) from Population Structure (A4)

Because A1 and A4 often co-occur in predictable pairs, a natural question is whether they are truly separate design decisions. We argue they are, on the following grounds.

A1 answers "what algorithm governs exploration?" — the control logic that decides how parents are selected, how many children are generated, and when to migrate or archive. A4 answers "how are candidates stored and indexed?" — the data structure that organizes the population and determines what retrieval operations are efficient. The distinction is analogous to the classical separation between an algorithm and its underlying data structure: the same sorting algorithm can operate on an array or a linked list, and the same data structure can serve different algorithms.

Counterexample 1: Same A1, different A4. Two systems both use a single-population EA (A1 = Single-pop). System A stores candidates in an unordered flat pool (A4 = Flat pool) and selects parents uniformly at random. System B stores candidates in a fitness-sorted ranked archive (A4 = Ranked archive) and biases parent selection toward the top of the archive. The search algorithm — generational replacement with tournament selection — is the same, but the storage and retrieval mechanisms differ, affecting selection dynamics.

Counterexample 2: Same A4, different A1. Two systems both use island-partitioned populations (A4 = Island topology). System A evolves islands independently with periodic migration of the best individual (A1 = Island-model EA). System B runs a distinct search strategy within each island — perhaps MAP-Elites on one island and a standard EA on another (A1 = Hybrid). The population structure is the same partitioned topology, but the algorithms governing exploration differ.

Acknowledged coupling. MAP-Elites (A1) requires a behavioral archive (A4), creating a strong A1→A4 dependency. However, the reverse does not hold: a behavioral archive could supplement a standard EA as a novelty-filtered secondary store. The coupling is empirical, not definitional — it reflects practical engineering constraints rather than logical identity. Section 2.9 quantifies these couplings from the surveyed corpus.

2.7 Axis 5: LLM Integration Pattern

This axis captures how LLMs are embedded in the evolutionary loop — not just that they are used (all surveyed systems use LLMs), but the structural role they play. This is the axis most distinctive to the 2024–2026 generation of systems and the one with the least precedent in classical evolutionary computation.

2.7.1 Single-Model Mutator

The simplest pattern: one LLM serves as the mutation operator. All candidates are generated by the same model with the same (or minimally varying) prompts. Early systems such as EoH and FunSearch used this pattern, relying on a single capable model for all code generation.

The advantage is simplicity; the disadvantage is that the search is limited by the biases and capabilities of a single model. If the model consistently fails to generate certain code patterns, those regions of the solution space become effectively unreachable.

2.7.2 Multi-Model Pool with Bandit Routing

More advanced systems maintain a pool of LLMs (potentially including different model families, sizes, and configurations) and use a bandit algorithm to route generation requests to the most promising model. AlphaEvolve uses a multi-model approach (white paper), and OpenEvolve implements this with configurable model backends and bandit-based selection (public repository).

The routing mechanism selects a model using a UCB1 index rule, analogous to the operator selection in §2.4.5:

$$\ell^* = \arg\max_{\ell \in \mathcal{L}} \left[\hat{\mu}_\ell + c\sqrt{\frac{\ln t}{n_\ell}}\right]$$

where $\hat{\mu}_\ell$ is the empirical improvement rate of model $\ell$, $n_\ell$ is the number of times it has been used, and $c$ controls the exploration-exploitation tradeoff. As with operator selection, this is a deterministic index-based rule — the model with the highest UCB1 score is selected — not a stochastic routing policy. The exploration arises from the confidence bound term, which favors underexplored models. In practice, model pools might include both expensive high-capability models (for complex mutations) and cheaper fast models (for simple refinements), with the bandit learning when each is cost-effective.

Note the distinction between this index-based routing and a true stochastic routing policy $\pi: \text{State} \to \Delta(\mathcal{L})$ that would sample from a distribution over models. Some systems may implement the latter (e.g., Boltzmann exploration over model indices), but the UCB1 formulation used in OpenEvolve and described for AlphaEvolve is deterministic given the current statistics.

2.7.3 Hierarchical Integration (Proposer + Critic)

Some systems use LLMs in multiple distinct roles: a proposer generates candidate code, while a critic (possibly the same or a different model) evaluates, analyzes, or refines the proposal before it enters the population. ReEvo's reflect-then-mutate pattern is a form of hierarchical integration where the reflection phase acts as an internal critic.

This pattern is analogous to generator-discriminator architectures in generative adversarial networks, though the "discriminator" here provides natural-language feedback rather than a scalar score.

2.7.4 Prompt Co-Evolution

An emerging pattern where the prompts used to instruct the LLM are themselves evolved alongside the candidate population. Rather than using fixed prompt templates, the system maintains a population of prompts, scores them by the quality of candidates they produce, and applies evolutionary pressure to improve prompting strategies over time.

OpenEvolve's documentation describes a form of prompt evolution, though the extent to which this is fully implemented varies across system versions. The idea connects to prompt optimization research (Zhou et al., 2023) and represents a meta-learning approach where the system learns not only better solutions but also better ways of asking for better solutions. We include this category because it appears in published system descriptions, while noting that it is less mature than the other A5 categories.

# Pseudocode — prompt co-evolution sketch
# Illustrates the concept; not from any specific repository

from dataclasses import dataclass

@dataclass
class EvolvedPrompt:
    template: str          # The prompt template with placeholders
    fitness: float = 0.0   # Average quality of candidates produced
    n_uses: int = 0        # Number of times used

def co_evolve_prompts(
    prompt_population: list[EvolvedPrompt],
    code_population: list,
    llm,
    evaluator,
):
    """One step of prompt co-evolution."""
    # Select a prompt based on its track record
    prompt = select_by_fitness(prompt_population)
    prompt.n_uses += 1

    # Select a parent candidate from the code population
    parent = tournament_select(code_population, k=3)

    # Generate child using the selected prompt
    filled_prompt = prompt.template.format(code=parent.code, score=parent.score)
    child_code = llm.generate(filled_prompt)
    child_score = evaluator.evaluate(child_code)

    # Update prompt fitness with exponential moving average
    alpha = 0.1
    prompt.fitness = (1 - alpha) * prompt.fitness + alpha * child_score

    # Occasionally mutate prompts themselves
    if random.random() < 0.05:  # Low mutation rate for prompts
        new_template = llm.generate(
            f"Improve this code-generation prompt:\n{prompt.template}"
        )
        prompt_population.append(EvolvedPrompt(template=new_template))

2.8 Classification of Surveyed Systems

The following tables apply the five-axis framework to representative systems from the survey. Important caveats: (1) Classifications are based on published papers, official documentation, and public repository inspection where available. For systems with limited public documentation, assignments may be provisional and are marked with a confidence indicator. (2) The table shows a representative subset of the surveyed corpus; not all systems discussed in the book are included. (3) Some systems span multiple categories on a given axis; we record the primary category and note alternatives in parentheses.

2.8.1 Classification Protocol

To make assignments reproducible, we apply the following decision rules for each axis:

A1 (Search Strategy): What is the primary structure governing exploration? If the system maintains a single population with generational replacement, it is "Single-pop EA." If it partitions into islands with migration, "Island-model." If it maintains a descriptor-indexed archive, "MAP-Elites." If it builds a search tree, "Tree search."
A2 (Mutation): What does the LLM prompt ask for? If the LLM produces a complete replacement function, "LLM-rewrite." If it produces a diff or edit, "LLM-diff-edit." If two parents are presented, "LLM-crossover." If a reflection precedes mutation, "Reflect-then-mutate." If multiple operators are combined with a selection mechanism, "Multi-operator."
A3 (Evaluation): How many stages does a candidate pass through before receiving its final score? One stage = "Single-stage." Multiple stages with early filtering = "Cascade." Multiple independent objectives = "Multi-objective."
A4 (Population): How are candidates stored and indexed? Unstructured list = "Flat pool." Sorted by fitness = "Ranked." Indexed by descriptors = "Behavioral archive." Partitioned across islands = "Island topology." Pareto-optimal set = "Pareto front."
A5 (LLM Integration): How many models, and what roles? One model, one role = "Single-model." Multiple models with routing = "Multi-model pool." Multiple roles (generator + critic) = "Hierarchical."

Confidence levels. Each system is assigned a confidence level based on the quality of public evidence available:

High (●●●): classified from peer-reviewed paper and inspectable public repository.
Medium (●●): classified from a peer-reviewed paper or detailed public documentation, but without full repository access or with some axis assignments requiring interpretation.
Low (●): classified from a white paper, blog post, or partial documentation where undisclosed implementation details require provisional inference.

2.8.2 Classification Table

System	Year	A1: Search	A2: Mutation	A3: Evaluation	A4: Population	A5: LLM Integration	Source	Conf.
FunSearch	2024	Island-model	LLM-rewrite	Single-stage	Island topology (ranked per island)	Single-model	Nature paper (Romera-Paredes et al., 2024, §Methods)	●●●
EoH	2024	Single-pop EA	LLM-rewrite (+crossover)	Single-stage	Flat pool	Single-model	Paper (Liu et al., 2024) + public repo	●●●
ReEvo	2024	Single-pop EA	Reflect-then-mutate	Single-stage	Flat pool	Hierarchical (reflect + mutate)	Paper (Ye et al., 2024) + public repo	●●●
MCTS-AHD	2024	Tree search (MCTS)	LLM-rewrite (at leaf nodes)	Single-stage	N/A (search tree)	Single-model	Paper (Liu et al., 2024)	●●
AlphaEvolve	2025	Island-model + MAP-Elites	Multi-operator (diff-edit primary)	Cascade	Behavioral archive + islands	Multi-model pool	White paper + blog (Google DeepMind, 2025)	●
OpenEvolve	2025	Island-model	Multi-operator with bandit	Cascade	Island topology (ranked)	Multi-model pool with bandit	Public repo (README, source code)	●●
GEPA	2025	Single-pop EA (multi-objective)	Multi-operator	Multi-objective	Pareto front	Single-model (configurable)	Paper + public repo	●●●

Notes on specific entries. AlphaEvolve's classification draws on the published white paper and blog post; undisclosed implementation details (such as the specific behavioral descriptors for MAP-Elites) are not inferred. The low confidence rating reflects that the system is not publicly available for inspection. MCTS-AHD's A4 is marked "N/A" because tree search does not maintain a population in the EA sense; it receives medium confidence because the paper is detailed but the A4 inapplicability introduces a structural mismatch with the taxonomy. OpenEvolve receives medium confidence because repository inspection confirms most architectural details, but some configurable features (e.g., prompt co-evolution) are partially documented.

2.8.3 Meta-Framework: LLM4AD

LLM4AD (2025) is a platform that integrates multiple published methods — EoH, FunSearch, ReEvo, MCTS-AHD, and others — rather than implementing a single fixed architecture. Its classification depends on which method is instantiated. We separate it from the main table because assigning fixed axis values to a configurable framework would either force a meaningless "Configurable" label on every axis or require choosing one instantiation as canonical, both of which distort the taxonomy's semantics.

Instantiation	A1	A2	A3	A4	A5
LLM4AD (EoH mode)	Single-pop	LLM-rewrite	Single-stage	Flat pool	Single-model
LLM4AD (FunSearch mode)	Island-model	LLM-rewrite	Single-stage	Island topology	Single-model
LLM4AD (ReEvo mode)	Single-pop	Reflect-then-mutate	Single-stage	Flat pool	Hierarchical
LLM4AD (MCTS mode)	Tree search	LLM-rewrite	Single-stage	N/A	Single-model

Source: paper + public repository. Confidence: ●● (medium) — instantiation modes documented but fidelity to original methods may vary. See Chapter 8 for detailed analysis.

2.9 Cross-Axis Couplings and Design Patterns

Although the five axes are analytically separable, certain combinations appear together more frequently than others in the surveyed systems. Understanding these couplings reveals design patterns — recurring architectural configurations that have proven effective in practice. We present the coupling analysis as an empirical observation from a small corpus ($n = 7$ systems with fixed architectures, excluding the LLM4AD meta-framework and noting that MCTS-AHD has limited applicability on A4). These frequencies are suggestive rather than statistically validated and should be treated as hypothesis-generating observations, not established regularities.

2.9.1 Observed Co-Occurrences

Three coupling patterns stand out in the surveyed systems, though we emphasize that these emerge from a small sample and should be interpreted as design-space hypotheses rather than confirmed regularities:

Pattern 1: The "Industrial-Scale" Configuration. Island-model search (A1) co-occurs with multi-operator mutation (A2), cascade evaluation (A3), and multi-model pools (A5) in 2 of 3 island-model systems (AlphaEvolve and OpenEvolve; FunSearch is the exception, using simpler mutation and evaluation). The logic is economic: at scale, the system generates enough candidates to amortize the overhead of operator/model selection, and cascade evaluation is essential to control cost.

Pattern 2: The "Research-Minimal" Configuration. Single-population EA (A1) co-occurs with simpler mutation (A2 = rewrite or reflect), single-stage evaluation (A3), flat or simple populations (A4), and single-model integration (A5). This pattern appears in all three single-population systems (EoH, ReEvo, GEPA), though GEPA extends it with multi-operator mutation and multi-objective evaluation. The pattern minimizes engineering complexity and is well-suited for research exploration.

Pattern 3: The "Diversity-Focused" Configuration. MAP-Elites search (A1) co-occurs with behavioral archives (A4). In the surveyed corpus, only AlphaEvolve (as a hybrid) employs MAP-Elites, so this pattern rests on a single observation plus the structural requirement discussed in §2.6.6. The broader quality-diversity literature (Mouret and Clune, 2015; Cully and Demiris, 2018) provides additional evidence that this coupling is robust, but within LLM-evolutionary systems specifically, it has limited empirical support so far.

2.9.2 Unexplored Regions

The coupling analysis also reveals design combinations that have not been explored in the surveyed literature. Some noteworthy gaps:

Tree search + multi-model pool: No surveyed system combines MCTS with bandit-selected model routing, though the combination is technically feasible and could allow different models to handle different depths of the search tree.
MAP-Elites + prompt co-evolution: Evolving prompts alongside a behavioral archive could enable the system to discover different prompting strategies for different behavioral regions.
Multi-objective + cascade evaluation: Cascade filtering is typically applied to a single fitness score; extending it to Pareto-based decisions (e.g., filtering candidates dominated by cached Pareto-front members) is an open design challenge.
Single-population + multi-model pool: No surveyed single-population system uses bandit-routed multi-model generation, though this could improve sample efficiency for simpler system architectures without the complexity of island models.

These gaps suggest directions for future system design, though we note that the absence of a combination may reflect practical engineering constraints or insufficient return-on-complexity rather than a fundamental incompatibility.

2.10 Evolutionary Trajectory of the Design Space

The surveyed systems span roughly two years (2024–2026), during which the design space has evolved along identifiable trends:

2.10.1 From Simple to Configurable

Early systems (2024) implemented a single fixed configuration: one search strategy, one mutation operator, one model. Later systems (2025–2026) are increasingly configurable platforms where each axis can be selected by the user. LLM4AD is the most prominent case — a framework that integrates multiple published methods and allows the user to mix and match components. This trajectory mirrors the historical evolution from bespoke GP systems to frameworks like DEAP and ECJ in classical evolutionary computation.

2.10.2 From Single-Objective to Multi-Faceted Evaluation

Early systems optimized a single scalar fitness. More recent systems incorporate multiple evaluation signals — cascade stages, multi-objective Pareto optimization, novelty scores, and code-quality metrics. This reflects a growing recognition that raw performance on a single metric is insufficient; practical heuristics must also be robust, efficient, readable, and generalizable.

2.10.3 From Single-Model to Model Ecosystems

The shift from single-model mutation to multi-model pools with adaptive routing is one of the most consequential architectural changes. It decouples the evolutionary search from any single LLM's capabilities and introduces a new level of adaptation: the system learns not only better programs but also which models are best suited for different types of mutations at different stages of search.

2.11 Worked Example: Classifying a System

To demonstrate the classification protocol, we walk through the assignment process for FunSearch (Romera-Paredes et al., 2024), which is well-documented in a peer-reviewed Nature paper.

A1 (Search Strategy): The paper describes multiple "programs databases" that evolve independently with periodic sharing of best programs (Nature paper, §Methods, "Distributed computing" subsection). Decision: This matches the island-model definition — semi-independent subpopulations with migration. → Island-model EA. Confidence: High — explicitly described in the paper.

A2 (Mutation): The LLM receives 1–2 high-scoring programs and generates a complete new function. The paper does not describe diff-based editing or crossover as a distinct operation. Decision: Complete function generation from examples. → LLM-rewrite. Confidence: High — the generation process is described in §Methods.

A3 (Evaluation): Generated programs are executed on the problem instance and scored. The paper mentions filtering for programs that "do not run" but does not describe a formal multi-stage cascade. Decision: Primarily single-stage, with implicit syntactic filtering. → Single-stage execution (with filtering caveat). Confidence: High, with the minor caveat that implicit pre-filtering could be interpreted as an informal cascade.

A4 (Population): Each programs database maintains candidates organized into score-based clusters, with sampling probability depending on cluster rank (Nature paper, §Methods, "Best-shot prompting" subsection). The island structure is the primary organizational principle. → Island topology (with ranked archive per island). Confidence: High — described in the paper.

A5 (LLM Integration): The paper describes using a single model (Codey, based on PaLM 2) for all code generation. → Single-model mutator. Confidence: High.

This worked example illustrates that even well-documented systems require interpretive judgment at the boundaries — for instance, whether FunSearch's syntactic filtering constitutes a cascade. The classification protocol favors the simpler assignment when the evidence is ambiguous.

2.12 Limitations of the Framework

Several limitations of this taxonomy should be acknowledged:

Axes are not fully independent. As §2.9 discusses, certain axis values predict others. The five-dimensional design space is not uniformly populated — many theoretical combinations are either impractical or unexplored. The taxonomy is useful for describing existing systems and identifying gaps, but it should not be interpreted as implying that all $\prod_i |A_i|$ combinations are equally viable. The A1-A4 coupling (§2.6.6) is the most prominent case.

Granularity tradeoffs. Each axis aggregates diverse mechanisms into a small number of categories. For instance, "LLM-rewrite" encompasses both zero-shot generation and few-shot generation with multiple exemplars — a distinction that may matter for performance but is not captured at this taxonomy level. Finer-grained subtypes are discussed within each axis section but are not elevated to top-level categories to keep the framework tractable.

Small classification corpus. The evidence for cross-axis couplings (§2.9) and design patterns rests on a sample of seven fixed-architecture systems. While these represent the major published systems in the 2024–2025 period, the counts are too small for statistical claims. Future surveys with larger corpora should revisit and test these patterns.

Temporal bias. The taxonomy is shaped by the systems that existed during the survey period (2024–2026). Future systems may introduce mechanisms that do not fit cleanly into any existing category — for instance, neuro-symbolic mutation operators or learned evaluation surrogates — requiring taxonomy extension.

Evidence asymmetry. Proprietary systems (AlphaEvolve) are classified from white papers and blog posts, while open-source systems (OpenEvolve, GEPA, LLM4AD) can be classified from code inspection. This creates an inherent confidence asymmetry that the three-tier confidence markers partially address but cannot fully resolve.

Prior classification schemes. This is not the first taxonomy applicable to program-generating systems. Poli et al. (2008) classified genetic programming systems by representation, O'Neill and Ryan (2003) taxonomized grammatical evolution, and recent LLM-agent surveys (Wang et al., 2024; Xi et al., 2023) classify agentic systems by planning, memory, and tool-use dimensions. Our framework is specifically designed for the LLM-as-mutation-operator paradigm and does not claim superiority over these prior schemes for their respective domains. The table below summarizes how our axes relate to selected prior frameworks:

Prior Framework	Domain	Overlapping Axes	Key Difference
Poli et al. (2008)	GP	Representation, Variation, Selection	No LLM integration axis; representation is a primary axis (uniform here)
Gulwani et al. (2017)	Program synthesis	Search strategy, Evaluation	No population or model-pool axis; focuses on deductive vs. inductive
Wang et al. (2024)	LLM agents	Planning, Memory, Tool use	No evolutionary/selection axis; agents are single-trajectory, not population-based

2.13 Summary

Chapter Summary

Key takeaway: Self-evolving AI systems (2024–2026) can be systematically analyzed along five axes — search strategy, mutation mechanism, evaluation paradigm, population structure, and LLM integration pattern — that capture the major design decisions differentiating these systems.

Main contribution: A classification framework that enables structured cross-system comparison, reveals tentative design patterns (industrial-scale, research-minimal, diversity-focused) from a small but representative corpus, identifies unexplored design-space regions, and tracks the evolutionary trajectory from fixed single-model systems to configurable multi-model platforms. The framework includes a classification protocol with decision rules, confidence levels, and per-system evidence citations to support reproducibility.

For researchers: When designing a new system, use the five axes to explicitly choose your position in the design space rather than defaulting to the architecture of the most recent publication. When evaluating a system, note which axes are being compared — two systems that differ on four axes simultaneously cannot yield clean ablation-style insights about any single design choice. When classifying a system, apply the decision rules from §2.8.1 and record the evidence basis and confidence level for each axis assignment.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}