Score7.77/10 — Draft
Chapter 66

Comprehensive Methods Catalog

Part P09: Synthesis & Future Directions

This chapter serves as a consolidated reference for the algorithmic methods employed across LLM-powered evolutionary systems surveyed in this book. Rather than describing any single system, it catalogs the full design space of mutation operators, selection strategies, evaluation pipelines, population management schemes, island models, bandit algorithms, and crossover techniques that have appeared in the literature and open-source implementations from 2024 through 2026. Each method is presented with its formal specification, typical use contexts, known trade-offs, and cross-references to the system chapters where it appears in practice. The goal is to equip researchers with a single-point reference that enables principled method selection when designing new evolutionary systems.

Key Contribution. This chapter provides the first unified methods catalog for LLM-powered evolutionary algorithm discovery, organizing 30+ distinct algorithmic components across seven method families into a structured taxonomy. By standardizing notation, stating assumptions, and mapping each method to the systems that employ it, the catalog enables direct comparison and informed recombination of techniques that have previously been described only in isolation within individual system papers.

66.1 Motivation and Scope

The rapid proliferation of LLM-powered evolutionary systems between 2024 and 2026 has produced a rich but fragmented landscape of algorithmic methods. FunSearch, AlphaEvolve, OpenEvolve, GEPA, EoH, ReEvo, LLM4AD, and numerous other systems each introduce or adapt methods drawn from classical evolutionary computation, multi-armed bandit theory, quality-diversity optimization, and LLM prompt engineering. Yet the same fundamental operation — say, tournament selection — may be described with different notation, different default parameters, and different implicit assumptions across papers.

This catalog addresses three problems. First, terminological fragmentation: the same method is sometimes called by different names, or the same name is applied to subtly different algorithms. Second, parameter opacity: many systems report that they use a method without specifying its configuration, making reproduction difficult. Third, interaction blindness: methods are typically described in isolation, but their effectiveness depends critically on how they compose — a mutation operator's performance depends on the selection strategy feeding it parents, the evaluation cascade scoring its children, and the population structure determining where successful offspring land.

The scope of this catalog covers methods that are algorithmically specified in at least one surveyed system's paper, documentation, or public repository. We exclude methods that are only mentioned speculatively or proposed as future work. For each method, we provide: (1) a formal definition with all symbols defined, (2) pseudocode where the algorithm involves non-trivial control flow, (3) a summary of known configurations across systems, and (4) notes on interactions with other methods.

66.2 Methods Taxonomy

We organize methods into seven families, each addressing a distinct decision point in the evolutionary loop. The taxonomy is functional rather than historical: families are defined by what role the method plays in the search process, not by where it was first proposed.

Methods Taxonomy — Seven Families Evolutionary Loop F1: Mutation Operators F2: Crossover Techniques F3: Selection Strategies F4: Evaluation Methods F5: Population Management F6: Island Models F7: Bandit Algorithms Solid arrows: direct role in loop. Dashed: adaptive control relationships.

The seven families and their roles are:

FamilyRole in LoopKey DecisionMethods Count
F1: Mutation OperatorsGenerate candidate variants from parentsHow to perturb code via LLM8 primary types
F2: Crossover TechniquesCombine traits from multiple parentsHow to merge code segments4 primary types
F3: Selection StrategiesChoose parents and survivorsExploitation vs. exploration balance6 primary types
F4: Evaluation MethodsScore candidate qualityFidelity vs. cost trade-off5 primary types
F5: Population ManagementMaintain and organize the candidate poolArchive structure and capacity5 primary types
F6: Island ModelsPartition search into semi-independent subpopulationsIsolation vs. communication4 primary types
F7: Bandit AlgorithmsAdaptively allocate resources across method variantsOnline learning over operators, models, or prompts4 primary types

66.3 F1: Mutation Operators Catalog

Mutation is the primary variation operator in LLM-powered evolution. Unlike classical genetic programming where mutation is a syntactic tree operation, LLM-based mutation operates at the semantic level: the language model receives a parent program (or programs) along with contextual information and generates a modified candidate. This semantic grounding is why LLM-based mutation produces syntactically valid and often semantically meaningful variants at rates far exceeding random tree perturbation.

We catalog eight mutation operator types that have appeared in the surveyed systems. Each operator is defined by its input specification (what context is provided to the LLM), its transformation intent (what kind of change is requested), and its output format (how the modified code is returned).

66.3.1 M1: Line-Level Diff Mutation

The LLM receives a parent program and produces a unified diff specifying lines to add, remove, or modify. This is the most surgical mutation type, typically producing small, focused changes. It is the default mutation operator in AlphaEvolve and is supported in OpenEvolve and GEPA.

Formal specification. Let $p \in \mathcal{P}$ be a parent program represented as a sequence of lines $p = (l_1, l_2, \ldots, l_n)$. A diff mutation $\mu_{\text{diff}}$ produces a child $c$ by applying a set of edit operations $\Delta = \{(i, \text{op}_i, \text{content}_i)\}$ where $i$ is a line index, $\text{op}_i \in \{\text{add}, \text{delete}, \text{replace}\}$, and $\text{content}_i$ is the new line content (empty for deletions). The child is:

$$c = \text{apply}(\Delta, p) \quad \text{where } \Delta = \mu_{\text{diff}}(\text{LLM}(p, \text{context}, \text{prompt}_{\text{diff}}))$$

where $\text{context}$ includes evaluation feedback, error messages, and task description, and $\text{prompt}_{\text{diff}}$ instructs the model to return changes in unified diff format.

# Pseudocode — line-level diff mutation
# Appears in: AlphaEvolve, OpenEvolve, GEPA (variant)

def diff_mutation(parent: str, context: MutationContext, llm: LLMBackend) -> str:
    """Generate a child program via line-level diff."""
    prompt = build_diff_prompt(
        parent_code=parent,
        task_description=context.task_description,
        evaluation_feedback=context.last_eval_result,
        error_messages=context.errors,
        instruction="Suggest improvements as a unified diff."
    )
    raw_response = llm.generate(prompt)
    diff_ops = parse_unified_diff(raw_response)  # Extract edit operations
    child = apply_diff(parent, diff_ops)          # Apply edits to parent
    return child

Trade-offs. Diff mutation excels at fine-grained optimization of near-optimal solutions but struggles to make large structural changes. It is most effective in later stages of evolution when the population has converged toward good solutions and further progress requires precise parameter tuning or local algorithmic adjustments.

66.3.2 M2: Full Rewrite Mutation

The LLM receives a parent program and generates a complete replacement. This is the most radical mutation type, capable of fundamentally restructuring the algorithm. It is used in FunSearch, EoH, and as an option in most multi-operator systems.

Formal specification. Given parent $p$, a full rewrite produces:

$$c = \mu_{\text{rewrite}}(\text{LLM}(p, \text{context}, \text{prompt}_{\text{rewrite}}))$$

where the prompt instructs the model to write a completely new implementation from scratch, using the parent only as inspiration. The key distinction from diff mutation is that $c$ need not share any lines with $p$.

Trade-offs. Rewrite mutation has high variance: it can discover radically better approaches but frequently produces worse candidates. Its expected improvement per evaluation is lower than diff mutation for mature populations, but its ability to escape local optima makes it essential for maintaining search diversity.

66.3.3 M3: Reflection-Guided Mutation

A two-phase operator where the LLM first analyzes the parent program's strengths and weaknesses, then generates a modified version informed by that analysis. This separates the "what to change" decision from the "how to change it" implementation.

Formal specification. The reflection operator decomposes mutation into an analysis phase and a synthesis phase:

$$r = \text{LLM}_{\text{analyze}}(p, \text{context}, \text{prompt}_{\text{reflect}})$$
$$c = \text{LLM}_{\text{synthesize}}(p, r, \text{context}, \text{prompt}_{\text{improve}})$$

where $r$ is the reflection text — a natural-language analysis of the parent's performance characteristics, algorithmic choices, and potential improvements. The synthesis call receives both the parent code $p$ and the reflection $r$.

# Pseudocode — reflection-guided mutation
# Appears in: ReEvo, GEPA, OpenEvolve (variant)

def reflection_mutation(parent: str, context: MutationContext, llm: LLMBackend) -> str:
    """Two-phase mutation: analyze then improve."""
    # Phase 1: Reflect on parent's performance
    reflection_prompt = build_reflection_prompt(
        parent_code=parent,
        fitness_score=context.last_eval_result.score,
        error_log=context.errors,
        instruction="Analyze this solution's strengths and weaknesses."
    )
    reflection = llm.generate(reflection_prompt)

    # Phase 2: Generate improved version informed by reflection
    synthesis_prompt = build_synthesis_prompt(
        parent_code=parent,
        reflection=reflection,
        instruction="Implement improvements based on the analysis above."
    )
    child = llm.generate(synthesis_prompt)
    return extract_code(child)

Trade-offs. Reflection mutation costs approximately 2× the LLM tokens of a single-phase mutation but typically produces higher-quality changes because the model has an explicit reasoning step. It is most valuable when evaluation feedback is rich enough for meaningful analysis.

66.3.4 M4: Error-Repair Mutation

A specialized operator triggered when a candidate fails evaluation due to runtime errors, syntax errors, or constraint violations. The LLM receives the parent code, the error trace, and instructions to fix the specific failure while preserving the algorithmic intent.

Formal specification. Given a failed candidate $p_f$ with error trace $e$:

$$c = \mu_{\text{repair}}(\text{LLM}(p_f, e, \text{prompt}_{\text{fix}}))$$

Error-repair is often applied as a post-processing step after any other mutation produces a non-compiling or non-executing candidate. The repair budget — the maximum number of repair attempts before discarding the candidate — is a key parameter, typically set to 1–3 in practice.

66.3.5 M5: Guided Perturbation Mutation

The operator receives explicit guidance about what aspect of the program to modify. The guidance may come from evaluation feedback decomposition, a learning log, or a bandit-selected mutation strategy. This differs from generic diff/rewrite mutation in that the prompt includes a specific directive such as "modify the scoring heuristic" or "change the data structure for the priority queue."

66.3.6 M6: Prompt-Template Mutation

Rather than mutating the candidate program, this operator mutates the prompt template used to generate future candidates. This meta-level mutation is a form of co-evolution where the prompt population evolves alongside the solution population. It appears in GEPA's prompt co-evolution mechanism and in OpenEvolve's adaptive prompt management.

66.3.7 Additional Mutation Types

M7: Simplification mutation instructs the LLM to reduce the complexity of a parent while maintaining performance. It counteracts code bloat — a well-known pathology in genetic programming where solutions grow in size without proportional fitness improvement. M8: Domain-injection mutation augments the prompt with external domain knowledge (e.g., mathematical identities, known algorithm patterns) to guide the LLM toward incorporating specific techniques.

66.3.8 Mutation Operator Comparison

OperatorScope of ChangeLLM CallsBest PhaseSystems
M1: Line-Level DiffLocal (1–10 lines)1Late (refinement)AlphaEvolve, OpenEvolve
M2: Full RewriteGlobal (entire function)1Early (exploration)FunSearch, EoH
M3: Reflection-GuidedVariable2Mid (directed improvement)ReEvo, GEPA
M4: Error RepairTargeted (error site)1Any (post-failure)Most systems
M5: Guided PerturbationDirected1Mid-LateAlphaEvolve, GEPA
M6: Prompt-TemplateMeta-level1ContinuousGEPA, OpenEvolve
M7: SimplificationReductive1Late (anti-bloat)OpenEvolve
M8: Domain InjectionAugmented1AnyReEvo

66.4 F2: Crossover Techniques

Crossover combines genetic material from two or more parent programs to produce offspring. In classical GP, crossover swaps subtrees between parents. In LLM-powered evolution, crossover is semantic: the LLM is presented with multiple parents and asked to combine their strengths. This makes LLM crossover fundamentally different from syntactic recombination — the model can identify complementary algorithmic strategies and merge them coherently.

66.4.1 C1: Two-Parent Prompted Crossover

The simplest crossover form: the LLM receives two parent programs and is instructed to combine their best features.

$$c = \chi(\text{LLM}(p_1, p_2, \text{context}, \text{prompt}_{\text{cross}}))$$

where $p_1, p_2 \in \mathcal{P}$ are parent programs, typically selected to have complementary strengths — one might score well on efficiency while the other excels at solution quality.

# Pseudocode — two-parent prompted crossover
# Appears in: FunSearch, EoH, AlphaEvolve, GEPA

def prompted_crossover(parent_a: str, parent_b: str,
                       context: CrossoverContext, llm: LLMBackend) -> str:
    """Combine two parents via LLM-guided semantic crossover."""
    prompt = build_crossover_prompt(
        parent_a_code=parent_a,
        parent_a_score=context.score_a,
        parent_b_code=parent_b,
        parent_b_score=context.score_b,
        instruction="Combine the strengths of both solutions into one."
    )
    child = llm.generate(prompt)
    return extract_code(child)

66.4.2 C2: Multi-Parent Crossover

Extends crossover to $k > 2$ parents. The LLM receives a curated set of high-performing programs, each annotated with performance characteristics. This is more information-dense but risks confusing the model when parents are too dissimilar. AlphaEvolve and GEPA support multi-parent crossover with $k$ typically in the range $[2, 5]$.

66.4.3 C3: Feature-Specific Crossover

The prompt explicitly identifies which component to take from each parent: "Use parent A's scoring function with parent B's search strategy." This requires the system to decompose programs into functional components, either through static analysis or through metadata attached to each candidate. GEPA's multi-objective decomposition naturally supports this by tracking which parent excels on which objective.

66.4.4 C4: Diff-Based Crossover

Rather than presenting full parent programs, the system computes the diff between two parents and asks the LLM to resolve the differences. This focuses the model's attention on the specific points of divergence rather than requiring it to process two complete programs.

66.5 F3: Selection Strategies

Selection determines which candidates from the population are chosen as parents for variation and which survive to the next generation. The choice of selection strategy fundamentally shapes the exploration-exploitation trade-off. We catalog six selection strategies observed across the surveyed systems.

66.5.1 S1: Fitness-Proportionate (Roulette Wheel) Selection

Each candidate $i$ in population $P$ of size $N$ is selected with probability proportional to its fitness $f_i$. For a maximization problem with non-negative fitness values:

$$P(\text{select } i) = \frac{f_i}{\sum_{j=1}^{N} f_j}$$

where $f_i \geq 0$ is the fitness of candidate $i$ and $N = |P|$ is the population size. This strategy is simple but suffers from premature convergence when a few candidates dominate the fitness landscape. It is rarely used as the sole selection mechanism in modern LLM-evolutionary systems.

66.5.2 S2: Tournament Selection

The most widely used selection strategy across the surveyed systems. A tournament of size $k$ is formed by sampling $k$ candidates uniformly at random (with or without replacement) from the population. The candidate with the highest fitness in the tournament is selected as the parent.

Formal specification. Let candidates be ranked by fitness so that rank $r = 1$ denotes the best individual out of $N$ total. In a tournament of size $k$ with sampling with replacement, the probability that the candidate of rank $r$ wins is:

$$P(\text{rank } r \text{ wins}) = \frac{r^k - (r-1)^k}{N^k}$$

where $r \in \{1, 2, \ldots, N\}$, $N$ is the population size, and $k$ is the tournament size. The derivation is as follows: the event "rank $r$ wins" is equivalent to "all $k$ sampled candidates have rank $\geq r$ and at least one has rank exactly $r$." The probability that a single draw has rank $\geq r$ (i.e., fitness $\leq$ rank $r$'s fitness under rank-1-best convention) is $r/N$. Thus $P(\text{all } k \text{ have rank} \geq r) = (r/N)^k$. The probability that all $k$ have rank $\geq r+1$ is $((r-1)/N)^k$. The difference gives the stated formula.

For the best candidate ($r=1$), this simplifies to:

$$P(\text{rank 1 wins}) = 1 - \left(\frac{N-1}{N}\right)^k$$

Typical configurations: FunSearch uses $k = 2$ (binary tournament), AlphaEvolve uses adaptive tournament size, and most open-source implementations default to $k \in \{2, 3, 4\}$.

66.5.3 S3: Power-Law Ranked Selection

A distinctive selection strategy employed by OpenEvolve and other systems inspired by AlphaEvolve's design. Candidates are ranked by fitness (rank 1 = best), and parent selection probability follows a power-law distribution over ranks:

$$P(\text{select rank } r) = \frac{r^{-\alpha}}{\sum_{j=1}^{N} j^{-\alpha}} = \frac{r^{-\alpha}}{H_{N,\alpha}}$$

where $\alpha > 0$ is the power-law exponent controlling selection pressure and $H_{N,\alpha} = \sum_{j=1}^{N} j^{-\alpha}$ is the generalized harmonic number serving as a normalization constant. Larger $\alpha$ concentrates selection on top-ranked candidates; $\alpha = 0$ recovers uniform random selection.

# Pseudocode — power-law ranked selection
# Appears in: OpenEvolve, systems inspired by AlphaEvolve

import numpy as np

def power_law_selection(population: list, fitnesses: list[float],
                        alpha: float = 1.0) -> object:
    """Select a parent using power-law distribution over fitness ranks."""
    N = len(population)
    # Rank candidates: rank 1 = highest fitness
    ranked_indices = np.argsort(fitnesses)[::-1]  # Descending fitness
    ranks = np.arange(1, N + 1, dtype=float)

    # Compute selection probabilities
    weights = ranks ** (-alpha)
    probabilities = weights / weights.sum()

    # Sample one parent
    chosen_rank = np.random.choice(N, p=probabilities)
    return population[ranked_indices[chosen_rank]]

Properties. At $\alpha = 1.0$, the top-ranked candidate is selected approximately $\ln(N)$ times more often than the median candidate. At $\alpha = 2.0$, this ratio increases to approximately $N/4$ for large $N$. The power-law distribution provides a smooth gradient between strong elitism and near-uniform exploration, controllable by a single parameter.

66.5.4 S4: MAP-Elites Quality-Diversity Selection

Used in quality-diversity approaches including AlphaEvolve's behavioral characterization. The population is organized into a behavioral descriptor grid, and parents are selected uniformly from occupied cells (or with bias toward recently updated cells). This ensures diversity in the type of solution explored, not just the quality.

The behavioral descriptor space $\mathcal{B} = \mathcal{B}_1 \times \mathcal{B}_2 \times \cdots \times \mathcal{B}_d$ is discretized into a grid. Each cell stores only the highest-fitness candidate that maps to that cell. For candidate $c$ with fitness $f(c)$ and descriptor $\mathbf{b}(c) = (b_1(c), \ldots, b_d(c))$:

$$\text{cell}(c) = \left(\left\lfloor \frac{b_i(c) - b_i^{\min}}{b_i^{\max} - b_i^{\min}} \cdot K_i \right\rfloor\right)_{i=1}^{d}$$

where $K_i$ is the number of bins along dimension $i$, and $b_i^{\min}, b_i^{\max}$ define the descriptor range for dimension $i$. Each floor result is clamped to $[0, K_i - 1]$ to handle boundary cases where $b_i(c) = b_i^{\max}$.

66.5.5 S5: Pareto-Based Multi-Objective Selection

When candidates are evaluated on multiple objectives, Pareto-based selection avoids collapsing objectives into a single scalar. GEPA implements NSGA-II-style non-dominated sorting with crowding distance:

$$\text{CD}(i) = \sum_{m=1}^{M} \frac{f_m(i_{+1}) - f_m(i_{-1})}{f_m^{\max} - f_m^{\min}}$$

where $\text{CD}(i)$ is the crowding distance for candidate $i$, $M$ is the number of objectives, $i_{+1}$ and $i_{-1}$ are the neighbors of $i$ along objective $m$ when the front is sorted by that objective, and $f_m^{\max}, f_m^{\min}$ are the maximum and minimum values of objective $m$ on the current Pareto front. Boundary candidates (those at the extremes of any objective) are assigned infinite crowding distance to ensure their preservation.

66.5.6 S6: Epsilon-Greedy Selection

With probability $1 - \epsilon$, the best candidate is selected; with probability $\epsilon$, a candidate is selected uniformly at random. This is the simplest exploration-exploitation strategy, sometimes used as a baseline or inner selector within more complex schemes. The parameter $\epsilon$ may decay over the course of evolution to shift from exploration to exploitation.

66.6 F4: Evaluation Methods

Evaluation determines candidate fitness and is typically the computational bottleneck in LLM-powered evolution. Evaluation methods span a spectrum from cheap, low-fidelity proxies to expensive, high-fidelity assessments.

66.6.1 E1: Cascade Evaluation

A multi-stage pipeline where candidates must pass cheaper evaluation stages before proceeding to more expensive ones. This dramatically reduces total evaluation cost by filtering low-quality candidates early.

Cascade Evaluation Pipeline Stage 1: Syntax Parse + type check Cost: ~0.01s pass? Stage 2: Smoke Small test cases Cost: ~0.1s pass? Stage 3: Standard Full test suite Cost: ~1-10s pass? Stage 4: Full Fidelity Large instances + stress Cost: ~10-100s ~40% rejected ~30% rejected ~20% rejected

The expected cost of evaluating a random candidate through the cascade is:

$$\mathbb{E}[\text{cost}] = \sum_{s=1}^{S} c_s \prod_{j=1}^{s-1} q_j$$

where $S$ is the number of stages, $c_s$ is the cost of stage $s$, and $q_j$ is the pass rate at stage $j$ (fraction of candidates reaching stage $j$ that pass it). The product $\prod_{j=1}^{s-1} q_j$ gives the fraction of candidates that reach stage $s$. When early stages are cheap and have low pass rates, the cascade can reduce total evaluation cost by 5–50× compared to running all candidates through the full evaluation.

66.6.2 E2: Sandbox Execution

Candidates are executed in a restricted environment with resource limits (CPU time, memory, file system access, network isolation). This is a safety-critical component: LLM-generated code may contain infinite loops, excessive memory allocation, or unintended side effects. All surveyed systems implement some form of sandboxing, ranging from simple subprocess timeouts to container-based isolation.

66.6.3 E3: Multi-Instance Aggregation

Candidates are evaluated across multiple problem instances and their scores are aggregated. Common aggregation functions include arithmetic mean, geometric mean (which penalizes inconsistency), and worst-case (minimax). For $T$ test instances with scores $s_1, s_2, \ldots, s_T$:

$$f_{\text{arith}} = \frac{1}{T}\sum_{t=1}^{T} s_t, \qquad f_{\text{geom}} = \left(\prod_{t=1}^{T} s_t\right)^{1/T}, \qquad f_{\text{worst}} = \min_{t} s_t$$

66.6.4 E4: Novelty-Filtered Evaluation

Before expensive evaluation, candidates are checked for semantic similarity to previously evaluated solutions. If a candidate's embedding is too similar to an existing evaluated candidate, it is skipped. This prevents wasting evaluation budget on near-duplicates. OpenEvolve uses cosine similarity with a configurable threshold:

$$\text{skip}(c) = \exists \, c' \in \mathcal{H} : \text{sim}(\mathbf{e}(c), \mathbf{e}(c')) > \tau_{\text{novelty}}$$

where $\mathbf{e}(c)$ is the embedding of candidate $c$'s code, $\mathcal{H}$ is the history of evaluated candidates, $\text{sim}(\cdot, \cdot)$ is cosine similarity, and $\tau_{\text{novelty}} \in [0, 1]$ is the novelty threshold.

66.6.5 E5: Adaptive Sample-Size Evaluation

Also known as Adaptive Sample Increase (ASI). Candidates are initially evaluated on a small sample of test instances. If the candidate appears promising, the sample size is increased for higher-confidence scoring. This allocates evaluation budget adaptively, spending more compute on candidates that are likely to be good.

66.7 F5: Population Management

Population management governs how the set of candidates is maintained, updated, and structured over the course of evolution. The choice of population structure interacts strongly with selection strategy and island models.

66.7.1 P1: Flat Population with Elitism

The simplest structure: a fixed-size population $P$ of $N$ candidates, where the $e$ best candidates (the elite) are guaranteed to survive to the next generation. New children replace the worst candidates. FunSearch uses this structure with its "programs database" maintaining a ranked list per island.

66.7.2 P2: MAP-Elites Grid Archive

Candidates are stored in a $d$-dimensional grid indexed by behavioral descriptors. Each cell holds exactly one candidate — the best found for that behavioral region. The archive grows by filling empty cells (exploration) or improving occupied cells (exploitation). Total archive capacity is $\prod_{i=1}^{d} K_i$ where $K_i$ is the resolution along dimension $i$.

66.7.3 P3: Pareto Archive

For multi-objective optimization, the population maintains the set of non-dominated candidates (the Pareto front). Archive size is bounded either by a fixed capacity with crowding-distance-based pruning, or by an $\epsilon$-dominance grid that limits front resolution. GEPA uses this structure for its multi-objective mode.

66.7.4 P4: Tiered Population

Candidates are organized into tiers (e.g., elite, active, nursery) with different retention policies. Top-tier candidates are preserved indefinitely; mid-tier candidates are subject to replacement; bottom-tier candidates are pruned aggressively. This creates a natural stratification that protects proven solutions while allowing experimentation.

66.7.5 P5: Skills Archive / Knowledge Base

Beyond storing candidates by fitness, some systems maintain a structured knowledge base of reusable components, strategies, or sub-solutions. ReEvo's learning logs and GEPA's knowledge integration represent this approach. The archive stores not just code but also metadata about what makes each solution effective.

66.8 F6: Island Models

Island models partition the population into semi-independent subpopulations (islands) that evolve in parallel with periodic migration of candidates between them. This architecture provides diversity maintenance, parallel speedup, and robustness to premature convergence.

66.8.1 I1: Static Island Topology

A fixed number of islands $n_I$ with a predetermined migration topology (ring, star, fully connected). Each island maintains its own population and applies its own selection pressure. Migration occurs every $g_{\text{mig}}$ generations, transferring $m$ candidates between connected islands.

The migration rate $\rho$ — the fraction of the island population replaced by migrants per migration event — controls the trade-off between island independence and information sharing:

$$\rho = \frac{m}{N_{\text{island}}}$$

where $m$ is the number of migrants and $N_{\text{island}}$ is the island population size. Low $\rho$ (e.g., 0.05) maintains diversity; high $\rho$ (e.g., 0.3) drives convergence across islands. FunSearch uses a static island model with independent programs databases.

# Pseudocode — island model with periodic migration
# Appears in: FunSearch, OpenEvolve, LLM4AD

def island_evolution_step(islands: list[Island], generation: int,
                          migration_interval: int, migration_rate: float):
    """One generation of island-model evolution."""
    # Evolve each island independently
    for island in islands:
        parents = island.select_parents()
        children = [island.mutate(p) for p in parents]
        evaluated = [island.evaluate(c) for c in children]
        island.update_population(evaluated)

    # Periodic migration
    if generation % migration_interval == 0:
        for i, island in enumerate(islands):
            target = islands[(i + 1) % len(islands)]  # Ring topology
            migrants = island.select_best(
                k=int(migration_rate * island.size)
            )
            target.receive_migrants(migrants)

66.8.2 I2: Adaptive Island Spawning

Islands are dynamically created or destroyed based on search progress. When an island stagnates (no fitness improvement for $g_{\text{stag}}$ generations), it may be reset with fresh random candidates or merged with a successful island. When overall diversity drops, new islands may be spawned with different operator configurations.

66.8.3 I3: Heterogeneous Islands

Different islands use different operator configurations, selection pressures, or even different LLM models. This creates a portfolio of search strategies that hedge against any single strategy being ineffective. One island might use aggressive exploitation (high selection pressure, diff mutation), while another uses broad exploration (low selection pressure, full rewrite mutation).

66.8.4 I4: Hierarchical Islands

Islands are organized in a hierarchy where top-level islands receive only the best migrants from lower-level islands. This creates a natural funnel from exploration at the bottom to exploitation at the top.

66.9 F7: Bandit Algorithms for Adaptive Control

Bandit algorithms provide online learning for adaptive operator, model, and prompt selection during evolution. The key insight is that the optimal mutation operator, LLM model, or prompt template changes over the course of a run, and the system must balance exploiting known-good configurations against exploring alternatives.

66.9.1 B1: UCB1 (Upper Confidence Bound)

The most widely used bandit algorithm in the surveyed systems. UCB1 selects the arm (operator, model, or configuration) that maximizes the upper confidence bound of its estimated reward:

$$a^* = \arg\max_{a \in \mathcal{A}} \left[ \bar{r}_a + c \sqrt{\frac{\ln n}{n_a}} \right]$$

where $\bar{r}_a$ is the empirical mean reward of arm $a$, $n$ is the total number of pulls across all arms, $n_a$ is the number of times arm $a$ has been pulled, and $c > 0$ is the exploration constant. The first term $\bar{r}_a$ favors arms with high observed reward (exploitation); the second term $c\sqrt{\ln n / n_a}$ favors under-explored arms (exploration).

Reward definition. The reward signal varies by application. For mutation operator selection, a common choice is binary reward: $r = 1$ if the mutated child improves upon its parent, $r = 0$ otherwise. For model selection, reward may incorporate cost-normalized improvement: $r = \Delta f / \text{cost}_{\text{API}}$ where $\Delta f$ is the fitness improvement and $\text{cost}_{\text{API}}$ is the API cost of the LLM call.

Theoretical guarantees. UCB1 achieves logarithmic regret under the assumption that rewards are bounded in $[0, 1]$ and stationary (the reward distribution for each arm does not change over time). In the context of LLM-powered evolution, stationarity is violated because the quality of a mutation operator depends on the current population state, which changes as evolution progresses. Nevertheless, UCB1 remains effective in practice because the non-stationarity is gradual and the exploration bonus naturally adapts to distributional drift.

The classic regret bound for UCB1 is gap-dependent:

$$R_n \leq \sum_{a : \Delta_a > 0} \left( \frac{8 \ln n}{\Delta_a} + (1 + \frac{\pi^2}{3}) \Delta_a \right)$$

where $R_n$ is the cumulative regret after $n$ rounds, $\Delta_a = \mu^* - \mu_a$ is the gap between the optimal arm's mean reward $\mu^*$ and arm $a$'s mean reward $\mu_a$, and the sum runs over all suboptimal arms. This bound is logarithmic in $n$, meaning regret grows slowly. However, practitioners should note that this bound assumes stationary rewards. In the evolutionary setting where reward distributions shift as the population evolves, UCB1 functions as a practical heuristic rather than an algorithm with formal guarantees.

# Pseudocode — UCB1 bandit for operator selection
# Appears in: OpenEvolve, GEPA, AlphaEvolve (inferred)

import math

class UCB1Selector:
    """Select among mutation operators using UCB1."""

    def __init__(self, arms: list[str], c: float = 1.414):
        self.arms = arms
        self.c = c                          # Exploration constant
        self.counts = {a: 0 for a in arms}  # n_a: pulls per arm
        self.rewards = {a: 0.0 for a in arms}  # Cumulative reward
        self.total = 0                      # n: total pulls

    def select(self) -> str:
        """Select the arm with highest UCB score."""
        # Ensure each arm is tried at least once
        for arm in self.arms:
            if self.counts[arm] == 0:
                return arm

        def ucb_score(arm: str) -> float:
            mean_reward = self.rewards[arm] / self.counts[arm]
            exploration = self.c * math.sqrt(
                math.log(self.total) / self.counts[arm]
            )
            return mean_reward + exploration

        return max(self.arms, key=ucb_score)

    def update(self, arm: str, reward: float) -> None:
        """Update statistics after observing reward."""
        self.counts[arm] += 1
        self.rewards[arm] += reward
        self.total += 1

66.9.2 B2: Thompson Sampling

An alternative to UCB1 that uses Bayesian posterior sampling. Each arm maintains a Beta distribution $\text{Beta}(\alpha_a, \beta_a)$ over its success probability (for Bernoulli rewards). At each round, a sample is drawn from each arm's posterior, and the arm with the highest sample is selected:

$$\theta_a \sim \text{Beta}(\alpha_a, \beta_a), \qquad a^* = \arg\max_{a} \theta_a$$

where $\alpha_a$ is initialized to 1 (prior successes + 1) and incremented by 1 on each success, and $\beta_a$ is initialized to 1 and incremented by 1 on each failure. Thompson sampling naturally handles exploration through posterior uncertainty and adapts more gracefully to non-stationary environments when combined with a decay factor on historical counts.

66.9.3 B3: EXP3 (Exponential-Weight Algorithm for Exploration and Exploitation)

Designed for the adversarial bandit setting where rewards may be non-stochastic. EXP3 maintains a weight $w_a$ for each arm and selects arms with probability:

$$P(a) = (1 - \gamma) \frac{w_a}{\sum_{a'} w_{a'}} + \frac{\gamma}{K}$$

where $K = |\mathcal{A}|$ is the number of arms and $\gamma \in (0, 1]$ is a mixing parameter ensuring minimum exploration probability $\gamma / K$ for every arm. Weights are updated multiplicatively based on importance-weighted rewards. EXP3 provides worst-case regret of $O(\sqrt{KN \ln K})$ where $N$ is the number of rounds, making it suitable when reward distributions are highly non-stationary — as may occur during phase transitions in evolution.

66.9.4 B4: Sliding-Window UCB

A modification of UCB1 that only considers the most recent $W$ observations per arm, discarding older data. This explicitly addresses non-stationarity by forgetting outdated reward information:

$$a^* = \arg\max_{a \in \mathcal{A}} \left[ \bar{r}_{a,W} + c \sqrt{\frac{\ln n_W}{n_{a,W}}} \right]$$

where $\bar{r}_{a,W}$ is the mean reward of arm $a$ over its last $\min(n_a, W)$ observations, $n_W = \sum_a \min(n_a, W)$, and $n_{a,W} = \min(n_a, W)$. The window size $W$ controls the memory horizon: small $W$ tracks rapid changes but increases variance; large $W$ provides stable estimates but reacts slowly to shifts.

66.9.5 Bandit Application Domains

ApplicationArmsReward SignalTypical Algorithm
Mutation operator selectiondiff, rewrite, reflect, repairBinary: child improves parent?UCB1, Thompson
LLM model routingModel variants (e.g., GPT-4, Claude, Gemini)Cost-normalized improvementUCB1
Prompt template selectionPrompt populationChild fitness or improvementThompson, EXP3
Temperature selectionDiscrete temperature valuesDiversity-weighted improvementUCB1
Island configurationOperator/model combos per islandIsland-level progress rateSliding-Window UCB

66.10 Cross-Method Interactions and Design Patterns

Methods from different families do not operate in isolation. Their interactions create emergent system behaviors that can be beneficial or pathological. This section catalogs the most important interaction patterns observed across the surveyed systems.

66.10.1 Selection–Mutation Coupling

Strong selection pressure (high tournament size, steep power-law exponent) reduces population diversity, which in turn reduces the diversity of parents available to mutation operators. If combined with conservative mutation (diff-only), this coupling can cause premature convergence. The recommended mitigation is to pair strong selection with diverse mutation portfolios managed by a bandit.

66.10.2 Evaluation–Population Coupling

Cascade evaluation creates an implicit bias in the population: candidates that pass early stages may not be the same ones that would score highest on the full evaluation. If early-stage filtering is too aggressive (low pass rate on smoke tests), the population may converge on solutions that are "easy to evaluate" rather than truly optimal. The recommended mitigation is to periodically re-evaluate archived elites on the full evaluation to detect cascade artifacts.

66.10.3 Island–Bandit Coupling

When bandits manage per-island operator selection independently, different islands may converge to different optimal configurations — a desirable outcome that creates implicit heterogeneity. However, if migration rates are too high, the bandit statistics on each island are contaminated by migrants that were produced under different operator distributions, confounding the reward signal.

Method Interaction Map Selection (F3) Mutation (F1) Evaluation (F4) Population (F5) Islands (F6) Bandits (F7) Solid: data flow. Dashed: adaptive control.

66.10.4 Common Configuration Patterns

Analysis of the surveyed systems reveals several recurring configuration patterns that combine methods from multiple families:

Pattern NameSelectionMutationPopulationControlSystems
Exploit-HeavyPower-law ($\alpha \geq 2$)Diff-dominantFlat + elitismNoneLate-stage AlphaEvolve
Explore-HeavyUniform / low tournamentRewrite-dominantMAP-Elites gridNoneEarly-stage QD systems
Balanced AdaptiveTournament ($k$=2–3)Bandit-selected portfolioTieredUCB1OpenEvolve, GEPA
Multi-ObjectivePareto + crowdingReflection + crossoverPareto archiveThompsonGEPA
Island PortfolioPer-island variedPer-island variedPer-island flatPer-island banditFunSearch

66.11 Method Selection Guidelines

Choosing the right combination of methods depends on the problem characteristics, budget constraints, and desired solution properties. The following decision framework synthesizes lessons from across the surveyed systems.

Problem complexity and search space size. For narrow optimization tasks (improving a single heuristic parameter), diff mutation with tournament selection on a flat population is sufficient and cost-efficient. For open-ended algorithm discovery where the solution space is vast, full rewrite mutation with MAP-Elites population structure and island models provides the necessary diversity.

Evaluation cost. When evaluation is cheap (sub-second), the system can afford to evaluate many candidates and aggressive filtering is unnecessary. When evaluation is expensive (minutes per candidate), cascade evaluation and novelty filtering become essential, and the system should favor exploitation-heavy selection to avoid wasting budget on unpromising candidates.

Multi-objectivity. If the problem has a single clear objective, any selection strategy works. With multiple competing objectives, Pareto-based selection with crowding distance is strongly preferred over weighted-sum scalarization, as the latter requires knowing the correct weights a priori.

Budget and runtime. For budget-constrained settings (fixed API cost ceiling), bandit-based model selection with cost-normalized rewards is critical for maximizing progress per dollar. For time-constrained settings, parallel island models with asynchronous evaluation provide the best throughput.

66.12 Open Questions and Research Frontiers

Despite the rapid progress cataloged in this chapter, several fundamental questions remain open.

Optimal operator scheduling. While bandits provide a principled framework for adaptive operator selection, the non-stationarity of rewards in evolutionary settings means that theoretical regret guarantees do not directly apply. Developing bandit variants with provable guarantees under the specific non-stationarity patterns of evolutionary search is an open theoretical problem.

Crossover effectiveness. It remains unclear whether LLM-based crossover genuinely combines parental traits or simply uses the parents as inspiration for generating new candidates. Empirical studies measuring the heritability of specific code features through crossover would help resolve this question.

Evaluation fidelity gaps. Cascade evaluation introduces systematic bias by selecting for candidates that perform well on early, potentially unrepresentative test instances. Understanding and mitigating this bias, perhaps through adaptive stage design or periodic full-fidelity re-evaluation, is an important practical concern.

Scalability of behavioral descriptors. MAP-Elites-style population management requires meaningful behavioral descriptors, but defining these for arbitrary algorithm discovery tasks is non-trivial. Automated descriptor discovery — perhaps using LLMs to propose behavioral characterizations — is a promising but largely unexplored direction.

Method transfer across domains. Most systems are configured for specific benchmark families. Whether optimal method configurations transfer across problem domains (e.g., from combinatorial optimization to numerical algorithm design) remains an empirical question that no systematic study has yet addressed.

Chapter Summary

Key takeaway. The methods employed across LLM-powered evolutionary systems can be organized into seven families — mutation, crossover, selection, evaluation, population management, island models, and bandit algorithms — each with 4–8 well-characterized variants. The effectiveness of any method depends critically on its interactions with methods from other families, making system design a combinatorial configuration problem rather than a collection of independent choices.

Main contribution. This catalog provides the first unified, formally specified reference for the complete algorithmic toolkit of LLM-powered evolutionary systems, with consistent notation, explicit assumptions, and cross-system provenance for each method. It enables researchers to reason about method selection systematically rather than adopting configurations from individual systems without understanding the design space.

For the researcher. When designing a new LLM-powered evolutionary system, start with the five common configuration patterns in Section 66.10.4 and adapt based on your problem's evaluation cost, objectivity, and budget constraints. The interaction patterns in Section 66.10 are as important as the individual method specifications — a well-chosen combination of simple methods will typically outperform a collection of sophisticated methods that interact poorly.