Score8.5/10 — Final
Chapter 71

Method Shortcut: Component-by-Component Comparison

Part P09: Synthesis & Future Directions

This chapter is the index to every design decision an LLM-powered evolutionary system has to make. Where Chapter 67 — Comprehensive Methods Catalog gives the long-form algorithmic reference for each method, and Chapter 65 — Comparative Architecture Analysis compares systems layer-by-layer, this chapter inverts the view: for each design dimension, it enumerates the variants observed in the surveyed systems, shows which systems use which, and tells you — under which conditions — each variant tends to pay off.

How to read this chapter. Fifteen tables, one per design dimension, followed by a cross-cutting meta-analysis (§71.16). Each row gives (i) the variant's name, (ii) a one-line summary, (iii) the Ch67 method ID so you can drop into the full reference, (iv) representative systems that use it with source-type tags and chapter pointers, and (v) the condition under which this variant tends to pay off. Use this chapter as a lookup: identify the dimension you care about, scan the rows, follow the links. Treat "When to pick" as a working rule of thumb, not a recommendation — the supporting ablation evidence is uneven across dimensions (see §71.16).

71.0.1   Methodological Note: What This Taxonomy Is and Is Not

Before the tables, three caveats the reader should keep in mind throughout.

The fifteen dimensions are not strictly orthogonal. Several concepts naturally span more than one table. The most conspicuous overlaps:

  • MAP-Elites is simultaneously a search strategy (§71.1), an archive structure (§71.2), a parent-sampling rule (§71.3), and a diversity mechanism (§71.5). We place each facet in the table it most directly constrains: the high-level decision to run a QD loop belongs to §71.1; the cellular archive data structure to §71.2; the "random cell, then champion" draw to §71.3; the behavioural-descriptor rationale to §71.5.
  • Bandits appear in parent selection (§71.3), model routing (§71.8), operator selection (§71.11), and budget allocation (§71.12). Each table lists only bandit uses specific to that decision; the same bandit algorithm (UCB1, Thompson) can reappear across tables.
  • Reflection crosses mutation (§71.4, "reflect-then-mutate"), memory (§71.10), meta-adaptation (§71.11, prompt evolution), and failure handling (§71.14, reflective repair). We list it in every table where it is the defining design choice for that row.
  • Skills archives appear in population structure (§71.2), prompt co-evolution (§71.9), and memory (§71.10); the same data structure, viewed through three different lenses.

Placement rule. Where a concept could belong to more than one table, we place it in the table whose question it most directly answers — "What shape is the search?" for §71.1, "How are candidates stored?" for §71.2, "How is the next parent drawn?" for §71.3, and so on. Cross-references in the commentary flag the siblings.

Evidence asymmetry. Not every "When to pick" recommendation rests on controlled ablations. Some dimensions (cascade evaluation, sandboxing, line-level diff mutation) are backed by multiple within-system ablations; others (e.g., crossover value, optimal bandit family, memory retrieval strategies) are under-evidenced and reflect practitioner rule-of-thumb. §71.16 summarises which claims are well-supported and which are working hypotheses.

71.0.2   Source-Type Tags for "Used by"

Every "Used by" attribution carries a one-letter tag distinguishing how the evidence was obtained. This lets readers audit claims at a glance and weight them accordingly.

TagMeaningEpistemic weight
[P]Published system with peer-reviewed paper and/or open source release; design is verifiable from artifacts.Highest — independently auditable.
[I]Industry / internal system described in a tech report, blog post, or paper without code release; design is reported but not reproducible.Moderate — depends on the vendor's disclosure.
[B]Blueprint or reference architecture — a design proposal (including this survey's own Next Evolution Architecture blueprint) that may not yet be fully implemented in any single running system.Low — aspirational, informs design space rather than evidencing it.
[E]Experimental branch, community fork, or configuration flag inside an otherwise-published system; the variant is implemented but not the project's default and is usually undocumented.Low — existence confirmed, impact rarely ablated.

Chapter pointers of the form (Ch##) route to the survey chapter that documents the system. Where a system is flagged [I] or [B], readers should not treat the row as reproducible evidence, only as a design-space observation.

71.0   Method ID Index

The short codes in the Ch67 column throughout this chapter (e.g. M1, S4, P2, B1) are references into Chapter 67 — Comprehensive Methods Catalog, which contains the formal specification, pseudocode, and trade-offs for every method. Each ID in the table below is a clickable anchor that jumps to the matching subsection in Ch67. Ch67's section numbering follows the pattern #s67-<family>-<index> where <family> is the family's ordinal inside Ch67 (3 for Mutation, 4 for Crossover, 5 for Selection, 6 for Evaluation, 7 for Population, 8 for Islands, 9 for Bandits). Use this as the single index for the thirty-four methods across the seven families.

FamilyMethods (click an ID to jump to Ch67)
F1 · Mutation M1 Line-level diff  ·  M2 Full rewrite  ·  M3 Reflection-guided  ·  M4 Error-repair  ·  M5 Guided perturbation  ·  M6 Prompt-template
F2 · Crossover C1 Two-parent  ·  C2 Multi-parent  ·  C3 Feature-specific  ·  C4 Diff-based
F3 · Selection S1 Fitness-proportionate  ·  S2 Tournament  ·  S3 Power-law rank  ·  S4 MAP-Elites cell  ·  S5 Pareto  ·  S6 ε-greedy
F4 · Evaluation E1 Cascade  ·  E2 Sandbox  ·  E3 Multi-instance  ·  E4 Novelty-filtered  ·  E5 Adaptive sample-size
F5 · Population P1 Flat elitism  ·  P2 MAP-Elites grid  ·  P3 Pareto archive  ·  P4 Tiered  ·  P5 Skills archive
F6 · Islands I1 Static topology  ·  I2 Adaptive spawning  ·  I3 Heterogeneous  ·  I4 Hierarchical
F7 · Bandits B1 UCB1  ·  B2 Thompson  ·  B3 EXP3  ·  B4 Sliding-window UCB

Shorthand for the rest of this chapter. In the per-dimension tables below, the Ch67 column lists the IDs that apply to each variant. When you see an unfamiliar code — say B3 in §71.3 — scroll back to this index and click it to jump to the formal definition in Ch67. A dash () in the Ch67 column means the variant is described only in Ch67's surrounding prose or does not yet have a catalogued method ID.

71.1   Search Strategy

The top-level shape of the search. Most downstream choices are constrained by this decision, though several dimensions (mutation, evaluation, sandboxing) remain largely independent.

VariantSummaryCh67Used byWhen to pick
Single-population EAOne flat pool, elitism, each iteration samples a parent and produces one child.OpenEvolve default [P, Ch14], LLM4AD [P, Ch21]Plausible when evaluators are cheap and diversity is not the binding constraint; fragile on deceptive or multi-modal landscapes.
Island-model EASeveral small populations in a ring; occasional migration between islands.I1–I4AlphaEvolve [I, Ch9], OpenEvolve [P, Ch14], ShinkaEvolve [P, Ch12]Worth the extra compute per island when premature convergence has been observed in single-pool runs on the same task.
MAP-Elites / Quality-DiversityGrid archive indexed by behavioural descriptors; one champion per cell.S4, P2AlphaEvolve behavioural-grid mode [I, Ch9], GEPA-family variants [P, Ch25]Typically preferred when the landscape is known to be deceptive and cheap, meaningful descriptors are available; evidence of QD superiority over well-tuned island EAs on code-synthesis tasks is mixed.
Tree search (MCTS / AB-MCTS)Program-synthesis tree with UCB selection and LLM-proposed expansions.B1, B2AB-MCTS / TreeQuest [P, Ch19], Arcgentica [P, Ch17], Confluence Labs [I, Ch28], ALE-Agent [P, Ch20]Useful when the task decomposes into verifiable step-by-step construction (ARC-AGI, heuristic contests) and when an informative per-node reward exists.
Hybrid (loop + tree)Outer evolutionary loop whose mutation operator is itself a tree search.ShinkaEvolve@ICFP [P, Ch12], Sakana Marlin [I, Ch30]Reasonable when the problem has both long-horizon strategy choice and low-level synthesis, and when the added implementation complexity is justified by failures of either tier alone.
Evolution strategies / ES-at-scalePerturb a parameter vector with Gaussian noise; reweight by fitness; no LLM mutation.EGGROLL [P, Ch45], Evolution Strategies at Scale [I, Ch46], EvoX [P, Ch47]Applicable only when the search object is a continuous parameter vector and massive parallel evaluation is available; not comparable to LLM-mutator systems on code tasks.

71.2   Population Structure

How candidates are stored and organised. Distinct from the search strategy: a FunSearch-lineage EA can in principle be paired with any of these archives, although in practice certain pairs dominate (see §71.16).

VariantSummaryCh67Used byWhen to pick
Flat pool with elitismKeep the top-k candidates by fitness; new children replace the worst.P1OpenEvolve flat mode [P, Ch14], LLM4AD [P, Ch21]Defensible as a baseline when the landscape looks unimodal; breaks down on deceptive tasks.
Ranked archiveAll candidates ranked by a scalar; parent sampling by power-law over rank.S3ShinkaEvolve [P, Ch12], GEPA [P, Ch25]A reasonable compromise when soft elitism is wanted — strong parents preferred, weaker ones still contribute — and the archive can grow large without memory pressure.
MAP-Elites grid archiveBehavioural descriptor partitions the space into cells; best candidate per cell.P2, S4AlphaEvolve [I, Ch9], several derivative forks [E]Justified when explicit diversity is an objective and cheap descriptors (runtime, length, ASI tags) are available; the descriptor choice usually matters more than the grid resolution.
Pareto archiveNon-dominated front under multi-objective fitness.P3, S5AlphaEvolve multi-obj mode [I, Ch9], GEPA variants [E]Appropriate when multiple competing objectives have unknown relative weights; less useful when a scalarisation is already trusted.
Tiered populationMultiple pools at different refinement stages; promotion by evaluation tier.P4ShinkaEvolve [P, Ch12], Next Evolution Architecture blueprint [B]Natural fit when the evaluator is a cascade whose stages differ by an order of magnitude in cost.
Skills archive / knowledge baseCandidates stored as reusable skills with natural-language descriptions.P5GEPA Skills [P, Ch26], EvoSkill [P, Ch33], RetroAgent [P, Ch34]Most useful in agent systems where solutions are expected to compose across tasks; adds retrieval overhead that is wasted on single-task code synthesis.
Knowledge graph / graph memoryEntities and relations stored in a structured graph; retrieved by traversal.OmniScientist [P, Ch52], Zochi [P, Ch53], Omni-SimpleMem [P, Ch51]Worth the indexing cost primarily for research agents doing cross-paper synthesis; rarely beneficial in closed-box algorithmic tasks.

71.3   Parent Selection & Sampling

How the next parent is drawn from whatever archive you chose in §71.2.

VariantSummaryCh67Used byWhen to pick
UniformSample any archive entry with equal probability.FunSearch baselines [P, Ch5], MAP-Elites default [P, Ch9]Sensible when the archive itself already encodes selection pressure (MAP-Elites cells); otherwise tends to waste budget on weak entries.
Fitness-proportionate (roulette)Sampling probability proportional to fitness.S1small-population baselines [P]Rarely the best choice; sensitive to fitness scaling. Tournament is almost always preferable in practice.
TournamentSample k at random, pick the best.S2ShinkaEvolve [P, Ch12], OpenEvolve [P, Ch14]A robust default when fitness scaling is unknown; scale-invariant, tunable via k to dial selection pressure.
Power-law rankRank the archive and sample with density ∝ rank^(−α).S3AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12]Useful on long archives where soft elitism is wanted; α acts as a continuous explore/exploit knob.
MAP-Elites cell samplingPick a random occupied cell uniformly, then the champion of that cell.S4AlphaEvolve QD mode [I, Ch9]The idiomatic draw for a behavioural grid; other rules tend to undo the archive's diversity pressure.
ε-greedyWith probability ε pick uniform, otherwise pick the current best.S6small-population baselines [P]Adequate as a cheap exploration knob when no diversity machinery exists; tends to be outperformed by tournament or rank sampling when either is available.
UCB1 over candidatesTreat each candidate as a bandit arm; confidence-bound selection.B1AB-MCTS [P, Ch19], TreeQuest [P, Ch19]Natural in tree search where each node accumulates rollouts; less obviously beneficial over tournament in flat archives.
Thompson samplingSample from posterior; parent is arg-max of the sample.B2ShinkaEvolve adaptive modes [E, Ch12], RD-Agent [P, Ch38]Most useful on non-stationary problems where a calibrated probabilistic reward model already exists.
EXP3 / sliding-window UCBAdversarial or non-stationary bandits for drifting rewards.B3, B4hierarchical model routers [B], adaptive operator pools [E]Applicable when the reward distribution is known to shift over the run; overkill for stationary tasks.

71.4   Change Operators (Mutation)

How the LLM turns a parent into a child. Empirically the single largest source of inter-system variation.

VariantSummaryCh67Used byWhen to pick
Line-level diffLLM returns a unified diff that is applied to the parent.M1AlphaEvolve default [I, Ch9], OpenEvolve [P, Ch14], GEPA [P, Ch25]Effective on mature candidates where surgical edits dominate; token-efficient; fails when structural change is needed.
Full rewriteLLM returns a complete new version of the target file or function.M2ShinkaEvolve [P, Ch12], LLM4AD [P, Ch21], AI Scientist [P, Ch48]Usually preferred during early exploration, when structural change dominates, or when diffs repeatedly fail to apply.
Reflect-then-mutateLLM first analyses weaknesses, then produces an edit informed by the analysis.M3GEPA [P, Ch25], EurekaClaw [P, Ch15], RetroAgent [P, Ch34], Darwinian Evolver [P, Ch24], NeoSigma [I, Ch35]Worth the extra tokens mainly when the evaluator is expensive and rich textual feedback is available; the sample-efficiency advantage shrinks on cheap evaluators.
Error-repair mutationLLM sees error traces and is asked to produce a minimal patch.M4OpenEvolve [P, Ch14], ShinkaEvolve [P, Ch12], most serious systemsTypically included as a fallback operator whenever execution can fail; near-universal in production systems.
Guided perturbationSmall semantic edit biased by a learning-log hint or benchmark feedback.M5A-Evolve [P, Ch16], ShinkaEvolve variant [E, Ch12]Pays off when the learning log carries high-quality, transferable advice; otherwise tends to degrade to random perturbation.
Prompt-template mutationThe mutated object is a prompt, not the code the prompt acts on.M6GEPA [P, Ch25], GEPA Skills [P, Ch26], AutoAgent [P, Ch7], Meta-Harness [P, Ch32]The right choice when the "program" being optimised is itself a prompt template or harness.
Two-parent crossoverLLM given two parents, asked to combine their strengths.C1ShinkaEvolve [P, Ch12], LLM4AD [P, Ch21]Evidence is mixed: helpful on diverse high-quality archives, but whether it genuinely mixes or merely inspires is unresolved in the published ablations.
Multi-parent crossoverGeneralises C1 to k parents; typically feature-level merge.C2experimental branches of ShinkaEvolve [E, Ch12]Plausible on large diverse archives where many ideas can combine; not reliably validated against two-parent crossover.
Feature-specific crossoverExtract a named subroutine from one parent, transplant into another.C3Arcgentica [P, Ch17], some GEPA variants [E, Ch25]Most useful when the representation supports clean decomposition (DSL, function library).
Diff-based crossoverApply one parent's diff-from-root to another parent.C4OpenEvolve experimental [E, Ch14]Applicable when candidates share a common lineage; little independent validation of its benefit.
Gaussian perturbation (ES)Add scaled noise to a parameter vector; no LLM involved.EGGROLL [P, Ch45], Evolution Strategies at Scale [I, Ch46], EvoX [P, Ch47]Applicable to continuous parameter spaces only. Not an LLM operator; listed here for taxonomic completeness.

71.5   Diversity & Novelty Machinery

How the system avoids collapsing onto a single idiom.

VariantSummaryCh67Used byWhen to pick
NoneRely on fitness alone; accept collapse as a risk.small-scale baselines [P]Defensible only on short runs with small budgets where collapse is unlikely to be reached.
Behavioural descriptor + gridMAP-Elites; diversity is enforced by the archive shape.P2AlphaEvolve [I, Ch9], OpenEvolve QD mode [E, Ch14]A strong choice when cheap, meaningful descriptors exist; the descriptor choice matters more than the grid resolution.
Embedding-distance novelty archiveCandidate accepted if far enough (in embedding space) from existing archive entries.E4GEPA [P, Ch25], Omni-SimpleMem [P, Ch51], several research agents [P]Usually the fallback when no natural descriptor exists but an embedder is available; sensitive to the embedder's alignment with task-relevant structure.
Similarity-filtered acceptanceReject children too similar (by hash, tree edit, or embedding) to recent parents.ShinkaEvolve [P, Ch12], OpenEvolve variant [E, Ch14]Useful for fighting prompt collapse inside an already-running system; evidence is anecdotal.
Island isolationIslands diverge naturally because migration is rare.I1–I4AlphaEvolve [I, Ch9], OpenEvolve [P, Ch14], ShinkaEvolve [P, Ch12]Population-level diversity at the cost of compute per island; scales well with budget.
Novelty-penalised fitnessFitness function explicitly penalises similarity to recent candidates.EurekaClaw [P, Ch15], GEPA [P, Ch25]Viable when a meaningful distance in reward space exists; coefficient tuning is usually ad hoc.

71.6   Evaluation & Fitness Assessment

How a candidate is scored. The cost and fidelity of this step tends to dominate the overall budget.

VariantSummaryCh67Used byWhen to pick
Single-stage executionRun once on the full test set; use the returned scalar.small baselines [P], GEPA simple mode [P, Ch25]Adequate when a single run is cheap relative to LLM calls; otherwise a cascade usually wins.
Cascade (smoke → light → full)Tiered evaluator; candidates that fail early stages never reach expensive ones.E1AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], Next Evolution Architecture blueprint [B]Strongly preferred whenever evaluator cost varies by 10× or more across fidelity levels; one of the most reliably beneficial choices in the survey (see §71.16).
Sandbox executionRun in a resource-capped isolated environment with no network.E2AutoHarness [P, Ch31], OpenEvolve [P, Ch14], AlphaEvolve [I, Ch9]A safety requirement rather than a tuning choice whenever the system executes LLM-generated code.
Multi-instance aggregationScore on many instances; aggregate by mean, median, or worst-case.E3LLM4AD [P, Ch21], ShinkaEvolve [P, Ch12], ALE-Agent [P, Ch20]Worth the cost on noisy or distribution-sensitive benchmarks; the aggregator (mean vs. worst) materially affects robustness/progress trade-off.
Novelty-filtered evaluationReject candidates that fail a similarity filter before scoring.E4GEPA [P, Ch25], Omni-SimpleMem [P, Ch51]Reasonable when novelty is a first-class objective and the filter is cheap compared to the evaluator.
Adaptive sample sizeMore instances for promising candidates, fewer for weak ones.E5ShinkaEvolve [P, Ch12], Next Evolution Architecture blueprint [B]Useful for variance-aware budgets on large test sets; related to racing algorithms.
LLM-as-judgeA strong LLM scores the candidate's output; no execution required.AI Scientist [P, Ch48], AI-Researcher [P, Ch55], research agents [P]Often the only option when execution is impossible (open-ended research, paper writing); known to introduce stylistic bias.
Hybrid LLM + executionExecution for correctness, LLM for style/quality.Meta-Harness [P, Ch32], RD-Agent [P, Ch38]Appropriate when both correctness and subjective quality matter and their signals are separable.
Formal verifierProof checker confirms correctness; fitness is proof length or compilation time.OpenProver [P, Ch40], Pi-Autoresearch [P, Ch39]The idiomatic choice for theorem proving and verified code synthesis; rarely applicable outside those niches.
ASI diagnostic feedbackStructured failure analysis returned alongside the score; fed back into the learning log.Next Evolution Architecture blueprint [B], GEPA [P, Ch25], RetroAgent [P, Ch34]Most valuable when the evaluator can plausibly teach the system, not just grade it — implies a rich per-failure signal.

71.7   LLM Orchestration

How many LLMs are involved and how they are wired together.

VariantSummaryCh67Used byWhen to pick
Single-model mutatorOne LLM does everything.small-scale baselines [P], Karpathy Autoresearch [I, Ch54]Sensible when simplicity or single-provider deployment is a constraint.
Model pool + bandit routingSeveral LLMs with different cost/quality profiles; bandit learns which one to call when.B1–B4AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], RD-Agent [P, Ch38]Worth the routing overhead when cost is a binding constraint and providers differ meaningfully on the target task distribution.
Proposer + criticOne LLM proposes, a second LLM critiques; the critic's feedback informs the next proposal.GEPA [P, Ch25], AI Scientist v2 [P, Ch49], K-Dense Co-Scientist [P, Ch56]Often a good trade when the evaluator is expensive but a cheap critic can identify obvious failures.
Hierarchical drafter + improverCheap model drafts many candidates, strong model improves only the promising ones.AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], Sakana Marlin [I, Ch30]The most common cost-aware pattern in 2026-era systems when cost is the dominant bottleneck.
Multi-agent debateMultiple LLM instances argue; final output is the consensus.AI Scientist v2 [P, Ch49], DeepScientist [P, Ch43], AgentLaboratory [P, Ch44]Most common in research agents and open-ended reasoning; little evidence of benefit on verifiable tasks.
Ensemble votingMultiple LLMs each answer; majority or quality-weighted vote.Confluence Labs [I, Ch28], AB-MCTS aggregation step [P, Ch19]Reasonable when individual model errors are approximately independent and each call is cheap.

71.8   Model Selection & Routing

Given a model pool, which model is called for a given operation?

VariantSummaryCh67Used byWhen to pick
Static assignmentEach operation is hard-coded to a specific model.OpenEvolve [P, Ch14], LLM4AD [P, Ch21]Adequate for early prototypes or when only one provider is available.
Cost-tier routingCheap model for draft / reject, strong model for promising candidates.AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12]A strong default when at least two tiers are available and their cost ratio exceeds roughly 5×.
Hierarchical banditNested bandits over model family and model instance.B1, B2Next Evolution Architecture blueprint [B], RD-Agent [P, Ch38]Overhead becomes worth it when provider diversity is high and cost/quality ratios drift over the run.
Per-stage routingModel chosen per evaluation-cascade stage.ShinkaEvolve [P, Ch12], Sakana Marlin [I, Ch30]Effective when each cascade stage rewards a different skill (cheap filters vs. deep analysis).
Budget-aware routerModel chosen to maximise expected fitness per remaining dollar.Next Evolution Architecture blueprint [B], AlphaEvolve internal [I, Ch9]Appropriate under a hard fixed budget with a calibrated marginal-value model; theoretically clean but rarely ablated in public.

71.9   Prompt Engineering & Co-Evolution

Who writes the prompts, and do they evolve?

VariantSummaryCh67Used byWhen to pick
Static promptHand-written prompt, unchanged throughout the run.FunSearch [P, Ch5], OpenEvolve [P, Ch14]Reasonable as a baseline when the prompt is already well-tuned or when reproducibility is paramount.
Prompt slot mutationSpecific placeholders (persona, style, constraints) are mutated each iteration.ShinkaEvolve [P, Ch12], AlphaEvolve variant [I, Ch9]Low-cost diversity source; typically a cheap win compared to static prompts on long runs.
Prompt populationsA second population of prompts co-evolves alongside the primary population.GEPA [P, Ch25], GEPA Skills [P, Ch26], AutoAgent [P, Ch7]Worth the complexity when prompts, not code, carry the task-specific knowledge.
Reflect-and-rewrite-promptAfter each cycle the LLM rewrites the prompt based on what worked.GEPA [P, Ch25], RetroAgent [P, Ch34], Darwin Gödel Machine [P, Ch23]Most useful on long runs with rich feedback and tasks whose structure drifts over time.
Instruction co-evolutionPrompts, skills, and the harness evolve together under a joint fitness.Darwin Gödel Machine [P, Ch23], NeoSigma [I, Ch35], EvoSkill [P, Ch33]The right regime for self-improving agents when the boundary between code and prompt dissolves; sample-inefficient.
Skill promptsCandidates are named skills with natural-language descriptions, retrieved by similarity.P5GEPA Skills [P, Ch26], EvoSkill [P, Ch33], RetroAgent [P, Ch34]Fit for compositional agent tasks with reusable sub-solutions; retrieval quality dominates outcomes.

71.10   Memory, Learning Logs & Skills

What the system remembers between iterations — and how memory is read back into the mutator.

VariantSummaryCh67Used byWhen to pick
No memoryEach iteration is independent.OpenEvolve default [P, Ch14], FunSearch baselines [P, Ch5]Adequate for cheap iterations or stateless search where insight reuse has not been shown to help.
Learning log (text)Append-only log of candidate, score, observation; subset retrieved into prompt.AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], GEPA [P, Ch25]Worth adding whenever insight reuse plausibly matters; a common cheap win.
Embedded log (vector)Learning log entries indexed in a vector store; retrieved by semantic similarity.Omni-SimpleMem [P, Ch51], RetroAgent [P, Ch34], several research agents [P]Typically needed on long runs with thousands of entries where linear scans become a bottleneck.
Skills archiveDistilled reusable primitives with natural-language descriptions.P5GEPA Skills [P, Ch26], EvoSkill [P, Ch33], AutoEvolver [P, Ch36]Most useful in agent tasks where solutions compose; overkill for single-task code synthesis.
Knowledge graphEntities, relations, observations stored in a graph; retrieved by traversal.OmniScientist [P, Ch52], Zochi [P, Ch53], K-Dense Co-Scientist [P, Ch56]Warranted mainly for research agents doing cross-paper synthesis; graph maintenance cost is non-trivial.
Lessons libraryHigh-level abstracted heuristics learned from multiple runs.Darwin Gödel Machine [P, Ch23], AI Scientist v2 [P, Ch49], Karpathy Autoresearch [I, Ch54]Interesting for meta-level self-improvement over many tasks; evidence base for within-run benefit is thin.
Retrieval-augmented mutationParents are fetched not only by fitness but by relevance to a retrieved log entry.ShinkaEvolve adaptive mode [E, Ch12], GEPA [P, Ch25]Useful when similar past candidates plausibly carry transferable advice; depends on retrieval quality.

71.11   Meta-Level Adaptation & Self-Improvement

Which parts of the system can evolve themselves?

VariantSummaryCh67Used byWhen to pick
None (frozen)Every hyperparameter is fixed at the start.FunSearch [P, Ch5], OpenEvolve default [P, Ch14]Preferred when reproducibility is the dominant concern or when the system is used as a baseline.
Adaptive mutation rateMutation strength increased on stagnation, decreased on progress.ShinkaEvolve [P, Ch12], A-Evolve [P, Ch16]Worth the tuning cost on long runs where the landscape's roughness changes with iteration.
Operator banditBandit over a pool of mutation operators; rewards update by child fitness.B1, B3ShinkaEvolve [P, Ch12], GEPA [P, Ch25], LLM4AD [P, Ch21]Useful when multiple mutation styles each dominate in different regimes; bandit restart logic matters.
Prompt evolverPrompts evolve alongside candidates (see §71.9).GEPA [P, Ch25], AutoAgent [P, Ch7], Meta-Harness [P, Ch32]See §71.9; same applicability conditions.
Harness evolutionThe agent harness (tools, scratchpads, sandboxes) evolves as part of the optimisation.Meta-Harness [P, Ch32], AutoHarness [P, Ch31], AutoAgent [P, Ch7]Most useful on agent benchmarks where harness quality dominates model quality.
Self-modifying code (Gödelian)The system proposes changes to its own source code and deploys the winning version.Darwin Gödel Machine [P, Ch23], Darwinian Evolver [P, Ch24], NeoSigma [I, Ch35]Research settings only; requires strong safety gates and offers uncertain efficiency gains.
Full self-rewriteEnd-to-end regeneration of the system across many axes simultaneously.Darwin Gödel Machine [P, Ch23], AutoEvolver [P, Ch36]Experimental; sample-inefficient but maximally open-ended.

71.12   Cost & Budget Management

How the system decides when to stop spending and where to spend next.

VariantSummaryCh67Used byWhen to pick
Fixed budget capHard stop at N calls or $X cost.Every serious system in practice [P/I]Should be present as a safety net regardless of other budget machinery.
Cascade early stopAbandon evaluation as soon as the light stage fails.E1AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12]Very strong win whenever cascade-stage cost varies 10× or more; see §71.16.
Adaptive sample sizeMore instances for promising candidates (racing).E5ShinkaEvolve [P, Ch12], Next Evolution Architecture blueprint [B]Useful when evaluator variance is high and test sets are large.
Model-tier routingRoute by marginal fitness-per-dollar.AlphaEvolve [I, Ch9], Sakana Marlin [I, Ch30], RD-Agent [P, Ch38]Effective when provider pools span large cost ratios; static cost-tier routing usually captures most of the gain (§71.8).
Bandit budget splitBandit dynamically reallocates budget across operators, islands, or models.B1–B4ShinkaEvolve [P, Ch12], Next Evolution Architecture blueprint [B]Most justified on long runs with shifting regimes.
Per-stage cost ceilingEach cascade stage has an independent cost budget.ShinkaEvolve [P, Ch12], Meta-Harness [P, Ch32]Useful when stage costs are predictable and bounded; prevents one stage from starving the others.
Marginal-value stoppingStop when expected improvement per dollar falls below a threshold.AlphaEvolve internal [I, Ch9], Next Evolution Architecture blueprint [B]Theoretically clean; requires a calibrated improvement model that is rare in practice.

71.13   Sandboxing & Safety

A safety requirement rather than a tuning choice: any system executing LLM-generated code needs isolation. The question is how much.

VariantSummaryCh67Used byWhen to pick
SubprocessUntrusted code in a child process with minimal OS isolation.OpenEvolve dev mode [P, Ch14], small prototypes [P]Development only. Not appropriate for production or shared infrastructure.
Container (Docker)Candidate runs in a restricted container with no network.AutoHarness [P, Ch31], AutoAgent [P, Ch7], Meta-Harness [P, Ch32]The typical production default for code-execution evaluators.
Syscall whitelistseccomp / AppArmor filter on allowed syscalls.AlphaEvolve internal [I, Ch9], Next Evolution Architecture blueprint [B]Warranted when adversarial code (or reward-hacking) is a realistic threat.
Resource capsHard CPU, RAM, wall-clock, disk limits.Every serious system [P/I]Cheap to add and catches infinite loops; a standard baseline.
Network disabledNo outbound network from the sandbox.AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], AutoHarness [P, Ch31]Standard for code-execution evaluators; deliberate exceptions require strong justification.
Static-analysis gateReject candidates whose AST contains banned constructs (exec, eval, import os).Meta-Harness [P, Ch32], Next Evolution Architecture blueprint [B]A belt-and-braces complement to the sandbox; useful when the cost of one bad execution is high.
Budget guardSeparate watchdog that kills runs exceeding cost or token limits.AutoAgent [P, Ch7], RD-Agent [P, Ch38], Sakana Marlin [I, Ch30]Standard practice for long autonomous runs where runaway costs are a realistic failure mode.

71.14   Failure Handling & Repair

What happens when a candidate crashes, loops, or fails its evaluator?

VariantSummaryCh67Used byWhen to pick
NoneFailed candidate is simply discarded; no learning from failure.Baselines [P]Appropriate only when failure traces carry no information, which is rare.
RetryRe-run with same inputs; if flaky it may pass on retry.OpenEvolve [P, Ch14], AutoHarness [P, Ch31]Necessary on genuinely flaky evaluators (real-world benchmarks); should be bounded to avoid masking real bugs.
Error-repair operatorOn failure, call the LLM to patch the candidate given the error trace.M4most serious systems [P/I]Standard as a fallback operator whenever execution can fail.
Learning-log entryRecord the failure with context so future prompts can avoid the same mistake.AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], GEPA [P, Ch25]Valuable when failure modes are informative and the log is actually read back into subsequent prompts.
Candidate quarantineTrack chronically failing candidates separately; never sample them as parents.ShinkaEvolve [P, Ch12], RetroAgent [P, Ch34]Useful when repair consistently fails on a minority of candidates that would otherwise keep consuming budget.
Reflective repairReflection step analyses the failure and proposes both a repair and a lesson for the log.GEPA [P, Ch25], NeoSigma [I, Ch35], Darwin Gödel Machine [P, Ch23]Best on expensive evaluators where each failure must teach something to justify its cost.

71.15   Evaluator Types

The full space of fitness functions. The choice is dictated by the problem rather than by the search strategy.

VariantSummaryCh67Used byWhen to pick
Code executionRun the candidate program against a test suite.E2AlphaEvolve [I, Ch9], OpenEvolve [P, Ch14], ShinkaEvolve [P, Ch12], LLM4AD [P, Ch21]The natural choice for algorithmic discovery with deterministic tests.
Benchmark suiteStandardised, multi-instance benchmark (e.g., MLE-Bench, ALE-Bench).E3RD-Agent [P, Ch38], ALE-Agent [P, Ch20], AI-Researcher [P, Ch55]Required when comparable, reproducible results are the product.
LLM-as-judgeAnother LLM scores the candidate's output.AI Scientist [P, Ch48], AI Scientist v2 [P, Ch49], research agents [P]Typically the only option for open-ended outputs (papers, hypotheses, plans); known to introduce stylistic bias.
Hybrid exec + judgeExecution for correctness, LLM for quality.Meta-Harness [P, Ch32], RD-Agent [P, Ch38], DeepScientist [P, Ch43]Appropriate when objectives mix verifiable and subjective components.
Human-in-loopPeriodic human review as a high-fidelity oracle.AI Scientist [P, Ch48], K-Dense Co-Scientist [P, Ch56]Warranted for safety-critical or domain-specific judgements; rate-limited by annotator bandwidth.
SimulatorCandidate drives a simulator; reward is the simulator's outcome.EurekaClaw [P, Ch15], Matlantis [I, Ch41], Sakana Marlin [I, Ch30]The idiomatic evaluator for robotics, materials science, and sim-to-real.
Formal verifierLean/Coq proof checker; fitness is proof-found / proof-length.OpenProver [P, Ch40], Pi-Autoresearch [P, Ch39]Suited to theorem proving and verified synthesis; applicability is narrow but fidelity is high.
Real-world deploymentCandidate runs in a production system; feedback is real traffic.NeoSigma failure mining [I, Ch35], 7/24 Office [I, Ch42]Appropriate when the only accurate oracle is production and the cost of bad candidates is bounded.

71.16   Cross-Cutting Meta-Analysis

The fifteen tables above enumerate what can be done. This section asks the synthesis questions the reviewer is entitled to: which variants are actually common, which combinations recur together, which dimensions are strongly coupled, and where do the surveyed systems genuinely disagree? The quantitative claims below are coarse counts over the sixty-one surveyed systems; treat them as directional rather than inferential.

71.16.1   Variant Frequency (Directional)

Frequencies are bucketed because not every system exposes every dimension and not every codebase was audited at the same depth. Dominant = seen in a clear majority of the surveyed systems exposing this dimension; common = seen in roughly a third to half; niche = seen in under ten systems; rare = seen in one to three and usually experimental.

DimensionDominant variant(s)CommonNiche / rare
§71.1 Search strategySingle-pop EA, Island EATree search, MAP-ElitesHybrid loop+tree, ES-at-scale
§71.2 PopulationFlat elitism, Ranked archiveMAP-Elites grid, TieredKnowledge graph, Skills archive
§71.3 SelectionTournament, Power-law rankMAP-Elites cell, UCB1Fitness-proportionate, EXP3
§71.4 MutationLine-level diff, Full rewrite, Error-repairReflect-then-mutate, Prompt-templateMulti-parent, Feature-specific, Diff-based crossover
§71.5 DiversityIsland isolationEmbedding-distance novelty, Behavioural gridNovelty-penalised fitness
§71.6 EvaluationCascade, Sandbox, Multi-instanceLLM-as-judge, Adaptive sample sizeFormal verifier, ASI diagnostic feedback
§71.7 OrchestrationHierarchical drafter+improver, Proposer+criticModel pool + bandit routingMulti-agent debate, Ensemble voting
§71.8 RoutingStatic, Cost-tierPer-stageHierarchical bandit, Budget-aware
§71.9 PromptsStatic, Slot mutationPrompt populations, Reflect-and-rewriteInstruction co-evolution
§71.10 MemoryNo memory, Learning log (text)Skills archive, Embedded logKnowledge graph, Lessons library
§71.11 Meta-adaptationNone (frozen)Adaptive mutation rate, Operator banditSelf-modifying code, Full self-rewrite
§71.12 BudgetFixed cap, Cascade early stopModel-tier routing, Adaptive sample sizeMarginal-value stopping
§71.13 SandboxContainer, Resource caps, Network disabledBudget guard, Static-analysis gateSyscall whitelist
§71.14 Failure handlingRetry, Error-repair operatorLearning-log entry, Candidate quarantineReflective repair
§71.15 EvaluatorCode execution, Benchmark suiteLLM-as-judge, Hybrid exec+judge, SimulatorFormal verifier, Real-world deployment, Human-in-loop

71.16.2   Recurring Combinations (Co-Occurrence)

Five combinations recur so often that they effectively define recognisable styles of LLM-powered evolution:

  1. FunSearch-lineage stack. Single-pop or island EA (§71.1) + flat elitism (§71.2) + tournament or power-law rank (§71.3) + line-level diff + full rewrite + error-repair (§71.4) + cascade evaluator (§71.6) + container sandbox (§71.13) + learning log (§71.10). Examples: OpenEvolve, AlphaEvolve, ShinkaEvolve, LLM4AD.
  2. Quality-diversity stack. MAP-Elites search (§71.1) + grid archive (§71.2) + cell sampling (§71.3) + behavioural-descriptor novelty (§71.5) + cascade evaluator (§71.6). Examples: AlphaEvolve QD mode, GEPA variants.
  3. Reflection-heavy agent stack. Reflect-then-mutate (§71.4) + proposer+critic orchestration (§71.7) + prompt populations or skill prompts (§71.9) + embedded log or skills archive (§71.10) + reflective repair (§71.14) + LLM-as-judge or hybrid evaluator (§71.15). Examples: GEPA, GEPA Skills, RetroAgent, EvoSkill.
  4. Tree-search stack. MCTS/AB-MCTS (§71.1) + UCB1 or Thompson selection (§71.3) + per-node fitness (§71.6). Examples: AB-MCTS, TreeQuest, Arcgentica, ALE-Agent.
  5. Self-modification stack. Harness evolution + self-modifying code (§71.11) + instruction co-evolution (§71.9) + lessons library (§71.10) + reflective repair (§71.14). Examples: Darwin Gödel Machine, Darwinian Evolver, NeoSigma.

ASCII co-occurrence sketch (row = representative system, column = style; ● = strong alignment, ○ = partial):

                         FunSearch  QD   Reflect  Tree  Self-mod
OpenEvolve                  ●        ○    ·        ·     ·
AlphaEvolve                 ●        ●    ·        ·     ·
ShinkaEvolve                ●        ○    ○        ·     ·
LLM4AD                      ●        ·    ·        ·     ·
GEPA                        ○        ○    ●        ·     ·
GEPA Skills                 ·        ·    ●        ·     ·
RetroAgent                  ·        ·    ●        ·     ○
AB-MCTS / TreeQuest         ·        ·    ·        ●     ·
Arcgentica                  ·        ·    ·        ●     ·
ALE-Agent                   ·        ·    ·        ●     ·
Darwin Gödel Machine        ·        ·    ○        ·     ●
NeoSigma                    ·        ·    ○        ·     ●
EvoSkill                    ·        ·    ●        ·     ○

71.16.3   Strongly Coupled Dimensions

Some dimensional choices are nearly forced by prior ones. The couplings below are observed as structural regularities, not proven causal dependencies:

  • Cascade evaluator ↔ tiered population ↔ cost-tier routing. Choosing a cascade in §71.6 implies candidates live at different refinement stages (§71.2) and that cheap vs. strong models run at different stages (§71.8). AlphaEvolve and ShinkaEvolve show all three.
  • MAP-Elites grid ↔ cell sampling ↔ behavioural descriptor. Choosing a QD archive (§71.2) effectively fixes the selection rule (§71.3) and the diversity mechanism (§71.5).
  • Skills archive ↔ prompt-populations ↔ skill-prompt retrieval. When §71.2 is a skills archive, §71.9 almost always includes some form of prompt-level retrieval, and §71.10 an embedded log.
  • Island model ↔ adaptive spawning ↔ heterogeneous islands. Within the island family, systems that ablate migration topology tend to also vary island configurations and operator mixes.
  • Reflect-then-mutate ↔ learning log ↔ reflective repair. Reflection machinery is rarely worthwhile unless the log retains lessons and failures feed back into prompts.
  • Self-modifying code ↔ instruction co-evolution ↔ lessons library. Gödelian self-modification almost always appears together with prompt/skill co-evolution and an explicit lessons archive.

Conversely, several dimensions are decoupled in practice: sandboxing choices (§71.13) are largely orthogonal to everything else, and mutation operator choice (§71.4) is surprisingly independent of search strategy (§71.1) — line-level diff appears in single-pop EAs, island EAs, and QD systems alike.

71.16.4   Where the Surveyed Systems Genuinely Disagree

On several questions the field does not (yet) speak with one voice. These are the most instructive points for a practitioner deciding what to build:

  • Does crossover help? ShinkaEvolve and LLM4AD use two-parent crossover; FunSearch and OpenEvolve's default mode omit it. No cross-system controlled study settles whether the LLM genuinely mixes parents or is merely inspired by seeing two candidates at once.
  • Reflection vs. cheap rerolls. GEPA, EurekaClaw, and RetroAgent spend tokens on reflection before each mutation; AlphaEvolve and OpenEvolve largely prefer to generate more cheap diffs and let selection sort them out. Both stacks produce state-of-the-art results on different task families.
  • Static vs. evolving prompts. FunSearch and OpenEvolve freeze the prompt; GEPA, Darwin Gödel Machine, and AutoAgent evolve it. The freeze camp cites reproducibility; the evolving camp cites long-run adaptation.
  • Self-modifying code. Darwin Gödel Machine and NeoSigma allow the system to rewrite itself; the rest of the field keeps the harness frozen. The disagreement is as much about safety posture as about efficiency.
  • Memory format. Text logs (AlphaEvolve, ShinkaEvolve) vs. embedded logs (Omni-SimpleMem, RetroAgent) vs. knowledge graphs (OmniScientist, Zochi). No published ablation cleanly compares them at matched budget.
  • Diversity via archive shape vs. fitness penalty. MAP-Elites enforces diversity structurally; EurekaClaw and GEPA enforce it via an explicit fitness term. The trade-off between the two has not been systematically benchmarked.

71.16.5   Which Claims in This Chapter Are Well-Supported?

Not all "When to pick" recommendations rest on the same evidence base. A rough tier list:

  • Well-supported (multiple within-system ablations across different teams): cascade evaluation beats single-stage whenever stage cost varies >10×; sandboxing is mandatory for code execution; error-repair as a fallback operator is near-universal because its absence produces visible failure modes.
  • Moderately supported (one or two published ablations, or consistent anecdotal reports): tournament > fitness-proportionate as a default; cost-tier routing saves budget on multi-provider pools; line-level diff dominates on mature candidates while full rewrite dominates on early exploration.
  • Weakly supported (practitioner consensus, little controlled evidence): crossover helps in large diverse archives; reflective repair is worth its cost on expensive evaluators; knowledge graphs pay off in research agents; power-law rank > uniform on long archives.
  • Under-evidenced (claims that appear in the literature but have not been cleanly ablated): MAP-Elites vs. island EAs on code-synthesis tasks; optimal bandit family (UCB1 vs. Thompson vs. EXP3) for operator selection; whether embedded-log retrieval measurably outperforms text-log retrieval at matched budget; whether self-modifying systems converge faster than frozen-harness systems on any benchmark.

Readers designing a new system should weight the recommendations accordingly. The cross-cutting dimensions most likely to move the needle on realistic research budgets are, on current evidence, evaluator cascade design, mutation operator pool (including error-repair), and cost-tier model routing. Everything else is a secondary tuning knob whose impact depends strongly on problem structure.

Chapter Summary

Key takeaway. An LLM-powered evolutionary system is a point in an approximately fifteen-dimensional design space. Each of the fifteen tables in this chapter is one dimension, and each row is a variant observed in at least one surveyed system, tagged by source type ([P]/[I]/[B]/[E]) and chapter pointer for auditability. The dimensions are not strictly orthogonal (§71.0.1); several recognisable styles — FunSearch, quality-diversity, reflection-heavy, tree-search, self-modifying — arise from tightly coupled choices (§71.16.2–§71.16.3).

How to use this chapter. For each dimension, identify candidate variants and cross-check against the representative systems in the "Used by" column; follow the chapter pointer for context on how the variant behaves in that system, and the Ch67 method ID for its formal specification. Treat the "When to pick" column as conditional rules of thumb, not recommendations: §71.16.5 lists which of those rules are backed by controlled ablations and which are practitioner consensus awaiting stronger evidence.