Method Shortcut: Component-by-Component Comparison
Part P09: Synthesis & Future Directions
This chapter is the index to every design decision an LLM-powered evolutionary system has to make. Where Chapter 67 — Comprehensive Methods Catalog gives the long-form algorithmic reference for each method, and Chapter 65 — Comparative Architecture Analysis compares systems layer-by-layer, this chapter inverts the view: for each design dimension, it enumerates the variants observed in the surveyed systems, shows which systems use which, and tells you — under which conditions — each variant tends to pay off.
71.0.1 Methodological Note: What This Taxonomy Is and Is Not
Before the tables, three caveats the reader should keep in mind throughout.
The fifteen dimensions are not strictly orthogonal. Several concepts naturally span more than one table. The most conspicuous overlaps:
- MAP-Elites is simultaneously a search strategy (§71.1), an archive structure (§71.2), a parent-sampling rule (§71.3), and a diversity mechanism (§71.5). We place each facet in the table it most directly constrains: the high-level decision to run a QD loop belongs to §71.1; the cellular archive data structure to §71.2; the "random cell, then champion" draw to §71.3; the behavioural-descriptor rationale to §71.5.
- Bandits appear in parent selection (§71.3), model routing (§71.8), operator selection (§71.11), and budget allocation (§71.12). Each table lists only bandit uses specific to that decision; the same bandit algorithm (UCB1, Thompson) can reappear across tables.
- Reflection crosses mutation (§71.4, "reflect-then-mutate"), memory (§71.10), meta-adaptation (§71.11, prompt evolution), and failure handling (§71.14, reflective repair). We list it in every table where it is the defining design choice for that row.
- Skills archives appear in population structure (§71.2), prompt co-evolution (§71.9), and memory (§71.10); the same data structure, viewed through three different lenses.
Placement rule. Where a concept could belong to more than one table, we place it in the table whose question it most directly answers — "What shape is the search?" for §71.1, "How are candidates stored?" for §71.2, "How is the next parent drawn?" for §71.3, and so on. Cross-references in the commentary flag the siblings.
Evidence asymmetry. Not every "When to pick" recommendation rests on controlled ablations. Some dimensions (cascade evaluation, sandboxing, line-level diff mutation) are backed by multiple within-system ablations; others (e.g., crossover value, optimal bandit family, memory retrieval strategies) are under-evidenced and reflect practitioner rule-of-thumb. §71.16 summarises which claims are well-supported and which are working hypotheses.
71.0.2 Source-Type Tags for "Used by"
Every "Used by" attribution carries a one-letter tag distinguishing how the evidence was obtained. This lets readers audit claims at a glance and weight them accordingly.
| Tag | Meaning | Epistemic weight |
|---|---|---|
| [P] | Published system with peer-reviewed paper and/or open source release; design is verifiable from artifacts. | Highest — independently auditable. |
| [I] | Industry / internal system described in a tech report, blog post, or paper without code release; design is reported but not reproducible. | Moderate — depends on the vendor's disclosure. |
| [B] | Blueprint or reference architecture — a design proposal (including this survey's own Next Evolution Architecture blueprint) that may not yet be fully implemented in any single running system. | Low — aspirational, informs design space rather than evidencing it. |
| [E] | Experimental branch, community fork, or configuration flag inside an otherwise-published system; the variant is implemented but not the project's default and is usually undocumented. | Low — existence confirmed, impact rarely ablated. |
Chapter pointers of the form (Ch##) route to the survey chapter that documents the system. Where a system is flagged [I] or [B], readers should not treat the row as reproducible evidence, only as a design-space observation.
71.0 Method ID Index
The short codes in the Ch67 column throughout this chapter (e.g. M1, S4, P2, B1) are references into Chapter 67 — Comprehensive Methods Catalog, which contains the formal specification, pseudocode, and trade-offs for every method. Each ID in the table below is a clickable anchor that jumps to the matching subsection in Ch67. Ch67's section numbering follows the pattern #s67-<family>-<index> where <family> is the family's ordinal inside Ch67 (3 for Mutation, 4 for Crossover, 5 for Selection, 6 for Evaluation, 7 for Population, 8 for Islands, 9 for Bandits). Use this as the single index for the thirty-four methods across the seven families.
| Family | Methods (click an ID to jump to Ch67) |
|---|---|
| F1 · Mutation |
M1 Line-level diff ·
M2 Full rewrite ·
M3 Reflection-guided ·
M4 Error-repair ·
M5 Guided perturbation ·
M6 Prompt-template
|
| F2 · Crossover |
C1 Two-parent ·
C2 Multi-parent ·
C3 Feature-specific ·
C4 Diff-based
|
| F3 · Selection |
S1 Fitness-proportionate ·
S2 Tournament ·
S3 Power-law rank ·
S4 MAP-Elites cell ·
S5 Pareto ·
S6 ε-greedy
|
| F4 · Evaluation |
E1 Cascade ·
E2 Sandbox ·
E3 Multi-instance ·
E4 Novelty-filtered ·
E5 Adaptive sample-size
|
| F5 · Population |
P1 Flat elitism ·
P2 MAP-Elites grid ·
P3 Pareto archive ·
P4 Tiered ·
P5 Skills archive
|
| F6 · Islands |
I1 Static topology ·
I2 Adaptive spawning ·
I3 Heterogeneous ·
I4 Hierarchical
|
| F7 · Bandits |
B1 UCB1 ·
B2 Thompson ·
B3 EXP3 ·
B4 Sliding-window UCB
|
Shorthand for the rest of this chapter. In the per-dimension tables below, the Ch67 column lists the IDs that apply to each variant. When you see an unfamiliar code — say B3 in §71.3 — scroll back to this index and click it to jump to the formal definition in Ch67. A dash (—) in the Ch67 column means the variant is described only in Ch67's surrounding prose or does not yet have a catalogued method ID.
71.1 Search Strategy
The top-level shape of the search. Most downstream choices are constrained by this decision, though several dimensions (mutation, evaluation, sandboxing) remain largely independent.
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Single-population EA | One flat pool, elitism, each iteration samples a parent and produces one child. | — | OpenEvolve default [P, Ch14], LLM4AD [P, Ch21] | Plausible when evaluators are cheap and diversity is not the binding constraint; fragile on deceptive or multi-modal landscapes. |
| Island-model EA | Several small populations in a ring; occasional migration between islands. | I1–I4 | AlphaEvolve [I, Ch9], OpenEvolve [P, Ch14], ShinkaEvolve [P, Ch12] | Worth the extra compute per island when premature convergence has been observed in single-pool runs on the same task. |
| MAP-Elites / Quality-Diversity | Grid archive indexed by behavioural descriptors; one champion per cell. | S4, P2 | AlphaEvolve behavioural-grid mode [I, Ch9], GEPA-family variants [P, Ch25] | Typically preferred when the landscape is known to be deceptive and cheap, meaningful descriptors are available; evidence of QD superiority over well-tuned island EAs on code-synthesis tasks is mixed. |
| Tree search (MCTS / AB-MCTS) | Program-synthesis tree with UCB selection and LLM-proposed expansions. | B1, B2 | AB-MCTS / TreeQuest [P, Ch19], Arcgentica [P, Ch17], Confluence Labs [I, Ch28], ALE-Agent [P, Ch20] | Useful when the task decomposes into verifiable step-by-step construction (ARC-AGI, heuristic contests) and when an informative per-node reward exists. |
| Hybrid (loop + tree) | Outer evolutionary loop whose mutation operator is itself a tree search. | — | ShinkaEvolve@ICFP [P, Ch12], Sakana Marlin [I, Ch30] | Reasonable when the problem has both long-horizon strategy choice and low-level synthesis, and when the added implementation complexity is justified by failures of either tier alone. |
| Evolution strategies / ES-at-scale | Perturb a parameter vector with Gaussian noise; reweight by fitness; no LLM mutation. | — | EGGROLL [P, Ch45], Evolution Strategies at Scale [I, Ch46], EvoX [P, Ch47] | Applicable only when the search object is a continuous parameter vector and massive parallel evaluation is available; not comparable to LLM-mutator systems on code tasks. |
71.2 Population Structure
How candidates are stored and organised. Distinct from the search strategy: a FunSearch-lineage EA can in principle be paired with any of these archives, although in practice certain pairs dominate (see §71.16).
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Flat pool with elitism | Keep the top-k candidates by fitness; new children replace the worst. | P1 | OpenEvolve flat mode [P, Ch14], LLM4AD [P, Ch21] | Defensible as a baseline when the landscape looks unimodal; breaks down on deceptive tasks. |
| Ranked archive | All candidates ranked by a scalar; parent sampling by power-law over rank. | S3 | ShinkaEvolve [P, Ch12], GEPA [P, Ch25] | A reasonable compromise when soft elitism is wanted — strong parents preferred, weaker ones still contribute — and the archive can grow large without memory pressure. |
| MAP-Elites grid archive | Behavioural descriptor partitions the space into cells; best candidate per cell. | P2, S4 | AlphaEvolve [I, Ch9], several derivative forks [E] | Justified when explicit diversity is an objective and cheap descriptors (runtime, length, ASI tags) are available; the descriptor choice usually matters more than the grid resolution. |
| Pareto archive | Non-dominated front under multi-objective fitness. | P3, S5 | AlphaEvolve multi-obj mode [I, Ch9], GEPA variants [E] | Appropriate when multiple competing objectives have unknown relative weights; less useful when a scalarisation is already trusted. |
| Tiered population | Multiple pools at different refinement stages; promotion by evaluation tier. | P4 | ShinkaEvolve [P, Ch12], Next Evolution Architecture blueprint [B] | Natural fit when the evaluator is a cascade whose stages differ by an order of magnitude in cost. |
| Skills archive / knowledge base | Candidates stored as reusable skills with natural-language descriptions. | P5 | GEPA Skills [P, Ch26], EvoSkill [P, Ch33], RetroAgent [P, Ch34] | Most useful in agent systems where solutions are expected to compose across tasks; adds retrieval overhead that is wasted on single-task code synthesis. |
| Knowledge graph / graph memory | Entities and relations stored in a structured graph; retrieved by traversal. | — | OmniScientist [P, Ch52], Zochi [P, Ch53], Omni-SimpleMem [P, Ch51] | Worth the indexing cost primarily for research agents doing cross-paper synthesis; rarely beneficial in closed-box algorithmic tasks. |
71.3 Parent Selection & Sampling
How the next parent is drawn from whatever archive you chose in §71.2.
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Uniform | Sample any archive entry with equal probability. | — | FunSearch baselines [P, Ch5], MAP-Elites default [P, Ch9] | Sensible when the archive itself already encodes selection pressure (MAP-Elites cells); otherwise tends to waste budget on weak entries. |
| Fitness-proportionate (roulette) | Sampling probability proportional to fitness. | S1 | small-population baselines [P] | Rarely the best choice; sensitive to fitness scaling. Tournament is almost always preferable in practice. |
| Tournament | Sample k at random, pick the best. | S2 | ShinkaEvolve [P, Ch12], OpenEvolve [P, Ch14] | A robust default when fitness scaling is unknown; scale-invariant, tunable via k to dial selection pressure. |
| Power-law rank | Rank the archive and sample with density ∝ rank^(−α). | S3 | AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12] | Useful on long archives where soft elitism is wanted; α acts as a continuous explore/exploit knob. |
| MAP-Elites cell sampling | Pick a random occupied cell uniformly, then the champion of that cell. | S4 | AlphaEvolve QD mode [I, Ch9] | The idiomatic draw for a behavioural grid; other rules tend to undo the archive's diversity pressure. |
| ε-greedy | With probability ε pick uniform, otherwise pick the current best. | S6 | small-population baselines [P] | Adequate as a cheap exploration knob when no diversity machinery exists; tends to be outperformed by tournament or rank sampling when either is available. |
| UCB1 over candidates | Treat each candidate as a bandit arm; confidence-bound selection. | B1 | AB-MCTS [P, Ch19], TreeQuest [P, Ch19] | Natural in tree search where each node accumulates rollouts; less obviously beneficial over tournament in flat archives. |
| Thompson sampling | Sample from posterior; parent is arg-max of the sample. | B2 | ShinkaEvolve adaptive modes [E, Ch12], RD-Agent [P, Ch38] | Most useful on non-stationary problems where a calibrated probabilistic reward model already exists. |
| EXP3 / sliding-window UCB | Adversarial or non-stationary bandits for drifting rewards. | B3, B4 | hierarchical model routers [B], adaptive operator pools [E] | Applicable when the reward distribution is known to shift over the run; overkill for stationary tasks. |
71.4 Change Operators (Mutation)
How the LLM turns a parent into a child. Empirically the single largest source of inter-system variation.
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Line-level diff | LLM returns a unified diff that is applied to the parent. | M1 | AlphaEvolve default [I, Ch9], OpenEvolve [P, Ch14], GEPA [P, Ch25] | Effective on mature candidates where surgical edits dominate; token-efficient; fails when structural change is needed. |
| Full rewrite | LLM returns a complete new version of the target file or function. | M2 | ShinkaEvolve [P, Ch12], LLM4AD [P, Ch21], AI Scientist [P, Ch48] | Usually preferred during early exploration, when structural change dominates, or when diffs repeatedly fail to apply. |
| Reflect-then-mutate | LLM first analyses weaknesses, then produces an edit informed by the analysis. | M3 | GEPA [P, Ch25], EurekaClaw [P, Ch15], RetroAgent [P, Ch34], Darwinian Evolver [P, Ch24], NeoSigma [I, Ch35] | Worth the extra tokens mainly when the evaluator is expensive and rich textual feedback is available; the sample-efficiency advantage shrinks on cheap evaluators. |
| Error-repair mutation | LLM sees error traces and is asked to produce a minimal patch. | M4 | OpenEvolve [P, Ch14], ShinkaEvolve [P, Ch12], most serious systems | Typically included as a fallback operator whenever execution can fail; near-universal in production systems. |
| Guided perturbation | Small semantic edit biased by a learning-log hint or benchmark feedback. | M5 | A-Evolve [P, Ch16], ShinkaEvolve variant [E, Ch12] | Pays off when the learning log carries high-quality, transferable advice; otherwise tends to degrade to random perturbation. |
| Prompt-template mutation | The mutated object is a prompt, not the code the prompt acts on. | M6 | GEPA [P, Ch25], GEPA Skills [P, Ch26], AutoAgent [P, Ch7], Meta-Harness [P, Ch32] | The right choice when the "program" being optimised is itself a prompt template or harness. |
| Two-parent crossover | LLM given two parents, asked to combine their strengths. | C1 | ShinkaEvolve [P, Ch12], LLM4AD [P, Ch21] | Evidence is mixed: helpful on diverse high-quality archives, but whether it genuinely mixes or merely inspires is unresolved in the published ablations. |
| Multi-parent crossover | Generalises C1 to k parents; typically feature-level merge. | C2 | experimental branches of ShinkaEvolve [E, Ch12] | Plausible on large diverse archives where many ideas can combine; not reliably validated against two-parent crossover. |
| Feature-specific crossover | Extract a named subroutine from one parent, transplant into another. | C3 | Arcgentica [P, Ch17], some GEPA variants [E, Ch25] | Most useful when the representation supports clean decomposition (DSL, function library). |
| Diff-based crossover | Apply one parent's diff-from-root to another parent. | C4 | OpenEvolve experimental [E, Ch14] | Applicable when candidates share a common lineage; little independent validation of its benefit. |
| Gaussian perturbation (ES) | Add scaled noise to a parameter vector; no LLM involved. | — | EGGROLL [P, Ch45], Evolution Strategies at Scale [I, Ch46], EvoX [P, Ch47] | Applicable to continuous parameter spaces only. Not an LLM operator; listed here for taxonomic completeness. |
71.5 Diversity & Novelty Machinery
How the system avoids collapsing onto a single idiom.
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| None | Rely on fitness alone; accept collapse as a risk. | — | small-scale baselines [P] | Defensible only on short runs with small budgets where collapse is unlikely to be reached. |
| Behavioural descriptor + grid | MAP-Elites; diversity is enforced by the archive shape. | P2 | AlphaEvolve [I, Ch9], OpenEvolve QD mode [E, Ch14] | A strong choice when cheap, meaningful descriptors exist; the descriptor choice matters more than the grid resolution. |
| Embedding-distance novelty archive | Candidate accepted if far enough (in embedding space) from existing archive entries. | E4 | GEPA [P, Ch25], Omni-SimpleMem [P, Ch51], several research agents [P] | Usually the fallback when no natural descriptor exists but an embedder is available; sensitive to the embedder's alignment with task-relevant structure. |
| Similarity-filtered acceptance | Reject children too similar (by hash, tree edit, or embedding) to recent parents. | — | ShinkaEvolve [P, Ch12], OpenEvolve variant [E, Ch14] | Useful for fighting prompt collapse inside an already-running system; evidence is anecdotal. |
| Island isolation | Islands diverge naturally because migration is rare. | I1–I4 | AlphaEvolve [I, Ch9], OpenEvolve [P, Ch14], ShinkaEvolve [P, Ch12] | Population-level diversity at the cost of compute per island; scales well with budget. |
| Novelty-penalised fitness | Fitness function explicitly penalises similarity to recent candidates. | — | EurekaClaw [P, Ch15], GEPA [P, Ch25] | Viable when a meaningful distance in reward space exists; coefficient tuning is usually ad hoc. |
71.6 Evaluation & Fitness Assessment
How a candidate is scored. The cost and fidelity of this step tends to dominate the overall budget.
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Single-stage execution | Run once on the full test set; use the returned scalar. | — | small baselines [P], GEPA simple mode [P, Ch25] | Adequate when a single run is cheap relative to LLM calls; otherwise a cascade usually wins. |
| Cascade (smoke → light → full) | Tiered evaluator; candidates that fail early stages never reach expensive ones. | E1 | AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], Next Evolution Architecture blueprint [B] | Strongly preferred whenever evaluator cost varies by 10× or more across fidelity levels; one of the most reliably beneficial choices in the survey (see §71.16). |
| Sandbox execution | Run in a resource-capped isolated environment with no network. | E2 | AutoHarness [P, Ch31], OpenEvolve [P, Ch14], AlphaEvolve [I, Ch9] | A safety requirement rather than a tuning choice whenever the system executes LLM-generated code. |
| Multi-instance aggregation | Score on many instances; aggregate by mean, median, or worst-case. | E3 | LLM4AD [P, Ch21], ShinkaEvolve [P, Ch12], ALE-Agent [P, Ch20] | Worth the cost on noisy or distribution-sensitive benchmarks; the aggregator (mean vs. worst) materially affects robustness/progress trade-off. |
| Novelty-filtered evaluation | Reject candidates that fail a similarity filter before scoring. | E4 | GEPA [P, Ch25], Omni-SimpleMem [P, Ch51] | Reasonable when novelty is a first-class objective and the filter is cheap compared to the evaluator. |
| Adaptive sample size | More instances for promising candidates, fewer for weak ones. | E5 | ShinkaEvolve [P, Ch12], Next Evolution Architecture blueprint [B] | Useful for variance-aware budgets on large test sets; related to racing algorithms. |
| LLM-as-judge | A strong LLM scores the candidate's output; no execution required. | — | AI Scientist [P, Ch48], AI-Researcher [P, Ch55], research agents [P] | Often the only option when execution is impossible (open-ended research, paper writing); known to introduce stylistic bias. |
| Hybrid LLM + execution | Execution for correctness, LLM for style/quality. | — | Meta-Harness [P, Ch32], RD-Agent [P, Ch38] | Appropriate when both correctness and subjective quality matter and their signals are separable. |
| Formal verifier | Proof checker confirms correctness; fitness is proof length or compilation time. | — | OpenProver [P, Ch40], Pi-Autoresearch [P, Ch39] | The idiomatic choice for theorem proving and verified code synthesis; rarely applicable outside those niches. |
| ASI diagnostic feedback | Structured failure analysis returned alongside the score; fed back into the learning log. | — | Next Evolution Architecture blueprint [B], GEPA [P, Ch25], RetroAgent [P, Ch34] | Most valuable when the evaluator can plausibly teach the system, not just grade it — implies a rich per-failure signal. |
71.7 LLM Orchestration
How many LLMs are involved and how they are wired together.
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Single-model mutator | One LLM does everything. | — | small-scale baselines [P], Karpathy Autoresearch [I, Ch54] | Sensible when simplicity or single-provider deployment is a constraint. |
| Model pool + bandit routing | Several LLMs with different cost/quality profiles; bandit learns which one to call when. | B1–B4 | AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], RD-Agent [P, Ch38] | Worth the routing overhead when cost is a binding constraint and providers differ meaningfully on the target task distribution. |
| Proposer + critic | One LLM proposes, a second LLM critiques; the critic's feedback informs the next proposal. | — | GEPA [P, Ch25], AI Scientist v2 [P, Ch49], K-Dense Co-Scientist [P, Ch56] | Often a good trade when the evaluator is expensive but a cheap critic can identify obvious failures. |
| Hierarchical drafter + improver | Cheap model drafts many candidates, strong model improves only the promising ones. | — | AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], Sakana Marlin [I, Ch30] | The most common cost-aware pattern in 2026-era systems when cost is the dominant bottleneck. |
| Multi-agent debate | Multiple LLM instances argue; final output is the consensus. | — | AI Scientist v2 [P, Ch49], DeepScientist [P, Ch43], AgentLaboratory [P, Ch44] | Most common in research agents and open-ended reasoning; little evidence of benefit on verifiable tasks. |
| Ensemble voting | Multiple LLMs each answer; majority or quality-weighted vote. | — | Confluence Labs [I, Ch28], AB-MCTS aggregation step [P, Ch19] | Reasonable when individual model errors are approximately independent and each call is cheap. |
71.8 Model Selection & Routing
Given a model pool, which model is called for a given operation?
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Static assignment | Each operation is hard-coded to a specific model. | — | OpenEvolve [P, Ch14], LLM4AD [P, Ch21] | Adequate for early prototypes or when only one provider is available. |
| Cost-tier routing | Cheap model for draft / reject, strong model for promising candidates. | — | AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12] | A strong default when at least two tiers are available and their cost ratio exceeds roughly 5×. |
| Hierarchical bandit | Nested bandits over model family and model instance. | B1, B2 | Next Evolution Architecture blueprint [B], RD-Agent [P, Ch38] | Overhead becomes worth it when provider diversity is high and cost/quality ratios drift over the run. |
| Per-stage routing | Model chosen per evaluation-cascade stage. | — | ShinkaEvolve [P, Ch12], Sakana Marlin [I, Ch30] | Effective when each cascade stage rewards a different skill (cheap filters vs. deep analysis). |
| Budget-aware router | Model chosen to maximise expected fitness per remaining dollar. | — | Next Evolution Architecture blueprint [B], AlphaEvolve internal [I, Ch9] | Appropriate under a hard fixed budget with a calibrated marginal-value model; theoretically clean but rarely ablated in public. |
71.9 Prompt Engineering & Co-Evolution
Who writes the prompts, and do they evolve?
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Static prompt | Hand-written prompt, unchanged throughout the run. | — | FunSearch [P, Ch5], OpenEvolve [P, Ch14] | Reasonable as a baseline when the prompt is already well-tuned or when reproducibility is paramount. |
| Prompt slot mutation | Specific placeholders (persona, style, constraints) are mutated each iteration. | — | ShinkaEvolve [P, Ch12], AlphaEvolve variant [I, Ch9] | Low-cost diversity source; typically a cheap win compared to static prompts on long runs. |
| Prompt populations | A second population of prompts co-evolves alongside the primary population. | — | GEPA [P, Ch25], GEPA Skills [P, Ch26], AutoAgent [P, Ch7] | Worth the complexity when prompts, not code, carry the task-specific knowledge. |
| Reflect-and-rewrite-prompt | After each cycle the LLM rewrites the prompt based on what worked. | — | GEPA [P, Ch25], RetroAgent [P, Ch34], Darwin Gödel Machine [P, Ch23] | Most useful on long runs with rich feedback and tasks whose structure drifts over time. |
| Instruction co-evolution | Prompts, skills, and the harness evolve together under a joint fitness. | — | Darwin Gödel Machine [P, Ch23], NeoSigma [I, Ch35], EvoSkill [P, Ch33] | The right regime for self-improving agents when the boundary between code and prompt dissolves; sample-inefficient. |
| Skill prompts | Candidates are named skills with natural-language descriptions, retrieved by similarity. | P5 | GEPA Skills [P, Ch26], EvoSkill [P, Ch33], RetroAgent [P, Ch34] | Fit for compositional agent tasks with reusable sub-solutions; retrieval quality dominates outcomes. |
71.10 Memory, Learning Logs & Skills
What the system remembers between iterations — and how memory is read back into the mutator.
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| No memory | Each iteration is independent. | — | OpenEvolve default [P, Ch14], FunSearch baselines [P, Ch5] | Adequate for cheap iterations or stateless search where insight reuse has not been shown to help. |
| Learning log (text) | Append-only log of candidate, score, observation; subset retrieved into prompt. | — | AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], GEPA [P, Ch25] | Worth adding whenever insight reuse plausibly matters; a common cheap win. |
| Embedded log (vector) | Learning log entries indexed in a vector store; retrieved by semantic similarity. | — | Omni-SimpleMem [P, Ch51], RetroAgent [P, Ch34], several research agents [P] | Typically needed on long runs with thousands of entries where linear scans become a bottleneck. |
| Skills archive | Distilled reusable primitives with natural-language descriptions. | P5 | GEPA Skills [P, Ch26], EvoSkill [P, Ch33], AutoEvolver [P, Ch36] | Most useful in agent tasks where solutions compose; overkill for single-task code synthesis. |
| Knowledge graph | Entities, relations, observations stored in a graph; retrieved by traversal. | — | OmniScientist [P, Ch52], Zochi [P, Ch53], K-Dense Co-Scientist [P, Ch56] | Warranted mainly for research agents doing cross-paper synthesis; graph maintenance cost is non-trivial. |
| Lessons library | High-level abstracted heuristics learned from multiple runs. | — | Darwin Gödel Machine [P, Ch23], AI Scientist v2 [P, Ch49], Karpathy Autoresearch [I, Ch54] | Interesting for meta-level self-improvement over many tasks; evidence base for within-run benefit is thin. |
| Retrieval-augmented mutation | Parents are fetched not only by fitness but by relevance to a retrieved log entry. | — | ShinkaEvolve adaptive mode [E, Ch12], GEPA [P, Ch25] | Useful when similar past candidates plausibly carry transferable advice; depends on retrieval quality. |
71.11 Meta-Level Adaptation & Self-Improvement
Which parts of the system can evolve themselves?
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| None (frozen) | Every hyperparameter is fixed at the start. | — | FunSearch [P, Ch5], OpenEvolve default [P, Ch14] | Preferred when reproducibility is the dominant concern or when the system is used as a baseline. |
| Adaptive mutation rate | Mutation strength increased on stagnation, decreased on progress. | — | ShinkaEvolve [P, Ch12], A-Evolve [P, Ch16] | Worth the tuning cost on long runs where the landscape's roughness changes with iteration. |
| Operator bandit | Bandit over a pool of mutation operators; rewards update by child fitness. | B1, B3 | ShinkaEvolve [P, Ch12], GEPA [P, Ch25], LLM4AD [P, Ch21] | Useful when multiple mutation styles each dominate in different regimes; bandit restart logic matters. |
| Prompt evolver | Prompts evolve alongside candidates (see §71.9). | — | GEPA [P, Ch25], AutoAgent [P, Ch7], Meta-Harness [P, Ch32] | See §71.9; same applicability conditions. |
| Harness evolution | The agent harness (tools, scratchpads, sandboxes) evolves as part of the optimisation. | — | Meta-Harness [P, Ch32], AutoHarness [P, Ch31], AutoAgent [P, Ch7] | Most useful on agent benchmarks where harness quality dominates model quality. |
| Self-modifying code (Gödelian) | The system proposes changes to its own source code and deploys the winning version. | — | Darwin Gödel Machine [P, Ch23], Darwinian Evolver [P, Ch24], NeoSigma [I, Ch35] | Research settings only; requires strong safety gates and offers uncertain efficiency gains. |
| Full self-rewrite | End-to-end regeneration of the system across many axes simultaneously. | — | Darwin Gödel Machine [P, Ch23], AutoEvolver [P, Ch36] | Experimental; sample-inefficient but maximally open-ended. |
71.12 Cost & Budget Management
How the system decides when to stop spending and where to spend next.
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Fixed budget cap | Hard stop at N calls or $X cost. | — | Every serious system in practice [P/I] | Should be present as a safety net regardless of other budget machinery. |
| Cascade early stop | Abandon evaluation as soon as the light stage fails. | E1 | AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12] | Very strong win whenever cascade-stage cost varies 10× or more; see §71.16. |
| Adaptive sample size | More instances for promising candidates (racing). | E5 | ShinkaEvolve [P, Ch12], Next Evolution Architecture blueprint [B] | Useful when evaluator variance is high and test sets are large. |
| Model-tier routing | Route by marginal fitness-per-dollar. | — | AlphaEvolve [I, Ch9], Sakana Marlin [I, Ch30], RD-Agent [P, Ch38] | Effective when provider pools span large cost ratios; static cost-tier routing usually captures most of the gain (§71.8). |
| Bandit budget split | Bandit dynamically reallocates budget across operators, islands, or models. | B1–B4 | ShinkaEvolve [P, Ch12], Next Evolution Architecture blueprint [B] | Most justified on long runs with shifting regimes. |
| Per-stage cost ceiling | Each cascade stage has an independent cost budget. | — | ShinkaEvolve [P, Ch12], Meta-Harness [P, Ch32] | Useful when stage costs are predictable and bounded; prevents one stage from starving the others. |
| Marginal-value stopping | Stop when expected improvement per dollar falls below a threshold. | — | AlphaEvolve internal [I, Ch9], Next Evolution Architecture blueprint [B] | Theoretically clean; requires a calibrated improvement model that is rare in practice. |
71.13 Sandboxing & Safety
A safety requirement rather than a tuning choice: any system executing LLM-generated code needs isolation. The question is how much.
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Subprocess | Untrusted code in a child process with minimal OS isolation. | — | OpenEvolve dev mode [P, Ch14], small prototypes [P] | Development only. Not appropriate for production or shared infrastructure. |
| Container (Docker) | Candidate runs in a restricted container with no network. | — | AutoHarness [P, Ch31], AutoAgent [P, Ch7], Meta-Harness [P, Ch32] | The typical production default for code-execution evaluators. |
| Syscall whitelist | seccomp / AppArmor filter on allowed syscalls. | — | AlphaEvolve internal [I, Ch9], Next Evolution Architecture blueprint [B] | Warranted when adversarial code (or reward-hacking) is a realistic threat. |
| Resource caps | Hard CPU, RAM, wall-clock, disk limits. | — | Every serious system [P/I] | Cheap to add and catches infinite loops; a standard baseline. |
| Network disabled | No outbound network from the sandbox. | — | AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], AutoHarness [P, Ch31] | Standard for code-execution evaluators; deliberate exceptions require strong justification. |
| Static-analysis gate | Reject candidates whose AST contains banned constructs (exec, eval, import os). | — | Meta-Harness [P, Ch32], Next Evolution Architecture blueprint [B] | A belt-and-braces complement to the sandbox; useful when the cost of one bad execution is high. |
| Budget guard | Separate watchdog that kills runs exceeding cost or token limits. | — | AutoAgent [P, Ch7], RD-Agent [P, Ch38], Sakana Marlin [I, Ch30] | Standard practice for long autonomous runs where runaway costs are a realistic failure mode. |
71.14 Failure Handling & Repair
What happens when a candidate crashes, loops, or fails its evaluator?
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| None | Failed candidate is simply discarded; no learning from failure. | — | Baselines [P] | Appropriate only when failure traces carry no information, which is rare. |
| Retry | Re-run with same inputs; if flaky it may pass on retry. | — | OpenEvolve [P, Ch14], AutoHarness [P, Ch31] | Necessary on genuinely flaky evaluators (real-world benchmarks); should be bounded to avoid masking real bugs. |
| Error-repair operator | On failure, call the LLM to patch the candidate given the error trace. | M4 | most serious systems [P/I] | Standard as a fallback operator whenever execution can fail. |
| Learning-log entry | Record the failure with context so future prompts can avoid the same mistake. | — | AlphaEvolve [I, Ch9], ShinkaEvolve [P, Ch12], GEPA [P, Ch25] | Valuable when failure modes are informative and the log is actually read back into subsequent prompts. |
| Candidate quarantine | Track chronically failing candidates separately; never sample them as parents. | — | ShinkaEvolve [P, Ch12], RetroAgent [P, Ch34] | Useful when repair consistently fails on a minority of candidates that would otherwise keep consuming budget. |
| Reflective repair | Reflection step analyses the failure and proposes both a repair and a lesson for the log. | — | GEPA [P, Ch25], NeoSigma [I, Ch35], Darwin Gödel Machine [P, Ch23] | Best on expensive evaluators where each failure must teach something to justify its cost. |
71.15 Evaluator Types
The full space of fitness functions. The choice is dictated by the problem rather than by the search strategy.
| Variant | Summary | Ch67 | Used by | When to pick |
|---|---|---|---|---|
| Code execution | Run the candidate program against a test suite. | E2 | AlphaEvolve [I, Ch9], OpenEvolve [P, Ch14], ShinkaEvolve [P, Ch12], LLM4AD [P, Ch21] | The natural choice for algorithmic discovery with deterministic tests. |
| Benchmark suite | Standardised, multi-instance benchmark (e.g., MLE-Bench, ALE-Bench). | E3 | RD-Agent [P, Ch38], ALE-Agent [P, Ch20], AI-Researcher [P, Ch55] | Required when comparable, reproducible results are the product. |
| LLM-as-judge | Another LLM scores the candidate's output. | — | AI Scientist [P, Ch48], AI Scientist v2 [P, Ch49], research agents [P] | Typically the only option for open-ended outputs (papers, hypotheses, plans); known to introduce stylistic bias. |
| Hybrid exec + judge | Execution for correctness, LLM for quality. | — | Meta-Harness [P, Ch32], RD-Agent [P, Ch38], DeepScientist [P, Ch43] | Appropriate when objectives mix verifiable and subjective components. |
| Human-in-loop | Periodic human review as a high-fidelity oracle. | — | AI Scientist [P, Ch48], K-Dense Co-Scientist [P, Ch56] | Warranted for safety-critical or domain-specific judgements; rate-limited by annotator bandwidth. |
| Simulator | Candidate drives a simulator; reward is the simulator's outcome. | — | EurekaClaw [P, Ch15], Matlantis [I, Ch41], Sakana Marlin [I, Ch30] | The idiomatic evaluator for robotics, materials science, and sim-to-real. |
| Formal verifier | Lean/Coq proof checker; fitness is proof-found / proof-length. | — | OpenProver [P, Ch40], Pi-Autoresearch [P, Ch39] | Suited to theorem proving and verified synthesis; applicability is narrow but fidelity is high. |
| Real-world deployment | Candidate runs in a production system; feedback is real traffic. | — | NeoSigma failure mining [I, Ch35], 7/24 Office [I, Ch42] | Appropriate when the only accurate oracle is production and the cost of bad candidates is bounded. |
71.16 Cross-Cutting Meta-Analysis
The fifteen tables above enumerate what can be done. This section asks the synthesis questions the reviewer is entitled to: which variants are actually common, which combinations recur together, which dimensions are strongly coupled, and where do the surveyed systems genuinely disagree? The quantitative claims below are coarse counts over the sixty-one surveyed systems; treat them as directional rather than inferential.
71.16.1 Variant Frequency (Directional)
Frequencies are bucketed because not every system exposes every dimension and not every codebase was audited at the same depth. Dominant = seen in a clear majority of the surveyed systems exposing this dimension; common = seen in roughly a third to half; niche = seen in under ten systems; rare = seen in one to three and usually experimental.
| Dimension | Dominant variant(s) | Common | Niche / rare |
|---|---|---|---|
| §71.1 Search strategy | Single-pop EA, Island EA | Tree search, MAP-Elites | Hybrid loop+tree, ES-at-scale |
| §71.2 Population | Flat elitism, Ranked archive | MAP-Elites grid, Tiered | Knowledge graph, Skills archive |
| §71.3 Selection | Tournament, Power-law rank | MAP-Elites cell, UCB1 | Fitness-proportionate, EXP3 |
| §71.4 Mutation | Line-level diff, Full rewrite, Error-repair | Reflect-then-mutate, Prompt-template | Multi-parent, Feature-specific, Diff-based crossover |
| §71.5 Diversity | Island isolation | Embedding-distance novelty, Behavioural grid | Novelty-penalised fitness |
| §71.6 Evaluation | Cascade, Sandbox, Multi-instance | LLM-as-judge, Adaptive sample size | Formal verifier, ASI diagnostic feedback |
| §71.7 Orchestration | Hierarchical drafter+improver, Proposer+critic | Model pool + bandit routing | Multi-agent debate, Ensemble voting |
| §71.8 Routing | Static, Cost-tier | Per-stage | Hierarchical bandit, Budget-aware |
| §71.9 Prompts | Static, Slot mutation | Prompt populations, Reflect-and-rewrite | Instruction co-evolution |
| §71.10 Memory | No memory, Learning log (text) | Skills archive, Embedded log | Knowledge graph, Lessons library |
| §71.11 Meta-adaptation | None (frozen) | Adaptive mutation rate, Operator bandit | Self-modifying code, Full self-rewrite |
| §71.12 Budget | Fixed cap, Cascade early stop | Model-tier routing, Adaptive sample size | Marginal-value stopping |
| §71.13 Sandbox | Container, Resource caps, Network disabled | Budget guard, Static-analysis gate | Syscall whitelist |
| §71.14 Failure handling | Retry, Error-repair operator | Learning-log entry, Candidate quarantine | Reflective repair |
| §71.15 Evaluator | Code execution, Benchmark suite | LLM-as-judge, Hybrid exec+judge, Simulator | Formal verifier, Real-world deployment, Human-in-loop |
71.16.2 Recurring Combinations (Co-Occurrence)
Five combinations recur so often that they effectively define recognisable styles of LLM-powered evolution:
- FunSearch-lineage stack. Single-pop or island EA (§71.1) + flat elitism (§71.2) + tournament or power-law rank (§71.3) + line-level diff + full rewrite + error-repair (§71.4) + cascade evaluator (§71.6) + container sandbox (§71.13) + learning log (§71.10). Examples: OpenEvolve, AlphaEvolve, ShinkaEvolve, LLM4AD.
- Quality-diversity stack. MAP-Elites search (§71.1) + grid archive (§71.2) + cell sampling (§71.3) + behavioural-descriptor novelty (§71.5) + cascade evaluator (§71.6). Examples: AlphaEvolve QD mode, GEPA variants.
- Reflection-heavy agent stack. Reflect-then-mutate (§71.4) + proposer+critic orchestration (§71.7) + prompt populations or skill prompts (§71.9) + embedded log or skills archive (§71.10) + reflective repair (§71.14) + LLM-as-judge or hybrid evaluator (§71.15). Examples: GEPA, GEPA Skills, RetroAgent, EvoSkill.
- Tree-search stack. MCTS/AB-MCTS (§71.1) + UCB1 or Thompson selection (§71.3) + per-node fitness (§71.6). Examples: AB-MCTS, TreeQuest, Arcgentica, ALE-Agent.
- Self-modification stack. Harness evolution + self-modifying code (§71.11) + instruction co-evolution (§71.9) + lessons library (§71.10) + reflective repair (§71.14). Examples: Darwin Gödel Machine, Darwinian Evolver, NeoSigma.
ASCII co-occurrence sketch (row = representative system, column = style; ● = strong alignment, ○ = partial):
FunSearch QD Reflect Tree Self-mod
OpenEvolve ● ○ · · ·
AlphaEvolve ● ● · · ·
ShinkaEvolve ● ○ ○ · ·
LLM4AD ● · · · ·
GEPA ○ ○ ● · ·
GEPA Skills · · ● · ·
RetroAgent · · ● · ○
AB-MCTS / TreeQuest · · · ● ·
Arcgentica · · · ● ·
ALE-Agent · · · ● ·
Darwin Gödel Machine · · ○ · ●
NeoSigma · · ○ · ●
EvoSkill · · ● · ○
71.16.3 Strongly Coupled Dimensions
Some dimensional choices are nearly forced by prior ones. The couplings below are observed as structural regularities, not proven causal dependencies:
- Cascade evaluator ↔ tiered population ↔ cost-tier routing. Choosing a cascade in §71.6 implies candidates live at different refinement stages (§71.2) and that cheap vs. strong models run at different stages (§71.8). AlphaEvolve and ShinkaEvolve show all three.
- MAP-Elites grid ↔ cell sampling ↔ behavioural descriptor. Choosing a QD archive (§71.2) effectively fixes the selection rule (§71.3) and the diversity mechanism (§71.5).
- Skills archive ↔ prompt-populations ↔ skill-prompt retrieval. When §71.2 is a skills archive, §71.9 almost always includes some form of prompt-level retrieval, and §71.10 an embedded log.
- Island model ↔ adaptive spawning ↔ heterogeneous islands. Within the island family, systems that ablate migration topology tend to also vary island configurations and operator mixes.
- Reflect-then-mutate ↔ learning log ↔ reflective repair. Reflection machinery is rarely worthwhile unless the log retains lessons and failures feed back into prompts.
- Self-modifying code ↔ instruction co-evolution ↔ lessons library. Gödelian self-modification almost always appears together with prompt/skill co-evolution and an explicit lessons archive.
Conversely, several dimensions are decoupled in practice: sandboxing choices (§71.13) are largely orthogonal to everything else, and mutation operator choice (§71.4) is surprisingly independent of search strategy (§71.1) — line-level diff appears in single-pop EAs, island EAs, and QD systems alike.
71.16.4 Where the Surveyed Systems Genuinely Disagree
On several questions the field does not (yet) speak with one voice. These are the most instructive points for a practitioner deciding what to build:
- Does crossover help? ShinkaEvolve and LLM4AD use two-parent crossover; FunSearch and OpenEvolve's default mode omit it. No cross-system controlled study settles whether the LLM genuinely mixes parents or is merely inspired by seeing two candidates at once.
- Reflection vs. cheap rerolls. GEPA, EurekaClaw, and RetroAgent spend tokens on reflection before each mutation; AlphaEvolve and OpenEvolve largely prefer to generate more cheap diffs and let selection sort them out. Both stacks produce state-of-the-art results on different task families.
- Static vs. evolving prompts. FunSearch and OpenEvolve freeze the prompt; GEPA, Darwin Gödel Machine, and AutoAgent evolve it. The freeze camp cites reproducibility; the evolving camp cites long-run adaptation.
- Self-modifying code. Darwin Gödel Machine and NeoSigma allow the system to rewrite itself; the rest of the field keeps the harness frozen. The disagreement is as much about safety posture as about efficiency.
- Memory format. Text logs (AlphaEvolve, ShinkaEvolve) vs. embedded logs (Omni-SimpleMem, RetroAgent) vs. knowledge graphs (OmniScientist, Zochi). No published ablation cleanly compares them at matched budget.
- Diversity via archive shape vs. fitness penalty. MAP-Elites enforces diversity structurally; EurekaClaw and GEPA enforce it via an explicit fitness term. The trade-off between the two has not been systematically benchmarked.
71.16.5 Which Claims in This Chapter Are Well-Supported?
Not all "When to pick" recommendations rest on the same evidence base. A rough tier list:
- Well-supported (multiple within-system ablations across different teams): cascade evaluation beats single-stage whenever stage cost varies >10×; sandboxing is mandatory for code execution; error-repair as a fallback operator is near-universal because its absence produces visible failure modes.
- Moderately supported (one or two published ablations, or consistent anecdotal reports): tournament > fitness-proportionate as a default; cost-tier routing saves budget on multi-provider pools; line-level diff dominates on mature candidates while full rewrite dominates on early exploration.
- Weakly supported (practitioner consensus, little controlled evidence): crossover helps in large diverse archives; reflective repair is worth its cost on expensive evaluators; knowledge graphs pay off in research agents; power-law rank > uniform on long archives.
- Under-evidenced (claims that appear in the literature but have not been cleanly ablated): MAP-Elites vs. island EAs on code-synthesis tasks; optimal bandit family (UCB1 vs. Thompson vs. EXP3) for operator selection; whether embedded-log retrieval measurably outperforms text-log retrieval at matched budget; whether self-modifying systems converge faster than frozen-harness systems on any benchmark.
Readers designing a new system should weight the recommendations accordingly. The cross-cutting dimensions most likely to move the needle on realistic research budgets are, on current evidence, evaluator cascade design, mutation operator pool (including error-repair), and cost-tier model routing. Everything else is a secondary tuning knob whose impact depends strongly on problem structure.
Chapter Summary
Key takeaway. An LLM-powered evolutionary system is a point in an approximately fifteen-dimensional design space. Each of the fifteen tables in this chapter is one dimension, and each row is a variant observed in at least one surveyed system, tagged by source type ([P]/[I]/[B]/[E]) and chapter pointer for auditability. The dimensions are not strictly orthogonal (§71.0.1); several recognisable styles — FunSearch, quality-diversity, reflection-heavy, tree-search, self-modifying — arise from tightly coupled choices (§71.16.2–§71.16.3).
How to use this chapter. For each dimension, identify candidate variants and cross-check against the representative systems in the "Used by" column; follow the chapter pointer for context on how the variant behaves in that system, and the Ch67 method ID for its formal specification. Treat the "When to pick" column as conditional rules of thumb, not recommendations: §71.16.5 lists which of those rules are backed by controlled ablations and which are practitioner consensus awaiting stronger evidence.