Evo Methods in 5 Minutes
Part P09: Synthesis & Future Directions
This chapter is a five-minute orientation for readers who have not yet opened the sixty-one system chapters in this book. The first section (§70.1) delivers the five-minute front layer itself: a single mental model, a compact ingredient schema, one taxonomy map, and four takeaways. Everything after §70.1 is deep dive — the decision procedure, the N/T/O lineage criteria, the hybrid and out-of-taxonomy cases, the comparative matrix, the evidence ledger, the design knobs, and the failure modes — and can be treated as an appendix-style reference by readers who want the orientation alone. A note on epistemic tone carried throughout: this is a synthesis chapter, not an empirical meta-analysis, so we distinguish between observations (a pattern visible in a cited chapter's reported tables), hypotheses (a reading of that pattern we find plausible), and conjectures (claims the surveyed literature does not yet settle).
70.1 The Five-Minute Front Layer
70.1.1 The universal loop
Most systems surveyed in this book — from FunSearch's one-population matrix-multiplication search (Ch3) through the Darwin Gödel Machine's self-rewriting harness (Ch23) — can be read as an instantiation of the same four-step loop. Learning this loop is a good first pass at the field; it is not a theorem about it, and §70.3 lists the families for which the reading is lossy.
# A first-pass reading of the LLM-evolution loop.
# Useful for most, but not all, systems surveyed in this book.
while budget_remaining():
parent = select(state) # what to evolve from
child = mutate(parent, llm) # LLM proposes a variant
score = evaluate(child) # automated evaluator scores it
state = update(state, child, score) # archive / tree / frontier update
memory.record(parent, child, score) # optional: log, skills, reflection
70.1.2 Six ingredients, at a glance
Each chapter can be read as a set of answers for the same six ingredients. Holding the schema fixed is what makes the sixty-one chapters comparable at all.
| Ingredient | Question | Representative range |
|---|---|---|
| Representation | What is a candidate? | function, file, diff, prompt, skill, weight vector, proof, hypothesis, harness |
| Variation | How is a child produced? | diff, rewrite, crossover, reflect-then-edit, repair, prompt mutation, Gaussian noise |
| Evaluator | How is a child scored? | exec, cascade, multi-instance, LLM judge, benchmark, proof, human |
| Archive | Which candidates survive? | flat pool, ranked, island, MAP-Elites, Pareto, tree, tiered, skills library |
| Memory | What persists beyond the archive? | none, learning log, reflection buffer, skills, graph, self-modifying harness |
| Cost profile | Where does the budget go? | exec ms, cascade, multi-seed sec, RL rollout min, harness run hr, nested inner loop |
70.1.3 The taxonomy map
Systems cluster into four broad lineages and two out-of-taxonomy families. The figure below is the whole map: which lineage exists, which columns decide membership, and which edges between lineages the surveyed systems actually travel along. The deep-dive sections (§70.5 onward) unpack each box.
70.1.4 Four takeaways
- The loop is four boxes plus optional memory. State → select → mutate → evaluate → update is the common skeleton. Classifying a system starts by asking what goes in each box.
- Archive and memory are the decisive columns. Across the surveyed systems, the variation operator is shared widely (diff/rewrite appears in FS and QD; reflect-then-edit appears in RF and hybrid TS+RF), but the archive column (pool vs. grid vs. tree) and the memory column (none vs. log vs. reflection-buffer) do most of the lineage work.
- Four lineages, two OOT families. FunSearch, Quality-Diversity, Tree-search, and Reflection cover the bulk of the book. ES-at-scale and the outer layer of research-agent pipelines are deliberately kept out of the taxonomy because they fail a necessary criterion of every lineage.
- Evaluator fidelity and mutation granularity are the knobs most often reported as first-order; the clearest gap in the literature is cross-lineage head-to-heads on identical benchmarks with matched compute.
If you stop here. You now have the universal loop, the six ingredients, the taxonomy map, and the four takeaways. §§70.2–70.8 are a deep-dive reference that operationalises this front layer: decision procedures, lineage criteria, hybrids, the comparative matrix, the evidence ledger, design knobs, and failure modes. Return to them when you open a specific system chapter and want the classification framework made precise.
70.2 Beyond Five Minutes: Deep-Dive Roadmap
The rest of the chapter takes each element of the front layer and makes it formal: §70.3 lists where the loop abstraction bends; §70.4 makes the six ingredients into a decision procedure with an explicit precedence tree; §70.5 turns the four lineages into N/T/O criteria and catalogues hybrids and out-of-taxonomy cases; §70.6 presents the comparative matrix and its evidence ledger with counts; §70.7 maps the design knobs the surveyed ablations report as first-order and the comparisons still missing; §70.8 catalogues failure modes; §70.9 is a routing guide.
70.3 Where the Universal Loop Bends
Several families surveyed in this book map onto the loop only with qualifications. Calling them out up front is the honest move; §70.5.3 then decides which of these qualified mappings deserve their own taxonomic category.
- Tree search systems (AB-MCTS/TreeQuest Ch19, Arcgentica Ch17, ALE-Agent Ch18, Confluence Labs Ch20). The "population" is a search tree, select is a UCT/UCB-style traversal down that tree to an expansion node, and update is a backpropagation of child scores to ancestor statistics. The loop reading is faithful to the control flow, but "archive admission" hides that tree growth, pruning, and value backup are the actual operators. These systems are in-taxonomy (§70.5, Tree-search lineage) because the four boxes still carry real semantic weight.
- Beam search and best-first rollouts (used as inner loops in several research agents, e.g. AIRA₂, DeepScientist). The "population" is a fixed-width frontier; select is a top-k operation, not a sampler, and the variation operator is usually a single-shot generation with no explicit mutation distinction. These fit the loop as a degenerate case and are typically absorbed into whichever lineage the surrounding system belongs to.
- Self-modifying agents (Darwin Gödel Machine Ch23, Darwinian Evolver Ch24, NeoSigma Ch28). The "candidate" is the agent harness itself, so the evaluator must run the candidate on a downstream task to score it. The loop is still recognisable, but mutate is a source-code patch and evaluate can involve nested evolutionary runs, making the loop recursive rather than flat. These remain in-taxonomy under the Reflection lineage because the four boxes still carry weight — the candidate population is just unusually small and the mutation operator is unusually expressive.
- Research-agent pipelines (Part P07: AI Scientist, AIRA₂, Zochi, DeepScientist, RD-Agent, OmniScientist). These are end-to-end pipelines — literature review → hypothesis → experiment → paper — and only a subset of their stages is evolutionary. Reading the whole pipeline as one loop is lossy; reading the idea-mutation, experiment-refinement, or reward-shaping stages as loops is accurate. We treat the outer pipeline as an out-of-taxonomy special case in §70.5.3 because the architectural interest lives above the loop, not inside it; their inner loops are classified normally.
- Evolution-strategies-at-scale methods (Ch46, Ch47). The variation operator is Gaussian noise on a parameter vector, not an LLM edit, and the update rule is a weighted recombination rather than an archive insertion. We treat them as out-of-taxonomy special cases in §70.5.3: they belong in the book because they inform the LLM-evolution conversation about scaling and population dynamics, but they blur the "L" in LLM-evolution and do not fit any of the four lineages cleanly.
70.4 The Six Ingredients and a Decision Procedure
The six ingredients become more useful when treated as a fixed schema rather than a checklist. We have separated compute and evaluation cost into its own row because §70.7.1 argues, on the basis of the surveyed ablations, that evaluator fidelity is one of the dominant drivers of final performance — and any schema that pretends cost is secondary would contradict the chapter's own synthesis.
| Ingredient | Question | Range of choices observed in the field |
|---|---|---|
| Representation | What is a candidate? | A Python function, a full file, a diff, a skill, a prompt, a neural-network weight vector, a proof, a scientific hypothesis. |
| Variation operator | How do we make a new candidate? | Line diff, full rewrite, two-parent crossover, reflect-then-edit, error repair, prompt mutation, Gaussian noise on parameters. |
| Evaluator | How do we score a candidate? | Single execution, cascade (smoke→light→full), multi-instance aggregation, LLM-as-judge, hybrid, real benchmark, formal proof, human review. |
| Selector & archive | Which candidates survive, which become parents? | Flat elitism, ranked archive, MAP-Elites grid†, Pareto front, island ring, search tree, tiered pool, skills library. |
| Memory / reflection | What does the system remember beyond the archive? | Nothing, a learning log, an embedded log, a skills archive, a reflection buffer, a knowledge graph, a self-modifying harness. |
| Cost & budget profile | What does one candidate cost, and where does the budget go? | Cheap deterministic exec (ms), cascaded exec with admission thresholds, multi-seed benchmarks (sec–min), RL rollouts or agent harness runs (min–hr), nested evolutionary inner loops. |
† MAP-Elites: a Quality-Diversity archive that indexes candidates by a low-dimensional behavioral descriptor — a hand- or auto-defined feature vector capturing how a candidate behaves, not how well — and keeps one champion per cell. See Ch67 §67.4 for the formal definition used throughout this book.
70.4.1 Classification procedure
# A reading protocol for placing a new LLM-evolutionary system.
# Apply to any paper, repo, or chapter in this book.
def classify(system) -> dict:
schema = {}
# Q1 Representation: what is ONE unit of selection?
schema["representation"] = ... # function|file|diff|prompt|skill|weight_vector|proof|hypothesis|harness
# Q2 Variation operator: how is a child produced from parent(s)?
schema["variation"] = ... # diff|rewrite|crossover|reflect_then_edit|repair|prompt_mutation|gaussian_noise
# Q3 Evaluator, fidelity, and noise profile.
schema["evaluator"] = ... # single_exec|cascade|multi_instance|llm_judge|hybrid|benchmark|proof|human
schema["eval_fidelity"] = ... # cheap | medium | expensive
schema["eval_noise"] = ... # deterministic | stochastic
# Q4 Selector & archive.
schema["archive"] = ... # flat_pool|ranked|island|map_elites|pareto|tree|tiered|skills_library
# Q5 Memory / reflection beyond the archive itself.
schema["memory"] = ... # none|learning_log|embedded_log|skills|reflection_buffer|graph|self_mod
# Q6 Cost & budget profile.
schema["cost_profile"] = ... # exec_ms|cascade|multi_seed_sec|rollout_min|harness_run_hr|nested_inner_loop
# Q7 Lineage: one primary + optional secondary tags (see §70.4.2).
schema["lineage"] = infer_lineage(schema)
return schema
70.4.2 Lineage precedence: an ordered decision tree
When the six ingredients are mutually consistent, classification is trivial. The hard cases are conflicts: archive looks like TS but variation looks like RF; memory looks reflective but archive is still a ranked pool; configuration toggles change the archive type. The rule set below fires in order; the first matching rule wins, and hybrid tags are emitted when a later rule would have fired on a different column than the winner.
def infer_lineage(s) -> str:
# --------------- OOT gates (failure of a NECESSARY criterion of every lineage) ---------------
# R0a. Variation is parameter-vector noise, not any form of LLM edit -> OOT (ES family).
if s["variation"] == "gaussian_noise":
return "OOT(ES)"
# R0b. The system is an end-to-end research pipeline whose architectural interest is orchestration,
# and the loop-as-architecture granularity fails -> OOT (pipeline outer layer).
if s.get("granularity") == "pipeline_outer":
return "OOT(pipeline)"
# --------------- Primary lineage by ARCHIVE column (decisive for FS/QD/TS) -----------------
# R1. Explicit tree with per-node statistics -> TS, possibly hybridised by variation.
if s["archive"] == "tree":
if s["variation"] == "reflect_then_edit":
return "TS+RF" # ALE-Agent pattern: node stats + reflective proposal.
return "TS"
# R2. Descriptor-indexed grid with one champion per cell -> QD, optionally paired with a pool.
if s["archive"] == "map_elites":
if s.get("also_ranked_pool"):
return "FS+QD" # AlphaEvolve §4.5 pattern.
return "QD"
# R3. Configuration-dependent grid on top of a default pool -> FS (~QD).
if s["archive"] in ("flat_pool", "ranked", "island") and s.get("configurable_grid"):
return "FS(~QD)" # OpenEvolve, LLM4AD pattern.
# --------------- Variation-vs-memory tie-break when archive is a pool ----------------------
# R4. Pool archive + reflect-then-edit variation + first-class memory store -> RF.
# Both of RF's necessary criteria must fire; memory alone is not enough (prevents
# FS-with-learning-log being mislabelled RF).
if s["archive"] in ("flat_pool", "ranked", "island"):
if s["variation"] == "reflect_then_edit" and s["memory"] in (
"reflection_buffer", "skills", "self_mod", "graph"
):
# Recursive-inner-loop annotation for self-modifying harnesses.
if s["memory"] == "self_mod" and s.get("inner_loop_lineage") == "FS":
return "RF(primary)+FS(inner)" # Darwin Gödel Machine pattern.
return "RF"
# R5. Default fall-through: pool + diff/rewrite + optional learning log -> FS.
return "FS"
# R6. Anything else (e.g. tiered_pool, skills_library without pool) -> promote to RF if
# both RF necessary criteria fire; otherwise emit an "unclassified" tag for review.
if s["variation"] == "reflect_then_edit" and s["memory"] in (
"reflection_buffer", "skills", "self_mod"
):
return "RF"
return "UNCLASSIFIED"
Read the tree as a strict priority list. R0 gates out the two OOT families first because no in-taxonomy rule can rescue a system that fails a necessary variation-operator criterion of every lineage. R1–R3 make the archive column authoritative for the FS/QD/TS distinction, which is why AlphaEvolve's MAP-Elites variant earns an FS+QD tag (archive points to QD, pool presence forces the hybrid) rather than a forced single-lineage call. R4 requires both Reflection necessary criteria (reflect-then-edit and first-class memory) to fire before promoting a pool system to RF; this is the load-bearing rule that prevents a FunSearch-style system with a learning log from being mislabelled. R5 is the FunSearch default. R6 handles the residual cases and emits an explicit review flag rather than a guess.
Applied to a system picked at random — say, GEPA (Ch7) — the procedure returns: representation = prompt; variation = reflect_then_edit; evaluator = benchmark, medium fidelity, stochastic; archive = ranked; memory = reflection_buffer; cost profile = multi_seed_sec; R4 fires (pool + reflect + first-class memory) → lineage = RF. For AlphaEvolve's MAP-Elites variant R2 fires with also_ranked_pool=True → FS+QD. For AB-MCTS R1 fires with non-reflective variation → TS. For ES-at-scale R0a fires → OOT(ES).
70.5 The Four Lineages
Lineage tags are reading aids, not crisp partitions: several systems borrow from more than one lineage (§70.5.2), and two families sit outside the taxonomy entirely (§70.5.3). Classification follows Table 70.5.1, with each criterion marked N (necessary — a system in this lineage cannot fail this test), T (typical — holds for almost all flagship members but has documented exceptions), or O (optional — common but not definitional).
70.5.1 Classification criteria and lineage definitions
| Criterion | FunSearch | Quality-Diversity | Tree-search | Reflection |
|---|---|---|---|---|
| Search state | N: flat pool or island ring | N: grid indexed by behavioral descriptor | N: explicit tree with per-node statistics | T: small ranked pool alongside a memory store |
| Variation operator | N: diff or full rewrite of a code candidate | T: same as FunSearch, but parent sampled from grid | T: LLM as proposal distribution at an expansion node | N: reflect-then-edit — LLM first analyses failure, then patches |
| Evaluator | T: deterministic code execution, usually cascaded | T: deterministic + extraction of behavioral descriptors | T: step-wise or terminal verification; sparse rewards backpropagated | O: expensive or noisy task reward, often agent rollouts |
| Archive structure | N: ranked list, sometimes per-island | N: one champion per occupied cell | N: tree with per-node value and visit counts | T: ranked pool + separate reflection/skills memory |
| Memory / reflection role | O: learning log, mostly for prompt context | O: log + archive of diverse exemplars | N: node statistics are the memory | N: first-class — reflection, skills, prompt evolution, or harness patching |
Reading key: N = necessary (a system failing this is not in the lineage); T = typical (holds for the overwhelming majority of members, with named exceptions); O = optional (frequent but not definitional).
For each lineage the list below gives a positive example (a system that passes every N-criterion cleanly), a near-miss (a system that would be placed here naively but fails or strains an N-criterion and belongs elsewhere), and a tie-break rule for the commonest classification conflict.
- FunSearch lineage. Positive example: FunSearch itself (Ch3) — flat island pool, diff/rewrite variation, deterministic cascade, ranked per-island archive, learning log. Near-miss: ES-at-scale (Ch47) superficially resembles FS at the loop level but fails the FS necessary variation-operator criterion because the variation is Gaussian noise on parameters. Tie-break rule: if any published configuration adds a MAP-Elites grid, tag as
FS+QDrather than forcing one lineage — the decision tree's R2/R3 rules implement this automatically. - Quality-Diversity / MAP-Elites. Positive example: AlphaEvolve's behavioural-grid variant (Ch4 §4.5). Near-miss: an island-pool FunSearch configuration is sometimes called "diversity-preserving" but is not QD because an island ring is not a descriptor-indexed grid. Tie-break rule: if archive and memory point in different directions — e.g. ranked pool plus a reflection buffer that also indexes by a behavioral tag — the archive column is authoritative; memory drives RF-lineage only when the archive is also a ranked pool, as in GEPA Skills (Ch8).
- Tree-search lineage. Positive example: AB-MCTS / TreeQuest (Ch19) — explicit tree with per-node visit counts, UCB traversal, LLM proposals at expansion nodes. Near-miss: a research agent that builds a candidate frontier and picks top-k each step is not TS — no persistent per-node statistic is backed up — it is a beam search. Tie-break rule: if node statistics are used by a reflect-then-edit proposal step (ALE-Agent Ch18), the system is
TS+RF; TS is assigned by the archive column, RF by the variation column. - Reflection lineage. Positive example: GEPA (Ch7) — the LLM first reasons about why a prompt succeeded or failed, then edits it, and a reflection buffer is maintained alongside the ranked pool. Near-miss: a FunSearch-style system that pipes learning-log entries into its prompt context is not RF — the learning log is an O-criterion and the variation operator is still a plain diff or rewrite. Tie-break rule: when archive (pool) and memory (first-class store) both fit RF but variation is single-shot rewrite, the system is FS-with-memory, not RF.
Flagship placements under these criteria:
- FunSearch lineage. FunSearch (Ch3), AlphaEvolve (Ch4), OpenEvolve (Ch5), ShinkaEvolve (Ch6), LLM4AD (Ch10), SkyDiscover/AdaEvolve (Ch9). Where it shines: problems with a fast, deterministic scorer — combinatorial optimisation, numerical kernels, competitive programming.
- Quality-Diversity / MAP-Elites. AlphaEvolve's behavioural-grid variant (Ch4 §4.5), MAP-Elites configurations of OpenEvolve (Ch5 §5.6) and LLM4AD (Ch10). Where it shines: deceptive landscapes where approach-diversity matters as much as peak fitness.
- Tree-search lineage. AB-MCTS & TreeQuest (Ch19), Arcgentica (Ch17), Confluence Labs (Ch20), ALE-Agent (Ch18), ShinkaEvolve@ICFP (Ch6 §6.8). Where it shines: discrete, verifiable step problems — ARC-AGI, heuristic contests, theorem proving.
- Reflection lineage. GEPA (Ch7), GEPA Skills (Ch8), EurekaClaw (Ch26), RetroAgent (Ch25), Darwin Gödel Machine (Ch23), Darwinian Evolver (Ch24), NeoSigma (Ch28). Where it shines: expensive or noisy evaluators, where sample-efficiency matters more than throughput.
70.5.2 Hybrid and borderline cases
Forcing single-bucket assignment obscures some of the most interesting systems. Each hybrid below is annotated with the Table 70.5.1 criteria it stretches:
- AlphaEvolve (Ch4). Primarily FS. Ships a MAP-Elites archive variant in §4.5 that satisfies every QD necessary criterion. Its learning log has reflection-like properties but does not carry a first-class reflection store. Tag: FS + QD.
- ShinkaEvolve (Ch6). FS-family on core benchmarks. In its ICFP configuration (§6.8) it additionally satisfies every TS necessary criterion. Borrows reflection-style adaptive mutation scheduling (an O-criterion of RF). Tag: FS + TS, with Reflection-lite adaptation.
- Darwin Gödel Machine (Ch23). RF at the harness level. Fails RF's typical "small ranked pool" because the population collapses to near-singleton harnesses; passes both RF necessary criteria. Its downstream evaluation task can itself be a FunSearch-style search. Tag: RF (primary), FS (recursive inner loop). This is a self-modifying in-taxonomy case, not OOT.
- ALE-Agent (Ch18). Satisfies every TS necessary criterion but replaces the typical single-shot LLM proposal with a reflective step at each node, satisfying RF's necessary "reflect-then-edit" criterion. Tag: TS + RF.
- OpenEvolve (Ch5). FS by default; when configured with a behavioral-grid archive (§5.6) it additionally satisfies QD's necessary criterion. Marked
FS(~QD)by R3 of the decision tree.
70.5.3 Out-of-taxonomy special cases
Two families do not fit any lineage. Critically, self-modifying Reflection-lineage systems such as DGM (Ch23) and Darwinian Evolver (Ch24) are not OOT — they are unusual RF members whose candidate happens to be a harness. The OOT category is reserved for families that fail a necessary criterion of every lineage.
- Evolution-strategies-at-scale methods (Ch46 EGGROLL, Ch47 Evolution Strategies at Scale). Why OOT: variation is Gaussian noise on a parameter vector, failing the necessary variation-operator criterion of every lineage. What they contribute: evidence about population sizing, compute-scaling, and diversity pressure that the LLM-evolution lineages can borrow.
- Research-agent pipelines, outer layer (Part P07: AI Scientist, AIRA₂, Zochi, DeepScientist, RD-Agent, OmniScientist). Why OOT: end-to-end pipelines whose interest lies in orchestration, long-horizon memory, and stage-level planning, not in any single evolutionary inner loop. Inner loops (idea mutation, experiment refinement, reward shaping) individually belong to FS, TS, or RF and are classified normally in Ch69; the outer pipeline fails the loop-as-architecture test.
The practical rule: if the decision tree in §70.4.2 reaches UNCLASSIFIED or R0 fires, the system is OOT. If it merely produces a compound lineage tag (R1 with hybrid branch, R2 with pool, R4 with inner annotation), the system is in §70.5.2. A self-modifying candidate is not, by itself, grounds for OOT status.
70.6 Comparative Matrix and Evidence Ledger
The matrix below maps fourteen representative systems against the ingredients and lineage tags. FS=FunSearch, QD=Quality-Diversity, TS=Tree-search, RF=Reflection, OOT=out-of-taxonomy. The marker ~ on a cell means configuration-dependent: the feature is present in at least one published or default configuration of the system, but is not universal across its releases.
| System | Ch. | Lineage | Representation | Variation op | Evaluator | Archive | Memory |
|---|---|---|---|---|---|---|---|
| FunSearch | Ch3 | FS | Python function | Full rewrite | Deterministic exec | Island pool | Learning log |
| AlphaEvolve | Ch4 | FS+QD | Function / file | Diff + rewrite | Cascade | Pool + MAP-Elites (§4.5) | Learning log |
| OpenEvolve | Ch5 | FS(~QD) | Function / file | Diff + rewrite | Cascade | Pool ~or grid (§5.6) | Learning log |
| ShinkaEvolve | Ch6 | FS+TS | File | Diff + rewrite | Cascade + multi-inst | Pool ~tree (ICFP §6.8) | Learning log + ~reflection |
| GEPA | Ch7 | RF | Prompt | Reflect-then-edit | Benchmark (stochastic) | Ranked pool | Reflection buffer |
| GEPA Skills | Ch8 | RF | Prompt + skill | Reflect-then-edit | Benchmark | Pool + skills library | Skills archive |
| LLM4AD | Ch10 | FS(~QD) | Heuristic function | Diff + rewrite | Deterministic exec | Pool ~or grid | Learning log |
| Arcgentica | Ch17 | TS | Program / plan | LLM proposal @ node | Step + terminal verify | Search tree | Node statistics |
| ALE-Agent | Ch18 | TS+RF | Heuristic program | Reflective proposal @ node | Contest scorer | Search tree | Node stats + reflection |
| AB-MCTS / TreeQuest | Ch19 | TS | Solution | LLM proposal @ node | Task verifier | Search tree (posterior) | Node statistics |
| Darwin Gödel Machine | Ch23 | RF (~FS inner) | Agent harness (code) | Reflect-then-edit patch | Downstream task run | Near-singleton pool | Self-modifying harness |
| EurekaClaw | Ch26 | RF | Reward function | Reflect-then-edit | RL rollout (noisy) | Ranked pool | Reflection buffer |
| ES-at-Scale | Ch47 | OOT | Weight vector | Gaussian noise | Task reward | Implicit (population) | None |
| AI Scientist | Ch58 | OOT (inner ~RF) | Research idea / experiment | Pipeline stage-dependent | Paper / experiment artifact | Pipeline state | Long-horizon notes |
Reading the matrix. Two patterns are visible, each with an exception note, a sampling-bias note, and a falsification test so the reader can audit the generalisation.
- Representation almost fully predicts evaluator. Code → deterministic cascade, prompt → stochastic benchmark, reward function → noisy rollout, weight vector → task reward, research idea → paper artifact. This is a hypothesis, not a law: the evaluator is usually the cost bottleneck, and the representation tends to be chosen so it is cheap to evaluate at the desired fidelity. Known exception: GEPA Skills (Ch8) mixes prompt and skill representations into the same benchmark evaluator, weakening the one-to-one coupling. Sampling bias: the surveyed systems over-represent code-execution benchmarks because that is where LLM-evolution first took off; a survey that sampled more heavily from agentic robotics or theorem-proving might loosen the mapping. Falsification test: a future system that deliberately decouples representation and evaluator — for example, evolving prompts but scoring them with a formal proof checker — would count as evidence against this reading.
- Archive and Memory columns drive lineage membership more than Variation op. AlphaEvolve and OpenEvolve use the same diff+rewrite operator as FunSearch but earn their QD tag entirely from the archive column; GEPA and EurekaClaw differ in representation and evaluator but share the same reflection-buffer memory and therefore the same lineage. Known exception: ALE-Agent (Ch18) is tagged
TS+RFprecisely because its variation operator is decisive — the TS archive alone would not give RF credit, and the reflect-then-edit step is what activates rule R1's hybrid branch. Sampling bias: the surveyed FS-family systems ship with configurable archive topologies, which inflates the apparent leverage of the archive column; a sample drawn from systems with fixed archives would show more variation-op leverage. Falsification test: a system whose lineage flips solely by swapping variation operator while holding archive and memory constant would weaken the claim.
A third, weaker observation: the ~ markers cluster in FS-family rows. This is not a flaw of FS but, we conjecture, a feature — those systems ship with configurable archive topologies, which is precisely why they can cross into QD or TS without a redesign. Reflection-lineage systems, by contrast, rarely toggle their memory store on and off in the surveyed configurations: it is definitional for them.
70.6.1 Evidence ledger for §§70.6–70.8 synthesis claims
The synthesis statements in the matrix reading, in §70.7.1, and in §70.8 are not independent observations — each is the distillation of a specific set of chapters and reported ablations. The table below is the full ledger: for every non-trivial synthesis claim, it lists supporting chapters, what was actually compared, and our confidence level. Grades: observed (the claim restates a pattern directly visible in the cited chapters' tables), hypothesis (a reading of that pattern we find plausible), conjecture (a generalisation beyond what the literature establishes).
| Synthesis claim | Section | Supporting chapters | What was compared | Confidence |
|---|---|---|---|---|
| Representation almost fully predicts evaluator. | §70.6 reading | Ch3, Ch4, Ch5, Ch6, Ch7, Ch8, Ch17, Ch18, Ch19, Ch23, Ch26, Ch47, Ch58 | Cross-chapter reading of the Representation/Evaluator columns. | Observed |
| Archive and Memory columns drive lineage membership more than Variation op. | §70.6 reading | Ch3, Ch4, Ch5, Ch7, Ch8, Ch10, Ch26 | Same variation operator across FS and QD tags; same memory type across differing representations. | Observed |
| Evaluator fidelity / cascade design is one of the knobs most often reported as headline-moving. | §70.7.1 item 1 | Ch4, Ch5, Ch6 | Within-system ablations varying cheap-stage cost, admission thresholds, and cascade depth. | Observed in Ch4, Ch6; consistent with Ch5. |
| Mutation granularity shifts with run stage in FS-family systems. | §70.7.1 item 2 | Ch4, Ch5, Ch7, Ch26 | Operator-mix curves over run time; single operator swaps under expensive evaluators. | Observed within cited systems; hypothesis for generalisation. |
| Archive topology matters when the landscape is deceptive and the descriptor is meaningful. | §70.7.1 item 3 | Ch3, Ch4 §4.5 | Within-system archive ablations on shared benchmarks. | Observed in Ch4 §4.5; hypothesis that the descriptor-quality precondition is general. |
| Parent selection pressure is first-order for TS, second-order for FS-family. | §70.7.1 item 4 | Ch17, Ch19, Ch4, Ch6 | UCT sweeps; FS softmax / top-k sweeps. | Observed in Ch17, Ch19 for TS; hypothesis for the FS-family ordering. |
| Reflection / memory persistence is headline-moving in RF, neutral in FS/TS. | §70.7.1 item 5 | Ch7, Ch8, Ch26, Ch3, Ch4, Ch6 | Per-run vs cross-run vs distilled reflection; learning-log on/off. | Observed in RF chapters; hypothesis for the FS/TS neutrality. |
| Evaluator overfitting is a well-established failure mode. | §70.8 item 1 | Ch4, Ch6 | Multi-instance evaluation and held-out sets introduced in response to gaming. | Observed |
| Archive collapse is addressed by topology choice. | §70.8 item 2 | Ch3, Ch4 §4.5, Ch8 | Before/after introduction of island, grid, novelty filtering. | Observed |
| Reflection contamination is a failure mode in the RF lineage. | §70.8 item 3 | Ch7, Ch26 | Buffer pruning and score-conditional admission fixes. | Observed; dominance is a conjecture. |
| Prompt / harness collapse is lineage-specific. | §70.8 item 4 | Ch23, Ch28 | Case descriptions of harness edits degrading future quality; rollback checkpoints as countermeasure. | Observed instances; hypothesis that it generalises. |
| Cross-lineage head-to-heads on identical benchmarks are largely absent. | §70.7.2 | Survey-wide negative observation | Search for matched-compute, matched-LLM head-to-heads across FS/QD/TS/RF. | Observed (as absence). |
70.6.2 Summary counts for the synthesis claims
The ledger above is long. The table below compresses it into counts the reader can audit at a glance: how many chapters back each claim, whether the evidence is within-system (an ablation inside one chapter) or cross-system (a comparison across chapters), and what benchmark family the evidence comes from. This is explicitly a coverage count, not an effect-size tally.
| Claim cluster | # supporting chapters | Within-system | Cross-system | Benchmark family | Confidence |
|---|---|---|---|---|---|
| Repr predicts evaluator (§70.6 read 1) | 13 | — | 13 (cross-read) | code, NLP, RL, agentic, research | Observed |
| Archive+Memory drive lineage (§70.6 read 2) | 7 | — | 7 (cross-read) | code, NLP, RL | Observed |
| Evaluator fidelity knob (§70.7.1 #1) | 3 | 3 | 0 | code exec (cascades) | Observed |
| Mutation granularity knob (§70.7.1 #2) | 4 | 4 | 0 | code exec + expensive NLP/RL | Observed → hypothesis |
| Archive topology knob (§70.7.1 #3) | 2 | 2 | 0 | combinatorial (deceptive) | Observed → hypothesis |
| Selection pressure knob (§70.7.1 #4) | 4 (2 TS + 2 FS) | 4 | 0 | ARC/contests + code exec | Observed (TS) → hypothesis (FS) |
| Reflection persistence knob (§70.7.1 #5) | 6 (3 RF + 3 FS) | 6 | 0 | expensive NLP/RL + code exec | Observed (RF) → hypothesis (FS/TS) |
| Evaluator overfitting (§70.8 #1) | 2 | 2 | 0 | code exec | Observed |
| Archive collapse (§70.8 #2) | 3 | 3 | 0 | combinatorial | Observed |
| Reflection contamination (§70.8 #3) | 2 | 2 | 0 | expensive NLP/RL | Observed + conjecture |
| Harness collapse (§70.8 #4) | 2 | 2 | 0 | agent harness | Observed instances + hypothesis |
| Cross-lineage head-to-heads (§70.7.2) | 0 | 0 | 0 | — | Observed absence |
The compressed view makes the structural limitation of the current synthesis visible: every design-knob claim in §70.7.1 is supported by within-system ablations, and none by cross-system comparisons on matched benchmarks. Phrases like "knobs that appear to move the needle" should therefore be read as coverage observations, not effect-size rankings, and claims that depend on cross-system generalisation are tagged as hypotheses rather than observations.
70.7 Design Knobs and Missing Comparisons
The evidence base for this section is the ablation and sensitivity tables reported in the surveyed chapters, catalogued in §70.6.1 and counted in §70.6.2. Throughout, an observation is a pattern a cited chapter's table directly shows; a hypothesis is our reading of what that pattern means; a conjecture is a generalisation offered as a target for future work.
70.7.1 Knobs that appear to move the needle
Across the systems in §70.6, five design knobs recur in reported ablations as candidate first-order drivers of final performance. They are listed in decreasing order of the number of surveyed chapters in which an ablation targeting them produced a headline-scale swing — a count of coverage, not an effect size.
- Evaluator fidelity and cascade design (observed in Ch4, Ch5, Ch6). Across FS-family systems the cheap-stage cost, the admission threshold, and whether the cheap stage is deterministic or a surrogate all move headline numbers. Dominant pattern: cheapening the smoke stage is typically net-positive, up to the point where it becomes uninformative. Chapters that report their cascade admission curves (Ch4, Ch6) show the knee of that curve as the most cost-sensitive parameter within those two studies; generalising further is a conjecture.
- Mutation granularity: diff vs. rewrite vs. reflect-then-edit (observed in Ch4, Ch5, Ch7, Ch26). Where the same system ships multiple operators, diffs dominate early in a run while rewrites dominate late once the archive has saturated the local basin. RF systems report that replacing reflect-then-edit with a single-shot rewrite collapses their sample-efficiency advantage.
- Archive topology: flat pool vs. island vs. MAP-Elites grid (observed in Ch3, Ch4 §4.5). FunSearch reports gains from island pools on deceptive landscapes. AlphaEvolve reports that MAP-Elites improves recovery from local optima on landscapes where the behavioral descriptor genuinely partitions the search space, and is neutral-to-negative where it does not.
- Parent selection pressure (observed in Ch17, Ch19 for TS; consistent with Ch4, Ch6 for FS-family). Softmax temperature, UCT exploration constant, or top-k width — whatever knob a system uses to trade off exploitation against exploration — is first-order in TS and second-order in FS-family ablations. The TS dependence is clean because visit counts amplify early selection bias.
- Reflection / memory persistence (observed in Ch7, Ch8, Ch26 for RF; weakly consistent with Ch3, Ch4, Ch6 for FS). For RF, whether the reflection buffer is per-run, cross-run, or distilled into a skills library is headline-scale. For FS/TS, adding a learning log is usually neutral or mildly positive; removing it is rarely catastrophic. RF systems are their memory in a way that FS and TS systems are not.
70.7.2 Comparisons the literature has not yet made
- Lineage-against-lineage head-to-heads on identical benchmarks. Almost every system ablates against its own predecessors or against non-evolutionary baselines. Few report head-to-heads across lineages on identical task sets with matched compute and matched LLMs.
- Archive topology on identical landscapes. AlphaEvolve (Ch4 §4.5) is the cleanest within-system comparison of flat vs. MAP-Elites archives, but it is one study.
- Reflection-buffer provenance and transfer. GEPA, GEPA Skills, and EurekaClaw each report positive effects from their reflection stores, but they do not share a benchmark or a buffer format.
- LLM-model sensitivity. Most ablations hold the proposal LLM fixed. Studies that vary it (Ch4, Ch6) suggest model choice interacts with mutation granularity — stronger models benefit more from diffs, weaker models from rewrites — but the sample is too small to generalise.
- Out-of-taxonomy vs. in-taxonomy on shared tasks. ES-at-scale (Ch46, Ch47) and research-agent pipelines (Part P07) rarely share benchmarks with FS/QD/TS/RF lineages.
70.8 Failure Modes and How Lineages Respond
The same four lineages are visible in the failure modes each system reports and in the countermeasures it ships. Three failure modes are well-established; the fourth is lineage-specific.
- Evaluator overfitting (observed in Ch4, Ch6). Search finds candidates that game the cheap stage of a cascade or exploit a quirk of the benchmark. Countermeasures across FS and TS lineages: multi-instance evaluation, cascade admission thresholds, held-out instance sets.
- Archive collapse (observed in Ch3, Ch4 §4.5, Ch8). The population converges to near-identical candidates and progress stalls. Countermeasures: island ring topology (Ch3), MAP-Elites grid (Ch4 §4.5), explicit novelty filtering (Ch8).
- Reflection contamination (observed in Ch7, Ch26). A reflection buffer accumulates plausible-but-wrong explanations for past failures, and subsequent reflect-then-edit steps compound them. Countermeasures: reflection pruning and score-conditional buffer admission. Whether it is the dominant failure mode for RF systems is a conjecture.
- Prompt or harness collapse (observed in Ch23, Ch28). The system edits its own prompt or harness in a way that degrades its future proposal quality. Countermeasures — rollback checkpoints, sanity evals — are less uniform across systems than cascade admission thresholds. Treating this as a general self-modification failure mode, rather than two specific case reports, is a hypothesis.
70.9 Reader Routing Guide
A reader arriving without a background in LLM evolution can use this chapter as a routing table. If your interest is code discovery with a fast deterministic scorer, start with the FS-lineage chapters (Ch3–Ch6) and §70.7.1 items 1–3. If your interest is sample-efficient search under expensive or noisy evaluators, start with the RF-lineage chapters (Ch7, Ch8, Ch26) and item 5. If your interest is structured discrete problems with verifiable steps, start with the TS-lineage chapters (Ch17–Ch20) and item 4.
If your interest is self-modifying agents whose candidate is their own harness, read Ch23 (Darwin Gödel Machine) and Ch24 (Darwinian Evolver). These systems are in-taxonomy under the Reflection lineage — §70.5.1 places them there because they satisfy both RF necessary criteria, and §70.5.2 tags DGM as RF (primary), FS (recursive inner loop). The only respect in which they are unusual is that their RF ranked-pool state collapses to a near-singleton harness; that makes them edge cases of RF, not exits from the taxonomy. Read them after the core RF chapters (Ch7, Ch8, Ch26).
If your interest is long-horizon autonomous science pipelines, read Part P07, but treat those systems as §70.5.3 out-of-taxonomy cases at the outer pipeline level — their architectural story is orchestration and stage planning, not a single evolutionary loop. Their inner loops are classified normally in Ch69. Return to §70.5.3 before Part P07 so the distinction between an unusual-but-in-taxonomy self-modifying system (Ch23–Ch24) and a pipeline-level OOT system (Part P07) stays clear.
The decision tree in §70.4.2 is designed to be applied to new systems that appear after this book was written. If a future system produces a compound lineage tag under R1–R4, it belongs in §70.5.2 of a future edition; if R0 fires or the tree reaches UNCLASSIFIED, it belongs in §70.5.3. A self-modifying candidate, on its own, does not force that outcome. Either way, the classification is a contribution to the map rather than a failure of it — the map is there to show where the research frontier has moved, not to freeze it.