Score8.5/10 — Final

Chapter 70

Evo Methods in 5 Minutes

Part P09: Synthesis & Future Directions

This chapter is a five-minute orientation for readers who have not yet opened the sixty-one system chapters in this book. The first section (§70.1) delivers the five-minute front layer itself: a single mental model, a compact ingredient schema, one taxonomy map, and four takeaways. Everything after §70.1 is deep dive — the decision procedure, the N/T/O lineage criteria, the hybrid and out-of-taxonomy cases, the comparative matrix, the evidence ledger, the design knobs, and the failure modes — and can be treated as an appendix-style reference by readers who want the orientation alone. A note on epistemic tone carried throughout: this is a synthesis chapter, not an empirical meta-analysis, so we distinguish between observations (a pattern visible in a cited chapter's reported tables), hypotheses (a reading of that pattern we find plausible), and conjectures (claims the surveyed literature does not yet settle).

70.1 The Five-Minute Front Layer

The one-sentence framing. Most LLM-powered evolutionary systems can be read as a loop that samples parents from some population-like state, asks a language model to propose variations, runs an automated evaluator to score each variant, and updates the state accordingly. Islands, MAP-Elites, cascades, bandits, reflection, skills, and self-modification are design choices layered on top of that loop; several surveyed systems stretch or break the abstraction, and those stretches are where the research frontier lives.

70.1.1 The universal loop

Most systems surveyed in this book — from FunSearch's one-population matrix-multiplication search (Ch3) through the Darwin Gödel Machine's self-rewriting harness (Ch23) — can be read as an instantiation of the same four-step loop. Learning this loop is a good first pass at the field; it is not a theorem about it, and §70.3 lists the families for which the reading is lossy.

# A first-pass reading of the LLM-evolution loop.
# Useful for most, but not all, systems surveyed in this book.
while budget_remaining():
    parent     = select(state)                   # what to evolve from
    child      = mutate(parent, llm)             # LLM proposes a variant
    score      = evaluate(child)                 # automated evaluator scores it
    state      = update(state, child, score)     # archive / tree / frontier update
    memory.record(parent, child, score)          # optional: log, skills, reflection

70.1.2 Six ingredients, at a glance

Each chapter can be read as a set of answers for the same six ingredients. Holding the schema fixed is what makes the sixty-one chapters comparable at all.

Ingredient	Question	Representative range
Representation	What is a candidate?	function, file, diff, prompt, skill, weight vector, proof, hypothesis, harness
Variation	How is a child produced?	diff, rewrite, crossover, reflect-then-edit, repair, prompt mutation, Gaussian noise
Evaluator	How is a child scored?	exec, cascade, multi-instance, LLM judge, benchmark, proof, human
Archive	Which candidates survive?	flat pool, ranked, island, MAP-Elites, Pareto, tree, tiered, skills library
Memory	What persists beyond the archive?	none, learning log, reflection buffer, skills, graph, self-modifying harness
Cost profile	Where does the budget go?	exec ms, cascade, multi-seed sec, RL rollout min, harness run hr, nested inner loop

70.1.3 The taxonomy map

Systems cluster into four broad lineages and two out-of-taxonomy families. The figure below is the whole map: which lineage exists, which columns decide membership, and which edges between lineages the surveyed systems actually travel along. The deep-dive sections (§70.5 onward) unpack each box.

70.1.4 Four takeaways

The loop is four boxes plus optional memory. State → select → mutate → evaluate → update is the common skeleton. Classifying a system starts by asking what goes in each box.
Archive and memory are the decisive columns. Across the surveyed systems, the variation operator is shared widely (diff/rewrite appears in FS and QD; reflect-then-edit appears in RF and hybrid TS+RF), but the archive column (pool vs. grid vs. tree) and the memory column (none vs. log vs. reflection-buffer) do most of the lineage work.
Four lineages, two OOT families. FunSearch, Quality-Diversity, Tree-search, and Reflection cover the bulk of the book. ES-at-scale and the outer layer of research-agent pipelines are deliberately kept out of the taxonomy because they fail a necessary criterion of every lineage.
Evaluator fidelity and mutation granularity are the knobs most often reported as first-order; the clearest gap in the literature is cross-lineage head-to-heads on identical benchmarks with matched compute.

If you stop here. You now have the universal loop, the six ingredients, the taxonomy map, and the four takeaways. §§70.2–70.8 are a deep-dive reference that operationalises this front layer: decision procedures, lineage criteria, hybrids, the comparative matrix, the evidence ledger, design knobs, and failure modes. Return to them when you open a specific system chapter and want the classification framework made precise.

70.2 Beyond Five Minutes: Deep-Dive Roadmap

The rest of the chapter takes each element of the front layer and makes it formal: §70.3 lists where the loop abstraction bends; §70.4 makes the six ingredients into a decision procedure with an explicit precedence tree; §70.5 turns the four lineages into N/T/O criteria and catalogues hybrids and out-of-taxonomy cases; §70.6 presents the comparative matrix and its evidence ledger with counts; §70.7 maps the design knobs the surveyed ablations report as first-order and the comparisons still missing; §70.8 catalogues failure modes; §70.9 is a routing guide.

70.3 Where the Universal Loop Bends

Several families surveyed in this book map onto the loop only with qualifications. Calling them out up front is the honest move; §70.5.3 then decides which of these qualified mappings deserve their own taxonomic category.

Tree search systems (AB-MCTS/TreeQuest Ch19, Arcgentica Ch17, ALE-Agent Ch18, Confluence Labs Ch20). The "population" is a search tree, select is a UCT/UCB-style traversal down that tree to an expansion node, and update is a backpropagation of child scores to ancestor statistics. The loop reading is faithful to the control flow, but "archive admission" hides that tree growth, pruning, and value backup are the actual operators. These systems are in-taxonomy (§70.5, Tree-search lineage) because the four boxes still carry real semantic weight.
Beam search and best-first rollouts (used as inner loops in several research agents, e.g. AIRA₂, DeepScientist). The "population" is a fixed-width frontier; select is a top-k operation, not a sampler, and the variation operator is usually a single-shot generation with no explicit mutation distinction. These fit the loop as a degenerate case and are typically absorbed into whichever lineage the surrounding system belongs to.
Self-modifying agents (Darwin Gödel Machine Ch23, Darwinian Evolver Ch24, NeoSigma Ch28). The "candidate" is the agent harness itself, so the evaluator must run the candidate on a downstream task to score it. The loop is still recognisable, but mutate is a source-code patch and evaluate can involve nested evolutionary runs, making the loop recursive rather than flat. These remain in-taxonomy under the Reflection lineage because the four boxes still carry weight — the candidate population is just unusually small and the mutation operator is unusually expressive.
Research-agent pipelines (Part P07: AI Scientist, AIRA₂, Zochi, DeepScientist, RD-Agent, OmniScientist). These are end-to-end pipelines — literature review → hypothesis → experiment → paper — and only a subset of their stages is evolutionary. Reading the whole pipeline as one loop is lossy; reading the idea-mutation, experiment-refinement, or reward-shaping stages as loops is accurate. We treat the outer pipeline as an out-of-taxonomy special case in §70.5.3 because the architectural interest lives above the loop, not inside it; their inner loops are classified normally.
Evolution-strategies-at-scale methods (Ch46, Ch47). The variation operator is Gaussian noise on a parameter vector, not an LLM edit, and the update rule is a weighted recombination rather than an archive insertion. We treat them as out-of-taxonomy special cases in §70.5.3: they belong in the book because they inform the LLM-evolution conversation about scaling and population dynamics, but they blur the "L" in LLM-evolution and do not fit any of the four lineages cleanly.

70.4 The Six Ingredients and a Decision Procedure

The six ingredients become more useful when treated as a fixed schema rather than a checklist. We have separated compute and evaluation cost into its own row because §70.7.1 argues, on the basis of the surveyed ablations, that evaluator fidelity is one of the dominant drivers of final performance — and any schema that pretends cost is secondary would contradict the chapter's own synthesis.

Ingredient	Question	Range of choices observed in the field
Representation	What is a candidate?	A Python function, a full file, a diff, a skill, a prompt, a neural-network weight vector, a proof, a scientific hypothesis.
Variation operator	How do we make a new candidate?	Line diff, full rewrite, two-parent crossover, reflect-then-edit, error repair, prompt mutation, Gaussian noise on parameters.
Evaluator	How do we score a candidate?	Single execution, cascade (smoke→light→full), multi-instance aggregation, LLM-as-judge, hybrid, real benchmark, formal proof, human review.
Selector & archive	Which candidates survive, which become parents?	Flat elitism, ranked archive, MAP-Elites grid^†, Pareto front, island ring, search tree, tiered pool, skills library.
Memory / reflection	What does the system remember beyond the archive?	Nothing, a learning log, an embedded log, a skills archive, a reflection buffer, a knowledge graph, a self-modifying harness.
Cost & budget profile	What does one candidate cost, and where does the budget go?	Cheap deterministic exec (ms), cascaded exec with admission thresholds, multi-seed benchmarks (sec–min), RL rollouts or agent harness runs (min–hr), nested evolutionary inner loops.

^† MAP-Elites: a Quality-Diversity archive that indexes candidates by a low-dimensional behavioral descriptor — a hand- or auto-defined feature vector capturing how a candidate behaves, not how well — and keeps one champion per cell. See Ch67 §67.4 for the formal definition used throughout this book.

70.4.1 Classification procedure

# A reading protocol for placing a new LLM-evolutionary system.
# Apply to any paper, repo, or chapter in this book.

def classify(system) -> dict:
    schema = {}

    # Q1 Representation: what is ONE unit of selection?
    schema["representation"] = ...  # function|file|diff|prompt|skill|weight_vector|proof|hypothesis|harness

    # Q2 Variation operator: how is a child produced from parent(s)?
    schema["variation"] = ...       # diff|rewrite|crossover|reflect_then_edit|repair|prompt_mutation|gaussian_noise

    # Q3 Evaluator, fidelity, and noise profile.
    schema["evaluator"]      = ...  # single_exec|cascade|multi_instance|llm_judge|hybrid|benchmark|proof|human
    schema["eval_fidelity"]  = ...  # cheap | medium | expensive
    schema["eval_noise"]     = ...  # deterministic | stochastic

    # Q4 Selector & archive.
    schema["archive"] = ...         # flat_pool|ranked|island|map_elites|pareto|tree|tiered|skills_library

    # Q5 Memory / reflection beyond the archive itself.
    schema["memory"] = ...          # none|learning_log|embedded_log|skills|reflection_buffer|graph|self_mod

    # Q6 Cost & budget profile.
    schema["cost_profile"] = ...    # exec_ms|cascade|multi_seed_sec|rollout_min|harness_run_hr|nested_inner_loop

    # Q7 Lineage: one primary + optional secondary tags (see §70.4.2).
    schema["lineage"] = infer_lineage(schema)
    return schema

70.4.2 Lineage precedence: an ordered decision tree

When the six ingredients are mutually consistent, classification is trivial. The hard cases are conflicts: archive looks like TS but variation looks like RF; memory looks reflective but archive is still a ranked pool; configuration toggles change the archive type. The rule set below fires in order; the first matching rule wins, and hybrid tags are emitted when a later rule would have fired on a different column than the winner.

def infer_lineage(s) -> str:
    # --------------- OOT gates (failure of a NECESSARY criterion of every lineage) ---------------
    # R0a. Variation is parameter-vector noise, not any form of LLM edit -> OOT (ES family).
    if s["variation"] == "gaussian_noise":
        return "OOT(ES)"
    # R0b. The system is an end-to-end research pipeline whose architectural interest is orchestration,
    #       and the loop-as-architecture granularity fails -> OOT (pipeline outer layer).
    if s.get("granularity") == "pipeline_outer":
        return "OOT(pipeline)"

    # --------------- Primary lineage by ARCHIVE column (decisive for FS/QD/TS) -----------------
    # R1. Explicit tree with per-node statistics -> TS, possibly hybridised by variation.
    if s["archive"] == "tree":
        if s["variation"] == "reflect_then_edit":
            return "TS+RF"            # ALE-Agent pattern: node stats + reflective proposal.
        return "TS"

    # R2. Descriptor-indexed grid with one champion per cell -> QD, optionally paired with a pool.
    if s["archive"] == "map_elites":
        if s.get("also_ranked_pool"):
            return "FS+QD"            # AlphaEvolve §4.5 pattern.
        return "QD"

    # R3. Configuration-dependent grid on top of a default pool -> FS (~QD).
    if s["archive"] in ("flat_pool", "ranked", "island") and s.get("configurable_grid"):
        return "FS(~QD)"              # OpenEvolve, LLM4AD pattern.

    # --------------- Variation-vs-memory tie-break when archive is a pool ----------------------
    # R4. Pool archive + reflect-then-edit variation + first-class memory store -> RF.
    #     Both of RF's necessary criteria must fire; memory alone is not enough (prevents
    #     FS-with-learning-log being mislabelled RF).
    if s["archive"] in ("flat_pool", "ranked", "island"):
        if s["variation"] == "reflect_then_edit" and s["memory"] in (
            "reflection_buffer", "skills", "self_mod", "graph"
        ):
            # Recursive-inner-loop annotation for self-modifying harnesses.
            if s["memory"] == "self_mod" and s.get("inner_loop_lineage") == "FS":
                return "RF(primary)+FS(inner)"   # Darwin Gödel Machine pattern.
            return "RF"
        # R5. Default fall-through: pool + diff/rewrite + optional learning log -> FS.
        return "FS"

    # R6. Anything else (e.g. tiered_pool, skills_library without pool) -> promote to RF if
    #     both RF necessary criteria fire; otherwise emit an "unclassified" tag for review.
    if s["variation"] == "reflect_then_edit" and s["memory"] in (
        "reflection_buffer", "skills", "self_mod"
    ):
        return "RF"
    return "UNCLASSIFIED"

Read the tree as a strict priority list. R0 gates out the two OOT families first because no in-taxonomy rule can rescue a system that fails a necessary variation-operator criterion of every lineage. R1–R3 make the archive column authoritative for the FS/QD/TS distinction, which is why AlphaEvolve's MAP-Elites variant earns an FS+QD tag (archive points to QD, pool presence forces the hybrid) rather than a forced single-lineage call. R4 requires both Reflection necessary criteria (reflect-then-edit and first-class memory) to fire before promoting a pool system to RF; this is the load-bearing rule that prevents a FunSearch-style system with a learning log from being mislabelled. R5 is the FunSearch default. R6 handles the residual cases and emits an explicit review flag rather than a guess.

Applied to a system picked at random — say, GEPA (Ch7) — the procedure returns: representation = prompt; variation = reflect_then_edit; evaluator = benchmark, medium fidelity, stochastic; archive = ranked; memory = reflection_buffer; cost profile = multi_seed_sec; R4 fires (pool + reflect + first-class memory) → lineage = RF. For AlphaEvolve's MAP-Elites variant R2 fires with also_ranked_pool=True → FS+QD. For AB-MCTS R1 fires with non-reflective variation → TS. For ES-at-scale R0a fires → OOT(ES).

70.5 The Four Lineages

Lineage tags are reading aids, not crisp partitions: several systems borrow from more than one lineage (§70.5.2), and two families sit outside the taxonomy entirely (§70.5.3). Classification follows Table 70.5.1, with each criterion marked N (necessary — a system in this lineage cannot fail this test), T (typical — holds for almost all flagship members but has documented exceptions), or O (optional — common but not definitional).

70.5.1 Classification criteria and lineage definitions

Criterion	FunSearch	Quality-Diversity	Tree-search	Reflection
Search state	N: flat pool or island ring	N: grid indexed by behavioral descriptor	N: explicit tree with per-node statistics	T: small ranked pool alongside a memory store
Variation operator	N: diff or full rewrite of a code candidate	T: same as FunSearch, but parent sampled from grid	T: LLM as proposal distribution at an expansion node	N: reflect-then-edit — LLM first analyses failure, then patches
Evaluator	T: deterministic code execution, usually cascaded	T: deterministic + extraction of behavioral descriptors	T: step-wise or terminal verification; sparse rewards backpropagated	O: expensive or noisy task reward, often agent rollouts
Archive structure	N: ranked list, sometimes per-island	N: one champion per occupied cell	N: tree with per-node value and visit counts	T: ranked pool + separate reflection/skills memory
Memory / reflection role	O: learning log, mostly for prompt context	O: log + archive of diverse exemplars	N: node statistics are the memory	N: first-class — reflection, skills, prompt evolution, or harness patching

Reading key: N = necessary (a system failing this is not in the lineage); T = typical (holds for the overwhelming majority of members, with named exceptions); O = optional (frequent but not definitional).

For each lineage the list below gives a positive example (a system that passes every N-criterion cleanly), a near-miss (a system that would be placed here naively but fails or strains an N-criterion and belongs elsewhere), and a tie-break rule for the commonest classification conflict.

FunSearch lineage. Positive example: FunSearch itself (Ch3) — flat island pool, diff/rewrite variation, deterministic cascade, ranked per-island archive, learning log. Near-miss: ES-at-scale (Ch47) superficially resembles FS at the loop level but fails the FS necessary variation-operator criterion because the variation is Gaussian noise on parameters. Tie-break rule: if any published configuration adds a MAP-Elites grid, tag as FS+QD rather than forcing one lineage — the decision tree's R2/R3 rules implement this automatically.
Quality-Diversity / MAP-Elites. Positive example: AlphaEvolve's behavioural-grid variant (Ch4 §4.5). Near-miss: an island-pool FunSearch configuration is sometimes called "diversity-preserving" but is not QD because an island ring is not a descriptor-indexed grid. Tie-break rule: if archive and memory point in different directions — e.g. ranked pool plus a reflection buffer that also indexes by a behavioral tag — the archive column is authoritative; memory drives RF-lineage only when the archive is also a ranked pool, as in GEPA Skills (Ch8).
Tree-search lineage. Positive example: AB-MCTS / TreeQuest (Ch19) — explicit tree with per-node visit counts, UCB traversal, LLM proposals at expansion nodes. Near-miss: a research agent that builds a candidate frontier and picks top-k each step is not TS — no persistent per-node statistic is backed up — it is a beam search. Tie-break rule: if node statistics are used by a reflect-then-edit proposal step (ALE-Agent Ch18), the system is TS+RF; TS is assigned by the archive column, RF by the variation column.
Reflection lineage. Positive example: GEPA (Ch7) — the LLM first reasons about why a prompt succeeded or failed, then edits it, and a reflection buffer is maintained alongside the ranked pool. Near-miss: a FunSearch-style system that pipes learning-log entries into its prompt context is not RF — the learning log is an O-criterion and the variation operator is still a plain diff or rewrite. Tie-break rule: when archive (pool) and memory (first-class store) both fit RF but variation is single-shot rewrite, the system is FS-with-memory, not RF.

Flagship placements under these criteria:

FunSearch lineage. FunSearch (Ch3), AlphaEvolve (Ch4), OpenEvolve (Ch5), ShinkaEvolve (Ch6), LLM4AD (Ch10), SkyDiscover/AdaEvolve (Ch9). Where it shines: problems with a fast, deterministic scorer — combinatorial optimisation, numerical kernels, competitive programming.
Quality-Diversity / MAP-Elites. AlphaEvolve's behavioural-grid variant (Ch4 §4.5), MAP-Elites configurations of OpenEvolve (Ch5 §5.6) and LLM4AD (Ch10). Where it shines: deceptive landscapes where approach-diversity matters as much as peak fitness.
Tree-search lineage. AB-MCTS & TreeQuest (Ch19), Arcgentica (Ch17), Confluence Labs (Ch20), ALE-Agent (Ch18), ShinkaEvolve@ICFP (Ch6 §6.8). Where it shines: discrete, verifiable step problems — ARC-AGI, heuristic contests, theorem proving.
Reflection lineage. GEPA (Ch7), GEPA Skills (Ch8), EurekaClaw (Ch26), RetroAgent (Ch25), Darwin Gödel Machine (Ch23), Darwinian Evolver (Ch24), NeoSigma (Ch28). Where it shines: expensive or noisy evaluators, where sample-efficiency matters more than throughput.

70.5.2 Hybrid and borderline cases

Forcing single-bucket assignment obscures some of the most interesting systems. Each hybrid below is annotated with the Table 70.5.1 criteria it stretches:

AlphaEvolve (Ch4). Primarily FS. Ships a MAP-Elites archive variant in §4.5 that satisfies every QD necessary criterion. Its learning log has reflection-like properties but does not carry a first-class reflection store. Tag: FS + QD.
ShinkaEvolve (Ch6). FS-family on core benchmarks. In its ICFP configuration (§6.8) it additionally satisfies every TS necessary criterion. Borrows reflection-style adaptive mutation scheduling (an O-criterion of RF). Tag: FS + TS, with Reflection-lite adaptation.
Darwin Gödel Machine (Ch23). RF at the harness level. Fails RF's typical "small ranked pool" because the population collapses to near-singleton harnesses; passes both RF necessary criteria. Its downstream evaluation task can itself be a FunSearch-style search. Tag: RF (primary), FS (recursive inner loop). This is a self-modifying in-taxonomy case, not OOT.
ALE-Agent (Ch18). Satisfies every TS necessary criterion but replaces the typical single-shot LLM proposal with a reflective step at each node, satisfying RF's necessary "reflect-then-edit" criterion. Tag: TS + RF.
OpenEvolve (Ch5). FS by default; when configured with a behavioral-grid archive (§5.6) it additionally satisfies QD's necessary criterion. Marked FS(~QD) by R3 of the decision tree.

70.5.3 Out-of-taxonomy special cases

Two families do not fit any lineage. Critically, self-modifying Reflection-lineage systems such as DGM (Ch23) and Darwinian Evolver (Ch24) are not OOT — they are unusual RF members whose candidate happens to be a harness. The OOT category is reserved for families that fail a necessary criterion of every lineage.

Evolution-strategies-at-scale methods (Ch46 EGGROLL, Ch47 Evolution Strategies at Scale). Why OOT: variation is Gaussian noise on a parameter vector, failing the necessary variation-operator criterion of every lineage. What they contribute: evidence about population sizing, compute-scaling, and diversity pressure that the LLM-evolution lineages can borrow.
Research-agent pipelines, outer layer (Part P07: AI Scientist, AIRA₂, Zochi, DeepScientist, RD-Agent, OmniScientist). Why OOT: end-to-end pipelines whose interest lies in orchestration, long-horizon memory, and stage-level planning, not in any single evolutionary inner loop. Inner loops (idea mutation, experiment refinement, reward shaping) individually belong to FS, TS, or RF and are classified normally in Ch69; the outer pipeline fails the loop-as-architecture test.

The practical rule: if the decision tree in §70.4.2 reaches UNCLASSIFIED or R0 fires, the system is OOT. If it merely produces a compound lineage tag (R1 with hybrid branch, R2 with pool, R4 with inner annotation), the system is in §70.5.2. A self-modifying candidate is not, by itself, grounds for OOT status.

70.6 Comparative Matrix and Evidence Ledger

The matrix below maps fourteen representative systems against the ingredients and lineage tags. FS=FunSearch, QD=Quality-Diversity, TS=Tree-search, RF=Reflection, OOT=out-of-taxonomy. The marker ~ on a cell means configuration-dependent: the feature is present in at least one published or default configuration of the system, but is not universal across its releases.

System	Ch.	Lineage	Representation	Variation op	Evaluator	Archive	Memory
FunSearch	Ch3	FS	Python function	Full rewrite	Deterministic exec	Island pool	Learning log
AlphaEvolve	Ch4	FS+QD	Function / file	Diff + rewrite	Cascade	Pool + MAP-Elites (§4.5)	Learning log
OpenEvolve	Ch5	FS(~QD)	Function / file	Diff + rewrite	Cascade	Pool ~or grid (§5.6)	Learning log
ShinkaEvolve	Ch6	FS+TS	File	Diff + rewrite	Cascade + multi-inst	Pool ~tree (ICFP §6.8)	Learning log + ~reflection
GEPA	Ch7	RF	Prompt	Reflect-then-edit	Benchmark (stochastic)	Ranked pool	Reflection buffer
GEPA Skills	Ch8	RF	Prompt + skill	Reflect-then-edit	Benchmark	Pool + skills library	Skills archive
LLM4AD	Ch10	FS(~QD)	Heuristic function	Diff + rewrite	Deterministic exec	Pool ~or grid	Learning log
Arcgentica	Ch17	TS	Program / plan	LLM proposal @ node	Step + terminal verify	Search tree	Node statistics
ALE-Agent	Ch18	TS+RF	Heuristic program	Reflective proposal @ node	Contest scorer	Search tree	Node stats + reflection
AB-MCTS / TreeQuest	Ch19	TS	Solution	LLM proposal @ node	Task verifier	Search tree (posterior)	Node statistics
Darwin Gödel Machine	Ch23	RF (~FS inner)	Agent harness (code)	Reflect-then-edit patch	Downstream task run	Near-singleton pool	Self-modifying harness
EurekaClaw	Ch26	RF	Reward function	Reflect-then-edit	RL rollout (noisy)	Ranked pool	Reflection buffer
ES-at-Scale	Ch47	OOT	Weight vector	Gaussian noise	Task reward	Implicit (population)	None
AI Scientist	Ch58	OOT (inner ~RF)	Research idea / experiment	Pipeline stage-dependent	Paper / experiment artifact	Pipeline state	Long-horizon notes

Reading the matrix. Two patterns are visible, each with an exception note, a sampling-bias note, and a falsification test so the reader can audit the generalisation.

Representation almost fully predicts evaluator. Code → deterministic cascade, prompt → stochastic benchmark, reward function → noisy rollout, weight vector → task reward, research idea → paper artifact. This is a hypothesis, not a law: the evaluator is usually the cost bottleneck, and the representation tends to be chosen so it is cheap to evaluate at the desired fidelity. Known exception: GEPA Skills (Ch8) mixes prompt and skill representations into the same benchmark evaluator, weakening the one-to-one coupling. Sampling bias: the surveyed systems over-represent code-execution benchmarks because that is where LLM-evolution first took off; a survey that sampled more heavily from agentic robotics or theorem-proving might loosen the mapping. Falsification test: a future system that deliberately decouples representation and evaluator — for example, evolving prompts but scoring them with a formal proof checker — would count as evidence against this reading.
Archive and Memory columns drive lineage membership more than Variation op. AlphaEvolve and OpenEvolve use the same diff+rewrite operator as FunSearch but earn their QD tag entirely from the archive column; GEPA and EurekaClaw differ in representation and evaluator but share the same reflection-buffer memory and therefore the same lineage. Known exception: ALE-Agent (Ch18) is tagged TS+RF precisely because its variation operator is decisive — the TS archive alone would not give RF credit, and the reflect-then-edit step is what activates rule R1's hybrid branch. Sampling bias: the surveyed FS-family systems ship with configurable archive topologies, which inflates the apparent leverage of the archive column; a sample drawn from systems with fixed archives would show more variation-op leverage. Falsification test: a system whose lineage flips solely by swapping variation operator while holding archive and memory constant would weaken the claim.

A third, weaker observation: the ~ markers cluster in FS-family rows. This is not a flaw of FS but, we conjecture, a feature — those systems ship with configurable archive topologies, which is precisely why they can cross into QD or TS without a redesign. Reflection-lineage systems, by contrast, rarely toggle their memory store on and off in the surveyed configurations: it is definitional for them.

70.6.1 Evidence ledger for §§70.6–70.8 synthesis claims

The synthesis statements in the matrix reading, in §70.7.1, and in §70.8 are not independent observations — each is the distillation of a specific set of chapters and reported ablations. The table below is the full ledger: for every non-trivial synthesis claim, it lists supporting chapters, what was actually compared, and our confidence level. Grades: observed (the claim restates a pattern directly visible in the cited chapters' tables), hypothesis (a reading of that pattern we find plausible), conjecture (a generalisation beyond what the literature establishes).

Synthesis claim	Section	Supporting chapters	What was compared	Confidence
Representation almost fully predicts evaluator.	§70.6 reading	Ch3, Ch4, Ch5, Ch6, Ch7, Ch8, Ch17, Ch18, Ch19, Ch23, Ch26, Ch47, Ch58	Cross-chapter reading of the Representation/Evaluator columns.	Observed
Archive and Memory columns drive lineage membership more than Variation op.	§70.6 reading	Ch3, Ch4, Ch5, Ch7, Ch8, Ch10, Ch26	Same variation operator across FS and QD tags; same memory type across differing representations.	Observed
Evaluator fidelity / cascade design is one of the knobs most often reported as headline-moving.	§70.7.1 item 1	Ch4, Ch5, Ch6	Within-system ablations varying cheap-stage cost, admission thresholds, and cascade depth.	Observed in Ch4, Ch6; consistent with Ch5.
Mutation granularity shifts with run stage in FS-family systems.	§70.7.1 item 2	Ch4, Ch5, Ch7, Ch26	Operator-mix curves over run time; single operator swaps under expensive evaluators.	Observed within cited systems; hypothesis for generalisation.
Archive topology matters when the landscape is deceptive and the descriptor is meaningful.	§70.7.1 item 3	Ch3, Ch4 §4.5	Within-system archive ablations on shared benchmarks.	Observed in Ch4 §4.5; hypothesis that the descriptor-quality precondition is general.
Parent selection pressure is first-order for TS, second-order for FS-family.	§70.7.1 item 4	Ch17, Ch19, Ch4, Ch6	UCT sweeps; FS softmax / top-k sweeps.	Observed in Ch17, Ch19 for TS; hypothesis for the FS-family ordering.
Reflection / memory persistence is headline-moving in RF, neutral in FS/TS.	§70.7.1 item 5	Ch7, Ch8, Ch26, Ch3, Ch4, Ch6	Per-run vs cross-run vs distilled reflection; learning-log on/off.	Observed in RF chapters; hypothesis for the FS/TS neutrality.
Evaluator overfitting is a well-established failure mode.	§70.8 item 1	Ch4, Ch6	Multi-instance evaluation and held-out sets introduced in response to gaming.	Observed
Archive collapse is addressed by topology choice.	§70.8 item 2	Ch3, Ch4 §4.5, Ch8	Before/after introduction of island, grid, novelty filtering.	Observed
Reflection contamination is a failure mode in the RF lineage.	§70.8 item 3	Ch7, Ch26	Buffer pruning and score-conditional admission fixes.	Observed; dominance is a conjecture.
Prompt / harness collapse is lineage-specific.	§70.8 item 4	Ch23, Ch28	Case descriptions of harness edits degrading future quality; rollback checkpoints as countermeasure.	Observed instances; hypothesis that it generalises.
Cross-lineage head-to-heads on identical benchmarks are largely absent.	§70.7.2	Survey-wide negative observation	Search for matched-compute, matched-LLM head-to-heads across FS/QD/TS/RF.	Observed (as absence).

70.6.2 Summary counts for the synthesis claims

The ledger above is long. The table below compresses it into counts the reader can audit at a glance: how many chapters back each claim, whether the evidence is within-system (an ablation inside one chapter) or cross-system (a comparison across chapters), and what benchmark family the evidence comes from. This is explicitly a coverage count, not an effect-size tally.

Claim cluster	# supporting chapters	Within-system	Cross-system	Benchmark family	Confidence
Repr predicts evaluator (§70.6 read 1)	13	—	13 (cross-read)	code, NLP, RL, agentic, research	Observed
Archive+Memory drive lineage (§70.6 read 2)	7	—	7 (cross-read)	code, NLP, RL	Observed
Evaluator fidelity knob (§70.7.1 #1)	3	3	0	code exec (cascades)	Observed
Mutation granularity knob (§70.7.1 #2)	4	4	0	code exec + expensive NLP/RL	Observed → hypothesis
Archive topology knob (§70.7.1 #3)	2	2	0	combinatorial (deceptive)	Observed → hypothesis
Selection pressure knob (§70.7.1 #4)	4 (2 TS + 2 FS)	4	0	ARC/contests + code exec	Observed (TS) → hypothesis (FS)
Reflection persistence knob (§70.7.1 #5)	6 (3 RF + 3 FS)	6	0	expensive NLP/RL + code exec	Observed (RF) → hypothesis (FS/TS)
Evaluator overfitting (§70.8 #1)	2	2	0	code exec	Observed
Archive collapse (§70.8 #2)	3	3	0	combinatorial	Observed
Reflection contamination (§70.8 #3)	2	2	0	expensive NLP/RL	Observed + conjecture
Harness collapse (§70.8 #4)	2	2	0	agent harness	Observed instances + hypothesis
Cross-lineage head-to-heads (§70.7.2)	0	0	0	—	Observed absence

The compressed view makes the structural limitation of the current synthesis visible: every design-knob claim in §70.7.1 is supported by within-system ablations, and none by cross-system comparisons on matched benchmarks. Phrases like "knobs that appear to move the needle" should therefore be read as coverage observations, not effect-size rankings, and claims that depend on cross-system generalisation are tagged as hypotheses rather than observations.

70.7 Design Knobs and Missing Comparisons

The evidence base for this section is the ablation and sensitivity tables reported in the surveyed chapters, catalogued in §70.6.1 and counted in §70.6.2. Throughout, an observation is a pattern a cited chapter's table directly shows; a hypothesis is our reading of what that pattern means; a conjecture is a generalisation offered as a target for future work.

70.7.1 Knobs that appear to move the needle

Across the systems in §70.6, five design knobs recur in reported ablations as candidate first-order drivers of final performance. They are listed in decreasing order of the number of surveyed chapters in which an ablation targeting them produced a headline-scale swing — a count of coverage, not an effect size.

Evaluator fidelity and cascade design (observed in Ch4, Ch5, Ch6). Across FS-family systems the cheap-stage cost, the admission threshold, and whether the cheap stage is deterministic or a surrogate all move headline numbers. Dominant pattern: cheapening the smoke stage is typically net-positive, up to the point where it becomes uninformative. Chapters that report their cascade admission curves (Ch4, Ch6) show the knee of that curve as the most cost-sensitive parameter within those two studies; generalising further is a conjecture.
Mutation granularity: diff vs. rewrite vs. reflect-then-edit (observed in Ch4, Ch5, Ch7, Ch26). Where the same system ships multiple operators, diffs dominate early in a run while rewrites dominate late once the archive has saturated the local basin. RF systems report that replacing reflect-then-edit with a single-shot rewrite collapses their sample-efficiency advantage.
Archive topology: flat pool vs. island vs. MAP-Elites grid (observed in Ch3, Ch4 §4.5). FunSearch reports gains from island pools on deceptive landscapes. AlphaEvolve reports that MAP-Elites improves recovery from local optima on landscapes where the behavioral descriptor genuinely partitions the search space, and is neutral-to-negative where it does not.
Parent selection pressure (observed in Ch17, Ch19 for TS; consistent with Ch4, Ch6 for FS-family). Softmax temperature, UCT exploration constant, or top-k width — whatever knob a system uses to trade off exploitation against exploration — is first-order in TS and second-order in FS-family ablations. The TS dependence is clean because visit counts amplify early selection bias.
Reflection / memory persistence (observed in Ch7, Ch8, Ch26 for RF; weakly consistent with Ch3, Ch4, Ch6 for FS). For RF, whether the reflection buffer is per-run, cross-run, or distilled into a skills library is headline-scale. For FS/TS, adding a learning log is usually neutral or mildly positive; removing it is rarely catastrophic. RF systems are their memory in a way that FS and TS systems are not.

70.7.2 Comparisons the literature has not yet made

Lineage-against-lineage head-to-heads on identical benchmarks. Almost every system ablates against its own predecessors or against non-evolutionary baselines. Few report head-to-heads across lineages on identical task sets with matched compute and matched LLMs.
Archive topology on identical landscapes. AlphaEvolve (Ch4 §4.5) is the cleanest within-system comparison of flat vs. MAP-Elites archives, but it is one study.
Reflection-buffer provenance and transfer. GEPA, GEPA Skills, and EurekaClaw each report positive effects from their reflection stores, but they do not share a benchmark or a buffer format.
LLM-model sensitivity. Most ablations hold the proposal LLM fixed. Studies that vary it (Ch4, Ch6) suggest model choice interacts with mutation granularity — stronger models benefit more from diffs, weaker models from rewrites — but the sample is too small to generalise.
Out-of-taxonomy vs. in-taxonomy on shared tasks. ES-at-scale (Ch46, Ch47) and research-agent pipelines (Part P07) rarely share benchmarks with FS/QD/TS/RF lineages.

Synthesis claim. The two design choices that show up most consistently as first-order drivers in the surveyed ablations — by chapter coverage, not by measured effect size — are evaluator fidelity / cascade design and mutation granularity. The two choices whose effects are most visibly lineage-conditional are archive topology and reflection memory persistence. The comparisons most visibly missing from the literature are cross-lineage head-to-heads on identical benchmarks with matched compute. These are tentative synthesis patterns, backed by the ledger and counts in §§70.6.1–70.6.2 rather than by a meta-analysis, and they should be read as a starting map for future comparative work, not as settled conclusions.

70.8 Failure Modes and How Lineages Respond

The same four lineages are visible in the failure modes each system reports and in the countermeasures it ships. Three failure modes are well-established; the fourth is lineage-specific.

Evaluator overfitting (observed in Ch4, Ch6). Search finds candidates that game the cheap stage of a cascade or exploit a quirk of the benchmark. Countermeasures across FS and TS lineages: multi-instance evaluation, cascade admission thresholds, held-out instance sets.
Archive collapse (observed in Ch3, Ch4 §4.5, Ch8). The population converges to near-identical candidates and progress stalls. Countermeasures: island ring topology (Ch3), MAP-Elites grid (Ch4 §4.5), explicit novelty filtering (Ch8).
Reflection contamination (observed in Ch7, Ch26). A reflection buffer accumulates plausible-but-wrong explanations for past failures, and subsequent reflect-then-edit steps compound them. Countermeasures: reflection pruning and score-conditional buffer admission. Whether it is the dominant failure mode for RF systems is a conjecture.
Prompt or harness collapse (observed in Ch23, Ch28). The system edits its own prompt or harness in a way that degrades its future proposal quality. Countermeasures — rollback checkpoints, sanity evals — are less uniform across systems than cascade admission thresholds. Treating this as a general self-modification failure mode, rather than two specific case reports, is a hypothesis.

70.9 Reader Routing Guide

A reader arriving without a background in LLM evolution can use this chapter as a routing table. If your interest is code discovery with a fast deterministic scorer, start with the FS-lineage chapters (Ch3–Ch6) and §70.7.1 items 1–3. If your interest is sample-efficient search under expensive or noisy evaluators, start with the RF-lineage chapters (Ch7, Ch8, Ch26) and item 5. If your interest is structured discrete problems with verifiable steps, start with the TS-lineage chapters (Ch17–Ch20) and item 4.

If your interest is self-modifying agents whose candidate is their own harness, read Ch23 (Darwin Gödel Machine) and Ch24 (Darwinian Evolver). These systems are in-taxonomy under the Reflection lineage — §70.5.1 places them there because they satisfy both RF necessary criteria, and §70.5.2 tags DGM as RF (primary), FS (recursive inner loop). The only respect in which they are unusual is that their RF ranked-pool state collapses to a near-singleton harness; that makes them edge cases of RF, not exits from the taxonomy. Read them after the core RF chapters (Ch7, Ch8, Ch26).

If your interest is long-horizon autonomous science pipelines, read Part P07, but treat those systems as §70.5.3 out-of-taxonomy cases at the outer pipeline level — their architectural story is orchestration and stage planning, not a single evolutionary loop. Their inner loops are classified normally in Ch69. Return to §70.5.3 before Part P07 so the distinction between an unusual-but-in-taxonomy self-modifying system (Ch23–Ch24) and a pipeline-level OOT system (Part P07) stays clear.

The decision tree in §70.4.2 is designed to be applied to new systems that appear after this book was written. If a future system produces a compound lineage tag under R1–R4, it belongs in §70.5.2 of a future edition; if R0 fires or the tree reaches UNCLASSIFIED, it belongs in §70.5.3. A self-modifying candidate, on its own, does not force that outcome. Either way, the classification is a contribution to the map rather than a failure of it — the map is there to show where the research frontier has moved, not to freeze it.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}