Score7.96/10 — Draft

Chapter 65

Next Evolution Architecture

Part P09: Synthesis & Future Directions

After surveying seventeen LLM-powered evolutionary systems published between 2024 and 2026, a striking conclusion emerges: no single system captures all the key innovations that the field has collectively discovered. Each system made a focused contribution—ShinkaEvolve's sample efficiency, GEPA's diagnostic feedback, the Darwin Gödel Machine's self-improvement, AlphaEvolve's population diversity—but the total design space has never been fully integrated. This chapter synthesizes the strongest design patterns from across the survey into a unified architectural blueprint called OmniEvolve, identifies six critical integration gaps that no existing system addresses, and proposes a formal framework for reasoning about the next generation of LLM-powered evolutionary systems.

Key Contribution

This chapter provides the first systematic gap analysis across seventeen surveyed evolutionary AI systems (2024–2026) and synthesizes their complementary innovations into a unified architectural blueprint. Rather than describing a deployed system, it serves as a design-space map for the field: identifying six integration gaps no existing system solves, proposing seven design principles derived from empirical observations, specifying formal mathematical foundations, and outlining concrete research directions for next-generation frameworks. The primary insight is that the compound value of integrating sample-efficiency mechanisms, diagnostic feedback, meta-level evolution, and hybrid search strategies is substantially greater than the sum of individual contributions.

65.1 Design Motivation: The Integration Imperative

The seventeen systems surveyed in this book can each be characterized by a small set of innovations. AlphaEvolve (Chapter 4) demonstrated MAP-Elites quality-diversity at industrial scale with Gemini ensembles. OpenEvolve (Chapter 5) democratized the evolutionary loop with open-source MAP-Elites and multi-provider LLM support. ShinkaEvolve (Chapter 6) achieved dramatic sample-efficiency gains through two-tier novelty filtering, bandit-based LLM selection, and prompt co-evolution. GEPA (Chapter 7) introduced Actionable Side Information (ASI) for diagnostic feedback and Pareto-based multi-objective search. The Darwin Gödel Machine (Chapter 10) pushed the frontier of self-modification. Darwinian Evolver (Chapter 9) contributed learning logs for cross-individual knowledge sharing. AB-MCTS (Chapter 12) brought adaptive tree search with Thompson Sampling. SkyDiscover/AdaEvolve (Chapter 11) demonstrated three-level hierarchical adaptation across 200+ benchmarks.

Yet when we map these innovations against each other, we find that every system has significant blind spots. The following table, derived from the survey analysis, summarizes what each key system contributed and what it lacks.

Table 65.1 — Innovation map: what each surveyed system got right and what it lacks. Compiled from Chapters 4–15 of this survey.
System	Key Innovation	What It Lacks
AlphaEvolve	MAP-Elites quality-diversity, Gemini ensemble at scale	Closed source, no prompt evolution, no learning logs, no self-improvement
OpenEvolve	Open MAP-Elites + islands, multi-provider LLM	No prompt evolution, no diagnostic ASI, no learning logs
ShinkaEvolve	2-tier novelty, bandit LLM selection, prompt co-evolution, async 5–10× speedup	No structured diagnostic feedback, no learning logs, no self-modification
GEPA	Actionable Side Information, Pareto search, reflection-driven mutation	No prompt evolution, no learning logs, limited population management
DGM	Full self-modification, expanding archive, cross-language transfer	No cost control, safety risks, no structured diagnostic feedback
Darwinian Evolver	Learning log system, failure-case-driven mutation	Sequential only, no prompt evolution, flat population, no novelty filtering
AB-MCTS	Thompson Sampling for adaptive branching, multi-LLM collaboration	No population management, not a full evolutionary framework
SkyDiscover	Three-level hierarchical adaptation, globally-normalized bandits, 200+ benchmarks	No prompt evolution, no learning logs, no MAP-Elites, no 2-tier novelty
LLM4AD	7 search methods unified, GUI, broad task coverage	No prompt evolution, no learning logs, no ASI
Arcgentica	Runtime-as-context, persistent REPL, live execution as reasoning surface	Not a full evolutionary system, no population

65.1.1 Six Critical Integration Gaps

From the cross-system analysis, six specific integration gaps emerge—pairs or clusters of innovations that have never been combined in a single framework. These gaps represent the highest-value opportunities for next-generation system design.

Gap 1 (Sample Efficiency + Diagnostics) represents the most immediate efficiency opportunity. ShinkaEvolve's two-tier novelty filtering and bandit LLM selection dramatically reduce evaluations-to-convergence. GEPA's Actionable Side Information explains why mutations fail. Combining both would make reflective mutation far more targeted, further reducing the number of evaluations needed. Gap 2 (Diversity + Learning Logs) addresses knowledge isolation: MAP-Elites maintains behavioral diversity across the population, but individuals in isolated cells cannot share what they have learned. Darwinian Evolver's learning logs provide exactly this cross-individual knowledge sharing, yet that system uses a flat population without quality-diversity structure. Gap 3 (Prompt Evolution + ASI) closes a feedback loop: when a prompt consistently produces mutations that fail with specific diagnostic patterns, the prompt itself should be mutated to address those patterns—but no existing system connects these two mechanisms.

Gap 4 (Controlled Self-Improvement) addresses the safety boundary. DGM's self-modification is the most ambitious mechanism in the survey, but it lacks scoped guarantees. A system that selectively improves its mutation strategies and prompts (but not its evaluation logic or safety harness) would capture most of the benefit while maintaining formal rollback safety. Gap 5 (Tree Search + Population) recognizes that depth-first refinement and breadth-first diversity serve complementary functions, yet no system dynamically switches between AB-MCTS tree search for deep individual refinement and evolutionary population search for broad exploration. Gap 6 (Runtime Context + Evolution) highlights that Arcgentica's use of live execution state as reasoning context could dramatically improve mutation quality, yet no evolutionary framework has incorporated this pattern.

65.2 Seven Design Principles

From the empirical observations and failure modes documented across all seventeen surveyed systems, seven design principles emerge for next-generation LLM-powered evolutionary frameworks. These are not abstract ideals but direct consequences of what worked and what did not in deployed systems.

Table 65.2 — Seven design principles for next-generation LLM-powered evolutionary systems, with empirical grounding from the survey.
#	Principle	Rationale	Empirical Source
P1	Sample Efficiency is the Primary Constraint	LLM API costs are the bottleneck, not algorithmic compute. Every design decision should minimize evaluations-to-convergence.	ShinkaEvolve's 5–10× async speedup; GEPA's reflection-driven mutation; Darwinian Evolver's failure-case targeting
P2	Diagnostic Data Drives Everything	Mutation operators should never be blind. Every evaluation should return actionable diagnostic information, not just a score.	GEPA's ASI schema; ShinkaEvolve's cascaded evaluation
P3	Population Diversity $\geq$ Best Individual Score	Premature convergence to local optima is the primary failure mode. MAP-Elites grids and novelty filtering must be maintained even at the cost of individual fitness.	AlphaEvolve's MAP-Elites; FunSearch's island isolation
P4	Meta-Level Evolution is the Multiplier	Systems that evolve how they evolve outperform those that do not. Prompt co-evolution, learning logs, and adaptive scheduling compound over time.	ShinkaEvolve's prompt archive; DGM's self-improvement; SkyDiscover's three-level adaptation
P5	Safety First, Capability Second	Self-modification must be strictly scoped. The evaluation harness, fitness functions, and safety checks must be immutable.	DGM's safety discussion; all systems' sandbox requirements
P6	Cost-Aware Architecture Throughout	Cost control is a first-class architectural concern, not a feature. Every component must track its cost contribution; bandit selection should use cheaper models when they suffice.	ShinkaEvolve's committed budget guard; OpenEvolve's per-iteration budget
P7	Generalization Over Task Specialization	The system should work on any problem expressible as mutable code + evaluation function. Domain knowledge belongs in prompts and evaluators, not in the engine.	LLM4AD's 7-method platform; GEPA's declarative API

Principles P1 and P6 are closely related but distinct: P1 concerns algorithmic sample efficiency (reducing the number of evaluations needed), while P6 concerns economic efficiency (reducing the dollar cost per evaluation). A system can be sample-efficient but cost-inefficient if it uses expensive models for every mutation, or cost-efficient but sample-inefficient if cheap models produce many wasted evaluations. The optimal design balances both through hierarchical model selection.

65.3 Unified Architecture Blueprint

The proposed OmniEvolve architecture integrates all six gap-filling innovations into a coherent system. The architecture is organized around five major subsystems: Population Management (three-layer hierarchical structure), Mutation Engine (six modes with bandit scheduling), Evaluation Pipeline (cascade with ASI diagnostics), Knowledge Layer (learning logs, prompt archive, skill store, meta-recommendations), and Safety & Cost Layer (committed budget guard, sandbox, snapshot-rollback). A Search Strategy Router dynamically switches between evolutionary and tree-search modes, and an LLM Orchestrator manages hierarchical model selection across all components.

65.3.1 Data Flow: One Evolutionary Iteration

A single evolutionary step in the proposed architecture traverses nine stages, each contributing to the system's sample efficiency and knowledge accumulation. The key insight is that every stage feeds information back into the knowledge layer, creating compound returns over time.

Stage 1 (Sample Parents): The population manager applies composite selection weights combining power-law rank fitness, sigmoid-based novelty penalty, and exponential recency weighting. Stage 2 (Pre-Mutation Novelty Check): A two-tier novelty pipeline—cheap embedding similarity followed by LLM-as-judge—rejects mutations that are too similar to existing programs before any expensive evaluation occurs. Stage 3 (Context Assembly): The mutation prompt is constructed by injecting parent code, best solutions, learning log entries retrieved by embedding similarity, ASI diagnostics from prior evaluations, and the currently active co-evolved prompt. Stage 4 (LLM Mutation): A bandit-selected model generates the mutation using one of six modes chosen by a per-island UCB1 bandit. Stage 5 (Post-Generation Verification): Quick syntax check, type check, and mini-evaluation reject obviously broken mutations cheaply. Stage 6 (Cascade Evaluation): A multi-stage evaluation pipeline with Bayesian early stopping runs progressively more expensive test suites. Stage 7 (Population Update): The new program is placed in the MAP-Elites grid, island archive, and Pareto front as appropriate. Stage 8 (Knowledge Update): The learning log, prompt evolver, model bandit, and meta-recommendation store all receive update signals. Stage 9 (Search Router Check): If an island is stagnant, the system may spawn an AB-MCTS tree search for deep refinement and reinject results back to the population.

65.4 Component Synthesis: Best Practices from the Field

65.4.1 Three-Layer Population Structure

The proposed population management unifies three complementary patterns observed in separate systems. Layer 1 is a MAP-Elites quality-diversity grid (from AlphaEvolve/OpenEvolve), where each cell holds the best program achieving a specific combination of behavioral features. Layer 2 is a set of specialized islands (from ShinkaEvolve) with different mutation configurations—exploitation, exploration, diversity, reflection, and crossover islands. Layer 3 is a global non-culled archive (from DGM) indexed by embedding vectors for long-term memory and cross-island crossover retrieval. Periodic migration across layers prevents isolation, while island specialization prevents homogenization.

The island specialization scheme assigns different mutation ratios and selection pressures to each island type. Exploitation islands use 80% diff patches with strong power-law selection ($\alpha = 2.0$), while exploration islands use 50% full rewrites with weak selection ($\alpha = 0.5$). Reflection islands dedicate 90% of their capacity to ASI-driven reflection mutation, creating a targeted repair mechanism. When an island shows no improvement for a configurable stagnation threshold, it is archived and replaced with a freshly seeded island drawn from diverse global archive samples.

65.4.2 Composite Parent Selection

Parent selection combines three mechanisms into a multiplicative composite weight. For each program $p$ in the population, the selection probability is proportional to:

$$w(p) = w_{\text{fitness}}(p) \cdot w_{\text{novelty}}(p) \cdot w_{\text{recency}}(p)$$

The fitness weight uses power-law rank selection:

$$w_{\text{fitness}}(p) = \text{rank}(p)^{-\alpha}, \quad \alpha \in [0.5, 3.0]$$

where $\text{rank}(p) = 1$ for the best program, and $\alpha$ controls the exploitation-exploration trade-off (higher $\alpha$ concentrates probability on top-ranked programs). This follows ShinkaEvolve's approach (Chapter 6).

The novelty weight penalizes over-sampled parents using a sigmoid function, inspired by Darwinian Evolver (Chapter 9):

$$w_{\text{novelty}}(p) = \sigma\!\left(-\beta \cdot (\text{select\_count}(p) - \mu_{\text{select}})\right)$$

where $\sigma$ is the standard sigmoid, $\beta$ controls penalty sharpness (default: 0.5), and $\mu_{\text{select}}$ is the mean selection count across the population. Programs selected too frequently receive a multiplicative down-weight, preventing any individual from dominating as a parent.

The recency weight applies exponential time decay: $w_{\text{recency}}(p) = \exp(-\lambda \cdot \text{age}(p))$ with $\lambda = 0.01$, slightly favoring recently discovered programs without discarding historical knowledge.

# Pseudocode — illustrative composite selection, not from a public implementation
import numpy as np

def composite_selection_weights(
    programs: list,         # list of candidate programs
    alpha: float = 2.0,     # power-law exponent (island-specific)
    beta: float = 0.5,      # novelty penalty sharpness
    lam: float = 0.01       # recency decay rate
) -> np.ndarray:
    """Compute composite selection probabilities for parent sampling."""
    n = len(programs)
    scores = np.array([p.fitness for p in programs])

    # Fitness weight: power-law over rank (rank 1 = best)
    ranks = n + 1 - np.argsort(np.argsort(scores))
    w_fit = ranks.astype(float) ** (-alpha)

    # Novelty weight: sigmoid anti-oversampling penalty
    select_counts = np.array([p.select_count for p in programs])
    mu_select = select_counts.mean()
    w_nov = 1.0 / (1.0 + np.exp(beta * (select_counts - mu_select)))

    # Recency weight: exponential time decay
    ages = np.array([p.generation_age for p in programs])
    w_rec = np.exp(-lam * ages)

    # Multiplicative composite, normalized to probability distribution
    w = w_fit * w_nov * w_rec
    return w / w.sum()

65.4.3 Adaptive Mutation Scheduling

Rather than fixing the ratio of mutation modes, the architecture proposes learning the optimal ratio per island per problem using a multi-armed bandit. Each island maintains a MutationBandit that tracks the empirical success of each mode using the UCB1 formula:

$$\text{UCB}_i(t) = \hat{\mu}_i + c \cdot \sqrt{\frac{\ln t}{n_i}}$$

where $\hat{\mu}_i$ is the empirical mean improvement (measured as delta-score clipped to non-negative) for mutation mode $i$, $n_i$ is the number of times mode $i$ has been selected, $t$ is the total number of mutations across all modes, and $c$ is the exploration constant (recommended default: 0.3). The bandit selects the mode with the highest UCB score at each step, ensuring that under-explored modes receive a bonus while high-performing modes are exploited. Note that this is a heuristic application of UCB1: the stationarity assumption of classical multi-armed bandits does not strictly hold as the search landscape changes, but empirical evidence from ShinkaEvolve and SkyDiscover suggests that UCB-based scheduling performs well in practice.

65.4.4 Three-Stage Novelty Pipeline

The novelty filtering pipeline, synthesized from ShinkaEvolve's two-tier system and AlphaEvolve's behavioral descriptors, uses three stages of increasing cost to reject redundant mutations before expensive evaluation:

Stage 1 (Embedding Similarity, ~$0.001): Compute the embedding of the generated program and compare it against the $k$-nearest neighbors in the archive (using cosine similarity). If the maximum similarity exceeds a threshold (default: 0.92), reject the program as too similar. This is the cheapest filter and catches syntactic near-duplicates. Stage 2 (LLM Novelty Judge, ~$0.005): A fast LLM (Tier 1 model) is shown the new program alongside its $k$ most similar archived neighbors and asked a binary question: "Is this mutation algorithmically novel?" This catches semantic duplicates that differ syntactically. Stage 3 (Behavioral Novelty, during evaluation): Post-evaluation, the behavioral descriptor vector $B(p) = (f_1(p), f_2(p))$ is computed, where $f_i$ are normalized behavioral features (e.g., code complexity, test-case pass pattern). The MAP-Elites grid is indexed by discretized $B(p)$, and a program replaces a cell occupant only if its fitness exceeds the incumbent.

65.4.5 Hierarchical LLM Orchestration

The LLM orchestration subsystem uses a hierarchical bandit that first selects a cost tier, then selects a model within that tier. This prevents expensive models from starving cheap ones during the exploration phase. The proposed tier structure, reflecting the 2024–2026 LLM landscape, organizes models into four tiers: Fast (60–80% of calls, $0.01–$0.05 per call), Balanced (15–30%, $0.10–$0.40), Power (5–10%, $1.00–$5.00), and Local (fallback, $0.00 marginal cost). The reward signal for both tier and model bandits is efficiency-aware: quality divided by cost, ensuring that the system automatically gravitates toward the best quality-per-dollar option.

For tree-search mode (AB-MCTS), the architecture proposes Thompson Sampling for model selection, following AB-MCTS's demonstrated approach:

$$\theta_i \sim \text{Beta}(\alpha_i + \text{successes}_i, \, \beta_i + \text{failures}_i)$$

At each selection step, $\theta_i$ is sampled from each model's posterior and the model with the highest sample is chosen. This provides automatic exploration-exploitation balance without requiring explicit tuning of an exploration constant $c$.

65.4.6 Diagnostic Feedback (ASI) as First-Class Contract

Drawing on GEPA's Actionable Side Information, the architecture mandates that every evaluation function return structured diagnostic data alongside the numeric score. The proposed schema includes: specific failed test cases (input, expected output, actual output, error type), performance profiling (bottleneck function, runtime breakdown, memory usage), complexity estimates, and a suggested fix approach. This data feeds into four downstream consumers simultaneously: the reflection-driven mutation mode (Mode D), the learning log, the prompt co-evolver, and the meta-recommendation synthesizer.

# Pseudocode — illustrative ASI schema for evaluation returns
from dataclasses import dataclass, field
from typing import Any

@dataclass
class FailedTest:
    input_data: Any
    expected: Any
    actual: Any
    error_type: str       # "wrong_answer", "timeout", "exception"
    traceback: str | None = None

@dataclass
class ActionableSideInfo:
    """Structured diagnostic data returned from every evaluation."""
    score: float                                  # Primary fitness metric
    failed_tests: list[FailedTest] = field(default_factory=list)
    error_type: str | None = None                 # "timeout", "wrong_answer", "crash"
    error_message: str | None = None
    complexity_estimate: str | None = None        # "O(n^2)", "O(n log n)", etc.
    memory_usage_mb: float | None = None
    bottleneck_function: str | None = None
    suggested_fix_approach: str | None = None     # e.g., "memoize recursive calls"
    custom_metrics: dict[str, Any] = field(default_factory=dict)

The key innovation proposed in this architecture—addressing Gap 3—is that ASI diagnostics drive not only mutation content but also prompt evolution. When a prompt consistently produces mutations that fail with a specific ASI error pattern (e.g., "recursion overflow"), the prompt itself is mutated to specifically address that failure mode. This creates a closed feedback loop: bad mutations produce diagnostic data, which improves the prompt that generated them.

65.4.7 Cross-Population Knowledge Sharing

The knowledge layer addresses Gap 2 by combining learning logs (from Darwinian Evolver), skill stores (from GEPA Skills), and meta-recommendations into a shared cross-population memory. Every mutation—successful or not—generates a log entry recording the mutation mode, parent IDs, ASI diagnostics, outcome, cost, and auto-generated pattern tags. The learning log grows without bound, so a hierarchical compression strategy is proposed: raw entries are compressed every 100 entries into LLM-generated summaries (~100 tokens each), summaries are clustered every 10 into insight digests (~50 tokens), and at query time the top-$K$ relevant entries are retrieved by embedding similarity to the current mutation context.

Every $N$ generations (proposed default: 100), an LLM synthesizes the learning log into high-level strategic meta-recommendations that become permanent entries injected into all future mutation prompts. This mechanism allows the system to discover and propagate patterns like "memoization consistently improves recursive solutions" or "switching from greedy to dynamic programming yields 15%+ improvements on problems with overlapping subproblems."

65.5 Formal Mathematical Framework

65.5.1 Problem Formulation

Let $\mathcal{P}$ denote the space of programs (text strings over a programming language alphabet $\Sigma$). Let $f: \mathcal{P} \to \mathbb{R}$ be the fitness function returning a scalar score. The single-objective optimization target is:

$$p^* = \arg\max_{p \in \mathcal{P}} f(p)$$

In the multi-objective setting with $k$ objectives $f_1, \ldots, f_k$, the goal is to find the Pareto-optimal set:

$$\mathcal{P}^* = \{p \in \mathcal{P} \mid \nexists\, p' \in \mathcal{P} : f_i(p') \geq f_i(p) \;\forall i \;\wedge\; f_j(p') > f_j(p) \text{ for some } j\}$$

This is the set of programs for which no other program is strictly better on at least one objective without being worse on any other.

65.5.2 System State

The full state of the proposed system at generation $t$ is a four-tuple:

$$S_t = (\Pi_t, \, L_t, \, \Phi_t, \, \Theta_t)$$

where $\Pi_t$ is the population state (MAP-Elites grid $+$ island populations $+$ global archive), $L_t$ is the learning log (complete history of all mutations and their outcomes), $\Phi_t$ is the prompt archive (population of mutation prompts per mode, with fitness tracking), and $\Theta_t$ is the bandit state (model selection posteriors $+$ mutation mode weights $+$ tier preferences). A single evolutionary step is a stochastic transition:

$$S_{t+1} = \text{Update}(S_t, \, p_{\text{new}}, \, r_t, \, d_t)$$

where $p_{\text{new}}$ is the newly generated program, $r_t = f(p_{\text{new}})$ is the evaluation reward, and $d_t$ is the ASI diagnostic output. The $\text{Update}$ function applies changes to all four state components simultaneously.

65.5.3 Sample Efficiency Analysis

Let $\mathbb{E}[T^*]$ be the expected number of evaluations to reach within $\epsilon$ of the global optimum $f^*$. The architecture achieves the following improvement over naive random search by combining two rejection mechanisms:

$$\mathbb{E}[T^*]_{\text{unified}} \leq \mathbb{E}[T^*]_{\text{random}} \cdot (1 - p_{\text{reject}})^{-1} \cdot (1 - p_{\text{early\_stop}})^{-1}$$

where $p_{\text{reject}} \approx 0.6\text{--}0.8$ is the fraction of mutations rejected by novelty filtering and cascade stages before full evaluation, and $p_{\text{early\_stop}} \approx 0.3\text{--}0.5$ is the fraction of evaluations terminated early by SPRT when the outcome is statistically clear. These empirical ranges are drawn from ShinkaEvolve's reported analysis. Combined, the two mechanisms effectively reduce evaluation count by an estimated 4–10$\times$ compared to systems lacking them. Note that this is an upper-bound estimate, not a formal proof—it assumes the rejection mechanisms do not discard beneficial mutations at a significant rate.

65.5.4 Learning Log Information Value

The value of the learning log $L_t$ to future mutation proposals can be formalized as a mutual information quantity:

$$V(L_t) = I(\Delta r_{t+1};\; L_t \mid p_{\text{parent}}, \, m_t)$$

where $I(\cdot;\cdot)$ denotes mutual information, $\Delta r_{t+1}$ is the reward delta of the next mutation, $p_{\text{parent}}$ is the parent program, and $m_t$ is the mutation mode. High $V(L_t)$ means the log is highly predictive of mutation outcomes conditioned on the parent and mode—a signal that the accumulated knowledge is providing genuine value. This formalization suggests a practical diagnostic: if $V(L_t)$ plateaus, the log compression strategy may need updating, or the meta-recommendation synthesis frequency should increase. While direct estimation of $V(L_t)$ is expensive, proxy metrics such as the correlation between log-predicted and actual delta-scores can serve as practical indicators.

65.5.5 Cascade Early Stopping

For stochastic evaluations (averaged over $N$ random seeds), the proposed architecture uses a sequential probability ratio test (SPRT) to stop evaluation early when the outcome is statistically clear:

$$\Lambda_t = \sum_{i=1}^{t} \log \frac{p(x_i \mid H_1)}{p(x_i \mid H_0)}$$

where $H_0$ is the hypothesis that the new program is not better than the current best, $H_1$ is the hypothesis that it is better, and $x_i$ is the score on the $i$-th test seed. Evaluation stops early when $|\Lambda_t|$ exceeds the decision boundary determined by target Type I and Type II error rates ($\alpha = \beta = 0.05$). This yields approximately 30–50% reduction in per-evaluation cost for stochastic problems, based on ShinkaEvolve's confidence-interval-based early stopping analysis.

65.6 Hybrid Search: Evolutionary Loop + Tree Search

One of the most distinctive proposals in this architecture is the Search Strategy Router, which dynamically switches between population-based evolutionary search and AB-MCTS tree search based on the current optimization landscape. The motivation comes directly from Gap 5: population search excels at broad exploration (many local optima, wide search space), while tree search excels at deep refinement of individual solutions.

The router operates on a simple heuristic: when an island has been stagnant for a configurable number of generations (no fitness improvement above a threshold), it triggers AB-MCTS tree search rooted at the island's best program. The tree search uses Thompson Sampling to decide between deepening (refining the current node via diff patches) and branching (generating new siblings via full rewrites):

$$P(\text{branch}) = P(\theta_{\text{branch}} > \theta_{\text{deepen}}) \quad\text{where}\quad \theta_i \sim \text{Beta}(\alpha_i, \beta_i)$$

where $\alpha_i$ and $\beta_i$ are updated based on whether each action produces an improvement. When tree search finds an improvement, the result is reinjected into the population, potentially revitalizing the stagnant island.

# Pseudocode — adaptive branching MCTS for deep refinement
# Illustrative algorithm logic, not from a public implementation

class AdaptiveBranchMCTS:
    """AB-MCTS tree search triggered by island stagnation."""

    def __init__(self, root_program, llm_orchestrator):
        self.root = SearchNode(program=root_program)
        self.branch_bandit = ThompsonBandit(arms=["branch", "deepen"])
        self.llm = llm_orchestrator

    def search(self, n_steps: int):
        """Run n_steps of adaptive tree search."""
        for _ in range(n_steps):
            # Select most promising leaf via UCT
            node = self._select_leaf(self.root)

            # Thompson Sampling: deepen (diff) or branch (rewrite)?
            action = self.branch_bandit.sample()

            if action == "deepen":
                child_code = self.llm.mutate(node.program, mode="diff")
            else:
                child_code = self.llm.mutate(node.program, mode="full")

            score = self._evaluate(child_code)
            delta = score - node.program.score
            self.branch_bandit.update(action, success=(delta > 0))
            self._backpropagate(node, delta)

        return self._best_leaf().program

    def _select_leaf(self, node):
        """UCT-based tree traversal to most promising leaf."""
        while node.children:
            node = max(node.children, key=lambda c: c.uct_score())
        return node

65.7 Controlled Self-Improvement

The Darwin Gödel Machine (Chapter 10) demonstrated the potential of self-modifying evolutionary systems, but also revealed significant safety concerns. The proposed architecture addresses Gap 4 by defining a strict mutable/immutable boundary. The mutable zone includes: system prompts for each mutation mode, parent selection weights, mutation mode ratios per island, novelty filter thresholds, LLM model selection preferences, and learning log query strategies. The immutable zone includes: the evaluation function and fitness oracle, the sandbox execution harness, cost budget guards, safety checks, the population database interface, the snapshot-rollback mechanism, and the self-improvement module itself.

All self-improvement proposals require statistical validation via A/B testing. The proposed protocol takes a snapshot of the current configuration, runs $n$ mutations (proposed default: 50) with both the proposed and current configuration, and applies a two-sample $t$-test at significance level $p < 0.05$. The new configuration is adopted only if it shows a statistically significant positive improvement; otherwise the system rolls back to the snapshot. This provides a formal guarantee against regression while allowing the system to improve its own search strategies over time.

# Pseudocode — self-improvement with snapshot-rollback and A/B validation

def evaluate_self_improvement(
    current_config: dict,
    proposed_config: dict,
    component: str,
    n_trials: int = 50,
    significance: float = 0.05
) -> bool:
    """A/B test a proposed configuration change with rollback safety."""
    snapshot_id = take_snapshot(component, current_config)

    # Run n_trials mutations with each configuration
    results_proposed = run_mutations(proposed_config, n_trials)
    results_current = run_mutations(current_config, n_trials)

    # Two-sample t-test for significance
    from scipy import stats
    t_stat, p_value = stats.ttest_ind(results_proposed, results_current)
    improvement = results_proposed.mean() - results_current.mean()

    if p_value < significance and improvement > 0:
        apply_config(proposed_config, component)
        log_improvement(component, improvement, snapshot_id)
        return True  # Improvement accepted
    else:
        rollback_to_snapshot(snapshot_id)
        return False  # No significant improvement; rollback

65.8 Cost Control as Architectural Concern

Following Principle P6, cost control permeates every layer of the architecture rather than being isolated in a single module. The most important mechanism is the committed cost budget guard, which tracks both realized cost (completed API calls) and in-flight cost (pending calls whose responses have not yet returned):

$$C_{\text{committed}}(t) = C_{\text{realized}}(t) + C_{\text{in-flight}}(t) \leq B_{\text{total}}$$

where $C_{\text{realized}}(t)$ is the cumulative actual cost of settled API calls, $C_{\text{in-flight}}(t)$ is the sum of estimated costs for currently pending calls, and $B_{\text{total}}$ is the total budget. Every API call must first reserve its estimated cost; the call is rejected if committed cost would exceed the budget. This prevents overspending due to concurrent asynchronous calls—a subtle failure mode observed in systems that only track realized cost.

The cascade evaluation pipeline (Section 65.4, Stage 6) provides further cost reduction through progressive filtering. Stage 0 (syntax check) costs approximately $0.00 and rejects 5–15% of candidates. Stage 1 (5% of test suite) costs approximately $0.01 and rejects 20–40%. Stage 2 (30% of test suite) costs approximately $0.05 and rejects 15–25%. Only programs surviving all stages reach the full evaluation with ASI diagnostic generation. These rejection rates are estimates based on the source analysis and would vary by problem domain.

Table 65.3 — Cost control mechanisms and their source systems. Estimated savings are projections from individual system analyses, not from a combined deployment.
Mechanism	Source System	Estimated Savings	Layer
Committed budget guard	ShinkaEvolve	Prevents overspend from concurrency	Safety
Cascade evaluation	AlphaEvolve, OpenEvolve	60–80% of evaluations stopped early	Evaluation
Embedding novelty pre-filter	ShinkaEvolve	~$0.001 per rejected duplicate	Novelty
Hierarchical model bandit	ShinkaEvolve, AB-MCTS	Automatically favors cheap models when sufficient	LLM Orchestrator
SPRT early stopping	ShinkaEvolve	30–50% reduction for stochastic evaluations	Evaluation
Skill transfer to cheap models	GEPA Skills	Cheap models achieve ~85% of expensive quality	Knowledge

65.9 Projected Performance and Validation Strategy

The source analysis provides performance projections for the unified architecture compared to individual systems. These are analytical projections, not empirical measurements—they represent expected gains based on the documented contributions of individual components. They should be treated as hypotheses to be validated, not as established results.

Table 65.4 — Projected performance of unified architecture vs. individual systems. All projections are analytical estimates from the source gap analysis, not empirical measurements.
Metric	ShinkaEvolve	OpenEvolve	GEPA	Unified (Projected)	Projected Gain Source
Competitive programming percentile	~60th	~55th	~65th	~75–80th	Learning logs + ASI + AB-MCTS tree search
Cost per improvement unit	$0.05–0.20	$0.10–0.40	$0.05–0.25	$0.02–0.10	Cascade + novelty filter + bandit models
Evaluations to first valid solution	~100	~200	~80	~40–60	2-tier novelty + cascade + SPRT
Long-run improvement (gen 500+)	Plateau risk	Plateau risk	Plateau risk	Continued	Prompt co-evolution + self-improvement
Cross-problem transfer	Limited	None	Skills only	Skill store + logs	Skill store + cross-model transfer

The proposed validation strategy includes four benchmark categories: (1) Competitive programming (ALE-Bench's 40 problems), measuring percentile rank, cost, and evaluation count versus published ShinkaEvolve and GEPA results. (2) Mathematical discovery (replicating AlphaEvolve's cap-set and matrix multiplication problems), validating that population diversity finds comparable solutions. (3) Ablation study, disabling one component at a time (novelty filter, ASI, learning logs, prompt evolution, bandit selection) to quantify each contribution. (4) Long-run improvement, running 2000+ generations and measuring the score-vs-generation curve to validate that prompt co-evolution and self-improvement prevent plateau.

65.10 Design Decision Synthesis

The following table summarizes the ten key design decisions in the proposed architecture, the specific choice made, the rationale, and the source system(s) that inspired each choice. This serves as a compact reference for researchers seeking to understand the provenance of each component.

Table 65.5 — Summary of architectural design decisions with source attribution.
Design Decision	Choice	Rationale	Source System(s)
Population structure	MAP-Elites + Islands + Archive (3-layer)	Combines behavioral diversity, isolation, and long-term memory	AlphaEvolve + ShinkaEvolve + DGM
Mutation engine	6 modes with UCB1 bandit scheduling	No single mode is optimal; bandit learns per problem per phase	ShinkaEvolve + GEPA + Darwinian Evolver + Arcgentica
Parent selection	Power-law × novelty × recency (composite)	Avoids oversampling while favoring high-fitness parents	ShinkaEvolve + Darwinian Evolver
Novelty filtering	3-stage: embedding + LLM judge + behavioral	Cheap filter catches duplicates; expensive judge catches semantic similarity	ShinkaEvolve + AlphaEvolve
LLM orchestration	Hierarchical bandit (tier → model)	Prevents expensive models from starving cheap ones; efficiency reward	ShinkaEvolve + AB-MCTS
Diagnostic feedback	Mandatory ASI schema from all evaluators	Enables reflection, failure-targeting, log compression, prompt mutation	GEPA
Knowledge sharing	Learning logs + skill store + meta-recs	Cross-population discovery sharing; skill transfer to cheap models	Darwinian Evolver + GEPA Skills
Prompt evolution	ASI-informed prompt mutation per mode	Prompts producing diagnostic failures get targeted mutation	ShinkaEvolve (extended)
Self-improvement	Scoped mutable zone + A/B validation + rollback	Controllable improvement without risking evaluation integrity	DGM (constrained)
Search strategy	Adaptive: evolutionary + AB-MCTS hybrid	Population for breadth; tree search for depth on stagnant islands	AB-MCTS + ShinkaEvolve

65.11 Open Research Directions

The synthesis reveals six open research questions that extend beyond what any single surveyed system has addressed. These represent concrete PhD-level research directions for the next generation of LLM-powered evolutionary systems.

RQ1: Optimal Population Topology. How should islands communicate? The proposed architecture defaults to ring topology with periodic migration, but hub-and-spoke, fully connected, and adaptive topologies remain unexplored. A promising approach is to apply multi-armed bandit selection over topology configurations, measuring the diversity-convergence trade-off dynamically.

RQ2: Behavioral Descriptor Design. For code evolution, what are the right behavioral descriptors for MAP-Elites? Code complexity, runtime profile, test coverage pattern, and semantic embeddings are all candidates, but the optimal descriptor space is problem-dependent. Unsupervised discovery of descriptors from evaluation data using dimensionality reduction could yield descriptors that capture meaningful diversity without combinatorial explosion.

RQ3: Learning Log Information Value. Can the value function $V(L_t)$ (Section 65.5.4) be estimated in practice? Does it exhibit diminishing returns, and if so, when should old entries be purged? What compression methods preserve maximum mutual information? Ablation studies combined with information-theoretic analysis of different log compression strategies could provide actionable answers.

RQ4: Prompt Evolution Convergence. Do evolved prompts converge to a fixed point, or do they continuously improve in tandem with the evolving population? Is there a theoretical bound on prompt fitness? Do evolved prompts generalize across problem types or remain problem-specific? Tracking prompt entropy over generations and measuring transfer across domains would clarify these questions.

RQ5: Safe Self-Improvement Boundaries. What is the formal boundary between safe and unsafe self-modification? Can a type system or formal verification framework enforce the mutable/immutable boundary at compile time? How do safety guarantees compose when multiple components self-modify simultaneously? This connects to the broader AI safety literature on capability amplification.

RQ6: Multi-Task Transfer Learning. Can skills, prompts, and meta-recommendations transfer across fundamentally different problem domains (sorting, scientific discovery, infrastructure optimization)? Curriculum learning frameworks measuring zero-shot and few-shot transfer across domain pairs would establish whether the knowledge layer provides genuine generalization or merely task-specific memorization.

65.12 The Frontier Beyond

The proposed architecture represents the state-of-the-art integration of 2024–2026 methods. Looking further ahead, the source analysis identifies five research frontiers that go beyond component integration:

Continuous learning—systems that never stop improving, with formal guarantees on monotonic long-run performance. Multi-agent co-evolution—populations of interacting agents that evolve communication protocols and collaborative strategies, not just individual programs. Neural-symbolic hybrid search—combining LLM-based code mutation with formal verification and constraint propagation to reduce the search space by orders of magnitude. Evolutionary meta-learning—systems that evolve their own evolutionary algorithms, realizing a safe version of the Gödel Machine concept. Human-AI co-evolution—tight feedback loops where humans provide insight and domain knowledge while AI provides search scale, generalizing the collaborative patterns observed in competitive programming contests.

These directions share a common theme: the shift from evolving programs to evolving the processes that produce programs. The self-improvement module in the proposed architecture is a first step in this direction, but a fully recursive system—one that can safely modify its own modification strategies without limit—remains an open challenge with deep connections to both evolutionary theory and AI alignment.

65.13 Limitations and Caveats

Several important limitations of this architectural synthesis should be noted. First, the projected performance figures in Table 65.4 are analytical estimates, not empirical measurements. The compound effect of integrating all components simultaneously has not been validated; in practice, interactions between components may produce diminishing returns or unexpected interference. Second, the complexity of the unified system is substantially higher than any individual surveyed system. More components mean more configuration parameters, more failure modes, and higher engineering effort. The phased implementation roadmap (proposed as 10 phases over 24 weeks in the source) is itself ambitious. Third, several proposed mechanisms remain underspecified: the behavioral descriptor design for MAP-Elites, the learning log compression strategy, and the prompt evolution convergence criteria all require further research before production deployment. Fourth, the cost estimates are based on 2024–2026 LLM pricing, which has been declining rapidly; the optimal tier boundaries and model assignments will shift as pricing evolves.

Perhaps most importantly, there is a tension between Principle P7 (generalization) and practical effectiveness. A system optimized for generality may underperform domain-specific systems on individual tasks. The proposed architecture handles this through island specialization and adaptive scheduling, but the trade-off between general-purpose and specialist performance remains an empirical question.

Chapter Summary

Key takeaway: No single system from the 2024–2026 survey captures all the innovations the field has collectively discovered. A unified architecture that combines ShinkaEvolve's sample efficiency, GEPA's diagnostic feedback, Darwinian Evolver's learning logs, DGM's controlled self-improvement, AlphaEvolve's population diversity, and AB-MCTS's adaptive tree search addresses six critical integration gaps and is projected to substantially outperform any individual system.

Main contribution: A systematic gap analysis of seventeen surveyed systems yielding six specific integration opportunities, seven empirically-grounded design principles, a complete architectural blueprint with formal mathematical foundations, and six concrete PhD research directions for next-generation LLM-powered evolutionary frameworks.

What researchers should know: The compound value of integrating meta-level evolution (prompt co-evolution, learning logs, self-improvement) with sample-efficiency mechanisms (novelty filtering, cascade evaluation, bandit model selection) is the single most important insight from this survey. Systems that evolve how they evolve have demonstrated the strongest long-run performance, and the next research frontier lies in making this meta-evolution process both more powerful and formally safe.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}