← Back to Survey Index

Next Evolution

Architectural Recommendations for the Optimal LLM-Powered Evolutionary System

A PhD-level synthesis of 17 evolutionary AI systems (2024–2026) — Design Blueprint for a Best-in-Class Implementation

Proposed System: OmniEvolve A unified evolutionary framework synthesizing the best components from across the field: ShinkaEvolve's sample efficiency • GEPA's diagnostic feedback • Darwinian Evolver's learning logs • DGM's self-improvement • AlphaEvolve's population diversity • AB-MCTS's adaptive search • Arcgentica's runtime context • AdaEvolve's hierarchical adaptation Jump to Architecture Implementation Plan Formal Specification

Table of Contents

1. Design Motivation & Gap Analysis

After surveying 17 evolutionary AI systems published between 2024 and 2026, a clear picture emerges: no single system captures all the key innovations that the field has collectively discovered. Each system made a focused contribution, but the total design space has never been fully integrated. This section maps the gaps.

1.1 What Each System Got Right

System Key Innovation What It Lacks
AlphaEvolve MAP-Elites quality diversity, Gemini ensemble at scale, real-world infra optimization Closed source, no prompt evolution, no learning logs, no self-improvement
OpenEvolve Open MAP-Elites + islands, multi-provider LLM, strong community No prompt evolution, no diagnostic ASI, no learning logs, basic cost control
ShinkaEvolve 2-tier novelty, bandit LLM selection, prompt co-evolution, async 5-10x speedup No structured diagnostic feedback, no learning logs, no self-modification
GEPA Actionable Side Information, Pareto search, declarative API, reflection-driven mutation No prompt evolution, no learning logs, limited population management
DGM Full self-modification, expanding archive, cross-language transfer No cost control, safety risks, no structured diagnostic feedback
Darwinian Evolver Learning log system, failure-case-driven mutation, clean design Sequential only, no prompt evolution, flat population, no novelty filtering
AB-MCTS Thompson Sampling for adaptive branching, multi-LLM collaboration, tree search No population management, not a full evolutionary framework
GEPA Skills Repository-specific skill learning, cross-model transfer, persistent knowledge Limited to skill extraction, not full evolution
Arcgentica Runtime-as-context, persistent REPL, live execution as reasoning surface Not a full evolutionary system, no population
LLM4AD 7 search methods unified, GUI, broad task coverage Kitchen-sink design; no prompt evolution, no learning logs, no ASI
SkyDiscover Three-level hierarchical adaptation, globally-normalized bandits, meta-guidance, 200+ benchmarks No prompt evolution, no learning logs, no structured ASI, no MAP-Elites, no 2-tier novelty

1.2 The Critical Gaps No System Solves Together

Gap 1: Sample Efficiency + Diagnostic Feedback

ShinkaEvolve dramatically reduces evaluations through 2-tier novelty filtering and bandit LLM selection. GEPA adds why mutations fail via Actionable Side Information. No system does both. Combining them would make reflective mutation far more targeted, further reducing evaluations.


Gap 2: Population Diversity + Learning Logs

MAP-Elites maintains behavioral diversity in the population. Darwinian Evolver's learning logs share cross-individual discoveries. No system combines quality-diversity grids with shared experiential memory. Individuals in isolated MAP-Elites cells could share mutation history.


Gap 3: Prompt Evolution + ASI Feedback

ShinkaEvolve evolves system prompts alongside programs. GEPA provides structured diagnostic data about what went wrong. No system uses diagnostic failure data to drive prompt mutation. This would create a closed feedback loop: bad mutations → diagnose failure → improve the prompt that generated them.


Gap 4: Controlled Self-Improvement

DGM's self-modification is powerful but unsafe at scale. No system offers safe, scoped self-improvement with formal rollback guarantees. A system that selectively improves its mutation strategies and prompts (not its core evaluation logic) would be both safe and powerful.


Gap 5: Tree Search + Evolutionary Population

AB-MCTS's adaptive tree search excels at deep refinement of individual solutions. MAP-Elites/island models excel at broad exploration. No system combines tree-depth refinement with population-level diversity. The optimal strategy uses trees for promising individuals, population search for diversity.


Gap 6: Runtime Context + Evolutionary Loop

Arcgentica's runtime-as-context (persistent REPL, live execution) makes code mutations smarter. No evolutionary framework uses live execution state as context for LLM mutations. Combining this with population management would enable mutations informed by actual runtime behavior.


2. Core Design Principles

The proposed system, OmniEvolve, is guided by seven principles derived from empirical observations across all 16 surveyed systems:

P1: Sample Efficiency is the Primary Constraint

LLM API costs are the bottleneck, not algorithmic compute. Every design decision should minimize evaluations-to-convergence. From ShinkaEvolve: reject before you evaluate. From GEPA: diagnose before you retry. From Darwinian Evolver: learn before you repeat.


P2: Diagnostic Data Drives Everything

Mutation operators should never be blind. Every evaluation should return not just a score, but actionable diagnostic information: what failed, why, where. This data feeds mutation prompts, parent selection, learning logs, and prompt evolution simultaneously.


P3: Population Diversity ≥ Best Individual Score

Premature convergence to local optima is the primary failure mode. MAP-Elites grids, island isolation, and embedding-based novelty filtering must be maintained even at the cost of individual fitness. The best solutions are found by maintaining diverse exploration fronts.


P4: Meta-Level Evolution is the Multiplier

Systems that evolve how they evolve outperform those that don't. Prompt co-evolution, learning logs, and adaptive mutation scheduling compound over time. A system that gets better at mutating code is geometrically more powerful than one with fixed strategies.


P5: Safety First, Capability Second

Self-modification must be strictly scoped. The evaluation harness, fitness functions, and safety checks must be immutable. Only mutation prompts, parent selection weights, and search heuristics may self-modify. All changes require snapshot-rollback capability.


P6: Cost-Aware Architecture Throughout

Cost control is not a feature — it is a first-class architectural concern. Every component tracks its cost contribution. Bandit selection automatically uses cheaper models when they suffice. Budget guards operate at committed (in-flight) cost, not just completed cost.


P7: Generalization Over Task Specialization

The system should work on any problem expressible as (1) mutable code blocks, (2) an evaluation function returning score + diagnostics. No domain-specific logic should leak into the core. Domain knowledge belongs in prompts and evaluation functions, not in the evolutionary engine.


3. Proposed Architecture: OmniEvolve

3.1 High-Level System Overview

╔═══════════════════════════════════════════════════════════════════════════════════════════════╗
║                                    O M N I E V O L V E                                        ║
║                    Unified LLM-Powered Evolutionary Optimization Engine                        ║
╚═══════════════════════════════════════════════════════════════════════════════════════════════╝

                              ┌──────────────────────────────────┐
                              │         ORCHESTRATOR              │
                              │  (async event loop, budget guard)  │
                              └────────────┬─────────────────────┘
                                           │
          ┌────────────────────────────────┼────────────────────────────────┐
          │                               │                                  │
  ┌───────▼────────┐            ┌────────▼────────┐              ┌──────────▼──────────┐
  │  POPULATION    │            │  LLM MUTATION   │              │   EVALUATION        │
  │  MANAGER       │◄──────────►│  ENGINE         │◄────────────►│   PIPELINE          │
  │                │            │                 │              │                     │
  │ MAP-Elites     │            │ Diff / Full /   │              │ Sandbox Exec        │
  │ Island Model   │            │ Cross / Reflect │              │ Cascade Filter      │
  │ Pareto Front   │            │ Failure-Driven  │              │ ASI Diagnostics     │
  │ Archive        │            │ Runtime-Aware   │              │ Early Stopping      │
  └───────┬────────┘            └────────┬────────┘              └──────────┬──────────┘
          │                              │                                   │
          │              ┌───────────────▼──────────────────┐               │
          │              │         KNOWLEDGE LAYER           │               │
          │              │                                   │               │
          │              │  ┌──────────────┐ ┌────────────┐ │               │
          │              │  │ Learning Logs│ │  Prompt    │ │               │
          │              │  │ (cross-pop.) │ │  Archive   │ │               │
          │              │  └──────────────┘ └────────────┘ │               │
          │              │  ┌──────────────┐ ┌────────────┐ │               │
          │              │  │ Skill Store  │ │  Meta-Recs │ │               │
          │              │  └──────────────┘ └────────────┘ │               │
          │              └──────────────────────────────────┘               │
          │                              │                                   │
          │              ┌───────────────▼──────────────────┐               │
          │              │      LLM ORCHESTRATOR             │               │
          │              │                                   │               │
          │              │  Flash-Haiku  ┐                   │               │
          │              │  Flash-Pro    ├── UCB1 Bandit     │               │
          │              │  o3-mini      │   + Thompson      │               │
          │              │  Sonnet-4.x   │   Sampling        │               │
          │              │  local model ─┘                   │               │
          │              │                                   │               │
          │              │  Prompt Co-Evolver (v2 archive)   │               │
          │              └──────────────────────────────────┘               │
          │                              │                                   │
          └──────────────────────────────┼───────────────────────────────────┘
                                         │
                          ┌──────────────▼───────────────────┐
                          │      SEARCH STRATEGY ROUTER       │
                          │                                   │
                          │  Evolutionary Loop (default)      │
                          │  AB-MCTS Tree Search (deep refine)│
                          │  Hybrid (pop. → tree → pop.)      │
                          └──────────────────────────────────┘
                                         │
                          ┌──────────────▼───────────────────┐
                          │      SAFETY & COST LAYER          │
                          │                                   │
                          │  Committed Budget Guard           │
                          │  Sandbox (Docker / subprocess)    │
                          │  Snapshot Rollback                │
                          │  Immutability Enforcement         │
                          └──────────────────────────────────┘

3.2 Data Flow: One Evolutionary Iteration

┌─────────────────────────────────────────────────────────────────────────────────┐
 │                        SINGLE EVOLUTIONARY STEP                                  │
 │                                                                                  │
 │  1. SAMPLE PARENTS                                                               │
 │     Population DB → power-law rank selection (ShinkaEvolve)                     │
 │                    + novelty-adjusted weights (Darwinian Evolver)                │
 │                    + Pareto frontier candidates (GEPA)                           │
 │                                                                                  │
 │  2. PRE-MUTATION NOVELTY CHECK (ShinkaEvolve 2-tier)                            │
 │     Embedding similarity → cosine threshold → SKIP if too similar               │
 │     LLM-as-judge → semantic novelty → SKIP if not novel                         │
 │     [Cost: ~0.001$ per check, saves ~0.50$ per skipped evaluation]              │
 │                                                                                  │
 │  3. CONTEXT ASSEMBLY                                                              │
 │     Inject: parent code + best solutions + learning logs + ASI diagnostics      │
 │             + prompt from co-evolved archive + runtime REPL state (optional)    │
 │                                                                                  │
 │  4. LLM MUTATION (bandit-selected model)                                         │
 │     Mode A: Diff patch (targeted, cheap, fast)                                  │
 │     Mode B: Full rewrite (creative, expensive)                                  │
 │     Mode C: Crossover (two parents → child)                                     │
 │     Mode D: Reflection-driven (ASI feedback → targeted fix)                     │
 │     Mode E: Failure-case-driven (specific test failure → fix)                   │
 │                                                                                  │
 │  5. POST-GENERATION VERIFICATION (Darwinian Evolver)                             │
 │     Quick syntax check + type check + mini-eval → REJECT if clearly wrong       │
 │     [Cost: ~0.05$ per check, saves full evaluation cost]                        │
 │                                                                                  │
 │  6. CASCADE EVALUATION (AlphaEvolve / OpenEvolve)                               │
 │     Stage 1: Fast cheap test subset → REJECT if score < threshold               │
 │     Stage 2: Medium test set → REJECT if fails                                  │
 │     Stage 3: Full evaluation → ASI diagnostics returned                         │
 │     Early stopping (Bayesian / CI) → stop when confident in outcome             │
 │                                                                                  │
 │  7. POPULATION UPDATE                                                             │
 │     MAP-Elites: insert if dominates cell                                        │
 │     Island Archive: keep top-K per island                                       │
 │     Pareto Front: update if non-dominated                                       │
 │                                                                                  │
 │  8. KNOWLEDGE UPDATE                                                             │
 │     Learning log: append (mutation_type, parent_ids, ASI, outcome, delta_score) │
 │     Prompt evolver: if outcome positive → reinforce prompt; if negative → mutate│
 │     Bandit update: record (model, cost, quality) → update UCB1 arm              │
 │     Meta-recommendations: every N steps → LLM synthesizes pattern insights      │
 │                                                                                  │
 │  9. SEARCH ROUTER CHECK                                                           │
 │     If island is stagnant (N steps, no improvement) → spawn AB-MCTS tree        │
 │     If tree search finds improvement → reinject to population                   │
 └─────────────────────────────────────────────────────────────────────────────────┘

4. Component: Population Management

Sources: MAP-ElitesAlphaEvolve / OpenEvolve Island ModelShinkaEvolve Pareto FrontGEPA Expanding ArchiveDGM

4.1 Three-Layer Population Structure

OmniEvolve uses a hierarchical three-layer population that simultaneously provides quality-diversity (breadth), competitive pressure (depth), and long-term memory (archive):

LAYER 1: MAP-ELITES QUALITY-DIVERSITY GRID
┌────────────────────────────────────────────────────────┐
│  Feature Dimension 1 (e.g., code length, complexity)    │
│  Feature Dimension 2 (e.g., score on subset A vs. B)    │
│                                                          │
│  Each cell: best program achieving that (f1, f2) combo  │
│  Resolution: 20×20 = 400 cells per island               │
│  Behavioral descriptors computed post-evaluation         │
└────────────────────────────────────────────────────────┘
         ↕  periodic migration (top-K from each)
LAYER 2: ISLAND MODEL (Isolated Populations)
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│  Island 1  │ │  Island 2  │ │  Island 3  │ │  Island N  │
│ (exploit)  │ │ (explore)  │ │ (diversity)│ │ (specialist│
│ diff-heavy │ │ full+cross │ │ beam search│ │  per task) │
└────────────┘ └────────────┘ └────────────┘ └────────────┘
         ↕  ring topology migration every M steps
LAYER 3: GLOBAL ARCHIVE (non-culled)
┌────────────────────────────────────────────────────────┐
│  All programs ever accepted (DGM-style expanding)       │
│  Indexed by embedding vector for retrieval              │
│  Used for crossover parent selection across islands     │
│  Compressed after 10K entries (keep skeleton + score)  │
└────────────────────────────────────────────────────────┘

4.2 Island Specialization

Island Type Mutation Ratio Selection Purpose
Exploitation Island 80% diff, 20% full Power-law (α=2.0) Refine best known solutions
Exploration Island 30% diff, 50% full, 20% cross Power-law (α=0.5) Creative search, new algorithms
Diversity Island 40% diff, 40% full, 20% cross Novelty-weighted Maintain MAP-Elites cell coverage
Reflection Island 90% reflection-driven, 10% full ASI-diagnostic-based Fix specific failure modes
Crossover Island 10% diff, 10% full, 80% cross Pareto frontier Combine breakthroughs from other islands

4.3 Dynamic Island Spawning

Inspired by ShinkaEvolve v1.1, when an island shows no improvement for stagnation_threshold steps, the system dynamically spawns a new island with a fresh random seed from the global archive. Stale islands are archived and removed. The number of active islands is bounded by max_islands (default: 8).

def check_stagnation(island: Island, window: int = 50) -> bool:
    """Detect stagnation using exponential moving average of delta-score."""
    recent_scores = island.score_history[-window:]
    if len(recent_scores) < window:
        return False
    ema_delta = 0.0
    alpha = 0.1
    for i in range(1, len(recent_scores)):
        delta = recent_scores[i] - recent_scores[i-1]
        ema_delta = alpha * delta + (1 - alpha) * ema_delta
    return abs(ema_delta) < island.config.stagnation_epsilon  # e.g., 1e-4

def spawn_new_island(population_db: PopulationDB) -> Island:
    """Spawn new island seeded from diverse global archive samples."""
    seeds = population_db.archive.sample_diverse(
        n=5,
        method="maximin_distance",  # maximize min pairwise embedding distance
        temperature=1.5             # higher = more random diversity
    )
    return Island(seed_programs=seeds, mutation_config=IslandConfig.random())

5. Component: Mutation Engine

Sources: Diff/Full/CrossShinkaEvolve Reflection-DrivenGEPA Failure-Case-DrivenDarwinian Evolver Runtime-AwareArcgentica

5.1 Five Mutation Modes

The mutation engine supports five qualitatively distinct mutation modes, each suited to different evolutionary contexts:

Mode A: Diff Patch Mutation

When: exploit phase, refining good programs

LLM prompt: Given parent code + best solution + score, generate unified diff

Cost: Low (small output)

Success rate: High when good parent exists

# Prompt template (abbreviated)
You are improving a program. Current score: {score}.
Generate a unified diff that improves performance.
Focus on: {top_3_insights_from_learning_logs}
<<< PARENT PROGRAM >>>
{parent_code}
<<< BEST KNOWN PROGRAM >>>
{best_code}
Output ONLY a unified diff, no explanations.

Mode B: Full Rewrite Mutation

When: explore phase, crossing local optima

LLM prompt: Given problem spec + inspirational examples, write new algorithm

Cost: Medium (full program output)

Success rate: Lower but higher ceiling

# Prompt template (abbreviated)
Implement a novel algorithm for: {problem_description}
Inspirational approaches (do not copy exactly):
{top_diverse_programs}
Constraints: {fitness_function_description}
Key insights from successful mutations:
{meta_recommendations}

Mode C: Crossover Mutation

When: crossover island, combining breakthroughs

LLM prompt: Two parents from different islands/cells, combine best aspects

Cost: Medium

Success rate: Medium, high novelty

# Parent selection for crossover:
parent_a = island_A.sample(strategy="best")
parent_b = archive.sample_dissimilar(parent_a, method="embedding")
# Embedding distance > threshold ensures genuine diversity
assert cosine_distance(embed(parent_a), embed(parent_b)) > 0.3

Mode D: Reflection-Driven Mutation (GEPA ASI)

When: diagnostics available, reflection island

LLM prompt: Inject full ASI diagnostic data, ask LLM to reflect then mutate

Cost: Medium-high (longer context)

Success rate: Highest for fixing known failures

# ASI context injected:
{
  "score": 0.73,
  "failed_test_cases": [{input, expected, actual, error_type}],
  "performance_profile": {bottleneck_function, runtime_ms},
  "diagnostic_message": "Stack overflow on input size > 1000",
  "suggested_fix_category": "recursion_depth"
}

Mode E: Failure-Case-Driven Mutation

When: known failure cases, targeted repair (Darwinian Evolver style)

LLM prompt: Present specific failing test case, ask LLM to trace and fix

Cost: Low-medium

Success rate: High for specific bugs

# Learning log provides failure patterns:
failure_patterns = learning_log.get_recurring_failures(
    program_id=parent.id,
    min_occurrences=3
)
# Inject most frequent failure for targeted fix

Mode F: Runtime-Context Mutation (Arcgentica style)

When: REPL available, complex debugging needed

LLM prompt: Inject live execution trace, intermediate values, stack state

Cost: High (execution + analysis)

Success rate: Highest for runtime bugs

# Runtime context injected:
repl_session = RuntimeREPL(program=parent_code)
trace = repl_session.execute_with_trace(test_input)
# LLM sees: intermediate variable values,
#           branch decisions, recursion depth,
#           memory allocation patterns

5.2 Adaptive Mutation Scheduling

The ratio of mutation modes is not fixed — it is learned per island per problem using a multi-armed bandit. Each island maintains a MutationBandit tracking the success rate of each mode:

$$ UCBi(t) = μ̂i + c · √(ln t / ni) (Eq. 1 — UCB1 for Mutation Mode Selection) $$

Where μ̂i = empirical mean improvement for mutation mode i, ni = number of times mode i was selected, t = total mutations, c = exploration constant (default: 0.3). The bandit selects the mode with highest UCB score at each step.

class MutationBandit:
    def __init__(self, modes: list[MutationMode], c: float = 0.3):
        self.modes = modes
        self.c = c
        self.counts = {m: 0 for m in modes}     # n_i
        self.rewards = {m: 0.0 for m in modes}  # cumulative delta-score

    def select(self) -> MutationMode:
        t = sum(self.counts.values()) + 1
        ucb_scores = {}
        for mode in self.modes:
            n_i = self.counts[mode]
            if n_i == 0:
                return mode  # ensure each mode tried at least once
            mu_hat = self.rewards[mode] / n_i
            ucb_scores[mode] = mu_hat + self.c * math.sqrt(math.log(t) / n_i)
        return max(ucb_scores, key=ucb_scores.get)

    def update(self, mode: MutationMode, delta_score: float):
        self.counts[mode] += 1
        self.rewards[mode] += max(0.0, delta_score)  # only reward improvements

6. Component: Parent Selection

Sources: Power-LawShinkaEvolve Sigmoid-Weighted + Novelty BonusDarwinian Evolver Pareto FrontierGEPA

6.1 Composite Selection Strategy

OmniEvolve combines three selection mechanisms into a composite weight system. For each program p in the population, the selection probability is:

$$ w(p) = wfitness(p) · wnovelty(p) · wrecency(p) (Eq. 2 — Composite Selection Weight) $$

Fitness Weight (Power-Law Rank)

$$ wfitness(p) = rank(p)^−α,   α ∈ [0.5, 3.0] (Eq. 3 — Power-Law Rank Selection) $$

Programs are sorted by fitness descending; rank 1 = best. α controls exploitation/exploration: α=2.0 for exploit islands, α=0.5 for explore islands. Normalized to sum to 1.

Novelty Weight (Anti-Oversampling)

$$ wnovelty(p) = σ(-β · (select_count(p) − μselect)) (Eq. 4 — Sigmoid Novelty Penalty) $$

Where σ is the sigmoid function, β controls sharpness (default: 0.5), and μselect is the mean selection count. Programs selected too frequently get down-weighted, preventing any single program from dominating as a parent. Inspired by Darwinian Evolver.

Recency Weight (Time Decay)

$$ wrecency(p) = exp(−λ · age(p)),   λ = 0.01 (Eq. 5 — Exponential Recency Weight) $$

Slightly favors recently discovered programs, ensuring the system doesn't exclusively exploit historical knowledge. age(p) = number of generations since p was created.

def composite_weight(programs: list[Program], alpha: float, beta: float) -> np.ndarray:
    n = len(programs)
    scores = np.array([p.fitness for p in programs])
    ranks = n + 1 - np.argsort(np.argsort(scores))  # rank 1 = best

    # Fitness weight
    w_fit = ranks.astype(float) ** (-alpha)

    # Novelty weight (anti-oversampling)
    select_counts = np.array([p.select_count for p in programs])
    mu_select = select_counts.mean()
    w_nov = 1.0 / (1.0 + np.exp(beta * (select_counts - mu_select)))

    # Recency weight
    ages = np.array([p.age for p in programs])
    w_rec = np.exp(-0.01 * ages)

    # Composite
    w = w_fit * w_nov * w_rec
    return w / w.sum()

6.2 Multi-Objective Pareto Selection

For problems with multiple optimization targets (e.g., speed + accuracy + memory), OmniEvolve maintains a Pareto frontier and applies ε-greedy selection:

def select_from_pareto(pareto_front: list[Program], epsilon: float = 0.1) -> Program:
    if random.random() < epsilon:
        return random.choice(pareto_front)  # explore random Pareto member
    # Exploit: select program with highest hypervolume contribution
    return max(pareto_front, key=lambda p: p.hypervolume_contribution)

7. Component: Novelty & Diversity Filtering

Sources: 2-tier noveltyShinkaEvolve Behavioral descriptorsAlphaEvolve Failure-type categorizationDarwinian Evolver

7.1 Three-Stage Novelty Pipeline

GENERATED PROGRAM
        │
        ▼
┌───────────────────────────────┐
│  STAGE 1: Embedding Similarity │  Cost: ~$0.001
│  cosine_sim(embed(new), embed(k-NN)) > threshold?  → REJECT
│  threshold = 0.92 (tunable)   │
└───────────────┬───────────────┘
                │ PASS
                ▼
┌───────────────────────────────┐
│  STAGE 2: LLM Novelty Judge   │  Cost: ~$0.005
│  "Is this mutation algorithmically novel? [YES/NO]"
│  Judge sees: new + k most similar archived programs
└───────────────┬───────────────┘
                │ YES (novel)
                ▼
┌───────────────────────────────┐
│  STAGE 3: Behavioral Novelty  │  Cost: 0 (during evaluation)
│  Post-eval: compute behavioral descriptor vector
│  (e.g., test case pass pattern, branch frequencies)
│  MAP-Elites cell placement uses behavioral descriptors
└───────────────────────────────┘

7.2 Embedding-Based Pre-filtering (Stage 1)

class NoveltyPreFilter:
    def __init__(self, model="text-embedding-3-small", threshold=0.92, k=10):
        self.model = model
        self.threshold = threshold
        self.k = k
        self.index = FaissIndex(dimension=1536)  # or ChromaDB, Qdrant

    def is_novel(self, program_code: str) -> bool:
        embedding = embed(program_code, model=self.model)
        if self.index.size() < self.k:
            self.index.add(embedding)
            return True  # insufficient history, always novel
        neighbors, distances = self.index.knn_query(embedding, k=self.k)
        max_similarity = max(1.0 - d for d in distances)
        if max_similarity > self.threshold:
            return False  # too similar to existing program
        self.index.add(embedding)
        return True

7.3 Behavioral Descriptors for MAP-Elites

Behavioral descriptors are computed post-evaluation and determine which MAP-Elites cell a program occupies. Descriptors should be:

  • Problem-agnostic: code complexity (cyclomatic), length (lines of code), estimated big-O class
  • Problem-specific: score on test subset A vs. B, performance on edge cases, approach category (greedy/DP/search)

$$ B(p) = (f1(p), f2(p)) where fi ∈ [0, 1] are normalized behavioral features (Eq. 6 — Behavioral Descriptor Vector) $$

The MAP-Elites grid is indexed by discretized B(p). A program replaces the cell occupant only if its fitness exceeds the incumbent: grid[B(p)] ← p iff fitness(p) > fitness(grid[B(p)]).

8. Component: LLM Orchestration

Sources: Bandit selectionShinkaEvolve Thompson SamplingAB-MCTS Gemini ensembleAlphaEvolve

8.1 Model Tier Architecture

Tier Models Cost/Call Primary Use Selection Strategy
Tier 1: Fast Gemini Flash 2.0, Claude Haiku 4.5, GPT-4o-mini $0.01–0.05 Novelty judging, diff patches, verification Default; used 60–80% of the time
Tier 2: Balanced Claude Sonnet 4.6, GPT-4o, Gemini Pro $0.10–0.40 Full rewrites, reflection-driven mutation UCB1 bandit; ~15–30%
Tier 3: Power Claude Opus 4.6, o3, Gemini Ultra $1.00–5.00 Complex crossover, breakthrough attempts UCB1 bandit; ~5–10%
Tier 4: Local Qwen-Coder, DeepSeek-Coder, CodeLlama $0.00 (infra) High-volume diff patches when budget constrained Budget guard trigger

8.2 Hierarchical Bandit Model Selection

A hierarchical bandit first selects the tier, then selects the model within the tier. This prevents expensive models from starving cheap ones in the exploration phase:

class HierarchicalModelBandit:
    def __init__(self, tiers: dict[str, list[LLMModel]]):
        self.tier_bandit = UCB1Bandit(arms=list(tiers.keys()), c=0.5)
        self.model_bandits = {
            tier: ThompsonSamplingBandit(arms=models, prior_alpha=1, prior_beta=1)
            for tier, models in tiers.items()
        }

    def select(self, context: MutationContext) -> LLMModel:
        # Override: use cheap model when budget is low
        if context.remaining_budget < context.total_budget * 0.1:
            return self.model_bandits["fast"].select()

        tier = self.tier_bandit.select()
        return self.model_bandits[tier].select()

    def update(self, model: LLMModel, tier: str,
               delta_score: float, cost: float, quality: float):
        # Reward = quality / cost (efficiency-aware)
        efficiency = quality / (cost + 1e-8)
        self.tier_bandit.update(tier, efficiency)
        self.model_bandits[tier].update(model, efficiency)

8.3 Thompson Sampling for Deep Refinement

When OmniEvolve switches to AB-MCTS tree search mode, it uses Thompson Sampling to balance exploration (trying new LLMs) vs. exploitation (using the best-performing one for the current problem):

$$ θi ~ Beta(αi + successesi, βi + failuresi) (Eq. 7 — Thompson Sampling for Model Selection) $$

Each model arm has a Beta posterior. At each selection step, sample θi from each model's posterior and select argmaxi(θi). This provides automatic exploration-exploitation balance without explicit c tuning.

9. Component: Diagnostic Feedback System (ASI)

Sources: Actionable Side InformationGEPA Reflection-driven mutationGEPA / ReEvo

9.1 ASI Schema

Every evaluation function in OmniEvolve must return an EvaluationResult with a structured ActionableSideInfo object alongside the numeric score:

@dataclass
class ActionableSideInfo:
    # Required fields
    score: float                          # Primary fitness metric

    # Diagnostic fields (all optional, but recommended)
    failed_tests: list[FailedTest]        # Specific failed inputs
    performance_profile: PerformanceProfile  # Runtime bottlenecks
    error_type: str | None               # "timeout", "wrong_answer", "crash", etc.
    error_message: str | None

    # Structural analysis
    complexity_estimate: str | None       # "O(n^2)", "O(n log n)", etc.
    memory_usage_mb: float | None
    bottleneck_function: str | None      # Function consuming most time

    # Custom fields (domain-specific)
    custom_metrics: dict[str, Any] = field(default_factory=dict)

    # For prompt evolution: what would help next mutation?
    suggested_fix_approach: str | None    # e.g., "memoize recursive calls"
    blocking_insight: str | None          # Key insight needed to improve

@dataclass
class FailedTest:
    input: Any
    expected: Any
    actual: Any
    error_type: str     # "wrong_answer", "timeout", "exception"
    traceback: str | None

9.2 Reflection-Driven Mutation Prompt

REFLECTION_PROMPT = """
You are improving an algorithm. Analyze the diagnostic feedback carefully before proposing changes.

## Current Program
```python
{program_code}

Evaluation Result

  • Score: {score} / {max_score}
  • Error type: {error_type}

Failed Test Cases (top 3)

{failed_tests_formatted}

Performance Profile

  • Bottleneck: {bottleneck_function} ({bottleneck_pct}% of runtime)
  • Estimated complexity: {complexity_estimate}
  • Memory usage: {memory_usage_mb} MB

Suggested approach from evaluator

{suggested_fix_approach}

Relevant learning log insights

{learning_log_insights}

Task

  1. REFLECT: In 2-3 sentences, diagnose the root cause of the failures.
  2. PROPOSE: Describe your fix strategy in 1-2 sentences.
  3. IMPLEMENT: Provide the complete improved program.

Start with "REFLECTION:" then "STRATEGY:" then "```python". """

## 10. Component: Knowledge Sharing & Learning Logs

**Sources:**
`Learning log systemDarwinian Evolver`
`Meta-recommendationsShinkaEvolve`
`Skill storeGEPA Skills`

### 10.1 Learning Log Schema

The learning log is a **shared, cross-population memory**. Every mutation — successful or not — generates a log entry. LLMs read compressed summaries of this log when proposing new mutations.

```python
@dataclass
class LearningLogEntry:
    # Identity
    entry_id: UUID
    timestamp: datetime
    generation: int
    island_id: str

    # What was tried
    mutation_mode: MutationMode            # diff/full/cross/reflect/failure
    parent_program_ids: list[UUID]
    mutation_description: str             # 1-2 sentence summary of what changed

    # What happened
    pre_score: float
    post_score: float
    delta_score: float
    asi_before: ActionableSideInfo
    asi_after: ActionableSideInfo

    # Meta
    llm_model_used: str
    llm_cost_usd: float
    evaluation_time_s: float

    # Pattern tags (auto-generated by LLM summarizer)
    pattern_tags: list[str]   # e.g., ["memoization", "recursion_to_iter", "O(n2)_to_O(nlogn)"]
    outcome: Literal["improvement", "neutral", "regression", "crash"]

10.2 Log Compression and Retrieval

The learning log grows without bound; LLMs have finite context windows. OmniEvolve uses a hierarchical compression strategy:

RAW LOG ENTRIES (individual mutations)
         │  Every 100 entries
         ▼
COMPRESSED SUMMARIES (LLM-generated, ~100 tokens each)
         │  Every 10 summaries
         ▼
INSIGHT DIGESTS (pattern clusters, ~50 tokens each)
         │  At query time
         ▼
RETRIEVED CONTEXT (top-K relevant by embedding similarity to current mutation)
class LearningLogRetriever:
    def get_relevant_context(
        self,
        current_program: str,
        mutation_mode: MutationMode,
        max_tokens: int = 2000
    ) -> str:
        # Embed current program + mutation mode description
        query_embedding = embed(current_program + str(mutation_mode))

        # Retrieve top-K most similar entries by pattern
        relevant_summaries = self.vector_store.similarity_search(
            query=query_embedding, k=10
        )
        # Retrieve top-K failure patterns matching current ASI
        if self.current_asi:
            failure_patterns = self.failure_index.get_similar(
                self.current_asi.error_type, k=5
            )

        # Retrieve top-K successful mutation patterns
        successful_patterns = self.success_index.get_top(k=5)

        # Format and truncate to max_tokens
        return self._format_context(relevant_summaries, failure_patterns,
                                    successful_patterns, max_tokens)

10.3 Meta-Recommendations Generation

Every N=100 generations, OmniEvolve asks an LLM to synthesize the learning log into high-level strategic recommendations. These become permanent entries in the Meta-Recommendation Store and are injected into all future mutation prompts:

META_REC_PROMPT = """
Analyze these {N} mutation outcomes and extract 3-5 high-level insights.
For each insight: (1) state the principle, (2) show 2 examples that support it,
(3) state when to apply it.

Focus on: what types of changes consistently improve performance?
What approaches consistently fail? What patterns emerge across successful mutations?

Mutation history summary:
{compressed_log_summary}

Output as JSON: [{{"insight": str, "examples": [str], "apply_when": str}}]
"""

10.4 Skill Store (Cross-Model Transfer)

Inspired by GEPA Skills, OmniEvolve maintains a Skill Store: a repository of extracted algorithmic knowledge that transfers across LLM models. Skills learned by expensive models become available to cheap models, enabling cost-efficient exploitation:

@dataclass
class Skill:
    name: str                      # e.g., "sliding_window_for_substring_search"
    description: str               # natural language description
    template_code: str             # illustrative code pattern
    applicable_patterns: list[str] # patterns that suggest this skill
    success_evidence: list[str]    # programs where applying improved score
    transfer_count: int            # times successfully transferred across models

11. Component: Prompt Co-Evolution

Sources: Prompt co-evolutionShinkaEvolve v1.1 System prompt archiveShinkaEvolve

11.1 How Prompts Evolve

OmniEvolve maintains a Prompt Population alongside the program population. Each mutation mode has its own prompt population. Prompts are selected, used, evaluated by their downstream effect on mutation success, and themselves mutated:

PROMPT POPULATION (one per mutation mode)
┌──────────────────────────────────────────────────────────┐
│  Prompt_1: score=0.34, uses=120, delta_avg=+0.021         │
│  Prompt_2: score=0.41, uses=85,  delta_avg=+0.038  ← best│
│  Prompt_3: score=0.29, uses=60,  delta_avg=+0.011         │
│  Prompt_4: score=0.22, uses=40,  delta_avg=-0.003         │
└──────────────────────────────────────────────────────────┘
        │                           │
   Prompt selection            Prompt mutation
   (power-law rank)            (LLM generates new variant)
        │                           │
        └──────────────────────────►│
                                    ▼
                           New prompt evaluated by
                           downstream mutation quality

11.2 Prompt Fitness Metric

$$ fitness(prompt) = EMAt(Δscore | prompt used at t) (Eq. 8 — Prompt Fitness) $$

EMA = exponential moving average (α=0.1) of delta-score achieved by mutations using this prompt. Prompts with negative or zero EMA are candidates for mutation or elimination.

11.3 ASI-Informed Prompt Mutation

A key OmniEvolve innovation: when a prompt consistently produces mutations that fail with specific ASI diagnostics (e.g., "recursion overflow"), the prompt is mutated to specifically address that failure pattern:

PROMPT_MUTATION_TEMPLATE = """
This mutation prompt has been used {n_times} times with average delta-score {avg_delta:.4f}.

Here are the most common failure patterns it produces:
{top_failure_patterns}

Here is the current prompt:
---
{current_prompt}
---

Generate an improved version of this prompt that:
1. Addresses the most common failure: "{top_failure}"
2. Preserves what's working (the prompt works well when: {success_contexts})
3. Adds explicit guidance about: {suggested_addition}

Output ONLY the new prompt text, no explanations.
"""

12. Component: Selective Self-Improvement

Sources: Self-modificationDGM (Gödel Machine) Skill accumulationGEPA Skills

12.1 What May and May Not Self-Modify

✓ Safe to Self-Modify (Mutable Zone)

  • System prompts for each mutation mode
  • Parent selection weights (α for power-law)
  • Mutation mode ratios per island
  • Novelty filter thresholds
  • LLM model selection preferences
  • Meta-recommendation injection logic
  • Learning log query strategies

✗ Immutable (Never Self-Modified)

  • Evaluation function / fitness oracle
  • Sandbox execution harness
  • Cost budget guards and limits
  • Safety checks (timeout, OOM kill)
  • Population database read/write interface
  • Snapshot and rollback mechanism
  • The self-improvement module itself

12.2 Snapshot-Rollback Protocol

class SelfImprovementModule:
    def propose_improvement(self, component: str, current_config: dict) -> dict:
        """LLM proposes new configuration for mutable component."""
        snapshot = self.take_snapshot(component)

        proposed_config = self.llm_propose(
            component=component,
            current_config=current_config,
            performance_history=self.get_history(component, n=100),
            meta_recommendations=self.meta_rec_store.get_relevant(component)
        )
        return proposed_config, snapshot

    def evaluate_improvement(
        self, snapshot_id: UUID, proposed_config: dict,
        component: str, n_trials: int = 50
    ) -> bool:
        """A/B test: run N mutations with proposed vs current config."""
        results_proposed = self.run_with_config(proposed_config, n_trials)
        results_current = self.run_with_config(self.current_config, n_trials)

        # Statistical test: t-test, p < 0.05
        t_stat, p_value = scipy.stats.ttest_ind(results_proposed, results_current)
        improvement = np.mean(results_proposed) - np.mean(results_current)

        if p_value < 0.05 and improvement > 0:
            self.apply_config(proposed_config, component)
            self.log_improvement(component, improvement, snapshot_id)
            return True
        else:
            self.rollback_to_snapshot(snapshot_id)
            return False

13. Component: Cost Control System

Sources: Committed cost modelShinkaEvolve Cascade evaluationAlphaEvolve Early stoppingShinkaEvolve Per-iteration budgetOpenEvolve

13.1 Committed Cost Budget Guard

The budget guard tracks committed cost = realized cost + in-flight cost (API calls not yet returned). This prevents overspending due to concurrent calls:

$$ Ccommitted(t) = Crealized(t) + Cinflight(t) ≤ Btotal (Eq. 9 — Committed Cost Guard) $$

class CommittedBudgetGuard:
    def __init__(self, total_budget_usd: float):
        self.total = total_budget_usd
        self._realized = 0.0
        self._inflight: dict[UUID, float] = {}  # call_id → estimated_cost
        self._lock = asyncio.Lock()

    async def reserve(self, estimated_cost: float) -> UUID | None:
        async with self._lock:
            committed = self._realized + sum(self._inflight.values())
            if committed + estimated_cost > self.total:
                return None  # budget exceeded, reject call
            call_id = uuid4()
            self._inflight[call_id] = estimated_cost
            return call_id

    async def settle(self, call_id: UUID, actual_cost: float):
        async with self._lock:
            estimated = self._inflight.pop(call_id, 0.0)
            self._realized += actual_cost

    @property
    def remaining(self) -> float:
        return self.total - self._realized - sum(self._inflight.values())

13.2 Cascade Evaluation Pipeline

Stage Test Coverage Cost Reject Threshold Avg. Reject Rate
Stage 0: Syntax Check Parse + type check ~$0.00 Any syntax error 5–15%
Stage 1: Mini-eval 5% of test suite (random) ~$0.01 score < 0.3 20–40%
Stage 2: Medium-eval 30% of test suite ~$0.05 score < threshold 15–25%
Stage 3: Full-eval 100% + ASI generation ~$0.10–$1.00

13.3 Bayesian Early Stopping

For stochastic evaluations (e.g., average over N random seeds), OmniEvolve uses sequential probability ratio testing to stop early when the outcome is clear:

$$ Λt = Σi=1^t log[p(xi|H1) / p(xi|H0)] (Eq. 10 — SPRT Log-Likelihood Ratio) $$

H0: new program is not better than best known. H1: new program is better. Stop early when |Λt| exceeds boundary. Type I/II error rates α=β=0.05. Inspired by ShinkaEvolve's CI-based early stopping.

14. Component: Safety & Sandboxing

Sources: Sandbox execAll systems Safety discussionDGM paper

14.1 Multi-Layer Sandboxing

Layer Mechanism Protects Against
Layer 1: Process subprocess with resource limits (ulimit, setrlimit) CPU/RAM/disk exhaustion
Layer 2: Container Docker/podman with no-network, read-only FS, seccomp Network access, filesystem modification
Layer 3: Static Analysis AST scan for dangerous imports/calls before execution Known malicious patterns (os.system, import)
Layer 4: Timeout Hard wall-clock timeout (SIGKILL after T seconds) Infinite loops, deadlocks
Layer 5: Output Validation Schema validation on evaluation result before use Evaluation function injection
BLOCKED_PATTERNS = [
    r"import\s+os", r"import\s+subprocess", r"import\s+sys",
    r"__import__", r"exec\s*\(", r"eval\s*\(",
    r"open\s*\(", r"socket\.", r"urllib\.", r"requests\.",
    r"ctypes\.", r"cffi\.", r"multiprocessing\."
]

def static_safety_check(code: str) -> tuple[bool, str]:
    """Return (is_safe, reason)."""
    for pattern in BLOCKED_PATTERNS:
        if re.search(pattern, code):
            return False, f"Blocked pattern: {pattern}"
    return True, ""

15. Component: Hybrid Search Strategy

Sources: AB-MCTSSakana AB-MCTS / TreeQuest Iterative refinementConfluence Labs

15.1 Search Router Logic

OmniEvolve uses an intelligent router that switches between evolutionary search and tree search based on the current optimization landscape:

SEARCH ROUTER (runs every K generations)

Is the problem "narrow and deep"?              Is the problem "broad and shallow"?
(few good solutions, need fine-grained         (many local optima, need exploration,
 refinement, single known promising approach)   wide search space)
                 │                                              │
                 ▼                                              ▼
         AB-MCTS Tree Search                        Evolutionary Loop
         ─────────────────                          ─────────────────
         Start from best solution                   MAP-Elites + Islands
         Thompson Sampling for depth/width          Multi-modal mutation
         Refinement-focused prompts                 Diversity-preserving selection
         Inject improvements back to population     Tree search on stagnant islands

15.2 Adaptive Branching MCTS

Inspired by AB-MCTS (Sakana), when OmniEvolve activates tree search mode, it uses Thompson Sampling to decide whether to deepen (refine current solution) vs. branch (generate new sibling):

$$ P(branch) = P(θbranch > θdeepen) where θi ~ Beta(αi, βi) (Eq. 11 — AB-MCTS Branching Probability) $$

class AdaptiveBranchMCTS:
    def __init__(self, root_program: Program, llm_ensemble: LLMEnsemble):
        self.tree = SearchTree(root=SearchNode(root_program))
        self.branch_bandit = ThompsonSamplingBandit(["branch", "deepen"])
        self.llm = llm_ensemble

    def search(self, n_steps: int) -> Program:
        for _ in range(n_steps):
            node = self.select_node()  # UCT selection
            action = self.branch_bandit.select()

            if action == "deepen":
                child = self.refine(node)   # Diff patch mutation
            else:
                child = self.branch(node)   # Full rewrite from this node

            score = evaluate(child)
            delta = score - node.program.score
            self.branch_bandit.update(action, success=(delta > 0))
            self.backpropagate(node, delta)

        return self.tree.best_leaf().program

16. Formal Mathematical Specification

16.1 Problem Formulation

Let P be the space of programs (text over some programming language alphabet Σ). Let f: P → ℝ be the fitness function returning a scalar score. The optimization objective is:

$$ p* = arg maxp ∈ P f(p) (Eq. 12 — Primary Optimization Objective) $$

In the multi-objective setting, with objectives f1, ..., fk:

$$ P* = {p ∈ P | ¬∃p' ∈ P : fi(p') ≥ fi(p) ∀i ∧ fj(p') > fj(p) for some j} (Eq. 13 — Pareto Optimal Set) $$

16.2 OmniEvolve State Space

The full state of OmniEvolve at generation t is:

$$ St = (Πt, Lt, Φt, Θt) (Eq. 14 — System State) $$

Where: Πt = population (MAP-Elites grid + islands + archive), Lt = learning log (history of all mutations), Φt = prompt archive (population of mutation prompts), Θt = bandit state (model selection posteriors + mutation mode weights).

16.3 Transition Dynamics

One evolutionary step is a stochastic transition St → St+1:

$$ St+1 = Update(St, pnew, rt, dt) (Eq. 15 — State Transition) $$

Where pnew is the generated program, rt = f(pnew) is the reward, dt is the ASI diagnostic, and Update() applies all four updates: population (Π), log (L), prompts (Φ), bandits (Θ).

16.4 Sample Efficiency Bound

Let E[T] be the expected number of evaluations to reach within ε of the global optimum f. OmniEvolve achieves the following improvement over naive random search:

$$ E[T]OmniEvolve ≤ E[T]random · (1 − preject)^-1 · (1 − pearly_stop)^-1 (Eq. 16 — Sample Efficiency Improvement Bound) $$

Where preject ≈ 0.6–0.8 (fraction of mutations rejected by novelty filter before evaluation), pearly_stop ≈ 0.3–0.5 (fraction of evaluations stopped early by SPRT). Combined: effectively reduces evaluation count by 4–10x compared to systems without these mechanisms. Empirical values from ShinkaEvolve analysis.

16.5 Learning Log Value Function

The value of the learning log Lt to a new mutation proposal can be formalized as a mutual information quantity:

$$ V(Lt) = I(Δrt+1; Lt | pparent, modet) (Eq. 17 — Learning Log Information Value) $$

Where I(·;·) is mutual information between the reward delta of the next mutation and the learning log history. High V(Lt) means the log is highly predictive of mutation outcomes — a signal that meta-recommendation synthesis should run.

17. Implementation Roadmap

17.1 Technology Stack Recommendations

Layer Recommended Alternatives Reason
Language Python 3.12+ Ecosystem compatibility with all 16 surveyed systems
Async framework asyncio + aiohttp trio ShinkaEvolve's AsyncEvolutionRunner pattern proven
LLM client LiteLLM langchain, direct SDKs Unified API across all providers; cost tracking built-in
Vector store Qdrant (self-hosted) ChromaDB, FAISS Persistent, fast, filterable, supports metadata
Population DB SQLite + JSON (small) / PostgreSQL (large) DuckDB Simple, reliable, queryable, no infra overhead
Sandbox Docker + restrictedpython Firejail, nsjail Two-layer defense; Docker for infra isolation, restrictedpython for code
Config Hydra (OmegaConf) pydantic-settings ShinkaEvolve uses Hydra; excellent sweep support
Monitoring Prometheus + Grafana W&B, MLflow Real-time cost, score, and population diversity tracking
Experiment tracking MLflow W&B Free, self-hosted, full artifact tracking

17.2 Phased Implementation Plan

Phase 1: Phase 1: Core Evolution Loop (Weeks 1–3) Implement: async orchestrator, basic island model (3 islands), diff+full mutation modes, sandbox executor, basic fitness database. Target: reproduce OpenEvolve-level capability. Validate on: cap_set problem, TSP heuristic.

Phase 2: Phase 2: MAP-Elites + Novelty (Weeks 4–5) Add: MAP-Elites grid with configurable behavioral descriptors, embedding-based novelty pre-filter, LLM novelty judge, power-law parent selection. Target: +20% coverage of behavioral space vs. basic islands.

Phase 3: Phase 3: LLM Bandit Orchestration (Week 6) Add: UCB1 model selection bandit, LiteLLM integration for all providers, committed budget guard, cascade evaluation pipeline, SPRT early stopping. Target: 50% cost reduction vs. Phase 2.

Phase 4: Phase 4: Diagnostic Feedback (Weeks 7–8) Add: ASI schema, evaluator wrapper that returns diagnostics, reflection-driven mutation mode (Mode D), failure-case-driven mutation mode (Mode E). Target: +15% average fitness on problems with clear failure diagnostics.

Phase 5: Phase 5: Learning Logs (Weeks 9–10) Add: learning log database, log compression and retrieval, meta-recommendation generation, log injection into mutation prompts. Target: measurable improvement in mutation success rate on generation 50+ vs. 1-50.

Phase 6: Phase 6: Prompt Co-Evolution (Weeks 11–12) Add: prompt population per mutation mode, prompt fitness tracking, ASI-informed prompt mutation. Target: prompt fitness improves monotonically over 500+ mutations.

Phase 7: Phase 7: AB-MCTS Tree Search Integration (Weeks 13–14) Add: search router, adaptive branching MCTS, Thompson Sampling for depth/width, re-injection of tree search improvements to population. Target: solve problems that pure evolutionary fails on within same budget.

Phase 8: Phase 8: Selective Self-Improvement (Weeks 15–17) Add: self-improvement module, snapshot-rollback system, A/B testing for configuration changes, immutability enforcement. Target: demonstrable improvement of mutation strategies over 1000+ generations.

Phase 9: Phase 9: Skill Store + Cross-Model Transfer (Weeks 18–20) Add: skill extraction from successful programs, skill store API, skill injection into cheap model prompts. Target: cheap model (Haiku) achieves 85%+ of expensive model (Opus) quality when skills available.

Phase 10: Phase 10: Benchmarking & Evaluation (Weeks 21–24) Evaluate on: ALE-Bench (competitive programming), ARC-AGI-2 (generalization), mathematical discovery (AIME), infrastructure optimization (matrix multiply). Compare against ShinkaEvolve, OpenEvolve, GEPA, Darwinian Evolver.

17.3 Key Configuration Parameters

# omnievolve_config.yaml (Hydra)
evolution:
  n_islands: 8
  island_types: [exploit, explore, diversity, reflect, crossover]
  generations: 1000
  async: true
  max_concurrent_mutations: 16

population:
  map_elites_resolution: [20, 20]
  archive_max_size: 10000
  archive_compression_interval: 500
  migration_interval: 50
  migration_top_k: 3

mutation:
  modes: [diff, full, cross, reflect, failure, runtime]
  bandit_c: 0.3
  stagnation_window: 50
  stagnation_epsilon: 0.0001

novelty:
  embedding_model: text-embedding-3-small
  embedding_similarity_threshold: 0.92
  llm_novelty_judge: gemini-flash-2.0
  n_neighbors: 10

llm:
  provider: litellm
  models:
    tier1: [gemini/gemini-2.0-flash, claude-haiku-4-5]
    tier2: [claude-sonnet-4-6, gpt-4o]
    tier3: [claude-opus-4-6]
    tier4: [ollama/qwen2.5-coder:32b]
  bandit_c: 0.5
  thompson_prior_alpha: 1
  thompson_prior_beta: 1

cost:
  total_budget_usd: 100.0
  cascade_stages: [0.05, 0.30, 1.00]  # fraction of test suite per stage
  cascade_thresholds: [0.30, null]      # reject if below threshold
  early_stopping_alpha: 0.05
  early_stopping_beta: 0.05

prompt_evolution:
  enabled: true
  population_size: 10
  mutation_interval: 100  # mutate prompts every N mutations
  ema_alpha: 0.1

learning_log:
  enabled: true
  compression_interval: 100
  meta_rec_interval: 500
  retrieval_k: 10
  max_context_tokens: 2000

self_improvement:
  enabled: true
  trial_n: 50
  significance_threshold: 0.05
  snapshot_retention: 5  # keep last 5 snapshots

search:
  strategy: adaptive  # evolutionary | mcts | adaptive
  mcts_trigger_stagnation: 100
  mcts_steps: 50
  thompson_prior_alpha: 1
  thompson_prior_beta: 1

safety:
  sandbox: docker
  timeout_s: 30
  memory_limit_mb: 512
  network_disabled: true
  static_analysis: true

18. Expected Performance & Benchmarks

18.1 Performance Projections vs. Individual Systems

Benchmark ShinkaEvolve OpenEvolve GEPA OmniEvolve (projected) Gain Source
ALE-Bench Score (competitive prog.) ~60th percentile ~55th percentile ~65th percentile ~75–80th percentile Learning logs + ASI + AB-MCTS tree search
Cost per improvement unit $0.05–0.20 $0.10–0.40 $0.05–0.25 $0.02–0.10 Cascade eval + novelty pre-filter + bandit models
Evaluations to first valid solution ~100 ~200 ~80 ~40–60 2-tier novelty + cascade + SPRT early stopping
Long-run improvement (gen 500+) Plateau risk Plateau risk Plateau risk Continued improvement Prompt co-evolution + self-improvement
Cross-problem transfer Limited None Skills only Skill store + learning logs Skill store + cross-model transfer

18.2 Evaluation Plan for PhD Validation

Benchmark 1: ALE-Bench

Run OmniEvolve on ALE-Bench's 40 competitive programming problems. Measure: percentile rank, cost per problem, evaluations per problem. Compare against published ShinkaEvolve and GEPA results.


Benchmark 2: Mathematical Discovery

Replicate AlphaEvolve's cap_set and matrix multiplication problems. Measure: best known values achieved, evaluation count, cost. Validate that population diversity (MAP-Elites) finds the same or better solutions.


Benchmark 3: Ablation Study

Disable one component at a time and measure performance degradation: (1) no novelty filter, (2) no ASI, (3) no learning logs, (4) no prompt evolution, (5) no bandit model selection. Quantify each component's contribution.


Benchmark 4: Long-Run Improvement

Run 2000 generations on a hard problem and measure score-vs-generation curve. Validate that prompt co-evolution and self-improvement prevent plateau. Compare against ShinkaEvolve and OpenEvolve on same problem.


19. Conclusion & PhD Research Directions

19.1 Summary of Design Choices

OmniEvolve synthesizes the best mechanisms from across the field into a unified, principled architecture. The key design decisions are:

Decision Choice Rationale Source
Population Structure MAP-Elites + Islands + Archive (3-layer) Combines behavioral diversity (MAP), isolation (islands), and long-term memory (archive) AlphaEvolve + ShinkaEvolve + DGM
Mutation Engine 6 modes with UCB1 bandit scheduling No single mutation mode is optimal; bandit learns what works per problem per phase ShinkaEvolve + GEPA + Darwinian Evolver
Parent Selection Power-law rank × novelty penalty × recency Composite weight avoids oversampling any individual while favoring high-fitness parents ShinkaEvolve + Darwinian Evolver
Novelty Filtering 2-stage: embedding + LLM judge Cheap embedding filter removes obvious duplicates; expensive LLM judge handles semantic similarity ShinkaEvolve
LLM Orchestration Hierarchical bandit (tier then model) Prevents expensive models from starving cheap ones; efficiency-aware reward (quality/cost) ShinkaEvolve + AB-MCTS
Diagnostic Feedback Mandatory ASI schema from all evaluators Structured diagnostics enable reflection-driven mutation, failure-targeted repair, and log compression GEPA
Knowledge Sharing Cross-population learning logs + skill store Individual islands learn in isolation; logs share discoveries; skills transfer to cheap models Darwinian Evolver + GEPA Skills
Prompt Evolution ASI-informed prompt mutation per mode Prompts that consistently produce diagnostic failures get specifically mutated to address those failures ShinkaEvolve (extended)
Self-Improvement Scoped to mutable zone + A/B validation Controllable improvement without risking evaluation integrity; statistical validation prevents regressions DGM (constrained)
Search Strategy Adaptive: evolutionary + AB-MCTS Population search for broad coverage; tree search for deep refinement on stagnant islands AB-MCTS + ShinkaEvolve

19.2 Open Research Questions (PhD Directions)

RQ1: Optimal Population Topology

How should islands communicate? Ring topology, hub-and-spoke, fully connected, adaptive? What migration policies maximize exploration without premature convergence? Can meta-learning discover optimal topologies per problem class?

Suggested approach: Multi-armed bandit over topology configurations; measure diversity vs. convergence trade-off.


RQ2: Behavioral Descriptor Design

For code evolution, what are the right behavioral descriptors for MAP-Elites? Code complexity, runtime profile, test coverage pattern, semantic embedding? How to design descriptors that capture meaningful diversity without explosion of cell count?

Suggested approach: Unsupervised discovery of descriptors from evaluation data using dimensionality reduction.


RQ3: Learning Log Information Value

How much does the learning log actually help? Can we formalize its value (Eq. 17) and show diminishing returns as it grows? When should old entries be purged? What compression ratio preserves maximum mutual information?

Suggested approach: Ablation + information-theoretic analysis of log compression methods.


RQ4: Prompt Evolution Convergence

Do evolved prompts converge to a fixed point, or do they continuously improve? Is there a theoretical bound on prompt fitness? Do evolved prompts generalize across problem types, or are they problem-specific?

Suggested approach: Track prompt entropy over generations; measure transfer across problem domains.


RQ5: Safe Self-Improvement Boundaries

What is the precise formal boundary between safe and unsafe self-modification in evolutionary systems? Can a type system enforce this boundary? How do safety guarantees compose when multiple components self-modify simultaneously?

Suggested approach: Formal verification framework; capability amplification bounds.


RQ6: Multi-Task Transfer Learning

Can OmniEvolve's learned skills, prompts, and meta-recommendations transfer across fundamentally different problem domains (e.g., sorting → scientific discovery → infrastructure optimization)? What is the transfer hierarchy?

Suggested approach: Curriculum learning framework; measure zero-shot and few-shot transfer across domain pairs.


19.3 The Frontier: What Comes After OmniEvolve

**The Next Level:** OmniEvolve represents the state-of-the-art integration of 2024–2026 methods. The next research frontier lies in:

**Continuous learning**: Systems that never stop improving, with formal guarantees on monotonic long-run performance

**Multi-agent co-evolution**: Populations of interacting agents that evolve communication protocols and collaborative strategies

**Neural-symbolic hybrid search**: Combining LLM-based code mutation with formal verification and constraint propagation

**Evolutionary meta-learning**: Systems that evolve their own evolutionary algorithms — a true Gödel Machine realized safely

**Human-AI co-evolution**: Humans and AI in tight feedback loops where humans provide insight, AI provides scale — the AHC058/ICFP model generalized

**Final Recommendation for PhD Implementation:** Begin with Phases 1–4 (core loop + MAP-Elites + bandit LLM + ASI diagnostics). This combination alone should surpass any single existing open-source system on most benchmarks. Phases 5–8 (learning logs + prompt evolution + self-improvement) are the genuine research contributions — they go beyond what any single surveyed system has fully implemented. Phases 9–10 provide the evaluation infrastructure for publication-quality results.


OmniEvolve Architecture Blueprint • Evolutionary AI Systems Survey 2024–2026

Based on analysis of 17 systems: AlphaEvolve, OpenEvolve, ShinkaEvolve, GEPA, LLM4AD, SkyDiscover/AdaEvolve, DGM, Darwinian Evolver, GEPA Skills, Confluence Labs, Arcgentica, AB-MCTS, AI Scientist, ALE-Bench, AHC058, ICFP2025, arxiv papers

← Back to Survey ShinkaEvolve Details DGM Details GEPA Details Darwinian Evolver SkyDiscover/AdaEvolve