Next Evolution
Architectural Recommendations for the Optimal LLM-Powered Evolutionary System
A PhD-level synthesis of 17 evolutionary AI systems (2024–2026) — Design Blueprint for a Best-in-Class Implementation
Proposed System: OmniEvolve A unified evolutionary framework synthesizing the best components from across the field: ShinkaEvolve's sample efficiency • GEPA's diagnostic feedback • Darwinian Evolver's learning logs • DGM's self-improvement • AlphaEvolve's population diversity • AB-MCTS's adaptive search • Arcgentica's runtime context • AdaEvolve's hierarchical adaptation Jump to Architecture Implementation Plan Formal Specification
Table of Contents
- Design Motivation & Gap Analysis
- Core Design Principles
- Proposed Architecture: OmniEvolve
- Component: Population Management
- Component: Mutation Engine
- Component: Parent Selection
- Component: Novelty & Diversity
- Component: LLM Orchestration
- Component: Diagnostic Feedback (ASI)
- Component: Knowledge Sharing & Learning Logs
- Component: Prompt Co-Evolution
- Component: Selective Self-Improvement
- Component: Cost Control System
- Component: Safety & Sandboxing
- Component: Hybrid Search Strategy
- Formal Mathematical Specification
- Implementation Roadmap
- Expected Performance & Benchmarks
- Conclusion & PhD Research Directions
1. Design Motivation & Gap Analysis
After surveying 17 evolutionary AI systems published between 2024 and 2026, a clear picture emerges: no single system captures all the key innovations that the field has collectively discovered. Each system made a focused contribution, but the total design space has never been fully integrated. This section maps the gaps.
1.1 What Each System Got Right
| System | Key Innovation | What It Lacks |
|---|---|---|
| AlphaEvolve | MAP-Elites quality diversity, Gemini ensemble at scale, real-world infra optimization | Closed source, no prompt evolution, no learning logs, no self-improvement |
| OpenEvolve | Open MAP-Elites + islands, multi-provider LLM, strong community | No prompt evolution, no diagnostic ASI, no learning logs, basic cost control |
| ShinkaEvolve | 2-tier novelty, bandit LLM selection, prompt co-evolution, async 5-10x speedup | No structured diagnostic feedback, no learning logs, no self-modification |
| GEPA | Actionable Side Information, Pareto search, declarative API, reflection-driven mutation | No prompt evolution, no learning logs, limited population management |
| DGM | Full self-modification, expanding archive, cross-language transfer | No cost control, safety risks, no structured diagnostic feedback |
| Darwinian Evolver | Learning log system, failure-case-driven mutation, clean design | Sequential only, no prompt evolution, flat population, no novelty filtering |
| AB-MCTS | Thompson Sampling for adaptive branching, multi-LLM collaboration, tree search | No population management, not a full evolutionary framework |
| GEPA Skills | Repository-specific skill learning, cross-model transfer, persistent knowledge | Limited to skill extraction, not full evolution |
| Arcgentica | Runtime-as-context, persistent REPL, live execution as reasoning surface | Not a full evolutionary system, no population |
| LLM4AD | 7 search methods unified, GUI, broad task coverage | Kitchen-sink design; no prompt evolution, no learning logs, no ASI |
| SkyDiscover | Three-level hierarchical adaptation, globally-normalized bandits, meta-guidance, 200+ benchmarks | No prompt evolution, no learning logs, no structured ASI, no MAP-Elites, no 2-tier novelty |
1.2 The Critical Gaps No System Solves Together
Gap 1: Sample Efficiency + Diagnostic Feedback
ShinkaEvolve dramatically reduces evaluations through 2-tier novelty filtering and bandit LLM selection. GEPA adds why mutations fail via Actionable Side Information. No system does both. Combining them would make reflective mutation far more targeted, further reducing evaluations.
Gap 2: Population Diversity + Learning Logs
MAP-Elites maintains behavioral diversity in the population. Darwinian Evolver's learning logs share cross-individual discoveries. No system combines quality-diversity grids with shared experiential memory. Individuals in isolated MAP-Elites cells could share mutation history.
Gap 3: Prompt Evolution + ASI Feedback
ShinkaEvolve evolves system prompts alongside programs. GEPA provides structured diagnostic data about what went wrong. No system uses diagnostic failure data to drive prompt mutation. This would create a closed feedback loop: bad mutations → diagnose failure → improve the prompt that generated them.
Gap 4: Controlled Self-Improvement
DGM's self-modification is powerful but unsafe at scale. No system offers safe, scoped self-improvement with formal rollback guarantees. A system that selectively improves its mutation strategies and prompts (not its core evaluation logic) would be both safe and powerful.
Gap 5: Tree Search + Evolutionary Population
AB-MCTS's adaptive tree search excels at deep refinement of individual solutions. MAP-Elites/island models excel at broad exploration. No system combines tree-depth refinement with population-level diversity. The optimal strategy uses trees for promising individuals, population search for diversity.
Gap 6: Runtime Context + Evolutionary Loop
Arcgentica's runtime-as-context (persistent REPL, live execution) makes code mutations smarter. No evolutionary framework uses live execution state as context for LLM mutations. Combining this with population management would enable mutations informed by actual runtime behavior.
2. Core Design Principles
The proposed system, OmniEvolve, is guided by seven principles derived from empirical observations across all 16 surveyed systems:
P1: Sample Efficiency is the Primary Constraint
LLM API costs are the bottleneck, not algorithmic compute. Every design decision should minimize evaluations-to-convergence. From ShinkaEvolve: reject before you evaluate. From GEPA: diagnose before you retry. From Darwinian Evolver: learn before you repeat.
P2: Diagnostic Data Drives Everything
Mutation operators should never be blind. Every evaluation should return not just a score, but actionable diagnostic information: what failed, why, where. This data feeds mutation prompts, parent selection, learning logs, and prompt evolution simultaneously.
P3: Population Diversity ≥ Best Individual Score
Premature convergence to local optima is the primary failure mode. MAP-Elites grids, island isolation, and embedding-based novelty filtering must be maintained even at the cost of individual fitness. The best solutions are found by maintaining diverse exploration fronts.
P4: Meta-Level Evolution is the Multiplier
Systems that evolve how they evolve outperform those that don't. Prompt co-evolution, learning logs, and adaptive mutation scheduling compound over time. A system that gets better at mutating code is geometrically more powerful than one with fixed strategies.
P5: Safety First, Capability Second
Self-modification must be strictly scoped. The evaluation harness, fitness functions, and safety checks must be immutable. Only mutation prompts, parent selection weights, and search heuristics may self-modify. All changes require snapshot-rollback capability.
P6: Cost-Aware Architecture Throughout
Cost control is not a feature — it is a first-class architectural concern. Every component tracks its cost contribution. Bandit selection automatically uses cheaper models when they suffice. Budget guards operate at committed (in-flight) cost, not just completed cost.
P7: Generalization Over Task Specialization
The system should work on any problem expressible as (1) mutable code blocks, (2) an evaluation function returning score + diagnostics. No domain-specific logic should leak into the core. Domain knowledge belongs in prompts and evaluation functions, not in the evolutionary engine.
3. Proposed Architecture: OmniEvolve
3.1 High-Level System Overview
╔═══════════════════════════════════════════════════════════════════════════════════════════════╗
║ O M N I E V O L V E ║
║ Unified LLM-Powered Evolutionary Optimization Engine ║
╚═══════════════════════════════════════════════════════════════════════════════════════════════╝
┌──────────────────────────────────┐
│ ORCHESTRATOR │
│ (async event loop, budget guard) │
└────────────┬─────────────────────┘
│
┌────────────────────────────────┼────────────────────────────────┐
│ │ │
┌───────▼────────┐ ┌────────▼────────┐ ┌──────────▼──────────┐
│ POPULATION │ │ LLM MUTATION │ │ EVALUATION │
│ MANAGER │◄──────────►│ ENGINE │◄────────────►│ PIPELINE │
│ │ │ │ │ │
│ MAP-Elites │ │ Diff / Full / │ │ Sandbox Exec │
│ Island Model │ │ Cross / Reflect │ │ Cascade Filter │
│ Pareto Front │ │ Failure-Driven │ │ ASI Diagnostics │
│ Archive │ │ Runtime-Aware │ │ Early Stopping │
└───────┬────────┘ └────────┬────────┘ └──────────┬──────────┘
│ │ │
│ ┌───────────────▼──────────────────┐ │
│ │ KNOWLEDGE LAYER │ │
│ │ │ │
│ │ ┌──────────────┐ ┌────────────┐ │ │
│ │ │ Learning Logs│ │ Prompt │ │ │
│ │ │ (cross-pop.) │ │ Archive │ │ │
│ │ └──────────────┘ └────────────┘ │ │
│ │ ┌──────────────┐ ┌────────────┐ │ │
│ │ │ Skill Store │ │ Meta-Recs │ │ │
│ │ └──────────────┘ └────────────┘ │ │
│ └──────────────────────────────────┘ │
│ │ │
│ ┌───────────────▼──────────────────┐ │
│ │ LLM ORCHESTRATOR │ │
│ │ │ │
│ │ Flash-Haiku ┐ │ │
│ │ Flash-Pro ├── UCB1 Bandit │ │
│ │ o3-mini │ + Thompson │ │
│ │ Sonnet-4.x │ Sampling │ │
│ │ local model ─┘ │ │
│ │ │ │
│ │ Prompt Co-Evolver (v2 archive) │ │
│ └──────────────────────────────────┘ │
│ │ │
└──────────────────────────────┼───────────────────────────────────┘
│
┌──────────────▼───────────────────┐
│ SEARCH STRATEGY ROUTER │
│ │
│ Evolutionary Loop (default) │
│ AB-MCTS Tree Search (deep refine)│
│ Hybrid (pop. → tree → pop.) │
└──────────────────────────────────┘
│
┌──────────────▼───────────────────┐
│ SAFETY & COST LAYER │
│ │
│ Committed Budget Guard │
│ Sandbox (Docker / subprocess) │
│ Snapshot Rollback │
│ Immutability Enforcement │
└──────────────────────────────────┘
3.2 Data Flow: One Evolutionary Iteration
┌─────────────────────────────────────────────────────────────────────────────────┐
│ SINGLE EVOLUTIONARY STEP │
│ │
│ 1. SAMPLE PARENTS │
│ Population DB → power-law rank selection (ShinkaEvolve) │
│ + novelty-adjusted weights (Darwinian Evolver) │
│ + Pareto frontier candidates (GEPA) │
│ │
│ 2. PRE-MUTATION NOVELTY CHECK (ShinkaEvolve 2-tier) │
│ Embedding similarity → cosine threshold → SKIP if too similar │
│ LLM-as-judge → semantic novelty → SKIP if not novel │
│ [Cost: ~0.001$ per check, saves ~0.50$ per skipped evaluation] │
│ │
│ 3. CONTEXT ASSEMBLY │
│ Inject: parent code + best solutions + learning logs + ASI diagnostics │
│ + prompt from co-evolved archive + runtime REPL state (optional) │
│ │
│ 4. LLM MUTATION (bandit-selected model) │
│ Mode A: Diff patch (targeted, cheap, fast) │
│ Mode B: Full rewrite (creative, expensive) │
│ Mode C: Crossover (two parents → child) │
│ Mode D: Reflection-driven (ASI feedback → targeted fix) │
│ Mode E: Failure-case-driven (specific test failure → fix) │
│ │
│ 5. POST-GENERATION VERIFICATION (Darwinian Evolver) │
│ Quick syntax check + type check + mini-eval → REJECT if clearly wrong │
│ [Cost: ~0.05$ per check, saves full evaluation cost] │
│ │
│ 6. CASCADE EVALUATION (AlphaEvolve / OpenEvolve) │
│ Stage 1: Fast cheap test subset → REJECT if score < threshold │
│ Stage 2: Medium test set → REJECT if fails │
│ Stage 3: Full evaluation → ASI diagnostics returned │
│ Early stopping (Bayesian / CI) → stop when confident in outcome │
│ │
│ 7. POPULATION UPDATE │
│ MAP-Elites: insert if dominates cell │
│ Island Archive: keep top-K per island │
│ Pareto Front: update if non-dominated │
│ │
│ 8. KNOWLEDGE UPDATE │
│ Learning log: append (mutation_type, parent_ids, ASI, outcome, delta_score) │
│ Prompt evolver: if outcome positive → reinforce prompt; if negative → mutate│
│ Bandit update: record (model, cost, quality) → update UCB1 arm │
│ Meta-recommendations: every N steps → LLM synthesizes pattern insights │
│ │
│ 9. SEARCH ROUTER CHECK │
│ If island is stagnant (N steps, no improvement) → spawn AB-MCTS tree │
│ If tree search finds improvement → reinject to population │
└─────────────────────────────────────────────────────────────────────────────────┘
4. Component: Population Management
Sources:
MAP-ElitesAlphaEvolve / OpenEvolve
Island ModelShinkaEvolve
Pareto FrontGEPA
Expanding ArchiveDGM
4.1 Three-Layer Population Structure
OmniEvolve uses a hierarchical three-layer population that simultaneously provides quality-diversity (breadth), competitive pressure (depth), and long-term memory (archive):
LAYER 1: MAP-ELITES QUALITY-DIVERSITY GRID
┌────────────────────────────────────────────────────────┐
│ Feature Dimension 1 (e.g., code length, complexity) │
│ Feature Dimension 2 (e.g., score on subset A vs. B) │
│ │
│ Each cell: best program achieving that (f1, f2) combo │
│ Resolution: 20×20 = 400 cells per island │
│ Behavioral descriptors computed post-evaluation │
└────────────────────────────────────────────────────────┘
↕ periodic migration (top-K from each)
LAYER 2: ISLAND MODEL (Isolated Populations)
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ Island 1 │ │ Island 2 │ │ Island 3 │ │ Island N │
│ (exploit) │ │ (explore) │ │ (diversity)│ │ (specialist│
│ diff-heavy │ │ full+cross │ │ beam search│ │ per task) │
└────────────┘ └────────────┘ └────────────┘ └────────────┘
↕ ring topology migration every M steps
LAYER 3: GLOBAL ARCHIVE (non-culled)
┌────────────────────────────────────────────────────────┐
│ All programs ever accepted (DGM-style expanding) │
│ Indexed by embedding vector for retrieval │
│ Used for crossover parent selection across islands │
│ Compressed after 10K entries (keep skeleton + score) │
└────────────────────────────────────────────────────────┘
4.2 Island Specialization
| Island Type | Mutation Ratio | Selection | Purpose |
|---|---|---|---|
| Exploitation Island | 80% diff, 20% full | Power-law (α=2.0) | Refine best known solutions |
| Exploration Island | 30% diff, 50% full, 20% cross | Power-law (α=0.5) | Creative search, new algorithms |
| Diversity Island | 40% diff, 40% full, 20% cross | Novelty-weighted | Maintain MAP-Elites cell coverage |
| Reflection Island | 90% reflection-driven, 10% full | ASI-diagnostic-based | Fix specific failure modes |
| Crossover Island | 10% diff, 10% full, 80% cross | Pareto frontier | Combine breakthroughs from other islands |
4.3 Dynamic Island Spawning
Inspired by ShinkaEvolve v1.1, when an island shows no improvement for stagnation_threshold steps, the system dynamically spawns a new island with a fresh random seed from the global archive. Stale islands are archived and removed. The number of active islands is bounded by max_islands (default: 8).
def check_stagnation(island: Island, window: int = 50) -> bool:
"""Detect stagnation using exponential moving average of delta-score."""
recent_scores = island.score_history[-window:]
if len(recent_scores) < window:
return False
ema_delta = 0.0
alpha = 0.1
for i in range(1, len(recent_scores)):
delta = recent_scores[i] - recent_scores[i-1]
ema_delta = alpha * delta + (1 - alpha) * ema_delta
return abs(ema_delta) < island.config.stagnation_epsilon # e.g., 1e-4
def spawn_new_island(population_db: PopulationDB) -> Island:
"""Spawn new island seeded from diverse global archive samples."""
seeds = population_db.archive.sample_diverse(
n=5,
method="maximin_distance", # maximize min pairwise embedding distance
temperature=1.5 # higher = more random diversity
)
return Island(seed_programs=seeds, mutation_config=IslandConfig.random())
5. Component: Mutation Engine
Sources:
Diff/Full/CrossShinkaEvolve
Reflection-DrivenGEPA
Failure-Case-DrivenDarwinian Evolver
Runtime-AwareArcgentica
5.1 Five Mutation Modes
The mutation engine supports five qualitatively distinct mutation modes, each suited to different evolutionary contexts:
Mode A: Diff Patch Mutation
When: exploit phase, refining good programs
LLM prompt: Given parent code + best solution + score, generate unified diff
Cost: Low (small output)
Success rate: High when good parent exists
# Prompt template (abbreviated)
You are improving a program. Current score: {score}.
Generate a unified diff that improves performance.
Focus on: {top_3_insights_from_learning_logs}
<<< PARENT PROGRAM >>>
{parent_code}
<<< BEST KNOWN PROGRAM >>>
{best_code}
Output ONLY a unified diff, no explanations.
Mode B: Full Rewrite Mutation
When: explore phase, crossing local optima
LLM prompt: Given problem spec + inspirational examples, write new algorithm
Cost: Medium (full program output)
Success rate: Lower but higher ceiling
# Prompt template (abbreviated)
Implement a novel algorithm for: {problem_description}
Inspirational approaches (do not copy exactly):
{top_diverse_programs}
Constraints: {fitness_function_description}
Key insights from successful mutations:
{meta_recommendations}
Mode C: Crossover Mutation
When: crossover island, combining breakthroughs
LLM prompt: Two parents from different islands/cells, combine best aspects
Cost: Medium
Success rate: Medium, high novelty
# Parent selection for crossover:
parent_a = island_A.sample(strategy="best")
parent_b = archive.sample_dissimilar(parent_a, method="embedding")
# Embedding distance > threshold ensures genuine diversity
assert cosine_distance(embed(parent_a), embed(parent_b)) > 0.3
Mode D: Reflection-Driven Mutation (GEPA ASI)
When: diagnostics available, reflection island
LLM prompt: Inject full ASI diagnostic data, ask LLM to reflect then mutate
Cost: Medium-high (longer context)
Success rate: Highest for fixing known failures
# ASI context injected:
{
"score": 0.73,
"failed_test_cases": [{input, expected, actual, error_type}],
"performance_profile": {bottleneck_function, runtime_ms},
"diagnostic_message": "Stack overflow on input size > 1000",
"suggested_fix_category": "recursion_depth"
}
Mode E: Failure-Case-Driven Mutation
When: known failure cases, targeted repair (Darwinian Evolver style)
LLM prompt: Present specific failing test case, ask LLM to trace and fix
Cost: Low-medium
Success rate: High for specific bugs
# Learning log provides failure patterns:
failure_patterns = learning_log.get_recurring_failures(
program_id=parent.id,
min_occurrences=3
)
# Inject most frequent failure for targeted fix
Mode F: Runtime-Context Mutation (Arcgentica style)
When: REPL available, complex debugging needed
LLM prompt: Inject live execution trace, intermediate values, stack state
Cost: High (execution + analysis)
Success rate: Highest for runtime bugs
# Runtime context injected:
repl_session = RuntimeREPL(program=parent_code)
trace = repl_session.execute_with_trace(test_input)
# LLM sees: intermediate variable values,
# branch decisions, recursion depth,
# memory allocation patterns
5.2 Adaptive Mutation Scheduling
The ratio of mutation modes is not fixed — it is learned per island per problem using a multi-armed bandit. Each island maintains a MutationBandit tracking the success rate of each mode:
$$ UCBi(t) = μ̂i + c · √(ln t / ni) (Eq. 1 — UCB1 for Mutation Mode Selection) $$
Where μ̂i = empirical mean improvement for mutation mode i, ni = number of times mode i was selected, t = total mutations, c = exploration constant (default: 0.3). The bandit selects the mode with highest UCB score at each step.
class MutationBandit:
def __init__(self, modes: list[MutationMode], c: float = 0.3):
self.modes = modes
self.c = c
self.counts = {m: 0 for m in modes} # n_i
self.rewards = {m: 0.0 for m in modes} # cumulative delta-score
def select(self) -> MutationMode:
t = sum(self.counts.values()) + 1
ucb_scores = {}
for mode in self.modes:
n_i = self.counts[mode]
if n_i == 0:
return mode # ensure each mode tried at least once
mu_hat = self.rewards[mode] / n_i
ucb_scores[mode] = mu_hat + self.c * math.sqrt(math.log(t) / n_i)
return max(ucb_scores, key=ucb_scores.get)
def update(self, mode: MutationMode, delta_score: float):
self.counts[mode] += 1
self.rewards[mode] += max(0.0, delta_score) # only reward improvements
6. Component: Parent Selection
Sources:
Power-LawShinkaEvolve
Sigmoid-Weighted + Novelty BonusDarwinian Evolver
Pareto FrontierGEPA
6.1 Composite Selection Strategy
OmniEvolve combines three selection mechanisms into a composite weight system. For each program p in the population, the selection probability is:
$$ w(p) = wfitness(p) · wnovelty(p) · wrecency(p) (Eq. 2 — Composite Selection Weight) $$
Fitness Weight (Power-Law Rank)
$$ wfitness(p) = rank(p)^−α, α ∈ [0.5, 3.0] (Eq. 3 — Power-Law Rank Selection) $$
Programs are sorted by fitness descending; rank 1 = best. α controls exploitation/exploration: α=2.0 for exploit islands, α=0.5 for explore islands. Normalized to sum to 1.
Novelty Weight (Anti-Oversampling)
$$ wnovelty(p) = σ(-β · (select_count(p) − μselect)) (Eq. 4 — Sigmoid Novelty Penalty) $$
Where σ is the sigmoid function, β controls sharpness (default: 0.5), and μselect is the mean selection count. Programs selected too frequently get down-weighted, preventing any single program from dominating as a parent. Inspired by Darwinian Evolver.
Recency Weight (Time Decay)
$$ wrecency(p) = exp(−λ · age(p)), λ = 0.01 (Eq. 5 — Exponential Recency Weight) $$
Slightly favors recently discovered programs, ensuring the system doesn't exclusively exploit historical knowledge. age(p) = number of generations since p was created.
def composite_weight(programs: list[Program], alpha: float, beta: float) -> np.ndarray:
n = len(programs)
scores = np.array([p.fitness for p in programs])
ranks = n + 1 - np.argsort(np.argsort(scores)) # rank 1 = best
# Fitness weight
w_fit = ranks.astype(float) ** (-alpha)
# Novelty weight (anti-oversampling)
select_counts = np.array([p.select_count for p in programs])
mu_select = select_counts.mean()
w_nov = 1.0 / (1.0 + np.exp(beta * (select_counts - mu_select)))
# Recency weight
ages = np.array([p.age for p in programs])
w_rec = np.exp(-0.01 * ages)
# Composite
w = w_fit * w_nov * w_rec
return w / w.sum()
6.2 Multi-Objective Pareto Selection
For problems with multiple optimization targets (e.g., speed + accuracy + memory), OmniEvolve maintains a Pareto frontier and applies ε-greedy selection:
def select_from_pareto(pareto_front: list[Program], epsilon: float = 0.1) -> Program:
if random.random() < epsilon:
return random.choice(pareto_front) # explore random Pareto member
# Exploit: select program with highest hypervolume contribution
return max(pareto_front, key=lambda p: p.hypervolume_contribution)
7. Component: Novelty & Diversity Filtering
Sources:
2-tier noveltyShinkaEvolve
Behavioral descriptorsAlphaEvolve
Failure-type categorizationDarwinian Evolver
7.1 Three-Stage Novelty Pipeline
GENERATED PROGRAM
│
▼
┌───────────────────────────────┐
│ STAGE 1: Embedding Similarity │ Cost: ~$0.001
│ cosine_sim(embed(new), embed(k-NN)) > threshold? → REJECT
│ threshold = 0.92 (tunable) │
└───────────────┬───────────────┘
│ PASS
▼
┌───────────────────────────────┐
│ STAGE 2: LLM Novelty Judge │ Cost: ~$0.005
│ "Is this mutation algorithmically novel? [YES/NO]"
│ Judge sees: new + k most similar archived programs
└───────────────┬───────────────┘
│ YES (novel)
▼
┌───────────────────────────────┐
│ STAGE 3: Behavioral Novelty │ Cost: 0 (during evaluation)
│ Post-eval: compute behavioral descriptor vector
│ (e.g., test case pass pattern, branch frequencies)
│ MAP-Elites cell placement uses behavioral descriptors
└───────────────────────────────┘
7.2 Embedding-Based Pre-filtering (Stage 1)
class NoveltyPreFilter:
def __init__(self, model="text-embedding-3-small", threshold=0.92, k=10):
self.model = model
self.threshold = threshold
self.k = k
self.index = FaissIndex(dimension=1536) # or ChromaDB, Qdrant
def is_novel(self, program_code: str) -> bool:
embedding = embed(program_code, model=self.model)
if self.index.size() < self.k:
self.index.add(embedding)
return True # insufficient history, always novel
neighbors, distances = self.index.knn_query(embedding, k=self.k)
max_similarity = max(1.0 - d for d in distances)
if max_similarity > self.threshold:
return False # too similar to existing program
self.index.add(embedding)
return True
7.3 Behavioral Descriptors for MAP-Elites
Behavioral descriptors are computed post-evaluation and determine which MAP-Elites cell a program occupies. Descriptors should be:
- Problem-agnostic: code complexity (cyclomatic), length (lines of code), estimated big-O class
- Problem-specific: score on test subset A vs. B, performance on edge cases, approach category (greedy/DP/search)
$$ B(p) = (f1(p), f2(p)) where fi ∈ [0, 1] are normalized behavioral features (Eq. 6 — Behavioral Descriptor Vector) $$
The MAP-Elites grid is indexed by discretized B(p). A program replaces the cell occupant only if its fitness exceeds the incumbent: grid[B(p)] ← p iff fitness(p) > fitness(grid[B(p)]).
8. Component: LLM Orchestration
Sources:
Bandit selectionShinkaEvolve
Thompson SamplingAB-MCTS
Gemini ensembleAlphaEvolve
8.1 Model Tier Architecture
| Tier | Models | Cost/Call | Primary Use | Selection Strategy |
|---|---|---|---|---|
Tier 1: Fast |
Gemini Flash 2.0, Claude Haiku 4.5, GPT-4o-mini | $0.01–0.05 | Novelty judging, diff patches, verification | Default; used 60–80% of the time |
Tier 2: Balanced |
Claude Sonnet 4.6, GPT-4o, Gemini Pro | $0.10–0.40 | Full rewrites, reflection-driven mutation | UCB1 bandit; ~15–30% |
Tier 3: Power |
Claude Opus 4.6, o3, Gemini Ultra | $1.00–5.00 | Complex crossover, breakthrough attempts | UCB1 bandit; ~5–10% |
Tier 4: Local |
Qwen-Coder, DeepSeek-Coder, CodeLlama | $0.00 (infra) | High-volume diff patches when budget constrained | Budget guard trigger |
8.2 Hierarchical Bandit Model Selection
A hierarchical bandit first selects the tier, then selects the model within the tier. This prevents expensive models from starving cheap ones in the exploration phase:
class HierarchicalModelBandit:
def __init__(self, tiers: dict[str, list[LLMModel]]):
self.tier_bandit = UCB1Bandit(arms=list(tiers.keys()), c=0.5)
self.model_bandits = {
tier: ThompsonSamplingBandit(arms=models, prior_alpha=1, prior_beta=1)
for tier, models in tiers.items()
}
def select(self, context: MutationContext) -> LLMModel:
# Override: use cheap model when budget is low
if context.remaining_budget < context.total_budget * 0.1:
return self.model_bandits["fast"].select()
tier = self.tier_bandit.select()
return self.model_bandits[tier].select()
def update(self, model: LLMModel, tier: str,
delta_score: float, cost: float, quality: float):
# Reward = quality / cost (efficiency-aware)
efficiency = quality / (cost + 1e-8)
self.tier_bandit.update(tier, efficiency)
self.model_bandits[tier].update(model, efficiency)
8.3 Thompson Sampling for Deep Refinement
When OmniEvolve switches to AB-MCTS tree search mode, it uses Thompson Sampling to balance exploration (trying new LLMs) vs. exploitation (using the best-performing one for the current problem):
$$ θi ~ Beta(αi + successesi, βi + failuresi) (Eq. 7 — Thompson Sampling for Model Selection) $$
Each model arm has a Beta posterior. At each selection step, sample θi from each model's posterior and select argmaxi(θi). This provides automatic exploration-exploitation balance without explicit c tuning.
9. Component: Diagnostic Feedback System (ASI)
Sources:
Actionable Side InformationGEPA
Reflection-driven mutationGEPA / ReEvo
9.1 ASI Schema
Every evaluation function in OmniEvolve must return an EvaluationResult with a structured ActionableSideInfo object alongside the numeric score:
@dataclass
class ActionableSideInfo:
# Required fields
score: float # Primary fitness metric
# Diagnostic fields (all optional, but recommended)
failed_tests: list[FailedTest] # Specific failed inputs
performance_profile: PerformanceProfile # Runtime bottlenecks
error_type: str | None # "timeout", "wrong_answer", "crash", etc.
error_message: str | None
# Structural analysis
complexity_estimate: str | None # "O(n^2)", "O(n log n)", etc.
memory_usage_mb: float | None
bottleneck_function: str | None # Function consuming most time
# Custom fields (domain-specific)
custom_metrics: dict[str, Any] = field(default_factory=dict)
# For prompt evolution: what would help next mutation?
suggested_fix_approach: str | None # e.g., "memoize recursive calls"
blocking_insight: str | None # Key insight needed to improve
@dataclass
class FailedTest:
input: Any
expected: Any
actual: Any
error_type: str # "wrong_answer", "timeout", "exception"
traceback: str | None
9.2 Reflection-Driven Mutation Prompt
REFLECTION_PROMPT = """
You are improving an algorithm. Analyze the diagnostic feedback carefully before proposing changes.
## Current Program
```python
{program_code}
Evaluation Result
- Score: {score} / {max_score}
- Error type: {error_type}
Failed Test Cases (top 3)
{failed_tests_formatted}
Performance Profile
- Bottleneck: {bottleneck_function} ({bottleneck_pct}% of runtime)
- Estimated complexity: {complexity_estimate}
- Memory usage: {memory_usage_mb} MB
Suggested approach from evaluator
{suggested_fix_approach}
Relevant learning log insights
{learning_log_insights}
Task
- REFLECT: In 2-3 sentences, diagnose the root cause of the failures.
- PROPOSE: Describe your fix strategy in 1-2 sentences.
- IMPLEMENT: Provide the complete improved program.
Start with "REFLECTION:" then "STRATEGY:" then "```python". """
## 10. Component: Knowledge Sharing & Learning Logs
**Sources:**
`Learning log systemDarwinian Evolver`
`Meta-recommendationsShinkaEvolve`
`Skill storeGEPA Skills`
### 10.1 Learning Log Schema
The learning log is a **shared, cross-population memory**. Every mutation — successful or not — generates a log entry. LLMs read compressed summaries of this log when proposing new mutations.
```python
@dataclass
class LearningLogEntry:
# Identity
entry_id: UUID
timestamp: datetime
generation: int
island_id: str
# What was tried
mutation_mode: MutationMode # diff/full/cross/reflect/failure
parent_program_ids: list[UUID]
mutation_description: str # 1-2 sentence summary of what changed
# What happened
pre_score: float
post_score: float
delta_score: float
asi_before: ActionableSideInfo
asi_after: ActionableSideInfo
# Meta
llm_model_used: str
llm_cost_usd: float
evaluation_time_s: float
# Pattern tags (auto-generated by LLM summarizer)
pattern_tags: list[str] # e.g., ["memoization", "recursion_to_iter", "O(n2)_to_O(nlogn)"]
outcome: Literal["improvement", "neutral", "regression", "crash"]
10.2 Log Compression and Retrieval
The learning log grows without bound; LLMs have finite context windows. OmniEvolve uses a hierarchical compression strategy:
RAW LOG ENTRIES (individual mutations)
│ Every 100 entries
▼
COMPRESSED SUMMARIES (LLM-generated, ~100 tokens each)
│ Every 10 summaries
▼
INSIGHT DIGESTS (pattern clusters, ~50 tokens each)
│ At query time
▼
RETRIEVED CONTEXT (top-K relevant by embedding similarity to current mutation)
class LearningLogRetriever:
def get_relevant_context(
self,
current_program: str,
mutation_mode: MutationMode,
max_tokens: int = 2000
) -> str:
# Embed current program + mutation mode description
query_embedding = embed(current_program + str(mutation_mode))
# Retrieve top-K most similar entries by pattern
relevant_summaries = self.vector_store.similarity_search(
query=query_embedding, k=10
)
# Retrieve top-K failure patterns matching current ASI
if self.current_asi:
failure_patterns = self.failure_index.get_similar(
self.current_asi.error_type, k=5
)
# Retrieve top-K successful mutation patterns
successful_patterns = self.success_index.get_top(k=5)
# Format and truncate to max_tokens
return self._format_context(relevant_summaries, failure_patterns,
successful_patterns, max_tokens)
10.3 Meta-Recommendations Generation
Every N=100 generations, OmniEvolve asks an LLM to synthesize the learning log into high-level strategic recommendations. These become permanent entries in the Meta-Recommendation Store and are injected into all future mutation prompts:
META_REC_PROMPT = """
Analyze these {N} mutation outcomes and extract 3-5 high-level insights.
For each insight: (1) state the principle, (2) show 2 examples that support it,
(3) state when to apply it.
Focus on: what types of changes consistently improve performance?
What approaches consistently fail? What patterns emerge across successful mutations?
Mutation history summary:
{compressed_log_summary}
Output as JSON: [{{"insight": str, "examples": [str], "apply_when": str}}]
"""
10.4 Skill Store (Cross-Model Transfer)
Inspired by GEPA Skills, OmniEvolve maintains a Skill Store: a repository of extracted algorithmic knowledge that transfers across LLM models. Skills learned by expensive models become available to cheap models, enabling cost-efficient exploitation:
@dataclass
class Skill:
name: str # e.g., "sliding_window_for_substring_search"
description: str # natural language description
template_code: str # illustrative code pattern
applicable_patterns: list[str] # patterns that suggest this skill
success_evidence: list[str] # programs where applying improved score
transfer_count: int # times successfully transferred across models
11. Component: Prompt Co-Evolution
Sources:
Prompt co-evolutionShinkaEvolve v1.1
System prompt archiveShinkaEvolve
11.1 How Prompts Evolve
OmniEvolve maintains a Prompt Population alongside the program population. Each mutation mode has its own prompt population. Prompts are selected, used, evaluated by their downstream effect on mutation success, and themselves mutated:
PROMPT POPULATION (one per mutation mode)
┌──────────────────────────────────────────────────────────┐
│ Prompt_1: score=0.34, uses=120, delta_avg=+0.021 │
│ Prompt_2: score=0.41, uses=85, delta_avg=+0.038 ← best│
│ Prompt_3: score=0.29, uses=60, delta_avg=+0.011 │
│ Prompt_4: score=0.22, uses=40, delta_avg=-0.003 │
└──────────────────────────────────────────────────────────┘
│ │
Prompt selection Prompt mutation
(power-law rank) (LLM generates new variant)
│ │
└──────────────────────────►│
▼
New prompt evaluated by
downstream mutation quality
11.2 Prompt Fitness Metric
$$ fitness(prompt) = EMAt(Δscore | prompt used at t) (Eq. 8 — Prompt Fitness) $$
EMA = exponential moving average (α=0.1) of delta-score achieved by mutations using this prompt. Prompts with negative or zero EMA are candidates for mutation or elimination.
11.3 ASI-Informed Prompt Mutation
A key OmniEvolve innovation: when a prompt consistently produces mutations that fail with specific ASI diagnostics (e.g., "recursion overflow"), the prompt is mutated to specifically address that failure pattern:
PROMPT_MUTATION_TEMPLATE = """
This mutation prompt has been used {n_times} times with average delta-score {avg_delta:.4f}.
Here are the most common failure patterns it produces:
{top_failure_patterns}
Here is the current prompt:
---
{current_prompt}
---
Generate an improved version of this prompt that:
1. Addresses the most common failure: "{top_failure}"
2. Preserves what's working (the prompt works well when: {success_contexts})
3. Adds explicit guidance about: {suggested_addition}
Output ONLY the new prompt text, no explanations.
"""
12. Component: Selective Self-Improvement
Sources:
Self-modificationDGM (Gödel Machine)
Skill accumulationGEPA Skills
12.1 What May and May Not Self-Modify
✓ Safe to Self-Modify (Mutable Zone)
- System prompts for each mutation mode
- Parent selection weights (α for power-law)
- Mutation mode ratios per island
- Novelty filter thresholds
- LLM model selection preferences
- Meta-recommendation injection logic
- Learning log query strategies
✗ Immutable (Never Self-Modified)
- Evaluation function / fitness oracle
- Sandbox execution harness
- Cost budget guards and limits
- Safety checks (timeout, OOM kill)
- Population database read/write interface
- Snapshot and rollback mechanism
- The self-improvement module itself
12.2 Snapshot-Rollback Protocol
class SelfImprovementModule:
def propose_improvement(self, component: str, current_config: dict) -> dict:
"""LLM proposes new configuration for mutable component."""
snapshot = self.take_snapshot(component)
proposed_config = self.llm_propose(
component=component,
current_config=current_config,
performance_history=self.get_history(component, n=100),
meta_recommendations=self.meta_rec_store.get_relevant(component)
)
return proposed_config, snapshot
def evaluate_improvement(
self, snapshot_id: UUID, proposed_config: dict,
component: str, n_trials: int = 50
) -> bool:
"""A/B test: run N mutations with proposed vs current config."""
results_proposed = self.run_with_config(proposed_config, n_trials)
results_current = self.run_with_config(self.current_config, n_trials)
# Statistical test: t-test, p < 0.05
t_stat, p_value = scipy.stats.ttest_ind(results_proposed, results_current)
improvement = np.mean(results_proposed) - np.mean(results_current)
if p_value < 0.05 and improvement > 0:
self.apply_config(proposed_config, component)
self.log_improvement(component, improvement, snapshot_id)
return True
else:
self.rollback_to_snapshot(snapshot_id)
return False
13. Component: Cost Control System
Sources:
Committed cost modelShinkaEvolve
Cascade evaluationAlphaEvolve
Early stoppingShinkaEvolve
Per-iteration budgetOpenEvolve
13.1 Committed Cost Budget Guard
The budget guard tracks committed cost = realized cost + in-flight cost (API calls not yet returned). This prevents overspending due to concurrent calls:
$$ Ccommitted(t) = Crealized(t) + Cinflight(t) ≤ Btotal (Eq. 9 — Committed Cost Guard) $$
class CommittedBudgetGuard:
def __init__(self, total_budget_usd: float):
self.total = total_budget_usd
self._realized = 0.0
self._inflight: dict[UUID, float] = {} # call_id → estimated_cost
self._lock = asyncio.Lock()
async def reserve(self, estimated_cost: float) -> UUID | None:
async with self._lock:
committed = self._realized + sum(self._inflight.values())
if committed + estimated_cost > self.total:
return None # budget exceeded, reject call
call_id = uuid4()
self._inflight[call_id] = estimated_cost
return call_id
async def settle(self, call_id: UUID, actual_cost: float):
async with self._lock:
estimated = self._inflight.pop(call_id, 0.0)
self._realized += actual_cost
@property
def remaining(self) -> float:
return self.total - self._realized - sum(self._inflight.values())
13.2 Cascade Evaluation Pipeline
| Stage | Test Coverage | Cost | Reject Threshold | Avg. Reject Rate |
|---|---|---|---|---|
| Stage 0: Syntax Check | Parse + type check | ~$0.00 | Any syntax error | 5–15% |
| Stage 1: Mini-eval | 5% of test suite (random) | ~$0.01 | score < 0.3 | 20–40% |
| Stage 2: Medium-eval | 30% of test suite | ~$0.05 | score < threshold | 15–25% |
| Stage 3: Full-eval | 100% + ASI generation | ~$0.10–$1.00 | — | — |
13.3 Bayesian Early Stopping
For stochastic evaluations (e.g., average over N random seeds), OmniEvolve uses sequential probability ratio testing to stop early when the outcome is clear:
$$ Λt = Σi=1^t log[p(xi|H1) / p(xi|H0)] (Eq. 10 — SPRT Log-Likelihood Ratio) $$
H0: new program is not better than best known. H1: new program is better. Stop early when |Λt| exceeds boundary. Type I/II error rates α=β=0.05. Inspired by ShinkaEvolve's CI-based early stopping.
14. Component: Safety & Sandboxing
Sources:
Sandbox execAll systems
Safety discussionDGM paper
14.1 Multi-Layer Sandboxing
| Layer | Mechanism | Protects Against |
|---|---|---|
| Layer 1: Process | subprocess with resource limits (ulimit, setrlimit) | CPU/RAM/disk exhaustion |
| Layer 2: Container | Docker/podman with no-network, read-only FS, seccomp | Network access, filesystem modification |
| Layer 3: Static Analysis | AST scan for dangerous imports/calls before execution | Known malicious patterns (os.system, import) |
| Layer 4: Timeout | Hard wall-clock timeout (SIGKILL after T seconds) | Infinite loops, deadlocks |
| Layer 5: Output Validation | Schema validation on evaluation result before use | Evaluation function injection |
BLOCKED_PATTERNS = [
r"import\s+os", r"import\s+subprocess", r"import\s+sys",
r"__import__", r"exec\s*\(", r"eval\s*\(",
r"open\s*\(", r"socket\.", r"urllib\.", r"requests\.",
r"ctypes\.", r"cffi\.", r"multiprocessing\."
]
def static_safety_check(code: str) -> tuple[bool, str]:
"""Return (is_safe, reason)."""
for pattern in BLOCKED_PATTERNS:
if re.search(pattern, code):
return False, f"Blocked pattern: {pattern}"
return True, ""
15. Component: Hybrid Search Strategy
Sources:
AB-MCTSSakana AB-MCTS / TreeQuest
Iterative refinementConfluence Labs
15.1 Search Router Logic
OmniEvolve uses an intelligent router that switches between evolutionary search and tree search based on the current optimization landscape:
SEARCH ROUTER (runs every K generations)
Is the problem "narrow and deep"? Is the problem "broad and shallow"?
(few good solutions, need fine-grained (many local optima, need exploration,
refinement, single known promising approach) wide search space)
│ │
▼ ▼
AB-MCTS Tree Search Evolutionary Loop
───────────────── ─────────────────
Start from best solution MAP-Elites + Islands
Thompson Sampling for depth/width Multi-modal mutation
Refinement-focused prompts Diversity-preserving selection
Inject improvements back to population Tree search on stagnant islands
15.2 Adaptive Branching MCTS
Inspired by AB-MCTS (Sakana), when OmniEvolve activates tree search mode, it uses Thompson Sampling to decide whether to deepen (refine current solution) vs. branch (generate new sibling):
$$ P(branch) = P(θbranch > θdeepen) where θi ~ Beta(αi, βi) (Eq. 11 — AB-MCTS Branching Probability) $$
class AdaptiveBranchMCTS:
def __init__(self, root_program: Program, llm_ensemble: LLMEnsemble):
self.tree = SearchTree(root=SearchNode(root_program))
self.branch_bandit = ThompsonSamplingBandit(["branch", "deepen"])
self.llm = llm_ensemble
def search(self, n_steps: int) -> Program:
for _ in range(n_steps):
node = self.select_node() # UCT selection
action = self.branch_bandit.select()
if action == "deepen":
child = self.refine(node) # Diff patch mutation
else:
child = self.branch(node) # Full rewrite from this node
score = evaluate(child)
delta = score - node.program.score
self.branch_bandit.update(action, success=(delta > 0))
self.backpropagate(node, delta)
return self.tree.best_leaf().program
16. Formal Mathematical Specification
16.1 Problem Formulation
Let P be the space of programs (text over some programming language alphabet Σ). Let f: P → ℝ be the fitness function returning a scalar score. The optimization objective is:
$$ p* = arg maxp ∈ P f(p) (Eq. 12 — Primary Optimization Objective) $$
In the multi-objective setting, with objectives f1, ..., fk:
$$ P* = {p ∈ P | ¬∃p' ∈ P : fi(p') ≥ fi(p) ∀i ∧ fj(p') > fj(p) for some j} (Eq. 13 — Pareto Optimal Set) $$
16.2 OmniEvolve State Space
The full state of OmniEvolve at generation t is:
$$ St = (Πt, Lt, Φt, Θt) (Eq. 14 — System State) $$
Where: Πt = population (MAP-Elites grid + islands + archive), Lt = learning log (history of all mutations), Φt = prompt archive (population of mutation prompts), Θt = bandit state (model selection posteriors + mutation mode weights).
16.3 Transition Dynamics
One evolutionary step is a stochastic transition St → St+1:
$$ St+1 = Update(St, pnew, rt, dt) (Eq. 15 — State Transition) $$
Where pnew is the generated program, rt = f(pnew) is the reward, dt is the ASI diagnostic, and Update() applies all four updates: population (Π), log (L), prompts (Φ), bandits (Θ).
16.4 Sample Efficiency Bound
Let E[T] be the expected number of evaluations to reach within ε of the global optimum f. OmniEvolve achieves the following improvement over naive random search:
$$ E[T]OmniEvolve ≤ E[T]random · (1 − preject)^-1 · (1 − pearly_stop)^-1 (Eq. 16 — Sample Efficiency Improvement Bound) $$
Where preject ≈ 0.6–0.8 (fraction of mutations rejected by novelty filter before evaluation), pearly_stop ≈ 0.3–0.5 (fraction of evaluations stopped early by SPRT). Combined: effectively reduces evaluation count by 4–10x compared to systems without these mechanisms. Empirical values from ShinkaEvolve analysis.
16.5 Learning Log Value Function
The value of the learning log Lt to a new mutation proposal can be formalized as a mutual information quantity:
$$ V(Lt) = I(Δrt+1; Lt | pparent, modet) (Eq. 17 — Learning Log Information Value) $$
Where I(·;·) is mutual information between the reward delta of the next mutation and the learning log history. High V(Lt) means the log is highly predictive of mutation outcomes — a signal that meta-recommendation synthesis should run.
17. Implementation Roadmap
17.1 Technology Stack Recommendations
| Layer | Recommended | Alternatives | Reason |
|---|---|---|---|
| Language | Python 3.12+ | — | Ecosystem compatibility with all 16 surveyed systems |
| Async framework | asyncio + aiohttp | trio | ShinkaEvolve's AsyncEvolutionRunner pattern proven |
| LLM client | LiteLLM | langchain, direct SDKs | Unified API across all providers; cost tracking built-in |
| Vector store | Qdrant (self-hosted) | ChromaDB, FAISS | Persistent, fast, filterable, supports metadata |
| Population DB | SQLite + JSON (small) / PostgreSQL (large) | DuckDB | Simple, reliable, queryable, no infra overhead |
| Sandbox | Docker + restrictedpython | Firejail, nsjail | Two-layer defense; Docker for infra isolation, restrictedpython for code |
| Config | Hydra (OmegaConf) | pydantic-settings | ShinkaEvolve uses Hydra; excellent sweep support |
| Monitoring | Prometheus + Grafana | W&B, MLflow | Real-time cost, score, and population diversity tracking |
| Experiment tracking | MLflow | W&B | Free, self-hosted, full artifact tracking |
17.2 Phased Implementation Plan
Phase 1: Phase 1: Core Evolution Loop (Weeks 1–3) Implement: async orchestrator, basic island model (3 islands), diff+full mutation modes, sandbox executor, basic fitness database. Target: reproduce OpenEvolve-level capability. Validate on: cap_set problem, TSP heuristic.
Phase 2: Phase 2: MAP-Elites + Novelty (Weeks 4–5) Add: MAP-Elites grid with configurable behavioral descriptors, embedding-based novelty pre-filter, LLM novelty judge, power-law parent selection. Target: +20% coverage of behavioral space vs. basic islands.
Phase 3: Phase 3: LLM Bandit Orchestration (Week 6) Add: UCB1 model selection bandit, LiteLLM integration for all providers, committed budget guard, cascade evaluation pipeline, SPRT early stopping. Target: 50% cost reduction vs. Phase 2.
Phase 4: Phase 4: Diagnostic Feedback (Weeks 7–8) Add: ASI schema, evaluator wrapper that returns diagnostics, reflection-driven mutation mode (Mode D), failure-case-driven mutation mode (Mode E). Target: +15% average fitness on problems with clear failure diagnostics.
Phase 5: Phase 5: Learning Logs (Weeks 9–10) Add: learning log database, log compression and retrieval, meta-recommendation generation, log injection into mutation prompts. Target: measurable improvement in mutation success rate on generation 50+ vs. 1-50.
Phase 6: Phase 6: Prompt Co-Evolution (Weeks 11–12) Add: prompt population per mutation mode, prompt fitness tracking, ASI-informed prompt mutation. Target: prompt fitness improves monotonically over 500+ mutations.
Phase 7: Phase 7: AB-MCTS Tree Search Integration (Weeks 13–14) Add: search router, adaptive branching MCTS, Thompson Sampling for depth/width, re-injection of tree search improvements to population. Target: solve problems that pure evolutionary fails on within same budget.
Phase 8: Phase 8: Selective Self-Improvement (Weeks 15–17) Add: self-improvement module, snapshot-rollback system, A/B testing for configuration changes, immutability enforcement. Target: demonstrable improvement of mutation strategies over 1000+ generations.
Phase 9: Phase 9: Skill Store + Cross-Model Transfer (Weeks 18–20) Add: skill extraction from successful programs, skill store API, skill injection into cheap model prompts. Target: cheap model (Haiku) achieves 85%+ of expensive model (Opus) quality when skills available.
Phase 10: Phase 10: Benchmarking & Evaluation (Weeks 21–24) Evaluate on: ALE-Bench (competitive programming), ARC-AGI-2 (generalization), mathematical discovery (AIME), infrastructure optimization (matrix multiply). Compare against ShinkaEvolve, OpenEvolve, GEPA, Darwinian Evolver.
17.3 Key Configuration Parameters
# omnievolve_config.yaml (Hydra)
evolution:
n_islands: 8
island_types: [exploit, explore, diversity, reflect, crossover]
generations: 1000
async: true
max_concurrent_mutations: 16
population:
map_elites_resolution: [20, 20]
archive_max_size: 10000
archive_compression_interval: 500
migration_interval: 50
migration_top_k: 3
mutation:
modes: [diff, full, cross, reflect, failure, runtime]
bandit_c: 0.3
stagnation_window: 50
stagnation_epsilon: 0.0001
novelty:
embedding_model: text-embedding-3-small
embedding_similarity_threshold: 0.92
llm_novelty_judge: gemini-flash-2.0
n_neighbors: 10
llm:
provider: litellm
models:
tier1: [gemini/gemini-2.0-flash, claude-haiku-4-5]
tier2: [claude-sonnet-4-6, gpt-4o]
tier3: [claude-opus-4-6]
tier4: [ollama/qwen2.5-coder:32b]
bandit_c: 0.5
thompson_prior_alpha: 1
thompson_prior_beta: 1
cost:
total_budget_usd: 100.0
cascade_stages: [0.05, 0.30, 1.00] # fraction of test suite per stage
cascade_thresholds: [0.30, null] # reject if below threshold
early_stopping_alpha: 0.05
early_stopping_beta: 0.05
prompt_evolution:
enabled: true
population_size: 10
mutation_interval: 100 # mutate prompts every N mutations
ema_alpha: 0.1
learning_log:
enabled: true
compression_interval: 100
meta_rec_interval: 500
retrieval_k: 10
max_context_tokens: 2000
self_improvement:
enabled: true
trial_n: 50
significance_threshold: 0.05
snapshot_retention: 5 # keep last 5 snapshots
search:
strategy: adaptive # evolutionary | mcts | adaptive
mcts_trigger_stagnation: 100
mcts_steps: 50
thompson_prior_alpha: 1
thompson_prior_beta: 1
safety:
sandbox: docker
timeout_s: 30
memory_limit_mb: 512
network_disabled: true
static_analysis: true
18. Expected Performance & Benchmarks
18.1 Performance Projections vs. Individual Systems
| Benchmark | ShinkaEvolve | OpenEvolve | GEPA | OmniEvolve (projected) | Gain Source |
|---|---|---|---|---|---|
| ALE-Bench Score (competitive prog.) | ~60th percentile | ~55th percentile | ~65th percentile | ~75–80th percentile |
Learning logs + ASI + AB-MCTS tree search |
| Cost per improvement unit | $0.05–0.20 | $0.10–0.40 | $0.05–0.25 | $0.02–0.10 |
Cascade eval + novelty pre-filter + bandit models |
| Evaluations to first valid solution | ~100 | ~200 | ~80 | ~40–60 |
2-tier novelty + cascade + SPRT early stopping |
| Long-run improvement (gen 500+) | Plateau risk | Plateau risk | Plateau risk | Continued improvement |
Prompt co-evolution + self-improvement |
| Cross-problem transfer | Limited | None | Skills only | Skill store + learning logs |
Skill store + cross-model transfer |
18.2 Evaluation Plan for PhD Validation
Benchmark 1: ALE-Bench
Run OmniEvolve on ALE-Bench's 40 competitive programming problems. Measure: percentile rank, cost per problem, evaluations per problem. Compare against published ShinkaEvolve and GEPA results.
Benchmark 2: Mathematical Discovery
Replicate AlphaEvolve's cap_set and matrix multiplication problems. Measure: best known values achieved, evaluation count, cost. Validate that population diversity (MAP-Elites) finds the same or better solutions.
Benchmark 3: Ablation Study
Disable one component at a time and measure performance degradation: (1) no novelty filter, (2) no ASI, (3) no learning logs, (4) no prompt evolution, (5) no bandit model selection. Quantify each component's contribution.
Benchmark 4: Long-Run Improvement
Run 2000 generations on a hard problem and measure score-vs-generation curve. Validate that prompt co-evolution and self-improvement prevent plateau. Compare against ShinkaEvolve and OpenEvolve on same problem.
19. Conclusion & PhD Research Directions
19.1 Summary of Design Choices
OmniEvolve synthesizes the best mechanisms from across the field into a unified, principled architecture. The key design decisions are:
| Decision | Choice | Rationale | Source |
|---|---|---|---|
| Population Structure | MAP-Elites + Islands + Archive (3-layer) | Combines behavioral diversity (MAP), isolation (islands), and long-term memory (archive) | AlphaEvolve + ShinkaEvolve + DGM |
| Mutation Engine | 6 modes with UCB1 bandit scheduling | No single mutation mode is optimal; bandit learns what works per problem per phase | ShinkaEvolve + GEPA + Darwinian Evolver |
| Parent Selection | Power-law rank × novelty penalty × recency | Composite weight avoids oversampling any individual while favoring high-fitness parents | ShinkaEvolve + Darwinian Evolver |
| Novelty Filtering | 2-stage: embedding + LLM judge | Cheap embedding filter removes obvious duplicates; expensive LLM judge handles semantic similarity | ShinkaEvolve |
| LLM Orchestration | Hierarchical bandit (tier then model) | Prevents expensive models from starving cheap ones; efficiency-aware reward (quality/cost) | ShinkaEvolve + AB-MCTS |
| Diagnostic Feedback | Mandatory ASI schema from all evaluators | Structured diagnostics enable reflection-driven mutation, failure-targeted repair, and log compression | GEPA |
| Knowledge Sharing | Cross-population learning logs + skill store | Individual islands learn in isolation; logs share discoveries; skills transfer to cheap models | Darwinian Evolver + GEPA Skills |
| Prompt Evolution | ASI-informed prompt mutation per mode | Prompts that consistently produce diagnostic failures get specifically mutated to address those failures | ShinkaEvolve (extended) |
| Self-Improvement | Scoped to mutable zone + A/B validation | Controllable improvement without risking evaluation integrity; statistical validation prevents regressions | DGM (constrained) |
| Search Strategy | Adaptive: evolutionary + AB-MCTS | Population search for broad coverage; tree search for deep refinement on stagnant islands | AB-MCTS + ShinkaEvolve |
19.2 Open Research Questions (PhD Directions)
RQ1: Optimal Population Topology
How should islands communicate? Ring topology, hub-and-spoke, fully connected, adaptive? What migration policies maximize exploration without premature convergence? Can meta-learning discover optimal topologies per problem class?
Suggested approach: Multi-armed bandit over topology configurations; measure diversity vs. convergence trade-off.
RQ2: Behavioral Descriptor Design
For code evolution, what are the right behavioral descriptors for MAP-Elites? Code complexity, runtime profile, test coverage pattern, semantic embedding? How to design descriptors that capture meaningful diversity without explosion of cell count?
Suggested approach: Unsupervised discovery of descriptors from evaluation data using dimensionality reduction.
RQ3: Learning Log Information Value
How much does the learning log actually help? Can we formalize its value (Eq. 17) and show diminishing returns as it grows? When should old entries be purged? What compression ratio preserves maximum mutual information?
Suggested approach: Ablation + information-theoretic analysis of log compression methods.
RQ4: Prompt Evolution Convergence
Do evolved prompts converge to a fixed point, or do they continuously improve? Is there a theoretical bound on prompt fitness? Do evolved prompts generalize across problem types, or are they problem-specific?
Suggested approach: Track prompt entropy over generations; measure transfer across problem domains.
RQ5: Safe Self-Improvement Boundaries
What is the precise formal boundary between safe and unsafe self-modification in evolutionary systems? Can a type system enforce this boundary? How do safety guarantees compose when multiple components self-modify simultaneously?
Suggested approach: Formal verification framework; capability amplification bounds.
RQ6: Multi-Task Transfer Learning
Can OmniEvolve's learned skills, prompts, and meta-recommendations transfer across fundamentally different problem domains (e.g., sorting → scientific discovery → infrastructure optimization)? What is the transfer hierarchy?
Suggested approach: Curriculum learning framework; measure zero-shot and few-shot transfer across domain pairs.
19.3 The Frontier: What Comes After OmniEvolve
**The Next Level:** OmniEvolve represents the state-of-the-art integration of 2024–2026 methods. The next research frontier lies in:
**Continuous learning**: Systems that never stop improving, with formal guarantees on monotonic long-run performance
**Multi-agent co-evolution**: Populations of interacting agents that evolve communication protocols and collaborative strategies
**Neural-symbolic hybrid search**: Combining LLM-based code mutation with formal verification and constraint propagation
**Evolutionary meta-learning**: Systems that evolve their own evolutionary algorithms — a true Gödel Machine realized safely
**Human-AI co-evolution**: Humans and AI in tight feedback loops where humans provide insight, AI provides scale — the AHC058/ICFP model generalized
**Final Recommendation for PhD Implementation:** Begin with Phases 1–4 (core loop + MAP-Elites + bandit LLM + ASI diagnostics). This combination alone should surpass any single existing open-source system on most benchmarks. Phases 5–8 (learning logs + prompt evolution + self-improvement) are the genuine research contributions — they go beyond what any single surveyed system has fully implemented. Phases 9–10 provide the evaluation infrastructure for publication-quality results.
OmniEvolve Architecture Blueprint • Evolutionary AI Systems Survey 2024–2026
Based on analysis of 17 systems: AlphaEvolve, OpenEvolve, ShinkaEvolve, GEPA, LLM4AD, SkyDiscover/AdaEvolve, DGM, Darwinian Evolver, GEPA Skills, Confluence Labs, Arcgentica, AB-MCTS, AI Scientist, ALE-Bench, AHC058, ICFP2025, arxiv papers
← Back to Survey ShinkaEvolve Details DGM Details GEPA Details Darwinian Evolver SkyDiscover/AdaEvolve