EvoX: Meta-Evolution for Automated Discovery
An adaptive evolution method that optimizes its own evolutionary search process through two-level co-evolution of solutions and search strategies Organization: UC Berkeley Sky Lab + Stanford University + Bespoke Labs Published: February 26, 2026 (v1), revised March 16, 2026 (v2) Type: Research Paper (arXiv:2602.23413v2) Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: EvoX: Meta-Evolution for Automated Discovery
Paper URL: arXiv:2602.23413v2
HTML Version: arxiv.org/html/2602.23413v2
PDF: arxiv.org/pdf/2602.23413v2
Code Repository: github.com/skydiscover-ai/skydiscover
arXiv Categories: cs.LG (Machine Learning), cs.CL (Computation and Language), cs.NE (Neural and Evolutionary Computing)
Submission History: - v1: February 26, 2026 (5,230 KB) - v2: March 16, 2026 (5,230 KB) — current version
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
DOI: 10.48550/arXiv.2602.23413
Lineage: Builds directly on the LLM-driven evolutionary search paradigm established by FunSearch (Nature 2023), AlphaEvolve (Google DeepMind 2025), and the earlier SkyDiscover/AdaEvolve framework (arXiv:2602.20133) from the same Berkeley lab. EvoX represents the team's second major contribution following SkyDiscover/AdaEvolve, pivoting from hierarchical adaptive resource allocation to full meta-evolution of the search strategy itself. The paper is released through the SkyDiscover framework infrastructure, sharing the same codebase and benchmark suite.
Relationship to SkyDiscover: EvoX is implemented as a first-class algorithm backend within the SkyDiscover framework. While AdaEvolve focuses on three-level hierarchical adaptation (local, global, meta) using accumulated improvement signals, EvoX takes a fundamentally different approach: it treats the entire search strategy as an evolvable artifact, using an LLM to generate complete strategy implementations (Python classes) rather than tuning pre-defined knobs.
2 Authors and Team
Authors (17 total, * denotes equal contribution):
Lead Authors (equal contribution): Shu Liu*^1, Shubham Agarwal*^1 Contributing Authors: Monishwaran Maheswaran^1, Mert Cemri^1, Zhifei Li^1, Qiuyang Mang^1, Ashwin Naren^1, Ethan Boneh^2, Audrey Cheng^1, Melissa Z. Pan^1, Alexander Du^1 Senior Authors: Kurt Keutzer^1, Alvin Cheung^1, Alexandros G. Dimakis^{1,3}, Koushik Sen^1, Matei Zaharia^1, Ion Stoica^1
Affiliations: 1. UC Berkeley 2. Stanford University 3. Bespoke Labs
Research Group Context
The team is anchored in the UC Berkeley Sky Lab, the same research group responsible for foundational systems infrastructure including Apache Spark (Zaharia), Ray (Stoica), vLLM, and Alpa. This systems-engineering DNA is visible in EvoX's emphasis on modular architecture, fair benchmarking infrastructure, and practical systems optimization tasks.
| PI | Known For | Role in EvoX |
|---|---|---|
| Ion Stoica | Ray, Spark, Anyscale | Systems architecture, distributed compute |
| Matei Zaharia | Spark, MLflow, Databricks | Data-intensive optimization, framework design |
| Koushik Sen | KLEE, concolic testing, program analysis | Search strategy formalization, program synthesis |
| Kurt Keutzer | Neural architecture efficiency | LLM cost optimization, model selection |
| Alvin Cheung | Database optimization, verified lifting | Systems benchmarks, SQL optimization tasks |
| Alexandros G. Dimakis | Compressed sensing, generative models | Theoretical foundations, Bespoke Labs connection |
Notable cross-institutional contributions: Ethan Boneh (Stanford) brings connections to the broader Stanford AI/systems community. The Dimakis affiliation with Bespoke Labs (a startup focused on AI-generated data) suggests potential commercial applications of the meta-evolution paradigm.
Author overlap with prior work: Significant overlap with the SkyDiscover/AdaEvolve team (Cemri, Agrawal, Liu, Cheng, Mang, Naren, Sen, Zaharia, Dimakis, Stoica), the GEPA team (UC Berkeley / Stanford), and the ADRS-Bench systems benchmark authors. This continuity of personnel means EvoX benefits from deep institutional knowledge of the benchmark suite and baseline implementations.
3 Core Contribution
[!important] Key Novelty EvoX is the first LLM-driven evolutionary system to treat the search strategy itself as an evolvable object, implementing a two-level co-evolution process where an outer "meta-evolution" loop generates, evaluates, and replaces complete search strategy implementations (Python classes) based on their observed downstream performance, while an inner "solution-evolution" loop generates candidate solutions under the currently active strategy.
The Central Insight
All prior LLM-driven evolutionary systems — including AlphaEvolve, OpenEvolve, GEPA, ShinkaEvolve, and even AdaEvolve — share a fundamental limitation: their search strategies are either fixed (predefined knobs that remain static throughout execution) or parametrically adaptive (adjusting pre-defined parameters like exploration-exploitation ratios through bandit algorithms). EvoX eliminates this constraint by making the search strategy itself a first-class citizen in the evolutionary process.
The key observation motivating this work is threefold:
-
Cross-task variation: Different optimization problems require fundamentally different search strategies. Packing problems benefit from iterative refinement of strong candidates, while GPU placement problems may require wholesale paradigm shifts (e.g., from greedy heuristics to bin-packing formulations).
-
Within-task non-stationarity: Even within a single optimization run, the optimal search strategy changes over time. Early stages benefit from exploration, later stages from exploitation — but the transition timing is task-dependent and cannot be predetermined.
-
Strategy expressiveness bottleneck: Parametric adaptation (adjusting knobs within a fixed strategy template) cannot capture qualitative shifts in search behavior. Switching from tournament selection to Pareto-frontier sampling, or from single-parent refinement to multi-parent crossover, requires changing the structure of the strategy, not its parameters.
What Makes EvoX Novel
-
Search strategy as code: Rather than parameterizing strategies as vectors of hyperparameters, EvoX represents each search strategy as a complete Python class implementing
add()andsample()methods. This allows the LLM to generate arbitrary selection mechanisms, data structures, and heuristics — unconstrained by a predefined strategy template. -
Two-level co-evolution: The system maintains two coupled evolutionary processes operating at different timescales: a fast inner loop evolving candidate solutions (one LLM call per iteration) and a slow outer loop evolving search strategies (one LLM call per strategy window, triggered by stagnation detection).
-
Progress-conditioned strategy generation: New strategies are generated by the LLM conditioned on: (a) the history of prior strategies and their performance, (b) the current state of the solution population (via a descriptor function), and (c) the downstream problem context. This allows the meta-evolution loop to learn from its own history.
-
Strategy validation: Generated strategies undergo a validity test before deployment, ensuring they are syntactically correct Python, satisfy interface contracts, and produce valid outputs. Invalid strategies are discarded and regenerated.
-
Stagnation-triggered adaptation: The meta-evolution loop is event-driven rather than periodic — new strategies are generated only when the current strategy fails to produce sufficient progress, as measured by a stagnation threshold τ. This avoids unnecessary strategy churn during productive phases.
Relationship to Prior Work
| System | Year | Search Strategy | Adaptation Mechanism | Strategy Expressiveness | Benchmark Scale |
|---|---|---|---|---|---|
| FunSearch | 2023 | Fixed (best-shot sampling) | None | N/A (static) | ~5 math tasks |
| AlphaEvolve | 2025 | MAP-Elites with fixed ratios | None (hand-tuned) | Low (predefined knobs) | ~20 tasks |
| OpenEvolve | 2025 | Static elite + diversity | None (hand-tuned) | Low (predefined heuristics) | ~15 tasks |
| ShinkaEvolve | 2025 | Islands + bandit LLM selection | Parametric (LLM model selection) | Medium (bandit over models) | ~20 tasks |
| GEPA | 2026 | Pareto-frontier + reflection | Parametric (reflection-guided) | Medium (fixed architecture) | ~30 tasks |
| AdaEvolve | 2026 | Three-level hierarchical | Parametric (bandit + meta-LLM directives) | Medium (three predefined levels) | 200+ tasks |
| ThetaEvolve | 2026 | Fixed strategy + RL generator | Model adaptation (RL fine-tuning) | Low (adapts model, not strategy) | ~20 tasks |
| CodeEvolve | 2026 | Island-based GA | None (fixed GA operators) | Low (predefined GA) | ~15 tasks |
| EvoX | 2026 | LLM-generated Python classes | Structural (full strategy evolution) | Maximal (arbitrary code) | 200+ tasks |
The fundamental distinction: prior systems adapt parameters of a fixed strategy architecture. EvoX adapts the architecture itself by generating complete strategy implementations as code.
Formal Problem Statement
EvoX formalizes LLM-driven optimization as two coupled optimization problems:
Inner optimization (solution evolution): Given a search strategy S, maximize the best score in the solution population D_T under a fixed evaluation budget T:
max_{(x,s,a) ∈ D_T} s
Outer optimization (meta-evolution): Find the sequence of search strategies S_0, S_1, ..., S_K that maximizes the final best score, where each strategy S_k is deployed for a window of W solution-evolution iterations and evaluated by the progress it induces:
max_{S_0,...,S_K} max_{(x,s,a) ∈ D_T} s
subject to: each S_k is a valid SearchStrategy implementation
K ≤ T/W (budget constraint)
4 Supported Solutions
EvoX targets any optimization problem where candidate solutions can be expressed as code (programs, algorithms, heuristics) and evaluated by an automated scorer. The system is agnostic to the solution domain — it operates on the search process rather than on domain-specific representations.
4.1 Solution Types Supported
| Solution Type | Description | Demonstrated In Paper | Example |
|---|---|---|---|
| Mathematical constructions | Programs generating geometric/combinatorial objects | ✅ Yes (8 tasks) | Circle-packing configurations, point placements |
| Systems heuristics | Scheduling, placement, and routing algorithms | ✅ Yes (6 tasks) | GPU model placement, transaction scheduling |
| Competitive programming solutions | Full algorithmic solutions to CP problems | ✅ Yes (172 tasks) | FrontierCS open-ended problems |
| NP-hard optimization heuristics | Metaheuristics for AtCoder-style contests | ✅ Yes (10 tasks) | Territory control, sliding puzzle, graph classification |
| Signal processing filters | Causal, online filtering programs | ✅ Yes (1 task) | Multi-objective noisy time series filter |
| SQL generation strategies | LLM-driven SQL prompting/post-processing | ✅ Yes (1 task) | LLM-SQL accuracy optimization |
| Telemetry reconstruction | Missing signal reconstruction algorithms | ✅ Yes (1 task) | Temporal/spatial correlation exploitation |
| Abstract reasoning programs | ARC-AGI task solvers | ✅ Yes (evaluation) | Compositional reasoning programs |
4.2 What EvoX Does NOT Support
| Limitation | Reason |
|---|---|
| Non-programmatic solutions | All solutions must be expressible as executable code |
| Continuous optimization (direct) | No gradient computation — operates through code-level search |
| Real-time adaptation | Designed for offline optimization, not online/streaming settings |
| Multi-agent coordination | Single-population evolution; no explicit agent communication |
| Hardware-in-the-loop | Evaluators must be automated; no physical experiments |
4.3 Benchmark Portfolio
EvoX is evaluated across approximately 200 optimization tasks organized into three families:
| Domain | Count | Source | Task Type |
|---|---|---|---|
| Mathematical optimization | 8 | Custom + prior work | Circle packing, Heilbronn, autocorrelation, signal processing |
| Systems performance optimization | 6 | ADRS-Bench | GPU placement, scheduling, cloud transfer, telemetry |
| Algorithmic problems | 182 | ALE-Bench-Lite (10) + FrontierCS (172) | NP-hard optimization, constructive, interactive |
| Abstract reasoning | — | ARC-AGI-2 | Compositional reasoning (supplementary evaluation) |
| Total | ~196+ |
5 LLM Integration
5.1 Models Used
EvoX uses LLMs in two distinct roles, each potentially served by different models:
| Role | Model Options Tested | Purpose | Call Frequency |
|---|---|---|---|
| Solution generator (G_sol) | GPT-5, Gemini-3.0-Pro | Generate candidate solutions given parent + operator + inspiration set | Every inner-loop iteration (T calls total) |
| Strategy generator (G_str) | GPT-5, Gemini-3.0-Pro | Generate new search strategy implementations (Python classes) | Only on stagnation events (K << T calls total) |
| Operator instantiation | Lightweight model (unspecified) | Generate operator-specific prompts from problem description | Once per problem, at initialization |
A critical design choice: the strategy generator and solution generator can be the same or different LLM instances. The paper evaluates both GPT-5 and Gemini-3.0-Pro as backbone models, using the same model for both roles in each experiment.
5.2 How LLMs Are Used
EvoX uses LLMs in two fundamentally different modes:
Mode 1: Solution Generation (Inner Loop) The solution LLM operates as a standard mutation operator in the evolutionary loop. Given a parent candidate, a variation operator label, and an optional inspiration set, it produces a modified candidate solution. This is identical in structure to how AlphaEvolve, OpenEvolve, and other prior systems use LLMs.
Mode 2: Strategy Generation (Outer Loop)
The strategy LLM operates as a meta-programmer. Given the history of prior strategies, their performance profiles, and the current population state, it generates a complete Python class implementing the EvolvedProgramDatabase interface. This is the novel contribution — the LLM is generating the search algorithm itself, not just candidate solutions.
5.3 Prompt Architecture
EvoX employs a multi-layered prompt architecture. The most important prompt is the search strategy evolution prompt (P_search), which instructs the LLM to generate a complete EvolvedProgramDatabase class.
Strategy Evolution System Prompt
System Model: LLM-Driven Optimization with Evolving Search
You are an expert developer implementing a search strategy for
LLM-driven program optimization. Your task is to implement
EvolvedProgramDatabase, which realizes the SearchStrategy
interface used by the optimizer.
The optimizer maintains a population database of evaluated
candidate programs. At each step, the search strategy constructs
the next-generation LLM input by choosing:
(i) a parent program to modify,
(ii) a variation operator that specifies how the parent
should be modified, and
(iii) an optional inspiration set of past programs drawn
from the population.
The generated candidate is then evaluated and added back to
the population. The quality of a search strategy is determined
by how effectively it improves the best solution over the course
of the optimization process.
You must implement a class named EvolvedProgramDatabase, which
defines the behavior of the search strategy through two methods:
- add: controls how evaluated programs are added to the population
- sample: selects the parent program, the variation operator
applied to it, and an optional inspiration set of past programs.
The base class provides:
self.programs: Dict[str, EvolvedProgram] (all stored programs)
self.get(program_id) for retrieval
self.DIVERGE_LABEL and self.REFINE_LABEL predefined constants
Strategy Evolution Constraints Block
======================== CONSTRAINTS ========================
1. Program evaluation metrics are read-only and must not be
modified.
2. All imports and dataclass definitions must be preserved.
3. All variables used by the strategy must be initialized with
safe default values to avoid runtime errors.
4. When accessing metrics, always validate types before
arithmetic (isinstance(value, (int, float))).
======================== CRITICAL INSTRUCTION ========================
Design a search strategy wisely and adaptively to maximize
improvement.
Strategy Evolution User Prompt Structure
The user message in the strategy evolution prompt contains four key sections:
1. Downstream problem context:
"Your search algorithm evolves solutions for the following
downstream problem. Use this information to inform your
search strategy."
{problem_template}
2. Search algorithm information:
- Search window definition (iteration range, optimization
horizon)
- Current population state via descriptor function φ(D_t)
3. Strategy history:
- Prior strategies and their observed performance metrics
- Improvement rates per strategy window
- Stagnation patterns observed
4. Generation instruction:
"Design a search strategy wisely and adaptively to maximize
improvement."
Variation Operator Labels
The strategy's sample() method selects among three variation operator labels that govern the type of modification requested from the solution LLM:
| Label | Constant | Semantics | Typical Use |
|---|---|---|---|
"" (empty string) |
Default | Free-form variation — no constraints on edit scope | Exploratory, early-stage search |
REFINE_LABEL |
self.REFINE_LABEL |
Local refinement — small, structure-preserving edits | Exploitation, parameter tuning |
DIVERGE_LABEL |
self.DIVERGE_LABEL |
Structural variation — larger, exploratory changes | Escaping local optima, paradigm shifts |
Each generation step applies exactly one variation operator. The search strategy's intelligence lies in choosing when to apply each operator based on population state.
Solution Generation Prompt Structure
System message:
You are an expert software developer tasked with iteratively
improving a codebase. Your objective is to maximize the
combined score while exploring diverse solutions across
feature dimensions. The system maintains a population of
programs; both high performance and diversity are valuable
signals.
User message:
[Problem description]
[Parent code to modify]
[Variation operator instruction (refine/diverge/free-form)]
[Inspiration set programs (optional)]
[Evaluation feedback from parent]
5.4 Operator Instantiation
At the start of each problem, EvoX uses a lightweight model to generate operator-specific prompts from the problem description. This transforms the generic operator labels (REFINE, DIVERGE) into task-specific instructions. For example:
- REFINE for circle packing: "Adjust individual circle positions by small deltas while maintaining non-overlap constraints"
- DIVERGE for circle packing: "Redesign the packing strategy entirely — consider different initialization patterns, different optimization algorithms, or different constraint relaxation approaches"
- REFINE for GPU placement: "Tune load thresholds and replication factors in the current placement heuristic"
- DIVERGE for GPU placement: "Reformulate the placement problem as bin-packing, graph partitioning, or constraint satisfaction"
6 Key Results
6.1 Headline Results
EvoX is evaluated against strong baselines across nearly 200 tasks. The headline comparison against human-designed solutions and prior AI systems:
| Task | Human Best | AlphaEvolve | AI Best (Open) | EvoX | Delta vs Human | Delta vs AI Best |
|---|---|---|---|---|---|---|
| Circle-Packing (↑) | 2.634 | 2.635 | — | 2.636 | +0.08% | +0.04% (vs AlphaEvolve) |
| MinMaxMinDist (↓) | 4.16584 | 4.16579 | — | 4.16578 | -0.001% | -0.0002% (vs AlphaEvolve) |
| Cloud Transfer (↓) | 626.24 | — | 645.72 | 623.69 | -0.41% | -3.41% |
| GPU Placement (↑) | 21.89 | — | 26.26 | 30.52 | +39.4% | +16.2% |
| Txn Scheduling (↑) | 2724.8 | — | 4329 | 4347.8 | +59.6% | +0.44% |
| Frontier-CS median (↑) | — | — | 56.2 | 75.5 | — | +34.3% |
EvoX outperforms all prior AI-driven evolutionary methods on 96% of the math and system optimization benchmarks, and achieves a 34.3% improvement in median score on 172 competitive programming problems.
6.2 Detailed Mathematics Results (GPT-5 Backbone, Best-of-3-Runs)
| Task | Direction | Human | AlphaEvolve | ThetaEvolve | CodeEvolve | Random | OpenEvolve | GEPA | Shinka | EvoX |
|---|---|---|---|---|---|---|---|---|---|---|
| Circle Packing | ↑ | 2.634 | 2.635 | 2.636 | 2.636 | 2.500 | 2.541 | 2.629 | 2.508 | 2.636 |
| Circle Packing Rect | ↑ | 2.364 | 2.366 | — | 2.366 | 2.278 | 2.276 | 2.354 | 2.358 | 2.360 |
| Heilbronn Convex | ↑ | 0.0306 | 0.0309 | — | — | 0.0198 | 0.0267 | 0.0269 | 0.0256 | 0.0272 |
| Heilbronn Triangle | ↑ | 0.0360 | 0.0365 | — | — | 0.0285 | 0.0283 | 0.0332 | 0.0338 | 0.0339 |
| MinMaxMinDist (16,2) | ↓ | 12.89 | 12.89 | — | 12.89 | 13.00 | 13.00 | 12.95 | 12.98 | 12.89 |
| MinMaxMinDist (14,3) | ↓ | 4.17 | 4.17 | — | 4.17 | 4.46 | 4.46 | 4.17 | 4.17 | 4.18 |
| Autocorrelation | ↓ | 1.4581 | 1.4557 | 1.4930 | — | 1.4607 | 1.4610 | 1.4758 | 1.4614 | 1.4609 |
| Signal Processing | ↑ | — | — | — | — | 0.5858 | 0.6219 | 0.7057 | 0.5328 | 0.6898 |
6.3 Detailed Systems Results (GPT-5 Backbone, Best-of-3-Runs)
| Task | Direction | Human | Random | OpenEvolve | GEPA | Shinka | EvoX |
|---|---|---|---|---|---|---|---|
| EPLB | ↑ | 0.1265 | 0.1445 | 0.1272 | 0.1445 | 0.1272 | 0.1453 |
| PRISM | ↑ | 21.89 | 26.23 | 26.23 | 26.23 | 26.26 | 30.52 |
| LLM-SQL | ↑ | 0.692 | 0.729 | 0.716 | 0.713 | 0.713 | 0.730 |
| Cloudcast | ↓ | 626.24 | 644.84 | 729.80 | 645.72 | 812.74 | 637.14 |
| Transaction | ↑ | 2724.8 | 4149.38 | 4237.3 | 3984.1 | 4329.0 | 4347.83 |
| Telemetry | ↑ | 0.8222 | 0.9498 | 0.9515 | 0.9477 | 0.9515 | 0.9520 |
6.4 ARC-AGI-2 Results
EvoX was also evaluated on ARC-AGI-2 abstract reasoning tasks, which are not traditional optimization benchmarks. Under a matched inference budget of 30 LLM iterations per task:
| Backbone Model | OpenEvolve | EvoX | Delta |
|---|---|---|---|
| GPT-5 | 41% | 48% | +7pp |
| Gemini-3-Pro | 45% | 51% | +6pp |
The ARC-AGI results are notable because ARC tasks differ fundamentally from optimization — they assume strict train-test separation rather than iterative adaptation during inference. EvoX's improvement here suggests the meta-evolution of search strategies has benefits beyond pure optimization, potentially improving the LLM's program synthesis capability through more effective context construction.
6.5 Gemini-3.0-Pro Backbone Comparison
EvoX maintains its advantages when switching to Gemini-3.0-Pro as the backbone:
| Task | OpenEvolve | GEPA | Shinka | EvoX |
|---|---|---|---|---|
| Circle Packing (↑, best) | 2.5414 | 2.6208 | 2.6358 | 2.6360 |
| Signal Processing (↑, best) | 0.5649 | 0.6798 | 0.5049 | 0.7429 |
| EPLB (↑, best) | 0.1272 | 0.1272 | 0.1272 | 0.1453 |
| Cloudcast (↓, best) | 707.82 | 667.06 | 1032.42 | 623.69 |
6.6 Win Rate Analysis
Across all evaluated tasks: - Math + Systems (14 tasks): EvoX wins on 96% of tasks - FrontierCS (172 tasks): EvoX achieves 75.5 median score vs. 56.2 for best baseline (34.3% improvement) - ARC-AGI-2: EvoX improves accuracy by 6-7 percentage points over OpenEvolve
6.7 Key Quantitative Insights
GPU Placement (PRISM) is the most dramatic result: EvoX achieves 30.52 throughput vs. 21.89 human best — a 39.4% improvement. This task exemplifies EvoX's strength: the meta-evolution loop discovers that initial greedy placement strategies plateau, and generates a new strategy that reformulates the problem as bin-packing assignment, unlocking a qualitatively different solution space.
Signal Processing with Gemini-3.0-Pro shows EvoX's largest relative gain: 0.7429 vs. 0.5049 for ShinkaEvolve (47.1% improvement). The multi-objective nature of this task benefits strongly from strategy evolution, as the meta-loop discovers strategies that explicitly sample candidates along different Pareto trade-off objectives.
7 Reproducibility
7.1 Open Source Status
| Aspect | Status | Details |
|---|---|---|
| Paper | ✅ Open Access | arXiv with CC BY 4.0 license |
| Code | ✅ Open Source | github.com/skydiscover-ai/skydiscover |
| Benchmark Suite | ✅ Included | ~200 tasks shipped with the framework |
| Pre-trained Models | N/A | Uses commercial LLM APIs (GPT-5, Gemini-3.0-Pro) |
| Evaluation Harness | ✅ Included | Automated evaluators for all benchmarks |
| Baseline Implementations | ✅ Included | OpenEvolve, GEPA, ShinkaEvolve backends |
7.2 Reproducibility Requirements
LLM API Access: - GPT-5 API access (OpenAI) — required for GPT-5 experiments - Gemini-3.0-Pro API access (Google) — required for Gemini experiments - API costs are non-trivial (see Section 8)
Software Dependencies: - Python 3.x (version not explicitly specified in paper) - SkyDiscover framework and its dependencies - Standard scientific Python stack (numpy, etc.)
Compute Requirements: - No GPU required for the evolutionary search itself (LLM calls are API-based) - Some benchmark evaluators (e.g., GPU placement) may require specific hardware for evaluation - Sequential evaluation budget: 100 iterations per task (fixed across all open AI frameworks)
7.3 Reproducibility Concerns
-
LLM non-determinism: Both GPT-5 and Gemini-3.0-Pro produce stochastic outputs. The paper reports mean and best over three runs to mitigate this, but exact reproduction requires matching API versions and sampling parameters.
-
AlphaEvolve comparison: AlphaEvolve results are "taken directly from its original publication" — not reproduced using the same infrastructure. AlphaEvolve uses internal Google infrastructure and Gemini models that may differ from the versions available via public API.
-
Evaluation budget fairness: All open AI frameworks are evaluated under a fixed budget of 100 iterations. This is fair for wall-clock comparison, but may disadvantage methods designed for longer runs. AlphaEvolve, with internal Google compute, likely ran for many more iterations.
-
Strategy generator cost: The paper notes that meta-evolution adds minimal overhead (~6% additional LLM calls), but the cost of invalid strategy generation and retry is not detailed.
8 Compute and API Costs
8.1 Evaluation Budget
All open AI frameworks are evaluated under a fixed budget of 100 solution-evaluation iterations per task. This means:
| Component | Calls | Cost Factor |
|---|---|---|
| Solution generation | 100 calls per task | ~100 × (GPT-5 or Gemini-3.0-Pro) per task |
| Strategy generation | K calls per task (K << 100) | ~5-15 × (same model) per task |
| Operator instantiation | 1 call per task (lightweight model) | Negligible |
| Total per task | ~105-115 LLM calls |
8.2 Cost Estimates
Based on current (April 2026) API pricing and typical prompt/response lengths for code generation tasks:
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Est. Cost per Task | Est. Total (200 tasks) |
|---|---|---|---|---|
| GPT-5 | ~$10 | ~$30 | ~$5-15 | ~$1,000-3,000 |
| Gemini-3.0-Pro | ~$5 | ~$15 | ~$2.50-7.50 | ~$500-1,500 |
Per-experiment cost with 3 runs: Multiply by 3 for the mean/best statistics reported, yielding: - GPT-5: ~$3,000-9,000 for the full benchmark suite - Gemini-3.0-Pro: ~$1,500-4,500 for the full benchmark suite
8.3 Meta-Evolution Overhead
The paper characterizes EvoX's meta-evolution overhead as approximately 6% additional LLM calls beyond what a fixed-strategy baseline would require. This low overhead is achieved through stagnation-triggered (rather than periodic) strategy evolution:
Strategy evolution overhead = K / T ≈ 5-15 / 100 = 5-15%
Where:
K = number of strategy evolution events
T = total solution evaluation budget (100)
W = window size per strategy (5-20 iterations)
The overhead is front-loaded: most strategy evolution events occur in the first half of the run, as the system discovers effective strategies that sustain progress without triggering stagnation. Later in the run, strategies tend to be retained for longer windows.
8.4 Compute Infrastructure
The paper does not specify the hardware infrastructure used for running experiments. Given that: - LLM inference is API-based (no local GPU needed for generation) - Evaluators for most tasks are lightweight (seconds per evaluation) - The ADRS-Bench systems tasks may require specific hardware configurations
A reasonable estimate is that all experiments could be run on a single workstation with API access, though parallelization across tasks would significantly reduce wall-clock time.
9 Architecture Solution
9.1 System Architecture Overview
EvoX implements a two-level co-evolution architecture. The system can be understood as two coupled control loops operating at different timescales:
┌──────────────────────────────────────────────────────────┐
│ META-EVOLUTION LOOP │
│ (Outer / Slow Loop) │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ Strategy │ │ Strategy │ │ Stagnation │ │
│ │ History │──▶│ Generator │──▶│ Detector │ │
│ │ H │ │ G_str(·) │ │ Δ < τ ? │ │
│ └──────────┘ └──────┬───────┘ └────────┬──────────┘ │
│ ▲ │ │ │
│ │ │ S_new │ stagnation │
│ │ ▼ │ signal │
│ │ ┌──────────────┐ │ │
│ │ │ Strategy │ │ │
│ └────────│ Validator │◀────────────┘ │
│ │ Valid(S)? │ │
│ └──────┬───────┘ │
│ │ S_valid │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ SOLUTION-EVOLUTION LOOP │ │
│ │ (Inner / Fast Loop) │ │
│ │ │ │
│ │ ┌──────────┐ │ │
│ │ │ Solution │ │ │
│ │ │ Pop. │◀──────────────────────────┐ │ │
│ │ │ D_t │ │ │ │
│ │ └────┬─────┘ │ │ │
│ │ │ │ │ │
│ │ ▼ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌───────┴──────┐ │ │
│ │ │ Search │ │ Solution │ │ Evaluator │ │ │
│ │ │ Strategy │──▶│Generator │──▶│ E(x) │ │ │
│ │ │ S_k │ │ G_sol(·) │ │ → (s, a) │ │ │
│ │ └──────────┘ └──────────┘ └──────────────┘ │ │
│ │ Selects: │ │
│ │ - parent x_par Produces: Evaluates: │ │
│ │ - operator π x' ~ G_sol s(x'), a(x') │ │
│ │ - inspiration I │ │
│ │ │ │
│ │ Repeats for W iterations per strategy window │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────┘
9.2 Data Flow
Problem Definition
│
▼
┌───────────────────────┐
│ Operator Instantiation│
│ (lightweight LLM) │
└───────────┬───────────┘
│ operator prompts
▼
┌───────────────────────┐
│ Initial Strategy │
│ S_0 (random sample) │
└───────────┬───────────┘
│
┌─────────────┴─────────────┐
│ │
▼ │
┌───────────────┐ │
│ Inner Loop │ │
│ (W iters) │ │
│ │ │
│ S_k.sample() │──▶ (x_par, π, I) │
│ │ │ │
│ ▼ │ │
│ G_sol(x_par, │ │
│ π, I) │ │
│ │ │ │
│ ▼ │ │
│ E(x') → s,a │ │
│ │ │ │
│ ▼ │ │
│ S_k.add(x') │ │
└───────┬───────┘ │
│ │
▼ │
┌───────────────┐ │
│ Stagnation? │ │
│ Δ < τ │ │
└───┬───────┬───┘ │
│No │Yes │
│ ▼ │
│ ┌─────────────┐ │
│ │ Meta-Evolve │ │
│ │ Strategy │──────────────┘
│ │ S_{k+1} │ new strategy
│ └─────────────┘
│
▼
┌───────────────┐
│ Budget T │
│ exhausted? │
└───┬───────┬───┘
│No │Yes
│ ▼
│ Return best
│ (x*, s*)
│
└──▶ Continue inner loop
9.3 Strategy Interface Contract
The search strategy is materialized as a Python class conforming to a well-defined interface:
class EvolvedProgramDatabase:
"""Search strategy interface that governs candidate
selection and variation in the evolutionary loop."""
# Provided by base class (read-only)
programs: Dict[str, EvolvedProgram]
DIVERGE_LABEL: str # structural variation
REFINE_LABEL: str # local refinement
def get(self, program_id: str) -> EvolvedProgram:
"""Retrieve a program by ID."""
...
def add(self, program: EvolvedProgram) -> None:
"""Add an evaluated program to the population.
The strategy controls how programs are organized
internally — e.g., maintaining elite sets, diversity
archives, Pareto frontiers, or custom data structures.
"""
...
def sample(self) -> Tuple[
Dict[str, EvolvedProgram], # parent_dict
List[EvolvedProgram] # inspiration_set
]:
"""Select parent, variation operator, and inspiration.
Returns:
parent_dict: maps a label ("", REFINE_LABEL, or
DIVERGE_LABEL) to exactly one parent
inspiration_set: list of programs providing
complementary context
"""
...
9.4 EvolvedProgram Data Structure
Each candidate program in the population carries the following fields:
@dataclass
class EvolvedProgram:
id: str # unique identifier
code: str # program source code
metrics: ProgramMetrics # evaluation results
iteration_found: int # when this candidate was created
timestamp: float # wall-clock creation time
parent_id: Optional[str] # ID of parent program
parent_info: Tuple[str, str] # (label, parent_id)
metadata: Dict[str, Any] # arbitrary metadata
@property
def combined_score(self) -> float:
"""Main optimization objective."""
return self.metrics.combined_score
The
parent_infotuple is a critical piece of lineage tracking: it records not just which program was the parent, but what variation operator was applied to it. This enables the strategy to reason about which operators are effective for which types of parents.
10 Component Breakdown
10.1 Solution-Evolution Loop
The inner loop is the standard LLM-driven evolutionary search paradigm, shared with prior systems:
| Component | Function | Implementation |
|---|---|---|
| Solution Population D_t | Stores all evaluated candidates with scores and artifacts | In-memory dictionary (Dict[str, EvolvedProgram]) |
| Search Strategy S_k | Selects parent, operator, and inspiration set | LLM-generated Python class |
| Solution Generator G_sol | Produces new candidate from prompt | GPT-5 or Gemini-3.0-Pro API call |
| Evaluator E | Scores candidate, returns (score, artifacts) | Task-specific automated evaluator |
Inner loop iteration flow:
1. S_k.sample() → (parent_dict, inspiration_set)
2. Construct prompt from parent, operator label, and inspiration set
3. G_sol(prompt) → x' (new candidate code)
4. E(x') → (s(x'), a(x')) (score and artifacts)
5. S_k.add(EvolvedProgram(x', s, a, ...)) → update population
10.2 Meta-Evolution Loop
The outer loop is EvoX's novel contribution:
| Component | Function | Implementation |
|---|---|---|
| Stagnation Detector | Monitors progress within each strategy window | Threshold comparison: Δ < τ |
| Population Descriptor φ(·) | Summarizes current population state for strategy generator | Function returning population statistics |
| Strategy History H | Records prior strategies and their performance profiles | List of (strategy_code, performance_metrics) tuples |
| Strategy Generator G_str | Produces new search strategy implementation | GPT-5 or Gemini-3.0-Pro API call |
| Strategy Validator Valid(·) | Checks syntactic and semantic validity of generated strategy | Python compilation + interface conformance test |
Meta-evolution trigger conditions: 1. Current strategy has been active for at least W iterations (minimum window) 2. Progress measure Δ falls below threshold τ (stagnation detected) 3. Budget T has not been exhausted
Strategy generation process:
1. Compute population descriptor: φ(D_t) → population statistics
2. Assemble strategy generation prompt: P_search(H, φ(D_t), problem_context)
3. Generate strategy: S_new ~ G_str(P_search)
4. Validate: Valid(S_new) → boolean
5. If valid: deploy S_new as S_{k+1}, record (S_k, performance_k) in H
6. If invalid: retry generation (up to retry limit)
10.3 Variation Operator System
EvoX defines three variation operators that control the type of modification the solution LLM applies:
| Operator | Label | Scope | Analogy |
|---|---|---|---|
| Free-form | "" (empty) |
Unconstrained | Exploratory mutation |
| Local Refinement | REFINE_LABEL |
Structure-preserving, fine-grained | Parameter tuning, micro-optimization |
| Structural Variation | DIVERGE_LABEL |
Coarse-grained redesign | Crossover, paradigm shift |
The operators are instantiated per-problem at initialization using a lightweight LLM to generate task-specific prompts from the problem description. This means the same REFINE_LABEL maps to different concrete instructions depending on the optimization domain.
10.4 Evaluator Component
The evaluator is task-specific and provided by the benchmark suite. Key properties:
- Deterministic: Same code → same score (no evaluation noise)
- Automated: No human judgment required
- Scalar output: Primary score
s(x)is a single real number - Auxiliary artifacts:
a(x)may include logs, traces, error messages, intermediate outputs - Sandboxed execution: Candidate code runs in a restricted environment
10.5 Strategy Validation Component
The Valid(·) function ensures generated strategies are deployable:
def is_valid_strategy(strategy_code: str) -> bool:
"""Validate a generated search strategy."""
# 1. Syntactic validity
try:
compile(strategy_code, "<strategy>", "exec")
except SyntaxError:
return False
# 2. Class existence
namespace = {}
exec(strategy_code, namespace)
if "EvolvedProgramDatabase" not in namespace:
return False
# 3. Interface conformance
cls = namespace["EvolvedProgramDatabase"]
if not (hasattr(cls, "add") and hasattr(cls, "sample")):
return False
# 4. Runtime test with mock data
try:
instance = cls()
instance.add(mock_program)
parent_dict, inspiration = instance.sample()
# Verify output types
assert isinstance(parent_dict, dict)
assert isinstance(inspiration, list)
except Exception:
return False
return True
11 Core Mechanisms (Detailed)
11.1 Algorithm 1: EvoX Two-Level Evolution
The complete algorithm, as presented in the paper:
Algorithm 1: EvoX Two-Level Evolution Process
INPUT:
Budget T, window W, stagnation threshold τ
Evaluator E
Solution generator G_sol
Strategy generator G_str
Population descriptor φ(·)
Validity test Valid(·)
Initial solution database D_0
Initial strategy S_0
INITIALIZE:
D_t ← D_0
S_t ← S_0
H ← ∅ (strategy history)
t ← 0
MAIN LOOP:
while t < T:
────────── SOLUTION EVOLUTION (Inner Loop) ──────────
for w = 1 to W:
(x_par, π, I) ← S_t.sample(D_t)
x' ~ G_sol(· | x_par, π, I)
(s', a') ← E(x')
D_{t+1} ← D_t ∪ {(x', s', a')}
t ← t + 1
if t ≥ T: break
────────── STAGNATION DETECTION ──────────
Δ ← improvement over last W iterations
if Δ < τ:
────────── META-EVOLUTION (Outer Loop) ──────────
Record (S_t, performance) → H
repeat:
S_new ~ G_str(· | H, φ(D_t), problem_context)
until Valid(S_new) or retry limit reached
if Valid(S_new):
S_t ← S_new
else:
keep S_t (no valid replacement found)
OUTPUT:
Return (x*, s*) = argmax_{(x,s,a) ∈ D_T} s
11.2 Search Strategy as Code: The Key Abstraction
The most innovative aspect of EvoX is representing search strategies as executable Python classes. This enables an unprecedented level of expressiveness compared to parametric strategy representations.
Example: A simple generated strategy (early in evolution)
class EvolvedProgramDatabase(BaseDatabase):
"""Random sampling strategy (initial S_0)."""
def __init__(self):
super().__init__()
self.elite_size = 5
def add(self, program: EvolvedProgram) -> None:
self.programs[program.id] = program
def sample(self) -> Tuple[Dict, List]:
if not self.programs:
return {"": None}, []
# Random parent selection
parent = random.choice(list(self.programs.values()))
# Random operator selection
label = random.choice(
["", self.REFINE_LABEL, self.DIVERGE_LABEL]
)
return {label: parent}, []
Example: A sophisticated generated strategy (after several meta-evolution steps)
class EvolvedProgramDatabase(BaseDatabase):
"""Adaptive strategy with Pareto-aware selection and
stagnation-responsive operator scheduling."""
def __init__(self):
super().__init__()
self.elite_archive = {}
self.diversity_archive = {}
self.recent_improvements = []
self.operator_successes = {
"": 0, "REFINE": 0, "DIVERGE": 0
}
self.operator_attempts = {
"": 1, "REFINE": 1, "DIVERGE": 1
}
def add(self, program: EvolvedProgram) -> None:
self.programs[program.id] = program
score = program.metrics.combined_score
if isinstance(score, (int, float)):
# Track improvement
if self.elite_archive:
best = max(
p.metrics.combined_score
for p in self.elite_archive.values()
if isinstance(
p.metrics.combined_score,
(int, float)
)
)
if score > best:
self.recent_improvements.append(
score - best
)
# Credit the operator that produced this
label = program.parent_info[0]
if label in self.operator_successes:
self.operator_successes[label] += 1
# Update elite archive (top-K)
self.elite_archive[program.id] = program
if len(self.elite_archive) > 10:
worst = min(
self.elite_archive,
key=lambda k:
self.elite_archive[k]
.metrics.combined_score
)
del self.elite_archive[worst]
def sample(self) -> Tuple[Dict, List]:
if not self.programs:
return {"": None}, []
# Operator selection via UCB1
total = sum(self.operator_attempts.values())
ucb_scores = {}
for op in ["", self.REFINE_LABEL, self.DIVERGE_LABEL]:
key = op if op else ""
exploit = (
self.operator_successes.get(key, 0) /
self.operator_attempts.get(key, 1)
)
explore = math.sqrt(
2 * math.log(total) /
self.operator_attempts.get(key, 1)
)
ucb_scores[op] = exploit + explore
label = max(ucb_scores, key=ucb_scores.get)
self.operator_attempts[
label if label else ""
] += 1
# Parent selection: elite with probability 0.7,
# random otherwise
if (random.random() < 0.7
and self.elite_archive):
parent = random.choice(
list(self.elite_archive.values())
)
else:
parent = random.choice(
list(self.programs.values())
)
# Inspiration: diverse set from non-elite programs
non_elite = [
p for p in self.programs.values()
if p.id not in self.elite_archive
]
inspiration = random.sample(
non_elite,
min(3, len(non_elite))
)
return {label: parent}, inspiration
The transformation from the simple random strategy to the sophisticated UCB1-based adaptive strategy illustrates the power of EvoX's meta-evolution. The LLM generates qualitatively different strategy architectures — including new data structures (elite/diversity archives), new selection mechanisms (UCB1), and new feedback loops (operator success tracking) — none of which were predefined in the system.
11.3 Stagnation Detection Mechanism
Stagnation is detected by comparing the improvement in the best score over the most recent strategy window W against a threshold τ:
Δ_k = max_{(x,s,a) ∈ D_{t}} s - max_{(x,s,a) ∈ D_{t-W}} s
Stagnation triggered iff Δ_k < τ
Design considerations: - τ is a fixed threshold, not adaptive (unlike AdaEvolve's accumulated improvement signal) - The threshold should be calibrated per-task or set to a small positive value to detect near-zero improvement - Using a fixed threshold rather than a relative threshold means absolute magnitude of scores matters
Comparison with AdaEvolve's stagnation detection:
| Aspect | EvoX | AdaEvolve |
|---|---|---|
| Signal | Raw improvement Δ | Accumulated improvement (scale-invariant EMA) |
| Threshold | Fixed τ | Adaptive (percentile-based) |
| Response | Generate new strategy (structural) | Adjust bandit weights (parametric) |
| Expressiveness | Full strategy replacement | Three-level parameter adjustment |
11.4 Population Descriptor Function φ(·)
The population descriptor compresses the current state of the solution database into a summary that can be included in the strategy generation prompt. While the paper does not fully specify φ(·), based on the prompt structure and interface design, it likely includes:
def population_descriptor(D_t: Dict[str, EvolvedProgram]) -> str:
"""Summarize population state for strategy generation."""
programs = list(D_t.values())
scores = [
p.metrics.combined_score for p in programs
if isinstance(p.metrics.combined_score, (int, float))
]
return f"""
Population size: {len(programs)}
Score distribution:
Best: {max(scores):.6f}
Median: {sorted(scores)[len(scores)//2]:.6f}
Worst: {min(scores):.6f}
Std: {statistics.stdev(scores):.6f}
Iteration range: {min(p.iteration_found for p in programs)}
to {max(p.iteration_found for p in programs)}
Operator distribution:
Free-form: {sum(1 for p in programs
if p.parent_info[0] == "")}
Refine: {sum(1 for p in programs
if p.parent_info[0] == "REFINE")}
Diverge: {sum(1 for p in programs
if p.parent_info[0] == "DIVERGE")}
Recent improvement trajectory: [last 10 deltas]
"""
11.5 Strategy History and Conditioning
The strategy history H accumulates all previously deployed strategies and their performance metrics. When generating a new strategy, the LLM is conditioned on H to:
- Avoid repeating failures: If a strategy based on pure exploitation stagnated, the LLM is less likely to generate another pure exploitation strategy.
- Build on successes: If a strategy with UCB1-based operator selection showed initial promise, the LLM may refine this approach.
- Combine ideas: The LLM can synthesize elements from multiple historical strategies into a novel combination.
Strategy History H = [
(S_0, {window: [0, 15], improvement: 0.42, stagnated: True}),
(S_1, {window: [15, 35], improvement: 0.18, stagnated: True}),
(S_2, {window: [35, 70], improvement: 0.31, stagnated: False}),
...
]
11.6 Search Strategy Design Space
EvoX's strategy evolution explores a rich design space. The LLM-generated strategies observed in practice include elements from multiple algorithmic families:
| Strategy Element | Examples Generated | Traditional Analogy |
|---|---|---|
| Parent selection | Tournament, fitness-proportional, rank-based, random, elite-only, Pareto-aware | Genetic algorithm selection operators |
| Inspiration selection | Diversity-maximizing, structurally dissimilar, score-complementary, random subset | Crossover partner selection |
| Operator scheduling | UCB1 bandit, epsilon-greedy, fixed ratio, stagnation-responsive, recency-weighted | Multi-armed bandit, adaptive operator selection |
| Population management | Elite archives, diversity archives, age-based pruning, Pareto fronts, sliding windows | MAP-Elites, NSGA-II, generational GA |
| Internal state tracking | Improvement histories, operator success rates, lineage graphs, novelty scores | Fitness landscape analysis |
The richness of this design space is a direct consequence of representing strategies as code rather than parameter vectors. A UCB1-bandit over operators with Pareto-aware parent selection cannot be expressed as a point in a predefined parameter space — it requires structural innovation.
11.7 Formal Convergence Properties
The paper does not provide formal convergence guarantees for the meta-evolution loop. This is expected given the generality of the approach — arbitrary Python code generation by an LLM is not amenable to standard convergence analysis. However, several informal properties are demonstrated empirically:
-
Monotonic best-score: The solution population retains all evaluated candidates, so the best score can only improve or stay constant. The meta-evolution loop cannot decrease the best score.
-
Strategy diversity: The stagnation-triggered mechanism ensures new strategies are only generated when current strategies fail, preventing unnecessary churn.
-
History conditioning: By conditioning on the full strategy history, the meta-evolution loop has access to information about which approaches have been tried and failed, reducing redundant exploration.
-
Graceful degradation: If no valid strategy can be generated, the system retains the current strategy rather than crashing or deploying an invalid strategy.
11.8 Comparison: Fixed vs. Evolved Search Dynamics
The paper demonstrates the difference between fixed and evolved search strategies through a signal processing case study:
Fixed strategy (MAP-Elites, red line):
Iteration 0-30: Rapid improvement (0.0 → 0.55)
Iteration 30-60: Slowing improvement (0.55 → 0.60)
Iteration 60-100: Stagnation (0.60 → 0.62)
Evolved strategy (EvoX, blue line):
Iteration 0-15: Moderate improvement (0.0 → 0.40)
Iteration 15-20: Strategy evolution event #1
Iteration 20-50: Resumed improvement (0.40 → 0.62)
Iteration 50-55: Strategy evolution event #2
Iteration 55-80: Breakthrough (0.62 → 0.72)
Iteration 80-100: Continued gains (0.72 → 0.74)
The characteristic pattern of EvoX is discrete performance breakthroughs following strategy evolution events. These breakthroughs represent qualitative shifts in the search process — not just parameter adjustments — enabled by the structural expressiveness of code-level strategy evolution.
12 Programming Language
12.1 Primary Language
| Language | Role | Notes |
|---|---|---|
| Python | Implementation language | All framework code, strategies, and most candidate solutions |
| Python | Strategy representation | Search strategies are generated as Python classes |
| Various | Candidate solutions | Depends on benchmark; primarily Python for algorithmic tasks |
12.2 Framework Dependencies
EvoX is built on the SkyDiscover framework infrastructure:
| Dependency | Purpose | Version/Notes |
|---|---|---|
| SkyDiscover | Core framework, benchmark suite, evaluation harness | Same codebase (github.com/skydiscover-ai/skydiscover) |
| OpenAI API | GPT-5 access | Via official Python SDK |
| Google AI API | Gemini-3.0-Pro access | Via official Python SDK |
| Python stdlib | Core data structures, random, math, statistics | Standard library |
12.3 Benchmark-Specific Dependencies
Different benchmarks may require additional dependencies:
| Benchmark Family | Dependencies | Notes |
|---|---|---|
| Mathematical optimization | NumPy, SciPy | Geometric computations, optimization |
| Systems (ADRS-Bench) | Benchmark-specific simulators | GPU placement, scheduling simulators |
| FrontierCS | Problem-specific evaluators | Competitive programming judges |
| ALE-Bench-Lite | AtCoder evaluators | NP-hard optimization judges |
12.4 Code Generation Target
The LLM generates Python code in two contexts:
1. Search strategies: Always Python classes conforming to EvolvedProgramDatabase interface
2. Candidate solutions: Typically Python functions or classes, though the framework supports arbitrary executable programs evaluated by the benchmark's evaluator
13 Memory Management
13.1 Solution Population Database
The solution population D_t is the primary stateful data structure in EvoX:
Population Database D_t:
┌─────────────────────────────────────────────────┐
│ programs: Dict[str, EvolvedProgram] │
│ │
│ EvolvedProgram: │
│ ┌──────────────────────────────────────────┐ │
│ │ id: str (UUID) │ │
│ │ code: str (full source) │ │
│ │ metrics.combined_score: float │ │
│ │ iteration_found: int │ │
│ │ timestamp: float │ │
│ │ parent_id: Optional[str] │ │
│ │ parent_info: (label, parent_id) │ │
│ │ metadata: Dict[str, Any] │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Grows monotonically: |D_t| = t │
│ Never prunes (all candidates retained) │
└─────────────────────────────────────────────────┘
Key properties: - Append-only: Every evaluated candidate is permanently stored. The database grows from |D_0| = 0 to |D_T| = T. - No global pruning: Unlike systems with fixed-size populations (e.g., genetic algorithms with generational replacement), EvoX's solution database is unbounded in size. - Strategy-local pruning: Individual strategies may maintain internal archives (e.g., elite sets, diversity archives) that are subsets of the full database. These are part of the strategy's state, not the global infrastructure. - Read-only metrics: Program evaluation metrics are immutable once computed. This prevents strategies from corrupting evaluation results.
13.2 Strategy History
The strategy history H is a secondary stateful data structure:
Strategy History H:
┌──────────────────────────────────────────────────┐
│ List of (strategy, performance_record) tuples │
│ │
│ strategy: str (Python source code) │
│ performance_record: │
│ ┌────────────────────────────────────────────┐ │
│ │ window_start: int │ │
│ │ window_end: int │ │
│ │ best_score_at_start: float │ │
│ │ best_score_at_end: float │ │
│ │ improvement: float │ │
│ │ stagnated: bool │ │
│ │ num_candidates_generated: int │ │
│ └────────────────────────────────────────────┘ │
│ │
│ Grows by 1 per strategy evolution event │
│ Typical size: 5-15 entries per 100-iteration run │
└──────────────────────────────────────────────────┘
13.3 Memory Scaling Analysis
For a typical 100-iteration run:
| Component | Size | Growth Rate | Memory Estimate |
|---|---|---|---|
| Solution population | 100 programs | O(T) linear | ~1-10 MB (depending on code size) |
| Strategy history | 5-15 strategies | O(T/W) sublinear | ~50-150 KB |
| Active strategy state | 1 strategy instance | O(1) constant | ~1-10 KB |
| LLM context windows | Temporary | O(1) per call | Governed by model context limit |
| Total | ~1-15 MB |
Memory is not a bottleneck for EvoX at the scales evaluated (100 iterations). For much longer runs (thousands of iterations), the unbounded growth of D_t could become a concern, though Python dictionary overhead per entry is modest.
13.4 Context Window Management
A subtle but critical aspect of memory management in EvoX is how information from the population database is compressed into LLM context windows:
For solution generation (inner loop): - Parent code (full source) - Inspiration set (2-5 programs, possibly truncated) - Operator-specific prompt - Problem description
For strategy generation (outer loop): - Full strategy history H (all prior strategy code + performance) - Population descriptor φ(D_t) (statistical summary) - Problem context - Interface specification
The population descriptor φ(·) serves as a compression mechanism: rather than including all T programs in the strategy generation prompt, it summarizes the population's statistical properties, reducing context length from O(T × code_size) to O(1).
13.5 State Persistence
The paper does not discuss checkpointing or state persistence mechanisms. Given that: - The evaluation budget is fixed at 100 iterations - Typical wall-clock time per run is moderate (minutes to hours depending on evaluator cost) - The system is designed for single-run optimization, not long-running jobs
State persistence is likely not implemented. However, for production use, the modular architecture would support checkpointing of D_t, H, and current strategy state.
14 Continued Learning
14.1 Within-Run Learning
EvoX implements strong within-run learning through its two-level architecture:
Solution-level learning: The solution population D_t accumulates all evaluated candidates, providing an expanding knowledge base for the solution generator. As the population grows, strategies can select more informative parents and inspiration sets.
Strategy-level learning: The strategy history H enables the meta-evolution loop to learn from the effectiveness of prior strategies. Each strategy evolution event is conditioned on the full history, allowing the LLM to identify patterns in which strategy architectures work well for the current problem.
Operator-level learning: Individual strategies may implement internal learning mechanisms (e.g., UCB1 bandit over operators, as shown in Section 11.2). These mechanisms adapt within a single strategy window based on operator success rates.
Learning Hierarchy:
Time scale │ What is learned │ Mechanism
───────────────┼────────────────────────────┼──────────────────
Per-iteration │ Solution population │ Append to D_t
Per-window │ Operator effectiveness │ Strategy-internal
Per-stagnation │ Strategy architecture │ Meta-evolution
Per-run │ All of the above │ Accumulated H
14.2 Cross-Run Learning
EvoX does not implement explicit cross-run or cross-task learning. Each optimization run starts fresh with: - Empty population D_0 = ∅ - Simple initial strategy S_0 (random sampling) - Empty strategy history H = ∅
This is a deliberate design choice:
Starting from a simple random-sampling search strategy, EvoX consistently outperforms existing LLM-driven evolutionary frameworks.
The authors emphasize that EvoX achieves strong performance without cross-task transfer, warm-starting, or pre-trained strategy models. The meta-evolution loop discovers effective strategies from scratch within each run's 100-iteration budget.
14.3 Implicit Cross-Task Learning via LLM
While EvoX does not explicitly transfer learned strategies across tasks, the LLM backbone (GPT-5 or Gemini-3.0-Pro) provides implicit cross-task knowledge:
-
Algorithmic knowledge: The LLM has been trained on evolutionary algorithm literature, MAP-Elites implementations, bandit algorithms, and other search strategies. This pre-existing knowledge bootstraps the strategy generation process.
-
Problem domain knowledge: The LLM understands common optimization heuristics for specific domains (e.g., bin-packing for placement problems, gradient-free optimization for black-box functions).
-
Code generation capability: The LLM can generate syntactically correct, functionally complex Python classes, which is a prerequisite for the strategy-as-code approach.
14.4 Comparison with Other Systems' Learning Mechanisms
| System | Within-Run Learning | Cross-Run Learning | Cross-Task Learning |
|---|---|---|---|
| AlphaEvolve | Population accumulation | Not reported | LLM-implicit |
| OpenEvolve | Population accumulation | Not reported | LLM-implicit |
| ShinkaEvolve | Bandit LLM selection | Not reported | LLM-implicit |
| GEPA | Reflection-based | Not reported | LLM-implicit |
| AdaEvolve | Three-level adaptation | Not reported | LLM-implicit |
| ThetaEvolve | RL model fine-tuning | Yes (model weights) | Transfer via fine-tuned model |
| SOAR | Search + fine-tuning | Yes (model weights) | Transfer via fine-tuned model |
| EvoX | Two-level co-evolution | No | LLM-implicit |
A potential extension of EvoX would be to transfer successful strategy patterns across tasks — e.g., by seeding H with strategies that worked well on similar problems. This would create a form of meta-meta-learning: learning about strategy evolution across tasks.
14.5 Strategy Evolution Dynamics
The paper characterizes how strategies evolve over the course of a run. Typical patterns observed:
Phase 1: Exploration (iterations 0-20) - Initial random sampling strategy deployed - Population begins to accumulate diverse candidates - First stagnation event typically occurs within 10-20 iterations
Phase 2: Strategy Discovery (iterations 20-50) - Multiple strategy evolution events occur in rapid succession - Strategies become progressively more sophisticated - Transition from random selection to fitness-proportional or elite-based selection
Phase 3: Strategy Refinement (iterations 50-80) - Strategy evolution events become less frequent - Effective strategies are retained for longer windows - Operators stabilize (e.g., exploitation ratio increases)
Phase 4: Convergence (iterations 80-100) - Few or no strategy evolution events - Current strategy produces diminishing but non-zero improvements - Final score approaches the run's optimum
15 Applications
15.1 Mathematical Optimization
EvoX demonstrates strong performance on mathematical construction problems where the goal is to discover geometric or combinatorial configurations optimizing a precise mathematical objective.
Circle Packing in a Square
Problem: Pack k circles of equal radius inside a unit square, maximizing the common radius.
EvoX approach: The meta-evolution loop discovers that: - Early strategies should explore diverse packing topologies (DIVERGE operator) - Mid-run strategies should refine promising configurations (REFINE operator) - Late-run strategies should combine elements of the best configurations (inspiration sets)
Result: EvoX achieves radius 2.636, surpassing both human best (2.634) and AlphaEvolve (2.635).
Heilbronn Triangle and Convex Problems
Problem: Place n points in a unit square to maximize the smallest triangle (or convex polygon) area formed by any k > 3 points.
Significance: These are long-standing open problems in combinatorial geometry. EvoX's ability to match or approach human-constructed bounds demonstrates the system's capability for mathematical discovery.
Signal Processing (Multi-Objective)
Problem: Design a causal, online filtering program for noisy time series, balancing fidelity, smoothness, lag, and false trend detection.
EvoX advantage: This multi-objective task is where EvoX shows its largest relative gain (0.7429 vs. 0.5049 for ShinkaEvolve with Gemini). The meta-evolution loop discovers strategies that explicitly sample candidates along different Pareto trade-off objectives, a strategy architecture that cannot be expressed in fixed parametric frameworks.
15.2 Systems Performance Optimization
EvoX is evaluated on six production-inspired systems optimization tasks from ADRS-Bench:
GPU-to-Model Placement (PRISM)
Problem: Schedule multiple deep learning models across distributed GPUs under memory, compute, and latency SLO constraints in a multi-tenant setting.
EvoX result: 30.52 throughput vs. 21.89 human best — the largest absolute improvement across all tasks (39.4%).
Key insight from meta-evolution: The initial strategy (random sampling) quickly finds greedy least-loaded heuristics. After stagnation, the meta-evolution loop generates a strategy that reformulates the problem as bin-packing assignment, enabling exploration of a qualitatively different solution space that produces dramatically better throughput.
Expert Placement Load Balancing (EPLB)
Problem: Place and replicate experts in Mixture-of-Experts (MoE) models across GPUs based on activation frequency, minimizing load imbalance.
Result: EvoX achieves 0.1453 throughput ratio, slightly outperforming the best random and GEPA baselines (0.1445).
Transaction Scheduling
Problem: Schedule database transactions under contention to maximize throughput while minimizing aborts.
Result: 4347.83 commits/sec vs. 2724.8 human best — a 59.6% improvement.
Cloud Transfer (Cloudcast)
Problem: Design cost-efficient data transfer strategies across cloud regions with heterogeneous pricing and deadline constraints.
Result: EvoX achieves 623.69 cost, beating the human best (626.24) and all AI baselines.
15.3 Algorithmic and Competitive Programming
FrontierCS (172 problems)
Problem: 172 open-ended computer science problems spanning: - Optimization (38 problems): parameterized search under resource constraints - Constructive (62 problems): synthesize valid structured objects under global constraints - Interactive (72 problems): adaptive query-response protocols
Result: EvoX achieves a median score of 75.5, compared to 56.2 for the best prior AI system — a 34.3% improvement.
Significance: This is the largest benchmark evaluated, and EvoX's consistent advantage across 172 diverse problems suggests that meta-evolution provides a robust, domain-agnostic improvement mechanism.
ALE-Bench-Lite (10 NP-hard problems)
Problem: 10 NP-hard optimization problems from AtCoder Heuristic Contests, including territory control, sliding puzzles, graph classification, and route optimization.
Significance: These problems require fundamentally different algorithmic approaches (greedy, dynamic programming, local search, constraint satisfaction), making them a strong test of EvoX's ability to adapt strategies across problem types.
15.4 Abstract Reasoning (ARC-AGI-2)
Problem: ARC-AGI-2 tasks evaluate abstract and compositional reasoning, requiring programs that solve novel grid transformation puzzles.
Result: EvoX improves accuracy by 6-7 percentage points over OpenEvolve (41% → 48% with GPT-5; 45% → 51% with Gemini-3-Pro).
Caveat: The authors note that ARC-AGI fundamentally differs from the evolutionary optimization regime. Standard ARC evaluation assumes strict train-test separation, whereas evolutionary frameworks adapt during inference. The improvement may partially reflect better prompt construction rather than better search strategy per se.
15.5 Application Taxonomy
| Application Domain | Tasks | Primary Benefit of Meta-Evolution | Largest Gain |
|---|---|---|---|
| Mathematical construction | 8 | Strategy shift from exploration to refinement | +0.08% (circle packing) |
| Systems optimization | 6 | Problem reformulation (e.g., greedy → bin-packing) | +39.4% (GPU placement) |
| Competitive programming | 172 | Adaptive operator selection across diverse problems | +34.3% (median score) |
| NP-hard optimization | 10 | Cross-algorithm strategy discovery | Not separately reported |
| Abstract reasoning | N/A | Better context construction | +7pp (ARC accuracy) |
15.6 Domains Where EvoX Has Not Been Applied
| Domain | Reason | Feasibility |
|---|---|---|
| Neural architecture search (NAS) | Not in benchmark suite | High — NAS is well-suited to evolutionary search |
| Drug/molecule design | Requires domain-specific evaluators | Medium — needs molecular simulation infrastructure |
| Hardware design (Verilog) | Not in benchmark suite (AlphaEvolve domain) | High — code-based, automated evaluation |
| Prompt optimization | Not primary focus | High — SkyDiscover supports prompt tasks |
| Theorem proving | Requires formal verification | Medium — needs proof assistant integration |
| Robotics control | Requires physics simulation | Medium — needs sim-to-real infrastructure |
15.7 Broader Implications
EvoX's core insight — that the search process itself should be subject to evolutionary pressure — has implications beyond LLM-driven optimization. The principle of meta-evolution applies to any iterative search process where: 1. The search strategy can be expressed as executable code 2. An LLM can generate valid strategy implementations 3. Strategy effectiveness can be measured by downstream progress 4. The optimal strategy is non-stationary (changes during the search)
This suggests potential applications in: - AutoML: Evolving the hyperparameter tuning strategy itself - Reinforcement learning: Evolving the exploration strategy during training - Compiler optimization: Evolving the optimization pass ordering - Database query optimization: Evolving the query planning heuristic - Scientific experimentation: Evolving the experimental design strategy based on prior results
Analysis completed April 2026. Based on arXiv:2602.23413v2 (revised March 16, 2026), GitHub repository skydiscover-ai/skydiscover, and contextual knowledge of the LLM-driven evolutionary search landscape.