EvoX: Meta-Evolution for Automated Discovery

An adaptive evolution method that optimizes its own evolutionary search process through two-level co-evolution of solutions and search strategies Organization: UC Berkeley Sky Lab + Stanford University + Bespoke Labs Published: February 26, 2026 (v1), revised March 16, 2026 (v2) Type: Research Paper (arXiv:2602.23413v2) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: EvoX: Meta-Evolution for Automated Discovery

Paper URL: arXiv:2602.23413v2

HTML Version: arxiv.org/html/2602.23413v2

PDF: arxiv.org/pdf/2602.23413v2

Code Repository: github.com/skydiscover-ai/skydiscover

arXiv Categories: cs.LG (Machine Learning), cs.CL (Computation and Language), cs.NE (Neural and Evolutionary Computing)

Submission History: - v1: February 26, 2026 (5,230 KB) - v2: March 16, 2026 (5,230 KB) — current version

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

DOI: 10.48550/arXiv.2602.23413

Lineage: Builds directly on the LLM-driven evolutionary search paradigm established by FunSearch (Nature 2023), AlphaEvolve (Google DeepMind 2025), and the earlier SkyDiscover/AdaEvolve framework (arXiv:2602.20133) from the same Berkeley lab. EvoX represents the team's second major contribution following SkyDiscover/AdaEvolve, pivoting from hierarchical adaptive resource allocation to full meta-evolution of the search strategy itself. The paper is released through the SkyDiscover framework infrastructure, sharing the same codebase and benchmark suite.

Relationship to SkyDiscover: EvoX is implemented as a first-class algorithm backend within the SkyDiscover framework. While AdaEvolve focuses on three-level hierarchical adaptation (local, global, meta) using accumulated improvement signals, EvoX takes a fundamentally different approach: it treats the entire search strategy as an evolvable artifact, using an LLM to generate complete strategy implementations (Python classes) rather than tuning pre-defined knobs.

2 Authors and Team

Authors (17 total, * denotes equal contribution):

Lead Authors (equal contribution): Shu Liu*^1, Shubham Agarwal*^1 Contributing Authors: Monishwaran Maheswaran^1, Mert Cemri^1, Zhifei Li^1, Qiuyang Mang^1, Ashwin Naren^1, Ethan Boneh^2, Audrey Cheng^1, Melissa Z. Pan^1, Alexander Du^1 Senior Authors: Kurt Keutzer^1, Alvin Cheung^1, Alexandros G. Dimakis^{1,3}, Koushik Sen^1, Matei Zaharia^1, Ion Stoica^1

Affiliations: 1. UC Berkeley 2. Stanford University 3. Bespoke Labs

Research Group Context

The team is anchored in the UC Berkeley Sky Lab, the same research group responsible for foundational systems infrastructure including Apache Spark (Zaharia), Ray (Stoica), vLLM, and Alpa. This systems-engineering DNA is visible in EvoX's emphasis on modular architecture, fair benchmarking infrastructure, and practical systems optimization tasks.

PI	Known For	Role in EvoX
Ion Stoica	Ray, Spark, Anyscale	Systems architecture, distributed compute
Matei Zaharia	Spark, MLflow, Databricks	Data-intensive optimization, framework design
Koushik Sen	KLEE, concolic testing, program analysis	Search strategy formalization, program synthesis
Kurt Keutzer	Neural architecture efficiency	LLM cost optimization, model selection
Alvin Cheung	Database optimization, verified lifting	Systems benchmarks, SQL optimization tasks
Alexandros G. Dimakis	Compressed sensing, generative models	Theoretical foundations, Bespoke Labs connection

Notable cross-institutional contributions: Ethan Boneh (Stanford) brings connections to the broader Stanford AI/systems community. The Dimakis affiliation with Bespoke Labs (a startup focused on AI-generated data) suggests potential commercial applications of the meta-evolution paradigm.

Author overlap with prior work: Significant overlap with the SkyDiscover/AdaEvolve team (Cemri, Agrawal, Liu, Cheng, Mang, Naren, Sen, Zaharia, Dimakis, Stoica), the GEPA team (UC Berkeley / Stanford), and the ADRS-Bench systems benchmark authors. This continuity of personnel means EvoX benefits from deep institutional knowledge of the benchmark suite and baseline implementations.

3 Core Contribution

[!important] Key Novelty EvoX is the first LLM-driven evolutionary system to treat the search strategy itself as an evolvable object, implementing a two-level co-evolution process where an outer "meta-evolution" loop generates, evaluates, and replaces complete search strategy implementations (Python classes) based on their observed downstream performance, while an inner "solution-evolution" loop generates candidate solutions under the currently active strategy.

The Central Insight

All prior LLM-driven evolutionary systems — including AlphaEvolve, OpenEvolve, GEPA, ShinkaEvolve, and even AdaEvolve — share a fundamental limitation: their search strategies are either fixed (predefined knobs that remain static throughout execution) or parametrically adaptive (adjusting pre-defined parameters like exploration-exploitation ratios through bandit algorithms). EvoX eliminates this constraint by making the search strategy itself a first-class citizen in the evolutionary process.

The key observation motivating this work is threefold:

Cross-task variation: Different optimization problems require fundamentally different search strategies. Packing problems benefit from iterative refinement of strong candidates, while GPU placement problems may require wholesale paradigm shifts (e.g., from greedy heuristics to bin-packing formulations).
Within-task non-stationarity: Even within a single optimization run, the optimal search strategy changes over time. Early stages benefit from exploration, later stages from exploitation — but the transition timing is task-dependent and cannot be predetermined.
Strategy expressiveness bottleneck: Parametric adaptation (adjusting knobs within a fixed strategy template) cannot capture qualitative shifts in search behavior. Switching from tournament selection to Pareto-frontier sampling, or from single-parent refinement to multi-parent crossover, requires changing the structure of the strategy, not its parameters.

What Makes EvoX Novel

Search strategy as code: Rather than parameterizing strategies as vectors of hyperparameters, EvoX represents each search strategy as a complete Python class implementing add() and sample() methods. This allows the LLM to generate arbitrary selection mechanisms, data structures, and heuristics — unconstrained by a predefined strategy template.
Two-level co-evolution: The system maintains two coupled evolutionary processes operating at different timescales: a fast inner loop evolving candidate solutions (one LLM call per iteration) and a slow outer loop evolving search strategies (one LLM call per strategy window, triggered by stagnation detection).
Progress-conditioned strategy generation: New strategies are generated by the LLM conditioned on: (a) the history of prior strategies and their performance, (b) the current state of the solution population (via a descriptor function), and (c) the downstream problem context. This allows the meta-evolution loop to learn from its own history.
Strategy validation: Generated strategies undergo a validity test before deployment, ensuring they are syntactically correct Python, satisfy interface contracts, and produce valid outputs. Invalid strategies are discarded and regenerated.
Stagnation-triggered adaptation: The meta-evolution loop is event-driven rather than periodic — new strategies are generated only when the current strategy fails to produce sufficient progress, as measured by a stagnation threshold τ. This avoids unnecessary strategy churn during productive phases.

Relationship to Prior Work

System	Year	Search Strategy	Adaptation Mechanism	Strategy Expressiveness	Benchmark Scale
FunSearch	2023	Fixed (best-shot sampling)	None	N/A (static)	~5 math tasks
AlphaEvolve	2025	MAP-Elites with fixed ratios	None (hand-tuned)	Low (predefined knobs)	~20 tasks
OpenEvolve	2025	Static elite + diversity	None (hand-tuned)	Low (predefined heuristics)	~15 tasks
ShinkaEvolve	2025	Islands + bandit LLM selection	Parametric (LLM model selection)	Medium (bandit over models)	~20 tasks
GEPA	2026	Pareto-frontier + reflection	Parametric (reflection-guided)	Medium (fixed architecture)	~30 tasks
AdaEvolve	2026	Three-level hierarchical	Parametric (bandit + meta-LLM directives)	Medium (three predefined levels)	200+ tasks
ThetaEvolve	2026	Fixed strategy + RL generator	Model adaptation (RL fine-tuning)	Low (adapts model, not strategy)	~20 tasks
CodeEvolve	2026	Island-based GA	None (fixed GA operators)	Low (predefined GA)	~15 tasks
EvoX	2026	LLM-generated Python classes	Structural (full strategy evolution)	Maximal (arbitrary code)	200+ tasks

The fundamental distinction: prior systems adapt parameters of a fixed strategy architecture. EvoX adapts the architecture itself by generating complete strategy implementations as code.

Formal Problem Statement

EvoX formalizes LLM-driven optimization as two coupled optimization problems:

Inner optimization (solution evolution): Given a search strategy S, maximize the best score in the solution population D_T under a fixed evaluation budget T:

max_{(x,s,a) ∈ D_T} s

Outer optimization (meta-evolution): Find the sequence of search strategies S_0, S_1, ..., S_K that maximizes the final best score, where each strategy S_k is deployed for a window of W solution-evolution iterations and evaluated by the progress it induces:

max_{S_0,...,S_K} max_{(x,s,a) ∈ D_T} s
subject to: each S_k is a valid SearchStrategy implementation
            K ≤ T/W (budget constraint)

4 Supported Solutions

EvoX targets any optimization problem where candidate solutions can be expressed as code (programs, algorithms, heuristics) and evaluated by an automated scorer. The system is agnostic to the solution domain — it operates on the search process rather than on domain-specific representations.

4.1 Solution Types Supported

Solution Type	Description	Demonstrated In Paper	Example
Mathematical constructions	Programs generating geometric/combinatorial objects	✅ Yes (8 tasks)	Circle-packing configurations, point placements
Systems heuristics	Scheduling, placement, and routing algorithms	✅ Yes (6 tasks)	GPU model placement, transaction scheduling
Competitive programming solutions	Full algorithmic solutions to CP problems	✅ Yes (172 tasks)	FrontierCS open-ended problems
NP-hard optimization heuristics	Metaheuristics for AtCoder-style contests	✅ Yes (10 tasks)	Territory control, sliding puzzle, graph classification
Signal processing filters	Causal, online filtering programs	✅ Yes (1 task)	Multi-objective noisy time series filter
SQL generation strategies	LLM-driven SQL prompting/post-processing	✅ Yes (1 task)	LLM-SQL accuracy optimization
Telemetry reconstruction	Missing signal reconstruction algorithms	✅ Yes (1 task)	Temporal/spatial correlation exploitation
Abstract reasoning programs	ARC-AGI task solvers	✅ Yes (evaluation)	Compositional reasoning programs

4.2 What EvoX Does NOT Support

Limitation	Reason
Non-programmatic solutions	All solutions must be expressible as executable code
Continuous optimization (direct)	No gradient computation — operates through code-level search
Real-time adaptation	Designed for offline optimization, not online/streaming settings
Multi-agent coordination	Single-population evolution; no explicit agent communication
Hardware-in-the-loop	Evaluators must be automated; no physical experiments

4.3 Benchmark Portfolio

EvoX is evaluated across approximately 200 optimization tasks organized into three families:

Domain	Count	Source	Task Type
Mathematical optimization	8	Custom + prior work	Circle packing, Heilbronn, autocorrelation, signal processing
Systems performance optimization	6	ADRS-Bench	GPU placement, scheduling, cloud transfer, telemetry
Algorithmic problems	182	ALE-Bench-Lite (10) + FrontierCS (172)	NP-hard optimization, constructive, interactive
Abstract reasoning	—	ARC-AGI-2	Compositional reasoning (supplementary evaluation)
Total	~196+

5 LLM Integration

5.1 Models Used

EvoX uses LLMs in two distinct roles, each potentially served by different models:

Role	Model Options Tested	Purpose	Call Frequency
Solution generator (G_sol)	GPT-5, Gemini-3.0-Pro	Generate candidate solutions given parent + operator + inspiration set	Every inner-loop iteration (T calls total)
Strategy generator (G_str)	GPT-5, Gemini-3.0-Pro	Generate new search strategy implementations (Python classes)	Only on stagnation events (K << T calls total)
Operator instantiation	Lightweight model (unspecified)	Generate operator-specific prompts from problem description	Once per problem, at initialization

A critical design choice: the strategy generator and solution generator can be the same or different LLM instances. The paper evaluates both GPT-5 and Gemini-3.0-Pro as backbone models, using the same model for both roles in each experiment.

5.2 How LLMs Are Used

EvoX uses LLMs in two fundamentally different modes:

Mode 1: Solution Generation (Inner Loop) The solution LLM operates as a standard mutation operator in the evolutionary loop. Given a parent candidate, a variation operator label, and an optional inspiration set, it produces a modified candidate solution. This is identical in structure to how AlphaEvolve, OpenEvolve, and other prior systems use LLMs.

Mode 2: Strategy Generation (Outer Loop) The strategy LLM operates as a meta-programmer. Given the history of prior strategies, their performance profiles, and the current population state, it generates a complete Python class implementing the EvolvedProgramDatabase interface. This is the novel contribution — the LLM is generating the search algorithm itself, not just candidate solutions.

5.3 Prompt Architecture

EvoX employs a multi-layered prompt architecture. The most important prompt is the search strategy evolution prompt (P_search), which instructs the LLM to generate a complete EvolvedProgramDatabase class.

Strategy Evolution System Prompt

System Model: LLM-Driven Optimization with Evolving Search

You are an expert developer implementing a search strategy for
LLM-driven program optimization. Your task is to implement
EvolvedProgramDatabase, which realizes the SearchStrategy
interface used by the optimizer.

The optimizer maintains a population database of evaluated
candidate programs. At each step, the search strategy constructs
the next-generation LLM input by choosing:
  (i)   a parent program to modify,
  (ii)  a variation operator that specifies how the parent
        should be modified, and
  (iii) an optional inspiration set of past programs drawn
        from the population.

The generated candidate is then evaluated and added back to
the population. The quality of a search strategy is determined
by how effectively it improves the best solution over the course
of the optimization process.

You must implement a class named EvolvedProgramDatabase, which
defines the behavior of the search strategy through two methods:

  - add: controls how evaluated programs are added to the population
  - sample: selects the parent program, the variation operator
    applied to it, and an optional inspiration set of past programs.

The base class provides:
  self.programs: Dict[str, EvolvedProgram]  (all stored programs)
  self.get(program_id)                       for retrieval
  self.DIVERGE_LABEL and self.REFINE_LABEL   predefined constants

Strategy Evolution Constraints Block

======================== CONSTRAINTS ========================
1. Program evaluation metrics are read-only and must not be
   modified.
2. All imports and dataclass definitions must be preserved.
3. All variables used by the strategy must be initialized with
   safe default values to avoid runtime errors.
4. When accessing metrics, always validate types before
   arithmetic (isinstance(value, (int, float))).
======================== CRITICAL INSTRUCTION ========================
Design a search strategy wisely and adaptively to maximize
improvement.

Strategy Evolution User Prompt Structure

The user message in the strategy evolution prompt contains four key sections:

1. Downstream problem context:
   "Your search algorithm evolves solutions for the following
    downstream problem. Use this information to inform your
    search strategy."
   {problem_template}

2. Search algorithm information:
   - Search window definition (iteration range, optimization
     horizon)
   - Current population state via descriptor function φ(D_t)

3. Strategy history:
   - Prior strategies and their observed performance metrics
   - Improvement rates per strategy window
   - Stagnation patterns observed

4. Generation instruction:
   "Design a search strategy wisely and adaptively to maximize
    improvement."

Variation Operator Labels

The strategy's sample() method selects among three variation operator labels that govern the type of modification requested from the solution LLM:

Label	Constant	Semantics	Typical Use
`""` (empty string)	Default	Free-form variation — no constraints on edit scope	Exploratory, early-stage search
`REFINE_LABEL`	`self.REFINE_LABEL`	Local refinement — small, structure-preserving edits	Exploitation, parameter tuning
`DIVERGE_LABEL`	`self.DIVERGE_LABEL`	Structural variation — larger, exploratory changes	Escaping local optima, paradigm shifts

Each generation step applies exactly one variation operator. The search strategy's intelligence lies in choosing when to apply each operator based on population state.

Solution Generation Prompt Structure

System message:
  You are an expert software developer tasked with iteratively
  improving a codebase. Your objective is to maximize the
  combined score while exploring diverse solutions across
  feature dimensions. The system maintains a population of
  programs; both high performance and diversity are valuable
  signals.

User message:
  [Problem description]
  [Parent code to modify]
  [Variation operator instruction (refine/diverge/free-form)]
  [Inspiration set programs (optional)]
  [Evaluation feedback from parent]

5.4 Operator Instantiation

At the start of each problem, EvoX uses a lightweight model to generate operator-specific prompts from the problem description. This transforms the generic operator labels (REFINE, DIVERGE) into task-specific instructions. For example:

REFINE for circle packing: "Adjust individual circle positions by small deltas while maintaining non-overlap constraints"
DIVERGE for circle packing: "Redesign the packing strategy entirely — consider different initialization patterns, different optimization algorithms, or different constraint relaxation approaches"
REFINE for GPU placement: "Tune load thresholds and replication factors in the current placement heuristic"
DIVERGE for GPU placement: "Reformulate the placement problem as bin-packing, graph partitioning, or constraint satisfaction"

6 Key Results

6.1 Headline Results

EvoX is evaluated against strong baselines across nearly 200 tasks. The headline comparison against human-designed solutions and prior AI systems:

Task	Human Best	AlphaEvolve	AI Best (Open)	EvoX	Delta vs Human	Delta vs AI Best
Circle-Packing (↑)	2.634	2.635	—	2.636	+0.08%	+0.04% (vs AlphaEvolve)
MinMaxMinDist (↓)	4.16584	4.16579	—	4.16578	-0.001%	-0.0002% (vs AlphaEvolve)
Cloud Transfer (↓)	626.24	—	645.72	623.69	-0.41%	-3.41%
GPU Placement (↑)	21.89	—	26.26	30.52	+39.4%	+16.2%
Txn Scheduling (↑)	2724.8	—	4329	4347.8	+59.6%	+0.44%
Frontier-CS median (↑)	—	—	56.2	75.5	—	+34.3%

EvoX outperforms all prior AI-driven evolutionary methods on 96% of the math and system optimization benchmarks, and achieves a 34.3% improvement in median score on 172 competitive programming problems.

6.2 Detailed Mathematics Results (GPT-5 Backbone, Best-of-3-Runs)

Task	Direction	Human	AlphaEvolve	ThetaEvolve	CodeEvolve	Random	OpenEvolve	GEPA	Shinka	EvoX
Circle Packing	↑	2.634	2.635	2.636	2.636	2.500	2.541	2.629	2.508	2.636
Circle Packing Rect	↑	2.364	2.366	—	2.366	2.278	2.276	2.354	2.358	2.360
Heilbronn Convex	↑	0.0306	0.0309	—	—	0.0198	0.0267	0.0269	0.0256	0.0272
Heilbronn Triangle	↑	0.0360	0.0365	—	—	0.0285	0.0283	0.0332	0.0338	0.0339
MinMaxMinDist (16,2)	↓	12.89	12.89	—	12.89	13.00	13.00	12.95	12.98	12.89
MinMaxMinDist (14,3)	↓	4.17	4.17	—	4.17	4.46	4.46	4.17	4.17	4.18
Autocorrelation	↓	1.4581	1.4557	1.4930	—	1.4607	1.4610	1.4758	1.4614	1.4609
Signal Processing	↑	—	—	—	—	0.5858	0.6219	0.7057	0.5328	0.6898

6.3 Detailed Systems Results (GPT-5 Backbone, Best-of-3-Runs)

Task	Direction	Human	Random	OpenEvolve	GEPA	Shinka	EvoX
EPLB	↑	0.1265	0.1445	0.1272	0.1445	0.1272	0.1453
PRISM	↑	21.89	26.23	26.23	26.23	26.26	30.52
LLM-SQL	↑	0.692	0.729	0.716	0.713	0.713	0.730
Cloudcast	↓	626.24	644.84	729.80	645.72	812.74	637.14
Transaction	↑	2724.8	4149.38	4237.3	3984.1	4329.0	4347.83
Telemetry	↑	0.8222	0.9498	0.9515	0.9477	0.9515	0.9520

6.4 ARC-AGI-2 Results

EvoX was also evaluated on ARC-AGI-2 abstract reasoning tasks, which are not traditional optimization benchmarks. Under a matched inference budget of 30 LLM iterations per task:

Backbone Model	OpenEvolve	EvoX	Delta
GPT-5	41%	48%	+7pp
Gemini-3-Pro	45%	51%	+6pp

The ARC-AGI results are notable because ARC tasks differ fundamentally from optimization — they assume strict train-test separation rather than iterative adaptation during inference. EvoX's improvement here suggests the meta-evolution of search strategies has benefits beyond pure optimization, potentially improving the LLM's program synthesis capability through more effective context construction.

6.5 Gemini-3.0-Pro Backbone Comparison

EvoX maintains its advantages when switching to Gemini-3.0-Pro as the backbone:

Task	OpenEvolve	GEPA	Shinka	EvoX
Circle Packing (↑, best)	2.5414	2.6208	2.6358	2.6360
Signal Processing (↑, best)	0.5649	0.6798	0.5049	0.7429
EPLB (↑, best)	0.1272	0.1272	0.1272	0.1453
Cloudcast (↓, best)	707.82	667.06	1032.42	623.69

6.6 Win Rate Analysis

Across all evaluated tasks: - Math + Systems (14 tasks): EvoX wins on 96% of tasks - FrontierCS (172 tasks): EvoX achieves 75.5 median score vs. 56.2 for best baseline (34.3% improvement) - ARC-AGI-2: EvoX improves accuracy by 6-7 percentage points over OpenEvolve

6.7 Key Quantitative Insights

GPU Placement (PRISM) is the most dramatic result: EvoX achieves 30.52 throughput vs. 21.89 human best — a 39.4% improvement. This task exemplifies EvoX's strength: the meta-evolution loop discovers that initial greedy placement strategies plateau, and generates a new strategy that reformulates the problem as bin-packing assignment, unlocking a qualitatively different solution space.

Signal Processing with Gemini-3.0-Pro shows EvoX's largest relative gain: 0.7429 vs. 0.5049 for ShinkaEvolve (47.1% improvement). The multi-objective nature of this task benefits strongly from strategy evolution, as the meta-loop discovers strategies that explicitly sample candidates along different Pareto trade-off objectives.

7 Reproducibility

7.1 Open Source Status

Aspect	Status	Details
Paper	✅ Open Access	arXiv with CC BY 4.0 license
Code	✅ Open Source	github.com/skydiscover-ai/skydiscover
Benchmark Suite	✅ Included	~200 tasks shipped with the framework
Pre-trained Models	N/A	Uses commercial LLM APIs (GPT-5, Gemini-3.0-Pro)
Evaluation Harness	✅ Included	Automated evaluators for all benchmarks
Baseline Implementations	✅ Included	OpenEvolve, GEPA, ShinkaEvolve backends

7.2 Reproducibility Requirements

LLM API Access: - GPT-5 API access (OpenAI) — required for GPT-5 experiments - Gemini-3.0-Pro API access (Google) — required for Gemini experiments - API costs are non-trivial (see Section 8)

Software Dependencies: - Python 3.x (version not explicitly specified in paper) - SkyDiscover framework and its dependencies - Standard scientific Python stack (numpy, etc.)

Compute Requirements: - No GPU required for the evolutionary search itself (LLM calls are API-based) - Some benchmark evaluators (e.g., GPU placement) may require specific hardware for evaluation - Sequential evaluation budget: 100 iterations per task (fixed across all open AI frameworks)

7.3 Reproducibility Concerns

LLM non-determinism: Both GPT-5 and Gemini-3.0-Pro produce stochastic outputs. The paper reports mean and best over three runs to mitigate this, but exact reproduction requires matching API versions and sampling parameters.
AlphaEvolve comparison: AlphaEvolve results are "taken directly from its original publication" — not reproduced using the same infrastructure. AlphaEvolve uses internal Google infrastructure and Gemini models that may differ from the versions available via public API.
Evaluation budget fairness: All open AI frameworks are evaluated under a fixed budget of 100 iterations. This is fair for wall-clock comparison, but may disadvantage methods designed for longer runs. AlphaEvolve, with internal Google compute, likely ran for many more iterations.
Strategy generator cost: The paper notes that meta-evolution adds minimal overhead (~6% additional LLM calls), but the cost of invalid strategy generation and retry is not detailed.

8 Compute and API Costs

8.1 Evaluation Budget

All open AI frameworks are evaluated under a fixed budget of 100 solution-evaluation iterations per task. This means:

Component	Calls	Cost Factor
Solution generation	100 calls per task	~100 × (GPT-5 or Gemini-3.0-Pro) per task
Strategy generation	K calls per task (K << 100)	~5-15 × (same model) per task
Operator instantiation	1 call per task (lightweight model)	Negligible
Total per task	~105-115 LLM calls

8.2 Cost Estimates

Based on current (April 2026) API pricing and typical prompt/response lengths for code generation tasks:

Model	Input ($/1M tokens)	Output ($/1M tokens)	Est. Cost per Task	Est. Total (200 tasks)
GPT-5	~$10	~$30	~$5-15	~$1,000-3,000
Gemini-3.0-Pro	~$5	~$15	~$2.50-7.50	~$500-1,500

Per-experiment cost with 3 runs: Multiply by 3 for the mean/best statistics reported, yielding: - GPT-5: ~$3,000-9,000 for the full benchmark suite - Gemini-3.0-Pro: ~$1,500-4,500 for the full benchmark suite

8.3 Meta-Evolution Overhead

The paper characterizes EvoX's meta-evolution overhead as approximately 6% additional LLM calls beyond what a fixed-strategy baseline would require. This low overhead is achieved through stagnation-triggered (rather than periodic) strategy evolution:

Strategy evolution overhead = K / T ≈ 5-15 / 100 = 5-15%

Where:
  K = number of strategy evolution events
  T = total solution evaluation budget (100)
  W = window size per strategy (5-20 iterations)

The overhead is front-loaded: most strategy evolution events occur in the first half of the run, as the system discovers effective strategies that sustain progress without triggering stagnation. Later in the run, strategies tend to be retained for longer windows.

8.4 Compute Infrastructure

The paper does not specify the hardware infrastructure used for running experiments. Given that: - LLM inference is API-based (no local GPU needed for generation) - Evaluators for most tasks are lightweight (seconds per evaluation) - The ADRS-Bench systems tasks may require specific hardware configurations

A reasonable estimate is that all experiments could be run on a single workstation with API access, though parallelization across tasks would significantly reduce wall-clock time.

9 Architecture Solution

9.1 System Architecture Overview

EvoX implements a two-level co-evolution architecture. The system can be understood as two coupled control loops operating at different timescales:

┌──────────────────────────────────────────────────────────┐
│                     META-EVOLUTION LOOP                    │
│                   (Outer / Slow Loop)                      │
│                                                            │
│  ┌──────────┐   ┌──────────────┐   ┌───────────────────┐  │
│  │ Strategy │   │   Strategy   │   │    Stagnation     │  │
│  │ History  │──▶│  Generator   │──▶│    Detector       │  │
│  │    H     │   │   G_str(·)   │   │   Δ < τ ?         │  │
│  └──────────┘   └──────┬───────┘   └────────┬──────────┘  │
│        ▲               │                     │             │
│        │               │ S_new               │ stagnation  │
│        │               ▼                     │ signal      │
│        │        ┌──────────────┐             │             │
│        │        │   Strategy   │             │             │
│        └────────│  Validator   │◀────────────┘             │
│                 │  Valid(S)?   │                            │
│                 └──────┬───────┘                            │
│                        │ S_valid                            │
│                        ▼                                    │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              SOLUTION-EVOLUTION LOOP                  │   │
│  │              (Inner / Fast Loop)                      │   │
│  │                                                       │   │
│  │  ┌──────────┐                                         │   │
│  │  │ Solution │                                         │   │
│  │  │   Pop.   │◀──────────────────────────┐            │   │
│  │  │   D_t    │                           │            │   │
│  │  └────┬─────┘                           │            │   │
│  │       │                                 │            │   │
│  │       ▼                                 │            │   │
│  │  ┌──────────┐   ┌──────────┐   ┌───────┴──────┐    │   │
│  │  │  Search  │   │ Solution │   │   Evaluator  │    │   │
│  │  │ Strategy │──▶│Generator │──▶│    E(x)      │    │   │
│  │  │  S_k     │   │ G_sol(·) │   │  → (s, a)    │    │   │
│  │  └──────────┘   └──────────┘   └──────────────┘    │   │
│  │  Selects:                                           │   │
│  │  - parent x_par      Produces:   Evaluates:         │   │
│  │  - operator π        x' ~ G_sol  s(x'), a(x')      │   │
│  │  - inspiration I                                    │   │
│  │                                                       │   │
│  │  Repeats for W iterations per strategy window         │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                            │
└──────────────────────────────────────────────────────────┘

9.2 Data Flow

                    Problem Definition
                          │
                          ▼
              ┌───────────────────────┐
              │  Operator Instantiation│
              │  (lightweight LLM)     │
              └───────────┬───────────┘
                          │  operator prompts
                          ▼
              ┌───────────────────────┐
              │   Initial Strategy    │
              │   S_0 (random sample) │
              └───────────┬───────────┘
                          │
            ┌─────────────┴─────────────┐
            │                           │
            ▼                           │
    ┌───────────────┐                   │
    │ Inner Loop    │                   │
    │ (W iters)     │                   │
    │               │                   │
    │ S_k.sample()  │──▶ (x_par, π, I) │
    │    │          │                   │
    │    ▼          │                   │
    │ G_sol(x_par,  │                   │
    │   π, I)       │                   │
    │    │          │                   │
    │    ▼          │                   │
    │ E(x') → s,a  │                   │
    │    │          │                   │
    │    ▼          │                   │
    │ S_k.add(x')   │                   │
    └───────┬───────┘                   │
            │                           │
            ▼                           │
    ┌───────────────┐                   │
    │  Stagnation?  │                   │
    │  Δ < τ        │                   │
    └───┬───────┬───┘                   │
        │No     │Yes                    │
        │       ▼                       │
        │ ┌─────────────┐              │
        │ │ Meta-Evolve │              │
        │ │  Strategy   │──────────────┘
        │ │ S_{k+1}     │   new strategy
        │ └─────────────┘
        │
        ▼
    ┌───────────────┐
    │ Budget T      │
    │ exhausted?    │
    └───┬───────┬───┘
        │No     │Yes
        │       ▼
        │   Return best
        │   (x*, s*)
        │
        └──▶ Continue inner loop

9.3 Strategy Interface Contract

The search strategy is materialized as a Python class conforming to a well-defined interface:

class EvolvedProgramDatabase:
    """Search strategy interface that governs candidate
    selection and variation in the evolutionary loop."""

    # Provided by base class (read-only)
    programs: Dict[str, EvolvedProgram]
    DIVERGE_LABEL: str   # structural variation
    REFINE_LABEL: str    # local refinement

    def get(self, program_id: str) -> EvolvedProgram:
        """Retrieve a program by ID."""
        ...

    def add(self, program: EvolvedProgram) -> None:
        """Add an evaluated program to the population.

        The strategy controls how programs are organized
        internally — e.g., maintaining elite sets, diversity
        archives, Pareto frontiers, or custom data structures.
        """
        ...

    def sample(self) -> Tuple[
        Dict[str, EvolvedProgram],  # parent_dict
        List[EvolvedProgram]        # inspiration_set
    ]:
        """Select parent, variation operator, and inspiration.

        Returns:
          parent_dict: maps a label ("", REFINE_LABEL, or
                       DIVERGE_LABEL) to exactly one parent
          inspiration_set: list of programs providing
                          complementary context
        """
        ...

9.4 EvolvedProgram Data Structure

Each candidate program in the population carries the following fields:

@dataclass
class EvolvedProgram:
    id: str                          # unique identifier
    code: str                        # program source code
    metrics: ProgramMetrics          # evaluation results
    iteration_found: int             # when this candidate was created
    timestamp: float                 # wall-clock creation time
    parent_id: Optional[str]         # ID of parent program
    parent_info: Tuple[str, str]     # (label, parent_id)
    metadata: Dict[str, Any]         # arbitrary metadata

    @property
    def combined_score(self) -> float:
        """Main optimization objective."""
        return self.metrics.combined_score

The parent_info tuple is a critical piece of lineage tracking: it records not just which program was the parent, but what variation operator was applied to it. This enables the strategy to reason about which operators are effective for which types of parents.

10 Component Breakdown

10.1 Solution-Evolution Loop

The inner loop is the standard LLM-driven evolutionary search paradigm, shared with prior systems:

Component	Function	Implementation
Solution Population D_t	Stores all evaluated candidates with scores and artifacts	In-memory dictionary (Dict[str, EvolvedProgram])
Search Strategy S_k	Selects parent, operator, and inspiration set	LLM-generated Python class
Solution Generator G_sol	Produces new candidate from prompt	GPT-5 or Gemini-3.0-Pro API call
Evaluator E	Scores candidate, returns (score, artifacts)	Task-specific automated evaluator

Inner loop iteration flow: 1. S_k.sample() → (parent_dict, inspiration_set) 2. Construct prompt from parent, operator label, and inspiration set 3. G_sol(prompt) → x' (new candidate code) 4. E(x') → (s(x'), a(x')) (score and artifacts) 5. S_k.add(EvolvedProgram(x', s, a, ...)) → update population

10.2 Meta-Evolution Loop

The outer loop is EvoX's novel contribution:

Component	Function	Implementation
Stagnation Detector	Monitors progress within each strategy window	Threshold comparison: Δ < τ
Population Descriptor φ(·)	Summarizes current population state for strategy generator	Function returning population statistics
Strategy History H	Records prior strategies and their performance profiles	List of (strategy_code, performance_metrics) tuples
Strategy Generator G_str	Produces new search strategy implementation	GPT-5 or Gemini-3.0-Pro API call
Strategy Validator Valid(·)	Checks syntactic and semantic validity of generated strategy	Python compilation + interface conformance test

Meta-evolution trigger conditions: 1. Current strategy has been active for at least W iterations (minimum window) 2. Progress measure Δ falls below threshold τ (stagnation detected) 3. Budget T has not been exhausted

Strategy generation process: 1. Compute population descriptor: φ(D_t) → population statistics 2. Assemble strategy generation prompt: P_search(H, φ(D_t), problem_context) 3. Generate strategy: S_new ~ G_str(P_search) 4. Validate: Valid(S_new) → boolean 5. If valid: deploy S_new as S_{k+1}, record (S_k, performance_k) in H 6. If invalid: retry generation (up to retry limit)

10.3 Variation Operator System

EvoX defines three variation operators that control the type of modification the solution LLM applies:

Operator	Label	Scope	Analogy
Free-form	`""` (empty)	Unconstrained	Exploratory mutation
Local Refinement	`REFINE_LABEL`	Structure-preserving, fine-grained	Parameter tuning, micro-optimization
Structural Variation	`DIVERGE_LABEL`	Coarse-grained redesign	Crossover, paradigm shift

The operators are instantiated per-problem at initialization using a lightweight LLM to generate task-specific prompts from the problem description. This means the same REFINE_LABEL maps to different concrete instructions depending on the optimization domain.

10.4 Evaluator Component

The evaluator is task-specific and provided by the benchmark suite. Key properties:

Deterministic: Same code → same score (no evaluation noise)
Automated: No human judgment required
Scalar output: Primary score s(x) is a single real number
Auxiliary artifacts: a(x) may include logs, traces, error messages, intermediate outputs
Sandboxed execution: Candidate code runs in a restricted environment

10.5 Strategy Validation Component

The Valid(·) function ensures generated strategies are deployable:

def is_valid_strategy(strategy_code: str) -> bool:
    """Validate a generated search strategy."""
    # 1. Syntactic validity
    try:
        compile(strategy_code, "<strategy>", "exec")
    except SyntaxError:
        return False

    # 2. Class existence
    namespace = {}
    exec(strategy_code, namespace)
    if "EvolvedProgramDatabase" not in namespace:
        return False

    # 3. Interface conformance
    cls = namespace["EvolvedProgramDatabase"]
    if not (hasattr(cls, "add") and hasattr(cls, "sample")):
        return False

    # 4. Runtime test with mock data
    try:
        instance = cls()
        instance.add(mock_program)
        parent_dict, inspiration = instance.sample()
        # Verify output types
        assert isinstance(parent_dict, dict)
        assert isinstance(inspiration, list)
    except Exception:
        return False

    return True

11 Core Mechanisms (Detailed)

11.1 Algorithm 1: EvoX Two-Level Evolution

The complete algorithm, as presented in the paper:

Algorithm 1: EvoX Two-Level Evolution Process

INPUT:
  Budget T, window W, stagnation threshold τ
  Evaluator E
  Solution generator G_sol
  Strategy generator G_str
  Population descriptor φ(·)
  Validity test Valid(·)
  Initial solution database D_0
  Initial strategy S_0

INITIALIZE:
  D_t ← D_0
  S_t ← S_0
  H ← ∅          (strategy history)
  t ← 0

MAIN LOOP:
  while t < T:
      ────────── SOLUTION EVOLUTION (Inner Loop) ──────────
      for w = 1 to W:
          (x_par, π, I) ← S_t.sample(D_t)
          x' ~ G_sol(· | x_par, π, I)
          (s', a') ← E(x')
          D_{t+1} ← D_t ∪ {(x', s', a')}
          t ← t + 1
          if t ≥ T: break

      ────────── STAGNATION DETECTION ──────────
      Δ ← improvement over last W iterations
      if Δ < τ:
          ────────── META-EVOLUTION (Outer Loop) ──────────
          Record (S_t, performance) → H
          repeat:
              S_new ~ G_str(· | H, φ(D_t), problem_context)
          until Valid(S_new) or retry limit reached

          if Valid(S_new):
              S_t ← S_new
          else:
              keep S_t (no valid replacement found)

OUTPUT:
  Return (x*, s*) = argmax_{(x,s,a) ∈ D_T} s

11.2 Search Strategy as Code: The Key Abstraction

The most innovative aspect of EvoX is representing search strategies as executable Python classes. This enables an unprecedented level of expressiveness compared to parametric strategy representations.

Example: A simple generated strategy (early in evolution)

class EvolvedProgramDatabase(BaseDatabase):
    """Random sampling strategy (initial S_0)."""

    def __init__(self):
        super().__init__()
        self.elite_size = 5

    def add(self, program: EvolvedProgram) -> None:
        self.programs[program.id] = program

    def sample(self) -> Tuple[Dict, List]:
        if not self.programs:
            return {"": None}, []

        # Random parent selection
        parent = random.choice(list(self.programs.values()))

        # Random operator selection
        label = random.choice(
            ["", self.REFINE_LABEL, self.DIVERGE_LABEL]
        )

        return {label: parent}, []

Example: A sophisticated generated strategy (after several meta-evolution steps)

class EvolvedProgramDatabase(BaseDatabase):
    """Adaptive strategy with Pareto-aware selection and
    stagnation-responsive operator scheduling."""

    def __init__(self):
        super().__init__()
        self.elite_archive = {}
        self.diversity_archive = {}
        self.recent_improvements = []
        self.operator_successes = {
            "": 0, "REFINE": 0, "DIVERGE": 0
        }
        self.operator_attempts = {
            "": 1, "REFINE": 1, "DIVERGE": 1
        }

    def add(self, program: EvolvedProgram) -> None:
        self.programs[program.id] = program
        score = program.metrics.combined_score
        if isinstance(score, (int, float)):
            # Track improvement
            if self.elite_archive:
                best = max(
                    p.metrics.combined_score
                    for p in self.elite_archive.values()
                    if isinstance(
                        p.metrics.combined_score,
                        (int, float)
                    )
                )
                if score > best:
                    self.recent_improvements.append(
                        score - best
                    )
                    # Credit the operator that produced this
                    label = program.parent_info[0]
                    if label in self.operator_successes:
                        self.operator_successes[label] += 1

            # Update elite archive (top-K)
            self.elite_archive[program.id] = program
            if len(self.elite_archive) > 10:
                worst = min(
                    self.elite_archive,
                    key=lambda k:
                        self.elite_archive[k]
                            .metrics.combined_score
                )
                del self.elite_archive[worst]

    def sample(self) -> Tuple[Dict, List]:
        if not self.programs:
            return {"": None}, []

        # Operator selection via UCB1
        total = sum(self.operator_attempts.values())
        ucb_scores = {}
        for op in ["", self.REFINE_LABEL, self.DIVERGE_LABEL]:
            key = op if op else ""
            exploit = (
                self.operator_successes.get(key, 0) /
                self.operator_attempts.get(key, 1)
            )
            explore = math.sqrt(
                2 * math.log(total) /
                self.operator_attempts.get(key, 1)
            )
            ucb_scores[op] = exploit + explore

        label = max(ucb_scores, key=ucb_scores.get)
        self.operator_attempts[
            label if label else ""
        ] += 1

        # Parent selection: elite with probability 0.7,
        # random otherwise
        if (random.random() < 0.7
                and self.elite_archive):
            parent = random.choice(
                list(self.elite_archive.values())
            )
        else:
            parent = random.choice(
                list(self.programs.values())
            )

        # Inspiration: diverse set from non-elite programs
        non_elite = [
            p for p in self.programs.values()
            if p.id not in self.elite_archive
        ]
        inspiration = random.sample(
            non_elite,
            min(3, len(non_elite))
        )

        return {label: parent}, inspiration

The transformation from the simple random strategy to the sophisticated UCB1-based adaptive strategy illustrates the power of EvoX's meta-evolution. The LLM generates qualitatively different strategy architectures — including new data structures (elite/diversity archives), new selection mechanisms (UCB1), and new feedback loops (operator success tracking) — none of which were predefined in the system.

11.3 Stagnation Detection Mechanism

Stagnation is detected by comparing the improvement in the best score over the most recent strategy window W against a threshold τ:

Δ_k = max_{(x,s,a) ∈ D_{t}} s - max_{(x,s,a) ∈ D_{t-W}} s

Stagnation triggered iff Δ_k < τ

Design considerations: - τ is a fixed threshold, not adaptive (unlike AdaEvolve's accumulated improvement signal) - The threshold should be calibrated per-task or set to a small positive value to detect near-zero improvement - Using a fixed threshold rather than a relative threshold means absolute magnitude of scores matters

Comparison with AdaEvolve's stagnation detection:

Aspect	EvoX	AdaEvolve
Signal	Raw improvement Δ	Accumulated improvement (scale-invariant EMA)
Threshold	Fixed τ	Adaptive (percentile-based)
Response	Generate new strategy (structural)	Adjust bandit weights (parametric)
Expressiveness	Full strategy replacement	Three-level parameter adjustment

11.4 Population Descriptor Function φ(·)

The population descriptor compresses the current state of the solution database into a summary that can be included in the strategy generation prompt. While the paper does not fully specify φ(·), based on the prompt structure and interface design, it likely includes:

def population_descriptor(D_t: Dict[str, EvolvedProgram]) -> str:
    """Summarize population state for strategy generation."""
    programs = list(D_t.values())
    scores = [
        p.metrics.combined_score for p in programs
        if isinstance(p.metrics.combined_score, (int, float))
    ]

    return f"""
    Population size: {len(programs)}
    Score distribution:
      Best:   {max(scores):.6f}
      Median: {sorted(scores)[len(scores)//2]:.6f}
      Worst:  {min(scores):.6f}
      Std:    {statistics.stdev(scores):.6f}
    Iteration range: {min(p.iteration_found for p in programs)}
                  to {max(p.iteration_found for p in programs)}
    Operator distribution:
      Free-form: {sum(1 for p in programs
                      if p.parent_info[0] == "")}
      Refine:    {sum(1 for p in programs
                      if p.parent_info[0] == "REFINE")}
      Diverge:   {sum(1 for p in programs
                      if p.parent_info[0] == "DIVERGE")}
    Recent improvement trajectory: [last 10 deltas]
    """

11.5 Strategy History and Conditioning

The strategy history H accumulates all previously deployed strategies and their performance metrics. When generating a new strategy, the LLM is conditioned on H to:

Avoid repeating failures: If a strategy based on pure exploitation stagnated, the LLM is less likely to generate another pure exploitation strategy.
Build on successes: If a strategy with UCB1-based operator selection showed initial promise, the LLM may refine this approach.
Combine ideas: The LLM can synthesize elements from multiple historical strategies into a novel combination.

Strategy History H = [
  (S_0, {window: [0, 15], improvement: 0.42, stagnated: True}),
  (S_1, {window: [15, 35], improvement: 0.18, stagnated: True}),
  (S_2, {window: [35, 70], improvement: 0.31, stagnated: False}),
  ...
]

11.6 Search Strategy Design Space

EvoX's strategy evolution explores a rich design space. The LLM-generated strategies observed in practice include elements from multiple algorithmic families:

Strategy Element	Examples Generated	Traditional Analogy
Parent selection	Tournament, fitness-proportional, rank-based, random, elite-only, Pareto-aware	Genetic algorithm selection operators
Inspiration selection	Diversity-maximizing, structurally dissimilar, score-complementary, random subset	Crossover partner selection
Operator scheduling	UCB1 bandit, epsilon-greedy, fixed ratio, stagnation-responsive, recency-weighted	Multi-armed bandit, adaptive operator selection
Population management	Elite archives, diversity archives, age-based pruning, Pareto fronts, sliding windows	MAP-Elites, NSGA-II, generational GA
Internal state tracking	Improvement histories, operator success rates, lineage graphs, novelty scores	Fitness landscape analysis

The richness of this design space is a direct consequence of representing strategies as code rather than parameter vectors. A UCB1-bandit over operators with Pareto-aware parent selection cannot be expressed as a point in a predefined parameter space — it requires structural innovation.

11.7 Formal Convergence Properties

The paper does not provide formal convergence guarantees for the meta-evolution loop. This is expected given the generality of the approach — arbitrary Python code generation by an LLM is not amenable to standard convergence analysis. However, several informal properties are demonstrated empirically:

Monotonic best-score: The solution population retains all evaluated candidates, so the best score can only improve or stay constant. The meta-evolution loop cannot decrease the best score.
Strategy diversity: The stagnation-triggered mechanism ensures new strategies are only generated when current strategies fail, preventing unnecessary churn.
History conditioning: By conditioning on the full strategy history, the meta-evolution loop has access to information about which approaches have been tried and failed, reducing redundant exploration.
Graceful degradation: If no valid strategy can be generated, the system retains the current strategy rather than crashing or deploying an invalid strategy.

11.8 Comparison: Fixed vs. Evolved Search Dynamics

The paper demonstrates the difference between fixed and evolved search strategies through a signal processing case study:

Fixed strategy (MAP-Elites, red line):
  Iteration 0-30:   Rapid improvement (0.0 → 0.55)
  Iteration 30-60:  Slowing improvement (0.55 → 0.60)
  Iteration 60-100: Stagnation (0.60 → 0.62)

Evolved strategy (EvoX, blue line):
  Iteration 0-15:   Moderate improvement (0.0 → 0.40)
  Iteration 15-20:  Strategy evolution event #1
  Iteration 20-50:  Resumed improvement (0.40 → 0.62)
  Iteration 50-55:  Strategy evolution event #2
  Iteration 55-80:  Breakthrough (0.62 → 0.72)
  Iteration 80-100: Continued gains (0.72 → 0.74)

The characteristic pattern of EvoX is discrete performance breakthroughs following strategy evolution events. These breakthroughs represent qualitative shifts in the search process — not just parameter adjustments — enabled by the structural expressiveness of code-level strategy evolution.

12 Programming Language

12.1 Primary Language

Language	Role	Notes
Python	Implementation language	All framework code, strategies, and most candidate solutions
Python	Strategy representation	Search strategies are generated as Python classes
Various	Candidate solutions	Depends on benchmark; primarily Python for algorithmic tasks

12.2 Framework Dependencies

EvoX is built on the SkyDiscover framework infrastructure:

Dependency	Purpose	Version/Notes
SkyDiscover	Core framework, benchmark suite, evaluation harness	Same codebase (github.com/skydiscover-ai/skydiscover)
OpenAI API	GPT-5 access	Via official Python SDK
Google AI API	Gemini-3.0-Pro access	Via official Python SDK
Python stdlib	Core data structures, random, math, statistics	Standard library

12.3 Benchmark-Specific Dependencies

Different benchmarks may require additional dependencies:

Benchmark Family	Dependencies	Notes
Mathematical optimization	NumPy, SciPy	Geometric computations, optimization
Systems (ADRS-Bench)	Benchmark-specific simulators	GPU placement, scheduling simulators
FrontierCS	Problem-specific evaluators	Competitive programming judges
ALE-Bench-Lite	AtCoder evaluators	NP-hard optimization judges

12.4 Code Generation Target

The LLM generates Python code in two contexts: 1. Search strategies: Always Python classes conforming to EvolvedProgramDatabase interface 2. Candidate solutions: Typically Python functions or classes, though the framework supports arbitrary executable programs evaluated by the benchmark's evaluator

13 Memory Management

13.1 Solution Population Database

The solution population D_t is the primary stateful data structure in EvoX:

Population Database D_t:
┌─────────────────────────────────────────────────┐
│  programs: Dict[str, EvolvedProgram]             │
│                                                   │
│  EvolvedProgram:                                  │
│  ┌──────────────────────────────────────────┐    │
│  │ id: str                    (UUID)         │    │
│  │ code: str                  (full source)  │    │
│  │ metrics.combined_score: float             │    │
│  │ iteration_found: int                      │    │
│  │ timestamp: float                          │    │
│  │ parent_id: Optional[str]                  │    │
│  │ parent_info: (label, parent_id)           │    │
│  │ metadata: Dict[str, Any]                  │    │
│  └──────────────────────────────────────────┘    │
│                                                   │
│  Grows monotonically: |D_t| = t                  │
│  Never prunes (all candidates retained)          │
└─────────────────────────────────────────────────┘

Key properties: - Append-only: Every evaluated candidate is permanently stored. The database grows from |D_0| = 0 to |D_T| = T. - No global pruning: Unlike systems with fixed-size populations (e.g., genetic algorithms with generational replacement), EvoX's solution database is unbounded in size. - Strategy-local pruning: Individual strategies may maintain internal archives (e.g., elite sets, diversity archives) that are subsets of the full database. These are part of the strategy's state, not the global infrastructure. - Read-only metrics: Program evaluation metrics are immutable once computed. This prevents strategies from corrupting evaluation results.

13.2 Strategy History

The strategy history H is a secondary stateful data structure:

Strategy History H:
┌──────────────────────────────────────────────────┐
│  List of (strategy, performance_record) tuples    │
│                                                    │
│  strategy: str (Python source code)                │
│  performance_record:                               │
│  ┌────────────────────────────────────────────┐   │
│  │ window_start: int                          │   │
│  │ window_end: int                            │   │
│  │ best_score_at_start: float                 │   │
│  │ best_score_at_end: float                   │   │
│  │ improvement: float                         │   │
│  │ stagnated: bool                            │   │
│  │ num_candidates_generated: int              │   │
│  └────────────────────────────────────────────┘   │
│                                                    │
│  Grows by 1 per strategy evolution event           │
│  Typical size: 5-15 entries per 100-iteration run  │
└──────────────────────────────────────────────────┘

13.3 Memory Scaling Analysis

For a typical 100-iteration run:

Component	Size	Growth Rate	Memory Estimate
Solution population	100 programs	O(T) linear	~1-10 MB (depending on code size)
Strategy history	5-15 strategies	O(T/W) sublinear	~50-150 KB
Active strategy state	1 strategy instance	O(1) constant	~1-10 KB
LLM context windows	Temporary	O(1) per call	Governed by model context limit
Total			~1-15 MB

Memory is not a bottleneck for EvoX at the scales evaluated (100 iterations). For much longer runs (thousands of iterations), the unbounded growth of D_t could become a concern, though Python dictionary overhead per entry is modest.

13.4 Context Window Management

A subtle but critical aspect of memory management in EvoX is how information from the population database is compressed into LLM context windows:

For solution generation (inner loop): - Parent code (full source) - Inspiration set (2-5 programs, possibly truncated) - Operator-specific prompt - Problem description

For strategy generation (outer loop): - Full strategy history H (all prior strategy code + performance) - Population descriptor φ(D_t) (statistical summary) - Problem context - Interface specification

The population descriptor φ(·) serves as a compression mechanism: rather than including all T programs in the strategy generation prompt, it summarizes the population's statistical properties, reducing context length from O(T × code_size) to O(1).

13.5 State Persistence

The paper does not discuss checkpointing or state persistence mechanisms. Given that: - The evaluation budget is fixed at 100 iterations - Typical wall-clock time per run is moderate (minutes to hours depending on evaluator cost) - The system is designed for single-run optimization, not long-running jobs

State persistence is likely not implemented. However, for production use, the modular architecture would support checkpointing of D_t, H, and current strategy state.

14 Continued Learning

14.1 Within-Run Learning

EvoX implements strong within-run learning through its two-level architecture:

Solution-level learning: The solution population D_t accumulates all evaluated candidates, providing an expanding knowledge base for the solution generator. As the population grows, strategies can select more informative parents and inspiration sets.

Strategy-level learning: The strategy history H enables the meta-evolution loop to learn from the effectiveness of prior strategies. Each strategy evolution event is conditioned on the full history, allowing the LLM to identify patterns in which strategy architectures work well for the current problem.

Operator-level learning: Individual strategies may implement internal learning mechanisms (e.g., UCB1 bandit over operators, as shown in Section 11.2). These mechanisms adapt within a single strategy window based on operator success rates.

Learning Hierarchy:

  Time scale     │    What is learned          │  Mechanism
  ───────────────┼────────────────────────────┼──────────────────
  Per-iteration  │    Solution population      │  Append to D_t
  Per-window     │    Operator effectiveness   │  Strategy-internal
  Per-stagnation │    Strategy architecture    │  Meta-evolution
  Per-run        │    All of the above         │  Accumulated H

14.2 Cross-Run Learning

EvoX does not implement explicit cross-run or cross-task learning. Each optimization run starts fresh with: - Empty population D_0 = ∅ - Simple initial strategy S_0 (random sampling) - Empty strategy history H = ∅

This is a deliberate design choice:

Starting from a simple random-sampling search strategy, EvoX consistently outperforms existing LLM-driven evolutionary frameworks.

The authors emphasize that EvoX achieves strong performance without cross-task transfer, warm-starting, or pre-trained strategy models. The meta-evolution loop discovers effective strategies from scratch within each run's 100-iteration budget.

14.3 Implicit Cross-Task Learning via LLM

While EvoX does not explicitly transfer learned strategies across tasks, the LLM backbone (GPT-5 or Gemini-3.0-Pro) provides implicit cross-task knowledge:

Algorithmic knowledge: The LLM has been trained on evolutionary algorithm literature, MAP-Elites implementations, bandit algorithms, and other search strategies. This pre-existing knowledge bootstraps the strategy generation process.
Problem domain knowledge: The LLM understands common optimization heuristics for specific domains (e.g., bin-packing for placement problems, gradient-free optimization for black-box functions).
Code generation capability: The LLM can generate syntactically correct, functionally complex Python classes, which is a prerequisite for the strategy-as-code approach.

14.4 Comparison with Other Systems' Learning Mechanisms

System	Within-Run Learning	Cross-Run Learning	Cross-Task Learning
AlphaEvolve	Population accumulation	Not reported	LLM-implicit
OpenEvolve	Population accumulation	Not reported	LLM-implicit
ShinkaEvolve	Bandit LLM selection	Not reported	LLM-implicit
GEPA	Reflection-based	Not reported	LLM-implicit
AdaEvolve	Three-level adaptation	Not reported	LLM-implicit
ThetaEvolve	RL model fine-tuning	Yes (model weights)	Transfer via fine-tuned model
SOAR	Search + fine-tuning	Yes (model weights)	Transfer via fine-tuned model
EvoX	Two-level co-evolution	No	LLM-implicit

A potential extension of EvoX would be to transfer successful strategy patterns across tasks — e.g., by seeding H with strategies that worked well on similar problems. This would create a form of meta-meta-learning: learning about strategy evolution across tasks.

14.5 Strategy Evolution Dynamics

The paper characterizes how strategies evolve over the course of a run. Typical patterns observed:

Phase 1: Exploration (iterations 0-20) - Initial random sampling strategy deployed - Population begins to accumulate diverse candidates - First stagnation event typically occurs within 10-20 iterations

Phase 2: Strategy Discovery (iterations 20-50) - Multiple strategy evolution events occur in rapid succession - Strategies become progressively more sophisticated - Transition from random selection to fitness-proportional or elite-based selection

Phase 3: Strategy Refinement (iterations 50-80) - Strategy evolution events become less frequent - Effective strategies are retained for longer windows - Operators stabilize (e.g., exploitation ratio increases)

Phase 4: Convergence (iterations 80-100) - Few or no strategy evolution events - Current strategy produces diminishing but non-zero improvements - Final score approaches the run's optimum

15 Applications

15.1 Mathematical Optimization

EvoX demonstrates strong performance on mathematical construction problems where the goal is to discover geometric or combinatorial configurations optimizing a precise mathematical objective.

Circle Packing in a Square

Problem: Pack k circles of equal radius inside a unit square, maximizing the common radius.

EvoX approach: The meta-evolution loop discovers that: - Early strategies should explore diverse packing topologies (DIVERGE operator) - Mid-run strategies should refine promising configurations (REFINE operator) - Late-run strategies should combine elements of the best configurations (inspiration sets)

Result: EvoX achieves radius 2.636, surpassing both human best (2.634) and AlphaEvolve (2.635).

Heilbronn Triangle and Convex Problems

Problem: Place n points in a unit square to maximize the smallest triangle (or convex polygon) area formed by any k > 3 points.

Significance: These are long-standing open problems in combinatorial geometry. EvoX's ability to match or approach human-constructed bounds demonstrates the system's capability for mathematical discovery.

Signal Processing (Multi-Objective)

Problem: Design a causal, online filtering program for noisy time series, balancing fidelity, smoothness, lag, and false trend detection.

EvoX advantage: This multi-objective task is where EvoX shows its largest relative gain (0.7429 vs. 0.5049 for ShinkaEvolve with Gemini). The meta-evolution loop discovers strategies that explicitly sample candidates along different Pareto trade-off objectives, a strategy architecture that cannot be expressed in fixed parametric frameworks.

15.2 Systems Performance Optimization

EvoX is evaluated on six production-inspired systems optimization tasks from ADRS-Bench:

GPU-to-Model Placement (PRISM)

Problem: Schedule multiple deep learning models across distributed GPUs under memory, compute, and latency SLO constraints in a multi-tenant setting.

EvoX result: 30.52 throughput vs. 21.89 human best — the largest absolute improvement across all tasks (39.4%).

Key insight from meta-evolution: The initial strategy (random sampling) quickly finds greedy least-loaded heuristics. After stagnation, the meta-evolution loop generates a strategy that reformulates the problem as bin-packing assignment, enabling exploration of a qualitatively different solution space that produces dramatically better throughput.

Expert Placement Load Balancing (EPLB)

Problem: Place and replicate experts in Mixture-of-Experts (MoE) models across GPUs based on activation frequency, minimizing load imbalance.

Result: EvoX achieves 0.1453 throughput ratio, slightly outperforming the best random and GEPA baselines (0.1445).

Transaction Scheduling

Problem: Schedule database transactions under contention to maximize throughput while minimizing aborts.

Result: 4347.83 commits/sec vs. 2724.8 human best — a 59.6% improvement.

Cloud Transfer (Cloudcast)

Problem: Design cost-efficient data transfer strategies across cloud regions with heterogeneous pricing and deadline constraints.

Result: EvoX achieves 623.69 cost, beating the human best (626.24) and all AI baselines.

15.3 Algorithmic and Competitive Programming

FrontierCS (172 problems)

Problem: 172 open-ended computer science problems spanning: - Optimization (38 problems): parameterized search under resource constraints - Constructive (62 problems): synthesize valid structured objects under global constraints - Interactive (72 problems): adaptive query-response protocols

Result: EvoX achieves a median score of 75.5, compared to 56.2 for the best prior AI system — a 34.3% improvement.

Significance: This is the largest benchmark evaluated, and EvoX's consistent advantage across 172 diverse problems suggests that meta-evolution provides a robust, domain-agnostic improvement mechanism.

ALE-Bench-Lite (10 NP-hard problems)

Problem: 10 NP-hard optimization problems from AtCoder Heuristic Contests, including territory control, sliding puzzles, graph classification, and route optimization.

Significance: These problems require fundamentally different algorithmic approaches (greedy, dynamic programming, local search, constraint satisfaction), making them a strong test of EvoX's ability to adapt strategies across problem types.

15.4 Abstract Reasoning (ARC-AGI-2)

Problem: ARC-AGI-2 tasks evaluate abstract and compositional reasoning, requiring programs that solve novel grid transformation puzzles.

Result: EvoX improves accuracy by 6-7 percentage points over OpenEvolve (41% → 48% with GPT-5; 45% → 51% with Gemini-3-Pro).

Caveat: The authors note that ARC-AGI fundamentally differs from the evolutionary optimization regime. Standard ARC evaluation assumes strict train-test separation, whereas evolutionary frameworks adapt during inference. The improvement may partially reflect better prompt construction rather than better search strategy per se.

15.5 Application Taxonomy

Application Domain	Tasks	Primary Benefit of Meta-Evolution	Largest Gain
Mathematical construction	8	Strategy shift from exploration to refinement	+0.08% (circle packing)
Systems optimization	6	Problem reformulation (e.g., greedy → bin-packing)	+39.4% (GPU placement)
Competitive programming	172	Adaptive operator selection across diverse problems	+34.3% (median score)
NP-hard optimization	10	Cross-algorithm strategy discovery	Not separately reported
Abstract reasoning	N/A	Better context construction	+7pp (ARC accuracy)

15.6 Domains Where EvoX Has Not Been Applied

Domain	Reason	Feasibility
Neural architecture search (NAS)	Not in benchmark suite	High — NAS is well-suited to evolutionary search
Drug/molecule design	Requires domain-specific evaluators	Medium — needs molecular simulation infrastructure
Hardware design (Verilog)	Not in benchmark suite (AlphaEvolve domain)	High — code-based, automated evaluation
Prompt optimization	Not primary focus	High — SkyDiscover supports prompt tasks
Theorem proving	Requires formal verification	Medium — needs proof assistant integration
Robotics control	Requires physics simulation	Medium — needs sim-to-real infrastructure

15.7 Broader Implications

EvoX's core insight — that the search process itself should be subject to evolutionary pressure — has implications beyond LLM-driven optimization. The principle of meta-evolution applies to any iterative search process where: 1. The search strategy can be expressed as executable code 2. An LLM can generate valid strategy implementations 3. Strategy effectiveness can be measured by downstream progress 4. The optimal strategy is non-stationary (changes during the search)

This suggests potential applications in: - AutoML: Evolving the hyperparameter tuning strategy itself - Reinforcement learning: Evolving the exploration strategy during training - Compiler optimization: Evolving the optimization pass ordering - Database query optimization: Evolving the query planning heuristic - Scientific experimentation: Evolving the experimental design strategy based on prior results

Analysis completed April 2026. Based on arXiv:2602.23413v2 (revised March 16, 2026), GitHub repository skydiscover-ai/skydiscover, and contextual knowledge of the LLM-driven evolutionary search landscape.