SkyDiscover & AdaEvolve
Part P02: General-Purpose Evolutionary Frameworks
9.1 Overview & Motivation
By early 2026, the landscape of LLM-powered evolutionary search had grown rich but fragmented. FunSearch (2023) demonstrated that large language models could discover novel mathematical constructions. AlphaEvolve and OpenEvolve (2025) introduced multi-island architectures with MAP-Elites-inspired quality-diversity. ShinkaEvolve added bandit-driven model selection and prompt co-evolution. GEPA brought Pareto-based multi-objective optimization with structured diagnostic feedback. Yet each system used its own benchmark suite, its own evaluation protocol, and its own definition of "improvement" — making rigorous cross-system comparison effectively impossible.
SkyDiscover, developed at the UC Berkeley Sky Lab and published in February 2026 (arXiv:2602.20133), addresses this fragmentation with two contributions. First, it provides a modular framework offering a unified interface for implementing, running, and fairly comparing discovery algorithms across more than 200 optimization tasks spanning mathematics, systems engineering, competitive programming, and creative applications. Second, it introduces AdaEvolve, a novel search algorithm implementing three-level hierarchical adaptation — local exploration intensity, global cross-island resource allocation, and meta-level tactical generation — all coordinated by a single unified signal.
Key Contribution
AdaEvolve implements three-level hierarchical adaptation coordinated by a single accumulated improvement signal — a scale-invariant exponential moving average of squared improvement magnitudes. This signal simultaneously drives within-island exploration intensity (Level 1), cross-island resource allocation via a globally-normalized UCB bandit (Level 2), and LLM-generated tactical paradigm shifts when stagnation is detected across all search fronts (Level 3). The paper reports a ~34% median improvement over OpenEvolve, GEPA, and ShinkaEvolve baselines across approximately 185 evaluated tasks within its benchmark suite (arXiv:2602.20133, Table 3 aggregate and Figure 5). This figure aggregates across diverse problem types with varying difficulty and should be interpreted as a summary statistic rather than a uniform improvement guarantee. All results are self-reported by the SkyDiscover authors; no independent reproduction has been published as of this writing.
The Berkeley Sky Lab team — including Ion Stoica and Matei Zaharia (creators of Spark and Ray), Koushik Sen (creator of KLEE), and Alex Dimakis — brings a distinctive systems-engineering perspective to evolutionary search. This is reflected in the framework's emphasis on fair benchmarking infrastructure, pluggable algorithm backends, and practical systems optimization tasks alongside traditional mathematical benchmarks. Several authors overlap with the GEPA team, and SkyDiscover includes GEPA as a first-class backend algorithm, positioning the framework as a unification point for the field.
Verification Methodology
This chapter draws on three evidence sources, distinguished throughout: (1) Paper-reported — claims from arXiv:2602.20133, cited by section, table, or figure. (2) Repository-documented — module paths, class names, configuration keys, and code patterns described in the repository README, docstrings, and source files at github.com/skydiscover-ai/skydiscover. Code excerpts presented below are simplified from the repository source for presentation clarity; full implementations include additional logging, error handling, and integration code. (3) Author's inference — analytical derivations, numerical range estimates, and extrapolative claims produced by this survey, explicitly labeled as such. Module paths reference the repository structure documented in the README and paper (arXiv:2602.20133, §2); where paths could not be independently verified against a pinned commit, this is noted.
9.1.1 The Adaptation Problem
Prior LLM-evolutionary systems operate with static search schedules: fixed island counts, predetermined mutation ratios, and time-invariant exploration-exploitation tradeoffs. OpenEvolve allocates compute equally across islands with fixed ring migration. ShinkaEvolve uses a bandit to select LLM providers but does not adapt the search topology itself. GEPA applies reflection-driven mutation within a single population. None of these systems dynamically reallocate compute based on observed search dynamics.
This is suboptimal because the improvement landscape of evolutionary search is highly non-stationary. Early iterations often see rapid gains from low-hanging improvements, followed by long plateaus where incremental refinement yields diminishing returns, punctuated by occasional breakthroughs that open new improvement trajectories. A static allocation strategy cannot respond to these dynamics: it wastes compute on stagnant search fronts and under-invests in productive ones.
AdaEvolve's hierarchical adaptation directly addresses this: it measures the volatility of improvement on each island in real time and uses this signal to make allocation decisions at three nested scales — within each island, across islands, and at the meta-strategic level.
9.2 Architecture
SkyDiscover separates concerns into two distinct layers: the framework layer, which handles task definitions, evaluation, configuration, monitoring, and benchmarking infrastructure; and the search algorithm layer, which implements the actual evolutionary search logic. This separation enables fair comparison: different search algorithms operate against the same evaluator API, the same benchmark tasks, and the same monitoring infrastructure.
9.2.1 Framework Layer
The framework layer comprises seven principal components, shown in the diagram above. The following table documents each component with its module path and evidence status:
| Component | Module / Interface | Responsibility | Evidence |
|---|---|---|---|
| Evaluator API | evaluate(program_path) → EvaluationResult | Executes candidate programs; returns combined_score and structured artifacts dict that feeds into subsequent mutation context | Paper §2; repo interface |
| Search Router | skydiscover/search/ | Dispatches to configured algorithm: AdaEvolve, EvoX, Top-K, Beam Search, Best-of-N, or native backends (OpenEvolve, GEPA, ShinkaEvolve). Selected via search.algorithm config key | Paper §2; repo module |
| Context Builder | skydiscover/context_builder/ | Assembles LLM mutation prompts from source code, evaluation artifacts, EVOLVE-BLOCK regions, and current search state | Paper §2; repo module |
| LLM Provider Layer | llm.models[] YAML config | Weighted multi-model pools using provider/model format (e.g., openai/gpt-5); supports OpenAI, Google, Anthropic, and local Ollama/LiteLLM | Paper §2.3; repo config |
| Benchmark Suite | benchmarks/ | 200+ tasks with evaluators, initial programs, and documentation across 5 domains; task format specified in benchmarks/README.md | Paper §4; repo directory |
| Live Dashboard | Web UI | Real-time scatter plots, code diffs, metric tracking, human intervention controls | Paper §2.5 |
| Checkpoint System | JSON serialization | Full search-state serialization for interrupting and resuming long-running experiments | Paper §2.4; repo code |
The checkpoint system, live dashboard, and human feedback controls are framework-level services available to all search algorithms, not AdaEvolve-specific mechanisms. AdaEvolve extends the checkpoint system by serializing its own per-island accumulated improvement signals and bandit state alongside the standard program archives (see §9.5.4).
9.2.2 Evaluator API and EVOLVE-BLOCK Markers
SkyDiscover uses EVOLVE-BLOCK-START / EVOLVE-BLOCK-END markers to designate mutable regions within candidate programs. Code outside these markers is preserved as immutable context — imports, data loading, and output formatting remain fixed while the search algorithm evolves only the designated solution logic. If no markers are present, the entire program becomes mutable (arXiv:2602.20133, §2.1).
# Example benchmark task structure using EVOLVE-BLOCK markers.
# Convention documented in the repository's benchmarks/README.md.
import numpy as np
# Immutable context — preserved across all mutations
def load_data(path: str) -> np.ndarray:
return np.load(path)
# EVOLVE-BLOCK-START
def solve(data: np.ndarray) -> float:
"""This function will be evolved by the search algorithm."""
# Initial naive implementation — the seed program
return float(np.sum(data))
# EVOLVE-BLOCK-END
# Immutable harness
if __name__ == "__main__":
result = solve(load_data("input.npy"))
print(f"Result: {result}")
The evaluator returns both a numeric score and structured artifacts. The evaluator contract, documented in the repository and paper (arXiv:2602.20133, §2.1), defines a standard interface that all benchmark tasks implement:
# Evaluator interface from skydiscover's evaluation module.
# Each benchmark task provides an evaluate() function following this contract.
# Simplified from the repository source for presentation.
from dataclasses import dataclass, field
from typing import Any
@dataclass
class EvaluationResult:
"""Standard return type for all benchmark evaluators.
combined_score: Primary optimization target (higher is better).
artifacts: Structured feedback injected into subsequent LLM prompts
via the context builder. Keys vary by benchmark task.
"""
combined_score: float
artifacts: dict[str, Any] = field(default_factory=dict)
def evaluate(program_path: str) -> EvaluationResult:
"""Evaluator contract implemented by every benchmark task.
Executes the candidate program, measures performance against
the task's scoring criteria, and returns both a numeric score
and structured artifacts for the next mutation cycle.
"""
score = run_tests(program_path)
return EvaluationResult(
combined_score=score,
artifacts={
"failed_test_cases": get_failures(program_path),
"runtime_ms": measure_runtime(program_path),
"memory_mb": measure_memory(program_path),
"hint": generate_diagnostic_hint(program_path),
},
)
The artifacts — failed test cases, runtime measurements, memory usage, and free-form hints — are automatically incorporated into subsequent mutation prompts via the context builder. This creates a lightweight feedback loop analogous to GEPA's Adaptive Supplementary Information (ASI), though less formally structured: artifacts are free-form dictionaries rather than typed diagnostic schemas.
9.2.3 Weighted Multi-Model Pools
A distinctive feature of SkyDiscover's LLM integration is weighted multi-model pools for distributed sampling (arXiv:2602.20133, §2.3). Rather than using a single LLM per mutation, the framework samples from a weighted mixture of providers. This enables diversity in mutation style — different models produce characteristically different code modifications — and provides resilience against provider outages or rate limits.
LLM providers are specified in YAML configuration using the provider/model format. The relevant configuration keys, as documented in the repository:
# SkyDiscover LLM configuration — exact config keys from repository docs.
# The llm.models[] array defines the weighted provider pool.
llm:
models:
- model: "openai/gpt-5" # provider/model format
weight: 0.4 # sampling weight
- model: "gemini/gemini-3-pro"
weight: 0.3
- model: "anthropic/claude-sonnet-4-6"
weight: 0.2
- model: "ollama/qwen2.5-coder:32b" # local model via Ollama
weight: 0.1
system_prompt: "You are an expert algorithm designer..."
max_iterations: 500
Supported providers include OpenAI (openai/), Google (gemini/), Anthropic (anthropic/), and local models via Ollama or LiteLLM (ollama/). The framework also provides an agentic mode where the LLM has access to the full project file structure during mutation, enabling context-aware modifications that consider imports, dependencies, and cross-file interactions (arXiv:2602.20133, §2.3). This is conceptually similar to multi-file evolution in other systems but operates at the file-system level rather than within a REPL.
9.2.4 Multi-Algorithm Support and Search Router
The search router, implemented in the skydiscover/search/ module, dispatches to any of six or more strategies, making SkyDiscover a platform rather than a single algorithm. The algorithm is selected via the search.algorithm configuration key:
# Search algorithm dispatch from skydiscover/search/.
# The router uses the search.algorithm config key to select and
# instantiate the appropriate search backend.
# Simplified from the repository source for presentation.
import importlib
from typing import Any
ALGORITHM_REGISTRY: dict[str, str] = {
"adaevolve": "skydiscover.search.adaevolve.AdaEvolve",
"evox": "skydiscover.search.evox.EvoX",
"topk": "skydiscover.search.topk.TopKSearch",
"beam": "skydiscover.search.beam.BeamSearch",
"best_of_n": "skydiscover.search.best_of_n.BestOfN",
"gepa": "skydiscover.search.backends.gepa.GEPABackend",
"openevolve": "skydiscover.search.backends.openevolve.OpenEvolveBackend",
"shinkaevolve": "skydiscover.search.backends.shinkaevolve.ShinkaEvolveBackend",
}
def create_search(config: dict[str, Any]) -> "SearchAlgorithm":
"""Instantiate the configured search algorithm.
Args:
config: Parsed YAML configuration dict.
Returns:
Initialized search algorithm with run() entry point.
Raises:
ValueError: If search.algorithm names an unknown backend.
"""
algo_name = config.get("search", {}).get("algorithm", "adaevolve")
if algo_name not in ALGORITHM_REGISTRY:
available = ", ".join(sorted(ALGORITHM_REGISTRY))
raise ValueError(f"Unknown algorithm '{algo_name}'. Available: {available}")
module_path, class_name = ALGORITHM_REGISTRY[algo_name].rsplit(".", 1)
module = importlib.import_module(module_path)
algo_class = getattr(module, class_name)
return algo_class(config)
| Algorithm | Config Key | Type | Description |
|---|---|---|---|
| AdaEvolve | adaevolve | Hierarchical adaptive | Three-level adaptation with accumulated improvement signal (default) |
| EvoX | evox | Self-evolving | Co-adapts solution generation and experience management via LLM-driven strategy evolution |
| Top-K | topk | Generic | Simple top-K parent selection with random mutation |
| Beam Search | beam | Generic | Maintains fixed-width beam of best candidates |
| Best-of-N | best_of_n | Generic | Generates N candidates per iteration, keeps the best |
| Native backends | gepa, openevolve, shinkaevolve | Delegated | Run as backend search engines through SkyDiscover's infrastructure |
This multi-algorithm design is what enables fair benchmarking: all algorithms share the same evaluator API, the same benchmark tasks, and the same LLM provider infrastructure.
9.3 Core Algorithms: AdaEvolve
AdaEvolve, implemented in skydiscover/search/adaevolve.py, provides a multi-island evolutionary search with three nested levels of adaptation. The core insight is that a single unified signal — the accumulated improvement signal — can coordinate decisions at all three levels, from fine-grained within-island exploration to coarse-grained strategic paradigm shifts.
9.3.1 The Accumulated Improvement Signal
The foundation of AdaEvolve is a per-island metric that quantifies the volatility of recent improvement. It is computed in two steps (arXiv:2602.20133, §3.1).
Step 1: Normalized improvement magnitude. After each mutation evaluation on island $k$ at step $t$, the improvement relative to the island's current best score is computed:
where $f'$ is the newly evaluated candidate's score, $f_k^*$ is the best score currently on island $k$, and $\epsilon$ is a small constant (default $10^{-8}$) for numerical stability. The $\max(\cdot, 0)$ clipping ensures that only improvements contribute — deteriorations are clamped to zero. Dividing by $|f_k^*|$ makes the signal scale-invariant: a 1% improvement on a problem with scores in the thousands produces the same $\delta$ as a 1% improvement on a problem with scores near zero.
Note on formula variants: The paper presents a simpler form $\delta = \max((f' - f_k^*) / f_k^*, 0)$ without the absolute value or $\epsilon$ (arXiv:2602.20133, §3.1). Equation 1 above follows the defensive variant present in the repository's implementation, which handles negative scores and zero-score edge cases. Both are equivalent when $f_k^* > 0$, which holds for the majority of the benchmark suite.
Step 2: Exponential moving average of squared improvements. The accumulated improvement signal $G_t^{(k)}$ is an EMA of the squared normalized improvements:
where $\rho = 0.9$ is the decay rate (arXiv:2602.20133, §3.1; also documented as the default in the repository's search.rho config key). The squaring serves two purposes: it amplifies large breakthroughs relative to small incremental improvements, and it suppresses noise from near-zero improvements. $G_0^{(k)} = 0$ for all islands. High $G_t^{(k)}$ indicates a productive trajectory; low $G_t^{(k)}$ signals stagnation.
# From skydiscover/search/adaevolve.py — AccumulatedImprovementSignal
# Implements the per-island volatility metric (Eqs. 1–2, arXiv:2602.20133 §3.1).
# Simplified from repository source; full implementation includes logging
# and integration with the checkpoint serializer.
class AccumulatedImprovementSignal:
"""Per-island volatility metric driving all three adaptation levels.
The signal G_t^(k) is an exponential moving average of squared
normalized improvement magnitudes. It is the sole coordination
mechanism across local, global, and meta adaptation.
"""
def __init__(self, rho: float = 0.9, epsilon: float = 1e-8):
self.rho = rho # EMA decay rate (config: search.rho)
self.epsilon = epsilon # numerical stability constant
self.G = 0.0 # accumulated signal, initialized to 0
self.best_score = float('-inf')
def update(self, new_score: float) -> float:
"""Update signal after evaluating a new candidate on this island.
Returns the updated G value for use by all three adaptation levels.
"""
if self.best_score > float('-inf'):
# Normalized improvement magnitude (Eq. 1)
delta = max(
(new_score - self.best_score) / (abs(self.best_score) + self.epsilon),
0.0
)
else:
delta = 0.0 # first evaluation establishes baseline only
# Exponential moving average of squared improvements (Eq. 2)
self.G = self.rho * self.G + (1 - self.rho) * delta ** 2
if new_score > self.best_score:
self.best_score = new_score
return self.G
This single scalar — computed identically on every island — is the input to all three adaptation levels described below. The design's elegance lies in its unification: rather than requiring separate signals for exploration scheduling, resource allocation, and stagnation detection, a single volatility metric serves all three purposes.
Author's Analytical Inference: Numerical Range of $G_t^{(k)}$
The following analysis is the survey author's own derivation, not reported in the paper. It is included to assess the numerical plausibility of the paper's fixed threshold values.
If a single improvement of normalized magnitude $\delta_0$ occurs at step $t_0$ followed by no further improvements, $G_t$ decays as $G_t = (1 - \rho) \cdot \delta_0^2 \cdot \rho^{(t - t_0)}$ for $t > t_0$. With $\rho = 0.9$, the signal's half-life is $\ln(2)/\ln(1/0.9) \approx 6.6$ steps.
| Scenario | $\delta$ | $G$ immediately after | $G$ after 10 zero steps | $G$ after 20 steps |
|---|---|---|---|---|
| Large breakthrough (score doubles) | 1.0 | 0.10 | 0.035 | 0.012 |
| Substantial improvement (50%) | 0.50 | 0.025 | 0.0087 | 0.003 |
| Moderate improvement (10%) | 0.10 | 0.001 | 0.00035 | 0.00012 |
| Small refinement (1%) | 0.01 | $10^{-5}$ | $3.5 \times 10^{-6}$ | $1.2 \times 10^{-6}$ |
| 10× improvement from low base | 9.0 | 8.1 | 2.82 | 0.98 |
Key observations from this analysis:
- The meta-guidance threshold ($\tau_{\text{meta}} = 0.12$) is crossed approximately 2 steps after a score-doubling event, and immediately for any single improvement below ~35%. A 10× improvement would sustain $G > 0.12$ for ~40 steps.
- The spawning threshold ($\tau_{\text{spawn}} = 0.02$) requires deeper stagnation: ~10 steps of no improvement after a score-doubling event.
- For near-optimal problems (where typical improvements are fractions of a percent), $G_t$ would be expected to remain near zero ($< 10^{-4}$), which would keep meta-guidance chronically triggered and exploration intensity near maximum. This is consistent with the ablation results discussed in §9.4.4.
- For problems starting from naive seeds (Frontier-CS with initial scores near zero), early large improvements can push $G_t$ above 1.0, maintaining exploitation mode for extended periods.
The thresholds thus appear calibrated for problems exhibiting moderate-to-large relative improvements — the typical regime for Frontier-CS and ADRS benchmarks where the majority of the paper's results originate. The paper does not report threshold sensitivity analysis or alternative threshold values. The implications of this problem-dependent behavior are discussed in §9.6.3.
9.3.2 Level 1: Local Adaptation (Exploration Intensity)
Each island dynamically adjusts its exploration intensity $I_t^{(k)}$ — a parameter controlling the balance between exploiting known good solutions and exploring creative alternatives — based on its accumulated improvement signal (arXiv:2602.20133, §3.2):
where $I_{\min} = 0.1$ and $I_{\max} = 0.7$ are bounds documented in the paper (configurable via search.I_min and search.I_max). The behavior is intuitive:
- High $G_t^{(k)}$ (productive trajectory): The denominator grows, pushing $I_t^{(k)}$ toward $I_{\min}$. The island enters exploitation mode — sampling parents from top-ranked candidates and applying focused diff-style mutations.
- Low $G_t^{(k)}$ (stagnation): The denominator approaches 1, pushing $I_t^{(k)}$ toward $I_{\max}$. The island enters exploration mode — selecting parents more randomly and applying broader mutations including full rewrites.
This is analogous to temperature scheduling in simulated annealing, but driven by observed improvement dynamics rather than a predetermined cooling schedule. The square root provides a smooth, monotonically decreasing response that avoids abrupt transitions.
# From skydiscover/search/adaevolve.py — local adaptation (Eq. 3)
# Config keys: search.I_min (default 0.1), search.I_max (default 0.7)
def compute_exploration_intensity(
G: float,
I_min: float = 0.1,
I_max: float = 0.7,
epsilon: float = 1e-8,
) -> float:
"""Compute exploration intensity from accumulated improvement signal.
High G (productive) → low intensity (exploit good solutions)
Low G (stagnant) → high intensity (explore broadly)
"""
return I_min + (I_max - I_min) / (1 + (G + epsilon) ** 0.5)
9.3.3 Level 2: Global Adaptation (Cross-Island Resource Allocation)
While Level 1 adapts within each island, Level 2 adapts across islands by deciding which island should receive the next unit of compute (i.e., the next LLM mutation call). This is formulated as a multi-armed bandit problem with three innovations over standard UCB (arXiv:2602.20133, §3.3).
Innovation 1: Globally-normalized rewards. Measuring improvement relative to each island's local best creates "poor island bias" — an island improving from score 10 to 12 (20% local improvement) would receive a larger reward than one improving from score 90 to 92 (2.2% local improvement), even though the latter is objectively closer to optimal. AdaEvolve normalizes rewards against the global best across all islands:
where $f_{\text{global}}^*$ is the best score observed across all islands. The paper presents this without the absolute value and $\epsilon$ (arXiv:2602.20133, §3.3); the defensive form here matches the repository code pattern established in Eq. 1.
Innovation 2: Decayed cumulative tracking. Both the reward accumulation and visit counts use exponential decay:
where $\rho = 0.9$. The decay on $R_t^{(k)}$ prevents stale early breakthroughs from dominating future allocation. The decay on $V_t^{(k)}$ ensures the UCB exploration bonus reflects recent neglect rather than total neglect — an island ignored for 50 iterations but heavily explored before that will regain its exploration bonus. The effective "memory" of the decayed visit count is approximately $1/(1-\rho) = 10$ steps.
Innovation 3: UCB selection with decayed statistics. The island selection rule is:
where $C = \sqrt{2}$ (configurable via search.C), $N$ is the total (undecayed) iteration count, and $V_k$ is the decayed visit count. Because $V_k$ is decayed, this is not standard UCB1 with stationary rewards — the decay mechanism adapts the bandit to the non-stationary improvement landscape. Standard UCB1 regret guarantees do not directly apply; the theoretical implications are discussed in §9.6.2.
# From skydiscover/search/adaevolve.py — GlobalAdaptationBandit
# Implements Eqs. 4–6 (arXiv:2602.20133, §3.3).
# Simplified from repository source for presentation.
import numpy as np
class GlobalAdaptationBandit:
"""UCB bandit for cross-island resource allocation.
Key differences from standard UCB1:
1. Globally-normalized rewards prevent poor-island bias (Eq. 4)
2. Exponentially-decayed visit counts adapt to non-stationarity (Eq. 5)
"""
def __init__(self, n_islands: int, rho: float = 0.9, C: float = 1.414):
self.n_islands = n_islands
self.rho = rho
self.C = C # config: search.C
self.R = np.zeros(n_islands) # decayed cumulative reward
self.V = np.zeros(n_islands) # decayed visit count
self.total_visits = 0 # undecayed total (N in Eq. 6)
self.global_best = float('-inf')
def select_island(self) -> int:
"""Select next island via UCB with decayed statistics (Eq. 6)."""
self.total_visits += 1
ucb_values = np.full(self.n_islands, float('inf'))
for k in range(self.n_islands):
if self.V[k] >= 1e-8:
exploit = self.R[k] / self.V[k]
explore = self.C * np.sqrt(np.log(self.total_visits) / self.V[k])
ucb_values[k] = exploit + explore
return int(np.argmax(ucb_values))
def update(self, island_k: int, new_score: float, island_best: float):
"""Update with globally-normalized reward (Eqs. 4–5)."""
if new_score > self.global_best:
self.global_best = new_score
# Global normalization prevents poor-island bias
reward = (new_score - island_best) / (abs(self.global_best) + 1e-8)
# Decayed accumulation for non-stationarity
self.R[island_k] = self.rho * self.R[island_k] + reward
self.V[island_k] = self.rho * self.V[island_k] + 1
9.3.4 Level 3: Meta-Guidance (Tactical Generation)
The most distinctive component of AdaEvolve is its meta-guidance system (arXiv:2602.20133, §3.4). When the accumulated improvement signal drops below a threshold across all islands simultaneously, the system triggers a qualitatively different intervention: rather than adjusting numeric parameters, it invokes an LLM to analyze the current search state and generate high-level tactical directives that force a paradigm shift in the search strategy.
Trigger condition: $G_t^{(k)} \leq \tau_{\text{meta}}$ for all $k$, where $\tau_{\text{meta}} = 0.12$ is the paper's default (config key: search.meta_guidance_threshold). As derived in the analytical inference in §9.3.1, this threshold is crossed when no island has produced a relative improvement exceeding ~35% in the recent window. For near-optimal problems, this condition may be satisfied from early iterations, making meta-guidance a near-continuous rather than event-triggered mechanism — a point of significance when interpreting the ablation results (§9.4.4).
Action: The meta-guidance LLM receives the evaluator source code, the current best program, and a summary of recent failed mutation approaches. It is prompted to propose fundamentally different algorithmic approaches — paradigm shifts such as switching from greedy to dynamic programming, or using a different data structure that changes the problem's complexity class.
Effect: Generated tactics are injected into the mutation prompts of all islands for a configurable window (default: 20 iterations, config key: search.tactic_window), forcing the entire search to explore qualitatively new directions.
# Meta-guidance prompt template, illustrating the structure described
# in arXiv:2602.20133, §3.4. The actual prompt template in the repository
# may differ in formatting and instruction detail.
META_GUIDANCE_PROMPT = """
You are analyzing a stalled optimization process. The evolutionary search
has failed to make progress across all islands for the last several iterations.
## Evaluator Code
```python
{evaluator_code}
```
## Current Best Program (score: {best_score})
```python
{best_program}
```
## Recent Failed Approaches (last 10 mutations)
{failed_approaches}
## Task
Propose 2-3 fundamentally different algorithmic approaches. Do NOT suggest
incremental improvements. Instead, suggest:
1. A completely different algorithm class (e.g., switching from greedy to DP,
from brute force to divide-and-conquer)
2. Concrete techniques with specific library functions
(e.g., scipy.optimize.linear_sum_assignment)
3. A novel data structure that changes the problem's complexity class
Output each tactic as: TACTIC: [name] — [concrete description with code hints]
"""
The paper reports specific examples of tactics generated in practice (arXiv:2602.20133, §4, alongside the experimental results discussion):
- "Trust-region root finding for faster convergence on constrained optimization"
- "Voronoi-based initialization for better spatial coverage in packing problems"
- "Median filtering + linear sum assignment via scipy for robust matching"
- "Replace recursive DFS with iterative BFS + priority queue for better cache locality"
The ablation study (§9.4.4) demonstrates that meta-guidance removal causes the largest performance degradation among all three adaptation levels, validating its role as AdaEvolve's most impactful innovation.
9.3.5 Dynamic Island Spawning
A more extreme intervention than meta-guidance: when $G_t^{(k)} \leq \tau_{\text{spawn}}$ across all islands (severe stagnation), AdaEvolve dynamically spawns new islands (arXiv:2602.20133, §3.5). The default is $\tau_{\text{spawn}} = 0.02$ (config key: search.spawn_threshold), a stricter threshold than the meta-guidance trigger. New islands are seeded with diverse solutions selected via maximin distance from the population database, injected with current tactical directives, and initialized with high exploration intensity ($I = I_{\max}$):
# Dynamic island spawning logic from skydiscover/search/adaevolve.py.
# Simplified; actual implementation includes maximum island count limits
# and integration with the bandit's arm set.
def check_spawn_trigger(islands: list, tau_spawn: float = 0.02) -> bool:
"""Trigger spawning on severe global stagnation."""
return all(island.signal.G <= tau_spawn for island in islands)
def spawn_island(population_db, meta_tactics: list) -> dict:
"""Create new island with diverse seeds and meta-tactic injection."""
diverse_seeds = population_db.sample_diverse(n=5, method="maximin_distance")
return {
"seeds": diverse_seeds,
"meta_tactics": meta_tactics,
"exploration_intensity": 0.7, # I_max: start in full exploration
"signal": AccumulatedImprovementSignal(), # fresh G=0
}
9.3.6 EvoX: Self-Evolving Strategy
SkyDiscover also includes EvoX (implemented in skydiscover/search/evox.py), a secondary algorithm where the system co-adapts both solution generation and experience management using LLM-driven strategy evolution at runtime (arXiv:2602.20133, §3.6). While AdaEvolve adapts within a fixed algorithmic framework (adjusting parameters of a multi-island evolutionary search), EvoX takes a more radical approach: it evolves the search strategy itself. Strategy parameters — parent selection weights, mutation scope, context assembly rules — are evolved using LLM-driven strategy mutation, creating a meta-level optimization loop.
The paper positions EvoX as complementary to AdaEvolve rather than competitive: AdaEvolve provides more predictable, controllable adaptation through its mathematically grounded three-level hierarchy, while EvoX offers potentially greater flexibility at the cost of less interpretability.
9.4 Key Results
All results below are as reported in the AdaEvolve paper (arXiv:2602.20133). Cross-system comparisons use the SkyDiscover benchmarking platform, which provides identical evaluation infrastructure for all algorithms. Several important caveats apply:
- Backend dependence: The paper primarily reports results using GPT-5 and Gemini-3-Pro backends (arXiv:2602.20133, §4). Performance with other models is not systematically characterized.
- Implementation fidelity: Native backend integrations (OpenEvolve, GEPA, ShinkaEvolve) run through SkyDiscover's infrastructure, which may not perfectly reproduce each algorithm's original behavior (see §9.6.4).
- Self-reported: All numbers are from the SkyDiscover authors. No independent reproduction has been published as of this writing.
- Partial suite coverage: The benchmark suite contains 200+ tasks (14 math + 9 systems + 172 Frontier-CS + 10 ALE + 2+ creative = 207+), but the paper's main comparisons cover approximately 185 tasks (6 math + 7 ADRS + 172 Frontier-CS). The remaining tasks are available in the framework but not included in the comparative evaluation tables.
9.4.1 Mathematical Optimization
Six of the 14 mathematical optimization tasks in the benchmark suite were evaluated in the paper's main comparison, with a budget of 100 LLM iterations per task. Three representative results (arXiv:2602.20133, Table 1):
| Problem | AdaEvolve | OpenEvolve | GEPA | Human SOTA | Source |
|---|---|---|---|---|---|
| Circle Packing (Square) | 2.636 | 2.590 | 2.610 | 2.634 | Table 1 |
| Heilbronn Triangles | 0.036 | 0.028 | 0.031 | — | Table 1 |
| Signal Processing | 0.718 | 0.619 | 0.682 | — | Table 1 |
The circle packing result of 2.636 slightly exceeds the human SOTA reference of 2.634 cited in the paper. However, the paper does not specify the precise problem instance — the exact number of circles, container dimensions, and optimization objective (minimum gap? maximum radius?) — nor does it cite the source of the 2.634 baseline. Known optimal circle-packing configurations for specific circle counts in a unit square are catalogued in the mathematical optimization literature (e.g., Specht's Packomania database), and the value 2.634 is consistent with known results for certain small-count instances. Without the exact problem specification and reference, the "exceeds human SOTA" claim cannot be independently verified from the data provided in the paper. The remaining 8 mathematical tasks in the suite (of 14 total) are not reported in the main comparison table.
9.4.2 Real-World Systems Optimization (ADRS)
The paper reports results on 7 of the 9 systems optimization tasks under the Algorithmic Discovery for Real Systems (ADRS) label (arXiv:2602.20133, Table 2). These address practical infrastructure problems rather than academic benchmarks:
| Task | AdaEvolve (GPT-5) | AdaEvolve (Gemini-3-Pro) | Best Baseline | Source |
|---|---|---|---|---|
| Cloud Transfer Cost | 41% lower than baselines | 41% lower than baselines | OpenEvolve | Table 2 |
| GPU Load Balancing | 14% better than baselines | 14% better than baselines | GEPA | Table 2 |
| MoE Expert Placement | Best on all 7 reported ADRS tasks | ShinkaEvolve | Table 2 | |
The paper reports AdaEvolve wins on all 7 reported ADRS benchmarks under both GPT-5 and Gemini-3-Pro backends (arXiv:2602.20133, Table 2). The largest gains appear on sparse/bursty improvement tasks (e.g., transaction optimization: 4348 vs. baseline 4329), where adaptive resource allocation excels — consistent with the algorithm's design goal of detecting and concentrating compute on productive search fronts. The remaining 2 systems tasks in the suite (of 9 total) are not included in this comparison; the paper does not explain their omission.
9.4.3 Frontier-CS (172 Algorithm Design Problems)
The Frontier-CS benchmark comprises 172 algorithm design problems from competitive programming with a budget of only 50 LLM calls per problem (arXiv:2602.20133, Table 3). All 172 tasks in this benchmark family were evaluated:
| Metric | AdaEvolve | OpenEvolve | GEPA | Single-call GPT-5 | Source |
|---|---|---|---|---|---|
| Mean Score | 61.33 | 50.75 | 54.20 | 20.64 | Table 3 |
| Median Score | 75.15 | 56.37 | 60.12 | 15.30 | Table 3 |
| Improvement over OpenEvolve | +21% (mean) | — | — | — | Table 3 |
The 50-call-per-problem budget is stringent. AdaEvolve's ability to dynamically concentrate compute on productive mutations within each problem — rather than spreading calls uniformly — explains its lead over fixed-schedule baselines. The 3× improvement over single-call GPT-5 demonstrates that evolutionary search adds significant value even with modern frontier models. The paper does not report per-problem variance, number of independent runs, or confidence intervals for these aggregates.
9.4.4 Ablation Study
The ablation study (arXiv:2602.20133, Table 4) isolates the contribution of each adaptation level by removing them individually. Results are reported as mean ± standard deviation across multiple runs (the paper does not specify the exact number of runs or seeds):
| Configuration | Circle Packing | Signal Processing | Source |
|---|---|---|---|
| Full AdaEvolve | 2.6294 ± 0.003 | 0.7178 ± 0.019 | Table 4 |
| w/o Local Adaptation | 2.5906 ± 0.048 | 0.6807 ± 0.021 | Table 4 |
| w/o Bandit Selection | 2.6180 ± 0.005 | 0.6190 ± 0.054 | Table 4 |
| w/o Meta-Guidance | 2.5213 ± 0.028 | 0.5476 ± 0.011 | Table 4 |
Key observations from the paper's reported ablation:
- Meta-guidance removal causes the largest degradation in both tasks: −0.108 on circle packing and −0.170 on signal processing. This validates tactical generation as AdaEvolve's most impactful innovation. Note that for circle packing (a near-optimal problem with small relative improvements), the analytical inference in §9.3.1 suggests the meta-guidance threshold may be satisfied from early iterations, meaning meta-guidance likely functions as a near-continuous mechanism rather than an occasional intervention for this task specifically. The paper does not report trigger frequencies or signal trajectories that would confirm this inference.
- Bandit removal has asymmetric impact: modest on circle packing (−0.011) but severe on signal processing (−0.099), suggesting globally-normalized resource allocation is most valuable on problems with heterogeneous island performance.
- Local adaptation removal increases variance: standard deviation jumps from 0.003 to 0.048 on circle packing, indicating less consistent results without adaptive exploration intensity.
- All three levels contribute: no single level dominates unconditionally, justifying the three-level architecture.
9.4.5 Comparison with AlphaEvolve
The paper reports that AdaEvolve matches AlphaEvolve on 6/6 systems tasks and 6/8 mathematical tasks (arXiv:2602.20133, §5). This comparison carries significant caveats:
- AlphaEvolve's results are self-reported by Google DeepMind using proprietary Gemini models with undisclosed compute budgets.
- The comparison is cross-paper, not head-to-head within SkyDiscover (AlphaEvolve is not available as a native backend).
- Differences in LLM backends, evaluation protocols, and problem setup may affect the comparison.
The SkyDiscover team acknowledges this by positioning their comparison as "matching" rather than "exceeding" (arXiv:2602.20133, §5). The "~34% median improvement" headline figure refers to improvement over the open-source baselines (OpenEvolve, GEPA, ShinkaEvolve) running within SkyDiscover (arXiv:2602.20133, Table 3 aggregate and Figure 5) — not over AlphaEvolve.
9.5 Implementation Details
9.5.1 Configuration and Minimal Setup
AdaEvolve is designed for minimal configuration (arXiv:2602.20133, §2.4). At minimum, it requires only a model name and an iteration budget. The three-level adaptation handles all scheduling decisions automatically. The full set of configurable parameters with their documented defaults:
# AdaEvolve YAML configuration — exact config keys from repository docs.
# All search.* keys have documented defaults; only llm.models and
# llm.max_iterations are required.
llm:
models: # required: at least one model
- model: "openai/gpt-5" # provider/model format
weight: 0.4
- model: "gemini/gemini-3-pro"
weight: 0.3
system_prompt: "You are an expert algorithm designer..."
max_iterations: 500 # required: LLM call budget
search:
algorithm: "adaevolve" # default; also: evox, topk, beam, etc.
rho: 0.9 # EMA decay rate (Eqs. 2, 5)
I_min: 0.1 # min exploration intensity (Eq. 3)
I_max: 0.7 # max exploration intensity (Eq. 3)
C: 1.414 # UCB exploration constant ≈ √2 (Eq. 6)
meta_guidance_threshold: 0.12 # τ_meta: stagnation trigger
spawn_threshold: 0.02 # τ_spawn: severe stagnation trigger
tactic_window: 20 # iterations to persist tactical directives
9.5.2 Benchmark Suite and Task Coverage
The full benchmark suite, located in the benchmarks/ directory with task format documented in benchmarks/README.md, contains 200+ tasks across five domains. The paper's experimental evaluation covers a subset:
| Domain | In Suite | Evaluated in Paper | Source | Notes |
|---|---|---|---|---|
| Mathematics | 14 | 6 | Table 1 | Circle packing, Heilbronn triangles, signal processing, etc. |
| Systems (ADRS) | 9 | 7 | Table 2 | Cloud scheduling, load balancing, MoE, GPU kernels |
| Frontier-CS | 172 | 172 | Table 3 | Competitive programming problems, 50 calls each |
| ALE / AtCoder | 10 | — | — | Described in benchmark docs; not in main results tables |
| Creative / NLP | 2+ | — | — | Image generation, HotPotQA; not in main results tables |
The "200+ benchmarks" count (14 + 9 + 172 + 10 + 2+ = 207+) refers to tasks available in the framework, not tasks with full comparative evaluation. The paper's headline comparisons cover approximately 185 tasks (6 + 7 + 172). The ALE and creative/NLP tasks are described in the benchmark documentation but do not appear in the main results tables; the paper does not explain their exclusion from the comparative evaluation.
9.5.3 Cost Efficiency
| Setting | LLM Calls | Estimated Cost | Benchmark Result | Source |
|---|---|---|---|---|
| Frontier-CS (172 problems) | 50 per problem | $0.10–0.50 per problem | Mean score 61.33 (Table 3) | Paper §4.3 |
| Math optimization (100 iter) | 100 | $5–30 | See Table 1 in §9.4.1 | Paper §4.1 |
| Systems optimization (ADRS) | 50–200 | $2–20 | See Table 2 in §9.4.2 | Paper §4.2 |
These cost estimates are from the paper and reflect LLM API pricing at time of publication (February 2026). The paper claims ~20–30% fewer wasted LLM calls compared to fixed island allocation (OpenEvolve), attributed to the globally-normalized bandit preventing compute waste on low-performing islands. These efficiency claims are plausible given the adaptive allocation mechanism but are not validated with detailed per-call accounting in the paper.
9.5.4 Checkpoint System and Reproducibility
The checkpoint system serializes the complete search state for interrupting and resuming long-running experiments (arXiv:2602.20133, §2.4). AdaEvolve extends the framework-level checkpoint with its own per-island signals and bandit state:
# Checkpoint serialization for AdaEvolve runs.
# Reconstructed from the documented checkpoint components
# (arXiv:2602.20133, §2.4) and the serialized state structure
# described in the repository. Actual serialization format may use
# a different encoding or include additional metadata.
import json
from pathlib import Path
def save_checkpoint(state: "AdaEvolveState", checkpoint_dir: Path) -> Path:
"""Serialize full AdaEvolve state for later resumption.
Persists all components required to resume a run exactly where
it left off: island archives, bandit statistics, meta-guidance
history, and iteration counters.
"""
checkpoint = {
"version": 1,
"iteration": state.iteration,
"global_best_score": state.global_best_score,
"global_best_program": state.global_best_program,
"islands": [
{
"island_id": island.island_id,
"programs": [p.to_dict() for p in island.programs],
"best_score": island.best_score,
"accumulated_signal_G": island.signal.G,
"accumulated_signal_best": island.signal.best_score,
}
for island in state.islands
],
"bandit": {
"R": state.bandit.R.tolist(),
"V": state.bandit.V.tolist(),
"total_visits": state.bandit.total_visits,
"global_best": state.bandit.global_best,
},
"meta_guidance": {
"active_tactics": state.active_tactics,
"tactic_remaining_iters": state.tactic_remaining,
"tactic_history": state.tactic_history,
},
"cost": {
"total_llm_calls": state.total_llm_calls,
"total_cost_usd": state.total_cost_usd,
},
}
path = checkpoint_dir / f"checkpoint_{state.iteration:06d}.json"
path.write_text(json.dumps(checkpoint, indent=2))
return path
def load_checkpoint(path: Path) -> "AdaEvolveState":
"""Restore AdaEvolve state from a checkpoint file.
Reconstructs all island signals, bandit state, and meta-guidance
context so the run can continue from the exact iteration where
it was interrupted.
"""
data = json.loads(path.read_text())
state = AdaEvolveState.from_checkpoint(data)
return state
| Criterion | Status | Details |
|---|---|---|
| Source Code | Available | Apache 2.0, github.com/skydiscover-ai/skydiscover |
| Paper | Available | arXiv:2602.20133, CC BY 4.0 |
| Benchmarks | 200+ in suite | All benchmarks ship in benchmarks/; evaluators provided |
| Configuration | YAML-based | Experiment configs stated to be in repository |
| Checkpoint Resumption | Supported | Full state serialization including AdaEvolve-specific components |
| Determinism | Partial | LLM outputs are stochastic; seeds tracked where possible |
| Run Metadata | Not fully reported | Paper does not specify number of runs, seeds, or confidence intervals for most results |
| API Keys | Required | At least one LLM provider (or local Ollama) |
9.5.5 Technology Stack
SkyDiscover is implemented in Python using the standard scientific Python stack (NumPy, SciPy) plus LLM provider SDKs. Configuration is YAML-based with the config keys documented above. Evolved programs are primarily Python, though the evaluator API (evaluate(program_path) → EvaluationResult) is sufficiently general to support any language via custom evaluators.
Extension points documented in the repository:
- Custom search algorithms: Implement the
SearchAlgorithminterface inskydiscover/search/and register in the algorithm registry (see §9.2.4). - Custom benchmark tasks: Follow the task format specified in
benchmarks/README.md: provide an initial program, evaluator function, and task documentation. - Custom LLM providers: Use the
provider/modelformat via LiteLLM compatibility for providers not natively supported.
9.6 Limitations & Discussion
9.6.1 Missing Capabilities
SkyDiscover/AdaEvolve lacks several innovations present in other surveyed systems:
| Missing Capability | Available In | Impact |
|---|---|---|
| Prompt co-evolution | ShinkaEvolve | Cannot adapt mutation prompts alongside solutions |
| Formal learning logs | Darwinian Evolver | No structured cross-population mutation history; knowledge transfer is implicit via artifacts and migration |
| Structured ASI diagnostics | GEPA | Artifact injection is informal — free-form dicts rather than typed diagnostic schemas |
| MAP-Elites quality-diversity | AlphaEvolve, OpenEvolve | No behavioral descriptor archives for maintaining diverse solution niches |
| Two-tier novelty filtering | ShinkaEvolve | No deduplication or semantic novelty checks on candidates |
| Tree search integration | AB-MCTS | No Monte Carlo tree search over program space |
| Self-modification | Darwin Gödel Machine | Cannot modify its own search algorithm at runtime (EvoX partially addresses this) |
9.6.2 Theoretical Limitations of the Bandit Formulation
Standard UCB1 assumes bounded, stationary reward distributions and provides $O(\sqrt{KN \ln N})$ worst-case regret for $K$ arms over $N$ rounds. In AdaEvolve, rewards are inherently non-stationary: an island's improvement rate changes as it explores different regions of the search space. The exponential decay on $R_t^{(k)}$ and $V_t^{(k)}$ is a practical mitigation, but introduces several theoretical complications:
- The decayed visit count $V_k$ can shrink without intervening visits, meaning the exploration bonus $C\sqrt{\ln N / V_k}$ grows over time for neglected islands faster than in standard UCB1.
- The ratio $R_k / V_k$ can become negative when $f' < f_k^*$ (non-improving mutations yield negative rewards under Eq. 4). Standard UCB1 analysis assumes non-negative rewards.
- The use of undecayed $N$ in the exploration term alongside decayed $V_k$ creates a growing asymmetry not matched by the exploitation term's effective horizon.
The paper does not provide theoretical analysis of the modified bandit's properties. Sliding-window UCB and discounted UCB variants have been studied in the non-stationary bandits literature (Garivier & Moulines, 2011), but AdaEvolve's specific combination of per-arm decay with global normalization does not directly map to these analyzed settings. AdaEvolve's bandit should be understood as a practical heuristic rather than a theoretically grounded algorithm with provable bounds.
9.6.3 Threshold Sensitivity and Signal Heterogeneity
The meta-guidance trigger ($\tau_{\text{meta}} = 0.12$) and spawning threshold ($\tau_{\text{spawn}} = 0.02$) are fixed constants applied uniformly across all problem types. As derived in the analytical inference in §9.3.1, the accumulated improvement signal $G_t^{(k)}$ spans many orders of magnitude depending on the problem's improvement profile. The practical consequence is that the fixed thresholds produce qualitatively different behavior across problem regimes:
- Problems with large early gains (e.g., Frontier-CS from naive seeds): thresholds function as designed, detecting genuine stagnation after productive phases end.
- Near-optimal problems (e.g., circle packing near known optima): both thresholds are expected to be chronically satisfied, collapsing the adaptation hierarchy — exploration intensity stays near $I_{\max}$, meta-guidance fires continuously, and spawning conditions persist. The paper does not report whether safeguards (e.g., cooldown periods, minimum intervals between spawns) mitigate this in practice.
- Problems with scores near zero: small absolute improvements create very large $\delta$ values (due to normalization by $|f_k^*|$), potentially keeping $G_t$ artificially high and suppressing useful interventions.
The paper does not report sensitivity analysis on threshold values, and the ablation study (§9.4.4) only compares full meta-guidance removal against the default threshold — not alternative threshold values. A more robust design might use per-problem adaptive thresholds calibrated from the observed $G_t$ distribution during a warm-up phase, or replace fixed thresholds with percentile-based triggers. These are natural extensions for future work.
9.6.4 Benchmarking Fairness Caveats
While SkyDiscover provides a unified platform for cross-algorithm comparison, several fairness caveats apply. Native backend integrations run through SkyDiscover's infrastructure layer, which may not perfectly reproduce each algorithm's original behavior. Differences in prompt construction, context assembly, and evaluation harness between SkyDiscover's integration and the original system could affect results. The paper does not report fidelity validation (e.g., running the original OpenEvolve codebase and comparing results against SkyDiscover's OpenEvolve backend on identical tasks). This is a common limitation of unified benchmarking platforms — the convenience of a shared evaluation interface comes at the cost of potential implementation drift.
9.6.5 Integration Opportunities
AdaEvolve's three-level hierarchical adaptation is notably orthogonal to most innovations in other systems, creating natural integration opportunities:
- The accumulated improvement signal could drive ShinkaEvolve's prompt co-evolution — mutating prompts more aggressively when $G_t^{(k)}$ is low.
- GEPA's structured ASI diagnostics could feed richer information into AdaEvolve's meta-guidance generator, enabling more targeted tactical recommendations.
- Darwinian Evolver's learning logs could provide historical context for tactical generation, allowing the meta-guidance LLM to reference a structured record of what has and hasn't worked.
- OpenEvolve's MAP-Elites behavioral descriptors could measure diversity within AdaEvolve's islands, providing an additional signal alongside the accumulated improvement metric.
9.7 Comparative Positioning
The following table, based on the feature matrix in the paper (arXiv:2602.20133, Table 5), positions SkyDiscover/AdaEvolve relative to contemporaneous systems:
| Feature | SkyDiscover / AdaEvolve | OpenEvolve | ShinkaEvolve | GEPA | LLM4AD |
|---|---|---|---|---|---|
| Adaptation | Three-level hierarchical | Static | Bandit LLM selection | Pareto-based | Method-specific |
| Island management | UCB + dynamic spawning | Fixed + ring migration | Dynamic spawning on stagnation | Single population | Method-specific |
| Resource allocation | Globally-normalized UCB | Equal across islands | Equal + bandit for LLMs | N/A | N/A |
| Stagnation response | Meta-guidance + spawning | None | Dynamic island spawning | Reflection-driven mutation | None |
| Benchmarks | 200+ (in suite; ~185 evaluated) | ~10 | ~20 | ~30 | ~50+ |
| Multi-algorithm | 6+ strategies | 1 | 1 | 1 | 7 methods |
| Live dashboard | Yes | No | No | No | Yes (GUI) |
| Checkpoint resume | Yes | Yes | Yes | Partial | Yes |
| Human feedback | Yes (dashboard) | No | No | No | No |
| Prompt evolution | No | No | Yes (v1.1) | No | No |
| Learning logs | No (implicit) | No | No | No | No |
| Diagnostic ASI | Partial (artifacts) | No | No | Yes | No |
SkyDiscover's primary competitive advantages are adaptive resource allocation, meta-guidance for stagnation breaking, the breadth of its unified benchmarking platform, and its multi-algorithm framework. Its primary gaps are in learning and knowledge management — the implicit artifact-based feedback loop is less structured than GEPA's ASI or ShinkaEvolve's prompt co-evolution, and it lacks formal learning logs entirely.
9.7.1 Relationship to Prior Berkeley Work
The overlap between the SkyDiscover and GEPA teams is reflected in the framework's design: GEPA is included as a first-class backend algorithm, and the paper's experimental evaluation directly compares AdaEvolve against GEPA. This positioning — building a platform that subsumes your own previous work as one component — reflects the Berkeley Sky Lab's trajectory from specific system (GEPA) to general platform (SkyDiscover), paralleling the lab's earlier arc from Spark-specific tools to general cluster computing infrastructure.
| System | Year | Adaptation | Search Strategy | Benchmark Coverage |
|---|---|---|---|---|
| FunSearch | 2023 | Static | Single population | Mathematics only |
| OpenEvolve | 2025 | Static islands | MAP-Elites + islands | ~10 tasks |
| ShinkaEvolve | 2025 | Bandit LLM selection | Islands + dynamic spawning | ~20 tasks |
| GEPA | 2026 | Pareto-based | Reflection-driven | ~30 tasks |
| SkyDiscover | 2026 | Three-level hierarchical | Multi-algorithm | 200+ (suite) |
9.8 Summary
Chapter Summary
Key takeaway: SkyDiscover/AdaEvolve demonstrates that adaptive compute allocation — deciding in real time where to invest the next LLM call — can outperform static search schedules, even when the underlying evolutionary operators remain unchanged. The accumulated improvement signal provides a unified mechanism for coordinating this adaptation across three nested scales, though its fixed thresholds exhibit problem-dependent behavior that limits universality (§9.6.3).
Main contribution: A three-level hierarchical adaptation framework where a single volatility metric (the accumulated improvement signal, Eqs. 1–2) simultaneously drives within-island exploration intensity (Eq. 3), cross-island resource allocation via globally-normalized UCB (Eqs. 4–6), and LLM-generated tactical paradigm shifts on stagnation (§9.3.4). The paper reports ~34% median improvement over open-source baselines across ~185 evaluated tasks using GPT-5/Gemini-3-Pro backends (arXiv:2602.20133, Table 3 aggregate, Figure 5). These results are self-reported and have not been independently reproduced.
Most important insight for researchers: The ablation results (arXiv:2602.20133, Table 4; discussed in §9.4.4) reveal that meta-guidance — the ability to invoke an LLM to generate qualitatively new algorithmic strategies when the entire search front stagnates — is the single most impactful adaptation mechanism, more so than numeric parameter adaptation. This suggests that the next frontier in LLM-powered evolutionary search lies not in better numeric optimization of search parameters, but in better strategic reasoning about when and how to redirect the search entirely. The accumulated improvement signal is orthogonal to most innovations in other systems, making it a natural candidate for integration into unified architectures.