PaperarXiv:2509.19349

Introduced2025-09

Score8.62/10 — Final

Chapter 6

ShinkaEvolve

Part P02: General-Purpose Evolutionary Frameworks

6.1 Overview & Motivation

ShinkaEvolve is an open-ended program evolution framework developed by Takuya Akiba and the Sakana AI team, published at ICLR 2026 and released under the Apache 2.0 license [paper]. The system addresses a fundamental bottleneck in LLM-driven evolutionary code search: sample efficiency. Prior systems such as FunSearch and AlphaEvolve demonstrated that large language models can serve as mutation operators in evolutionary loops, but they require thousands to hundreds of thousands of evaluations to converge on high-quality solutions. ShinkaEvolve reduces this cost through three interlocking innovations: power-law parent sampling that biases selection toward high-fitness individuals while preserving exploratory breadth, a two-tier novelty filtering pipeline that rejects trivially similar candidates before they consume evaluation budget, and a multi-armed bandit controller that dynamically allocates mutation requests to the most productive LLM in an ensemble [paper].

The system's generality derives from a simple contract: any problem expressible as a program containing mutable code blocks (delimited by EVOLVE-BLOCK-START / EVOLVE-BLOCK-END markers) together with an evaluation function returning a numeric score can be optimized [paper, README]. This contract has been exercised across competitive programming (AtCoder heuristic contests, ICFP 2025), mathematical reasoning (AIME), scientific discovery (circle packing), and machine learning optimization (mixture-of-experts training strategies) [paper]. ShinkaEvolve thus occupies a distinctive position in the design space—it is neither a domain-specific optimizer nor a general-purpose code generation tool, but an evolutionary search framework in which LLMs serve as semantically aware mutation operators.

Key Contribution

ShinkaEvolve demonstrates that sample efficiency in LLM-driven evolutionary search can be dramatically improved through the combination of intelligent parent selection, two-tier novelty filtering, and bandit-based model allocation. The system achieved state-of-the-art results across multiple domains (circle packing, AIME, competitive programming, MoE training) while keeping evaluation budgets in the low hundreds rather than thousands [paper]. The framework is fully open-source and reproducible, with Hydra-based configuration and checkpoint/resume support [repo, README].

Provenance note: The authors report a 10–100× efficiency improvement relative to AlphaEvolve [paper]. However, the specific tasks, evaluation budgets, seed conditions, and comparison protocol (whether baselines used identical evaluators, same seed programs, or matched computational budgets) underlying this comparison are not detailed in the available sources. Readers should treat this as an author-reported claim, not an independently verified or methodologically transparent comparison. No ablation studies isolating the contribution of each innovation are published.

Provenance Notation

This chapter tags implementation claims by source: [paper] = ICLR 2026 publication; [repo: file] = verified in repository source code; [example config] = repository example YAML files; [README] = repository documentation; [inferred] = reconstructed from code structure or module naming, not directly documented. Where no tag appears, the statement is a standard definition or the author's analytical commentary.

6.2 Architecture & System Design

ShinkaEvolve's architecture is organized around a central evolution runner that coordinates four subsystems: an LLM ensemble with bandit-based selection, a two-tier novelty filter, an island-based population database, and a parallel evaluation engine [paper; repo: directory structure]. The system provides three execution interfaces: shinka_launch (Hydra-based CLI for YAML-configured runs), shinka_run (agent CLI for interactive use), and shinka_visualize (a web UI for real-time monitoring) [README]. The following diagram illustrates the primary data flow.

6.2.1 Module Structure

The repository is organized into clearly separated packages, each responsible for a distinct concern. The following table maps the primary modules to their roles.

Module	Key Files	Responsibility	Source
`shinka/core/`	`evolution_runner.py`, `async_runner.py`, `async_novelty_judge.py`, `wrap_eval.py`	Main evolution loop (sync and async variants), novelty judgment, evaluation wrapping	[repo: directory listing]
`shinka/database/`	`database.py`, `async_dbase.py`, `island_sampler.py`, `prompt_dbase.py`	Island-based population storage, island selection, prompt archive	[repo: directory listing]
`shinka/llm/`	`providers/`	LLM provider abstraction (OpenRouter, Google, OpenAI, local endpoints)	[repo: directory listing]
`shinka/embed/`	Embedding providers	Code embedding computation for novelty Tier 1	[repo: directory listing]
`shinka/edit/`	Code editing operations	Diff application, block replacement, crossover assembly	[inferred from module name]
`shinka/prompts/`	`prompts_fix.py`, `prompts_prompt_evo.py`	Prompt templates for mutation and prompt co-evolution	[repo: directory listing]
`shinka/webui/`	`index.html`, `compare.html`	Dashboard visualization and run comparison	[repo: directory listing]
`shinka/plots/`	`plot_costs.py`, `plot_evals.py`, `plot_time.py`, `plot_llm.py`	Post-run analysis and visualization	[repo: directory listing]

6.2.2 Configuration System

ShinkaEvolve uses Hydra for hierarchical YAML configuration with three primary config objects [repo: Hydra config schema; README]:

Config Object	Key Parameters	Source
`EvolutionConfig`	`num_generations`, `patch_types` (diff/full/cross), novelty thresholds, LLM model list, `max_api_costs`	[repo: config schema]
`DatabaseConfig`	`num_islands` (4), `migration_interval` (10), `migration_rate` (10%), `elite_ratio` (30%)	[example config]; see note below
`JobConfig`	`LocalJobConfig`, `SlurmDockerJobConfig`, `SlurmCondaJobConfig`	[repo: config schema]

Parameter Default Provenance

The parameter values cited for DatabaseConfig (4 islands, 10-generation migration interval, 10% migration rate, 30% elite ratio) appear in the repository's example configuration files [example config]. It is not clear from the available documentation whether these are recommended defaults compiled into the code, or merely illustrative values in one example. The source material presents them without distinguishing these cases. Readers who adopt ShinkaEvolve should verify the actual compiled defaults in the DatabaseConfig class definition rather than relying on example files.

6.2.3 End-to-End Run Example

The following minimally complete example ties together the configuration, evolve-block markers, evaluation contract, and CLI invocation required for a ShinkaEvolve run. The example uses a simplified circle packing task [adapted from repo: examples/circle_packing; README].

Step 1: Seed program (initial.py) — the program to be evolved, with mutable regions marked:

# initial.py — Seed program for circle packing optimization
# Source: adapted from repository example [repo: examples/circle_packing]
import math

def solve(n: int) -> tuple[list[tuple[float, float]], float]:
    """Pack n equal circles into a unit square. Returns (positions, radius)."""
    # === EVOLVE-BLOCK-START ===
    # Grid-based initialization (baseline strategy)
    cols = math.ceil(math.sqrt(n))
    rows = math.ceil(n / cols)
    radius = 0.5 / max(cols, rows)
    positions = []
    for i in range(n):
        row, col = divmod(i, cols)
        x = (col + 0.5) / cols
        y = (row + 0.5) / rows
        positions.append((x, y))
    return positions[:n], radius
    # === EVOLVE-BLOCK-END ===

Step 2: Evaluation function (evaluate.py) — returns a dict with at minimum a combined_score key [paper; repo: core/wrap_eval.py contract]:

# evaluate.py — Fitness function for circle packing
# Contract: must return dict with "combined_score" key [paper, README]

def evaluate(program_path: str) -> dict:
    """Run the packing algorithm and score the result."""
    module = load_module(program_path)
    n = 25  # number of circles
    positions, radius = module.solve(n)

    # Check validity: no overlaps, all within unit square
    overlap_penalty = compute_overlap(positions, radius)
    boundary_penalty = compute_boundary_violations(positions, radius)

    if overlap_penalty > 0 or boundary_penalty > 0:
        return {"combined_score": -1.0 * (overlap_penalty + boundary_penalty)}

    return {
        "combined_score": radius,   # maximize packing radius
        "num_circles": n,
        "coverage": n * math.pi * radius**2,  # fraction of square covered
    }

Step 3: Configuration (Hydra YAML) [adapted from example config]:

# config.yaml — Hydra configuration for a circle packing run
# Source: adapted from repository example configs [example config]
evo:
  num_generations: 30
  patch_types: [diff, full, cross]
  max_api_costs: 50.0           # USD budget cap
  models:
    - openrouter/qwen/qwen3-coder
    - openai/gpt-4o

db:
  num_islands: 4
  migration_interval: 10
  migration_rate: 0.1
  elite_ratio: 0.3

novelty:
  embedding_threshold: 0.15
  use_llm_judge: true

job:
  type: local
  max_workers: 4

Step 4: CLI invocation [README]:

# Install and run [README]
git clone https://github.com/SakanaAI/ShinkaEvolve
cd ShinkaEvolve
uv pip install -e .

# Launch with the bundled circle packing example
shinka_launch variant=circle_packing_example

# Or with the agent CLI, pointing at a custom task directory
shinka_run --task-dir ./my_circle_packing --num_generations 30

# Override specific parameters
shinka_run --task-dir ./my_circle_packing \
  --set evo.max_api_costs=50.0 \
  --set db.num_islands=4 \
  --set evo.models='[openai/gpt-4o]'

# Monitor in real time
shinka_visualize --run-dir outputs/my_circle_packing

Note: the YAML config key names and CLI override syntax are adapted from the repository examples and README. Exact parameter paths may vary across ShinkaEvolve versions.

6.3 Core Algorithms

6.3.1 Code Modification & Mutation Operators

Programs in ShinkaEvolve contain explicitly marked mutable regions delimited by EVOLVE-BLOCK-START and EVOLVE-BLOCK-END markers [paper; README; repo: examples]. The LLM receives the complete program together with parent history, evaluation results, and meta-recommendations, then proposes modifications restricted to these marked blocks [paper]. This design separates the mutable search space from the fixed evaluation harness, ensuring that the LLM's mutations cannot accidentally corrupt the testing infrastructure.

Three mutation types are supported, referred to as patch types in the configuration [paper; repo: config schema]:

Patch Type	Mechanism	Typical Use	Source
`diff`	LLM generates a unified diff patch targeting specific lines within evolve blocks	Small, targeted refinements	[paper]
`full`	LLM generates a complete replacement of the evolve block contents	Major algorithmic restructuring	[paper]
`cross`	LLM receives two parent programs and combines elements from both	Crossover-like recombination of distinct strategies	[paper]

This is fundamentally different from classical genetic programming, where mutations operate at the syntactic level (e.g., subtree replacement in abstract syntax trees). ShinkaEvolve's LLM-mediated mutations are semantically aware: the model understands the algorithmic intent and can propose changes that preserve correctness while altering the optimization strategy. The following pseudocode illustrates the mutation dispatch logic:

# Illustrative pseudocode for mutation dispatch
# Simplified from the logic in shinka/core/evolution_runner.py [repo]
# NOT an exact repository excerpt; see provenance note below.

def propose_mutation(parent, patch_type, llm_client, prompt, context):
    """Generate a mutated child program from a parent.

    Args:
        parent: Source program with EVOLVE-BLOCK markers.
        patch_type: One of 'diff', 'full', 'cross' [paper].
        llm_client: Selected LLM provider instance.
        prompt: System prompt (possibly co-evolved) [paper, v1.1].
        context: Dict with elite archive, meta-recommendations, etc.
    """
    if patch_type == "diff":
        # LLM produces a unified diff against the evolve block
        user_msg = build_diff_prompt(parent, context)
        response = llm_client.generate(system=prompt, user=user_msg)
        child = apply_unified_diff(parent, response)

    elif patch_type == "full":
        # LLM rewrites the entire evolve block
        user_msg = build_full_prompt(parent, context)
        response = llm_client.generate(system=prompt, user=user_msg)
        child = replace_evolve_block(parent, response)

    elif patch_type == "cross":
        # LLM combines two parents
        parent_b = select_second_parent(context["population"])
        user_msg = build_cross_prompt(parent, parent_b, context)
        response = llm_client.generate(system=prompt, user=user_msg)
        child = extract_evolve_block(response)

    return child

Provenance: this is simplified pseudocode faithful to the published mutation types [paper] and module structure [repo: core/evolution_runner.py]. It is not an exact repository excerpt. The prompt construction functions (build_diff_prompt, etc.) and diff application logic are located in shinka/edit/ and shinka/prompts/ respectively [repo: directory listing], but their exact signatures and internal logic are not documented in the paper or README.

The v1.1 release introduced a fix mode [repo: prompts/prompts_fix.py]: when no program in the population produces correct output—common at the start of difficult tasks—special prompts direct the LLM to focus on fixing fundamental errors rather than optimizing performance [paper]. This addresses a bootstrapping problem where performance-oriented mutations are useless if the baseline program is incorrect.

6.3.2 Parent Selection & Sampling

Parent selection balances exploitation (refining known good solutions) with exploration (expanding into less-visited regions of the search space). ShinkaEvolve supports three selection strategies [paper], of which power-law selection is the documented default.

Power-Law Selection

Programs are ranked by fitness score in descending order within their island. The probability of selecting the program at rank $i$ follows a power-law distribution:

$$P(\text{rank} = i) = \frac{i^{-\alpha}}{\sum_{j=1}^{N} j^{-\alpha}}, \quad \alpha \in [1.0, 3.0]$$

[Paper-reported formulation. The exponent range $\alpha \in [1.0, 3.0]$ is from the source material; whether this is enforced in code or merely a recommended range is not documented.]

where $N$ is the number of programs in the island's population and $\alpha$ is the power-law exponent controlling the exploitation–exploration balance. Higher $\alpha$ concentrates selection on top-ranked programs; $\alpha = 1$ yields a gentler Zipf-like distribution that still favors high-rank programs but gives meaningful probability mass to the rest of the population. The normalizing constant $\sum_{j=1}^{N} j^{-\alpha}$ is the generalized harmonic number $H_{N,\alpha}$.

Weighted Selection

An alternative strategy selects parents with probability proportional to their raw fitness scores rather than ranks [paper]:

$$P(\text{program} = p_i) = \frac{f(p_i)}{\sum_{j=1}^{N} f(p_j)}$$

[Standard fitness-proportionate selection (roulette wheel), applied per paper description.]

where $f(p_i)$ is the fitness score (specifically, the combined_score returned by the evaluation function [paper, README]) of program $p_i$. This is equivalent to classical fitness-proportionate selection. It provides a more balanced exploitation–exploration trade-off than power-law selection but can be problematic when fitness values have low variance or negative values.

Beam Search

A deterministic strategy that expands the top-$k$ programs exhaustively at each generation, providing depth-first refinement [paper]. This is appropriate when the search landscape has a narrow funnel toward the optimum and breadth is less important than greedy improvement.

In addition to parent selection, the system uses an elite ratio (example config value: 30% [example config]) to determine how many top-ranked programs from the archive are included as context in the LLM prompt. These elite programs serve as "inspiration"—the LLM can observe their strategies without directly copying them [paper].

6.3.3 Population Management & Island Architecture

ShinkaEvolve implements a multi-island evolutionary model inspired by allopatric speciation in biological island biogeography [paper]. The population is partitioned into semi-independent subpopulations (islands) that evolve in relative isolation, with periodic migration of elite individuals between islands. This design promotes diversity by allowing different islands to explore different regions of the solution space, while migration ensures that successful strategies eventually propagate across the entire population.

The island_sampler.py module provides four strategies for selecting which island receives the next mutation request [repo: database/island_sampler.py]:

Strategy	Behavior	Source
`uniform`	Equal probability for each island regardless of performance	[repo: island_sampler.py]
`equal`	Round-robin cycling through islands in sequence	[repo: island_sampler.py]
`proportional`	Probability proportional to the island's best fitness score	[repo: island_sampler.py]
`weighted`	Custom per-island weights (e.g., to favor islands with higher diversity)	[repo: island_sampler.py]

The v1.1 release added dynamic island spawning [paper]: when stagnation is detected, new islands are created with randomized strategies to inject fresh diversity into the population. The precise stagnation detection heuristic is not documented in the paper, README, or example configurations; see §6.3.9 for a consolidated summary of undocumented implementation details.

6.3.4 Novelty-Based Program Filtering

A central innovation in ShinkaEvolve is its two-tier novelty system, designed to prevent evaluation budget from being wasted on trivially similar candidates [paper]. The two tiers form a cascade filter: the cheaper embedding-based check runs first, and only candidates passing it are subjected to the more expensive LLM-based judgment.

Tier 1: Code Embedding Similarity

Before evaluation, each proposed program is compared to existing population members using code embeddings [paper]. Let $\text{embed}(p)$ denote the embedding vector produced by the configured embedding provider for program $p$, and let $\text{sim}(u, v) = \frac{u \cdot v}{\|u\| \|v\|}$ denote cosine similarity between vectors $u$ and $v$. A candidate program $p_{\text{new}}$ is rejected if it is too similar to any existing population member:

$$\text{reject}(p_{\text{new}}) \iff \exists \, q \in \text{Pop} : \text{sim}\big(\text{embed}(p_{\text{new}}),\; \text{embed}(q)\big) > 1 - \theta_{\text{novelty}}$$

[Paper-reported mechanism. The threshold formulation using $1 - \theta_{\text{novelty}}$ is from the source material code examples. The cosine similarity operates on raw embedding vectors; whether vectors are L2-normalized before storage is not documented.]

where $\theta_{\text{novelty}}$ is a configurable novelty threshold. When $\theta_{\text{novelty}} = 0.15$, a candidate is rejected if its cosine similarity to any existing member exceeds $0.85$. The choice of embedding model is task-dependent; the shinka/embed/ module provides a pluggable provider interface [repo: embed/ directory], but the paper does not specify which embedding model is used by default nor provide calibration guidance for $\theta_{\text{novelty}}$ across different problem domains.

Provenance: the threshold value $\theta_{\text{novelty}} = 0.15$ appears in source material code examples [source material]. Whether this is a compiled default, an example config value, or a recommended starting point is not distinguished in the available documentation.

Tier 2: LLM-as-Novelty-Judge

Candidates that pass the embedding filter are then assessed by an LLM that evaluates whether the proposed changes represent a genuinely novel algorithmic approach or merely a superficial modification such as variable renaming, comment changes, or trivial reformatting [paper]. This tier is implemented in shinka/core/async_novelty_judge.py [repo: core/async_novelty_judge.py].

# Illustrative pseudocode for two-tier novelty filtering
# Based on mechanism described in paper [paper] and module
# structure [repo: core/async_novelty_judge.py, embed/]
# NOT an exact repository excerpt.

async def novelty_filter(candidate, population, embedder, llm_client, threshold=0.15):
    """Two-tier novelty check: embedding similarity then LLM judgment.

    Returns True if candidate is novel enough to merit evaluation.
    """
    # Tier 1: embedding-based fast rejection [paper]
    candidate_emb = await embedder.encode(candidate.code)
    for member in population:
        member_emb = await embedder.encode(member.code)
        similarity = cosine_similarity(candidate_emb, member_emb)
        if similarity > (1.0 - threshold):
            return False  # Too similar — reject without LLM call

    # Tier 2: LLM-based novelty judgment [paper]
    prompt = (
        "Compare the parent program and the proposed child program. "
        "Does the child introduce a genuinely novel algorithmic approach, "
        "or is it a superficial change (variable renaming, formatting, "
        "trivial constant tweaks)?\n\n"
        f"Parent:\n{candidate.parent_code}\n\n"
        f"Child:\n{candidate.code}\n\n"
        "Rate novelty 1-5. Explain briefly."
    )
    response = await llm_client.generate(prompt)
    novelty_score = extract_integer_score(response)
    return novelty_score >= 3  # Accept if score indicates meaningful novelty

Provenance: the 1–5 rating scale and the acceptance cutoff (score ≥ 3) are from the source material. Whether these values are configurable or hard-coded is not documented. The exact prompt template is illustrative; the actual prompt text in async_novelty_judge.py may differ.

The cascade structure is critical for cost control. Embedding comparisons are cheap (a single forward pass through a small model), while LLM novelty judgments require a full generation call. By filtering out the most obviously redundant candidates at Tier 1, the system avoids paying LLM costs for candidates that would have been rejected anyway. The paper reports that the combined effect of both tiers, together with power-law parent selection and bandit-based model allocation, contributes to the overall efficiency improvement [paper]—though no ablation isolating each tier's individual contribution is published.

6.3.5 Adaptive LLM Selection via Multi-Armed Bandits

When an ensemble of LLMs is available, ShinkaEvolve must decide which model to query for each mutation request. This is formulated as a multi-armed bandit problem, where each LLM is an arm [paper], and the reward signal reflects whether the mutation produced a fitness improvement.

The system implements the Upper Confidence Bound (UCB1) algorithm [paper; source material code]. Let $K$ models be available. After $t$ total selections, the UCB1 score for model $i$ is:

$$\text{UCB}_i(t) = \hat{\mu}_i(t) + c \cdot \sqrt{\frac{\ln t}{n_i(t)}}$$

[Standard UCB1 formulation (Auer et al., 2002), applied to this system per paper description and source material code.]

where:

$\hat{\mu}_i(t)$ is the empirical mean reward from model $i$ after its $n_i(t)$ selections. Based on the source material code, rewards are binary: $r = 1$ if the mutation produced a fitness improvement over the parent, $r = 0$ otherwise [source material code]. Thus $\hat{\mu}_i(t) = s_i / n_i(t)$, where $s_i$ is the count of improvement-producing mutations. Note that UCB1 is valid for any reward distribution bounded in $[0, 1]$, not only Bernoulli rewards; the binary encoding is a system design choice that discards information about the magnitude of improvement.
$n_i(t)$ is the number of times model $i$ has been selected through round $t$.
$c$ is the exploration weight. The source material uses $c = 1.4$ [source material code]. The standard UCB1 derivation (via the Hoeffding concentration inequality for rewards in $[0, 1]$) yields $c = \sqrt{2} \approx 1.414$ as the value ensuring logarithmic regret. The configured value $c = 1.4$ is close to $\sqrt{2}$; whether this is a deliberate approximation or an independently tuned parameter is not documented.
$\ln t$ is the natural logarithm of the total number of selection rounds.

The model with the highest UCB score is selected at each round. The exploration term $c \cdot \sqrt{\ln t \,/\, n_i(t)}$ ensures that under-explored models receive a bonus that grows logarithmically with total rounds.

Non-Stationarity and the Limits of Formal Guarantees

A critical caveat is that classical UCB1 analysis assumes stationary rewards: the reward distribution for each arm is fixed (i.i.d.) across all rounds. In ShinkaEvolve, this assumption is violated. The quality of mutations produced by a given LLM depends on the current state of the search—which parents are available, how mature the population is, and what regions of the solution space remain unexplored. A model that excels at generating diverse initial strategies may become less effective once the population has converged to a narrow region, and vice versa. The reward distributions are therefore non-stationary, shifting as the population evolves.

The standard gap-dependent regret bound—$R(T) = O\!\left(\sum_{i: \Delta_i > 0} \frac{\ln T}{\Delta_i}\right)$, where $\Delta_i = \mu^* - \mu_i$ is the gap between the best arm and arm $i$—does not formally hold under non-stationarity. Variants of UCB designed for non-stationary settings exist (e.g., Sliding-Window UCB by Garivier & Moulines, 2011; Discounted UCB by Kocsis & Szepesvári, 2006), which weight recent observations more heavily than older ones.

Despite this, UCB1 remains a practical heuristic for this application for several reasons: (1) the non-stationarity is typically gradual—population state changes incrementally between generations, not adversarially; (2) the exploration bonus naturally re-explores models that have been neglected, providing implicit adaptation; (3) the binary reward signal is relatively robust to moderate distribution shift. The source material does not discuss non-stationarity or alternative bandit formulations; the use of UCB1 should be understood as a pragmatic design choice without formal optimality guarantees in this setting.

The v1.1 release added bandit state persistence: the success counts $s_i$, trial counts $n_i$, and total round count $t$ are saved to disk on checkpoint and restored on resume [paper; source material]. This allows the bandit to retain its learned preferences across sessions. Note that restoring historical statistics across sessions further compounds the non-stationarity concern, since the search state at resume may differ substantially from when the statistics were accumulated.

# Illustrative pseudocode for bandit-based LLM selection
# Faithful to the mechanism described in the source material [paper, source material code]
# NOT an exact repository excerpt.

import math

class BanditLLMSelector:
    def __init__(self, models, exploration_weight=1.4):
        self.models = models
        self.c = exploration_weight  # Source material uses 1.4 [source material]
        self.successes = {m: 0 for m in models}
        self.trials = {m: 1 for m in models}  # Init to 1 to avoid division by zero
        self.total_trials = len(models)

    def select(self):
        """Select the LLM with highest UCB1 score."""
        best_model, best_score = None, -float("inf")
        for model in self.models:
            mu = self.successes[model] / self.trials[model]
            exploration = self.c * math.sqrt(
                math.log(self.total_trials) / self.trials[model]
            )
            score = mu + exploration
            if score > best_score:
                best_model, best_score = model, score
        return best_model

    def update(self, model, improved: bool):
        """Update statistics after evaluating a mutation.

        Uses binary reward: 1 if fitness improved, 0 otherwise.
        Magnitude of improvement is discarded. [source material code]
        """
        self.trials[model] += 1
        self.total_trials += 1
        if improved:
            self.successes[model] += 1

Note: initialization of trials to 1 per model is a practical choice to avoid division by zero. This inflates the initial total_trials count by $K$ (number of models), which has negligible impact after a modest number of rounds. Whether the actual implementation uses this convention or another initialization is not documented.

6.3.6 Prompt Co-Evolution

A distinctive feature introduced in v1.1 is prompt co-evolution: the system prompts that guide LLM mutations are themselves treated as an evolving population [paper]. This mechanism is implemented across shinka/database/prompt_dbase.py (prompt archive storage) and shinka/prompts/prompts_prompt_evo.py (prompt mutation templates) [repo: directory listing].

The prompt co-evolution loop operates as follows, based on the paper's description [paper]:

Prompt selection. A system prompt is selected from the prompt archive. The source material indicates that selection is fitness-proportionate, where prompt fitness is defined as the average improvement of programs generated under that prompt's guidance [paper].
Program mutation. The selected prompt is used as the system message when querying the LLM for a code mutation.
Credit assignment. After the mutated program is evaluated, the resulting fitness improvement (or lack thereof) is attributed back to the prompt that guided the mutation, updating the prompt's fitness statistics.
Prompt mutation. Periodically, a prompt is selected and mutated by an LLM based on its success history—the LLM is asked to improve the prompt to generate better code mutations, with the history of mutations it produced as context.

This creates a co-evolutionary dynamic in which prompts and programs exert mutual selection pressure: better prompts produce better programs, and the resulting fitness signal selects for better prompts. The approach is analogous to co-evolution of genotype and phenotype in biology, or to the co-evolution of instruction sets and solution populations in hyper-heuristic optimization.

See §6.3.9 for a detailed summary of what is verified versus undocumented about prompt co-evolution, including the unknown credit assignment formula, prompt archive size, and mutation frequency.

6.3.7 Evaluation & Early Stopping

The evaluation system, implemented in shinka/core/wrap_eval.py [repo: core/wrap_eval.py], provides the fitness signal that drives selection. Users supply two files: an initial.py containing the seed program with evolve-block markers, and an evaluate.py that runs the program and returns a dictionary containing at minimum a combined_score key [paper; README; repo: examples].

The v1.1 evaluator includes several robustness features [paper; repo: config schema]:

Feature	Description	Source
Parallel execution	`run_workers` with optional `max_workers_cap`	[repo: config schema]
Early stopping	Three modes: `bayesian`, `ci` (confidence interval), `hybrid`	[repo: config schema]
NaN/Inf guards	Invalid scores automatically filtered before population update	[inferred from source material]
Deterministic ordering	Results sorted consistently regardless of parallel execution order	[inferred from source material]

Early stopping is a cascade evaluation strategy: rather than running all evaluation trials for every candidate, the evaluator can terminate a candidate's evaluation early if preliminary results indicate it is unlikely to outperform existing population members. The three modes named in the config schema (bayesian, ci, hybrid) likely correspond to Bayesian posterior estimation, frequentist confidence intervals, and a combination thereof. However, the precise stopping rules, confidence levels, prior specifications, and threshold criteria for each mode are not formally documented in the paper, README, or available source material. Researchers using these modes should consult the wrap_eval.py source code directly for the actual decision logic. See §6.3.9 for a consolidated view of verified versus undocumented implementation details.

6.3.8 Parallelization & Async Execution

ShinkaEvolve provides two execution modes with distinct concurrency characteristics [paper; repo: core/evolution_runner.py, core/async_runner.py]:

EvolutionRunner (synchronous): program generation is sequential, but evaluations within a generation run in parallel via the run_workers configuration [repo: core/evolution_runner.py]. This mode is suitable for debugging and small-scale experiments where deterministic behavior is preferred.

AsyncEvolutionRunner (asynchronous): both program generation and evaluation run concurrently, with the runner maintaining multiple in-flight proposals simultaneously [repo: core/async_runner.py]. The source material reports a 5–10× wall-clock speedup compared to the synchronous runner [paper]. This speedup arises because LLM generation calls (typically 1–10 seconds each) and program evaluations can overlap: while one candidate is being evaluated, the next candidates are already being generated.

The async runner uses Python's asyncio framework. A simplified view of the concurrency structure:

# Illustrative pseudocode for async evolution pipeline
# Based on paper description [paper] and module structure
# [repo: core/async_runner.py]. NOT an exact repository excerpt.

import asyncio

async def async_evolution_loop(config, database, llm_selector, evaluator):
    """Run evolution with concurrent generation and evaluation.

    Multiple proposals are in-flight simultaneously, overlapping
    LLM generation latency with evaluation compute.
    """
    budget_tracker = CostTracker(config.max_api_costs)

    for generation in range(config.num_generations):
        # Launch multiple proposals concurrently
        tasks = []
        for _ in range(config.max_parallel_jobs):
            if not budget_tracker.can_propose():
                break
            task = asyncio.create_task(
                propose_and_evaluate(
                    database, llm_selector, evaluator, budget_tracker, config
                )
            )
            tasks.append(task)

        # Gather results as they complete
        results = await asyncio.gather(*tasks, return_exceptions=True)

        # Update population with successful evaluations
        for result in results:
            if isinstance(result, Exception):
                continue  # Log and skip failed proposals
            if result.score is not None:
                database.insert(result.program, result.score, result.island_id)
                llm_selector.update(result.model, improved=result.improved)

        # Periodic migration between islands
        if generation % config.migration_interval == 0:
            database.migrate_elites(config.migration_rate)

        # Generate meta-recommendations for next generation
        update_meta_recommendations(database, generation)

Provenance: simplified pseudocode illustrating the concurrency model as described in the paper [paper]. Actual implementation includes budget tracking for in-flight work, novelty filtering between proposal and evaluation, and async summarization [repo: core/async_summarizer.py]. The exact task/gather structure may differ from this illustration.

The cost tracker plays a critical role in the async runner: because multiple proposals are in-flight simultaneously, the budget guard must account for committed cost—the sum of realized costs plus the estimated cost of currently in-flight LLM calls—not just realized cost [paper; source material]. When committed cost reaches max_api_costs, no new proposals are dispatched, and the runner drains the remaining in-flight work before terminating.

6.3.9 Verified vs. Unknown Implementation Details

Several mechanisms described in §§6.3.1–6.3.8 are not fully specified in the available sources. The following table consolidates what is verified from the paper, repository structure, and documentation versus what remains undocumented. Readers should not assume that undocumented details follow any particular implementation; consulting the repository source code directly is advised.

Mechanism	Verified	Unknown / Undocumented
Prompt co-evolution (§6.3.6)	Prompt archive exists (`prompt_dbase.py`) [repo]; prompt mutation templates exist (`prompts_prompt_evo.py`) [repo]; fitness-proportionate prompt selection [paper]; credit assigned to the prompt that guided a mutation [paper]	Prompt archive size (fixed or growing?); prompt mutation frequency (every generation? periodic?); precise credit assignment formula when multiple factors (prompt, model, parent quality, patch type) jointly contribute; whether interaction effects between prompt and model are modeled; prompt initialization strategy
Adaptive mutation scheduling (§6.3.1)	Listed as a v1.1 feature [paper]; the ratio of `diff`/`full`/`cross` patch types adapts based on observed success rates [paper]	The precise adaptation rule (bandit-based? exponential moving average? rule-based thresholds?); update frequency; whether per-island or global; interaction with the LLM bandit controller
Stagnation-triggered island spawning (§6.3.3)	Dynamic island spawning occurs when stagnation is detected [paper]; new islands use randomized strategies [paper]	The stagnation detection heuristic (number of generations without improvement? fitness variance threshold? relative or absolute?); spawning parameters (how many new islands? what population size?); whether old stagnant islands are pruned
Early stopping (§6.3.7)	Three named modes: `bayesian`, `ci`, `hybrid` [repo: config schema]; modes terminate candidate evaluation early based on preliminary results [paper]	Statistical stopping rules for each mode; confidence levels and significance thresholds; Bayesian prior specification; what "hybrid" combines and how; minimum number of trials before stopping can trigger
Meta-recommendations (§6.8)	Textual insights generated after each generation and included in subsequent LLM prompts [paper]; persist across sessions [paper]	The prompt template used to generate meta-recommendations; how many generations of history are summarized; whether recommendations are per-island or global; maximum context length allocated to recommendations
Novelty judge LLM (§6.3.4)	Tier 2 uses an LLM to assess novelty with a 1–5 rating scale, acceptance threshold ≥ 3 [source material]; implemented in `async_novelty_judge.py` [repo]	Whether the 1–5 scale and ≥ 3 cutoff are configurable; which LLM is used for novelty judgment (same as mutation models or a dedicated cheaper model?); the exact prompt template
Fix mode (§6.3.1)	Exists as `prompts_fix.py` [repo]; activated when no correct program exists in the population [paper]	How "no correct program" is determined (all scores negative? all below a threshold?); transition criteria from fix mode back to normal mode; whether fix mode uses different LLM selection

6.4 Key Results

ShinkaEvolve has been evaluated across five distinct problem domains [paper]. The following table summarizes the reported results. Because different domains use different budget metrics (evaluations, generations, trials), these are separated into distinct columns to avoid conflating incomparable quantities.

Domain	Task	Evaluations	Generations	Trials	Outcome	Source
Scientific Discovery	Circle packing in a square	~150	—	—	State-of-the-art packing density via hybrid golden-spiral + gradient + simulated annealing	[paper, repo example]
Mathematical Reasoning	AIME problem solving	—	~75	—	Generalization across problem years and LLM backends	[paper]
Competitive Programming	AtCoder heuristic problems	—	—	Multiple tasks	2.3% average improvement over initial solutions	[paper]
ML Optimization	MoE training loss function	—	~30	—	Outperformed DeepSeek Global LBL strategy	[paper]
Competitive Programming	ICFP 2025 SAT optimization	—	—	320	10× speedup in SAT solving; ~$60 API cost	[paper]

Methodological Notes on Reported Results

Budget metrics are not interchangeable. "Evaluations" counts individual fitness assessments of candidate programs. "Generations" counts evolutionary cycles, each of which may produce and evaluate multiple candidates depending on population size and parallel workers. "Trials" in the ICFP context refers to optimization runs. These quantities cannot be compared across rows or across systems without normalization, which the available sources do not provide.
No ablation studies are published. The paper reports the combined effect of all sample-efficiency innovations. It is unknown how much of the efficiency gain is attributable to power-law parent selection, novelty filtering, or bandit-based model allocation individually, or whether these components interact synergistically.
Baselines and comparison protocols are not fully documented. The authors report a 10–100× efficiency improvement relative to prior approaches [paper]. The specific comparison methodology—whether baselines used identical evaluation functions, the same seed programs, matched computational budgets, or identical LLM versions—is not publicly described at a level sufficient for independent verification. Readers should treat this as an author-reported aggregate claim.
The 2.3% AtCoder improvement may appear modest, but competitive programming problems are typically highly optimized, and incremental improvements at the frontier can represent meaningful algorithmic innovation.

Cross-System Budget Comparison

Direct efficiency comparisons between ShinkaEvolve and other LLM-driven evolutionary systems are constrained by the lack of shared benchmarks, identical evaluation protocols, and published per-task budgets. The following table shows what is reported for each system; these numbers should not be read as direct comparisons because they involve different tasks, evaluation functions, and computational environments.

System	Representative Task	Reported Budget	Budget Type	Source
ShinkaEvolve	Circle packing	~150	Evaluations	[paper]
ShinkaEvolve	ICFP 2025 SAT	320	Trials	[paper]
AlphaEvolve	Various (math, algorithms)	Not fully detailed per task	—	[DeepMind blog/paper]
FunSearch	Cap set, bin packing	Millions of samples (reported)	Evaluated programs	[Nature 2024 paper]

Caveat: FunSearch's million-scale budget reflects a fundamentally different computational regime (massive parallelism with cheap evaluation) rather than a direct measure of algorithm inefficiency. Budget numbers across systems are not comparable without controlling for task difficulty, evaluation cost, LLM capability, and parallelism. The 10–100× efficiency claim [paper] is a global author-reported summary, not a task-matched comparison.

6.4.1 The Circle Packing Discovery

The circle packing result merits particular attention as it illustrates ShinkaEvolve's ability to discover non-obvious algorithmic strategies [paper]. The task is to pack $n$ non-overlapping circles of equal radius into a unit square, maximizing the radius. Starting from a simple greedy initialization, ShinkaEvolve evolved through approximately 150 evaluations to discover a three-phase algorithm [paper; repo: examples/circle_packing]:

Golden-angle spiral initialization: circles are placed along a Fermat spiral with golden-angle separation ($\approx 137.5°$), providing near-uniform coverage of the square [paper].
Gradient-based refinement: circle positions are optimized using gradient descent on a differentiable overlap penalty [paper].
Simulated annealing escape: when gradient descent converges to a local optimum, simulated annealing with a cooling schedule perturbs the configuration to escape [paper].

This hybrid is notable because each component alone is well-known, but their combination with appropriate transition criteria represents a non-trivial design decision. The evolutionary search effectively composed these components through incremental mutation, demonstrating that LLM-mediated evolution can discover composite strategies that emerge from the interaction of known techniques. The authors note this hybrid strategy would be difficult for a human to design from scratch [paper].

6.5 Implementation Details

6.5.1 Cost Model

ShinkaEvolve tracks costs across four categories, each corresponding to a distinct type of LLM API call [paper; source material]:

$$C_{\text{committed}} = C_{\text{api}} + C_{\text{embed}} + C_{\text{novelty}} + C_{\text{meta}} + C_{\text{in\text{-}flight}}$$

[Source material description of cost tracking categories. Whether this is the exact formula in code or a conceptual decomposition is not confirmed.]

where:

$C_{\text{api}}$ — cost of LLM calls for program mutation (the primary expense)
$C_{\text{embed}}$ — cost of computing code embeddings for Tier 1 novelty filtering
$C_{\text{novelty}}$ — cost of LLM calls for Tier 2 novelty judgment
$C_{\text{meta}}$ — cost of LLM calls for meta-recommendation generation
$C_{\text{in\text{-}flight}}$ — estimated cost of currently pending LLM calls that have been dispatched but not yet completed

The budget guard enforces $C_{\text{committed}} \leq C_{\text{max}}$, where $C_{\text{max}}$ is the user-configured max_api_costs [paper; source material]. When num_generations is omitted from the configuration, max_api_costs becomes required, ensuring every run has a termination condition [source material].

Reported cost figures for specific experiments:

Experiment	Reported Cost	Evidence Quality	Source
ICFP 2025 SAT optimization	~$60 (320 trials)	Specific trial count and cost reported together	[paper]
Circle packing	~$30–50	Approximate; no specific trial count paired	[source material, author estimate]
AIME mathematical reasoning	~$20–40	Approximate; no specific trial count paired	[source material, author estimate]

Provenance: only the ICFP cost figure ($60 for 320 trials) is reported with a paired trial count [paper]. The circle packing and AIME estimates are marked as approximate in the source material and should be treated as rough guidance, not precise measurements. Actual costs depend heavily on the LLM backend: frontier models (GPT-4o, Gemini Pro) cost $0.10–0.50 per mutation, while local models via Ollama or vLLM reduce API costs to near-zero at the expense of potentially lower mutation quality [source material].

6.5.2 Reproducibility

ShinkaEvolve provides several mechanisms supporting reproducibility [README; repo]:

Mechanism	Description	Source
Open source	Full code under Apache 2.0	[repo; README]
Hydra configs	YAML-based configuration with composition and overrides; configs can be versioned	[repo; README]
Checkpoint/resume	Population, bandit state, meta-recommendations, and prompt archive persist to disk	[paper; source material]
Bundled examples	Circle packing, Game 2048, Julia prime counting, novelty examples	[repo: examples/]
Package management	`uv pip install -e .` with pinned dependencies	[README]

Reproducibility limitations. Several factors limit exact reproducibility even with identical configurations:

LLM non-determinism: LLM outputs are stochastic; different API calls with the same prompt may produce different mutations. Model versioning by providers further compounds this—results obtained with gpt-4o-2024-08-06 may not reproduce with a later version. This is an inherent limitation of any system using commercial LLM APIs.
Evaluation stochasticity: if the evaluation function itself involves randomness (e.g., Monte Carlo simulation), scores will vary across runs. The early-stopping mechanism may further amplify this by terminating evaluations at different points.
Concurrency ordering: in the async runner, the order in which proposals complete is non-deterministic, which can affect population state and subsequent selections despite deterministic result sorting within generations [inferred from async architecture].

In summary, ShinkaEvolve runs are configurationally reproducible (same config reproduces the same experimental setup) but not outcome-reproducible (same config does not guarantee identical results). This is standard for stochastic search methods and is not a deficiency specific to ShinkaEvolve.

6.5.3 Storage & Persistence

The population database uses SQLite with WAL (Write-Ahead Logging) mode for concurrent read/write access, with improved retry logic in v1.1 [source material]. The memory footprint grows linearly with population size but is bounded by the island architecture—each island maintains a fixed-size active population and a separate archive of historical best programs [inferred from database module structure]. The prompt archive is stored in a parallel database (prompt_dbase.py) with fitness statistics per prompt [repo: database/prompt_dbase.py].

6.6 Comparative Analysis

The following table positions ShinkaEvolve relative to other LLM-driven evolutionary systems covered in this survey. Comparisons are limited to documented features; direct head-to-head benchmarks on identical tasks are not available for most pairs.

Feature	ShinkaEvolve	AlphaEvolve (Ch. 4)	OpenEvolve (Ch. 5)	FunSearch
Venue / Year	ICLR 2026	Google DeepMind 2025	Open-source reimpl.	Nature 2024
License	Apache 2.0	Proprietary	Open source	Not released
Parent selection	Power-law, weighted, beam [paper]	MAP-Elites based	Tournament	Fitness-proportionate
Novelty filtering	Two-tier (embedding + LLM judge) [paper]	Not documented	Not documented	Not documented
Multi-model ensemble	Bandit-selected ensemble [paper]	Gemini family	Configurable	Single model family
Island model	Yes (4 default, dynamic spawning) [paper; example config]	MAP-Elites niches	Yes	Island model
Prompt co-evolution	Yes (v1.1) [paper]	Not documented	Not documented	No
Async execution	Yes (5–10× speedup reported) [paper]	Distributed	Yes	Distributed
Budget control	Committed-cost guard [paper; source material]	Internal quota	Generation-based	Not documented
Early stopping	Bayesian / CI / Hybrid [repo: config]	Not documented	Not documented	Not documented

Caveat: "Not documented" means the feature is not described in available sources; it may exist in the actual implementation. This table compares documented design choices, not empirical performance, since systems have been evaluated on different tasks with different budgets.

ShinkaEvolve's primary differentiator is its explicit focus on sample efficiency [paper]. Where FunSearch and AlphaEvolve rely on large evaluation budgets enabled by proprietary compute infrastructure, ShinkaEvolve is designed for researchers with limited budgets. The novelty filter, bandit-based model selection, and power-law parent sampling each contribute to reducing wasted evaluations. However, it is important to note several trade-offs:

The novelty filter introduces additional hyperparameters (embedding threshold $\theta_{\text{novelty}}$, LLM judge acceptance cutoff) that require calibration per domain, and the two-tier cascade adds latency to each proposal.
The efficiency advantage is reported by the authors without published ablations or task-matched baselines, making it difficult to attribute the gain to specific innovations or to assess how much of the improvement generalizes across problem types.
Systems designed for massive parallelism (FunSearch, AlphaEvolve) may achieve higher absolute performance given sufficient compute; ShinkaEvolve's advantage is in the low-budget regime where each evaluation is precious.

6.7 Limitations & Discussion

Novelty Threshold Sensitivity

The two-tier novelty filter's effectiveness depends on proper calibration of $\theta_{\text{novelty}}$ and the LLM judge's acceptance threshold. If the embedding threshold is too aggressive, genuinely novel candidates may be rejected (false negatives); if too lenient, the filter provides little benefit (false positives). The source material does not report sensitivity analyses or provide guidelines for threshold selection across different problem domains [paper]. Practitioners should expect to tune these parameters when applying ShinkaEvolve to new tasks.

Credit Assignment in Co-Evolution

When a mutation succeeds, the credit is distributed across multiple contributing factors: the parent program, the selected LLM, the system prompt, and the patch type. The current framework attributes success independently to each factor (bandit updates the LLM statistics, prompt archive updates the prompt fitness) [paper; inferred from architecture], but does not model interactions between these factors. A successful mutation may be due to the combination of a specific prompt and a specific model, a signal that is lost when credit is assigned marginally. Factorial or contextual bandit approaches could capture these interactions but would increase the sample complexity of learning.

Scalability of Embedding-Based Novelty

Tier 1 novelty filtering compares each new candidate against all population members, requiring $O(N)$ embedding similarity computations per proposal, where $N$ is the population size. For large populations or long-running experiments, this linear scan may become a bottleneck, though approximate nearest-neighbor indices (e.g., FAISS, Annoy) could mitigate this. Whether the implementation uses such optimizations is not documented.

Evaluation Function Design

ShinkaEvolve's generality hinges on the user providing an appropriate evaluation function. For well-defined optimization problems (packing, competitive programming), the evaluation function is natural. For more open-ended tasks (code quality, maintainability), defining a numeric fitness score is itself a research challenge that the framework does not address.

Reproducibility Under LLM Drift

Results depend on specific LLM versions, which are updated by providers without notice. An experiment run with gpt-4o-2024-08-06 may not reproduce with a later model version, even with identical seeds and configurations. The use of local models (Ollama, vLLM) offers better reproducibility but at the cost of potentially lower mutation quality. This limitation affects all LLM-dependent evolutionary systems, not only ShinkaEvolve.

Limited Ablation Evidence

The source material reports the combined effect of all three sample-efficiency innovations (parent selection, novelty filtering, bandit allocation) but does not provide ablation studies isolating each component's individual contribution [paper]. It is therefore unclear how much of the reported efficiency gain is attributable to each mechanism, or whether they exhibit synergistic effects that would not be captured by independent ablation. This is the most significant gap in the empirical evidence: without ablations, the relative importance of each innovation cannot be assessed, and practitioners cannot determine which components are essential versus optional for their use case.

UCB1 Under Non-Stationarity

As discussed in §6.3.5, the bandit controller uses classical UCB1, which assumes stationary reward distributions. The formal regret guarantee does not apply when the quality of each LLM's mutations changes with the evolving population. While practical performance appears acceptable based on reported results, this theoretical gap means the system's model selection may be suboptimal in later stages of long-running searches when the reward landscape has shifted substantially from the initial distribution.

6.8 Meta-Level Mechanisms & Self-Improvement

Beyond the core evolution loop, ShinkaEvolve incorporates several meta-level mechanisms that improve the search process across generations [paper]:

Meta-recommendations [paper]: after each generation, the system generates high-level textual insights about which types of mutations have been successful. These insights are included in subsequent LLM prompts, providing a form of accumulated search wisdom that biases future mutations toward productive directions. Meta-recommendations persist across checkpoint/resume sessions [source material].
Adaptive mutation scheduling [paper]: the ratio of diff, full, and cross patch types adapts based on their observed success rates. The precise adaptation rule is not formally documented (see §6.3.9).
Stagnation detection and response [paper]: when island performance plateaus, the system can spawn new islands with randomized strategies to inject diversity. The stagnation detection heuristic is not publicly specified (see §6.3.9).

These meta-level features create a hierarchy of adaptation: at the lowest level, individual programs evolve through LLM-mediated mutation; at the middle level, the LLM allocation and mutation type ratios adapt through bandit algorithms and scheduling; at the highest level, system prompts co-evolve and meta-recommendations accumulate qualitative search knowledge. This hierarchical structure is a recurring theme in modern evolutionary AI systems—the recognition that the search process itself can and should be subject to optimization.

It is worth noting that while this hierarchical adaptation is architecturally elegant, the available evidence characterizes only the lowest level (program evolution) with concrete results. The middle and upper levels are described as features but are not independently validated through ablation or comparative studies. Whether the meta-level mechanisms provide substantial improvement over simpler fixed strategies (e.g., uniform mutation scheduling, fixed prompts) remains an open empirical question.

Summary

Key takeaway: ShinkaEvolve demonstrates that sample efficiency is a critical bottleneck in LLM-driven evolutionary code search, and that principled solutions—power-law parent selection, two-tier novelty filtering, and bandit-based model allocation—can substantially reduce evaluation budgets while maintaining or exceeding the solution quality of prior systems [paper].

Main contribution: The framework introduces a coherent set of sample-efficiency innovations designed to make LLM-driven algorithm evolution practical for researchers without access to massive compute budgets. The two-tier novelty filtering (embedding similarity + LLM-as-judge) is, to the authors' knowledge, the first systematic approach to redundancy elimination in this setting. Prompt co-evolution adds a second evolving population that improves the mutation operator itself over time [paper].

What researchers should know: ShinkaEvolve is fully open-source (Apache 2.0) and designed for immediate use: any problem with mutable code blocks and a numeric evaluation function can be optimized. The ICLR 2026 publication and results across five domains (circle packing, AIME, competitive programming, MoE training, ICFP 2025) establish it as a serious research tool [paper]. However, several caveats apply: (1) the 10–100× efficiency claim is author-reported without published comparison methodology or ablation studies; (2) several mechanisms (adaptive mutation scheduling, stagnation detection, early stopping rules, prompt co-evolution details) are not fully documented in public sources (see §6.3.9); (3) the UCB1 bandit operates under non-stationary conditions where formal guarantees do not hold (see §6.3.5); and (4) results depend on specific LLM versions and are not exactly reproducible. Researchers adopting ShinkaEvolve should plan for hyperparameter tuning—particularly the novelty filter thresholds—and consult the repository source code for implementation details beyond what the paper documents.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}