ShinkaEvolve

Open-Ended Program Evolution Framework Authors: Takuya Akiba & Sakana AI Team Venue: ICLR 2026 License: Apache 2.0 Language: Python ≥3.10 Stars: 855 ★ Repository: github.com/SakanaAI/ShinkaEvolve

Core Contribution
Supported Solutions
Architecture
Components
Core Mechanisms
Code Modification & Mutation Operators
Parent Selection & Sampling
Population Management & Island Architecture
Solution Diversity & Novelty
LLM Orchestration & Model Selection
Search Strategies
Evaluation & Fitness Assessment
Prompt Engineering & Co-Evolution
Cost Control & Budget Management
Meta-Level & Self-Improvement
LLM Integration
Key Results
Reproducibility
Cost Information
Memory Management
Continued Learning
Applications & Use Cases

1. Core Contribution

ShinkaEvolve is an evolutionary code optimization framework that discovers algorithms using LLMs with dramatically improved sample efficiency compared to prior approaches like AlphaEvolve. The system constructs an archive of evaluated programs, generates new programs via LLM ensembles acting as intelligent mutation operators, and evaluates their fitness through automated metrics.

**Key Innovation:** Three sample-efficiency innovations that reduce the number of evaluations needed by 10-100x compared to prior work:

**Parent Sampling Strategy** — Balances exploitation of good solutions with exploration of new ideas

**Novelty-Based Program Rejection** — Avoids evaluating minor variations via code embedding similarity and LLM-as-novelty-judge

**Adaptive LLM Prioritization** — Bandit-based dynamic selection of best LLM from ensemble

Unlike prior systems that treat LLMs as black-box text generators, ShinkaEvolve deeply integrates LLMs into the evolutionary loop as intelligent mutation operators that understand code semantics, can reason about algorithmic improvements, and propose structured modifications. The framework achieved state-of-the-art results across circle packing, mathematical reasoning (AIME), competitive programming, and MoE training optimization.

2. Supported Solutions

Domain	Problem Types	Generalization
Algorithm Optimization	SAT solvers, sorting, search algorithms	Any algorithm with a measurable fitness function
Mathematical Reasoning	AIME problems, mathematical conjectures	Generalizes across problem years and LLM backends
Competitive Programming	Heuristic contests (AtCoder, ICFP)	Applicable to any optimization contest
ML Optimization	MoE training strategies, loss functions	Any ML pipeline with trainable components
Scientific Discovery	Circle packing, combinatorial optimization	Any problem expressible as code with evaluation
Infrastructure	Code optimization, performance tuning	General-purpose code improvement

**Generalization Capability:** ShinkaEvolve is designed as a general-purpose framework. Any problem that can be expressed as (1) a program with mutable code blocks, and (2) an evaluation function returning a numeric score, can be optimized using this framework.

3. Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     ShinkaEvolve Framework                       │
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │  shinka_launch│    │  shinka_run  │    │shinka_visualize│      │
│  │  (Hydra CLI)  │    │ (Agent CLI)  │    │   (WebUI)     │      │
│  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘       │
│         │                    │                    │               │
│  ┌──────▼────────────────────▼────────────────────▼──────┐      │
│  │              Evolution Runner (sync/async)             │      │
│  │  ┌─────────────────┐  ┌──────────────────────────┐    │      │
│  │  │EvolutionRunner  │  │ AsyncEvolutionRunner      │    │      │
│  │  │(sequential gen) │  │ (concurrent gen+eval)     │    │      │
│  │  │                 │  │ 5-10x speedup             │    │      │
│  │  └─────────────────┘  └──────────────────────────┘    │      │
│  └───────────────────────────┬───────────────────────────┘      │
│                              │                                   │
│  ┌───────────┬───────────────┼───────────────┬───────────┐      │
│  ▼           ▼               ▼               ▼           ▼      │
│ ┌────┐  ┌────────┐  ┌──────────────┐  ┌─────────┐  ┌────────┐ │
│ │LLM │  │Novelty │  │  Population  │  │Evaluator│  │ Prompt │ │
│ │Ens.│  │ Judge  │  │   Database   │  │ Wrapper │  │Evolver │ │
│ │    │  │        │  │              │  │         │  │        │ │
│ │Prov│  │Embed + │  │Islands + Arch│  │Parallel │  │Sys.Prom│ │
│ │ider│  │LLM-as- │  │ive + Migrat.│  │Exec +   │  │Archive │ │
│ │Mod.│  │Judge   │  │Island Sampl.│  │E.Stop   │  │Mutation│ │
│ └────┘  └────────┘  └──────────────┘  └─────────┘  └────────┘ │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    Bandit Controller                       │   │
│  │   Thompson/UCB-based LLM selection + meta-recommendations │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Three Execution Modes

EvolutionRunner

Synchronous pipeline. Parallel evaluations but sequential program generation. Best for debugging and small runs.

AsyncEvolutionRunner

Fully concurrent proposals and evaluations. Achieves 5-10x speedup. Best for production runs.

shinka_launch

Hydra-based CLI with YAML configuration. Supports config composition and overrides.

4. Components

Directory Structure

shinka/
├── core/               # Evolution runners (sync + async), novelty judge
│   ├── evolution_runner.py     # Main synchronous loop
│   ├── async_runner.py         # Async concurrent pipeline
│   ├── async_summarizer.py     # Async result summarization
│   ├── async_novelty_judge.py  # Async novelty evaluation
│   └── wrap_eval.py            # Evaluation wrapper with parallelism
├── database/           # Population management
│   ├── database.py             # Island-based population database
│   ├── async_dbase.py          # Async database operations
│   ├── island_sampler.py       # Island selection strategies
│   └── prompt_dbase.py         # Prompt population storage
├── edit/               # Code editing operations
├── embed/              # Embedding providers for novelty
├── llm/                # LLM provider-based modules
│   └── providers/              # OpenAI, Google, local model support
├── prompts/            # Prompt templates
│   ├── prompts_fix.py          # Fix-mode for incorrect populations
│   └── prompts_prompt_evo.py   # Prompt evolution templates
├── plots/              # Visualization modules
│   ├── plot_costs.py, plot_evals.py, plot_time.py, plot_llm.py
├── webui/              # Dashboard and compare views
│   ├── index.html, compare.html
└── utils/              # Utility functions

Configuration Layers

Config	Key Parameters	Purpose
`EvolutionConfig`	num_generations (10+), patch_types (diff/full/cross), novelty params, LLM models	Controls evolutionary search behavior
`DatabaseConfig`	num_islands (4), migration_interval (10 gen), migration_rate (10%), elite_ratio (30%)	Population topology and dynamics
`JobConfig`	LocalJobConfig, SlurmDockerJobConfig, SlurmCondaJobConfig	Execution environment

5. Core Mechanisms

5.1 Code Modification & Mutation Operators

ShinkaEvolve uses LLMs as intelligent mutation operators. Programs contain EVOLVE-BLOCK-START/END markers that delineate mutable code regions. The system supports three mutation/patch types:

Patch Type	Description	Use Case
`diff`	Generate unified diff patches to modify specific lines	Small targeted improvements
`full`	Generate complete replacement of evolve blocks	Major algorithmic changes
`cross`	Combine elements from two parent programs	Crossover-like recombination

# Example: Initial program with evolve markers
# === EVOLVE-BLOCK-START ===
def optimize(problem):
    """Optimize the given problem instance."""
    solution = greedy_initialize(problem)
    for i in range(1000):
        neighbor = random_perturbation(solution)
        if evaluate(neighbor) > evaluate(solution):
            solution = neighbor
    return solution
# === EVOLVE-BLOCK-END ===

The LLM receives the current program, parent history, evaluation results, and meta-recommendations, then proposes modifications within the marked blocks. This is fundamentally different from random mutations in classical genetic programming — the LLM understands the algorithmic intent and proposes semantically meaningful changes.

5.2 Parent Selection & Sampling

Parent selection balances exploitation (using known good solutions) with exploration (trying novel approaches). ShinkaEvolve supports multiple strategies:

Strategy	Mechanism	When Used
Power-Law	Higher-ranked programs exponentially more likely to be selected: P(i) ∝ i^-α	Default exploitation-heavy mode
Weighted	Selection proportional to fitness score: P(i) ∝ f(i) / Σf	Balanced exploration-exploitation
Beam Search	Top-k programs expanded exhaustively	Focused depth-first refinement

$$ Power-Law Selection: P(rank_i) = rank_i^-α / Σj rank_j^-α, α ∈ [1.0, 3.0] $$

# Python implementation of power-law parent selection
import numpy as np

def power_law_selection(population, scores, alpha=2.0):
    """Select parent using power-law distribution over ranked programs."""
    # Rank programs by score (highest first)
    ranked_indices = np.argsort(scores)[::-1]
    ranks = np.arange(1, len(ranked_indices) + 1)

    # Compute power-law probabilities
    probs = ranks.astype(float) ** (-alpha)
    probs /= probs.sum()

    # Sample parent
    selected_rank = np.random.choice(len(ranks), p=probs)
    return population[ranked_indices[selected_rank]]

Additionally, the elite selection ratio (default 30%) determines how many top programs from the archive are used as "inspiration" context in prompts to the LLM. This provides the LLM with examples of successful approaches without directly copying them.

5.3 Population Management & Island Architecture

ShinkaEvolve implements a multi-island evolutionary model inspired by biological island biogeography. Each island maintains a semi-independent subpopulation that evolves in relative isolation, with periodic migration of elite individuals between islands.

Island 1          Island 2          Island 3          Island 4
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Pop: 25      │ │ Pop: 25      │ │ Pop: 25      │ │ Pop: 25      │
│ Elite: 8     │ │ Elite: 8     │ │ Elite: 8     │ │ Elite: 8     │
│ Archive: 50  │ │ Archive: 50  │ │ Archive: 50  │ │ Archive: 50  │
│              │ │              │ │              │ │              │
│ Strategy: A  │ │ Strategy: B  │ │ Strategy: C  │ │ Strategy: D  │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
       │     Migration (10%)     │                │
       └────────every 10 gen─────┴────────────────┘
                    │
              ┌─────▼──────┐
              │   Global   │
              │  Archive   │
              │ Best-of-All│
              └────────────┘

Island Configuration

num_islands: Default 4 islands
migration_interval: Every 10 generations
migration_rate: 10% of elite programs migrate
Dynamic island spawning: New islands spawned on stagnation detection (v1.1)

Island Sampling Strategies (island_sampler.py)

Strategy	Behavior
`uniform`	Equal probability for each island regardless of performance
`equal`	Round-robin cycling through islands
`proportional`	Probability proportional to island's best score
`weighted`	Custom weights per island (e.g., favoring diverse islands)

5.4 Solution Diversity & Novelty

ShinkaEvolve employs a two-tier novelty system to prevent the population from converging to a single solution pattern:

Tier 1: Code Embedding Similarity

Before evaluation, new programs are compared to existing population members using code embeddings. Programs that are too similar (below a novelty threshold) are rejected without costly evaluation.

$$ novelty(p) = minq ∈ Pop d(embed(p), embed(q)) > θnovelty $$

# Novelty rejection via embedding similarity
def check_novelty_embedding(new_program, population, embedder, threshold=0.15):
    """Reject programs too similar to existing population."""
    new_emb = embedder.encode(new_program.code)
    for existing in population:
        existing_emb = embedder.encode(existing.code)
        similarity = cosine_similarity(new_emb, existing_emb)
        if similarity > (1.0 - threshold):
            return False  # Too similar, reject
    return True  # Novel enough

Tier 2: LLM-as-Novelty-Judge

For programs passing the embedding filter, an LLM evaluates whether the proposed changes represent a genuinely novel algorithmic approach or merely a superficial modification (e.g., variable renaming, comment changes).

# LLM-based novelty assessment (async_novelty_judge.py)
async def judge_novelty(parent_code, child_code, llm_client):
    """Use LLM to assess whether mutation is algorithmically novel."""
    prompt = f"""Compare these two programs and assess if the child
    introduces a genuinely novel algorithmic approach:

    Parent: {parent_code}
    Child: {child_code}

    Rate novelty 1-5: 1=cosmetic change, 5=fundamentally new approach.
    Explain your reasoning."""

    response = await llm_client.generate(prompt)
    novelty_score = extract_score(response)
    return novelty_score >= 3  # Accept if genuinely novel

5.5 LLM Orchestration & Model Selection

ShinkaEvolve uses a multi-armed bandit approach to dynamically select which LLM from the ensemble to use for each mutation. The bandit tracks the expected improvement from each LLM and adapts allocation over time.

$$ UCBi(t) = μi(t) + c · √(ln(t) / ni(t)) $$

where μi(t) is the average improvement from LLM i, ni(t) is the number of times it was selected, and c controls exploration.

# Bandit-based LLM selection
class BanditLLMSelector:
    def __init__(self, models, exploration_weight=1.4):
        self.models = models
        self.c = exploration_weight
        self.successes = {m: 0 for m in models}
        self.trials = {m: 1 for m in models}  # Avoid div-by-zero
        self.total_trials = len(models)

    def select(self):
        """Select LLM using UCB1 strategy."""
        ucb_scores = {}
        for model in self.models:
            mu = self.successes[model] / self.trials[model]
            exploration = self.c * math.sqrt(
                math.log(self.total_trials) / self.trials[model]
            )
            ucb_scores[model] = mu + exploration
        return max(ucb_scores, key=ucb_scores.get)

    def update(self, model, improvement):
        """Update model statistics after evaluation."""
        self.trials[model] += 1
        self.total_trials += 1
        if improvement > 0:
            self.successes[model] += 1

The system supports diverse LLM backends through a provider-based architecture:

"openrouter/qwen/qwen3-coder" — OpenRouter-hosted models
"local/qwen2.5-coder@localhost" — Local model endpoints
Google GenAI, OpenAI, Anthropic providers

5.6 Search Strategies

ShinkaEvolve supports multiple search strategies that can be combined:

Generational Evolution: Standard generational model with population replacement
Steady-State: Replace worst individual immediately on finding better solution
Beam Search: Expand top-k programs at each generation for depth-first refinement
Fix Mode (v1.1): When no correct program exists in population, special prompts focus LLM on fixing fundamental errors rather than optimizing performance

5.7 Evaluation & Fitness Assessment

The evaluation system (wrap_eval.py) provides robust program assessment:

# Evaluation setup: users provide two files
# evaluate.py - runs experiments and returns metrics
def evaluate(program_path: str) -> dict:
    """Run the program and return metrics."""
    result = run_program(program_path)
    return {
        "combined_score": result.score,  # Required key
        "accuracy": result.accuracy,
        "runtime_ms": result.runtime
    }

# initial.py - starting solution with EVOLVE-BLOCK markers
# === EVOLVE-BLOCK-START ===
def solve(input_data):
    return baseline_solution(input_data)
# === EVOLVE-BLOCK-END ===

Evaluation Features (v1.1)

Per-run parallelism: run_workers with optional max_workers_cap
Early stopping: bayesian, ci (confidence interval), hybrid modes
NaN/Inf guards: Invalid scores automatically filtered
Plot artifact generation: Optional visualization of evaluation results
Deterministic result ordering: Consistent results regardless of parallel execution order

5.8 Prompt Engineering & Co-Evolution

A major innovation in v1.1 is prompt co-evolution: the system prompts used to guide LLM mutations evolve alongside the program population.

**Prompt Co-Evolution Components:**

**System Prompt Archive** (prompt_dbase.py): Population of system prompts stored alongside program population

**Prompt Mutation** (prompt_evolver.py): LLM modifies system prompts based on which prompts led to successful mutations

**Prompt Fitness Tracking**: Each prompt tracks the average improvement of programs it helped generate

# Prompt co-evolution pseudocode
class PromptEvolver:
    def __init__(self, prompt_archive):
        self.archive = prompt_archive  # PromptDatabase

    def evolve_prompt(self, parent_prompt, success_history):
        """Mutate a system prompt based on success history."""
        meta_prompt = f"""This system prompt was used to guide code mutations:
        {parent_prompt}

        History of mutations it produced:
        {format_history(success_history)}

        Improve this prompt to generate better code mutations.
        Focus on what worked and amplify those strategies."""

        return self.llm.generate(meta_prompt)

    def select_prompt(self):
        """Select system prompt using fitness-proportionate selection."""
        prompts = self.archive.get_all()
        scores = [p.avg_improvement for p in prompts]
        return weighted_sample(prompts, scores)

5.9 Cost Control & Budget Management

The v1.1 release introduced max_api_costs as a first-class runtime budget guard:

**Budget Model:** Committed cost = realized DB costs (`api_costs` + `embed_cost` + `novelty_cost` + `meta_cost`) + estimated cost of in-flight work. Once budget reached, new proposals stop and the runner drains ongoing jobs.

# Cost tracking pseudocode
class CostTracker:
    def __init__(self, max_budget):
        self.max_budget = max_budget
        self.realized_costs = {
            'api_costs': 0.0,    # LLM generation costs
            'embed_cost': 0.0,   # Embedding computation
            'novelty_cost': 0.0, # Novelty judge LLM calls
            'meta_cost': 0.0     # Meta-recommendation costs
        }
        self.in_flight_estimate = 0.0

    @property
    def committed_cost(self):
        return sum(self.realized_costs.values()) + self.in_flight_estimate

    def can_propose(self):
        return self.committed_cost < self.max_budget

When num_generations is omitted from configuration, max_api_costs becomes required, ensuring runs always have a termination condition.

5.10 Meta-Level & Self-Improvement

ShinkaEvolve includes meta-level mechanisms that improve the search process itself:

Meta-recommendations: After each generation, the system generates high-level insights about what types of mutations have been successful, which are passed to the LLM in subsequent generations
Bandit state persistence: LLM selection statistics persist across resume sessions (v1.1)
Adaptive mutation scheduling: The ratio of diff/full/cross mutations adapts based on success rates
Stagnation detection: When island performance plateaus, dynamic island spawning creates new populations with randomized strategies

6. LLM Integration

ShinkaEvolve's provider-based LLM architecture supports any model accessible via API:

Provider	Example Backend ID	Notes
OpenRouter	`openrouter/qwen/qwen3-coder`	Access to 100+ models
Local	`local/qwen2.5-coder@localhost`	Ollama, vLLM, llama.cpp
Google	`google/gemini-2.5-pro`	Via google-genai SDK
OpenAI	`openai/gpt-4o`	Standard API

The system can run ensembles of models simultaneously, with the bandit controller learning which models perform best for the specific task at hand. This is particularly powerful when different models have complementary strengths (e.g., one model is better at algorithmic innovation while another excels at code optimization).

7. Key Results

Domain	Efficiency	Performance	Notes
Circle Packing	150 samples	State-of-the-art solution	Hybrid: golden spiral + gradient + simulated annealing
AIME Math	75 generations	Generalizes across years/LLMs	Mathematical reasoning optimization
Competitive Prog.	Multiple tasks	2.3% avg improvement	AtCoder heuristic problems
MoE Training	30 generations	Beat DeepSeek Global LBL	Mixture-of-Experts loss function
ICFP 2025	320 trials, ~$60	10x speedup in SAT	Won programming contest

**Circle Packing Discovery:** ShinkaEvolve identified a sophisticated hybrid algorithm combining golden-angle spiral initialization, gradient-based refinement, and simulated annealing to escape local optima — a solution that would be extremely difficult for a human to design from scratch.

8. Reproducibility

Open source: Full code available under Apache 2.0
Package manager: uv pip install -e .
Examples included: Circle packing, Game 2048, Julia prime counting, novelty examples
Configuration: Hydra-based YAML configs for full reproducibility
WebUI: Real-time visualization via shinka_visualize
Resume support: Checkpoint/resume with meta memory and bandit state

# Quick start
git clone https://github.com/SakanaAI/ShinkaEvolve
cd ShinkaEvolve
uv pip install -e .

# Run circle packing example
shinka_launch variant=circle_packing_example

# Or with agent CLI
shinka_run --task-dir examples/circle_packing --num_generations 20

# Customize
shinka_run --set evo.max_parallel_jobs=6 --set db.num_islands=3

9. Cost Information

Experiment	Trials	Cost	Outcome
ICFP 2025 SAT Optimization	320	~$60	10x speedup
Circle Packing (est.)	150	~$30-50	State-of-the-art
AIME Math (est.)	75 gens	~$20-40	Cross-year generalization

Cost is heavily dependent on the LLM backend chosen. Using local models via Ollama/vLLM reduces API costs to near-zero, while frontier models like GPT-4o cost $0.10-0.50 per mutation.

10. Memory Management

Population Database: SQLite-backed with WAL mode for concurrent read/write (improved retry in v1.1)
Evaluation Cache: Results cached to avoid re-evaluating identical programs
Meta Memory: Bandit statistics and meta-recommendations persist across sessions
Archive: Best programs from all islands stored in global archive for cross-island inspiration
Prompt Archive: Co-evolved system prompts stored separately with fitness tracking
Context Window Management: Parent programs and archive snippets selected to fit within LLM context limits

11. Continued Learning

Within-run adaptation: Bandit controller learns which LLMs and mutation types work best
Cross-session resume: Population, bandit state, and meta-recommendations persist
Prompt evolution: System prompts improve over generations, accumulating knowledge about what mutation instructions work
Knowledge transfer: Insights from evolved programs can be manually extracted and applied to new domains (demonstrated in ICFP 2025)
Archive growth: The archive continuously accumulates diverse high-quality solutions

12. Applications & Use Cases

Infrastructure Optimization

Optimize scheduling algorithms, resource allocation, database query optimization. Any infrastructure code with measurable performance.

Mathematical Discovery

AIME reasoning, circle packing, combinatorial problems. Discover novel mathematical strategies beyond human intuition.

AI/ML Optimization

Loss functions, training strategies (MoE), hyperparameter optimization code. Optimize ML pipeline components.

Scientific Discovery

Algorithm design for scientific computing, simulation optimization, experimental design.

Competitive Programming

Heuristic contest optimization (AtCoder, ICFP). Discover novel algorithmic strategies under time pressure.

Human-AI Collaboration

Evolved solutions provide insights that humans can extract and generalize. Bidirectional knowledge transfer demonstrated at ICFP 2025.

ShinkaEvolve

Table of Contents

1. Core Contribution

2. Supported Solutions

3. Architecture

Three Execution Modes

EvolutionRunner

AsyncEvolutionRunner

shinka_launch

4. Components

Directory Structure

Configuration Layers

5. Core Mechanisms

5.1 Code Modification & Mutation Operators

5.2 Parent Selection & Sampling

5.3 Population Management & Island Architecture

Island Configuration

Island Sampling Strategies (island_sampler.py)

5.4 Solution Diversity & Novelty

Tier 1: Code Embedding Similarity

Tier 2: LLM-as-Novelty-Judge

5.5 LLM Orchestration & Model Selection

5.6 Search Strategies

5.7 Evaluation & Fitness Assessment

Evaluation Features (v1.1)

5.8 Prompt Engineering & Co-Evolution

5.9 Cost Control & Budget Management

5.10 Meta-Level & Self-Improvement

6. LLM Integration

7. Key Results

8. Reproducibility

9. Cost Information

10. Memory Management

11. Continued Learning

12. Applications & Use Cases

Infrastructure Optimization

Mathematical Discovery

AI/ML Optimization

Scientific Discovery

Competitive Programming

Human-AI Collaboration