ShinkaEvolve
Open-Ended Program Evolution Framework
Authors: Takuya Akiba & Sakana AI Team
Venue: ICLR 2026
License: Apache 2.0
Language: Python ≥3.10
Stars: 855 ★
Repository: github.com/SakanaAI/ShinkaEvolve
Table of Contents
- Core Contribution
- Supported Solutions
- Architecture
- Components
- Core Mechanisms
- Code Modification & Mutation Operators
- Parent Selection & Sampling
- Population Management & Island Architecture
- Solution Diversity & Novelty
- LLM Orchestration & Model Selection
- Search Strategies
- Evaluation & Fitness Assessment
- Prompt Engineering & Co-Evolution
- Cost Control & Budget Management
- Meta-Level & Self-Improvement
- LLM Integration
- Key Results
- Reproducibility
- Cost Information
- Memory Management
- Continued Learning
- Applications & Use Cases
1. Core Contribution
ShinkaEvolve is an evolutionary code optimization framework that discovers algorithms using LLMs with dramatically improved sample efficiency compared to prior approaches like AlphaEvolve. The system constructs an archive of evaluated programs, generates new programs via LLM ensembles acting as intelligent mutation operators, and evaluates their fitness through automated metrics.
**Key Innovation:** Three sample-efficiency innovations that reduce the number of evaluations needed by 10-100x compared to prior work:
**Parent Sampling Strategy** — Balances exploitation of good solutions with exploration of new ideas
**Novelty-Based Program Rejection** — Avoids evaluating minor variations via code embedding similarity and LLM-as-novelty-judge
**Adaptive LLM Prioritization** — Bandit-based dynamic selection of best LLM from ensemble
Unlike prior systems that treat LLMs as black-box text generators, ShinkaEvolve deeply integrates LLMs into the evolutionary loop as intelligent mutation operators that understand code semantics, can reason about algorithmic improvements, and propose structured modifications. The framework achieved state-of-the-art results across circle packing, mathematical reasoning (AIME), competitive programming, and MoE training optimization.
2. Supported Solutions
| Domain | Problem Types | Generalization |
|---|---|---|
| Algorithm Optimization | SAT solvers, sorting, search algorithms | Any algorithm with a measurable fitness function |
| Mathematical Reasoning | AIME problems, mathematical conjectures | Generalizes across problem years and LLM backends |
| Competitive Programming | Heuristic contests (AtCoder, ICFP) | Applicable to any optimization contest |
| ML Optimization | MoE training strategies, loss functions | Any ML pipeline with trainable components |
| Scientific Discovery | Circle packing, combinatorial optimization | Any problem expressible as code with evaluation |
| Infrastructure | Code optimization, performance tuning | General-purpose code improvement |
**Generalization Capability:** ShinkaEvolve is designed as a general-purpose framework. Any problem that can be expressed as (1) a program with mutable code blocks, and (2) an evaluation function returning a numeric score, can be optimized using this framework.
3. Architecture
┌─────────────────────────────────────────────────────────────────┐
│ ShinkaEvolve Framework │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ shinka_launch│ │ shinka_run │ │shinka_visualize│ │
│ │ (Hydra CLI) │ │ (Agent CLI) │ │ (WebUI) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ┌──────▼────────────────────▼────────────────────▼──────┐ │
│ │ Evolution Runner (sync/async) │ │
│ │ ┌─────────────────┐ ┌──────────────────────────┐ │ │
│ │ │EvolutionRunner │ │ AsyncEvolutionRunner │ │ │
│ │ │(sequential gen) │ │ (concurrent gen+eval) │ │ │
│ │ │ │ │ 5-10x speedup │ │ │
│ │ └─────────────────┘ └──────────────────────────┘ │ │
│ └───────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌───────────┬───────────────┼───────────────┬───────────┐ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌────┐ ┌────────┐ ┌──────────────┐ ┌─────────┐ ┌────────┐ │
│ │LLM │ │Novelty │ │ Population │ │Evaluator│ │ Prompt │ │
│ │Ens.│ │ Judge │ │ Database │ │ Wrapper │ │Evolver │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │Prov│ │Embed + │ │Islands + Arch│ │Parallel │ │Sys.Prom│ │
│ │ider│ │LLM-as- │ │ive + Migrat.│ │Exec + │ │Archive │ │
│ │Mod.│ │Judge │ │Island Sampl.│ │E.Stop │ │Mutation│ │
│ └────┘ └────────┘ └──────────────┘ └─────────┘ └────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Bandit Controller │ │
│ │ Thompson/UCB-based LLM selection + meta-recommendations │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Three Execution Modes
EvolutionRunner
Synchronous pipeline. Parallel evaluations but sequential program generation. Best for debugging and small runs.
AsyncEvolutionRunner
Fully concurrent proposals and evaluations. Achieves 5-10x speedup. Best for production runs.
shinka_launch
Hydra-based CLI with YAML configuration. Supports config composition and overrides.
4. Components
Directory Structure
shinka/
├── core/ # Evolution runners (sync + async), novelty judge
│ ├── evolution_runner.py # Main synchronous loop
│ ├── async_runner.py # Async concurrent pipeline
│ ├── async_summarizer.py # Async result summarization
│ ├── async_novelty_judge.py # Async novelty evaluation
│ └── wrap_eval.py # Evaluation wrapper with parallelism
├── database/ # Population management
│ ├── database.py # Island-based population database
│ ├── async_dbase.py # Async database operations
│ ├── island_sampler.py # Island selection strategies
│ └── prompt_dbase.py # Prompt population storage
├── edit/ # Code editing operations
├── embed/ # Embedding providers for novelty
├── llm/ # LLM provider-based modules
│ └── providers/ # OpenAI, Google, local model support
├── prompts/ # Prompt templates
│ ├── prompts_fix.py # Fix-mode for incorrect populations
│ └── prompts_prompt_evo.py # Prompt evolution templates
├── plots/ # Visualization modules
│ ├── plot_costs.py, plot_evals.py, plot_time.py, plot_llm.py
├── webui/ # Dashboard and compare views
│ ├── index.html, compare.html
└── utils/ # Utility functions
Configuration Layers
| Config | Key Parameters | Purpose |
|---|---|---|
EvolutionConfig |
num_generations (10+), patch_types (diff/full/cross), novelty params, LLM models | Controls evolutionary search behavior |
DatabaseConfig |
num_islands (4), migration_interval (10 gen), migration_rate (10%), elite_ratio (30%) | Population topology and dynamics |
JobConfig |
LocalJobConfig, SlurmDockerJobConfig, SlurmCondaJobConfig | Execution environment |
5. Core Mechanisms
5.1 Code Modification & Mutation Operators
ShinkaEvolve uses LLMs as intelligent mutation operators. Programs contain EVOLVE-BLOCK-START/END markers that delineate mutable code regions. The system supports three mutation/patch types:
| Patch Type | Description | Use Case |
|---|---|---|
diff |
Generate unified diff patches to modify specific lines | Small targeted improvements |
full |
Generate complete replacement of evolve blocks | Major algorithmic changes |
cross |
Combine elements from two parent programs | Crossover-like recombination |
# Example: Initial program with evolve markers
# === EVOLVE-BLOCK-START ===
def optimize(problem):
"""Optimize the given problem instance."""
solution = greedy_initialize(problem)
for i in range(1000):
neighbor = random_perturbation(solution)
if evaluate(neighbor) > evaluate(solution):
solution = neighbor
return solution
# === EVOLVE-BLOCK-END ===
The LLM receives the current program, parent history, evaluation results, and meta-recommendations, then proposes modifications within the marked blocks. This is fundamentally different from random mutations in classical genetic programming — the LLM understands the algorithmic intent and proposes semantically meaningful changes.
5.2 Parent Selection & Sampling
Parent selection balances exploitation (using known good solutions) with exploration (trying novel approaches). ShinkaEvolve supports multiple strategies:
| Strategy | Mechanism | When Used |
|---|---|---|
| Power-Law | Higher-ranked programs exponentially more likely to be selected: P(i) ∝ i^-α | Default exploitation-heavy mode |
| Weighted | Selection proportional to fitness score: P(i) ∝ f(i) / Σf | Balanced exploration-exploitation |
| Beam Search | Top-k programs expanded exhaustively | Focused depth-first refinement |
$$ Power-Law Selection: P(rank_i) = rank_i^-α / Σj rank_j^-α, α ∈ [1.0, 3.0] $$
# Python implementation of power-law parent selection
import numpy as np
def power_law_selection(population, scores, alpha=2.0):
"""Select parent using power-law distribution over ranked programs."""
# Rank programs by score (highest first)
ranked_indices = np.argsort(scores)[::-1]
ranks = np.arange(1, len(ranked_indices) + 1)
# Compute power-law probabilities
probs = ranks.astype(float) ** (-alpha)
probs /= probs.sum()
# Sample parent
selected_rank = np.random.choice(len(ranks), p=probs)
return population[ranked_indices[selected_rank]]
Additionally, the elite selection ratio (default 30%) determines how many top programs from the archive are used as "inspiration" context in prompts to the LLM. This provides the LLM with examples of successful approaches without directly copying them.
5.3 Population Management & Island Architecture
ShinkaEvolve implements a multi-island evolutionary model inspired by biological island biogeography. Each island maintains a semi-independent subpopulation that evolves in relative isolation, with periodic migration of elite individuals between islands.
Island 1 Island 2 Island 3 Island 4
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Pop: 25 │ │ Pop: 25 │ │ Pop: 25 │ │ Pop: 25 │
│ Elite: 8 │ │ Elite: 8 │ │ Elite: 8 │ │ Elite: 8 │
│ Archive: 50 │ │ Archive: 50 │ │ Archive: 50 │ │ Archive: 50 │
│ │ │ │ │ │ │ │
│ Strategy: A │ │ Strategy: B │ │ Strategy: C │ │ Strategy: D │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ Migration (10%) │ │
└────────every 10 gen─────┴────────────────┘
│
┌─────▼──────┐
│ Global │
│ Archive │
│ Best-of-All│
└────────────┘
Island Configuration
- num_islands: Default 4 islands
- migration_interval: Every 10 generations
- migration_rate: 10% of elite programs migrate
- Dynamic island spawning: New islands spawned on stagnation detection (v1.1)
Island Sampling Strategies (island_sampler.py)
| Strategy | Behavior |
|---|---|
uniform |
Equal probability for each island regardless of performance |
equal |
Round-robin cycling through islands |
proportional |
Probability proportional to island's best score |
weighted |
Custom weights per island (e.g., favoring diverse islands) |
5.4 Solution Diversity & Novelty
ShinkaEvolve employs a two-tier novelty system to prevent the population from converging to a single solution pattern:
Tier 1: Code Embedding Similarity
Before evaluation, new programs are compared to existing population members using code embeddings. Programs that are too similar (below a novelty threshold) are rejected without costly evaluation.
$$ novelty(p) = minq ∈ Pop d(embed(p), embed(q)) > θnovelty $$
# Novelty rejection via embedding similarity
def check_novelty_embedding(new_program, population, embedder, threshold=0.15):
"""Reject programs too similar to existing population."""
new_emb = embedder.encode(new_program.code)
for existing in population:
existing_emb = embedder.encode(existing.code)
similarity = cosine_similarity(new_emb, existing_emb)
if similarity > (1.0 - threshold):
return False # Too similar, reject
return True # Novel enough
Tier 2: LLM-as-Novelty-Judge
For programs passing the embedding filter, an LLM evaluates whether the proposed changes represent a genuinely novel algorithmic approach or merely a superficial modification (e.g., variable renaming, comment changes).
# LLM-based novelty assessment (async_novelty_judge.py)
async def judge_novelty(parent_code, child_code, llm_client):
"""Use LLM to assess whether mutation is algorithmically novel."""
prompt = f"""Compare these two programs and assess if the child
introduces a genuinely novel algorithmic approach:
Parent: {parent_code}
Child: {child_code}
Rate novelty 1-5: 1=cosmetic change, 5=fundamentally new approach.
Explain your reasoning."""
response = await llm_client.generate(prompt)
novelty_score = extract_score(response)
return novelty_score >= 3 # Accept if genuinely novel
5.5 LLM Orchestration & Model Selection
ShinkaEvolve uses a multi-armed bandit approach to dynamically select which LLM from the ensemble to use for each mutation. The bandit tracks the expected improvement from each LLM and adapts allocation over time.
$$ UCBi(t) = μi(t) + c · √(ln(t) / ni(t)) $$
where μi(t) is the average improvement from LLM i, ni(t) is the number of times it was selected, and c controls exploration.
# Bandit-based LLM selection
class BanditLLMSelector:
def __init__(self, models, exploration_weight=1.4):
self.models = models
self.c = exploration_weight
self.successes = {m: 0 for m in models}
self.trials = {m: 1 for m in models} # Avoid div-by-zero
self.total_trials = len(models)
def select(self):
"""Select LLM using UCB1 strategy."""
ucb_scores = {}
for model in self.models:
mu = self.successes[model] / self.trials[model]
exploration = self.c * math.sqrt(
math.log(self.total_trials) / self.trials[model]
)
ucb_scores[model] = mu + exploration
return max(ucb_scores, key=ucb_scores.get)
def update(self, model, improvement):
"""Update model statistics after evaluation."""
self.trials[model] += 1
self.total_trials += 1
if improvement > 0:
self.successes[model] += 1
The system supports diverse LLM backends through a provider-based architecture:
"openrouter/qwen/qwen3-coder"— OpenRouter-hosted models"local/qwen2.5-coder@localhost"— Local model endpoints- Google GenAI, OpenAI, Anthropic providers
5.6 Search Strategies
ShinkaEvolve supports multiple search strategies that can be combined:
- Generational Evolution: Standard generational model with population replacement
- Steady-State: Replace worst individual immediately on finding better solution
- Beam Search: Expand top-k programs at each generation for depth-first refinement
- Fix Mode (v1.1): When no correct program exists in population, special prompts focus LLM on fixing fundamental errors rather than optimizing performance
5.7 Evaluation & Fitness Assessment
The evaluation system (wrap_eval.py) provides robust program assessment:
# Evaluation setup: users provide two files
# evaluate.py - runs experiments and returns metrics
def evaluate(program_path: str) -> dict:
"""Run the program and return metrics."""
result = run_program(program_path)
return {
"combined_score": result.score, # Required key
"accuracy": result.accuracy,
"runtime_ms": result.runtime
}
# initial.py - starting solution with EVOLVE-BLOCK markers
# === EVOLVE-BLOCK-START ===
def solve(input_data):
return baseline_solution(input_data)
# === EVOLVE-BLOCK-END ===
Evaluation Features (v1.1)
- Per-run parallelism:
run_workerswith optionalmax_workers_cap - Early stopping:
bayesian,ci(confidence interval),hybridmodes - NaN/Inf guards: Invalid scores automatically filtered
- Plot artifact generation: Optional visualization of evaluation results
- Deterministic result ordering: Consistent results regardless of parallel execution order
5.8 Prompt Engineering & Co-Evolution
A major innovation in v1.1 is prompt co-evolution: the system prompts used to guide LLM mutations evolve alongside the program population.
**Prompt Co-Evolution Components:**
**System Prompt Archive** (prompt_dbase.py): Population of system prompts stored alongside program population
**Prompt Mutation** (prompt_evolver.py): LLM modifies system prompts based on which prompts led to successful mutations
**Prompt Fitness Tracking**: Each prompt tracks the average improvement of programs it helped generate
# Prompt co-evolution pseudocode
class PromptEvolver:
def __init__(self, prompt_archive):
self.archive = prompt_archive # PromptDatabase
def evolve_prompt(self, parent_prompt, success_history):
"""Mutate a system prompt based on success history."""
meta_prompt = f"""This system prompt was used to guide code mutations:
{parent_prompt}
History of mutations it produced:
{format_history(success_history)}
Improve this prompt to generate better code mutations.
Focus on what worked and amplify those strategies."""
return self.llm.generate(meta_prompt)
def select_prompt(self):
"""Select system prompt using fitness-proportionate selection."""
prompts = self.archive.get_all()
scores = [p.avg_improvement for p in prompts]
return weighted_sample(prompts, scores)
5.9 Cost Control & Budget Management
The v1.1 release introduced max_api_costs as a first-class runtime budget guard:
**Budget Model:** Committed cost = realized DB costs (`api_costs` + `embed_cost` + `novelty_cost` + `meta_cost`) + estimated cost of in-flight work. Once budget reached, new proposals stop and the runner drains ongoing jobs.
# Cost tracking pseudocode
class CostTracker:
def __init__(self, max_budget):
self.max_budget = max_budget
self.realized_costs = {
'api_costs': 0.0, # LLM generation costs
'embed_cost': 0.0, # Embedding computation
'novelty_cost': 0.0, # Novelty judge LLM calls
'meta_cost': 0.0 # Meta-recommendation costs
}
self.in_flight_estimate = 0.0
@property
def committed_cost(self):
return sum(self.realized_costs.values()) + self.in_flight_estimate
def can_propose(self):
return self.committed_cost < self.max_budget
When num_generations is omitted from configuration, max_api_costs becomes required, ensuring runs always have a termination condition.
5.10 Meta-Level & Self-Improvement
ShinkaEvolve includes meta-level mechanisms that improve the search process itself:
- Meta-recommendations: After each generation, the system generates high-level insights about what types of mutations have been successful, which are passed to the LLM in subsequent generations
- Bandit state persistence: LLM selection statistics persist across resume sessions (v1.1)
- Adaptive mutation scheduling: The ratio of diff/full/cross mutations adapts based on success rates
- Stagnation detection: When island performance plateaus, dynamic island spawning creates new populations with randomized strategies
6. LLM Integration
ShinkaEvolve's provider-based LLM architecture supports any model accessible via API:
| Provider | Example Backend ID | Notes |
|---|---|---|
| OpenRouter | openrouter/qwen/qwen3-coder |
Access to 100+ models |
| Local | local/qwen2.5-coder@localhost |
Ollama, vLLM, llama.cpp |
google/gemini-2.5-pro |
Via google-genai SDK | |
| OpenAI | openai/gpt-4o |
Standard API |
The system can run ensembles of models simultaneously, with the bandit controller learning which models perform best for the specific task at hand. This is particularly powerful when different models have complementary strengths (e.g., one model is better at algorithmic innovation while another excels at code optimization).
7. Key Results
| Domain | Efficiency | Performance | Notes |
|---|---|---|---|
| Circle Packing | 150 samples | State-of-the-art solution | Hybrid: golden spiral + gradient + simulated annealing |
| AIME Math | 75 generations | Generalizes across years/LLMs | Mathematical reasoning optimization |
| Competitive Prog. | Multiple tasks | 2.3% avg improvement | AtCoder heuristic problems |
| MoE Training | 30 generations | Beat DeepSeek Global LBL | Mixture-of-Experts loss function |
| ICFP 2025 | 320 trials, ~$60 | 10x speedup in SAT | Won programming contest |
**Circle Packing Discovery:** ShinkaEvolve identified a sophisticated hybrid algorithm combining golden-angle spiral initialization, gradient-based refinement, and simulated annealing to escape local optima — a solution that would be extremely difficult for a human to design from scratch.
8. Reproducibility
- Open source: Full code available under Apache 2.0
- Package manager:
uv pip install -e . - Examples included: Circle packing, Game 2048, Julia prime counting, novelty examples
- Configuration: Hydra-based YAML configs for full reproducibility
- WebUI: Real-time visualization via
shinka_visualize - Resume support: Checkpoint/resume with meta memory and bandit state
# Quick start
git clone https://github.com/SakanaAI/ShinkaEvolve
cd ShinkaEvolve
uv pip install -e .
# Run circle packing example
shinka_launch variant=circle_packing_example
# Or with agent CLI
shinka_run --task-dir examples/circle_packing --num_generations 20
# Customize
shinka_run --set evo.max_parallel_jobs=6 --set db.num_islands=3
9. Cost Information
| Experiment | Trials | Cost | Outcome |
|---|---|---|---|
| ICFP 2025 SAT Optimization | 320 | ~$60 | 10x speedup |
| Circle Packing (est.) | 150 | ~$30-50 | State-of-the-art |
| AIME Math (est.) | 75 gens | ~$20-40 | Cross-year generalization |
Cost is heavily dependent on the LLM backend chosen. Using local models via Ollama/vLLM reduces API costs to near-zero, while frontier models like GPT-4o cost $0.10-0.50 per mutation.
10. Memory Management
- Population Database: SQLite-backed with WAL mode for concurrent read/write (improved retry in v1.1)
- Evaluation Cache: Results cached to avoid re-evaluating identical programs
- Meta Memory: Bandit statistics and meta-recommendations persist across sessions
- Archive: Best programs from all islands stored in global archive for cross-island inspiration
- Prompt Archive: Co-evolved system prompts stored separately with fitness tracking
- Context Window Management: Parent programs and archive snippets selected to fit within LLM context limits
11. Continued Learning
- Within-run adaptation: Bandit controller learns which LLMs and mutation types work best
- Cross-session resume: Population, bandit state, and meta-recommendations persist
- Prompt evolution: System prompts improve over generations, accumulating knowledge about what mutation instructions work
- Knowledge transfer: Insights from evolved programs can be manually extracted and applied to new domains (demonstrated in ICFP 2025)
- Archive growth: The archive continuously accumulates diverse high-quality solutions
12. Applications & Use Cases
Infrastructure Optimization
Optimize scheduling algorithms, resource allocation, database query optimization. Any infrastructure code with measurable performance.
Mathematical Discovery
AIME reasoning, circle packing, combinatorial problems. Discover novel mathematical strategies beyond human intuition.
AI/ML Optimization
Loss functions, training strategies (MoE), hyperparameter optimization code. Optimize ML pipeline components.
Scientific Discovery
Algorithm design for scientific computing, simulation optimization, experimental design.
Competitive Programming
Heuristic contest optimization (AtCoder, ICFP). Discover novel algorithmic strategies under time pressure.
Human-AI Collaboration
Evolved solutions provide insights that humans can extract and generalize. Bidirectional knowledge transfer demonstrated at ICFP 2025.