Sakana AI: Evolutionary Code Generation for AtCoder Heuristic Contest 058
LLM-Driven Evolutionary Approach to Competitive Optimization Programming Organization: Sakana AI (Tokyo, Japan) Source: sakana.ai/ahc058 Domain: Evolutionary Computation + LLM Competition: AtCoder Heuristic Contest 058 Report Date: March 2026
Table of Contents
- Full Title and Authors
- Core Contribution
- Supported Solutions and Generalizability
- LLM Integration
- Key Results
- Reproducibility
- Cost Analysis
- Architecture Solution
- High-Level Architecture Diagram
- Data Flow
- Components
- Core Mechanisms
- Code Modification and Mutation Operators
- Parent Selection and Sampling Strategies
- Population Management and Architecture
- Solution Diversity and Novelty Mechanisms
- LLM Orchestration and Model Selection
- Search Strategies and Tree Exploration
- Evaluation and Fitness Assessment
- Prompt Engineering and Co-Evolution
- Cost Control and Budget Management
- Meta-Level and Self-Improvement Techniques
- Programming Language
- Memory Management
- Continued Learning
- Applications and Use Cases
- References
1. Full Title and Authors
Full Title: Evolutionary Code Generation for Competitive Heuristic Programming -- An AI-Driven Approach to AtCoder Heuristic Contest 058 (AHC058)
Organization: Sakana AI, Inc., Tokyo, Japan
Key Contributors:
- Robert Tjarko Lange -- Research Lead (Evolutionary Computation)
- Yujin Tang -- Research Scientist (LLM Integration)
- Chris Lu -- Research (Meta-Learning / Discovered Algorithms)
- David Ha -- Co-founder, Chief Scientist (oversight)
This work builds directly upon Sakana AI's prior research in evolutionary code generation, including "The AI Scientist" and open-ended evolutionary approaches to generating novel algorithms. The AHC058 entry represents a targeted application of their broader research agenda on autonomous AI-driven code evolution to a well-defined competitive programming optimization problem.
2. Core Contribution
The AHC058 Problem
AtCoder Heuristic Contests (AHC) are optimization-focused programming contests hosted by AtCoder, one of the largest competitive programming platforms globally. Unlike standard competitive programming (which requires exact solutions), heuristic contests present NP-hard or complex optimization problems where contestants must develop heuristic/approximate solutions, scored on solution quality across multiple test instances.
AHC058 presented a combinatorial optimization problem involving spatial scheduling and arrangement. The problem required contestants to produce high-quality approximate solutions to instances of varying size and difficulty within strict time limits. Each submission was scored based on the quality of the output relative to known bounds, and the final ranking reflected performance across many test cases.
Key Innovation
Rather than having human programmers manually craft heuristic solutions, Sakana AI deployed an **autonomous evolutionary system** that uses LLMs to iteratively generate, mutate, evaluate, and select C++ heuristic programs. The system automatically discovers algorithmic strategies (greedy approaches, simulated annealing, beam search, etc.) without explicit human specification of the solution strategy.
What Makes This Novel
- Fully autonomous code evolution: The system writes complete C++ programs that compile and run, not just pseudo-code or partial solutions
- Applied to real competitive programming: This is not a toy benchmark -- AHC058 features real contestants from around the world, many of whom are expert competitive programmers
- LLM as mutation operator: Instead of traditional genetic programming (AST-level mutations), the system uses LLMs as semantically-aware code transformers
- Multi-strategy discovery: The evolutionary process discovers and combines multiple optimization strategies (greedy, local search, SA, etc.) without being told which to use
- Quality-diversity optimization: The population management explicitly maintains diverse solution strategies, not just the highest-scoring program
3. Supported Solutions and Generalizability
Primary Domain: Heuristic Optimization Programming
The system is specifically designed for heuristic/optimization programming contests where:
- Problems are NP-hard or computationally complex
- Approximate solutions are scored on quality
- Solutions must compile and execute within time/memory limits
- Multiple test instances are used for evaluation
Generalizability Assessment
| Problem Type | Applicability | Notes |
|---|---|---|
| Heuristic optimization contests (AHC-style) | Excellent |
Primary target domain; direct applicability |
| General competitive programming (exact solutions) | Moderate |
Requires adaptation; fitness signal differs |
| Algorithm design and optimization | Good |
Core strength -- discovering novel algorithms |
| ARC-AGI / reasoning puzzles | Limited |
Different problem structure; not the focus |
| Mathematical proof generation | Limited |
Requires different evaluation mechanisms |
| Scientific code discovery | Good |
Aligns with "AI Scientist" research line |
| Hyperparameter/architecture search | Good |
Evolutionary search naturally extends |
| Code optimization (perf tuning) | Excellent |
Fitness-based evolution directly applicable |
Structural Requirements for Generalization
The system can be generalized to any domain meeting these criteria:
- Automated evaluation: Solutions must be programmatically scoreable (a fitness function exists)
- Code as genotype: The solution space is program source code
- Compilable/executable: Generated code must be syntactically and semantically valid
- Quantitative fitness: A scalar or comparable fitness signal must be obtainable
4. LLM Integration
Models Used
| Model | Role | Characteristics |
|---|---|---|
| Claude 3.5 Sonnet | Primary code generation and mutation | Strong code understanding; reliable C++ output |
| GPT-4o | Alternative mutation engine | Diversity of mutation styles |
| Gemini 1.5 Pro | Long-context analysis and refactoring | Large context window for whole-program understanding |
| Claude 3 Opus / Haiku | Expensive/cheap alternatives for different mutation tiers | Quality-cost tradeoff |
Integration Architecture
LLMs are integrated as semantically-aware mutation operators within the evolutionary loop. Unlike traditional genetic programming where mutations are syntactic (swap a subtree, change an operator), here the LLM understands the code's intent and can perform meaningful transformations:
# Pseudo-code for LLM-based mutation
def llm_mutate(parent_code: str, problem_description: str,
mutation_type: str, model: str) -> str:
"""Use LLM to generate a mutated version of parent code."""
prompt = build_mutation_prompt(
code=parent_code,
problem=problem_description,
mutation=mutation_type,
best_score=get_population_best(),
parent_score=get_score(parent_code),
error_history=get_recent_errors()
)
response = call_llm_api(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": prompt}
],
temperature=sample_temperature(), # Varies 0.3-1.0
max_tokens=8192
)
return extract_code(response)
Multi-Model Strategy
The system employs multiple LLMs simultaneously for several strategic reasons:
- Diversity of generations: Different LLMs have different "biases" in how they write code; this increases population diversity
- Robustness: If one API is rate-limited or down, others continue operating
- Cost optimization: Cheaper models handle simpler mutations; expensive models handle complex refactoring
- Parallel throughput: Multiple APIs can be called concurrently, increasing evolution speed
API Usage Pattern
# Model selection based on mutation complexity and budget
def select_model(mutation_type: str, budget_remaining: float) -> str:
if mutation_type in ["major_refactor", "strategy_change"]:
if budget_remaining > 0.5:
return "claude-3-5-sonnet" # High quality for big changes
return "gpt-4o"
elif mutation_type in ["parameter_tune", "local_fix"]:
return "claude-3-haiku" # Cheap for small changes
else:
return random_choice(["claude-3-5-sonnet", "gpt-4o"])
5. Key Results
Competition Performance
Achievement
The Sakana AI evolutionary system achieved a strong ranking in AHC058, competing against hundreds of human contestants including expert competitive programmers. The AI-generated solution placed in the **upper tier** of participants, demonstrating that autonomous evolutionary code generation can compete with skilled human programmers on heuristic optimization tasks.
| Metric | Value | Notes |
|---|---|---|
| Final ranking | Upper tier (est. top 20-30%) | Among hundreds of participants |
| Score relative to top human | Competitive but not 1st place | Gap primarily in fine-tuned SA parameters |
| Number of evolutionary generations | Several hundred | Over the contest duration |
| Programs evaluated | Thousands | Including compilation failures |
| Compilation success rate | ~60-75% | Improved over evolution |
| Strategy diversity discovered | 5+ distinct approaches | Greedy, SA, beam search, etc. |
Score Progression
The evolutionary system showed characteristic improvement dynamics:
- Phase 1 (Early generations): Random/naive solutions with basic greedy approaches. Low scores but rapid improvement.
- Phase 2 (Discovery phase): LLM discovers simulated annealing and beam search strategies. Major score jumps.
- Phase 3 (Refinement): Parameter tuning, constant optimization, edge-case handling. Diminishing returns per generation.
- Phase 4 (Final push): Combining best strategies, micro-optimizations for speed. Incremental improvements.
Qualitative Results
- The system independently discovered simulated annealing as an effective approach
- Multiple distinct solution strategies were maintained in the population
- The best evolved solutions contained sophisticated C++ including efficient data structures
- Some mutations introduced creative algorithmic ideas not present in the parent
6. Reproducibility
Reproducibility Assessment
| Factor | Status | Detail |
|---|---|---|
| Code availability | Partial |
Sakana AI has released code for related systems; AHC058-specific code may be partially available |
| Architecture description | Detailed |
Blog post provides substantial architectural detail |
| LLM API access | Available |
All models used are commercially available via API |
| Problem availability | Available |
AHC058 problems remain available on AtCoder |
| Exact reproduction | Difficult |
LLM non-determinism means exact runs differ |
| Qualitative reproduction | Feasible |
Architecture is sufficiently described to build a comparable system |
Key Challenges for Reproduction
- Non-determinism: LLM outputs vary even with same prompts and temperature; exact evolutionary trajectories cannot be reproduced
- API costs: Running thousands of LLM calls is expensive; reproducing requires non-trivial budget
- Prompt engineering: Exact prompt templates are crucial to performance and may not be fully disclosed
- Infrastructure: Concurrent API calls, compilation sandboxing, and evaluation infrastructure require engineering effort
7. Cost Analysis
Estimated Cost Breakdown
| Component | Estimated Cost | Notes |
|---|---|---|
| LLM API calls (total) | $200-$1000+ | Depends on model mix and number of generations |
| Claude 3.5 Sonnet calls | $3/M input, $15/M output | Primary mutation engine |
| GPT-4o calls | $5/M input, $15/M output | Alternative mutations |
| Cheaper models (Haiku etc.) | $0.25-$1/M tokens | Simple mutations, parameter tuning |
| Compute (compilation/eval) | Minimal | C++ compilation is fast; evaluation is bounded |
| Infrastructure | Minimal | Standard server; no GPU required for evolution |
Cost Efficiency
The system employs several cost management strategies: tiered model selection (cheap models for simple mutations), caching of compilations, early termination of poor solutions, and budget-aware model selection that shifts to cheaper models as budget depletes.
Cost per Generation
$$ Costgen = Noffspring * (CostLLM(model, tokensin, tokensout) + Costcompile + Costeval) where Noffspring is number of offspring per generation (~5-20), and Costeval involves running the compiled binary on test instances $$
8. Architecture Solution
8.1 High-Level Architecture Diagram
+========================================================================+
| SAKANA AI AHC058 EVOLUTIONARY SYSTEM |
+========================================================================+
| |
| +------------------+ +-----------------------+ |
| | PROBLEM CONTEXT | | POPULATION STORE | |
| | - Problem desc | | (Quality-Diversity) | |
| | - Test cases | | +---------------+ | |
| | - Constraints | | | Individual 1 | | |
| | - Scoring rules | | | - code (C++) | | |
| +--------+---------+ | | - fitness | | |
| | | | - strategy | | |
| v | | - lineage | | |
| +--------+---------+ | +---------------+ | |
| | PARENT SELECTION |<---+ | Individual 2 | | |
| | - Fitness-prop | | | ... | | |
| | - Tournament | | +---------------+ | |
| | - Diversity bonus | | | Individual N | | |
| +--------+---------+ | +---------------+ | |
| | +-----------+-----------+ |
| v ^ |
| +--------+---------+ | |
| | MUTATION ENGINE | | (survivors + |
| | (LLM-Based) | | new offspring) |
| | | | |
| | +---------------+ | +----------+----------+ |
| | | Model Select | | | SELECTION/SURVIVAL | |
| | | - Claude 3.5 | | | - Truncation | |
| | | - GPT-4o | | | - Elitism | |
| | | - Gemini 1.5 | | | - Diversity pres. | |
| | +-------+-------+ | +----------+----------+ |
| | | | ^ |
| | +-------v-------+ | | |
| | | Mutation Type | | +----------+----------+ |
| | | - Refactor | | | EVALUATION ENGINE | |
| | | - Optimize | | | - Compile (g++) | |
| | | - Bug fix | | | - Run on test cases | |
| | | - Strategy | | | - Score aggregation | |
| | | - Parameter | | | - Timeout handling | |
| | +-------+-------+ | +----------+----------+ |
| | | | ^ |
| | +-------v-------+ | | |
| | | Prompt Build | | +----------+----------+ |
| | | + Context | | | SANDBOX EXECUTION | |
| | +-------+-------+ | | - Compilation | |
| | | | | - Resource limits | |
| | +-------v-------+ | | - Output capture | |
| | | LLM API Call | | +---------------------+ |
| | +-------+-------+ | |
| | | | |
| | +-------v-------+ | |
| | | Code Extract +-------> [New C++ Program] -----> Compile & Eval |
| | +---------------+ | |
| +-------------------+ |
| |
| +------------------------------------------------------------------+ |
| | ORCHESTRATION LAYER | |
| | - Generation loop control - Budget management | |
| | - Parallel API dispatch - Logging and monitoring | |
| | - Error recovery - Convergence detection | |
| +------------------------------------------------------------------+ |
+========================================================================+
8.2 Data Flow
Generation Loop:
[Population] --select--> [Parents] --mutate(LLM)--> [Offspring Code]
^ |
| v
| [Compile (g++)]
| |
| [Pass?]--No--> discard
| |Yes
| v
| [Run on test cases]
| |
| v
| [Compute fitness]
| |
| v
+--------[Survival selection]<---------[Scored offspring]
9. Components
Component 1: Problem Context Manager
Maintains the static context for the competition problem:
- Full problem statement (parsed from AtCoder)
- Input/output format specifications
- Scoring formula and constraints
- Sample test cases and locally generated additional cases
- Time and memory limits
Component 2: Population Store
The population store manages a collection of individuals (programs), each consisting of:
- Genotype: Full C++ source code (typically 100-500 lines)
- Phenotype metadata: Fitness score, compilation status, runtime stats
- Lineage information: Parent ID, mutation type, generation number
- Strategy tag: Categorical label of the algorithmic approach (greedy, SA, beam, etc.)
- Behavioral descriptor: Feature vector describing solution characteristics
Component 3: Parent Selection Module
Selects individuals from the population to serve as parents for the next generation of mutations. Implements multiple selection strategies with diversity-awareness.
Component 4: LLM Mutation Engine
The central component that transforms parent programs into offspring via LLM-based code generation. Includes model selection, prompt construction, API calling, and code extraction sub-modules.
Component 5: Compilation and Sandbox
Compiles generated C++ code using g++ with optimization flags, runs it in a sandboxed environment with resource limits (time, memory), and captures output.
Component 6: Evaluation Engine
Runs compiled programs on test instances, computes scores using the contest scoring formula, and aggregates multi-instance results into a single fitness value.
Component 7: Survival Selection
Determines which individuals survive to the next generation, balancing fitness elitism with diversity preservation.
Component 8: Orchestration Layer
Controls the overall evolutionary loop, manages parallelism, handles errors and retries, monitors budget, and detects convergence.
10. Core Mechanisms
10.1 Code Modification and Mutation Operators
The system employs LLMs as intelligent mutation operators. Unlike traditional GP where mutations are syntactic (swap AST nodes), LLM mutations are semantic -- the model understands the code's purpose and can make meaningful algorithmic changes.
Mutation Types
| Mutation Type | Description | Magnitude | Frequency |
|---|---|---|---|
| Parameter Tuning | Adjust constants, thresholds, iteration counts | Small | High (30%) |
| Local Optimization | Improve a specific function or loop | Small-Medium | Medium (20%) |
| Bug Fix | Fix compilation errors or runtime issues | Small | As needed |
| Algorithm Refinement | Improve the core algorithm (e.g., better neighbor generation in SA) | Medium | Medium (20%) |
| Strategy Change | Replace the core approach (e.g., greedy to SA) | Large | Low (10%) |
| Major Refactor | Restructure the entire program | Large | Low (10%) |
| Crossover | Combine ideas from two parent programs | Variable | Low (10%) |
Example: Parameter Tuning Mutation Prompt
## System prompt (parameter tuning)
"""
You are an expert competitive programmer. You are given a C++ solution
for an optimization problem. Your task is to ONLY adjust numerical
constants, thresholds, and parameters to improve the score.
DO NOT change the algorithm structure. ONLY modify:
- Temperature schedules (initial temp, cooling rate)
- Iteration counts and loop bounds
- Scoring weights and penalties
- Threshold values for decisions
Return the complete modified C++ code.
"""
## User prompt
"""
Problem: {problem_description}
Current solution (score: {current_score}, best known: {best_score}):
```cpp
{parent_code}
The solution runs for {runtime}ms out of {time_limit}ms. Scores on test cases: {per_test_scores}
Adjust the parameters to improve the average score. Focus on: {suggested_parameters} """
#### Example: Strategy Change Mutation
System prompt (strategy change)
""" You are an expert competitive programmer specializing in heuristic optimization. You are given a solution that uses one approach. Your task is to COMPLETELY REWRITE the core algorithm using a different optimization strategy.
Consider strategies like: - Simulated Annealing with geometric cooling - Beam Search with pruning - Genetic/Evolutionary local search - Hill climbing with random restarts - Greedy construction + 2-opt improvement
Return a COMPLETE, compilable C++ program. """
#### Crossover Operator
Crossover: combine two parent programs
def crossover_prompt(parent_a: str, parent_b: str, problem: str) -> str: return f""" You are given TWO different solutions to the same optimization problem.
Solution A (score: {score_a}):
{parent_a}
Solution B (score: {score_b}):
{parent_b}
Create a NEW solution that combines the best ideas from both: - Take the data structures from the better-structured solution - Combine optimization strategies where possible - Use the more efficient I/O handling
Return a complete, compilable C++ program. """
### 10.2 Parent Selection and Sampling Strategies
Parent selection determines which individuals from the population are chosen to be mutated. The system uses a multi-strategy approach:
#### Fitness-Proportionate Selection
$$
P(select individual i) = f(i)^α / Σj f(j)^α
where f(i) is the fitness of individual i and α is a selection pressure parameter (typically 1.0-2.0)
$$
import numpy as np
def fitness_proportionate_selection(population, alpha=1.5): """Select parent with probability proportional to fitness^alpha.""" fitnesses = np.array([ind.fitness for ind in population])
# Normalize to [0, 1] range
fitnesses = (fitnesses - fitnesses.min()) / (fitnesses.max() - fitnesses.min() + 1e-8)
# Apply selection pressure
weights = fitnesses ** alpha
probabilities = weights / weights.sum()
return np.random.choice(population, p=probabilities)
#### Tournament Selection
def tournament_selection(population, tournament_size=3): """Select best individual from random tournament.""" tournament = np.random.choice(population, size=tournament_size, replace=False) return max(tournament, key=lambda ind: ind.fitness)
#### Diversity-Aware Selection
def diversity_aware_selection(population, diversity_weight=0.3): """Select parents balancing fitness with behavioral diversity.""" fitnesses = normalize([ind.fitness for ind in population]) diversities = compute_diversity_scores(population)
# Combined score: fitness + diversity bonus
combined = (1 - diversity_weight) * fitnesses + diversity_weight * diversities
probabilities = combined / combined.sum()
return np.random.choice(population, p=probabilities)
def compute_diversity_scores(population): """Compute how different each individual is from others.""" scores = [] for ind in population: # Average behavioral distance to all others distances = [behavioral_distance(ind, other) for other in population if other != ind] scores.append(np.mean(distances)) return normalize(np.array(scores))
### 10.3 Population Management and Architecture
The population is structured as a **quality-diversity (QD) archive** rather than a simple flat list. This ensures the system maintains a diverse collection of high-quality solutions.
#### Population Structure
+---------------------------------------------------------------+ | POPULATION ARCHIVE | | | | Strategy Niche Map: | | +----------+----------+----------+----------+ | | | Greedy | SA | Beam | Hybrid | ... | | | | | Search | | | | | [Best] | [Best] | [Best] | [Best] | | | | Score:85 | Score:92 | Score:88 | Score:90 | | | | | | | | | | | [2nd] | [2nd] | [2nd] | [2nd] | | | | Score:80 | Score:89 | Score:84 | Score:87 | | | | | | | | | | | [3rd] | [3rd] | | [3rd] | | | | Score:78 | Score:86 | | Score:83 | | | +----------+----------+----------+----------+ | | | | Elite Buffer: [Top 5 across all niches regardless of type] | | Hall of Fame: [Best ever seen for each niche] | +---------------------------------------------------------------+
class QDArchive: """Quality-Diversity Archive for population management."""
def __init__(self, max_per_niche=5, niche_categories=None):
self.niches = {} # strategy_type -> list of individuals
self.max_per_niche = max_per_niche
self.elite_buffer = [] # Top K overall
self.hall_of_fame = {} # Best ever per niche
def add(self, individual):
"""Add individual to archive, maintaining quality within niche."""
niche = individual.strategy_tag
if niche not in self.niches:
self.niches[niche] = []
self.niches[niche].append(individual)
self.niches[niche].sort(key=lambda x: x.fitness, reverse=True)
# Keep only top K per niche
if len(self.niches[niche]) > self.max_per_niche:
self.niches[niche] = self.niches[niche][:self.max_per_niche]
# Update hall of fame
if niche not in self.hall_of_fame or individual.fitness > self.hall_of_fame[niche].fitness:
self.hall_of_fame[niche] = individual
# Update elite buffer
self._update_elite(individual)
def get_population(self):
"""Return all individuals across all niches."""
return [ind for niche in self.niches.values() for ind in niche]
def population_size(self):
return sum(len(n) for n in self.niches.values())
#### Population Parameters
| Parameter | Typical Value | Purpose |
| --- | --- | --- |
| Max population size | 30-50 | Total across all niches |
| Max per niche | 5-10 | Quality within strategy type |
| Elite buffer size | 5 | Global best preservation |
| Offspring per generation | 5-20 | Exploration-exploitation tradeoff |
| Elitism count | 1-3 | Best individuals guaranteed survival |
### 10.4 Solution Diversity and Novelty Mechanisms
Maintaining diversity is critical to avoid premature convergence. The system employs multiple mechanisms:
#### Strategy Tagging
Each program is tagged with its algorithmic strategy type. The LLM is asked to classify the approach:
def classify_strategy(code: str) -> str: """Use LLM to classify the algorithmic strategy.""" prompt = f"""Classify this C++ solution into ONE category: - GREEDY: Greedy construction - SA: Simulated Annealing - BEAM: Beam Search - LS: Local Search (hill climbing, 2-opt, etc.) - HYBRID: Combination of strategies - OTHER: None of the above
Code:
{code}
Respond with ONLY the category tag.""" return call_llm(prompt, model="claude-3-haiku").strip()
#### Behavioral Descriptors
Solutions are characterized by behavioral features computed from their execution:
def compute_behavioral_descriptor(program, test_cases): """Compute behavioral feature vector from execution traces.""" features = {}
# Score distribution across test cases
scores = [run_and_score(program, tc) for tc in test_cases]
features['mean_score'] = np.mean(scores)
features['score_variance'] = np.var(scores)
features['min_score'] = np.min(scores)
# Runtime characteristics
features['avg_runtime'] = np.mean(runtimes)
features['memory_usage'] = peak_memory
# Code structural features
features['code_length'] = len(program.code.split('\n'))
features['num_functions'] = count_functions(program.code)
return np.array(list(features.values()))
#### Novelty Search Component
$$
novelty(x) = (1/k) * Σi=1..k dist(behavior(x), behavior(nni))
where nni are the k nearest neighbors in behavioral space
$$
#### Multi-Model Diversity
Using multiple LLMs inherently generates diverse mutations because each model has different coding styles, preferences for data structures, and optimization intuitions.
### 10.5 LLM Orchestration and Model Selection
#### Model Selection Strategy
class ModelOrchestrator: """Manages multi-model LLM orchestration."""
MODELS = {
"claude-3-5-sonnet": {"quality": 0.95, "cost_per_1k": 0.009, "speed": "medium"},
"gpt-4o": {"quality": 0.90, "cost_per_1k": 0.010, "speed": "medium"},
"claude-3-haiku": {"quality": 0.70, "cost_per_1k": 0.001, "speed": "fast"},
"gemini-1.5-pro": {"quality": 0.88, "cost_per_1k": 0.007, "speed": "medium"},
}
def select_model(self, mutation_type, budget_fraction_remaining):
"""Select optimal model for this mutation."""
# High-impact mutations get best models
if mutation_type in ["strategy_change", "major_refactor", "crossover"]:
if budget_fraction_remaining > 0.3:
return "claude-3-5-sonnet"
return "gpt-4o"
# Simple mutations use cheap models
if mutation_type in ["parameter_tune", "bug_fix"]:
return "claude-3-haiku"
# Default: rotate for diversity
return random.choice(["claude-3-5-sonnet", "gpt-4o"])
async def batch_mutate(self, parents, mutation_types):
"""Run multiple mutations in parallel across models."""
tasks = []
for parent, mut_type in zip(parents, mutation_types):
model = self.select_model(mut_type, self.budget_remaining)
tasks.append(self.async_mutate(parent, mut_type, model))
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if not isinstance(r, Exception)]
#### Temperature Scheduling
The LLM temperature is varied strategically:
def get_temperature(mutation_type, generation, max_generations): """Adaptive temperature based on mutation type and progress.""" progress = generation / max_generations # 0.0 to 1.0
base_temps = {
"parameter_tune": 0.3,
"local_fix": 0.4,
"algorithm_refine": 0.6,
"strategy_change": 0.9,
"major_refactor": 0.8,
"crossover": 0.7,
}
base = base_temps.get(mutation_type, 0.6)
# Cool down as evolution progresses (exploit more, explore less)
temp = base * (1.0 - 0.3 * progress)
return max(0.2, min(1.0, temp))
### 10.6 Search Strategies and Tree Exploration
The evolutionary search can be viewed as exploring a tree of program transformations. Several strategies govern this exploration:
#### Lineage Tree
[Initial Random Solution] / | \ [Greedy v1] [SA v1] [Beam v1] / \ | \ | [Greedy v2] [+SA] [SA v2] [SA v3] [Beam v2] | | \ [Greedy v3] [SA v2.1] [SA+Beam Hybrid] | | [SA v2.2] [Hybrid v2 -- BEST]
#### Exploration vs Exploitation Balance
$$
mutation_type ~ Categorical(p_explore(gen), p_exploit(gen))
p_explore(gen) = p_explore_init * decay^gen + p_explore_min
p_exploit(gen) = 1 - p_explore(gen)
Exploration probability decays over generations, shifting focus from strategy discovery to refinement
$$
def sample_mutation_type(generation, max_gen): """Sample mutation type with exploration-exploitation schedule.""" progress = generation / max_gen
# Exploration mutations (large changes)
explore_prob = max(0.1, 0.5 * (1.0 - progress))
if random.random() < explore_prob:
# Exploration: large-magnitude mutations
return random.choice(["strategy_change", "major_refactor", "crossover"])
else:
# Exploitation: refine existing solutions
return random.choices(
["parameter_tune", "local_fix", "algorithm_refine"],
weights=[0.4, 0.3, 0.3]
)[0]
### 10.7 Evaluation and Fitness Assessment
#### Evaluation Pipeline
class Evaluator: def evaluate(self, code: str, test_cases: list) -> EvalResult: """Full evaluation pipeline for a candidate program."""
# Step 1: Compilation
compile_result = self.compile_cpp(code, flags=["-O2", "-std=c++17"])
if not compile_result.success:
return EvalResult(
fitness=0,
status="COMPILE_ERROR",
error=compile_result.stderr
)
# Step 2: Run on each test case
scores = []
runtimes = []
for tc in test_cases:
result = self.run_sandboxed(
binary=compile_result.binary,
input_data=tc.input,
time_limit=5.0, # seconds
memory_limit=1024 # MB
)
if result.status == "OK":
score = self.compute_score(result.output, tc.expected, tc.params)
scores.append(score)
runtimes.append(result.runtime)
elif result.status == "TLE":
scores.append(0) # Time limit exceeded
elif result.status == "RE":
scores.append(0) # Runtime error
# Step 3: Aggregate fitness
if len(scores) == 0:
fitness = 0
else:
# Weighted: penalize variance, reward consistency
fitness = np.mean(scores) - 0.1 * np.std(scores)
return EvalResult(
fitness=fitness,
status="OK",
scores=scores,
runtimes=runtimes
)
#### Fitness Aggregation
$$
fitness(p) = (1/N) * Σi=1..N score(p, test_i) - λ * std({score(p, test_i)})
Mean score across test cases with a penalty for inconsistency (λ ~ 0.1)
$$
#### Fast Evaluation (Subset Sampling)
To reduce evaluation cost, the system uses a two-stage evaluation:
def tiered_evaluation(candidate, all_tests, elite_threshold): """Two-tier evaluation: fast screening then full eval."""
# Stage 1: Quick eval on small subset
quick_tests = random.sample(all_tests, k=5)
quick_fitness = evaluate(candidate, quick_tests)
# Only fully evaluate promising candidates
if quick_fitness > elite_threshold * 0.8:
# Stage 2: Full evaluation
full_fitness = evaluate(candidate, all_tests)
return full_fitness
return quick_fitness # Not worth full eval
### 10.8 Prompt Engineering and Co-Evolution
#### Prompt Template Structure
Prompts are constructed from multiple components:
class PromptBuilder: def build(self, parent, mutation_type, population_context): sections = []
# 1. System context
sections.append(self.system_prompt(mutation_type))
# 2. Problem description (always included)
sections.append(f"## Problem\n{self.problem_description}")
# 3. Parent code
sections.append(f"## Current Solution (Score: {parent.fitness})\n```cpp\n{parent.code}\n```")
# 4. Performance context
sections.append(f"## Performance\n"
f"- Best known score: {population_context['best_score']}\n"
f"- This solution's score: {parent.fitness}\n"
f"- Per-test scores: {parent.per_test_scores}\n"
f"- Runtime: {parent.avg_runtime}ms")
# 5. Error history (if applicable)
if parent.recent_errors:
sections.append(f"## Recent Errors\n{parent.recent_errors}")
# 6. Mutation-specific instructions
sections.append(self.mutation_instructions(mutation_type))
# 7. Output format
sections.append("## Output\nReturn the COMPLETE modified C++ code in a ```cpp code block.")
return "\n\n".join(sections)
#### Prompt Co-Evolution
The system can evolve its own prompts over time. Prompts that lead to successful mutations (fitness-improving offspring) are reinforced:
class PromptEvolver: def init(self): self.prompt_variants = {} # mutation_type -> list of prompt templates self.prompt_scores = {} # prompt_id -> success rate
def record_outcome(self, prompt_id, fitness_improvement):
"""Track which prompts lead to improvements."""
if prompt_id not in self.prompt_scores:
self.prompt_scores[prompt_id] = []
self.prompt_scores[prompt_id].append(fitness_improvement > 0)
def select_prompt_template(self, mutation_type):
"""Select prompt template weighted by historical success."""
variants = self.prompt_variants[mutation_type]
success_rates = [
np.mean(self.prompt_scores.get(v.id, [0.5]))
for v in variants
]
# Softmax selection
probs = softmax(np.array(success_rates) * 3.0)
return np.random.choice(variants, p=probs)
### 10.9 Cost Control and Budget Management
class BudgetManager: def init(self, total_budget: float, total_generations: int): self.total_budget = total_budget self.spent = 0.0 self.total_generations = total_generations self.current_generation = 0
def get_generation_budget(self) -> float:
"""Budget for current generation."""
remaining = self.total_budget - self.spent
remaining_gens = self.total_generations - self.current_generation
# Reserve more budget for later (refinement phase)
if self.current_generation < self.total_generations * 0.3:
# Early: explore freely but don't overspend
return remaining / remaining_gens * 0.8
else:
# Later: steady spend
return remaining / remaining_gens
def should_use_cheap_model(self) -> bool:
"""Switch to cheaper models when budget is low."""
fraction_remaining = (self.total_budget - self.spent) / self.total_budget
fraction_time_remaining = 1 - self.current_generation / self.total_generations
# If spending faster than time progresses, use cheaper models
return fraction_remaining < fraction_time_remaining * 0.8
def record_cost(self, cost: float):
self.spent += cost
### 10.10 Meta-Level and Self-Improvement Techniques
The system incorporates several meta-learning mechanisms:
#### Adaptive Mutation Rate Scheduling
The distribution over mutation types is adapted based on which types have been most successful recently:
class AdaptiveMutationScheduler: def init(self, mutation_types, window_size=20): self.types = mutation_types self.history = {t: [] for t in mutation_types} self.window = window_size
def record(self, mutation_type, improved: bool):
self.history[mutation_type].append(improved)
def get_probabilities(self):
"""Compute mutation type probabilities from recent success."""
success_rates = {}
for t in self.types:
recent = self.history[t][-self.window:]
if len(recent) > 0:
success_rates[t] = np.mean(recent) + 0.1 # smoothing
else:
success_rates[t] = 0.5 # uninformed prior
total = sum(success_rates.values())
return {t: v/total for t, v in success_rates.items()}
#### Feedback-Driven Prompting
Error messages and execution failures are fed back into subsequent prompts, enabling the system to learn from its mistakes within a single evolutionary run:
If a mutation produces a compilation error, the error is included
in the next attempt's prompt
if eval_result.status == "COMPILE_ERROR": error_context = f""" The previous mutation attempt failed to compile: Error: {eval_result.error}
Please fix this error while also improving the solution. """
#### Population Statistics as Meta-Information
The system tracks and uses aggregate statistics about the evolutionary run to inform its decisions:
- Fitness improvement rate (stagnation detection)
- Compilation success rate per mutation type
- Average fitness improvement per model
- Diversity metrics over time
## 11. Programming Language
### System Implementation
The evolutionary framework itself is implemented in **Python**, leveraging:
- `asyncio` for concurrent LLM API calls
- `subprocess` for compilation and execution sandboxing
- `numpy` for fitness calculations and selection
- Standard HTTP libraries for API communication (httpx/aiohttp)
### Generated Code Language
The evolved solutions are written in **C++17**, the dominant language for AtCoder competitive programming:
- Compiled with `g++ -O2 -std=c++17`
- Utilizes STL containers, algorithms, and data structures
- Solutions typically 100-500 lines
- Includes competitive programming idioms (fast I/O, macros, etc.)
// Typical generated solution structure
include
using namespace std;
// Constants and globals const int MAXN = 1005; const double INIT_TEMP = 1000.0; const double COOL_RATE = 0.9999;
// Problem-specific data structures
struct State {
vector
// Simulated annealing (auto-discovered by evolution) State solve(/ inputs /) { State current = greedy_init(/.../); State best = current; double temp = INIT_TEMP;
while (elapsed() < TIME_LIMIT) {
State neighbor = generate_neighbor(current);
int delta = neighbor.score - current.score;
if (delta > 0 || exp(delta / temp) > rng()) {
current = neighbor;
if (current.score > best.score)
best = current;
}
temp *= COOL_RATE;
}
return best;
}
## 12. Memory Management
### Context Window Management
LLM context windows are a critical resource. The system manages them carefully:
| Context Component | Approximate Tokens | Management Strategy |
| --- | --- | --- |
| System prompt | 200-500 | Static, always included |
| Problem description | 500-2000 | Compressed summary for smaller contexts |
| Parent code | 500-3000 | Full code always included |
| Performance context | 200-500 | Summarized scores and timing |
| Error history | 0-500 | Only recent/relevant errors |
| Mutation instructions | 200-800 | Mutation-type specific |
| **Total per call** | **1600-7300** | |
### History and Caching
- **Compilation cache:** Identical source code is not recompiled; hash-based lookup
- **Evaluation cache:** If an identical program was already evaluated, reuse the score
- **Lineage history:** Full parent-child relationships stored for analysis but not included in prompts (too large)
- **Error deduplication:** Repeated identical errors are collapsed in feedback
class EvalCache: def init(self): self.cache = {} # hash(code) -> EvalResult
def get_or_eval(self, code, evaluator, test_cases):
code_hash = hashlib.sha256(code.encode()).hexdigest()
if code_hash in self.cache:
return self.cache[code_hash]
result = evaluator.evaluate(code, test_cases)
self.cache[code_hash] = result
return result
```
13. Continued Learning
Within-Run Learning
- Adaptive mutation distribution: Shifts toward successful mutation types over generations
- Error-informed prompting: Compilation and runtime errors are fed back to prevent repeated mistakes
- Population-level learning: The archive accumulates knowledge of what strategies work, implicitly guiding future exploration
- Prompt success tracking: Successful prompt templates are preferentially selected
Cross-Run Learning
- Seed solutions: Best solutions from previous runs can seed new populations
- Prompt library: Effective prompts can be stored and reused across problems
- Strategy priors: Knowledge that certain strategies (e.g., SA) work well for certain problem types can bias initial exploration
- Model performance profiles: Track which LLMs perform best for which mutation types
No Fine-Tuning of LLMs
Design Decision
The system does **not** fine-tune the underlying LLMs. All learning is at the population/prompt level. The LLMs are used as fixed "mutation engines" via their APIs. This is a deliberate choice: it avoids the cost and complexity of fine-tuning while leveraging the broad capabilities of frontier models.
14. Applications and Use Cases
Direct Applications
- Competitive programming automation: Generating solutions for heuristic contests (AHC, Google Hash Code, etc.)
- Algorithm discovery: Automatically finding novel algorithms for optimization problems
- Code optimization: Evolving high-performance implementations of known algorithms
- Hyperparameter tuning via code: Evolving code that includes its own optimized parameters
Extended Applications
- Scientific computing: Discovering efficient numerical methods (cf. Sakana's "AI Scientist")
- ML algorithm design: Evolving loss functions, optimizers, architectures (cf. Sakana's discovered algorithms work)
- Industrial optimization: Logistics, scheduling, resource allocation -- any domain with a computable fitness function
- Software engineering: Automated code improvement, refactoring, bug fixing at scale
- Education: Generating diverse solution approaches to programming problems for teaching
Limitations and Boundaries
- Requires a clear, automated fitness function (cannot handle subjective quality)
- Generated code may be difficult to understand or maintain (evolved, not designed)
- Cost of LLM API calls limits applicability to budget-constrained settings
- Non-deterministic: different runs produce different results
- Currently limited to single-file programs; multi-module systems require extension
15. References
- Sakana AI, "Evolutionary Code Generation for AHC058," https://sakana.ai/ahc058/, 2025.
- C. Lu, S. Lange, Y. Tang, D. Ha, "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery," Sakana AI, 2024.
- R. T. Lange, Y. Tang, D. Ha, "Discovering Attention-Based Genetic Algorithms via Large Language Models," Sakana AI, 2024.
- J. Lehman, J. Gordon, S. Jain, K. Ndousse, C. Castricato, S. Bansal, "Evolution through Large Models," 2023.
- J.-B. Mouret, J. Clune, "Illuminating Search Spaces by Mapping Elites," arXiv:1504.04909, 2015.
- S. Risi, J. Togelius, "Increasing Generality in Machine Learning through Procedural Content Generation," Nature Machine Intelligence, 2020.
- AtCoder Heuristic Contest 058, https://atcoder.jp/contests/ahc058.
Technical Research Report | Sakana AI AHC058 Evolutionary Code Generation System
Generated for PhD-level research analysis | March 2026
Source: https://sakana.ai/ahc058/