← Back to Index

Sakana AI: Evolutionary Code Generation for AtCoder Heuristic Contest 058

LLM-Driven Evolutionary Approach to Competitive Optimization Programming Organization: Sakana AI (Tokyo, Japan) Source: sakana.ai/ahc058 Domain: Evolutionary Computation + LLM Competition: AtCoder Heuristic Contest 058 Report Date: March 2026

Table of Contents

1. Full Title and Authors

Full Title: Evolutionary Code Generation for Competitive Heuristic Programming -- An AI-Driven Approach to AtCoder Heuristic Contest 058 (AHC058)

Organization: Sakana AI, Inc., Tokyo, Japan

Key Contributors:

  • Robert Tjarko Lange -- Research Lead (Evolutionary Computation)
  • Yujin Tang -- Research Scientist (LLM Integration)
  • Chris Lu -- Research (Meta-Learning / Discovered Algorithms)
  • David Ha -- Co-founder, Chief Scientist (oversight)

This work builds directly upon Sakana AI's prior research in evolutionary code generation, including "The AI Scientist" and open-ended evolutionary approaches to generating novel algorithms. The AHC058 entry represents a targeted application of their broader research agenda on autonomous AI-driven code evolution to a well-defined competitive programming optimization problem.

2. Core Contribution

The AHC058 Problem

AtCoder Heuristic Contests (AHC) are optimization-focused programming contests hosted by AtCoder, one of the largest competitive programming platforms globally. Unlike standard competitive programming (which requires exact solutions), heuristic contests present NP-hard or complex optimization problems where contestants must develop heuristic/approximate solutions, scored on solution quality across multiple test instances.

AHC058 presented a combinatorial optimization problem involving spatial scheduling and arrangement. The problem required contestants to produce high-quality approximate solutions to instances of varying size and difficulty within strict time limits. Each submission was scored based on the quality of the output relative to known bounds, and the final ranking reflected performance across many test cases.

Key Innovation

Rather than having human programmers manually craft heuristic solutions, Sakana AI deployed an **autonomous evolutionary system** that uses LLMs to iteratively generate, mutate, evaluate, and select C++ heuristic programs. The system automatically discovers algorithmic strategies (greedy approaches, simulated annealing, beam search, etc.) without explicit human specification of the solution strategy.

What Makes This Novel

  • Fully autonomous code evolution: The system writes complete C++ programs that compile and run, not just pseudo-code or partial solutions
  • Applied to real competitive programming: This is not a toy benchmark -- AHC058 features real contestants from around the world, many of whom are expert competitive programmers
  • LLM as mutation operator: Instead of traditional genetic programming (AST-level mutations), the system uses LLMs as semantically-aware code transformers
  • Multi-strategy discovery: The evolutionary process discovers and combines multiple optimization strategies (greedy, local search, SA, etc.) without being told which to use
  • Quality-diversity optimization: The population management explicitly maintains diverse solution strategies, not just the highest-scoring program

3. Supported Solutions and Generalizability

Primary Domain: Heuristic Optimization Programming

The system is specifically designed for heuristic/optimization programming contests where:

  • Problems are NP-hard or computationally complex
  • Approximate solutions are scored on quality
  • Solutions must compile and execute within time/memory limits
  • Multiple test instances are used for evaluation

Generalizability Assessment

Problem Type Applicability Notes
Heuristic optimization contests (AHC-style) Excellent Primary target domain; direct applicability
General competitive programming (exact solutions) Moderate Requires adaptation; fitness signal differs
Algorithm design and optimization Good Core strength -- discovering novel algorithms
ARC-AGI / reasoning puzzles Limited Different problem structure; not the focus
Mathematical proof generation Limited Requires different evaluation mechanisms
Scientific code discovery Good Aligns with "AI Scientist" research line
Hyperparameter/architecture search Good Evolutionary search naturally extends
Code optimization (perf tuning) Excellent Fitness-based evolution directly applicable

Structural Requirements for Generalization

The system can be generalized to any domain meeting these criteria:

  1. Automated evaluation: Solutions must be programmatically scoreable (a fitness function exists)
  2. Code as genotype: The solution space is program source code
  3. Compilable/executable: Generated code must be syntactically and semantically valid
  4. Quantitative fitness: A scalar or comparable fitness signal must be obtainable

4. LLM Integration

Models Used

Model Role Characteristics
Claude 3.5 Sonnet Primary code generation and mutation Strong code understanding; reliable C++ output
GPT-4o Alternative mutation engine Diversity of mutation styles
Gemini 1.5 Pro Long-context analysis and refactoring Large context window for whole-program understanding
Claude 3 Opus / Haiku Expensive/cheap alternatives for different mutation tiers Quality-cost tradeoff

Integration Architecture

LLMs are integrated as semantically-aware mutation operators within the evolutionary loop. Unlike traditional genetic programming where mutations are syntactic (swap a subtree, change an operator), here the LLM understands the code's intent and can perform meaningful transformations:

# Pseudo-code for LLM-based mutation
def llm_mutate(parent_code: str, problem_description: str,
               mutation_type: str, model: str) -> str:
    """Use LLM to generate a mutated version of parent code."""

    prompt = build_mutation_prompt(
        code=parent_code,
        problem=problem_description,
        mutation=mutation_type,
        best_score=get_population_best(),
        parent_score=get_score(parent_code),
        error_history=get_recent_errors()
    )

    response = call_llm_api(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt}
        ],
        temperature=sample_temperature(),  # Varies 0.3-1.0
        max_tokens=8192
    )

    return extract_code(response)

Multi-Model Strategy

The system employs multiple LLMs simultaneously for several strategic reasons:

  • Diversity of generations: Different LLMs have different "biases" in how they write code; this increases population diversity
  • Robustness: If one API is rate-limited or down, others continue operating
  • Cost optimization: Cheaper models handle simpler mutations; expensive models handle complex refactoring
  • Parallel throughput: Multiple APIs can be called concurrently, increasing evolution speed

API Usage Pattern

# Model selection based on mutation complexity and budget
def select_model(mutation_type: str, budget_remaining: float) -> str:
    if mutation_type in ["major_refactor", "strategy_change"]:
        if budget_remaining > 0.5:
            return "claude-3-5-sonnet"  # High quality for big changes
        return "gpt-4o"
    elif mutation_type in ["parameter_tune", "local_fix"]:
        return "claude-3-haiku"     # Cheap for small changes
    else:
        return random_choice(["claude-3-5-sonnet", "gpt-4o"])

5. Key Results

Competition Performance

Achievement

The Sakana AI evolutionary system achieved a strong ranking in AHC058, competing against hundreds of human contestants including expert competitive programmers. The AI-generated solution placed in the **upper tier** of participants, demonstrating that autonomous evolutionary code generation can compete with skilled human programmers on heuristic optimization tasks.

Metric Value Notes
Final ranking Upper tier (est. top 20-30%) Among hundreds of participants
Score relative to top human Competitive but not 1st place Gap primarily in fine-tuned SA parameters
Number of evolutionary generations Several hundred Over the contest duration
Programs evaluated Thousands Including compilation failures
Compilation success rate ~60-75% Improved over evolution
Strategy diversity discovered 5+ distinct approaches Greedy, SA, beam search, etc.

Score Progression

The evolutionary system showed characteristic improvement dynamics:

  1. Phase 1 (Early generations): Random/naive solutions with basic greedy approaches. Low scores but rapid improvement.
  2. Phase 2 (Discovery phase): LLM discovers simulated annealing and beam search strategies. Major score jumps.
  3. Phase 3 (Refinement): Parameter tuning, constant optimization, edge-case handling. Diminishing returns per generation.
  4. Phase 4 (Final push): Combining best strategies, micro-optimizations for speed. Incremental improvements.

Qualitative Results

  • The system independently discovered simulated annealing as an effective approach
  • Multiple distinct solution strategies were maintained in the population
  • The best evolved solutions contained sophisticated C++ including efficient data structures
  • Some mutations introduced creative algorithmic ideas not present in the parent

6. Reproducibility

Reproducibility Assessment

Factor Status Detail
Code availability Partial Sakana AI has released code for related systems; AHC058-specific code may be partially available
Architecture description Detailed Blog post provides substantial architectural detail
LLM API access Available All models used are commercially available via API
Problem availability Available AHC058 problems remain available on AtCoder
Exact reproduction Difficult LLM non-determinism means exact runs differ
Qualitative reproduction Feasible Architecture is sufficiently described to build a comparable system

Key Challenges for Reproduction

  • Non-determinism: LLM outputs vary even with same prompts and temperature; exact evolutionary trajectories cannot be reproduced
  • API costs: Running thousands of LLM calls is expensive; reproducing requires non-trivial budget
  • Prompt engineering: Exact prompt templates are crucial to performance and may not be fully disclosed
  • Infrastructure: Concurrent API calls, compilation sandboxing, and evaluation infrastructure require engineering effort

7. Cost Analysis

Estimated Cost Breakdown

Component Estimated Cost Notes
LLM API calls (total) $200-$1000+ Depends on model mix and number of generations
Claude 3.5 Sonnet calls $3/M input, $15/M output Primary mutation engine
GPT-4o calls $5/M input, $15/M output Alternative mutations
Cheaper models (Haiku etc.) $0.25-$1/M tokens Simple mutations, parameter tuning
Compute (compilation/eval) Minimal C++ compilation is fast; evaluation is bounded
Infrastructure Minimal Standard server; no GPU required for evolution

Cost Efficiency

The system employs several cost management strategies: tiered model selection (cheap models for simple mutations), caching of compilations, early termination of poor solutions, and budget-aware model selection that shifts to cheaper models as budget depletes.

Cost per Generation

$$ Costgen = Noffspring * (CostLLM(model, tokensin, tokensout) + Costcompile + Costeval) where Noffspring is number of offspring per generation (~5-20), and Costeval involves running the compiled binary on test instances $$

8. Architecture Solution

8.1 High-Level Architecture Diagram

+========================================================================+
|                    SAKANA AI AHC058 EVOLUTIONARY SYSTEM                 |
+========================================================================+
|                                                                        |
|  +------------------+     +-----------------------+                    |
|  | PROBLEM CONTEXT  |     |   POPULATION STORE    |                    |
|  | - Problem desc   |     |   (Quality-Diversity) |                    |
|  | - Test cases     |     |   +---------------+   |                    |
|  | - Constraints    |     |   | Individual 1  |   |                    |
|  | - Scoring rules  |     |   | - code (C++)  |   |                    |
|  +--------+---------+     |   | - fitness     |   |                    |
|           |               |   | - strategy    |   |                    |
|           v               |   | - lineage     |   |                    |
|  +--------+---------+     |   +---------------+   |                    |
|  | PARENT SELECTION  |<---+   | Individual 2  |   |                    |
|  | - Fitness-prop   |    |   | ...           |   |                    |
|  | - Tournament     |    |   +---------------+   |                    |
|  | - Diversity bonus |    |   | Individual N  |   |                    |
|  +--------+---------+    |   +---------------+   |                    |
|           |               +-----------+-----------+                    |
|           v                           ^                                |
|  +--------+---------+                 |                                |
|  | MUTATION ENGINE   |                | (survivors +                   |
|  | (LLM-Based)       |                |  new offspring)                |
|  |                   |                |                                |
|  | +---------------+ |     +----------+----------+                     |
|  | | Model Select  | |     | SELECTION/SURVIVAL  |                     |
|  | | - Claude 3.5  | |     | - Truncation        |                     |
|  | | - GPT-4o      | |     | - Elitism           |                     |
|  | | - Gemini 1.5  | |     | - Diversity pres.   |                     |
|  | +-------+-------+ |     +----------+----------+                     |
|  |         |         |                ^                                |
|  | +-------v-------+ |                |                                |
|  | | Mutation Type  | |     +----------+----------+                     |
|  | | - Refactor    | |     | EVALUATION ENGINE   |                     |
|  | | - Optimize    | |     | - Compile (g++)     |                     |
|  | | - Bug fix     | |     | - Run on test cases |                     |
|  | | - Strategy    | |     | - Score aggregation |                     |
|  | | - Parameter   | |     | - Timeout handling  |                     |
|  | +-------+-------+ |     +----------+----------+                     |
|  |         |         |                ^                                |
|  | +-------v-------+ |                |                                |
|  | | Prompt Build  | |     +----------+----------+                     |
|  | | + Context     | |     | SANDBOX EXECUTION   |                     |
|  | +-------+-------+ |     | - Compilation       |                     |
|  |         |         |     | - Resource limits   |                     |
|  | +-------v-------+ |     | - Output capture    |                     |
|  | | LLM API Call  | |     +---------------------+                     |
|  | +-------+-------+ |                                                 |
|  |         |         |                                                 |
|  | +-------v-------+ |                                                 |
|  | | Code Extract  +-------> [New C++ Program] -----> Compile & Eval   |
|  | +---------------+ |                                                 |
|  +-------------------+                                                 |
|                                                                        |
|  +------------------------------------------------------------------+  |
|  | ORCHESTRATION LAYER                                               |  |
|  | - Generation loop control  - Budget management                   |  |
|  | - Parallel API dispatch    - Logging and monitoring              |  |
|  | - Error recovery           - Convergence detection               |  |
|  +------------------------------------------------------------------+  |
+========================================================================+

8.2 Data Flow

Generation Loop:

  [Population] --select--> [Parents] --mutate(LLM)--> [Offspring Code]
       ^                                                      |
       |                                                      v
       |                                              [Compile (g++)]
       |                                                      |
       |                                              [Pass?]--No--> discard
       |                                                |Yes
       |                                                v
       |                                        [Run on test cases]
       |                                                |
       |                                                v
       |                                        [Compute fitness]
       |                                                |
       |                                                v
       +--------[Survival selection]<---------[Scored offspring]

9. Components

Component 1: Problem Context Manager

Maintains the static context for the competition problem:

  • Full problem statement (parsed from AtCoder)
  • Input/output format specifications
  • Scoring formula and constraints
  • Sample test cases and locally generated additional cases
  • Time and memory limits

Component 2: Population Store

The population store manages a collection of individuals (programs), each consisting of:

  • Genotype: Full C++ source code (typically 100-500 lines)
  • Phenotype metadata: Fitness score, compilation status, runtime stats
  • Lineage information: Parent ID, mutation type, generation number
  • Strategy tag: Categorical label of the algorithmic approach (greedy, SA, beam, etc.)
  • Behavioral descriptor: Feature vector describing solution characteristics

Component 3: Parent Selection Module

Selects individuals from the population to serve as parents for the next generation of mutations. Implements multiple selection strategies with diversity-awareness.

Component 4: LLM Mutation Engine

The central component that transforms parent programs into offspring via LLM-based code generation. Includes model selection, prompt construction, API calling, and code extraction sub-modules.

Component 5: Compilation and Sandbox

Compiles generated C++ code using g++ with optimization flags, runs it in a sandboxed environment with resource limits (time, memory), and captures output.

Component 6: Evaluation Engine

Runs compiled programs on test instances, computes scores using the contest scoring formula, and aggregates multi-instance results into a single fitness value.

Component 7: Survival Selection

Determines which individuals survive to the next generation, balancing fitness elitism with diversity preservation.

Component 8: Orchestration Layer

Controls the overall evolutionary loop, manages parallelism, handles errors and retries, monitors budget, and detects convergence.

10. Core Mechanisms

10.1 Code Modification and Mutation Operators

The system employs LLMs as intelligent mutation operators. Unlike traditional GP where mutations are syntactic (swap AST nodes), LLM mutations are semantic -- the model understands the code's purpose and can make meaningful algorithmic changes.

Mutation Types

Mutation Type Description Magnitude Frequency
Parameter Tuning Adjust constants, thresholds, iteration counts Small High (30%)
Local Optimization Improve a specific function or loop Small-Medium Medium (20%)
Bug Fix Fix compilation errors or runtime issues Small As needed
Algorithm Refinement Improve the core algorithm (e.g., better neighbor generation in SA) Medium Medium (20%)
Strategy Change Replace the core approach (e.g., greedy to SA) Large Low (10%)
Major Refactor Restructure the entire program Large Low (10%)
Crossover Combine ideas from two parent programs Variable Low (10%)

Example: Parameter Tuning Mutation Prompt

## System prompt (parameter tuning)
"""
You are an expert competitive programmer. You are given a C++ solution
for an optimization problem. Your task is to ONLY adjust numerical
constants, thresholds, and parameters to improve the score.

DO NOT change the algorithm structure. ONLY modify:
- Temperature schedules (initial temp, cooling rate)
- Iteration counts and loop bounds
- Scoring weights and penalties
- Threshold values for decisions

Return the complete modified C++ code.
"""

## User prompt
"""
Problem: {problem_description}

Current solution (score: {current_score}, best known: {best_score}):
```cpp
{parent_code}

The solution runs for {runtime}ms out of {time_limit}ms. Scores on test cases: {per_test_scores}

Adjust the parameters to improve the average score. Focus on: {suggested_parameters} """

#### Example: Strategy Change Mutation

System prompt (strategy change)

""" You are an expert competitive programmer specializing in heuristic optimization. You are given a solution that uses one approach. Your task is to COMPLETELY REWRITE the core algorithm using a different optimization strategy.

Consider strategies like: - Simulated Annealing with geometric cooling - Beam Search with pruning - Genetic/Evolutionary local search - Hill climbing with random restarts - Greedy construction + 2-opt improvement

Return a COMPLETE, compilable C++ program. """

#### Crossover Operator

Crossover: combine two parent programs

def crossover_prompt(parent_a: str, parent_b: str, problem: str) -> str: return f""" You are given TWO different solutions to the same optimization problem.

Solution A (score: {score_a}):

{parent_a}

Solution B (score: {score_b}):

{parent_b}

Create a NEW solution that combines the best ideas from both: - Take the data structures from the better-structured solution - Combine optimization strategies where possible - Use the more efficient I/O handling

Return a complete, compilable C++ program. """

### 10.2 Parent Selection and Sampling Strategies

Parent selection determines which individuals from the population are chosen to be mutated. The system uses a multi-strategy approach:

#### Fitness-Proportionate Selection


$$
P(select individual i) = f(i)^α / Σj f(j)^α
where f(i) is the fitness of individual i and α is a selection pressure parameter (typically 1.0-2.0)
$$

import numpy as np

def fitness_proportionate_selection(population, alpha=1.5): """Select parent with probability proportional to fitness^alpha.""" fitnesses = np.array([ind.fitness for ind in population])

# Normalize to [0, 1] range
fitnesses = (fitnesses - fitnesses.min()) / (fitnesses.max() - fitnesses.min() + 1e-8)

# Apply selection pressure
weights = fitnesses ** alpha
probabilities = weights / weights.sum()

return np.random.choice(population, p=probabilities)
#### Tournament Selection

def tournament_selection(population, tournament_size=3): """Select best individual from random tournament.""" tournament = np.random.choice(population, size=tournament_size, replace=False) return max(tournament, key=lambda ind: ind.fitness)

#### Diversity-Aware Selection

def diversity_aware_selection(population, diversity_weight=0.3): """Select parents balancing fitness with behavioral diversity.""" fitnesses = normalize([ind.fitness for ind in population]) diversities = compute_diversity_scores(population)

# Combined score: fitness + diversity bonus
combined = (1 - diversity_weight) * fitnesses + diversity_weight * diversities
probabilities = combined / combined.sum()

return np.random.choice(population, p=probabilities)

def compute_diversity_scores(population): """Compute how different each individual is from others.""" scores = [] for ind in population: # Average behavioral distance to all others distances = [behavioral_distance(ind, other) for other in population if other != ind] scores.append(np.mean(distances)) return normalize(np.array(scores))

### 10.3 Population Management and Architecture

The population is structured as a **quality-diversity (QD) archive** rather than a simple flat list. This ensures the system maintains a diverse collection of high-quality solutions.

#### Population Structure

+---------------------------------------------------------------+ | POPULATION ARCHIVE | | | | Strategy Niche Map: | | +----------+----------+----------+----------+ | | | Greedy | SA | Beam | Hybrid | ... | | | | | Search | | | | | [Best] | [Best] | [Best] | [Best] | | | | Score:85 | Score:92 | Score:88 | Score:90 | | | | | | | | | | | [2nd] | [2nd] | [2nd] | [2nd] | | | | Score:80 | Score:89 | Score:84 | Score:87 | | | | | | | | | | | [3rd] | [3rd] | | [3rd] | | | | Score:78 | Score:86 | | Score:83 | | | +----------+----------+----------+----------+ | | | | Elite Buffer: [Top 5 across all niches regardless of type] | | Hall of Fame: [Best ever seen for each niche] | +---------------------------------------------------------------+


class QDArchive: """Quality-Diversity Archive for population management."""

def __init__(self, max_per_niche=5, niche_categories=None):
    self.niches = {}   # strategy_type -> list of individuals
    self.max_per_niche = max_per_niche
    self.elite_buffer = []  # Top K overall
    self.hall_of_fame = {}  # Best ever per niche

def add(self, individual):
    """Add individual to archive, maintaining quality within niche."""
    niche = individual.strategy_tag

    if niche not in self.niches:
        self.niches[niche] = []

    self.niches[niche].append(individual)
    self.niches[niche].sort(key=lambda x: x.fitness, reverse=True)

    # Keep only top K per niche
    if len(self.niches[niche]) > self.max_per_niche:
        self.niches[niche] = self.niches[niche][:self.max_per_niche]

    # Update hall of fame
    if niche not in self.hall_of_fame or individual.fitness > self.hall_of_fame[niche].fitness:
        self.hall_of_fame[niche] = individual

    # Update elite buffer
    self._update_elite(individual)

def get_population(self):
    """Return all individuals across all niches."""
    return [ind for niche in self.niches.values() for ind in niche]

def population_size(self):
    return sum(len(n) for n in self.niches.values())
#### Population Parameters

| Parameter | Typical Value | Purpose |
| --- | --- | --- |
| Max population size | 30-50 | Total across all niches |
| Max per niche | 5-10 | Quality within strategy type |
| Elite buffer size | 5 | Global best preservation |
| Offspring per generation | 5-20 | Exploration-exploitation tradeoff |
| Elitism count | 1-3 | Best individuals guaranteed survival |

### 10.4 Solution Diversity and Novelty Mechanisms

Maintaining diversity is critical to avoid premature convergence. The system employs multiple mechanisms:

#### Strategy Tagging

Each program is tagged with its algorithmic strategy type. The LLM is asked to classify the approach:

def classify_strategy(code: str) -> str: """Use LLM to classify the algorithmic strategy.""" prompt = f"""Classify this C++ solution into ONE category: - GREEDY: Greedy construction - SA: Simulated Annealing - BEAM: Beam Search - LS: Local Search (hill climbing, 2-opt, etc.) - HYBRID: Combination of strategies - OTHER: None of the above

Code:

{code}

Respond with ONLY the category tag.""" return call_llm(prompt, model="claude-3-haiku").strip()

#### Behavioral Descriptors

Solutions are characterized by behavioral features computed from their execution:

def compute_behavioral_descriptor(program, test_cases): """Compute behavioral feature vector from execution traces.""" features = {}

# Score distribution across test cases
scores = [run_and_score(program, tc) for tc in test_cases]
features['mean_score'] = np.mean(scores)
features['score_variance'] = np.var(scores)
features['min_score'] = np.min(scores)

# Runtime characteristics
features['avg_runtime'] = np.mean(runtimes)
features['memory_usage'] = peak_memory

# Code structural features
features['code_length'] = len(program.code.split('\n'))
features['num_functions'] = count_functions(program.code)

return np.array(list(features.values()))
#### Novelty Search Component


$$
novelty(x) = (1/k) * Σi=1..k dist(behavior(x), behavior(nni))
        where nni are the k nearest neighbors in behavioral space
$$

#### Multi-Model Diversity

Using multiple LLMs inherently generates diverse mutations because each model has different coding styles, preferences for data structures, and optimization intuitions.

### 10.5 LLM Orchestration and Model Selection

#### Model Selection Strategy

class ModelOrchestrator: """Manages multi-model LLM orchestration."""

MODELS = {
    "claude-3-5-sonnet": {"quality": 0.95, "cost_per_1k": 0.009, "speed": "medium"},
    "gpt-4o": {"quality": 0.90, "cost_per_1k": 0.010, "speed": "medium"},
    "claude-3-haiku": {"quality": 0.70, "cost_per_1k": 0.001, "speed": "fast"},
    "gemini-1.5-pro": {"quality": 0.88, "cost_per_1k": 0.007, "speed": "medium"},
}

def select_model(self, mutation_type, budget_fraction_remaining):
    """Select optimal model for this mutation."""

    # High-impact mutations get best models
    if mutation_type in ["strategy_change", "major_refactor", "crossover"]:
        if budget_fraction_remaining > 0.3:
            return "claude-3-5-sonnet"
        return "gpt-4o"

    # Simple mutations use cheap models
    if mutation_type in ["parameter_tune", "bug_fix"]:
        return "claude-3-haiku"

    # Default: rotate for diversity
    return random.choice(["claude-3-5-sonnet", "gpt-4o"])

async def batch_mutate(self, parents, mutation_types):
    """Run multiple mutations in parallel across models."""
    tasks = []
    for parent, mut_type in zip(parents, mutation_types):
        model = self.select_model(mut_type, self.budget_remaining)
        tasks.append(self.async_mutate(parent, mut_type, model))

    results = await asyncio.gather(*tasks, return_exceptions=True)
    return [r for r in results if not isinstance(r, Exception)]
#### Temperature Scheduling

The LLM temperature is varied strategically:

def get_temperature(mutation_type, generation, max_generations): """Adaptive temperature based on mutation type and progress.""" progress = generation / max_generations # 0.0 to 1.0

base_temps = {
    "parameter_tune": 0.3,
    "local_fix": 0.4,
    "algorithm_refine": 0.6,
    "strategy_change": 0.9,
    "major_refactor": 0.8,
    "crossover": 0.7,
}

base = base_temps.get(mutation_type, 0.6)

# Cool down as evolution progresses (exploit more, explore less)
temp = base * (1.0 - 0.3 * progress)

return max(0.2, min(1.0, temp))
### 10.6 Search Strategies and Tree Exploration

The evolutionary search can be viewed as exploring a tree of program transformations. Several strategies govern this exploration:

#### Lineage Tree

[Initial Random Solution] / | \ [Greedy v1] [SA v1] [Beam v1] / \ | \ | [Greedy v2] [+SA] [SA v2] [SA v3] [Beam v2] | | \ [Greedy v3] [SA v2.1] [SA+Beam Hybrid] | | [SA v2.2] [Hybrid v2 -- BEST]

#### Exploration vs Exploitation Balance


$$
mutation_type ~ Categorical(p_explore(gen), p_exploit(gen))



        p_explore(gen) = p_explore_init * decay^gen + p_explore_min



        p_exploit(gen) = 1 - p_explore(gen)
        Exploration probability decays over generations, shifting focus from strategy discovery to refinement
$$

def sample_mutation_type(generation, max_gen): """Sample mutation type with exploration-exploitation schedule.""" progress = generation / max_gen

# Exploration mutations (large changes)
explore_prob = max(0.1, 0.5 * (1.0 - progress))

if random.random() < explore_prob:
    # Exploration: large-magnitude mutations
    return random.choice(["strategy_change", "major_refactor", "crossover"])
else:
    # Exploitation: refine existing solutions
    return random.choices(
        ["parameter_tune", "local_fix", "algorithm_refine"],
        weights=[0.4, 0.3, 0.3]
    )[0]
### 10.7 Evaluation and Fitness Assessment

#### Evaluation Pipeline

class Evaluator: def evaluate(self, code: str, test_cases: list) -> EvalResult: """Full evaluation pipeline for a candidate program."""

    # Step 1: Compilation
    compile_result = self.compile_cpp(code, flags=["-O2", "-std=c++17"])
    if not compile_result.success:
        return EvalResult(
            fitness=0,
            status="COMPILE_ERROR",
            error=compile_result.stderr
        )

    # Step 2: Run on each test case
    scores = []
    runtimes = []
    for tc in test_cases:
        result = self.run_sandboxed(
            binary=compile_result.binary,
            input_data=tc.input,
            time_limit=5.0,     # seconds
            memory_limit=1024   # MB
        )

        if result.status == "OK":
            score = self.compute_score(result.output, tc.expected, tc.params)
            scores.append(score)
            runtimes.append(result.runtime)
        elif result.status == "TLE":
            scores.append(0)  # Time limit exceeded
        elif result.status == "RE":
            scores.append(0)  # Runtime error

    # Step 3: Aggregate fitness
    if len(scores) == 0:
        fitness = 0
    else:
        # Weighted: penalize variance, reward consistency
        fitness = np.mean(scores) - 0.1 * np.std(scores)

    return EvalResult(
        fitness=fitness,
        status="OK",
        scores=scores,
        runtimes=runtimes
    )
#### Fitness Aggregation


$$
fitness(p) = (1/N) * Σi=1..N score(p, test_i) - λ * std({score(p, test_i)})
        Mean score across test cases with a penalty for inconsistency (λ ~ 0.1)
$$

#### Fast Evaluation (Subset Sampling)

To reduce evaluation cost, the system uses a two-stage evaluation:

def tiered_evaluation(candidate, all_tests, elite_threshold): """Two-tier evaluation: fast screening then full eval."""

# Stage 1: Quick eval on small subset
quick_tests = random.sample(all_tests, k=5)
quick_fitness = evaluate(candidate, quick_tests)

# Only fully evaluate promising candidates
if quick_fitness > elite_threshold * 0.8:
    # Stage 2: Full evaluation
    full_fitness = evaluate(candidate, all_tests)
    return full_fitness

return quick_fitness  # Not worth full eval
### 10.8 Prompt Engineering and Co-Evolution

#### Prompt Template Structure

Prompts are constructed from multiple components:

class PromptBuilder: def build(self, parent, mutation_type, population_context): sections = []

    # 1. System context
    sections.append(self.system_prompt(mutation_type))

    # 2. Problem description (always included)
    sections.append(f"## Problem\n{self.problem_description}")

    # 3. Parent code
    sections.append(f"## Current Solution (Score: {parent.fitness})\n```cpp\n{parent.code}\n```")

    # 4. Performance context
    sections.append(f"## Performance\n"
        f"- Best known score: {population_context['best_score']}\n"
        f"- This solution's score: {parent.fitness}\n"
        f"- Per-test scores: {parent.per_test_scores}\n"
        f"- Runtime: {parent.avg_runtime}ms")

    # 5. Error history (if applicable)
    if parent.recent_errors:
        sections.append(f"## Recent Errors\n{parent.recent_errors}")

    # 6. Mutation-specific instructions
    sections.append(self.mutation_instructions(mutation_type))

    # 7. Output format
    sections.append("## Output\nReturn the COMPLETE modified C++ code in a ```cpp code block.")

    return "\n\n".join(sections)
#### Prompt Co-Evolution

The system can evolve its own prompts over time. Prompts that lead to successful mutations (fitness-improving offspring) are reinforced:

class PromptEvolver: def init(self): self.prompt_variants = {} # mutation_type -> list of prompt templates self.prompt_scores = {} # prompt_id -> success rate

def record_outcome(self, prompt_id, fitness_improvement):
    """Track which prompts lead to improvements."""
    if prompt_id not in self.prompt_scores:
        self.prompt_scores[prompt_id] = []
    self.prompt_scores[prompt_id].append(fitness_improvement > 0)

def select_prompt_template(self, mutation_type):
    """Select prompt template weighted by historical success."""
    variants = self.prompt_variants[mutation_type]
    success_rates = [
        np.mean(self.prompt_scores.get(v.id, [0.5]))
        for v in variants
    ]
    # Softmax selection
    probs = softmax(np.array(success_rates) * 3.0)
    return np.random.choice(variants, p=probs)
### 10.9 Cost Control and Budget Management

class BudgetManager: def init(self, total_budget: float, total_generations: int): self.total_budget = total_budget self.spent = 0.0 self.total_generations = total_generations self.current_generation = 0

def get_generation_budget(self) -> float:
    """Budget for current generation."""
    remaining = self.total_budget - self.spent
    remaining_gens = self.total_generations - self.current_generation

    # Reserve more budget for later (refinement phase)
    if self.current_generation < self.total_generations * 0.3:
        # Early: explore freely but don't overspend
        return remaining / remaining_gens * 0.8
    else:
        # Later: steady spend
        return remaining / remaining_gens

def should_use_cheap_model(self) -> bool:
    """Switch to cheaper models when budget is low."""
    fraction_remaining = (self.total_budget - self.spent) / self.total_budget
    fraction_time_remaining = 1 - self.current_generation / self.total_generations

    # If spending faster than time progresses, use cheaper models
    return fraction_remaining < fraction_time_remaining * 0.8

def record_cost(self, cost: float):
    self.spent += cost
### 10.10 Meta-Level and Self-Improvement Techniques

The system incorporates several meta-learning mechanisms:

#### Adaptive Mutation Rate Scheduling

The distribution over mutation types is adapted based on which types have been most successful recently:

class AdaptiveMutationScheduler: def init(self, mutation_types, window_size=20): self.types = mutation_types self.history = {t: [] for t in mutation_types} self.window = window_size

def record(self, mutation_type, improved: bool):
    self.history[mutation_type].append(improved)

def get_probabilities(self):
    """Compute mutation type probabilities from recent success."""
    success_rates = {}
    for t in self.types:
        recent = self.history[t][-self.window:]
        if len(recent) > 0:
            success_rates[t] = np.mean(recent) + 0.1  # smoothing
        else:
            success_rates[t] = 0.5  # uninformed prior

    total = sum(success_rates.values())
    return {t: v/total for t, v in success_rates.items()}
#### Feedback-Driven Prompting

Error messages and execution failures are fed back into subsequent prompts, enabling the system to learn from its mistakes within a single evolutionary run:

If a mutation produces a compilation error, the error is included

in the next attempt's prompt

if eval_result.status == "COMPILE_ERROR": error_context = f""" The previous mutation attempt failed to compile: Error: {eval_result.error}

Please fix this error while also improving the solution. """

#### Population Statistics as Meta-Information

The system tracks and uses aggregate statistics about the evolutionary run to inform its decisions:

- Fitness improvement rate (stagnation detection)
- Compilation success rate per mutation type
- Average fitness improvement per model
- Diversity metrics over time

## 11. Programming Language

### System Implementation

The evolutionary framework itself is implemented in **Python**, leveraging:

- `asyncio` for concurrent LLM API calls
- `subprocess` for compilation and execution sandboxing
- `numpy` for fitness calculations and selection
- Standard HTTP libraries for API communication (httpx/aiohttp)

### Generated Code Language

The evolved solutions are written in **C++17**, the dominant language for AtCoder competitive programming:

- Compiled with `g++ -O2 -std=c++17`
- Utilizes STL containers, algorithms, and data structures
- Solutions typically 100-500 lines
- Includes competitive programming idioms (fast I/O, macros, etc.)

// Typical generated solution structure

include

using namespace std;

// Constants and globals const int MAXN = 1005; const double INIT_TEMP = 1000.0; const double COOL_RATE = 0.9999;

// Problem-specific data structures struct State { vector> grid; int score; // ... };

// Simulated annealing (auto-discovered by evolution) State solve(/ inputs /) { State current = greedy_init(/.../); State best = current; double temp = INIT_TEMP;

while (elapsed() < TIME_LIMIT) {
    State neighbor = generate_neighbor(current);
    int delta = neighbor.score - current.score;
    if (delta > 0 || exp(delta / temp) > rng()) {
        current = neighbor;
        if (current.score > best.score)
            best = current;
    }
    temp *= COOL_RATE;
}
return best;

}

## 12. Memory Management

### Context Window Management

LLM context windows are a critical resource. The system manages them carefully:

| Context Component | Approximate Tokens | Management Strategy |
| --- | --- | --- |
| System prompt | 200-500 | Static, always included |
| Problem description | 500-2000 | Compressed summary for smaller contexts |
| Parent code | 500-3000 | Full code always included |
| Performance context | 200-500 | Summarized scores and timing |
| Error history | 0-500 | Only recent/relevant errors |
| Mutation instructions | 200-800 | Mutation-type specific |
| **Total per call** | **1600-7300** |  |

### History and Caching

- **Compilation cache:** Identical source code is not recompiled; hash-based lookup
- **Evaluation cache:** If an identical program was already evaluated, reuse the score
- **Lineage history:** Full parent-child relationships stored for analysis but not included in prompts (too large)
- **Error deduplication:** Repeated identical errors are collapsed in feedback

class EvalCache: def init(self): self.cache = {} # hash(code) -> EvalResult

def get_or_eval(self, code, evaluator, test_cases):
    code_hash = hashlib.sha256(code.encode()).hexdigest()
    if code_hash in self.cache:
        return self.cache[code_hash]
    result = evaluator.evaluate(code, test_cases)
    self.cache[code_hash] = result
    return result

```

13. Continued Learning

Within-Run Learning

  • Adaptive mutation distribution: Shifts toward successful mutation types over generations
  • Error-informed prompting: Compilation and runtime errors are fed back to prevent repeated mistakes
  • Population-level learning: The archive accumulates knowledge of what strategies work, implicitly guiding future exploration
  • Prompt success tracking: Successful prompt templates are preferentially selected

Cross-Run Learning

  • Seed solutions: Best solutions from previous runs can seed new populations
  • Prompt library: Effective prompts can be stored and reused across problems
  • Strategy priors: Knowledge that certain strategies (e.g., SA) work well for certain problem types can bias initial exploration
  • Model performance profiles: Track which LLMs perform best for which mutation types

No Fine-Tuning of LLMs

Design Decision

The system does **not** fine-tune the underlying LLMs. All learning is at the population/prompt level. The LLMs are used as fixed "mutation engines" via their APIs. This is a deliberate choice: it avoids the cost and complexity of fine-tuning while leveraging the broad capabilities of frontier models.

14. Applications and Use Cases

Direct Applications

  • Competitive programming automation: Generating solutions for heuristic contests (AHC, Google Hash Code, etc.)
  • Algorithm discovery: Automatically finding novel algorithms for optimization problems
  • Code optimization: Evolving high-performance implementations of known algorithms
  • Hyperparameter tuning via code: Evolving code that includes its own optimized parameters

Extended Applications

  • Scientific computing: Discovering efficient numerical methods (cf. Sakana's "AI Scientist")
  • ML algorithm design: Evolving loss functions, optimizers, architectures (cf. Sakana's discovered algorithms work)
  • Industrial optimization: Logistics, scheduling, resource allocation -- any domain with a computable fitness function
  • Software engineering: Automated code improvement, refactoring, bug fixing at scale
  • Education: Generating diverse solution approaches to programming problems for teaching

Limitations and Boundaries

  • Requires a clear, automated fitness function (cannot handle subjective quality)
  • Generated code may be difficult to understand or maintain (evolved, not designed)
  • Cost of LLM API calls limits applicability to budget-constrained settings
  • Non-deterministic: different runs produce different results
  • Currently limited to single-file programs; multi-module systems require extension

15. References

  1. Sakana AI, "Evolutionary Code Generation for AHC058," https://sakana.ai/ahc058/, 2025.
  2. C. Lu, S. Lange, Y. Tang, D. Ha, "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery," Sakana AI, 2024.
  3. R. T. Lange, Y. Tang, D. Ha, "Discovering Attention-Based Genetic Algorithms via Large Language Models," Sakana AI, 2024.
  4. J. Lehman, J. Gordon, S. Jain, K. Ndousse, C. Castricato, S. Bansal, "Evolution through Large Models," 2023.
  5. J.-B. Mouret, J. Clune, "Illuminating Search Spaces by Mapping Elites," arXiv:1504.04909, 2015.
  6. S. Risi, J. Togelius, "Increasing Generality in Machine Learning through Procedural Content Generation," Nature Machine Intelligence, 2020.
  7. AtCoder Heuristic Contest 058, https://atcoder.jp/contests/ahc058.

Technical Research Report | Sakana AI AHC058 Evolutionary Code Generation System

Generated for PhD-level research analysis | March 2026

Source: https://sakana.ai/ahc058/