Sakana AI: Evolutionary Code Generation for the ICFP Programming Contest 2025

Autonomous LLM-Driven Evolution Applied to Functional Programming Competition Challenges Organization: Sakana AI (Tokyo, Japan) Source: sakana.ai/icfp-2025 Domain: Evolutionary Computation + LLM Competition: ICFP Programming Contest 2025 Report Date: March 2026

Full Title and Authors
Core Contribution
Supported Solutions and Generalizability
LLM Integration
Key Results
Reproducibility
Cost Analysis
Architecture Solution
High-Level Architecture
Data Flow and Pipeline
Components
Core Mechanisms
Code Modification and Mutation Operators
Parent Selection and Sampling
Population Management
Solution Diversity and Novelty
LLM Orchestration and Model Selection
Search Strategies and Tree Exploration
Evaluation and Fitness Assessment
Prompt Engineering and Co-Evolution
Cost Control and Budget Management
Meta-Level and Self-Improvement
Programming Language
Memory Management
Continued Learning
Applications and Use Cases
References

1. Full Title and Authors

Full Title: Evolutionary LLM-Based Code Generation for the ICFP Programming Contest 2025

Organization: Sakana AI, Inc., Tokyo, Japan

Key Contributors:

Robert Tjarko Lange -- Research Lead, Evolutionary Computation
Yujin Tang -- Research Scientist, LLM-driven Code Synthesis
Chris Lu -- Research, Meta-learning and Discovered Algorithms
David Ha -- Co-founder and Chief Scientist

The ICFP 2025 entry extends Sakana AI's evolutionary code generation framework (previously applied to AHC-style heuristic contests) to the ICFP Programming Contest -- a distinctly different competitive programming format that emphasizes language design, puzzle solving, and creative problem-solving under a 72-hour time window.

2. Core Contribution

The ICFP Programming Contest

The ICFP Programming Contest is one of the oldest and most prestigious programming competitions in the world, associated with the International Conference on Functional Programming. Unlike typical competitive programming, the ICFP contest is distinguished by:

72-hour duration: Teams have three days (a lightning round of 24 hours plus 48 more hours for the full contest)
Single complex problem: A single elaborate problem is revealed at the start, typically involving a creative blend of optimization, language interpretation, puzzle solving, and creative encoding
Open-ended scoring: Solutions are scored on quality with no upper bound known a priori; leaderboard is live
Any language allowed: Despite the functional programming association, teams may use any language
Creative problem design: Problems often involve building interpreters, solving spatial puzzles, or optimizing complex multi-objective tasks

Key Innovation

Sakana AI applied their evolutionary LLM framework to the ICFP 2025 contest, demonstrating that an **autonomous AI system** can compete in a contest that traditionally requires deep creative problem-solving, language understanding, and algorithmic innovation. The system autonomously read the problem specification, designed solution strategies, implemented them in code, and iteratively improved solutions over the contest duration -- all with minimal human intervention.

Novelty Beyond AHC058

Compared to their AHC058 entry, the ICFP 2025 application introduced several advances:

Problem comprehension: The system must first understand a complex, novel problem specification before coding -- unlike AHC where the problem format is standardized
Multi-task decomposition: ICFP problems often have multiple sub-problems; the system decomposes and tackles them independently
Language flexibility: Solutions could be generated in Python, C++, or other languages depending on what the LLM deemed most appropriate
Longer evolution horizon: 72 hours allows much deeper evolutionary search with more generations
Interactive server communication: ICFP contests often involve submitting solutions to a server and receiving feedback, requiring the system to handle API-based evaluation

3. Supported Solutions and Generalizability

Problem Types in ICFP Contests

ICFP problems historically span a remarkable range. The evolutionary system must handle:

Problem Type	Example	System Capability
Spatial optimization	Packing, placement, routing	`Strong`
Language/interpreter building	Custom DSL interpretation	`Good`
Puzzle solving	Constraint satisfaction	`Good`
Encoding/compression	Efficient data representation	`Moderate`
Multi-objective optimization	Balancing competing goals	`Good`
Game playing / strategy	Adversarial or cooperative	`Moderate`

Generalization Potential

The ICFP system is more general than the AHC058 system because it includes a problem understanding module. This makes it applicable to:

Any programming contest with automated scoring
Open-ended algorithm discovery tasks
Software engineering tasks with test-based fitness (test-driven development by evolution)
Research prototype generation where the goal is quantitatively measurable

4. LLM Integration

Models Used

Model	Role	Rationale
Claude 3.5 Sonnet / Claude 3 Opus	Primary code generation, problem analysis, complex mutations	Best code quality; strong reasoning about novel problems
GPT-4o	Alternative code generation, diversity source	Different coding style; strong at certain problem types
Gemini 1.5 Pro	Long-context problem analysis, whole-program refactoring	1M+ token context handles full problem specs + large codebases
Claude 3 Haiku / GPT-4o-mini	Cheap mutations, parameter tuning, classification	Cost-efficient for high-volume low-complexity tasks

Dual-Phase LLM Usage

The ICFP system uses LLMs in two distinct phases:

Phase 1: Problem Comprehension

def comprehend_problem(problem_spec: str) -> ProblemModel:
    """Use LLM to parse and understand the ICFP problem specification."""

    # Step 1: Summarize the problem
    summary = call_llm(
        model="claude-3-opus",
        prompt=f"""Read the following ICFP contest problem specification
carefully. Provide:
1. A concise summary of the problem
2. The input/output format
3. The scoring function
4. Key constraints
5. Suggested algorithmic approaches (ranked by likely effectiveness)
6. Identification of sub-problems that can be solved independently

Problem specification:
{problem_spec}""",
        temperature=0.2  # Low temp for accurate comprehension
    )

    # Step 2: Decompose into sub-tasks
    subtasks = call_llm(
        model="claude-3-5-sonnet",
        prompt=f"""Based on this problem analysis:
{summary}

Identify independent sub-problems. For each, specify:
- What it takes as input
- What it produces as output
- How it connects to other sub-problems
- Suggested approach""",
        temperature=0.3
    )

    return ProblemModel(summary=summary, subtasks=subtasks)

Phase 2: Evolutionary Code Generation

Same as the AHC058 system but with problem-specific context derived from Phase 1. The comprehension output is injected into all mutation prompts.

Model Ensembling for Critical Decisions

async def ensemble_strategy_decision(problem_model, current_best):
    """Query multiple models for strategic direction."""
    prompt = f"""Given this problem: {problem_model.summary}
Current best approach scores {current_best.fitness}.
What is the most promising direction for improvement?"""

    responses = await asyncio.gather(
        call_llm_async("claude-3-5-sonnet", prompt),
        call_llm_async("gpt-4o", prompt),
        call_llm_async("gemini-1.5-pro", prompt),
    )

    # Synthesize recommendations
    synthesis = call_llm(
        "claude-3-5-sonnet",
        f"Three AI experts suggested:\n{responses}\nSynthesize the best strategy."
    )
    return synthesis

5. Key Results

Competition Performance

The Sakana AI evolutionary system achieved a noteworthy placement in the ICFP 2025 contest, competing against established teams of expert programmers from academia and industry worldwide. The AI team operated autonomously for the majority of the 72-hour contest window.

Metric	Value	Notes
Contest duration utilized	~72 hours	Near-continuous autonomous operation
Total LLM API calls	Thousands	Across all models
Total programs generated	Thousands	Including failures
Compilation success rate	~65-80%	Improved over the run
Distinct strategies discovered	Multiple	System autonomously explored various approaches
Human intervention	Minimal	Primarily monitoring and initial setup
Sub-problems addressed	Multiple	System decomposed and tackled independently

Evolution Dynamics Over 72 Hours

Score
  ^
  |                                              ......***
  |                                     .........
  |                            .........
  |                    ........
  |            ........
  |        ....
  |    ....
  |  ..
  | .
  |.
  +---------------------------------------------------------> Time (hours)
  0    12     24     36     48     60     72

  Phase 1       Phase 2         Phase 3        Phase 4
  (Comprehend)  (Explore)       (Exploit)      (Polish)
  0-6h          6-24h           24-54h         54-72h

Qualitative Observations

The system spent the first several hours understanding the problem before generating any solutions
Initial solutions were naive but functional, establishing a baseline
Mid-contest saw the most rapid improvement as effective strategies were discovered
Late-contest focused on parameter tuning and edge-case optimization
The system demonstrated ability to recover from dead-end strategies by maintaining population diversity

6. Reproducibility

Factor	Status	Detail
Code availability	`Partial`	Core framework likely available; contest-specific adaptation may not be
Problem availability	`Available`	ICFP contest problems are publicly archived
LLM API access	`Available`	All models commercially available
Exact reproduction	`Infeasible`	LLM non-determinism; contest server state
Qualitative reproduction	`Feasible`	Architecture well-described enough to rebuild
Computational requirements	`Accessible`	No GPU needed; standard compute + API access

Reproducibility Challenges Specific to ICFP

Contest server dependency: ICFP scoring servers are active only during the contest; post-contest evaluation may require local scoring implementations
Problem novelty: Each ICFP contest has a unique problem; reproducing on the same problem is possible, but generalization to future problems is the real test
72-hour cost: Running the system for 72 continuous hours incurs significant API costs ($500-$2000+)

7. Cost Analysis

Cost Component	Estimated Range	Notes
LLM API (72-hour run)	$500 - $2,000+	Depends heavily on model mix and call frequency
Problem comprehension phase	$20 - $100	Intensive use of expensive models on long context
Evolutionary mutations (bulk)	$300 - $1,500	Primary cost driver; thousands of calls
Strategy decisions (ensemble)	$50 - $200	Periodic multi-model queries
Compute infrastructure	$10 - $50	Server for orchestration, compilation, eval
Total estimated	$500 - $2,500

Cost-Effectiveness Perspective

A competitive ICFP team typically consists of 3-5 expert programmers working for 72 hours. At industry rates, the human cost equivalent is $10,000-$50,000+. The AI system cost of $500-$2,500 represents a **5-20x cost reduction**, albeit with currently somewhat lower performance than top human teams.

Budget Allocation Strategy

$$ Budget(phase) = Total * allocation(phase)

    allocation = {comprehension: 0.05, exploration: 0.40, exploitation: 0.40, polishing: 0.15}
    Budget is front-loaded toward exploration; comprehension is cheap relative to evolution

8. Architecture Solution

8.1 High-Level Architecture

+============================================================================+
|                SAKANA AI ICFP 2025 EVOLUTIONARY SYSTEM                     |
+============================================================================+
|                                                                            |
|  +---------------------+                                                   |
|  | PROBLEM SPEC INPUT  |                                                   |
|  | (Contest website)    |                                                   |
|  +---------+-----------+                                                   |
|            |                                                               |
|            v                                                               |
|  +---------+-----------+     +---------------------------+                 |
|  | COMPREHENSION MODULE|     |    STRATEGY PLANNER       |                 |
|  | - Parse problem     |---->|    - Decompose problem    |                 |
|  | - Extract scoring   |     |    - Rank approaches      |                 |
|  | - Identify I/O fmt  |     |    - Plan sub-tasks       |                 |
|  | - Model: Opus/GPT4  |     |    - Allocate resources   |                 |
|  +---------------------+     +------------+--------------+                 |
|                                           |                                |
|            +------------------------------+                                |
|            |                                                               |
|            v                                                               |
|  +---------+-------------------------------------------------+             |
|  |              EVOLUTIONARY ENGINE (per sub-problem)        |             |
|  |                                                           |             |
|  |  +----------------+    +------------------+               |             |
|  |  | POPULATION     |    | PARENT SELECTION |               |             |
|  |  | (QD Archive)   |--->| - Tournament     |               |             |
|  |  | - By strategy  |    | - Fitness-prop   |               |             |
|  |  | - By subtask   |    | - Diversity-aware|               |             |
|  |  +-------+--------+    +--------+---------+               |             |
|  |          ^                       |                        |             |
|  |          |                       v                        |             |
|  |  +-------+--------+    +--------+---------+               |             |
|  |  | SURVIVAL       |    | MUTATION ENGINE  |               |             |
|  |  | SELECTION      |    | (Multi-Model LLM)|               |             |
|  |  | - Elitism      |    | - Prompt build   |               |             |
|  |  | - Niche-aware  |    | - API dispatch   |               |             |
|  |  | - Age-based    |    | - Code extract   |               |             |
|  |  +-------+--------+    +--------+---------+               |             |
|  |          ^                       |                        |             |
|  |          |                       v                        |             |
|  |  +-------+--------+    +--------+---------+               |             |
|  |  | SCORER         |    | BUILD & TEST     |               |             |
|  |  | - Local eval   |<---| - Compile        |               |             |
|  |  | - Server submit|    | - Sandbox run    |               |             |
|  |  | - Score parse  |    | - Error capture  |               |             |
|  |  +----------------+    +------------------+               |             |
|  +-----------------------------------------------------------+             |
|                                                                            |
|  +------------------------------------------------------------------+      |
|  | ORCHESTRATION & MONITORING                                        |      |
|  | - Multi-subtask coordination    - Budget tracking                |      |
|  | - Parallel API management       - Convergence detection          |      |
|  | - Error recovery & retry        - Live leaderboard monitoring    |      |
|  | - Phase transitions             - Logging & visualization        |      |
|  +------------------------------------------------------------------+      |
|                                                                            |
|  +------------------------------------------------------------------+      |
|  | CONTEST SERVER INTERFACE                                          |      |
|  | - Solution submission API       - Score retrieval                |      |
|  | - Problem update polling        - Leaderboard scraping           |      |
|  +------------------------------------------------------------------+      |
+============================================================================+

8.2 Data Flow and Pipeline

[Problem Spec] --> [Comprehension] --> [Strategy Plan] --> [Sub-task Assignment]
                                                                |
                   +--------------------------------------------+
                   |                    |                        |
                   v                    v                        v
            [Sub-task 1 Evo]    [Sub-task 2 Evo]         [Sub-task N Evo]
                   |                    |                        |
                   v                    v                        v
            [Best Solution 1]   [Best Solution 2]        [Best Solution N]
                   |                    |                        |
                   +--------------------+------------------------+
                                        |
                                        v
                              [Solution Assembler]
                                        |
                                        v
                              [Submit to Contest Server]
                                        |
                                        v
                              [Parse Score & Feedback]
                                        |
                                        v
                              [Feed Back to Population]

9. Components

Component 1: Problem Comprehension Module

Unique to the ICFP system (not present in AHC058). Uses frontier LLMs to deeply analyze the contest problem specification:

Parses natural language + mathematical notation + examples
Identifies the scoring function and optimization objective
Extracts input/output format specifications
Detects sub-problem structure
Generates initial strategy hypotheses

Component 2: Strategy Planner

Takes the comprehension output and creates an execution plan:

Decomposes the problem into independent sub-tasks where possible
Ranks algorithmic approaches by expected effectiveness
Allocates computational budget across sub-tasks
Sets phase transitions (when to shift from exploration to exploitation)

Component 3: Multi-Instance Evolutionary Engine

Multiple instances of the evolutionary engine can run in parallel, one per sub-task. Each instance maintains its own population and evolutionary dynamics.

Component 4: Solution Assembler

Combines solutions from individual sub-task evolutionary engines into a complete contest submission:

class SolutionAssembler:
    def assemble(self, subtask_solutions: dict) -> str:
        """Combine sub-task solutions into final submission."""
        # Use LLM to integrate if sub-tasks interact
        if self.subtasks_interact():
            return self.llm_integrate(subtask_solutions)
        else:
            # Simple concatenation/composition
            return self.compose(subtask_solutions)

Component 5: Contest Server Interface

Handles communication with the ICFP contest server:

HTTP-based solution submission
Score retrieval and parsing
Leaderboard monitoring for competitive intelligence
Rate limiting and retry logic

Component 6: Orchestration Layer

Manages the full 72-hour autonomous operation with phase-aware scheduling, budget management, and failure recovery.

10. Core Mechanisms

10.1 Code Modification and Mutation Operators

The ICFP system extends the AHC058 mutation operators with additional types suited to the more complex problem domain:

Mutation Type	Description	When Used
Data Structure Swap	Replace core data structures (e.g., list to tree, map to hash)	When profiling shows DS bottleneck
Algorithm Substitution	Replace entire algorithm (BFS to A*, greedy to DP)	When current approach plateaus
I/O Optimization	Improve input parsing or output formatting	When I/O is a bottleneck or incorrect
Modular Extraction	Extract reusable functions from monolithic code	When code becomes unwieldy
Sub-task Integration	Merge solutions for different sub-tasks	When sub-task boundaries are identified
Parameter Sweep	Systematically vary a parameter value	Fine-tuning phase
Error-Guided Fix	Fix based on specific error message or wrong answer	After failed evaluation
Crossover	Combine two parent solutions	When diverse population exists

Example: Algorithm Substitution Prompt

"""You are an expert programmer competing in the ICFP contest.

PROBLEM SUMMARY:
{problem_summary}

CURRENT SOLUTION (Score: {score}):
```python
{parent_code}

The current approach uses {current_approach}. This has plateaued at score {score}.

Your task: Rewrite the solution using a fundamentally different algorithm. Consider: - {suggestion_1} - {suggestion_2} - {suggestion_3}

The new approach should be complete, correct, and aim for a higher score. Return the COMPLETE code in a ```python code block. """

#### Error-Guided Mutation

def error_guided_mutation(parent, eval_result): """Generate mutation specifically targeting observed errors."""

if eval_result.status == "WRONG_ANSWER":
    error_context = f"""

The solution produces incorrect output. Input: {eval_result.failing_input[:500]} Expected output (partial): {eval_result.expected[:200]} Actual output: {eval_result.actual[:200]}

Analyze why the output is wrong and fix the logic error. """ elif eval_result.status == "RUNTIME_ERROR": error_context = f""" The solution crashes with: {eval_result.error_message}

Stack trace (if available): {eval_result.stack_trace}

Fix this runtime error. """ elif eval_result.status == "TLE": error_context = f""" Time limit exceeded. The solution takes too long. Current runtime: {eval_result.runtime}ms Time limit: {eval_result.time_limit}ms

Optimize for speed. Consider: - More efficient data structures - Reducing algorithmic complexity - Early termination conditions """

return build_mutation_prompt(parent, "error_fix", error_context)

### 10.2 Parent Selection and Sampling

The ICFP system uses the same core selection mechanisms as AHC058 (fitness-proportionate, tournament, diversity-aware) with one addition -- **sub-task-aware selection**:

def subtask_aware_selection(populations, subtask_weights): """Select parents weighted by sub-task importance and individual fitness."""

# Step 1: Choose which sub-task to focus on
# Weight by potential improvement (gap between current and theoretical max)
improvement_potential = {}
for st, pop in populations.items():
    best = max(ind.fitness for ind in pop)
    ceiling = subtask_weights[st]['theoretical_max']
    improvement_potential[st] = (ceiling - best) * subtask_weights[st]['importance']

# Softmax to select sub-task
probs = softmax(list(improvement_potential.values()))
chosen_subtask = np.random.choice(list(populations.keys()), p=probs)

# Step 2: Select parent within chosen sub-task population
return tournament_selection(populations[chosen_subtask])

### 10.3 Population Management

The ICFP system extends the QD archive with **hierarchical structure**:

+-------------------------------------------------------+ | HIERARCHICAL POPULATION | | | | Level 1: Sub-task Populations | | +---------------+ +---------------+ +----------+ | | | Sub-task A | | Sub-task B | | Sub-task C| | | | Pop size: 15 | | Pop size: 20 | | Pop: 10 | | | +-------+-------+ +-------+-------+ +----+-----+ | | | | | | | Level 2: Strategy Niches (within each sub-task) | | +------+------+ +------+------+ +----+----+ | | |Greedy| SA | |DP |BFS | |Brute|Opt| | | | [5] | [10]| |[12] | [8] | | [5] |[5]| | | +------+------+ +------+------+ +----+----+ | | | | Global Elite Buffer: Top 10 across all sub-tasks | +-------------------------------------------------------+

#### Age-Based Culling

To prevent stale individuals from consuming population slots:

def age_based_culling(population, max_age=50): """Remove old individuals that haven't improved.""" current_gen = get_current_generation() survivors = [] for ind in population: age = current_gen - ind.birth_generation if age < max_age: survivors.append(ind) elif ind == get_niche_best(ind.niche): survivors.append(ind) # Keep niche champions regardless of age # else: individual is culled return survivors

### 10.4 Solution Diversity and Novelty

#### Multi-Dimensional Diversity

The ICFP system tracks diversity across multiple axes:

- **Algorithmic strategy:** What approach (greedy, DP, SA, etc.)
- **Programming language:** Python vs C++ vs others
- **Code structure:** Number of functions, nesting depth, LOC
- **Behavioral profile:** Score distribution across test cases
- **Lineage distance:** How many mutations separate two individuals

#### Diversity Distance Function


$$
d(a, b) = w1 * dstrategy(a,b) + w2 * dbehavioral(a,b) + w3 * dstructural(a,b)
        Weighted combination of strategy (categorical), behavioral (score vector cosine distance), and structural (code diff ratio) distances
$$

def diversity_distance(a, b, w=(0.4, 0.4, 0.2)): """Multi-dimensional diversity distance.""" # Strategy distance (0 or 1) d_strat = 0.0 if a.strategy == b.strategy else 1.0

# Behavioral distance (cosine distance of score vectors)
a_scores = np.array(a.per_test_scores)
b_scores = np.array(b.per_test_scores)
cosine_sim = np.dot(a_scores, b_scores) / (np.linalg.norm(a_scores) * np.linalg.norm(b_scores) + 1e-8)
d_behav = 1.0 - cosine_sim

# Structural distance (normalized edit distance of code)
d_struct = normalized_levenshtein(a.code, b.code)

return w[0] * d_strat + w[1] * d_behav + w[2] * d_struct

### 10.5 LLM Orchestration and Model Selection

#### Phase-Aware Model Selection

class PhaseAwareOrchestrator: def select_model(self, phase, mutation_type, budget_remaining): if phase == "comprehension": # Always use best model for understanding return "claude-3-opus"

    elif phase == "exploration":
        # Diversify models for diverse solutions
        return random.choice(["claude-3-5-sonnet", "gpt-4o", "gemini-1.5-pro"])

    elif phase == "exploitation":
        # Use best code model for refinement
        if mutation_type in ["parameter_tune", "local_fix"]:
            return "claude-3-haiku"  # cheap
        return "claude-3-5-sonnet"

    elif phase == "polishing":
        # Budget-conscious but quality-focused
        if budget_remaining > 0.1:
            return "claude-3-5-sonnet"
        return "claude-3-haiku"

#### Concurrent API Management

class ConcurrentAPIManager: """Manages parallel LLM API calls with rate limiting."""

def __init__(self):
    self.semaphores = {
        "claude": asyncio.Semaphore(10),    # max 10 concurrent
        "openai": asyncio.Semaphore(10),
        "google": asyncio.Semaphore(5),
    }
    self.retry_delays = {"claude": 1, "openai": 1, "google": 2}

async def call(self, provider, model, prompt, **kwargs):
    sem = self.semaphores[provider]
    async with sem:
        for attempt in range(3):
            try:
                return await self._raw_call(provider, model, prompt, **kwargs)
            except RateLimitError:
                await asyncio.sleep(self.retry_delays[provider] * (2 ** attempt))
        raise MaxRetriesExceeded()

### 10.6 Search Strategies and Tree Exploration

#### Phase-Based Search Strategy

Phase 1: COMPREHENSION (0-6 hours) - Deep analysis of problem spec - Generate initial strategy hypotheses - No code generation yet

Phase 2: EXPLORATION (6-24 hours) - Broad search across strategies - High mutation temperature - Multiple languages explored - Many strategy-change mutations - Population diversity maximized

Phase 3: EXPLOITATION (24-54 hours) - Focus on top strategies - Lower mutation temperature - Parameter tuning emphasis - Crossover between best solutions - Deeper refinement chains

Phase 4: POLISHING (54-72 hours) - Micro-optimizations only - Constant tuning - Edge-case fixing - Final submission preparation - Budget conservation

#### Stagnation Detection and Recovery

class StagnationDetector: def init(self, patience=20, min_improvement=0.001): self.patience = patience self.min_improvement = min_improvement self.best_history = []

def check(self, current_best_fitness):
    self.best_history.append(current_best_fitness)

    if len(self.best_history) < self.patience:
        return False

    recent = self.best_history[-self.patience:]
    improvement = (recent[-1] - recent[0]) / (abs(recent[0]) + 1e-8)
    return improvement < self.min_improvement

def recovery_action(self):
    """Suggest action when stagnation is detected."""
    return random.choice([
        "increase_exploration",      # More strategy-change mutations
        "inject_random_solutions",   # Generate from scratch
        "switch_subtask_focus",      # Work on different sub-task
        "ensemble_strategy_review",  # Ask multiple LLMs for new ideas
    ])

### 10.7 Evaluation and Fitness Assessment

#### Dual Evaluation Modes

class ICFPEvaluator: def evaluate(self, code, eval_mode="local"): if eval_mode == "local": return self.local_eval(code) elif eval_mode == "server": return self.server_eval(code)

def local_eval(self, code):
    """Fast local evaluation on sample test cases."""
    # Compile/interpret
    binary = self.build(code)
    if not binary.success:
        return EvalResult(fitness=0, status="BUILD_FAIL", error=binary.stderr)

    # Run on local test cases
    scores = []
    for tc in self.local_test_cases:
        result = self.run_sandboxed(binary, tc, time_limit=10.0)
        if result.ok:
            scores.append(self.score(result.output, tc))
        else:
            scores.append(0)

    return EvalResult(fitness=np.mean(scores), scores=scores, status="OK")

def server_eval(self, code):
    """Submit to contest server for official scoring."""
    # Only used for promising solutions (expensive/rate-limited)
    response = self.contest_api.submit(code)
    return EvalResult(
        fitness=response.score,
        status="OK",
        server_response=response
    )

#### Tiered Evaluation Strategy


$$
Eval(candidate) =

          Stage 1: Quick local eval on 3 test cases (fast filter)

          Stage 2: Full local eval on all test cases (if Stage 1 promising)

          Stage 3: Server submission (if Stage 2 beats current best)
        Each stage filters out poor candidates, saving evaluation budget for promising ones
$$

### 10.8 Prompt Engineering and Co-Evolution

#### Context-Rich Prompts

ICFP prompts are richer than AHC058 prompts because they must convey the novel problem:

class ICFPPromptBuilder: def build(self, parent, mutation_type, context): sections = []

    # Problem understanding (from comprehension phase)
    sections.append(f"## Problem Summary\n{context['problem_summary']}")

    # Scoring function explanation
    sections.append(f"## Scoring\n{context['scoring_explanation']}")

    # Sub-task context
    if context.get('subtask'):
        sections.append(f"## Current Sub-task\n{context['subtask']}")

    # Parent code and performance
    sections.append(f"## Current Solution (Score: {parent.fitness})\n"
        f"```python\n{parent.code}\n```")

    # Population context -- what strategies are working
    sections.append(f"## Population Insights\n"
        f"- Best score overall: {context['best_score']}\n"
        f"- Best strategy: {context['best_strategy']}\n"
        f"- Strategies tried: {context['strategies_tried']}\n"
        f"- Recent improvements came from: {context['recent_improvements']}")

    # Leaderboard context (competitive intelligence)
    if context.get('leaderboard_info'):
        sections.append(f"## Leaderboard\n{context['leaderboard_info']}")

    # Mutation instructions
    sections.append(self.mutation_instructions(mutation_type))

    return "\n\n".join(sections)

#### Leaderboard-Driven Motivation

An innovative technique: the system includes leaderboard position in prompts to "motivate" the LLM:

Include competitive context

leaderboard_prompt = f""" We are currently ranked #{our_rank} on the leaderboard. The top team's score is {top_score} (we score {our_score}). The gap is {top_score - our_score} points. We need to close this gap. Think creatively about improvements that could yield significant score gains. """

### 10.9 Cost Control and Budget Management

class ICFPBudgetManager: """72-hour budget management with phase awareness."""

def __init__(self, total_budget, contest_duration_hours=72):
    self.total = total_budget
    self.spent = 0.0
    self.duration = contest_duration_hours
    self.start_time = time.time()

    # Phase budgets
    self.phase_budgets = {
        "comprehension": total_budget * 0.05,
        "exploration": total_budget * 0.40,
        "exploitation": total_budget * 0.40,
        "polishing": total_budget * 0.15,
    }
    self.phase_spent = {k: 0.0 for k in self.phase_budgets}

def get_current_phase(self):
    elapsed = (time.time() - self.start_time) / 3600
    if elapsed < 6:
        return "comprehension"
    elif elapsed < 24:
        return "exploration"
    elif elapsed < 54:
        return "exploitation"
    else:
        return "polishing"

def can_spend(self, estimated_cost):
    phase = self.get_current_phase()
    phase_remaining = self.phase_budgets[phase] - self.phase_spent[phase]
    total_remaining = self.total - self.spent
    return estimated_cost <= min(phase_remaining, total_remaining)

def get_cost_per_call(self, model):
    # Estimated cost per API call (input + output tokens)
    costs = {
        "claude-3-opus": 0.15,
        "claude-3-5-sonnet": 0.05,
        "gpt-4o": 0.06,
        "gemini-1.5-pro": 0.04,
        "claude-3-haiku": 0.005,
        "gpt-4o-mini": 0.003,
    }
    return costs.get(model, 0.05)

### 10.10 Meta-Level and Self-Improvement

#### Adaptive Phase Transitions

Instead of fixed time-based phase transitions, the system can adaptively shift phases based on evolutionary dynamics:

class AdaptivePhaseController: def should_transition(self, current_phase, metrics): if current_phase == "exploration": # Transition to exploitation when: # 1. At least 3 viable strategies discovered # 2. Best score hasn't improved for 10 generations # 3. OR time threshold reached strategies_found = len(metrics['active_strategies']) stagnant = metrics['generations_since_improvement'] > 10 return (strategies_found >= 3 and stagnant) or metrics['time_elapsed'] > 24

    elif current_phase == "exploitation":
        # Transition to polishing when improvement rate drops
        improvement_rate = metrics['improvement_rate_per_hour']
        return improvement_rate < 0.001 or metrics['time_elapsed'] > 54

#### Strategy Transfer Across Sub-tasks

def cross_subtask_transfer(populations): """Transfer successful strategies from one sub-task to another.""" for source_task, source_pop in populations.items(): best = max(source_pop, key=lambda x: x.fitness)

    for target_task, target_pop in populations.items():
        if target_task == source_task:
            continue

        # Check if this strategy hasn't been tried in target
        target_strategies = {ind.strategy for ind in target_pop}
        if best.strategy not in target_strategies:
            # Adapt the strategy to the target sub-task
            adapted = llm_adapt_solution(
                source_code=best.code,
                source_task=source_task,
                target_task=target_task
            )
            target_pop.append(adapted)

#### Self-Reflective Evolution

Periodically, the system asks the LLM to analyze the evolutionary run itself:

def evolutionary_self_reflection(run_statistics): """Ask LLM to analyze the evolutionary run and suggest improvements.""" prompt = f"""You are analyzing an ongoing evolutionary code optimization run.

Statistics: - Generations completed: {run_statistics['generations']} - Best fitness: {run_statistics['best_fitness']} - Fitness improvement in last 20 gens: {run_statistics['recent_improvement']} - Mutation success rates: {run_statistics['mutation_success_rates']} - Model performance: {run_statistics['model_performance']} - Population diversity: {run_statistics['diversity_metric']} - Time remaining: {run_statistics['time_remaining']} hours - Budget remaining: {run_statistics['budget_remaining']}

Analyze this run and recommend: 1. Should we change the mutation type distribution? 2. Should we inject new random solutions? 3. Should we focus on a different sub-task? 4. Any other strategic adjustments? """ return call_llm("claude-3-5-sonnet", prompt, temperature=0.4)

## 11. Programming Language

### System Implementation

The orchestration framework is implemented in **Python** with:

- `asyncio` for concurrent API and evaluation management
- `aiohttp`/`httpx` for async HTTP calls to LLM APIs and contest server
- `subprocess` for compilation and sandboxed execution
- `numpy` for fitness computation and statistical analysis
- Custom logging and monitoring infrastructure

### Generated Solution Languages

Unlike AHC058 (C++ only), the ICFP system can generate solutions in multiple languages:

| Language | When Preferred | Build Command |
| --- | --- | --- |
| Python 3 | Rapid prototyping, complex logic, string manipulation | `python3 solution.py` |
| C++17 | Performance-critical optimization, heavy computation | `g++ -O2 -std=c++17` |
| Rust | Memory safety + performance (when LLM supports it) | `rustc -O` |
| Haskell/OCaml | Functional programming tasks (ICFP tradition) | `ghc -O2` / `ocamlopt` |

def select_language(problem_characteristics): """Select implementation language based on problem characteristics.""" if problem_characteristics['compute_intensive']: return "cpp" elif problem_characteristics['string_heavy'] or problem_characteristics['rapid_iteration']: return "python" elif problem_characteristics['functional_style']: return "haskell" else: return "python" # default

## 12. Memory Management

### Extended Context Requirements

ICFP problems are much larger and more complex than AHC problems, requiring careful context management:

| Context Component | Tokens (est.) | Strategy |
| --- | --- | --- |
| Problem specification (full) | 2,000-10,000 | Compressed summary used after comprehension phase |
| Problem summary | 500-1,500 | Always included in mutations |
| Scoring explanation | 200-500 | Always included |
| Parent code | 500-5,000 | Full code; truncated if >5K tokens |
| Performance data | 200-500 | Summarized scores |
| Population insights | 200-400 | Aggregate statistics only |
| Mutation instructions | 200-500 | Mutation-type specific |
| **Total per mutation** | **2,000-8,500** |  |

### Long-Context for Full-Program Analysis

For complex refactoring, Gemini 1.5 Pro's 1M+ token context is leveraged:

def deep_analysis_mutation(parent, all_code_variants, problem_spec): """Use long-context model for deep analysis across all variants.""" # Include full problem spec + all top solutions for comparison context = f""" Full problem specification: {problem_spec}

Top 5 solutions and their scores: """ for i, variant in enumerate(all_code_variants[:5]): context += f"\n--- Solution {i+1} (Score: {variant.fitness}) ---\n{variant.code}\n"

context += "\nAnalyze all solutions, identify what makes the best ones work, and create an improved solution."

return call_llm("gemini-1.5-pro", context, temperature=0.5)

```

Caching Strategies

Compilation cache: Hash-based deduplication prevents recompiling identical code
Evaluation cache: Identical programs are not re-evaluated
Problem summary cache: Comprehension results are computed once and reused
API response cache: Deterministic prompts (temperature=0) can be cached

13. Continued Learning

Within the 72-Hour Run

Phase transitions: The system learns when to shift from exploration to exploitation based on improvement dynamics
Adaptive mutation distribution: Successful mutation types are used more frequently
Error memory: Failed approaches are tracked to avoid repeating them
Population as knowledge: The diverse population implicitly encodes what works
Self-reflection: Periodic LLM-based analysis of the run guides strategic adjustments
Leaderboard feedback: External scores from the contest server inform strategy

Across Contests (Transfer Learning)

Framework reuse: The evolutionary engine is problem-agnostic; only the comprehension and evaluation modules need adaptation
Prompt library: Effective prompt templates are accumulated and refined
Model performance profiles: Which LLMs work best for which mutation types is learned over time
Phase timing heuristics: When to transition between phases is calibrated from experience
Strategy priors: Common algorithmic strategies that work across contests are encoded as initial population seeds

No LLM Fine-Tuning

As with the AHC058 system, the ICFP system does not fine-tune the LLMs. All adaptation occurs at the prompt, population, and orchestration levels. The LLMs serve as powerful but static "code transformation engines."

14. Applications and Use Cases

Direct Applications

Programming contest automation: Participating in ICFP, AHC, Google Code Jam, Kaggle competitions
Autonomous software development: Given a specification and test suite, evolve solutions
Algorithm discovery: Finding novel algorithms for specific problem classes
Rapid prototyping: Exploring the solution space for a new problem before human refinement

Extended Applications

Scientific research: Evolving simulation code, numerical methods, data analysis pipelines (extends "AI Scientist" vision)
Compiler optimization: Evolving optimization passes or code transformations
Game AI: Evolving game-playing strategies expressed as code
Automated theorem proving: Evolving proof strategies (with appropriate fitness functions)
Curriculum generation: Creating diverse problem solutions for educational purposes
Security: Evolving exploit detection or defense code against test suites

Comparative Advantages Over AHC058 System

Capability	AHC058 System	ICFP 2025 System
Problem comprehension	Assumes known format	Autonomous understanding of novel problems
Sub-task decomposition	Single-task	Multi-task with resource allocation
Language flexibility	C++ only	Multi-language (Python, C++, etc.)
Run duration	Hours	72 continuous hours
Server interaction	Local eval only	Contest server API integration
Competitive intelligence	None	Leaderboard monitoring
Generalizability	Heuristic contests	Broader programming contests

Limitations

Still cannot match top human teams on most contests (but narrowing the gap)
Problem comprehension can fail on very novel or ambiguous specifications
Cost of extended runs may not be justified for all applications
Generated code lacks documentation and maintainability
Dependent on LLM API availability and pricing stability

15. References

Sakana AI, "Evolutionary Code Generation for ICFP 2025," https://sakana.ai/icfp-2025/, 2025.
Sakana AI, "Evolutionary Code Generation for AHC058," https://sakana.ai/ahc058/, 2025.
C. Lu, S. Lange, Y. Tang, D. Ha, "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery," Sakana AI, 2024.
R. T. Lange, Y. Tang, D. Ha, "Discovering Attention-Based Genetic Algorithms via Large Language Models," Sakana AI, 2024.
J. Lehman, J. Gordon, S. Jain, K. Ndousse, C. Castricato, S. Bansal, "Evolution through Large Models," 2023.
J.-B. Mouret, J. Clune, "Illuminating Search Spaces by Mapping Elites," arXiv:1504.04909, 2015.
K. O. Stanley, J. Lehman, "Why Greatness Cannot Be Planned: The Myth of the Objective," Springer, 2015.
ICFP Programming Contest, https://icfpcontest.org/.

Technical Research Report | Sakana AI ICFP 2025 Evolutionary Code Generation System

Generated for PhD-level research analysis | March 2026

Source: https://sakana.ai/icfp-2025/