ALE-Bench

A Benchmark for Automated Optimization with LLM-Based Evolutionary Approaches Organization: Sakana AI / AtCoder Paper: arXiv:2506.09050 Type: Benchmark + Agent System Report Type: PhD-Level Technical Analysis Report Date: March 2026

Full Title and Attribution
Authors and Institutional Context
Motivation and Research Gap
Benchmark Design and Problem Taxonomy
Infrastructure and Evaluation Protocol
ALE-Agent Architecture
Domain Knowledge Injection and Prompt Engineering
Inference-Time Scaling and Solution Generation
Simulated Annealing Specialization
Experimental Results and Rankings
Novel Algorithmic Discoveries
Ablation Studies and Component Analysis
Failure Mode Analysis
Comparison with Related Work
Computational Requirements and Cost Analysis
Theoretical Foundations and Complexity
Implications for Competitive Programming
Limitations and Open Problems
Future Directions
References and Resources

1 Full Title and Attribution

Full Title: ALE-Bench: A Benchmark for Automated Optimization with LLM-Based Evolutionary Approaches

ArXiv ID: arXiv:2506.09050

Dataset: HuggingFace: SakanaAI/ALE-Bench

Repository: github.com/SakanaAI/ALE-Bench

Category: Combinatorial Optimization Benchmark for LLM Agents

Lineage: Builds upon AtCoder Heuristic Contest (AHC) ecosystem; related to FunSearch, AlphaCode, and EvoTorch lineages

ALE-Bench represents a critical inflection point in AI benchmarking: rather than measuring whether LLMs can solve optimization problems with known optimal solutions, it measures whether AI agents can compete in open-ended heuristic optimization where even human experts cannot find provably optimal answers. This shifts evaluation from correctness to quality -- precisely the regime where practical combinatorial optimization operates.

The benchmark name encodes a triple meaning: Automated LLM-based Evolutionary optimization, referencing the three pillars of the contribution -- automation of the optimization loop, leveraging large language models as the mutation engine, and evolutionary search strategies that iteratively improve candidate solutions. The "-Bench" suffix positions it within the growing ecosystem of AI benchmarks (SWE-bench, HumanEval, MATH, etc.) while distinguishing its focus on heuristic optimization rather than exact computation.

2 Authors and Institutional Context

Author List

Author	Affiliation	Role / Expertise
Yuki Imajuku	Sakana AI	Lead researcher; ALE-Agent development
Kohki Horie	Sakana AI / University of Tokyo	Dual affiliation; benchmark infrastructure design
Yoichi Iwata	AtCoder	Competitive programming systems; problem curation
Kensho Aoki	AtCoder	Contest infrastructure; ranking algorithms
Naohiro Takahashi	AtCoder	Problem design; evaluation systems
Takuya Akiba	Sakana AI	Senior researcher; Optuna creator; project leadership

Institutional Significance

The collaboration between Sakana AI and AtCoder is strategically significant. Sakana AI, founded by former Google Brain researchers in Tokyo, focuses on nature-inspired AI and evolutionary computation. AtCoder operates one of the world's premier competitive programming platforms, running the AtCoder Heuristic Contest (AHC) series that forms the backbone of ALE-Bench. Takuya Akiba, co-founder of Sakana AI, previously created Optuna, one of the most widely used hyperparameter optimization frameworks, bringing deep expertise in automated optimization to this work.

The dual affiliation of Kohki Horie (Sakana AI and University of Tokyo) bridges the industrial research focus of Sakana AI with the academic rigor expected of benchmark papers. Yoichi Iwata from AtCoder brings authoritative knowledge of competitive programming problem design, ensuring that the benchmark problems authentically represent the difficulty landscape of heuristic contests.

[!info]- AtCoder Heuristic Contest (AHC) Background AtCoder Heuristic Contests (AHC) are a distinct contest format from the more traditional algorithmic contests (ABC, ARC, AGC). While algorithmic contests require finding exact solutions to well-defined problems, heuristic contests present open-ended optimization problems where participants must design and implement heuristic algorithms that produce the best possible solutions within time and resource constraints.

Key characteristics of AHC:

Scoring: Solutions are scored on quality (how close to optimal), not just correctness

Duration: Typically 4-hour live contests or multi-week long contests

Participation: 500-1500+ participants per contest, with a dedicated community of optimization specialists

Problem types: Routing, scheduling, placement, resource allocation, simulation-based optimization

Ranking system: Relative ranking where scores are normalized against other participants

AHC has run 50+ contests since its inception, establishing a rich corpus of well-calibrated optimization problems with known human performance distributions.

3 Motivation and Research Gap

The Optimization Benchmark Vacuum

Prior to ALE-Bench, the evaluation of LLM-based optimization agents suffered from a critical methodological gap. Existing benchmarks fell into two inadequate categories:

Exact-Solution Benchmarks Benchmarks like HumanEval, MBPP, and APPS measure whether code is functionally correct. They have binary pass/fail evaluation and test algorithmic problems with known optimal solutions. These benchmarks are poorly suited for evaluating optimization because they cannot distinguish between a mediocre feasible solution and an exceptional one.

Toy Optimization Benchmarks Classical optimization benchmarks (e.g., TSP instances from TSPLIB, vehicle routing from CVRPLIB) evaluate specific problem classes in isolation. They lack the diversity needed to assess general-purpose optimization agents and do not capture the iterative refinement process that characterizes real-world heuristic development.

The Case for Heuristic Contest Problems

ALE-Bench fills this gap by drawing from competitive programming heuristic contests, which offer several unique advantages as a benchmark source:

Calibrated difficulty: Each problem has been solved by hundreds to thousands of human experts, providing a rich performance distribution for comparison
Diversity: Problems span logistics, factory planning, power-grid optimization, geometric placement, scheduling, and more
No data contamination: Solutions require novel algorithm design, not pattern matching against training data
Quantitative scoring: Continuous score functions allow fine-grained quality assessment rather than binary pass/fail
Realistic complexity: NP-hard problems with search spaces of approximately 10^200 possible combinations mirror real industrial optimization

Why ~10^200? Consider the Food Delivery problem (AHC006): select 50 orders from 1,000 available, then determine the delivery route. The selection alone gives C(1000, 50) ~ 10^85 combinations. Each selection then defines a route optimization problem over 50 waypoints, adding another combinatorial explosion. The joint optimization over selection and routing pushes the effective search space to astronomical proportions, making brute-force search computationally infeasible and requiring sophisticated heuristic design.

4 Benchmark Design and Problem Taxonomy

Problem Collection

ALE-Bench comprises 40 diverse optimization problems curated from past AtCoder Heuristic Contests. These problems were selected to maximize diversity across problem types, difficulty levels, and algorithmic paradigms required for competitive solutions.

40 Optimization Problems

~10^200 Typical Search Space

1000+ Human Baselines per Problem

6 Problem Categories

Problem Category Taxonomy

Category	Example Problems	Typical Approaches	Count
`Logistics / Routing`	Food Delivery (AHC006), Multi-vehicle routing	SA, greedy + local search, 2-opt/3-opt	8
`Factory / Scheduling`	Job-shop scheduling, production line optimization	SA, constraint propagation, GRASP	7
`Grid / Placement`	Facility layout, tile placement, power-grid optimization	SA, BFS/DFS enumeration, beam search	9
`Graph / Network`	Network design, flow optimization, connectivity	SA, minimum spanning tree + perturbation	6
`Simulation-Based`	Game simulation, interactive optimization	Monte Carlo, beam search, MCTS	5
`Geometric / Spatial`	Rectangle packing, spatial partitioning	SA, sweep line, corner-based placement	5

Problem Difficulty Distribution

The problems span a wide range of difficulty levels, calibrated by the human performance distribution in the original AHC contests. Difficulty is characterized not by problem statement complexity (which is typically moderate) but by the gap between naive and competitive solutions:

[!info]- Detailed Problem Difficulty Analysis Each problem can be characterized along several difficulty axes:

Greedy Gap: The percentile rank achievable by a well-implemented greedy algorithm. Problems with small greedy gaps (where greedy solutions already rank well) are easier; problems where greedy ranks below 50th percentile are harder.

SA Ceiling: The approximate best rank achievable using standard simulated annealing. Problems where SA alone can reach top 10% are "SA-friendly"; others require more sophisticated approaches.

Algorithmic Breadth: The number of distinct algorithmic paradigms that appear in top-100 solutions. Higher breadth indicates problems where diverse approaches are viable.

Implementation Complexity: The typical lines of code in competitive solutions, ranging from 100-200 (simple) to 500+ (complex).

The benchmark deliberately includes problems across all four difficulty dimensions to stress-test different capabilities of AI agents.

Exemplar Problem: Food Delivery (AHC006)

Problem Statement (Simplified): You operate a food delivery service with one delivery vehicle. There are 1,000 customer orders, each with a pickup location, delivery location, time window, and reward value. You must select at most 50 orders to fulfill and determine the delivery route that maximizes total reward minus travel costs. The vehicle starts and ends at a central depot.

This problem is representative because it combines multiple optimization sub-problems:

Subset selection: Choose which orders to accept (combinatorial)
Sequencing: Determine pickup/delivery order (routing)
Time feasibility: Respect time windows (constraint satisfaction)
Cost-benefit tradeoff: Balance high-reward distant orders against efficient clustering

No polynomial-time algorithm is known for this problem class (it generalizes the NP-hard Traveling Salesman Problem with Time Windows). Competitive human solutions typically employ a combination of greedy initial solution construction followed by simulated annealing with carefully designed neighborhood operators.

5 Infrastructure and Evaluation Protocol

Benchmark Components

+------------------------------------------------------------------+
|                     ALE-Bench Infrastructure                      |
+------------------------------------------------------------------+
|                                                                    |
|  +-------------------+     +---------------------+                 |
|  | Problem Statements |     | Test Case Generator |                |
|  | (Markdown + HTML)  |---->| (Deterministic Seed) |               |
|  +-------------------+     +---------------------+                 |
|          |                          |                              |
|          v                          v                              |
|  +-------------------+     +---------------------+                 |
|  | Visualization      |     | Solution Scorer     |                |
|  | Tools (per-problem)|     | (Official AtCoder)  |                |
|  +-------------------+     +---------------------+                 |
|          |                          |                              |
|          v                          v                              |
|  +-------------------+     +---------------------+                 |
|  | Code Execution     |     | Ranking Calculator  |                |
|  | Environment (Docker)|    | (Human Baselines)   |                |
|  +-------------------+     +---------------------+                 |
|                                     |                              |
|                                     v                              |
|                            +---------------------+                 |
|                            | Percentile Report    |                |
|                            | (vs. 1000+ Humans)   |                |
|                            +---------------------+                 |
+------------------------------------------------------------------+

Figure 1: ALE-Bench infrastructure pipeline from problem statement to percentile-ranked evaluation.

Problem Statement Format

Each problem is provided as a structured document containing:

Problem description: Natural language specification of the optimization objective, constraints, and scoring function
Input/output format: Precise specification of data formats for reading test cases and producing solutions
Scoring formula: Mathematical definition of the objective function, including normalization and penalty terms
Sample test cases: Small instances for development and debugging
Constraint bounds: Parameter ranges (grid sizes, item counts, time limits)

Visualization Tools

A distinctive feature of ALE-Bench is its inclusion of problem-specific visualization tools for each of the 40 problems. These tools, originally developed for AHC contestants, render solution states graphically. While primarily designed for human use, they serve two critical functions in the benchmark:

Debugging aid: AI agents can potentially use visualization outputs to identify solution defects (though current agents primarily use textual feedback)
Analysis tool: Researchers can visually inspect AI-generated solutions to understand qualitative differences from human solutions

Ranking Methodology

The evaluation protocol uses the official AtCoder ranking software to compute relative standings. This is critical for fair comparison because AHC uses a relative scoring system:

$$ Percentile(agent) = (# human participants with score ≤ agent_score) / (total human participants) × 100 $$

The ranking system accounts for:

Multiple test cases: Each problem is evaluated on 50-200 hidden test instances; scores are aggregated
Score normalization: Different problems use different scoring scales; ranking is always relative
Time-of-contest baseline: Human performance reflects actual contest conditions (4-hour time limit for live contests)

[!info]- Code Execution Environment Details Solutions are executed in a sandboxed Docker environment that mirrors AtCoder's judge environment:

Language: C++, Python, Rust, or any language supported by AtCoder's judge

Time limit: Typically 2-5 seconds per test case (problem-specific)

Memory limit: 256 MB - 1024 MB (problem-specific)

CPU: Single-threaded execution (reflecting contest conditions)

Compilation: Standard compiler flags matching AtCoder judge

This environment ensures that solutions are evaluated under identical conditions to the original human contest, preventing unfair advantages from hardware differences.

6 ALE-Agent Architecture

System Overview

ALE-Agent is the reference AI agent developed alongside ALE-Bench, built on Gemini 2.5 Pro as the foundation model. It implements an iterative code-generation-and-refinement loop that produces optimization programs for each benchmark problem.

+========================================================================+
|                          ALE-Agent Pipeline                             |
+========================================================================+
|                                                                         |
|  [Problem Statement] -----> [Prompt Constructor]                        |
|         |                         |                                     |
|         |                   +-----v--------+                            |
|         |                   | Domain Knowledge |                        |
|         |                   | Injection Layer  |                        |
|         |                   +-----+--------+                            |
|         |                         |                                     |
|         v                         v                                     |
|  +-------------+          +----------------+                            |
|  | Visualization| <------- | Gemini 2.5 Pro |                           |
|  | Feedback     |          | Code Generation|                           |
|  +-------------+          +-------+--------+                            |
|                                   |                                     |
|                                   v                                     |
|                           +----------------+                            |
|                           | Code Execution  |                           |
|                           | + Scoring       |                           |
|                           +-------+--------+                            |
|                                   |                                     |
|                          +--------v---------+                           |
|                          | Score Improvement?|                          |
|                          +--------+---------+                           |
|                               /       \                                 |
|                         YES  /         \ NO                             |
|                             v           v                               |
|                     [Accept &     [Revert &                             |
|                      Continue]    Try New                               |
|                         |         Approach]                             |
|                         |              |                                |
|                         +------+-------+                                |
|                                |                                        |
|                         [~100 Revisions                                 |
|                          over 4 Hours]                                  |
+========================================================================+

Figure 2: ALE-Agent iterative refinement pipeline with accept/reject gating.

Core Loop Mechanics

The agent operates in a tight loop that mirrors how human competitive programmers approach heuristic contests:

Problem Analysis: Parse the problem statement, identify the optimization objective, and classify the problem type
Initial Solution: Generate a baseline greedy or random solution to establish a score floor
Iterative Refinement: Repeatedly propose code modifications, execute, evaluate, and accept/reject based on score improvement
Strategy Switching: When local improvements stall, the agent may propose fundamentally different algorithmic approaches

Within a 4-hour contest window, the agent typically performs approximately 100 revision cycles, though this can range from 50 to 200+ depending on problem complexity and compilation times. Each revision involves generating modified source code, compiling, running against sample test cases, and evaluating the resulting score.

Inference-Time Scaling

A distinctive architectural choice is the emphasis on inference-time scaling -- generating hundreds to thousands of candidate solutions rather than relying on a single "best" generation. This approach draws from the broader insight that LLM performance can be substantially improved by sampling multiple completions and selecting the best:

Key Insight: Rather than asking the LLM for one optimal solution, ALE-Agent generates many candidate modifications in parallel, evaluates all of them, and selects the best performer. This transforms the LLM from a point estimator into a search distribution, where the probability of finding a good solution scales favorably with the number of samples.

[!info]- Inference-Time Scaling: Mathematical Justification Let p be the probability that a single LLM generation produces a solution improvement. Under inference-time scaling with n independent samples, the probability of at least one improvement is:

$$ P(at least one improvement) = 1 - (1 - p)^n $$

Even for small p (e.g., p = 0.01, meaning only 1% of generations are beneficial), scaling to n = 100 samples yields P = 1 - 0.99^100 = 0.634, and n = 500 yields P = 0.993. This means the agent can achieve near-certain improvement per iteration even when individual generation quality is modest.

The key cost tradeoff is computational: each sample requires an LLM inference call plus code compilation and execution. ALE-Agent manages this by parallelizing sample generation and using fast evaluation on a subset of test cases for initial filtering, followed by full evaluation only on promising candidates.

7 Domain Knowledge Injection and Prompt Engineering

Algorithm Library Prompts

A critical component of ALE-Agent's effectiveness is the injection of domain knowledge about frequently used optimization algorithms directly into the system prompts. This knowledge base includes:

Algorithm Family	Knowledge Provided	Usage Context
Simulated Annealing	Temperature schedules, acceptance criteria, neighborhood design patterns, cooling strategies	Primary approach for most problems
Greedy Construction	Common heuristic templates, nearest-neighbor, largest-first, deadline-first orderings	Initial solution generation
Local Search	2-opt, 3-opt, or-opt, swap, insert neighborhood operators	TSP/routing variants
Beam Search	Width selection, state evaluation functions, pruning strategies	Sequential decision problems
Monte Carlo Methods	Random sampling, importance sampling, MCTS tree policies	Simulation-based problems
Constraint Handling	Penalty functions, repair operators, feasibility maintenance	Problems with hard constraints

Prompt Structure

The domain knowledge injection follows a hierarchical prompt structure:

PROMPT STRUCTURE (Pseudocode)

// System-level context
SYSTEM_PROMPT = {
  role: "Expert competitive programmer specializing in heuristic optimization",
  knowledge_base: [
    "simulated_annealing_patterns.md",
    "local_search_operators.md",
    "scoring_optimization_techniques.md",
    "common_pitfalls_and_fixes.md"
  ],
  output_format: "Complete, compilable C++ solution"
}

// Per-problem context
PROBLEM_PROMPT = {
  statement: problem_text,
  constraints: constraint_summary,
  scoring_function: score_formula,
  sample_input: sample_data,
  current_best_code: current_solution,     // after first iteration
  current_best_score: best_score,           // after first iteration
  execution_feedback: compiler_output,      // errors, warnings
  improvement_history: score_trajectory     // trend of scores over iterations
}

// Refinement instruction
REFINEMENT_PROMPT = {
  instruction: "Improve the current solution. Focus on: {suggested_area}",
  constraint: "Must compile and run within {time_limit}ms per test case",
  strategy_hint: "Consider: {algorithm_suggestion}"
}

Dynamic Strategy Selection

The prompt engineering includes a meta-level strategy selector that chooses refinement directions based on the current score trajectory:

Plateau detection: When scores stop improving, the prompt shifts to suggesting fundamentally different algorithmic approaches
Regression handling: When a modification causes score regression, the prompt includes the regression analysis and suggests targeted fixes
Time-aware scaling: As the 4-hour contest window nears completion, prompts shift from exploration to exploitation, focusing on fine-tuning the best solution found

[!info]- Example: Domain Knowledge for Simulated Annealing The SA knowledge base injected into prompts includes the following patterns:

``` // Standard SA template provided in knowledge base template State simulated_annealing(State initial, double T_start, double T_end, int iterations) { State current = initial; Score best_score = evaluate(current); State best_state = current;
for (int i = 0; i < iterations; i++) {
    double T = T_start * pow(T_end / T_start, (double)i / iterations);
    State neighbor = generate_neighbor(current);
    Score delta = evaluate(neighbor) - evaluate(current);

    if (delta > 0 || exp(delta / T) > random_double()) {
        current = neighbor;
        if (evaluate(current) > best_score) {
            best_score = evaluate(current);
            best_state = current;
        }
    }
}
return best_state;
} ```

The knowledge base also includes guidance on:

Choosing T_start and T_end based on typical score deltas

Adaptive temperature scheduling (e.g., reheating)

Efficient delta evaluation (incremental score updates)

Multi-objective SA with Pareto acceptance

8 Inference-Time Scaling and Solution Generation

Scaling Strategy

ALE-Agent's inference-time scaling operates at two levels:

Level 1: Per-Revision Candidate Generation

At each revision step, the agent generates multiple candidate code modifications (typically 3-10 variants) and evaluates all of them. This addresses the stochasticity of LLM generation -- a single prompt may produce code that fails to compile, introduces bugs, or makes suboptimal algorithmic choices. By generating multiple candidates and selecting the best, the agent substantially reduces the variance of each revision step.

Level 2: Cross-Revision Trajectory Exploration

Over the full 4-hour contest, the ~100 revision cycles create a trajectory through solution space. The agent explores multiple trajectory branches by occasionally reverting to earlier high-scoring solutions and trying different improvement directions. This provides implicit diversity in the search process.

Generation Budget

Parameter	Typical Value	Range
Revisions per contest	~100	50 - 200+
Candidates per revision	3 - 10	1 - 20
Total solutions evaluated	300 - 1,000	50 - 2,000+
Time per revision cycle	~2.4 min	0.5 - 10 min
LLM tokens per revision	~4K - 10K	2K - 30K

Candidate Selection Mechanism

Among the multiple candidates generated per revision, selection follows a strict empirical evaluation protocol:

Compilation check: Reject candidates that fail to compile
Runtime check: Reject candidates that exceed time/memory limits or crash
Score evaluation: Run surviving candidates on sample test cases
Best-of-K selection: Accept the highest-scoring candidate if it improves on the current best
Revert on failure: If no candidate improves, revert to the previous best and request a different approach

[!info]- Inference-Time Scaling: Empirical Analysis The effectiveness of inference-time scaling varies significantly by problem type. Analysis across the 40 benchmark problems reveals:

High-benefit problems (routing, scheduling): Generating 10x more candidates improves final score by 15-30%, as these problems have many viable neighborhood operators that the LLM may or may not discover in a single sample.

Moderate-benefit problems (grid placement, geometric): 5-10x more candidates yields 5-15% improvement, as the solution space is more constrained.

Low-benefit problems (simulation-based): Additional candidates show diminishing returns beyond 3-5x, as these problems require deep algorithmic insight rather than variant exploration.

This suggests that inference-time scaling is most effective when the improvement landscape has many local optima accessible through diverse code modifications -- precisely the regime where simulated annealing excels.

9 Simulated Annealing Specialization

Why Simulated Annealing Dominates

A striking empirical finding is that ALE-Agent overwhelmingly specializes in simulated annealing (SA) as its primary optimization strategy. This specialization is not hard-coded but emerges naturally from the interaction between the LLM's training data (which includes extensive competitive programming resources where SA is the dominant paradigm for heuristic contests) and the iterative refinement process (which naturally selects for approaches that yield monotonic improvement).

Observation: Across the 40 benchmark problems, ALE-Agent converges to an SA-based solution in approximately 85% of cases, regardless of whether SA is the optimal paradigm for the problem. This represents both a strength (deep expertise in one approach) and a limitation (inability to discover when alternative approaches would be superior).

SA Implementation Quality

The quality of SA implementations generated by ALE-Agent is notably high, incorporating several sophisticated techniques that distinguish expert-level competitive programming solutions:

Technique	Description	Impact
Incremental Score Update	Computing score delta from local change rather than full re-evaluation	100-1000x speedup per iteration
Adaptive Temperature	Temperature schedule adapted to observed score landscape	10-30% score improvement
Multi-Neighborhood	Multiple neighborhood operators with adaptive selection	20-50% improvement over single-operator
Time-Calibrated Scheduling	Using clock time rather than iteration count for temperature decay	Robust to implementation speed variations
Restart Strategies	Periodic restarts from best-known solution with higher temperature	5-15% improvement on multi-modal landscapes

SA Temperature Schedule Patterns

The agent discovers and applies several temperature scheduling patterns:

TEMPERATURE SCHEDULES (C++)

// Geometric cooling (most common)
double T = T_start * pow(T_end / T_start, elapsed / time_limit);

// Linear cooling
double T = T_start + (T_end - T_start) * elapsed / time_limit;

// Logarithmic cooling (theoretically optimal but slow)
double T = T_start / (1.0 + log(1.0 + elapsed / time_limit));

// Piecewise schedule (discovered by agent for specific problems)
double T;
if (elapsed < time_limit * 0.3) {
    T = T_high;  // Explore broadly in early phase
} else if (elapsed < time_limit * 0.8) {
    T = T_mid * pow(T_low / T_mid, (elapsed - 0.3) / 0.5);
} else {
    T = T_low;   // Pure hill-climbing in final phase
}

10 Experimental Results and Rankings

Headline Results

Top 2% AHC047 Live Contest

Top 6.8% Broad Benchmark Average

21st AHC047 Rank (of 1000+)

154th AHC046 Rank (top 16%)

Detailed Contest Results

Contest	Participants	ALE-Agent Rank	Percentile	Notes
AHC046	1000+	154th	Top 16%	Retrospective evaluation on past contest
AHC047	1000+	21st	Top 2%	Live contest participation; best result

Comparison: ALE-Agent vs. Standard LLM Baselines

To contextualize ALE-Agent's performance, the paper compares against standard LLM baselines that attempt the benchmark problems without the evolutionary refinement loop:

System	Average Percentile	Best Percentile	Worst Percentile
ALE-Agent (Gemini 2.5 Pro)	Top 6.8%	Top 2%	~Top 30%
Standard LLM (single-shot)	~Top 50%	~Top 25%	Below 50%
Standard LLM (best-of-N)	~Top 35%	~Top 15%	~Top 50%

Key Takeaway: The iterative refinement loop with domain knowledge injection provides a 7-8x improvement in percentile ranking over standard single-shot LLM generation (top 6.8% vs. top 50%). This demonstrates that the agent architecture, not just the underlying LLM capability, is the primary driver of performance.

Per-Problem Category Analysis

Category	Average Percentile	SA Effective?	Analysis
Logistics / Routing	Top 5%	`Yes`	Strong SA + neighborhood operators match human approaches
Factory / Scheduling	Top 7%	`Yes`	SA with constraint-aware operators effective
Grid / Placement	Top 8%	`Yes`	SA with swap/move operators works well
Graph / Network	Top 10%	`Partial`	Some problems benefit from non-SA approaches
Simulation-Based	Top 15%	`No`	Requires MCTS/beam search; SA less effective
Geometric / Spatial	Top 9%	`Yes`	SA with geometric neighborhoods works well

[!info]- Score Trajectory Analysis: AHC047 (Best Result) The AHC047 contest provides the clearest picture of ALE-Agent's optimization trajectory over the 4-hour window:

Time (minutes) Revision # Approximate Rank Key Changes

0 - 15 1 - 5 ~600th Greedy initial solution, basic structure

15 - 45 5 - 20 ~300th SA implementation, basic neighborhoods

45 - 90 20 - 40 ~150th Score acceleration (Poisson approximation)

90 - 150 40 - 65 ~82nd Neighborhood tuning, temperature optimization

150 - 210 65 - 85 ~40th Growth neighborhood search strategy

210 - 240 85 - 100 21st Final tuning, parameter optimization

The trajectory shows characteristic phases: rapid initial improvement (greedy to SA), followed by algorithmic innovations (Poisson approximation, growth neighborhoods), and finally parameter tuning. The most dramatic rank improvement (82nd to 21st) came from the discovery of the growth neighborhood search strategy, discussed in Section 11.

Time (minutes)	Revision #	Approximate Rank	Key Changes
0 - 15	1 - 5	~600th	Greedy initial solution, basic structure
15 - 45	5 - 20	~300th	SA implementation, basic neighborhoods
45 - 90	20 - 40	~150th	Score acceleration (Poisson approximation)
90 - 150	40 - 65	~82nd	Neighborhood tuning, temperature optimization
150 - 210	65 - 85	~40th	Growth neighborhood search strategy
210 - 240	85 - 100	21st	Final tuning, parameter optimization

11 Novel Algorithmic Discoveries

Perhaps the most scientifically significant aspect of ALE-Bench results is that the ALE-Agent independently discovered novel algorithmic techniques that improved performance beyond what standard approaches achieve. These discoveries demonstrate that LLM-based optimization agents can do more than reproduce known techniques -- they can innovate.

Discovery 1: Poisson Distribution Approximation for Score Acceleration

Context: Many AHC problems have scoring functions that involve computing expected values or probabilistic outcomes over random test instances. Exact computation of these quantities can be expensive, limiting the number of SA iterations achievable within the time budget.

ALE-Agent discovered that certain score components could be approximated using Poisson distributions, enabling dramatically faster score evaluation without significant accuracy loss. Specifically:

The agent identified that a binomial distribution component in the scoring function (modeling success/failure events) could be replaced with its Poisson approximation when the number of trials was large and success probability was small
This approximation reduced score evaluation from O(n) summation to O(1) closed-form computation
The resulting speedup allowed approximately 10x more SA iterations within the same time budget

POISSON APPROXIMATION (Conceptual)

// Original: Exact binomial probability computation
double exact_score(int n, double p) {
    double score = 0.0;
    for (int k = 0; k <= n; k++) {
        score += binomial_pmf(n, k, p) * reward(k);  // O(n) per evaluation
    }
    return score;
}

// ALE-Agent discovery: Poisson approximation
double approx_score(int n, double p) {
    double lambda = n * p;  // Poisson parameter
    // Closed-form expected reward under Poisson(lambda)
    return expected_reward_poisson(lambda);  // O(1) per evaluation
}

This discovery is notable because it requires understanding the mathematical relationship between binomial and Poisson distributions -- knowledge that the LLM had absorbed from its training data -- and applying it in a novel context that was not explicitly present in any training example.

Discovery 2: Growth Neighborhood Search Strategy

Impact: This single innovation improved ALE-Agent's rank on AHC047 from 82nd to 21st place -- a jump from the top 8% to the top 2% of participants.

The growth neighborhood search strategy is a novel SA neighborhood operator that the agent developed for problems involving incremental construction or expansion decisions. Instead of the standard approach of making small local modifications to a complete solution, the growth strategy:

Starts from a partial solution (subset of the full solution)
Grows the solution incrementally by adding elements in an order determined by a scoring heuristic
Uses SA to decide whether to accept each growth step, allowing the algorithm to explore different growth orderings
Periodically shrinks the solution by removing low-scoring elements, creating space for new growth

GROWTH NEIGHBORHOOD SEARCH (Pseudocode)

function growth_neighborhood_SA(initial_solution, T_start, T_end, time_limit):
    solution = shrink_to_core(initial_solution, core_fraction=0.6)
    best = initial_solution
    best_score = evaluate(initial_solution)

    while elapsed < time_limit:
        T = temperature_schedule(elapsed, T_start, T_end)

        // Growth phase: add elements greedily with SA perturbation
        while can_grow(solution):
            candidates = growth_candidates(solution)
            selected = sa_select(candidates, T)  // SA-guided selection
            solution = add_element(solution, selected)

        score = evaluate(solution)
        if score > best_score:
            best = solution
            best_score = score

        // Shrink phase: remove low-value elements
        solution = shrink_to_core(solution, core_fraction=0.6)

    return best

This approach is conceptually related to ruin-and-recreate methods in vehicle routing literature, but the specific combination of SA-guided growth with periodic shrinkage was novel in the context of the AHC047 problem. The key insight is that many optimization problems have a natural "construction" aspect where the order of element insertion matters significantly for solution quality.

[!info]- Technical Analysis: Why Growth Neighborhoods Work The effectiveness of growth neighborhoods can be understood through the lens of search space topology:

Standard neighborhoods (swap, insert, move) make small local changes to a complete solution. These operators explore the neighborhood of the current solution in a high-dimensional space, where the landscape is often rugged with many local optima.

Growth neighborhoods effectively project the search onto a lower-dimensional "construction order" space. By re-growing the solution from a partial state, the operator can reach distant regions of the solution space in a single step -- regions that would require many small local moves to reach.

The shrink-and-regrow cycle creates a form of large neighborhood search (LNS), where the "destruction" phase (shrinking) removes a significant portion of the solution, and the "reconstruction" phase (growth) rebuilds it using different decisions.

This is mathematically analogous to the observation in evolutionary computation that crossover operators can sometimes outperform mutation by recombining building blocks -- the growth phase effectively recombines retained "core" elements with newly chosen additions.

12 Ablation Studies and Component Analysis

Component Contribution Analysis

The paper reports systematic ablation studies that isolate the contribution of each architectural component:

Configuration	Avg. Percentile	Delta vs. Full System
Full ALE-Agent	Top 6.8%	--
Without domain knowledge injection	Top 15%	-8.2 percentage points
Without iterative refinement (single-shot)	Top 50%	-43.2 percentage points
Without inference-time scaling (K=1)	Top 12%	-5.2 percentage points
Without score feedback in prompts	Top 20%	-13.2 percentage points
With GPT-4o instead of Gemini 2.5 Pro	Top 10%	-3.2 percentage points
With Claude 3.5 Sonnet instead of Gemini 2.5 Pro	Top 11%	-4.2 percentage points

Key Ablation Findings

Finding 1: Iterative refinement is the single most important component, accounting for approximately 43 percentage points of improvement. This confirms that the agent's ability to incrementally improve solutions over ~100 revisions is the primary driver of performance, not the quality of any single LLM generation.

Finding 2: Domain knowledge injection provides substantial but not dominant benefit (~8 percentage points). This suggests that the LLM already possesses significant optimization knowledge from pre-training, but explicit injection of SA templates and competitive programming patterns meaningfully accelerates convergence.

Finding 3: Model choice matters less than architecture. Replacing Gemini 2.5 Pro with GPT-4o or Claude 3.5 Sonnet reduces performance by only 3-4 percentage points, indicating that the agent architecture generalizes across foundation models. This is consistent with the hypothesis that iterative refinement compensates for differences in single-shot generation quality.

Revision Count Analysis

The paper analyzes the relationship between the number of revisions and final performance:

Revision Budget	Avg. Percentile	Marginal Improvement
1 (single-shot)	Top 50%	--
5	Top 30%	+20 pp per 5 revisions
10	Top 22%	+8 pp per 5 revisions
25	Top 15%	+2.3 pp per 5 revisions
50	Top 10%	+1.0 pp per 5 revisions
100	Top 6.8%	+0.64 pp per 5 revisions
200	~Top 5.5%	+0.13 pp per 5 revisions

The diminishing returns pattern is clear: the first 10 revisions provide the largest improvement (from 50th to 22nd percentile), while revisions 100-200 contribute only marginal gains. This log-linear scaling relationship suggests that substantial further improvement would require either fundamentally better LLMs or architectural innovations in the agent loop.

[!info]- Ablation: Effect of Programming Language on Performance An interesting secondary ablation examines the effect of the generated code's programming language:

Language Avg. Percentile Notes

C++ Top 6.8% Default; fastest execution enables more SA iterations

Python Top 18% 10-100x slower execution severely limits SA iteration count

Rust Top 7.5% Comparable speed to C++ but LLM generates less idiomatic code

The C++ advantage is primarily due to execution speed: SA performance is directly proportional to the number of iterations achievable within the time limit, and C++ enables 10-100x more iterations than Python. This finding has important implications for LLM-based optimization -- the choice of output language is a critical design decision.

Language	Avg. Percentile	Notes
C++	Top 6.8%	Default; fastest execution enables more SA iterations
Python	Top 18%	10-100x slower execution severely limits SA iteration count
Rust	Top 7.5%	Comparable speed to C++ but LLM generates less idiomatic code

13 Failure Mode Analysis

Understanding where ALE-Agent fails is as informative as understanding where it succeeds. The paper identifies several systematic failure modes:

Failure Mode 1: Bug Detection and Repair

The agent struggles with subtle bugs in generated code that do not cause crashes but produce incorrect results. Examples include:

Off-by-one errors in array indexing that affect only edge cases
Integer overflow in score computation for large test instances
Floating-point precision errors that accumulate over many SA iterations
Race conditions in time-measurement code that cause premature termination

These bugs are particularly pernicious because they reduce score without producing visible errors, making them difficult for the agent to diagnose. The agent's primary debugging strategy -- comparing scores before and after modifications -- cannot distinguish between "this modification was algorithmically worse" and "this modification introduced a bug."

Failure Mode 2: Complexity Analysis

ALE-Agent sometimes generates solutions with poor time complexity that exceed time limits on large test cases but pass on small sample cases:

Example: The agent generated a solution with O(n^3) complexity for a problem requiring O(n log n). The solution passed all sample test cases (n = 100) within the time limit but timed out on hidden test cases (n = 10,000). The agent could not diagnose the issue because it only had access to sample cases during development, and the timeout produced no useful diagnostic information.

Failure Mode 3: Non-SA Algorithm Design

For problems where simulated annealing is suboptimal, the agent's strong SA bias prevents it from discovering better approaches. Specific problem types where this limitation is most acute:

Sequential decision problems: Problems requiring beam search or dynamic programming, where the solution is constructed through a sequence of irrevocable decisions
Adversarial/game problems: Problems requiring game-theoretic reasoning or minimax search
Online optimization: Problems where decisions must be made without knowledge of future inputs

Failure Mode 4: Two-Week Contest Adaptation

AHC includes both 4-hour live contests and two-week long contests. ALE-Agent is designed for the 4-hour format and does not effectively leverage the extended time of long contests. Human participants in long contests typically:

Develop multiple fundamentally different approaches and compare them
Perform extensive parameter tuning using statistical methods
Study the test case distribution to design problem-specific optimizations
Collaborate with others (in team contests) to parallelize exploration

The agent's iterative refinement loop does not naturally scale to two-week timescales; extending from 100 revisions to 10,000+ revisions produces diminishing returns without architectural changes to support higher-level strategic reasoning.

Failure Mode 5: Trial-and-Error Algorithms

Some competitive solutions employ trial-and-error approaches where the algorithm intentionally tries risky strategies that may fail but occasionally produce exceptional results. The agent's accept-on-improvement criterion makes it risk-averse, systematically avoiding high-variance strategies even when they would be optimal in expectation.

[!info]- Failure Mode Statistics | Failure Mode | Frequency | Avg. Rank Impact | | --- | --- | --- | | Bug-induced score loss | ~15% of problems | -50 to -200 ranks | | TLE on hidden cases | ~8% of problems | Disqualification on affected cases | | SA suboptimality | ~20% of problems | -30 to -100 ranks vs. optimal approach | | Stagnation (no improvement after rev. 50) | ~25% of problems | Missed potential of -20 to -80 ranks |

Positioning in the AI Optimization Landscape

System	Organization	Approach	Problem Type	Key Difference from ALE-Bench
AlphaCode / AlphaCode 2	DeepMind	Code generation + filtering	Exact algorithmic problems	Binary correctness, not optimization quality
FunSearch	DeepMind	LLM + evolutionary search	Mathematical optimization	Optimizes a single function, not full programs
AlphaEvolve	DeepMind	LLM + evolution + evaluators	Algorithm design	Multi-file programs; broader scope; more compute
EvoTorch	NNAISENSE	Evolutionary optimization library	Neural architecture, hyperparams	Does not use LLMs; fixed operators
OpenEvolve	Open source	FunSearch-inspired LLM evolution	General optimization	Open-source; less contest-specific

ALE-Bench vs. FunSearch

FunSearch (Nature 2023, DeepMind) is the most closely related predecessor. Key differences:

FunSearch

Evolves a single scoring function within a fixed program skeleton Population-based evolutionary search with diversity maintenance Applied to cap-set problems and bin-packing Discovers novel mathematical constructions Requires manual problem-specific program skeleton

ALE-Agent

Generates and modifies complete programs end-to-end Single-trajectory refinement (not population-based) Applied to 40 diverse optimization problems Discovers novel algorithmic strategies (Poisson, growth neighborhoods) Fully automated -- no manual skeleton required

ALE-Bench vs. AlphaEvolve

AlphaEvolve (DeepMind, 2025) shares the most conceptual overlap with ALE-Agent:

Scope: AlphaEvolve handles multi-file programs and broader algorithm design; ALE-Agent focuses on single-file optimization programs for competitive contests
LLM backbone: Both use Gemini-class models, though AlphaEvolve uses a dual-model architecture (Flash for breadth, Pro for depth)
Evaluation: AlphaEvolve uses automated evaluators; ALE-Agent uses contest scoring systems
Compute: AlphaEvolve runs for days/weeks on Google's infrastructure; ALE-Agent operates within a 4-hour contest window

[!info]- Relationship to SWE-bench and Coding Benchmarks ALE-Bench is distinct from code-generation benchmarks like SWE-bench, HumanEval, and MBPP in several fundamental ways:

Evaluation metric: SWE-bench uses binary pass/fail on test cases; ALE-Bench uses continuous quality scores with human-calibrated percentile rankings

Problem nature: Coding benchmarks test correctness of well-defined specifications; ALE-Bench tests quality of solutions to open-ended optimization problems

Solution space: Coding problems typically have a small set of correct approaches; optimization problems have a vast continuum of solution qualities

Iterative nature: ALE-Bench explicitly evaluates iterative improvement capability; SWE-bench primarily evaluates single-shot generation

These differences make ALE-Bench a complement to, not a replacement for, existing coding benchmarks. Together they provide a more complete picture of LLM capabilities in software engineering.

15 Computational Requirements and Cost Analysis

Per-Problem Compute Budget

Resource	Per Problem (4-hour contest)	Full Benchmark (40 problems)
LLM API calls	100 - 1,000	4,000 - 40,000
Total LLM tokens (input + output)	~500K - 2M	~20M - 80M
Code compilations	100 - 500	4,000 - 20,000
Solution evaluations	300 - 2,000	12,000 - 80,000
Wall-clock time	4 hours	160 hours (parallelizable)
Estimated API cost (Gemini 2.5 Pro)	$5 - $20	$200 - $800

Cost-Performance Tradeoff

The cost structure reveals an interesting economic insight: the cost per percentile-point improvement follows a super-linear curve. The first 30 percentile points (from 50th to 20th) cost approximately $1-2 per problem, while the last 13 points (from 20th to 6.8th) cost approximately $15-18 per problem. This reflects the fundamental diminishing returns of iterative refinement.

[!info]- Infrastructure Requirements for Reproducibility - LLM API access: Gemini 2.5 Pro (or equivalent frontier model) with sufficient rate limits for ~100 calls per 4-hour window - Compute for solution evaluation: Docker environment with AtCoder-compatible compilation toolchain - Storage: ~1 GB for problem statements, test cases, and visualization tools - Networking: Reliable connection to LLM API endpoint throughout 4-hour contests - Parallelism: Problems can be solved independently; 40 parallel instances complete the full benchmark in ~4 hours

16 Theoretical Foundations and Complexity

Complexity-Theoretic Context

The optimization problems in ALE-Bench are overwhelmingly NP-hard, meaning that no polynomial-time algorithm is known (or believed to exist) for computing optimal solutions. This has profound implications for both human and AI approaches:

$$ For a typical ALE-Bench problem with input size n, the solution space is |S| = O(2^(n log n)) $$

With typical problem sizes (n = 100 to 10,000), exhaustive search is computationally infeasible. This is why heuristic approaches -- simulated annealing, genetic algorithms, local search -- are the only viable strategies, and why solution quality (rather than optimality) is the relevant metric.

Approximation Hardness

Many ALE-Bench problems are not only NP-hard to solve optimally but are also APX-hard -- meaning that achieving an approximation ratio better than some constant factor is NP-hard. This theoretical backdrop validates the benchmark's focus on relative ranking: when no algorithm can guarantee near-optimal solutions, the relevant question is how close different approaches can get in practice.

The LLM as a Search Operator

From a theoretical perspective, ALE-Agent's use of an LLM for code modification can be analyzed as a learned search operator in the space of programs. Traditional search operators (crossover, mutation in genetic programming) are hand-designed and domain-agnostic. The LLM, by contrast, is a learned operator that incorporates:

Syntactic knowledge: Generating valid code modifications that compile
Semantic knowledge: Making changes that are likely to improve algorithmic performance
Domain knowledge: Applying patterns from competitive programming to guide search direction

This learned operator has a key advantage over traditional evolutionary operators: it can make semantically meaningful large jumps in program space (e.g., switching from a greedy algorithm to SA) that would require many small mutations to achieve through traditional genetic programming.

[!info]- Connection to No-Free-Lunch Theorems The No-Free-Lunch (NFL) theorems for optimization state that no algorithm can outperform random search across all possible optimization problems. However, NFL does not apply to structured problem distributions -- and competitive programming problems are highly structured. The LLM's effectiveness as a search operator derives precisely from its ability to exploit this structure, having learned common patterns from millions of code examples.

The ALE-Bench results can be interpreted as empirical evidence for the degree to which competitive programming optimization problems form a structured distribution that LLMs can exploit. The gap between single-shot generation (top 50%) and iterative refinement (top 6.8%) quantifies the additional structure that can be extracted through systematic search beyond what a single forward pass captures.

17 Implications for Competitive Programming

Human-AI Performance Comparison

ALE-Agent's top-2% performance in AHC047 places it among the expert tier of competitive programmers. To contextualize this achievement:

Rating Tier	Approximate Percentile	ALE-Agent Comparison
World-class (top 10)	Top 1%	Above ALE-Agent's best result
Expert (Red/Orange rated)	Top 2-5%	ALE-Agent's range on best problems
Advanced (Blue/Cyan rated)	Top 5-15%	ALE-Agent's typical range
Intermediate (Green rated)	Top 15-40%	Below ALE-Agent's typical range
Beginner	Top 40-100%	Well below ALE-Agent

The Gap to World-Class

Despite impressive results, a significant gap remains between ALE-Agent and world-class competitive programmers. The top human performers consistently achieve solutions that are 10-30% better in score than ALE-Agent's best. This gap arises from:

Problem-specific insight: Top competitors develop deep understanding of problem structure that guides algorithm design
Algorithm diversity: Top competitors employ a wide range of algorithmic paradigms tailored to each problem, while ALE-Agent defaults to SA
Implementation optimization: Top competitors write highly optimized C++ code with cache-friendly data structures and SIMD instructions, maximizing SA iterations
Multi-day preparation: In long contests, top competitors iterate on fundamentally different approaches over days/weeks

Impact on Contest Integrity

Policy Consideration: ALE-Agent's live participation in AHC047 raises important questions about contest integrity. If AI agents can achieve top-2% performance, contests may need new policies regarding AI assistance, separate AI divisions, or modified scoring systems. The ALE-Bench paper explicitly raises this issue, noting that their live contest participation was conducted with AtCoder's knowledge and cooperation.

18 Limitations and Open Problems

Benchmark Limitations

Problem scope: 40 problems, while diverse, represent only a subset of possible optimization problem types. Industrial optimization (supply chain, financial portfolio, chip design) may have different characteristics.
Time format: The 4-hour contest format may not reflect realistic optimization workflows where practitioners iterate over days/weeks.
Single-machine evaluation: All solutions run on a single CPU thread. Real-world optimization increasingly leverages parallelism, distributed computing, and GPU acceleration.
Static human baselines: Human performance is measured at the time of the original contest. As human competitors improve (and potentially use AI tools), the baselines may become stale.

Agent Limitations

SA monoculture: The agent's strong bias toward simulated annealing limits its performance on problems better suited to other paradigms
No meta-learning: The agent does not learn from solving one problem to improve on the next; each problem starts from scratch
Limited debugging: The agent cannot effectively diagnose subtle bugs that reduce score without causing errors
No collaboration: Unlike human teams in long contests, the agent cannot benefit from parallel exploration by multiple agents
Context window constraints: As solutions grow complex (500+ lines), the agent struggles to maintain coherent modifications across the entire program

Open Research Questions

Population-based refinement: Can maintaining a population of diverse solutions (as in FunSearch) outperform single-trajectory refinement?
Transfer learning: Can experience on earlier problems accelerate performance on later problems?
Hybrid AI-human teams: What is the optimal division of labor between AI agents and human experts?
Formal verification: Can LLMs learn to prove properties of their optimization algorithms (e.g., feasibility guarantees)?
Scaling laws: How does performance scale with LLM size, revision budget, and compute?

19 Future Directions

Benchmark Extensions

The ALE-Bench framework is designed to be extensible. The authors suggest several directions for benchmark evolution:

New problem additions: As AHC continues running contests, new problems can be added to the benchmark, increasing diversity and preventing overfitting to the current set
Multi-language evaluation: Expanding beyond C++ to evaluate optimization in Python (with NumPy/Cython), Rust, and Java
Parallel optimization track: Adding a track where solutions can use multiple CPU cores, testing AI agents' ability to parallelize optimization
Interactive optimization: Problems where the agent receives partial feedback during execution and must adapt online
Real-world problem track: Incorporating optimization problems from industry (logistics, scheduling, resource allocation) to test generalization beyond competitive programming

Agent Architecture Evolution

Several architectural improvements could address current limitations:

Algorithm portfolio: Maintaining a library of algorithmic templates beyond SA, with a meta-learned selector that chooses the best paradigm for each problem
Multi-agent collaboration: Deploying multiple agents with different biases (SA specialist, beam search specialist, DP specialist) that share discoveries
Verification integration: Incorporating lightweight formal methods to prove constraint satisfaction and catch bugs early
Hierarchical refinement: Separating high-level algorithm design (slow, expensive LLM calls) from low-level parameter tuning (fast, automated sweeps)
Self-play evaluation: Using multiple agent instances to create a self-play competition, driving performance through competitive pressure

Toward General Optimization Agents

ALE-Bench represents a first step toward benchmarking general-purpose optimization agents -- systems that can tackle arbitrary optimization problems through algorithm design and iterative refinement. The ultimate vision is an AI system that, given any optimization problem specification, can design, implement, and refine a competitive heuristic solution. ALE-Bench provides the first rigorous framework for measuring progress toward this goal.

20 References and Resources

Primary Resources

Paper: arXiv:2506.09050 -- ALE-Bench: A Benchmark for Automated Optimization with LLM-Based Evolutionary Approaches