ALE-Bench
A Benchmark for Automated Optimization with LLM-Based Evolutionary Approaches Organization: Sakana AI / AtCoder Paper: arXiv:2506.09050 Type: Benchmark + Agent System Report Type: PhD-Level Technical Analysis Report Date: March 2026
Table of Contents
- Full Title and Attribution
- Authors and Institutional Context
- Motivation and Research Gap
- Benchmark Design and Problem Taxonomy
- Infrastructure and Evaluation Protocol
- ALE-Agent Architecture
- Domain Knowledge Injection and Prompt Engineering
- Inference-Time Scaling and Solution Generation
- Simulated Annealing Specialization
- Experimental Results and Rankings
- Novel Algorithmic Discoveries
- Ablation Studies and Component Analysis
- Failure Mode Analysis
- Comparison with Related Work
- Computational Requirements and Cost Analysis
- Theoretical Foundations and Complexity
- Implications for Competitive Programming
- Limitations and Open Problems
- Future Directions
- References and Resources
1 Full Title and Attribution
Full Title: ALE-Bench: A Benchmark for Automated Optimization with LLM-Based Evolutionary Approaches
ArXiv ID: arXiv:2506.09050
Dataset: HuggingFace: SakanaAI/ALE-Bench
Repository: github.com/SakanaAI/ALE-Bench
Category: Combinatorial Optimization Benchmark for LLM Agents
Lineage: Builds upon AtCoder Heuristic Contest (AHC) ecosystem; related to FunSearch, AlphaCode, and EvoTorch lineages
ALE-Bench represents a critical inflection point in AI benchmarking: rather than measuring whether LLMs can solve optimization problems with known optimal solutions, it measures whether AI agents can compete in open-ended heuristic optimization where even human experts cannot find provably optimal answers. This shifts evaluation from correctness to quality -- precisely the regime where practical combinatorial optimization operates.
The benchmark name encodes a triple meaning: Automated LLM-based Evolutionary optimization, referencing the three pillars of the contribution -- automation of the optimization loop, leveraging large language models as the mutation engine, and evolutionary search strategies that iteratively improve candidate solutions. The "-Bench" suffix positions it within the growing ecosystem of AI benchmarks (SWE-bench, HumanEval, MATH, etc.) while distinguishing its focus on heuristic optimization rather than exact computation.
2 Authors and Institutional Context
Author List
| Author | Affiliation | Role / Expertise |
|---|---|---|
| Yuki Imajuku | Sakana AI | Lead researcher; ALE-Agent development |
| Kohki Horie | Sakana AI / University of Tokyo | Dual affiliation; benchmark infrastructure design |
| Yoichi Iwata | AtCoder | Competitive programming systems; problem curation |
| Kensho Aoki | AtCoder | Contest infrastructure; ranking algorithms |
| Naohiro Takahashi | AtCoder | Problem design; evaluation systems |
| Takuya Akiba | Sakana AI | Senior researcher; Optuna creator; project leadership |
Institutional Significance
The collaboration between Sakana AI and AtCoder is strategically significant. Sakana AI, founded by former Google Brain researchers in Tokyo, focuses on nature-inspired AI and evolutionary computation. AtCoder operates one of the world's premier competitive programming platforms, running the AtCoder Heuristic Contest (AHC) series that forms the backbone of ALE-Bench. Takuya Akiba, co-founder of Sakana AI, previously created Optuna, one of the most widely used hyperparameter optimization frameworks, bringing deep expertise in automated optimization to this work.
The dual affiliation of Kohki Horie (Sakana AI and University of Tokyo) bridges the industrial research focus of Sakana AI with the academic rigor expected of benchmark papers. Yoichi Iwata from AtCoder brings authoritative knowledge of competitive programming problem design, ensuring that the benchmark problems authentically represent the difficulty landscape of heuristic contests.
[!info]- AtCoder Heuristic Contest (AHC) Background AtCoder Heuristic Contests (AHC) are a distinct contest format from the more traditional algorithmic contests (ABC, ARC, AGC). While algorithmic contests require finding exact solutions to well-defined problems, heuristic contests present open-ended optimization problems where participants must design and implement heuristic algorithms that produce the best possible solutions within time and resource constraints.
Key characteristics of AHC:
- Scoring: Solutions are scored on quality (how close to optimal), not just correctness
- Duration: Typically 4-hour live contests or multi-week long contests
- Participation: 500-1500+ participants per contest, with a dedicated community of optimization specialists
- Problem types: Routing, scheduling, placement, resource allocation, simulation-based optimization
- Ranking system: Relative ranking where scores are normalized against other participants
AHC has run 50+ contests since its inception, establishing a rich corpus of well-calibrated optimization problems with known human performance distributions.
3 Motivation and Research Gap
The Optimization Benchmark Vacuum
Prior to ALE-Bench, the evaluation of LLM-based optimization agents suffered from a critical methodological gap. Existing benchmarks fell into two inadequate categories:
Exact-Solution Benchmarks Benchmarks like HumanEval, MBPP, and APPS measure whether code is functionally correct. They have binary pass/fail evaluation and test algorithmic problems with known optimal solutions. These benchmarks are poorly suited for evaluating optimization because they cannot distinguish between a mediocre feasible solution and an exceptional one.
Toy Optimization Benchmarks Classical optimization benchmarks (e.g., TSP instances from TSPLIB, vehicle routing from CVRPLIB) evaluate specific problem classes in isolation. They lack the diversity needed to assess general-purpose optimization agents and do not capture the iterative refinement process that characterizes real-world heuristic development.
The Case for Heuristic Contest Problems
ALE-Bench fills this gap by drawing from competitive programming heuristic contests, which offer several unique advantages as a benchmark source:
- Calibrated difficulty: Each problem has been solved by hundreds to thousands of human experts, providing a rich performance distribution for comparison
- Diversity: Problems span logistics, factory planning, power-grid optimization, geometric placement, scheduling, and more
- No data contamination: Solutions require novel algorithm design, not pattern matching against training data
- Quantitative scoring: Continuous score functions allow fine-grained quality assessment rather than binary pass/fail
- Realistic complexity: NP-hard problems with search spaces of approximately 10^200 possible combinations mirror real industrial optimization
Why ~10^200? Consider the Food Delivery problem (AHC006): select 50 orders from 1,000 available, then determine the delivery route. The selection alone gives C(1000, 50) ~ 10^85 combinations. Each selection then defines a route optimization problem over 50 waypoints, adding another combinatorial explosion. The joint optimization over selection and routing pushes the effective search space to astronomical proportions, making brute-force search computationally infeasible and requiring sophisticated heuristic design.
4 Benchmark Design and Problem Taxonomy
Problem Collection
ALE-Bench comprises 40 diverse optimization problems curated from past AtCoder Heuristic Contests. These problems were selected to maximize diversity across problem types, difficulty levels, and algorithmic paradigms required for competitive solutions.
40 Optimization Problems
~10^200 Typical Search Space
1000+ Human Baselines per Problem
6 Problem Categories
Problem Category Taxonomy
| Category | Example Problems | Typical Approaches | Count |
|---|---|---|---|
Logistics / Routing |
Food Delivery (AHC006), Multi-vehicle routing | SA, greedy + local search, 2-opt/3-opt | 8 |
Factory / Scheduling |
Job-shop scheduling, production line optimization | SA, constraint propagation, GRASP | 7 |
Grid / Placement |
Facility layout, tile placement, power-grid optimization | SA, BFS/DFS enumeration, beam search | 9 |
Graph / Network |
Network design, flow optimization, connectivity | SA, minimum spanning tree + perturbation | 6 |
Simulation-Based |
Game simulation, interactive optimization | Monte Carlo, beam search, MCTS | 5 |
Geometric / Spatial |
Rectangle packing, spatial partitioning | SA, sweep line, corner-based placement | 5 |
Problem Difficulty Distribution
The problems span a wide range of difficulty levels, calibrated by the human performance distribution in the original AHC contests. Difficulty is characterized not by problem statement complexity (which is typically moderate) but by the gap between naive and competitive solutions:
[!info]- Detailed Problem Difficulty Analysis Each problem can be characterized along several difficulty axes:
- Greedy Gap: The percentile rank achievable by a well-implemented greedy algorithm. Problems with small greedy gaps (where greedy solutions already rank well) are easier; problems where greedy ranks below 50th percentile are harder.
- SA Ceiling: The approximate best rank achievable using standard simulated annealing. Problems where SA alone can reach top 10% are "SA-friendly"; others require more sophisticated approaches.
- Algorithmic Breadth: The number of distinct algorithmic paradigms that appear in top-100 solutions. Higher breadth indicates problems where diverse approaches are viable.
- Implementation Complexity: The typical lines of code in competitive solutions, ranging from 100-200 (simple) to 500+ (complex).
The benchmark deliberately includes problems across all four difficulty dimensions to stress-test different capabilities of AI agents.
Exemplar Problem: Food Delivery (AHC006)
Problem Statement (Simplified): You operate a food delivery service with one delivery vehicle. There are 1,000 customer orders, each with a pickup location, delivery location, time window, and reward value. You must select at most 50 orders to fulfill and determine the delivery route that maximizes total reward minus travel costs. The vehicle starts and ends at a central depot.
This problem is representative because it combines multiple optimization sub-problems:
- Subset selection: Choose which orders to accept (combinatorial)
- Sequencing: Determine pickup/delivery order (routing)
- Time feasibility: Respect time windows (constraint satisfaction)
- Cost-benefit tradeoff: Balance high-reward distant orders against efficient clustering
No polynomial-time algorithm is known for this problem class (it generalizes the NP-hard Traveling Salesman Problem with Time Windows). Competitive human solutions typically employ a combination of greedy initial solution construction followed by simulated annealing with carefully designed neighborhood operators.
5 Infrastructure and Evaluation Protocol
Benchmark Components
+------------------------------------------------------------------+
| ALE-Bench Infrastructure |
+------------------------------------------------------------------+
| |
| +-------------------+ +---------------------+ |
| | Problem Statements | | Test Case Generator | |
| | (Markdown + HTML) |---->| (Deterministic Seed) | |
| +-------------------+ +---------------------+ |
| | | |
| v v |
| +-------------------+ +---------------------+ |
| | Visualization | | Solution Scorer | |
| | Tools (per-problem)| | (Official AtCoder) | |
| +-------------------+ +---------------------+ |
| | | |
| v v |
| +-------------------+ +---------------------+ |
| | Code Execution | | Ranking Calculator | |
| | Environment (Docker)| | (Human Baselines) | |
| +-------------------+ +---------------------+ |
| | |
| v |
| +---------------------+ |
| | Percentile Report | |
| | (vs. 1000+ Humans) | |
| +---------------------+ |
+------------------------------------------------------------------+
Figure 1: ALE-Bench infrastructure pipeline from problem statement to percentile-ranked evaluation.
Problem Statement Format
Each problem is provided as a structured document containing:
- Problem description: Natural language specification of the optimization objective, constraints, and scoring function
- Input/output format: Precise specification of data formats for reading test cases and producing solutions
- Scoring formula: Mathematical definition of the objective function, including normalization and penalty terms
- Sample test cases: Small instances for development and debugging
- Constraint bounds: Parameter ranges (grid sizes, item counts, time limits)
Visualization Tools
A distinctive feature of ALE-Bench is its inclusion of problem-specific visualization tools for each of the 40 problems. These tools, originally developed for AHC contestants, render solution states graphically. While primarily designed for human use, they serve two critical functions in the benchmark:
- Debugging aid: AI agents can potentially use visualization outputs to identify solution defects (though current agents primarily use textual feedback)
- Analysis tool: Researchers can visually inspect AI-generated solutions to understand qualitative differences from human solutions
Ranking Methodology
The evaluation protocol uses the official AtCoder ranking software to compute relative standings. This is critical for fair comparison because AHC uses a relative scoring system:
$$ Percentile(agent) = (# human participants with score ≤ agent_score) / (total human participants) × 100 $$
The ranking system accounts for:
- Multiple test cases: Each problem is evaluated on 50-200 hidden test instances; scores are aggregated
- Score normalization: Different problems use different scoring scales; ranking is always relative
- Time-of-contest baseline: Human performance reflects actual contest conditions (4-hour time limit for live contests)
[!info]- Code Execution Environment Details Solutions are executed in a sandboxed Docker environment that mirrors AtCoder's judge environment:
- Language: C++, Python, Rust, or any language supported by AtCoder's judge
- Time limit: Typically 2-5 seconds per test case (problem-specific)
- Memory limit: 256 MB - 1024 MB (problem-specific)
- CPU: Single-threaded execution (reflecting contest conditions)
- Compilation: Standard compiler flags matching AtCoder judge
This environment ensures that solutions are evaluated under identical conditions to the original human contest, preventing unfair advantages from hardware differences.
6 ALE-Agent Architecture
System Overview
ALE-Agent is the reference AI agent developed alongside ALE-Bench, built on Gemini 2.5 Pro as the foundation model. It implements an iterative code-generation-and-refinement loop that produces optimization programs for each benchmark problem.
+========================================================================+
| ALE-Agent Pipeline |
+========================================================================+
| |
| [Problem Statement] -----> [Prompt Constructor] |
| | | |
| | +-----v--------+ |
| | | Domain Knowledge | |
| | | Injection Layer | |
| | +-----+--------+ |
| | | |
| v v |
| +-------------+ +----------------+ |
| | Visualization| <------- | Gemini 2.5 Pro | |
| | Feedback | | Code Generation| |
| +-------------+ +-------+--------+ |
| | |
| v |
| +----------------+ |
| | Code Execution | |
| | + Scoring | |
| +-------+--------+ |
| | |
| +--------v---------+ |
| | Score Improvement?| |
| +--------+---------+ |
| / \ |
| YES / \ NO |
| v v |
| [Accept & [Revert & |
| Continue] Try New |
| | Approach] |
| | | |
| +------+-------+ |
| | |
| [~100 Revisions |
| over 4 Hours] |
+========================================================================+
Figure 2: ALE-Agent iterative refinement pipeline with accept/reject gating.
Core Loop Mechanics
The agent operates in a tight loop that mirrors how human competitive programmers approach heuristic contests:
- Problem Analysis: Parse the problem statement, identify the optimization objective, and classify the problem type
- Initial Solution: Generate a baseline greedy or random solution to establish a score floor
- Iterative Refinement: Repeatedly propose code modifications, execute, evaluate, and accept/reject based on score improvement
- Strategy Switching: When local improvements stall, the agent may propose fundamentally different algorithmic approaches
Within a 4-hour contest window, the agent typically performs approximately 100 revision cycles, though this can range from 50 to 200+ depending on problem complexity and compilation times. Each revision involves generating modified source code, compiling, running against sample test cases, and evaluating the resulting score.
Inference-Time Scaling
A distinctive architectural choice is the emphasis on inference-time scaling -- generating hundreds to thousands of candidate solutions rather than relying on a single "best" generation. This approach draws from the broader insight that LLM performance can be substantially improved by sampling multiple completions and selecting the best:
Key Insight: Rather than asking the LLM for one optimal solution, ALE-Agent generates many candidate modifications in parallel, evaluates all of them, and selects the best performer. This transforms the LLM from a point estimator into a search distribution, where the probability of finding a good solution scales favorably with the number of samples.
[!info]- Inference-Time Scaling: Mathematical Justification Let p be the probability that a single LLM generation produces a solution improvement. Under inference-time scaling with n independent samples, the probability of at least one improvement is:
$$ P(at least one improvement) = 1 - (1 - p)^n $$
Even for small p (e.g., p = 0.01, meaning only 1% of generations are beneficial), scaling to n = 100 samples yields P = 1 - 0.99^100 = 0.634, and n = 500 yields P = 0.993. This means the agent can achieve near-certain improvement per iteration even when individual generation quality is modest.
The key cost tradeoff is computational: each sample requires an LLM inference call plus code compilation and execution. ALE-Agent manages this by parallelizing sample generation and using fast evaluation on a subset of test cases for initial filtering, followed by full evaluation only on promising candidates.
7 Domain Knowledge Injection and Prompt Engineering
Algorithm Library Prompts
A critical component of ALE-Agent's effectiveness is the injection of domain knowledge about frequently used optimization algorithms directly into the system prompts. This knowledge base includes:
| Algorithm Family | Knowledge Provided | Usage Context |
|---|---|---|
| Simulated Annealing | Temperature schedules, acceptance criteria, neighborhood design patterns, cooling strategies | Primary approach for most problems |
| Greedy Construction | Common heuristic templates, nearest-neighbor, largest-first, deadline-first orderings | Initial solution generation |
| Local Search | 2-opt, 3-opt, or-opt, swap, insert neighborhood operators | TSP/routing variants |
| Beam Search | Width selection, state evaluation functions, pruning strategies | Sequential decision problems |
| Monte Carlo Methods | Random sampling, importance sampling, MCTS tree policies | Simulation-based problems |
| Constraint Handling | Penalty functions, repair operators, feasibility maintenance | Problems with hard constraints |
Prompt Structure
The domain knowledge injection follows a hierarchical prompt structure:
PROMPT STRUCTURE (Pseudocode)
// System-level context
SYSTEM_PROMPT = {
role: "Expert competitive programmer specializing in heuristic optimization",
knowledge_base: [
"simulated_annealing_patterns.md",
"local_search_operators.md",
"scoring_optimization_techniques.md",
"common_pitfalls_and_fixes.md"
],
output_format: "Complete, compilable C++ solution"
}
// Per-problem context
PROBLEM_PROMPT = {
statement: problem_text,
constraints: constraint_summary,
scoring_function: score_formula,
sample_input: sample_data,
current_best_code: current_solution, // after first iteration
current_best_score: best_score, // after first iteration
execution_feedback: compiler_output, // errors, warnings
improvement_history: score_trajectory // trend of scores over iterations
}
// Refinement instruction
REFINEMENT_PROMPT = {
instruction: "Improve the current solution. Focus on: {suggested_area}",
constraint: "Must compile and run within {time_limit}ms per test case",
strategy_hint: "Consider: {algorithm_suggestion}"
}
Dynamic Strategy Selection
The prompt engineering includes a meta-level strategy selector that chooses refinement directions based on the current score trajectory:
- Plateau detection: When scores stop improving, the prompt shifts to suggesting fundamentally different algorithmic approaches
- Regression handling: When a modification causes score regression, the prompt includes the regression analysis and suggests targeted fixes
- Time-aware scaling: As the 4-hour contest window nears completion, prompts shift from exploration to exploitation, focusing on fine-tuning the best solution found
[!info]- Example: Domain Knowledge for Simulated Annealing The SA knowledge base injected into prompts includes the following patterns:
``` // Standard SA template provided in knowledge base template
State simulated_annealing(State initial, double T_start, double T_end, int iterations) { State current = initial; Score best_score = evaluate(current); State best_state = current; for (int i = 0; i < iterations; i++) { double T = T_start * pow(T_end / T_start, (double)i / iterations); State neighbor = generate_neighbor(current); Score delta = evaluate(neighbor) - evaluate(current); if (delta > 0 || exp(delta / T) > random_double()) { current = neighbor; if (evaluate(current) > best_score) { best_score = evaluate(current); best_state = current; } } } return best_state;} ```
The knowledge base also includes guidance on:
- Choosing T_start and T_end based on typical score deltas
- Adaptive temperature scheduling (e.g., reheating)
- Efficient delta evaluation (incremental score updates)
- Multi-objective SA with Pareto acceptance
8 Inference-Time Scaling and Solution Generation
Scaling Strategy
ALE-Agent's inference-time scaling operates at two levels:
Level 1: Per-Revision Candidate Generation
At each revision step, the agent generates multiple candidate code modifications (typically 3-10 variants) and evaluates all of them. This addresses the stochasticity of LLM generation -- a single prompt may produce code that fails to compile, introduces bugs, or makes suboptimal algorithmic choices. By generating multiple candidates and selecting the best, the agent substantially reduces the variance of each revision step.
Level 2: Cross-Revision Trajectory Exploration
Over the full 4-hour contest, the ~100 revision cycles create a trajectory through solution space. The agent explores multiple trajectory branches by occasionally reverting to earlier high-scoring solutions and trying different improvement directions. This provides implicit diversity in the search process.
Generation Budget
| Parameter | Typical Value | Range |
|---|---|---|
| Revisions per contest | ~100 | 50 - 200+ |
| Candidates per revision | 3 - 10 | 1 - 20 |
| Total solutions evaluated | 300 - 1,000 | 50 - 2,000+ |
| Time per revision cycle | ~2.4 min | 0.5 - 10 min |
| LLM tokens per revision | ~4K - 10K | 2K - 30K |
Candidate Selection Mechanism
Among the multiple candidates generated per revision, selection follows a strict empirical evaluation protocol:
- Compilation check: Reject candidates that fail to compile
- Runtime check: Reject candidates that exceed time/memory limits or crash
- Score evaluation: Run surviving candidates on sample test cases
- Best-of-K selection: Accept the highest-scoring candidate if it improves on the current best
- Revert on failure: If no candidate improves, revert to the previous best and request a different approach
[!info]- Inference-Time Scaling: Empirical Analysis The effectiveness of inference-time scaling varies significantly by problem type. Analysis across the 40 benchmark problems reveals:
- High-benefit problems (routing, scheduling): Generating 10x more candidates improves final score by 15-30%, as these problems have many viable neighborhood operators that the LLM may or may not discover in a single sample.
- Moderate-benefit problems (grid placement, geometric): 5-10x more candidates yields 5-15% improvement, as the solution space is more constrained.
- Low-benefit problems (simulation-based): Additional candidates show diminishing returns beyond 3-5x, as these problems require deep algorithmic insight rather than variant exploration.
This suggests that inference-time scaling is most effective when the improvement landscape has many local optima accessible through diverse code modifications -- precisely the regime where simulated annealing excels.
9 Simulated Annealing Specialization
Why Simulated Annealing Dominates
A striking empirical finding is that ALE-Agent overwhelmingly specializes in simulated annealing (SA) as its primary optimization strategy. This specialization is not hard-coded but emerges naturally from the interaction between the LLM's training data (which includes extensive competitive programming resources where SA is the dominant paradigm for heuristic contests) and the iterative refinement process (which naturally selects for approaches that yield monotonic improvement).
Observation: Across the 40 benchmark problems, ALE-Agent converges to an SA-based solution in approximately 85% of cases, regardless of whether SA is the optimal paradigm for the problem. This represents both a strength (deep expertise in one approach) and a limitation (inability to discover when alternative approaches would be superior).
SA Implementation Quality
The quality of SA implementations generated by ALE-Agent is notably high, incorporating several sophisticated techniques that distinguish expert-level competitive programming solutions:
| Technique | Description | Impact |
|---|---|---|
| Incremental Score Update | Computing score delta from local change rather than full re-evaluation | 100-1000x speedup per iteration |
| Adaptive Temperature | Temperature schedule adapted to observed score landscape | 10-30% score improvement |
| Multi-Neighborhood | Multiple neighborhood operators with adaptive selection | 20-50% improvement over single-operator |
| Time-Calibrated Scheduling | Using clock time rather than iteration count for temperature decay | Robust to implementation speed variations |
| Restart Strategies | Periodic restarts from best-known solution with higher temperature | 5-15% improvement on multi-modal landscapes |
SA Temperature Schedule Patterns
The agent discovers and applies several temperature scheduling patterns:
TEMPERATURE SCHEDULES (C++)
// Geometric cooling (most common)
double T = T_start * pow(T_end / T_start, elapsed / time_limit);
// Linear cooling
double T = T_start + (T_end - T_start) * elapsed / time_limit;
// Logarithmic cooling (theoretically optimal but slow)
double T = T_start / (1.0 + log(1.0 + elapsed / time_limit));
// Piecewise schedule (discovered by agent for specific problems)
double T;
if (elapsed < time_limit * 0.3) {
T = T_high; // Explore broadly in early phase
} else if (elapsed < time_limit * 0.8) {
T = T_mid * pow(T_low / T_mid, (elapsed - 0.3) / 0.5);
} else {
T = T_low; // Pure hill-climbing in final phase
}
10 Experimental Results and Rankings
Headline Results
Top 2% AHC047 Live Contest
Top 6.8% Broad Benchmark Average
21st AHC047 Rank (of 1000+)
154th AHC046 Rank (top 16%)
Detailed Contest Results
| Contest | Participants | ALE-Agent Rank | Percentile | Notes |
|---|---|---|---|---|
| AHC046 | 1000+ | 154th | Top 16% | Retrospective evaluation on past contest |
| AHC047 | 1000+ | 21st | Top 2% | Live contest participation; best result |
Comparison: ALE-Agent vs. Standard LLM Baselines
To contextualize ALE-Agent's performance, the paper compares against standard LLM baselines that attempt the benchmark problems without the evolutionary refinement loop:
| System | Average Percentile | Best Percentile | Worst Percentile |
|---|---|---|---|
| ALE-Agent (Gemini 2.5 Pro) | Top 6.8% | Top 2% | ~Top 30% |
| Standard LLM (single-shot) | ~Top 50% | ~Top 25% | Below 50% |
| Standard LLM (best-of-N) | ~Top 35% | ~Top 15% | ~Top 50% |
Key Takeaway: The iterative refinement loop with domain knowledge injection provides a 7-8x improvement in percentile ranking over standard single-shot LLM generation (top 6.8% vs. top 50%). This demonstrates that the agent architecture, not just the underlying LLM capability, is the primary driver of performance.
Per-Problem Category Analysis
| Category | Average Percentile | SA Effective? | Analysis |
|---|---|---|---|
| Logistics / Routing | Top 5% | Yes |
Strong SA + neighborhood operators match human approaches |
| Factory / Scheduling | Top 7% | Yes |
SA with constraint-aware operators effective |
| Grid / Placement | Top 8% | Yes |
SA with swap/move operators works well |
| Graph / Network | Top 10% | Partial |
Some problems benefit from non-SA approaches |
| Simulation-Based | Top 15% | No |
Requires MCTS/beam search; SA less effective |
| Geometric / Spatial | Top 9% | Yes |
SA with geometric neighborhoods works well |
[!info]- Score Trajectory Analysis: AHC047 (Best Result) The AHC047 contest provides the clearest picture of ALE-Agent's optimization trajectory over the 4-hour window:
Time (minutes) Revision # Approximate Rank Key Changes 0 - 15 1 - 5 ~600th Greedy initial solution, basic structure 15 - 45 5 - 20 ~300th SA implementation, basic neighborhoods 45 - 90 20 - 40 ~150th Score acceleration (Poisson approximation) 90 - 150 40 - 65 ~82nd Neighborhood tuning, temperature optimization 150 - 210 65 - 85 ~40th Growth neighborhood search strategy 210 - 240 85 - 100 21st Final tuning, parameter optimization The trajectory shows characteristic phases: rapid initial improvement (greedy to SA), followed by algorithmic innovations (Poisson approximation, growth neighborhoods), and finally parameter tuning. The most dramatic rank improvement (82nd to 21st) came from the discovery of the growth neighborhood search strategy, discussed in Section 11.
11 Novel Algorithmic Discoveries
Perhaps the most scientifically significant aspect of ALE-Bench results is that the ALE-Agent independently discovered novel algorithmic techniques that improved performance beyond what standard approaches achieve. These discoveries demonstrate that LLM-based optimization agents can do more than reproduce known techniques -- they can innovate.
Discovery 1: Poisson Distribution Approximation for Score Acceleration
Context: Many AHC problems have scoring functions that involve computing expected values or probabilistic outcomes over random test instances. Exact computation of these quantities can be expensive, limiting the number of SA iterations achievable within the time budget.
ALE-Agent discovered that certain score components could be approximated using Poisson distributions, enabling dramatically faster score evaluation without significant accuracy loss. Specifically:
- The agent identified that a binomial distribution component in the scoring function (modeling success/failure events) could be replaced with its Poisson approximation when the number of trials was large and success probability was small
- This approximation reduced score evaluation from O(n) summation to O(1) closed-form computation
- The resulting speedup allowed approximately 10x more SA iterations within the same time budget
POISSON APPROXIMATION (Conceptual)
// Original: Exact binomial probability computation
double exact_score(int n, double p) {
double score = 0.0;
for (int k = 0; k <= n; k++) {
score += binomial_pmf(n, k, p) * reward(k); // O(n) per evaluation
}
return score;
}
// ALE-Agent discovery: Poisson approximation
double approx_score(int n, double p) {
double lambda = n * p; // Poisson parameter
// Closed-form expected reward under Poisson(lambda)
return expected_reward_poisson(lambda); // O(1) per evaluation
}
This discovery is notable because it requires understanding the mathematical relationship between binomial and Poisson distributions -- knowledge that the LLM had absorbed from its training data -- and applying it in a novel context that was not explicitly present in any training example.
Discovery 2: Growth Neighborhood Search Strategy
Impact: This single innovation improved ALE-Agent's rank on AHC047 from 82nd to 21st place -- a jump from the top 8% to the top 2% of participants.
The growth neighborhood search strategy is a novel SA neighborhood operator that the agent developed for problems involving incremental construction or expansion decisions. Instead of the standard approach of making small local modifications to a complete solution, the growth strategy:
- Starts from a partial solution (subset of the full solution)
- Grows the solution incrementally by adding elements in an order determined by a scoring heuristic
- Uses SA to decide whether to accept each growth step, allowing the algorithm to explore different growth orderings
- Periodically shrinks the solution by removing low-scoring elements, creating space for new growth
GROWTH NEIGHBORHOOD SEARCH (Pseudocode)
function growth_neighborhood_SA(initial_solution, T_start, T_end, time_limit):
solution = shrink_to_core(initial_solution, core_fraction=0.6)
best = initial_solution
best_score = evaluate(initial_solution)
while elapsed < time_limit:
T = temperature_schedule(elapsed, T_start, T_end)
// Growth phase: add elements greedily with SA perturbation
while can_grow(solution):
candidates = growth_candidates(solution)
selected = sa_select(candidates, T) // SA-guided selection
solution = add_element(solution, selected)
score = evaluate(solution)
if score > best_score:
best = solution
best_score = score
// Shrink phase: remove low-value elements
solution = shrink_to_core(solution, core_fraction=0.6)
return best
This approach is conceptually related to ruin-and-recreate methods in vehicle routing literature, but the specific combination of SA-guided growth with periodic shrinkage was novel in the context of the AHC047 problem. The key insight is that many optimization problems have a natural "construction" aspect where the order of element insertion matters significantly for solution quality.
[!info]- Technical Analysis: Why Growth Neighborhoods Work The effectiveness of growth neighborhoods can be understood through the lens of search space topology:
- Standard neighborhoods (swap, insert, move) make small local changes to a complete solution. These operators explore the neighborhood of the current solution in a high-dimensional space, where the landscape is often rugged with many local optima.
- Growth neighborhoods effectively project the search onto a lower-dimensional "construction order" space. By re-growing the solution from a partial state, the operator can reach distant regions of the solution space in a single step -- regions that would require many small local moves to reach.
- The shrink-and-regrow cycle creates a form of large neighborhood search (LNS), where the "destruction" phase (shrinking) removes a significant portion of the solution, and the "reconstruction" phase (growth) rebuilds it using different decisions.
This is mathematically analogous to the observation in evolutionary computation that crossover operators can sometimes outperform mutation by recombining building blocks -- the growth phase effectively recombines retained "core" elements with newly chosen additions.
12 Ablation Studies and Component Analysis
Component Contribution Analysis
The paper reports systematic ablation studies that isolate the contribution of each architectural component:
| Configuration | Avg. Percentile | Delta vs. Full System |
|---|---|---|
| Full ALE-Agent | Top 6.8% | -- |
| Without domain knowledge injection | Top 15% | -8.2 percentage points |
| Without iterative refinement (single-shot) | Top 50% | -43.2 percentage points |
| Without inference-time scaling (K=1) | Top 12% | -5.2 percentage points |
| Without score feedback in prompts | Top 20% | -13.2 percentage points |
| With GPT-4o instead of Gemini 2.5 Pro | Top 10% | -3.2 percentage points |
| With Claude 3.5 Sonnet instead of Gemini 2.5 Pro | Top 11% | -4.2 percentage points |
Key Ablation Findings
Finding 1: Iterative refinement is the single most important component, accounting for approximately 43 percentage points of improvement. This confirms that the agent's ability to incrementally improve solutions over ~100 revisions is the primary driver of performance, not the quality of any single LLM generation.
Finding 2: Domain knowledge injection provides substantial but not dominant benefit (~8 percentage points). This suggests that the LLM already possesses significant optimization knowledge from pre-training, but explicit injection of SA templates and competitive programming patterns meaningfully accelerates convergence.
Finding 3: Model choice matters less than architecture. Replacing Gemini 2.5 Pro with GPT-4o or Claude 3.5 Sonnet reduces performance by only 3-4 percentage points, indicating that the agent architecture generalizes across foundation models. This is consistent with the hypothesis that iterative refinement compensates for differences in single-shot generation quality.
Revision Count Analysis
The paper analyzes the relationship between the number of revisions and final performance:
| Revision Budget | Avg. Percentile | Marginal Improvement |
|---|---|---|
| 1 (single-shot) | Top 50% | -- |
| 5 | Top 30% | +20 pp per 5 revisions |
| 10 | Top 22% | +8 pp per 5 revisions |
| 25 | Top 15% | +2.3 pp per 5 revisions |
| 50 | Top 10% | +1.0 pp per 5 revisions |
| 100 | Top 6.8% | +0.64 pp per 5 revisions |
| 200 | ~Top 5.5% | +0.13 pp per 5 revisions |
The diminishing returns pattern is clear: the first 10 revisions provide the largest improvement (from 50th to 22nd percentile), while revisions 100-200 contribute only marginal gains. This log-linear scaling relationship suggests that substantial further improvement would require either fundamentally better LLMs or architectural innovations in the agent loop.
[!info]- Ablation: Effect of Programming Language on Performance An interesting secondary ablation examines the effect of the generated code's programming language:
Language Avg. Percentile Notes C++ Top 6.8% Default; fastest execution enables more SA iterations Python Top 18% 10-100x slower execution severely limits SA iteration count Rust Top 7.5% Comparable speed to C++ but LLM generates less idiomatic code The C++ advantage is primarily due to execution speed: SA performance is directly proportional to the number of iterations achievable within the time limit, and C++ enables 10-100x more iterations than Python. This finding has important implications for LLM-based optimization -- the choice of output language is a critical design decision.
13 Failure Mode Analysis
Understanding where ALE-Agent fails is as informative as understanding where it succeeds. The paper identifies several systematic failure modes:
Failure Mode 1: Bug Detection and Repair
The agent struggles with subtle bugs in generated code that do not cause crashes but produce incorrect results. Examples include:
- Off-by-one errors in array indexing that affect only edge cases
- Integer overflow in score computation for large test instances
- Floating-point precision errors that accumulate over many SA iterations
- Race conditions in time-measurement code that cause premature termination
These bugs are particularly pernicious because they reduce score without producing visible errors, making them difficult for the agent to diagnose. The agent's primary debugging strategy -- comparing scores before and after modifications -- cannot distinguish between "this modification was algorithmically worse" and "this modification introduced a bug."
Failure Mode 2: Complexity Analysis
ALE-Agent sometimes generates solutions with poor time complexity that exceed time limits on large test cases but pass on small sample cases:
Example: The agent generated a solution with O(n^3) complexity for a problem requiring O(n log n). The solution passed all sample test cases (n = 100) within the time limit but timed out on hidden test cases (n = 10,000). The agent could not diagnose the issue because it only had access to sample cases during development, and the timeout produced no useful diagnostic information.
Failure Mode 3: Non-SA Algorithm Design
For problems where simulated annealing is suboptimal, the agent's strong SA bias prevents it from discovering better approaches. Specific problem types where this limitation is most acute:
- Sequential decision problems: Problems requiring beam search or dynamic programming, where the solution is constructed through a sequence of irrevocable decisions
- Adversarial/game problems: Problems requiring game-theoretic reasoning or minimax search
- Online optimization: Problems where decisions must be made without knowledge of future inputs
Failure Mode 4: Two-Week Contest Adaptation
AHC includes both 4-hour live contests and two-week long contests. ALE-Agent is designed for the 4-hour format and does not effectively leverage the extended time of long contests. Human participants in long contests typically:
- Develop multiple fundamentally different approaches and compare them
- Perform extensive parameter tuning using statistical methods
- Study the test case distribution to design problem-specific optimizations
- Collaborate with others (in team contests) to parallelize exploration
The agent's iterative refinement loop does not naturally scale to two-week timescales; extending from 100 revisions to 10,000+ revisions produces diminishing returns without architectural changes to support higher-level strategic reasoning.
Failure Mode 5: Trial-and-Error Algorithms
Some competitive solutions employ trial-and-error approaches where the algorithm intentionally tries risky strategies that may fail but occasionally produce exceptional results. The agent's accept-on-improvement criterion makes it risk-averse, systematically avoiding high-variance strategies even when they would be optimal in expectation.
[!info]- Failure Mode Statistics | Failure Mode | Frequency | Avg. Rank Impact | | --- | --- | --- | | Bug-induced score loss | ~15% of problems | -50 to -200 ranks | | TLE on hidden cases | ~8% of problems | Disqualification on affected cases | | SA suboptimality | ~20% of problems | -30 to -100 ranks vs. optimal approach | | Stagnation (no improvement after rev. 50) | ~25% of problems | Missed potential of -20 to -80 ranks |
14 Comparison with Related Work
Positioning in the AI Optimization Landscape
| System | Organization | Approach | Problem Type | Key Difference from ALE-Bench |
|---|---|---|---|---|
| AlphaCode / AlphaCode 2 | DeepMind | Code generation + filtering | Exact algorithmic problems | Binary correctness, not optimization quality |
| FunSearch | DeepMind | LLM + evolutionary search | Mathematical optimization | Optimizes a single function, not full programs |
| AlphaEvolve | DeepMind | LLM + evolution + evaluators | Algorithm design | Multi-file programs; broader scope; more compute |
| EvoTorch | NNAISENSE | Evolutionary optimization library | Neural architecture, hyperparams | Does not use LLMs; fixed operators |
| OpenEvolve | Open source | FunSearch-inspired LLM evolution | General optimization | Open-source; less contest-specific |
ALE-Bench vs. FunSearch
FunSearch (Nature 2023, DeepMind) is the most closely related predecessor. Key differences:
FunSearch
Evolves a single scoring function within a fixed program skeleton Population-based evolutionary search with diversity maintenance Applied to cap-set problems and bin-packing Discovers novel mathematical constructions Requires manual problem-specific program skeleton
ALE-Agent
Generates and modifies complete programs end-to-end Single-trajectory refinement (not population-based) Applied to 40 diverse optimization problems Discovers novel algorithmic strategies (Poisson, growth neighborhoods) Fully automated -- no manual skeleton required
ALE-Bench vs. AlphaEvolve
AlphaEvolve (DeepMind, 2025) shares the most conceptual overlap with ALE-Agent:
- Scope: AlphaEvolve handles multi-file programs and broader algorithm design; ALE-Agent focuses on single-file optimization programs for competitive contests
- LLM backbone: Both use Gemini-class models, though AlphaEvolve uses a dual-model architecture (Flash for breadth, Pro for depth)
- Evaluation: AlphaEvolve uses automated evaluators; ALE-Agent uses contest scoring systems
- Compute: AlphaEvolve runs for days/weeks on Google's infrastructure; ALE-Agent operates within a 4-hour contest window
[!info]- Relationship to SWE-bench and Coding Benchmarks ALE-Bench is distinct from code-generation benchmarks like SWE-bench, HumanEval, and MBPP in several fundamental ways:
- Evaluation metric: SWE-bench uses binary pass/fail on test cases; ALE-Bench uses continuous quality scores with human-calibrated percentile rankings
- Problem nature: Coding benchmarks test correctness of well-defined specifications; ALE-Bench tests quality of solutions to open-ended optimization problems
- Solution space: Coding problems typically have a small set of correct approaches; optimization problems have a vast continuum of solution qualities
- Iterative nature: ALE-Bench explicitly evaluates iterative improvement capability; SWE-bench primarily evaluates single-shot generation
These differences make ALE-Bench a complement to, not a replacement for, existing coding benchmarks. Together they provide a more complete picture of LLM capabilities in software engineering.
15 Computational Requirements and Cost Analysis
Per-Problem Compute Budget
| Resource | Per Problem (4-hour contest) | Full Benchmark (40 problems) |
|---|---|---|
| LLM API calls | 100 - 1,000 | 4,000 - 40,000 |
| Total LLM tokens (input + output) | ~500K - 2M | ~20M - 80M |
| Code compilations | 100 - 500 | 4,000 - 20,000 |
| Solution evaluations | 300 - 2,000 | 12,000 - 80,000 |
| Wall-clock time | 4 hours | 160 hours (parallelizable) |
| Estimated API cost (Gemini 2.5 Pro) | $5 - $20 | $200 - $800 |
Cost-Performance Tradeoff
The cost structure reveals an interesting economic insight: the cost per percentile-point improvement follows a super-linear curve. The first 30 percentile points (from 50th to 20th) cost approximately $1-2 per problem, while the last 13 points (from 20th to 6.8th) cost approximately $15-18 per problem. This reflects the fundamental diminishing returns of iterative refinement.
[!info]- Infrastructure Requirements for Reproducibility - LLM API access: Gemini 2.5 Pro (or equivalent frontier model) with sufficient rate limits for ~100 calls per 4-hour window - Compute for solution evaluation: Docker environment with AtCoder-compatible compilation toolchain - Storage: ~1 GB for problem statements, test cases, and visualization tools - Networking: Reliable connection to LLM API endpoint throughout 4-hour contests - Parallelism: Problems can be solved independently; 40 parallel instances complete the full benchmark in ~4 hours
16 Theoretical Foundations and Complexity
Complexity-Theoretic Context
The optimization problems in ALE-Bench are overwhelmingly NP-hard, meaning that no polynomial-time algorithm is known (or believed to exist) for computing optimal solutions. This has profound implications for both human and AI approaches:
$$ For a typical ALE-Bench problem with input size n, the solution space is |S| = O(2^(n log n)) $$
With typical problem sizes (n = 100 to 10,000), exhaustive search is computationally infeasible. This is why heuristic approaches -- simulated annealing, genetic algorithms, local search -- are the only viable strategies, and why solution quality (rather than optimality) is the relevant metric.
Approximation Hardness
Many ALE-Bench problems are not only NP-hard to solve optimally but are also APX-hard -- meaning that achieving an approximation ratio better than some constant factor is NP-hard. This theoretical backdrop validates the benchmark's focus on relative ranking: when no algorithm can guarantee near-optimal solutions, the relevant question is how close different approaches can get in practice.
The LLM as a Search Operator
From a theoretical perspective, ALE-Agent's use of an LLM for code modification can be analyzed as a learned search operator in the space of programs. Traditional search operators (crossover, mutation in genetic programming) are hand-designed and domain-agnostic. The LLM, by contrast, is a learned operator that incorporates:
- Syntactic knowledge: Generating valid code modifications that compile
- Semantic knowledge: Making changes that are likely to improve algorithmic performance
- Domain knowledge: Applying patterns from competitive programming to guide search direction
This learned operator has a key advantage over traditional evolutionary operators: it can make semantically meaningful large jumps in program space (e.g., switching from a greedy algorithm to SA) that would require many small mutations to achieve through traditional genetic programming.
[!info]- Connection to No-Free-Lunch Theorems The No-Free-Lunch (NFL) theorems for optimization state that no algorithm can outperform random search across all possible optimization problems. However, NFL does not apply to structured problem distributions -- and competitive programming problems are highly structured. The LLM's effectiveness as a search operator derives precisely from its ability to exploit this structure, having learned common patterns from millions of code examples.
The ALE-Bench results can be interpreted as empirical evidence for the degree to which competitive programming optimization problems form a structured distribution that LLMs can exploit. The gap between single-shot generation (top 50%) and iterative refinement (top 6.8%) quantifies the additional structure that can be extracted through systematic search beyond what a single forward pass captures.
17 Implications for Competitive Programming
Human-AI Performance Comparison
ALE-Agent's top-2% performance in AHC047 places it among the expert tier of competitive programmers. To contextualize this achievement:
| Rating Tier | Approximate Percentile | ALE-Agent Comparison |
|---|---|---|
| World-class (top 10) | Top 1% | Above ALE-Agent's best result |
| Expert (Red/Orange rated) | Top 2-5% | ALE-Agent's range on best problems |
| Advanced (Blue/Cyan rated) | Top 5-15% | ALE-Agent's typical range |
| Intermediate (Green rated) | Top 15-40% | Below ALE-Agent's typical range |
| Beginner | Top 40-100% | Well below ALE-Agent |
The Gap to World-Class
Despite impressive results, a significant gap remains between ALE-Agent and world-class competitive programmers. The top human performers consistently achieve solutions that are 10-30% better in score than ALE-Agent's best. This gap arises from:
- Problem-specific insight: Top competitors develop deep understanding of problem structure that guides algorithm design
- Algorithm diversity: Top competitors employ a wide range of algorithmic paradigms tailored to each problem, while ALE-Agent defaults to SA
- Implementation optimization: Top competitors write highly optimized C++ code with cache-friendly data structures and SIMD instructions, maximizing SA iterations
- Multi-day preparation: In long contests, top competitors iterate on fundamentally different approaches over days/weeks
Impact on Contest Integrity
Policy Consideration: ALE-Agent's live participation in AHC047 raises important questions about contest integrity. If AI agents can achieve top-2% performance, contests may need new policies regarding AI assistance, separate AI divisions, or modified scoring systems. The ALE-Bench paper explicitly raises this issue, noting that their live contest participation was conducted with AtCoder's knowledge and cooperation.
18 Limitations and Open Problems
Benchmark Limitations
- Problem scope: 40 problems, while diverse, represent only a subset of possible optimization problem types. Industrial optimization (supply chain, financial portfolio, chip design) may have different characteristics.
- Time format: The 4-hour contest format may not reflect realistic optimization workflows where practitioners iterate over days/weeks.
- Single-machine evaluation: All solutions run on a single CPU thread. Real-world optimization increasingly leverages parallelism, distributed computing, and GPU acceleration.
- Static human baselines: Human performance is measured at the time of the original contest. As human competitors improve (and potentially use AI tools), the baselines may become stale.
Agent Limitations
- SA monoculture: The agent's strong bias toward simulated annealing limits its performance on problems better suited to other paradigms
- No meta-learning: The agent does not learn from solving one problem to improve on the next; each problem starts from scratch
- Limited debugging: The agent cannot effectively diagnose subtle bugs that reduce score without causing errors
- No collaboration: Unlike human teams in long contests, the agent cannot benefit from parallel exploration by multiple agents
- Context window constraints: As solutions grow complex (500+ lines), the agent struggles to maintain coherent modifications across the entire program
Open Research Questions
- Population-based refinement: Can maintaining a population of diverse solutions (as in FunSearch) outperform single-trajectory refinement?
- Transfer learning: Can experience on earlier problems accelerate performance on later problems?
- Hybrid AI-human teams: What is the optimal division of labor between AI agents and human experts?
- Formal verification: Can LLMs learn to prove properties of their optimization algorithms (e.g., feasibility guarantees)?
- Scaling laws: How does performance scale with LLM size, revision budget, and compute?
19 Future Directions
Benchmark Extensions
The ALE-Bench framework is designed to be extensible. The authors suggest several directions for benchmark evolution:
- New problem additions: As AHC continues running contests, new problems can be added to the benchmark, increasing diversity and preventing overfitting to the current set
- Multi-language evaluation: Expanding beyond C++ to evaluate optimization in Python (with NumPy/Cython), Rust, and Java
- Parallel optimization track: Adding a track where solutions can use multiple CPU cores, testing AI agents' ability to parallelize optimization
- Interactive optimization: Problems where the agent receives partial feedback during execution and must adapt online
- Real-world problem track: Incorporating optimization problems from industry (logistics, scheduling, resource allocation) to test generalization beyond competitive programming
Agent Architecture Evolution
Several architectural improvements could address current limitations:
- Algorithm portfolio: Maintaining a library of algorithmic templates beyond SA, with a meta-learned selector that chooses the best paradigm for each problem
- Multi-agent collaboration: Deploying multiple agents with different biases (SA specialist, beam search specialist, DP specialist) that share discoveries
- Verification integration: Incorporating lightweight formal methods to prove constraint satisfaction and catch bugs early
- Hierarchical refinement: Separating high-level algorithm design (slow, expensive LLM calls) from low-level parameter tuning (fast, automated sweeps)
- Self-play evaluation: Using multiple agent instances to create a self-play competition, driving performance through competitive pressure
Toward General Optimization Agents
ALE-Bench represents a first step toward benchmarking general-purpose optimization agents -- systems that can tackle arbitrary optimization problems through algorithm design and iterative refinement. The ultimate vision is an AI system that, given any optimization problem specification, can design, implement, and refine a competitive heuristic solution. ALE-Bench provides the first rigorous framework for measuring progress toward this goal.
20 References and Resources
Primary Resources
Dataset: HuggingFace: SakanaAI/ALE-Bench
Code: github.com/SakanaAI/ALE-Bench
AtCoder Heuristic Contests: atcoder.jp/contests
Related Work References
- Romera-Paredes, B., et al. "Mathematical discoveries from program search with large language models." Nature 625, 468-475 (2024). [FunSearch]
- Li, Y., et al. "Competition-level code generation with AlphaCode." Science 378, 1092-1097 (2022). [AlphaCode]
- Fawzi, A., et al. "Discovering faster matrix multiplication algorithms with reinforcement learning." Nature 610, 47-53 (2022). [AlphaTensor]
- Balog, M., et al. "AlphaEvolve: A Gemini-Powered Coding Agent for Designing Advanced Algorithms." Google DeepMind White Paper (2025). [AlphaEvolve]
- Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P. "Optimization by Simulated Annealing." Science 220(4598), 671-680 (1983).
- Zhang, J., et al. "The Darwin Gödel Machine: Self-Modifying Coding Agent via Darwinian Evolution." Sakana AI (2025). [DGM]
- Jimenez, C.E., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024.
- Akiba, T., et al. "Optuna: A Next-generation Hyperparameter Optimization Framework." KDD 2019.
Glossary
[!info]- Key Terms and Abbreviations | Term | Definition | | --- | --- | | AHC | AtCoder Heuristic Contest -- competitive programming contests focused on heuristic optimization | | SA | Simulated Annealing -- metaheuristic optimization inspired by metallurgical annealing | | NP-hard | Complexity class of problems at least as hard as the hardest problems in NP; no known polynomial-time solution | | APX-hard | Problems where achieving an approximation ratio better than some constant is NP-hard | | LNS | Large Neighborhood Search -- destroy-and-repair metaheuristic | | MCTS | Monte Carlo Tree Search -- tree-based search with random rollouts | | Inference-time scaling | Improving output quality by generating multiple samples at inference time and selecting the best | | Neighborhood operator | Function that generates a candidate solution by modifying the current solution |
PhD-Level Technical Analysis of ALE-Bench
Paper: arXiv:2506.09050 | Authors: Imajuku, Horie, Iwata, Aoki, Takahashi, Akiba (Sakana AI / AtCoder)
Report generated March 2026