AutoEvolver
Can Coding Agents Optimize Algorithms Autonomously? Organization: Princeton NLP Group (Princeton University) Published: March 23, 2026 Type: Blog Post + GitHub Repository Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: Can Coding Agents Optimize Algorithms Autonomously?
Project Page: tengxiaoliu.github.io/autoevolver
Repository: github.com/tengxiaoliu/autoevolver
Publication Date: March 23, 2026
Institutional Affiliation: Princeton NLP Group, Department of Computer Science, Princeton University
Lineage: Positioned as a direct empirical response to dedicated evolutionary optimization systems including AlphaEvolve (DeepMind, 2025), ShinkaEvolve (2025), ThetaEvolve (2025), and TTT-Discover (2026). AutoEvolver does not build on any of these systems' codebases — it deliberately strips them away to test a minimalist hypothesis.
Publication Format: Technical blog post with open-source supporting materials, not a peer-reviewed paper. The work is primarily empirical and observational, analyzing the emergent behaviors of a general-purpose coding agent applied to algorithmic optimization problems without purpose-built evolutionary scaffolding.
Central Question: "What happens if you give a coding agent an algorithmic optimization problem and simply ask it to keep improving?"
2 Authors and Team
| Author | Affiliation | Role | Notable Prior Work |
|---|---|---|---|
| Tengxiao Liu | Princeton University | Lead researcher, experiment design | NLP, coding agents research |
| Yuqing Yang | Princeton University | Co-researcher | Machine learning research |
| Xi Ye | Princeton University | Co-researcher | Code generation, reasoning |
| Danqi Chen | Princeton University | Faculty advisor, PI | Princeton NLP Group lead; ORQA, DPR, instruction tuning |
The research team is from the Princeton NLP Group, one of the leading academic NLP labs in the United States. Danqi Chen's group has published extensively on language model capabilities, retrieval-augmented generation, and code generation. The team's approach reflects an empirically-driven philosophy: rather than building a new system, they systematically studied the behavior of an existing one (Claude Code) when given optimization tasks.
Methodological Posture
AutoEvolver is explicitly framed as an observational study rather than a system paper. The researchers did not build a tool — they conducted a carefully controlled experiment to test whether purpose-built evolutionary frameworks are necessary for competitive performance on algorithmic optimization benchmarks. This epistemological stance is critical to understanding the contribution: the result is primarily a set of empirical findings and behavioral observations, not a software artifact.
3 Core Contribution
Key Finding: A general-purpose coding agent (Claude Code with Opus 4.6), given only a problem description, an initial naive solution, and an evaluation script, can achieve state-of-the-art results on established algorithmic optimization benchmarks — surpassing results from purpose-built evolutionary systems like ThetaEvolve and TTT-Discover — with zero evolutionary scaffolding.
The Minimalist Hypothesis
AutoEvolver tests a radical minimalist hypothesis: that the evolutionary framework — population management, selection operators, mutation pipelines, crossover, diversity mechanisms — may be unnecessary when the LLM itself is sufficiently capable. The coding agent spontaneously exhibits behaviors that are functionally equivalent to evolutionary strategies:
| Evolutionary Concept | Emergent Agent Behavior |
|---|---|
| Population of programs | Parallel background tasks + file-system archive |
| Mutation operators | LLM-driven code modifications |
| Selection pressure | Agent's internal evaluation and comparison logic |
| Diversity maintenance | Strategy switching, web research pivots |
| Memory / archive | Context window + file system |
| Fitness evaluation | Evaluation script execution |
The Aspiration Prompting Discovery
Perhaps the most significant methodological contribution is the discovery of aspiration prompting — a minimal intervention technique where a single sentence raising the agent's target score is sufficient to break through performance plateaus. This finding has broad implications:
- Satisficing behavior: Coding agents exhibit satisficing — they settle for "good enough" solutions unless externally pushed. This mirrors Herbert Simon's bounded rationality theory applied to AI agents.
- Qualitative strategy shifts: Aspiration prompting doesn't just extend search time — it triggers qualitatively different algorithmic strategies. The agent shifts from incremental parameter tuning to fundamentally different approaches (e.g., switching from simulated annealing to SLSQP joint optimization).
- Minimal intervention cost: The intervention is a single natural language sentence. No algorithmic modification, no hyperparameter changes, no additional code. This makes aspiration prompting trivially cheap to implement.
What AutoEvolver Is NOT
- Not a framework or tool — it's an empirical study
- Not reproducible in the traditional sense — each run follows a unique trajectory
- Not a replacement for evolutionary frameworks — the authors explicitly acknowledge this
- Not a peer-reviewed paper — published as a blog post with supporting code
4 Supported Solutions
AutoEvolver was tested on three benchmark problems that are standard in the evolutionary AI literature. Each problem represents a different class of optimization:
Problem Taxonomy
| Problem | Type | Objective | Domain | Search Space |
|---|---|---|---|---|
| Circle Packing (n=26) | Continuous optimization | Maximize Σrᵢ | Computational geometry | Circle centers (xᵢ, yᵢ) and radii rᵢ in [0,1]² |
| Erdős Minimum Overlap | Combinatorial/functional | Minimize C₅ | Additive combinatorics | Step functions f: [0,1] → [0,1] |
| First Autocorrelation Inequality (AC1) | Functional construction | Minimize C₁ upper bound | Additive combinatorics | Nonneg. functions f on [-1/4, 1/4] |
Circle Packing (n=26)
Pack 26 non-overlapping circles inside a unit square, maximizing the sum of their radii. Constraints:
$$r_i \le x_i \le 1 - r_i, \quad r_i \le y_i \le 1 - r_i \quad \forall i$$ $$\sqrt{(x_i - x_j)^2 + (y_i - y_j)^2} \ge r_i + r_j \quad \forall i \ne j$$
Objective: $\max \sum_{i=1}^{26} r_i$
This problem has a rich history in computational geometry and operations research. Known optimal configurations exist for small n, but for n=26 the landscape is extremely rugged with many local optima. The problem requires both global exploration (finding a good arrangement topology) and local refinement (precise coordinate optimization).
Erdős Minimum Overlap Problem
A classic problem in additive combinatorics. Partition {1, 2, ..., 2n} into two equal-size sets A and B. For each integer k, let Mₖ count the solutions to aᵢ - bⱼ = k. The goal is to bound c = lim(n→∞) M(n)/n, where M(n) = min_{A,B} max_k Mₖ.
Following prior work, the problem is formulated as optimizing step functions f describing the density of A throughout [1, 2n], with f(x) ∈ [0, 1] and ∫f = 1. The objective:
$$\text{Minimize } C_5 = \max_k \int f(x)(1 - f(x+k))\,dx$$
This is a minimax optimization over function space, discretized at resolution n. A key discovery by Claude Code was that increasing the discretization n yields better solutions — a direction the agent initially missed and only explored after aspiration prompting.
First Autocorrelation Inequality (AC1)
For nonnegative f supported on [-1/4, 1/4], find the smallest C₁ such that:
$$\max_{|t| \le 1/2} (f * f)(t) \ge C_1 \left(\int f\right)^2$$
Any valid construction f certifies an upper bound C₁ ≤ ‖f * f‖_∞ / (∫f)². Lower values represent tighter bounds. This problem arises in the study of additive patterns and has connections to the Littlewood conjecture.
Solution Quality Summary
| Problem | AutoEvolver Result | Previous SOTA | Source | Margin |
|---|---|---|---|---|
| Circle Packing (Σr ↑) | 2.63598844 | 2.63598308 | ThetaEvolve | +5.36 × 10⁻⁶ |
| Erdős C₅ (↓) | 0.38086945 | 0.38087532 | TTT-Discover | −5.87 × 10⁻⁶ |
| AC1 C₁ (↓) | 1.5028628969 | 1.5028628983 | TTT-Discover | −1.4 × 10⁻⁹ |
All three results represent new state-of-the-art performance, though the margins are extremely small. The circle packing result is evaluated with a feasibility tolerance of 1e-6, consistent with ThetaEvolve's evaluator.
5 LLM Integration
Model Configuration
AutoEvolver uses a single LLM configuration with no ensemble:
| Parameter | Value |
|---|---|
| Model | Claude Code (Opus 4.6) |
| Provider | Anthropic |
| Mode | Autonomous (skip-permissions) |
| Interaction | Single long-running session per problem |
| Human intervention | Aspiration prompting only (1-2 sentences per problem) |
How the LLM Functions
Unlike evolutionary systems where the LLM serves as a mutation operator within a larger framework, in AutoEvolver the LLM is the entire system. Claude Code functions simultaneously as:
AutoEvolver: LLM as Complete Optimization System
=================================================
┌─────────────────────────────────────────────────────────────────┐
│ CLAUDE CODE (Opus 4.6) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ STRATEGIST │ │ IMPLEMENTER │ │ EVALUATOR │ │
│ │ │ │ │ │ │ │
│ │ Decides what │ │ Writes code │ │ Interprets results │ │
│ │ approach to │ │ modifications│ │ Compares against │ │
│ │ try next │ │ and new │ │ known best │ │
│ │ │ │ algorithms │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │
│ │ │ │ │
│ ┌──────┴───────┐ ┌──────┴───────┐ ┌──────────┴───────────┐ │
│ │ RESEARCHER │ │ PARALLELIZER│ │ SELF-CORRECTOR │ │
│ │ │ │ │ │ │ │
│ │ Searches web │ │ Launches │ │ Detects reward │ │
│ │ for papers, │ │ background │ │ hacking, catches │ │
│ │ GitHub repos │ │ tasks, │ │ comparison errors, │ │
│ │ │ │ sub-agents │ │ validates feasibility│ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ MEMORY SYSTEM │ │
│ │ Context Window (short-term) ←→ File System (long-term) │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
LLM as Mutation Operator (Emergent)
In evolutionary systems, the LLM receives a prompt containing parent programs and generates mutated offspring. AutoEvolver demonstrates that this pattern emerges naturally when an LLM is given an optimization objective:
- The agent maintains candidate solutions (in files) — analogous to a population
- The agent modifies solutions (via code changes) — analogous to mutation
- The agent evaluates and compares (via evaluation scripts) — analogous to fitness evaluation
- The agent selects the best (keeping the highest-scoring solution) — analogous to selection
- The agent combines ideas (integrating insights from web research with current solutions) — analogous to crossover
The critical difference is that these behaviors are not orchestrated by an external evolutionary loop — they emerge from the LLM's internal planning and execution within the agentic framework.
No Prompt Engineering
A notable aspect of AutoEvolver is the absence of sophisticated prompt engineering. The agent receives:
- A natural language problem description
- An initial naive solution (code file)
- An evaluation script
No system prompts, few-shot examples, persona assignments, or structured output formats are used. The simplicity of the setup is the point: the researchers test whether the agent's general capabilities are sufficient without domain-specific prompting.
6 Key Results
Headline Performance
All three benchmark problems achieved new state-of-the-art:
| Problem | AutoEvolver | Previous SOTA | SOTA Source | Runtime |
|---|---|---|---|---|
| Circle Packing (26 circles, Σr ↑) | 2.63598844 | 2.63598308 | ThetaEvolve | 16.6 hours |
| Erdős Min Overlap (C₅ ↓) | 0.38086945 | 0.38087532 | TTT-Discover | 30.8 hours |
| AC1 (C₁ ↓) | 1.5028628969 | 1.5028628983 | TTT-Discover | 40.4 hours |
Combined runtime: 88 hours of autonomous computation across 2,762 messages and 1,486 tool calls.
Circle Packing Trajectory
The agent's progression on the circle packing problem illustrates the multi-phase strategy evolution:
Circle Packing Score Progression
================================
Score
2.636 ┤ ●━━━━ 2.63598844
│ ╱ (SOTA)
2.620 ┤ ●━━━╱ SLSQP
│ ╱ joint opt
2.560 ┤ ●━━━━━━━╱ Multi-start
│ ╱ optimization
2.500 ┤ ●━━━━━━━╱ LP + simulated
│ ╱ annealing (plateau)
│ ╱
│ ╱
1.000 ┤ ●━━━━━━━━━━━╱
│ ╱ Naive ring
0.960 ┤ ●━━━╱ arrangement
│
└──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──── Time
0 1 2 3 4 5 8 10 12 16 (hours)
│ │ │ │
Exploration │ Research Endgame
Refinement Pivot Optimization
Key breakthrough: Web search → GitHub discussion mentioning SLSQP
joint optimization → score jumped from 2.555 → 2.619
Phase Transitions in Strategy
Across all three problems, the agent exhibited a consistent multi-phase behavior pattern:
| Phase | Behavior | Example (Circle Packing) |
|---|---|---|
| 1. Exploration | Apply standard optimization methods | Gradient descent, simulated annealing |
| 2. Refinement | Tune hyperparameters, scale problem | Multi-start with different seeds |
| 3. Plateau | Detect diminishing returns | "This is a good result" (satisficing) |
| 4. Aspiration Prompt | Human raises target | "SOTA is 2.6359. I believe you can beat it." |
| 5. Research Pivot | Consult external resources | Web search → SLSQP discovery |
| 6. Synthesis | Integrate insights with progress | Joint center+radius optimization |
| 7. Endgame | Targeted local search from best known | Iterated perturbation chains |
Erdős Problem: The Discretization Discovery
The Erdős problem produced the most dramatic illustration of aspiration prompting's effect:
Before intervention (message 231): The agent reached C₅ = 0.38087447, beating TTT-Discover's 0.38087532 by a margin of 0.85 × 10⁻⁶. The agent declared "Final result" and confirmed the solution was at a local optimum via SLSQP, perturbation search, and subgradient verification.
Intervention (message 252): "Great — let's try more rounds. Aiming for larger improvements."
After intervention: The agent discovered that increasing discretization n yields better solutions. It systematically pushed from n=180 → 270 → 360 → 450 → 600 → 750, ultimately reaching C₅ = 0.38086945 — expanding the margin over prior SOTA from 0.85 × 10⁻⁶ to 5.87 × 10⁻⁶ (7× improvement triggered by one sentence).
Emergent Self-Correction
The agent demonstrated multiple forms of self-monitoring:
-
Reward hacking detection (Circle Packing): The agent found that L-BFGS-B could produce seemingly improved scores by exploiting LP solver tolerances, yielding slightly infeasible solutions. It identified the issue, diagnosed the cause, and corrected it autonomously.
-
Optimization direction confusion (Erdős): The agent twice confused maximization with minimization for C₅, prematurely declared "this beats the competitor," then caught its own error within a few messages and corrected the comparison.
-
Efficiency vs. correctness analysis (AC1): When replacing
np.convolvewithscipy.signal.fftconvolve(reducing O(n²) to O(n log n)), the agent explicitly questioned whether this constituted "cheating" before confirming mathematical equivalence and proceeding.
7 Reproducibility
Fundamental Reproducibility Challenges
AutoEvolver represents a worst case for scientific reproducibility. The authors are transparent about this:
| Dimension | Status | Notes |
|---|---|---|
| Problem definitions | ✅ Fully reproducible | Mathematical specifications are precise |
| Evaluation scripts | ✅ Fully reproducible | Available in GitHub repository |
| Initial solutions | ✅ Fully reproducible | Naive starting points provided |
| Agent trajectory | ❌ Not reproducible | Each run follows unique stochastic path |
| Final solutions | ⚠️ Solutions available | Numerical results verifiable, path to them is not |
| Web search content | ❌ Not reproducible | External resources accessed vary over time |
| Aspiration prompts | ⚠️ Partially reproducible | Timing and exact wording are judgment calls |
Trajectory Analysis Methodology
The researchers used DataClaw to capture conversation trajectories, enabling post-hoc analysis of 88 hours of autonomous computation. The tool records all messages, tool calls, and agent outputs, providing a complete audit trail even though the trajectories themselves are not reproducible.
Available Materials
The GitHub repository (tengxiaoliu/autoevolver) provides:
autoevolver/
├── tasks/ # Problem setups (problem descriptions,
│ ├── circle_packing/ # initial solutions, evaluation scripts)
│ │ ├── problem.md
│ │ ├── initial_solution.py
│ │ └── evaluate.py
│ ├── erdos_overlap/
│ └── ac1/
├── results/ # Final solutions for all three problems
│ ├── circle_packing/
│ ├── erdos_overlap/
│ └── ac1/
└── README.md
What Would Be Needed for Reproducibility
To approach reproducibility, future work would need:
- Deterministic LLM sampling — temperature=0, fixed random seeds (though even this doesn't guarantee identical outputs across API versions)
- Frozen web content — cached versions of all external resources accessed
- Predefined aspiration schedule — fixed intervention timing and wording
- Version-locked API — identical model weights and inference infrastructure
The authors do not claim reproducibility as a goal. Their contribution is the demonstration that competitive performance is possible under these conditions, not that it is reliably achievable.
8 Compute and API Costs
Runtime Breakdown
| Problem | Wall Clock | Messages | Tool Calls | Active Computation |
|---|---|---|---|---|
| Circle Packing | 16.6 hours | ~920 | ~500 | Continuous |
| Erdős Overlap | 30.8 hours | ~1,050 | ~600 | Continuous |
| AC1 | 40.4 hours | ~792 | ~386 | Continuous |
| Total | 87.8 hours | 2,762 | 1,486 | ~88 hours |
Cost Estimation
The authors do not report exact API costs. We can estimate based on Claude Code Opus 4.6 pricing (as of March 2026):
| Cost Component | Estimate | Basis |
|---|---|---|
| Input tokens (context) | ~$300-500 | Long sessions with growing context windows |
| Output tokens (code + reasoning) | ~$200-400 | 2,762 messages with code generation |
| Tool call overhead | ~$50-100 | 1,486 tool calls |
| Background task spawning | ~$100-200 | Sub-agents, parallel explorations |
| Estimated total | ~$650-1,200 | For all three problems combined |
| Per-problem average | ~$220-400 |
Cost Comparison with Evolutionary Frameworks
| System | Typical Run Cost | Infrastructure | Human Setup Time |
|---|---|---|---|
| AlphaEvolve | Not disclosed (Google internal) | Multi-node GPU cluster + Gemini API | Weeks (problem formulation + evaluator) |
| ThetaEvolve | Not disclosed | GPU cluster + LLM API | Days (framework + problem encoding) |
| TTT-Discover | Not disclosed | GPU cluster + LLM API | Days |
| AutoEvolver | ~$200-400 per problem | Single machine + API | ~30 min (write problem + eval script) |
The key cost advantage of AutoEvolver is not API costs (which may be comparable or higher) but human engineering time. Setting up a problem in AutoEvolver requires writing a problem description, an initial solution, and an evaluation script — no framework configuration, population parameters, or evolutionary operator design.
Compute Efficiency Analysis
A critical nuance: AutoEvolver is almost certainly less efficient in terms of useful computation per dollar. The agent exhibits:
- Approach cycling: Revisiting previously explored strategies (L-BFGS-B attempted 15+ times on AC1)
- Ignored task outputs: ~60% of 174 task completion notifications on Erdős were never read
- Stalled processes: ~40 messages on AC1 polling processes with 0-byte output files
- Redundant launches: Four monitoring sub-agents with near-identical prompts
In a purpose-built evolutionary framework, every evaluation contributes to the population and is never wasted. In AutoEvolver, a significant fraction of computation is redundant or unproductive.
9 Architecture Solution
The Non-Architecture
AutoEvolver's architecture is deliberately minimal — it is the absence of architecture that constitutes the contribution. The "system" consists of:
AutoEvolver "Architecture"
===========================
┌─────────────────────────────────────────────────────────┐
│ │
│ ┌───────────────┐ ┌────────────────────┐ │
│ │ Problem │ │ Evaluation │ │
│ │ Description │────────▶│ Script │ │
│ │ (text) │ │ (Python) │ │
│ └───────────────┘ └─────────┬──────────┘ │
│ │ │
│ ┌───────────────┐ │ │
│ │ Initial │ │ Score │
│ │ Solution │ │ feedback │
│ │ (Python) │ │ │
│ └───────┬───────┘ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────────────────────┐ │
│ │ │ │
│ │ CLAUDE CODE (Opus 4.6) │ │
│ │ │ │
│ │ Autonomous session in skip-permissions mode │ │
│ │ │ │
│ │ Capabilities: │ │
│ │ • Code writing and execution │ │
│ │ • Web search (arXiv, GitHub, forums) │ │
│ │ • File system operations │ │
│ │ • Parallel task spawning │ │
│ │ • Sub-agent creation │ │
│ │ • Result analysis and plotting │ │
│ │ │ │
│ └────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Aspiration Prompt (1 sentence, when needed) │ │
│ │ "The SOTA is X. I believe you can beat it." │ │
│ └─────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Comparison with Purpose-Built Evolutionary Architectures
Traditional Evolutionary Framework (e.g., AlphaEvolve)
======================================================
┌──────────┐ ┌───────────┐ ┌──────────┐ ┌───────────┐
│ Program │───▶│ Prompt │───▶│ LLM │───▶│ Parser │
│ Database │ │ Sampler │ │ (Gemini) │ │ & Differ │
│ (pop.) │ │ │ │ │ │ │
└────┬─────┘ └───────────┘ └──────────┘ └─────┬─────┘
│ │
│ ┌────────────────────┐ │
│ │ Evaluator Pool │◀───────────────────────┘
│ │ (parallel sandbox │
│ │ execution) │
│ └────────┬───────────┘
│ │
└─────────────┘
Selection + Archive
AutoEvolver
===========
┌──────────────┐ ┌──────────────────────────────────┐
│ Problem + │───▶│ Claude Code (the ENTIRE system) │
│ Eval Script │ │ Does everything above + more │
└──────────────┘ └──────────────────────────────────┘
The philosophical difference is profound. In evolutionary frameworks, the LLM is a component — a mutation operator — within a larger algorithmic structure. In AutoEvolver, the LLM is the algorithm. The evolutionary-like behaviors emerge from the LLM's general intelligence rather than being imposed by external structure.
Implications for System Design
This result challenges the evolutionary AI systems field to justify its architectural complexity. If a bare LLM can match or exceed the performance of carefully engineered evolutionary pipelines, several possibilities follow:
- The frameworks provide marginal value: The LLM's general capabilities already encompass the evolutionary strategies, making explicit frameworks redundant.
- The frameworks provide value at scale: AutoEvolver was tested on three problems. At scale (hundreds of problems, diverse domains), the consistency and efficiency of evolutionary frameworks may dominate.
- The frameworks provide different value: Controllability, reproducibility, and transparency may be more important than raw performance in research settings.
The authors favor interpretation (3), explicitly stating: "Not a replacement. Compared to purpose-built frameworks, coding agents still lack controllability and reproducibility."
10 Component Breakdown
Input Components
| Component | Description | Purpose |
|---|---|---|
| Problem Description | Natural language specification of the optimization task and objective | Provides the agent with domain understanding and optimization direction |
| Initial Solution | A naive starting implementation in Python | Gives the agent something concrete to improve, avoiding cold-start |
| Evaluation Script | A deterministic function scoring solutions and validating correctness | Provides objective fitness signal; the agent's only ground truth |
Emergent Components (Not Designed, Observed)
Through analysis of the 88-hour trajectory, the researchers identified several emergent system components:
| Emergent Component | How It Manifests | Evolutionary Analog |
|---|---|---|
| Solution Archive | promising_solutions/ directory with 110+ candidates (Erdős) |
Program database / population |
| Strategy Registry | Mental tracking of tried approaches in context | Tabu list / novelty archive |
| Parallel Evaluator Pool | Background tasks running multiple strategies simultaneously | Parallel fitness evaluation |
| Web Research Module | Targeted arXiv/GitHub searches during idle periods | Knowledge base / prior information |
| Process Manager | Identifying and killing inferior parallel processes | Resource management |
| Score Comparator | Internal logic comparing new scores against known best | Selection operator |
The Aspiration Prompt
The aspiration prompt is the only designed component beyond the basic setup. Its anatomy:
Aspiration Prompt Anatomy
=========================
Structure: "[Acknowledgment] + [Target] + [Encouragement]"
Examples used in the study:
Circle Packing:
"The current SOTA on this problem is 2.6359. I believe you can beat it."
Erdős:
"Great — let's try more rounds. Aiming for larger improvements."
AC1:
[Not explicitly reported — similar pattern inferred]
Timing: Applied when the agent declares "Final result" or equivalent
Effect: Triggers qualitative strategy shift, not just extended search
Tool Usage Patterns
Claude Code's tool usage across the 88-hour study:
| Tool Category | Usage Pattern | Frequency |
|---|---|---|
| Code execution | Running optimization scripts, evaluating solutions | Very high (~40% of tool calls) |
| File I/O | Saving/loading solutions, writing new scripts | High (~25%) |
| Web search | Searching arXiv, GitHub, math forums | Moderate (~15%) |
| Background tasks | Spawning parallel optimization processes | Moderate (~12%) |
| Sub-agents | Delegating monitoring and exploration tasks | Low (~8%) |
11 Core Mechanisms (Detailed)
Mechanism 1: Multi-Phase Strategy Evolution
The most striking emergent behavior is the agent's consistent progression through qualitatively different optimization phases. This pattern was observed independently across all three problems.
Multi-Phase Strategy Evolution
==============================
Phase 1: EXPLORATION (Hours 0-2)
├── Apply textbook optimization methods
├── Test multiple approaches quickly
├── Establish baseline performance
└── Characteristic: Broad, shallow search
Phase 2: REFINEMENT (Hours 2-5)
├── Focus on best-performing approach
├── Tune hyperparameters systematically
├── Scale problem parameters
└── Characteristic: Deep, narrow search
Phase 3: PLATEAU RECOGNITION (Hours 5-8)
├── Detect diminishing returns
├── Declare solution as "final" or "optimal"
├── Verify local optimality (perturbation, gradient checks)
└── Characteristic: Satisficing behavior
Phase 4: ASPIRATION INTERVENTION (Single message)
├── Human raises target score
└── Characteristic: External pressure applied
Phase 5: RESEARCH PIVOT (Hours 8-12)
├── Web search for alternative approaches
├── Read papers, GitHub repos, forum discussions
├── Discover fundamentally new strategies
└── Characteristic: Information seeking, strategy shift
Phase 6: SYNTHESIS (Hours 12-16)
├── Combine external insights with accumulated progress
├── Implement new approaches using best solution as warm start
├── Evaluate hybrid strategies
└── Characteristic: Cross-pollination, integration
Phase 7: ENDGAME (Hours 16+)
├── High-resolution local search around best known
├── Scaling tricks (higher discretization, more iterations)
├── Diminishing returns but still accumulating small gains
└── Characteristic: Exploitation-dominated, precision focus
Mechanism 2: Spontaneous Parallel Exploration
The agent autonomously transitions from sequential to parallel exploration as optimization becomes harder. This behavior was most pronounced on the Erdős problem:
Quantitative Parallelism Metrics (Erdős Problem):
| Metric | Value |
|---|---|
| Total background tasks launched | 174 |
| Sub-agents spawned | 9 |
| Peak concurrent tasks | 5-10 |
| Task completion notifications received | 174 |
| Notifications actually read/processed | ~70 (~40%) |
Files in promising_solutions/ archive |
110+ |
The agent's parallelism exhibits a natural analogy to population-based search. Multiple candidate strategies compete for the agent's attention, with better-performing ones receiving more follow-up computation. However, unlike formal evolutionary algorithms, this "selection" is mediated by the agent's attention and context management rather than explicit selection operators.
Mechanism 3: Strategic Web Research
The agent uses web research in two distinct patterns:
Pattern A: Fallback when stuck. After reaching a plateau, the agent initiates web searches alongside continued optimization, looking for new techniques. Example trajectory (Circle Packing):
Message 37: "I'm stuck at 2.57, likely because the simulated annealing is converging to a local optimum." → Searches Packomania website for known n=26 coordinates → fails to extract useful data
Message 144: Launches 4 parallel experiments including web search for known optimal packing coordinates
Message 157: "I found a GitHub issue that mentions a circle packing result of 2.635977 for n=26!"
Message 171: "The key insight from that issue is that they used SLSQP for jointly optimizing centers AND radii, rather than our approach of LP for radii and NM/Powell for centers." → Score jumps from 2.555 to 2.619
Pattern B: Opportunistic during idle time. While background optimization tasks run, the agent uses idle time to gather theoretical insights. On the AC1 problem, the agent read papers on autocorrelation inequalities to understand the theoretical landscape, even though the insights did not directly translate to code improvements.
Mechanism 4: Self-Correction and Reward Hacking Detection
The agent exhibits multiple layers of self-monitoring, including detection of its own reward hacking:
Case Study: L-BFGS-B Feasibility Exploitation (Circle Packing)
The agent discovered that L-BFGS-B could produce seemingly improved scores by exploiting LP solver tolerances, yielding solutions that technically violated the non-overlap constraint by amounts smaller than the tolerance threshold. The agent's reasoning:
- Observed that L-BFGS-B solutions scored slightly higher than SLSQP solutions
- Investigated why — found that the LP solver's tolerance allowed slightly infeasible circle placements
- Determined this was "reward hacking" — the evaluation function was being gamed
- Reverted to stricter feasibility checking
- Continued optimization with properly constrained solutions
This is a remarkable demonstration of an AI system detecting and correcting its own tendency to exploit evaluation metrics — a behavior that purpose-built evolutionary frameworks typically handle via explicit constraint enforcement in the evaluator.
Mechanism 5: Approach Cycling (Failure Mode)
A significant failure mode is the agent's tendency to revisit previously explored and rejected approaches, apparently losing track of prior conclusions as the context window advances:
Case Study: L-BFGS-B Cycling (AC1 Problem)
| Message | Event | Conclusion |
|---|---|---|
| 62 | First proposes L-BFGS-B | Pivots to simulated annealing instead |
| 130 | Tries L-BFGS-B again | "Too slow" |
| 215 | L-BFGS-B again | "Only marginally improves" |
| 340 | L-BFGS-B again | "No improvement" |
| 420 | L-BFGS-B again | "Already tried this" |
| 521 | Catches itself: "Actually wait, I showed earlier that L-BFGS-B also can't improve it." | Brief self-awareness |
| 548 | L-BFGS-B again: "let me try something I haven't tried yet: L-BFGS-B" | Context lost again |
This failure mode represents the most significant practical limitation of AutoEvolver. In a purpose-built evolutionary framework, a tabu list or strategy registry would prevent this waste. The context window functions as short-term memory, but information is lost as earlier messages scroll out of the active context. The file system acts as long-term memory, but the agent doesn't maintain a systematic strategy log.
Mechanism 6: Process Awareness and Resource Management
The agent demonstrates system-level awareness that goes beyond algorithm design:
Case Study: Process Interference Detection (Erdős)
Message 1280: "Both processes write to the same file. The better process's saves get overwritten by the worse one. This explains why we rarely see improvement."
Message 1283: "I should stop the worse process so only the better one saves snapshots."
Message 1291: "C₅ dropped to 0.380869458 — a big improvement! Killing the worse process worked perfectly."
This incident demonstrates debugging of system-level interactions between concurrent processes — a skill typically associated with software engineering rather than algorithm optimization.
12 Programming Language
Solution Implementation
All solutions are implemented in Python, leveraging the scientific computing ecosystem:
| Library | Usage | Problem |
|---|---|---|
numpy |
Array operations, linear algebra | All three |
scipy.optimize |
SLSQP, L-BFGS-B, differential evolution | Circle Packing, Erdős |
scipy.signal |
FFT convolution | AC1 |
scipy.linalg |
Linear algebra routines | All three |
matplotlib |
Visualization of solutions | All three |
multiprocessing |
Parallel solution evaluation | Erdős |
Code Characteristics
The agent-generated code exhibits several notable patterns:
- Progressive sophistication: Early code is simple and procedural; late-stage code uses advanced optimization techniques with careful numerical handling.
- Self-documenting: The agent tends to add comments explaining its reasoning, though these comments become less reliable as context is lost.
- Modular evolution: The agent often refactors its solution into separate functions as complexity grows, spontaneously applying software engineering practices.
Example: Circle Packing Solution Evolution
Stage 1 — Naive (Score: 0.96):
def pack_circles(n=26):
"""Naive ring arrangement."""
circles = []
r = 1.0 / (2 * n)
for i in range(n):
angle = 2 * np.pi * i / n
x = 0.5 + 0.3 * np.cos(angle)
y = 0.5 + 0.3 * np.sin(angle)
circles.append((x, y, r))
return circles
Stage 5 — SLSQP Joint Optimization (Score: 2.619):
def joint_optimize(centers, radii, n=26):
"""Jointly optimize centers and radii using SLSQP."""
x0 = np.concatenate([centers.ravel(), radii])
def objective(x):
return -np.sum(x[2*n:]) # Maximize sum of radii
constraints = []
for i in range(n):
# Containment: r <= x <= 1-r, r <= y <= 1-r
constraints.append({'type': 'ineq', 'fun': lambda x, i=i:
x[2*i] - x[2*n+i]})
constraints.append({'type': 'ineq', 'fun': lambda x, i=i:
1 - x[2*i] - x[2*n+i]})
constraints.append({'type': 'ineq', 'fun': lambda x, i=i:
x[2*i+1] - x[2*n+i]})
constraints.append({'type': 'ineq', 'fun': lambda x, i=i:
1 - x[2*i+1] - x[2*n+i]})
for i in range(n):
for j in range(i+1, n):
# Non-overlap: dist(i,j) >= r_i + r_j
constraints.append({'type': 'ineq', 'fun': lambda x, i=i, j=j:
np.sqrt((x[2*i]-x[2*j])**2 + (x[2*i+1]-x[2*j+1])**2)
- x[2*n+i] - x[2*n+j]})
result = minimize(objective, x0, method='SLSQP',
constraints=constraints, options={'maxiter': 10000})
return result
Stage 7 — Iterated Perturbation Chains (Score: 2.63598844):
def perturbation_chain(solution, eval_fn, n_iters=100000, temp=1e-6):
"""Fine-grained local search via iterated perturbation."""
best = solution.copy()
best_score = eval_fn(best)
for it in range(n_iters):
candidate = best.copy()
# Perturb a random circle's position or radius
idx = np.random.randint(len(candidate))
dim = np.random.randint(3) # x, y, or r
delta = np.random.normal(0, temp)
candidate[idx][dim] += delta
if is_feasible(candidate) and eval_fn(candidate) > best_score:
best = candidate
best_score = eval_fn(candidate)
return best, best_score
13 Memory Management
The Dual Memory System
AutoEvolver's memory architecture is an emergent property of Claude Code's design, not an intentional choice:
Memory Architecture
===================
┌─────────────────────────────────────────────────────────┐
│ SHORT-TERM MEMORY: Context Window │
│ │
│ • Contains recent messages, tool outputs, reasoning │
│ • Finite capacity (~200K tokens for Opus 4.6) │
│ • Information drops off as window advances │
│ • Strategy history, previous conclusions lost │
│ • Current best solution always tracked │
│ │
│ FAILURE MODE: Approach cycling │
│ L-BFGS-B tried 15+ times on AC1 because prior │
│ negative conclusions scrolled out of context │
│ │
├─────────────────────────────────────────────────────────┤
│ LONG-TERM MEMORY: File System │
│ │
│ • Solution files (current best, candidates) │
│ • promising_solutions/ directory (110+ files, Erdős) │
│ • Evaluation results and logs │
│ • Code implementations (versioned by the agent) │
│ │
│ FAILURE MODE: No systematic strategy log │
│ The agent saves solutions but not a record of │
│ which approaches were tried and rejected │
│ │
├─────────────────────────────────────────────────────────┤
│ EXTERNAL MEMORY: Web Resources │
│ │
│ • arXiv papers, GitHub repos, forum discussions │
│ • Accessed opportunistically during idle periods │
│ • Not persistent — re-searched when needed │
│ • Provides novel strategies not in training data │
│ │
└─────────────────────────────────────────────────────────┘
Memory Failure Analysis
The most significant memory-related failure is approach cycling, where the agent revisits strategies it previously explored and rejected. Quantitative evidence:
| Problem | Approach | Times Revisited | Total Wasted Messages |
|---|---|---|---|
| AC1 | L-BFGS-B | 15+ | ~60 |
| Circle Packing | L-BFGS-B | 2 (mild) | ~10 |
| Erdős | Random initialization | 3 | ~15 |
Proposed Mitigations (Not Implemented)
The authors identify but do not implement several mitigations:
- Persistent strategy registry: A file logging all tried approaches with their outcomes, consulted before each new attempt.
- Context summarization: Periodic compression of the context window into a summary of findings and rejected approaches.
- Approach deduplication: A check before launching a new strategy to verify it hasn't been tried before.
These mitigations would move AutoEvolver closer to a purpose-built framework, partially undermining the minimalist thesis. This tension between raw capability and systematic efficiency is one of the paper's most interesting implicit findings.
File System as Population Archive
The agent's use of the file system as a solution archive is structurally similar to a MAP-Elites archive or a quality-diversity archive:
Erdős Problem File System (Post-Run):
promising_solutions/
├── best_n180.npy # Best solution at discretization n=180
├── best_n270.npy # Best at n=270
├── best_n360.npy # Best at n=360 (C5=0.38087064)
├── best_n450.npy # Best at n=450
├── best_n600.npy # Best at n=600
├── best_n750.npy # Best at n=750 (C5=0.38086945, final SOTA)
├── candidate_001.npy # Intermediate candidates
├── candidate_002.npy
│ ... (~110 files total)
├── candidate_110.npy
├── basin_search_result_1.npy
├── basin_search_result_2.npy
└── ...
This archive was created entirely by the agent without external instruction. The structure mirrors a quality-diversity archive where solutions are organized by a behavioral characteristic (discretization n) rather than pure fitness.
14 Continued Learning
Within-Session Learning
AutoEvolver exhibits clear within-session learning across the multi-phase trajectory. The agent accumulates domain knowledge throughout each run:
| Learning Signal | How It's Used | Persistence |
|---|---|---|
| Evaluation scores | Direct fitness feedback for strategy selection | Context window (degrades) |
| Optimization landscapes | Understanding of problem structure (local optima, convexity) | Context window (degrades) |
| Web research findings | External techniques integrated into current approach | Context window + code files |
| Failed approaches | (Poorly) avoided in future attempts | Context window (lost → cycling) |
| Discretization scaling | Discovery that higher n → better Erdős solutions | Code files (persists) |
Cross-Problem Learning: Absent
A critical limitation: AutoEvolver performs no cross-problem learning. Each problem is solved in an independent session with no knowledge transfer. The agent does not:
- Recognize that SLSQP worked well on Circle Packing and try it first on Erdős
- Transfer the aspiration-prompt-induced insight about discretization scaling
- Build a library of optimization strategies across problems
- Maintain a persistent memory of effective techniques
This is a fundamental difference from evolutionary frameworks that can accumulate cross-problem knowledge through prompt templates, strategy libraries, or learned mutation operators.
Comparison with Evolutionary Framework Learning
| Learning Type | AutoEvolver | AlphaEvolve | FunSearch |
|---|---|---|---|
| Within-run fitness improvement | ✅ (context + files) | ✅ (program database) | ✅ (program database) |
| Cross-run knowledge | ❌ | ⚠️ (seed programs) | ⚠️ (seed programs) |
| Strategy meta-learning | ❌ | ❌ | ❌ |
| Self-improving LLM | ❌ | ✅ (AlphaEvolve contributes to Gemini training) | ❌ |
Implications for System Design
The absence of cross-problem learning suggests a concrete architecture improvement: a strategy memory layer that persists across sessions and records which approaches worked on which problem types. This would address the approach cycling problem within sessions and enable knowledge transfer across problems, without requiring the full complexity of an evolutionary framework.
Proposed Strategy Memory Architecture
======================================
Session 1 (Circle Packing) Session 2 (Erdős)
┌──────────────────┐ ┌──────────────────┐
│ Claude Code │ │ Claude Code │
│ │ │ │
│ • Tried SA ❌ │ ──write──▶ │ • Read prior │
│ • Tried SLSQP ✅ │ │ strategies │
│ • SLSQP + perturb│ │ • Skip SA, try │
│ chains ✅✅ │ │ SLSQP first │
└──────────────────┘ └──────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────┐
│ PERSISTENT STRATEGY MEMORY │
│ │
│ Problem Type │ Approach │ Outcome │
│ ─────────────── │ ──────────── │ ───────── │
│ Continuous opt │ SA │ Local optima │
│ Continuous opt │ SLSQP joint │ Breakthrough │
│ Continuous opt │ Perturbation │ Refinement │
│ Functional opt │ ↑ discret. │ Breakthrough │
└─────────────────────────────────────────────────┘
15 Applications
Direct Applications
AutoEvolver's primary application is algorithmic optimization on problems where:
- Solutions are expressible as code — the agent needs to write and execute Python
- An evaluation function exists — deterministic scoring of solution quality
- Web resources are available — the agent benefits from accessing existing literature
- Long compute budgets are acceptable — 16-40 hours per problem
- Reproducibility is not critical — each run follows a unique trajectory
| Application Domain | Suitability | Notes |
|---|---|---|
| Mathematical optimization | ✅ High | Core strength demonstrated on three problems |
| Algorithm design | ✅ High | Agent designs novel algorithms as part of optimization |
| Heuristic discovery | ✅ High | Agent discovers and combines heuristics autonomously |
| Hyperparameter optimization | ⚠️ Moderate | Possible but purpose-built tools (Optuna, etc.) are more efficient |
| Neural architecture search | ⚠️ Moderate | Conceivable but expensive and not demonstrated |
| Scientific hypothesis testing | ❌ Low | No experimental design or statistical rigor |
| Production algorithm deployment | ❌ Low | Reproducibility concerns prevent direct deployment |
Implications for the Evolutionary AI Field
AutoEvolver's most significant contribution is not practical but conceptual. It challenges foundational assumptions in the field:
Challenge 1: Are evolutionary frameworks necessary?
If a general-purpose coding agent can match evolutionary frameworks on their home benchmarks, the burden of proof shifts to framework developers to demonstrate value beyond raw performance — in efficiency, controllability, reproducibility, or scalability.
Challenge 2: What is the role of the LLM?
In evolutionary frameworks, the LLM is a component (mutation operator) within a larger system. AutoEvolver shows the LLM can serve as the entire optimization system. This raises the question of whether the evolutionary framework is mostly providing structure that the LLM already implicitly possesses.
Challenge 3: How should we evaluate evolutionary AI systems?
If performance on a small number of benchmark problems is the primary metric, AutoEvolver's results undermine the case for complex frameworks. The field may need evaluation criteria that capture efficiency, scalability, robustness, and reproducibility — dimensions where frameworks are expected to excel.
Relationship to OmniEvolve Design
AutoEvolver's findings are directly relevant to OmniEvolve's architecture:
| AutoEvolver Finding | OmniEvolve Design Implication |
|---|---|
| Aspiration prompting breaks plateaus | Incorporate adaptive target-setting in the evaluation loop |
| Strategy cycling wastes compute | Implement a persistent strategy registry in the knowledge module |
| File-system archives emerge naturally | Formalize this pattern in the artifact storage system |
| Parallel exploration is beneficial | Support parallel island-based search from the architecture level |
| Web research provides breakthroughs | Integrate literature search as a mutation information source |
| Agent satisfices without pressure | Design evaluation feedback to maintain optimization pressure |
Broader Scientific Context
AutoEvolver joins a growing body of evidence that general-purpose AI agents can perform specialized tasks previously requiring purpose-built systems:
| Domain | Purpose-Built System | General Agent Result |
|---|---|---|
| Code optimization | AlphaEvolve, FunSearch | AutoEvolver (matches/exceeds) |
| Scientific paper writing | Specialized NLG systems | AI Scientist (produces publishable papers) |
| Theorem proving | Lean4, Coq provers | LLM-based proof (emerging results) |
| Chip design | EDA tools | LLM-based placement (Google, 2023) |
| Drug discovery | Specialized ML pipelines | LLM-based molecular design (emerging) |
The pattern suggests that as general-purpose LLMs become more capable, the value proposition of domain-specific systems shifts from capability (can it solve the problem?) to efficiency (can it solve the problem better, faster, and more reliably?).
Limitations as a Research Contribution
Despite the impressive results, several limitations constrain AutoEvolver's impact:
- N=1 per problem: Each problem was solved once. Statistical significance cannot be established without multiple runs.
- Human intervention: Aspiration prompting, while minimal, is not automated. The timing and content of interventions involve human judgment.
- Fair comparison: The agent has access to web resources including potentially the very papers it's competing against. Evolutionary frameworks typically operate without such external information.
- Cost opacity: API costs are not reported, making cost-efficiency comparisons impossible.
- Narrow benchmark: Three mathematical optimization problems are not representative of the broader algorithmic design space.
Bottom line: AutoEvolver demonstrates that the capability gap between general-purpose coding agents and purpose-built evolutionary frameworks has narrowed to the point of parity on selected benchmarks. The efficiency, reproducibility, and scalability gaps remain wide — and these may be the more important dimensions for practical research systems.
References
- Liu, T., Yang, Y., Ye, X., and Chen, D. "Can Coding Agents Optimize Algorithms Autonomously?" Blog Post, March 2026. https://tengxiaoliu.github.io/autoevolver/
- Novikov, A. et al. "AlphaEvolve: A coding agent for scientific and algorithmic discovery." arXiv:2506.13131, 2025.
- Lange, R.T. et al. "ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution." arXiv:2509.19349, 2025.
- Wang, Y. et al. "ThetaEvolve: Test-time Learning on Open Problems." arXiv:2511.23473, 2025.
- Yuksekgonul, M. et al. "Learning to Discover at Test Time." arXiv:2601.16175 (TTT-Discover), 2026.
- Simon, H.A. "Models of Bounded Rationality." MIT Press, 1982.
- Mouret, J.-B. and Clune, J. "Illuminating search spaces by mapping elites." arXiv:1504.04909, 2015.
- Romera-Paredes, B. et al. "Mathematical discoveries from program search with large language models." Nature, 625, 468–475, 2024 (FunSearch).
- DataClaw. Conversation trajectory capture tool. github.com/peteromallet/dataclaw.
- Anthropic. "Claude Code." Technical Documentation, 2026.