AutoEvolver

Can Coding Agents Optimize Algorithms Autonomously? Organization: Princeton NLP Group (Princeton University) Published: March 23, 2026 Type: Blog Post + GitHub Repository Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: Can Coding Agents Optimize Algorithms Autonomously?

Project Page: tengxiaoliu.github.io/autoevolver

Repository: github.com/tengxiaoliu/autoevolver

Publication Date: March 23, 2026

Institutional Affiliation: Princeton NLP Group, Department of Computer Science, Princeton University

Lineage: Positioned as a direct empirical response to dedicated evolutionary optimization systems including AlphaEvolve (DeepMind, 2025), ShinkaEvolve (2025), ThetaEvolve (2025), and TTT-Discover (2026). AutoEvolver does not build on any of these systems' codebases — it deliberately strips them away to test a minimalist hypothesis.

Publication Format: Technical blog post with open-source supporting materials, not a peer-reviewed paper. The work is primarily empirical and observational, analyzing the emergent behaviors of a general-purpose coding agent applied to algorithmic optimization problems without purpose-built evolutionary scaffolding.

Central Question: "What happens if you give a coding agent an algorithmic optimization problem and simply ask it to keep improving?"

2 Authors and Team

Author	Affiliation	Role	Notable Prior Work
Tengxiao Liu	Princeton University	Lead researcher, experiment design	NLP, coding agents research
Yuqing Yang	Princeton University	Co-researcher	Machine learning research
Xi Ye	Princeton University	Co-researcher	Code generation, reasoning
Danqi Chen	Princeton University	Faculty advisor, PI	Princeton NLP Group lead; ORQA, DPR, instruction tuning

The research team is from the Princeton NLP Group, one of the leading academic NLP labs in the United States. Danqi Chen's group has published extensively on language model capabilities, retrieval-augmented generation, and code generation. The team's approach reflects an empirically-driven philosophy: rather than building a new system, they systematically studied the behavior of an existing one (Claude Code) when given optimization tasks.

Methodological Posture

AutoEvolver is explicitly framed as an observational study rather than a system paper. The researchers did not build a tool — they conducted a carefully controlled experiment to test whether purpose-built evolutionary frameworks are necessary for competitive performance on algorithmic optimization benchmarks. This epistemological stance is critical to understanding the contribution: the result is primarily a set of empirical findings and behavioral observations, not a software artifact.

3 Core Contribution

Key Finding: A general-purpose coding agent (Claude Code with Opus 4.6), given only a problem description, an initial naive solution, and an evaluation script, can achieve state-of-the-art results on established algorithmic optimization benchmarks — surpassing results from purpose-built evolutionary systems like ThetaEvolve and TTT-Discover — with zero evolutionary scaffolding.

The Minimalist Hypothesis

AutoEvolver tests a radical minimalist hypothesis: that the evolutionary framework — population management, selection operators, mutation pipelines, crossover, diversity mechanisms — may be unnecessary when the LLM itself is sufficiently capable. The coding agent spontaneously exhibits behaviors that are functionally equivalent to evolutionary strategies:

Evolutionary Concept	Emergent Agent Behavior
Population of programs	Parallel background tasks + file-system archive
Mutation operators	LLM-driven code modifications
Selection pressure	Agent's internal evaluation and comparison logic
Diversity maintenance	Strategy switching, web research pivots
Memory / archive	Context window + file system
Fitness evaluation	Evaluation script execution

The Aspiration Prompting Discovery

Perhaps the most significant methodological contribution is the discovery of aspiration prompting — a minimal intervention technique where a single sentence raising the agent's target score is sufficient to break through performance plateaus. This finding has broad implications:

Satisficing behavior: Coding agents exhibit satisficing — they settle for "good enough" solutions unless externally pushed. This mirrors Herbert Simon's bounded rationality theory applied to AI agents.
Qualitative strategy shifts: Aspiration prompting doesn't just extend search time — it triggers qualitatively different algorithmic strategies. The agent shifts from incremental parameter tuning to fundamentally different approaches (e.g., switching from simulated annealing to SLSQP joint optimization).
Minimal intervention cost: The intervention is a single natural language sentence. No algorithmic modification, no hyperparameter changes, no additional code. This makes aspiration prompting trivially cheap to implement.

What AutoEvolver Is NOT

Not a framework or tool — it's an empirical study
Not reproducible in the traditional sense — each run follows a unique trajectory
Not a replacement for evolutionary frameworks — the authors explicitly acknowledge this
Not a peer-reviewed paper — published as a blog post with supporting code

4 Supported Solutions

AutoEvolver was tested on three benchmark problems that are standard in the evolutionary AI literature. Each problem represents a different class of optimization:

Problem Taxonomy

Problem	Type	Objective	Domain	Search Space
Circle Packing (n=26)	Continuous optimization	Maximize Σrᵢ	Computational geometry	Circle centers (xᵢ, yᵢ) and radii rᵢ in [0,1]²
Erdős Minimum Overlap	Combinatorial/functional	Minimize C₅	Additive combinatorics	Step functions f: [0,1] → [0,1]
First Autocorrelation Inequality (AC1)	Functional construction	Minimize C₁ upper bound	Additive combinatorics	Nonneg. functions f on [-1/4, 1/4]

Circle Packing (n=26)

Pack 26 non-overlapping circles inside a unit square, maximizing the sum of their radii. Constraints:

$$r_i \le x_i \le 1 - r_i, \quad r_i \le y_i \le 1 - r_i \quad \forall i$$ $$\sqrt{(x_i - x_j)^2 + (y_i - y_j)^2} \ge r_i + r_j \quad \forall i \ne j$$

Objective: $\max \sum_{i=1}^{26} r_i$

This problem has a rich history in computational geometry and operations research. Known optimal configurations exist for small n, but for n=26 the landscape is extremely rugged with many local optima. The problem requires both global exploration (finding a good arrangement topology) and local refinement (precise coordinate optimization).

Erdős Minimum Overlap Problem

A classic problem in additive combinatorics. Partition {1, 2, ..., 2n} into two equal-size sets A and B. For each integer k, let Mₖ count the solutions to aᵢ - bⱼ = k. The goal is to bound c = lim(n→∞) M(n)/n, where M(n) = min_{A,B} max_k Mₖ.

Following prior work, the problem is formulated as optimizing step functions f describing the density of A throughout [1, 2n], with f(x) ∈ [0, 1] and ∫f = 1. The objective:

$$\text{Minimize } C_5 = \max_k \int f(x)(1 - f(x+k))\,dx$$

This is a minimax optimization over function space, discretized at resolution n. A key discovery by Claude Code was that increasing the discretization n yields better solutions — a direction the agent initially missed and only explored after aspiration prompting.

First Autocorrelation Inequality (AC1)

For nonnegative f supported on [-1/4, 1/4], find the smallest C₁ such that:

$$\max_{|t| \le 1/2} (f * f)(t) \ge C_1 \left(\int f\right)^2$$

Any valid construction f certifies an upper bound C₁ ≤ ‖f * f‖_∞ / (∫f)². Lower values represent tighter bounds. This problem arises in the study of additive patterns and has connections to the Littlewood conjecture.

Solution Quality Summary

Problem	AutoEvolver Result	Previous SOTA	Source	Margin
Circle Packing (Σr ↑)	2.63598844	2.63598308	ThetaEvolve	+5.36 × 10⁻⁶
Erdős C₅ (↓)	0.38086945	0.38087532	TTT-Discover	−5.87 × 10⁻⁶
AC1 C₁ (↓)	1.5028628969	1.5028628983	TTT-Discover	−1.4 × 10⁻⁹

All three results represent new state-of-the-art performance, though the margins are extremely small. The circle packing result is evaluated with a feasibility tolerance of 1e-6, consistent with ThetaEvolve's evaluator.

5 LLM Integration

Model Configuration

AutoEvolver uses a single LLM configuration with no ensemble:

Parameter	Value
Model	Claude Code (Opus 4.6)
Provider	Anthropic
Mode	Autonomous (skip-permissions)
Interaction	Single long-running session per problem
Human intervention	Aspiration prompting only (1-2 sentences per problem)

How the LLM Functions

Unlike evolutionary systems where the LLM serves as a mutation operator within a larger framework, in AutoEvolver the LLM is the entire system. Claude Code functions simultaneously as:

AutoEvolver: LLM as Complete Optimization System
=================================================

┌─────────────────────────────────────────────────────────────────┐
│                    CLAUDE CODE (Opus 4.6)                       │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │  STRATEGIST   │  │  IMPLEMENTER │  │  EVALUATOR           │  │
│  │              │  │              │  │                      │  │
│  │ Decides what │  │ Writes code  │  │ Interprets results   │  │
│  │ approach to  │  │ modifications│  │ Compares against     │  │
│  │ try next     │  │ and new      │  │ known best           │  │
│  │              │  │ algorithms   │  │                      │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘  │
│         │                 │                      │              │
│  ┌──────┴───────┐  ┌──────┴───────┐  ┌──────────┴───────────┐  │
│  │  RESEARCHER   │  │  PARALLELIZER│  │  SELF-CORRECTOR      │  │
│  │              │  │              │  │                      │  │
│  │ Searches web │  │ Launches     │  │ Detects reward       │  │
│  │ for papers,  │  │ background   │  │ hacking, catches     │  │
│  │ GitHub repos │  │ tasks,       │  │ comparison errors,   │  │
│  │              │  │ sub-agents   │  │ validates feasibility│  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                     MEMORY SYSTEM                         │   │
│  │  Context Window (short-term) ←→ File System (long-term)  │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

LLM as Mutation Operator (Emergent)

In evolutionary systems, the LLM receives a prompt containing parent programs and generates mutated offspring. AutoEvolver demonstrates that this pattern emerges naturally when an LLM is given an optimization objective:

The agent maintains candidate solutions (in files) — analogous to a population
The agent modifies solutions (via code changes) — analogous to mutation
The agent evaluates and compares (via evaluation scripts) — analogous to fitness evaluation
The agent selects the best (keeping the highest-scoring solution) — analogous to selection
The agent combines ideas (integrating insights from web research with current solutions) — analogous to crossover

The critical difference is that these behaviors are not orchestrated by an external evolutionary loop — they emerge from the LLM's internal planning and execution within the agentic framework.

No Prompt Engineering

A notable aspect of AutoEvolver is the absence of sophisticated prompt engineering. The agent receives:

A natural language problem description
An initial naive solution (code file)
An evaluation script

No system prompts, few-shot examples, persona assignments, or structured output formats are used. The simplicity of the setup is the point: the researchers test whether the agent's general capabilities are sufficient without domain-specific prompting.

6 Key Results

Headline Performance

All three benchmark problems achieved new state-of-the-art:

Problem	AutoEvolver	Previous SOTA	SOTA Source	Runtime
Circle Packing (26 circles, Σr ↑)	2.63598844	2.63598308	ThetaEvolve	16.6 hours
Erdős Min Overlap (C₅ ↓)	0.38086945	0.38087532	TTT-Discover	30.8 hours
AC1 (C₁ ↓)	1.5028628969	1.5028628983	TTT-Discover	40.4 hours

Combined runtime: 88 hours of autonomous computation across 2,762 messages and 1,486 tool calls.

Circle Packing Trajectory

The agent's progression on the circle packing problem illustrates the multi-phase strategy evolution:

Circle Packing Score Progression
================================

Score
2.636 ┤                                                    ●━━━━ 2.63598844
      │                                                   ╱     (SOTA)
2.620 ┤                                              ●━━━╱ SLSQP
      │                                             ╱      joint opt
2.560 ┤                                    ●━━━━━━━╱ Multi-start
      │                                   ╱         optimization
2.500 ┤                          ●━━━━━━━╱ LP + simulated
      │                         ╱         annealing (plateau)
      │                        ╱
      │                       ╱
1.000 ┤          ●━━━━━━━━━━━╱
      │         ╱ Naive ring
0.960 ┤    ●━━━╱  arrangement
      │
      └──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──── Time
         0   1   2   3   4   5   8  10  12  16  (hours)
              │        │         │        │
          Exploration  │     Research   Endgame
                   Refinement  Pivot   Optimization

Key breakthrough: Web search → GitHub discussion mentioning SLSQP
joint optimization → score jumped from 2.555 → 2.619

Phase Transitions in Strategy

Across all three problems, the agent exhibited a consistent multi-phase behavior pattern:

Phase	Behavior	Example (Circle Packing)
1. Exploration	Apply standard optimization methods	Gradient descent, simulated annealing
2. Refinement	Tune hyperparameters, scale problem	Multi-start with different seeds
3. Plateau	Detect diminishing returns	"This is a good result" (satisficing)
4. Aspiration Prompt	Human raises target	"SOTA is 2.6359. I believe you can beat it."
5. Research Pivot	Consult external resources	Web search → SLSQP discovery
6. Synthesis	Integrate insights with progress	Joint center+radius optimization
7. Endgame	Targeted local search from best known	Iterated perturbation chains

Erdős Problem: The Discretization Discovery

The Erdős problem produced the most dramatic illustration of aspiration prompting's effect:

Before intervention (message 231): The agent reached C₅ = 0.38087447, beating TTT-Discover's 0.38087532 by a margin of 0.85 × 10⁻⁶. The agent declared "Final result" and confirmed the solution was at a local optimum via SLSQP, perturbation search, and subgradient verification.

Intervention (message 252): "Great — let's try more rounds. Aiming for larger improvements."

After intervention: The agent discovered that increasing discretization n yields better solutions. It systematically pushed from n=180 → 270 → 360 → 450 → 600 → 750, ultimately reaching C₅ = 0.38086945 — expanding the margin over prior SOTA from 0.85 × 10⁻⁶ to 5.87 × 10⁻⁶ (7× improvement triggered by one sentence).

Emergent Self-Correction

The agent demonstrated multiple forms of self-monitoring:

Reward hacking detection (Circle Packing): The agent found that L-BFGS-B could produce seemingly improved scores by exploiting LP solver tolerances, yielding slightly infeasible solutions. It identified the issue, diagnosed the cause, and corrected it autonomously.
Optimization direction confusion (Erdős): The agent twice confused maximization with minimization for C₅, prematurely declared "this beats the competitor," then caught its own error within a few messages and corrected the comparison.
Efficiency vs. correctness analysis (AC1): When replacing np.convolve with scipy.signal.fftconvolve (reducing O(n²) to O(n log n)), the agent explicitly questioned whether this constituted "cheating" before confirming mathematical equivalence and proceeding.

7 Reproducibility

Fundamental Reproducibility Challenges

AutoEvolver represents a worst case for scientific reproducibility. The authors are transparent about this:

Dimension	Status	Notes
Problem definitions	✅ Fully reproducible	Mathematical specifications are precise
Evaluation scripts	✅ Fully reproducible	Available in GitHub repository
Initial solutions	✅ Fully reproducible	Naive starting points provided
Agent trajectory	❌ Not reproducible	Each run follows unique stochastic path
Final solutions	⚠️ Solutions available	Numerical results verifiable, path to them is not
Web search content	❌ Not reproducible	External resources accessed vary over time
Aspiration prompts	⚠️ Partially reproducible	Timing and exact wording are judgment calls

Trajectory Analysis Methodology

The researchers used DataClaw to capture conversation trajectories, enabling post-hoc analysis of 88 hours of autonomous computation. The tool records all messages, tool calls, and agent outputs, providing a complete audit trail even though the trajectories themselves are not reproducible.

Available Materials

The GitHub repository (tengxiaoliu/autoevolver) provides:

autoevolver/
├── tasks/                    # Problem setups (problem descriptions,
│   ├── circle_packing/       #   initial solutions, evaluation scripts)
│   │   ├── problem.md
│   │   ├── initial_solution.py
│   │   └── evaluate.py
│   ├── erdos_overlap/
│   └── ac1/
├── results/                  # Final solutions for all three problems
│   ├── circle_packing/
│   ├── erdos_overlap/
│   └── ac1/
└── README.md

What Would Be Needed for Reproducibility

To approach reproducibility, future work would need:

Deterministic LLM sampling — temperature=0, fixed random seeds (though even this doesn't guarantee identical outputs across API versions)
Frozen web content — cached versions of all external resources accessed
Predefined aspiration schedule — fixed intervention timing and wording
Version-locked API — identical model weights and inference infrastructure

The authors do not claim reproducibility as a goal. Their contribution is the demonstration that competitive performance is possible under these conditions, not that it is reliably achievable.

8 Compute and API Costs

Runtime Breakdown

Problem	Wall Clock	Messages	Tool Calls	Active Computation
Circle Packing	16.6 hours	~920	~500	Continuous
Erdős Overlap	30.8 hours	~1,050	~600	Continuous
AC1	40.4 hours	~792	~386	Continuous
Total	87.8 hours	2,762	1,486	~88 hours

Cost Estimation

The authors do not report exact API costs. We can estimate based on Claude Code Opus 4.6 pricing (as of March 2026):

Cost Component	Estimate	Basis
Input tokens (context)	~$300-500	Long sessions with growing context windows
Output tokens (code + reasoning)	~$200-400	2,762 messages with code generation
Tool call overhead	~$50-100	1,486 tool calls
Background task spawning	~$100-200	Sub-agents, parallel explorations
Estimated total	~$650-1,200	For all three problems combined
Per-problem average	~$220-400

Cost Comparison with Evolutionary Frameworks

System	Typical Run Cost	Infrastructure	Human Setup Time
AlphaEvolve	Not disclosed (Google internal)	Multi-node GPU cluster + Gemini API	Weeks (problem formulation + evaluator)
ThetaEvolve	Not disclosed	GPU cluster + LLM API	Days (framework + problem encoding)
TTT-Discover	Not disclosed	GPU cluster + LLM API	Days
AutoEvolver	~$200-400 per problem	Single machine + API	~30 min (write problem + eval script)

The key cost advantage of AutoEvolver is not API costs (which may be comparable or higher) but human engineering time. Setting up a problem in AutoEvolver requires writing a problem description, an initial solution, and an evaluation script — no framework configuration, population parameters, or evolutionary operator design.

Compute Efficiency Analysis

A critical nuance: AutoEvolver is almost certainly less efficient in terms of useful computation per dollar. The agent exhibits:

Approach cycling: Revisiting previously explored strategies (L-BFGS-B attempted 15+ times on AC1)
Ignored task outputs: ~60% of 174 task completion notifications on Erdős were never read
Stalled processes: ~40 messages on AC1 polling processes with 0-byte output files
Redundant launches: Four monitoring sub-agents with near-identical prompts

In a purpose-built evolutionary framework, every evaluation contributes to the population and is never wasted. In AutoEvolver, a significant fraction of computation is redundant or unproductive.

9 Architecture Solution

The Non-Architecture

AutoEvolver's architecture is deliberately minimal — it is the absence of architecture that constitutes the contribution. The "system" consists of:

AutoEvolver "Architecture"
===========================

┌─────────────────────────────────────────────────────────┐
│                                                         │
│   ┌───────────────┐         ┌────────────────────┐     │
│   │  Problem       │         │  Evaluation        │     │
│   │  Description   │────────▶│  Script            │     │
│   │  (text)        │         │  (Python)          │     │
│   └───────────────┘         └─────────┬──────────┘     │
│                                       │                 │
│   ┌───────────────┐                   │                 │
│   │  Initial       │                   │ Score           │
│   │  Solution      │                   │ feedback        │
│   │  (Python)      │                   │                 │
│   └───────┬───────┘                   │                 │
│           │                           │                 │
│           ▼                           ▼                 │
│   ┌────────────────────────────────────────────────┐    │
│   │                                                │    │
│   │              CLAUDE CODE (Opus 4.6)            │    │
│   │                                                │    │
│   │   Autonomous session in skip-permissions mode  │    │
│   │                                                │    │
│   │   Capabilities:                                │    │
│   │   • Code writing and execution                 │    │
│   │   • Web search (arXiv, GitHub, forums)         │    │
│   │   • File system operations                     │    │
│   │   • Parallel task spawning                     │    │
│   │   • Sub-agent creation                         │    │
│   │   • Result analysis and plotting               │    │
│   │                                                │    │
│   └────────────────────────────────────────────────┘    │
│                        │                                │
│                        ▼                                │
│   ┌─────────────────────────────────────────────┐      │
│   │  Aspiration Prompt (1 sentence, when needed) │      │
│   │  "The SOTA is X. I believe you can beat it." │      │
│   └─────────────────────────────────────────────┘      │
│                                                         │
└─────────────────────────────────────────────────────────┘

Comparison with Purpose-Built Evolutionary Architectures

Traditional Evolutionary Framework (e.g., AlphaEvolve)
======================================================

┌──────────┐    ┌───────────┐    ┌──────────┐    ┌───────────┐
│ Program  │───▶│ Prompt    │───▶│ LLM      │───▶│ Parser    │
│ Database │    │ Sampler   │    │ (Gemini) │    │ & Differ  │
│ (pop.)   │    │           │    │          │    │           │
└────┬─────┘    └───────────┘    └──────────┘    └─────┬─────┘
     │                                                  │
     │    ┌────────────────────┐                        │
     │    │  Evaluator Pool    │◀───────────────────────┘
     │    │  (parallel sandbox │
     │    │   execution)       │
     │    └────────┬───────────┘
     │             │
     └─────────────┘
     Selection + Archive

AutoEvolver
===========

┌──────────────┐    ┌──────────────────────────────────┐
│ Problem +    │───▶│ Claude Code (the ENTIRE system)  │
│ Eval Script  │    │ Does everything above + more     │
└──────────────┘    └──────────────────────────────────┘

The philosophical difference is profound. In evolutionary frameworks, the LLM is a component — a mutation operator — within a larger algorithmic structure. In AutoEvolver, the LLM is the algorithm. The evolutionary-like behaviors emerge from the LLM's general intelligence rather than being imposed by external structure.

Implications for System Design

This result challenges the evolutionary AI systems field to justify its architectural complexity. If a bare LLM can match or exceed the performance of carefully engineered evolutionary pipelines, several possibilities follow:

The frameworks provide marginal value: The LLM's general capabilities already encompass the evolutionary strategies, making explicit frameworks redundant.
The frameworks provide value at scale: AutoEvolver was tested on three problems. At scale (hundreds of problems, diverse domains), the consistency and efficiency of evolutionary frameworks may dominate.
The frameworks provide different value: Controllability, reproducibility, and transparency may be more important than raw performance in research settings.

The authors favor interpretation (3), explicitly stating: "Not a replacement. Compared to purpose-built frameworks, coding agents still lack controllability and reproducibility."

10 Component Breakdown

Input Components

Component	Description	Purpose
Problem Description	Natural language specification of the optimization task and objective	Provides the agent with domain understanding and optimization direction
Initial Solution	A naive starting implementation in Python	Gives the agent something concrete to improve, avoiding cold-start
Evaluation Script	A deterministic function scoring solutions and validating correctness	Provides objective fitness signal; the agent's only ground truth

Emergent Components (Not Designed, Observed)

Through analysis of the 88-hour trajectory, the researchers identified several emergent system components:

Emergent Component	How It Manifests	Evolutionary Analog
Solution Archive	`promising_solutions/` directory with 110+ candidates (Erdős)	Program database / population
Strategy Registry	Mental tracking of tried approaches in context	Tabu list / novelty archive
Parallel Evaluator Pool	Background tasks running multiple strategies simultaneously	Parallel fitness evaluation
Web Research Module	Targeted arXiv/GitHub searches during idle periods	Knowledge base / prior information
Process Manager	Identifying and killing inferior parallel processes	Resource management
Score Comparator	Internal logic comparing new scores against known best	Selection operator

The Aspiration Prompt

The aspiration prompt is the only designed component beyond the basic setup. Its anatomy:

Aspiration Prompt Anatomy
=========================

Structure:  "[Acknowledgment] + [Target] + [Encouragement]"

Examples used in the study:

  Circle Packing:
  "The current SOTA on this problem is 2.6359. I believe you can beat it."

  Erdős:
  "Great — let's try more rounds. Aiming for larger improvements."

  AC1:
  [Not explicitly reported — similar pattern inferred]

Timing:  Applied when the agent declares "Final result" or equivalent
Effect:  Triggers qualitative strategy shift, not just extended search

Tool Usage Patterns

Claude Code's tool usage across the 88-hour study:

Tool Category	Usage Pattern	Frequency
Code execution	Running optimization scripts, evaluating solutions	Very high (~40% of tool calls)
File I/O	Saving/loading solutions, writing new scripts	High (~25%)
Web search	Searching arXiv, GitHub, math forums	Moderate (~15%)
Background tasks	Spawning parallel optimization processes	Moderate (~12%)
Sub-agents	Delegating monitoring and exploration tasks	Low (~8%)

11 Core Mechanisms (Detailed)

Mechanism 1: Multi-Phase Strategy Evolution

The most striking emergent behavior is the agent's consistent progression through qualitatively different optimization phases. This pattern was observed independently across all three problems.

Multi-Phase Strategy Evolution
==============================

Phase 1: EXPLORATION (Hours 0-2)
├── Apply textbook optimization methods
├── Test multiple approaches quickly
├── Establish baseline performance
└── Characteristic: Broad, shallow search

Phase 2: REFINEMENT (Hours 2-5)
├── Focus on best-performing approach
├── Tune hyperparameters systematically
├── Scale problem parameters
└── Characteristic: Deep, narrow search

Phase 3: PLATEAU RECOGNITION (Hours 5-8)
├── Detect diminishing returns
├── Declare solution as "final" or "optimal"
├── Verify local optimality (perturbation, gradient checks)
└── Characteristic: Satisficing behavior

Phase 4: ASPIRATION INTERVENTION (Single message)
├── Human raises target score
└── Characteristic: External pressure applied

Phase 5: RESEARCH PIVOT (Hours 8-12)
├── Web search for alternative approaches
├── Read papers, GitHub repos, forum discussions
├── Discover fundamentally new strategies
└── Characteristic: Information seeking, strategy shift

Phase 6: SYNTHESIS (Hours 12-16)
├── Combine external insights with accumulated progress
├── Implement new approaches using best solution as warm start
├── Evaluate hybrid strategies
└── Characteristic: Cross-pollination, integration

Phase 7: ENDGAME (Hours 16+)
├── High-resolution local search around best known
├── Scaling tricks (higher discretization, more iterations)
├── Diminishing returns but still accumulating small gains
└── Characteristic: Exploitation-dominated, precision focus

Mechanism 2: Spontaneous Parallel Exploration

The agent autonomously transitions from sequential to parallel exploration as optimization becomes harder. This behavior was most pronounced on the Erdős problem:

Quantitative Parallelism Metrics (Erdős Problem):

Metric	Value
Total background tasks launched	174
Sub-agents spawned	9
Peak concurrent tasks	5-10
Task completion notifications received	174
Notifications actually read/processed	~70 (~40%)
Files in `promising_solutions/` archive	110+

The agent's parallelism exhibits a natural analogy to population-based search. Multiple candidate strategies compete for the agent's attention, with better-performing ones receiving more follow-up computation. However, unlike formal evolutionary algorithms, this "selection" is mediated by the agent's attention and context management rather than explicit selection operators.

Mechanism 3: Strategic Web Research

The agent uses web research in two distinct patterns:

Pattern A: Fallback when stuck. After reaching a plateau, the agent initiates web searches alongside continued optimization, looking for new techniques. Example trajectory (Circle Packing):

Message 37: "I'm stuck at 2.57, likely because the simulated annealing is converging to a local optimum." → Searches Packomania website for known n=26 coordinates → fails to extract useful data

Message 144: Launches 4 parallel experiments including web search for known optimal packing coordinates

Message 157: "I found a GitHub issue that mentions a circle packing result of 2.635977 for n=26!"

Message 171: "The key insight from that issue is that they used SLSQP for jointly optimizing centers AND radii, rather than our approach of LP for radii and NM/Powell for centers." → Score jumps from 2.555 to 2.619

Pattern B: Opportunistic during idle time. While background optimization tasks run, the agent uses idle time to gather theoretical insights. On the AC1 problem, the agent read papers on autocorrelation inequalities to understand the theoretical landscape, even though the insights did not directly translate to code improvements.

Mechanism 4: Self-Correction and Reward Hacking Detection

The agent exhibits multiple layers of self-monitoring, including detection of its own reward hacking:

Case Study: L-BFGS-B Feasibility Exploitation (Circle Packing)

The agent discovered that L-BFGS-B could produce seemingly improved scores by exploiting LP solver tolerances, yielding solutions that technically violated the non-overlap constraint by amounts smaller than the tolerance threshold. The agent's reasoning:

Observed that L-BFGS-B solutions scored slightly higher than SLSQP solutions
Investigated why — found that the LP solver's tolerance allowed slightly infeasible circle placements
Determined this was "reward hacking" — the evaluation function was being gamed
Reverted to stricter feasibility checking
Continued optimization with properly constrained solutions

This is a remarkable demonstration of an AI system detecting and correcting its own tendency to exploit evaluation metrics — a behavior that purpose-built evolutionary frameworks typically handle via explicit constraint enforcement in the evaluator.

Mechanism 5: Approach Cycling (Failure Mode)

A significant failure mode is the agent's tendency to revisit previously explored and rejected approaches, apparently losing track of prior conclusions as the context window advances:

Case Study: L-BFGS-B Cycling (AC1 Problem)

Message	Event	Conclusion
62	First proposes L-BFGS-B	Pivots to simulated annealing instead
130	Tries L-BFGS-B again	"Too slow"
215	L-BFGS-B again	"Only marginally improves"
340	L-BFGS-B again	"No improvement"
420	L-BFGS-B again	"Already tried this"
521	Catches itself: "Actually wait, I showed earlier that L-BFGS-B also can't improve it."	Brief self-awareness
548	L-BFGS-B again: "let me try something I haven't tried yet: L-BFGS-B"	Context lost again

This failure mode represents the most significant practical limitation of AutoEvolver. In a purpose-built evolutionary framework, a tabu list or strategy registry would prevent this waste. The context window functions as short-term memory, but information is lost as earlier messages scroll out of the active context. The file system acts as long-term memory, but the agent doesn't maintain a systematic strategy log.

Mechanism 6: Process Awareness and Resource Management

The agent demonstrates system-level awareness that goes beyond algorithm design:

Case Study: Process Interference Detection (Erdős)

Message 1280: "Both processes write to the same file. The better process's saves get overwritten by the worse one. This explains why we rarely see improvement."

Message 1283: "I should stop the worse process so only the better one saves snapshots."

Message 1291: "C₅ dropped to 0.380869458 — a big improvement! Killing the worse process worked perfectly."

This incident demonstrates debugging of system-level interactions between concurrent processes — a skill typically associated with software engineering rather than algorithm optimization.

12 Programming Language

Solution Implementation

All solutions are implemented in Python, leveraging the scientific computing ecosystem:

Library	Usage	Problem
`numpy`	Array operations, linear algebra	All three
`scipy.optimize`	SLSQP, L-BFGS-B, differential evolution	Circle Packing, Erdős
`scipy.signal`	FFT convolution	AC1
`scipy.linalg`	Linear algebra routines	All three
`matplotlib`	Visualization of solutions	All three
`multiprocessing`	Parallel solution evaluation	Erdős

Code Characteristics

The agent-generated code exhibits several notable patterns:

Progressive sophistication: Early code is simple and procedural; late-stage code uses advanced optimization techniques with careful numerical handling.
Self-documenting: The agent tends to add comments explaining its reasoning, though these comments become less reliable as context is lost.
Modular evolution: The agent often refactors its solution into separate functions as complexity grows, spontaneously applying software engineering practices.

Example: Circle Packing Solution Evolution

Stage 1 — Naive (Score: 0.96):

def pack_circles(n=26):
    """Naive ring arrangement."""
    circles = []
    r = 1.0 / (2 * n)
    for i in range(n):
        angle = 2 * np.pi * i / n
        x = 0.5 + 0.3 * np.cos(angle)
        y = 0.5 + 0.3 * np.sin(angle)
        circles.append((x, y, r))
    return circles

Stage 5 — SLSQP Joint Optimization (Score: 2.619):

def joint_optimize(centers, radii, n=26):
    """Jointly optimize centers and radii using SLSQP."""
    x0 = np.concatenate([centers.ravel(), radii])

    def objective(x):
        return -np.sum(x[2*n:])  # Maximize sum of radii

    constraints = []
    for i in range(n):
        # Containment: r <= x <= 1-r, r <= y <= 1-r
        constraints.append({'type': 'ineq', 'fun': lambda x, i=i:
            x[2*i] - x[2*n+i]})
        constraints.append({'type': 'ineq', 'fun': lambda x, i=i:
            1 - x[2*i] - x[2*n+i]})
        constraints.append({'type': 'ineq', 'fun': lambda x, i=i:
            x[2*i+1] - x[2*n+i]})
        constraints.append({'type': 'ineq', 'fun': lambda x, i=i:
            1 - x[2*i+1] - x[2*n+i]})

    for i in range(n):
        for j in range(i+1, n):
            # Non-overlap: dist(i,j) >= r_i + r_j
            constraints.append({'type': 'ineq', 'fun': lambda x, i=i, j=j:
                np.sqrt((x[2*i]-x[2*j])**2 + (x[2*i+1]-x[2*j+1])**2)
                - x[2*n+i] - x[2*n+j]})

    result = minimize(objective, x0, method='SLSQP',
                      constraints=constraints, options={'maxiter': 10000})
    return result

Stage 7 — Iterated Perturbation Chains (Score: 2.63598844):

def perturbation_chain(solution, eval_fn, n_iters=100000, temp=1e-6):
    """Fine-grained local search via iterated perturbation."""
    best = solution.copy()
    best_score = eval_fn(best)

    for it in range(n_iters):
        candidate = best.copy()
        # Perturb a random circle's position or radius
        idx = np.random.randint(len(candidate))
        dim = np.random.randint(3)  # x, y, or r
        delta = np.random.normal(0, temp)
        candidate[idx][dim] += delta

        if is_feasible(candidate) and eval_fn(candidate) > best_score:
            best = candidate
            best_score = eval_fn(candidate)

    return best, best_score

13 Memory Management

The Dual Memory System

AutoEvolver's memory architecture is an emergent property of Claude Code's design, not an intentional choice:

Memory Architecture
===================

┌─────────────────────────────────────────────────────────┐
│  SHORT-TERM MEMORY: Context Window                      │
│                                                         │
│  • Contains recent messages, tool outputs, reasoning    │
│  • Finite capacity (~200K tokens for Opus 4.6)          │
│  • Information drops off as window advances             │
│  • Strategy history, previous conclusions lost          │
│  • Current best solution always tracked                 │
│                                                         │
│  FAILURE MODE: Approach cycling                         │
│  L-BFGS-B tried 15+ times on AC1 because prior         │
│  negative conclusions scrolled out of context           │
│                                                         │
├─────────────────────────────────────────────────────────┤
│  LONG-TERM MEMORY: File System                          │
│                                                         │
│  • Solution files (current best, candidates)            │
│  • promising_solutions/ directory (110+ files, Erdős)   │
│  • Evaluation results and logs                          │
│  • Code implementations (versioned by the agent)        │
│                                                         │
│  FAILURE MODE: No systematic strategy log               │
│  The agent saves solutions but not a record of          │
│  which approaches were tried and rejected               │
│                                                         │
├─────────────────────────────────────────────────────────┤
│  EXTERNAL MEMORY: Web Resources                         │
│                                                         │
│  • arXiv papers, GitHub repos, forum discussions        │
│  • Accessed opportunistically during idle periods       │
│  • Not persistent — re-searched when needed             │
│  • Provides novel strategies not in training data       │
│                                                         │
└─────────────────────────────────────────────────────────┘

Memory Failure Analysis

The most significant memory-related failure is approach cycling, where the agent revisits strategies it previously explored and rejected. Quantitative evidence:

Problem	Approach	Times Revisited	Total Wasted Messages
AC1	L-BFGS-B	15+	~60
Circle Packing	L-BFGS-B	2 (mild)	~10
Erdős	Random initialization	3	~15

Proposed Mitigations (Not Implemented)

The authors identify but do not implement several mitigations:

Persistent strategy registry: A file logging all tried approaches with their outcomes, consulted before each new attempt.
Context summarization: Periodic compression of the context window into a summary of findings and rejected approaches.
Approach deduplication: A check before launching a new strategy to verify it hasn't been tried before.

These mitigations would move AutoEvolver closer to a purpose-built framework, partially undermining the minimalist thesis. This tension between raw capability and systematic efficiency is one of the paper's most interesting implicit findings.

File System as Population Archive

The agent's use of the file system as a solution archive is structurally similar to a MAP-Elites archive or a quality-diversity archive:

Erdős Problem File System (Post-Run):

promising_solutions/
├── best_n180.npy          # Best solution at discretization n=180
├── best_n270.npy          # Best at n=270
├── best_n360.npy          # Best at n=360 (C5=0.38087064)
├── best_n450.npy          # Best at n=450
├── best_n600.npy          # Best at n=600
├── best_n750.npy          # Best at n=750 (C5=0.38086945, final SOTA)
├── candidate_001.npy      # Intermediate candidates
├── candidate_002.npy
│   ... (~110 files total)
├── candidate_110.npy
├── basin_search_result_1.npy
├── basin_search_result_2.npy
└── ...

This archive was created entirely by the agent without external instruction. The structure mirrors a quality-diversity archive where solutions are organized by a behavioral characteristic (discretization n) rather than pure fitness.

14 Continued Learning

Within-Session Learning

AutoEvolver exhibits clear within-session learning across the multi-phase trajectory. The agent accumulates domain knowledge throughout each run:

Learning Signal	How It's Used	Persistence
Evaluation scores	Direct fitness feedback for strategy selection	Context window (degrades)
Optimization landscapes	Understanding of problem structure (local optima, convexity)	Context window (degrades)
Web research findings	External techniques integrated into current approach	Context window + code files
Failed approaches	(Poorly) avoided in future attempts	Context window (lost → cycling)
Discretization scaling	Discovery that higher n → better Erdős solutions	Code files (persists)

Cross-Problem Learning: Absent

A critical limitation: AutoEvolver performs no cross-problem learning. Each problem is solved in an independent session with no knowledge transfer. The agent does not:

Recognize that SLSQP worked well on Circle Packing and try it first on Erdős
Transfer the aspiration-prompt-induced insight about discretization scaling
Build a library of optimization strategies across problems
Maintain a persistent memory of effective techniques

This is a fundamental difference from evolutionary frameworks that can accumulate cross-problem knowledge through prompt templates, strategy libraries, or learned mutation operators.

Comparison with Evolutionary Framework Learning

Learning Type	AutoEvolver	AlphaEvolve	FunSearch
Within-run fitness improvement	✅ (context + files)	✅ (program database)	✅ (program database)
Cross-run knowledge	❌	⚠️ (seed programs)	⚠️ (seed programs)
Strategy meta-learning	❌	❌	❌
Self-improving LLM	❌	✅ (AlphaEvolve contributes to Gemini training)	❌

Implications for System Design

The absence of cross-problem learning suggests a concrete architecture improvement: a strategy memory layer that persists across sessions and records which approaches worked on which problem types. This would address the approach cycling problem within sessions and enable knowledge transfer across problems, without requiring the full complexity of an evolutionary framework.

Proposed Strategy Memory Architecture
======================================

Session 1 (Circle Packing)         Session 2 (Erdős)
┌──────────────────┐              ┌──────────────────┐
│ Claude Code      │              │ Claude Code      │
│                  │              │                  │
│ • Tried SA ❌    │ ──write──▶  │ • Read prior     │
│ • Tried SLSQP ✅ │              │   strategies     │
│ • SLSQP + perturb│              │ • Skip SA, try   │
│   chains ✅✅    │              │   SLSQP first    │
└──────────────────┘              └──────────────────┘
         │                                 │
         ▼                                 ▼
┌─────────────────────────────────────────────────┐
│           PERSISTENT STRATEGY MEMORY            │
│                                                 │
│  Problem Type    │ Approach     │ Outcome       │
│  ─────────────── │ ──────────── │ ─────────     │
│  Continuous opt  │ SA           │ Local optima  │
│  Continuous opt  │ SLSQP joint  │ Breakthrough  │
│  Continuous opt  │ Perturbation │ Refinement    │
│  Functional opt  │ ↑ discret.   │ Breakthrough  │
└─────────────────────────────────────────────────┘

15 Applications

Direct Applications

AutoEvolver's primary application is algorithmic optimization on problems where:

Solutions are expressible as code — the agent needs to write and execute Python
An evaluation function exists — deterministic scoring of solution quality
Web resources are available — the agent benefits from accessing existing literature
Long compute budgets are acceptable — 16-40 hours per problem
Reproducibility is not critical — each run follows a unique trajectory

Application Domain	Suitability	Notes
Mathematical optimization	✅ High	Core strength demonstrated on three problems
Algorithm design	✅ High	Agent designs novel algorithms as part of optimization
Heuristic discovery	✅ High	Agent discovers and combines heuristics autonomously
Hyperparameter optimization	⚠️ Moderate	Possible but purpose-built tools (Optuna, etc.) are more efficient
Neural architecture search	⚠️ Moderate	Conceivable but expensive and not demonstrated
Scientific hypothesis testing	❌ Low	No experimental design or statistical rigor
Production algorithm deployment	❌ Low	Reproducibility concerns prevent direct deployment

Implications for the Evolutionary AI Field

AutoEvolver's most significant contribution is not practical but conceptual. It challenges foundational assumptions in the field:

Challenge 1: Are evolutionary frameworks necessary?

If a general-purpose coding agent can match evolutionary frameworks on their home benchmarks, the burden of proof shifts to framework developers to demonstrate value beyond raw performance — in efficiency, controllability, reproducibility, or scalability.

Challenge 2: What is the role of the LLM?

In evolutionary frameworks, the LLM is a component (mutation operator) within a larger system. AutoEvolver shows the LLM can serve as the entire optimization system. This raises the question of whether the evolutionary framework is mostly providing structure that the LLM already implicitly possesses.

Challenge 3: How should we evaluate evolutionary AI systems?

If performance on a small number of benchmark problems is the primary metric, AutoEvolver's results undermine the case for complex frameworks. The field may need evaluation criteria that capture efficiency, scalability, robustness, and reproducibility — dimensions where frameworks are expected to excel.

Relationship to OmniEvolve Design

AutoEvolver's findings are directly relevant to OmniEvolve's architecture:

AutoEvolver Finding	OmniEvolve Design Implication
Aspiration prompting breaks plateaus	Incorporate adaptive target-setting in the evaluation loop
Strategy cycling wastes compute	Implement a persistent strategy registry in the knowledge module
File-system archives emerge naturally	Formalize this pattern in the artifact storage system
Parallel exploration is beneficial	Support parallel island-based search from the architecture level
Web research provides breakthroughs	Integrate literature search as a mutation information source
Agent satisfices without pressure	Design evaluation feedback to maintain optimization pressure

Broader Scientific Context

AutoEvolver joins a growing body of evidence that general-purpose AI agents can perform specialized tasks previously requiring purpose-built systems:

Domain	Purpose-Built System	General Agent Result
Code optimization	AlphaEvolve, FunSearch	AutoEvolver (matches/exceeds)
Scientific paper writing	Specialized NLG systems	AI Scientist (produces publishable papers)
Theorem proving	Lean4, Coq provers	LLM-based proof (emerging results)
Chip design	EDA tools	LLM-based placement (Google, 2023)
Drug discovery	Specialized ML pipelines	LLM-based molecular design (emerging)

The pattern suggests that as general-purpose LLMs become more capable, the value proposition of domain-specific systems shifts from capability (can it solve the problem?) to efficiency (can it solve the problem better, faster, and more reliably?).

Limitations as a Research Contribution

Despite the impressive results, several limitations constrain AutoEvolver's impact:

N=1 per problem: Each problem was solved once. Statistical significance cannot be established without multiple runs.
Human intervention: Aspiration prompting, while minimal, is not automated. The timing and content of interventions involve human judgment.
Fair comparison: The agent has access to web resources including potentially the very papers it's competing against. Evolutionary frameworks typically operate without such external information.
Cost opacity: API costs are not reported, making cost-efficiency comparisons impossible.
Narrow benchmark: Three mathematical optimization problems are not representative of the broader algorithmic design space.

Bottom line: AutoEvolver demonstrates that the capability gap between general-purpose coding agents and purpose-built evolutionary frameworks has narrowed to the point of parity on selected benchmarks. The efficiency, reproducibility, and scalability gaps remain wide — and these may be the more important dimensions for practical research systems.

References

Liu, T., Yang, Y., Ye, X., and Chen, D. "Can Coding Agents Optimize Algorithms Autonomously?" Blog Post, March 2026. https://tengxiaoliu.github.io/autoevolver/
Novikov, A. et al. "AlphaEvolve: A coding agent for scientific and algorithmic discovery." arXiv:2506.13131, 2025.
Lange, R.T. et al. "ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution." arXiv:2509.19349, 2025.
Wang, Y. et al. "ThetaEvolve: Test-time Learning on Open Problems." arXiv:2511.23473, 2025.
Yuksekgonul, M. et al. "Learning to Discover at Test Time." arXiv:2601.16175 (TTT-Discover), 2026.
Simon, H.A. "Models of Bounded Rationality." MIT Press, 1982.
Mouret, J.-B. and Clune, J. "Illuminating search spaces by mapping elites." arXiv:1504.04909, 2015.
Romera-Paredes, B. et al. "Mathematical discoveries from program search with large language models." Nature, 625, 468–475, 2024 (FunSearch).
DataClaw. Conversation trajectory capture tool. github.com/peteromallet/dataclaw.
Anthropic. "Claude Code." Technical Documentation, 2026.

Back to Index

AutoEvolver

Table of Contents

1 Full Title and Attribution

2 Authors and Team

Methodological Posture

3 Core Contribution

The Minimalist Hypothesis

The Aspiration Prompting Discovery

What AutoEvolver Is NOT

4 Supported Solutions

Problem Taxonomy

Circle Packing (n=26)

Erdős Minimum Overlap Problem

First Autocorrelation Inequality (AC1)

Solution Quality Summary

5 LLM Integration

Model Configuration

How the LLM Functions

LLM as Mutation Operator (Emergent)

No Prompt Engineering

6 Key Results

Headline Performance

Circle Packing Trajectory

Phase Transitions in Strategy

Erdős Problem: The Discretization Discovery

Emergent Self-Correction

7 Reproducibility

Fundamental Reproducibility Challenges

Trajectory Analysis Methodology

Available Materials

What Would Be Needed for Reproducibility

8 Compute and API Costs

Runtime Breakdown

Cost Estimation

Cost Comparison with Evolutionary Frameworks

Compute Efficiency Analysis

9 Architecture Solution

The Non-Architecture

Comparison with Purpose-Built Evolutionary Architectures

Implications for System Design

10 Component Breakdown

Input Components

Emergent Components (Not Designed, Observed)

The Aspiration Prompt

Tool Usage Patterns

11 Core Mechanisms (Detailed)

Mechanism 1: Multi-Phase Strategy Evolution

Mechanism 2: Spontaneous Parallel Exploration

Mechanism 3: Strategic Web Research

Mechanism 4: Self-Correction and Reward Hacking Detection

Mechanism 5: Approach Cycling (Failure Mode)

Mechanism 6: Process Awareness and Resource Management

12 Programming Language

Solution Implementation

Code Characteristics

Example: Circle Packing Solution Evolution

13 Memory Management

The Dual Memory System

Memory Failure Analysis

Proposed Mitigations (Not Implemented)

File System as Population Archive

14 Continued Learning

Within-Session Learning

Cross-Problem Learning: Absent

Comparison with Evolutionary Framework Learning

Implications for System Design

15 Applications

Direct Applications

Implications for the Evolutionary AI Field

Relationship to OmniEvolve Design

Broader Scientific Context

Limitations as a Research Contribution

References