A-Evolve
Universal Infrastructure for Self-Improving Agents via Agentic Evolution Organization: Amazon (A-EVO-Lab) Published: February 3, 2026 (paper); March 25, 2026 (open-source release) Type: Position Paper (ICML) + Open-Source Framework Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: Position: Agentic Evolution is the Path to Evolving LLMs
arXiv: 2602.00359 (v2)
GitHub Repository: A-EVO-Lab/a-evolve
License: MIT
Publication Venue: ICML 2026 (Machine Learning)
Publication Date: February 3, 2026 (arXiv preprint); March 25, 2026 (open-source infrastructure release)
Lineage: Positioned as a unifying framework that subsumes and extends prior work on LLM self-improvement — including Reflexion (Shinn et al., 2023), test-time training (Huang et al., 2025), prompt optimization (Zhou et al., 2022), and heuristic memory accumulation (Wang et al., 2024; Gao et al., 2025). The authors explicitly frame A-Evolve as the "PyTorch for Agentic AI" — a standardized infrastructure layer rather than a standalone agent.
BibTeX:
@article{lin2026position,
title={Position: Agentic Evolution is the Path to Evolving LLMs},
author={Lin, Minhua and Lu, Hanqing and Shi, Zhan and He, Bing and
Mao, Rui and Zhang, Zhiwei and Wu, Zongyu and Tang, Xianfeng
and Liu, Hui and Dai, Zhenwei and others},
journal={arXiv preprint arXiv:2602.00359},
year={2026}
}
2 Authors and Team
The paper is authored by a large team at Amazon, operating under the lab name A-EVO-Lab:
Primary Authors: Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, Xiang Zhang, Suhang Wang, Benoit Dumoulin, Jian Pei
The team spans Amazon's applied science and research divisions. Several authors have prior work in continual learning, LLM agents, and deployment-time adaptation. The inclusion of Jian Pei — a prominent database and data mining researcher (Simon Fraser University → Duke University → Amazon) — signals the team's focus on systematic, infrastructure-grade approaches to agent improvement rather than ad-hoc optimization.
The lab name "A-EVO-Lab" (Amazon Evolution Lab) and the paper's ICML positioning suggest this is a strategic research direction for Amazon — framing deployment-time agent self-improvement as a first-class scaling axis alongside training-time and inference-time compute.
| Role | Names |
|---|---|
| Lead authors | Minhua Lin, Hanqing Lu |
| Core framework | Zhan Shi, Bing He, Rui Mao |
| Algorithm design | Zhiwei Zhang, Zongyu Wu, Xianfeng Tang |
| Evaluation / benchmarks | Hui Liu, Zhenwei Dai |
| Theory / formalism | Xiang Zhang, Suhang Wang |
| Senior leadership | Benoit Dumoulin, Jian Pei |
3 Core Contribution
Key Novelty: A-Evolve introduces agentic evolution as a third scaling axis for LLM systems — complementary to training-time compute and inference-time compute — where an explicit evolver agent autonomously diagnoses failures, proposes structured mutations to persistent agent state, gates updates through validation, and commits only verified improvements. The framework treats the entire agent workspace (prompts, skills, tools, memory) as mutable file-system state, enabling domain-agnostic, algorithm-agnostic, and agent-agnostic evolution.
The Central Argument
The paper makes a position argument with three pillars:
-
The Train-Deploy Gap is Fundamental. Static LLM training (pre-training + post-training) cannot anticipate the infinite variety of real-world deployment scenarios. Models inevitably degrade under distribution shift, API changes, and evolving constraints. Neither scaling training data nor extending inference-time reasoning chains closes this gap.
-
Existing Adaptation Methods Are Insufficient. Parametric approaches (fine-tuning, test-time training) are opaque, risk catastrophic forgetting, and lack semantic accountability. Non-parametric heuristic approaches (memory accumulation, append-and-retrieve) saturate with noisy text and exhibit diminishing returns. Both fail because the update mechanism
F_Evolveis static and heuristic rather than adaptive and goal-directed. -
Evolution Must Be Agentic. The evolution process itself must be elevated from a fixed pipeline to an autonomous evolver agent that reasons about failures, decides what/when/how to change, and produces verified, composable updates. This is the only path to sustained, open-ended adaptation over indefinite deployment horizons.
What Makes A-Evolve Novel
| Dimension | Prior Work | A-Evolve |
|---|---|---|
| Update mechanism | Fixed heuristic (append memory, gradient step) | Autonomous evolver agent with Diagnose→Plan→Update→Verify pipeline |
| Mutation target | Single axis (weights OR prompts OR memory) | Composite policy π = (π_θ, π_S) — any combination of parametric and non-parametric state |
| Update governance | Unconditional (always apply) | Conditional commit gate: propose, verify, accept/reject |
| Artifact structure | Unstructured text blobs | Typed persistent artifacts: Knowledge registry K, Tool registry T, Validation registry V |
| Evolution scope | Domain-specific | Domain-agnostic framework: BYOA (agent), BYOE (benchmark), BYO-Algo (algorithm) |
| Scaling theory | None | Evolution-Scaling Hypothesis: adaptation frontier scales with evolution-time compute |
| Reproducibility | Ad hoc | Every mutation git-tagged (evo-1, evo-2, …) with full provenance |
The Evolution-Scaling Hypothesis
The paper's most ambitious theoretical contribution is the Evolution-Scaling Hypothesis — a conjecture that deployment-time adaptation capacity scales predictably with compute allocated to the evolution process, analogous to training-time scaling laws (Kaplan et al., 2020) and inference-time scaling (Snell et al., 2025):
P*(C_evolve, π_0) = max_{F_evolve} E_{π ~ F_evolve(π_0)} [P(π)]
Where:
- P*(C_evolve, π_0) is the compute-optimal evolution frontier
- C_evolve is the total evolution-time compute budget
- F_evolve ranges over all evolution strategies within that budget
- π_0 is the initial deployed policy
The hypothesis states that P* is strictly increasing with C_evolve: more evolution compute → more accurate diagnosis, more candidate updates considered, more robust artifact synthesis, and stronger verification before committing. Validated updates persist and compound, making evolution a convergent process whose effectiveness scales with resources.
This positions agentic evolution as a third scaling law — after training-time and inference-time — and the first to operate during deployment rather than before it.
4 Supported Solutions
A-Evolve is a framework, not a standalone agent. It evolves any agent that implements the BaseAgent.solve() interface, across any domain with a BenchmarkAdapter, using any evolution strategy via EvolutionEngine.step(). The types of evolvable artifacts include:
| Artifact Type | Location in Workspace | Description | Example Mutation |
|---|---|---|---|
| System prompts | prompts/system.md |
Instructional text governing LLM reasoning | Harden output format constraints, add domain-specific procedures |
| Skills | skills/*/SKILL.md |
Reusable procedural knowledge files | Synthesize entity-verification skill, search-iteration strategy |
| Tools | tools/ |
External interface configurations and wrappers | Add API schema adapter, patch parser for new JSON format |
| Memory | memory/*.jsonl |
Episodic and semantic memory entries | Record failure patterns, amortize successful reasoning traces |
| Manifest | manifest.yaml |
Agent identity, entrypoint, evolvable layer declarations | Update evolvable_layers, change reload strategy |
Solution Domain Breadth
The framework ships with adapters for four diverse benchmark domains:
| Domain | Benchmark | Task Type | Seed Workspace |
|---|---|---|---|
| Software engineering | SWE-bench Verified | Real GitHub issue resolution (Python repos) | seed_workspaces/swe/ |
| Tool calling | MCP-Atlas | Multi-server MCP tool orchestration (16+ servers) | seed_workspaces/mcp/ |
| Terminal operations | Terminal-Bench 2.0 | CLI operations in Docker containers | seed_workspaces/terminal/ |
| Skill discovery | SkillsBench | Autonomous capability acquisition | seed_workspaces/reasoning/ |
What an Evolved Agent Looks Like
A concrete before/after from the MCP-Atlas evolution:
Before (seed workspace):
mcp_agent/
├── manifest.yaml
├── prompts/system.md ← 20 lines, generic
├── skills/ ← empty
└── memory/ ← empty
After (evolved — 79.4% on MCP-Atlas, #1 ranked):
mcp_agent/
├── manifest.yaml
├── prompts/system.md ← 20 lines, unchanged
├── skills/
│ ├── entity-verification/SKILL.md ← NEW
│ ├── search-iteration/SKILL.md ← NEW
│ ├── multi-requirement/SKILL.md ← NEW
│ ├── code-execution/SKILL.md ← NEW
│ └── conditional-handler/SKILL.md ← NEW
└── memory/
└── episodic.jsonl ← 6 entries
Key insight: 5 targeted skills outperformed 10 generic ones. The evolution engine learned to synthesize specific, verified skills rather than accumulating broad but shallow capabilities. The system prompt was left unchanged — all improvement came through structured artifact creation.
5 LLM Integration
Composite Policy Model
A-Evolve models an LLM system as a composite policy:
π_t = (π_θ,t, π_S,t)
Where:
- π_θ,t — the parametric backbone (LLM weights, frozen during non-parametric evolution)
- π_S,t — the persistent artifact state (prompts, skills, tools, memory) that conditions behavior across episodes
This separation is fundamental: the LLM backbone provides the reasoning engine, while the artifact state provides the accumulated deployment knowledge. Evolution can target either or both, though the current implementation focuses on non-parametric artifact evolution (mutating π_S) while keeping π_θ fixed.
LLM Roles in the Framework
| Role | Where | How |
|---|---|---|
| Solver | Solve phase | The base LLM executes tasks using current artifacts (π_S as read-only context) |
| Diagnoser | Evolve → Diagnose | LLM analyzes deployment evidence, identifies failure modes and root causes |
| Planner | Evolve → Plan | LLM translates diagnostic insights into structured edit plans (target artifacts, operators, ordering) |
| Updater | Evolve → Update | LLM synthesizes concrete artifact changes: writes SKILL.md files, patches prompts, generates memory entries |
| Verifier | Evolve → Verify | LLM (or automated tests) evaluates candidate updates against validation registry |
Provider Abstraction
The framework abstracts LLM access through a LLMProvider.complete() interface:
| Provider | Status | Notes |
|---|---|---|
| Anthropic (Claude) | Built-in | Primary model used for benchmark results (Claude Opus-4.6) |
| OpenAI (GPT-4o) | Built-in | Demonstrated in tutorial; via openai SDK |
| AWS Bedrock | Built-in | Amazon's managed API access |
| Custom | Via interface | Implement LLMProvider.complete() |
Dual-Use Architecture
Unlike systems that use separate models for mutation and evaluation (e.g., AlphaEvolve's Flash/Pro ensemble), A-Evolve uses the same model for both solving and evolving by default. The evolution engine can optionally use a different (stronger) model for the evolution phases, but the default configuration demonstrates that a single model can bootstrap its own improvement through structured workspace mutations.
Critical distinction from AlphaEvolve: Where AlphaEvolve evolves code programs as solutions to mathematical/algorithmic problems, A-Evolve evolves agents — the prompts, skills, tools, and memory that govern an LLM's task-solving behavior. The mutation target is not a standalone algorithm but the deployed system's persistent state.
6 Key Results
Headline Performance
All results achieved with a single Claude Opus-4.6 base model, evolved using A-Evolve's reference algorithms, with zero hours of human harness engineering:
| Benchmark | Domain | Baseline | Evolved | Improvement | Ranking | Algorithm Used |
|---|---|---|---|---|---|---|
| MCP-Atlas | Tool calling (MCP) | 76.0% | 79.4% | +3.4pp | 🥇 #1 | adaptive_evolve |
| SWE-bench Verified | Software engineering | 74.2% | 76.8% | +2.6pp | ~#5 | guided_synth |
| Terminal-Bench 2.0 | CLI operations | 63.5% | 76.5% | +13.0pp | ~#7 | adaptive_skill |
| SkillsBench | Skill discovery | 19.7% | 34.9% | +15.2pp | #2 | skillforge |
Result Significance
| Metric | Analysis |
|---|---|
| MCP-Atlas #1 | Achieved the top rank on a benchmark specifically designed to test tool-calling capabilities across 16+ MCP servers. The evolution engine synthesized 5 targeted skills that outperformed manually engineered prompts from competing systems. |
| Terminal-Bench +13pp | The largest absolute improvement (+13.0 percentage points), suggesting that CLI/terminal operations have a high "evolvability ceiling" — many failure modes are systematic and addressable through structured skill synthesis. |
| SkillsBench +15.2pp | The largest relative improvement (77% relative gain over baseline), demonstrating that autonomous skill discovery is where agentic evolution provides the most leverage over static agent configurations. |
| SWE-bench +2.6pp | The most modest improvement, consistent with the expectation that real-world software engineering tasks have higher complexity variance and diminishing returns from non-parametric evolution alone. |
Ablation Study: Evolver Component Contributions
The paper includes ablation experiments (referenced in Sections 5-6 of the paper) that decompose the contribution of each evolver component:
| Configuration | Description | Relative to Full A-Evolve |
|---|---|---|
| Full A-Evolve | Diagnose + Plan + Update + Verify | Baseline (100%) |
| A-Evolve/D (Diagnose only) | Diagnosis without planning → raw update attempts | Significant performance drop; broken tools committed, ~15% solver efficiency degradation |
| A-Evolve/P (+ Planning) | Diagnosis + Planning → structured action sequences | Substantial recovery; planning enables implementable fixes |
| A-Evolve/V (+ Verify) | Full pipeline without gating | Regressions from uncommitted bad mutations |
| No evolution | Static baseline agent | Lowest performance across all benchmarks |
Key ablation finding: The planning stage was the most impactful individual component. Without planning, the diagnosis stage produced correct failure attributions but the resulting updates were often syntactically broken or semantically incomplete. Planning translates "what's wrong" into "how to fix it" — bridging the gap between understanding and actionable improvement.
Convergence Behavior
The evolution loop converges when EGL (Evolutionary Generality Loss) stabilizes or max_cycles is reached. EGL measures the gap between training performance and holdout performance:
EGL(t) = Score_train(t) - Score_holdout(t)
The framework uses an egl_window parameter (configurable, default varies by algorithm) to detect convergence. When EGL stops improving over the window, evolution halts — preventing overfitting to the training task distribution. This mechanism is critical for ensuring that evolved skills generalize to unseen tasks.
7 Reproducibility
Reproducibility Infrastructure
A-Evolve provides unusually strong reproducibility guarantees for an LLM-based system:
| Mechanism | Implementation | Purpose |
|---|---|---|
| Git-tagged mutations | Every accepted mutation auto-tagged evo-1, evo-2, … |
Full audit trail of workspace evolution; rollback to any checkpoint |
| Deterministic workspace | Agent reads from file system; π_S is serialized state | Identical workspace → identical agent behavior (modulo LLM sampling) |
| Gated commits | Mutations validated on holdout tasks before acceptance | Prevents regression; rejected mutations don't contaminate workspace |
| Observation logging | All trajectories, feedback, and diagnostic outputs logged | Complete evidence trail for post-hoc analysis |
| Seed workspaces | Pre-built starting points for each benchmark | seed_workspaces/{swe,mcp,terminal,reasoning}/ |
| Version control integration | Git-backed rollback on failed gate checks | Automatic reversion to last-known-good state |
Open Source Status
| Component | Status | Location |
|---|---|---|
Framework core (agent_evolve) |
✅ Open source (MIT) | agent_evolve/ |
| 4 reference algorithms | ✅ Open source | algorithms/ or agent_evolve/algorithms/ |
| 4 seed workspaces | ✅ Open source | seed_workspaces/ |
| 4 benchmark adapters | ✅ Open source | Built-in adapters |
| Evolved agent checkpoints | ✅ Git-tagged | Reproducible via evolver.run() |
| Training data / benchmark tasks | External dependencies | SWE-bench, MCP-Atlas, Terminal-Bench, SkillsBench (separate repos) |
Reproducing Results
import agent_evolve as ae
# Reproduce MCP-Atlas #1 result
evolver = ae.Evolver(
agent="mcp", # built-in seed workspace
benchmark="mcp-atlas", # built-in benchmark adapter
engine="adaptive_evolve", # reference algorithm
)
results = evolver.run(cycles=10)
# Expected: ~79.4% final score
Caveat: Exact numerical reproduction depends on LLM API determinism (temperature, sampling seed), which is not fully guaranteed across API versions. However, the directional results and convergence behavior should reproduce consistently.
8 Compute and API Costs
Cost Model
A-Evolve's compute costs decompose into two categories:
| Phase | Compute Type | Cost Driver | Notes |
|---|---|---|---|
| Solve | LLM inference | Token count × model price per token | Proportional to batch_size × num_cycles × avg_task_tokens |
| Evolve | LLM inference + tool invocation | Diagnosis reasoning + plan generation + artifact synthesis + verification | Typically 2-5× more tokens per cycle than solve phase |
| Gate | LLM inference (holdout) | Holdout task count × model price | egl_window tasks per cycle for convergence check |
Estimated Costs Per Benchmark
Based on the reported configurations (Claude Opus-4.6, 10 evolution cycles):
| Benchmark | Tasks/Cycle | Cycles | Est. Solve Tokens | Est. Evolve Tokens | Est. Total Cost |
|---|---|---|---|---|---|
| MCP-Atlas | ~50 | 10 | ~2M | ~4M | $150–300 |
| SWE-bench Verified | ~50 | 10 | ~5M | ~8M | $400–800 |
| Terminal-Bench 2.0 | ~50 | 10 | ~3M | ~5M | $200–400 |
| SkillsBench | ~50 | 10 | ~1.5M | ~3M | $100–200 |
Important: These are rough estimates. Actual costs depend heavily on task complexity, agent trajectory length, and LLM pricing at time of use. The evolution phase involves multiple LLM calls per cycle (diagnose, plan, update, verify), and complex benchmarks like SWE-bench generate longer trajectories.
Cost Scaling
The Evolution-Scaling Hypothesis implies that costs scale linearly with C_evolve but returns are sublinear (log-like improvement curve). Key cost levers:
| Lever | Effect |
|---|---|
| Fewer cycles | Lower cost, less evolved agent |
| Smaller batch size | Lower cost per cycle, noisier signal |
| Weaker model for evolve phase | Lower cost, lower mutation quality |
| Early convergence (EGL gate) | Auto-stops when improvement plateaus |
9 Architecture Solution
High-Level Architecture
┌──────────────────────────────────────────────────────────────────┐
│ A-EVOLVE FRAMEWORK │
│ │
│ ┌────────────────┐ ┌──────────────────────────────────┐ │
│ │ Agent (BYOA) │ │ Evolution Engine (BYO-Algo) │ │
│ │ │ │ │ │
│ │ BaseAgent │ │ EvolutionEngine │ │
│ │ .solve(task) │ │ .step(workspace, obs, hist) │ │
│ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ ┌─────────────┐ │ │
│ │ │ Solver │ │ │ │ Diagnose │→ │ Plan │ │ │
│ │ │ (LLM) │ │ │ └──────────┘ └─────────────┘ │ │
│ │ └──────────┘ │ │ │ │ │ │
│ └────────┬───────┘ │ ┌────▼──────┐ ┌───▼────────┐ │ │
│ │ │ │ Update │→ │ Verify │ │ │
│ │ │ └───────────┘ └────────────┘ │ │
│ │ └──────────────┬───────────────────┘ │
│ │ │ │
│ ┌────────▼──────────────────────────────▼───────────────────┐ │
│ │ AGENT WORKSPACE (π_S) │ │
│ │ │ │
│ │ manifest.yaml prompts/ skills/ tools/ memory/ │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │ │
│ │ │Knowledge │ │ Tool │ │ Validation │ │ │
│ │ │Registry │ │ Registry │ │ Registry │ │ │
│ │ │ (K_t) │ │ (T_t) │ │ (V_t) │ │ │
│ │ └──────────┘ └──────────┘ └──────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Benchmark Adapter (BYOE) │ │
│ │ .get_tasks(split, limit) .evaluate(task, trajectory) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ VersionControl │ │ EvolutionHist│ │ TrialRunner │ │
│ │ (git-backed) │ │ (obs+logs) │ │ (on-demand │ │
│ │ │ │ │ │ validation) │ │
│ └─────────────────┘ └──────────────┘ └─────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
The Solve-Evolve Control Loop
The core loop separates instance-level task execution from cross-episode capability improvement:
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────┐ ┌────────┐
│ SOLVE │───▶│ OBSERVE │───▶│ EVOLVE │───▶│ GATE │───▶│ RELOAD │
│ │ │ │ │ │ │ │ │ │
│ Agent │ │ Collect │ │ Engine │ │Check │ │ Agent │
│ runs on │ │ trajs + │ │ mutates │ │hold- │ │ reloads│
│ tasks │ │ feedback│ │ work- │ │out │ │ from │
│ │ │ into │ │ space │ │tasks │ │ (maybe │
│ π_S is │ │ Obs │ │ files │ │ │ │ rolled │
│ read- │ │ buffer │ │ │ │accept│ │ back) │
│ only │ │ │ │ │ │/ │ │ work- │
│ │ │ │ │ │ │reject│ │ space │
└─────────┘ └─────────┘ └─────────┘ └──────┘ └────────┘
│ │
└────────────────────────────────────────────────────────┘
NEXT CYCLE
Formal specification:
Solve: τ_t = Solve(π_t, x_t) # Black-box execution
Observe: Obs_{1:t} = Obs_{1:t-1} ∪ {τ_t} # Evidence accumulation
Evolve: Δ_t ← F_Evolve(π_θ,t, π_S,t, Obs_{1:t}) # Structured mutation
Gate: c_t ← C(π_t, Δ_t, Obs_{1:t}) # Commit decision {0,1}
Reload: π_{t+1} = π_t ⊕ (c_t · Δ_t) # Conditional update
Three-Axis Pluggability
| Axis | Interface | Contract | Extension Point |
|---|---|---|---|
| Agent (BYOA) | BaseAgent |
Implement solve(task: Task) -> Trajectory |
Any architecture: ReAct, Plan-and-Solve, multi-agent, custom |
| Benchmark (BYOE) | BenchmarkAdapter |
Implement get_tasks(split, limit) + evaluate(task, trajectory) -> Feedback |
Any domain with task + evaluation signal |
| Algorithm (BYO-Algo) | EvolutionEngine |
Implement step(workspace, observations, history, trial) -> StepResult |
Any evolution strategy: LLM mutation, RL, genetic programming |
Key architectural insight: The Agent Workspace is the unifying abstraction. By standardizing agent state as a file-system directory with a manifest, the evolution engine can mutate any agent without knowing its internals. The agent reloads from its workspace after each cycle, picking up mutations transparently. This decoupling is what makes A-Evolve truly agent-agnostic.
10 Component Breakdown
Core Type System
from agent_evolve.types import Task, Trajectory, Feedback, StepResult
# Task: input to the agent
Task(
id: str, # Unique task identifier
input: str, # Task prompt/description
metadata: dict # Benchmark-specific metadata (rule, answer, etc.)
)
# Trajectory: agent's execution trace
Trajectory(
task_id: str, # Links back to Task
output: str, # Agent's final answer
steps: list[dict] # Execution trace (tool calls, reasoning, etc.)
)
# Feedback: benchmark evaluation result
Feedback(
success: bool, # Binary pass/fail
score: float, # Continuous score [0, 1]
detail: str, # JSON-serialized evaluation details
raw: dict # Raw evaluation data
)
# StepResult: evolution engine output
StepResult(
mutated: bool, # Whether workspace was modified
summary: str, # Human-readable mutation description
metadata: dict # Engine-specific metadata
)
Agent Workspace Contract (AgentWorkspace)
The AgentWorkspace class mediates all file-system interactions:
| Method | Signature | Purpose |
|---|---|---|
read_prompt() |
→ str |
Read current system prompt |
write_prompt(text) |
→ None |
Overwrite system prompt |
list_skills() |
→ list[SkillInfo] |
Enumerate available skills |
write_skill(name, content) |
→ None |
Create/update a SKILL.md file |
get_skill_content(name) |
→ str |
Read a specific skill's content |
add_memory(entry, category) |
→ None |
Append to episodic/semantic memory |
list_memories() |
→ list[dict] |
Read all memory entries |
Base Agent Contract (BaseAgent)
from agent_evolve.protocol.base_agent import BaseAgent
class BaseAgent:
def __init__(self, workspace_dir: str | Path):
self.workspace = AgentWorkspace(workspace_dir)
self.system_prompt = self.workspace.read_prompt()
self.skills = self.workspace.list_skills()
self.memories = self.workspace.list_memories()
def solve(self, task: Task) -> Trajectory:
"""Implement this: process task, return trajectory."""
raise NotImplementedError
def reload_from_fs(self):
"""Re-read workspace state after evolution mutations."""
self.system_prompt = self.workspace.read_prompt()
self.skills = self.workspace.list_skills()
self.memories = self.workspace.list_memories()
def remember(self, content: str, category: str = "episodic"):
"""Store episodic memory during solve phase."""
self.workspace.add_memory({"content": content}, category)
Evolution Engine Contract (EvolutionEngine)
from agent_evolve.engine.base import EvolutionEngine
class EvolutionEngine:
def step(
self,
workspace: AgentWorkspace, # Mutable workspace reference
observations: list[Observation], # Accumulated solve results
history: EvolutionHistory, # Previous cycles' data
trial: TrialRunner # On-demand validation
) -> StepResult:
"""One evolution cycle. Analyze failures, mutate workspace."""
raise NotImplementedError
Benchmark Adapter Contract (BenchmarkAdapter)
from agent_evolve.benchmarks.base import BenchmarkAdapter
class BenchmarkAdapter:
def get_tasks(
self,
split: str = "train", # "train" or "holdout"
limit: int = 10
) -> list[Task]:
"""Return benchmark tasks for the given split."""
raise NotImplementedError
def evaluate(
self,
task: Task,
trajectory: Trajectory
) -> Feedback:
"""Score agent output against ground truth."""
raise NotImplementedError
Evolver Configuration (EvolveConfig)
import agent_evolve as ae
config = ae.EvolveConfig(
batch_size=8, # Tasks per solve batch
max_cycles=10, # Maximum evolution iterations
egl_window=2, # EGL convergence window size
)
Shared Primitives
The evolution engine has access to three shared primitives:
| Primitive | Purpose | API |
|---|---|---|
| TrialRunner | On-demand validation during evolution | Run holdout tasks to test candidate mutations before committing |
| EvolutionHistory | Observation + version queries | Query past cycles, compare scores, retrieve failure patterns |
| VersionControl | Git-based rollback | Tag accepted mutations, revert rejected ones, maintain audit trail |
11 Core Mechanisms (Detailed)
Mechanism 1: The Four-Phase Evolver
The evolver F_Evolve is decomposed into four cooperating functions that implement the three principles of agentic evolution:
┌─────────────────┐
│ Obs_{1:t} │ Accumulated deployment evidence
│ (trajectories, │ (tool traces, errors, feedback)
│ feedback) │
└────────┬────────┘
│
┌────────▼────────┐
│ DIAGNOSE │ Goal-oriented: identify WHAT to change
│ │
│ • Classify │ Analyze failure modes across tasks
│ failure │ Attribute to root causes
│ modes │ (missing tools, brittle logic,
│ • Attribute │ interface mismatches)
│ root causes │
│ • Produce │ Output: update objective g_t
│ update │
│ objective │
└────────┬────────┘
│ g_t
┌────────▼────────┐
│ PLAN │ Compositional: specify HOW to change
│ │
│ • Target │ Select target artifacts
│ artifacts │ Choose edit operators
│ • Edit ops │ (add, patch, refactor, prune)
│ (add/patch/ │ Define ordering constraints
│ refactor/ │
│ prune) │ Output: edit plan p_t
│ • Ordering │
└────────┬────────┘
│ p_t
┌────────▼────────┐
│ UPDATE │ Execute the plan
│ │
│ • Synthesize │ Generate concrete file changes
│ artifact │ Write SKILL.md, patch prompts,
│ changes │ update memory
│ • Attach │ Include provenance + tests
│ provenance │
│ • Build Δ_t │ Output: candidate update Δ_t
└────────┬────────┘
│ Δ_t
┌────────▼────────┐
│ VERIFY │ Autonomy: decide WHEN to commit
│ │
│ • Run holdout │ Evaluate on held-out tasks
│ validation │ Check for regressions
│ • Check for │ Return commit decision c_t ∈ {0,1}
│ regressions │
│ • Commit or │ Output: c_t
│ rollback │
└─────────────────┘
Mechanism 2: Persistent Artifact State (π_S)
The artifact state is organized into three typed registries:
Knowledge Registry (K_t):
- Stores structured or textual artifacts: schemas, workflows, interface contracts, exemplars
- Addressable and versioned — supports retrieval, patching, and replacement
- Enables goal-oriented evolution by localizing failures to specific knowledge components
- Physically: prompts/, skills/ directories
Tool Registry (T_t):
- Contains executable functions with explicit input-output signatures and associated tests
- During solve-time: provides deterministic action primitives reducing inference variance
- During evolve-time: serves as diagnostic instruments for replaying failures and probing edge cases
- Physically: tools/ directory
Validation Registry (V_t): - Contains governance assets: unit tests, regression suites, human review hooks - Validation artifacts are themselves editable (the evolution engine can synthesize new tests) - Grounds the commit decision c_t: updates committed only if they pass verification - Critical for preventing regressions over long deployment horizons
Mechanism 3: Edit Operators
A-Evolve supports a small set of canonical edit operators over π_S:
| Operator | Target | Description | Example |
|---|---|---|---|
| ADD | K, T, V | Create new artifact | Synthesize entity-verification/SKILL.md |
| PATCH | K, T | Modify existing artifact | Update parser tool for new JSON schema |
| REFACTOR | K, T | Restructure without changing behavior | Split monolithic skill into composable sub-skills |
| PRUNE | K, T, V | Remove obsolete artifact | Delete skill that no longer contributes to score |
All proposed updates are logged with full provenance, enabling auditing, rollback, and risk-based human oversight.
Mechanism 4: Reference Evolution Algorithms
A-Evolve ships with four reference algorithms, each implementing EvolutionEngine.step():
| Algorithm | File | Strategy | Key Innovation | Best Domain |
|---|---|---|---|---|
adaptive_evolve |
algorithms/adaptive_evolve/ |
Per-claim feedback analysis + meta-learning | Analyzes individual claims within task feedback to attribute failures at the finest granularity; meta-learns which mutation patterns are most effective | MCP-Atlas (🥇 79.4%) |
adaptive_skill |
algorithms/adaptive_skill/ |
LLM-driven workspace mutation with bash tool access | Grants the evolution engine shell access to test mutations programmatically; can run scripts, validate outputs, and iterate within a single evolution step | Terminal-Bench 2.0 (76.5%) |
skillforge |
algorithms/skillforge/ |
LLM-driven workspace mutation with EGL gating | Focuses on skill synthesis with strict EGL-based convergence detection; stops evolving when holdout improvement plateaus | SkillsBench (34.9%) |
guided_synth |
algorithms/guided_synth/ |
Memory-first evolution + LLM-guided intervention synthesis | Prioritizes memory accumulation before skill synthesis; uses episodic memory to guide when and how to intervene | SWE-bench Verified (76.8%) |
Mechanism 5: Evolutionary Generality Loss (EGL)
EGL is A-Evolve's convergence detection metric:
EGL(t) = Score_train(t) - Score_holdout(t)
The framework monitors EGL across a sliding window (egl_window parameter). When EGL stabilizes (the gap between training and holdout performance stops narrowing), evolution halts. This serves two purposes:
- Prevents overfitting: If training score improves but holdout score doesn't, the agent is memorizing task-specific solutions rather than learning generalizable capabilities
- Saves compute: Stops evolution when further cycles are unlikely to yield meaningful holdout improvement
The EGL window is configurable per algorithm — skillforge uses strict EGL gating, while guided_synth uses a looser window to allow longer exploration before convergence.
Mechanism 6: Git-Based Version Control
Every accepted mutation is git-tagged with an incrementing version:
evo-0 → Initial seed workspace
evo-1 → First accepted mutation (e.g., prompt hardening)
evo-2 → Second accepted mutation (e.g., added json-sum-exact skill)
evo-3 → Third accepted mutation (e.g., added episodic memory patterns)
...
If the Gate phase rejects a mutation (holdout regression), the workspace is automatically rolled back to the last tagged version. This provides:
- Full audit trail: Every evolution step is inspectable via
git diff evo-N..evo-N+1 - Reproducibility: Checkout any tag to reproduce the agent at that evolution stage
- Safety: No permanent damage from bad mutations; always recoverable
12 Programming Language
Framework Implementation
| Component | Language | Version | Notes |
|---|---|---|---|
| Core framework | Python | 3.11+ | Package: agent_evolve |
| Evolution algorithms | Python | 3.11+ | Under algorithms/ |
| Seed workspaces | YAML + Markdown | — | manifest.yaml, SKILL.md, system.md |
| Memory storage | JSONL | — | memory/episodic.jsonl |
| Version control | Git | — | Automated tagging and rollback |
| Package management | pip | — | pip install -e ".[all,dev]" |
Package Structure
a-evolve/
├── agent_evolve/ # Core framework package
│ ├── __init__.py # ae.Evolver, ae.EvolveConfig exports
│ ├── protocol/
│ │ └── base_agent.py # BaseAgent ABC
│ ├── benchmarks/
│ │ └── base.py # BenchmarkAdapter ABC
│ ├── engine/
│ │ └── base.py # EvolutionEngine ABC
│ ├── contract/
│ │ └── workspace.py # AgentWorkspace file-system abstraction
│ ├── types.py # Task, Trajectory, Feedback, StepResult
│ └── algorithms/ # Reference evolution algorithms
│ ├── adaptive_evolve/
│ ├── adaptive_skill/
│ ├── skillforge/
│ └── guided_synth/
├── seed_workspaces/ # Pre-built starting points
│ ├── swe/
│ ├── mcp/
│ ├── terminal/
│ └── reasoning/
├── docs/ # Benchmark-specific guides
│ ├── swe-bench-demo.md
│ ├── mcp-atlas-demo.md
│ ├── terminal-bench-demo.md
│ ├── skill-bench-demo.md
│ └── algorithms/ # Algorithm documentation
│ ├── adaptive-evolve.md
│ ├── adaptive-skill.md
│ ├── skillforge.md
│ └── guided-synth.md
├── figs/ # Figures for README
├── pyproject.toml / setup.py # Package definition
└── README.md
API Design Philosophy
The API is designed for minimal surface area — "3 lines of code" is the guiding principle:
import agent_evolve as ae
evolver = ae.Evolver(agent="swe-verified", benchmark="swe-verified")
results = evolver.run(cycles=10)
Extensibility is achieved through interface implementation rather than configuration complexity. Custom agents implement solve(), custom benchmarks implement get_tasks() + evaluate(), and custom algorithms implement step(). Each interface is a single method.
13 Memory Management
Memory Architecture
A-Evolve employs a file-system-native memory model where all memory is serialized to the workspace directory. This is fundamentally different from in-process memory systems — memory persists across agent restarts, evolution cycles, and even deployment sessions.
| Memory Type | Storage | Format | Lifecycle |
|---|---|---|---|
| Episodic memory | memory/episodic.jsonl |
JSON Lines (one entry per line) | Appended during solve phase; analyzed during evolve phase |
| Semantic memory | memory/semantic.jsonl |
JSON Lines | Synthesized during evolve phase from recurring patterns |
| Skill memory | skills/*/SKILL.md |
Markdown with YAML frontmatter | Created/patched by evolution engine |
| Prompt memory | prompts/system.md |
Markdown | Hardened by evolution engine (append constraints) |
Memory Flow Through the Loop
SOLVE PHASE EVOLVE PHASE
───────────── ──────────────
Agent reads: Engine reads:
• prompts/system.md • Obs buffer (all trajectories)
• skills/ • memory/episodic.jsonl
• memory/ (last N entries) • Current workspace state
Agent writes: Engine writes:
• memory/episodic.jsonl • skills/new-skill/SKILL.md
(task outcomes, traces) • prompts/system.md (patches)
• memory/episodic.jsonl (patterns)
Memory Capacity and Saturation
Unlike heuristic memory accumulation systems (Reflexion, Voyager) that can suffer from context saturation, A-Evolve's approach to memory is qualitatively different:
-
Memory is structured, not raw text. Episodic entries have schema (content, category, metadata). Skills have YAML frontmatter + procedural content.
-
Memory is curated by the evolution engine. The engine doesn't just append — it synthesizes, refactors, and prunes. This addresses the diminishing returns problem of naive memory accumulation.
-
Memory is bounded by workspace size. The file system imposes natural limits, and the agent reads only the last N entries during solve-time. This prevents context overflow.
-
Skills amortize memory. Recurring failure patterns are compiled into permanent skills rather than remaining as raw episodic traces. This is the key compositional principle in action — fragile reasoning is crystallized into reusable capability.
The amortization argument: A-Evolve's central memory insight is that memory should be consumed by the evolution engine to produce skills, not merely accumulated for the solver to re-read. This prevents the context saturation that plagues append-and-retrieve systems.
14 Continued Learning
Continual Deployment-Time Adaptation
A-Evolve's entire thesis is built on continued learning. The system is explicitly designed for open-ended, indefinite deployment horizons where the agent must adapt to:
| Challenge | How A-Evolve Handles It |
|---|---|
| Distribution shift | Evolution engine diagnoses failures caused by environment changes and synthesizes targeted fixes |
| API drift | Schema changes detected through failure analysis; adapter tools synthesized and validated |
| New task types | Skill synthesis creates new capabilities; memory patterns bootstrap adaptation |
| Capability degradation | EGL monitoring detects regression; git rollback preserves last-known-good state |
| Context saturation | Skills amortize raw memory into permanent capabilities; pruning removes obsolete artifacts |
The Three Scaling Axes
A-Evolve's theoretical framework positions evolution alongside two established scaling axes:
┌──────────────────────────┐
│ CAPABILITY FRONTIER │
│ │
Scaling Axis 1: │ ┌──────────────────┐ │
Training-Time ─────────│──▶│ Static ability │ │
Compute │ │ (pre-training + │ │
(Kaplan et al.) │ │ post-training) │ │
│ └──────────────────┘ │
│ │ │
Scaling Axis 2: │ ┌────────▼─────────┐ │
Inference-Time ────────│──▶│ Per-task │ │
Compute │ │ reasoning │ │
(Snell et al.) │ │ (CoT, search) │ │
│ └──────────────────┘ │
│ │ │
Scaling Axis 3: │ ┌────────▼─────────┐ │
Evolution-Time ────────│──▶│ Cross-episode │ │
Compute │ │ adaptation │ │
(A-Evolve) │ │ (skills, tools, │ │
│ │ memory) │ │
│ └──────────────────┘ │
└──────────────────────────┘
Convergence and Termination
The evolution loop terminates when any of these conditions is met:
- EGL convergence: The Evolutionary Generality Loss stabilizes over the
egl_window - Max cycles reached: Hard limit on evolution iterations
- Perfect score: Training score reaches 100% (rare)
- Budget exhausted: Compute budget for evolution-time is depleted
Long-Horizon Deployment Vision
The paper envisions A-Evolve running continuously in production:
Deployment Timeline
────────────────────────────────────────────────────────
Day 1: Deploy seed agent (π_0)
Day 2-5: Accumulate deployment evidence (Obs)
Day 5: Evolution cycle 1 → add entity-verification skill
Day 6-10: More evidence; improved performance
Day 10: Evolution cycle 2 → patch prompt, add memory patterns
Day 15: API change detected → evolution cycle 3 → schema adapter
Day 20: EGL converges → evolution pauses
Day 30: New failure pattern → evolution resumes
...
Open question from the paper: Whether the evolution-scaling frontier
P*(C_evolve)eventually saturates (suggesting fundamental limits to non-parametric adaptation) or continues to improve (suggesting that evolution can substitute for retraining). The authors present this as an empirical question requiring longer-horizon experiments than current benchmarks support.
Relationship to Other Continual Learning Paradigms
| Paradigm | Parametric Updates | Artifact Updates | Agent-Directed | Governed |
|---|---|---|---|---|
| Online fine-tuning | ✅ | ❌ | ❌ | ❌ |
| Reflexion (Shinn 2023) | ❌ | ✅ (text memory) | ❌ | ❌ |
| Voyager (Wang 2023) | ❌ | ✅ (skills) | ❌ | ❌ |
| ADAS (Hu 2024) | ❌ | ✅ (agent architectures) | Partial | ❌ |
| A-Evolve | ✅ (planned) | ✅ (typed artifacts) | ✅ (evolver agent) | ✅ (gate + git) |
15 Applications
Current Applications (Demonstrated)
| Application | Benchmark | Task Description | Result |
|---|---|---|---|
| Software engineering automation | SWE-bench Verified | Resolve real GitHub issues in Python repositories | 76.8% (~#5) |
| Multi-tool orchestration | MCP-Atlas | Coordinate 16+ MCP servers for complex tool-calling tasks | 79.4% (🥇 #1) |
| Terminal/CLI operations | Terminal-Bench 2.0 | Execute system administration and DevOps tasks in Docker | 76.5% (~#7) |
| Autonomous skill discovery | SkillsBench | Learn and apply new capabilities without human instruction | 34.9% (#2) |
Potential Applications (Framework Affordances)
Because A-Evolve is a framework with pluggable agents, benchmarks, and algorithms, it can in principle be applied to any domain where:
- Tasks can be automated (agent can attempt them)
- Evaluation signal exists (success/failure can be measured)
- Agent state is file-representable (prompts, skills, tools, memory)
| Domain | Potential Agent | Potential Benchmark | Evolution Target |
|---|---|---|---|
| Customer support | RAG-based chatbot | Resolution rate, CSAT score | Skills for handling edge cases, domain knowledge |
| Data analysis | Code-generating analyst | Accuracy on analytical queries | SQL patterns, visualization templates |
| Security monitoring | Alert triage agent | True positive rate, response time | Detection rules, investigation playbooks |
| Content generation | Marketing copywriter | A/B test click-through rate | Style guides, audience-specific templates |
| Research assistance | Literature review agent | Relevance scoring, citation accuracy | Search strategies, synthesis templates |
| Cloud operations | Infrastructure-as-code agent | Deployment success rate | Terraform patterns, error recovery scripts |
Comparison with Other Evolutionary Systems
| System | Year | Evolution Target | Evolution Mechanism | Domain Scope | Governance |
|---|---|---|---|---|---|
| FunSearch | 2023 | Single Python function | LLM + evolutionary search | Mathematical problems | None (best-score selection) |
| AlphaEvolve | 2025 | Entire codebases | Gemini Flash + Pro ensemble | Algorithms, math, hardware | Automated evaluation |
| OpenEvolve | 2025 | Code programs | LLM-as-mutator + MAP-Elites | General code optimization | Evaluator pipeline |
| A-Evolve | 2026 | Agent workspace (prompts, skills, tools, memory) | Autonomous evolver agent (Diagnose→Plan→Update→Verify) | Any agent domain | EGL gating + git rollback + validation registry |
| GEPA | 2025 | Heuristic algorithms | LLM-guided population evolution | Combinatorial optimization | Best-score selection |
| ShinkaEvolve | 2025 | Optimization algorithms | LLM-driven mutation + island model | Algorithm design | Fitness-based selection |
Key Differentiators for Applications
A-Evolve's unique position in the evolutionary AI landscape is that it evolves agents (the deployed system's behavior) rather than programs (standalone code solving a specific problem). This means:
- The mutation target is the deployment-time policy, not a static algorithm
- Evolution happens during deployment, not in a separate research loop
- The evolver is itself an agent, not a fixed pipeline — it can adapt its own strategy
- Governance is built-in, not bolted on — the validation registry and git integration ensure safety
- The framework is agent-agnostic — it can evolve any system whose state lives on the file system
Limitations and Open Questions
| Limitation | Impact | Mitigation |
|---|---|---|
| LLM sampling non-determinism | Exact numerical reproduction not guaranteed | Git-tagged checkpoints enable qualitative reproduction |
| Non-parametric evolution only (current release) | Cannot modify model weights — bounded by base model capability | Parametric evolution planned for future work; current non-parametric results already competitive |
| Benchmark-dependent evaluation | Evolved skills may overfit to benchmark-specific patterns | EGL gating on holdout set; but transfer to production settings untested |
| Evolution-time cost | Multiple LLM calls per cycle for diagnose/plan/update/verify | Early stopping via EGL convergence; configurable batch sizes |
| Single-model default | Same model used for both solving and evolving | Optionally use stronger model for evolution; but self-evolution is a feature, not a bug |
| Limited convergence theory | Evolution-Scaling Hypothesis is conjectured, not proven | Empirical evidence supports it across 4 benchmarks; formal analysis remains future work |
This analysis is based on the arXiv paper (v2, February 2026), the open-source repository (March 2026), the MarkTechPost coverage and tutorial (March 2026), and the Hugging Face paper page. The A-Evolve framework represents a significant conceptual advance in how we think about LLM system improvement — shifting from manual prompt engineering and static training to autonomous, governed, deployment-time evolution. Whether the Evolution-Scaling Hypothesis holds at scale and across diverse production environments remains the central open empirical question.