A-Evolve

Universal Infrastructure for Self-Improving Agents via Agentic Evolution Organization: Amazon (A-EVO-Lab) Published: February 3, 2026 (paper); March 25, 2026 (open-source release) Type: Position Paper (ICML) + Open-Source Framework Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: Position: Agentic Evolution is the Path to Evolving LLMs

arXiv: 2602.00359 (v2)

GitHub Repository: A-EVO-Lab/a-evolve

License: MIT

Publication Venue: ICML 2026 (Machine Learning)

Publication Date: February 3, 2026 (arXiv preprint); March 25, 2026 (open-source infrastructure release)

Lineage: Positioned as a unifying framework that subsumes and extends prior work on LLM self-improvement — including Reflexion (Shinn et al., 2023), test-time training (Huang et al., 2025), prompt optimization (Zhou et al., 2022), and heuristic memory accumulation (Wang et al., 2024; Gao et al., 2025). The authors explicitly frame A-Evolve as the "PyTorch for Agentic AI" — a standardized infrastructure layer rather than a standalone agent.

BibTeX:

@article{lin2026position,
  title={Position: Agentic Evolution is the Path to Evolving LLMs},
  author={Lin, Minhua and Lu, Hanqing and Shi, Zhan and He, Bing and
          Mao, Rui and Zhang, Zhiwei and Wu, Zongyu and Tang, Xianfeng
          and Liu, Hui and Dai, Zhenwei and others},
  journal={arXiv preprint arXiv:2602.00359},
  year={2026}
}

2 Authors and Team

The paper is authored by a large team at Amazon, operating under the lab name A-EVO-Lab:

Primary Authors: Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, Xiang Zhang, Suhang Wang, Benoit Dumoulin, Jian Pei

The team spans Amazon's applied science and research divisions. Several authors have prior work in continual learning, LLM agents, and deployment-time adaptation. The inclusion of Jian Pei — a prominent database and data mining researcher (Simon Fraser University → Duke University → Amazon) — signals the team's focus on systematic, infrastructure-grade approaches to agent improvement rather than ad-hoc optimization.

The lab name "A-EVO-Lab" (Amazon Evolution Lab) and the paper's ICML positioning suggest this is a strategic research direction for Amazon — framing deployment-time agent self-improvement as a first-class scaling axis alongside training-time and inference-time compute.

Role	Names
Lead authors	Minhua Lin, Hanqing Lu
Core framework	Zhan Shi, Bing He, Rui Mao
Algorithm design	Zhiwei Zhang, Zongyu Wu, Xianfeng Tang
Evaluation / benchmarks	Hui Liu, Zhenwei Dai
Theory / formalism	Xiang Zhang, Suhang Wang
Senior leadership	Benoit Dumoulin, Jian Pei

3 Core Contribution

Key Novelty: A-Evolve introduces agentic evolution as a third scaling axis for LLM systems — complementary to training-time compute and inference-time compute — where an explicit evolver agent autonomously diagnoses failures, proposes structured mutations to persistent agent state, gates updates through validation, and commits only verified improvements. The framework treats the entire agent workspace (prompts, skills, tools, memory) as mutable file-system state, enabling domain-agnostic, algorithm-agnostic, and agent-agnostic evolution.

The Central Argument

The paper makes a position argument with three pillars:

The Train-Deploy Gap is Fundamental. Static LLM training (pre-training + post-training) cannot anticipate the infinite variety of real-world deployment scenarios. Models inevitably degrade under distribution shift, API changes, and evolving constraints. Neither scaling training data nor extending inference-time reasoning chains closes this gap.
Existing Adaptation Methods Are Insufficient. Parametric approaches (fine-tuning, test-time training) are opaque, risk catastrophic forgetting, and lack semantic accountability. Non-parametric heuristic approaches (memory accumulation, append-and-retrieve) saturate with noisy text and exhibit diminishing returns. Both fail because the update mechanism F_Evolve is static and heuristic rather than adaptive and goal-directed.
Evolution Must Be Agentic. The evolution process itself must be elevated from a fixed pipeline to an autonomous evolver agent that reasons about failures, decides what/when/how to change, and produces verified, composable updates. This is the only path to sustained, open-ended adaptation over indefinite deployment horizons.

What Makes A-Evolve Novel

Dimension	Prior Work	A-Evolve
Update mechanism	Fixed heuristic (append memory, gradient step)	Autonomous evolver agent with Diagnose→Plan→Update→Verify pipeline
Mutation target	Single axis (weights OR prompts OR memory)	Composite policy π = (π_θ, π_S) — any combination of parametric and non-parametric state
Update governance	Unconditional (always apply)	Conditional commit gate: propose, verify, accept/reject
Artifact structure	Unstructured text blobs	Typed persistent artifacts: Knowledge registry K, Tool registry T, Validation registry V
Evolution scope	Domain-specific	Domain-agnostic framework: BYOA (agent), BYOE (benchmark), BYO-Algo (algorithm)
Scaling theory	None	Evolution-Scaling Hypothesis: adaptation frontier scales with evolution-time compute
Reproducibility	Ad hoc	Every mutation git-tagged (evo-1, evo-2, …) with full provenance

The Evolution-Scaling Hypothesis

The paper's most ambitious theoretical contribution is the Evolution-Scaling Hypothesis — a conjecture that deployment-time adaptation capacity scales predictably with compute allocated to the evolution process, analogous to training-time scaling laws (Kaplan et al., 2020) and inference-time scaling (Snell et al., 2025):

P*(C_evolve, π_0) = max_{F_evolve} E_{π ~ F_evolve(π_0)} [P(π)]

Where: - P*(C_evolve, π_0) is the compute-optimal evolution frontier - C_evolve is the total evolution-time compute budget - F_evolve ranges over all evolution strategies within that budget - π_0 is the initial deployed policy

The hypothesis states that P* is strictly increasing with C_evolve: more evolution compute → more accurate diagnosis, more candidate updates considered, more robust artifact synthesis, and stronger verification before committing. Validated updates persist and compound, making evolution a convergent process whose effectiveness scales with resources.

This positions agentic evolution as a third scaling law — after training-time and inference-time — and the first to operate during deployment rather than before it.

4 Supported Solutions

A-Evolve is a framework, not a standalone agent. It evolves any agent that implements the BaseAgent.solve() interface, across any domain with a BenchmarkAdapter, using any evolution strategy via EvolutionEngine.step(). The types of evolvable artifacts include:

Artifact Type	Location in Workspace	Description	Example Mutation
System prompts	`prompts/system.md`	Instructional text governing LLM reasoning	Harden output format constraints, add domain-specific procedures
Skills	`skills/*/SKILL.md`	Reusable procedural knowledge files	Synthesize entity-verification skill, search-iteration strategy
Tools	`tools/`	External interface configurations and wrappers	Add API schema adapter, patch parser for new JSON format
Memory	`memory/*.jsonl`	Episodic and semantic memory entries	Record failure patterns, amortize successful reasoning traces
Manifest	`manifest.yaml`	Agent identity, entrypoint, evolvable layer declarations	Update evolvable_layers, change reload strategy

Solution Domain Breadth

The framework ships with adapters for four diverse benchmark domains:

Domain	Benchmark	Task Type	Seed Workspace
Software engineering	SWE-bench Verified	Real GitHub issue resolution (Python repos)	`seed_workspaces/swe/`
Tool calling	MCP-Atlas	Multi-server MCP tool orchestration (16+ servers)	`seed_workspaces/mcp/`
Terminal operations	Terminal-Bench 2.0	CLI operations in Docker containers	`seed_workspaces/terminal/`
Skill discovery	SkillsBench	Autonomous capability acquisition	`seed_workspaces/reasoning/`

What an Evolved Agent Looks Like

A concrete before/after from the MCP-Atlas evolution:

Before (seed workspace):

mcp_agent/
├── manifest.yaml
├── prompts/system.md      ← 20 lines, generic
├── skills/                ← empty
└── memory/                ← empty

After (evolved — 79.4% on MCP-Atlas, #1 ranked):

mcp_agent/
├── manifest.yaml
├── prompts/system.md      ← 20 lines, unchanged
├── skills/
│   ├── entity-verification/SKILL.md   ← NEW
│   ├── search-iteration/SKILL.md      ← NEW
│   ├── multi-requirement/SKILL.md     ← NEW
│   ├── code-execution/SKILL.md        ← NEW
│   └── conditional-handler/SKILL.md   ← NEW
└── memory/
    └── episodic.jsonl     ← 6 entries

Key insight: 5 targeted skills outperformed 10 generic ones. The evolution engine learned to synthesize specific, verified skills rather than accumulating broad but shallow capabilities. The system prompt was left unchanged — all improvement came through structured artifact creation.

5 LLM Integration

Composite Policy Model

A-Evolve models an LLM system as a composite policy:

π_t = (π_θ,t, π_S,t)

Where: - π_θ,t — the parametric backbone (LLM weights, frozen during non-parametric evolution) - π_S,t — the persistent artifact state (prompts, skills, tools, memory) that conditions behavior across episodes

This separation is fundamental: the LLM backbone provides the reasoning engine, while the artifact state provides the accumulated deployment knowledge. Evolution can target either or both, though the current implementation focuses on non-parametric artifact evolution (mutating π_S) while keeping π_θ fixed.

LLM Roles in the Framework

Role	Where	How
Solver	Solve phase	The base LLM executes tasks using current artifacts (π_S as read-only context)
Diagnoser	Evolve → Diagnose	LLM analyzes deployment evidence, identifies failure modes and root causes
Planner	Evolve → Plan	LLM translates diagnostic insights into structured edit plans (target artifacts, operators, ordering)
Updater	Evolve → Update	LLM synthesizes concrete artifact changes: writes SKILL.md files, patches prompts, generates memory entries
Verifier	Evolve → Verify	LLM (or automated tests) evaluates candidate updates against validation registry

Provider Abstraction

The framework abstracts LLM access through a LLMProvider.complete() interface:

Provider	Status	Notes
Anthropic (Claude)	Built-in	Primary model used for benchmark results (Claude Opus-4.6)
OpenAI (GPT-4o)	Built-in	Demonstrated in tutorial; via `openai` SDK
AWS Bedrock	Built-in	Amazon's managed API access
Custom	Via interface	Implement `LLMProvider.complete()`

Dual-Use Architecture

Unlike systems that use separate models for mutation and evaluation (e.g., AlphaEvolve's Flash/Pro ensemble), A-Evolve uses the same model for both solving and evolving by default. The evolution engine can optionally use a different (stronger) model for the evolution phases, but the default configuration demonstrates that a single model can bootstrap its own improvement through structured workspace mutations.

Critical distinction from AlphaEvolve: Where AlphaEvolve evolves code programs as solutions to mathematical/algorithmic problems, A-Evolve evolves agents — the prompts, skills, tools, and memory that govern an LLM's task-solving behavior. The mutation target is not a standalone algorithm but the deployed system's persistent state.

6 Key Results

Headline Performance

All results achieved with a single Claude Opus-4.6 base model, evolved using A-Evolve's reference algorithms, with zero hours of human harness engineering:

Benchmark	Domain	Baseline	Evolved	Improvement	Ranking	Algorithm Used
MCP-Atlas	Tool calling (MCP)	76.0%	79.4%	+3.4pp	🥇 #1	`adaptive_evolve`
SWE-bench Verified	Software engineering	74.2%	76.8%	+2.6pp	~#5	`guided_synth`
Terminal-Bench 2.0	CLI operations	63.5%	76.5%	+13.0pp	~#7	`adaptive_skill`
SkillsBench	Skill discovery	19.7%	34.9%	+15.2pp	#2	`skillforge`

Result Significance

Metric	Analysis
MCP-Atlas #1	Achieved the top rank on a benchmark specifically designed to test tool-calling capabilities across 16+ MCP servers. The evolution engine synthesized 5 targeted skills that outperformed manually engineered prompts from competing systems.
Terminal-Bench +13pp	The largest absolute improvement (+13.0 percentage points), suggesting that CLI/terminal operations have a high "evolvability ceiling" — many failure modes are systematic and addressable through structured skill synthesis.
SkillsBench +15.2pp	The largest relative improvement (77% relative gain over baseline), demonstrating that autonomous skill discovery is where agentic evolution provides the most leverage over static agent configurations.
SWE-bench +2.6pp	The most modest improvement, consistent with the expectation that real-world software engineering tasks have higher complexity variance and diminishing returns from non-parametric evolution alone.

Ablation Study: Evolver Component Contributions

The paper includes ablation experiments (referenced in Sections 5-6 of the paper) that decompose the contribution of each evolver component:

Configuration	Description	Relative to Full A-Evolve
Full A-Evolve	Diagnose + Plan + Update + Verify	Baseline (100%)
A-Evolve/D (Diagnose only)	Diagnosis without planning → raw update attempts	Significant performance drop; broken tools committed, ~15% solver efficiency degradation
A-Evolve/P (+ Planning)	Diagnosis + Planning → structured action sequences	Substantial recovery; planning enables implementable fixes
A-Evolve/V (+ Verify)	Full pipeline without gating	Regressions from uncommitted bad mutations
No evolution	Static baseline agent	Lowest performance across all benchmarks

Key ablation finding: The planning stage was the most impactful individual component. Without planning, the diagnosis stage produced correct failure attributions but the resulting updates were often syntactically broken or semantically incomplete. Planning translates "what's wrong" into "how to fix it" — bridging the gap between understanding and actionable improvement.

Convergence Behavior

The evolution loop converges when EGL (Evolutionary Generality Loss) stabilizes or max_cycles is reached. EGL measures the gap between training performance and holdout performance:

EGL(t) = Score_train(t) - Score_holdout(t)

The framework uses an egl_window parameter (configurable, default varies by algorithm) to detect convergence. When EGL stops improving over the window, evolution halts — preventing overfitting to the training task distribution. This mechanism is critical for ensuring that evolved skills generalize to unseen tasks.

7 Reproducibility

Reproducibility Infrastructure

A-Evolve provides unusually strong reproducibility guarantees for an LLM-based system:

Mechanism	Implementation	Purpose
Git-tagged mutations	Every accepted mutation auto-tagged `evo-1`, `evo-2`, …	Full audit trail of workspace evolution; rollback to any checkpoint
Deterministic workspace	Agent reads from file system; π_S is serialized state	Identical workspace → identical agent behavior (modulo LLM sampling)
Gated commits	Mutations validated on holdout tasks before acceptance	Prevents regression; rejected mutations don't contaminate workspace
Observation logging	All trajectories, feedback, and diagnostic outputs logged	Complete evidence trail for post-hoc analysis
Seed workspaces	Pre-built starting points for each benchmark	`seed_workspaces/{swe,mcp,terminal,reasoning}/`
Version control integration	Git-backed rollback on failed gate checks	Automatic reversion to last-known-good state

Open Source Status

Component	Status	Location
Framework core (`agent_evolve`)	✅ Open source (MIT)	`agent_evolve/`
4 reference algorithms	✅ Open source	`algorithms/` or `agent_evolve/algorithms/`
4 seed workspaces	✅ Open source	`seed_workspaces/`
4 benchmark adapters	✅ Open source	Built-in adapters
Evolved agent checkpoints	✅ Git-tagged	Reproducible via `evolver.run()`
Training data / benchmark tasks	External dependencies	SWE-bench, MCP-Atlas, Terminal-Bench, SkillsBench (separate repos)

Reproducing Results

import agent_evolve as ae

# Reproduce MCP-Atlas #1 result
evolver = ae.Evolver(
    agent="mcp",                    # built-in seed workspace
    benchmark="mcp-atlas",          # built-in benchmark adapter
    engine="adaptive_evolve",       # reference algorithm
)
results = evolver.run(cycles=10)
# Expected: ~79.4% final score

Caveat: Exact numerical reproduction depends on LLM API determinism (temperature, sampling seed), which is not fully guaranteed across API versions. However, the directional results and convergence behavior should reproduce consistently.

8 Compute and API Costs

Cost Model

A-Evolve's compute costs decompose into two categories:

Phase	Compute Type	Cost Driver	Notes
Solve	LLM inference	Token count × model price per token	Proportional to `batch_size × num_cycles × avg_task_tokens`
Evolve	LLM inference + tool invocation	Diagnosis reasoning + plan generation + artifact synthesis + verification	Typically 2-5× more tokens per cycle than solve phase
Gate	LLM inference (holdout)	Holdout task count × model price	`egl_window` tasks per cycle for convergence check

Estimated Costs Per Benchmark

Based on the reported configurations (Claude Opus-4.6, 10 evolution cycles):

Benchmark	Tasks/Cycle	Cycles	Est. Solve Tokens	Est. Evolve Tokens	Est. Total Cost
MCP-Atlas	~50	10	~2M	~4M	$150–300
SWE-bench Verified	~50	10	~5M	~8M	$400–800
Terminal-Bench 2.0	~50	10	~3M	~5M	$200–400
SkillsBench	~50	10	~1.5M	~3M	$100–200

Important: These are rough estimates. Actual costs depend heavily on task complexity, agent trajectory length, and LLM pricing at time of use. The evolution phase involves multiple LLM calls per cycle (diagnose, plan, update, verify), and complex benchmarks like SWE-bench generate longer trajectories.

Cost Scaling

The Evolution-Scaling Hypothesis implies that costs scale linearly with C_evolve but returns are sublinear (log-like improvement curve). Key cost levers:

Lever	Effect
Fewer cycles	Lower cost, less evolved agent
Smaller batch size	Lower cost per cycle, noisier signal
Weaker model for evolve phase	Lower cost, lower mutation quality
Early convergence (EGL gate)	Auto-stops when improvement plateaus

9 Architecture Solution

High-Level Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        A-EVOLVE FRAMEWORK                        │
│                                                                  │
│  ┌────────────────┐       ┌──────────────────────────────────┐  │
│  │  Agent (BYOA)  │       │     Evolution Engine (BYO-Algo)  │  │
│  │                │       │                                  │  │
│  │  BaseAgent     │       │  EvolutionEngine                 │  │
│  │  .solve(task)  │       │  .step(workspace, obs, hist)     │  │
│  │                │       │                                  │  │
│  │  ┌──────────┐  │       │  ┌──────────┐  ┌─────────────┐  │  │
│  │  │ Solver   │  │       │  │ Diagnose │→ │    Plan     │  │  │
│  │  │ (LLM)    │  │       │  └──────────┘  └─────────────┘  │  │
│  │  └──────────┘  │       │       │              │           │  │
│  └────────┬───────┘       │  ┌────▼──────┐  ┌───▼────────┐  │  │
│           │               │  │  Update   │→ │  Verify    │  │  │
│           │               │  └───────────┘  └────────────┘  │  │
│           │               └──────────────┬───────────────────┘  │
│           │                              │                      │
│  ┌────────▼──────────────────────────────▼───────────────────┐  │
│  │                    AGENT WORKSPACE (π_S)                   │  │
│  │                                                            │  │
│  │  manifest.yaml  prompts/  skills/  tools/  memory/        │  │
│  │                                                            │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐    │  │
│  │  │Knowledge │  │  Tool    │  │   Validation         │    │  │
│  │  │Registry  │  │ Registry │  │   Registry           │    │  │
│  │  │  (K_t)   │  │  (T_t)  │  │    (V_t)             │    │  │
│  │  └──────────┘  └──────────┘  └──────────────────────┘    │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │              Benchmark Adapter (BYOE)                      │  │
│  │  .get_tasks(split, limit)    .evaluate(task, trajectory)   │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌─────────────────┐  ┌──────────────┐  ┌─────────────────┐    │
│  │  VersionControl │  │ EvolutionHist│  │  TrialRunner    │    │
│  │  (git-backed)   │  │  (obs+logs)  │  │  (on-demand     │    │
│  │                 │  │              │  │   validation)   │    │
│  └─────────────────┘  └──────────────┘  └─────────────────┘    │
└──────────────────────────────────────────────────────────────────┘

The Solve-Evolve Control Loop

The core loop separates instance-level task execution from cross-episode capability improvement:

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌──────┐    ┌────────┐
│  SOLVE  │───▶│ OBSERVE │───▶│ EVOLVE  │───▶│ GATE │───▶│ RELOAD │
│         │    │         │    │         │    │      │    │        │
│ Agent   │    │ Collect │    │ Engine  │    │Check │    │ Agent  │
│ runs on │    │ trajs + │    │ mutates │    │hold- │    │ reloads│
│ tasks   │    │ feedback│    │ work-   │    │out   │    │ from   │
│         │    │ into    │    │ space   │    │tasks │    │ (maybe │
│ π_S is  │    │ Obs     │    │ files   │    │      │    │ rolled │
│ read-   │    │ buffer  │    │         │    │accept│    │ back)  │
│ only    │    │         │    │         │    │/     │    │ work-  │
│         │    │         │    │         │    │reject│    │ space  │
└─────────┘    └─────────┘    └─────────┘    └──────┘    └────────┘
     │                                                        │
     └────────────────────────────────────────────────────────┘
                          NEXT CYCLE

Formal specification:

Solve:    τ_t = Solve(π_t, x_t)                    # Black-box execution
Observe:  Obs_{1:t} = Obs_{1:t-1} ∪ {τ_t}          # Evidence accumulation
Evolve:   Δ_t ← F_Evolve(π_θ,t, π_S,t, Obs_{1:t}) # Structured mutation
Gate:     c_t ← C(π_t, Δ_t, Obs_{1:t})             # Commit decision {0,1}
Reload:   π_{t+1} = π_t ⊕ (c_t · Δ_t)             # Conditional update

Three-Axis Pluggability

Axis	Interface	Contract	Extension Point
Agent (BYOA)	`BaseAgent`	Implement `solve(task: Task) -> Trajectory`	Any architecture: ReAct, Plan-and-Solve, multi-agent, custom
Benchmark (BYOE)	`BenchmarkAdapter`	Implement `get_tasks(split, limit)` + `evaluate(task, trajectory) -> Feedback`	Any domain with task + evaluation signal
Algorithm (BYO-Algo)	`EvolutionEngine`	Implement `step(workspace, observations, history, trial) -> StepResult`	Any evolution strategy: LLM mutation, RL, genetic programming

Key architectural insight: The Agent Workspace is the unifying abstraction. By standardizing agent state as a file-system directory with a manifest, the evolution engine can mutate any agent without knowing its internals. The agent reloads from its workspace after each cycle, picking up mutations transparently. This decoupling is what makes A-Evolve truly agent-agnostic.

10 Component Breakdown

Core Type System

from agent_evolve.types import Task, Trajectory, Feedback, StepResult

# Task: input to the agent
Task(
    id: str,                    # Unique task identifier
    input: str,                 # Task prompt/description
    metadata: dict              # Benchmark-specific metadata (rule, answer, etc.)
)

# Trajectory: agent's execution trace
Trajectory(
    task_id: str,               # Links back to Task
    output: str,                # Agent's final answer
    steps: list[dict]           # Execution trace (tool calls, reasoning, etc.)
)

# Feedback: benchmark evaluation result
Feedback(
    success: bool,              # Binary pass/fail
    score: float,               # Continuous score [0, 1]
    detail: str,                # JSON-serialized evaluation details
    raw: dict                   # Raw evaluation data
)

# StepResult: evolution engine output
StepResult(
    mutated: bool,              # Whether workspace was modified
    summary: str,               # Human-readable mutation description
    metadata: dict              # Engine-specific metadata
)

Agent Workspace Contract (`AgentWorkspace`)

The AgentWorkspace class mediates all file-system interactions:

Method	Signature	Purpose
`read_prompt()`	`→ str`	Read current system prompt
`write_prompt(text)`	`→ None`	Overwrite system prompt
`list_skills()`	`→ list[SkillInfo]`	Enumerate available skills
`write_skill(name, content)`	`→ None`	Create/update a SKILL.md file
`get_skill_content(name)`	`→ str`	Read a specific skill's content
`add_memory(entry, category)`	`→ None`	Append to episodic/semantic memory
`list_memories()`	`→ list[dict]`	Read all memory entries

Base Agent Contract (`BaseAgent`)

from agent_evolve.protocol.base_agent import BaseAgent

class BaseAgent:
    def __init__(self, workspace_dir: str | Path):
        self.workspace = AgentWorkspace(workspace_dir)
        self.system_prompt = self.workspace.read_prompt()
        self.skills = self.workspace.list_skills()
        self.memories = self.workspace.list_memories()

    def solve(self, task: Task) -> Trajectory:
        """Implement this: process task, return trajectory."""
        raise NotImplementedError

    def reload_from_fs(self):
        """Re-read workspace state after evolution mutations."""
        self.system_prompt = self.workspace.read_prompt()
        self.skills = self.workspace.list_skills()
        self.memories = self.workspace.list_memories()

    def remember(self, content: str, category: str = "episodic"):
        """Store episodic memory during solve phase."""
        self.workspace.add_memory({"content": content}, category)

Evolution Engine Contract (`EvolutionEngine`)

from agent_evolve.engine.base import EvolutionEngine

class EvolutionEngine:
    def step(
        self,
        workspace: AgentWorkspace,      # Mutable workspace reference
        observations: list[Observation], # Accumulated solve results
        history: EvolutionHistory,       # Previous cycles' data
        trial: TrialRunner               # On-demand validation
    ) -> StepResult:
        """One evolution cycle. Analyze failures, mutate workspace."""
        raise NotImplementedError

Benchmark Adapter Contract (`BenchmarkAdapter`)

from agent_evolve.benchmarks.base import BenchmarkAdapter

class BenchmarkAdapter:
    def get_tasks(
        self,
        split: str = "train",    # "train" or "holdout"
        limit: int = 10
    ) -> list[Task]:
        """Return benchmark tasks for the given split."""
        raise NotImplementedError

    def evaluate(
        self,
        task: Task,
        trajectory: Trajectory
    ) -> Feedback:
        """Score agent output against ground truth."""
        raise NotImplementedError

Evolver Configuration (`EvolveConfig`)

import agent_evolve as ae

config = ae.EvolveConfig(
    batch_size=8,       # Tasks per solve batch
    max_cycles=10,      # Maximum evolution iterations
    egl_window=2,       # EGL convergence window size
)

Shared Primitives

The evolution engine has access to three shared primitives:

Primitive	Purpose	API
TrialRunner	On-demand validation during evolution	Run holdout tasks to test candidate mutations before committing
EvolutionHistory	Observation + version queries	Query past cycles, compare scores, retrieve failure patterns
VersionControl	Git-based rollback	Tag accepted mutations, revert rejected ones, maintain audit trail

11 Core Mechanisms (Detailed)

Mechanism 1: The Four-Phase Evolver

The evolver F_Evolve is decomposed into four cooperating functions that implement the three principles of agentic evolution:

                    ┌─────────────────┐
                    │  Obs_{1:t}      │  Accumulated deployment evidence
                    │  (trajectories, │  (tool traces, errors, feedback)
                    │   feedback)     │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │    DIAGNOSE     │  Goal-oriented: identify WHAT to change
                    │                 │
                    │  • Classify     │  Analyze failure modes across tasks
                    │    failure      │  Attribute to root causes
                    │    modes        │  (missing tools, brittle logic,
                    │  • Attribute    │   interface mismatches)
                    │    root causes  │
                    │  • Produce      │  Output: update objective g_t
                    │    update       │
                    │    objective    │
                    └────────┬────────┘
                             │ g_t
                    ┌────────▼────────┐
                    │      PLAN       │  Compositional: specify HOW to change
                    │                 │
                    │  • Target       │  Select target artifacts
                    │    artifacts    │  Choose edit operators
                    │  • Edit ops     │  (add, patch, refactor, prune)
                    │    (add/patch/  │  Define ordering constraints
                    │     refactor/   │
                    │     prune)      │  Output: edit plan p_t
                    │  • Ordering     │
                    └────────┬────────┘
                             │ p_t
                    ┌────────▼────────┐
                    │     UPDATE      │  Execute the plan
                    │                 │
                    │  • Synthesize   │  Generate concrete file changes
                    │    artifact     │  Write SKILL.md, patch prompts,
                    │    changes      │  update memory
                    │  • Attach       │  Include provenance + tests
                    │    provenance   │
                    │  • Build Δ_t    │  Output: candidate update Δ_t
                    └────────┬────────┘
                             │ Δ_t
                    ┌────────▼────────┐
                    │     VERIFY      │  Autonomy: decide WHEN to commit
                    │                 │
                    │  • Run holdout  │  Evaluate on held-out tasks
                    │    validation   │  Check for regressions
                    │  • Check for   │  Return commit decision c_t ∈ {0,1}
                    │    regressions  │
                    │  • Commit or    │  Output: c_t
                    │    rollback     │
                    └─────────────────┘

Mechanism 2: Persistent Artifact State (π_S)

The artifact state is organized into three typed registries:

Knowledge Registry (K_t): - Stores structured or textual artifacts: schemas, workflows, interface contracts, exemplars - Addressable and versioned — supports retrieval, patching, and replacement - Enables goal-oriented evolution by localizing failures to specific knowledge components - Physically: prompts/, skills/ directories

Tool Registry (T_t): - Contains executable functions with explicit input-output signatures and associated tests - During solve-time: provides deterministic action primitives reducing inference variance - During evolve-time: serves as diagnostic instruments for replaying failures and probing edge cases - Physically: tools/ directory

Validation Registry (V_t): - Contains governance assets: unit tests, regression suites, human review hooks - Validation artifacts are themselves editable (the evolution engine can synthesize new tests) - Grounds the commit decision c_t: updates committed only if they pass verification - Critical for preventing regressions over long deployment horizons

Mechanism 3: Edit Operators

A-Evolve supports a small set of canonical edit operators over π_S:

Operator	Target	Description	Example
ADD	K, T, V	Create new artifact	Synthesize `entity-verification/SKILL.md`
PATCH	K, T	Modify existing artifact	Update parser tool for new JSON schema
REFACTOR	K, T	Restructure without changing behavior	Split monolithic skill into composable sub-skills
PRUNE	K, T, V	Remove obsolete artifact	Delete skill that no longer contributes to score

All proposed updates are logged with full provenance, enabling auditing, rollback, and risk-based human oversight.

Mechanism 4: Reference Evolution Algorithms

A-Evolve ships with four reference algorithms, each implementing EvolutionEngine.step():

Algorithm	File	Strategy	Key Innovation	Best Domain
`adaptive_evolve`	`algorithms/adaptive_evolve/`	Per-claim feedback analysis + meta-learning	Analyzes individual claims within task feedback to attribute failures at the finest granularity; meta-learns which mutation patterns are most effective	MCP-Atlas (🥇 79.4%)
`adaptive_skill`	`algorithms/adaptive_skill/`	LLM-driven workspace mutation with bash tool access	Grants the evolution engine shell access to test mutations programmatically; can run scripts, validate outputs, and iterate within a single evolution step	Terminal-Bench 2.0 (76.5%)
`skillforge`	`algorithms/skillforge/`	LLM-driven workspace mutation with EGL gating	Focuses on skill synthesis with strict EGL-based convergence detection; stops evolving when holdout improvement plateaus	SkillsBench (34.9%)
`guided_synth`	`algorithms/guided_synth/`	Memory-first evolution + LLM-guided intervention synthesis	Prioritizes memory accumulation before skill synthesis; uses episodic memory to guide when and how to intervene	SWE-bench Verified (76.8%)

Mechanism 5: Evolutionary Generality Loss (EGL)

EGL is A-Evolve's convergence detection metric:

EGL(t) = Score_train(t) - Score_holdout(t)

The framework monitors EGL across a sliding window (egl_window parameter). When EGL stabilizes (the gap between training and holdout performance stops narrowing), evolution halts. This serves two purposes:

Prevents overfitting: If training score improves but holdout score doesn't, the agent is memorizing task-specific solutions rather than learning generalizable capabilities
Saves compute: Stops evolution when further cycles are unlikely to yield meaningful holdout improvement

The EGL window is configurable per algorithm — skillforge uses strict EGL gating, while guided_synth uses a looser window to allow longer exploration before convergence.

Mechanism 6: Git-Based Version Control

Every accepted mutation is git-tagged with an incrementing version:

evo-0  →  Initial seed workspace
evo-1  →  First accepted mutation (e.g., prompt hardening)
evo-2  →  Second accepted mutation (e.g., added json-sum-exact skill)
evo-3  →  Third accepted mutation (e.g., added episodic memory patterns)
...

If the Gate phase rejects a mutation (holdout regression), the workspace is automatically rolled back to the last tagged version. This provides:

Full audit trail: Every evolution step is inspectable via git diff evo-N..evo-N+1
Reproducibility: Checkout any tag to reproduce the agent at that evolution stage
Safety: No permanent damage from bad mutations; always recoverable

12 Programming Language

Framework Implementation

Component	Language	Version	Notes
Core framework	Python	3.11+	Package: `agent_evolve`
Evolution algorithms	Python	3.11+	Under `algorithms/`
Seed workspaces	YAML + Markdown	—	`manifest.yaml`, `SKILL.md`, `system.md`
Memory storage	JSONL	—	`memory/episodic.jsonl`
Version control	Git	—	Automated tagging and rollback
Package management	pip	—	`pip install -e ".[all,dev]"`

Package Structure

a-evolve/
├── agent_evolve/              # Core framework package
│   ├── __init__.py            # ae.Evolver, ae.EvolveConfig exports
│   ├── protocol/
│   │   └── base_agent.py      # BaseAgent ABC
│   ├── benchmarks/
│   │   └── base.py            # BenchmarkAdapter ABC
│   ├── engine/
│   │   └── base.py            # EvolutionEngine ABC
│   ├── contract/
│   │   └── workspace.py       # AgentWorkspace file-system abstraction
│   ├── types.py               # Task, Trajectory, Feedback, StepResult
│   └── algorithms/            # Reference evolution algorithms
│       ├── adaptive_evolve/
│       ├── adaptive_skill/
│       ├── skillforge/
│       └── guided_synth/
├── seed_workspaces/           # Pre-built starting points
│   ├── swe/
│   ├── mcp/
│   ├── terminal/
│   └── reasoning/
├── docs/                      # Benchmark-specific guides
│   ├── swe-bench-demo.md
│   ├── mcp-atlas-demo.md
│   ├── terminal-bench-demo.md
│   ├── skill-bench-demo.md
│   └── algorithms/            # Algorithm documentation
│       ├── adaptive-evolve.md
│       ├── adaptive-skill.md
│       ├── skillforge.md
│       └── guided-synth.md
├── figs/                      # Figures for README
├── pyproject.toml / setup.py  # Package definition
└── README.md

API Design Philosophy

The API is designed for minimal surface area — "3 lines of code" is the guiding principle:

import agent_evolve as ae

evolver = ae.Evolver(agent="swe-verified", benchmark="swe-verified")
results = evolver.run(cycles=10)

Extensibility is achieved through interface implementation rather than configuration complexity. Custom agents implement solve(), custom benchmarks implement get_tasks() + evaluate(), and custom algorithms implement step(). Each interface is a single method.

13 Memory Management

Memory Architecture

A-Evolve employs a file-system-native memory model where all memory is serialized to the workspace directory. This is fundamentally different from in-process memory systems — memory persists across agent restarts, evolution cycles, and even deployment sessions.

Memory Type	Storage	Format	Lifecycle
Episodic memory	`memory/episodic.jsonl`	JSON Lines (one entry per line)	Appended during solve phase; analyzed during evolve phase
Semantic memory	`memory/semantic.jsonl`	JSON Lines	Synthesized during evolve phase from recurring patterns
Skill memory	`skills/*/SKILL.md`	Markdown with YAML frontmatter	Created/patched by evolution engine
Prompt memory	`prompts/system.md`	Markdown	Hardened by evolution engine (append constraints)

Memory Flow Through the Loop

SOLVE PHASE                    EVOLVE PHASE
─────────────                  ──────────────
Agent reads:                   Engine reads:
  • prompts/system.md            • Obs buffer (all trajectories)
  • skills/                      • memory/episodic.jsonl
  • memory/ (last N entries)     • Current workspace state

Agent writes:                  Engine writes:
  • memory/episodic.jsonl        • skills/new-skill/SKILL.md
    (task outcomes, traces)      • prompts/system.md (patches)
                                 • memory/episodic.jsonl (patterns)

Memory Capacity and Saturation

Unlike heuristic memory accumulation systems (Reflexion, Voyager) that can suffer from context saturation, A-Evolve's approach to memory is qualitatively different:

Memory is structured, not raw text. Episodic entries have schema (content, category, metadata). Skills have YAML frontmatter + procedural content.
Memory is curated by the evolution engine. The engine doesn't just append — it synthesizes, refactors, and prunes. This addresses the diminishing returns problem of naive memory accumulation.
Memory is bounded by workspace size. The file system imposes natural limits, and the agent reads only the last N entries during solve-time. This prevents context overflow.
Skills amortize memory. Recurring failure patterns are compiled into permanent skills rather than remaining as raw episodic traces. This is the key compositional principle in action — fragile reasoning is crystallized into reusable capability.

The amortization argument: A-Evolve's central memory insight is that memory should be consumed by the evolution engine to produce skills, not merely accumulated for the solver to re-read. This prevents the context saturation that plagues append-and-retrieve systems.

14 Continued Learning

Continual Deployment-Time Adaptation

A-Evolve's entire thesis is built on continued learning. The system is explicitly designed for open-ended, indefinite deployment horizons where the agent must adapt to:

Challenge	How A-Evolve Handles It
Distribution shift	Evolution engine diagnoses failures caused by environment changes and synthesizes targeted fixes
API drift	Schema changes detected through failure analysis; adapter tools synthesized and validated
New task types	Skill synthesis creates new capabilities; memory patterns bootstrap adaptation
Capability degradation	EGL monitoring detects regression; git rollback preserves last-known-good state
Context saturation	Skills amortize raw memory into permanent capabilities; pruning removes obsolete artifacts

The Three Scaling Axes

A-Evolve's theoretical framework positions evolution alongside two established scaling axes:

                        ┌──────────────────────────┐
                        │    CAPABILITY FRONTIER    │
                        │                          │
 Scaling Axis 1:        │   ┌──────────────────┐   │
 Training-Time ─────────│──▶│  Static ability   │   │
 Compute               │   │  (pre-training +  │   │
 (Kaplan et al.)       │   │   post-training)  │   │
                        │   └──────────────────┘   │
                        │            │              │
 Scaling Axis 2:        │   ┌────────▼─────────┐   │
 Inference-Time ────────│──▶│  Per-task         │   │
 Compute               │   │  reasoning        │   │
 (Snell et al.)        │   │  (CoT, search)    │   │
                        │   └──────────────────┘   │
                        │            │              │
 Scaling Axis 3:        │   ┌────────▼─────────┐   │
 Evolution-Time ────────│──▶│  Cross-episode    │   │
 Compute               │   │  adaptation       │   │
 (A-Evolve)            │   │  (skills, tools,  │   │
                        │   │   memory)         │   │
                        │   └──────────────────┘   │
                        └──────────────────────────┘

Convergence and Termination

The evolution loop terminates when any of these conditions is met:

EGL convergence: The Evolutionary Generality Loss stabilizes over the egl_window
Max cycles reached: Hard limit on evolution iterations
Perfect score: Training score reaches 100% (rare)
Budget exhausted: Compute budget for evolution-time is depleted

Long-Horizon Deployment Vision

The paper envisions A-Evolve running continuously in production:

Deployment Timeline
────────────────────────────────────────────────────────
Day 1:    Deploy seed agent (π_0)
Day 2-5:  Accumulate deployment evidence (Obs)
Day 5:    Evolution cycle 1 → add entity-verification skill
Day 6-10: More evidence; improved performance
Day 10:   Evolution cycle 2 → patch prompt, add memory patterns
Day 15:   API change detected → evolution cycle 3 → schema adapter
Day 20:   EGL converges → evolution pauses
Day 30:   New failure pattern → evolution resumes
...

Open question from the paper: Whether the evolution-scaling frontier P*(C_evolve) eventually saturates (suggesting fundamental limits to non-parametric adaptation) or continues to improve (suggesting that evolution can substitute for retraining). The authors present this as an empirical question requiring longer-horizon experiments than current benchmarks support.

Relationship to Other Continual Learning Paradigms

Paradigm	Parametric Updates	Artifact Updates	Agent-Directed	Governed
Online fine-tuning	✅	❌	❌	❌
Reflexion (Shinn 2023)	❌	✅ (text memory)	❌	❌
Voyager (Wang 2023)	❌	✅ (skills)	❌	❌
ADAS (Hu 2024)	❌	✅ (agent architectures)	Partial	❌
A-Evolve	✅ (planned)	✅ (typed artifacts)	✅ (evolver agent)	✅ (gate + git)

15 Applications

Current Applications (Demonstrated)

Application	Benchmark	Task Description	Result
Software engineering automation	SWE-bench Verified	Resolve real GitHub issues in Python repositories	76.8% (~#5)
Multi-tool orchestration	MCP-Atlas	Coordinate 16+ MCP servers for complex tool-calling tasks	79.4% (🥇 #1)
Terminal/CLI operations	Terminal-Bench 2.0	Execute system administration and DevOps tasks in Docker	76.5% (~#7)
Autonomous skill discovery	SkillsBench	Learn and apply new capabilities without human instruction	34.9% (#2)

Potential Applications (Framework Affordances)

Because A-Evolve is a framework with pluggable agents, benchmarks, and algorithms, it can in principle be applied to any domain where:

Tasks can be automated (agent can attempt them)
Evaluation signal exists (success/failure can be measured)
Agent state is file-representable (prompts, skills, tools, memory)

Domain	Potential Agent	Potential Benchmark	Evolution Target
Customer support	RAG-based chatbot	Resolution rate, CSAT score	Skills for handling edge cases, domain knowledge
Data analysis	Code-generating analyst	Accuracy on analytical queries	SQL patterns, visualization templates
Security monitoring	Alert triage agent	True positive rate, response time	Detection rules, investigation playbooks
Content generation	Marketing copywriter	A/B test click-through rate	Style guides, audience-specific templates
Research assistance	Literature review agent	Relevance scoring, citation accuracy	Search strategies, synthesis templates
Cloud operations	Infrastructure-as-code agent	Deployment success rate	Terraform patterns, error recovery scripts

Comparison with Other Evolutionary Systems

System	Year	Evolution Target	Evolution Mechanism	Domain Scope	Governance
FunSearch	2023	Single Python function	LLM + evolutionary search	Mathematical problems	None (best-score selection)
AlphaEvolve	2025	Entire codebases	Gemini Flash + Pro ensemble	Algorithms, math, hardware	Automated evaluation
OpenEvolve	2025	Code programs	LLM-as-mutator + MAP-Elites	General code optimization	Evaluator pipeline
A-Evolve	2026	Agent workspace (prompts, skills, tools, memory)	Autonomous evolver agent (Diagnose→Plan→Update→Verify)	Any agent domain	EGL gating + git rollback + validation registry
GEPA	2025	Heuristic algorithms	LLM-guided population evolution	Combinatorial optimization	Best-score selection
ShinkaEvolve	2025	Optimization algorithms	LLM-driven mutation + island model	Algorithm design	Fitness-based selection

Key Differentiators for Applications

A-Evolve's unique position in the evolutionary AI landscape is that it evolves agents (the deployed system's behavior) rather than programs (standalone code solving a specific problem). This means:

The mutation target is the deployment-time policy, not a static algorithm

Evolution happens during deployment, not in a separate research loop

The evolver is itself an agent, not a fixed pipeline — it can adapt its own strategy

Governance is built-in, not bolted on — the validation registry and git integration ensure safety

The framework is agent-agnostic — it can evolve any system whose state lives on the file system

Limitations and Open Questions

Limitation	Impact	Mitigation
LLM sampling non-determinism	Exact numerical reproduction not guaranteed	Git-tagged checkpoints enable qualitative reproduction
Non-parametric evolution only (current release)	Cannot modify model weights — bounded by base model capability	Parametric evolution planned for future work; current non-parametric results already competitive
Benchmark-dependent evaluation	Evolved skills may overfit to benchmark-specific patterns	EGL gating on holdout set; but transfer to production settings untested
Evolution-time cost	Multiple LLM calls per cycle for diagnose/plan/update/verify	Early stopping via EGL convergence; configurable batch sizes
Single-model default	Same model used for both solving and evolving	Optionally use stronger model for evolution; but self-evolution is a feature, not a bug
Limited convergence theory	Evolution-Scaling Hypothesis is conjectured, not proven	Empirical evidence supports it across 4 benchmarks; formal analysis remains future work

This analysis is based on the arXiv paper (v2, February 2026), the open-source repository (March 2026), the MarkTechPost coverage and tutorial (March 2026), and the Hugging Face paper page. The A-Evolve framework represents a significant conceptual advance in how we think about LLM system improvement — shifting from manual prompt engineering and static training to autonomous, governed, deployment-time evolution. Whether the Evolution-Scaling Hypothesis holds at scale and across diverse production environments remains the central open empirical question.