← Back to Index

A-Evolve

Universal Infrastructure for Self-Improving Agents via Agentic Evolution Organization: Amazon (A-EVO-Lab) Published: February 3, 2026 (paper); March 25, 2026 (open-source release) Type: Position Paper (ICML) + Open-Source Framework Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents

1 Full Title and Attribution

Full Title: Position: Agentic Evolution is the Path to Evolving LLMs

arXiv: 2602.00359 (v2)

GitHub Repository: A-EVO-Lab/a-evolve

License: MIT

Publication Venue: ICML 2026 (Machine Learning)

Publication Date: February 3, 2026 (arXiv preprint); March 25, 2026 (open-source infrastructure release)

Lineage: Positioned as a unifying framework that subsumes and extends prior work on LLM self-improvement — including Reflexion (Shinn et al., 2023), test-time training (Huang et al., 2025), prompt optimization (Zhou et al., 2022), and heuristic memory accumulation (Wang et al., 2024; Gao et al., 2025). The authors explicitly frame A-Evolve as the "PyTorch for Agentic AI" — a standardized infrastructure layer rather than a standalone agent.

BibTeX:

@article{lin2026position,
  title={Position: Agentic Evolution is the Path to Evolving LLMs},
  author={Lin, Minhua and Lu, Hanqing and Shi, Zhan and He, Bing and
          Mao, Rui and Zhang, Zhiwei and Wu, Zongyu and Tang, Xianfeng
          and Liu, Hui and Dai, Zhenwei and others},
  journal={arXiv preprint arXiv:2602.00359},
  year={2026}
}

2 Authors and Team

The paper is authored by a large team at Amazon, operating under the lab name A-EVO-Lab:

Primary Authors: Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, Xiang Zhang, Suhang Wang, Benoit Dumoulin, Jian Pei

The team spans Amazon's applied science and research divisions. Several authors have prior work in continual learning, LLM agents, and deployment-time adaptation. The inclusion of Jian Pei — a prominent database and data mining researcher (Simon Fraser University → Duke University → Amazon) — signals the team's focus on systematic, infrastructure-grade approaches to agent improvement rather than ad-hoc optimization.

The lab name "A-EVO-Lab" (Amazon Evolution Lab) and the paper's ICML positioning suggest this is a strategic research direction for Amazon — framing deployment-time agent self-improvement as a first-class scaling axis alongside training-time and inference-time compute.

Role Names
Lead authors Minhua Lin, Hanqing Lu
Core framework Zhan Shi, Bing He, Rui Mao
Algorithm design Zhiwei Zhang, Zongyu Wu, Xianfeng Tang
Evaluation / benchmarks Hui Liu, Zhenwei Dai
Theory / formalism Xiang Zhang, Suhang Wang
Senior leadership Benoit Dumoulin, Jian Pei

3 Core Contribution

Key Novelty: A-Evolve introduces agentic evolution as a third scaling axis for LLM systems — complementary to training-time compute and inference-time compute — where an explicit evolver agent autonomously diagnoses failures, proposes structured mutations to persistent agent state, gates updates through validation, and commits only verified improvements. The framework treats the entire agent workspace (prompts, skills, tools, memory) as mutable file-system state, enabling domain-agnostic, algorithm-agnostic, and agent-agnostic evolution.

The Central Argument

The paper makes a position argument with three pillars:

  1. The Train-Deploy Gap is Fundamental. Static LLM training (pre-training + post-training) cannot anticipate the infinite variety of real-world deployment scenarios. Models inevitably degrade under distribution shift, API changes, and evolving constraints. Neither scaling training data nor extending inference-time reasoning chains closes this gap.

  2. Existing Adaptation Methods Are Insufficient. Parametric approaches (fine-tuning, test-time training) are opaque, risk catastrophic forgetting, and lack semantic accountability. Non-parametric heuristic approaches (memory accumulation, append-and-retrieve) saturate with noisy text and exhibit diminishing returns. Both fail because the update mechanism F_Evolve is static and heuristic rather than adaptive and goal-directed.

  3. Evolution Must Be Agentic. The evolution process itself must be elevated from a fixed pipeline to an autonomous evolver agent that reasons about failures, decides what/when/how to change, and produces verified, composable updates. This is the only path to sustained, open-ended adaptation over indefinite deployment horizons.

What Makes A-Evolve Novel

Dimension Prior Work A-Evolve
Update mechanism Fixed heuristic (append memory, gradient step) Autonomous evolver agent with Diagnose→Plan→Update→Verify pipeline
Mutation target Single axis (weights OR prompts OR memory) Composite policy π = (π_θ, π_S) — any combination of parametric and non-parametric state
Update governance Unconditional (always apply) Conditional commit gate: propose, verify, accept/reject
Artifact structure Unstructured text blobs Typed persistent artifacts: Knowledge registry K, Tool registry T, Validation registry V
Evolution scope Domain-specific Domain-agnostic framework: BYOA (agent), BYOE (benchmark), BYO-Algo (algorithm)
Scaling theory None Evolution-Scaling Hypothesis: adaptation frontier scales with evolution-time compute
Reproducibility Ad hoc Every mutation git-tagged (evo-1, evo-2, …) with full provenance

The Evolution-Scaling Hypothesis

The paper's most ambitious theoretical contribution is the Evolution-Scaling Hypothesis — a conjecture that deployment-time adaptation capacity scales predictably with compute allocated to the evolution process, analogous to training-time scaling laws (Kaplan et al., 2020) and inference-time scaling (Snell et al., 2025):

P*(C_evolve, π_0) = max_{F_evolve} E_{π ~ F_evolve(π_0)} [P(π)]

Where: - P*(C_evolve, π_0) is the compute-optimal evolution frontier - C_evolve is the total evolution-time compute budget - F_evolve ranges over all evolution strategies within that budget - π_0 is the initial deployed policy

The hypothesis states that P* is strictly increasing with C_evolve: more evolution compute → more accurate diagnosis, more candidate updates considered, more robust artifact synthesis, and stronger verification before committing. Validated updates persist and compound, making evolution a convergent process whose effectiveness scales with resources.

This positions agentic evolution as a third scaling law — after training-time and inference-time — and the first to operate during deployment rather than before it.

4 Supported Solutions

A-Evolve is a framework, not a standalone agent. It evolves any agent that implements the BaseAgent.solve() interface, across any domain with a BenchmarkAdapter, using any evolution strategy via EvolutionEngine.step(). The types of evolvable artifacts include:

Artifact Type Location in Workspace Description Example Mutation
System prompts prompts/system.md Instructional text governing LLM reasoning Harden output format constraints, add domain-specific procedures
Skills skills/*/SKILL.md Reusable procedural knowledge files Synthesize entity-verification skill, search-iteration strategy
Tools tools/ External interface configurations and wrappers Add API schema adapter, patch parser for new JSON format
Memory memory/*.jsonl Episodic and semantic memory entries Record failure patterns, amortize successful reasoning traces
Manifest manifest.yaml Agent identity, entrypoint, evolvable layer declarations Update evolvable_layers, change reload strategy

Solution Domain Breadth

The framework ships with adapters for four diverse benchmark domains:

Domain Benchmark Task Type Seed Workspace
Software engineering SWE-bench Verified Real GitHub issue resolution (Python repos) seed_workspaces/swe/
Tool calling MCP-Atlas Multi-server MCP tool orchestration (16+ servers) seed_workspaces/mcp/
Terminal operations Terminal-Bench 2.0 CLI operations in Docker containers seed_workspaces/terminal/
Skill discovery SkillsBench Autonomous capability acquisition seed_workspaces/reasoning/

What an Evolved Agent Looks Like

A concrete before/after from the MCP-Atlas evolution:

Before (seed workspace):

mcp_agent/
├── manifest.yaml
├── prompts/system.md      ← 20 lines, generic
├── skills/                ← empty
└── memory/                ← empty

After (evolved — 79.4% on MCP-Atlas, #1 ranked):

mcp_agent/
├── manifest.yaml
├── prompts/system.md      ← 20 lines, unchanged
├── skills/
│   ├── entity-verification/SKILL.md   ← NEW
│   ├── search-iteration/SKILL.md      ← NEW
│   ├── multi-requirement/SKILL.md     ← NEW
│   ├── code-execution/SKILL.md        ← NEW
│   └── conditional-handler/SKILL.md   ← NEW
└── memory/
    └── episodic.jsonl     ← 6 entries

Key insight: 5 targeted skills outperformed 10 generic ones. The evolution engine learned to synthesize specific, verified skills rather than accumulating broad but shallow capabilities. The system prompt was left unchanged — all improvement came through structured artifact creation.

5 LLM Integration

Composite Policy Model

A-Evolve models an LLM system as a composite policy:

π_t = (π_θ,t, π_S,t)

Where: - π_θ,t — the parametric backbone (LLM weights, frozen during non-parametric evolution) - π_S,t — the persistent artifact state (prompts, skills, tools, memory) that conditions behavior across episodes

This separation is fundamental: the LLM backbone provides the reasoning engine, while the artifact state provides the accumulated deployment knowledge. Evolution can target either or both, though the current implementation focuses on non-parametric artifact evolution (mutating π_S) while keeping π_θ fixed.

LLM Roles in the Framework

Role Where How
Solver Solve phase The base LLM executes tasks using current artifacts (π_S as read-only context)
Diagnoser Evolve → Diagnose LLM analyzes deployment evidence, identifies failure modes and root causes
Planner Evolve → Plan LLM translates diagnostic insights into structured edit plans (target artifacts, operators, ordering)
Updater Evolve → Update LLM synthesizes concrete artifact changes: writes SKILL.md files, patches prompts, generates memory entries
Verifier Evolve → Verify LLM (or automated tests) evaluates candidate updates against validation registry

Provider Abstraction

The framework abstracts LLM access through a LLMProvider.complete() interface:

Provider Status Notes
Anthropic (Claude) Built-in Primary model used for benchmark results (Claude Opus-4.6)
OpenAI (GPT-4o) Built-in Demonstrated in tutorial; via openai SDK
AWS Bedrock Built-in Amazon's managed API access
Custom Via interface Implement LLMProvider.complete()

Dual-Use Architecture

Unlike systems that use separate models for mutation and evaluation (e.g., AlphaEvolve's Flash/Pro ensemble), A-Evolve uses the same model for both solving and evolving by default. The evolution engine can optionally use a different (stronger) model for the evolution phases, but the default configuration demonstrates that a single model can bootstrap its own improvement through structured workspace mutations.

Critical distinction from AlphaEvolve: Where AlphaEvolve evolves code programs as solutions to mathematical/algorithmic problems, A-Evolve evolves agents — the prompts, skills, tools, and memory that govern an LLM's task-solving behavior. The mutation target is not a standalone algorithm but the deployed system's persistent state.

6 Key Results

Headline Performance

All results achieved with a single Claude Opus-4.6 base model, evolved using A-Evolve's reference algorithms, with zero hours of human harness engineering:

Benchmark Domain Baseline Evolved Improvement Ranking Algorithm Used
MCP-Atlas Tool calling (MCP) 76.0% 79.4% +3.4pp 🥇 #1 adaptive_evolve
SWE-bench Verified Software engineering 74.2% 76.8% +2.6pp ~#5 guided_synth
Terminal-Bench 2.0 CLI operations 63.5% 76.5% +13.0pp ~#7 adaptive_skill
SkillsBench Skill discovery 19.7% 34.9% +15.2pp #2 skillforge

Result Significance

Metric Analysis
MCP-Atlas #1 Achieved the top rank on a benchmark specifically designed to test tool-calling capabilities across 16+ MCP servers. The evolution engine synthesized 5 targeted skills that outperformed manually engineered prompts from competing systems.
Terminal-Bench +13pp The largest absolute improvement (+13.0 percentage points), suggesting that CLI/terminal operations have a high "evolvability ceiling" — many failure modes are systematic and addressable through structured skill synthesis.
SkillsBench +15.2pp The largest relative improvement (77% relative gain over baseline), demonstrating that autonomous skill discovery is where agentic evolution provides the most leverage over static agent configurations.
SWE-bench +2.6pp The most modest improvement, consistent with the expectation that real-world software engineering tasks have higher complexity variance and diminishing returns from non-parametric evolution alone.

Ablation Study: Evolver Component Contributions

The paper includes ablation experiments (referenced in Sections 5-6 of the paper) that decompose the contribution of each evolver component:

Configuration Description Relative to Full A-Evolve
Full A-Evolve Diagnose + Plan + Update + Verify Baseline (100%)
A-Evolve/D (Diagnose only) Diagnosis without planning → raw update attempts Significant performance drop; broken tools committed, ~15% solver efficiency degradation
A-Evolve/P (+ Planning) Diagnosis + Planning → structured action sequences Substantial recovery; planning enables implementable fixes
A-Evolve/V (+ Verify) Full pipeline without gating Regressions from uncommitted bad mutations
No evolution Static baseline agent Lowest performance across all benchmarks

Key ablation finding: The planning stage was the most impactful individual component. Without planning, the diagnosis stage produced correct failure attributions but the resulting updates were often syntactically broken or semantically incomplete. Planning translates "what's wrong" into "how to fix it" — bridging the gap between understanding and actionable improvement.

Convergence Behavior

The evolution loop converges when EGL (Evolutionary Generality Loss) stabilizes or max_cycles is reached. EGL measures the gap between training performance and holdout performance:

EGL(t) = Score_train(t) - Score_holdout(t)

The framework uses an egl_window parameter (configurable, default varies by algorithm) to detect convergence. When EGL stops improving over the window, evolution halts — preventing overfitting to the training task distribution. This mechanism is critical for ensuring that evolved skills generalize to unseen tasks.

7 Reproducibility

Reproducibility Infrastructure

A-Evolve provides unusually strong reproducibility guarantees for an LLM-based system:

Mechanism Implementation Purpose
Git-tagged mutations Every accepted mutation auto-tagged evo-1, evo-2, … Full audit trail of workspace evolution; rollback to any checkpoint
Deterministic workspace Agent reads from file system; π_S is serialized state Identical workspace → identical agent behavior (modulo LLM sampling)
Gated commits Mutations validated on holdout tasks before acceptance Prevents regression; rejected mutations don't contaminate workspace
Observation logging All trajectories, feedback, and diagnostic outputs logged Complete evidence trail for post-hoc analysis
Seed workspaces Pre-built starting points for each benchmark seed_workspaces/{swe,mcp,terminal,reasoning}/
Version control integration Git-backed rollback on failed gate checks Automatic reversion to last-known-good state

Open Source Status

Component Status Location
Framework core (agent_evolve) ✅ Open source (MIT) agent_evolve/
4 reference algorithms ✅ Open source algorithms/ or agent_evolve/algorithms/
4 seed workspaces ✅ Open source seed_workspaces/
4 benchmark adapters ✅ Open source Built-in adapters
Evolved agent checkpoints ✅ Git-tagged Reproducible via evolver.run()
Training data / benchmark tasks External dependencies SWE-bench, MCP-Atlas, Terminal-Bench, SkillsBench (separate repos)

Reproducing Results

import agent_evolve as ae

# Reproduce MCP-Atlas #1 result
evolver = ae.Evolver(
    agent="mcp",                    # built-in seed workspace
    benchmark="mcp-atlas",          # built-in benchmark adapter
    engine="adaptive_evolve",       # reference algorithm
)
results = evolver.run(cycles=10)
# Expected: ~79.4% final score

Caveat: Exact numerical reproduction depends on LLM API determinism (temperature, sampling seed), which is not fully guaranteed across API versions. However, the directional results and convergence behavior should reproduce consistently.

8 Compute and API Costs

Cost Model

A-Evolve's compute costs decompose into two categories:

Phase Compute Type Cost Driver Notes
Solve LLM inference Token count × model price per token Proportional to batch_size × num_cycles × avg_task_tokens
Evolve LLM inference + tool invocation Diagnosis reasoning + plan generation + artifact synthesis + verification Typically 2-5× more tokens per cycle than solve phase
Gate LLM inference (holdout) Holdout task count × model price egl_window tasks per cycle for convergence check

Estimated Costs Per Benchmark

Based on the reported configurations (Claude Opus-4.6, 10 evolution cycles):

Benchmark Tasks/Cycle Cycles Est. Solve Tokens Est. Evolve Tokens Est. Total Cost
MCP-Atlas ~50 10 ~2M ~4M $150–300
SWE-bench Verified ~50 10 ~5M ~8M $400–800
Terminal-Bench 2.0 ~50 10 ~3M ~5M $200–400
SkillsBench ~50 10 ~1.5M ~3M $100–200

Important: These are rough estimates. Actual costs depend heavily on task complexity, agent trajectory length, and LLM pricing at time of use. The evolution phase involves multiple LLM calls per cycle (diagnose, plan, update, verify), and complex benchmarks like SWE-bench generate longer trajectories.

Cost Scaling

The Evolution-Scaling Hypothesis implies that costs scale linearly with C_evolve but returns are sublinear (log-like improvement curve). Key cost levers:

Lever Effect
Fewer cycles Lower cost, less evolved agent
Smaller batch size Lower cost per cycle, noisier signal
Weaker model for evolve phase Lower cost, lower mutation quality
Early convergence (EGL gate) Auto-stops when improvement plateaus

9 Architecture Solution

High-Level Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        A-EVOLVE FRAMEWORK                        │
│                                                                  │
│  ┌────────────────┐       ┌──────────────────────────────────┐  │
│  │  Agent (BYOA)  │       │     Evolution Engine (BYO-Algo)  │  │
│  │                │       │                                  │  │
│  │  BaseAgent     │       │  EvolutionEngine                 │  │
│  │  .solve(task)  │       │  .step(workspace, obs, hist)     │  │
│  │                │       │                                  │  │
│  │  ┌──────────┐  │       │  ┌──────────┐  ┌─────────────┐  │  │
│  │  │ Solver   │  │       │  │ Diagnose │→ │    Plan     │  │  │
│  │  │ (LLM)    │  │       │  └──────────┘  └─────────────┘  │  │
│  │  └──────────┘  │       │       │              │           │  │
│  └────────┬───────┘       │  ┌────▼──────┐  ┌───▼────────┐  │  │
│           │               │  │  Update   │→ │  Verify    │  │  │
│           │               │  └───────────┘  └────────────┘  │  │
│           │               └──────────────┬───────────────────┘  │
│           │                              │                      │
│  ┌────────▼──────────────────────────────▼───────────────────┐  │
│  │                    AGENT WORKSPACE (π_S)                   │  │
│  │                                                            │  │
│  │  manifest.yaml  prompts/  skills/  tools/  memory/        │  │
│  │                                                            │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐    │  │
│  │  │Knowledge │  │  Tool    │  │   Validation         │    │  │
│  │  │Registry  │  │ Registry │  │   Registry           │    │  │
│  │  │  (K_t)   │  │  (T_t)  │  │    (V_t)             │    │  │
│  │  └──────────┘  └──────────┘  └──────────────────────┘    │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │              Benchmark Adapter (BYOE)                      │  │
│  │  .get_tasks(split, limit)    .evaluate(task, trajectory)   │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌─────────────────┐  ┌──────────────┐  ┌─────────────────┐    │
│  │  VersionControl │  │ EvolutionHist│  │  TrialRunner    │    │
│  │  (git-backed)   │  │  (obs+logs)  │  │  (on-demand     │    │
│  │                 │  │              │  │   validation)   │    │
│  └─────────────────┘  └──────────────┘  └─────────────────┘    │
└──────────────────────────────────────────────────────────────────┘

The Solve-Evolve Control Loop

The core loop separates instance-level task execution from cross-episode capability improvement:

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌──────┐    ┌────────┐
│  SOLVE  │───▶│ OBSERVE │───▶│ EVOLVE  │───▶│ GATE │───▶│ RELOAD │
│         │    │         │    │         │    │      │    │        │
│ Agent   │    │ Collect │    │ Engine  │    │Check │    │ Agent  │
│ runs on │    │ trajs + │    │ mutates │    │hold- │    │ reloads│
│ tasks   │    │ feedback│    │ work-   │    │out   │    │ from   │
│         │    │ into    │    │ space   │    │tasks │    │ (maybe │
│ π_S is  │    │ Obs     │    │ files   │    │      │    │ rolled │
│ read-   │    │ buffer  │    │         │    │accept│    │ back)  │
│ only    │    │         │    │         │    │/     │    │ work-  │
│         │    │         │    │         │    │reject│    │ space  │
└─────────┘    └─────────┘    └─────────┘    └──────┘    └────────┘
     │                                                        │
     └────────────────────────────────────────────────────────┘
                          NEXT CYCLE

Formal specification:

Solve:    τ_t = Solve(π_t, x_t)                    # Black-box execution
Observe:  Obs_{1:t} = Obs_{1:t-1} ∪ {τ_t}          # Evidence accumulation
Evolve:   Δ_t ← F_Evolve(π_θ,t, π_S,t, Obs_{1:t}) # Structured mutation
Gate:     c_t ← C(π_t, Δ_t, Obs_{1:t})             # Commit decision {0,1}
Reload:   π_{t+1} = π_t ⊕ (c_t · Δ_t)             # Conditional update

Three-Axis Pluggability

Axis Interface Contract Extension Point
Agent (BYOA) BaseAgent Implement solve(task: Task) -> Trajectory Any architecture: ReAct, Plan-and-Solve, multi-agent, custom
Benchmark (BYOE) BenchmarkAdapter Implement get_tasks(split, limit) + evaluate(task, trajectory) -> Feedback Any domain with task + evaluation signal
Algorithm (BYO-Algo) EvolutionEngine Implement step(workspace, observations, history, trial) -> StepResult Any evolution strategy: LLM mutation, RL, genetic programming

Key architectural insight: The Agent Workspace is the unifying abstraction. By standardizing agent state as a file-system directory with a manifest, the evolution engine can mutate any agent without knowing its internals. The agent reloads from its workspace after each cycle, picking up mutations transparently. This decoupling is what makes A-Evolve truly agent-agnostic.

10 Component Breakdown

Core Type System

from agent_evolve.types import Task, Trajectory, Feedback, StepResult

# Task: input to the agent
Task(
    id: str,                    # Unique task identifier
    input: str,                 # Task prompt/description
    metadata: dict              # Benchmark-specific metadata (rule, answer, etc.)
)

# Trajectory: agent's execution trace
Trajectory(
    task_id: str,               # Links back to Task
    output: str,                # Agent's final answer
    steps: list[dict]           # Execution trace (tool calls, reasoning, etc.)
)

# Feedback: benchmark evaluation result
Feedback(
    success: bool,              # Binary pass/fail
    score: float,               # Continuous score [0, 1]
    detail: str,                # JSON-serialized evaluation details
    raw: dict                   # Raw evaluation data
)

# StepResult: evolution engine output
StepResult(
    mutated: bool,              # Whether workspace was modified
    summary: str,               # Human-readable mutation description
    metadata: dict              # Engine-specific metadata
)

Agent Workspace Contract (AgentWorkspace)

The AgentWorkspace class mediates all file-system interactions:

Method Signature Purpose
read_prompt() → str Read current system prompt
write_prompt(text) → None Overwrite system prompt
list_skills() → list[SkillInfo] Enumerate available skills
write_skill(name, content) → None Create/update a SKILL.md file
get_skill_content(name) → str Read a specific skill's content
add_memory(entry, category) → None Append to episodic/semantic memory
list_memories() → list[dict] Read all memory entries

Base Agent Contract (BaseAgent)

from agent_evolve.protocol.base_agent import BaseAgent

class BaseAgent:
    def __init__(self, workspace_dir: str | Path):
        self.workspace = AgentWorkspace(workspace_dir)
        self.system_prompt = self.workspace.read_prompt()
        self.skills = self.workspace.list_skills()
        self.memories = self.workspace.list_memories()

    def solve(self, task: Task) -> Trajectory:
        """Implement this: process task, return trajectory."""
        raise NotImplementedError

    def reload_from_fs(self):
        """Re-read workspace state after evolution mutations."""
        self.system_prompt = self.workspace.read_prompt()
        self.skills = self.workspace.list_skills()
        self.memories = self.workspace.list_memories()

    def remember(self, content: str, category: str = "episodic"):
        """Store episodic memory during solve phase."""
        self.workspace.add_memory({"content": content}, category)

Evolution Engine Contract (EvolutionEngine)

from agent_evolve.engine.base import EvolutionEngine

class EvolutionEngine:
    def step(
        self,
        workspace: AgentWorkspace,      # Mutable workspace reference
        observations: list[Observation], # Accumulated solve results
        history: EvolutionHistory,       # Previous cycles' data
        trial: TrialRunner               # On-demand validation
    ) -> StepResult:
        """One evolution cycle. Analyze failures, mutate workspace."""
        raise NotImplementedError

Benchmark Adapter Contract (BenchmarkAdapter)

from agent_evolve.benchmarks.base import BenchmarkAdapter

class BenchmarkAdapter:
    def get_tasks(
        self,
        split: str = "train",    # "train" or "holdout"
        limit: int = 10
    ) -> list[Task]:
        """Return benchmark tasks for the given split."""
        raise NotImplementedError

    def evaluate(
        self,
        task: Task,
        trajectory: Trajectory
    ) -> Feedback:
        """Score agent output against ground truth."""
        raise NotImplementedError

Evolver Configuration (EvolveConfig)

import agent_evolve as ae

config = ae.EvolveConfig(
    batch_size=8,       # Tasks per solve batch
    max_cycles=10,      # Maximum evolution iterations
    egl_window=2,       # EGL convergence window size
)

Shared Primitives

The evolution engine has access to three shared primitives:

Primitive Purpose API
TrialRunner On-demand validation during evolution Run holdout tasks to test candidate mutations before committing
EvolutionHistory Observation + version queries Query past cycles, compare scores, retrieve failure patterns
VersionControl Git-based rollback Tag accepted mutations, revert rejected ones, maintain audit trail

11 Core Mechanisms (Detailed)

Mechanism 1: The Four-Phase Evolver

The evolver F_Evolve is decomposed into four cooperating functions that implement the three principles of agentic evolution:

                    ┌─────────────────┐
                    │  Obs_{1:t}      │  Accumulated deployment evidence
                    │  (trajectories, │  (tool traces, errors, feedback)
                    │   feedback)     │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │    DIAGNOSE     │  Goal-oriented: identify WHAT to change
                    │                 │
                    │  • Classify     │  Analyze failure modes across tasks
                    │    failure      │  Attribute to root causes
                    │    modes        │  (missing tools, brittle logic,
                    │  • Attribute    │   interface mismatches)
                    │    root causes  │
                    │  • Produce      │  Output: update objective g_t
                    │    update       │
                    │    objective    │
                    └────────┬────────┘
                             │ g_t
                    ┌────────▼────────┐
                    │      PLAN       │  Compositional: specify HOW to change
                    │                 │
                    │  • Target       │  Select target artifacts
                    │    artifacts    │  Choose edit operators
                    │  • Edit ops     │  (add, patch, refactor, prune)
                    │    (add/patch/  │  Define ordering constraints
                    │     refactor/   │
                    │     prune)      │  Output: edit plan p_t
                    │  • Ordering     │
                    └────────┬────────┘
                             │ p_t
                    ┌────────▼────────┐
                    │     UPDATE      │  Execute the plan
                    │                 │
                    │  • Synthesize   │  Generate concrete file changes
                    │    artifact     │  Write SKILL.md, patch prompts,
                    │    changes      │  update memory
                    │  • Attach       │  Include provenance + tests
                    │    provenance   │
                    │  • Build Δ_t    │  Output: candidate update Δ_t
                    └────────┬────────┘
                             │ Δ_t
                    ┌────────▼────────┐
                    │     VERIFY      │  Autonomy: decide WHEN to commit
                    │                 │
                    │  • Run holdout  │  Evaluate on held-out tasks
                    │    validation   │  Check for regressions
                    │  • Check for   │  Return commit decision c_t ∈ {0,1}
                    │    regressions  │
                    │  • Commit or    │  Output: c_t
                    │    rollback     │
                    └─────────────────┘

Mechanism 2: Persistent Artifact State (π_S)

The artifact state is organized into three typed registries:

Knowledge Registry (K_t): - Stores structured or textual artifacts: schemas, workflows, interface contracts, exemplars - Addressable and versioned — supports retrieval, patching, and replacement - Enables goal-oriented evolution by localizing failures to specific knowledge components - Physically: prompts/, skills/ directories

Tool Registry (T_t): - Contains executable functions with explicit input-output signatures and associated tests - During solve-time: provides deterministic action primitives reducing inference variance - During evolve-time: serves as diagnostic instruments for replaying failures and probing edge cases - Physically: tools/ directory

Validation Registry (V_t): - Contains governance assets: unit tests, regression suites, human review hooks - Validation artifacts are themselves editable (the evolution engine can synthesize new tests) - Grounds the commit decision c_t: updates committed only if they pass verification - Critical for preventing regressions over long deployment horizons

Mechanism 3: Edit Operators

A-Evolve supports a small set of canonical edit operators over π_S:

Operator Target Description Example
ADD K, T, V Create new artifact Synthesize entity-verification/SKILL.md
PATCH K, T Modify existing artifact Update parser tool for new JSON schema
REFACTOR K, T Restructure without changing behavior Split monolithic skill into composable sub-skills
PRUNE K, T, V Remove obsolete artifact Delete skill that no longer contributes to score

All proposed updates are logged with full provenance, enabling auditing, rollback, and risk-based human oversight.

Mechanism 4: Reference Evolution Algorithms

A-Evolve ships with four reference algorithms, each implementing EvolutionEngine.step():

Algorithm File Strategy Key Innovation Best Domain
adaptive_evolve algorithms/adaptive_evolve/ Per-claim feedback analysis + meta-learning Analyzes individual claims within task feedback to attribute failures at the finest granularity; meta-learns which mutation patterns are most effective MCP-Atlas (🥇 79.4%)
adaptive_skill algorithms/adaptive_skill/ LLM-driven workspace mutation with bash tool access Grants the evolution engine shell access to test mutations programmatically; can run scripts, validate outputs, and iterate within a single evolution step Terminal-Bench 2.0 (76.5%)
skillforge algorithms/skillforge/ LLM-driven workspace mutation with EGL gating Focuses on skill synthesis with strict EGL-based convergence detection; stops evolving when holdout improvement plateaus SkillsBench (34.9%)
guided_synth algorithms/guided_synth/ Memory-first evolution + LLM-guided intervention synthesis Prioritizes memory accumulation before skill synthesis; uses episodic memory to guide when and how to intervene SWE-bench Verified (76.8%)

Mechanism 5: Evolutionary Generality Loss (EGL)

EGL is A-Evolve's convergence detection metric:

EGL(t) = Score_train(t) - Score_holdout(t)

The framework monitors EGL across a sliding window (egl_window parameter). When EGL stabilizes (the gap between training and holdout performance stops narrowing), evolution halts. This serves two purposes:

  1. Prevents overfitting: If training score improves but holdout score doesn't, the agent is memorizing task-specific solutions rather than learning generalizable capabilities
  2. Saves compute: Stops evolution when further cycles are unlikely to yield meaningful holdout improvement

The EGL window is configurable per algorithm — skillforge uses strict EGL gating, while guided_synth uses a looser window to allow longer exploration before convergence.

Mechanism 6: Git-Based Version Control

Every accepted mutation is git-tagged with an incrementing version:

evo-0  →  Initial seed workspace
evo-1  →  First accepted mutation (e.g., prompt hardening)
evo-2  →  Second accepted mutation (e.g., added json-sum-exact skill)
evo-3  →  Third accepted mutation (e.g., added episodic memory patterns)
...

If the Gate phase rejects a mutation (holdout regression), the workspace is automatically rolled back to the last tagged version. This provides:

  • Full audit trail: Every evolution step is inspectable via git diff evo-N..evo-N+1
  • Reproducibility: Checkout any tag to reproduce the agent at that evolution stage
  • Safety: No permanent damage from bad mutations; always recoverable

12 Programming Language

Framework Implementation

Component Language Version Notes
Core framework Python 3.11+ Package: agent_evolve
Evolution algorithms Python 3.11+ Under algorithms/
Seed workspaces YAML + Markdown manifest.yaml, SKILL.md, system.md
Memory storage JSONL memory/episodic.jsonl
Version control Git Automated tagging and rollback
Package management pip pip install -e ".[all,dev]"

Package Structure

a-evolve/
├── agent_evolve/              # Core framework package
│   ├── __init__.py            # ae.Evolver, ae.EvolveConfig exports
│   ├── protocol/
│   │   └── base_agent.py      # BaseAgent ABC
│   ├── benchmarks/
│   │   └── base.py            # BenchmarkAdapter ABC
│   ├── engine/
│   │   └── base.py            # EvolutionEngine ABC
│   ├── contract/
│   │   └── workspace.py       # AgentWorkspace file-system abstraction
│   ├── types.py               # Task, Trajectory, Feedback, StepResult
│   └── algorithms/            # Reference evolution algorithms
│       ├── adaptive_evolve/
│       ├── adaptive_skill/
│       ├── skillforge/
│       └── guided_synth/
├── seed_workspaces/           # Pre-built starting points
│   ├── swe/
│   ├── mcp/
│   ├── terminal/
│   └── reasoning/
├── docs/                      # Benchmark-specific guides
│   ├── swe-bench-demo.md
│   ├── mcp-atlas-demo.md
│   ├── terminal-bench-demo.md
│   ├── skill-bench-demo.md
│   └── algorithms/            # Algorithm documentation
│       ├── adaptive-evolve.md
│       ├── adaptive-skill.md
│       ├── skillforge.md
│       └── guided-synth.md
├── figs/                      # Figures for README
├── pyproject.toml / setup.py  # Package definition
└── README.md

API Design Philosophy

The API is designed for minimal surface area — "3 lines of code" is the guiding principle:

import agent_evolve as ae

evolver = ae.Evolver(agent="swe-verified", benchmark="swe-verified")
results = evolver.run(cycles=10)

Extensibility is achieved through interface implementation rather than configuration complexity. Custom agents implement solve(), custom benchmarks implement get_tasks() + evaluate(), and custom algorithms implement step(). Each interface is a single method.

13 Memory Management

Memory Architecture

A-Evolve employs a file-system-native memory model where all memory is serialized to the workspace directory. This is fundamentally different from in-process memory systems — memory persists across agent restarts, evolution cycles, and even deployment sessions.

Memory Type Storage Format Lifecycle
Episodic memory memory/episodic.jsonl JSON Lines (one entry per line) Appended during solve phase; analyzed during evolve phase
Semantic memory memory/semantic.jsonl JSON Lines Synthesized during evolve phase from recurring patterns
Skill memory skills/*/SKILL.md Markdown with YAML frontmatter Created/patched by evolution engine
Prompt memory prompts/system.md Markdown Hardened by evolution engine (append constraints)

Memory Flow Through the Loop

SOLVE PHASE                    EVOLVE PHASE
─────────────                  ──────────────
Agent reads:                   Engine reads:
  • prompts/system.md            • Obs buffer (all trajectories)
  • skills/                      • memory/episodic.jsonl
  • memory/ (last N entries)     • Current workspace state

Agent writes:                  Engine writes:
  • memory/episodic.jsonl        • skills/new-skill/SKILL.md
    (task outcomes, traces)      • prompts/system.md (patches)
                                 • memory/episodic.jsonl (patterns)

Memory Capacity and Saturation

Unlike heuristic memory accumulation systems (Reflexion, Voyager) that can suffer from context saturation, A-Evolve's approach to memory is qualitatively different:

  1. Memory is structured, not raw text. Episodic entries have schema (content, category, metadata). Skills have YAML frontmatter + procedural content.

  2. Memory is curated by the evolution engine. The engine doesn't just append — it synthesizes, refactors, and prunes. This addresses the diminishing returns problem of naive memory accumulation.

  3. Memory is bounded by workspace size. The file system imposes natural limits, and the agent reads only the last N entries during solve-time. This prevents context overflow.

  4. Skills amortize memory. Recurring failure patterns are compiled into permanent skills rather than remaining as raw episodic traces. This is the key compositional principle in action — fragile reasoning is crystallized into reusable capability.

The amortization argument: A-Evolve's central memory insight is that memory should be consumed by the evolution engine to produce skills, not merely accumulated for the solver to re-read. This prevents the context saturation that plagues append-and-retrieve systems.

14 Continued Learning

Continual Deployment-Time Adaptation

A-Evolve's entire thesis is built on continued learning. The system is explicitly designed for open-ended, indefinite deployment horizons where the agent must adapt to:

Challenge How A-Evolve Handles It
Distribution shift Evolution engine diagnoses failures caused by environment changes and synthesizes targeted fixes
API drift Schema changes detected through failure analysis; adapter tools synthesized and validated
New task types Skill synthesis creates new capabilities; memory patterns bootstrap adaptation
Capability degradation EGL monitoring detects regression; git rollback preserves last-known-good state
Context saturation Skills amortize raw memory into permanent capabilities; pruning removes obsolete artifacts

The Three Scaling Axes

A-Evolve's theoretical framework positions evolution alongside two established scaling axes:

                        ┌──────────────────────────┐
                        │    CAPABILITY FRONTIER    │
                        │                          │
 Scaling Axis 1:        │   ┌──────────────────┐   │
 Training-Time ─────────│──▶│  Static ability   │   │
 Compute               │   │  (pre-training +  │   │
 (Kaplan et al.)       │   │   post-training)  │   │
                        │   └──────────────────┘   │
                        │            │              │
 Scaling Axis 2:        │   ┌────────▼─────────┐   │
 Inference-Time ────────│──▶│  Per-task         │   │
 Compute               │   │  reasoning        │   │
 (Snell et al.)        │   │  (CoT, search)    │   │
                        │   └──────────────────┘   │
                        │            │              │
 Scaling Axis 3:        │   ┌────────▼─────────┐   │
 Evolution-Time ────────│──▶│  Cross-episode    │   │
 Compute               │   │  adaptation       │   │
 (A-Evolve)            │   │  (skills, tools,  │   │
                        │   │   memory)         │   │
                        │   └──────────────────┘   │
                        └──────────────────────────┘

Convergence and Termination

The evolution loop terminates when any of these conditions is met:

  1. EGL convergence: The Evolutionary Generality Loss stabilizes over the egl_window
  2. Max cycles reached: Hard limit on evolution iterations
  3. Perfect score: Training score reaches 100% (rare)
  4. Budget exhausted: Compute budget for evolution-time is depleted

Long-Horizon Deployment Vision

The paper envisions A-Evolve running continuously in production:

Deployment Timeline
────────────────────────────────────────────────────────
Day 1:    Deploy seed agent (π_0)
Day 2-5:  Accumulate deployment evidence (Obs)
Day 5:    Evolution cycle 1 → add entity-verification skill
Day 6-10: More evidence; improved performance
Day 10:   Evolution cycle 2 → patch prompt, add memory patterns
Day 15:   API change detected → evolution cycle 3 → schema adapter
Day 20:   EGL converges → evolution pauses
Day 30:   New failure pattern → evolution resumes
...

Open question from the paper: Whether the evolution-scaling frontier P*(C_evolve) eventually saturates (suggesting fundamental limits to non-parametric adaptation) or continues to improve (suggesting that evolution can substitute for retraining). The authors present this as an empirical question requiring longer-horizon experiments than current benchmarks support.

Relationship to Other Continual Learning Paradigms

Paradigm Parametric Updates Artifact Updates Agent-Directed Governed
Online fine-tuning
Reflexion (Shinn 2023) ✅ (text memory)
Voyager (Wang 2023) ✅ (skills)
ADAS (Hu 2024) ✅ (agent architectures) Partial
A-Evolve ✅ (planned) ✅ (typed artifacts) ✅ (evolver agent) ✅ (gate + git)

15 Applications

Current Applications (Demonstrated)

Application Benchmark Task Description Result
Software engineering automation SWE-bench Verified Resolve real GitHub issues in Python repositories 76.8% (~#5)
Multi-tool orchestration MCP-Atlas Coordinate 16+ MCP servers for complex tool-calling tasks 79.4% (🥇 #1)
Terminal/CLI operations Terminal-Bench 2.0 Execute system administration and DevOps tasks in Docker 76.5% (~#7)
Autonomous skill discovery SkillsBench Learn and apply new capabilities without human instruction 34.9% (#2)

Potential Applications (Framework Affordances)

Because A-Evolve is a framework with pluggable agents, benchmarks, and algorithms, it can in principle be applied to any domain where:

  1. Tasks can be automated (agent can attempt them)
  2. Evaluation signal exists (success/failure can be measured)
  3. Agent state is file-representable (prompts, skills, tools, memory)
Domain Potential Agent Potential Benchmark Evolution Target
Customer support RAG-based chatbot Resolution rate, CSAT score Skills for handling edge cases, domain knowledge
Data analysis Code-generating analyst Accuracy on analytical queries SQL patterns, visualization templates
Security monitoring Alert triage agent True positive rate, response time Detection rules, investigation playbooks
Content generation Marketing copywriter A/B test click-through rate Style guides, audience-specific templates
Research assistance Literature review agent Relevance scoring, citation accuracy Search strategies, synthesis templates
Cloud operations Infrastructure-as-code agent Deployment success rate Terraform patterns, error recovery scripts

Comparison with Other Evolutionary Systems

System Year Evolution Target Evolution Mechanism Domain Scope Governance
FunSearch 2023 Single Python function LLM + evolutionary search Mathematical problems None (best-score selection)
AlphaEvolve 2025 Entire codebases Gemini Flash + Pro ensemble Algorithms, math, hardware Automated evaluation
OpenEvolve 2025 Code programs LLM-as-mutator + MAP-Elites General code optimization Evaluator pipeline
A-Evolve 2026 Agent workspace (prompts, skills, tools, memory) Autonomous evolver agent (Diagnose→Plan→Update→Verify) Any agent domain EGL gating + git rollback + validation registry
GEPA 2025 Heuristic algorithms LLM-guided population evolution Combinatorial optimization Best-score selection
ShinkaEvolve 2025 Optimization algorithms LLM-driven mutation + island model Algorithm design Fitness-based selection

Key Differentiators for Applications

A-Evolve's unique position in the evolutionary AI landscape is that it evolves agents (the deployed system's behavior) rather than programs (standalone code solving a specific problem). This means:

  1. The mutation target is the deployment-time policy, not a static algorithm
  2. Evolution happens during deployment, not in a separate research loop
  3. The evolver is itself an agent, not a fixed pipeline — it can adapt its own strategy
  4. Governance is built-in, not bolted on — the validation registry and git integration ensure safety
  5. The framework is agent-agnostic — it can evolve any system whose state lives on the file system

Limitations and Open Questions

Limitation Impact Mitigation
LLM sampling non-determinism Exact numerical reproduction not guaranteed Git-tagged checkpoints enable qualitative reproduction
Non-parametric evolution only (current release) Cannot modify model weights — bounded by base model capability Parametric evolution planned for future work; current non-parametric results already competitive
Benchmark-dependent evaluation Evolved skills may overfit to benchmark-specific patterns EGL gating on holdout set; but transfer to production settings untested
Evolution-time cost Multiple LLM calls per cycle for diagnose/plan/update/verify Early stopping via EGL convergence; configurable batch sizes
Single-model default Same model used for both solving and evolving Optionally use stronger model for evolution; but self-evolution is a feature, not a bug
Limited convergence theory Evolution-Scaling Hypothesis is conjectured, not proven Empirical evidence supports it across 4 benchmarks; formal analysis remains future work

This analysis is based on the arXiv paper (v2, February 2026), the open-source repository (March 2026), the MarkTechPost coverage and tutorial (March 2026), and the Hugging Face paper page. The A-Evolve framework represents a significant conceptual advance in how we think about LLM system improvement — shifting from manual prompt engineering and static training to autonomous, governed, deployment-time evolution. Whether the Evolution-Scaling Hypothesis holds at scale and across diverse production environments remains the central open empirical question.