EvoSkill

Self-evolving framework that automatically discovers and refines reusable coding agent skills through iterative failure analysis, Pareto frontier selection, and textual feedback descent. Organization: Sentient + Virginia Tech Published: March 3, 2026 Type: paper + open-source repository Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: EvoSkill: Automated Skill Discovery for Multi-Agent Systems

System Name: EvoSkill

Paper: arXiv:2603.02766 (cs.AI, cs.MA)

Repository: github.com/sentient-agi/EvoSkill — Apache 2.0 License, 323 stars

Submission Date: March 3, 2026

Status: Preprint, work in progress

License: Apache 2.0

Related Work: Builds on ROMA (Recursive Open Meta-Agent Framework, arXiv:2602.01848) by the same group. Extends the Feedback Descent paradigm (arXiv:2511.07919) to skill-level optimization. Directly targets the Agent Skills specification (agentskills.io).

Positioning Statement: EvoSkill operates at a fundamentally different abstraction level than prior evolutionary AI systems. Where AlphaEvolve evolves code and GEPA evolves prompts, EvoSkill evolves skills — structured, portable, interpretable capability modules that persist across tasks, models, and agent harnesses. This makes it the first system to apply evolutionary optimization to the "skill" unit of abstraction for coding agents.

2 Authors and Team

Author	Affiliation	Role
Salaheddin Alzubi	Sentient	Lead author, framework design
Noah Provenzano	Virginia Tech	Benchmark evaluation, experiments
Jaydon Bingham	Virginia Tech	Implementation, skill builder
Weiyuan Chen	Virginia Tech	Evaluation infrastructure
Tu Vu	Virginia Tech	Supervision, research direction

Organizational Context:

Sentient — AI research organization focused on autonomous agent systems. Previously published ROMA (Recursive Open Meta-Agent Framework), a hierarchical agent framework that EvoSkill builds upon.
Virginia Tech — Provides the academic research backbone, with Tu Vu's group contributing to agent skill optimization research.

Team Size: 5 authors — a compact team, contrasting with the large collaborations typical of the agent infrastructure space. The small team size suggests focused, iterative development rather than broad infrastructure building.

Intellectual Lineage: - ROMA framework (Alzubi et al., 2026) → hierarchical multi-agent systems - Feedback Descent (Lee, Boen, Finn, 2025) → textual feedback as optimization signal - Self-Refine (Madaan et al., 2023) → iterative LLM self-improvement - Voyager (Wang et al., 2023) → skill libraries for embodied agents - Agent Skills specification (2025) → structured skill format standard - AlphaEvolve (Google DeepMind) → evolutionary code optimization (comparison target) - GEPA (Agrawal et al., 2026) → evolutionary prompt optimization (comparison target)

3 Core Contribution

The Problem

Coding agents (Claude Code, OpenHands, Codex) are increasingly used as general-purpose problem solvers. However, their general-purpose flexibility does not confer domain expertise. Three specific gaps:

Hand-crafted skills are labor-intensive. The Agent Skills ecosystem (Claude Code skills, Codex skills) provides excellent infrastructure for using skills, but creating them requires manual domain knowledge and significant effort.
Existing evolutionary approaches optimize the wrong abstraction level. AlphaEvolve optimizes codebases; GEPA optimizes prompts. Both produce artifacts tightly coupled to specific models and tasks — they don't yield reusable, transferable capabilities.
No systematic failure-driven skill discovery. Agents repeatedly fail on the same categories of tasks, but there's no automated mechanism to analyze failures, propose remedies, and validate improvements.

The Solution

EvoSkill introduces evolutionary optimization at the skill level — discovering structured, reusable skill folders through iterative failure analysis:

┌──────────────────────────────────────────────────────────────────┐
│              EvoSkill: Evolutionary Skill Discovery               │
│                                                                   │
│  Input: Coding agent + benchmark dataset + scoring function       │
│  Output: Optimized skill library (structured skill folders)       │
│                                                                   │
│  ┌────────────────────────────────────────────────────────┐       │
│  │  Iteration t                                           │       │
│  │                                                        │       │
│  │  1. SELECT parent program from frontier (round-robin)  │       │
│  │  2. EVALUATE parent on training batch                  │       │
│  │  3. COLLECT failures (score < threshold τ)             │       │
│  │  4. PROPOSE skill via failure analysis (Proposer)      │       │
│  │  5. BUILD skill folder (Skill-Builder)                 │       │
│  │  6. EVALUATE candidate on validation set               │       │
│  │  7. UPDATE frontier if candidate outperforms weakest   │       │
│  │  8. APPEND to feedback history                         │       │
│  │                                                        │       │
│  │  Repeat for T iterations                               │       │
│  └────────────────────────────────────────────────────────┘       │
│                                                                   │
│  Invariant: Underlying model is FROZEN throughout                 │
│  Output: Best program in frontier G                               │
└──────────────────────────────────────────────────────────────────┘

Three Key Insights

Insight	Implication
Skills are the right unit of evolution	Unlike prompts or code, skills are structured, portable, and interpretable — they transfer across tasks and models
Failure analysis drives discovery	Instead of random mutation, EvoSkill uses structured failure diagnosis to propose targeted improvements
Pareto frontier + feedback history prevents regression	Only improvements survive; cumulative history prevents redundant proposals

Key Insight: EvoSkill's central claim is that optimizing at the skill level produces a qualitatively different kind of improvement than prompt or code optimization. Skills discovered by EvoSkill are not opaque tunings — they are interpretable, domain-relevant capabilities (e.g., "data extraction verification protocol") that can be understood, composed, and transferred.

4 Supported Solutions

Agent Harness Compatibility

EvoSkill is designed for any coding agent harness that supports structured skill folders:

Harness	Compatibility	Skill Format
Claude Code	Primary target	`.claude/skills/` directories
Codex	Supported	`.codex/skills/` directories
OpenCode	Supported	Compatible skill format
Any Agent Skills-compatible	Supported	Follows agentskills.io spec

SDK and Model Support

SDK	Models Tested	Notes
Claude SDK (default)	Claude Opus 4.5, Sonnet, Haiku	Primary evaluation platform
OpenCode SDK	DeepSeek-V3, Gemini 2.0 Flash	Alternative models via `--sdk opencode`

Benchmark Tasks

EvoSkill is evaluated on three benchmark tasks with distinct characteristics:

Benchmark	Type	Domain	Challenge
OfficeQA	Grounded reasoning	U.S. Treasury Bulletins (89K pages)	Dense tables, document navigation, quantitative reasoning
SealQA	Search-augmented QA	Open web with noisy retrieval	Conflicting results, premature search termination
BrowseComp	Fact-seeking browsing	Web browsing with short correct answers	Transfer target (zero-shot)

Evolution Modes

Mode	What Evolves	What's Fixed	Use Case
`skill_only`	Skill folders (SKILL.md + scripts)	System prompt	Primary mode — discovers reusable skills
`prompt_only`	System prompt	Skills	Alternative — optimizes base instructions

Skill Anatomy

Each discovered skill conforms to the Agent Skills specification:

.claude/skills/data-extraction-verification/
├── SKILL.md          # Trigger metadata + procedural instructions
├── helpers/
│   ├── validate.py   # Helper scripts for verification
│   └── templates/    # Reference materials
└── examples/         # Usage examples (optional)

A SKILL.md file contains: - Name and description — human-readable identification - Trigger conditions — when the agent should invoke this skill - Instructions — step-by-step procedural guidance - Constraints — what not to do, common pitfalls - Validation — how to verify correct application

5 LLM Integration

Three-Agent Architecture

EvoSkill's evolutionary loop is powered by three collaborating LLM agents, each with a distinct role:

┌─────────────────────────────────────────────────────────────────┐
│                    Three-Agent System                             │
│                                                                  │
│  ┌──────────────┐                                                │
│  │  Executor (A) │  Executes tasks under current program         │
│  │               │  Read access to base repo                     │
│  │  Role: Worker │  No write access to skills                    │
│  └──────┬───────┘                                                │
│         │ Output traces, predicted answers                       │
│         ▼                                                        │
│  ┌──────────────┐                                                │
│  │  Proposer (P) │  Analyzes failures + ground truth             │
│  │               │  Reads feedback history H                     │
│  │  Role: Critic │  Proposes skill edits or new skills           │
│  │               │  Read access to base repo                     │
│  └──────┬───────┘                                                │
│         │ Textual proposal π                                     │
│         ▼                                                        │
│  ┌──────────────┐                                                │
│  │ Skill-Builder │  Materializes proposal into skill folder      │
│  │      (S)      │  Write access to skills directory             │
│  │               │  Bootstrapped with meta-skill for authoring   │
│  │  Role: Builder│  Read access to base repo                     │
│  └──────────────┘                                                │
│                                                                  │
│  Access Control:                                                 │
│  - All agents: READ access to base agent repository              │
│  - Only Skill-Builder: WRITE access to skills directory          │
│  - Proposer: maintains cumulative feedback history H             │
└─────────────────────────────────────────────────────────────────┘

Model Selection

The paper evaluates primarily with Claude Opus 4.5 as the underlying model for all three agents. The repository supports configurable model selection:

# Default: Claude Opus
python scripts/run_loop.py --model opus

# Alternatives
python scripts/run_loop.py --model sonnet
python scripts/run_loop.py --model haiku

# Via OpenCode SDK with third-party models
python scripts/run_eval.py --sdk opencode --model deepseek-ai/DeepSeek-V3

The Frozen Model Principle

A defining characteristic of EvoSkill is that the underlying model is never modified:

What Changes	What Stays Frozen
Skill folders (SKILL.md, scripts)	Model weights
System prompt (in prompt_only mode)	Model architecture
Feedback history H	Model API
Frontier programs G	Inference parameters

This is a critical distinction from fine-tuning approaches. EvoSkill demonstrates that meaningful performance improvements can be achieved purely through skill-level optimization — essentially discovering the right "textbook" for the agent to consult, without changing the agent's underlying capabilities.

Ground Truth Usage

The Proposer receives ground-truth answers to enable root-cause diagnosis:

Ground-truth answers are provided to enable root-cause diagnosis, analogous to examining labeled misclassifications during error analysis in supervised learning, and are not propagated to the generated skills themselves.

This is an important methodological detail — the skills themselves don't contain answers or ground truth. The ground truth is used only to diagnose why the agent failed, not to embed the answers into the skills.

Progressive Context Enrichment

The Proposer maintains a cumulative feedback history H that grows across iterations:

H = [(proposal_1, score_1, verdict_1),
     (proposal_2, score_2, verdict_2),
     ...
     (proposal_t, score_t, verdict_t)]

This serves two purposes: 1. Prevents redundant proposals — the Proposer knows what's been tried 2. Enables refinement — the Proposer can build on partial successes and avoid repeating failures

This is directly analogous to the feedback history mechanism in Feedback Descent (Lee et al., 2025), applied to the skill discovery setting.

6 Key Results

OfficeQA Results

Configuration	0.00% (exact)	0.10%	1.00%	5.00%	10.00%
Baseline	60.6	66.3	72.8	77.2	79.7
5% train	63.4	67.4	74.3	77.6	80.1
10% train	65.8	69.9	76.4	80.5	82.5
15% train	64.5	69.9	75.6	79.3	81.3
merge-unique	68.1	70.8	77.1	80.5	82.4

Tolerance = allowable relative error in numeric answers.

Key Findings:

+7.3% exact-match improvement (60.6% → 67.9%) via skill-merge configuration
Diminishing returns beyond 10% training data — 15% split performs slightly worse than 10%, suggesting mild overfitting
Skill-merge outperforms all individual runs — skills from independent runs are complementary
Consistent improvement across all tolerance levels — gains range from 2.7-4.5%

SealQA Results

Configuration	Accuracy
Baseline	26.6%
EvoSkill (10% train)	38.7%

+12.1% absolute improvement — the largest gain across all benchmarks, attributable to the search-persistence-protocol skill that enforces exhaustive search before committing to answers.

Zero-Shot Transfer: SealQA → BrowseComp

Configuration	BrowseComp Accuracy
Baseline	43.5%
+ SealQA skill (zero-shot)	48.8%

+5.3% improvement with no modification — a skill discovered on SealQA transfers directly to BrowseComp, demonstrating that EvoSkill produces generalizable capabilities rather than task-specific tunings.

Result Significance Analysis

Metric	Assessment
Statistical rigor	Single-run evaluation due to computational cost (Opus 4.5); variance analysis deferred to future work
Baseline fairness	Baseline cross-referenced with benchmark authors' latest results
Transfer evidence	Strongest contribution — zero-shot transfer provides causal evidence for generalization
Scalability	Small training sets (5-15%) sufficient, suggesting efficiency
Interpretability	Discovered skills are human-readable and domain-relevant

Most Important Result: The zero-shot transfer experiment is EvoSkill's strongest claim. It demonstrates that optimizing at the skill level — rather than prompts or code — produces capabilities that generalize beyond the training task. A search-persistence skill learned from SealQA helps on BrowseComp because the underlying capability (exhaustive search before committing) is domain-general.

7 Reproducibility

Open-Source Infrastructure

Component	Availability	Notes
Framework code	GitHub (Apache 2.0)	Full evolutionary loop
Python API	`src/api.py`	High-level `EvoSkill()` and `EvalRunner()`
CLI scripts	`scripts/`	`run_loop.py`, `run_eval.py` variants
Agent profiles	`src/agent_profiles/`	Task-specific agent configurations
Evaluation scorers	`src/evaluation/`	Per-benchmark scoring functions
Task registration	`src/api.py`	Extensible `register_task()`

Reproducing Results

# Install
uv sync  # or pip install -e .

# Set API key
export ANTHROPIC_API_KEY=your-key-here

# Run the evolutionary loop on OfficeQA
python scripts/run_loop.py --mode skill_only --max-iterations 20

# Run the evolutionary loop on SealQA
python scripts/run_loop_sealqa.py --mode skill_only --max-iterations 20

# Evaluate a configuration
python scripts/run_eval.py --model opus --max-concurrent 8

Via Python API

from src.api import EvoSkill

# Run the full self-improvement loop
result = await EvoSkill(
    task="sealqa",
    model="opus",
    mode="skill_only",
    max_iterations=20,
    frontier_size=3,
    concurrency=4,
    train_ratio=0.18,
    val_ratio=0.12,
).run()

Reproducibility Concerns

Factor	Assessment	Mitigation
LLM non-determinism	All three agents (Executor, Proposer, Skill-Builder) use LLMs — inherent stochasticity	Git branch tracking provides full audit trail
Single-run evaluation	Results reported from single runs due to cost	Authors acknowledge; variance analysis is future work
Model versioning	Opus 4.5 is a specific model snapshot	Future model versions may yield different results
Dataset availability	OfficeQA and SealQA are public; BrowseComp stratified sample details not fully specified	Datasets referenced but hosting varies
Cost barrier	Running Opus 4.5 at scale is expensive	Limits community reproduction
Scoring functions	OfficeQA uses fuzzy matching; SealQA uses LLM-graded scoring	LLM-based scoring adds another layer of non-determinism

Git-Based Audit Trail

EvoSkill stores every agent program as a git branch:

main (base agent)
├── evo/iter-1/candidate-1    (+ data-extraction-verification skill)
├── evo/iter-2/candidate-1    (+ quantitative-analysis skill)
├── evo/iter-3/candidate-1    (rejected — score decreased)
└── evo/iter-4/candidate-1    (+ search-persistence-protocol)

Each branch diverges from its parent only in skill folders and metadata. This provides complete traceability: every skill, every proposal, every score is version-controlled.

8 Compute and API Costs

Cost Model

EvoSkill's cost is dominated by LLM API calls across three agents:

Cost per iteration ≈ 
    Cost(Executor on training batch)      [most expensive]
  + Cost(Proposer analyzing failures)     [moderate]
  + Cost(Skill-Builder creating skill)    [moderate]
  + Cost(Executor on validation set)      [most expensive]

Estimated Per-Iteration Costs

Agent	Estimated Tokens	Cost (Opus 4.5)	Notes
Executor (training)	50K-200K per question	$0.50-$5.00 per question	Multiple questions per batch
Proposer	20K-50K	$0.30-$0.75	Analyzes failure traces
Skill-Builder	10K-30K	$0.15-$0.45	Generates skill folder
Executor (validation)	50K-200K per question	$0.50-$5.00 per question	Full validation set (~17 items)

Total Run Costs

Configuration	Iterations	Estimated Total Cost
OfficeQA, 5% train	~18 (1.5 epochs)	$200-$800
OfficeQA, 10% train	~36 (1.5 epochs)	$400-$1,600
OfficeQA, 15% train	~54 (1.5 epochs)	$600-$2,400
SealQA, 10% train	~17 (1.5 epochs)	$300-$1,200
Skill-merge (3 independent runs)	3× above	3× above
Full reproduction	All configurations	$3,000-$15,000

Cost-Efficiency Analysis

Approach	Cost to Improve by ~7%	Reusability
EvoSkill (skill discovery)	$200-$1,600	High — skills transfer
Fine-tuning	$1,000-$10,000+	Low — model-specific
Manual skill authoring	10-40 engineer-hours	Moderate — domain-specific
Prompt engineering	5-20 engineer-hours	Low — task-specific

Cost Insight: EvoSkill's costs are non-trivial but competitive with alternatives. The key advantage is that discovered skills are reusable — a single run on SealQA produces skills that also improve BrowseComp for free. Amortized across multiple deployment tasks, the ROI is favorable.

Hardware Requirements

Component	Requirement
Compute	No GPU needed — all work is done via API calls
Storage	Minimal — skill folders are small text files
Memory	Standard — Python process + git operations
Network	API access to Anthropic (or alternative provider)
Docker	Required for LiveCodeBench evaluation sandbox

9 Architecture Solution

System Architecture

┌───────────────────────────────────────────────────────────────────────┐
│                      EvoSkill System Architecture                      │
│                                                                        │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                     Data Layer                                    │  │
│  │                                                                   │  │
│  │  Dataset D = {(x_i, y_i)}    LLM Classifier     Stratified       │  │
│  │  ┌───────────────┐          ┌──────────┐        Partitions       │  │
│  │  │ Raw benchmark │──────────│ Cluster  │───────► Train / Val /   │  │
│  │  │ questions     │  K cats  │ into K   │        Test splits      │  │
│  │  └───────────────┘          │categories│                          │  │
│  │                              └──────────┘                          │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                              │                                         │
│                              ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                     Evolution Layer                                │  │
│  │                                                                   │  │
│  │  Frontier G = {p_1, p_2, ..., p_k}  (top-k programs)            │  │
│  │                                                                   │  │
│  │  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐   │  │
│  │  │ Parent   │───►│ Evaluate │───►│ Collect  │───►│ Propose  │   │  │
│  │  │ Select   │    │ on Train │    │ Failures │    │ Skill    │   │  │
│  │  │(round-   │    │ Batch    │    │ (< τ)    │    │ Change   │   │  │
│  │  │ robin)   │    └──────────┘    └──────────┘    └────┬─────┘   │  │
│  │  └──────────┘                                         │          │  │
│  │       ▲                                               ▼          │  │
│  │       │         ┌──────────┐    ┌──────────┐    ┌──────────┐   │  │
│  │       │         │ Update   │◄───│ Evaluate │◄───│ Build    │   │  │
│  │       └─────────│ Frontier │    │ on Val   │    │ Skill    │   │  │
│  │                 │ (if      │    │ Set      │    │ Folder   │   │  │
│  │                 │ better)  │    └──────────┘    └──────────┘   │  │
│  │                 └──────────┘                                    │  │
│  │                                                                   │  │
│  │  Feedback History H: [(π_1,s_1), (π_2,s_2), ..., (π_t,s_t)]    │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                              │                                         │
│                              ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                     Storage Layer                                 │  │
│  │                                                                   │
│  │  Git Repository (fixed codebase)                                  │  │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐                 │  │
│  │  │ main       │  │ program    │  │ program    │                 │  │
│  │  │ (base)     │  │ branch 1   │  │ branch 2   │                 │  │
│  │  │            │  │ + skills   │  │ + skills   │                 │  │
│  │  │            │  │ + metadata │  │ + metadata │                 │  │
│  │  └────────────┘  └────────────┘  └────────────┘                 │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────────────────┘

Design Principles

1. Skill-Level Abstraction

EvoSkill's fundamental design decision is to evolve skills rather than lower-level artifacts:

Abstraction Level	Example Systems	Artifacts	Transferability
Weights	Fine-tuning, RLHF	Model parameters	None
Code	AlphaEvolve	Source code files	Low
Prompts	GEPA, DSPy	Text strings	Low
Skills	EvoSkill	Structured folders (SKILL.md + scripts)	High

Skills are the right level because they are: - Structured: Metadata, instructions, and code in a defined format - Portable: Follow the Agent Skills specification — work across harnesses - Interpretable: Human-readable, auditable, modifiable - Composable: Multiple skills coexist and activate independently - Transferable: Demonstrated zero-shot transfer across benchmarks

2. Separation of Concerns

Each of the three agents has a clearly delineated responsibility:

Agent	Reads	Writes	Responsibility
Executor	Base repo + skills	Nothing	Execute tasks, produce traces
Proposer	Traces + ground truth + history H	Nothing (appends to H)	Diagnose failures, propose skills
Skill-Builder	Base repo + proposal π	Skills directory	Build concrete skill folders

This separation ensures: - The Executor cannot cheat by looking at ground truth - The Proposer cannot directly modify the agent - The Skill-Builder cannot see evaluation scores (only the proposal)

3. Evolutionary Selection with Memory

The frontier G and feedback history H together implement evolutionary selection with memory: - Frontier G: Maintains top-k programs (population management) - History H: Prevents redundant exploration (cumulative memory) - Round-robin selection: Ensures all frontier members are explored - Score-based replacement: Only improvements enter the frontier

10 Component Breakdown

Core Components

1. Self-Improving Loop (`src/loop/`)

The central evolutionary loop that orchestrates all three agents:

# Simplified pseudocode from Algorithm 1
H = []                    # feedback history
G = {base_agent}          # frontier (initially just base)
s_base = evaluate(A, V)   # baseline score

for t in range(T):
    # Round-robin parent selection
    parent = G[t % len(G)]

    # Collect failures
    failures = collect_failures(parent, train_batch, threshold=τ)
    if not failures:
        continue

    # Propose skill change
    proposal = proposer(failures, H)

    # Build candidate
    candidate = skill_builder(parent, proposal)

    # Evaluate
    score = evaluate(candidate, validation_set)

    # Frontier update
    if len(G) < k or score > min(G.scores):
        G.add(candidate)
        if len(G) > k:
            G.remove(worst)

    # Record history
    H.append((proposal, score))

return best(G)

2. Agent Profiles (`src/agent_profiles/`)

Task-specific agent configurations:

src/agent_profiles/
├── __init__.py
├── base_agent/
│   ├── __init__.py
│   ├── base_agent.py      # Default agent options
│   └── prompt.txt          # Base system prompt
├── sealqa_agent/
│   ├── __init__.py
│   ├── sealqa_agent.py    # SealQA-specific options
│   └── prompt.txt          # SealQA system prompt
└── dabstep_agent/
    └── ...

Each agent profile defines: - System prompt - Available tools - Timeout configurations - Model selection

3. Evaluation System (`src/evaluation/`)

Modular scoring framework:

Scorer	Benchmark	Method
Multi-tolerance	OfficeQA	Fuzzy matching at 0%, 0.1%, 1%, 5%, 10% tolerance
LLM-graded	SealQA	GPT-based semantic equivalence judgment
Exact match	BrowseComp	String comparison

# Scoring interface
def score(question: str, predicted: str, ground_truth: str) -> float:
    """Return score in [0.0, 1.0]."""
    ...

4. Data Splitting (`src/data/`)

LLM-based stratified partitioning:

Classification: LLM classifies each example into K categories
Stratified split: Ensures every category appears in both train and validation
Holdout: Test set is never exposed during evolution

Default ratios are configurable: - Train: ~10-18% (for failure detection) - Validation: ~7-12% (for frontier scoring) - Test: remaining ~70-80% (for final evaluation)

5. Frontier Management (`src/frontier/`)

Git-based program versioning:

Git branch structure:
  main                          # Fixed codebase
  ├── evo/base                  # Base program (no skills)
  ├── evo/iter-1/candidate-0    # First evolved program
  ├── evo/iter-2/candidate-0    # Second evolved program
  └── evo/iter-N/candidate-0    # N-th evolved program

Each branch contains:
  .claude/skills/               # Accumulated skill folders
  .agent-metadata.json          # System prompt, lineage, score

Supporting Components

Component	Location	Purpose
Proposer prompts	`src/prompts/`	Structured templates for failure analysis
Skill-Builder meta-skill	`src/meta_skill/`	Best practices for skill authoring
Caching	`src/cache/`	Avoid re-evaluating identical programs
Concurrency control	`src/concurrency/`	Parallel evaluation with configurable limits
Result tracking	`src/results/`	Score history, skill provenance
CLI scripts	`scripts/`	Entry points for loop and evaluation

11 Core Mechanisms (Detailed)

Mechanism 1: Textual Feedback Descent for Skills

EvoSkill adapts the Feedback Descent framework (Lee, Boen, Finn, 2025) to skill discovery. The key innovation is using rich textual feedback rather than scalar rewards:

Original Feedback Descent:

Maintain frontier of top-k candidates
Each iteration:
  1. Select candidate from frontier
  2. Evaluate candidate → get textual feedback
  3. Editor LLM uses feedback to produce improved candidate
  4. New candidate enters frontier if better

EvoSkill's Adaptation:

Maintain frontier of top-k PROGRAMS (agent + skills)
Each iteration:
  1. Select program from frontier (round-robin)
  2. Execute program on tasks → collect FAILURE TRACES
  3. Proposer LLM diagnoses failures → produces SKILL PROPOSAL
  4. Skill-Builder LLM materializes proposal → SKILL FOLDER
  5. New program enters frontier if validation score improves

The critical difference: where Feedback Descent optimizes a single artifact (molecule, SVG, prompt), EvoSkill optimizes a composition of skills — the improvement is additive across iterations.

Mechanism 2: Failure-Driven Skill Discovery

The Proposer performs structured failure analysis:

Input to Proposer:
  - Execution traces (full agent conversation for each failed question)
  - Predicted answers (what the agent output)
  - Ground-truth answers (correct answers for diagnosis)
  - Feedback history H (all prior proposals and outcomes)

Proposer Analysis Steps:
  1. Review execution traces to identify WHERE the agent went wrong
  2. Classify failure mode:
     - Data extraction error (wrong cell, wrong metric)
     - Reasoning error (incorrect computation, wrong formula)
     - Search error (premature termination, missed source)
     - Comprehension error (misunderstood question intent)
  3. Check existing skills: is there a skill that SHOULD have prevented this?
     - If yes: propose EDIT to strengthen that skill
     - If no: propose NEW skill to address the gap
  4. Consult history H: has a similar proposal been tried before?
     - If tried and failed: propose different approach
     - If tried and partially worked: propose refinement
  5. Output: structured proposal π specifying skill name, trigger,
     instructions, and rationale

Mechanism 3: Pareto Frontier with Round-Robin Selection

The frontier G maintains the top-k programs:

Frontier G = {p_1, p_2, ..., p_k}  where k = frontier_size (default: 3)

Selection: round-robin cycling
  Iteration 1: parent = G[0]
  Iteration 2: parent = G[1]
  Iteration 3: parent = G[2]
  Iteration 4: parent = G[0]  (cycle restarts)
  ...

This ensures:
  - Every frontier member gets explored before any is revisited
  - No single program dominates the mutation budget
  - Diverse exploration starting points

Replacement: score-based
  if candidate_score > min(G.scores):
    G.add(candidate)
    if len(G) > k:
      G.remove(argmin(G.scores))

Why Round-Robin over Tournament/Roulette?

Round-robin selection guarantees that all frontier members receive equal exploration effort. In a small frontier (k=3), this matters: tournament selection could repeatedly select the strongest member, leading to premature convergence on a local optimum. Round-robin ensures that weaker frontier members — which may contain useful partial skills — also get a chance to be improved.

Mechanism 4: Skill Composition and Accumulation

Skills accumulate across iterations — a program's skill library grows monotonically:

Iteration 0: program_0 = {system_prompt}
Iteration 1: program_1 = {system_prompt, skill_A}
Iteration 3: program_3 = {system_prompt, skill_A, skill_B}
Iteration 7: program_7 = {system_prompt, skill_A, skill_B (edited), skill_C}

The Agent Skills specification enables this through progressive disclosure: - Metadata loading at startup: Agent reads all skill triggers — minimal context cost - On-demand activation: Full SKILL.md is read only when trigger conditions match - Helper execution: Scripts run in subprocess, never entering context window

This means an agent can maintain dozens of skills with negligible context overhead. The skills activate contextually, responding to the specific characteristics of each input.

Mechanism 5: Skill Merging

The paper introduces a skill-merge strategy that combines skills from independent runs:

Run 1 (5% train):  discovers skills {A, B, C}
Run 2 (10% train): discovers skills {B', D, E}
Run 3 (15% train): discovers skills {A', E', F}

Merge strategy:
  1. Identify unique skills by name/description
  2. For overlapping skills (B/B', A/A', E/E'):
     keep version from highest-scoring run
  3. Result: merged library {A', B, C, D, E, F}

This simple merging strategy outperforms any individual run (+7.3% vs best individual +5.2%), demonstrating that different training configurations surface complementary capabilities.

Mechanism 6: Category-Aware Training

The data setup uses LLM-based clustering to ensure balanced exposure:

Dataset D → LLM Classifier → K categories
  Category 1: questions about debt instruments
  Category 2: questions about revenue figures
  Category 3: questions about cross-document comparison
  ...

Stratified split ensures:
  - Every category appears in train, val, and test
  - Evolution sees diverse failure modes (not just one category)
  - Validation is representative of the full distribution

Training data are organized as category-keyed pools. During evolution, batches are sampled without replacement, cycling through all examples before repeating. This ensures the Proposer sees failures from all categories, not just the most frequent ones.

12 Programming Language

Implementation Stack

Component	Language	Framework
Core framework	Python 3.12+	asyncio, dataclasses
Agent interaction	Python	Claude Agent SDK / OpenCode SDK
Evaluation	Python	Custom scorers
Version control	Git	Branch-per-program storage
Package management	uv (recommended) or pip	`pyproject.toml`
Sandbox	Docker	For secure code execution

Code Organization

EvoSkill/
├── src/
│   ├── api.py              # High-level Python API
│   ├── loop/               # Self-improving loop implementation
│   ├── agent_profiles/     # Per-task agent configurations
│   │   ├── base_agent/
│   │   ├── sealqa_agent/
│   │   └── dabstep_agent/
│   ├── evaluation/         # Scoring functions
│   │   ├── eval_full.py    # Full evaluation runner
│   │   ├── sealqa_scorer.py
│   │   └── multi_tolerance.py
│   ├── prompts/            # Proposer/Builder prompt templates
│   ├── schemas.py          # Pydantic data models
│   ├── frontier.py         # Git-based frontier management
│   └── data/               # Dataset splitting utilities
├── scripts/
│   ├── run_loop.py         # CLI: run evolution loop
│   ├── run_loop_sealqa.py  # CLI: run SealQA loop
│   ├── run_eval.py         # CLI: run evaluation
│   └── run_eval_sealqa.py  # CLI: run SealQA eval
├── .dataset/               # Benchmark data (user-provided)
├── .claude/skills/         # Evolved skill folders (output)
├── pyproject.toml          # Package configuration
└── uv.lock                 # Dependency lock file

API Design

EvoSkill provides both async and sync interfaces:

# Async (recommended)
result = await EvoSkill(task="sealqa").run()

# Sync (convenience wrapper)
result = EvoSkill(task="sealqa").run_sync()

# Evaluation only
summary = await EvalRunner(task="sealqa", model="sonnet").run()

# Task registration
register_task(TaskConfig(
    name="my_task",
    make_agent_options=my_options_factory,
    scorer=my_scorer,
    default_dataset=".dataset/my_data.csv",
))

Language Choice Rationale

Agent SDK compatibility: Claude Code SDK and OpenCode SDK are Python-native
Async-first: The evaluation loop is I/O-bound (API calls), making asyncio the right concurrency model
Git integration: Python's subprocess provides clean git operation wrapping
ML ecosystem: Scoring functions and data processing leverage Python's ML stack
Skill output format: SKILL.md files are language-agnostic, but helper scripts are typically Python

13 Memory Management

Context Window Management

The most critical memory constraint in EvoSkill is not RAM but LLM context windows. Each agent must fit its inputs within the model's context limit:

Executor Agent:

Context budget:
  System prompt:                    ~2K tokens
  Skill triggers (all skills):     ~500 tokens × N_skills
  Active skill (on demand):        ~2K-5K tokens per skill
  Task question:                   ~1K-10K tokens
  Tool outputs:                    ~5K-50K tokens (document content)
  Agent reasoning:                 ~5K-20K tokens

Total: ~15K-80K tokens per question

Proposer Agent:

Context budget:
  System prompt + instructions:    ~3K tokens
  Failure traces (N failures):     ~5K-20K tokens per failure
  Ground-truth answers:            ~500 tokens per failure
  Feedback history H:              ~1K-5K tokens (growing)
  Existing skill inventory:        ~1K-3K tokens

Total: ~20K-100K tokens per iteration

Skill-Builder Agent:

Context budget:
  Meta-skill (authoring guidelines): ~3K tokens
  Parent program metadata:           ~1K tokens
  Proposal π:                        ~2K-5K tokens
  Base repository context:           ~5K-10K tokens

Total: ~10K-20K tokens per skill construction

Progressive Disclosure for Skill Loading

The Agent Skills specification's progressive disclosure model is essential for EvoSkill's scalability:

Startup: Load ALL skill triggers
  ┌────────────────────────────────────┐
  │ data-extraction-verification       │  ~50 tokens
  │ trigger: "extracting table data"   │
  ├────────────────────────────────────┤
  │ quantitative-analysis             │  ~50 tokens
  │ trigger: "financial calculation"   │
  ├────────────────────────────────────┤
  │ search-persistence-protocol        │  ~50 tokens
  │ trigger: "searching for answers"   │
  └────────────────────────────────────┘
  Total: ~150 tokens for 3 skills

Activation: Load ONLY matching skill
  ┌────────────────────────────────────┐
  │ data-extraction-verification       │
  │ [Full SKILL.md: ~2000 tokens]     │
  │ [Helper scripts: executed, not    │
  │  loaded into context]             │
  └────────────────────────────────────┘
  Total: ~2000 tokens (one skill)

This means EvoSkill can accumulate 50+ skills with startup cost of only ~2500 tokens. Only the relevant skill(s) for a given question consume full context.

Feedback History Growth

The feedback history H grows linearly with iterations:

After 20 iterations:
  H ≈ 20 × (proposal summary + score + verdict)
    ≈ 20 × 200 tokens
    ≈ 4000 tokens

After 100 iterations:
  H ≈ 100 × 200 tokens
    ≈ 20,000 tokens

For long runs, H may need summarization or sliding-window management. The paper's experiments (≤20 iterations) stay well within manageable limits.

System Memory

Component	Memory Footprint	Notes
Python process	~100-300 MB	Framework + data structures
Git repository	~10-100 MB	Branches + skill folders
Dataset in memory	~10-500 MB	Loaded benchmark data
Evaluation cache	~10-100 MB	Cached API responses
Total	~130 MB - 1 GB	Minimal by modern standards

14 Continued Learning

Iterative Skill Accumulation

EvoSkill's core loop is inherently a continued learning process. Skills accumulate monotonically — each iteration can add new skills or refine existing ones, but never removes successful skills:

Evolution trajectory (OfficeQA example):

Iteration 0:  Base agent (60.6% accuracy)
              Skills: {}

Iteration 3:  +data-extraction-verification
              Skills: {DEV}
              Accuracy: 63.4%

Iteration 7:  +quantitative-analysis-methodology
              Skills: {DEV, QAM}
              Accuracy: 65.8%

Iteration 12: Refined DEV (strengthened table parsing rules)
              Skills: {DEV', QAM}
              Accuracy: 67.9%

The Proposer can propose two types of changes:

Action	When Used	Outcome
New skill	No existing skill covers the failure mode	New skill folder added
Edit skill	Existing skill partially addresses failures but has gaps	Modified SKILL.md or helper scripts

This creates a natural learn-then-refine cycle: 1. Early iterations discover broad skills (addressing common failure modes) 2. Later iterations refine existing skills (handling edge cases) 3. Occasionally, late iterations discover entirely new capabilities

Transfer Learning

EvoSkill demonstrates three forms of transfer:

1. Within-Task Transfer (Skill Merging)

Independent runs on same task → merge unique skills
Result: 67.9% > 65.8% (best individual)

2. Cross-Task Transfer (Zero-Shot)

SealQA skill → BrowseComp (no modification)
Result: 43.5% → 48.8% (+5.3%)

3. Cross-Model Transfer (Untested but Architecturally Supported)

Skills evolved with Opus → applied with Sonnet/Haiku
Architecturally supported (skills are model-agnostic text)
Empirical validation is future work

Limitations of Current Continued Learning

Limitation	Impact	Possible Mitigation
No skill pruning	Skill library grows unboundedly	Relevance scoring + periodic pruning
No multi-task joint optimization	Skills optimized for one task at a time	Multi-objective frontier over multiple benchmarks
No inter-skill conflict detection	Two skills could give contradictory instructions	Consistency checking agent
Linear feedback history	H grows without summarization	Hierarchical summarization or sliding window
Single-task evaluation	Validation score measures only target benchmark	Multi-benchmark validation set

Future Directions for Continued Learning

The paper identifies several promising directions:

Shared skill libraries. Skills discovered by multiple users/organizations could be pooled into a community skill registry.
Multi-modal skill evolution. Extend to tasks requiring vision, code, and language coordination.
Cross-harness transfer. Test whether skills evolved for Claude Code transfer to Codex, OpenHands, etc.
Hierarchical skill structures. Skills that invoke other skills, creating composable capability trees.
Active curriculum selection. Instead of random training batch sampling, actively select the most informative failure cases for each iteration.

15 Applications

Primary Application: Agent Skill Optimization

EvoSkill's direct application is automating the creation of agent skills for any domain:

Workflow:
  1. Define a benchmark dataset with (question, ground_truth) pairs
  2. Implement a scoring function
  3. Run EvoSkill's evolutionary loop
  4. Deploy discovered skills to production agents

Result: Domain-specialized agent without fine-tuning or manual skill authoring

Demonstrated Applications

Application	Benchmark	Skills Discovered	Impact
Treasury document analysis	OfficeQA	Data extraction verification, quantitative analysis methodology	+7.3% exact match
Search-augmented QA	SealQA	Search persistence protocol (exhaustive search before committing)	+12.1% accuracy
Fact-seeking browsing	BrowseComp	Transfer of search persistence protocol	+5.3% accuracy (zero-shot)

Potential Application Domains

Domain	Task Type	Expected Skill Categories
Software engineering	Bug fixing, code review	Debugging protocols, testing strategies
Data analysis	Table reasoning, visualization	Data validation, statistical methods
Scientific research	Literature review, experiment design	Citation verification, methodology checks
Legal analysis	Contract review, case law	Clause interpretation, precedent search
Medical diagnosis	Clinical decision support	Symptom verification, differential diagnosis
Financial analysis	Risk assessment, compliance	Regulatory checking, calculation verification
Customer support	Ticket resolution, escalation	Issue categorization, resolution protocols

Qualitative Analysis of Discovered Skills

Data Extraction Verification (OfficeQA):

This skill emerged from failures where the agent extracted values from wrong table cells — a common error when parsing dense financial documents. The skill enforces: - Adjacent cell verification (check neighboring cells aren't the intended target) - Metric disambiguation (ensure the correct metric is selected from similar-sounding options) - Time granularity verification (monthly vs. quarterly vs. annual figures) - Source page confirmation (verify the extraction location)

This skill is domain-specific but pattern-general — similar verification protocols would be useful for any document extraction task, not just Treasury bulletins.

Quantitative Analysis Methodology (OfficeQA):

This skill provides structured methodology for financial calculations: - Mandatory validation checkpoints before computation - Prevention of common errors: wrong data transformations, date misalignment, population/sample confusion - Risk calculation frameworks - Currency conversion and statistical inference guidance

Search Persistence Protocol (SealQA):

The most transferable skill discovered, this protocol enforces: - Term interpretation expansion (consider alternative phrasings) - Multi-source verification (don't trust a single search result) - Completeness checks (ensure all aspects of the question are addressed) - Resistance to premature search termination

This skill transferred zero-shot to BrowseComp — a benchmark with entirely different questions — because the underlying capability (thorough search before committing) is domain-general.

Relationship to Other Systems

System	Relationship to EvoSkill
AlphaEvolve (Google DeepMind)	Evolves code; EvoSkill evolves skills (higher abstraction)
GEPA (DSPy)	Evolves prompts within DSPy; EvoSkill evolves structured skill folders
Voyager (Minecraft)	Discovered code-based skills for embodied agent; EvoSkill discovers text+code skills for coding agents
Self-Refine	Single-output refinement; EvoSkill accumulates skills across iterations
Feedback Descent	General optimization framework; EvoSkill applies it to skill discovery
Agent Skills spec	Defines the skill format; EvoSkill automates skill creation
DiscoGen	Generates tasks for ADA optimization; EvoSkill optimizes the agent's skill library
ROMA	Hierarchical multi-agent framework; EvoSkill evolves its skills

Strategic Position in the Evolutionary AI Landscape

EvoSkill occupies a unique position at the intersection of three trends:

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  Evolutionary Optimization          Agent Skill Ecosystem    │
│  (AlphaEvolve, GEPA,       ←────►  (Claude Code Skills,    │
│   FunSearch, OpenELM)               Codex Skills, ROMA)     │
│         │                                    │               │
│         └──────────────┬─────────────────────┘               │
│                        │                                     │
│                        ▼                                     │
│              ┌─────────────────┐                             │
│              │    EvoSkill     │                              │
│              │                 │                              │
│              │  Evolutionary   │                              │
│              │  optimization   │                              │
│              │  OF agent       │                              │
│              │  skills         │                              │
│              └─────────────────┘                              │
│                        │                                     │
│                        ▼                                     │
│              Transferable, interpretable                     │
│              capability improvements                         │
│              WITHOUT model modification                      │
│                                                              │
└──────────────────────────────────────────────────────────────┘

EvoSkill's key strategic advantage: It is the only system that combines evolutionary optimization with the structured skill format, producing artifacts that are both evolved (automatically discovered through search) and portable (work across models, tasks, and harnesses). This positions it as the bridge between the evolutionary AI community (focused on search and optimization) and the agent infrastructure community (focused on capability and deployment).

Open Questions

Skill library scaling: How many skills can an agent effectively maintain before interference or confusion?
Skill interaction effects: Do skills ever conflict or produce negative interference?
Convergence properties: Does the frontier converge to a fixed point, or does performance continue improving with more iterations?
Multi-benchmark optimization: Can EvoSkill optimize skills for multiple benchmarks simultaneously?
Compositional generalization: Can skills learned individually be composed to solve tasks requiring multiple capabilities?
Human-in-the-loop refinement: Can human experts improve EvoSkill-discovered skills, or are they already near-optimal for their domains?

References

@misc{alzubi2026evoskill,
  title={EvoSkill: Automated Skill Discovery for Multi-Agent Systems},
  author={Salaheddin Alzubi and Noah Provenzano and Jaydon Bingham 
          and Weiyuan Chen and Tu Vu},
  year={2026},
  eprint={2603.02766},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2603.02766},
}

@misc{alzubi2026roma,
  title={ROMA: Recursive Open Meta-Agent Framework for Long-Horizon 
         Multi-Agent Systems},
  author={Salaheddin Al Zu'bi and Bala Nama and Ashish Kaz 
          and Aditya Eswaran and Weiyuan Chen and Shantanu Khetan 
          and Rajat Bala and Tu Vu and Samuel Oh},
  year={2026},
  eprint={2602.01848},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2602.01848},
}

@misc{lee2025feedbackdescent,
  title={Feedback Descent: Open-Ended Text Optimization via 
         Pairwise Comparison},
  author={Yoonsang Lee and Jarett Boen and Chelsea Finn},
  year={2025},
  eprint={2511.07919},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2511.07919},
}