← Back to Index

EvoSkill

Self-evolving framework that automatically discovers and refines reusable coding agent skills through iterative failure analysis, Pareto frontier selection, and textual feedback descent. Organization: Sentient + Virginia Tech Published: March 3, 2026 Type: paper + open-source repository Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

Full Title: EvoSkill: Automated Skill Discovery for Multi-Agent Systems

System Name: EvoSkill

Paper: arXiv:2603.02766 (cs.AI, cs.MA)

Repository: github.com/sentient-agi/EvoSkill — Apache 2.0 License, 323 stars

Submission Date: March 3, 2026

Status: Preprint, work in progress

License: Apache 2.0

Related Work: Builds on ROMA (Recursive Open Meta-Agent Framework, arXiv:2602.01848) by the same group. Extends the Feedback Descent paradigm (arXiv:2511.07919) to skill-level optimization. Directly targets the Agent Skills specification (agentskills.io).

Positioning Statement: EvoSkill operates at a fundamentally different abstraction level than prior evolutionary AI systems. Where AlphaEvolve evolves code and GEPA evolves prompts, EvoSkill evolves skills — structured, portable, interpretable capability modules that persist across tasks, models, and agent harnesses. This makes it the first system to apply evolutionary optimization to the "skill" unit of abstraction for coding agents.


2 Authors and Team

Author Affiliation Role
Salaheddin Alzubi Sentient Lead author, framework design
Noah Provenzano Virginia Tech Benchmark evaluation, experiments
Jaydon Bingham Virginia Tech Implementation, skill builder
Weiyuan Chen Virginia Tech Evaluation infrastructure
Tu Vu Virginia Tech Supervision, research direction

Organizational Context:

  • Sentient — AI research organization focused on autonomous agent systems. Previously published ROMA (Recursive Open Meta-Agent Framework), a hierarchical agent framework that EvoSkill builds upon.
  • Virginia Tech — Provides the academic research backbone, with Tu Vu's group contributing to agent skill optimization research.

Team Size: 5 authors — a compact team, contrasting with the large collaborations typical of the agent infrastructure space. The small team size suggests focused, iterative development rather than broad infrastructure building.

Intellectual Lineage: - ROMA framework (Alzubi et al., 2026) → hierarchical multi-agent systems - Feedback Descent (Lee, Boen, Finn, 2025) → textual feedback as optimization signal - Self-Refine (Madaan et al., 2023) → iterative LLM self-improvement - Voyager (Wang et al., 2023) → skill libraries for embodied agents - Agent Skills specification (2025) → structured skill format standard - AlphaEvolve (Google DeepMind) → evolutionary code optimization (comparison target) - GEPA (Agrawal et al., 2026) → evolutionary prompt optimization (comparison target)


3 Core Contribution

The Problem

Coding agents (Claude Code, OpenHands, Codex) are increasingly used as general-purpose problem solvers. However, their general-purpose flexibility does not confer domain expertise. Three specific gaps:

  1. Hand-crafted skills are labor-intensive. The Agent Skills ecosystem (Claude Code skills, Codex skills) provides excellent infrastructure for using skills, but creating them requires manual domain knowledge and significant effort.
  2. Existing evolutionary approaches optimize the wrong abstraction level. AlphaEvolve optimizes codebases; GEPA optimizes prompts. Both produce artifacts tightly coupled to specific models and tasks — they don't yield reusable, transferable capabilities.
  3. No systematic failure-driven skill discovery. Agents repeatedly fail on the same categories of tasks, but there's no automated mechanism to analyze failures, propose remedies, and validate improvements.

The Solution

EvoSkill introduces evolutionary optimization at the skill level — discovering structured, reusable skill folders through iterative failure analysis:

┌──────────────────────────────────────────────────────────────────┐
│              EvoSkill: Evolutionary Skill Discovery               │
│                                                                   │
│  Input: Coding agent + benchmark dataset + scoring function       │
│  Output: Optimized skill library (structured skill folders)       │
│                                                                   │
│  ┌────────────────────────────────────────────────────────┐       │
│  │  Iteration t                                           │       │
│  │                                                        │       │
│  │  1. SELECT parent program from frontier (round-robin)  │       │
│  │  2. EVALUATE parent on training batch                  │       │
│  │  3. COLLECT failures (score < threshold τ)             │       │
│  │  4. PROPOSE skill via failure analysis (Proposer)      │       │
│  │  5. BUILD skill folder (Skill-Builder)                 │       │
│  │  6. EVALUATE candidate on validation set               │       │
│  │  7. UPDATE frontier if candidate outperforms weakest   │       │
│  │  8. APPEND to feedback history                         │       │
│  │                                                        │       │
│  │  Repeat for T iterations                               │       │
│  └────────────────────────────────────────────────────────┘       │
│                                                                   │
│  Invariant: Underlying model is FROZEN throughout                 │
│  Output: Best program in frontier G                               │
└──────────────────────────────────────────────────────────────────┘

Three Key Insights

Insight Implication
Skills are the right unit of evolution Unlike prompts or code, skills are structured, portable, and interpretable — they transfer across tasks and models
Failure analysis drives discovery Instead of random mutation, EvoSkill uses structured failure diagnosis to propose targeted improvements
Pareto frontier + feedback history prevents regression Only improvements survive; cumulative history prevents redundant proposals

Key Insight: EvoSkill's central claim is that optimizing at the skill level produces a qualitatively different kind of improvement than prompt or code optimization. Skills discovered by EvoSkill are not opaque tunings — they are interpretable, domain-relevant capabilities (e.g., "data extraction verification protocol") that can be understood, composed, and transferred.


4 Supported Solutions

Agent Harness Compatibility

EvoSkill is designed for any coding agent harness that supports structured skill folders:

Harness Compatibility Skill Format
Claude Code Primary target .claude/skills/ directories
Codex Supported .codex/skills/ directories
OpenCode Supported Compatible skill format
Any Agent Skills-compatible Supported Follows agentskills.io spec

SDK and Model Support

SDK Models Tested Notes
Claude SDK (default) Claude Opus 4.5, Sonnet, Haiku Primary evaluation platform
OpenCode SDK DeepSeek-V3, Gemini 2.0 Flash Alternative models via --sdk opencode

Benchmark Tasks

EvoSkill is evaluated on three benchmark tasks with distinct characteristics:

Benchmark Type Domain Challenge
OfficeQA Grounded reasoning U.S. Treasury Bulletins (89K pages) Dense tables, document navigation, quantitative reasoning
SealQA Search-augmented QA Open web with noisy retrieval Conflicting results, premature search termination
BrowseComp Fact-seeking browsing Web browsing with short correct answers Transfer target (zero-shot)

Evolution Modes

Mode What Evolves What's Fixed Use Case
skill_only Skill folders (SKILL.md + scripts) System prompt Primary mode — discovers reusable skills
prompt_only System prompt Skills Alternative — optimizes base instructions

Skill Anatomy

Each discovered skill conforms to the Agent Skills specification:

.claude/skills/data-extraction-verification/
├── SKILL.md          # Trigger metadata + procedural instructions
├── helpers/
│   ├── validate.py   # Helper scripts for verification
│   └── templates/    # Reference materials
└── examples/         # Usage examples (optional)

A SKILL.md file contains: - Name and description — human-readable identification - Trigger conditions — when the agent should invoke this skill - Instructions — step-by-step procedural guidance - Constraints — what not to do, common pitfalls - Validation — how to verify correct application


5 LLM Integration

Three-Agent Architecture

EvoSkill's evolutionary loop is powered by three collaborating LLM agents, each with a distinct role:

┌─────────────────────────────────────────────────────────────────┐
│                    Three-Agent System                             │
│                                                                  │
│  ┌──────────────┐                                                │
│  │  Executor (A) │  Executes tasks under current program         │
│  │               │  Read access to base repo                     │
│  │  Role: Worker │  No write access to skills                    │
│  └──────┬───────┘                                                │
│         │ Output traces, predicted answers                       │
│         ▼                                                        │
│  ┌──────────────┐                                                │
│  │  Proposer (P) │  Analyzes failures + ground truth             │
│  │               │  Reads feedback history H                     │
│  │  Role: Critic │  Proposes skill edits or new skills           │
│  │               │  Read access to base repo                     │
│  └──────┬───────┘                                                │
│         │ Textual proposal π                                     │
│         ▼                                                        │
│  ┌──────────────┐                                                │
│  │ Skill-Builder │  Materializes proposal into skill folder      │
│  │      (S)      │  Write access to skills directory             │
│  │               │  Bootstrapped with meta-skill for authoring   │
│  │  Role: Builder│  Read access to base repo                     │
│  └──────────────┘                                                │
│                                                                  │
│  Access Control:                                                 │
│  - All agents: READ access to base agent repository              │
│  - Only Skill-Builder: WRITE access to skills directory          │
│  - Proposer: maintains cumulative feedback history H             │
└─────────────────────────────────────────────────────────────────┘

Model Selection

The paper evaluates primarily with Claude Opus 4.5 as the underlying model for all three agents. The repository supports configurable model selection:

# Default: Claude Opus
python scripts/run_loop.py --model opus

# Alternatives
python scripts/run_loop.py --model sonnet
python scripts/run_loop.py --model haiku

# Via OpenCode SDK with third-party models
python scripts/run_eval.py --sdk opencode --model deepseek-ai/DeepSeek-V3

The Frozen Model Principle

A defining characteristic of EvoSkill is that the underlying model is never modified:

What Changes What Stays Frozen
Skill folders (SKILL.md, scripts) Model weights
System prompt (in prompt_only mode) Model architecture
Feedback history H Model API
Frontier programs G Inference parameters

This is a critical distinction from fine-tuning approaches. EvoSkill demonstrates that meaningful performance improvements can be achieved purely through skill-level optimization — essentially discovering the right "textbook" for the agent to consult, without changing the agent's underlying capabilities.

Ground Truth Usage

The Proposer receives ground-truth answers to enable root-cause diagnosis:

Ground-truth answers are provided to enable root-cause diagnosis, analogous to examining labeled misclassifications during error analysis in supervised learning, and are not propagated to the generated skills themselves.

This is an important methodological detail — the skills themselves don't contain answers or ground truth. The ground truth is used only to diagnose why the agent failed, not to embed the answers into the skills.

Progressive Context Enrichment

The Proposer maintains a cumulative feedback history H that grows across iterations:

H = [(proposal_1, score_1, verdict_1),
     (proposal_2, score_2, verdict_2),
     ...
     (proposal_t, score_t, verdict_t)]

This serves two purposes: 1. Prevents redundant proposals — the Proposer knows what's been tried 2. Enables refinement — the Proposer can build on partial successes and avoid repeating failures

This is directly analogous to the feedback history mechanism in Feedback Descent (Lee et al., 2025), applied to the skill discovery setting.


6 Key Results

OfficeQA Results

Configuration 0.00% (exact) 0.10% 1.00% 5.00% 10.00%
Baseline 60.6 66.3 72.8 77.2 79.7
5% train 63.4 67.4 74.3 77.6 80.1
10% train 65.8 69.9 76.4 80.5 82.5
15% train 64.5 69.9 75.6 79.3 81.3
merge-unique 68.1 70.8 77.1 80.5 82.4

Tolerance = allowable relative error in numeric answers.

Key Findings:

  1. +7.3% exact-match improvement (60.6% → 67.9%) via skill-merge configuration
  2. Diminishing returns beyond 10% training data — 15% split performs slightly worse than 10%, suggesting mild overfitting
  3. Skill-merge outperforms all individual runs — skills from independent runs are complementary
  4. Consistent improvement across all tolerance levels — gains range from 2.7-4.5%

SealQA Results

Configuration Accuracy
Baseline 26.6%
EvoSkill (10% train) 38.7%

+12.1% absolute improvement — the largest gain across all benchmarks, attributable to the search-persistence-protocol skill that enforces exhaustive search before committing to answers.

Zero-Shot Transfer: SealQA → BrowseComp

Configuration BrowseComp Accuracy
Baseline 43.5%
+ SealQA skill (zero-shot) 48.8%

+5.3% improvement with no modification — a skill discovered on SealQA transfers directly to BrowseComp, demonstrating that EvoSkill produces generalizable capabilities rather than task-specific tunings.

Result Significance Analysis

Metric Assessment
Statistical rigor Single-run evaluation due to computational cost (Opus 4.5); variance analysis deferred to future work
Baseline fairness Baseline cross-referenced with benchmark authors' latest results
Transfer evidence Strongest contribution — zero-shot transfer provides causal evidence for generalization
Scalability Small training sets (5-15%) sufficient, suggesting efficiency
Interpretability Discovered skills are human-readable and domain-relevant

Most Important Result: The zero-shot transfer experiment is EvoSkill's strongest claim. It demonstrates that optimizing at the skill level — rather than prompts or code — produces capabilities that generalize beyond the training task. A search-persistence skill learned from SealQA helps on BrowseComp because the underlying capability (exhaustive search before committing) is domain-general.


7 Reproducibility

Open-Source Infrastructure

Component Availability Notes
Framework code GitHub (Apache 2.0) Full evolutionary loop
Python API src/api.py High-level EvoSkill() and EvalRunner()
CLI scripts scripts/ run_loop.py, run_eval.py variants
Agent profiles src/agent_profiles/ Task-specific agent configurations
Evaluation scorers src/evaluation/ Per-benchmark scoring functions
Task registration src/api.py Extensible register_task()

Reproducing Results

# Install
uv sync  # or pip install -e .

# Set API key
export ANTHROPIC_API_KEY=your-key-here

# Run the evolutionary loop on OfficeQA
python scripts/run_loop.py --mode skill_only --max-iterations 20

# Run the evolutionary loop on SealQA
python scripts/run_loop_sealqa.py --mode skill_only --max-iterations 20

# Evaluate a configuration
python scripts/run_eval.py --model opus --max-concurrent 8

Via Python API

from src.api import EvoSkill

# Run the full self-improvement loop
result = await EvoSkill(
    task="sealqa",
    model="opus",
    mode="skill_only",
    max_iterations=20,
    frontier_size=3,
    concurrency=4,
    train_ratio=0.18,
    val_ratio=0.12,
).run()

Reproducibility Concerns

Factor Assessment Mitigation
LLM non-determinism All three agents (Executor, Proposer, Skill-Builder) use LLMs — inherent stochasticity Git branch tracking provides full audit trail
Single-run evaluation Results reported from single runs due to cost Authors acknowledge; variance analysis is future work
Model versioning Opus 4.5 is a specific model snapshot Future model versions may yield different results
Dataset availability OfficeQA and SealQA are public; BrowseComp stratified sample details not fully specified Datasets referenced but hosting varies
Cost barrier Running Opus 4.5 at scale is expensive Limits community reproduction
Scoring functions OfficeQA uses fuzzy matching; SealQA uses LLM-graded scoring LLM-based scoring adds another layer of non-determinism

Git-Based Audit Trail

EvoSkill stores every agent program as a git branch:

main (base agent)
├── evo/iter-1/candidate-1    (+ data-extraction-verification skill)
├── evo/iter-2/candidate-1    (+ quantitative-analysis skill)
├── evo/iter-3/candidate-1    (rejected — score decreased)
└── evo/iter-4/candidate-1    (+ search-persistence-protocol)

Each branch diverges from its parent only in skill folders and metadata. This provides complete traceability: every skill, every proposal, every score is version-controlled.


8 Compute and API Costs

Cost Model

EvoSkill's cost is dominated by LLM API calls across three agents:

Cost per iteration ≈ 
    Cost(Executor on training batch)      [most expensive]
  + Cost(Proposer analyzing failures)     [moderate]
  + Cost(Skill-Builder creating skill)    [moderate]
  + Cost(Executor on validation set)      [most expensive]

Estimated Per-Iteration Costs

Agent Estimated Tokens Cost (Opus 4.5) Notes
Executor (training) 50K-200K per question $0.50-$5.00 per question Multiple questions per batch
Proposer 20K-50K $0.30-$0.75 Analyzes failure traces
Skill-Builder 10K-30K $0.15-$0.45 Generates skill folder
Executor (validation) 50K-200K per question $0.50-$5.00 per question Full validation set (~17 items)

Total Run Costs

Configuration Iterations Estimated Total Cost
OfficeQA, 5% train ~18 (1.5 epochs) $200-$800
OfficeQA, 10% train ~36 (1.5 epochs) $400-$1,600
OfficeQA, 15% train ~54 (1.5 epochs) $600-$2,400
SealQA, 10% train ~17 (1.5 epochs) $300-$1,200
Skill-merge (3 independent runs) 3× above 3× above
Full reproduction All configurations $3,000-$15,000

Cost-Efficiency Analysis

Approach Cost to Improve by ~7% Reusability
EvoSkill (skill discovery) $200-$1,600 High — skills transfer
Fine-tuning $1,000-$10,000+ Low — model-specific
Manual skill authoring 10-40 engineer-hours Moderate — domain-specific
Prompt engineering 5-20 engineer-hours Low — task-specific

Cost Insight: EvoSkill's costs are non-trivial but competitive with alternatives. The key advantage is that discovered skills are reusable — a single run on SealQA produces skills that also improve BrowseComp for free. Amortized across multiple deployment tasks, the ROI is favorable.

Hardware Requirements

Component Requirement
Compute No GPU needed — all work is done via API calls
Storage Minimal — skill folders are small text files
Memory Standard — Python process + git operations
Network API access to Anthropic (or alternative provider)
Docker Required for LiveCodeBench evaluation sandbox

9 Architecture Solution

System Architecture

┌───────────────────────────────────────────────────────────────────────┐
│                      EvoSkill System Architecture                      │
│                                                                        │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                     Data Layer                                    │  │
│  │                                                                   │  │
│  │  Dataset D = {(x_i, y_i)}    LLM Classifier     Stratified       │  │
│  │  ┌───────────────┐          ┌──────────┐        Partitions       │  │
│  │  │ Raw benchmark │──────────│ Cluster  │───────► Train / Val /   │  │
│  │  │ questions     │  K cats  │ into K   │        Test splits      │  │
│  │  └───────────────┘          │categories│                          │  │
│  │                              └──────────┘                          │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                              │                                         │
│                              ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                     Evolution Layer                                │  │
│  │                                                                   │  │
│  │  Frontier G = {p_1, p_2, ..., p_k}  (top-k programs)            │  │
│  │                                                                   │  │
│  │  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐   │  │
│  │  │ Parent   │───►│ Evaluate │───►│ Collect  │───►│ Propose  │   │  │
│  │  │ Select   │    │ on Train │    │ Failures │    │ Skill    │   │  │
│  │  │(round-   │    │ Batch    │    │ (< τ)    │    │ Change   │   │  │
│  │  │ robin)   │    └──────────┘    └──────────┘    └────┬─────┘   │  │
│  │  └──────────┘                                         │          │  │
│  │       ▲                                               ▼          │  │
│  │       │         ┌──────────┐    ┌──────────┐    ┌──────────┐   │  │
│  │       │         │ Update   │◄───│ Evaluate │◄───│ Build    │   │  │
│  │       └─────────│ Frontier │    │ on Val   │    │ Skill    │   │  │
│  │                 │ (if      │    │ Set      │    │ Folder   │   │  │
│  │                 │ better)  │    └──────────┘    └──────────┘   │  │
│  │                 └──────────┘                                    │  │
│  │                                                                   │  │
│  │  Feedback History H: [(π_1,s_1), (π_2,s_2), ..., (π_t,s_t)]    │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                              │                                         │
│                              ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                     Storage Layer                                 │  │
│  │                                                                   │
│  │  Git Repository (fixed codebase)                                  │  │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐                 │  │
│  │  │ main       │  │ program    │  │ program    │                 │  │
│  │  │ (base)     │  │ branch 1   │  │ branch 2   │                 │  │
│  │  │            │  │ + skills   │  │ + skills   │                 │  │
│  │  │            │  │ + metadata │  │ + metadata │                 │  │
│  │  └────────────┘  └────────────┘  └────────────┘                 │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────────────────┘

Design Principles

1. Skill-Level Abstraction

EvoSkill's fundamental design decision is to evolve skills rather than lower-level artifacts:

Abstraction Level Example Systems Artifacts Transferability
Weights Fine-tuning, RLHF Model parameters None
Code AlphaEvolve Source code files Low
Prompts GEPA, DSPy Text strings Low
Skills EvoSkill Structured folders (SKILL.md + scripts) High

Skills are the right level because they are: - Structured: Metadata, instructions, and code in a defined format - Portable: Follow the Agent Skills specification — work across harnesses - Interpretable: Human-readable, auditable, modifiable - Composable: Multiple skills coexist and activate independently - Transferable: Demonstrated zero-shot transfer across benchmarks

2. Separation of Concerns

Each of the three agents has a clearly delineated responsibility:

Agent Reads Writes Responsibility
Executor Base repo + skills Nothing Execute tasks, produce traces
Proposer Traces + ground truth + history H Nothing (appends to H) Diagnose failures, propose skills
Skill-Builder Base repo + proposal π Skills directory Build concrete skill folders

This separation ensures: - The Executor cannot cheat by looking at ground truth - The Proposer cannot directly modify the agent - The Skill-Builder cannot see evaluation scores (only the proposal)

3. Evolutionary Selection with Memory

The frontier G and feedback history H together implement evolutionary selection with memory: - Frontier G: Maintains top-k programs (population management) - History H: Prevents redundant exploration (cumulative memory) - Round-robin selection: Ensures all frontier members are explored - Score-based replacement: Only improvements enter the frontier


10 Component Breakdown

Core Components

1. Self-Improving Loop (src/loop/)

The central evolutionary loop that orchestrates all three agents:

# Simplified pseudocode from Algorithm 1
H = []                    # feedback history
G = {base_agent}          # frontier (initially just base)
s_base = evaluate(A, V)   # baseline score

for t in range(T):
    # Round-robin parent selection
    parent = G[t % len(G)]

    # Collect failures
    failures = collect_failures(parent, train_batch, threshold=τ)
    if not failures:
        continue

    # Propose skill change
    proposal = proposer(failures, H)

    # Build candidate
    candidate = skill_builder(parent, proposal)

    # Evaluate
    score = evaluate(candidate, validation_set)

    # Frontier update
    if len(G) < k or score > min(G.scores):
        G.add(candidate)
        if len(G) > k:
            G.remove(worst)

    # Record history
    H.append((proposal, score))

return best(G)

2. Agent Profiles (src/agent_profiles/)

Task-specific agent configurations:

src/agent_profiles/
├── __init__.py
├── base_agent/
│   ├── __init__.py
│   ├── base_agent.py      # Default agent options
│   └── prompt.txt          # Base system prompt
├── sealqa_agent/
│   ├── __init__.py
│   ├── sealqa_agent.py    # SealQA-specific options
│   └── prompt.txt          # SealQA system prompt
└── dabstep_agent/
    └── ...

Each agent profile defines: - System prompt - Available tools - Timeout configurations - Model selection

3. Evaluation System (src/evaluation/)

Modular scoring framework:

Scorer Benchmark Method
Multi-tolerance OfficeQA Fuzzy matching at 0%, 0.1%, 1%, 5%, 10% tolerance
LLM-graded SealQA GPT-based semantic equivalence judgment
Exact match BrowseComp String comparison
# Scoring interface
def score(question: str, predicted: str, ground_truth: str) -> float:
    """Return score in [0.0, 1.0]."""
    ...

4. Data Splitting (src/data/)

LLM-based stratified partitioning:

  1. Classification: LLM classifies each example into K categories
  2. Stratified split: Ensures every category appears in both train and validation
  3. Holdout: Test set is never exposed during evolution

Default ratios are configurable: - Train: ~10-18% (for failure detection) - Validation: ~7-12% (for frontier scoring) - Test: remaining ~70-80% (for final evaluation)

5. Frontier Management (src/frontier/)

Git-based program versioning:

Git branch structure:
  main                          # Fixed codebase
  ├── evo/base                  # Base program (no skills)
  ├── evo/iter-1/candidate-0    # First evolved program
  ├── evo/iter-2/candidate-0    # Second evolved program
  └── evo/iter-N/candidate-0    # N-th evolved program

Each branch contains:
  .claude/skills/               # Accumulated skill folders
  .agent-metadata.json          # System prompt, lineage, score

Supporting Components

Component Location Purpose
Proposer prompts src/prompts/ Structured templates for failure analysis
Skill-Builder meta-skill src/meta_skill/ Best practices for skill authoring
Caching src/cache/ Avoid re-evaluating identical programs
Concurrency control src/concurrency/ Parallel evaluation with configurable limits
Result tracking src/results/ Score history, skill provenance
CLI scripts scripts/ Entry points for loop and evaluation

11 Core Mechanisms (Detailed)

Mechanism 1: Textual Feedback Descent for Skills

EvoSkill adapts the Feedback Descent framework (Lee, Boen, Finn, 2025) to skill discovery. The key innovation is using rich textual feedback rather than scalar rewards:

Original Feedback Descent:

Maintain frontier of top-k candidates
Each iteration:
  1. Select candidate from frontier
  2. Evaluate candidate → get textual feedback
  3. Editor LLM uses feedback to produce improved candidate
  4. New candidate enters frontier if better

EvoSkill's Adaptation:

Maintain frontier of top-k PROGRAMS (agent + skills)
Each iteration:
  1. Select program from frontier (round-robin)
  2. Execute program on tasks → collect FAILURE TRACES
  3. Proposer LLM diagnoses failures → produces SKILL PROPOSAL
  4. Skill-Builder LLM materializes proposal → SKILL FOLDER
  5. New program enters frontier if validation score improves

The critical difference: where Feedback Descent optimizes a single artifact (molecule, SVG, prompt), EvoSkill optimizes a composition of skills — the improvement is additive across iterations.

Mechanism 2: Failure-Driven Skill Discovery

The Proposer performs structured failure analysis:

Input to Proposer:
  - Execution traces (full agent conversation for each failed question)
  - Predicted answers (what the agent output)
  - Ground-truth answers (correct answers for diagnosis)
  - Feedback history H (all prior proposals and outcomes)

Proposer Analysis Steps:
  1. Review execution traces to identify WHERE the agent went wrong
  2. Classify failure mode:
     - Data extraction error (wrong cell, wrong metric)
     - Reasoning error (incorrect computation, wrong formula)
     - Search error (premature termination, missed source)
     - Comprehension error (misunderstood question intent)
  3. Check existing skills: is there a skill that SHOULD have prevented this?
     - If yes: propose EDIT to strengthen that skill
     - If no: propose NEW skill to address the gap
  4. Consult history H: has a similar proposal been tried before?
     - If tried and failed: propose different approach
     - If tried and partially worked: propose refinement
  5. Output: structured proposal π specifying skill name, trigger,
     instructions, and rationale

Mechanism 3: Pareto Frontier with Round-Robin Selection

The frontier G maintains the top-k programs:

Frontier G = {p_1, p_2, ..., p_k}  where k = frontier_size (default: 3)

Selection: round-robin cycling
  Iteration 1: parent = G[0]
  Iteration 2: parent = G[1]
  Iteration 3: parent = G[2]
  Iteration 4: parent = G[0]  (cycle restarts)
  ...

This ensures:
  - Every frontier member gets explored before any is revisited
  - No single program dominates the mutation budget
  - Diverse exploration starting points

Replacement: score-based
  if candidate_score > min(G.scores):
    G.add(candidate)
    if len(G) > k:
      G.remove(argmin(G.scores))

Why Round-Robin over Tournament/Roulette?

Round-robin selection guarantees that all frontier members receive equal exploration effort. In a small frontier (k=3), this matters: tournament selection could repeatedly select the strongest member, leading to premature convergence on a local optimum. Round-robin ensures that weaker frontier members — which may contain useful partial skills — also get a chance to be improved.

Mechanism 4: Skill Composition and Accumulation

Skills accumulate across iterations — a program's skill library grows monotonically:

Iteration 0: program_0 = {system_prompt}
Iteration 1: program_1 = {system_prompt, skill_A}
Iteration 3: program_3 = {system_prompt, skill_A, skill_B}
Iteration 7: program_7 = {system_prompt, skill_A, skill_B (edited), skill_C}

The Agent Skills specification enables this through progressive disclosure: - Metadata loading at startup: Agent reads all skill triggers — minimal context cost - On-demand activation: Full SKILL.md is read only when trigger conditions match - Helper execution: Scripts run in subprocess, never entering context window

This means an agent can maintain dozens of skills with negligible context overhead. The skills activate contextually, responding to the specific characteristics of each input.

Mechanism 5: Skill Merging

The paper introduces a skill-merge strategy that combines skills from independent runs:

Run 1 (5% train):  discovers skills {A, B, C}
Run 2 (10% train): discovers skills {B', D, E}
Run 3 (15% train): discovers skills {A', E', F}

Merge strategy:
  1. Identify unique skills by name/description
  2. For overlapping skills (B/B', A/A', E/E'):
     keep version from highest-scoring run
  3. Result: merged library {A', B, C, D, E, F}

This simple merging strategy outperforms any individual run (+7.3% vs best individual +5.2%), demonstrating that different training configurations surface complementary capabilities.

Mechanism 6: Category-Aware Training

The data setup uses LLM-based clustering to ensure balanced exposure:

Dataset D → LLM Classifier → K categories
  Category 1: questions about debt instruments
  Category 2: questions about revenue figures
  Category 3: questions about cross-document comparison
  ...

Stratified split ensures:
  - Every category appears in train, val, and test
  - Evolution sees diverse failure modes (not just one category)
  - Validation is representative of the full distribution

Training data are organized as category-keyed pools. During evolution, batches are sampled without replacement, cycling through all examples before repeating. This ensures the Proposer sees failures from all categories, not just the most frequent ones.


12 Programming Language

Implementation Stack

Component Language Framework
Core framework Python 3.12+ asyncio, dataclasses
Agent interaction Python Claude Agent SDK / OpenCode SDK
Evaluation Python Custom scorers
Version control Git Branch-per-program storage
Package management uv (recommended) or pip pyproject.toml
Sandbox Docker For secure code execution

Code Organization

EvoSkill/
├── src/
│   ├── api.py              # High-level Python API
│   ├── loop/               # Self-improving loop implementation
│   ├── agent_profiles/     # Per-task agent configurations
│   │   ├── base_agent/
│   │   ├── sealqa_agent/
│   │   └── dabstep_agent/
│   ├── evaluation/         # Scoring functions
│   │   ├── eval_full.py    # Full evaluation runner
│   │   ├── sealqa_scorer.py
│   │   └── multi_tolerance.py
│   ├── prompts/            # Proposer/Builder prompt templates
│   ├── schemas.py          # Pydantic data models
│   ├── frontier.py         # Git-based frontier management
│   └── data/               # Dataset splitting utilities
├── scripts/
│   ├── run_loop.py         # CLI: run evolution loop
│   ├── run_loop_sealqa.py  # CLI: run SealQA loop
│   ├── run_eval.py         # CLI: run evaluation
│   └── run_eval_sealqa.py  # CLI: run SealQA eval
├── .dataset/               # Benchmark data (user-provided)
├── .claude/skills/         # Evolved skill folders (output)
├── pyproject.toml          # Package configuration
└── uv.lock                 # Dependency lock file

API Design

EvoSkill provides both async and sync interfaces:

# Async (recommended)
result = await EvoSkill(task="sealqa").run()

# Sync (convenience wrapper)
result = EvoSkill(task="sealqa").run_sync()

# Evaluation only
summary = await EvalRunner(task="sealqa", model="sonnet").run()

# Task registration
register_task(TaskConfig(
    name="my_task",
    make_agent_options=my_options_factory,
    scorer=my_scorer,
    default_dataset=".dataset/my_data.csv",
))

Language Choice Rationale

  1. Agent SDK compatibility: Claude Code SDK and OpenCode SDK are Python-native
  2. Async-first: The evaluation loop is I/O-bound (API calls), making asyncio the right concurrency model
  3. Git integration: Python's subprocess provides clean git operation wrapping
  4. ML ecosystem: Scoring functions and data processing leverage Python's ML stack
  5. Skill output format: SKILL.md files are language-agnostic, but helper scripts are typically Python

13 Memory Management

Context Window Management

The most critical memory constraint in EvoSkill is not RAM but LLM context windows. Each agent must fit its inputs within the model's context limit:

Executor Agent:

Context budget:
  System prompt:                    ~2K tokens
  Skill triggers (all skills):     ~500 tokens × N_skills
  Active skill (on demand):        ~2K-5K tokens per skill
  Task question:                   ~1K-10K tokens
  Tool outputs:                    ~5K-50K tokens (document content)
  Agent reasoning:                 ~5K-20K tokens

Total: ~15K-80K tokens per question

Proposer Agent:

Context budget:
  System prompt + instructions:    ~3K tokens
  Failure traces (N failures):     ~5K-20K tokens per failure
  Ground-truth answers:            ~500 tokens per failure
  Feedback history H:              ~1K-5K tokens (growing)
  Existing skill inventory:        ~1K-3K tokens

Total: ~20K-100K tokens per iteration

Skill-Builder Agent:

Context budget:
  Meta-skill (authoring guidelines): ~3K tokens
  Parent program metadata:           ~1K tokens
  Proposal π:                        ~2K-5K tokens
  Base repository context:           ~5K-10K tokens

Total: ~10K-20K tokens per skill construction

Progressive Disclosure for Skill Loading

The Agent Skills specification's progressive disclosure model is essential for EvoSkill's scalability:

Startup: Load ALL skill triggers
  ┌────────────────────────────────────┐
  │ data-extraction-verification       │  ~50 tokens
  │ trigger: "extracting table data"   │
  ├────────────────────────────────────┤
  │ quantitative-analysis             │  ~50 tokens
  │ trigger: "financial calculation"   │
  ├────────────────────────────────────┤
  │ search-persistence-protocol        │  ~50 tokens
  │ trigger: "searching for answers"   │
  └────────────────────────────────────┘
  Total: ~150 tokens for 3 skills

Activation: Load ONLY matching skill
  ┌────────────────────────────────────┐
  │ data-extraction-verification       │
  │ [Full SKILL.md: ~2000 tokens]     │
  │ [Helper scripts: executed, not    │
  │  loaded into context]             │
  └────────────────────────────────────┘
  Total: ~2000 tokens (one skill)

This means EvoSkill can accumulate 50+ skills with startup cost of only ~2500 tokens. Only the relevant skill(s) for a given question consume full context.

Feedback History Growth

The feedback history H grows linearly with iterations:

After 20 iterations:
  H ≈ 20 × (proposal summary + score + verdict)
    ≈ 20 × 200 tokens
    ≈ 4000 tokens

After 100 iterations:
  H ≈ 100 × 200 tokens
    ≈ 20,000 tokens

For long runs, H may need summarization or sliding-window management. The paper's experiments (≤20 iterations) stay well within manageable limits.

System Memory

Component Memory Footprint Notes
Python process ~100-300 MB Framework + data structures
Git repository ~10-100 MB Branches + skill folders
Dataset in memory ~10-500 MB Loaded benchmark data
Evaluation cache ~10-100 MB Cached API responses
Total ~130 MB - 1 GB Minimal by modern standards

14 Continued Learning

Iterative Skill Accumulation

EvoSkill's core loop is inherently a continued learning process. Skills accumulate monotonically — each iteration can add new skills or refine existing ones, but never removes successful skills:

Evolution trajectory (OfficeQA example):

Iteration 0:  Base agent (60.6% accuracy)
              Skills: {}

Iteration 3:  +data-extraction-verification
              Skills: {DEV}
              Accuracy: 63.4%

Iteration 7:  +quantitative-analysis-methodology
              Skills: {DEV, QAM}
              Accuracy: 65.8%

Iteration 12: Refined DEV (strengthened table parsing rules)
              Skills: {DEV', QAM}
              Accuracy: 67.9%

Skill Refinement vs. Creation

The Proposer can propose two types of changes:

Action When Used Outcome
New skill No existing skill covers the failure mode New skill folder added
Edit skill Existing skill partially addresses failures but has gaps Modified SKILL.md or helper scripts

This creates a natural learn-then-refine cycle: 1. Early iterations discover broad skills (addressing common failure modes) 2. Later iterations refine existing skills (handling edge cases) 3. Occasionally, late iterations discover entirely new capabilities

Transfer Learning

EvoSkill demonstrates three forms of transfer:

1. Within-Task Transfer (Skill Merging)

Independent runs on same task → merge unique skills
Result: 67.9% > 65.8% (best individual)

2. Cross-Task Transfer (Zero-Shot)

SealQA skill → BrowseComp (no modification)
Result: 43.5% → 48.8% (+5.3%)

3. Cross-Model Transfer (Untested but Architecturally Supported)

Skills evolved with Opus → applied with Sonnet/Haiku
Architecturally supported (skills are model-agnostic text)
Empirical validation is future work

Limitations of Current Continued Learning

Limitation Impact Possible Mitigation
No skill pruning Skill library grows unboundedly Relevance scoring + periodic pruning
No multi-task joint optimization Skills optimized for one task at a time Multi-objective frontier over multiple benchmarks
No inter-skill conflict detection Two skills could give contradictory instructions Consistency checking agent
Linear feedback history H grows without summarization Hierarchical summarization or sliding window
Single-task evaluation Validation score measures only target benchmark Multi-benchmark validation set

Future Directions for Continued Learning

The paper identifies several promising directions:

  1. Shared skill libraries. Skills discovered by multiple users/organizations could be pooled into a community skill registry.
  2. Multi-modal skill evolution. Extend to tasks requiring vision, code, and language coordination.
  3. Cross-harness transfer. Test whether skills evolved for Claude Code transfer to Codex, OpenHands, etc.
  4. Hierarchical skill structures. Skills that invoke other skills, creating composable capability trees.
  5. Active curriculum selection. Instead of random training batch sampling, actively select the most informative failure cases for each iteration.

15 Applications

Primary Application: Agent Skill Optimization

EvoSkill's direct application is automating the creation of agent skills for any domain:

Workflow:
  1. Define a benchmark dataset with (question, ground_truth) pairs
  2. Implement a scoring function
  3. Run EvoSkill's evolutionary loop
  4. Deploy discovered skills to production agents

Result: Domain-specialized agent without fine-tuning or manual skill authoring

Demonstrated Applications

Application Benchmark Skills Discovered Impact
Treasury document analysis OfficeQA Data extraction verification, quantitative analysis methodology +7.3% exact match
Search-augmented QA SealQA Search persistence protocol (exhaustive search before committing) +12.1% accuracy
Fact-seeking browsing BrowseComp Transfer of search persistence protocol +5.3% accuracy (zero-shot)

Potential Application Domains

Domain Task Type Expected Skill Categories
Software engineering Bug fixing, code review Debugging protocols, testing strategies
Data analysis Table reasoning, visualization Data validation, statistical methods
Scientific research Literature review, experiment design Citation verification, methodology checks
Legal analysis Contract review, case law Clause interpretation, precedent search
Medical diagnosis Clinical decision support Symptom verification, differential diagnosis
Financial analysis Risk assessment, compliance Regulatory checking, calculation verification
Customer support Ticket resolution, escalation Issue categorization, resolution protocols

Qualitative Analysis of Discovered Skills

Data Extraction Verification (OfficeQA):

This skill emerged from failures where the agent extracted values from wrong table cells — a common error when parsing dense financial documents. The skill enforces: - Adjacent cell verification (check neighboring cells aren't the intended target) - Metric disambiguation (ensure the correct metric is selected from similar-sounding options) - Time granularity verification (monthly vs. quarterly vs. annual figures) - Source page confirmation (verify the extraction location)

This skill is domain-specific but pattern-general — similar verification protocols would be useful for any document extraction task, not just Treasury bulletins.

Quantitative Analysis Methodology (OfficeQA):

This skill provides structured methodology for financial calculations: - Mandatory validation checkpoints before computation - Prevention of common errors: wrong data transformations, date misalignment, population/sample confusion - Risk calculation frameworks - Currency conversion and statistical inference guidance

Search Persistence Protocol (SealQA):

The most transferable skill discovered, this protocol enforces: - Term interpretation expansion (consider alternative phrasings) - Multi-source verification (don't trust a single search result) - Completeness checks (ensure all aspects of the question are addressed) - Resistance to premature search termination

This skill transferred zero-shot to BrowseComp — a benchmark with entirely different questions — because the underlying capability (thorough search before committing) is domain-general.

Relationship to Other Systems

System Relationship to EvoSkill
AlphaEvolve (Google DeepMind) Evolves code; EvoSkill evolves skills (higher abstraction)
GEPA (DSPy) Evolves prompts within DSPy; EvoSkill evolves structured skill folders
Voyager (Minecraft) Discovered code-based skills for embodied agent; EvoSkill discovers text+code skills for coding agents
Self-Refine Single-output refinement; EvoSkill accumulates skills across iterations
Feedback Descent General optimization framework; EvoSkill applies it to skill discovery
Agent Skills spec Defines the skill format; EvoSkill automates skill creation
DiscoGen Generates tasks for ADA optimization; EvoSkill optimizes the agent's skill library
ROMA Hierarchical multi-agent framework; EvoSkill evolves its skills

Strategic Position in the Evolutionary AI Landscape

EvoSkill occupies a unique position at the intersection of three trends:

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  Evolutionary Optimization          Agent Skill Ecosystem    │
│  (AlphaEvolve, GEPA,       ←────►  (Claude Code Skills,    │
│   FunSearch, OpenELM)               Codex Skills, ROMA)     │
│         │                                    │               │
│         └──────────────┬─────────────────────┘               │
│                        │                                     │
│                        ▼                                     │
│              ┌─────────────────┐                             │
│              │    EvoSkill     │                              │
│              │                 │                              │
│              │  Evolutionary   │                              │
│              │  optimization   │                              │
│              │  OF agent       │                              │
│              │  skills         │                              │
│              └─────────────────┘                              │
│                        │                                     │
│                        ▼                                     │
│              Transferable, interpretable                     │
│              capability improvements                         │
│              WITHOUT model modification                      │
│                                                              │
└──────────────────────────────────────────────────────────────┘

EvoSkill's key strategic advantage: It is the only system that combines evolutionary optimization with the structured skill format, producing artifacts that are both evolved (automatically discovered through search) and portable (work across models, tasks, and harnesses). This positions it as the bridge between the evolutionary AI community (focused on search and optimization) and the agent infrastructure community (focused on capability and deployment).

Open Questions

  1. Skill library scaling: How many skills can an agent effectively maintain before interference or confusion?
  2. Skill interaction effects: Do skills ever conflict or produce negative interference?
  3. Convergence properties: Does the frontier converge to a fixed point, or does performance continue improving with more iterations?
  4. Multi-benchmark optimization: Can EvoSkill optimize skills for multiple benchmarks simultaneously?
  5. Compositional generalization: Can skills learned individually be composed to solve tasks requiring multiple capabilities?
  6. Human-in-the-loop refinement: Can human experts improve EvoSkill-discovered skills, or are they already near-optimal for their domains?

References

@misc{alzubi2026evoskill,
  title={EvoSkill: Automated Skill Discovery for Multi-Agent Systems},
  author={Salaheddin Alzubi and Noah Provenzano and Jaydon Bingham 
          and Weiyuan Chen and Tu Vu},
  year={2026},
  eprint={2603.02766},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2603.02766},
}

@misc{alzubi2026roma,
  title={ROMA: Recursive Open Meta-Agent Framework for Long-Horizon 
         Multi-Agent Systems},
  author={Salaheddin Al Zu'bi and Bala Nama and Ashish Kaz 
          and Aditya Eswaran and Weiyuan Chen and Shantanu Khetan 
          and Rajat Bala and Tu Vu and Samuel Oh},
  year={2026},
  eprint={2602.01848},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2602.01848},
}

@misc{lee2025feedbackdescent,
  title={Feedback Descent: Open-Ended Text Optimization via 
         Pairwise Comparison},
  author={Yoonsang Lee and Jarett Boen and Chelsea Finn},
  year={2025},
  eprint={2511.07919},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2511.07919},
}