EvoSkill
Self-evolving framework that automatically discovers and refines reusable coding agent skills through iterative failure analysis, Pareto frontier selection, and textual feedback descent. Organization: Sentient + Virginia Tech Published: March 3, 2026 Type: paper + open-source repository Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: EvoSkill: Automated Skill Discovery for Multi-Agent Systems
System Name: EvoSkill
Paper: arXiv:2603.02766 (cs.AI, cs.MA)
Repository: github.com/sentient-agi/EvoSkill — Apache 2.0 License, 323 stars
Submission Date: March 3, 2026
Status: Preprint, work in progress
License: Apache 2.0
Related Work: Builds on ROMA (Recursive Open Meta-Agent Framework, arXiv:2602.01848) by the same group. Extends the Feedback Descent paradigm (arXiv:2511.07919) to skill-level optimization. Directly targets the Agent Skills specification (agentskills.io).
Positioning Statement: EvoSkill operates at a fundamentally different abstraction level than prior evolutionary AI systems. Where AlphaEvolve evolves code and GEPA evolves prompts, EvoSkill evolves skills — structured, portable, interpretable capability modules that persist across tasks, models, and agent harnesses. This makes it the first system to apply evolutionary optimization to the "skill" unit of abstraction for coding agents.
2 Authors and Team
| Author | Affiliation | Role |
|---|---|---|
| Salaheddin Alzubi | Sentient | Lead author, framework design |
| Noah Provenzano | Virginia Tech | Benchmark evaluation, experiments |
| Jaydon Bingham | Virginia Tech | Implementation, skill builder |
| Weiyuan Chen | Virginia Tech | Evaluation infrastructure |
| Tu Vu | Virginia Tech | Supervision, research direction |
Organizational Context:
- Sentient — AI research organization focused on autonomous agent systems. Previously published ROMA (Recursive Open Meta-Agent Framework), a hierarchical agent framework that EvoSkill builds upon.
- Virginia Tech — Provides the academic research backbone, with Tu Vu's group contributing to agent skill optimization research.
Team Size: 5 authors — a compact team, contrasting with the large collaborations typical of the agent infrastructure space. The small team size suggests focused, iterative development rather than broad infrastructure building.
Intellectual Lineage: - ROMA framework (Alzubi et al., 2026) → hierarchical multi-agent systems - Feedback Descent (Lee, Boen, Finn, 2025) → textual feedback as optimization signal - Self-Refine (Madaan et al., 2023) → iterative LLM self-improvement - Voyager (Wang et al., 2023) → skill libraries for embodied agents - Agent Skills specification (2025) → structured skill format standard - AlphaEvolve (Google DeepMind) → evolutionary code optimization (comparison target) - GEPA (Agrawal et al., 2026) → evolutionary prompt optimization (comparison target)
3 Core Contribution
The Problem
Coding agents (Claude Code, OpenHands, Codex) are increasingly used as general-purpose problem solvers. However, their general-purpose flexibility does not confer domain expertise. Three specific gaps:
- Hand-crafted skills are labor-intensive. The Agent Skills ecosystem (Claude Code skills, Codex skills) provides excellent infrastructure for using skills, but creating them requires manual domain knowledge and significant effort.
- Existing evolutionary approaches optimize the wrong abstraction level. AlphaEvolve optimizes codebases; GEPA optimizes prompts. Both produce artifacts tightly coupled to specific models and tasks — they don't yield reusable, transferable capabilities.
- No systematic failure-driven skill discovery. Agents repeatedly fail on the same categories of tasks, but there's no automated mechanism to analyze failures, propose remedies, and validate improvements.
The Solution
EvoSkill introduces evolutionary optimization at the skill level — discovering structured, reusable skill folders through iterative failure analysis:
┌──────────────────────────────────────────────────────────────────┐
│ EvoSkill: Evolutionary Skill Discovery │
│ │
│ Input: Coding agent + benchmark dataset + scoring function │
│ Output: Optimized skill library (structured skill folders) │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Iteration t │ │
│ │ │ │
│ │ 1. SELECT parent program from frontier (round-robin) │ │
│ │ 2. EVALUATE parent on training batch │ │
│ │ 3. COLLECT failures (score < threshold τ) │ │
│ │ 4. PROPOSE skill via failure analysis (Proposer) │ │
│ │ 5. BUILD skill folder (Skill-Builder) │ │
│ │ 6. EVALUATE candidate on validation set │ │
│ │ 7. UPDATE frontier if candidate outperforms weakest │ │
│ │ 8. APPEND to feedback history │ │
│ │ │ │
│ │ Repeat for T iterations │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ Invariant: Underlying model is FROZEN throughout │
│ Output: Best program in frontier G │
└──────────────────────────────────────────────────────────────────┘
Three Key Insights
| Insight | Implication |
|---|---|
| Skills are the right unit of evolution | Unlike prompts or code, skills are structured, portable, and interpretable — they transfer across tasks and models |
| Failure analysis drives discovery | Instead of random mutation, EvoSkill uses structured failure diagnosis to propose targeted improvements |
| Pareto frontier + feedback history prevents regression | Only improvements survive; cumulative history prevents redundant proposals |
Key Insight: EvoSkill's central claim is that optimizing at the skill level produces a qualitatively different kind of improvement than prompt or code optimization. Skills discovered by EvoSkill are not opaque tunings — they are interpretable, domain-relevant capabilities (e.g., "data extraction verification protocol") that can be understood, composed, and transferred.
4 Supported Solutions
Agent Harness Compatibility
EvoSkill is designed for any coding agent harness that supports structured skill folders:
| Harness | Compatibility | Skill Format |
|---|---|---|
| Claude Code | Primary target | .claude/skills/ directories |
| Codex | Supported | .codex/skills/ directories |
| OpenCode | Supported | Compatible skill format |
| Any Agent Skills-compatible | Supported | Follows agentskills.io spec |
SDK and Model Support
| SDK | Models Tested | Notes |
|---|---|---|
| Claude SDK (default) | Claude Opus 4.5, Sonnet, Haiku | Primary evaluation platform |
| OpenCode SDK | DeepSeek-V3, Gemini 2.0 Flash | Alternative models via --sdk opencode |
Benchmark Tasks
EvoSkill is evaluated on three benchmark tasks with distinct characteristics:
| Benchmark | Type | Domain | Challenge |
|---|---|---|---|
| OfficeQA | Grounded reasoning | U.S. Treasury Bulletins (89K pages) | Dense tables, document navigation, quantitative reasoning |
| SealQA | Search-augmented QA | Open web with noisy retrieval | Conflicting results, premature search termination |
| BrowseComp | Fact-seeking browsing | Web browsing with short correct answers | Transfer target (zero-shot) |
Evolution Modes
| Mode | What Evolves | What's Fixed | Use Case |
|---|---|---|---|
skill_only |
Skill folders (SKILL.md + scripts) | System prompt | Primary mode — discovers reusable skills |
prompt_only |
System prompt | Skills | Alternative — optimizes base instructions |
Skill Anatomy
Each discovered skill conforms to the Agent Skills specification:
.claude/skills/data-extraction-verification/
├── SKILL.md # Trigger metadata + procedural instructions
├── helpers/
│ ├── validate.py # Helper scripts for verification
│ └── templates/ # Reference materials
└── examples/ # Usage examples (optional)
A SKILL.md file contains:
- Name and description — human-readable identification
- Trigger conditions — when the agent should invoke this skill
- Instructions — step-by-step procedural guidance
- Constraints — what not to do, common pitfalls
- Validation — how to verify correct application
5 LLM Integration
Three-Agent Architecture
EvoSkill's evolutionary loop is powered by three collaborating LLM agents, each with a distinct role:
┌─────────────────────────────────────────────────────────────────┐
│ Three-Agent System │
│ │
│ ┌──────────────┐ │
│ │ Executor (A) │ Executes tasks under current program │
│ │ │ Read access to base repo │
│ │ Role: Worker │ No write access to skills │
│ └──────┬───────┘ │
│ │ Output traces, predicted answers │
│ ▼ │
│ ┌──────────────┐ │
│ │ Proposer (P) │ Analyzes failures + ground truth │
│ │ │ Reads feedback history H │
│ │ Role: Critic │ Proposes skill edits or new skills │
│ │ │ Read access to base repo │
│ └──────┬───────┘ │
│ │ Textual proposal π │
│ ▼ │
│ ┌──────────────┐ │
│ │ Skill-Builder │ Materializes proposal into skill folder │
│ │ (S) │ Write access to skills directory │
│ │ │ Bootstrapped with meta-skill for authoring │
│ │ Role: Builder│ Read access to base repo │
│ └──────────────┘ │
│ │
│ Access Control: │
│ - All agents: READ access to base agent repository │
│ - Only Skill-Builder: WRITE access to skills directory │
│ - Proposer: maintains cumulative feedback history H │
└─────────────────────────────────────────────────────────────────┘
Model Selection
The paper evaluates primarily with Claude Opus 4.5 as the underlying model for all three agents. The repository supports configurable model selection:
# Default: Claude Opus
python scripts/run_loop.py --model opus
# Alternatives
python scripts/run_loop.py --model sonnet
python scripts/run_loop.py --model haiku
# Via OpenCode SDK with third-party models
python scripts/run_eval.py --sdk opencode --model deepseek-ai/DeepSeek-V3
The Frozen Model Principle
A defining characteristic of EvoSkill is that the underlying model is never modified:
| What Changes | What Stays Frozen |
|---|---|
| Skill folders (SKILL.md, scripts) | Model weights |
| System prompt (in prompt_only mode) | Model architecture |
| Feedback history H | Model API |
| Frontier programs G | Inference parameters |
This is a critical distinction from fine-tuning approaches. EvoSkill demonstrates that meaningful performance improvements can be achieved purely through skill-level optimization — essentially discovering the right "textbook" for the agent to consult, without changing the agent's underlying capabilities.
Ground Truth Usage
The Proposer receives ground-truth answers to enable root-cause diagnosis:
Ground-truth answers are provided to enable root-cause diagnosis, analogous to examining labeled misclassifications during error analysis in supervised learning, and are not propagated to the generated skills themselves.
This is an important methodological detail — the skills themselves don't contain answers or ground truth. The ground truth is used only to diagnose why the agent failed, not to embed the answers into the skills.
Progressive Context Enrichment
The Proposer maintains a cumulative feedback history H that grows across iterations:
H = [(proposal_1, score_1, verdict_1),
(proposal_2, score_2, verdict_2),
...
(proposal_t, score_t, verdict_t)]
This serves two purposes: 1. Prevents redundant proposals — the Proposer knows what's been tried 2. Enables refinement — the Proposer can build on partial successes and avoid repeating failures
This is directly analogous to the feedback history mechanism in Feedback Descent (Lee et al., 2025), applied to the skill discovery setting.
6 Key Results
OfficeQA Results
| Configuration | 0.00% (exact) | 0.10% | 1.00% | 5.00% | 10.00% |
|---|---|---|---|---|---|
| Baseline | 60.6 | 66.3 | 72.8 | 77.2 | 79.7 |
| 5% train | 63.4 | 67.4 | 74.3 | 77.6 | 80.1 |
| 10% train | 65.8 | 69.9 | 76.4 | 80.5 | 82.5 |
| 15% train | 64.5 | 69.9 | 75.6 | 79.3 | 81.3 |
| merge-unique | 68.1 | 70.8 | 77.1 | 80.5 | 82.4 |
Tolerance = allowable relative error in numeric answers.
Key Findings:
- +7.3% exact-match improvement (60.6% → 67.9%) via skill-merge configuration
- Diminishing returns beyond 10% training data — 15% split performs slightly worse than 10%, suggesting mild overfitting
- Skill-merge outperforms all individual runs — skills from independent runs are complementary
- Consistent improvement across all tolerance levels — gains range from 2.7-4.5%
SealQA Results
| Configuration | Accuracy |
|---|---|
| Baseline | 26.6% |
| EvoSkill (10% train) | 38.7% |
+12.1% absolute improvement — the largest gain across all benchmarks, attributable to the search-persistence-protocol skill that enforces exhaustive search before committing to answers.
Zero-Shot Transfer: SealQA → BrowseComp
| Configuration | BrowseComp Accuracy |
|---|---|
| Baseline | 43.5% |
| + SealQA skill (zero-shot) | 48.8% |
+5.3% improvement with no modification — a skill discovered on SealQA transfers directly to BrowseComp, demonstrating that EvoSkill produces generalizable capabilities rather than task-specific tunings.
Result Significance Analysis
| Metric | Assessment |
|---|---|
| Statistical rigor | Single-run evaluation due to computational cost (Opus 4.5); variance analysis deferred to future work |
| Baseline fairness | Baseline cross-referenced with benchmark authors' latest results |
| Transfer evidence | Strongest contribution — zero-shot transfer provides causal evidence for generalization |
| Scalability | Small training sets (5-15%) sufficient, suggesting efficiency |
| Interpretability | Discovered skills are human-readable and domain-relevant |
Most Important Result: The zero-shot transfer experiment is EvoSkill's strongest claim. It demonstrates that optimizing at the skill level — rather than prompts or code — produces capabilities that generalize beyond the training task. A search-persistence skill learned from SealQA helps on BrowseComp because the underlying capability (exhaustive search before committing) is domain-general.
7 Reproducibility
Open-Source Infrastructure
| Component | Availability | Notes |
|---|---|---|
| Framework code | GitHub (Apache 2.0) | Full evolutionary loop |
| Python API | src/api.py |
High-level EvoSkill() and EvalRunner() |
| CLI scripts | scripts/ |
run_loop.py, run_eval.py variants |
| Agent profiles | src/agent_profiles/ |
Task-specific agent configurations |
| Evaluation scorers | src/evaluation/ |
Per-benchmark scoring functions |
| Task registration | src/api.py |
Extensible register_task() |
Reproducing Results
# Install
uv sync # or pip install -e .
# Set API key
export ANTHROPIC_API_KEY=your-key-here
# Run the evolutionary loop on OfficeQA
python scripts/run_loop.py --mode skill_only --max-iterations 20
# Run the evolutionary loop on SealQA
python scripts/run_loop_sealqa.py --mode skill_only --max-iterations 20
# Evaluate a configuration
python scripts/run_eval.py --model opus --max-concurrent 8
Via Python API
from src.api import EvoSkill
# Run the full self-improvement loop
result = await EvoSkill(
task="sealqa",
model="opus",
mode="skill_only",
max_iterations=20,
frontier_size=3,
concurrency=4,
train_ratio=0.18,
val_ratio=0.12,
).run()
Reproducibility Concerns
| Factor | Assessment | Mitigation |
|---|---|---|
| LLM non-determinism | All three agents (Executor, Proposer, Skill-Builder) use LLMs — inherent stochasticity | Git branch tracking provides full audit trail |
| Single-run evaluation | Results reported from single runs due to cost | Authors acknowledge; variance analysis is future work |
| Model versioning | Opus 4.5 is a specific model snapshot | Future model versions may yield different results |
| Dataset availability | OfficeQA and SealQA are public; BrowseComp stratified sample details not fully specified | Datasets referenced but hosting varies |
| Cost barrier | Running Opus 4.5 at scale is expensive | Limits community reproduction |
| Scoring functions | OfficeQA uses fuzzy matching; SealQA uses LLM-graded scoring | LLM-based scoring adds another layer of non-determinism |
Git-Based Audit Trail
EvoSkill stores every agent program as a git branch:
main (base agent)
├── evo/iter-1/candidate-1 (+ data-extraction-verification skill)
├── evo/iter-2/candidate-1 (+ quantitative-analysis skill)
├── evo/iter-3/candidate-1 (rejected — score decreased)
└── evo/iter-4/candidate-1 (+ search-persistence-protocol)
Each branch diverges from its parent only in skill folders and metadata. This provides complete traceability: every skill, every proposal, every score is version-controlled.
8 Compute and API Costs
Cost Model
EvoSkill's cost is dominated by LLM API calls across three agents:
Cost per iteration ≈
Cost(Executor on training batch) [most expensive]
+ Cost(Proposer analyzing failures) [moderate]
+ Cost(Skill-Builder creating skill) [moderate]
+ Cost(Executor on validation set) [most expensive]
Estimated Per-Iteration Costs
| Agent | Estimated Tokens | Cost (Opus 4.5) | Notes |
|---|---|---|---|
| Executor (training) | 50K-200K per question | $0.50-$5.00 per question | Multiple questions per batch |
| Proposer | 20K-50K | $0.30-$0.75 | Analyzes failure traces |
| Skill-Builder | 10K-30K | $0.15-$0.45 | Generates skill folder |
| Executor (validation) | 50K-200K per question | $0.50-$5.00 per question | Full validation set (~17 items) |
Total Run Costs
| Configuration | Iterations | Estimated Total Cost |
|---|---|---|
| OfficeQA, 5% train | ~18 (1.5 epochs) | $200-$800 |
| OfficeQA, 10% train | ~36 (1.5 epochs) | $400-$1,600 |
| OfficeQA, 15% train | ~54 (1.5 epochs) | $600-$2,400 |
| SealQA, 10% train | ~17 (1.5 epochs) | $300-$1,200 |
| Skill-merge (3 independent runs) | 3× above | 3× above |
| Full reproduction | All configurations | $3,000-$15,000 |
Cost-Efficiency Analysis
| Approach | Cost to Improve by ~7% | Reusability |
|---|---|---|
| EvoSkill (skill discovery) | $200-$1,600 | High — skills transfer |
| Fine-tuning | $1,000-$10,000+ | Low — model-specific |
| Manual skill authoring | 10-40 engineer-hours | Moderate — domain-specific |
| Prompt engineering | 5-20 engineer-hours | Low — task-specific |
Cost Insight: EvoSkill's costs are non-trivial but competitive with alternatives. The key advantage is that discovered skills are reusable — a single run on SealQA produces skills that also improve BrowseComp for free. Amortized across multiple deployment tasks, the ROI is favorable.
Hardware Requirements
| Component | Requirement |
|---|---|
| Compute | No GPU needed — all work is done via API calls |
| Storage | Minimal — skill folders are small text files |
| Memory | Standard — Python process + git operations |
| Network | API access to Anthropic (or alternative provider) |
| Docker | Required for LiveCodeBench evaluation sandbox |
9 Architecture Solution
System Architecture
┌───────────────────────────────────────────────────────────────────────┐
│ EvoSkill System Architecture │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Data Layer │ │
│ │ │ │
│ │ Dataset D = {(x_i, y_i)} LLM Classifier Stratified │ │
│ │ ┌───────────────┐ ┌──────────┐ Partitions │ │
│ │ │ Raw benchmark │──────────│ Cluster │───────► Train / Val / │ │
│ │ │ questions │ K cats │ into K │ Test splits │ │
│ │ └───────────────┘ │categories│ │ │
│ │ └──────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Evolution Layer │ │
│ │ │ │
│ │ Frontier G = {p_1, p_2, ..., p_k} (top-k programs) │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Parent │───►│ Evaluate │───►│ Collect │───►│ Propose │ │ │
│ │ │ Select │ │ on Train │ │ Failures │ │ Skill │ │ │
│ │ │(round- │ │ Batch │ │ (< τ) │ │ Change │ │ │
│ │ │ robin) │ └──────────┘ └──────────┘ └────┬─────┘ │ │
│ │ └──────────┘ │ │ │
│ │ ▲ ▼ │ │
│ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ │ Update │◄───│ Evaluate │◄───│ Build │ │ │
│ │ └─────────│ Frontier │ │ on Val │ │ Skill │ │ │
│ │ │ (if │ │ Set │ │ Folder │ │ │
│ │ │ better) │ └──────────┘ └──────────┘ │ │
│ │ └──────────┘ │ │
│ │ │ │
│ │ Feedback History H: [(π_1,s_1), (π_2,s_2), ..., (π_t,s_t)] │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ │
│ │ Git Repository (fixed codebase) │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ main │ │ program │ │ program │ │ │
│ │ │ (base) │ │ branch 1 │ │ branch 2 │ │ │
│ │ │ │ │ + skills │ │ + skills │ │ │
│ │ │ │ │ + metadata │ │ + metadata │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
Design Principles
1. Skill-Level Abstraction
EvoSkill's fundamental design decision is to evolve skills rather than lower-level artifacts:
| Abstraction Level | Example Systems | Artifacts | Transferability |
|---|---|---|---|
| Weights | Fine-tuning, RLHF | Model parameters | None |
| Code | AlphaEvolve | Source code files | Low |
| Prompts | GEPA, DSPy | Text strings | Low |
| Skills | EvoSkill | Structured folders (SKILL.md + scripts) | High |
Skills are the right level because they are: - Structured: Metadata, instructions, and code in a defined format - Portable: Follow the Agent Skills specification — work across harnesses - Interpretable: Human-readable, auditable, modifiable - Composable: Multiple skills coexist and activate independently - Transferable: Demonstrated zero-shot transfer across benchmarks
2. Separation of Concerns
Each of the three agents has a clearly delineated responsibility:
| Agent | Reads | Writes | Responsibility |
|---|---|---|---|
| Executor | Base repo + skills | Nothing | Execute tasks, produce traces |
| Proposer | Traces + ground truth + history H | Nothing (appends to H) | Diagnose failures, propose skills |
| Skill-Builder | Base repo + proposal π | Skills directory | Build concrete skill folders |
This separation ensures: - The Executor cannot cheat by looking at ground truth - The Proposer cannot directly modify the agent - The Skill-Builder cannot see evaluation scores (only the proposal)
3. Evolutionary Selection with Memory
The frontier G and feedback history H together implement evolutionary selection with memory: - Frontier G: Maintains top-k programs (population management) - History H: Prevents redundant exploration (cumulative memory) - Round-robin selection: Ensures all frontier members are explored - Score-based replacement: Only improvements enter the frontier
10 Component Breakdown
Core Components
1. Self-Improving Loop (src/loop/)
The central evolutionary loop that orchestrates all three agents:
# Simplified pseudocode from Algorithm 1
H = [] # feedback history
G = {base_agent} # frontier (initially just base)
s_base = evaluate(A, V) # baseline score
for t in range(T):
# Round-robin parent selection
parent = G[t % len(G)]
# Collect failures
failures = collect_failures(parent, train_batch, threshold=τ)
if not failures:
continue
# Propose skill change
proposal = proposer(failures, H)
# Build candidate
candidate = skill_builder(parent, proposal)
# Evaluate
score = evaluate(candidate, validation_set)
# Frontier update
if len(G) < k or score > min(G.scores):
G.add(candidate)
if len(G) > k:
G.remove(worst)
# Record history
H.append((proposal, score))
return best(G)
2. Agent Profiles (src/agent_profiles/)
Task-specific agent configurations:
src/agent_profiles/
├── __init__.py
├── base_agent/
│ ├── __init__.py
│ ├── base_agent.py # Default agent options
│ └── prompt.txt # Base system prompt
├── sealqa_agent/
│ ├── __init__.py
│ ├── sealqa_agent.py # SealQA-specific options
│ └── prompt.txt # SealQA system prompt
└── dabstep_agent/
└── ...
Each agent profile defines: - System prompt - Available tools - Timeout configurations - Model selection
3. Evaluation System (src/evaluation/)
Modular scoring framework:
| Scorer | Benchmark | Method |
|---|---|---|
| Multi-tolerance | OfficeQA | Fuzzy matching at 0%, 0.1%, 1%, 5%, 10% tolerance |
| LLM-graded | SealQA | GPT-based semantic equivalence judgment |
| Exact match | BrowseComp | String comparison |
# Scoring interface
def score(question: str, predicted: str, ground_truth: str) -> float:
"""Return score in [0.0, 1.0]."""
...
4. Data Splitting (src/data/)
LLM-based stratified partitioning:
- Classification: LLM classifies each example into K categories
- Stratified split: Ensures every category appears in both train and validation
- Holdout: Test set is never exposed during evolution
Default ratios are configurable: - Train: ~10-18% (for failure detection) - Validation: ~7-12% (for frontier scoring) - Test: remaining ~70-80% (for final evaluation)
5. Frontier Management (src/frontier/)
Git-based program versioning:
Git branch structure:
main # Fixed codebase
├── evo/base # Base program (no skills)
├── evo/iter-1/candidate-0 # First evolved program
├── evo/iter-2/candidate-0 # Second evolved program
└── evo/iter-N/candidate-0 # N-th evolved program
Each branch contains:
.claude/skills/ # Accumulated skill folders
.agent-metadata.json # System prompt, lineage, score
Supporting Components
| Component | Location | Purpose |
|---|---|---|
| Proposer prompts | src/prompts/ |
Structured templates for failure analysis |
| Skill-Builder meta-skill | src/meta_skill/ |
Best practices for skill authoring |
| Caching | src/cache/ |
Avoid re-evaluating identical programs |
| Concurrency control | src/concurrency/ |
Parallel evaluation with configurable limits |
| Result tracking | src/results/ |
Score history, skill provenance |
| CLI scripts | scripts/ |
Entry points for loop and evaluation |
11 Core Mechanisms (Detailed)
Mechanism 1: Textual Feedback Descent for Skills
EvoSkill adapts the Feedback Descent framework (Lee, Boen, Finn, 2025) to skill discovery. The key innovation is using rich textual feedback rather than scalar rewards:
Original Feedback Descent:
Maintain frontier of top-k candidates
Each iteration:
1. Select candidate from frontier
2. Evaluate candidate → get textual feedback
3. Editor LLM uses feedback to produce improved candidate
4. New candidate enters frontier if better
EvoSkill's Adaptation:
Maintain frontier of top-k PROGRAMS (agent + skills)
Each iteration:
1. Select program from frontier (round-robin)
2. Execute program on tasks → collect FAILURE TRACES
3. Proposer LLM diagnoses failures → produces SKILL PROPOSAL
4. Skill-Builder LLM materializes proposal → SKILL FOLDER
5. New program enters frontier if validation score improves
The critical difference: where Feedback Descent optimizes a single artifact (molecule, SVG, prompt), EvoSkill optimizes a composition of skills — the improvement is additive across iterations.
Mechanism 2: Failure-Driven Skill Discovery
The Proposer performs structured failure analysis:
Input to Proposer:
- Execution traces (full agent conversation for each failed question)
- Predicted answers (what the agent output)
- Ground-truth answers (correct answers for diagnosis)
- Feedback history H (all prior proposals and outcomes)
Proposer Analysis Steps:
1. Review execution traces to identify WHERE the agent went wrong
2. Classify failure mode:
- Data extraction error (wrong cell, wrong metric)
- Reasoning error (incorrect computation, wrong formula)
- Search error (premature termination, missed source)
- Comprehension error (misunderstood question intent)
3. Check existing skills: is there a skill that SHOULD have prevented this?
- If yes: propose EDIT to strengthen that skill
- If no: propose NEW skill to address the gap
4. Consult history H: has a similar proposal been tried before?
- If tried and failed: propose different approach
- If tried and partially worked: propose refinement
5. Output: structured proposal π specifying skill name, trigger,
instructions, and rationale
Mechanism 3: Pareto Frontier with Round-Robin Selection
The frontier G maintains the top-k programs:
Frontier G = {p_1, p_2, ..., p_k} where k = frontier_size (default: 3)
Selection: round-robin cycling
Iteration 1: parent = G[0]
Iteration 2: parent = G[1]
Iteration 3: parent = G[2]
Iteration 4: parent = G[0] (cycle restarts)
...
This ensures:
- Every frontier member gets explored before any is revisited
- No single program dominates the mutation budget
- Diverse exploration starting points
Replacement: score-based
if candidate_score > min(G.scores):
G.add(candidate)
if len(G) > k:
G.remove(argmin(G.scores))
Why Round-Robin over Tournament/Roulette?
Round-robin selection guarantees that all frontier members receive equal exploration effort. In a small frontier (k=3), this matters: tournament selection could repeatedly select the strongest member, leading to premature convergence on a local optimum. Round-robin ensures that weaker frontier members — which may contain useful partial skills — also get a chance to be improved.
Mechanism 4: Skill Composition and Accumulation
Skills accumulate across iterations — a program's skill library grows monotonically:
Iteration 0: program_0 = {system_prompt}
Iteration 1: program_1 = {system_prompt, skill_A}
Iteration 3: program_3 = {system_prompt, skill_A, skill_B}
Iteration 7: program_7 = {system_prompt, skill_A, skill_B (edited), skill_C}
The Agent Skills specification enables this through progressive disclosure: - Metadata loading at startup: Agent reads all skill triggers — minimal context cost - On-demand activation: Full SKILL.md is read only when trigger conditions match - Helper execution: Scripts run in subprocess, never entering context window
This means an agent can maintain dozens of skills with negligible context overhead. The skills activate contextually, responding to the specific characteristics of each input.
Mechanism 5: Skill Merging
The paper introduces a skill-merge strategy that combines skills from independent runs:
Run 1 (5% train): discovers skills {A, B, C}
Run 2 (10% train): discovers skills {B', D, E}
Run 3 (15% train): discovers skills {A', E', F}
Merge strategy:
1. Identify unique skills by name/description
2. For overlapping skills (B/B', A/A', E/E'):
keep version from highest-scoring run
3. Result: merged library {A', B, C, D, E, F}
This simple merging strategy outperforms any individual run (+7.3% vs best individual +5.2%), demonstrating that different training configurations surface complementary capabilities.
Mechanism 6: Category-Aware Training
The data setup uses LLM-based clustering to ensure balanced exposure:
Dataset D → LLM Classifier → K categories
Category 1: questions about debt instruments
Category 2: questions about revenue figures
Category 3: questions about cross-document comparison
...
Stratified split ensures:
- Every category appears in train, val, and test
- Evolution sees diverse failure modes (not just one category)
- Validation is representative of the full distribution
Training data are organized as category-keyed pools. During evolution, batches are sampled without replacement, cycling through all examples before repeating. This ensures the Proposer sees failures from all categories, not just the most frequent ones.
12 Programming Language
Implementation Stack
| Component | Language | Framework |
|---|---|---|
| Core framework | Python 3.12+ | asyncio, dataclasses |
| Agent interaction | Python | Claude Agent SDK / OpenCode SDK |
| Evaluation | Python | Custom scorers |
| Version control | Git | Branch-per-program storage |
| Package management | uv (recommended) or pip | pyproject.toml |
| Sandbox | Docker | For secure code execution |
Code Organization
EvoSkill/
├── src/
│ ├── api.py # High-level Python API
│ ├── loop/ # Self-improving loop implementation
│ ├── agent_profiles/ # Per-task agent configurations
│ │ ├── base_agent/
│ │ ├── sealqa_agent/
│ │ └── dabstep_agent/
│ ├── evaluation/ # Scoring functions
│ │ ├── eval_full.py # Full evaluation runner
│ │ ├── sealqa_scorer.py
│ │ └── multi_tolerance.py
│ ├── prompts/ # Proposer/Builder prompt templates
│ ├── schemas.py # Pydantic data models
│ ├── frontier.py # Git-based frontier management
│ └── data/ # Dataset splitting utilities
├── scripts/
│ ├── run_loop.py # CLI: run evolution loop
│ ├── run_loop_sealqa.py # CLI: run SealQA loop
│ ├── run_eval.py # CLI: run evaluation
│ └── run_eval_sealqa.py # CLI: run SealQA eval
├── .dataset/ # Benchmark data (user-provided)
├── .claude/skills/ # Evolved skill folders (output)
├── pyproject.toml # Package configuration
└── uv.lock # Dependency lock file
API Design
EvoSkill provides both async and sync interfaces:
# Async (recommended)
result = await EvoSkill(task="sealqa").run()
# Sync (convenience wrapper)
result = EvoSkill(task="sealqa").run_sync()
# Evaluation only
summary = await EvalRunner(task="sealqa", model="sonnet").run()
# Task registration
register_task(TaskConfig(
name="my_task",
make_agent_options=my_options_factory,
scorer=my_scorer,
default_dataset=".dataset/my_data.csv",
))
Language Choice Rationale
- Agent SDK compatibility: Claude Code SDK and OpenCode SDK are Python-native
- Async-first: The evaluation loop is I/O-bound (API calls), making asyncio the right concurrency model
- Git integration: Python's
subprocessprovides clean git operation wrapping - ML ecosystem: Scoring functions and data processing leverage Python's ML stack
- Skill output format: SKILL.md files are language-agnostic, but helper scripts are typically Python
13 Memory Management
Context Window Management
The most critical memory constraint in EvoSkill is not RAM but LLM context windows. Each agent must fit its inputs within the model's context limit:
Executor Agent:
Context budget:
System prompt: ~2K tokens
Skill triggers (all skills): ~500 tokens × N_skills
Active skill (on demand): ~2K-5K tokens per skill
Task question: ~1K-10K tokens
Tool outputs: ~5K-50K tokens (document content)
Agent reasoning: ~5K-20K tokens
Total: ~15K-80K tokens per question
Proposer Agent:
Context budget:
System prompt + instructions: ~3K tokens
Failure traces (N failures): ~5K-20K tokens per failure
Ground-truth answers: ~500 tokens per failure
Feedback history H: ~1K-5K tokens (growing)
Existing skill inventory: ~1K-3K tokens
Total: ~20K-100K tokens per iteration
Skill-Builder Agent:
Context budget:
Meta-skill (authoring guidelines): ~3K tokens
Parent program metadata: ~1K tokens
Proposal π: ~2K-5K tokens
Base repository context: ~5K-10K tokens
Total: ~10K-20K tokens per skill construction
Progressive Disclosure for Skill Loading
The Agent Skills specification's progressive disclosure model is essential for EvoSkill's scalability:
Startup: Load ALL skill triggers
┌────────────────────────────────────┐
│ data-extraction-verification │ ~50 tokens
│ trigger: "extracting table data" │
├────────────────────────────────────┤
│ quantitative-analysis │ ~50 tokens
│ trigger: "financial calculation" │
├────────────────────────────────────┤
│ search-persistence-protocol │ ~50 tokens
│ trigger: "searching for answers" │
└────────────────────────────────────┘
Total: ~150 tokens for 3 skills
Activation: Load ONLY matching skill
┌────────────────────────────────────┐
│ data-extraction-verification │
│ [Full SKILL.md: ~2000 tokens] │
│ [Helper scripts: executed, not │
│ loaded into context] │
└────────────────────────────────────┘
Total: ~2000 tokens (one skill)
This means EvoSkill can accumulate 50+ skills with startup cost of only ~2500 tokens. Only the relevant skill(s) for a given question consume full context.
Feedback History Growth
The feedback history H grows linearly with iterations:
After 20 iterations:
H ≈ 20 × (proposal summary + score + verdict)
≈ 20 × 200 tokens
≈ 4000 tokens
After 100 iterations:
H ≈ 100 × 200 tokens
≈ 20,000 tokens
For long runs, H may need summarization or sliding-window management. The paper's experiments (≤20 iterations) stay well within manageable limits.
System Memory
| Component | Memory Footprint | Notes |
|---|---|---|
| Python process | ~100-300 MB | Framework + data structures |
| Git repository | ~10-100 MB | Branches + skill folders |
| Dataset in memory | ~10-500 MB | Loaded benchmark data |
| Evaluation cache | ~10-100 MB | Cached API responses |
| Total | ~130 MB - 1 GB | Minimal by modern standards |
14 Continued Learning
Iterative Skill Accumulation
EvoSkill's core loop is inherently a continued learning process. Skills accumulate monotonically — each iteration can add new skills or refine existing ones, but never removes successful skills:
Evolution trajectory (OfficeQA example):
Iteration 0: Base agent (60.6% accuracy)
Skills: {}
Iteration 3: +data-extraction-verification
Skills: {DEV}
Accuracy: 63.4%
Iteration 7: +quantitative-analysis-methodology
Skills: {DEV, QAM}
Accuracy: 65.8%
Iteration 12: Refined DEV (strengthened table parsing rules)
Skills: {DEV', QAM}
Accuracy: 67.9%
Skill Refinement vs. Creation
The Proposer can propose two types of changes:
| Action | When Used | Outcome |
|---|---|---|
| New skill | No existing skill covers the failure mode | New skill folder added |
| Edit skill | Existing skill partially addresses failures but has gaps | Modified SKILL.md or helper scripts |
This creates a natural learn-then-refine cycle: 1. Early iterations discover broad skills (addressing common failure modes) 2. Later iterations refine existing skills (handling edge cases) 3. Occasionally, late iterations discover entirely new capabilities
Transfer Learning
EvoSkill demonstrates three forms of transfer:
1. Within-Task Transfer (Skill Merging)
Independent runs on same task → merge unique skills
Result: 67.9% > 65.8% (best individual)
2. Cross-Task Transfer (Zero-Shot)
SealQA skill → BrowseComp (no modification)
Result: 43.5% → 48.8% (+5.3%)
3. Cross-Model Transfer (Untested but Architecturally Supported)
Skills evolved with Opus → applied with Sonnet/Haiku
Architecturally supported (skills are model-agnostic text)
Empirical validation is future work
Limitations of Current Continued Learning
| Limitation | Impact | Possible Mitigation |
|---|---|---|
| No skill pruning | Skill library grows unboundedly | Relevance scoring + periodic pruning |
| No multi-task joint optimization | Skills optimized for one task at a time | Multi-objective frontier over multiple benchmarks |
| No inter-skill conflict detection | Two skills could give contradictory instructions | Consistency checking agent |
| Linear feedback history | H grows without summarization | Hierarchical summarization or sliding window |
| Single-task evaluation | Validation score measures only target benchmark | Multi-benchmark validation set |
Future Directions for Continued Learning
The paper identifies several promising directions:
- Shared skill libraries. Skills discovered by multiple users/organizations could be pooled into a community skill registry.
- Multi-modal skill evolution. Extend to tasks requiring vision, code, and language coordination.
- Cross-harness transfer. Test whether skills evolved for Claude Code transfer to Codex, OpenHands, etc.
- Hierarchical skill structures. Skills that invoke other skills, creating composable capability trees.
- Active curriculum selection. Instead of random training batch sampling, actively select the most informative failure cases for each iteration.
15 Applications
Primary Application: Agent Skill Optimization
EvoSkill's direct application is automating the creation of agent skills for any domain:
Workflow:
1. Define a benchmark dataset with (question, ground_truth) pairs
2. Implement a scoring function
3. Run EvoSkill's evolutionary loop
4. Deploy discovered skills to production agents
Result: Domain-specialized agent without fine-tuning or manual skill authoring
Demonstrated Applications
| Application | Benchmark | Skills Discovered | Impact |
|---|---|---|---|
| Treasury document analysis | OfficeQA | Data extraction verification, quantitative analysis methodology | +7.3% exact match |
| Search-augmented QA | SealQA | Search persistence protocol (exhaustive search before committing) | +12.1% accuracy |
| Fact-seeking browsing | BrowseComp | Transfer of search persistence protocol | +5.3% accuracy (zero-shot) |
Potential Application Domains
| Domain | Task Type | Expected Skill Categories |
|---|---|---|
| Software engineering | Bug fixing, code review | Debugging protocols, testing strategies |
| Data analysis | Table reasoning, visualization | Data validation, statistical methods |
| Scientific research | Literature review, experiment design | Citation verification, methodology checks |
| Legal analysis | Contract review, case law | Clause interpretation, precedent search |
| Medical diagnosis | Clinical decision support | Symptom verification, differential diagnosis |
| Financial analysis | Risk assessment, compliance | Regulatory checking, calculation verification |
| Customer support | Ticket resolution, escalation | Issue categorization, resolution protocols |
Qualitative Analysis of Discovered Skills
Data Extraction Verification (OfficeQA):
This skill emerged from failures where the agent extracted values from wrong table cells — a common error when parsing dense financial documents. The skill enforces: - Adjacent cell verification (check neighboring cells aren't the intended target) - Metric disambiguation (ensure the correct metric is selected from similar-sounding options) - Time granularity verification (monthly vs. quarterly vs. annual figures) - Source page confirmation (verify the extraction location)
This skill is domain-specific but pattern-general — similar verification protocols would be useful for any document extraction task, not just Treasury bulletins.
Quantitative Analysis Methodology (OfficeQA):
This skill provides structured methodology for financial calculations: - Mandatory validation checkpoints before computation - Prevention of common errors: wrong data transformations, date misalignment, population/sample confusion - Risk calculation frameworks - Currency conversion and statistical inference guidance
Search Persistence Protocol (SealQA):
The most transferable skill discovered, this protocol enforces: - Term interpretation expansion (consider alternative phrasings) - Multi-source verification (don't trust a single search result) - Completeness checks (ensure all aspects of the question are addressed) - Resistance to premature search termination
This skill transferred zero-shot to BrowseComp — a benchmark with entirely different questions — because the underlying capability (thorough search before committing) is domain-general.
Relationship to Other Systems
| System | Relationship to EvoSkill |
|---|---|
| AlphaEvolve (Google DeepMind) | Evolves code; EvoSkill evolves skills (higher abstraction) |
| GEPA (DSPy) | Evolves prompts within DSPy; EvoSkill evolves structured skill folders |
| Voyager (Minecraft) | Discovered code-based skills for embodied agent; EvoSkill discovers text+code skills for coding agents |
| Self-Refine | Single-output refinement; EvoSkill accumulates skills across iterations |
| Feedback Descent | General optimization framework; EvoSkill applies it to skill discovery |
| Agent Skills spec | Defines the skill format; EvoSkill automates skill creation |
| DiscoGen | Generates tasks for ADA optimization; EvoSkill optimizes the agent's skill library |
| ROMA | Hierarchical multi-agent framework; EvoSkill evolves its skills |
Strategic Position in the Evolutionary AI Landscape
EvoSkill occupies a unique position at the intersection of three trends:
┌──────────────────────────────────────────────────────────────┐
│ │
│ Evolutionary Optimization Agent Skill Ecosystem │
│ (AlphaEvolve, GEPA, ←────► (Claude Code Skills, │
│ FunSearch, OpenELM) Codex Skills, ROMA) │
│ │ │ │
│ └──────────────┬─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ EvoSkill │ │
│ │ │ │
│ │ Evolutionary │ │
│ │ optimization │ │
│ │ OF agent │ │
│ │ skills │ │
│ └─────────────────┘ │
│ │ │
│ ▼ │
│ Transferable, interpretable │
│ capability improvements │
│ WITHOUT model modification │
│ │
└──────────────────────────────────────────────────────────────┘
EvoSkill's key strategic advantage: It is the only system that combines evolutionary optimization with the structured skill format, producing artifacts that are both evolved (automatically discovered through search) and portable (work across models, tasks, and harnesses). This positions it as the bridge between the evolutionary AI community (focused on search and optimization) and the agent infrastructure community (focused on capability and deployment).
Open Questions
- Skill library scaling: How many skills can an agent effectively maintain before interference or confusion?
- Skill interaction effects: Do skills ever conflict or produce negative interference?
- Convergence properties: Does the frontier converge to a fixed point, or does performance continue improving with more iterations?
- Multi-benchmark optimization: Can EvoSkill optimize skills for multiple benchmarks simultaneously?
- Compositional generalization: Can skills learned individually be composed to solve tasks requiring multiple capabilities?
- Human-in-the-loop refinement: Can human experts improve EvoSkill-discovered skills, or are they already near-optimal for their domains?
References
@misc{alzubi2026evoskill,
title={EvoSkill: Automated Skill Discovery for Multi-Agent Systems},
author={Salaheddin Alzubi and Noah Provenzano and Jaydon Bingham
and Weiyuan Chen and Tu Vu},
year={2026},
eprint={2603.02766},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.02766},
}
@misc{alzubi2026roma,
title={ROMA: Recursive Open Meta-Agent Framework for Long-Horizon
Multi-Agent Systems},
author={Salaheddin Al Zu'bi and Bala Nama and Ashish Kaz
and Aditya Eswaran and Weiyuan Chen and Shantanu Khetan
and Rajat Bala and Tu Vu and Samuel Oh},
year={2026},
eprint={2602.01848},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.01848},
}
@misc{lee2025feedbackdescent,
title={Feedback Descent: Open-Ended Text Optimization via
Pairwise Comparison},
author={Yoonsang Lee and Jarett Boen and Chelsea Finn},
year={2025},
eprint={2511.07919},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2511.07919},
}