Introduced2026-03

Score7.81/10 — Draft

Chapter 14

EvoSkill

Part P03: Self-Improving Agent Systems

14.1 Overview & Motivation

Coding agents such as Claude Code, OpenHands, and Codex have become general-purpose problem solvers. Their generality, however, does not equate to domain expertise. A coding agent may handle arbitrary programming tasks, yet it will repeatedly fail on the same category of domain-specific challenge—extracting the correct cell from a dense financial table, persisting through multiple search queries before committing to an answer, or verifying quantitative calculations against source documents. These failures are systematic, not random, and they recur because agents lack structured domain knowledge that would prevent them.

Three specific gaps motivate EvoSkill. First, the Agent Skills ecosystem (e.g., .claude/skills/ directories, the agentskills.io specification) provides excellent infrastructure for using skills, but creating them requires manual domain knowledge and significant engineering effort. Second, existing evolutionary approaches optimize at the wrong abstraction level: AlphaEvolve (Chapter 4) optimizes source code, GEPA (Chapter 7) optimizes prompts, and both produce artifacts tightly coupled to specific models and tasks—they do not yield reusable, transferable capabilities. Third, no prior system provides an automated mechanism to analyze agent failures, propose remedies as structured skills, and validate those skills against held-out data.

EvoSkill, published in March 2026 by researchers at Sentient and Virginia Tech (Alzubi et al., 2026), addresses all three gaps by introducing evolutionary optimization at the skill level. Rather than evolving code or prompts, EvoSkill evolves structured skill folders—portable, interpretable capability modules conforming to the Agent Skills specification—through iterative failure analysis, Pareto frontier selection, and textual feedback descent. The result is an automated pipeline that discovers domain-relevant skills which persist across tasks, transfer across benchmarks, and remain interpretable to human practitioners.

Key Contribution

EvoSkill is the first system to apply evolutionary optimization to the "skill" unit of abstraction for coding agents. By evolving structured skill folders rather than code or prompts, it produces artifacts that are simultaneously discovered (through automated search), portable (conforming to a cross-harness specification), interpretable (human-readable procedural instructions), and transferable (demonstrated zero-shot cross-benchmark generalization). This positions EvoSkill at the intersection of evolutionary optimization and the agent skill ecosystem—a niche occupied by no prior system in the surveyed literature.

14.1.1 Intellectual Lineage

EvoSkill draws on several distinct research threads. The Feedback Descent framework (Lee, Boen, & Finn, 2025; arXiv:2511.07919) provides the core optimization paradigm: maintaining a frontier of candidates, evaluating them to collect textual feedback, and using an LLM editor to produce improved candidates. EvoSkill adapts this from single-artifact optimization to skill-library optimization. The ROMA framework (Alzubi et al., 2026; arXiv:2602.01848), by the same group, provides the hierarchical multi-agent architecture on which EvoSkill's three-agent system builds. Voyager (Wang et al., 2023) pioneered skill libraries for embodied agents in Minecraft, demonstrating that LLM-discovered skills can accumulate into a growing capability repertoire; EvoSkill translates this insight from embodied to coding agents. Finally, the Agent Skills specification (agentskills.io, 2025) defines the structured format that makes EvoSkill's output portable across agent harnesses.

14.1.2 Abstraction Level Comparison

A central claim of EvoSkill is that skills are the right unit of evolution. The following table situates this choice relative to alternatives explored by other systems in this survey:

Abstraction Level	Representative Systems	Evolved Artifacts	Transferability	Interpretability
Weights	Fine-tuning, RLHF	Model parameters	None (model-specific)	Low
Code	AlphaEvolve, FunSearch, OpenEvolve	Source code files	Low (task-specific)	Medium
Prompts	GEPA, EvoPrompt, PromptBreeder	Text strings	Low (model-sensitive)	Medium
Skills	EvoSkill	Structured folders (SKILL.md + scripts)	High (cross-task, cross-harness)	High

Skills combine the structured procedural knowledge of code with the natural-language interpretability of prompts, while adding metadata (trigger conditions, constraints, validation rules) that enables contextual activation and composition. Unlike prompts, skills are not monolithic instructions injected into every context window—they activate selectively based on trigger conditions, preserving context budget for the actual task.

14.2 Architecture

EvoSkill's architecture comprises three layers: a data layer that partitions benchmarks into stratified training, validation, and test splits; an evolution layer that maintains a Pareto frontier of agent programs and iteratively improves them through failure-driven skill discovery; and a storage layer that uses git branches to version every candidate program. These layers are orchestrated by a self-improving loop implemented in src/loop/ (per the repository structure at github.com/sentient-agi/EvoSkill).

14.2.1 Three-Agent System

The evolution layer is powered by three collaborating LLM agents, each with strictly delineated access control. This separation of concerns is a deliberate design choice that prevents information leakage and ensures that discovered skills do not embed ground-truth answers.

Agent	Role	Reads	Writes	Responsibility
Executor (A)	Worker	Base repo + skill folders	Nothing	Executes tasks under the current program; produces traces and predicted answers
Proposer (P)	Critic	Execution traces + ground truth + feedback history $H$	Appends to $H$	Diagnoses failures via root-cause analysis; proposes skill edits or new skills
Skill-Builder (S)	Builder	Base repo + proposal $\pi$	Skills directory	Materializes a textual proposal into a concrete skill folder conforming to the Agent Skills spec

A critical methodological detail: the Proposer receives ground-truth answers solely for diagnostic purposes, analogous to examining labeled misclassifications during error analysis in supervised learning. As stated in the paper: "Ground-truth answers are provided to enable root-cause diagnosis [...] and are not propagated to the generated skills themselves" (Alzubi et al., 2026, §3). The Skill-Builder never sees ground truth or evaluation scores—only the Proposer's textual proposal.

14.2.2 Skill Anatomy

Each discovered skill conforms to the Agent Skills specification and is stored as a structured directory:

# From repo: .claude/skills/ (output directory)
# Example discovered skill structure:
#
# .claude/skills/data-extraction-verification/
# ├── SKILL.md          # Trigger metadata + procedural instructions
# ├── helpers/
# │   ├── validate.py   # Helper scripts for verification
# │   └── templates/    # Reference materials
# └── examples/         # Usage examples (optional)
#
# A SKILL.md file contains:
#   - Name and description (human-readable identification)
#   - Trigger conditions (when the agent should invoke this skill)
#   - Instructions (step-by-step procedural guidance)
#   - Constraints (what not to do, common pitfalls)
#   - Validation (how to verify correct application)

The progressive disclosure model of the Agent Skills specification is essential for scalability. At startup, the agent loads only the trigger metadata for all skills—approximately 50 tokens per skill. When a trigger condition matches, the full SKILL.md is loaded on demand. Helper scripts execute in subprocesses, never entering the context window. This means an agent can maintain dozens of skills with a startup context cost of only a few thousand tokens.

14.2.3 Agent Harness Compatibility

EvoSkill targets any coding agent harness that supports structured skill folders. The paper documents primary support for Claude Code (.claude/skills/), Codex (.codex/skills/), and OpenCode, as well as any harness compatible with the agentskills.io specification. The repository supports configurable SDK selection via command-line flags: --sdk opencode selects the OpenCode SDK for use with third-party models such as DeepSeek-V3 or Gemini 2.0 Flash (per scripts/run_eval.py).

14.3 Core Algorithms

14.3.1 Textual Feedback Descent for Skills

EvoSkill adapts the Feedback Descent framework (Lee, Boen, & Finn, 2025) to skill discovery. In the original formulation, Feedback Descent maintains a frontier of top-$k$ candidates and iteratively improves them using textual feedback from an LLM evaluator. Each iteration selects a candidate, evaluates it, collects feedback, and uses an editor LLM to produce an improved candidate that enters the frontier if it outperforms the weakest member.

EvoSkill's adaptation replaces the single-artifact optimization target with a composition of skills. Where Feedback Descent optimizes a molecule, SVG, or prompt, EvoSkill optimizes an agent's entire skill library. This is a critical difference: improvements are additive across iterations, since each iteration can add a new skill or refine an existing one without removing successful skills.

Formally, let $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{N}$ be a benchmark dataset of question-answer pairs. Let $\mathcal{D}_{\text{train}}, \mathcal{D}_{\text{val}}, \mathcal{D}_{\text{test}}$ be stratified partitions with ratios configurable by the user (defaults approximately 10–18%, 7–12%, and 70%+ respectively). Define a program $p$ as a tuple of a system prompt and a set of skill folders: $p = (\sigma, \{s_1, s_2, \ldots, s_m\})$. The objective is:

$$p^* = \arg\max_{p \in \mathcal{P}} \; \text{Score}(p, \mathcal{D}_{\text{val}})$$

where $\text{Score}(p, \mathcal{D}_{\text{val}}) = \frac{1}{|\mathcal{D}_{\text{val}}|} \sum_{(x,y) \in \mathcal{D}_{\text{val}}} f\bigl(A(x; p), y\bigr)$ measures the average scoring function $f$ over the validation set, $A(x; p)$ denotes the Executor agent's output on input $x$ under program $p$, and $\mathcal{P}$ is the space of all reachable programs (base program plus any combination of discoverable skills).

14.3.2 The Self-Improving Loop

Algorithm 1 in the paper describes the core evolutionary loop, implemented in src/loop/. The following pseudocode reflects the repository's actual structure:

# From repo: src/loop/ — self-improving loop (simplified from Algorithm 1)
# Also exposed via src/api.py as EvoSkill(...).run()

import asyncio
from typing import List, Tuple

async def evolve(
    dataset_train: list,       # Training partition
    dataset_val: list,         # Validation partition
    frontier_size: int = 3,    # k: max programs in frontier G
    max_iterations: int = 20,  # T: total iterations
    threshold: float = 0.5,    # τ: failure threshold
) -> "Program":
    H: List[Tuple[str, float, str]] = []  # Feedback history
    G = [base_program]                     # Frontier (initially base agent)
    s_base = await evaluate(base_program, dataset_val)

    for t in range(max_iterations):
        # Round-robin parent selection from frontier
        parent = G[t % len(G)]

        # Evaluate parent on a training batch; collect failures
        results = await evaluate_batch(parent, dataset_train)
        failures = [r for r in results if r.score < threshold]

        if not failures:
            continue  # No failures to learn from

        # Proposer: structured failure analysis → textual proposal
        proposal = await proposer(
            failure_traces=failures,
            feedback_history=H,
            existing_skills=parent.skills,
        )

        # Skill-Builder: materialize proposal into skill folder
        candidate = await skill_builder(parent, proposal)

        # Evaluate candidate on validation set
        score = await evaluate(candidate, dataset_val)

        # Frontier update: accept if better than weakest member
        if len(G) < frontier_size or score > min(p.score for p in G):
            G.append(candidate)
            if len(G) > frontier_size:
                worst = min(G, key=lambda p: p.score)
                G.remove(worst)

        # Append to cumulative feedback history
        verdict = "accepted" if candidate in G else "rejected"
        H.append((proposal.text, score, verdict))

    return max(G, key=lambda p: p.score)

14.3.3 Pareto Frontier with Round-Robin Selection

The frontier $G$ maintains the top-$k$ programs, where $k$ is configurable (default $k=3$). Parent selection uses a deterministic round-robin policy:

$$\text{parent}(t) = G\bigl[t \bmod |G|\bigr]$$

where $t$ is the iteration index and $|G|$ is the current frontier size. This policy guarantees that every frontier member receives equal exploration effort before any is revisited. In a small frontier ($k=3$), this matters considerably: tournament or roulette selection could repeatedly select the strongest member, leading to premature convergence on a local optimum. Round-robin ensures that weaker frontier members—which may contain useful partial skills amenable to refinement—also receive mutation attempts.

The replacement rule is score-based: a new candidate enters the frontier if its validation score exceeds the minimum score in $G$. When the frontier exceeds size $k$, the weakest member is removed:

$$G_{t+1} = \begin{cases} G_t \cup \{p_{\text{new}}\} \setminus \{\arg\min_{p \in G_t \cup \{p_{\text{new}}\}} \text{Score}(p)\} & \text{if } |G_t| = k \text{ and } \text{Score}(p_{\text{new}}) > \min_{p \in G_t} \text{Score}(p) \\ G_t \cup \{p_{\text{new}}\} & \text{if } |G_t| < k \\ G_t & \text{otherwise} \end{cases}$$

where $p_{\text{new}}$ is the newly constructed candidate program. This is a standard $(k+1)$-truncation scheme applied to a small population. The simplicity of the scheme is intentional: with a frontier of only 3 programs and 20 iterations, more complex selection mechanisms would add overhead without benefit.

14.3.4 Failure-Driven Skill Discovery

The Proposer performs structured failure analysis, which is the primary mechanism that distinguishes EvoSkill from random mutation approaches. The Proposer receives: (1) execution traces—the full agent conversation for each failed question; (2) predicted answers; (3) ground-truth answers for diagnostic purposes; and (4) the cumulative feedback history $H$.

The Proposer's analysis proceeds through a structured sequence: first, it reviews execution traces to identify where the agent went wrong. Second, it classifies the failure mode—data extraction errors, reasoning errors, search errors, or comprehension errors. Third, it checks whether an existing skill should have prevented the failure (proposing an edit if so, or a new skill if not). Fourth, it consults history $H$ to avoid redundant proposals. The output is a structured proposal $\pi$ specifying a skill name, trigger conditions, instructions, and rationale.

The feedback history $H$ serves as a cumulative memory that prevents the evolutionary process from repeating failed strategies:

$$H_t = \bigl[(\pi_1, s_1, v_1), (\pi_2, s_2, v_2), \ldots, (\pi_t, s_t, v_t)\bigr]$$

where $\pi_i$ is the $i$-th proposal text, $s_i$ is the resulting validation score, and $v_i \in \{\text{accepted}, \text{rejected}\}$ is the frontier verdict. This is directly analogous to the feedback history in Feedback Descent (Lee et al., 2025), adapted to the skill discovery setting. The Proposer can build on partial successes (proposals with modest score improvement) and avoid repeating strategies that failed.

14.3.5 Skill Composition and Accumulation

A defining feature of EvoSkill is that skills accumulate monotonically—a program's skill library grows over iterations. The Proposer may propose either a new skill (when no existing skill covers the failure mode) or an edit to an existing skill (when an existing skill partially addresses the failures but has gaps). This creates a natural learn-then-refine cycle:

# Skill accumulation trajectory (from repo: git branch history)
# Each program is stored as a git branch diverging from its parent

# Iteration 0:  base agent — no skills
# program_0.skills = {}

# Iteration 3:  data-extraction-verification discovered
# program_3.skills = {"data-extraction-verification"}

# Iteration 7:  quantitative-analysis-methodology discovered
# program_7.skills = {"data-extraction-verification",
#                      "quantitative-analysis-methodology"}

# Iteration 12: data-extraction-verification REFINED
#               (strengthened table parsing rules)
# program_12.skills = {"data-extraction-verification (v2)",
#                       "quantitative-analysis-methodology"}

# The progressive disclosure model means context cost scales
# only with the NUMBER of triggers (~50 tokens each),
# not the total content of all skills.
# 20 skills ≈ 1000 tokens at startup; only activated skills
# consume full context (~2000-5000 tokens each).

14.3.6 Skill Merging

The paper introduces a skill-merge strategy that combines skills from independent evolutionary runs. Given $R$ independent runs, each producing a set of skills, the merge procedure identifies unique skills by name and description, keeps the version from the highest-scoring run for overlapping skills, and unions the non-overlapping skills:

$$\mathcal{S}_{\text{merged}} = \bigcup_{r=1}^{R} \mathcal{S}_r^{\text{unique}} \;\cup\; \bigl\{ s_j^{r^*} : r^* = \arg\max_r \text{Score}(p_r) \text{ for overlapping } s_j \bigr\}$$

where $\mathcal{S}_r^{\text{unique}}$ are skills appearing only in run $r$, and $s_j^{r^*}$ selects the version of overlapping skill $s_j$ from the run with the highest overall score. This simple strategy outperforms any individual run, demonstrating that different training configurations surface complementary capabilities.

14.3.7 Category-Aware Data Partitioning

The data layer uses an LLM-based clustering step to ensure stratified exposure across failure modes. Given dataset $\mathcal{D}$, an LLM classifier assigns each example $(x_i, y_i)$ to one of $K$ categories. The stratified split ensures every category appears in training, validation, and test partitions. Training data are organized as category-keyed pools; during evolution, batches are sampled without replacement, cycling through all examples before repeating. This ensures the Proposer encounters failures from all categories, not just the most frequent ones.

14.4 Key Results

All results reported below are from single runs using Claude Opus 4.5 as the underlying model for all three agents, as documented in the paper (Alzubi et al., 2026). The authors acknowledge that variance analysis across multiple runs is deferred to future work due to computational cost.

14.4.1 OfficeQA

OfficeQA is a grounded reasoning benchmark over U.S. Treasury Bulletin documents (approximately 89,000 pages), requiring dense table navigation, quantitative reasoning, and precise data extraction. Results are reported at multiple tolerance levels (allowable relative error in numeric answers):

Configuration	0.00% (exact)	0.10%	1.00%	5.00%	10.00%
Baseline (no skills)	60.6	66.3	72.8	77.2	79.7
EvoSkill (5% train)	63.4	67.4	74.3	77.6	80.1
EvoSkill (10% train)	65.8	69.9	76.4	80.5	82.5
EvoSkill (15% train)	64.5	69.9	75.6	79.3	81.3
Skill-merge (unique)	68.1	70.8	77.1	80.5	82.4

Source: Alzubi et al. (2026), Table 1. All numbers are single-run accuracy percentages with Claude Opus 4.5.

The skill-merge configuration achieves +7.5 percentage points on exact match (60.6% → 68.1%), the largest gain reported. Notably, the 15% training split performs slightly worse than the 10% split (64.5 vs. 65.8 at exact match), suggesting diminishing returns and mild overfitting beyond the 10% threshold. The consistency of improvements across all tolerance levels (gains of 2.7–7.5 points) indicates that the discovered skills improve fundamental extraction accuracy, not just tolerance to rounding.

14.4.2 SealQA

SealQA is a search-augmented question-answering benchmark involving open web retrieval with noisy, conflicting results. The primary challenge is preventing premature search termination—agents often commit to an answer from the first search result rather than exhaustively verifying across multiple sources.

Configuration	Accuracy
Baseline (no skills)	26.6%
EvoSkill (10% train)	38.7%

Source: Alzubi et al. (2026), Table 2. Single-run, Claude Opus 4.5. SealQA uses LLM-graded semantic equivalence scoring.

The +12.1 percentage point absolute improvement is the largest gain across all benchmarks. The paper attributes this primarily to the search-persistence-protocol skill, which enforces exhaustive search before committing to answers—a behavioral change that addresses the single most common failure mode in the baseline agent.

14.4.3 Zero-Shot Transfer: SealQA → BrowseComp

The strongest empirical contribution is the zero-shot transfer experiment. Skills discovered on SealQA are applied without modification to BrowseComp, a separate fact-seeking web browsing benchmark:

Configuration	BrowseComp Accuracy
Baseline (no skills)	43.5%
+ SealQA skills (zero-shot)	48.8%

Source: Alzubi et al. (2026), Table 3. Zero-shot transfer — no modification to skills. BrowseComp uses a stratified sample; full sample details are not specified in the paper.

This +5.3 percentage point improvement with zero modification provides causal evidence that EvoSkill produces generalizable capabilities rather than task-specific tunings. The search-persistence-protocol, which enforces multi-source verification and resistance to premature search termination, transfers because the underlying capability is domain-general—thorough search before committing applies regardless of whether the questions concern web facts (BrowseComp) or noisy retrieval (SealQA).

14.4.4 Evidence Assessment

Several caveats apply to the reported results. All numbers are from single runs due to the computational cost of Claude Opus 4.5 at scale; variance analysis is explicitly deferred to future work. SealQA uses LLM-graded scoring (GPT-based semantic equivalence), which introduces an additional layer of non-determinism. The BrowseComp evaluation uses a stratified sample whose full specification is not provided. The baseline is cross-referenced with benchmark authors' latest results, but head-to-head budget parity with other optimization approaches (e.g., prompt engineering or fine-tuning) is not established. These limitations are standard for the computational scale of the work but should be noted when interpreting the magnitude of reported gains.

14.5 Implementation Details

14.5.1 Repository Structure

The repository (github.com/sentient-agi/EvoSkill, Apache 2.0 license) is organized as follows, per the documented source structure:

Component	Location	Purpose
High-level API	`src/api.py`	`EvoSkill()` and `EvalRunner()` classes; `register_task()` for extensibility
Evolutionary loop	`src/loop/`	Core self-improving loop implementation
Agent profiles	`src/agent_profiles/`	Per-task configurations (base, SealQA, DABStep)
Evaluation	`src/evaluation/`	Modular scorers: multi-tolerance, LLM-graded, exact match
Proposer prompts	`src/prompts/`	Structured templates for failure analysis
Meta-skill	`src/meta_skill/`	Skill-authoring best practices (bootstraps the Skill-Builder)
Frontier management	`src/frontier.py`	Git-based program versioning
Data splitting	`src/data/`	LLM-based stratified partitioning
Pydantic models	`src/schemas.py`	Data models for proposals, scores, programs
CLI entry points	`scripts/`	`run_loop.py`, `run_eval.py`, and task-specific variants

14.5.2 Usage

The repository documents both CLI and Python API usage:

# From repo: src/api.py — Python API for running the evolutionary loop

from src.api import EvoSkill, EvalRunner, register_task, TaskConfig

# Run the full evolutionary loop on SealQA
result = await EvoSkill(
    task="sealqa",
    model="opus",              # Claude Opus 4.5
    mode="skill_only",         # Evolve skills, not system prompt
    max_iterations=20,         # T = 20
    frontier_size=3,           # k = 3
    concurrency=4,             # Parallel evaluation limit
    train_ratio=0.18,          # 18% for training
    val_ratio=0.12,            # 12% for validation
).run()

# Evaluate a specific configuration
summary = await EvalRunner(
    task="sealqa",
    model="opus",
    max_concurrent=8,
).run()

# Register a custom task
register_task(TaskConfig(
    name="my_task",
    make_agent_options=my_options_factory,  # Agent configuration factory
    scorer=my_scorer,                       # Scoring function
    default_dataset=".dataset/my_data.csv", # Benchmark data path
))

# From repo: scripts/run_loop.py — CLI invocation

# OfficeQA evolution (skill_only mode, 20 iterations)
# $ python scripts/run_loop.py --mode skill_only --max-iterations 20

# SealQA evolution
# $ python scripts/run_loop_sealqa.py --mode skill_only --max-iterations 20

# Evaluation with alternative model via OpenCode SDK
# $ python scripts/run_eval.py --sdk opencode --model deepseek-ai/DeepSeek-V3

# Evaluation with concurrency control
# $ python scripts/run_eval.py --model opus --max-concurrent 8

14.5.3 Cost Analysis

EvoSkill's cost is dominated by LLM API calls across the three agents. The Executor is invoked twice per iteration (once on the training batch, once on the validation set), making it the largest cost component. The following estimates are derived from the token volumes and Opus 4.5 pricing documented in the paper:

Agent	Estimated Tokens / Call	Estimated Cost (Opus 4.5)	Notes
Executor (training batch)	50K–200K per question	$0.50–$5.00 per question	Multiple questions per batch
Proposer	20K–50K	$0.30–$0.75	Analyzes failure traces
Skill-Builder	10K–30K	$0.15–$0.45	Generates skill folder
Executor (validation)	50K–200K per question	$0.50–$5.00 per question	~17 items in validation set

Total run costs scale with the number of iterations and the training set size:

Configuration	Iterations	Estimated Total Cost
OfficeQA, 5% train	~18 (1.5 epochs)	$200–$800
OfficeQA, 10% train	~36 (1.5 epochs)	$400–$1,600
OfficeQA, 15% train	~54 (1.5 epochs)	$600–$2,400
SealQA, 10% train	~17 (1.5 epochs)	$300–$1,200
Skill-merge (3 runs)	3× above	3× above
Full reproduction	All configurations	$3,000–$15,000

Source: Alzubi et al. (2026), §5. Cost estimates based on Opus 4.5 API pricing at time of publication. Ranges reflect uncertainty in per-question token consumption.

These costs are non-trivial but competitive with alternatives. Fine-tuning a model to achieve comparable improvement typically costs $1,000–$10,000+ and produces model-specific artifacts. Manual skill authoring requires 10–40 engineer-hours but yields only moderate transferability. EvoSkill's key advantage is that discovered skills are reusable: a single SealQA run produces skills that also improve BrowseComp at zero additional cost. Amortized across multiple deployment tasks, the ROI is favorable.

14.5.4 Hardware and Infrastructure

EvoSkill requires no GPU—all computation occurs via API calls. Hardware requirements are minimal: a standard Python process with network access to the Anthropic API (or alternative provider), sufficient storage for git operations and skill folders (typically under 100 MB), and Docker for sandbox-based code execution benchmarks such as LiveCodeBench. The repository uses uv for package management (pyproject.toml + uv.lock), though pip install -e . is also supported.

14.5.5 Reproducibility

The git-based storage model provides a complete audit trail: every candidate program is stored as a git branch (evo/iter-N/candidate-M), diverging from its parent only in skill folders and metadata. This means every skill, every proposal, and every score is version-controlled. However, several factors limit strict reproducibility: all three agents use LLMs with inherent stochasticity; results are from single runs; model versioning means that Opus 4.5 at the time of publication may behave differently from future API versions; and the cost barrier ($3,000–$15,000 for full reproduction) limits community replication.

14.6 Qualitative Analysis of Discovered Skills

A distinctive feature of EvoSkill relative to other evolutionary systems is that its output artifacts are human-readable and domain-interpretable. This section examines the three primary skills documented in the paper.

14.6.1 Data Extraction Verification (OfficeQA)

This skill emerged from repeated failures where the Executor extracted values from incorrect table cells in dense financial documents. The skill enforces four verification steps: (1) adjacent cell verification—checking that neighboring cells are not the intended target; (2) metric disambiguation—ensuring the correct metric is selected from similar-sounding options; (3) time granularity verification—distinguishing monthly, quarterly, and annual figures; and (4) source page confirmation—verifying the extraction location. The skill is domain-specific (financial documents) but pattern-general—similar verification protocols would be useful for any document extraction task, not just Treasury bulletins.

14.6.2 Quantitative Analysis Methodology (OfficeQA)

This skill provides structured methodology for financial calculations, including mandatory validation checkpoints before computation, prevention of common errors (wrong data transformations, date misalignment, population/sample confusion), risk calculation frameworks, and currency conversion guidance. It addresses a different failure mode than the extraction skill: cases where the agent extracted the correct data but performed incorrect calculations.

14.6.3 Search Persistence Protocol (SealQA)

The most transferable skill discovered, this protocol enforces: term interpretation expansion (considering alternative phrasings for search queries), multi-source verification (refusing to trust a single search result), completeness checks (ensuring all aspects of a multi-part question are addressed), and resistance to premature search termination. This skill transferred zero-shot to BrowseComp because the underlying capability—thorough, exhaustive search before committing to an answer—is domain-general. This result provides the strongest evidence for EvoSkill's central thesis that skill-level optimization produces qualitatively more transferable artifacts than prompt or code optimization.

14.7 Comparative Analysis

EvoSkill occupies a unique position in the evolutionary AI landscape, operating at a different abstraction level than all other systems surveyed in this book:

System	Evolved Artifact	Selection Mechanism	Transfer	Interpretability	Model Modification
AlphaEvolve (Ch. 4)	Source code	MAP-Elites archive	Low	Medium	Frozen
FunSearch (Ch. 5)	Source code	Island populations	Low	Medium	Frozen
OpenEvolve (Ch. 6)	Source code	Pareto frontier + bandit	Low	Medium	Frozen
GEPA (Ch. 7)	Prompts (DSPy)	Pareto frontier + ASI	Low	Medium	Frozen
EvoSkill	Skill folders	Pareto frontier + round-robin	High	High	Frozen
Voyager	Code-based skills	Curriculum + success gating	Medium	Medium	Frozen
Self-Refine	Single outputs	Iterative refinement (no population)	None	High	Frozen

Several distinguishing features are worth highlighting. Unlike AlphaEvolve, FunSearch, and OpenEvolve, which evolve code within a specific problem formulation, EvoSkill evolves metacognitive artifacts—instructions that change how an agent approaches problems rather than solving specific problem instances. Unlike GEPA, which optimizes prompts within the DSPy framework, EvoSkill produces structured skill folders that are framework-agnostic and activate selectively via trigger conditions rather than being injected wholesale into the system prompt. Unlike Voyager, which discovers executable code skills for an embodied agent, EvoSkill discovers primarily textual procedural skills for coding agents, making them more interpretable and easier to audit.

The shared characteristic across all these systems is the frozen model principle: the underlying LLM's weights are never modified. EvoSkill demonstrates that meaningful performance improvements (up to +12.1% on SealQA) can be achieved purely through skill-level optimization, without any fine-tuning. This finding strengthens the broader thesis of this survey: that evolutionary search over LLM-generated artifacts is a powerful optimization paradigm that complements, rather than replaces, model training.

14.8 Context Window Management

The most critical resource constraint in EvoSkill is not RAM or compute but LLM context windows. Each of the three agents must fit its inputs within the model's context limit, and the cost per iteration scales with token consumption. The paper documents approximate context budgets per agent:

The progressive disclosure model is critical for scalability. An agent with 20 accumulated skills incurs only ~1,000 tokens at startup for trigger metadata. Only the 1–2 skills whose triggers match a given input consume full context (2,000–5,000 tokens each). This means the skill library can grow substantially without degrading per-question performance.

The feedback history $H$ grows linearly: after 20 iterations, $H \approx 20 \times 200 = 4{,}000$ tokens. For longer runs ($\gg 20$ iterations), the authors note that $H$ may need summarization or sliding-window management, though the paper's experiments remain well within manageable limits.

14.9 Limitations & Discussion

14.9.1 Statistical Limitations

All reported results are from single runs. The inherent stochasticity of LLM-based agents—present in all three agents plus the LLM-graded scorer for SealQA—means that the true variance of these results is unknown. A result of +12.1% on SealQA is impressive, but without confidence intervals or multiple-seed runs, the reliability of the exact magnitude is uncertain. The authors acknowledge this and identify variance analysis as future work, citing the cost barrier (full reproduction at $3,000–$15,000).

14.9.2 Skill Library Scaling

Skills accumulate monotonically—the current system adds skills and refines them but never prunes unsuccessful ones. As the skill library grows, several risks emerge: potential conflicts between skills that give contradictory instructions; increased startup context cost (though progressive disclosure mitigates this); and the possibility that too many trigger conditions create confusion about which skill should activate. No inter-skill conflict detection mechanism exists in the current system.

14.9.3 Single-Task Optimization

Each evolutionary run optimizes for a single benchmark. While the zero-shot transfer result demonstrates that some discovered skills generalize, there is no mechanism for multi-task joint optimization. A skill that improves SealQA accuracy could in principle degrade performance on other tasks, and the current evaluation framework would not detect this. Multi-objective optimization over multiple benchmarks simultaneously is an open direction.

14.9.4 Convergence Properties

The paper does not establish formal convergence properties. It is unknown whether the frontier converges to a fixed point, whether performance continues improving indefinitely with more iterations, or whether there exists a theoretical performance ceiling for skill-only optimization. The diminishing returns observed between the 10% and 15% training splits (§14.4.1) suggest practical convergence within the tested regime, but a characterization of the convergence rate as a function of frontier size $k$, iteration count $T$, and training ratio would strengthen the theoretical contribution.

14.9.5 Ground Truth Requirement

The Proposer requires ground-truth answers for failure diagnosis. This limits EvoSkill's applicability to domains where labeled evaluation data exists. In many real-world deployment scenarios—customer support, creative writing, open-ended research—ground truth may be unavailable or ill-defined. Extending EvoSkill to work with weaker supervision signals (e.g., human preference rankings, automated quality metrics, or reward models) would broaden its applicability considerably.

14.9.6 Cross-Model Transfer

The paper evaluates primarily with Claude Opus 4.5. While the architecture supports cross-model deployment (skills are model-agnostic text), empirical validation of cross-model transfer—e.g., skills evolved with Opus applied to Sonnet, Haiku, or non-Anthropic models—is not provided. Whether a skill's effectiveness is robust to model capability differences remains an open empirical question.

14.9.7 Open Questions

Question	Current Status	Potential Approach
Skill library capacity limit?	Untested beyond ~10 skills	Relevance scoring + periodic pruning
Inter-skill interference?	Not measured	Consistency checking agent; ablation studies
Multi-benchmark optimization?	Single-task only	Multi-objective frontier over benchmark suite
Compositional generalization?	Individual skills tested	Tasks requiring multiple simultaneous capabilities
Human-AI co-refinement?	Untested	Human expert editing of EvoSkill-discovered skills
Community skill registries?	Architecturally possible	Shared skill libraries across organizations

14.10 Broader Significance

EvoSkill represents an important conceptual advance in the evolutionary AI landscape surveyed in this book. While other systems have demonstrated that LLMs can serve as effective mutation operators for code (AlphaEvolve, FunSearch, OpenEvolve) and prompts (GEPA, EvoPrompt), EvoSkill demonstrates that skills—structured, portable, interpretable capability modules—are a viable and arguably superior unit of evolution when the goal is transferable agent improvement.

This has implications beyond the specific system. The Agent Skills ecosystem (Claude Code skills, Codex skills, the agentskills.io specification) provides infrastructure for using skills, but the bottleneck has been skill creation. EvoSkill automates this bottleneck, potentially enabling a flywheel: agents deployed in production encounter failures → EvoSkill analyzes failures and discovers skills → skills are deployed back to production agents → the agents encounter fewer failures but discover new ones → the cycle continues. This is a concrete instantiation of the "self-improving agent" vision, achieved without any model modification.

The zero-shot transfer result (SealQA → BrowseComp, +5.3%) is particularly significant for the research community because it provides evidence that evolutionary optimization can produce genuinely general capabilities—not just task-specific tunings. If this finding replicates across more benchmarks and models, it would suggest that skill-level evolution is a fundamentally different optimization modality than prompt or code evolution, one that inherently produces more transferable artifacts because the skill format forces the system to discover procedures rather than answers.

Summary

Key takeaway: EvoSkill demonstrates that evolving structured skill folders—rather than code or prompts—produces interpretable, portable, and transferable agent capabilities through automated failure analysis, achieving up to +12.1% accuracy improvement on search-augmented QA and +5.3% zero-shot cross-benchmark transfer, all without model modification.

Main contribution: The first system to apply evolutionary optimization at the skill abstraction level for coding agents, bridging the evolutionary AI community (focused on search and optimization) with the agent infrastructure community (focused on capability and deployment). The Feedback Descent adaptation to skill-library optimization, combined with the three-agent architecture enforcing strict separation of concerns, represents a novel synthesis that produces qualitatively different—more general and more interpretable—artifacts than prior evolutionary approaches.

What researchers should know: EvoSkill's strongest evidence is the zero-shot transfer experiment, which demonstrates that skill-level optimization produces domain-general capabilities. However, all results are from single runs with Claude Opus 4.5, variance analysis is absent, and the cost barrier ($3,000–$15,000 for full reproduction) limits community verification. The system is best understood as a proof-of-concept for skill-level evolution—architecturally clean and empirically promising, but awaiting the statistical rigor (multi-seed runs, cross-model evaluation, multi-benchmark joint optimization) needed to establish robust claims about generalization.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}