GEPA Skills

Part P03: Self-Improving Agent Systems

12.1 Overview and Motivation

Modern large language models exhibit impressive generalization across programming tasks, yet consistently underperform human developers on repository-specific work. A developer familiar with a codebase has internalized thousands of micro-decisions—which test runner to invoke, how errors propagate through the module hierarchy, which files to inspect first for a given symptom—that no general-purpose LLM can replicate from its training data alone. This repository knowledge gap manifests as navigation inefficiency (agents waste tokens exploring irrelevant files), convention violations (semantically correct patches that fail project-specific tests), tool misuse (running the wrong test suite), and costly retry spirals that exhaust token budgets on dead-end approaches.

The natural response to this gap is to fine-tune the LLM on repository-specific data. However, fine-tuning requires access to model weights (unavailable for proprietary APIs such as Claude or GPT-5), risks catastrophic forgetting of general coding abilities, is expensive to maintain as repositories evolve, produces non-transferable improvements (a fine-tuned GPT-5-mini cannot help Claude Sonnet), and yields opaque learned representations that are difficult to inspect or debug.

GEPA (General-purpose Evolutionary Program search for Anything), introduced by Tan, Agrawal, Sandadi, Klein, Sen, Dimakis, and Zaharia in February 2026, proposes a fundamentally different approach. Rather than modifying model weights, GEPA operates at the prompt level, automatically discovering structured natural-language skill documents that encode repository-specific knowledge. These skills are prepended to an agent's system prompt, teaching it domain-specific patterns, idioms, and troubleshooting heuristics. The skills are discovered through evolutionary search over a population of candidate documents, evaluated against verifiable coding tasks, and refined through LLM-based reflection on evaluation traces.

Key Contribution

GEPA demonstrates that the primary bottleneck for coding agents is not raw reasoning capability but repository-specific knowledge, and that this knowledge can be automatically extracted through evolutionary search over natural-language skill documents. Skills learned on a cheap model (GPT-5-mini, ~$40 total cost) transfer to expensive models (Claude Sonnet 4.5) and across agent frameworks without re-training—achieving up to +69 percentage points improvement on repository-specific benchmarks. This establishes a new paradigm of knowledge curation as an alternative to model training for improving AI coding agents.

The core thesis is that coding agents can be dramatically improved not by changing LLM weights or the agent harness, but by automatically learning repository-specific skills—structured natural-language instructions that encode domain knowledge, coding conventions, and debugging strategies—discovered through evolutionary search and transferable across both different LLM backends and different agent frameworks.

12.1.1 Formal Problem Statement

Given a repository $R$, an agent harness $H$, and a base LLM $M$, the skill optimization problem seeks a skill document $S^*$ that maximizes the expected task resolution rate:

$$S^* = \arg\max_{S} \; \mathbb{E}_{t \sim \mathcal{T}(R)}\bigl[\text{resolve}(H(M, S),\; t)\bigr]$$

where $\mathcal{T}(R)$ is the distribution of coding tasks for repository $R$, $H(M, S)$ denotes the agent using harness $H$ powered by model $M$ with skill $S$ prepended to its system prompt, and $\text{resolve}(\cdot, t) \in \{0, 1\}$ is a binary indicator of whether the agent correctly resolves task $t$ (validated by a test oracle). The search space is the set of all natural-language documents up to a token budget (typically 2000–4000 tokens), and the fitness function is the empirical resolve rate on a held-out validation set.

12.2 System Architecture

GEPA consists of three major layers: a data layer (SWE-smith) that generates verifiable coding tasks from repository history, a search layer (the optimize_anything core and its coding-specific instantiation gskill) that performs evolutionary search over skill documents, and an execution layer that runs agents with candidate skills and collects evaluation traces. The following diagram illustrates the end-to-end architecture and data flow.

12.2.1 The `optimize_anything` API

At the heart of GEPA is a general-purpose evolutionary optimization interface that can optimize any text-representable artifact using LLM-based evaluation and reflection. While the paper focuses on coding agent skills, the API is designed to be domain-agnostic. Three design principles govern its operation:

Artifact agnosticism. The API treats the optimized artifact as an opaque string. The evaluation function is the sole interface between the optimization loop and the application domain. For skill learning, it runs the agent with the candidate skill on a task set and returns the resolve rate; for other applications, it could evaluate prompt quality, configuration effectiveness, or any other measurable property.
LLM-as-mutator. Unlike traditional evolutionary algorithms that use random mutations, GEPA uses an LLM (the reflection model) to propose informed mutations. The reflection model receives the current artifact and its fitness, the full population with scores, detailed evaluation traces, and a mutation strategy directive.
Evaluation trace feedback. The evaluation function returns not just a scalar fitness score but structured traces—per-task success/failure information, error messages, and agent trajectory summaries. This rich feedback enables the reflection model to make targeted improvements rather than blind mutations.

# Pseudocode — no public implementation available
# Illustrates the optimize_anything API as described in the paper

class GEPA:
    def optimize_anything(
        self,
        artifact_type: str,                # e.g., "skill", "prompt", "config"
        initial_population: list[str],      # seed artifacts (can be empty)
        evaluate_fn: Callable[[str], EvalResult],  # fitness + traces
        reflect_model: str,                 # LLM for reflection/proposal
        num_generations: int = 10,
        population_size: int = 5,
        mutation_strategies: list[str] = ["refine", "combine", "simplify"],
        selection_method: str = "tournament",
        crossover_rate: float = 0.3,
        elite_count: int = 1,
    ) -> OptimizationResult:
        """Evolutionary search over text artifacts using LLM-based reflection."""

        population = self._initialize(initial_population, population_size)

        for gen in range(num_generations):
            # Step 1: Evaluate each candidate on a task batch
            results = []
            for candidate in population:
                eval_result = evaluate_fn(candidate)  # returns fitness + traces
                results.append((candidate, eval_result))

            # Step 2: Select elite (best candidate always survives)
            results.sort(key=lambda r: r[1].fitness, reverse=True)
            elite = results[0][0]

            # Step 3: Reflect on traces and propose new candidates
            reflection = self._reflect(
                population=results,
                model=reflect_model,
                strategy=self._choose_strategy(gen, mutation_strategies),
            )
            new_candidates = self._propose(
                reflection=reflection,
                model=reflect_model,
                count=population_size - elite_count,
            )

            # Step 4: Form next generation
            population = [elite] + new_candidates

        return OptimizationResult(
            best=elite,
            fitness=results[0][1].fitness,
            trace=all_generation_results,
        )

12.2.2 SWE-smith: The Data Pipeline

A critical enabler for GEPA is the ability to generate large numbers of verifiable coding tasks for arbitrary repositories. The SWE-smith pipeline addresses this by mining historical pull requests and commits to create task instances with ground-truth test oracles. The process has four stages:

Commit mining: SWE-smith scans the repository's git history for commits that modify source code files, have associated test changes or existing tests that exercise the modified code, and have clear commit messages or linked issues describing the intent.
Task extraction: For each qualifying commit, SWE-smith creates a task instance consisting of the repository state before the commit (the base state), a natural-language description of the problem (derived from the commit message or linked issue), and the set of tests that the patch must pass (the oracle).
Test validation: Each task is validated by running the test oracle against both the base state (should fail) and the patched state (should pass). Tasks where the oracle does not cleanly distinguish the two states are discarded.
Split allocation: The approximately 300 tasks per repository are split into 200 training tasks (used to evaluate skill candidates during evolutionary search), 50 validation tasks (used for early stopping and hyperparameter selection), and 60 test tasks (held out for final evaluation, never seen during skill learning).

The paper introduces Mini-SWE, a benchmark suite consisting of the SWE-smith-generated test splits across multiple repositories, serving as a controlled evaluation environment where test tasks are guaranteed unseen during skill learning. Two benchmark repositories are reported: Jinja (Python template engine, ~310 tasks) and Bleve (Go full-text search, ~290 tasks).

12.3 Core Algorithms

12.3.1 Evolutionary Search Formalization

GEPA maintains a population $P^{(g)} = \{S_1^{(g)}, \ldots, S_k^{(g)}\}$ at generation $g$. The population update rule is:

$$P^{(g+1)} = \text{Elite}(P^{(g)}) \cup \text{Propose}\bigl(\text{Reflect}(P^{(g)}, \Phi^{(g)})\bigr)$$

where the components are defined as follows:

$\text{Elite}(P^{(g)}) = \{S^* : S^* = \arg\max_{S \in P^{(g)}} f(S)\}$—the single best artifact is always preserved (elitism).
$\Phi^{(g)} = \{(t, r_t, \ell_t) : t \in T_{\text{batch}}\}$—structured evaluation traces from the current generation, where $r_t \in \{0,1\}$ is the binary resolution result and $\ell_t$ is the agent's execution log for task $t$.
$\text{Reflect}$—an LLM-based analysis function that maps the population and traces to a natural-language reflection document identifying patterns, failure modes, and improvement opportunities.
$\text{Propose}$—an LLM-based generation function that produces $k - 1$ new candidate artifacts conditioned on the reflection.

The objective function $f(S)$ is defined as the mean resolve rate over a task set $T$:

$$f(S) = \frac{1}{|T|} \sum_{t \in T} \text{resolve}\bigl(A_{H,M}(S),\; t\bigr)$$

where $A_{H,M}(S)$ denotes the agent using harness $H$, model $M$, and skill $S$, and $\text{resolve}$ returns 1 if the agent correctly resolves task $t$, 0 otherwise. This function is black-box (no gradients), stochastic (agent behavior is non-deterministic due to sampling temperature), and expensive to evaluate (each evaluation requires a full LLM agent run).

12.3.2 Fitness Estimation Under Stochasticity

Because LLM agent behavior is stochastic, a single evaluation of $f(S)$ on task $t$ is a Bernoulli random variable. The estimated fitness over a batch of $n$ tasks has variance:

$$\text{Var}[\hat{f}(S)] = \frac{f(S)(1 - f(S))}{n}$$

For $f(S) = 0.7$ and $n = 50$, this yields a standard deviation of approximately 0.065, or about 6.5 percentage points. This noise level is acceptable for tournament selection (distinguishing between candidates with $>$10 pp difference) but insufficient for fine-grained ranking. The authors address this by using relatively large evaluation batches (50 tasks per generation), evaluating the final best candidate on the full validation set (50 tasks, multiple runs), and reporting test-set performance with confidence intervals.

12.3.3 Mutation Strategies

Unlike traditional evolutionary algorithms where mutations are random perturbations, GEPA's mutations are semantically informed proposals generated by an LLM conditioned on evaluation traces. The paper describes five mutation strategies:

Strategy	Description	Typical Usage
`refine`	Analyze failures of current best artifact and propose targeted fixes	Most common (~60%); used when clear failure patterns exist
`combine`	Merge strengths of two parent artifacts via crossover	~20%; when population has diverse but complementary skills
`simplify`	Remove unnecessary content while maintaining performance	~20%; late-stage optimization, reducing token overhead
`specialize`	Add detailed instructions for specific failure categories	When a few hard tasks remain unsolved
`generalize`	Abstract specific instructions into broader principles	Improving transfer to unseen tasks

The default strategy mix is approximately 60% refine, 20% combine, and 20% simplify. This distribution matches the natural cadence of optimization: mostly targeted improvements with occasional exploration and compression. The paper reports that refine-only strategies converge but get stuck in local optima, combine-heavy strategies provide high diversity but slow convergence, and the balanced default yields the best final performance.

12.3.4 The Reflective Proposer

The reflection step is what distinguishes GEPA from naive hill-climbing with LLM mutations. The reflection model receives structured evaluation traces and performs causal analysis of failures before proposing mutations. This is a critical mechanism: the reflection model identifies systematic patterns across failed tasks (e.g., "8 of 12 failures involve the analysis package—the agent looks in index/ but the correct location is analysis/") and proposes targeted corrections rather than blind modifications.

# Pseudocode — no public implementation available
# Illustrates the reflection and proposal mechanism

def reflect_and_propose(
    current_best: str,
    current_fitness: float,
    population: list[tuple[str, float, list[TraceEntry]]],
    strategy: str,
    reflection_model: str,
    max_skill_tokens: int = 2000,
) -> list[str]:
    """
    Analyze evaluation traces and propose improved skill candidates.
    The reflection model receives:
    - The current best skill and its resolve rate
    - Detailed traces from failed tasks (agent trajectory, errors)
    - Detailed traces from successful tasks (what worked)
    - A mutation strategy directive
    """

    # Separate failed and successful traces from the best candidate
    best_traces = population[0][2]
    failed = [t for t in best_traces if not t.resolved]
    succeeded = [t for t in best_traces if t.resolved]

    # Build the reflection prompt
    prompt = REFLECTION_TEMPLATE.format(
        fitness=current_fitness * 100,
        current_skill=current_best,
        failed_traces=format_traces(failed, max_per_task=500),
        success_traces=format_traces(succeeded, max_per_task=200),
        strategy=strategy,
        strategy_instructions=STRATEGY_DESCRIPTIONS[strategy],
        max_tokens=max_skill_tokens,
    )

    # The reflection model analyzes patterns and proposes improved skills
    # Key: it receives per-task success/failure info, error messages,
    # and agent trajectory summaries — not just a scalar score
    response = llm_call(model=reflection_model, prompt=prompt)

    # Parse one or more proposed skill documents from the response
    proposed_skills = parse_skill_proposals(response)
    return proposed_skills

The paper provides a concrete example of reflection output from a Bleve skill optimization run. The reflection model identified that 8 of 12 failures involved the analysis package (the agent was searching in index/ instead of the separate analysis/ package), 3 failures were due to the agent running go test ./... (which times out on large repos) instead of targeted test commands, and 1 failure was a genuine reasoning error requiring Go concurrency expertise that skills could not easily address. This causal analysis led to specific, targeted skill improvements.

12.3.5 The `gskill` Pipeline

gskill is the instantiation of the optimize_anything API for coding agent skills specifically. It orchestrates the end-to-end process from initialization through evolutionary refinement to final evaluation. Three initialization strategies are explored:

Empty initialization: Start with a blank skill document. The first generation of reflection generates skills entirely from evaluation traces. This is the most general approach but requires more generations (~15–20 versus 8–12 with seeding).
Human-written seed: A domain expert writes an initial skill document. Provides a warm start but may introduce biases.
LLM-seeded initialization: An LLM generates initial skills from the repository's README, documentation, and directory structure. Reduces generations needed for convergence without human effort.

Several strategies manage the computational cost of evaluation: batch evaluation uses random subsets of 30–50 training tasks per generation rather than the full 200-task set; parallel execution runs task evaluations concurrently within each batch; early termination stops evaluation of a candidate if it scores significantly below the current best on the first $N$ tasks; and caching reuses results per (skill_hash, task_id) pair when the same skill is re-evaluated due to elitism.

12.3.6 Convergence Behavior

The paper reports a characteristic convergence pattern across multiple runs. Generations 1–3 show rapid improvement as basic repository knowledge is encoded (file structure, test commands, common patterns). Generations 4–7 exhibit moderate improvement as subtler patterns are captured (error handling idioms, edge case strategies). Generations 8–12 show diminishing returns with improvements from fine-tuning language and addressing specific hard tasks. Beyond generation 12, further iterations risk overfitting to the training task batch.

A notable emergent property is implicit curriculum learning: early generations master easy improvements (high marginal fitness impact) while later generations tackle harder patterns. This happens naturally because once easy tasks are solved, the fitness gradient shifts toward harder tasks, and the reflection model focuses on remaining failures, which are progressively more difficult.

12.4 Skill Format and Structure

Skills are structured natural-language documents with recognizable sections that encode different types of repository knowledge. The paper does not prescribe a rigid format, but the evolutionary process consistently converges on documents with similar organizational patterns—an emergent structure that reflects the information most useful to coding agents.

12.4.1 Emergent Skill Organization

Across multiple runs and repositories, evolved skills converge on four major sections despite no explicit format specification:

Section	Typical Allocation	Purpose	Impact on Agent Behavior
Orientation	~20% of tokens	High-level project description, domain, architecture	Reduces initial exploration; agent navigates to relevant areas faster
Navigation	~30% of tokens	File/module map with responsibilities	Agent makes correct assumptions about where to find and modify code; consistently the highest-value section
Procedures	~30% of tokens	How-to instructions for running tests, debugging, making changes	Agent runs correct tests on first attempt; avoids running irrelevant test suites
Heuristics	~20% of tokens	Pattern-matching rules: symptom → likely cause → solution	Symptom-to-fix shortcuts that bypass exploratory debugging cycles

This emergent organization is a form of self-organizing representation: the evolutionary process, without any structural template, discovers that this particular decomposition of repository knowledge is optimal for coding agent performance. The token allocation reflects the relative value of each knowledge type, with navigation and procedures dominating because they most directly reduce wasted exploration.

12.4.2 Token Budget and Emergent Compression

The paper experiments with skill lengths from 500 to 4000 tokens. The empirically optimal range is 1500–2500 tokens: shorter skills miss important patterns, while longer skills introduce noise that can confuse the agent or consume too much of the context window. Performance follows an inverted-U relationship with skill length.

An important observation is that the evolutionary process naturally tends toward concise skills. Early generations produce verbose documents, but the simplify mutation strategy and competitive selection pressure drive convergence toward information-dense, non-redundant skill documents. The authors describe this as a form of emergent compression: selective pressure against wasted tokens in the context window acts as a natural regularizer against bloat.

12.4.3 Skill Example: Bleve (Go)

The paper includes examples of evolved skill documents. The following is a representative excerpt from a Bleve skill, illustrating the emergent structure and information density. This is the actual skill content discovered through evolutionary search, not a human-authored document:

# Pseudocode — no public implementation available
# Represents the structure of an evolved skill document for Bleve (Go)
# This illustrates the content discovered by evolutionary search

BLEVE_SKILL = """
# Repository Skill: Bleve Full-Text Search Engine

## Project Overview
Bleve is a full-text search and indexing library for Go. The codebase
is organized around index types (scorch, upsidedown), analysis pipelines,
and search query types.

## Key Architecture Patterns
- All index implementations satisfy the Index interface in index/index.go
- Analysis chains: char filter -> tokenizer -> token filter
- Search queries implement the Query interface with Searcher() method
- The mapping package defines how documents map to index fields

## Common Bug Patterns and Fixes
1. Search relevance issues: check scoring logic in search/scorer/
   (most bugs involve TF-IDF weight calculation)
2. Index corruption: traces to segment merger in index/scorch/merge.go
3. Analysis pipeline bugs: check custom analyzers register all
   components in registry/

## Testing Strategy
- Run specific tests with: go test ./... -run TestName
- Integration tests in test/ directory
- Table-driven tests; match existing patterns in *_test.go
- ALWAYS run specific test file, not entire suite

## Debugging Heuristics
- Search returns no results -> check mapping first (mapping/)
- Indexing panics -> check for nil pointer in document.Fields
- Error 'unknown field' -> mapping is incomplete
"""

The per-task failure analysis from the paper reveals the mechanisms through which these skill sections help: 34% of baseline failures involved editing the wrong file (addressed by the architecture section), 22% involved incorrect test commands (addressed by the testing section), 18% were convention violations (addressed by code style guidance), 15% were incorrect logic (partially addressed by bug patterns), 7% were context window exhaustion (mitigated by reduced exploration), and 4% were fundamental misunderstandings that skills could not address.

12.5 Key Results

All results reported here are from the held-out test split (60 tasks) that was never used during skill learning, averaged over 3 runs with different random seeds, as stated in the paper.

12.5.1 Repository-Specific Performance

Mini-SWE Benchmark Results — Jinja (Python Template Engine)
Model	Baseline	With Skill	Improvement	Duration Change
GPT-5-mini	55%	82%	+27 pp	~−20%
Claude Haiku 4.5	72%	88%	+16 pp	~−15%
Claude Sonnet 4.5	89%	95%	+6 pp	~−25%

Mini-SWE Benchmark Results — Bleve (Go Full-Text Search)
Model	Baseline	With Skill	Improvement	Duration (s)
GPT-5-mini	24%	93%	+69 pp	—
Claude Haiku 4.5	79.3%	98.3%	+19 pp	173 → 142 (−18%)
Claude Sonnet 4.5	94.8%	100%	+5.2 pp	285 → 169 (−41%)

12.5.2 Result Patterns

Three patterns emerge from the empirical results:

Inversely proportional gains. The weaker the baseline model, the larger the absolute improvement from skills. GPT-5-mini gains +69 pp on Bleve while Sonnet gains +5.2 pp. This is consistent with the hypothesis that weaker models have more knowledge gaps that skills can fill, while stronger models already know much of what skills encode. The relationship is not linear—the marginal value of domain knowledge is highest when the model lacks it entirely.

Universal speed improvement. Even when the resolve rate improvement is modest (Sonnet: +5.2 pp on Bleve), the duration reduction is substantial (−41%). Skills help agents navigate directly to the right solution, eliminating wasted exploration. This is particularly valuable for high-cost models where token usage translates directly to monetary cost. The speed improvements are arguably more economically significant than the accuracy improvements for frontier models.

Near-ceiling performance. On Bleve, Claude Sonnet 4.5 with skills achieves 100% resolve rate on the test set. While impressive, this raises concerns about benchmark saturation—the Mini-SWE Bleve test set may be insufficiently challenging for frontier models with skills. The Jinja benchmark, with a 95% ceiling, retains more headroom for future evaluation.

12.5.3 Ablation: Evolutionary Search vs. Alternatives

The paper compares GEPA's evolutionary search against simpler skill-generation baselines, all evaluated on Bleve with GPT-5-mini:

Method	Bleve (GPT-5-mini)	Jinja (GPT-5-mini)
No skill (baseline)	24%	55%
Human-written skill	~55%	~68%
LLM-generated (one-shot, from README)	~48%	~65%
LLM + 1 reflection cycle	~65%	~72%
GEPA (5 generations)	~82%	~77%
GEPA (10 generations)	93%	82%

Each component contributes meaningfully. The one-shot LLM approach captures basic repository knowledge but misses subtle patterns that only emerge from evaluation feedback. A single reflection cycle helps but cannot discover the multi-layered knowledge that 10 generations of evolutionary refinement produce. As the authors note, the evolutionary process discovers knowledge that no human or single LLM call could produce because it is grounded in empirical evaluation traces—the system learns from what actually fails and why.

12.5.4 Sensitivity Analysis

The paper reports ablations across three key hyperparameters:

Population size. $K = 1$ (hill climbing) reaches ~88% after 15+ generations. $K = 3$ reaches ~91% in 10 generations at 2.8× cost. $K = 5$ (default) reaches ~93% in 7 generations at 4.5× cost. $K = 10$ reaches ~93% in 6 generations at 8.5× cost. Beyond $K = 5$, marginal benefit diminishes because the reflection model can only meaningfully analyze a limited number of evaluation traces per generation.

Evaluation batch size. 10 tasks per batch yields ~14 pp standard deviation in fitness estimates and unstable search. 30 tasks reduces noise to ~8 pp. 50 tasks (default) gives ~6 pp standard deviation and reliable comparisons. 100 tasks provides ~4 pp but at significantly higher cost per generation.

Skill token length. 500 tokens yields ~78% (too short to encode important patterns). 1000 tokens yields ~87%. 2000 tokens (default) yields ~93% (comprehensive without noise). 4000 tokens yields ~91% (slight degradation from redundancy confusing the agent).

12.6 Cross-Model Transfer

Perhaps the most surprising and practically important finding in the paper is that skills transfer across LLM models and agent frameworks. A skill learned using GPT-5-mini as the base model improves Claude Haiku 4.5 and Claude Sonnet 4.5 without any adaptation. This section analyzes the mechanisms and implications.

12.6.1 Transfer Results

12.6.2 Why Transfer Works

The transferability can be understood by classifying the knowledge types encoded in skills and their model-dependence:

Knowledge Type	Model-Dependent?	Encoded in Skills?	Transfer Expectation
Repository structure (which files exist, what they do)	No	Yes	Transfers perfectly
Testing conventions (how to run tests)	No	Yes	Transfers perfectly
Bug pattern mappings (symptom → likely cause)	No	Yes	Transfers well
Code style conventions	No	Yes	Transfers well
Reasoning strategies (problem decomposition)	Partially	Partially	Transfers moderately
Prompt-following ability	Yes	No	N/A (model-intrinsic)

The key insight from the paper is that the majority of information in evolved skills is factual domain knowledge (repository structure, testing commands, bug patterns) rather than reasoning strategies. Factual knowledge is model-independent: it is equally useful whether the reader is GPT-5-mini or Claude Sonnet. Skills transfer well because they are primarily knowledge documents, not reasoning guides.

12.6.3 The "Train-Weak, Deploy-Strong" Principle

The authors articulate a principle they call train-weak, deploy-strong: skills encode domain knowledge, not reasoning ability. A weaker model that knows which file to look at and which test to run can solve a task that a stronger model without that knowledge cannot. The knowledge is model-independent; the reasoning that applies it is model-dependent.

This has a practical corollary: the cost of skill learning is determined by the cheapest model in the hierarchy, while the benefit accrues to the most expensive model. A skill learning run costing ~$40 on GPT-5-mini can improve Claude Sonnet 4.5 performance by 5+ percentage points—each Sonnet invocation costing approximately 10–50× more than a GPT-5-mini invocation. The cost-benefit analysis strongly favors transfer: the marginal gain from model-specific skill learning does not justify the 10–50× higher evaluation cost of running a frontier model during the training phase.

The paper also demonstrates cross-harness transfer: skills learned in one agent framework (e.g., Moatless Tools) transfer to entirely different harness architectures. This works because skills are prepended to the system prompt, which is a universal interface across harness architectures. The skill content (domain knowledge) is harness-agnostic; only the format in which the agent applies the knowledge (tool calls, file operations) is harness-specific.

12.7 Cost Model and Computational Economics

12.7.1 Total Cost of Skill Learning

The total cost of a gskill run is modeled as:

$$C_{\text{total}} = G \cdot K \cdot B \cdot (C_{\text{agent}} + C_{\text{test}}) + G \cdot C_{\text{reflect}}$$

where $G$ is the number of generations (typically 10–15), $K$ is the population size (typically 3–5), $B$ is the batch size per evaluation (typically 30–50 tasks), $C_{\text{agent}}$ is the cost per agent run (depends on model; ~$0.01–$0.10 for GPT-5-mini), $C_{\text{test}}$ is the cost of test execution (compute, typically negligible compared to LLM cost), and $C_{\text{reflect}}$ is the cost of the reflection LLM call (single call per generation, ~$0.05).

For the paper's default parameters ($G = 10$, $K = 5$, $B = 40$, $C_{\text{agent}} = \$0.02$):

$$C_{\text{total}} = 10 \times 5 \times 40 \times \$0.02 + 10 \times \$0.05 = \$40.50$$

This is a one-time cost per repository, amortized over all future agent invocations. Compared to fine-tuning (which requires GPU hours costing hundreds to thousands of dollars), skill learning is orders of magnitude cheaper. The paper notes that the reflection model (which analyzes traces and proposes mutations) can be the same cheap model as the training model or a more capable model; using a stronger reflection model slightly improves convergence speed but does not significantly affect final skill quality, because the quality of skills is ultimately bounded by the evaluation signal, not the reflection model's reasoning ability.

LLM Hierarchy: Cost and Performance
Model	Role	Approx. Cost/Task	Baseline (Bleve)	With Skill
GPT-5-mini	Training + Deployment	~$0.02	24%	93% (+69 pp)
Claude Haiku 4.5	Transfer Deployment	~$0.08	79.3%	98.3% (+19 pp)
Claude Sonnet 4.5	Transfer Deployment	~$0.30	94.8%	100% (+5.2 pp)

Note: All cost figures and benchmark numbers in this section are as reported by the paper authors. Independent reproduction has not been published at the time of writing. The paper reports results averaged over 3 seeds on held-out test splits.

12.8 Compositional Evolution and Skill Discovery Mechanisms

12.8.1 Evolutionary Search over Natural Language

Traditional evolutionary algorithms operate over fixed-length numerical vectors or structured programs. GEPA's central innovation is applying evolutionary search to natural-language documents using an LLM as the mutation operator. This creates a qualitatively different search dynamic compared to classical approaches:

Informed mutations: Unlike random bit-flips or point mutations, LLM-proposed mutations are semantically coherent. A mutation might add a new section on error handling patterns, rephrase an ambiguous instruction, or remove a section that is hurting generalization—all based on causal analysis of evaluation traces.
Crossover via synthesis: When combining two parent skills, the LLM can intelligently merge them rather than performing random splicing. It identifies complementary sections and resolves conflicts between contradictory instructions.
Adaptive search radius: The LLM can make both small (rephrasing a single sentence) and large (restructuring the entire document) changes based on the reflection analysis, effectively adapting the mutation magnitude to the optimization landscape without requiring an explicit step-size parameter.

12.8.2 Cross-Task Generalization

A critical question is whether skills overfit to training tasks. The evaluation protocol (train/val/test splits) is designed to detect this, and the results show strong generalization: test-set performance closely tracks validation-set performance. The mechanism behind generalization is that skills encode structural knowledge about the repository (file layout, module responsibilities, testing infrastructure) rather than task-specific solutions. This structural knowledge is relevant to any task within the repository, not just the training tasks.

The evolutionary search implicitly selects for general knowledge because task-specific knowledge has low marginal fitness: it helps on only one task out of the evaluation batch (typically 30–50 tasks), contributing at most $1/B$ to the fitness score. Structural knowledge that helps across many tasks provides a much stronger selection signal. This is a natural form of regularization that emerges from the population-based evaluation protocol.

12.8.3 Hierarchical Decomposition and Future Directions

The current GEPA system treats each skill as a monolithic document. The paper identifies several open research directions related to compositional and hierarchical skill architectures:

Hierarchical skill learning would learn skills at multiple levels of abstraction: language-level skills (Go idioms, Python patterns), framework-level skills (standard library patterns, testing conventions), and repository-level skills (project-specific architecture, bug patterns). Higher-level skills would transfer more broadly across repositories; lower-level skills would be more specific and powerful within a single repository.

Compositional skills would decompose the monolithic skill document into independent modules (architecture, testing, debugging) learned separately and composed at deployment. This could enable more modular and maintainable skill libraries where individual components can be updated without re-optimizing the entire skill.

Task-adaptive skill selection would maintain a library of specialized skills and select the most relevant one(s) for each task, implemented as a learned routing function or retrieval system. Rather than a single skill serving all tasks, the agent would receive a task-specific composition of skill modules.

Continuous skill evolution would continuously evolve skills as the repository changes. Each merged pull request could trigger a lightweight skill update, keeping skills synchronized with the codebase and addressing the skill staleness problem that the current one-shot learning paradigm does not solve.

12.9 Connections to Related Work

GEPA sits at the intersection of several active research threads. Understanding its relationship to prior and concurrent work clarifies both its novelty and its limitations.

System	Approach	Relationship to GEPA
AlphaEvolve (DeepMind, 2025)	Evolutionary code optimization with LLM proposers	Shared algorithmic DNA: LLM-based mutation and evaluation loops. GEPA applies the same evolutionary search paradigm but to natural-language skill documents rather than code artifacts
OpenEvolve	Open-source evolutionary program search	Shared core architecture; GEPA's `optimize_anything` is more general than code-specific search, operating over arbitrary text artifacts
DSPy (Khattab et al.)	Programmatic prompt optimization	DSPy optimizes prompt templates with fixed structures; GEPA optimizes free-form documents without structural constraints, allowing emergent organization
SWE-bench	Coding agent benchmark	GEPA's SWE-smith generates similar task instances at scale from arbitrary repositories; Mini-SWE extends the paradigm to controlled skill evaluation
Reflexion (Shinn et al., 2023)	LLM self-reflection for task improvement	Reflexion uses reflection within a single task attempt; GEPA uses reflection across tasks and populations, enabling pattern discovery that single-task reflection cannot achieve
ADAS (Hu et al., 2024)	Automated Design of Agentic Systems	ADAS evolves agent architectures; GEPA evolves agent knowledge while holding the architecture fixed—complementary rather than competitive
EoH (Liu et al.)	Evolution of Heuristics with LLMs	EoH evolves algorithmic heuristics (code); GEPA evolves natural-language knowledge documents—different search spaces with shared evolutionary methodology

GEPA's distinctive contribution relative to this landscape is the demonstration that the knowledge context presented to an LLM at inference time can be optimized with the same rigor and methodology applied to the model itself. While other systems evolve code, prompts, or agent architectures, GEPA evolves domain knowledge documents—a qualitatively different artifact type that is human-readable, model-agnostic, and transferable across both models and frameworks.

12.10 Limitations and Open Questions

12.10.1 Scope of Evaluation

The paper demonstrates results on two repositories: Jinja (Python) and Bleve (Go). While these span two programming languages and different domains, the generality of the approach across diverse codebases—monorepos, polyglot projects, repositories with more than 1 million lines of code—remains unvalidated. The SWE-smith pipeline's ability to generate quality tasks may vary across repository types, particularly for repositories with poor test coverage or unconventional commit histories.

12.10.2 Benchmark Saturation

On Bleve, Claude Sonnet 4.5 with skills achieves 100% on the 60-task test set. This ceiling effect limits the ability to measure further improvements and raises questions about whether Mini-SWE is sufficiently challenging for frontier models with skills. The Jinja benchmark retains more headroom at 95%, but both benchmarks are relatively small. Scaling to larger and more diverse benchmarks (such as full SWE-bench with 12 Python repositories) would provide a more rigorous evaluation.

12.10.3 Skill Staleness

As repositories evolve, skills may become outdated—file paths change, modules are renamed, testing conventions shift. The paper does not address how skills should be maintained over time. A practical deployment would require a mechanism to detect skill degradation (e.g., monitoring resolve rates on a continuous validation set) and trigger re-optimization. The one-time cost of ~$40 makes periodic re-learning feasible, but the optimal cadence is unknown.

12.10.4 Single-Repository Scope

Current skills are repository-specific. Cross-repository skills (e.g., "general Go coding patterns" or "pytest conventions") are not explored. Learning meta-skills that transfer across repositories in the same language or framework is an open research direction that could significantly reduce the per-repository cost of skill learning.

12.10.5 Evaluation Methodology Considerations

Several aspects of the evaluation methodology warrant scrutiny. The 3-seed averaging provides some protection against stochastic variation, but the relatively small test sets (60 tasks) mean that individual task outcomes can materially affect the aggregate metric. The paper does not report confidence intervals for all comparisons. Additionally, the "human-written skill" baseline in the ablation study (~55% on Bleve) may not reflect the quality achievable by a developer with deep Bleve expertise spending significant time on skill authoring; it serves more as a baseline for casual human effort than for expert-level skill engineering.

12.11 Broader Significance: From Model Training to Knowledge Curation

GEPA represents a paradigm shift in how we think about improving AI coding agents. The traditional paradigm—collect data, train or fine-tune a model, deploy—treats the model as the primary locus of improvement. GEPA demonstrates an alternative where the model is held fixed and the knowledge context is optimized instead. The authors describe this as the knowledge curation paradigm: rather than training models to be better at everything, we curate the right knowledge to present to them at inference time.

This is analogous to how human expertise works: a junior developer given the right documentation and mentorship can perform at a much higher level than one working in isolation. Skills are the automated equivalent of documentation and mentorship, discovered through evolutionary search rather than human authoring.

The practical implications are significant. For software teams, GEPA enables a new development practice: skill engineering alongside code engineering, where teams maintain automatically updated skill documents for their repositories. For agent framework developers, it suggests that frameworks should be designed with standardized skill injection points, enabling a skill ecosystem where skills are shared, versioned, and composed. For model providers, it demonstrates that model quality is necessary but not sufficient—even frontier models benefit substantially from repository-specific context.

If the approach generalizes, it implies the emergence of a skill economy: a marketplace where repository-specific skills are learned, shared, and traded. Open-source repositories could include a .skills/ directory with pre-learned skills, just as they include documentation and CI configuration. The economics are favorable: skills are cheap to learn (~$40), valuable to users (significant performance improvement), and improve with the repository ecosystem (more tasks and more diverse evaluation lead to better skills).

Chapter Summary

Key takeaway: GEPA demonstrates that the primary bottleneck for AI coding agents is not reasoning capability but repository-specific knowledge, and that this knowledge can be automatically discovered through evolutionary search over natural-language skill documents at a cost of approximately $40 per repository.

Main contribution: The system establishes a new paradigm of knowledge curation—optimizing what models are told at inference time rather than what they learn during training. Skills learned on cheap models (GPT-5-mini) transfer to expensive models (Claude Sonnet 4.5) and across agent frameworks, achieving up to +69 percentage points on repository-specific benchmarks and up to 41% reduction in execution time. The "train-weak, deploy-strong" principle and the optimize_anything API provide a general framework for evolutionary optimization of any text artifact.

What researchers should know: GEPA's skills encode factual domain knowledge (file structure, testing commands, bug patterns), not reasoning strategies, which is why they transfer across models. The evolutionary process exhibits emergent properties—consistent document structure, natural compression, implicit curriculum learning—that arise from selection pressure without explicit design. The main limitations are narrow evaluation scope (two repositories), potential benchmark saturation for frontier models, and the open problem of skill maintenance as codebases evolve. The approach is complementary to, not competitive with, model training and agent architecture search.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}