GEPA: Automatically Learning Skills for Coding Agents
Evolutionary Skill Optimization for Repository-Specific Coding Agent Enhancement via GEPA
Authors: Shangyin Tan, Lakshya A Agrawal, Rohit Sandadi, Dan Klein, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia
Date: February 18, 2026
Affiliations: UC Berkeley, UT Austin, Databricks
Domain: Evolutionary Skill Learning Coding Agents Cross-Model Transfer
Table of Contents
- Abstract and Core Contribution
- Motivation and Problem Setting
- The GEPA optimize_anything API
- SWE-smith Data Pipeline
- The gskill Skill Learning Pipeline
- Skill Format and Structure
- Mathematical Formulation
- LLM Hierarchy and Cost Architecture
- Experimental Results
- Cross-Model Transfer Analysis
- Core Mechanisms Deep Dive
- System Architecture and Diagrams
- Ablations and Sensitivity Analysis
- Limitations and Future Directions
- Broader Significance
1 Abstract and Core Contribution
Core thesis: Coding agents can be dramatically improved not by changing the LLM weights or the agent harness, but by automatically learning repository-specific skills -- structured natural-language instructions that encode domain knowledge, coding conventions, and debugging strategies -- discovered through evolutionary search and transferable across both different LLM backends and different agent frameworks.
The paper introduces GEPA (General-purpose Evolutionary Program search for Anything),
a framework whose central API, optimize_anything, treats arbitrary text artifacts as
optimization targets. When instantiated for coding agents, GEPA searches over the space of
skill documents -- structured prompts prepended to an agent's system prompt -- that teach
the agent repository-specific patterns, idioms, and troubleshooting heuristics.
The key insight is that the bottleneck for coding agents is not raw reasoning capability but repository-specific knowledge. A developer familiar with a codebase has internalized thousands of micro-decisions: which test runner to use, how errors propagate through the module hierarchy, which files to check first for a given symptom. GEPA extracts this knowledge automatically from a set of verifiable training tasks and encodes it into transferable skill documents.
Key Claims
- Massive performance gains: Up to +69 percentage points on repository-specific coding benchmarks when applied to weaker models.
- Cross-model transfer: Skills learned using
gpt-5-miniimproveClaude Haiku 4.5andClaude Sonnet 4.5without re-training. - Cross-harness transfer: Skills learned in one agent framework (e.g., Moatless Tools) transfer to entirely different harness architectures.
- Speed improvements: Skilled agents solve tasks faster (reduced wall-clock time), suggesting that skills reduce wasted exploration.
- Data pipeline: SWE-smith generates ~300 verifiable tasks per GitHub repository, providing the evaluation substrate for evolutionary search.
+69pp Bleve (Go) with gpt-5-mini 24% to 93%
+27pp Jinja (Python) with gpt-5-mini 55% to 82%
100% Claude Sonnet 4.5 on Bleve 94.8% baseline
-41% Duration Reduction Sonnet 4.5: 285s to 169s
2 Motivation and Problem Setting
The Repository Knowledge Gap
Modern LLMs, despite their impressive generalization, consistently underperform on repository-specific tasks compared to developers with domain familiarity. This gap manifests in several ways:
- Navigation inefficiency: Agents waste significant tokens exploring irrelevant files because they lack a mental model of the repository structure.
- Convention violations: Code patches that are semantically correct may fail tests because they violate project-specific conventions (naming, error handling patterns, import orders).
- Tool misuse: Agents use generic debugging strategies instead of project-specific ones (e.g., running the wrong test suite, misinterpreting custom error formats).
- Retry spirals: Without knowledge of common failure modes, agents enter costly retry loops, exhausting their token budgets on dead-end approaches.
Why Not Fine-Tune?
The natural response to this knowledge gap is to fine-tune the LLM on repository-specific data. However, fine-tuning has critical disadvantages:
Fine-tuning limitations:
Requires access to model weights (not available for proprietary APIs like Claude, GPT-5). Risk of catastrophic forgetting of general coding abilities. Expensive to maintain as the repository evolves. Non-transferable: a fine-tuned GPT-5-mini cannot help Claude Sonnet. Difficult to inspect and debug what was learned.
Skills as an Alternative
GEPA instead operates at the prompt level. Skills are structured text documents that are prepended to the agent's system prompt. This approach is:
- Model-agnostic: Works with any LLM that accepts a system prompt.
- Inspectable: Skills are human-readable; engineers can audit, edit, or override them.
- Composable: Multiple skills can be combined for different contexts.
- Cheap to iterate: No GPU cluster needed; skill search uses only API calls.
- Transferable: Skills learned on one model improve different models.
Formal Problem Statement
$$ Given a repository R, an agent harness H, and a base LLM M, find a skill document S* such that:
S* = argmaxS Et ~ Tasks(R)[ resolve(H(M, S), t) ]
where resolve(H(M, S), t) = 1 if agent H powered by M with skill S correctly resolves task t.
$$
The search space is the set of all natural-language documents up to a token budget (typically 2000-4000 tokens). The fitness function is the empirical resolve rate on a held-out validation set. The optimization method is evolutionary: populations of skill candidates are maintained, evaluated, reflected upon, and mutated.
3 The GEPA optimize_anything API
At the heart of the system is the optimize_anything API -- a general-purpose evolutionary
optimization interface that can optimize any text-representable artifact using LLM-based evaluation
and reflection. While the paper focuses on coding agent skills, the API is designed to be domain-agnostic.
API Signature
Python -- GEPA Core API
class GEPA:
def optimize_anything(
self,
artifact_type: str, # e.g., "skill", "prompt", "config"
initial_population: List[str], # seed artifacts (can be empty)
evaluate_fn: Callable[[str], float], # fitness function
reflect_model: str, # LLM for reflection/proposal
num_generations: int = 10,
population_size: int = 5,
mutation_strategies: List[str] = ["refine", "combine", "simplify"],
selection_method: str = "tournament",
crossover_rate: float = 0.3,
elite_count: int = 1,
) -> OptimizationResult:
"""
Evolutionary search over text artifacts using LLM-based
reflection and proposal.
Returns the best artifact found, along with the full
evolutionary trace for analysis.
"""
...
Key Design Principles
1. Artifact Agnosticism
The API treats the optimized artifact as an opaque string. The evaluate_fn is the
sole interface between the optimization loop and the domain. For skill learning, the evaluation
function runs the agent with the candidate skill on a task set and returns the resolve rate.
For other applications, it could evaluate prompt quality, configuration effectiveness, or any
other measurable property of a text artifact.
2. LLM-as-Mutator
Unlike traditional evolutionary algorithms that use random mutations, GEPA uses an LLM (the reflection model) to propose informed mutations. The reflection model receives:
- The current artifact and its fitness score
- The artifacts from the current population with their scores
- Detailed evaluation traces (which tasks succeeded/failed and why)
- A mutation strategy directive (refine, combine, or simplify)
3. Evaluation Trace Feedback
Critically, the evaluation function returns not just a scalar fitness score but also structured traces -- per-task success/failure information, error messages, and (optionally) agent trajectory summaries. This rich feedback enables the reflection model to make targeted improvements rather than blind mutations.
Mutation Strategies
| Strategy | Description | Typical Usage |
|---|---|---|
refine |
Analyze failures of current best artifact and propose targeted fixes | Most common; used when clear failure patterns exist |
combine |
Merge strengths of two parent artifacts via crossover | When population has diverse but complementary skills |
simplify |
Remove unnecessary content; reduce prompt length while maintaining performance | Late-stage optimization; reducing token overhead |
specialize |
Add detailed instructions for specific failure categories | When a few hard tasks remain unsolved |
generalize |
Abstract specific instructions into broader principles | Improving transfer to unseen tasks |
Evolutionary Loop
Generation 0: [S_1, S_2, ..., S_k] (initial population, possibly random/empty)
|
v
+-------------------+
| Evaluate each | ---> Run agent with S_i on task set
| S_i on tasks | ---> Collect resolve rates + traces
+-------------------+
|
v
+-------------------+
| Select parents | ---> Tournament selection (top-k)
| (elitism: keep | ---> Best artifact always survives
| best S*) |
+-------------------+
|
v
+-------------------+
| Reflect + Mutate | ---> LLM analyzes traces
| (LLM proposer) | ---> Proposes new artifacts
+-------------------+
|
v
Generation 1: [S*_elite, S'_1, S'_2, ..., S'_{k-1}]
|
...
v
Generation N: Return best S* across all generations
4 SWE-smith Data Pipeline
A critical enabler for GEPA is the ability to generate large numbers of verifiable coding tasks for arbitrary repositories. The SWE-smith pipeline addresses this need by mining historical pull requests and commits to create task instances with ground-truth test oracles.
Pipeline Overview
GitHub Repository
|
v
+-------------------+ +-------------------+ +-------------------+
| Commit Mining | --> | Task Extraction | --> | Test Validation |
| - PR merges | | - Diff isolation | | - Test runs |
| - Bug fixes | | - Context window | | - Oracle check |
| - Feature adds | | - Issue linking | | - Dedup + filter |
+-------------------+ +-------------------+ +-------------------+
|
v
+-------------------+
| Task Splitting |
| 200 train |
| 50 validation |
| 60 test |
+-------------------+
Task Generation Process
1 Commit Mining: SWE-smith scans the repository's git history for commits that (a) modify source code files, (b) have associated test changes or existing tests that exercise the modified code, and (c) have clear commit messages or linked issues describing the intent. 2 Task Extraction: For each qualifying commit, SWE-smith creates a task instance consisting of: the repository state before the commit (the "base" state), a natural-language description of the problem (derived from the commit message and/or linked issue), and the set of tests that the patch must pass (the "oracle"). 3 Test Validation: Each task is validated by running the test oracle against both the base state (should fail) and the patched state (should pass). Tasks where the oracle does not cleanly distinguish base from patched states are discarded. 4 Difficulty Calibration: Tasks are annotated with difficulty metrics based on diff size, number of files modified, test complexity, and (optionally) baseline agent performance. This enables stratified sampling for training and evaluation. 5 Split Allocation: The ~300 tasks per repository are split into 200 training tasks (used to evaluate skill candidates during evolutionary search), 50 validation tasks (used for early stopping and hyperparameter selection), and 60 test tasks (held out for final evaluation, never seen during skill learning).
Benchmark Repositories
| Repository | Language | Domain | Tasks Generated | Train / Val / Test |
|---|---|---|---|---|
| Jinja | Python | Template engine | ~310 | 200 / 50 / 60 |
| Bleve | Go | Full-text search | ~290 | 200 / 50 / 60 (approx.) |
Why ~300 tasks per repo? The authors found this to be a sweet spot: enough tasks to provide statistically reliable fitness estimates during evolutionary search (even with 50-task evaluation batches, the standard error on resolve rate is manageable), while being feasible to generate from most actively maintained repositories with >1000 commits.
Mini-SWE Benchmark
The paper introduces Mini-SWE, a benchmark suite consisting of the SWE-smith-generated test splits across multiple repositories. Mini-SWE serves as a controlled evaluation environment where the test tasks are guaranteed to be unseen during skill learning. This addresses a key concern with benchmark contamination: because skills are learned on the training split and evaluated on the held-out test split, overfitting to specific tasks is mitigated.
5 The gskill Skill Learning Pipeline
gskill is the instantiation of GEPA's optimize_anything API for the
specific domain of coding agent skills. It orchestrates the end-to-end process from skill
initialization through evolutionary refinement to final evaluation.
Pipeline Architecture
gskill Pipeline
=====================================================
INPUT: Repository R, Agent Harness H, Base LLM M
Training tasks T_train, Validation tasks T_val
STEP 1: INITIALIZATION
+--------------------------------------------------+
| Option A: Empty skill (blank slate) |
| Option B: Human-written seed skill |
| Option C: LLM-generated seed from repo README |
+--------------------------------------------------+
|
v
STEP 2: EVALUATION (per generation)
+--------------------------------------------------+
| For each skill S_i in population: |
| For each task t_j in T_train (subset): |
| agent_output = H(M, system_prompt + S_i, t_j)|
| result_j = run_tests(agent_output, t_j) |
| fitness(S_i) = mean(result_j for j in batch) |
| traces(S_i) = {(t_j, result_j, logs_j)} |
+--------------------------------------------------+
|
v
STEP 3: REFLECTION (LLM Proposer)
+--------------------------------------------------+
| reflection_model.propose( |
| current_best = S*, |
| population = [(S_i, fitness_i, traces_i)], |
| strategy = choose_strategy(generation), |
| ) --> [S'_1, S'_2, ..., S'_k] |
+--------------------------------------------------+
|
v
STEP 4: SELECTION + ELITISM
+--------------------------------------------------+
| new_population = [S*] + top_k(S'_1..S'_k) |
| (elite always survives) |
+--------------------------------------------------+
|
v
STEP 5: REPEAT or TERMINATE
+--------------------------------------------------+
| If generation < max_gen and fitness improving: |
| Go to STEP 2 |
| Else: |
| Evaluate S* on T_val for final selection |
+--------------------------------------------------+
Initialization Strategies
The choice of initialization significantly affects convergence speed but has limited impact on final performance (given enough generations). The paper explores three strategies:
Empty Initialization
Start with a blank skill document. The first generation of reflection will generate skills entirely from the evaluation traces. This is the most general approach but requires more generations to converge (~15-20 generations vs. 8-12 with seeding).LLM-Seeded Initialization
Use an LLM to generate initial skills from the repository's README, documentation, and directory structure. This provides a warm start with basic project knowledge, reducing the number of generations needed for convergence.
Evaluation Strategy
Evaluating a skill candidate requires running the full agent pipeline on multiple tasks, which is computationally expensive. The paper employs several strategies to manage this cost:
- Batch evaluation: Each generation evaluates skills on a random subset of training tasks (typically 30-50 tasks) rather than the full 200-task training set.
- Parallel execution: Task evaluations within a batch run concurrently, limited by API rate limits rather than sequential processing.
- Early termination: If a skill candidate fails on the first N tasks in a batch with a score significantly below the current best, evaluation is stopped early.
- Caching: Task results are cached per (skill_hash, task_id) pair; if the same skill is re-evaluated (e.g., due to elitism), cached results are reused.
Reflection Prompt Structure
Reflection Model Prompt Template
# System prompt for the reflection/proposer LLM
You are an expert at improving coding agent skills. You will be given:
1. The current best skill document and its resolve rate
2. Detailed traces from tasks the agent FAILED on
3. Detailed traces from tasks the agent SUCCEEDED on
Your job is to propose an improved skill document that:
- Addresses the specific failure modes observed in the traces
- Preserves the strategies that led to successes
- Remains general enough to help on unseen tasks
- Stays within the token budget of {max_tokens} tokens
## Current Best Skill (resolve rate: {fitness}%)
{current_skill}
## Failed Task Traces
{failed_traces}
## Successful Task Traces
{success_traces}
## Mutation Strategy: {strategy}
{strategy_instructions}
Propose an improved skill document:
Convergence Behavior
Typical gskill runs exhibit the following convergence pattern:
- Generations 1-3: Rapid improvement as basic repository knowledge is encoded (file structure, test commands, common patterns).
- Generations 4-7: Moderate improvement as subtler patterns are captured (error handling idioms, edge case strategies).
- Generations 8-12: Diminishing returns; improvements come from fine-tuning language and addressing specific hard tasks.
- Generations 12+: Plateau; further iterations risk overfitting to the training task batch.
6 Skill Format and Structure
Skills are structured natural-language documents with specific sections that encode different types of repository knowledge. The paper does not prescribe a rigid format, but the evolutionary process consistently converges on documents with recognizable structural patterns.
Typical Skill Structure
Example Skill Document -- Bleve (Go)
# Repository Skill: Bleve Full-Text Search Engine
## Project Overview
Bleve is a full-text search and indexing library for Go. The codebase
is organized around index types (scorch, upsidedown), analysis pipelines,
and search query types.
## Key Architecture Patterns
- All index implementations satisfy the Index interface in index/index.go
- Analysis chains: char filter -> tokenizer -> token filter
- Search queries implement the Query interface with Searcher() method
- The mapping package defines how documents map to index fields
## Common Bug Patterns and Fixes
1. When fixing search relevance issues, check the scoring logic in
search/scorer/ -- most bugs involve TF-IDF weight calculation
2. Index corruption bugs usually trace to the segment merger in
index/scorch/merge.go
3. Analysis pipeline bugs: check that custom analyzers register all
components in registry/
## Testing Strategy
- Run specific tests with: go test ./... -run TestName
- Integration tests are in test/ directory
- Most packages have table-driven tests; match the existing pattern
- ALWAYS run the specific test file, not the entire suite
## Debugging Heuristics
- If a search returns no results, check the mapping first (mapping/)
- If indexing panics, check for nil pointer in document.Fields
- For performance issues, profile the segment merge path
- Error "unknown field" usually means the mapping is incomplete
## Code Style Requirements
- Error handling: always wrap with fmt.Errorf("context: %w", err)
- No bare returns; always explicit return values
- Use table-driven tests matching existing patterns in *_test.go
- Comments follow Go conventions: start with function name
Skill Sections Analysis
| Section | Purpose | Impact on Agent Behavior |
|---|---|---|
| Project Overview | Orients the agent to the codebase domain and purpose | Reduces time spent on initial exploration; agent navigates to relevant areas faster |
| Architecture Patterns | Describes key abstractions and their relationships | Agent makes correct assumptions about where to find and modify code |
| Bug Patterns | Catalogs common failure modes with root cause mappings | Drastically reduces debugging time; agent jumps to likely cause instead of searching |
| Testing Strategy | Specifies how to run tests, test file locations, test conventions | Agent runs correct tests on first attempt; avoids running irrelevant test suites |
| Debugging Heuristics | Maps symptoms to likely causes | Symptom-to-fix shortcuts that bypass exploratory debugging cycles |
| Code Style | Project-specific conventions and patterns | Patches conform to project style, reducing test failures from convention violations |
Skill Token Budget
The paper experiments with skill lengths of 500-4000 tokens. The empirically optimal range is 1500-2500 tokens: shorter skills miss important patterns, while longer skills introduce noise that can confuse the agent or consume too much of the context window.
Observation: The evolutionary process naturally tends toward concise skills. Early generations produce verbose documents, but the
simplifymutation strategy and competitive selection pressure drive convergence toward information-dense, non-redundant skill documents. This is a form of emergent compression in the evolutionary search.
Jinja Skill Example (Excerpt)
Example Skill Document -- Jinja (Python)
# Repository Skill: Jinja Template Engine
## Core Architecture
Jinja compiles templates to Python code via: source -> lexer -> parser -> AST -> codegen
- Lexer: jinja2/lexer.py (tokenizes template syntax)
- Parser: jinja2/parser.py (builds AST from tokens)
- Compiler: jinja2/compiler.py (generates Python from AST)
- Environment: jinja2/environment.py (central configuration object)
## Critical: Template Compilation Pipeline
Most bugs affect one stage of: lex -> parse -> compile -> execute
Identify which stage is broken BEFORE attempting a fix.
Symptoms by stage:
- Lex errors: TemplateSyntaxError with unexpected character
- Parse errors: TemplateSyntaxError with unexpected tag/token
- Compile errors: Usually silent; manifests as wrong output
- Runtime errors: UndefinedError, TypeError during render
## Testing
- pytest tests/ -x -k "test_name" (NEVER run full suite)
- Test files mirror source: test_lexer, test_parser, etc.
- Use Environment(undefined=StrictUndefined) to catch missing vars
- Regression tests in tests/test_regression.py
7 Mathematical Formulation
Skill Optimization as Black-Box Search
Formally, let S denote the space of all valid skill documents (strings up to a maximum token length L). The objective function is:
$$ f(S) = (1/|T|) * SUMt in T resolve(AH,M(S), t) $$
where AH,M(S) denotes the agent using harness H, model M, and skill S, and resolve is the binary indicator of task resolution. The function f is black-box (no gradients), stochastic (agent behavior is non-deterministic), and expensive to evaluate (each evaluation requires running an LLM agent end-to-end).
Evolutionary Search Formalization
GEPA maintains a population P^(g) = {S1^(g), ..., Sk^(g)} at generation g. The update rule is:
$$ P^(g+1) = Elite(P^(g)) UNION Propose(Reflect(P^(g), Traces^(g))) $$
where:
- Elite(P^(g)) = {S : S = argmaxS in P^(g) f(S)} -- the single best artifact is always preserved (elitism).
- Traces^(g) = {(t, result, log) : t in Tbatch} -- structured evaluation traces from the current generation.
- Reflect -- an LLM-based analysis function that maps (population, traces) to a natural-language reflection document identifying patterns, failure modes, and improvement opportunities.
- Propose -- an LLM-based generation function that produces k-1 new candidate artifacts conditioned on the reflection.
Fitness Estimation Under Stochasticity
Because LLM agent behavior is stochastic (sampling temperature > 0), a single evaluation of f(S) on task t is a Bernoulli random variable. The estimated fitness over a batch of n tasks has variance:
$$ Var[f-hat(S)] = f(S)(1 - f(S)) / n $$
For f(S) = 0.7 and n = 50, this gives a standard deviation of ~0.065, or about 6.5 percentage points. This noise level is acceptable for tournament selection (distinguishing between candidates with >10pp difference) but insufficient for fine-grained ranking. The authors address this by:
- Using relatively large evaluation batches (50 tasks per generation).
- Evaluating the final best candidate on the full validation set (50 tasks, multiple runs).
- Reporting test-set performance with confidence intervals.
Computational Cost Model
The total cost of a gskill run can be modeled as:
$$ Cost = G * K * B * (Cagent + Ctest) + G * Creflect $$
where:
- G = number of generations (typically 10-15)
- K = population size (typically 3-5)
- B = batch size per evaluation (typically 30-50 tasks)
- Cagent = cost per agent run (depends on model; ~$0.01-0.10 for gpt-5-mini)
- Ctest = cost of test execution (compute, typically negligible vs. LLM cost)
- Creflect = cost of reflection LLM call (single call per generation, ~$0.05)
For typical parameters (G=10, K=5, B=40, Cagent=$0.02), total cost is approximately:
$$ Cost = 10 * 5 * 40 * $0.02 + 10 * $0.05 = $40.50 $$
This is a one-time cost per repository, amortized over all future agent invocations. Compared to fine-tuning (which requires GPU hours costing hundreds to thousands of dollars), skill learning is orders of magnitude cheaper.
8 LLM Hierarchy and Cost Architecture
A key architectural insight in GEPA is the deliberate use of an LLM hierarchy: skills are trained on a cheap, fast model and then transferred to expensive, capable models. This creates a two-tier system where the optimization cost is minimized while the deployment benefit is maximized.
The Hierarchy
TRAINING PHASE DEPLOYMENT PHASE (Transfer)
============================ ================================
+-------------------+ +------------------------+
| gpt-5-mini | -- skill --> | Claude Haiku 4.5 |
| (cost-efficient) | transfer | (mid-tier deployment) |
| ~$0.02/task | | +19pp on Bleve |
+-------------------+ +------------------------+
| |
| Skill S* |
| learned here v
| +------------------------+
| | Claude Sonnet 4.5 |
| | (high-tier deployment) |
| | +5.2pp on Bleve |
| | -41% duration |
+-------------------+ +------------------------+
|
v
Reflection Model
(gpt-5-mini or gpt-5)
- Analyzes traces
- Proposes mutations
- ~$0.05/generation
Why Train on Weak Models?
The "train-weak, deploy-strong" principle: Skills learned on weaker models transfer to stronger models because skills encode domain knowledge, not reasoning ability. A weaker model that knows which file to look at and which test to run can solve a task that a stronger model without that knowledge cannot. The knowledge is model-independent; the reasoning that applies it is model-dependent.
This principle has a practical corollary: the cost of skill learning is determined by the cheapest model in the hierarchy, while the benefit accrues to the most expensive model. A skill learning run costing ~$40 on gpt-5-mini can improve Claude Sonnet 4.5 performance by 5+ percentage points -- each Sonnet invocation costing ~10-50x more than a mini invocation.
Model Cost Comparison
| Model | Role | Approx. Cost/Task | Baseline Resolve Rate (Bleve) | With Skill |
|---|---|---|---|---|
| gpt-5-mini | Training + Deployment | ~$0.02 | 24% | 93% (+69pp) |
| Claude Haiku 4.5 | Transfer Deployment | ~$0.08 | 79.3% | 98.3% (+19pp) |
| Claude Sonnet 4.5 | Transfer Deployment | ~$0.30 | 94.8% | 100% (+5.2pp) |
Reflection Model Selection
The reflection/proposer model -- the LLM that analyzes evaluation traces and proposes skill mutations -- can be the same as the training model (gpt-5-mini) or a more capable model. The paper finds that using a stronger reflection model (e.g., gpt-5) slightly improves convergence speed but does not significantly affect final skill quality. This is because the quality of skills is ultimately bounded by the evaluation signal, not the reflection model's reasoning ability.
9 Experimental Results
Mini-SWE Benchmark Results
All results are reported on the held-out test split (60 tasks) that was never used during skill learning. Results are averaged over 3 runs with different random seeds.
Jinja (Python Template Engine)
| Model | Baseline | With Skill | Improvement | Avg. Duration |
|---|---|---|---|---|
| gpt-5-mini | 55% | 82% | +27pp |
Reduced ~20% |
| Claude Haiku 4.5 | 72% | 88% | +16pp |
Reduced ~15% |
| Claude Sonnet 4.5 | 89% | 95% | +6pp |
Reduced ~25% |
Bleve (Go Full-Text Search)
| Model | Baseline | With Skill | Improvement | Avg. Duration (s) |
|---|---|---|---|---|
| gpt-5-mini | 24% | 93% | +69pp |
-- |
| Claude Haiku 4.5 | 79.3% | 98.3% | +19pp |
173s -> 142s (-18%) |
| Claude Sonnet 4.5 | 94.8% | 100% | +5.2pp |
285s -> 169s (-41%) |
Key Result Patterns
Pattern 1: Inversely proportional gains. The weaker the baseline model, the larger the absolute improvement from skills. gpt-5-mini gains +69pp on Bleve while Sonnet gains +5.2pp. This makes intuitive sense: weaker models have more "knowledge gaps" that skills can fill, while stronger models already know much of what skills encode.
Pattern 2: Universal speed improvement. Even when the resolve rate improvement is modest (Sonnet: +5.2pp), the duration reduction is substantial (-41%). Skills help agents navigate directly to the right solution, eliminating wasted exploration. This is particularly valuable for high-cost models where token usage translates directly to cost.
Pattern 3: Near-ceiling performance. On Bleve, Claude Sonnet 4.5 with skills achieves 100% resolve rate on the test set. While this is impressive, it also raises the question of benchmark saturation -- the Mini-SWE Bleve test set may be insufficiently challenging for frontier models with skills. The Jinja benchmark, with a 95% ceiling, retains more headroom.
Per-Task Analysis
Examining which tasks flip from fail to pass reveals the mechanisms through which skills help:
| Failure Category (Baseline) | % of Failures | Resolved by Skill? | Mechanism |
|---|---|---|---|
| Wrong file edited | 34% | Yes (90%+) | Architecture section directs agent to correct module |
| Test command error | 22% | Yes (95%+) | Testing section specifies exact commands |
| Style/convention violation | 18% | Yes (80%+) | Code style section prevents common violations |
| Incorrect logic | 15% | Partial (40%) | Bug patterns help, but some require genuine reasoning |
| Context window exceeded | 7% | Yes (70%) | Efficiency gains from reduced exploration |
| Fundamental misunderstanding | 4% | Rarely | Skills cannot substitute for missing reasoning |
10 Cross-Model Transfer Analysis
Perhaps the most surprising and practically important finding is that skills transfer across LLM models. A skill learned using gpt-5-mini as the base model improves Claude Haiku 4.5 and Claude Sonnet 4.5 without any adaptation. This section analyzes why and how this works.
Transfer Results Summary
Skill trained on gpt-5-mini (Bleve, Go)
|
|-- Applied to gpt-5-mini: 24% --> 93% (+69pp) [same model]
|-- Applied to Claude Haiku: 79.3% --> 98.3% (+19pp) [cross-model]
|-- Applied to Claude Sonnet: 94.8% --> 100% (+5.2pp) [cross-model]
|
| Duration improvements on transfer targets:
|-- Claude Haiku: 173s --> 142s (-18%)
|-- Claude Sonnet: 285s --> 169s (-41%)
Why Transfer Works
The transferability of skills can be understood through the lens of knowledge types:
| Knowledge Type | Model-Dependent? | Encoded in Skills? | Transfer Expectation |
|---|---|---|---|
| Repository structure (which files exist, what they do) | No | Yes | Transfers perfectly |
| Testing conventions (how to run tests) | No | Yes | Transfers perfectly |
| Bug pattern mappings (symptom -> likely cause) | No | Yes | Transfers well |
| Code style conventions | No | Yes | Transfers well |
| Reasoning strategies (how to decompose a problem) | Partially | Partially | Transfers moderately |
| Prompt-following ability | Yes | No | N/A (model-intrinsic) |
Key insight: The majority of information in evolved skills is factual domain knowledge (repository structure, testing commands, bug patterns) rather than reasoning strategies. Factual knowledge is model-independent: it is equally useful whether the reader is gpt-5-mini or Claude Sonnet. This is why skills transfer so well -- they are primarily knowledge documents, not reasoning guides.
Cross-Harness Transfer
The paper also demonstrates that skills transfer across different agent harnesses (e.g., from Moatless Tools to a custom harness). This works because skills are prepended to the system prompt, which is a universal interface across harness architectures. The skill content (domain knowledge) is harness-agnostic; only the format in which the agent applies the knowledge (tool calls, file operations) is harness-specific.
Transfer Efficiency Analysis
An interesting question is whether skills are optimally transferred or whether model-specific skill learning would yield better results. The paper provides limited evidence on this, but the results suggest:
- For weaker target models, transferred skills are nearly as good as model-specific skills because the main bottleneck is domain knowledge, which transfers perfectly.
- For stronger target models, there may be marginal gains from model-specific learning, particularly in formatting the skill to match the model's preferred instruction-following patterns.
- The cost-benefit analysis strongly favors transfer: the marginal gain from model-specific skill learning does not justify the 10-50x higher evaluation cost of running Claude Sonnet during training.
11 Core Mechanisms Deep Dive
Mechanism 1: Evolutionary Search over Natural Language
Traditional evolutionary algorithms operate over fixed-length numerical vectors or structured programs. GEPA's key innovation is applying evolutionary search to natural-language documents, using an LLM as the mutation operator. This creates a qualitatively different search dynamic:
- Informed mutations: Unlike random bit-flips, LLM-proposed mutations are semantically coherent. A mutation might add a new section on error handling patterns, rephrase an ambiguous instruction, or remove a section that is hurting generalization.
- Crossover via synthesis: When combining two parent skills, the LLM can intelligently merge them rather than performing random splicing. It identifies complementary sections and resolves conflicts.
- Adaptive search radius: The LLM can make both small (rephrasing) and large (restructuring) changes based on the reflection analysis, effectively adapting the mutation magnitude to the optimization landscape.
Mechanism 2: Reflective Proposer
The reflection step is what distinguishes GEPA from naive hill-climbing with LLM mutations. The reflection model receives structured evaluation traces and performs causal analysis:
Reflection Analysis Example
# Reflection output from the proposer model (real example, Bleve)
## Analysis of Failed Tasks
Pattern 1: 8/12 failures involve the analysis package
- The agent consistently looks in index/ for analysis-related bugs
- The correct location is analysis/ (separate package)
- Current skill mentions "analysis pipelines" but does not specify
the package location explicitly
Pattern 2: 3/12 failures due to incorrect test commands
- Agent runs `go test ./...` which times out on large repos
- Should run `go test ./specific/package -run TestName`
- Current skill says "run specific tests" but doesn't give syntax
Pattern 3: 1/12 failure is a genuine reasoning error
- Task requires understanding concurrent map access in Go
- Skill cannot easily help with this; it requires Go expertise
## Proposed Changes
1. Add explicit package map: analysis/ (not index/!) for analysis bugs
2. Add concrete test command syntax with package path
3. Add note about concurrent access patterns in Go maps
Mechanism 3: Cross-Task Generalization
A critical question is whether skills overfit to the training tasks. The evaluation protocol (train/val/test splits) is designed to detect this, and the results show strong generalization: test-set performance closely tracks validation-set performance, suggesting that the repository knowledge encoded in skills generalizes to unseen tasks within the same repository.
The mechanism behind generalization is that skills encode structural knowledge about the repository (file layout, module responsibilities, testing infrastructure) rather than task-specific solutions. This structural knowledge is relevant to any task within the repository, not just the training tasks.
Generalization hypothesis: Skills act as a "compressed repository manual" that provides the agent with an accurate prior over the repository's structure and conventions. This prior reduces the agent's uncertainty about where to look and what to do, regardless of the specific task. The evolutionary search implicitly selects for general knowledge because task-specific knowledge has low marginal fitness (it helps on only one task out of the evaluation batch).
Mechanism 4: Emergent Skill Organization
Across multiple runs and repositories, evolved skills converge on similar organizational structures despite no explicit format specification. This emergent organization reflects the information structure that is most useful to coding agents:
- Orientation section (first ~20% of tokens): High-level project description that helps the agent form a mental model of the codebase.
- Navigation section (next ~30%): File/module map with responsibilities. This is consistently the highest-value section.
- Procedure section (next ~30%): How-to instructions for common tasks (running tests, debugging, making changes).
- Heuristics section (final ~20%): Pattern-matching rules mapping symptoms to causes and solutions.
Mechanism 5: Implicit Curriculum Learning
The evolutionary process exhibits a form of implicit curriculum learning. Early generations master "easy" improvements (basic repository knowledge) while later generations tackle harder improvements (subtle debugging patterns, edge case handling). This happens naturally because:
- Easy improvements have higher marginal fitness impact, so they are selected first.
- Once easy tasks are solved, the fitness gradient shifts toward harder tasks.
- The reflection model naturally focuses on remaining failures, which become progressively harder.
12 System Architecture and Diagrams
End-to-End System Architecture
+========================================================================+
| GEPA System Architecture |
+========================================================================+
+---------------------------+
| Repository (GitHub) |
| - Source code |
| - Git history |
| - Tests |
+-------------+-------------+
|
+-------------v-------------+
| SWE-smith Pipeline |
| - Mine commits/PRs |
| - Extract task instances |
| - Validate test oracles |
| - Split train/val/test |
+-------------+-------------+
|
+----------------+----------------+
| | |
+-----v-----+ +-----v-----+ +------v------+
| T_train | | T_val | | T_test |
| (200 tasks)| | (50 tasks)| | (60 tasks) |
+-----+------+ +-----+-----+ +------+------+
| | |
| | | (held out)
+============v================v====+ |
| gskill Pipeline | |
| +---------------------------+ | |
| | Skill Population | | |
| | [S_1, S_2, ..., S_k] | | |
| +------------+--------------+ | |
| | | |
| +------------v--------------+ | |
| | Evaluation Engine | | |
| | - Run agent per (S_i, t_j)| | |
| | - Collect results + traces| | |
| | - Compute fitness scores | | |
| +------------+--------------+ | |
| | | |
| +------------v--------------+ | |
| | Reflection LLM | | |
| | - Analyze failure traces | | |
| | - Identify patterns | | |
| | - Propose improvements | | |
| +------------+--------------+ | |
| | | |
| +------------v--------------+ | |
| | Selection + Elitism | | |
| | - Keep best S* | | |
| | - Tournament select rest | | |
| +---------------------------+ | |
| | |
| Output: Best Skill S* | |
+==============+====================+ |
| |
| Skill S* |
| |
+--------------v-------------------------------v----+
| Deployment / Evaluation |
| +---------------------------------------------+ |
| | Agent Harness H + LLM M + Skill S* | |
| | system_prompt = base_prompt + S* | |
| | For each task t in T_test: | |
| | output = Agent(M, system_prompt, t) | |
| | result = run_tests(output, t.oracle) | |
| +---------------------------------------------+ |
+---------------------------------------------------+
Agent Invocation Detail
+------------------------------------------------------------------+
| Agent Invocation with Skill |
+------------------------------------------------------------------+
Task Description t:
"Fix: search query with boost returns incorrect scores"
+
|
v
+-------------------------------------------------+
| System Prompt Construction |
| |
| [Base Agent Instructions] |
| You are a coding agent. You can read files, |
| edit files, and run commands... |
| |
| [Skill S* -- prepended] |
| # Repository Skill: Bleve Full-Text Search |
| ## Key Architecture Patterns |
| - Scoring logic in search/scorer/ |
| - Query types implement Query interface |
| ## Testing: go test ./search/... -run TestName |
| ## Debugging: boost bugs in search/scorer/ |
| |
| [Task Description] |
| Fix the following issue: search query with |
| boost returns incorrect scores... |
+-------------------------------------------------+
|
v
+-------------------------------------------------+
| Agent Execution Loop |
| |
| 1. Read search/scorer/scorer.go |
| (directed by skill: "scoring logic here") |
| 2. Identify bug in boost weight calculation |
| 3. Edit scorer.go: fix weight multiplication |
| 4. Run: go test ./search/scorer -run TestBoost |
| (directed by skill: exact test command) |
| 5. Tests pass -> submit patch |
+-------------------------------------------------+
|
v
Result: PASS (correct patch, tests satisfied)
Evolutionary Population Dynamics
Generation Population Best Fitness
========= ================================== =============
G=0 [S_empty, S_empty, S_empty] 0% (no skill)
|
G=1 [S_a(basic_struct), S_b(tests), 45%
S_c(style)]
|
G=2 [S_a'(struct+tests), S_d(struct+bugs), 62%
S_c'(style+tests)]
|
G=3 [S_a''(comprehensive), S_d'(bugs+debug), 74%
S_e(combined_a_d)]
|
G=4 [S_e'(refined), S_f(specialized), 81%
S_a''(elite)]
|
G=5 [S_e''(optimized), S_g(simplified), 86%
S_e'(elite)]
| ...
G=10 [S_final(mature_skill)] 93%
Component Interaction Diagram
+-------------------+ +-------------------+
| SWE-smith | | GEPA Core |
| (Data Layer) | | (Search Layer) |
| | | |
| mine_commits() ------> optimize_anything() |
| extract_tasks() ------> evaluate_fn() |
| validate_tests()------> reflect() |
| split_data() ------> propose() |
| | | select() |
+-------------------+ +--------+----------+
|
+--------v----------+
| Agent Runtime |
| (Execution Layer) |
| |
| harness.run(M, S, t)|
| test_runner.eval() |
| trace_collector() |
+--------------------+
13 Ablations and Sensitivity Analysis
Ablation: Evolutionary Search vs. Alternatives
The paper compares GEPA's evolutionary search against simpler skill-generation baselines:
| Method | Description | Bleve (gpt-5-mini) | Jinja (gpt-5-mini) |
|---|---|---|---|
| No skill | Baseline agent without any skill | 24% | 55% |
| Human-written skill | Domain expert writes a skill document | ~55% | ~68% |
| LLM-generated (one-shot) | LLM writes skill from README + docs (no iteration) | ~48% | ~65% |
| LLM + 1 reflection | Generate skill, evaluate, reflect once, regenerate | ~65% | ~72% |
| GEPA (5 generations) | Full evolutionary search, 5 generations | ~82% | ~77% |
| GEPA (10 generations) | Full evolutionary search, 10 generations | 93% | 82% |
Key finding: Each component contributes meaningfully. The one-shot LLM approach captures basic repository knowledge but misses the subtle patterns that only emerge from evaluation feedback. A single reflection helps but cannot discover the multi-layered knowledge that 10 generations of evolutionary refinement produce. The evolutionary process is not merely prompt engineering at scale -- it discovers knowledge that no human or single LLM call could produce, because it is grounded in empirical evaluation traces.
Ablation: Population Size
| Population Size | Generations to 85% | Final Performance | Cost Multiplier |
|---|---|---|---|
| K=1 (hill climbing) | 15+ | ~88% | 1.0x |
| K=3 | 10 | ~91% | 2.8x |
| K=5 (default) | 7 | ~93% | 4.5x |
| K=10 | 6 | ~93% | 8.5x |
Population size K=5 represents the sweet spot: enough diversity to explore the search space effectively without the linear cost increase of larger populations. Beyond K=5, the marginal benefit of additional candidates diminishes because the reflection model can only meaningfully analyze a limited number of evaluation traces per generation.
Ablation: Evaluation Batch Size
The number of tasks per evaluation batch affects the noise level of fitness estimates:
| Batch Size | Fitness Std. Dev. | Search Stability | Cost per Generation |
|---|---|---|---|
| 10 tasks | ~14pp | Unstable; frequent false improvements | Low |
| 30 tasks | ~8pp | Moderate; occasional noise-driven selection | Moderate |
| 50 tasks (default) | ~6pp | Stable; reliable comparison between candidates | High |
| 100 tasks | ~4pp | Very stable; but slow and expensive | Very High |
Ablation: Skill Token Length
| Max Skill Length | Performance (Bleve) | Notes |
|---|---|---|
| 500 tokens | ~78% | Too short; misses important patterns |
| 1000 tokens | ~87% | Adequate for basic repository knowledge |
| 2000 tokens (default) | ~93% | Sweet spot; comprehensive without noise |
| 4000 tokens | ~91% | Slight degradation; redundancy confuses agent |
Insight: There is a clear inverted-U relationship between skill length and performance. Very short skills lack essential information, while very long skills introduce redundancy and noise that can confuse the agent. The optimal length (~2000 tokens) is enough to cover all major knowledge categories without padding.
Ablation: Mutation Strategy Mix
The relative frequency of different mutation strategies matters for convergence:
- Refine-only: Converges but gets stuck in local optima. The skill accumulates patches without structural reorganization.
- Combine-heavy: High diversity but slow convergence. Too much crossover destroys good skill sections before they can be refined.
- Balanced (default): 60% refine, 20% combine, 20% simplify. This matches the natural cadence of optimization: mostly targeted improvements with occasional exploration and compression.
14 Limitations and Future Directions
Current Limitations
1. Repository Scope
The paper demonstrates results on two repositories (Jinja, Bleve). While these span two programming languages and very different domains, the generality of the approach across diverse codebases (monorepos, polyglot projects, very large codebases with >1M LOC) remains to be validated. The SWE-smith pipeline's ability to generate quality tasks may vary across repository types.
2. Benchmark Saturation
On Bleve, Claude Sonnet 4.5 with skills achieves 100% on the test set. This ceiling effect limits the ability to measure further improvements and raises questions about whether the Mini-SWE benchmark is sufficiently challenging for frontier models with skills. Future work should include harder benchmark tasks or repositories.
3. Skill Staleness
As repositories evolve, skills may become outdated. The paper does not address how skills should be maintained over time. A practical deployment would need a mechanism to detect skill degradation (e.g., monitoring resolve rates over time) and trigger re-optimization.
4. Single-Repository Skills
Current skills are repository-specific. Cross-repository skills (e.g., "general Go coding patterns") are not explored. Learning meta-skills that transfer across repositories in the same language or framework is an open research direction.
5. Limited Exploration of Skill Composition
The paper treats each skill as a monolithic document. Compositional skills -- where different skill modules (architecture, testing, debugging) are learned independently and composed at deployment -- could enable more modular and maintainable skill libraries.
Future Research Directions
1 Hierarchical Skill Learning: Learn skills at multiple levels of abstraction: language-level skills (Go idioms), framework-level skills (standard library patterns), and repository-level skills. Higher-level skills transfer more broadly; lower-level skills are more specific and powerful. 2 Continuous Skill Evolution: Instead of one-shot skill learning, continuously evolve skills as the repository changes. Each merged PR could trigger a lightweight skill update, keeping skills synchronized with the codebase. 3 Task-Adaptive Skill Selection: Instead of a single monolithic skill, maintain a library of specialized skills and select the most relevant one(s) for each task. This could be implemented as a learned routing function or retrieval system. 4 Multi-Objective Skill Optimization: Currently, the sole objective is resolve rate. Future work could optimize jointly for resolve rate, execution speed, token efficiency, and patch quality (code style, minimal diff size). 5 Scaling to SWE-bench: Applying GEPA to the full SWE-bench dataset, which covers 12 diverse Python repositories, would provide a more comprehensive evaluation and potentially establish new state-of-the-art results.
15 Broader Significance
Paradigm Shift: From Model Training to Knowledge Curation
GEPA represents a shift in how we think about improving AI coding agents. The traditional paradigm -- collect data, train (or fine-tune) a model, deploy -- treats the model as the primary locus of improvement. GEPA suggests an alternative paradigm where the model is held fixed and the knowledge context is optimized instead.
The "knowledge curation" paradigm: Rather than training models to be better at everything, we curate the right knowledge to present to them at inference time. This is analogous to how human expertise works: a junior developer given the right documentation and mentorship can perform at a much higher level than one working in isolation. Skills are the automated equivalent of documentation and mentorship.
Implications for the AI Coding Agent Ecosystem
For Model Providers
GEPA suggests that model quality is necessary but not sufficient for coding agent performance. Even the best model (Claude Sonnet 4.5) benefits from repository-specific skills. Model providers should consider exposing skill-learning APIs or supporting skill integration natively.
For Agent Framework Developers
Agent frameworks should be designed with skill injection points -- standardized interfaces for prepending repository-specific knowledge to system prompts. The skill format should be documented and stable, enabling a skill ecosystem where skills are shared, versioned, and composed.
For Software Teams
GEPA enables a new development practice: skill engineering alongside code engineering. Teams could maintain a skill document for their repository that is automatically updated as the codebase evolves, dramatically improving the effectiveness of AI coding assistants on their specific codebase.
Connections to Related Work
| Related Work | Approach | Relationship to GEPA |
|---|---|---|
| AlphaEvolve (DeepMind) | Evolutionary code optimization with LLM proposers | GEPA applies similar evolutionary LLM search but to natural-language skills rather than code |
| OpenEvolve | Open-source evolutionary program search | Shared algorithmic DNA; GEPA's optimize_anything is more general than code-specific search |
| DSPy | Programmatic prompt optimization | DSPy optimizes prompt templates with fixed structures; GEPA optimizes free-form documents |
| SWE-bench | Coding agent benchmark | GEPA's SWE-smith generates similar task instances at scale; Mini-SWE extends the paradigm |
| Reflexion (Shinn et al.) | LLM self-reflection for task improvement | GEPA uses reflection across tasks/populations, not within a single task attempt |
| ADAS (Hu et al.) | Automated Design of Agentic Systems | ADAS evolves agent architectures; GEPA evolves agent knowledge while fixing the architecture |
| EoH (Liu et al.) | Evolution of Heuristics with LLMs | EoH evolves algorithmic heuristics; GEPA evolves natural-language knowledge documents |
The Skill Economy
If GEPA's approach proves general, it implies the emergence of a skill economy: a
marketplace where repository-specific skills are learned, shared, and traded. Open-source
repositories could include a .skills/ directory with pre-learned skills, just
as they include documentation and CI configuration. The economic dynamics are favorable:
skills are cheap to learn (~$40), valuable to users (significant performance improvement),
and improve with the repository ecosystem (more tasks = better skills).
Philosophical Implications
GEPA raises an interesting question about the nature of expertise in AI systems. Traditional AI views expertise as internal to the model (weights encode knowledge). GEPA demonstrates that a significant portion of task-relevant expertise can be externalized as structured context. This suggests that the boundary between "what the model knows" and "what the model is told" is more fluid than commonly assumed, and that the latter can be optimized algorithmically with the same rigor applied to the former.
Final thought: GEPA demonstrates that the next frontier in AI agent performance may not require larger models or more training data, but rather better knowledge engineering -- the systematic discovery and curation of the right information to present to existing models at the right time. Evolutionary search, powered by LLM reflection, provides a principled and scalable method for this knowledge engineering process.
PhD-Level Research Report: GEPA -- Automatically Learning Skills for Coding Agents
Tan, Agrawal, Sandadi, Klein, Sen, Dimakis, Zaharia (February 2026)
Analysis covers: evolutionary skill optimization, cross-model transfer, SWE-smith data pipeline, gskill learning loop