← Back to Index
AutoHarness
LLM-driven automatic synthesis of code harnesses that constrain agent action spaces, enabling smaller models to outperform larger ones through learned rejection sampling and program-space search. Organization: Google DeepMind Published: February 10, 2026 Type: paper Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: AutoHarness: improving LLM agents by automatically synthesizing a code harness
arXiv ID: 2603.03329
DOI: 10.48550/arXiv.2603.03329
License: CC BY 4.0
Venue: arXiv preprint (cs.CL, cs.AI)
Submission Date: February 10, 2026
"Often people manually write 'harnesses' around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment."
This paper addresses a fundamental and pervasive failure mode of LLM agents: action illegality. When deployed as decision-making agents in structured environments, LLMs frequently propose actions that are syntactically or semantically invalid—not merely suboptimal, but strictly prohibited by the environment's rules. The authors frame this as an instance of the "action applicability" problem studied in AI planning (Kokel et al., 2025) and propose a self-contained solution where the LLM generates its own safety wrapper.
2 Authors and Team
| Author | Affiliation | Role / Expertise |
|---|---|---|
| Xinghua Lou | Google DeepMind | Lead author, corresponding author; code synthesis and agent architecture |
| Miguel Lázaro-Gredilla | Google DeepMind | Probabilistic models, search methods, Thompson sampling design |
| Antoine Dedieu | Google DeepMind | Program synthesis, iterative refinement |
| Carter Wendelken | Google DeepMind | Agent evaluation, game environment integration |
| Wolfgang Lehrach | Google DeepMind | Code world models, game simulation (prior work on code world models for general game playing) |
| Kevin P. Murphy | Google DeepMind | Senior researcher; probabilistic ML, structured prediction, agent design |
Contact: {xinghua, lazarogredilla, adedieu, cwendelken, wpl, kpmurphy}@deepmind.com
Team Context: This team sits within Google DeepMind's agent systems group, with strong ties to the Gemini model family. Murphy is a well-known figure in probabilistic ML (author of Machine Learning: A Probabilistic Perspective). Lehrach has prior work on code world models for general game playing (Lehrach et al., 2025), which is the complementary problem of synthesizing the environment's state-transition function rather than the agent's action filter. The team brings together expertise in Bayesian optimization (Thompson sampling), program synthesis, and large-scale LLM evaluation.
3 Core Contribution
The Problem: Action Illegality in LLM Agents
The paper identifies a critical and quantifiable failure mode: LLM agents make illegal moves. This is not about strategy quality—it is about rule compliance.
The motivating statistic is devastating:
78% of Gemini-2.5-Flash losses in the Kaggle GameArena chess competition were attributed to illegal moves—not strategic blunders.
This highlights a disconnect between the model's apparent understanding of a domain and its ability to comply with structural constraints. The problem generalizes beyond games to any environment with a constrained action space:
| Domain | Illegal Action Example |
|---|---|
| Chess | Moving a piece to an occupied friendly square |
| API orchestration | Calling an endpoint with invalid parameters |
| Code generation | Generating syntactically invalid code |
| Robotic control | Commanding a joint beyond its physical limits |
| Database queries | Writing SQL that violates schema constraints |
The Solution: Code-as-Harness
The core insight is meta-level self-correction: rather than fine-tuning the LLM (expensive, degrades other capabilities) or manually writing harnesses (brittle, labor-intensive), the LLM synthesizes its own harness through iterative code refinement with environment feedback.
┌─────────────────────────────────────────────────────────┐
│ TRADITIONAL APPROACH │
│ │
│ Human Engineer ──writes──> Harness Code │
│ (per game, per domain) │
│ ↓ │
│ LLM Agent ──proposes──> Action ──filter──> Valid? │
│ │ │ │
│ Yes No │
│ ↓ ↓ │
│ Execute Retry │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ AUTOHARNESS APPROACH │
│ │
│ LLM (Gemini-2.5-Flash) │
│ │ │
│ ├──generates──> Harness Code (is_legal_action()) │
│ │ ↑ │
│ │ feedback │
│ │ │ │
│ └──refines via──> Tree Search + Thompson Sampling │
│ ↓ │
│ LLM Agent ──proposes──> Action ──harness──> Valid? │
│ │ │ │
│ Yes No │
│ ↓ ↓ │
│ Execute Retry │
└─────────────────────────────────────────────────────────┘
Key distinction from prior work: The harness is not learned from demonstrations or fine-tuned into the model—it is synthesized as executable code through a structured search over the program space, using the LLM as a mutation operator.
Three Harness Modes
The framework supports three configurations of increasing autonomy:
| Mode | LLM at Test Time? | Description |
|---|---|---|
| Harness-as-Action-Filter | Yes (ranking) | Code generates legal move set; LLM ranks and selects |
| Harness-as-Action-Verifier | Yes (generation + verification) | LLM proposes action; code verifies legality; retry on rejection |
| Harness-as-Policy | No | Code generates action directly; zero LLM calls at inference |
The paper primarily focuses on harness-as-action-verifier but demonstrates the striking result that harness-as-policy (no LLM at test time) outperforms even the largest frontier models.
4 Supported Solutions
Solution Space: What AutoHarness Generates
AutoHarness generates Python programs with two primary functions:
def is_legal_action(observation: str, action: str) -> bool:
"""Given the current game observation and a proposed action,
returns True if the action is legal, False otherwise."""
...
def propose_action(observation: str) -> str:
"""Given the current game observation,
proposes a valid action."""
...
These functions are domain-specific but structurally uniform—the same interface applies across all 145 games.
Solution Variants by Harness Mode
Harness-as-Action-Verifier (Primary):
while not done:
action = LLM.generate(observation)
if not is_legal_action(observation, action):
action = LLM.generate(observation + "ILLEGAL: " + action)
execute(action)
Harness-as-Action-Filter:
while not done:
legal_actions = [a for a in all_actions if is_legal_action(obs, a)]
action = LLM.rank_and_select(observation, legal_actions)
execute(action)
Harness-as-Policy (Code-Only):
while not done:
action = propose_action(observation)
execute(action)
# Zero LLM calls
Environment Integration
The system operates within the TextArena framework (Guertler et al., 2025), a comprehensive collection of text-based games. Key properties of supported environments:
| Property | Specification |
|---|---|
| Observation format | Text strings describing game state |
| Action format | Text strings (structured, e.g., UCI notation for chess) |
| Feedback signals | Legal/illegal indicator, reward (sparse, at trajectory end) |
| Game types | 1-player puzzle/strategy, 2-player competitive |
| Total games supported | 145 (after excluding 9 free-form text/dialog games) |
| Action space | Discrete, game-specific, often combinatorially large |
Additional Hardening
The authors deliberately removed "Available Moves" hints from game observations to prevent trivial copy-paste solutions. This forces the harness to deduce legality from structural understanding of the game rules, rather than reading an explicit list.
5 LLM Integration
Model Usage During Harness Synthesis (Training)
| Component | Model | Purpose |
|---|---|---|
| Harness Generator | Gemini-2.5-Flash | Generates and refines is_legal_action() and propose_action() code |
| Critic | Gemini-2.5-Flash | Consolidates error messages from failed rollouts into structured feedback |
| Refiner | Gemini-2.5-Flash | Takes current code + critic feedback → proposes improved code |
Single-model training: The entire harness synthesis uses only Gemini-2.5-Flash—a deliberately smaller, cheaper model. This is a key design choice: the harness enables a small model to match or exceed the performance of models orders of magnitude larger.
Model Usage During Agent Evaluation (Test Time)
| Agent Configuration | Model | Purpose |
|---|---|---|
| Gemini-2.5-Flash (vanilla) | Gemini-2.5-Flash | Baseline: raw LLM without harness |
| Gemini-2.5-Pro (vanilla) | Gemini-2.5-Pro | Strong baseline: larger model without harness |
| Gemini-2.5-Flash + Harness (Ours) | Gemini-2.5-Flash | Flash with synthesized harness for action verification |
| GPT-5.2 (no thinking) | GPT-5.2 | Cross-family baseline |
| GPT-5.2-High (high thinking) | GPT-5.2-High | Strongest baseline (high reasoning budget) |
| Harness-as-Policy (Ours) | None | Pure code, zero LLM calls at test time |
Prompt Engineering
The system uses a structured prompt for the LLM-as-policy configuration:
You are an expert, logical, and strategic AI game player. Your task is to
analyze the following game information and determine the single best move
to make.
Read the game rules, your player role, the current game state, and all
available moves carefully. Your objective is to play optimally to
maximize your chances of winning the game.
You are now player {player_id}.
The game information is as follows:
{observation}
**YOUR TASK:**
**Step 1: Think**
First, provide your step-by-step reasoning. Analyze the current game
state, your goal, and the available moves.
**Step 2: Move**
After your thinking block, provide *only* the single best move you have
chosen. Enclose your final move in `<move></move>` tags.
The same optimized prompt is used across all agent configurations, ensuring fair comparison.
LLM-as-Mutation-Operator
The most novel aspect of LLM integration is using the model as a gradient-free code optimizer. The LLM does not merely generate code from scratch—it iteratively refines existing code based on structured feedback:
Input to Refiner:
1. Current harness code (Python source)
2. Up to 5 failed execution traces with:
- Game observation at failure point
- Proposed action
- Whether is_legal_action() returned True/False
- Whether the action was actually legal/illegal
- Error messages from environment
3. Critic's consolidated analysis
Output from Refiner:
→ Refined harness code (improved Python source)
This creates a closed-loop optimization where the LLM acts as both the search operator and the solution representation.
6 Key Results
Legal Action Rate: Perfect Score
The harness achieves 100% legal action accuracy across all 145 TextArena games on held-out test rollouts (1000 steps × 10 random seeds per game). This is a binary success/failure metric—the harness either catches all illegal actions or it doesn't.
| Metric | Value |
|---|---|
| Games evaluated | 145 |
| Games with 100% legal action rate | 145 (100%) |
| Test rollout length | 1,000 steps |
| Random seeds per game | 10 |
| Average training iterations to converge | 14.5 |
| Games converging in <10 iterations | 19/32 (subset) |
2-Player Game Results: Flash+Harness vs. Pro
When evaluating game-playing performance on 16 selected 2-player games (40 matches each, balanced for first/second player):
| Matchup | Win Rate (Ours) | Win Rate (Opponent) | Games Won |
|---|---|---|---|
| Flash+Harness vs. Gemini-2.5-Pro | 56.3% | 38.2% | 9/16 |
| Flash+Harness vs. Gemini-2.5-Flash | 64.8% | — | 12/16 |
A smaller model (Flash) with an automatically synthesized harness beats a much larger model (Pro) at competitive gameplay, winning the majority of games.
1-Player Game Results: Average Reward
On 16 single-player games (20 matches each):
| Agent | Average Reward | Games Won (vs. nearest) |
|---|---|---|
| Gemini-2.5-Flash (vanilla) | 0.673 | — |
| Gemini-2.5-Pro (vanilla) | 0.707 | — |
| Gemini-2.5-Flash + Harness | 0.745 | 8/16 wins, 5/16 ties vs. Pro |
Harness-as-Policy: The Headline Result
The most striking finding—a pure code policy with zero LLM calls at test time outperforms all frontier models:
| Agent | Average Reward | Test-Time LLM Calls | Test Cost |
|---|---|---|---|
| GPT-5.2 (no thinking) | 0.635 | Many | ~$640* |
| Gemini-2.5-Flash (vanilla) | 0.673 | Many | Moderate |
| Gemini-2.5-Pro (vanilla) | 0.707 | Many | High |
| GPT-5.2-High (high thinking) | 0.844 | Many | ~$640* |
| Harness-as-Policy (Ours) | 0.870 | 0 | ~$0 |
*GPT-5.2 costs are for reduced evaluation (10 and 5 repeats respectively vs. 20 for others).
"Since Harness-as-Policy generates pure (Python) code, our test time cost is nearly zero, while the GPT-5.2 and GPT-5.2-High experiments cost approximately $640."
This result demonstrates that an LLM can synthesize a complete decision-making policy in code form that exceeds the performance of the LLM itself (and even larger/more expensive models), at effectively zero marginal cost per decision.
Per-Game Detail: Training Difficulty
The most challenging games for harness learning:
| Game | Training Iterations | Difficulty Driver |
|---|---|---|
| GermanWhist-v0 (2P) | 43 | Complex card game rules |
| Cryptarithm-v0 (1P) | 45 | Cryptarithmetic puzzle constraints |
| Othello-v0 (2P) | 62 | Board state validation complexity |
| Chess-v0 (2P) | 64 | Complex piece movement rules, castling, en passant |
| Breakthrough-v0-small (2P) | 136 | Small board amplifies edge cases |
The simplest games (PigDice variants, Snake, ColonelBlotto) converge in 1 iteration—their action spaces are trivially enumerable.
7 Reproducibility
Environment Availability
| Component | Availability | URL |
|---|---|---|
| TextArena game suite | Open source | TextArena on arXiv |
| Game definitions | Included in TextArena | — |
| Evaluation protocol | Fully specified in paper | — |
| Generated harness code | Examples in Appendix D | — |
Experimental Configuration
| Parameter | Value |
|---|---|
| Parallel environments during training | 10 |
| Max rollout steps per iteration | 1,000 |
| Max failed steps fed to Critic | 5 |
| Thompson sampling weight | 1.0 |
| Training termination | Legal action rate = 1.0 or timeout |
| 1P evaluation: matches per game | 20 |
| 2P evaluation: matches per game | 40 (20 as P1, 20 as P2) |
| Harness-as-Policy: max training iterations | 256 |
| Harness-as-Policy: average training iterations | 89.4 |
| Harness-as-Policy: average heuristic at termination | 0.939 |
| Games excluded | 9 free-form text/dialog games |
Reproducibility Challenges
- Model Access: Requires access to Gemini-2.5-Flash (and Pro, GPT-5.2 for baselines), which are proprietary APIs with potential non-determinism.
- No Code Release: The paper does not reference a public code repository. The harness synthesis framework itself is not open-sourced.
- TextArena Modifications: The authors modified some games to remove "Available Moves" hints. The exact modifications are described but not released as a patch.
- Stochasticity: Thompson sampling and LLM generation introduce randomness; reproducing exact numbers requires multiple seeds.
- API Versioning: Results are tied to specific model versions (Gemini-2.5-Flash/Pro, GPT-5.2/5.2-High) which may change over time.
8 Compute and API Costs
Training Costs
| Phase | Model | Calls per Game (avg) | Cost Driver |
|---|---|---|---|
| Harness synthesis (verifier) | Gemini-2.5-Flash | ~14.5 iterations × (10 envs × rollout + critic + refiner) | Low per-token cost of Flash |
| Harness synthesis (policy) | Gemini-2.5-Flash | ~89.4 iterations (max 256) | Higher iteration count but still Flash pricing |
Each iteration involves: - 10 parallel environment rollouts (up to 1000 steps each) - Critic analysis of up to 5 failure traces - Refiner code generation
Rough cost estimate per game (harness-as-verifier): With Flash pricing around ~$0.15/M input tokens and ~$0.60/M output tokens, and typical prompts of a few thousand tokens, each game's harness synthesis likely costs $1–5 in API calls.
For all 145 games: ~$150–700 total for harness synthesis.
Inference Costs
| Agent | Cost per Decision | Cost per Game (est.) |
|---|---|---|
| Gemini-2.5-Flash (vanilla) | Standard Flash pricing | ~$0.01–0.05 |
| Gemini-2.5-Pro (vanilla) | Standard Pro pricing | ~$0.10–0.50 |
| Flash + Harness (verifier) | Flash pricing + negligible code execution | ~$0.01–0.05 |
| GPT-5.2-High | Premium pricing | ~$1–10 per game |
| Harness-as-Policy | ~$0 (pure code) | ~$0 |
Cost Efficiency Analysis
The paper explicitly highlights the cost advantage:
"$640 for GPT-5.2 and GPT-5.2-High experiments" (reduced evaluation: 10 and 5 repeats)
vs.
"Harness-as-Policy test time cost is nearly zero"
This represents a cost reduction of several orders of magnitude for comparable or superior performance.
Training-to-Inference Amortization
Training Cost (one-time): ~$1-5 per game × 145 games = ~$150-700
Inference Cost (per decision): $0 (harness-as-policy)
Break-even vs. GPT-5.2-High: After ~100 decisions per game
(assuming ~$1/game for GPT-5.2-High per 20-match evaluation)
9 Architecture Solution
High-Level Architecture
┌──────────────────────────────────────────────────────────────────┐
│ AUTOHARNESS SYSTEM ARCHITECTURE │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ TRAINING PHASE │ │
│ │ │ │
│ │ ┌─────────┐ ┌──────────┐ ┌─────────┐ │ │
│ │ │ Tree │───>│ Thompson │───>│ Node │ │ │
│ │ │ Store │ │ Sampling │ │ Select │ │ │
│ │ └─────────┘ └──────────┘ └────┬────┘ │ │
│ │ ↑ │ │ │
│ │ │ ▼ │ │
│ │ ┌────┴────┐ ┌──────────┐ ┌─────────┐ │ │
│ │ │ New │<───│ Refiner │<───│ Critic │ │ │
│ │ │ Node │ │ (LLM) │ │ (LLM) │ │ │
│ │ └─────────┘ └──────────┘ └────┬────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────┐ │ │
│ │ │ Environment │ │ │
│ │ │ (TextArena) │ │ │
│ │ │ 10 parallel │ │ │
│ │ └──────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ INFERENCE PHASE │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ LLM │───>│ Harness │───>│ Env │ │ │
│ │ │ Agent │ │ Code │ │ Execute │ │ │
│ │ │(optional)│ │(learned) │ │ │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ │ Mode A: LLM proposes → Code verifies → Execute/Retry │ │
│ │ Mode B: Code proposes legal set → LLM ranks → Execute │ │
│ │ Mode C: Code proposes action → Execute (no LLM) │ │
│ └──────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Key Architectural Decisions
-
Program-Space Search, Not Weight-Space: The search operates over Python programs, not model parameters. This preserves the LLM's general capabilities while adding domain-specific constraints.
-
Tree Structure with Thompson Sampling: Unlike linear iterative refinement (e.g., Reflexion), the system maintains a tree of candidate programs and uses Thompson sampling to balance exploration (trying new logical structures) vs. exploitation (refining partially working harnesses).
-
Environment-in-the-Loop: The training loop is closed through the game environment itself—not through a learned proxy or reward model. The environment provides ground-truth legality signals.
-
Separation of Concerns: The Critic consolidates raw error traces into structured feedback; the Refiner uses this structured feedback to propose code improvements. This two-stage pipeline prevents the refiner from being overwhelmed by raw error logs.
-
Single-Model Architecture: Both Critic and Refiner use the same model (Gemini-2.5-Flash), keeping the system simple and cost-effective.
Relationship to Broader Agent Architecture Patterns
COMPARISON: Agent Architecture Paradigms
─────────────────────────────────────────────────────
Fine-Tuning Manual AutoHarness
Harness
─────────────────────────────────────────────────────
Modifies LLM Yes No No
Requires human eng. No Yes No
Domain transfer Poor None Automatic
Cost at training Very High Zero Low
Cost at inference Standard Standard Zero (policy)
Preserves LLM caps No Yes Yes
Scalable to N games Manual Manual Automatic
─────────────────────────────────────────────────────
10 Component Breakdown
Component 1: Tree Store
Purpose: Maintains the population of candidate harness programs as a tree data structure.
| Property | Specification |
|---|---|
| Structure | Tree (nodes = code versions, edges = refinement steps) |
| Root | Initial empty/template harness |
| Node value | Average legal action accuracy (0.0–1.0) |
| Growth | New nodes added via Refiner output |
| Termination | Any node reaches value 1.0, or timeout |
Each node stores: - The complete Python source code of the harness - The heuristic value (average fraction of legal actions across test rollouts) - Parent pointer (for lineage tracking) - Number of evaluations (for Thompson sampling confidence)
Component 2: Thompson Sampling Selector
Purpose: Selects which tree node to refine next, balancing exploration and exploitation.
The system follows the approach of Tang et al. (2024) for code repair with exploration-exploitation tradeoffs. Thompson sampling maintains a posterior distribution over each node's true quality and samples from these distributions to select the next node to refine.
For each node n in tree:
Sample θ_n ~ Beta(α_n, β_n) # posterior on true quality
where α_n = successes + 1, β_n = failures + 1
Select node n* = argmax_n θ_n
Refine n* → create new child node
The heuristic weight parameter is set to 1.0, controlling the exploration-exploitation balance.
Why Thompson Sampling over simpler strategies: - Random selection wastes iterations on unpromising branches - Greedy selection gets stuck in local optima - UCB1 requires tuning an exploration constant - Thompson sampling naturally adapts its exploration as uncertainty decreases
Component 3: Environment Rollout Engine
Purpose: Executes candidate harness code against actual game environments to measure legal action accuracy.
| Parameter | Value |
|---|---|
| Parallel environments | 10 |
| Max steps per rollout | 1,000 |
| Termination | Illegal move by code, code execution failure, or step limit |
| Auto-reset | Environment resets automatically when a game ends |
The rollout engine produces execution traces that include:
- The observation at each step
- The action proposed
- Whether is_legal_action() returned True or False
- Whether the environment accepted the action
- Any error messages or exceptions
Component 4: Critic
Purpose: Consolidates raw execution failure traces into structured, actionable feedback.
The Critic receives up to 5 sampled failure traces and produces a consolidated analysis identifying: - Common failure patterns across traces - Root causes of illegal actions - Specific code locations that need modification - Environmental constraints that the harness fails to capture
This consolidation step is critical—raw error traces can be verbose and contradictory, and the Refiner benefits from a clean summary.
Component 5: Refiner (LLM as Mutation Operator)
Purpose: Takes current harness code + Critic feedback and generates improved harness code.
The Refiner operates as a gradient-free code optimizer. Its input is: 1. The current harness source code 2. The Critic's consolidated error analysis 3. The specific failure traces (for reference)
Its output is a complete, revised harness program.
Refinement Logic:
if is_legal_action() returns True but action is actually ILLEGAL:
→ Refine BOTH is_legal_action() AND propose_action()
(The legality check is wrong)
if is_legal_action() returns False but action is actually ILLEGAL:
→ Refine ONLY propose_action()
(The legality check is correct; the proposal is bad)
This asymmetric refinement strategy prevents correct components from being needlessly modified.
Component 6: Harness Code (The Artifact)
Purpose: The generated Python program that serves as the agent's action filter or complete policy.
For harness-as-action-verifier, the typical structure:
import re
import numpy as np
def is_legal_action(observation: str, action: str) -> bool:
# Parse game state from observation text
board = parse_board(observation)
move = parse_move(action)
# Validate move against game rules
if not is_valid_piece(board, move):
return False
if not is_valid_destination(board, move):
return False
if causes_self_check(board, move):
return False
return True
def propose_action(observation: str) -> str:
# Generate a candidate move (may or may not use heuristics)
board = parse_board(observation)
legal_moves = get_all_legal_moves(board)
return format_move(legal_moves[0]) # or heuristic selection
For harness-as-policy, propose_action() encodes the complete decision-making logic—including strategic reasoning—without any LLM calls.
11 Core Mechanisms (Detailed)
Mechanism 1: Tree Search over Program Space
The search over program space is the paper's primary technical contribution. Unlike simple iterative prompting (e.g., Reflexion's linear chain), the tree structure enables:
Branching: A single parent node can spawn multiple children with different refinement strategies. If one refinement path leads to a dead end, the system can backtrack to a sibling or parent.
Depth vs. Breadth: Thompson sampling naturally controls the tree's shape. Early in training (high uncertainty), sampling favors breadth (exploring diverse approaches). Later (low uncertainty), sampling favors depth (refining the best approach).
Tree Evolution Example (Chess Harness):
─────────────────────────────────────────
Iteration 0: [Root: empty harness]
Value: 0.0
Iteration 1: [Root] → [V1: basic piece movement]
Value: 0.3
Iteration 3: [Root] → [V1] → [V2: + captures]
Value: 0.6
→ [V1b: alternative parsing]
Value: 0.4
Iteration 8: [Root] → [V1] → [V2] → [V3: + castling]
Value: 0.85
→ [V4: + en passant]
Value: 0.92
→ [V1b] → [abandoned]
Iteration 14: [Root] → [V1] → [V2] → ... → [V7: complete]
Value: 1.0
Mechanism 2: Heuristic Function Design
For harness-as-action-verifier, the heuristic is simply:
H = fraction of legal actions across all rollout steps
For harness-as-policy, the heuristic incorporates reward:
H = 0 if any illegal action is taken H = 0.5 + 0.5 × r otherwise, where r ∈ [0, 1] is the environment reward
This design encodes a strict priority: 1. First priority: Eliminate all illegal actions (H = 0 until all actions are legal) 2. Second priority: Maximize game reward (H scales from 0.5 to 1.0 with reward)
The 0.5 offset ensures that a policy with all legal actions but zero reward (H = 0.5) is strictly preferred over a policy with any illegal action (H = 0), preventing the system from reverting to illegal-but-high-reward strategies.
Mechanism 3: Iterative Code Refinement with Rich Feedback
The refinement loop uses environment execution as ground truth—not a learned reward model or human feedback. This gives the system access to:
- Exact failure points: Which step, which observation, which action
- Error types: Syntax error, runtime error, logical error (action executed but illegal)
- Environment messages: Game-specific error messages explaining why an action was rejected
This rich feedback is qualitatively different from the scalar rewards used in RL or the binary pass/fail signals used in program synthesis competitions.
Mechanism 4: Observation Processing Without Available Moves
A subtle but important design decision: the authors remove "Available Moves" hints from game observations before feeding them to the harness. In the original TextArena Chess observation:
Valid moves: [g1h3], [g1f3], [b1c3], [b1a3], [h2h3], ...
This line is deleted, forcing the harness to: 1. Parse the board state from the ASCII representation 2. Understand piece movement rules 3. Enumerate legal moves from first principles
This makes the problem significantly harder but also significantly more interesting—the harness must encode genuine understanding of game mechanics, not just string matching.
Mechanism 5: Rejection Sampling Interpretation
The harness-as-action-verifier mode can be viewed as learned rejection sampling:
Standard Rejection Sampling:
Sample x ~ q(x) (LLM generates action)
Accept with probability (harness evaluates legality)
p(x) / (M × q(x))
If rejected, resample
AutoHarness:
Sample action ~ LLM(observation)
Accept if is_legal_action(observation, action)
If rejected, resample with augmented prompt
The key difference from standard rejection sampling: the acceptance criterion is_legal_action() is itself learned through program synthesis, and the rejection triggers a modified resampling (the LLM receives additional context about why the action was rejected).
This creates an adaptive rejection sampler where: - The acceptance function is learned (not fixed) - Rejection modifies the proposal distribution (not just resampling from the same distribution) - The acceptance function is a program (not a density ratio)
Mechanism 6: Asymmetric Error Attribution
The refinement strategy distinguishes two failure modes:
| Failure Mode | is_legal_action() says |
Truth | What to Refine |
|---|---|---|---|
| False positive | True (legal) | Illegal | Both functions |
| True negative | False (illegal) | Illegal | Only propose_action() |
This is important because false positives indicate a bug in the legality check itself (it failed to catch an illegal move), while true negatives indicate the legality check is working but the proposal generator needs improvement.
12 Programming Language
Primary Language: Python
The harness code is generated as Python, which is natural given:
- LLMs are most capable at Python code generation
- The TextArena environment is Python-based
- Python supports the standard libraries needed for game logic (re, numpy)
Allowed Dependencies
The generated harness code uses only:
- Python standard library (particularly re for pattern matching)
- numpy for numerical operations
- No external game-specific libraries (no python-chess for chess, etc.)
This constraint is important: the harness must learn game rules from scratch, not wrap existing rule engines.
Code Quality Observations
From the paper's appendix examples, the generated harness code is: - Readable: Clear function names, logical structure - Correct: Handles edge cases that emerge through iterative refinement - Domain-specific: Each game gets custom parsing and validation logic - Self-contained: No state persisted between calls (pure functions)
Language Considerations for Extension
The approach is language-agnostic in principle—any language the LLM can generate code in could serve as the harness language. However, Python's dominance in LLM training data makes it the practical choice for maximizing code quality.
13 Memory Management
Training-Time Memory
The tree store grows during harness synthesis. Each node stores: - Complete Python source code (~100–500 lines per harness) - Heuristic value (float) - Parent pointer - Evaluation count
With an average of 14.5 iterations and typical branching factor of 1–3, the tree is small (dozens of nodes, <1MB per game).
Rollout memory: 10 parallel environments maintain game state. TextArena games are lightweight text-based environments; memory per environment is negligible (<1MB).
Inference-Time Memory
For harness-as-action-verifier: - The harness code is loaded once and called per step - No persistent state between function calls - Memory footprint: negligible (the harness is a pure function)
For harness-as-policy: - Same as above—pure code with no persistent state - Some harnesses may maintain internal data structures (e.g., tracking which cells have been visited in Minesweeper), but these are within the game scope
Knowledge Persistence
The system does not maintain cross-game memory. Each game's harness is synthesized independently. There is no skill library, no transfer learning between games, and no persistent knowledge base.
This is explicitly identified as a limitation and future work direction: "We also hope to explore building up a library of reusable harnesses."
Comparison to Memory-Intensive Approaches
| Approach | Memory Pattern | AutoHarness Equivalent |
|---|---|---|
| Reflexion | Verbal memory of past failures | Implicit in tree structure (parent nodes) |
| Voyager | Skill library with embeddings | None (independent per game) |
| AlphaEvolve | Evolutionary archive of programs | Tree store (smaller scale) |
| Fine-tuning | Model weights | No weight modification |
14 Continued Learning
Within-Game Learning (During Synthesis)
The tree search is itself a learning process: the system improves its harness code through iterative refinement. Learning curves show characteristic patterns:
| Game | Convergence Pattern |
|---|---|
| Simple games (PigDice, Snake) | 1 iteration: trivial action space |
| Medium games (Checkers, Sudoku) | 3–10 iterations: progressive rule discovery |
| Complex games (Chess, Othello) | 30–64 iterations: complex rule systems with edge cases |
| Pathological (Breakthrough-small) | 136 iterations: small board amplifies combinatorial edge cases |
Across-Game Transfer: Not Yet
The current system trains each game independently. The authors identify this as the key limitation:
"Currently we generate a separate harness for each environment (game). In the future, we would like to distill the resulting domain specific experts (agents) back into the base LLM, so that the whole system becomes recursively self-improving."
Future Directions for Continued Learning
The paper outlines three future directions that would enable continued learning:
-
Recursive Self-Improvement: Distilling game-specific harnesses back into the base LLM, creating a feedback loop where the LLM becomes better at generating harnesses over time.
-
Reusable Harness Library: Building a library of harness components (board parsers, move validators, strategy heuristics) that can be composed for new games.
-
Multimodal Extension: Applying the approach to more complex environments like Craftax and Terra Nova, which require visual perception and more sophisticated state representations.
Relationship to AlphaEvolve
The paper explicitly positions itself relative to AlphaEvolve (Novikov et al., 2025), which applies evolutionary algorithms to entire codebases using an LLM as a mutation function. AutoHarness is more constrained:
| Dimension | AlphaEvolve | AutoHarness |
|---|---|---|
| Scope | Entire codebase | Single harness program |
| Search method | Evolutionary (population-based) | Tree search + Thompson sampling |
| Feedback | Algorithm-specific metrics | Environment legal/illegal signals |
| Goal | Discover algorithms | Synthesize action constraints |
| Scale | Large codebases | Small programs (~100–500 LOC) |
Potential for Self-Improvement Loops
The paper hints at a vision where AutoHarness becomes part of a larger self-improving system:
┌─────────────────────────────────────────────┐
│ RECURSIVE IMPROVEMENT VISION │
│ │
│ LLM ──generates──> Harnesses │
│ ↑ │ │
│ │ │ game-specific │
│ │ │ expertise │
│ │ ▼ │
│ └──distill────── Domain Experts │
│ │ │
│ │ compose │
│ ▼ │
│ Harness Library │
│ │ │
│ │ seed │
│ ▼ │
│ New Game → faster harness │
└─────────────────────────────────────────────┘
This vision is unrealized in the current paper but represents a clear roadmap.
15 Applications
Direct Applications
1. Game-Playing Agents
The paper's primary domain. AutoHarness enables: - Compliant game play across diverse text-based games - Smaller models to compete with larger ones - Zero-cost inference via harness-as-policy
Target environments: TextArena (145 games), Craftax, Terra Nova, and any text-based game environment.
2. API Orchestration Agents
LLM agents calling APIs frequently generate invalid requests. AutoHarness could synthesize: - Request validators based on API schemas - Parameter constraint checkers - State-aware action filters (e.g., "you can't cancel an order that's already shipped")
3. Robotic Control
The code-as-policies paradigm (Liang et al., 2023) already uses LLMs for robot control. AutoHarness adds: - Joint limit enforcement - Collision avoidance constraints - Workspace boundary enforcement
4. Code Generation Pipelines
LLM-generated code often fails type checking or violates schema constraints. AutoHarness could: - Synthesize type validators for generated code - Enforce API compatibility constraints - Validate database query structure before execution
Broader Implications
The "Small Model + Harness > Large Model" Result
This is the paper's most provocative finding for the broader AI community. It suggests that:
-
Model size is not destiny. A Gemini-2.5-Flash with a synthesized harness beats Gemini-2.5-Pro. A synthesized code policy beats GPT-5.2-High.
-
Compute allocation matters. Spending compute on harness synthesis (one-time) rather than larger model inference (per-decision) is more efficient.
-
Code > Neural Networks for constraint enforcement. Learned code harnesses are deterministic, verifiable, and zero-cost at inference. Neural network-based constraint enforcement is probabilistic, opaque, and expensive.
The Rejection Sampling Connection
The paper's framework can be interpreted as learning the optimal rejection criterion for LLM outputs. This has implications beyond games:
| Application | Rejection Criterion |
|---|---|
| Code generation | Does the code compile? Pass tests? |
| Math proofs | Is each step logically valid? |
| Scientific hypotheses | Is the hypothesis consistent with known data? |
| Drug design | Does the molecule satisfy pharmacological constraints? |
In each case, the constraint checker could potentially be synthesized by the LLM itself, following the AutoHarness paradigm.
LLM-as-Compiler
AutoHarness can be viewed as using the LLM to "compile" domain rules from natural language game descriptions into executable constraint checking code. This compiler metaphor suggests applications in:
- Regulatory compliance: Compile regulation text into automated compliance checkers
- Contract enforcement: Compile contract terms into automated verification
- Safety constraints: Compile safety specifications into runtime monitors
Limitations and Open Questions
-
Scalability to continuous action spaces: The current approach assumes discrete, enumerable actions. Extension to continuous domains (e.g., continuous robotic control) requires fundamentally different harness structures.
-
Multi-agent coordination: The harness is synthesized for a single agent. Multi-agent scenarios may require coordination-aware constraint checking.
-
Dynamic environments: The harness encodes static game rules. Environments with changing rules would require online harness adaptation.
-
Observation parsing brittleness: The harness relies on parsing text observations. Changes to observation format could break the harness.
-
Strategic vs. tactical: The harness ensures legality (tactical correctness) but does not directly improve strategic quality. The harness-as-policy mode addresses this partially, but strategic depth is limited by the code synthesis budget.
Related Conceptual Connections
Connection to Evolutionary Algorithms
AutoHarness's tree search with Thompson sampling shares deep structural similarities with evolutionary algorithms:
| Concept | EA | AutoHarness |
|---|---|---|
| Individual | Candidate solution | Candidate harness program |
| Fitness | Objective function value | Legal action accuracy |
| Mutation | Random perturbation | LLM-guided code refinement |
| Selection | Tournament/fitness-proportionate | Thompson sampling |
| Population | Generation of individuals | Tree of program variants |
| Crossover | Recombination of parents | Not explicitly used (single-parent refinement) |
The LLM-as-mutation-operator is particularly interesting: it provides semantically meaningful mutations rather than random perturbations. The LLM understands the program's intent and can make targeted improvements based on error feedback, which is qualitatively different from random mutation operators in traditional genetic programming.
Connection to Program Synthesis
AutoHarness sits at the intersection of: - Neural program synthesis (LLM generates code) - Search-based program synthesis (tree search explores program space) - Feedback-driven synthesis (environment provides correctness signals)
This combination achieves what neither approach does alone: the LLM provides strong priors for generating plausible programs, while the search structure and environment feedback correct the LLM's errors.
Connection to Formal Verification
The is_legal_action() function is effectively a runtime assertion or invariant checker. In formal verification terms:
- Pre-condition: The observation represents a valid game state
- Post-condition: The action is legal in that state
- Invariant: The agent never executes an illegal action
The difference from formal verification is that these conditions are learned (synthesized as code) rather than specified a priori. This is a form of learned formal methods—an emerging area at the intersection of ML and software engineering.
This analysis is based on the paper as published on arXiv (2603.03329v1, February 10, 2026). The paper was authored by researchers at Google DeepMind. No code repository has been publicly released as of the analysis date.