← Back to Index

AutoHarness

LLM-driven automatic synthesis of code harnesses that constrain agent action spaces, enabling smaller models to outperform larger ones through learned rejection sampling and program-space search. Organization: Google DeepMind Published: February 10, 2026 Type: paper Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

Full Title: AutoHarness: improving LLM agents by automatically synthesizing a code harness

arXiv ID: 2603.03329

DOI: 10.48550/arXiv.2603.03329

License: CC BY 4.0

Venue: arXiv preprint (cs.CL, cs.AI)

Submission Date: February 10, 2026

"Often people manually write 'harnesses' around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment."

This paper addresses a fundamental and pervasive failure mode of LLM agents: action illegality. When deployed as decision-making agents in structured environments, LLMs frequently propose actions that are syntactically or semantically invalid—not merely suboptimal, but strictly prohibited by the environment's rules. The authors frame this as an instance of the "action applicability" problem studied in AI planning (Kokel et al., 2025) and propose a self-contained solution where the LLM generates its own safety wrapper.


2 Authors and Team

Author Affiliation Role / Expertise
Xinghua Lou Google DeepMind Lead author, corresponding author; code synthesis and agent architecture
Miguel Lázaro-Gredilla Google DeepMind Probabilistic models, search methods, Thompson sampling design
Antoine Dedieu Google DeepMind Program synthesis, iterative refinement
Carter Wendelken Google DeepMind Agent evaluation, game environment integration
Wolfgang Lehrach Google DeepMind Code world models, game simulation (prior work on code world models for general game playing)
Kevin P. Murphy Google DeepMind Senior researcher; probabilistic ML, structured prediction, agent design

Contact: {xinghua, lazarogredilla, adedieu, cwendelken, wpl, kpmurphy}@deepmind.com

Team Context: This team sits within Google DeepMind's agent systems group, with strong ties to the Gemini model family. Murphy is a well-known figure in probabilistic ML (author of Machine Learning: A Probabilistic Perspective). Lehrach has prior work on code world models for general game playing (Lehrach et al., 2025), which is the complementary problem of synthesizing the environment's state-transition function rather than the agent's action filter. The team brings together expertise in Bayesian optimization (Thompson sampling), program synthesis, and large-scale LLM evaluation.


3 Core Contribution

The Problem: Action Illegality in LLM Agents

The paper identifies a critical and quantifiable failure mode: LLM agents make illegal moves. This is not about strategy quality—it is about rule compliance.

The motivating statistic is devastating:

78% of Gemini-2.5-Flash losses in the Kaggle GameArena chess competition were attributed to illegal moves—not strategic blunders.

This highlights a disconnect between the model's apparent understanding of a domain and its ability to comply with structural constraints. The problem generalizes beyond games to any environment with a constrained action space:

Domain Illegal Action Example
Chess Moving a piece to an occupied friendly square
API orchestration Calling an endpoint with invalid parameters
Code generation Generating syntactically invalid code
Robotic control Commanding a joint beyond its physical limits
Database queries Writing SQL that violates schema constraints

The Solution: Code-as-Harness

The core insight is meta-level self-correction: rather than fine-tuning the LLM (expensive, degrades other capabilities) or manually writing harnesses (brittle, labor-intensive), the LLM synthesizes its own harness through iterative code refinement with environment feedback.

┌─────────────────────────────────────────────────────────┐
│                    TRADITIONAL APPROACH                  │
│                                                         │
│   Human Engineer ──writes──> Harness Code               │
│                              (per game, per domain)     │
│                              ↓                          │
│   LLM Agent ──proposes──> Action ──filter──> Valid?     │
│                                              │  │       │
│                                             Yes  No     │
│                                              ↓   ↓      │
│                                          Execute Retry   │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                    AUTOHARNESS APPROACH                  │
│                                                         │
│   LLM (Gemini-2.5-Flash)                               │
│     │                                                   │
│     ├──generates──> Harness Code (is_legal_action())    │
│     │                  ↑                                │
│     │              feedback                             │
│     │                  │                                │
│     └──refines via──> Tree Search + Thompson Sampling   │
│                        ↓                                │
│   LLM Agent ──proposes──> Action ──harness──> Valid?    │
│                                               │  │      │
│                                              Yes  No    │
│                                               ↓   ↓     │
│                                           Execute Retry  │
└─────────────────────────────────────────────────────────┘

Key distinction from prior work: The harness is not learned from demonstrations or fine-tuned into the model—it is synthesized as executable code through a structured search over the program space, using the LLM as a mutation operator.

Three Harness Modes

The framework supports three configurations of increasing autonomy:

Mode LLM at Test Time? Description
Harness-as-Action-Filter Yes (ranking) Code generates legal move set; LLM ranks and selects
Harness-as-Action-Verifier Yes (generation + verification) LLM proposes action; code verifies legality; retry on rejection
Harness-as-Policy No Code generates action directly; zero LLM calls at inference

The paper primarily focuses on harness-as-action-verifier but demonstrates the striking result that harness-as-policy (no LLM at test time) outperforms even the largest frontier models.


4 Supported Solutions

Solution Space: What AutoHarness Generates

AutoHarness generates Python programs with two primary functions:

def is_legal_action(observation: str, action: str) -> bool:
    """Given the current game observation and a proposed action,
    returns True if the action is legal, False otherwise."""
    ...

def propose_action(observation: str) -> str:
    """Given the current game observation,
    proposes a valid action."""
    ...

These functions are domain-specific but structurally uniform—the same interface applies across all 145 games.

Solution Variants by Harness Mode

Harness-as-Action-Verifier (Primary):

while not done:
    action = LLM.generate(observation)
    if not is_legal_action(observation, action):
        action = LLM.generate(observation + "ILLEGAL: " + action)
    execute(action)

Harness-as-Action-Filter:

while not done:
    legal_actions = [a for a in all_actions if is_legal_action(obs, a)]
    action = LLM.rank_and_select(observation, legal_actions)
    execute(action)

Harness-as-Policy (Code-Only):

while not done:
    action = propose_action(observation)
    execute(action)
    # Zero LLM calls

Environment Integration

The system operates within the TextArena framework (Guertler et al., 2025), a comprehensive collection of text-based games. Key properties of supported environments:

Property Specification
Observation format Text strings describing game state
Action format Text strings (structured, e.g., UCI notation for chess)
Feedback signals Legal/illegal indicator, reward (sparse, at trajectory end)
Game types 1-player puzzle/strategy, 2-player competitive
Total games supported 145 (after excluding 9 free-form text/dialog games)
Action space Discrete, game-specific, often combinatorially large

Additional Hardening

The authors deliberately removed "Available Moves" hints from game observations to prevent trivial copy-paste solutions. This forces the harness to deduce legality from structural understanding of the game rules, rather than reading an explicit list.


5 LLM Integration

Model Usage During Harness Synthesis (Training)

Component Model Purpose
Harness Generator Gemini-2.5-Flash Generates and refines is_legal_action() and propose_action() code
Critic Gemini-2.5-Flash Consolidates error messages from failed rollouts into structured feedback
Refiner Gemini-2.5-Flash Takes current code + critic feedback → proposes improved code

Single-model training: The entire harness synthesis uses only Gemini-2.5-Flash—a deliberately smaller, cheaper model. This is a key design choice: the harness enables a small model to match or exceed the performance of models orders of magnitude larger.

Model Usage During Agent Evaluation (Test Time)

Agent Configuration Model Purpose
Gemini-2.5-Flash (vanilla) Gemini-2.5-Flash Baseline: raw LLM without harness
Gemini-2.5-Pro (vanilla) Gemini-2.5-Pro Strong baseline: larger model without harness
Gemini-2.5-Flash + Harness (Ours) Gemini-2.5-Flash Flash with synthesized harness for action verification
GPT-5.2 (no thinking) GPT-5.2 Cross-family baseline
GPT-5.2-High (high thinking) GPT-5.2-High Strongest baseline (high reasoning budget)
Harness-as-Policy (Ours) None Pure code, zero LLM calls at test time

Prompt Engineering

The system uses a structured prompt for the LLM-as-policy configuration:

You are an expert, logical, and strategic AI game player. Your task is to
analyze the following game information and determine the single best move
to make.

Read the game rules, your player role, the current game state, and all
available moves carefully. Your objective is to play optimally to
maximize your chances of winning the game.

You are now player {player_id}.

The game information is as follows:
{observation}

**YOUR TASK:**

**Step 1: Think**
First, provide your step-by-step reasoning. Analyze the current game
state, your goal, and the available moves.

**Step 2: Move**
After your thinking block, provide *only* the single best move you have
chosen. Enclose your final move in `<move></move>` tags.

The same optimized prompt is used across all agent configurations, ensuring fair comparison.

LLM-as-Mutation-Operator

The most novel aspect of LLM integration is using the model as a gradient-free code optimizer. The LLM does not merely generate code from scratch—it iteratively refines existing code based on structured feedback:

Input to Refiner:
  1. Current harness code (Python source)
  2. Up to 5 failed execution traces with:
     - Game observation at failure point
     - Proposed action
     - Whether is_legal_action() returned True/False
     - Whether the action was actually legal/illegal
     - Error messages from environment
  3. Critic's consolidated analysis

Output from Refiner:
  → Refined harness code (improved Python source)

This creates a closed-loop optimization where the LLM acts as both the search operator and the solution representation.


6 Key Results

The harness achieves 100% legal action accuracy across all 145 TextArena games on held-out test rollouts (1000 steps × 10 random seeds per game). This is a binary success/failure metric—the harness either catches all illegal actions or it doesn't.

Metric Value
Games evaluated 145
Games with 100% legal action rate 145 (100%)
Test rollout length 1,000 steps
Random seeds per game 10
Average training iterations to converge 14.5
Games converging in <10 iterations 19/32 (subset)

2-Player Game Results: Flash+Harness vs. Pro

When evaluating game-playing performance on 16 selected 2-player games (40 matches each, balanced for first/second player):

Matchup Win Rate (Ours) Win Rate (Opponent) Games Won
Flash+Harness vs. Gemini-2.5-Pro 56.3% 38.2% 9/16
Flash+Harness vs. Gemini-2.5-Flash 64.8% 12/16

A smaller model (Flash) with an automatically synthesized harness beats a much larger model (Pro) at competitive gameplay, winning the majority of games.

1-Player Game Results: Average Reward

On 16 single-player games (20 matches each):

Agent Average Reward Games Won (vs. nearest)
Gemini-2.5-Flash (vanilla) 0.673
Gemini-2.5-Pro (vanilla) 0.707
Gemini-2.5-Flash + Harness 0.745 8/16 wins, 5/16 ties vs. Pro

Harness-as-Policy: The Headline Result

The most striking finding—a pure code policy with zero LLM calls at test time outperforms all frontier models:

Agent Average Reward Test-Time LLM Calls Test Cost
GPT-5.2 (no thinking) 0.635 Many ~$640*
Gemini-2.5-Flash (vanilla) 0.673 Many Moderate
Gemini-2.5-Pro (vanilla) 0.707 Many High
GPT-5.2-High (high thinking) 0.844 Many ~$640*
Harness-as-Policy (Ours) 0.870 0 ~$0

*GPT-5.2 costs are for reduced evaluation (10 and 5 repeats respectively vs. 20 for others).

"Since Harness-as-Policy generates pure (Python) code, our test time cost is nearly zero, while the GPT-5.2 and GPT-5.2-High experiments cost approximately $640."

This result demonstrates that an LLM can synthesize a complete decision-making policy in code form that exceeds the performance of the LLM itself (and even larger/more expensive models), at effectively zero marginal cost per decision.

Per-Game Detail: Training Difficulty

The most challenging games for harness learning:

Game Training Iterations Difficulty Driver
GermanWhist-v0 (2P) 43 Complex card game rules
Cryptarithm-v0 (1P) 45 Cryptarithmetic puzzle constraints
Othello-v0 (2P) 62 Board state validation complexity
Chess-v0 (2P) 64 Complex piece movement rules, castling, en passant
Breakthrough-v0-small (2P) 136 Small board amplifies edge cases

The simplest games (PigDice variants, Snake, ColonelBlotto) converge in 1 iteration—their action spaces are trivially enumerable.


7 Reproducibility

Environment Availability

Component Availability URL
TextArena game suite Open source TextArena on arXiv
Game definitions Included in TextArena
Evaluation protocol Fully specified in paper
Generated harness code Examples in Appendix D

Experimental Configuration

Parameter Value
Parallel environments during training 10
Max rollout steps per iteration 1,000
Max failed steps fed to Critic 5
Thompson sampling weight 1.0
Training termination Legal action rate = 1.0 or timeout
1P evaluation: matches per game 20
2P evaluation: matches per game 40 (20 as P1, 20 as P2)
Harness-as-Policy: max training iterations 256
Harness-as-Policy: average training iterations 89.4
Harness-as-Policy: average heuristic at termination 0.939
Games excluded 9 free-form text/dialog games

Reproducibility Challenges

  1. Model Access: Requires access to Gemini-2.5-Flash (and Pro, GPT-5.2 for baselines), which are proprietary APIs with potential non-determinism.
  2. No Code Release: The paper does not reference a public code repository. The harness synthesis framework itself is not open-sourced.
  3. TextArena Modifications: The authors modified some games to remove "Available Moves" hints. The exact modifications are described but not released as a patch.
  4. Stochasticity: Thompson sampling and LLM generation introduce randomness; reproducing exact numbers requires multiple seeds.
  5. API Versioning: Results are tied to specific model versions (Gemini-2.5-Flash/Pro, GPT-5.2/5.2-High) which may change over time.

8 Compute and API Costs

Training Costs

Phase Model Calls per Game (avg) Cost Driver
Harness synthesis (verifier) Gemini-2.5-Flash ~14.5 iterations × (10 envs × rollout + critic + refiner) Low per-token cost of Flash
Harness synthesis (policy) Gemini-2.5-Flash ~89.4 iterations (max 256) Higher iteration count but still Flash pricing

Each iteration involves: - 10 parallel environment rollouts (up to 1000 steps each) - Critic analysis of up to 5 failure traces - Refiner code generation

Rough cost estimate per game (harness-as-verifier): With Flash pricing around ~$0.15/M input tokens and ~$0.60/M output tokens, and typical prompts of a few thousand tokens, each game's harness synthesis likely costs $1–5 in API calls.

For all 145 games: ~$150–700 total for harness synthesis.

Inference Costs

Agent Cost per Decision Cost per Game (est.)
Gemini-2.5-Flash (vanilla) Standard Flash pricing ~$0.01–0.05
Gemini-2.5-Pro (vanilla) Standard Pro pricing ~$0.10–0.50
Flash + Harness (verifier) Flash pricing + negligible code execution ~$0.01–0.05
GPT-5.2-High Premium pricing ~$1–10 per game
Harness-as-Policy ~$0 (pure code) ~$0

Cost Efficiency Analysis

The paper explicitly highlights the cost advantage:

"$640 for GPT-5.2 and GPT-5.2-High experiments" (reduced evaluation: 10 and 5 repeats)

vs.

"Harness-as-Policy test time cost is nearly zero"

This represents a cost reduction of several orders of magnitude for comparable or superior performance.

Training-to-Inference Amortization

Training Cost (one-time):     ~$1-5 per game × 145 games = ~$150-700
Inference Cost (per decision): $0 (harness-as-policy)

Break-even vs. GPT-5.2-High:  After ~100 decisions per game
(assuming ~$1/game for GPT-5.2-High per 20-match evaluation)

9 Architecture Solution

High-Level Architecture

┌──────────────────────────────────────────────────────────────────┐
│                    AUTOHARNESS SYSTEM ARCHITECTURE               │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │                    TRAINING PHASE                         │    │
│  │                                                           │    │
│  │  ┌─────────┐    ┌──────────┐    ┌─────────┐              │    │
│  │  │  Tree    │───>│ Thompson │───>│  Node   │              │    │
│  │  │  Store   │    │ Sampling │    │ Select  │              │    │
│  │  └─────────┘    └──────────┘    └────┬────┘              │    │
│  │       ↑                              │                    │    │
│  │       │                              ▼                    │    │
│  │  ┌────┴────┐    ┌──────────┐    ┌─────────┐              │    │
│  │  │  New    │<───│ Refiner  │<───│ Critic  │              │    │
│  │  │  Node   │    │  (LLM)   │    │  (LLM)  │              │    │
│  │  └─────────┘    └──────────┘    └────┬────┘              │    │
│  │                                      │                    │    │
│  │                                      ▼                    │    │
│  │                              ┌──────────────┐             │    │
│  │                              │  Environment │             │    │
│  │                              │  (TextArena) │             │    │
│  │                              │  10 parallel │             │    │
│  │                              └──────────────┘             │    │
│  └──────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │                   INFERENCE PHASE                         │    │
│  │                                                           │    │
│  │  ┌──────────┐    ┌──────────┐    ┌──────────┐            │    │
│  │  │   LLM    │───>│ Harness  │───>│  Env     │            │    │
│  │  │  Agent   │    │  Code    │    │ Execute  │            │    │
│  │  │(optional)│    │(learned) │    │          │            │    │
│  │  └──────────┘    └──────────┘    └──────────┘            │    │
│  │                                                           │    │
│  │  Mode A: LLM proposes → Code verifies → Execute/Retry    │    │
│  │  Mode B: Code proposes legal set → LLM ranks → Execute   │    │
│  │  Mode C: Code proposes action → Execute (no LLM)         │    │
│  └──────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────┘

Key Architectural Decisions

  1. Program-Space Search, Not Weight-Space: The search operates over Python programs, not model parameters. This preserves the LLM's general capabilities while adding domain-specific constraints.

  2. Tree Structure with Thompson Sampling: Unlike linear iterative refinement (e.g., Reflexion), the system maintains a tree of candidate programs and uses Thompson sampling to balance exploration (trying new logical structures) vs. exploitation (refining partially working harnesses).

  3. Environment-in-the-Loop: The training loop is closed through the game environment itself—not through a learned proxy or reward model. The environment provides ground-truth legality signals.

  4. Separation of Concerns: The Critic consolidates raw error traces into structured feedback; the Refiner uses this structured feedback to propose code improvements. This two-stage pipeline prevents the refiner from being overwhelmed by raw error logs.

  5. Single-Model Architecture: Both Critic and Refiner use the same model (Gemini-2.5-Flash), keeping the system simple and cost-effective.

Relationship to Broader Agent Architecture Patterns

COMPARISON: Agent Architecture Paradigms
─────────────────────────────────────────────────────
                    Fine-Tuning    Manual    AutoHarness
                                   Harness
─────────────────────────────────────────────────────
Modifies LLM         Yes           No        No
Requires human eng.  No            Yes       No
Domain transfer      Poor          None      Automatic
Cost at training     Very High     Zero      Low
Cost at inference    Standard      Standard  Zero (policy)
Preserves LLM caps   No            Yes       Yes
Scalable to N games  Manual        Manual    Automatic
─────────────────────────────────────────────────────

10 Component Breakdown

Component 1: Tree Store

Purpose: Maintains the population of candidate harness programs as a tree data structure.

Property Specification
Structure Tree (nodes = code versions, edges = refinement steps)
Root Initial empty/template harness
Node value Average legal action accuracy (0.0–1.0)
Growth New nodes added via Refiner output
Termination Any node reaches value 1.0, or timeout

Each node stores: - The complete Python source code of the harness - The heuristic value (average fraction of legal actions across test rollouts) - Parent pointer (for lineage tracking) - Number of evaluations (for Thompson sampling confidence)

Component 2: Thompson Sampling Selector

Purpose: Selects which tree node to refine next, balancing exploration and exploitation.

The system follows the approach of Tang et al. (2024) for code repair with exploration-exploitation tradeoffs. Thompson sampling maintains a posterior distribution over each node's true quality and samples from these distributions to select the next node to refine.

For each node n in tree:
  Sample θ_n ~ Beta(α_n, β_n)     # posterior on true quality
  where α_n = successes + 1, β_n = failures + 1

Select node n* = argmax_n θ_n
Refine n* → create new child node

The heuristic weight parameter is set to 1.0, controlling the exploration-exploitation balance.

Why Thompson Sampling over simpler strategies: - Random selection wastes iterations on unpromising branches - Greedy selection gets stuck in local optima - UCB1 requires tuning an exploration constant - Thompson sampling naturally adapts its exploration as uncertainty decreases

Component 3: Environment Rollout Engine

Purpose: Executes candidate harness code against actual game environments to measure legal action accuracy.

Parameter Value
Parallel environments 10
Max steps per rollout 1,000
Termination Illegal move by code, code execution failure, or step limit
Auto-reset Environment resets automatically when a game ends

The rollout engine produces execution traces that include: - The observation at each step - The action proposed - Whether is_legal_action() returned True or False - Whether the environment accepted the action - Any error messages or exceptions

Component 4: Critic

Purpose: Consolidates raw execution failure traces into structured, actionable feedback.

The Critic receives up to 5 sampled failure traces and produces a consolidated analysis identifying: - Common failure patterns across traces - Root causes of illegal actions - Specific code locations that need modification - Environmental constraints that the harness fails to capture

This consolidation step is critical—raw error traces can be verbose and contradictory, and the Refiner benefits from a clean summary.

Component 5: Refiner (LLM as Mutation Operator)

Purpose: Takes current harness code + Critic feedback and generates improved harness code.

The Refiner operates as a gradient-free code optimizer. Its input is: 1. The current harness source code 2. The Critic's consolidated error analysis 3. The specific failure traces (for reference)

Its output is a complete, revised harness program.

Refinement Logic:

if is_legal_action() returns True but action is actually ILLEGAL:
    → Refine BOTH is_legal_action() AND propose_action()
    (The legality check is wrong)

if is_legal_action() returns False but action is actually ILLEGAL:
    → Refine ONLY propose_action()
    (The legality check is correct; the proposal is bad)

This asymmetric refinement strategy prevents correct components from being needlessly modified.

Component 6: Harness Code (The Artifact)

Purpose: The generated Python program that serves as the agent's action filter or complete policy.

For harness-as-action-verifier, the typical structure:

import re
import numpy as np

def is_legal_action(observation: str, action: str) -> bool:
    # Parse game state from observation text
    board = parse_board(observation)
    move = parse_move(action)

    # Validate move against game rules
    if not is_valid_piece(board, move):
        return False
    if not is_valid_destination(board, move):
        return False
    if causes_self_check(board, move):
        return False
    return True

def propose_action(observation: str) -> str:
    # Generate a candidate move (may or may not use heuristics)
    board = parse_board(observation)
    legal_moves = get_all_legal_moves(board)
    return format_move(legal_moves[0])  # or heuristic selection

For harness-as-policy, propose_action() encodes the complete decision-making logic—including strategic reasoning—without any LLM calls.


11 Core Mechanisms (Detailed)

Mechanism 1: Tree Search over Program Space

The search over program space is the paper's primary technical contribution. Unlike simple iterative prompting (e.g., Reflexion's linear chain), the tree structure enables:

Branching: A single parent node can spawn multiple children with different refinement strategies. If one refinement path leads to a dead end, the system can backtrack to a sibling or parent.

Depth vs. Breadth: Thompson sampling naturally controls the tree's shape. Early in training (high uncertainty), sampling favors breadth (exploring diverse approaches). Later (low uncertainty), sampling favors depth (refining the best approach).

Tree Evolution Example (Chess Harness):
─────────────────────────────────────────
Iteration 0:  [Root: empty harness]
              Value: 0.0

Iteration 1:  [Root] → [V1: basic piece movement]
                       Value: 0.3

Iteration 3:  [Root] → [V1] → [V2: + captures]
                               Value: 0.6
                     → [V1b: alternative parsing]
                       Value: 0.4

Iteration 8:  [Root] → [V1] → [V2] → [V3: + castling]
                                       Value: 0.85
                                     → [V4: + en passant]
                                       Value: 0.92
                     → [V1b] → [abandoned]

Iteration 14: [Root] → [V1] → [V2] → ... → [V7: complete]
                                              Value: 1.0

Mechanism 2: Heuristic Function Design

For harness-as-action-verifier, the heuristic is simply:

H = fraction of legal actions across all rollout steps

For harness-as-policy, the heuristic incorporates reward:

H = 0 if any illegal action is taken H = 0.5 + 0.5 × r otherwise, where r ∈ [0, 1] is the environment reward

This design encodes a strict priority: 1. First priority: Eliminate all illegal actions (H = 0 until all actions are legal) 2. Second priority: Maximize game reward (H scales from 0.5 to 1.0 with reward)

The 0.5 offset ensures that a policy with all legal actions but zero reward (H = 0.5) is strictly preferred over a policy with any illegal action (H = 0), preventing the system from reverting to illegal-but-high-reward strategies.

Mechanism 3: Iterative Code Refinement with Rich Feedback

The refinement loop uses environment execution as ground truth—not a learned reward model or human feedback. This gives the system access to:

  • Exact failure points: Which step, which observation, which action
  • Error types: Syntax error, runtime error, logical error (action executed but illegal)
  • Environment messages: Game-specific error messages explaining why an action was rejected

This rich feedback is qualitatively different from the scalar rewards used in RL or the binary pass/fail signals used in program synthesis competitions.

Mechanism 4: Observation Processing Without Available Moves

A subtle but important design decision: the authors remove "Available Moves" hints from game observations before feeding them to the harness. In the original TextArena Chess observation:

Valid moves: [g1h3], [g1f3], [b1c3], [b1a3], [h2h3], ...

This line is deleted, forcing the harness to: 1. Parse the board state from the ASCII representation 2. Understand piece movement rules 3. Enumerate legal moves from first principles

This makes the problem significantly harder but also significantly more interesting—the harness must encode genuine understanding of game mechanics, not just string matching.

Mechanism 5: Rejection Sampling Interpretation

The harness-as-action-verifier mode can be viewed as learned rejection sampling:

Standard Rejection Sampling:
  Sample x ~ q(x)              (LLM generates action)
  Accept with probability       (harness evaluates legality)
    p(x) / (M × q(x))
  If rejected, resample

AutoHarness:
  Sample action ~ LLM(observation)
  Accept if is_legal_action(observation, action)
  If rejected, resample with augmented prompt

The key difference from standard rejection sampling: the acceptance criterion is_legal_action() is itself learned through program synthesis, and the rejection triggers a modified resampling (the LLM receives additional context about why the action was rejected).

This creates an adaptive rejection sampler where: - The acceptance function is learned (not fixed) - Rejection modifies the proposal distribution (not just resampling from the same distribution) - The acceptance function is a program (not a density ratio)

Mechanism 6: Asymmetric Error Attribution

The refinement strategy distinguishes two failure modes:

Failure Mode is_legal_action() says Truth What to Refine
False positive True (legal) Illegal Both functions
True negative False (illegal) Illegal Only propose_action()

This is important because false positives indicate a bug in the legality check itself (it failed to catch an illegal move), while true negatives indicate the legality check is working but the proposal generator needs improvement.


12 Programming Language

Primary Language: Python

The harness code is generated as Python, which is natural given: - LLMs are most capable at Python code generation - The TextArena environment is Python-based - Python supports the standard libraries needed for game logic (re, numpy)

Allowed Dependencies

The generated harness code uses only: - Python standard library (particularly re for pattern matching) - numpy for numerical operations - No external game-specific libraries (no python-chess for chess, etc.)

This constraint is important: the harness must learn game rules from scratch, not wrap existing rule engines.

Code Quality Observations

From the paper's appendix examples, the generated harness code is: - Readable: Clear function names, logical structure - Correct: Handles edge cases that emerge through iterative refinement - Domain-specific: Each game gets custom parsing and validation logic - Self-contained: No state persisted between calls (pure functions)

Language Considerations for Extension

The approach is language-agnostic in principle—any language the LLM can generate code in could serve as the harness language. However, Python's dominance in LLM training data makes it the practical choice for maximizing code quality.


13 Memory Management

Training-Time Memory

The tree store grows during harness synthesis. Each node stores: - Complete Python source code (~100–500 lines per harness) - Heuristic value (float) - Parent pointer - Evaluation count

With an average of 14.5 iterations and typical branching factor of 1–3, the tree is small (dozens of nodes, <1MB per game).

Rollout memory: 10 parallel environments maintain game state. TextArena games are lightweight text-based environments; memory per environment is negligible (<1MB).

Inference-Time Memory

For harness-as-action-verifier: - The harness code is loaded once and called per step - No persistent state between function calls - Memory footprint: negligible (the harness is a pure function)

For harness-as-policy: - Same as above—pure code with no persistent state - Some harnesses may maintain internal data structures (e.g., tracking which cells have been visited in Minesweeper), but these are within the game scope

Knowledge Persistence

The system does not maintain cross-game memory. Each game's harness is synthesized independently. There is no skill library, no transfer learning between games, and no persistent knowledge base.

This is explicitly identified as a limitation and future work direction: "We also hope to explore building up a library of reusable harnesses."

Comparison to Memory-Intensive Approaches

Approach Memory Pattern AutoHarness Equivalent
Reflexion Verbal memory of past failures Implicit in tree structure (parent nodes)
Voyager Skill library with embeddings None (independent per game)
AlphaEvolve Evolutionary archive of programs Tree store (smaller scale)
Fine-tuning Model weights No weight modification

14 Continued Learning

Within-Game Learning (During Synthesis)

The tree search is itself a learning process: the system improves its harness code through iterative refinement. Learning curves show characteristic patterns:

Game Convergence Pattern
Simple games (PigDice, Snake) 1 iteration: trivial action space
Medium games (Checkers, Sudoku) 3–10 iterations: progressive rule discovery
Complex games (Chess, Othello) 30–64 iterations: complex rule systems with edge cases
Pathological (Breakthrough-small) 136 iterations: small board amplifies combinatorial edge cases

Across-Game Transfer: Not Yet

The current system trains each game independently. The authors identify this as the key limitation:

"Currently we generate a separate harness for each environment (game). In the future, we would like to distill the resulting domain specific experts (agents) back into the base LLM, so that the whole system becomes recursively self-improving."

Future Directions for Continued Learning

The paper outlines three future directions that would enable continued learning:

  1. Recursive Self-Improvement: Distilling game-specific harnesses back into the base LLM, creating a feedback loop where the LLM becomes better at generating harnesses over time.

  2. Reusable Harness Library: Building a library of harness components (board parsers, move validators, strategy heuristics) that can be composed for new games.

  3. Multimodal Extension: Applying the approach to more complex environments like Craftax and Terra Nova, which require visual perception and more sophisticated state representations.

Relationship to AlphaEvolve

The paper explicitly positions itself relative to AlphaEvolve (Novikov et al., 2025), which applies evolutionary algorithms to entire codebases using an LLM as a mutation function. AutoHarness is more constrained:

Dimension AlphaEvolve AutoHarness
Scope Entire codebase Single harness program
Search method Evolutionary (population-based) Tree search + Thompson sampling
Feedback Algorithm-specific metrics Environment legal/illegal signals
Goal Discover algorithms Synthesize action constraints
Scale Large codebases Small programs (~100–500 LOC)

Potential for Self-Improvement Loops

The paper hints at a vision where AutoHarness becomes part of a larger self-improving system:

┌─────────────────────────────────────────────┐
│           RECURSIVE IMPROVEMENT VISION       │
│                                              │
│   LLM ──generates──> Harnesses              │
│    ↑                    │                    │
│    │                    │ game-specific       │
│    │                    │ expertise           │
│    │                    ▼                    │
│    └──distill────── Domain Experts           │
│                         │                    │
│                         │ compose             │
│                         ▼                    │
│                    Harness Library            │
│                         │                    │
│                         │ seed               │
│                         ▼                    │
│                    New Game → faster harness  │
└─────────────────────────────────────────────┘

This vision is unrealized in the current paper but represents a clear roadmap.


15 Applications

Direct Applications

1. Game-Playing Agents

The paper's primary domain. AutoHarness enables: - Compliant game play across diverse text-based games - Smaller models to compete with larger ones - Zero-cost inference via harness-as-policy

Target environments: TextArena (145 games), Craftax, Terra Nova, and any text-based game environment.

2. API Orchestration Agents

LLM agents calling APIs frequently generate invalid requests. AutoHarness could synthesize: - Request validators based on API schemas - Parameter constraint checkers - State-aware action filters (e.g., "you can't cancel an order that's already shipped")

3. Robotic Control

The code-as-policies paradigm (Liang et al., 2023) already uses LLMs for robot control. AutoHarness adds: - Joint limit enforcement - Collision avoidance constraints - Workspace boundary enforcement

4. Code Generation Pipelines

LLM-generated code often fails type checking or violates schema constraints. AutoHarness could: - Synthesize type validators for generated code - Enforce API compatibility constraints - Validate database query structure before execution

Broader Implications

The "Small Model + Harness > Large Model" Result

This is the paper's most provocative finding for the broader AI community. It suggests that:

  1. Model size is not destiny. A Gemini-2.5-Flash with a synthesized harness beats Gemini-2.5-Pro. A synthesized code policy beats GPT-5.2-High.

  2. Compute allocation matters. Spending compute on harness synthesis (one-time) rather than larger model inference (per-decision) is more efficient.

  3. Code > Neural Networks for constraint enforcement. Learned code harnesses are deterministic, verifiable, and zero-cost at inference. Neural network-based constraint enforcement is probabilistic, opaque, and expensive.

The Rejection Sampling Connection

The paper's framework can be interpreted as learning the optimal rejection criterion for LLM outputs. This has implications beyond games:

Application Rejection Criterion
Code generation Does the code compile? Pass tests?
Math proofs Is each step logically valid?
Scientific hypotheses Is the hypothesis consistent with known data?
Drug design Does the molecule satisfy pharmacological constraints?

In each case, the constraint checker could potentially be synthesized by the LLM itself, following the AutoHarness paradigm.

LLM-as-Compiler

AutoHarness can be viewed as using the LLM to "compile" domain rules from natural language game descriptions into executable constraint checking code. This compiler metaphor suggests applications in:

  • Regulatory compliance: Compile regulation text into automated compliance checkers
  • Contract enforcement: Compile contract terms into automated verification
  • Safety constraints: Compile safety specifications into runtime monitors

Limitations and Open Questions

  1. Scalability to continuous action spaces: The current approach assumes discrete, enumerable actions. Extension to continuous domains (e.g., continuous robotic control) requires fundamentally different harness structures.

  2. Multi-agent coordination: The harness is synthesized for a single agent. Multi-agent scenarios may require coordination-aware constraint checking.

  3. Dynamic environments: The harness encodes static game rules. Environments with changing rules would require online harness adaptation.

  4. Observation parsing brittleness: The harness relies on parsing text observations. Changes to observation format could break the harness.

  5. Strategic vs. tactical: The harness ensures legality (tactical correctness) but does not directly improve strategic quality. The harness-as-policy mode addresses this partially, but strategic depth is limited by the code synthesis budget.


Connection to Evolutionary Algorithms

AutoHarness's tree search with Thompson sampling shares deep structural similarities with evolutionary algorithms:

Concept EA AutoHarness
Individual Candidate solution Candidate harness program
Fitness Objective function value Legal action accuracy
Mutation Random perturbation LLM-guided code refinement
Selection Tournament/fitness-proportionate Thompson sampling
Population Generation of individuals Tree of program variants
Crossover Recombination of parents Not explicitly used (single-parent refinement)

The LLM-as-mutation-operator is particularly interesting: it provides semantically meaningful mutations rather than random perturbations. The LLM understands the program's intent and can make targeted improvements based on error feedback, which is qualitatively different from random mutation operators in traditional genetic programming.

Connection to Program Synthesis

AutoHarness sits at the intersection of: - Neural program synthesis (LLM generates code) - Search-based program synthesis (tree search explores program space) - Feedback-driven synthesis (environment provides correctness signals)

This combination achieves what neither approach does alone: the LLM provides strong priors for generating plausible programs, while the search structure and environment feedback correct the LLM's errors.

Connection to Formal Verification

The is_legal_action() function is effectively a runtime assertion or invariant checker. In formal verification terms:

  • Pre-condition: The observation represents a valid game state
  • Post-condition: The action is legal in that state
  • Invariant: The agent never executes an illegal action

The difference from formal verification is that these conditions are learned (synthesized as code) rather than specified a priori. This is a form of learned formal methods—an emerging area at the intersection of ML and software engineering.


This analysis is based on the paper as published on arXiv (2603.03329v1, February 10, 2026). The paper was authored by researchers at Google DeepMind. No code repository has been publicly released as of the analysis date.