Introduced2026-02

Score8.54/10 — Final

Chapter 57

AutoHarness

Part P08: Harness & Agent Frameworks

57.1 Overview & Motivation

Large language models deployed as decision-making agents in structured environments suffer from a persistent and quantifiable failure mode: action illegality. When an LLM agent plays chess, it does not merely make suboptimal moves—it proposes moves that violate the rules entirely. When it orchestrates API calls, it generates requests with invalid parameters. When it controls a robot, it commands joints beyond physical limits. The problem is not one of strategy but of constraint compliance.

AutoHarness, published by Lou et al. at Google DeepMind in February 2026 (arXiv:2603.03329), addresses this problem through a striking inversion: rather than fine-tuning the LLM to produce fewer illegal actions—an expensive process that degrades other capabilities—the system uses the LLM to synthesize its own constraint checker as executable code. The resulting "harness" is a Python program that filters, verifies, or replaces the LLM's proposed actions, ensuring perfect rule compliance at negligible inference cost.

The motivating statistic is stark: 78% of Gemini-2.5-Flash losses in the Kaggle GameArena chess competition were attributed to illegal moves, not strategic blunders (Lou et al., 2026). This disconnect between apparent domain understanding and structural compliance motivates the entire framework. The problem generalizes across domains:

Domain	Illegal Action Example
Chess	Moving a piece to a square occupied by a friendly piece
API orchestration	Calling an endpoint with invalid parameter types
Robotic control	Commanding a joint beyond its physical limits
Database queries	Writing SQL that violates schema constraints
Code generation	Generating syntactically invalid code

Key Contribution

AutoHarness demonstrates that an LLM (Gemini-2.5-Flash) can automatically synthesize a code-based action harness through tree search with Thompson sampling over program space, achieving 100% legal action accuracy across 145 text-based games. The most striking result is that a pure code policy with zero LLM calls at inference outperforms frontier models including GPT-5.2-High (average reward 0.870 vs. 0.844), at effectively zero marginal inference cost. This establishes that compute invested in one-time harness synthesis can yield greater returns than compute invested in larger models at test time.

57.1.1 Framing Within Evolutionary AI

AutoHarness belongs to the lineage of systems that use LLMs as mutation operators within a search over program space. Unlike systems such as FunSearch or AlphaEvolve that evolve algorithms or heuristics to maximize a performance objective, AutoHarness evolves constraint programs that define the boundary between legal and illegal actions. The search objective is binary correctness (does the harness catch all illegal moves?) rather than continuous optimization, though the harness-as-policy mode extends this to reward maximization. The evolutionary structure—tree of candidate programs, fitness-based selection via Thompson sampling, LLM-driven mutation via the Refiner—places AutoHarness squarely within the LLM-powered evolutionary paradigm surveyed in this book, while its application domain (agent safety and constraint enforcement) distinguishes it from prior systems focused on algorithm discovery.

57.2 Architecture

The AutoHarness architecture separates into two distinct phases: a training phase that synthesizes the harness through iterative search, and an inference phase that deploys the synthesized harness alongside (or in place of) the LLM agent. The training phase is a closed-loop optimization system where the LLM generates code, the environment provides ground-truth feedback, and Thompson sampling guides exploration of the program space.

57.2.1 Key Architectural Decisions

Program-space search, not weight-space. The search operates over Python programs, not model parameters. This preserves the LLM's general capabilities while adding domain-specific constraints as an external artifact. No fine-tuning occurs at any stage.

Tree structure with Thompson sampling. Unlike linear iterative refinement approaches such as Reflexion (Shinn et al., 2023), which maintain a single chain of refinements, AutoHarness maintains a tree of candidate programs. Thompson sampling balances exploration of diverse program structures against exploitation of partially working harnesses. This prevents the search from committing prematurely to a single refinement trajectory.

Environment-in-the-loop. The training loop is closed through the actual game environment, not through a learned proxy or reward model. The environment provides ground-truth legality signals—either an action is legal or it is not. This eliminates the reward-model misalignment that plagues reinforcement learning from human feedback.

Single-model architecture. Both the Critic and Refiner components use Gemini-2.5-Flash. The system does not require access to larger or more capable models during training. This is a deliberate design choice: the entire point is that a small model can bootstrap its own constraint infrastructure.

Separation of analysis and synthesis. The Critic consolidates raw execution error traces into structured feedback; the Refiner consumes this structured feedback to propose code improvements. This two-stage pipeline prevents the Refiner from being overwhelmed by verbose, potentially contradictory raw error logs.

57.2.2 Three Inference Modes

The synthesized harness can be deployed in three configurations of increasing autonomy, as described in the paper:

Mode	LLM at Inference?	Mechanism
Harness-as-Action-Verifier	Yes (generates + verification)	LLM proposes action; `is_legal_action()` verifies; retry on rejection
Harness-as-Action-Filter	Yes (ranks)	Code enumerates all legal actions; LLM ranks and selects
Harness-as-Policy	No	`propose_action()` generates action directly; zero LLM calls

The paper primarily evaluates the verifier and policy modes. The policy mode produces the headline result: a pure code agent that outperforms frontier LLMs at zero inference cost.

57.3 Core Algorithms

57.3.1 Harness Interface

Every synthesized harness conforms to a uniform interface consisting of two functions. This interface is domain-agnostic—the same signature applies to chess, Sudoku, card games, and all other environments in the 145-game evaluation suite:

# Pseudocode — no public implementation available
# Harness interface as described in Lou et al. (2026), Section 3

def is_legal_action(observation: str, action: str) -> bool:
    """Determine whether a proposed action is legal given the current
    game observation. The observation is the raw text string from the
    environment; the action is the agent's proposed text response.

    Returns True if the action complies with game rules, False otherwise.
    """
    # Domain-specific parsing and validation logic
    # synthesized through iterative refinement
    ...

def propose_action(observation: str) -> str:
    """Given the current game observation, propose a valid action.
    For harness-as-policy mode, this encodes the complete
    decision-making policy without any LLM calls.

    Returns a text string representing the chosen action.
    """
    # Domain-specific strategy logic
    # synthesized through iterative refinement with reward signal
    ...

An important design constraint noted by the authors: the "Available Moves" hints were deliberately removed from game observations before being passed to the harness. In the original TextArena chess environment, the observation includes a line such as Valid moves: [g1h3], [g1f3], .... This line is deleted, forcing the harness to deduce legality from structural understanding of game mechanics rather than simple string matching.

57.3.2 Thompson Sampling for Node Selection

The search maintains a tree of candidate harness programs. Each node $n$ in the tree stores the harness source code, an evaluation count, and a success/failure record. Thompson sampling selects which node to refine next by maintaining a Beta posterior over each node's true quality.

Let $\alpha_n$ and $\beta_n$ denote the posterior parameters for node $n$, initialized as $\alpha_n = 1, \beta_n = 1$ (uniform prior). After each evaluation of node $n$, the parameters are updated based on the observed heuristic value. The selection procedure at each iteration is:

$$\theta_n \sim \text{Beta}(\alpha_n, \beta_n) \quad \text{for each node } n \in \mathcal{T}$$

$$n^* = \arg\max_{n \in \mathcal{T}} \theta_n$$

where $\mathcal{T}$ is the set of all nodes in the current tree, $\theta_n$ is the sampled quality estimate for node $n$, and $n^*$ is the node selected for refinement. The Beta distribution naturally encodes uncertainty: nodes with few evaluations have wide posteriors (encouraging exploration), while nodes with many evaluations have narrow posteriors concentrated around their empirical quality (favoring exploitation).

The paper reports a Thompson sampling weight parameter of 1.0 (Lou et al., 2026, experimental configuration). This controls the exploration-exploitation balance, though the exact mechanism by which this weight modulates the sampling is not further specified in the paper.

# Pseudocode — no public implementation available
# Thompson sampling node selection (Lou et al., 2026, Section 3)

import numpy as np
from dataclasses import dataclass, field

@dataclass
class TreeNode:
    code: str                    # Python source of the harness
    parent: "TreeNode | None"    # Parent node (None for root)
    children: list = field(default_factory=list)
    alpha: float = 1.0           # Beta posterior: successes + 1
    beta_param: float = 1.0      # Beta posterior: failures + 1
    heuristic: float = 0.0       # Average legal action rate

def select_node(tree_nodes: list[TreeNode]) -> TreeNode:
    """Select node to refine via Thompson sampling.
    Sample from each node's Beta posterior; select the argmax."""
    samples = [
        np.random.beta(node.alpha, node.beta_param)
        for node in tree_nodes
    ]
    return tree_nodes[np.argmax(samples)]

def update_node(node: TreeNode, legal_rate: float) -> None:
    """Update Beta posterior after evaluating the node's harness.
    legal_rate: fraction of actions that were legal in rollouts."""
    # Treat legal_rate as a Bernoulli-like success signal
    node.alpha += legal_rate
    node.beta_param += (1.0 - legal_rate)
    node.heuristic = (node.alpha - 1.0) / (node.alpha + node.beta_param - 2.0)

The authors cite Tang et al. (2024) for the application of Thompson sampling to code repair with exploration-exploitation tradeoffs. The rationale for Thompson sampling over alternatives is articulated in the paper: random selection wastes iterations on unpromising branches; greedy selection (always refining the best node) risks local optima; UCB1 requires tuning an exploration constant; Thompson sampling adapts exploration naturally as uncertainty decreases.

57.3.3 Heuristic Function Design

The heuristic function $H$ that drives the search differs between the two training objectives:

For harness-as-action-verifier, the heuristic is simply the fraction of legal actions across all rollout steps:

$$H_{\text{verifier}} = \frac{\text{number of legal actions in rollout}}{\text{total actions in rollout}}$$

Training terminates when $H_{\text{verifier}} = 1.0$ for a node, indicating perfect legal action accuracy across all test rollout steps (1,000 steps × 10 random seeds per game).

For harness-as-policy, the heuristic incorporates the environment reward $r \in [0, 1]$:

$$H_{\text{policy}} = \begin{cases} 0 & \text{if any illegal action is taken} \\ 0.5 + 0.5 \cdot r & \text{otherwise} \end{cases}$$

where $r$ is the environment reward at episode end. This design encodes a strict lexicographic priority:

First priority: Eliminate all illegal actions. Any illegal action forces $H = 0$, regardless of reward.
Second priority: Maximize game reward. Among fully legal policies, $H$ scales linearly from 0.5 (zero reward) to 1.0 (maximum reward).

The 0.5 offset is crucial: it ensures that a policy producing all legal actions with zero reward ($H = 0.5$) is strictly preferred over any policy with even one illegal action ($H = 0$). This prevents the search from exploiting high-reward but rule-violating strategies.

57.3.4 Iterative Refinement Loop

The core training loop consists of four stages executed iteratively until convergence or timeout:

# Pseudocode — no public implementation available
# Main training loop (Lou et al., 2026, Section 3)

def synthesize_harness(
    game_env,           # TextArena game environment
    max_iterations: int = 256,
    n_parallel: int = 10,
    max_steps: int = 1000,
    max_failures: int = 5,
) -> str:
    """Synthesize a harness for the given game environment.
    Returns the Python source code of the best harness found."""

    # Initialize tree with empty/template harness
    root = TreeNode(code=TEMPLATE_HARNESS, parent=None)
    tree_nodes = [root]

    for iteration in range(max_iterations):
        # Stage 1: Select node via Thompson sampling
        node = select_node(tree_nodes)

        # Stage 2: Rollout — execute harness in parallel environments
        traces = []
        for env_id in range(n_parallel):
            trace = rollout(game_env, node.code, max_steps)
            traces.append(trace)

        # Compute heuristic (legal action rate or reward-weighted)
        legal_rate = compute_legal_rate(traces)
        update_node(node, legal_rate)

        # Check termination
        if legal_rate == 1.0:
            return node.code

        # Stage 3: Critic — consolidate failure traces
        failures = sample_failures(traces, max_count=max_failures)
        critic_analysis = llm_critic(node.code, failures)

        # Stage 4: Refiner — generate improved code
        refined_code = llm_refiner(node.code, critic_analysis, failures)

        # Add new node to tree
        child = TreeNode(code=refined_code, parent=node)
        node.children.append(child)
        tree_nodes.append(child)

    # Return best node if no perfect harness found
    return max(tree_nodes, key=lambda n: n.heuristic).code

57.3.5 Asymmetric Error Attribution

The Refiner employs an asymmetric refinement strategy based on the type of failure observed. This mechanism prevents correct components from being needlessly modified:

Failure Type	`is_legal_action()` Returns	Ground Truth	What to Refine
False positive	True (legal)	Actually illegal	Both `is_legal_action()` and `propose_action()`
True negative (correct rejection)	False (illegal)	Actually illegal	Only `propose_action()`

The rationale: a false positive means the legality checker itself is buggy (it failed to catch an illegal move), requiring repair of the validation logic. A true negative means the checker works correctly—the proposal generator simply chose a bad action and needs to improve its selection. This asymmetry reduces the search space by protecting working components from unnecessary mutation.

57.3.6 Rejection Sampling Interpretation

The harness-as-action-verifier mode can be formally interpreted as learned rejection sampling. In standard rejection sampling, one draws samples from a proposal distribution $q(x)$ and accepts each sample with probability proportional to the target density. In AutoHarness:

$$a \sim \text{LLM}(\cdot \mid o) \quad \text{(propose action given observation)}$$

$$\text{Accept } a \iff \texttt{is\_legal\_action}(o, a) = \text{True}$$

where $o$ is the current observation and $a$ is the proposed action. Two features distinguish this from classical rejection sampling: (1) the acceptance criterion is itself learned through program synthesis rather than specified analytically, and (2) rejection modifies the proposal distribution—the LLM receives the illegal action as negative feedback in its next attempt, rather than simply resampling from the same conditional distribution.

57.4 Key Results

All results reported in this section are from Lou et al. (2026). The evaluation uses the TextArena framework (Guertler et al., 2025) with 145 games (9 free-form text/dialog games excluded). No independent replication has been published as of April 2026.

57.4.1 Legal Action Accuracy

The harness-as-action-verifier achieves 100% legal action accuracy across all 145 TextArena games on held-out test rollouts. Each game was evaluated with 1,000 rollout steps across 10 random seeds. This is a binary all-or-nothing metric—the harness either catches every illegal action or it does not.

Metric	Value	Source
Games evaluated	145	Paper §4
Games with 100% legal action rate	145 (100%)	Paper §4
Test rollout length	1,000 steps	Paper §4
Random seeds per game	10	Paper §4
Average training iterations (verifier)	14.5	Paper §4

57.4.2 Two-Player Game Performance

On 16 selected two-player games, evaluated with 40 matches each (20 as player 1, 20 as player 2) to balance first-mover effects:

Matchup	Win Rate (AutoHarness)	Win Rate (Opponent)	Games Won
Flash + Harness vs. Gemini-2.5-Pro	56.3%	38.2%	9/16
Flash + Harness vs. Gemini-2.5-Flash	64.8%	—	12/16

The smaller model (Flash) equipped with an automatically synthesized harness beats the larger model (Pro) without a harness, winning the majority of individual games. This is perhaps the paper's cleanest demonstration that constraint compliance, not model scale, is the binding bottleneck in structured decision-making.

57.4.3 Single-Player Game Performance

On 16 single-player games, evaluated with 20 matches each:

Agent	Average Reward	vs. Pro
Gemini-2.5-Flash (vanilla)	0.673	—
Gemini-2.5-Pro (vanilla)	0.707	—
Gemini-2.5-Flash + Harness	0.745	8/16 wins, 5/16 ties

57.4.4 Harness-as-Policy: The Headline Result

The most striking finding is that a pure code policy with zero LLM calls at inference outperforms all evaluated frontier models, including GPT-5.2-High with high reasoning budget:

Agent	Average Reward	Test-Time LLM Calls	Reported Test Cost
GPT-5.2 (no thinking)	0.635	Many	~$640 (reduced eval)*
Gemini-2.5-Flash (vanilla)	0.673	Many	Moderate
Gemini-2.5-Pro (vanilla)	0.707	Many	High
GPT-5.2-High (high thinking)	0.844	Many	~$640 (reduced eval)*
Harness-as-Policy (AutoHarness)	0.870	0	~$0

*GPT-5.2 costs are reported by the authors for a reduced evaluation (10 and 5 repeats respectively vs. 20 for other agents). The cost comparison should be interpreted in light of this evaluation asymmetry.

This result demonstrates that an LLM can synthesize a complete decision-making policy in code form that exceeds the performance of the LLM itself—and even larger, more expensive models—at effectively zero marginal cost per decision.

57.4.5 Training Difficulty by Game

Convergence speed varies dramatically across games, revealing how the complexity of game rules maps to harness synthesis difficulty:

Game	Training Iterations	Difficulty Driver
PigDice, Snake, ColonelBlotto	1	Trivially enumerable action space
GermanWhist-v0 (2P)	43	Complex card game rules
Cryptarithm-v0 (1P)	45	Cryptarithmetic puzzle constraints
Othello-v0 (2P)	62	Board state validation complexity
Chess-v0 (2P)	64	Complex piece movement, castling, en passant
Breakthrough-v0-small (2P)	136	Small board amplifies edge cases

For the harness-as-policy mode (which must encode strategy, not just legality), the maximum was 256 iterations with an average of 89.4 iterations and average heuristic at termination of 0.939 (Lou et al., 2026, experimental configuration).

57.5 Implementation Details

57.5.1 Model and Environment Setup

All harness synthesis (training) uses a single model: Gemini-2.5-Flash. This model serves triple duty as the initial code generator, the Critic, and the Refiner. No larger model is required during synthesis. The evaluation (inference) phase tests multiple models as baselines, but the harness itself is always produced by Flash.

The environment is TextArena (Guertler et al., 2025), a collection of text-based games where observations and actions are both strings. The authors modified certain games to remove "Available Moves" hints from observations, though the exact modifications are described in the paper rather than released as a code patch.

57.5.2 Cost Analysis

The paper does not provide exact dollar costs for harness synthesis. Based on the reported iteration counts and the structure of each iteration (10 parallel rollouts, Critic analysis, Refiner generation), the per-game synthesis cost can be estimated:

Phase	Model	Estimated Cost per Game	Basis
Verifier synthesis	Gemini-2.5-Flash	~$1–5	Author estimate: ~14.5 iters × (rollouts + critic + refiner calls)
Policy synthesis	Gemini-2.5-Flash	~$5–30	Higher iteration count (~89.4 avg) but still Flash pricing
Full suite (145 games, verifier)	Gemini-2.5-Flash	~$150–700 total	Extrapolation from per-game estimate

Note: These cost estimates are reconstructed from the paper's reported parameters and approximate Flash pricing (~$0.15/M input tokens, ~$0.60/M output tokens as of early 2026). They are not directly reported in the paper and should be treated as order-of-magnitude estimates.

The inference cost comparison is more dramatic and directly reported:

$$\text{Cost ratio} = \frac{\text{GPT-5.2 evaluation cost}}{\text{Harness-as-Policy cost}} = \frac{\sim\$640}{\sim\$0} \to \infty$$

The paper states explicitly: "Since Harness-as-Policy generates pure (Python) code, our test time cost is nearly zero, while the GPT-5.2 and GPT-5.2-High experiments cost approximately $640" (Lou et al., 2026). The $640 figure covers a reduced evaluation (10 and 5 repeats for GPT-5.2 and GPT-5.2-High respectively, compared to 20 repeats for other methods).

57.5.3 Reproducibility Assessment

Aspect	Status	Details
Code release	Not available	No public repository referenced in the paper
TextArena environment	Open source	Available via arXiv:2504.11442
Model access	Proprietary API	Gemini-2.5-Flash/Pro via Google API; GPT-5.2 via OpenAI API
TextArena modifications	Described, not released	Removal of "Available Moves" hints
Generated harness examples	Paper Appendix D	Selected examples only
Experimental parameters	Fully specified	All hyperparameters reported in paper
Stochasticity	Present	Thompson sampling + LLM generation introduce variance
API versioning	Version-specific	Results tied to specific Gemini/GPT model versions

Independent reproduction would require: (1) access to Gemini-2.5-Flash and evaluation-model APIs, (2) reimplementing the tree search, Critic, and Refiner pipeline from the paper description, (3) replicating the TextArena modifications, and (4) running sufficient seeds to account for stochastic variation. While the paper provides enough detail for a faithful reimplementation, the absence of released code and the dependence on proprietary model APIs are significant barriers.

57.5.4 Allowed Dependencies in Generated Code

The synthesized harness code is restricted to Python standard library modules (particularly re for pattern matching) and numpy for numerical operations. Critically, no domain-specific libraries are permitted—the harness cannot import python-chess for chess or any equivalent game-specific library. This constraint forces the LLM to synthesize game rules from first principles, making the harness a genuine demonstration of rule comprehension rather than a thin wrapper around an existing rule engine.

57.6 Connections to Evolutionary Computation

57.6.1 Structural Parallels

AutoHarness exhibits deep structural parallels with evolutionary algorithms, as analyzed in the paper and elaborated here. The following mapping connects EA concepts to AutoHarness components:

EA Concept	AutoHarness Equivalent
Individual	Candidate harness program (tree node)
Fitness function	Legal action rate / reward-weighted heuristic
Mutation operator	LLM-driven code refinement (Refiner)
Selection mechanism	Thompson sampling over Beta posteriors
Population	Tree of program variants
Crossover	Not used (single-parent refinement only)
Generation	One iteration of select → rollout → critique → refine

The most notable distinction from traditional evolutionary algorithms is the use of an LLM as the mutation operator. Traditional genetic programming applies syntactic transformations (subtree mutation, point mutation) that are semantically blind—they modify program structure without understanding program intent. The LLM-as-mutator provides semantically meaningful mutations: given error feedback explaining why the current harness fails, the LLM proposes targeted improvements that address the specific failure mode. This is qualitatively different from random perturbation and explains why the system converges in tens of iterations rather than thousands.

57.6.2 Relationship to AlphaEvolve and FunSearch

The paper explicitly positions AutoHarness relative to AlphaEvolve (Novikov et al., 2025), which applies LLM-powered evolutionary search to entire codebases. The distinction is in scope and objective:

Dimension	AlphaEvolve	FunSearch	AutoHarness
Search scope	Entire codebase	Single function	Single harness program (~100–500 LOC)
Search method	Evolutionary (population-based)	Evolutionary (island model)	Tree search + Thompson sampling
Feedback signal	Algorithm-specific metrics	Scalar score	Binary legal/illegal + reward
Objective	Algorithm/heuristic discovery	Function optimization	Constraint program synthesis
LLM role	Mutation operator	Mutation operator	Generator, Critic, Refiner
Inference cost	Varies	N/A (offline)	Zero (policy mode)

AutoHarness is more constrained in scope but targets a complementary problem: rather than discovering novel algorithms, it discovers the constraint boundaries within which an agent must operate. The two approaches are potentially composable—one could imagine using AlphaEvolve to discover strategy heuristics while AutoHarness ensures those heuristics never propose illegal actions.

57.6.3 Tree Search vs. Population-Based Search

AutoHarness uses a tree rather than a flat population. This choice has implications for how the search explores program space:

The tree structure enables backtracking: if a refinement path leads to a dead end (the harness cannot be improved further along that branch), Thompson sampling will shift probability mass toward other branches. This is not possible with flat population-based methods where parent-child lineage is not preserved. However, the absence of crossover means AutoHarness cannot combine partial solutions from different branches—a potential limitation for games where the harness requires multiple independent rule modules that could be developed separately.

57.7 Broader Applications and Implications

57.7.1 The "Small Model + Harness > Large Model" Thesis

AutoHarness's most provocative contribution to the broader AI discourse is the empirical demonstration that model size is not destiny for constrained decision-making. The result that Flash + Harness beats Pro, and that a code-only policy beats GPT-5.2-High, suggests a fundamental reallocation principle:

$$\text{Effective capability} = f(\text{model capability}, \text{constraint compliance})$$

where constraint compliance acts as a multiplicative gate. If an agent's actions are illegal 78% of the time, it does not matter how sophisticated its strategy is—the effective capability is bounded by the compliance rate. The harness lifts this bound to 1.0, allowing even a smaller model's strategic capability to be fully expressed.

This has direct implications for deployment economics. Instead of scaling model size to reduce error rates (which follows a power law with diminishing returns), one can invest a fixed, bounded cost in harness synthesis and achieve perfect compliance. The one-time training cost of ~$1–5 per game amortizes to zero over subsequent deployments.

57.7.2 LLM as Compiler

The paper implicitly demonstrates the LLM functioning as a compiler from natural language rules to executable constraint-checking code. The TextArena game descriptions are presented in natural language; the harness translates these descriptions into Python functions that can verify rule compliance deterministically. This "compilation" metaphor suggests applications in:

Regulatory compliance: Compile regulation text into automated compliance checkers
Contract enforcement: Compile contract terms into runtime verification
Safety specifications: Compile safety constraints into runtime monitors for autonomous systems

The key insight is that the compilation need not be perfect on the first attempt—the iterative refinement loop with environment feedback corrects errors systematically. This is a form of test-driven compilation where the environment serves as the test oracle.

57.7.3 Connection to Formal Verification

The is_legal_action() function is functionally equivalent to a runtime assertion or precondition checker in formal verification. In design-by-contract terms:

Precondition: The observation represents a valid game state
Postcondition: The executed action is legal in that state
Invariant: The agent never executes an illegal action

The difference from traditional formal verification is that these conditions are synthesized from execution feedback rather than specified by a human engineer. This positions AutoHarness within the emerging area of learned formal methods—using machine learning to produce artifacts that serve verification purposes.

57.8 Limitations & Discussion

57.8.1 Scope Limitations

Discrete action spaces only. The current approach assumes actions can be enumerated and individually validated. Extension to continuous action spaces (e.g., continuous robotic control with real-valued torques) would require fundamentally different harness structures—perhaps constraint satisfaction over continuous domains rather than discrete acceptance/rejection.

Static game rules. The harness encodes a fixed set of rules. Environments where rules change dynamically (e.g., games with evolving mechanics, regulatory environments with updating policies) would require online harness adaptation, which the current framework does not support.

Observation parsing brittleness. The harness relies on parsing text observations using regular expressions and string manipulation. Changes to the observation format—even minor ones like altered whitespace or field ordering—could break the harness. This is an inherent fragility of string-based parsing without a formal grammar.

No cross-game transfer. Each game's harness is synthesized independently. There is no skill library, no transfer learning, and no shared components across games. The authors explicitly identify this as a limitation: "Currently we generate a separate harness for each environment (game). In the future, we would like to distill the resulting domain specific experts (agents) back into the base LLM" (Lou et al., 2026). The absence of transfer means synthesis cost scales linearly with the number of environments.

57.8.2 Methodological Considerations

Evaluation asymmetry. The GPT-5.2 baselines were evaluated with fewer repeats (10 and 5) compared to other agents (20), due to cost constraints. This asymmetry means the GPT-5.2 results have higher variance, and the comparison should be interpreted with appropriate caution.

No public code release. As of April 2026, the harness synthesis framework has not been open-sourced. While the paper provides sufficient algorithmic detail for reimplementation, the absence of code prevents direct verification of results and limits community extension of the work.

Proprietary model dependence. Results are tied to specific versions of Gemini-2.5-Flash, Gemini-2.5-Pro, and GPT-5.2, which are proprietary APIs. Model updates, rate limiting, and non-determinism in API responses could affect reproducibility.

Strategic depth in policy mode. While the harness-as-policy achieves impressive average reward (0.870), the strategic depth of a synthesized code policy is inherently limited by the code synthesis budget (256 iterations maximum). For games requiring deep strategic planning (e.g., Go), a 500-line Python program may not capture the necessary complexity. The 0.939 average heuristic at termination (below the theoretical maximum of 1.0) indicates that the policy mode does not achieve optimal play on all games.

57.8.3 Open Questions

Scaling to complex environments. TextArena games are text-based with relatively compact state representations. Whether the approach scales to environments with rich visual observations (Craftax, Minecraft) or high-dimensional state spaces remains unexplored. The authors list Craftax and Terra Nova as future targets.

Multi-agent coordination. The harness is synthesized for a single agent. In multi-agent settings where legality depends on other agents' simultaneous actions, the constraint checking problem becomes substantially more complex.

Recursive self-improvement. The paper outlines but does not implement a vision where harness-generated domain expertise is distilled back into the base LLM, creating a recursive improvement loop. Whether such distillation would actually improve the LLM's ability to generate better harnesses—or merely overfit to the training game distribution—is an open question of significant importance.

57.9 Summary

Chapter Summary

Key takeaway: AutoHarness demonstrates that an LLM can synthesize its own constraint-checking code through tree search over program space, achieving perfect rule compliance across 145 games and enabling a pure code policy to outperform frontier models at zero inference cost.

Main contribution: The system introduces automatic harness synthesis as an alternative to both manual engineering and model fine-tuning for action compliance. By framing the problem as program search with Thompson sampling and environment-in-the-loop feedback, it achieves 100% legal action accuracy while being fully automated. The harness-as-policy result—outperforming GPT-5.2-High with zero LLM calls—establishes a new paradigm where one-time compute investment in code synthesis dominates repeated large-model inference.

What researchers should know: AutoHarness operationalizes the idea that LLM failures in structured environments are primarily constraint violations, not capability gaps. The practical implication is architectural: rather than scaling model size to reduce error rates, one can wrap any model in a synthesized constraint layer at bounded cost. The approach requires only the ability to obtain ground-truth feedback on action legality from the environment—a signal that is available in many deployment domains beyond games. The absence of open-source code and the reliance on proprietary model APIs are the primary barriers to independent verification and extension.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}