AutoHarness
Part P08: Harness & Agent Frameworks
57.1 Overview & Motivation
Large language models deployed as decision-making agents in structured environments suffer from a persistent and quantifiable failure mode: action illegality. When an LLM agent plays chess, it does not merely make suboptimal moves—it proposes moves that violate the rules entirely. When it orchestrates API calls, it generates requests with invalid parameters. When it controls a robot, it commands joints beyond physical limits. The problem is not one of strategy but of constraint compliance.
AutoHarness, published by Lou et al. at Google DeepMind in February 2026 (arXiv:2603.03329), addresses this problem through a striking inversion: rather than fine-tuning the LLM to produce fewer illegal actions—an expensive process that degrades other capabilities—the system uses the LLM to synthesize its own constraint checker as executable code. The resulting "harness" is a Python program that filters, verifies, or replaces the LLM's proposed actions, ensuring perfect rule compliance at negligible inference cost.
The motivating statistic is stark: 78% of Gemini-2.5-Flash losses in the Kaggle GameArena chess competition were attributed to illegal moves, not strategic blunders (Lou et al., 2026). This disconnect between apparent domain understanding and structural compliance motivates the entire framework. The problem generalizes across domains:
| Domain | Illegal Action Example |
|---|---|
| Chess | Moving a piece to a square occupied by a friendly piece |
| API orchestration | Calling an endpoint with invalid parameter types |
| Robotic control | Commanding a joint beyond its physical limits |
| Database queries | Writing SQL that violates schema constraints |
| Code generation | Generating syntactically invalid code |
Key Contribution
AutoHarness demonstrates that an LLM (Gemini-2.5-Flash) can automatically synthesize a code-based action harness through tree search with Thompson sampling over program space, achieving 100% legal action accuracy across 145 text-based games. The most striking result is that a pure code policy with zero LLM calls at inference outperforms frontier models including GPT-5.2-High (average reward 0.870 vs. 0.844), at effectively zero marginal inference cost. This establishes that compute invested in one-time harness synthesis can yield greater returns than compute invested in larger models at test time.
57.1.1 Framing Within Evolutionary AI
AutoHarness belongs to the lineage of systems that use LLMs as mutation operators within a search over program space. Unlike systems such as FunSearch or AlphaEvolve that evolve algorithms or heuristics to maximize a performance objective, AutoHarness evolves constraint programs that define the boundary between legal and illegal actions. The search objective is binary correctness (does the harness catch all illegal moves?) rather than continuous optimization, though the harness-as-policy mode extends this to reward maximization. The evolutionary structure—tree of candidate programs, fitness-based selection via Thompson sampling, LLM-driven mutation via the Refiner—places AutoHarness squarely within the LLM-powered evolutionary paradigm surveyed in this book, while its application domain (agent safety and constraint enforcement) distinguishes it from prior systems focused on algorithm discovery.
57.2 Architecture
The AutoHarness architecture separates into two distinct phases: a training phase that synthesizes the harness through iterative search, and an inference phase that deploys the synthesized harness alongside (or in place of) the LLM agent. The training phase is a closed-loop optimization system where the LLM generates code, the environment provides ground-truth feedback, and Thompson sampling guides exploration of the program space.
57.2.1 Key Architectural Decisions
Program-space search, not weight-space. The search operates over Python programs, not model parameters. This preserves the LLM's general capabilities while adding domain-specific constraints as an external artifact. No fine-tuning occurs at any stage.
Tree structure with Thompson sampling. Unlike linear iterative refinement approaches such as Reflexion (Shinn et al., 2023), which maintain a single chain of refinements, AutoHarness maintains a tree of candidate programs. Thompson sampling balances exploration of diverse program structures against exploitation of partially working harnesses. This prevents the search from committing prematurely to a single refinement trajectory.
Environment-in-the-loop. The training loop is closed through the actual game environment, not through a learned proxy or reward model. The environment provides ground-truth legality signals—either an action is legal or it is not. This eliminates the reward-model misalignment that plagues reinforcement learning from human feedback.
Single-model architecture. Both the Critic and Refiner components use Gemini-2.5-Flash. The system does not require access to larger or more capable models during training. This is a deliberate design choice: the entire point is that a small model can bootstrap its own constraint infrastructure.
Separation of analysis and synthesis. The Critic consolidates raw execution error traces into structured feedback; the Refiner consumes this structured feedback to propose code improvements. This two-stage pipeline prevents the Refiner from being overwhelmed by verbose, potentially contradictory raw error logs.
57.2.2 Three Inference Modes
The synthesized harness can be deployed in three configurations of increasing autonomy, as described in the paper:
| Mode | LLM at Inference? | Mechanism |
|---|---|---|
| Harness-as-Action-Verifier | Yes (generates + verification) | LLM proposes action; is_legal_action() verifies; retry on rejection |
| Harness-as-Action-Filter | Yes (ranks) | Code enumerates all legal actions; LLM ranks and selects |
| Harness-as-Policy | No | propose_action() generates action directly; zero LLM calls |
The paper primarily evaluates the verifier and policy modes. The policy mode produces the headline result: a pure code agent that outperforms frontier LLMs at zero inference cost.
57.3 Core Algorithms
57.3.1 Harness Interface
Every synthesized harness conforms to a uniform interface consisting of two functions. This interface is domain-agnostic—the same signature applies to chess, Sudoku, card games, and all other environments in the 145-game evaluation suite:
# Pseudocode — no public implementation available
# Harness interface as described in Lou et al. (2026), Section 3
def is_legal_action(observation: str, action: str) -> bool:
"""Determine whether a proposed action is legal given the current
game observation. The observation is the raw text string from the
environment; the action is the agent's proposed text response.
Returns True if the action complies with game rules, False otherwise.
"""
# Domain-specific parsing and validation logic
# synthesized through iterative refinement
...
def propose_action(observation: str) -> str:
"""Given the current game observation, propose a valid action.
For harness-as-policy mode, this encodes the complete
decision-making policy without any LLM calls.
Returns a text string representing the chosen action.
"""
# Domain-specific strategy logic
# synthesized through iterative refinement with reward signal
...
An important design constraint noted by the authors: the "Available Moves" hints were deliberately removed from game observations before being passed to the harness. In the original TextArena chess environment, the observation includes a line such as Valid moves: [g1h3], [g1f3], .... This line is deleted, forcing the harness to deduce legality from structural understanding of game mechanics rather than simple string matching.
57.3.2 Thompson Sampling for Node Selection
The search maintains a tree of candidate harness programs. Each node $n$ in the tree stores the harness source code, an evaluation count, and a success/failure record. Thompson sampling selects which node to refine next by maintaining a Beta posterior over each node's true quality.
Let $\alpha_n$ and $\beta_n$ denote the posterior parameters for node $n$, initialized as $\alpha_n = 1, \beta_n = 1$ (uniform prior). After each evaluation of node $n$, the parameters are updated based on the observed heuristic value. The selection procedure at each iteration is:
where $\mathcal{T}$ is the set of all nodes in the current tree, $\theta_n$ is the sampled quality estimate for node $n$, and $n^*$ is the node selected for refinement. The Beta distribution naturally encodes uncertainty: nodes with few evaluations have wide posteriors (encouraging exploration), while nodes with many evaluations have narrow posteriors concentrated around their empirical quality (favoring exploitation).
The paper reports a Thompson sampling weight parameter of 1.0 (Lou et al., 2026, experimental configuration). This controls the exploration-exploitation balance, though the exact mechanism by which this weight modulates the sampling is not further specified in the paper.
# Pseudocode — no public implementation available
# Thompson sampling node selection (Lou et al., 2026, Section 3)
import numpy as np
from dataclasses import dataclass, field
@dataclass
class TreeNode:
code: str # Python source of the harness
parent: "TreeNode | None" # Parent node (None for root)
children: list = field(default_factory=list)
alpha: float = 1.0 # Beta posterior: successes + 1
beta_param: float = 1.0 # Beta posterior: failures + 1
heuristic: float = 0.0 # Average legal action rate
def select_node(tree_nodes: list[TreeNode]) -> TreeNode:
"""Select node to refine via Thompson sampling.
Sample from each node's Beta posterior; select the argmax."""
samples = [
np.random.beta(node.alpha, node.beta_param)
for node in tree_nodes
]
return tree_nodes[np.argmax(samples)]
def update_node(node: TreeNode, legal_rate: float) -> None:
"""Update Beta posterior after evaluating the node's harness.
legal_rate: fraction of actions that were legal in rollouts."""
# Treat legal_rate as a Bernoulli-like success signal
node.alpha += legal_rate
node.beta_param += (1.0 - legal_rate)
node.heuristic = (node.alpha - 1.0) / (node.alpha + node.beta_param - 2.0)
The authors cite Tang et al. (2024) for the application of Thompson sampling to code repair with exploration-exploitation tradeoffs. The rationale for Thompson sampling over alternatives is articulated in the paper: random selection wastes iterations on unpromising branches; greedy selection (always refining the best node) risks local optima; UCB1 requires tuning an exploration constant; Thompson sampling adapts exploration naturally as uncertainty decreases.
57.3.3 Heuristic Function Design
The heuristic function $H$ that drives the search differs between the two training objectives:
For harness-as-action-verifier, the heuristic is simply the fraction of legal actions across all rollout steps:
Training terminates when $H_{\text{verifier}} = 1.0$ for a node, indicating perfect legal action accuracy across all test rollout steps (1,000 steps × 10 random seeds per game).
For harness-as-policy, the heuristic incorporates the environment reward $r \in [0, 1]$:
where $r$ is the environment reward at episode end. This design encodes a strict lexicographic priority:
- First priority: Eliminate all illegal actions. Any illegal action forces $H = 0$, regardless of reward.
- Second priority: Maximize game reward. Among fully legal policies, $H$ scales linearly from 0.5 (zero reward) to 1.0 (maximum reward).
The 0.5 offset is crucial: it ensures that a policy producing all legal actions with zero reward ($H = 0.5$) is strictly preferred over any policy with even one illegal action ($H = 0$). This prevents the search from exploiting high-reward but rule-violating strategies.
57.3.4 Iterative Refinement Loop
The core training loop consists of four stages executed iteratively until convergence or timeout:
# Pseudocode — no public implementation available
# Main training loop (Lou et al., 2026, Section 3)
def synthesize_harness(
game_env, # TextArena game environment
max_iterations: int = 256,
n_parallel: int = 10,
max_steps: int = 1000,
max_failures: int = 5,
) -> str:
"""Synthesize a harness for the given game environment.
Returns the Python source code of the best harness found."""
# Initialize tree with empty/template harness
root = TreeNode(code=TEMPLATE_HARNESS, parent=None)
tree_nodes = [root]
for iteration in range(max_iterations):
# Stage 1: Select node via Thompson sampling
node = select_node(tree_nodes)
# Stage 2: Rollout — execute harness in parallel environments
traces = []
for env_id in range(n_parallel):
trace = rollout(game_env, node.code, max_steps)
traces.append(trace)
# Compute heuristic (legal action rate or reward-weighted)
legal_rate = compute_legal_rate(traces)
update_node(node, legal_rate)
# Check termination
if legal_rate == 1.0:
return node.code
# Stage 3: Critic — consolidate failure traces
failures = sample_failures(traces, max_count=max_failures)
critic_analysis = llm_critic(node.code, failures)
# Stage 4: Refiner — generate improved code
refined_code = llm_refiner(node.code, critic_analysis, failures)
# Add new node to tree
child = TreeNode(code=refined_code, parent=node)
node.children.append(child)
tree_nodes.append(child)
# Return best node if no perfect harness found
return max(tree_nodes, key=lambda n: n.heuristic).code
57.3.5 Asymmetric Error Attribution
The Refiner employs an asymmetric refinement strategy based on the type of failure observed. This mechanism prevents correct components from being needlessly modified:
| Failure Type | is_legal_action() Returns | Ground Truth | What to Refine |
|---|---|---|---|
| False positive | True (legal) | Actually illegal | Both is_legal_action() and propose_action() |
| True negative (correct rejection) | False (illegal) | Actually illegal | Only propose_action() |
The rationale: a false positive means the legality checker itself is buggy (it failed to catch an illegal move), requiring repair of the validation logic. A true negative means the checker works correctly—the proposal generator simply chose a bad action and needs to improve its selection. This asymmetry reduces the search space by protecting working components from unnecessary mutation.
57.3.6 Rejection Sampling Interpretation
The harness-as-action-verifier mode can be formally interpreted as learned rejection sampling. In standard rejection sampling, one draws samples from a proposal distribution $q(x)$ and accepts each sample with probability proportional to the target density. In AutoHarness:
where $o$ is the current observation and $a$ is the proposed action. Two features distinguish this from classical rejection sampling: (1) the acceptance criterion is itself learned through program synthesis rather than specified analytically, and (2) rejection modifies the proposal distribution—the LLM receives the illegal action as negative feedback in its next attempt, rather than simply resampling from the same conditional distribution.
57.4 Key Results
All results reported in this section are from Lou et al. (2026). The evaluation uses the TextArena framework (Guertler et al., 2025) with 145 games (9 free-form text/dialog games excluded). No independent replication has been published as of April 2026.
57.4.1 Legal Action Accuracy
The harness-as-action-verifier achieves 100% legal action accuracy across all 145 TextArena games on held-out test rollouts. Each game was evaluated with 1,000 rollout steps across 10 random seeds. This is a binary all-or-nothing metric—the harness either catches every illegal action or it does not.
| Metric | Value | Source |
|---|---|---|
| Games evaluated | 145 | Paper §4 |
| Games with 100% legal action rate | 145 (100%) | Paper §4 |
| Test rollout length | 1,000 steps | Paper §4 |
| Random seeds per game | 10 | Paper §4 |
| Average training iterations (verifier) | 14.5 | Paper §4 |
57.4.2 Two-Player Game Performance
On 16 selected two-player games, evaluated with 40 matches each (20 as player 1, 20 as player 2) to balance first-mover effects:
| Matchup | Win Rate (AutoHarness) | Win Rate (Opponent) | Games Won |
|---|---|---|---|
| Flash + Harness vs. Gemini-2.5-Pro | 56.3% | 38.2% | 9/16 |
| Flash + Harness vs. Gemini-2.5-Flash | 64.8% | — | 12/16 |
The smaller model (Flash) equipped with an automatically synthesized harness beats the larger model (Pro) without a harness, winning the majority of individual games. This is perhaps the paper's cleanest demonstration that constraint compliance, not model scale, is the binding bottleneck in structured decision-making.
57.4.3 Single-Player Game Performance
On 16 single-player games, evaluated with 20 matches each:
| Agent | Average Reward | vs. Pro |
|---|---|---|
| Gemini-2.5-Flash (vanilla) | 0.673 | — |
| Gemini-2.5-Pro (vanilla) | 0.707 | — |
| Gemini-2.5-Flash + Harness | 0.745 | 8/16 wins, 5/16 ties |
57.4.4 Harness-as-Policy: The Headline Result
The most striking finding is that a pure code policy with zero LLM calls at inference outperforms all evaluated frontier models, including GPT-5.2-High with high reasoning budget:
| Agent | Average Reward | Test-Time LLM Calls | Reported Test Cost |
|---|---|---|---|
| GPT-5.2 (no thinking) | 0.635 | Many | ~$640 (reduced eval)* |
| Gemini-2.5-Flash (vanilla) | 0.673 | Many | Moderate |
| Gemini-2.5-Pro (vanilla) | 0.707 | Many | High |
| GPT-5.2-High (high thinking) | 0.844 | Many | ~$640 (reduced eval)* |
| Harness-as-Policy (AutoHarness) | 0.870 | 0 | ~$0 |
*GPT-5.2 costs are reported by the authors for a reduced evaluation (10 and 5 repeats respectively vs. 20 for other agents). The cost comparison should be interpreted in light of this evaluation asymmetry.
This result demonstrates that an LLM can synthesize a complete decision-making policy in code form that exceeds the performance of the LLM itself—and even larger, more expensive models—at effectively zero marginal cost per decision.
57.4.5 Training Difficulty by Game
Convergence speed varies dramatically across games, revealing how the complexity of game rules maps to harness synthesis difficulty:
| Game | Training Iterations | Difficulty Driver |
|---|---|---|
| PigDice, Snake, ColonelBlotto | 1 | Trivially enumerable action space |
| GermanWhist-v0 (2P) | 43 | Complex card game rules |
| Cryptarithm-v0 (1P) | 45 | Cryptarithmetic puzzle constraints |
| Othello-v0 (2P) | 62 | Board state validation complexity |
| Chess-v0 (2P) | 64 | Complex piece movement, castling, en passant |
| Breakthrough-v0-small (2P) | 136 | Small board amplifies edge cases |
For the harness-as-policy mode (which must encode strategy, not just legality), the maximum was 256 iterations with an average of 89.4 iterations and average heuristic at termination of 0.939 (Lou et al., 2026, experimental configuration).
57.5 Implementation Details
57.5.1 Model and Environment Setup
All harness synthesis (training) uses a single model: Gemini-2.5-Flash. This model serves triple duty as the initial code generator, the Critic, and the Refiner. No larger model is required during synthesis. The evaluation (inference) phase tests multiple models as baselines, but the harness itself is always produced by Flash.
The environment is TextArena (Guertler et al., 2025), a collection of text-based games where observations and actions are both strings. The authors modified certain games to remove "Available Moves" hints from observations, though the exact modifications are described in the paper rather than released as a code patch.
57.5.2 Cost Analysis
The paper does not provide exact dollar costs for harness synthesis. Based on the reported iteration counts and the structure of each iteration (10 parallel rollouts, Critic analysis, Refiner generation), the per-game synthesis cost can be estimated:
| Phase | Model | Estimated Cost per Game | Basis |
|---|---|---|---|
| Verifier synthesis | Gemini-2.5-Flash | ~$1–5 | Author estimate: ~14.5 iters × (rollouts + critic + refiner calls) |
| Policy synthesis | Gemini-2.5-Flash | ~$5–30 | Higher iteration count (~89.4 avg) but still Flash pricing |
| Full suite (145 games, verifier) | Gemini-2.5-Flash | ~$150–700 total | Extrapolation from per-game estimate |
Note: These cost estimates are reconstructed from the paper's reported parameters and approximate Flash pricing (~$0.15/M input tokens, ~$0.60/M output tokens as of early 2026). They are not directly reported in the paper and should be treated as order-of-magnitude estimates.
The inference cost comparison is more dramatic and directly reported:
The paper states explicitly: "Since Harness-as-Policy generates pure (Python) code, our test time cost is nearly zero, while the GPT-5.2 and GPT-5.2-High experiments cost approximately $640" (Lou et al., 2026). The $640 figure covers a reduced evaluation (10 and 5 repeats for GPT-5.2 and GPT-5.2-High respectively, compared to 20 repeats for other methods).
57.5.3 Reproducibility Assessment
| Aspect | Status | Details |
|---|---|---|
| Code release | Not available | No public repository referenced in the paper |
| TextArena environment | Open source | Available via arXiv:2504.11442 |
| Model access | Proprietary API | Gemini-2.5-Flash/Pro via Google API; GPT-5.2 via OpenAI API |
| TextArena modifications | Described, not released | Removal of "Available Moves" hints |
| Generated harness examples | Paper Appendix D | Selected examples only |
| Experimental parameters | Fully specified | All hyperparameters reported in paper |
| Stochasticity | Present | Thompson sampling + LLM generation introduce variance |
| API versioning | Version-specific | Results tied to specific Gemini/GPT model versions |
Independent reproduction would require: (1) access to Gemini-2.5-Flash and evaluation-model APIs, (2) reimplementing the tree search, Critic, and Refiner pipeline from the paper description, (3) replicating the TextArena modifications, and (4) running sufficient seeds to account for stochastic variation. While the paper provides enough detail for a faithful reimplementation, the absence of released code and the dependence on proprietary model APIs are significant barriers.
57.5.4 Allowed Dependencies in Generated Code
The synthesized harness code is restricted to Python standard library modules (particularly re for pattern matching) and numpy for numerical operations. Critically, no domain-specific libraries are permitted—the harness cannot import python-chess for chess or any equivalent game-specific library. This constraint forces the LLM to synthesize game rules from first principles, making the harness a genuine demonstration of rule comprehension rather than a thin wrapper around an existing rule engine.
57.6 Connections to Evolutionary Computation
57.6.1 Structural Parallels
AutoHarness exhibits deep structural parallels with evolutionary algorithms, as analyzed in the paper and elaborated here. The following mapping connects EA concepts to AutoHarness components:
| EA Concept | AutoHarness Equivalent |
|---|---|
| Individual | Candidate harness program (tree node) |
| Fitness function | Legal action rate / reward-weighted heuristic |
| Mutation operator | LLM-driven code refinement (Refiner) |
| Selection mechanism | Thompson sampling over Beta posteriors |
| Population | Tree of program variants |
| Crossover | Not used (single-parent refinement only) |
| Generation | One iteration of select → rollout → critique → refine |
The most notable distinction from traditional evolutionary algorithms is the use of an LLM as the mutation operator. Traditional genetic programming applies syntactic transformations (subtree mutation, point mutation) that are semantically blind—they modify program structure without understanding program intent. The LLM-as-mutator provides semantically meaningful mutations: given error feedback explaining why the current harness fails, the LLM proposes targeted improvements that address the specific failure mode. This is qualitatively different from random perturbation and explains why the system converges in tens of iterations rather than thousands.
57.6.2 Relationship to AlphaEvolve and FunSearch
The paper explicitly positions AutoHarness relative to AlphaEvolve (Novikov et al., 2025), which applies LLM-powered evolutionary search to entire codebases. The distinction is in scope and objective:
| Dimension | AlphaEvolve | FunSearch | AutoHarness |
|---|---|---|---|
| Search scope | Entire codebase | Single function | Single harness program (~100–500 LOC) |
| Search method | Evolutionary (population-based) | Evolutionary (island model) | Tree search + Thompson sampling |
| Feedback signal | Algorithm-specific metrics | Scalar score | Binary legal/illegal + reward |
| Objective | Algorithm/heuristic discovery | Function optimization | Constraint program synthesis |
| LLM role | Mutation operator | Mutation operator | Generator, Critic, Refiner |
| Inference cost | Varies | N/A (offline) | Zero (policy mode) |
AutoHarness is more constrained in scope but targets a complementary problem: rather than discovering novel algorithms, it discovers the constraint boundaries within which an agent must operate. The two approaches are potentially composable—one could imagine using AlphaEvolve to discover strategy heuristics while AutoHarness ensures those heuristics never propose illegal actions.
57.6.3 Tree Search vs. Population-Based Search
AutoHarness uses a tree rather than a flat population. This choice has implications for how the search explores program space:
The tree structure enables backtracking: if a refinement path leads to a dead end (the harness cannot be improved further along that branch), Thompson sampling will shift probability mass toward other branches. This is not possible with flat population-based methods where parent-child lineage is not preserved. However, the absence of crossover means AutoHarness cannot combine partial solutions from different branches—a potential limitation for games where the harness requires multiple independent rule modules that could be developed separately.
57.7 Broader Applications and Implications
57.7.1 The "Small Model + Harness > Large Model" Thesis
AutoHarness's most provocative contribution to the broader AI discourse is the empirical demonstration that model size is not destiny for constrained decision-making. The result that Flash + Harness beats Pro, and that a code-only policy beats GPT-5.2-High, suggests a fundamental reallocation principle:
where constraint compliance acts as a multiplicative gate. If an agent's actions are illegal 78% of the time, it does not matter how sophisticated its strategy is—the effective capability is bounded by the compliance rate. The harness lifts this bound to 1.0, allowing even a smaller model's strategic capability to be fully expressed.
This has direct implications for deployment economics. Instead of scaling model size to reduce error rates (which follows a power law with diminishing returns), one can invest a fixed, bounded cost in harness synthesis and achieve perfect compliance. The one-time training cost of ~$1–5 per game amortizes to zero over subsequent deployments.
57.7.2 LLM as Compiler
The paper implicitly demonstrates the LLM functioning as a compiler from natural language rules to executable constraint-checking code. The TextArena game descriptions are presented in natural language; the harness translates these descriptions into Python functions that can verify rule compliance deterministically. This "compilation" metaphor suggests applications in:
- Regulatory compliance: Compile regulation text into automated compliance checkers
- Contract enforcement: Compile contract terms into runtime verification
- Safety specifications: Compile safety constraints into runtime monitors for autonomous systems
The key insight is that the compilation need not be perfect on the first attempt—the iterative refinement loop with environment feedback corrects errors systematically. This is a form of test-driven compilation where the environment serves as the test oracle.
57.7.3 Connection to Formal Verification
The is_legal_action() function is functionally equivalent to a runtime assertion or precondition checker in formal verification. In design-by-contract terms:
- Precondition: The observation represents a valid game state
- Postcondition: The executed action is legal in that state
- Invariant: The agent never executes an illegal action
The difference from traditional formal verification is that these conditions are synthesized from execution feedback rather than specified by a human engineer. This positions AutoHarness within the emerging area of learned formal methods—using machine learning to produce artifacts that serve verification purposes.
57.8 Limitations & Discussion
57.8.1 Scope Limitations
Discrete action spaces only. The current approach assumes actions can be enumerated and individually validated. Extension to continuous action spaces (e.g., continuous robotic control with real-valued torques) would require fundamentally different harness structures—perhaps constraint satisfaction over continuous domains rather than discrete acceptance/rejection.
Static game rules. The harness encodes a fixed set of rules. Environments where rules change dynamically (e.g., games with evolving mechanics, regulatory environments with updating policies) would require online harness adaptation, which the current framework does not support.
Observation parsing brittleness. The harness relies on parsing text observations using regular expressions and string manipulation. Changes to the observation format—even minor ones like altered whitespace or field ordering—could break the harness. This is an inherent fragility of string-based parsing without a formal grammar.
No cross-game transfer. Each game's harness is synthesized independently. There is no skill library, no transfer learning, and no shared components across games. The authors explicitly identify this as a limitation: "Currently we generate a separate harness for each environment (game). In the future, we would like to distill the resulting domain specific experts (agents) back into the base LLM" (Lou et al., 2026). The absence of transfer means synthesis cost scales linearly with the number of environments.
57.8.2 Methodological Considerations
Evaluation asymmetry. The GPT-5.2 baselines were evaluated with fewer repeats (10 and 5) compared to other agents (20), due to cost constraints. This asymmetry means the GPT-5.2 results have higher variance, and the comparison should be interpreted with appropriate caution.
No public code release. As of April 2026, the harness synthesis framework has not been open-sourced. While the paper provides sufficient algorithmic detail for reimplementation, the absence of code prevents direct verification of results and limits community extension of the work.
Proprietary model dependence. Results are tied to specific versions of Gemini-2.5-Flash, Gemini-2.5-Pro, and GPT-5.2, which are proprietary APIs. Model updates, rate limiting, and non-determinism in API responses could affect reproducibility.
Strategic depth in policy mode. While the harness-as-policy achieves impressive average reward (0.870), the strategic depth of a synthesized code policy is inherently limited by the code synthesis budget (256 iterations maximum). For games requiring deep strategic planning (e.g., Go), a 500-line Python program may not capture the necessary complexity. The 0.939 average heuristic at termination (below the theoretical maximum of 1.0) indicates that the policy mode does not achieve optimal play on all games.
57.8.3 Open Questions
Scaling to complex environments. TextArena games are text-based with relatively compact state representations. Whether the approach scales to environments with rich visual observations (Craftax, Minecraft) or high-dimensional state spaces remains unexplored. The authors list Craftax and Terra Nova as future targets.
Multi-agent coordination. The harness is synthesized for a single agent. In multi-agent settings where legality depends on other agents' simultaneous actions, the constraint checking problem becomes substantially more complex.
Recursive self-improvement. The paper outlines but does not implement a vision where harness-generated domain expertise is distilled back into the base LLM, creating a recursive improvement loop. Whether such distillation would actually improve the LLM's ability to generate better harnesses—or merely overfit to the training game distribution—is an open question of significant importance.
57.9 Summary
Chapter Summary
Key takeaway: AutoHarness demonstrates that an LLM can synthesize its own constraint-checking code through tree search over program space, achieving perfect rule compliance across 145 games and enabling a pure code policy to outperform frontier models at zero inference cost.
Main contribution: The system introduces automatic harness synthesis as an alternative to both manual engineering and model fine-tuning for action compliance. By framing the problem as program search with Thompson sampling and environment-in-the-loop feedback, it achieves 100% legal action accuracy while being fully automated. The harness-as-policy result—outperforming GPT-5.2-High with zero LLM calls—establishes a new paradigm where one-time compute investment in code synthesis dominates repeated large-model inference.
What researchers should know: AutoHarness operationalizes the idea that LLM failures in structured environments are primarily constraint violations, not capability gaps. The practical implication is architectural: rather than scaling model size to reduce error rates, one can wrap any model in a synthesized constraint layer at bounded cost. The approach requires only the ability to obtain ground-truth feedback on action legality from the environment—a signal that is available in many deployment domains beyond games. The absence of open-source code and the reliance on proprietary model APIs are the primary barriers to independent verification and extension.