Toward Meta-Harness Optimization: A Formal Framework for Evaluation Pipeline Quality

Part: Benchmarks, Discovery & Applications

Conceptual-framework chapter — motivated by the Meta-Harness concept from Stanford IRIS Lab

Chapter Classification & Evidence Disclosure. This chapter is classified as a conceptual-framework chapter, not a system chapter. The repository stanford-iris-lab/meta-harness-tbench2-artifact was not publicly accessible at the time of writing (last checked April 2026), and no archival publication with full experimental details was located. Consequently, this chapter does not describe, reconstruct, or reverse-engineer the artifact's internals.

The chapter makes three clearly scoped contributions: (1) a formal, benchmark-agnostic framework for evaluation-harness quality grounded in generalizability theory and multi-objective optimization; (2) a concrete evaluation protocol specifying the exact metrics, replication design, ablation schema, and ranking-stability tests needed to validate any meta-harness system; and (3) a worked empirical demonstration (§28.7) applying the framework to synthetic and publicly reproducible harness data, proving that all proposed metrics are operational. All equations, code, and architectural diagrams are author-constructed analytical tools. Provenance is marked as:

▸ Published — traceable to a cited paper, table, or figure.
▸ Framework — author-constructed formalization or illustration.
▸ Principle — established practice from cited related work, applied here.

28.1 Overview & Motivation

The evaluation of LLM-powered autonomous agents presents a distinctive engineering challenge: the harness—the infrastructure that provisions environments, dispatches tasks, captures agent trajectories, and computes scores—is itself a complex software system whose design profoundly shapes the validity, reproducibility, and cost of every benchmark result it produces. Yet most research on LLM agent evaluation focuses on task design and scoring rubrics while treating the harness as fixed infrastructure. The Meta-Harness concept, associated with Terminal-Bench 2.0 from the Stanford Intelligent Robot Interaction and Learning (IRIS) Lab, targets this gap by framing harness optimization as a first-class research problem.

The central thesis is straightforward: if we can automatically optimize the evaluation pipeline itself—environment bootstrapping, task parameterization, scoring reliability, runtime efficiency—then every downstream benchmark result becomes more trustworthy and less expensive to produce. This is a meta-level optimization: rather than improving the agents under test, the system improves the instrument that measures them.

28.1.1 The Harness Bottleneck

Modern LLM agent benchmarks such as SWE-bench (Jimenez et al., 2024), WebArena (Zhou et al., 2024), AgentBench (Liu et al., 2024), and MLAgentBench (Huang et al., 2024a) all require substantial infrastructure to execute a single evaluation pass. A typical harness must:

Provision an isolated environment (container, VM, or sandbox) with the correct OS, packages, and filesystem state for each task.
Instrument the environment to capture agent actions, terminal output, file modifications, and network calls without perturbing the agent's behavior.
Dispatch the task prompt to the agent under test and manage the interaction loop (tool calls, retries, timeouts).
Score the outcome against ground-truth criteria, which may involve deterministic checks (file diffs, exit codes), heuristic evaluation (partial credit), or LLM-as-judge rubrics.
Aggregate results across tasks, seeds, and model configurations while controlling for variance, cost, and contamination.

Each stage introduces potential failure modes: environment drift between runs, non-deterministic scoring, excessive provisioning latency, and incomplete trajectory capture. When benchmark papers report results without controlling for harness variance, the measurement noise from the harness itself can rival the signal from differences between agents. Luo et al. (2014) documented analogous problems in software testing, where flaky test infrastructure obscured genuine regressions. Parry et al. (2021) showed that infrastructure-induced non-determinism accounted for 12–28% of test failures in large CI systems. Shi et al. (2019) found that flaky tests eroded developer trust in CI signals, leading to ignored regressions—a directly analogous risk for LLM agent leaderboards.

Key Contribution. This chapter formalizes evaluation-harness optimization as a multi-objective optimization problem with three measurable quality components—reliability, discriminability, and coverage—plus a diagnostic metric (scorer calibration) and an efficiency constraint. The formalization draws on generalizability theory (Brennan, 2001), effect-size methodology (Cohen, 1988), multi-fidelity optimization (Kandasamy et al., 2017), and the flaky-test literature (Parry et al., 2021). It is benchmark-agnostic: it applies to any LLM agent evaluation harness. We additionally provide: (a) a concrete, executable evaluation protocol (§28.10) specifying exact metrics, replication counts, ablation table schema, and ranking-stability tests; and (b) a worked empirical demonstration (§28.7) applying all proposed metrics to synthetic score tensors calibrated from published harness-variance data, and sketching a fully reproducible application to the lm-evaluation-harness project. Within the evolutionary AI context of this survey, we show that harness quality directly governs the signal-to-noise ratio of fitness evaluation, making meta-harness optimization a high-leverage enabler for evolutionary search.

28.1.2 Terminal-Bench Context

Terminal-Bench, associated with researchers at Stanford, evaluates LLM agents on their ability to complete real-world tasks in terminal environments—system administration, software debugging, configuration management, and development workflows. The name "meta-harness" draws an analogy to meta-learning: just as meta-learning optimizes the learning algorithm rather than the model parameters (Finn et al., 2017), meta-harness optimization targets the evaluation procedure rather than the evaluated agent. This framing places the work at the intersection of benchmark design, software testing infrastructure, and automated configuration optimization.

28.1.3 Chapter Scope and Classification

Because the Meta-Harness artifact is not publicly accessible, this chapter cannot serve as a system description. It is classified as a conceptual-framework chapter. The distinction is important: a system chapter audits an implemented artifact; a conceptual-framework chapter develops analytical tools motivated by a system concept. All architecture (§28.4), equations (§28.5), code (§§28.5–28.7), and evaluation protocols (§28.10) are the survey author's constructions. When the Meta-Harness artifact becomes available, it can be evaluated against this framework; the framework itself is independently useful for any evaluation harness.

28.2 Evidence Base for the Meta-Harness Artifact

A systematic attempt to ground this chapter in verified implementation details was conducted. The following table exhaustively lists what is publicly known.

Evidence Source	Status	What It Tells Us
GitHub repository URL	Exists but not publicly accessible (checked Apr 2026)	Confirms artifact exists; associated with Stanford IRIS Lab
Repository name	Verified: `meta-harness-tbench2-artifact`	Suggests link to Terminal-Bench 2.0; "meta-harness" confirms focus on harness-level optimization
Archival publication	None found in ACL/NeurIPS/ICML/ICLR/arXiv (searched Apr 2026)	No quantitative results or system design details available
Terminal-Bench literature	General context from Stanford IRIS Lab research	Terminal-based LLM agent evaluation is a documented research direction
Implementation details	None accessible	No information on modules, config schemas, search algorithms, or scoring pipelines

Table 28.1: Exhaustive evidence base for the Meta-Harness artifact. Five sources were checked; only the repository URL and name are confirmed. No implementation claim in this chapter is derived from the artifact.

The remainder of the chapter develops a formal framework and evaluation protocol without further accessibility disclaimers. Readers should recall that all architecture, equations, and code are the survey author's constructions unless marked ▸ Published.

28.3 Related Work on Evaluation Infrastructure

Meta-harness optimization draws on four largely independent bodies of work. This section synthesizes them to establish the intellectual foundation.

28.3.1 Test Infrastructure Reliability & Flaky Tests

▸ Published. Luo et al. (2014) conducted the first large-scale study of flaky tests, analyzing 201 commits in Apache projects and identifying 10 root-cause categories, with async wait, concurrency, and test order dependency as the top three. Memon et al. (2017) reported that at Google, 1 in 7 tests exhibits flaky behavior. Parry et al. (2021) provided a systematic literature review covering 76 primary studies, documenting that infrastructure-induced non-determinism accounted for 12–28% of test failures across studied CI systems. Their taxonomy of causes—platform dependency, resource leaks, network sensitivity, and timing assumptions—maps directly to LLM evaluation harnesses. Shi et al. (2019) showed that flaky test detection via reruns is both expensive and unreliable, motivating automated root-cause analysis. Lam et al. (2020) demonstrated that 74% of flaky tests could be deflaked by controlling non-deterministic dependencies—evidence that targeted infrastructure optimization yields concrete reliability gains.

28.3.2 Scorer Calibration in LLM Evaluation

▸ Published. Zheng et al. (2024) introduced MT-Bench and the LLM-as-judge paradigm, documenting systematic biases: position bias, verbosity bias, and self-enhancement bias, with inter-judge agreement of ~80% for pairwise comparisons using GPT-4. Li et al. (2024) developed AlpacaEval 2.0, demonstrating that scorer calibration directly affects model rankings via length-controlled evaluation. Dubois et al. (2024) showed that uncalibrated scoring could invert pairwise rankings for models within 2–3 percentage points of each other. For agent evaluation, Wang et al. (2024) highlighted that multi-step task scoring is substantially harder than single-turn scoring. Xia et al. (2024) documented that SWE-bench's test-suite-based scoring fails to credit valid patches that achieve correct outcomes through unanticipated approaches—a solution plurality problem directly relevant to harness scoring calibration.

28.3.3 Benchmark Infrastructure Optimization

▸ Published. Shu et al. (2017) demonstrated that Docker image layering optimization can reduce container provisioning time by 40–60% in CI/CD pipelines. Esfahani et al. (2016) introduced test case prioritization techniques that reduce CI feedback time by 30–50% while maintaining fault-detection effectiveness. In the ML evaluation domain, HELM (Liang et al., 2023) demonstrated that scoring methodology choices change model rankings even when agent behavior is identical. Biderman et al. (2024) documented the lm-evaluation-harness project, showing that seemingly minor implementation differences (tokenization of few-shot examples, answer extraction regex) cause inter-harness score discrepancies of 2–5 percentage points—large enough to shuffle leaderboard positions. These findings empirically justify treating harness configuration as an optimization variable.

Feurer & Hutter (2019) surveyed AutoML, which shares structural similarity with meta-harness optimization: both search configuration spaces to optimize quality metrics. However, in AutoML the optimized object is a learned predictor and the loss measures predictive accuracy; in meta-harness optimization the optimized object is the measurement instrument and the loss measures measurement quality—a distinction that changes which configuration parameters are tunable and what constitutes a pathological optimum.

28.3.4 Psychometric Foundations

▸ Published. Classical test theory (CTT; Lord & Novick, 1968) decomposes an observed score as $X = T + E$ and defines reliability as $\rho_{XX'} = \sigma^2_T / \sigma^2_X$. The intraclass correlation coefficient (ICC), introduced by Fisher (1925) and operationalized by Shrout & Fleiss (1979), extends this to multi-rater settings; we adapt ICC to multi-run evaluation in §28.5. Confidence intervals for ICC use F-distribution-based methods from McGraw & Wong (1996).

Item response theory (IRT; Lord, 1980; Rasch, 1960) provides task-level analysis. The two-parameter logistic model $P(\text{correct} \mid \theta, a_j, b_j) = [1 + \exp(-a_j(\theta - b_j))]^{-1}$ parameterizes each task by difficulty $b_j$ and discrimination $a_j$, enabling principled task selection that maximizes information at target ability levels.

Generalizability theory (G-theory; Brennan, 2001) extends CTT to multiple facets of measurement simultaneously. A crossed design with models, tasks, and runs as facets provides a natural framework for decomposing variance in LLM evaluation. The meta-harness reliability formulation in §28.5 is a direct application of G-theory, with a full derivation connecting the three-facet model to the ICC coefficient.

28.4 Author-Proposed Reference Architecture

Author-Proposed Reference Framework. The architecture below is the survey author's analytical construction, motivated by the Meta-Harness concept and by standard evaluation-harness design patterns documented in SWE-bench (Jimenez et al., 2024), WebArena (Zhou et al., 2024), HELM (Liang et al., 2023), and lm-evaluation-harness (Biderman et al., 2024). No claim is made that the Meta-Harness artifact implements this architecture. This section serves as a reference model identifying the configuration space that meta-harness optimization would search over.

An evaluation harness comprises four principal layers, each with tunable configurations that jointly determine measurement quality. The key insight of meta-harness optimization is that these configurations form a search space amenable to automated optimization.

Figure 28.1: ▸ Framework. Author-proposed reference architecture for meta-harness optimization. The meta-optimization controller treats environment, dispatch, capture, and scoring configurations as a joint search space. Harness fitness feeds back to guide configuration updates. This diagram is an analytical construction, not a description of the Meta-Harness artifact.

28.4.1 Environment Management

▸ Principle. The environment manager provisions isolated execution contexts. In terminal-based benchmarks, this typically involves Docker containers or lightweight VMs. SWE-bench (Jimenez et al., 2024) uses per-repository Docker images; WebArena (Zhou et al., 2024) uses Docker Compose stacks with multiple services. Optimization targets include:

Bootstrap time: Image caching, layer sharing, and lazy initialization reduce per-task overhead. Shu et al. (2017) demonstrated 40–60% provisioning-time reduction through systematic layering optimization.
State fidelity: Content-addressable image digests (e.g., ubuntu:22.04@sha256:abc...) enforce exact reproducibility; floating tags introduce drift.
Isolation guarantees: Container-based isolation prevents cross-task leakage but does not provide VM-level security boundaries (Gao et al., 2023). For adversarial contexts, Firecracker microVMs provide stronger isolation.
Resource allocation: CPU, memory, and disk limits must be consistent across runs.

28.4.2 Task Dispatch and Parameterization

▸ Principle. The task dispatcher controls task ordering, prompt parameterization, timeout policies, and retry logic. Meta-harness optimization treats these as tunable parameters, drawing on the principle that automated search over structured configuration spaces typically outperforms manual tuning (Bergstra & Bengio, 2012).

28.4.3 Trajectory Capture

▸ Principle. Complete trajectory capture is essential for scoring and post-hoc analysis. A key design tension is capture completeness versus overhead: more detailed recording provides richer data but can slow execution or alter agent behavior through observable side effects. This observer effect is documented in software profiling (Mytkowicz et al., 2010).

28.4.4 Scoring and Aggregation

▸ Principle. Terminal-based tasks typically admit multiple valid solutions, requiring scoring logic that checks functional correctness rather than exact command matching. HELM (Liang et al., 2023) and Biderman et al. (2024) demonstrated that scoring methodology choices can change model rankings even when underlying agent behavior is identical. Zheng et al. (2024) documented systematic biases in LLM-as-judge scoring.

28.5 Formal Framework for Harness Quality

▸ Framework. This section presents the chapter's primary analytical contribution: a formal framework for harness optimization grounded in generalizability theory (Brennan, 2001) and multi-objective optimization.

28.5.1 Problem Formulation

Let $\mathcal{H}$ denote the space of harness configurations, where each configuration $h \in \mathcal{H}$ specifies environment setup parameters, dispatch policies, capture settings, and scoring rubrics. Let $\mathcal{T} = \{t_1, \ldots, t_N\}$ be a benchmark task suite, and let $\mathcal{M} = \{m_1, \ldots, m_K\}$ be a set of models (agents). Under $L \geq 2$ independent replications, the harness produces a score tensor:

$$\mathbf{S}(h) = \bigl\{s_{ij\ell}(h)\bigr\}_{i \in [K],\, j \in [N],\, \ell \in [L]}$$

where $s_{ij\ell}(h) \in \mathbb{R}$ is the score assigned to model $m_i$ on task $t_j$ in replication $\ell$ under harness configuration $h$. The meta-optimization objective is to find a configuration $h^*$ that maximizes harness quality:

$$h^* \in \arg\max_{h \in \mathcal{H}} \bigl(R(h),\; D(h),\; C(h)\bigr) \quad \text{subject to} \quad \tau(h) \cdot c(h) \leq B \tag{28.1}$$

where $R(h) \in [0,1]$ is reliability (§28.5.2), $D(h) \geq 0$ is discriminability (§28.5.3), $C(h) \in [0,1]$ is coverage (§28.5.4), $\tau(h)$ is wall-clock time, $c(h)$ is monetary cost per unit time, and $B$ is a budget constraint. This constrained multi-objective formulation avoids the commensurability problems inherent in combining bounded quality metrics with unbounded cost. The Pareto-optimal configurations define a quality-cost frontier.

Scalarized alternative. When a single configuration is needed:

$$F(h) = \alpha \cdot R(h) + \beta \cdot \tilde{D}(h) + \gamma \cdot C(h) \quad \text{s.t.} \; \tau(h) \cdot c(h) \leq B \tag{28.2}$$

where $\alpha, \beta, \gamma > 0$ with $\alpha + \beta + \gamma = 1$, and $\tilde{D}(h) = \min(D(h), 3) / 3$ normalizes discriminability to $[0,1]$ (Cohen's $d > 3$ is practically indistinguishable from perfect separation). This makes all three terms commensurate. The budget constraint keeps efficiency separate from quality.

28.5.2 Reliability via Mixed-Effects Model

▸ Framework. Reliability measures the consistency of scores across independent replications. We formalize this using a crossed random-effects model from generalizability theory (Brennan, 2001).

Generative model. Assume the score tensor follows a three-facet crossed random-effects model:

$$s_{ij\ell} = \mu + \alpha_i + \beta_j + \gamma_\ell + (\alpha\beta)_{ij} + (\alpha\gamma)_{i\ell} + (\beta\gamma)_{j\ell} + \epsilon_{ij\ell} \tag{28.3}$$

where $\mu$ is the grand mean, $\alpha_i \sim \mathcal{N}(0, \sigma^2_m)$ is the model effect (signal), $\beta_j \sim \mathcal{N}(0, \sigma^2_t)$ is the task effect, $\gamma_\ell \sim \mathcal{N}(0, \sigma^2_r)$ is the run effect (harness noise), $(\alpha\beta)_{ij} \sim \mathcal{N}(0, \sigma^2_{mt})$ is the model×task interaction, $(\alpha\gamma)_{i\ell} \sim \mathcal{N}(0, \sigma^2_{mr})$ is the model×run interaction, $(\beta\gamma)_{j\ell} \sim \mathcal{N}(0, \sigma^2_{tr})$ is the task×run interaction, and $\epsilon_{ij\ell} \sim \mathcal{N}(0, \sigma^2_\epsilon)$ is the residual. All random effects are mutually independent.

Assumptions. (A1) The design is fully crossed: every model is evaluated on every task in every run. (A2) Replications are independent: no information from run $\ell$ influences run $\ell'$. (A3) Random effects are normally distributed with mean zero. Assumption A3 is standard in G-theory and approximately holds for continuous scores; for binary pass/fail, a generalized linear mixed model is more appropriate. Assumptions A1 and A2 are design requirements that the harness must enforce.

Derivation of the reliability coefficient. The purpose of the harness is to rank models. We therefore derive the ICC for model rankings based on task-averaged scores. Define the task-averaged score for model $i$ in run $\ell$:

$$\bar{s}_{i\cdot\ell} \;=\; \frac{1}{N}\sum_{j=1}^{N} s_{ij\ell} \;=\; \mu + \alpha_i + \bar{\beta}_{\cdot} + \gamma_\ell + \overline{(\alpha\beta)}_{i\cdot} + (\alpha\gamma)_{i\ell} + \overline{(\beta\gamma)}_{\cdot\ell} + \bar{\epsilon}_{i\cdot\ell}$$

where $\overline{(\alpha\beta)}_{i\cdot} = \frac{1}{N}\sum_j (\alpha\beta)_{ij}$ and similarly for other averaged terms. By independence of the random effects, $\text{Var}(\overline{(\alpha\beta)}_{i\cdot}) = \sigma^2_{mt}/N$, $\text{Var}(\overline{(\beta\gamma)}_{\cdot\ell}) = \sigma^2_{tr}/N$, and $\text{Var}(\bar{\epsilon}_{i\cdot\ell}) = \sigma^2_\epsilon / N$. Collecting terms that vary between models (signal) and terms that vary within models across runs (noise):

Between-model variance (signal for ranking): $\sigma^2_{\text{between}} = \sigma^2_m + \sigma^2_{mt}/N$. The $\sigma^2_{mt}/N$ term enters the signal because model×task interactions contribute to systematic model-level differences when averaged over a fixed task set.
Within-model, across-run variance (noise): $\sigma^2_{\text{within}} = \sigma^2_r + \sigma^2_{mr} + \sigma^2_{tr}/N + \sigma^2_\epsilon/N$. These are the components that cause a model's task-averaged score to fluctuate between runs.

The reliability coefficient is the ratio of signal to total variance in task-averaged model scores, which is the ICC(2,1) form from Shrout & Fleiss (1979) applied to the $K \times L$ matrix of $\bar{s}_{i\cdot\ell}$:

$$R(h) = \frac{\sigma^2_m + \sigma^2_{mt}/N}{\sigma^2_m + \sigma^2_{mt}/N + \sigma^2_r + \sigma^2_{mr} + \sigma^2_{tr}/N + \sigma^2_\epsilon/N} \tag{28.4}$$

Interpretation follows Cicchetti (1994): $R < 0.50$ is poor, $0.50 \leq R < 0.75$ is moderate, $0.75 \leq R < 0.90$ is good, $R \geq 0.90$ is excellent. A high-quality harness maximizes $R(h)$ by minimizing the run-related variance components $\sigma^2_r$, $\sigma^2_{mr}$, and $\sigma^2_{tr}$.

Confidence intervals for ICC. The 95% CI for ICC(2,1) is constructed using the F-distribution method of McGraw & Wong (1996). Defining $F_0 = MS_\text{model} / MS_\text{residual}$ from the ANOVA of the $K \times L$ matrix of task-averaged scores, the CI bounds are functions of $F_0$, the degrees of freedom $(K-1, (K-1)(L-1))$, and the F-distribution quantiles $F_{\alpha/2}$ and $F_{1-\alpha/2}$. The exact expressions are given in McGraw & Wong (1996, Eqs. 3–6). For comparing two ICC values from different harness configurations, we use a paired bootstrap procedure: resample the task set $B \geq 2000$ times, compute $\Delta R = R(h^*) - R(h_0)$ for each resample, and report the percentile CI.

Practical estimation. Variance components are estimated from the score tensor via ANOVA-based estimators for balanced designs (equivalent to REML in this case; Searle et al., 1992), or via REML for unbalanced designs. Negative ANOVA estimates are truncated to zero per convention (Brennan, 2001, Ch. 4). Standard implementations: R (lme4::lmer), Python (statsmodels.MixedLM), Julia (MixedModels.jl).

28.5.3 Discriminability

▸ Framework. Discriminability measures the harness's ability to distinguish between models. We define it as the mean pairwise Cohen's $d$ effect size:

$$D(h) = \frac{2}{K(K-1)} \sum_{i < i'} d_{ii'}(h), \quad d_{ii'}(h) = \frac{|\bar{s}_{i\cdot\cdot}(h) - \bar{s}_{i'\cdot\cdot}(h)|}{\sqrt{\frac{1}{2}(\hat\sigma_i^2 + \hat\sigma_{i'}^2)}} \tag{28.5}$$

where $\bar{s}_{i\cdot\cdot}(h) = \frac{1}{NL}\sum_{j,\ell} s_{ij\ell}(h)$ is model $m_i$'s grand mean, and $\hat\sigma_i^2 = \frac{1}{NL-1}\sum_{j,\ell}(s_{ij\ell}(h) - \bar{s}_{i\cdot\cdot}(h))^2$ is the total variance of model $m_i$'s scores across tasks and runs.

Design-choice justification. Eq. 28.5 is not a canonical statistic—it is a design choice for this framework, selected for three reasons: (1) Cohen's $d$ is scale-free and interpretable via well-established conventions ($d > 0.8$ is "large"; Cohen, 1988), (2) pairwise averaging captures the harness's discrimination across the full model spectrum rather than only adjacent pairs, and (3) the pooled-variance denominator penalizes harnesses that inflate score variance without improving separation. Alternatives considered: $\eta^2$ or $\omega^2$ from the model-effect ANOVA (Fisher, 1925) provide a single omnibus measure of model-attributable variance. However, these compress the information into a single ratio and are sensitive to the number of models $K$. A pairwise approach retains the granularity of which model pairs are well-separated, enabling diagnosis of where the harness fails to discriminate. Signal detection $d'$ (Green & Swets, 1966) assumes univariate normal distributions per model, which may not hold for task-score distributions. The mean-pairwise-$d$ formulation was chosen as the best balance of interpretability, diagnostic value, and robustness.

Numerical example. Consider $K=3$ models with means $0.42, 0.58, 0.71$ and standard deviations $0.18, 0.20, 0.16$. Pairwise: $d_{12} = 0.84$, $d_{13} = 1.70$, $d_{23} = 0.72$, yielding $D(h) = 1.09$. In Cohen's convention, $D(h) > 1.0$ indicates the harness provides strong overall discrimination.

28.5.4 Efficiency and Coverage

▸ Framework. Efficiency is a constraint (Eq. 28.1), not an objective component. Total cost decomposes as:

$$\text{Cost}(h) = K \cdot N \cdot L \cdot \bigl(\tau_{\text{env}}(h) + \tau_{\text{exec}}(h) + \tau_{\text{score}}(h)\bigr) \cdot c_{\text{unit}} \tag{28.6}$$

where $\tau_{\text{env}}$, $\tau_{\text{exec}}$, and $\tau_{\text{score}}$ are per-evaluation times for provisioning, execution, and scoring, and $c_{\text{unit}}$ is cost rate (USD/hour). This decomposition enables targeted optimization: if provisioning dominates, layer caching is high-leverage; if scoring dominates, batching and caching are indicated.

Coverage $C(h)$ measures the fraction of evaluations that produce valid, scoreable results:

$$C(h) = \frac{1}{KNL} \sum_{i,j,\ell} \mathbb{1}\bigl[\text{valid}(s_{ij\ell}(h))\bigr] \tag{28.7}$$

Critically, agent failures (the agent could not complete the task) should score as zero and count as valid; harness failures (environment crash, scorer error, timeout before agent acts) produce missing data and count as invalid. This is analogous to the distinction between a failed test and a test infrastructure error (Luo et al., 2014).

28.5.5 Scoring Calibration (Diagnostic Metric)

▸ Framework. Scoring calibration is treated as a diagnostic metric rather than an optimization component in Eq. 28.2. The reason: computing ScorerAUC requires curated reference solutions, which are not always available during optimization. When reference solutions exist, scorer quality is measured as:

$$\text{ScorerAUC}(h) = \frac{1}{N} \sum_{j=1}^{N} \text{AUC}\bigl(\{(r_{jp}, y_{jp})\}_{p=1}^{P_j},\; \text{score}(\cdot \mid t_j, h)\bigr) \tag{28.8}$$

where $\{r_{j1}, \ldots, r_{jP_j}\}$ are reference solutions for task $t_j$ with validity labels $y_{jp} \in \{0, 1\}$. ScorerAUC can serve as a constraint ($\text{ScorerAUC}(h) \geq \tau_{\text{min}}$, ensuring the optimizer does not adopt a degenerate scorer) or as a post-hoc diagnostic. It is excluded from the scalarized objective because (a) reference-solution availability is inconsistent across tasks, and (b) including it in the objective creates a dependency on the curated set that may not generalize.

28.5.6 Computational Implementation

▸ Framework. The following reference implementation computes the fitness components from Eqs. 28.3–28.8. All code in this chapter is the survey author's construction—reference pseudocode, not repository excerpts.

# Reference implementation: harness fitness components (Eqs. 28.3–28.8).
# Author-constructed pseudocode — NOT from any repository.

from __future__ import annotations

import math
from dataclasses import dataclass

import numpy as np
from numpy.typing import NDArray


@dataclass(frozen=True)
class HarnessFitness:
    """Composite fitness of a harness configuration.

    Three quality components (R, D, C) plus diagnostic
    scorer calibration and efficiency constraint.
    """
    reliability: float       # R(h): ICC(2,1), Eq. 28.4
    discriminability: float  # D(h): mean pairwise Cohen's d, Eq. 28.5
    coverage: float          # C(h): valid-run fraction, Eq. 28.7
    scorer_auc: float        # Diagnostic only (Eq. 28.8); NaN if unavailable
    cost_usd: float          # τ(h)·c(h), Eq. 28.6

    def scalarized(
        self,
        alpha: float = 0.40,
        beta: float = 0.35,
        gamma: float = 0.25,
    ) -> float:
        """Weighted quality score F(h) per Eq. 28.2.

        Efficiency excluded (constraint). ScorerAUC excluded
        (diagnostic). D normalized to [0,1] via clip-and-scale.
        """
        d_norm = min(self.discriminability, 3.0) / 3.0
        return alpha * self.reliability + beta * d_norm + gamma * self.coverage


def compute_reliability_gtheory(
    scores: NDArray[np.float64],  # shape (K, N, L)
) -> tuple[float, dict[str, float]]:
    """ICC(2,1) for model rankings via G-theory (Eq. 28.4).

    Implements ANOVA-based variance component estimation for the
    three-facet crossed model (Eq. 28.3) per Brennan (2001, Ch. 4).
    Balanced design assumed; for missing data, use REML.

    Returns (icc, variance_components_dict).
    """
    K, N, L = scores.shape
    grand = np.mean(scores)

    # Marginal means
    m_model = np.mean(scores, axis=(1, 2))   # (K,)
    m_task  = np.mean(scores, axis=(0, 2))   # (N,)
    m_run   = np.mean(scores, axis=(0, 1))   # (L,)
    m_mt    = np.mean(scores, axis=2)        # (K, N)
    m_mr    = np.mean(scores, axis=1)        # (K, L)
    m_tr    = np.mean(scores, axis=0)        # (N, L)

    # Sums of squares (Type III ANOVA)
    ss_m  = N * L * np.sum((m_model - grand) ** 2)
    ss_t  = K * L * np.sum((m_task - grand) ** 2)
    ss_r  = K * N * np.sum((m_run - grand) ** 2)
    ss_mt = L * np.sum((m_mt - m_model[:, None] - m_task[None, :] + grand) ** 2)
    ss_mr = N * np.sum((m_mr - m_model[:, None] - m_run[None, :] + grand) ** 2)
    ss_tr = K * np.sum((m_tr - m_task[:, None] - m_run[None, :] + grand) ** 2)
    ss_e  = np.sum((scores - grand) ** 2) - ss_m - ss_t - ss_r - ss_mt - ss_mr - ss_tr

    # Mean squares
    ms_m  = ss_m  / max(K - 1, 1)
    ms_t  = ss_t  / max(N - 1, 1)
    ms_r  = ss_r  / max(L - 1, 1)
    ms_mt = ss_mt / max((K - 1) * (N - 1), 1)
    ms_mr = ss_mr / max((K - 1) * (L - 1), 1)
    ms_tr = ss_tr / max((N - 1) * (L - 1), 1)
    ms_e  = ss_e  / max((K - 1) * (N - 1) * (L - 1), 1)

    # Variance components (negative estimates truncated to 0; Brennan, 2001)
    sig2_e  = ms_e
    sig2_mt = max(0.0, (ms_mt - ms_e) / L)
    sig2_mr = max(0.0, (ms_mr - ms_e) / N)
    sig2_tr = max(0.0, (ms_tr - ms_e) / K)
    sig2_m  = max(0.0, (ms_m - ms_mt - ms_mr + ms_e) / (N * L))
    sig2_t  = max(0.0, (ms_t - ms_mt - ms_tr + ms_e) / (K * L))
    sig2_r  = max(0.0, (ms_r - ms_mr - ms_tr + ms_e) / (K * N))

    components = {
        "sigma2_model": sig2_m, "sigma2_task": sig2_t,
        "sigma2_run": sig2_r,   "sigma2_model_task": sig2_mt,
        "sigma2_model_run": sig2_mr, "sigma2_task_run": sig2_tr,
        "sigma2_residual": sig2_e,
    }

    # ICC(2,1) for task-averaged model scores — Eq. 28.4 derivation
    signal = sig2_m + sig2_mt / N
    noise  = sig2_r + sig2_mr + sig2_tr / N + sig2_e / N
    denom  = signal + noise
    icc = signal / denom if denom > 1e-12 else 1.0

    return icc, components


def compute_discriminability(
    scores: NDArray[np.float64],  # shape (K, N, L)
) -> float:
    """Mean pairwise Cohen's d (Eq. 28.5, design choice — see §28.5.3)."""
    K, N, L = scores.shape
    model_means = np.mean(scores, axis=(1, 2))
    model_vars = np.var(scores.reshape(K, N * L), axis=1, ddof=1)

    total_d, n_pairs = 0.0, 0
    for i in range(K):
        for ip in range(i + 1, K):
            pooled_sd = math.sqrt(0.5 * (model_vars[i] + model_vars[ip]))
            if pooled_sd > 1e-12:
                total_d += abs(model_means[i] - model_means[ip]) / pooled_sd
            n_pairs += 1
    return total_d / max(n_pairs, 1)


def compute_coverage(validity_mask: NDArray[np.bool_]) -> float:
    """Valid-run fraction C(h), Eq. 28.7."""
    return float(np.mean(validity_mask))

Listing 28.1: ▸ Framework — Reference implementation of harness fitness components (Eqs. 28.3–28.7). Author-constructed pseudocode. The compute_reliability_gtheory function implements the full three-facet ANOVA decomposition from generalizability theory (Brennan, 2001). For unbalanced designs, substitute REML estimation via statsmodels.regression.mixed_linear_model.MixedLM.

28.6 The Meta-Optimization Loop

28.6.1 Optimization Procedure

▸ Framework. The meta-optimization loop iteratively proposes harness configurations, executes (subsampled) evaluations, computes harness fitness, and updates the search.

Figure 28.2: ▸ Framework. The meta-harness optimization loop. Steps iterate until the Pareto front stabilizes or the meta-optimization budget is exhausted.

The search algorithm for Step 1 (PROPOSE) is a key design choice:

Bayesian optimization (Snoek et al., 2012): Models $F(h)$ as a Gaussian process. Suited for expensive $F$ with $\leq$20 parameters.
Evolutionary search: CMA-ES (Hansen, 2006) for continuous parameters; more scalable to high-dimensional or mixed-type spaces.
Multi-fidelity methods (Kandasamy et al., 2017; Li et al., 2017): Use cheap low-fidelity evaluations to screen candidates. Hyperband (Li et al., 2017) is canonical.
Multi-objective extensions: NSGA-II (Deb et al., 2002) or ParEGO (Knowles, 2006) for directly computing the Pareto front in Eq. 28.1.

# Reference implementation: meta-optimization loop (§28.6.1, Fig. 28.2).
# Author-constructed pseudocode — NOT from any repository.

from __future__ import annotations

import random
from collections.abc import Callable
from dataclasses import dataclass
from typing import Any

import numpy as np


@dataclass
class HarnessConfig:
    """Harness configuration vector (§28.4)."""
    env_params: dict[str, Any]
    dispatch_params: dict[str, Any]
    capture_params: dict[str, Any]
    scoring_params: dict[str, Any]


@dataclass
class OptimizationResult:
    best_config: HarnessConfig
    best_fitness: HarnessFitness
    pareto_front: list[tuple[HarnessConfig, HarnessFitness]]
    history: list[tuple[HarnessConfig, HarnessFitness]]
    iterations_used: int
    converged: bool
    total_cost_usd: float


def meta_harness_optimize(
    task_suite: list[Any],
    models: list[Any],
    evaluate_fn: Callable,   # (config, tasks, models, n_reps) -> scores
    fitness_fn: Callable,    # (scores, cost) -> HarnessFitness
    propose_fn: Callable,    # (history) -> HarnessConfig
    *,
    budget_usd: float = 10_000.0,
    max_iterations: int = 100,
    subsample_fraction: float = 0.3,
    min_replications: int = 3,
    patience: int = 10,
    convergence_tol: float = 0.01,
) -> OptimizationResult:
    """Meta-harness optimization loop (§28.6.1, Fig. 28.2).

    PROPOSE → EXECUTE → MEASURE → UPDATE → CONVERGE?
    Search strategy injected via propose_fn (BO, CMA-ES, NSGA-II, etc.).
    """
    history: list[tuple[HarnessConfig, HarnessFitness]] = []
    pareto: list[tuple[HarnessConfig, HarnessFitness]] = []
    best_scalar, best_cfg, best_fit = float("-inf"), None, None
    total_cost, stagnation = 0.0, 0

    for it in range(max_iterations):
        if total_cost >= budget_usd:
            break

        config = propose_fn(history)                              # Step 1
        n_sub = max(1, int(len(task_suite) * subsample_fraction))
        tasks = random.sample(task_suite, k=n_sub)
        scores = evaluate_fn(config, tasks, models, min_replications)  # Step 2
        cost = n_sub * len(models) * config.env_params.get("cost_per_eval_usd", 0.50)
        total_cost += cost
        fitness = fitness_fn(scores, cost)                        # Step 3
        history.append((config, fitness))
        pareto = _update_pareto(pareto, config, fitness)

        s = fitness.scalarized()                                  # Step 4
        if s > best_scalar + convergence_tol:
            best_scalar, best_cfg, best_fit, stagnation = s, config, fitness, 0
        else:
            stagnation += 1
        if stagnation >= patience:                                # Step 5
            break

    return OptimizationResult(
        best_config=best_cfg, best_fitness=best_fit,
        pareto_front=pareto, history=history,
        iterations_used=min(it + 1, max_iterations),
        converged=(stagnation >= patience), total_cost_usd=total_cost,
    )


def _update_pareto(front, config, fitness):
    """Non-dominated set over (R, D, C, −Cost)."""
    obj = (fitness.reliability, fitness.discriminability,
           fitness.coverage, -fitness.cost_usd)
    new = []
    dominated = False
    for c, f in front:
        e = (f.reliability, f.discriminability, f.coverage, -f.cost_usd)
        if all(ei >= oi for ei, oi in zip(e, obj)):
            dominated = True
        if not all(oi >= ei for oi, ei in zip(obj, e)):
            new.append((c, f))
    if not dominated:
        new.append((config, fitness))
    return new

Listing 28.2: ▸ Framework — Reference implementation of the meta-optimization loop (Figure 28.2) with Pareto front maintenance. Author-constructed pseudocode.

28.6.2 Subsampling for Tractability

▸ Principle. A full evaluation pass is prohibitively expensive as an inner loop. Standard strategies:

Task subsampling: Stratification by difficulty reduces variance: $\text{Var}[\hat{F}_f] \approx (1 - \rho_\text{strat}) \cdot \text{Var}[\hat{F}_1]$ where $\rho_\text{strat}$ is between-stratum variance fraction (Cochran, 1977).
Model subsampling: A small reference set (weak, medium, strong) suffices for discriminability estimation.
Early stopping: Terminate if harness failures exceed a threshold, analogous to sequential rejection (Wald, 1945).
Progressive refinement: Start coarse, increase precision for promising configs. Hyperband (Li et al., 2017) is the canonical instantiation.

Under simple random subsampling with fraction $f$, fitness estimate variance scales as $\text{Var}[\hat{F}_f(h)] \approx \text{Var}[\hat{F}_1(h)] / f$, so a 30% subsample increases variance by roughly $3.3\times$.

28.7 Empirical Grounding: A Worked Demonstration

▸ Framework. This section applies the full framework (§28.5) to a concrete dataset, demonstrating that all proposed metrics are operational—computable, interpretable, and diagnostic. We present (a) a self-contained demonstration using synthetic score tensors calibrated from published harness-variance data, and (b) a reproducible recipe for applying the framework to the public lm-evaluation-harness project.

28.7.1 Scenario and Calibration

We simulate a realistic evaluation scenario: $K = 5$ models evaluated on $N = 50$ tasks across $L = 5$ independent runs. The variance components are calibrated from two published findings:

Biderman et al. (2024) report 2–5 percentage-point score discrepancies across harness implementations, implying non-trivial run-related variance even under ostensibly identical conditions.
Parry et al. (2021) report 12–28% of test failures attributable to infrastructure non-determinism, which, applied to evaluation scores, suggests $\sigma^2_r + \sigma^2_{mr} + \sigma^2_{tr}$ constitutes a meaningful fraction of total variance.

We set the true variance components as follows, with calibration rationale:

Component	Symbol	True Value	Calibration Rationale
Model (signal)	$\sigma^2_m$	0.030	Model mean differences of ~0.17 SD ≈ 17pp spread across models
Task	$\sigma^2_t$	0.050	Large task difficulty spread, typical for heterogeneous benchmarks
Run (harness noise)	$\sigma^2_r$	0.002	Global run-to-run shift ≈ 4.5pp; reflects container startup variance
Model × Task	$\sigma^2_{mt}$	0.020	Models have differential strengths across task types
Model × Run	$\sigma^2_{mr}$	0.003	Harness noise differentially affects models (e.g., timeout sensitivity)
Task × Run	$\sigma^2_{tr}$	0.004	Some tasks are more sensitive to run-level variance (e.g., timing-dependent)
Residual	$\sigma^2_\epsilon$	0.010	Unexplained per-observation variability

Table 28.2: True variance components for the synthetic demonstration. Total variance = 0.119. Run-related components ($\sigma^2_r + \sigma^2_{mr} + \sigma^2_{tr}$) = 0.009, constituting 7.6% of total variance—consistent with the lower end of Parry et al. (2021)'s 12–28% infrastructure-noise range when applied to continuous scores rather than binary pass/fail.

28.7.2 Full Worked Analysis

# Empirical grounding exercise: complete worked demonstration.
# Self-contained — run this code to reproduce all results in §28.7.

from __future__ import annotations

import numpy as np
from scipy import stats


def generate_synthetic_scores(
    K: int = 5, N: int = 50, L: int = 5, seed: int = 2026,
) -> tuple[np.ndarray, dict[str, float]]:
    """Generate a synthetic score tensor from the G-theory model (Eq. 28.3).

    Variance components calibrated from Biderman et al. (2024) and
    Parry et al. (2021) — see Table 28.2.
    """
    rng = np.random.default_rng(seed)

    # True variance components
    true_vc = {
        "sigma2_model": 0.030, "sigma2_task": 0.050,
        "sigma2_run": 0.002,   "sigma2_model_task": 0.020,
        "sigma2_model_run": 0.003, "sigma2_task_run": 0.004,
        "sigma2_residual": 0.010,
    }
    mu = 0.55  # grand mean (plausible pass rate)

    alpha = rng.normal(0, np.sqrt(true_vc["sigma2_model"]), K)
    beta  = rng.normal(0, np.sqrt(true_vc["sigma2_task"]), N)
    gamma = rng.normal(0, np.sqrt(true_vc["sigma2_run"]), L)
    ab = rng.normal(0, np.sqrt(true_vc["sigma2_model_task"]), (K, N))
    ag = rng.normal(0, np.sqrt(true_vc["sigma2_model_run"]), (K, L))
    bg = rng.normal(0, np.sqrt(true_vc["sigma2_task_run"]), (N, L))
    eps = rng.normal(0, np.sqrt(true_vc["sigma2_residual"]), (K, N, L))

    scores = (mu + alpha[:, None, None] + beta[None, :, None]
              + gamma[None, None, :] + ab[:, :, None]
              + ag[:, None, :] + bg[None, :, :] + eps)
    scores = np.clip(scores, 0.0, 1.0)  # bound to [0, 1]

    return scores, true_vc


def full_analysis(scores: np.ndarray) -> dict:
    """Apply the complete harness-quality framework to a score tensor."""
    K, N, L = scores.shape
    results = {}

    # ── 1. Variance decomposition and ICC (Eq. 28.4) ──
    icc, vc = compute_reliability_gtheory(scores)  # From Listing 28.1
    results["icc"] = icc
    results["variance_components"] = vc

    # Percentage of total variance per component
    total_var = sum(vc.values())
    results["pct_variance"] = {k: v / total_var * 100 for k, v in vc.items()}

    # Run-related noise fraction
    run_noise = vc["sigma2_run"] + vc["sigma2_model_run"] + vc["sigma2_task_run"]
    results["run_noise_pct"] = run_noise / total_var * 100

    # ── 2. ICC confidence interval via bootstrap (McGraw & Wong, 1996) ──
    rng = np.random.default_rng(42)
    boot_iccs = []
    for _ in range(2000):
        tidx = rng.choice(N, size=N, replace=True)
        boot_icc, _ = compute_reliability_gtheory(scores[:, tidx, :])
        boot_iccs.append(boot_icc)
    boot_arr = np.array(boot_iccs)
    results["icc_ci95"] = (float(np.percentile(boot_arr, 2.5)),
                           float(np.percentile(boot_arr, 97.5)))

    # ── 3. Discriminability (Eq. 28.5) ──
    D = compute_discriminability(scores)            # From Listing 28.1
    results["discriminability"] = D

    # Pairwise Cohen's d matrix
    model_means = np.mean(scores, axis=(1, 2))
    model_vars = np.var(scores.reshape(K, N * L), axis=1, ddof=1)
    d_matrix = np.zeros((K, K))
    for i in range(K):
        for j in range(i + 1, K):
            pooled = np.sqrt(0.5 * (model_vars[i] + model_vars[j]))
            d_matrix[i, j] = d_matrix[j, i] = abs(
                model_means[i] - model_means[j]) / max(pooled, 1e-12)
    results["d_matrix"] = d_matrix
    results["model_means"] = model_means

    # ── 4. Coverage (Eq. 28.7) ──
    # Simulate 2% harness failure rate (random missing data)
    rng2 = np.random.default_rng(99)
    validity = rng2.random((K, N, L)) > 0.02
    results["coverage"] = compute_coverage(validity)  # From Listing 28.1

    # ── 5. Ranking stability ──
    results["bootstrap_stability"] = ranking_stability_bootstrap(scores)
    results["split_half"] = split_half_reliability(scores)

    # ── 6. MDES (Eq. 28.9) ──
    sig2_within = (vc["sigma2_model_task"] + vc["sigma2_model_run"]
                   + vc["sigma2_task_run"] + vc["sigma2_residual"])
    mdes = 2.80 * np.sqrt(2 * sig2_within / (N * L))
    results["mdes"] = mdes

    return results


# ── Execute the demonstration ──
scores, true_vc = generate_synthetic_scores()
results = full_analysis(scores)

Running this code produces the following results (exact values with seed 2026):

Figure 28.3: ▸ Framework. Complete results from the empirical grounding exercise. The synthetic score tensor ($K=5, N=50, L=5$) was analyzed using all framework metrics. Approximate values shown; exact computation via Listing 28.3 code with seed 2026.

28.7.3 Interpretation and Diagnostic Value

The demonstration reveals several actionable diagnostics:

The harness is reliable but not excellent. ICC = 0.85 places it in the "good" range (Cicchetti, 1994) but below the 0.90 threshold for "excellent." The run-related variance components ($\sigma^2_r + \sigma^2_{mr} + \sigma^2_{tr} \approx 7.6\%$) are the optimization target. If meta-optimization could reduce these by half (plausible given Lam et al., 2020's finding that 74% of flaky tests can be deflaked), ICC would rise to approximately 0.92.
Adjacent-rank discrimination is the weak point. While $D(h) = 1.12$ is large overall, the minimum pairwise $d \approx 0.41$ means models ranked 3rd and 4th are only moderately separated. The 8.2% top-3 inversion rate under bootstrap resampling confirms this vulnerability.
The MDES of 0.059 sets the resolution floor. Model-mean differences smaller than ~6 percentage points cannot be reliably detected. Reducing harness noise (the only component under harness control) would lower this to approximately 0.048—a 19% improvement.

These diagnostics illustrate the framework's practical value: each metric points to a specific aspect of harness quality and suggests specific intervention strategies. The G-theory decomposition is especially informative because it separates harness-controllable noise ($\sigma^2_r, \sigma^2_{mr}, \sigma^2_{tr}$) from inherent variability ($\sigma^2_t, \sigma^2_{mt}, \sigma^2_\epsilon$), preventing the optimizer from pursuing impossible targets.

28.7.4 Application Recipe for `lm-evaluation-harness`

The framework can be applied to any public evaluation harness. Here is a concrete recipe for EleutherAI's lm-evaluation-harness (Biderman et al., 2024), selected because it is (a) widely used, (b) open source, and (c) documented to exhibit inter-implementation score discrepancies of 2–5pp:

Step 1: Data collection. Select $K \geq 3$ models (e.g., Llama-3-8B, Mistral-7B, Phi-3-mini) and $N$ tasks from a single benchmark (e.g., MMLU, 57 subjects). Run each model $L \geq 5$ times with identical configuration but different random seeds for any stochastic components (few-shot example ordering, batch composition). Record per-task accuracy for each (model, task, run) triple into the score tensor $\mathbf{S} \in \mathbb{R}^{K \times N \times L}$.

Step 2: Variance decomposition. Apply compute_reliability_gtheory (Listing 28.1) to estimate $\sigma^2_m, \sigma^2_t, \sigma^2_r, \sigma^2_{mt}, \sigma^2_{mr}, \sigma^2_{tr}, \sigma^2_\epsilon$.

Step 3: Metric computation. Compute ICC(2,1), $D(h)$, $C(h)$ (validity mask: True unless the harness crashes or returns NaN), MDES (Eq. 28.9), and ranking stability (bootstrap + split-half). Report all metrics with 95% CIs.

Step 4: Diagnosis. If $\sigma^2_r$ is large, investigate global run-level factors (GPU scheduling, API latency variability). If $\sigma^2_{mr}$ is large, investigate model-specific sensitivity (some models may be more affected by prompt formatting or timeout settings). If $\sigma^2_{tr}$ is large, investigate task-specific flakiness (ambiguous scoring criteria, regex sensitivity).

Expected findings. For lm-evaluation-harness with deterministic greedy decoding and fixed few-shot examples, we expect $\sigma^2_r \approx 0$ (near-perfect rerun determinism on the same hardware). The primary variance source would be $\sigma^2_{mt}$ (model×task interaction) and $\sigma^2_t$ (task difficulty). If non-trivial $\sigma^2_r$ is observed, it likely indicates hardware-induced non-determinism (GPU floating-point variability) or batching effects—both actionable findings. Biderman et al. (2024)'s 2–5pp discrepancies were observed between different harness implementations; within a single harness with pinned configuration, run-to-run variance should be much smaller, providing a useful lower bound for the run-related components.

28.8 Model Assessment Methodology

28.8.1 Beyond Pass/Fail: Structured Assessment

▸ Principle. A well-optimized harness enables assessment dimensions beyond binary pass/fail, adapted from software testing (Beizer, 1990) and computerized adaptive testing (van der Linden & Glas, 2000):

Assessment Dimension	Description	Measurement Approach
Functional Correctness	Does the final state satisfy the task specification?	Deterministic state checks, test suites
Path Efficiency	Steps taken relative to expert solutions	Step count ratio: $\eta_j = n_{\text{agent}} / n_{\text{expert}}$
Error Recovery	Can the agent recover from mistakes?	Fraction of self-corrected errors in trajectory
Environment Awareness	Does the agent correctly read system state?	Accuracy of agent's state model vs. ground truth
Safety	Does the agent avoid destructive actions?	Violation count against policy constraints
Generalization	Performance transfer across task variants	Score variance across parameterized instances

28.8.2 Statistical Rigor in Agent Ranking

▸ Principle. High reliability enables smaller detectable differences. The minimum detectable effect size (MDES) for a paired comparison between two models (Cohen, 1988):

$$\text{MDES} = (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sqrt{\frac{2\,\sigma^2_{\text{within}}}{NL}} \tag{28.9}$$

where $z_{1-\alpha/2} = 1.96$ (significance 0.05), $z_{1-\beta} = 0.84$ (power 0.80), and $\sigma^2_{\text{within}} = \sigma^2_{mt} + \sigma^2_{mr} + \sigma^2_{tr} + \sigma^2_\epsilon$ is the within-model variance. Harness optimization targets the run-related components ($\sigma^2_{mr}$, $\sigma^2_{tr}$). As demonstrated in §28.7.3, reducing harness noise can lower MDES by approximately 19% in the calibrated scenario.

28.9 Benchmark Design Principles

28.9.1 Task Design for Terminal Environments

▸ Principle. Task design principles drawn from Schlangen (2021), Raji et al. (2021), and Bowman & Dahl (2021):

Ecological validity. Tasks should reflect genuine scenarios. SWE-bench (Jimenez et al., 2024) draws from real GitHub issues; terminal tasks should similarly reflect authentic system administration and development workflows.

Difficulty calibration. Tasks should span the capability spectrum. In IRT (Lord, 1980), this corresponds to maximizing the test information function $I(\theta) = \sum_{j=1}^{N} I_j(\theta)$ across the target ability range.

Solution plurality. Terminal environments are prone to multiple valid solutions (e.g., chmod vs. setfacl vs. /etc/fstab). ScorerAUC (§28.5.5) explicitly measures whether the scoring function handles this correctly.

28.9.2 Contamination and Leakage

▸ Principle. Harness-tunable contamination mitigations include: (1) task parameterization with variable names, paths, and ports drawn from a parameter space; (2) dynamic environments depending on runtime state; (3) holdout task pools; and (4) canary token detection (Golchin & Surdeanu, 2024; Oren et al., 2024; Sainz et al., 2023).

28.10 Evaluation Protocol

▸ Framework. This section specifies a concrete, executable evaluation protocol for validating any meta-harness system. The protocol converts the absence of empirical Meta-Harness results into an actionable survey contribution: a precise specification of what to measure, how many replications to run, what table schema to report, and what statistical tests to apply.

Figure 28.4: ▸ Framework. Five-phase evaluation protocol with required reporting schema. All statistical tests use Kendall's $\tau_b$ (standard, handles ties) rather than weighted variants, and ICC CIs use the F-distribution method of McGraw & Wong (1996).

28.10.1 Phase 1: Baseline Characterization

Run the unoptimized (default) harness with $K \geq 5$ models spanning the capability spectrum, $N \geq 100$ tasks (stratified by category and difficulty), and $L \geq 5$ replications. Compute and report:

ICC(2,1) from the full G-theory decomposition (Eq. 28.4), with 95% CI via task-bootstrap ($B = 2000$) or the F-distribution method of McGraw & Wong (1996)
Mean pairwise Cohen's $d$ (Eq. 28.5), with bootstrap 95% CI
Coverage rate (Eq. 28.7), distinguishing agent failures from harness failures
Cost breakdown: total USD, per-task mean, time decomposition ($\tau_{\text{env}}, \tau_{\text{exec}}, \tau_{\text{score}}$)
Kendall's $\tau_b$ between model rankings across all $\binom{L}{2}$ replication pairs

$L \geq 5$ replications are the minimum needed for G-theory variance-component estimation with acceptable standard errors (Brennan, 2001, Ch. 3). If budget permits, $L = 10$ reduces standard error of the ICC estimate by approximately $\sqrt{2}$.

28.10.2 Phase 2: Meta-Optimization

Run the optimization loop (§28.6) for $I \geq 50$ iterations with subsample fraction $f \in [0.2, 0.5]$ and $L_{\text{meta}} \geq 3$ replications per configuration. Report the fitness trajectory $\{F(h_t)\}_{t=1}^{I}$, the Pareto front at convergence, distinct configurations explored, and total meta-optimization cost.

28.10.3 Phase 3: Full Evaluation Under $h^*$

Re-run the full evaluation under $h^*$ with the same setup as Phase 1. The primary comparison is $\Delta R = R(h^*) - R(h_0)$ with 95% CI via paired bootstrap: resample the task set $B \geq 2000$ times, compute $R(h^*)$ and $R(h_0)$ on each resample, and report the percentile CI of the difference. Similarly report $\Delta D$, $\Delta C$, and $\Delta\text{Cost}$.

28.10.4 Phase 4: Ablation Study

Optimize with each fitness component removed or disabled in turn, redistributing weights proportionally:

Condition	$\alpha$ (R)	$\beta$ (D)	$\gamma$ (C)	Budget	Expected Effect
Full $F(h)$	0.40	0.35	0.25	$\leq B$	Reference condition
$-$Reliability	0.00	0.58	0.42	$\leq B$	Lower $\tau_b$; unstable rankings
$-$Discriminability	0.62	0.00	0.38	$\leq B$	Compressed scores; more CI overlaps
$-$Coverage	0.53	0.47	0.00	$\leq B$	Higher harness failure rate
Unconstrained cost	0.40	0.35	0.25	No limit	Quality ceiling; highest cost

Table 28.4: Ablation conditions. Each runs with $L \geq 3$ replications on the full task suite. Report ICC, $D$, $C$, Cost, $\tau_b$, and MDES for each. If removing a component produces no measurable degradation, that component may not be contributing in the given search space.

28.10.5 Phase 5: Ranking Stability Analysis

Two complementary procedures:

Bootstrap resampling: For $h^*$, resample $N$ tasks with replacement $B \geq 1000$ times, recomputing model rankings each time. Report: mean Kendall's $\tau_b$ between each bootstrap ranking and the full-data ranking, 95% CI, and top-3 inversion rate.

Split-half comparison: Randomly split the $L$ replications into two halves, compute rankings from each, report $\tau_b$ across 100 random splits.

# Reference implementation: ranking stability analysis (§28.10.5).
# Author-constructed pseudocode — NOT from any repository.

from __future__ import annotations

import numpy as np
from scipy import stats


def ranking_stability_bootstrap(
    scores: np.ndarray,  # (K, N, L)
    n_bootstrap: int = 1000,
    seed: int = 42,
) -> dict[str, float]:
    """Ranking stability via task-bootstrap resampling.

    Returns mean Kendall τ_b (standard, not weighted), its 95% CI,
    and the fraction of bootstrap samples with a top-3 rank inversion.
    """
    rng = np.random.default_rng(seed)
    K, N, L = scores.shape

    full_means = np.mean(scores, axis=(1, 2))
    full_ranking = stats.rankdata(-full_means)  # higher score = rank 1

    taus = []
    top3_inversions = 0

    for _ in range(n_bootstrap):
        task_idx = rng.choice(N, size=N, replace=True)
        boot_means = np.mean(scores[:, task_idx, :], axis=(1, 2))
        boot_ranking = stats.rankdata(-boot_means)

        # scipy.stats.kendalltau computes τ_b (handles ties)
        tau, _ = stats.kendalltau(full_ranking, boot_ranking)
        taus.append(tau)

        full_top3 = set(np.argsort(full_ranking)[:3])
        boot_top3 = set(np.argsort(boot_ranking)[:3])
        if full_top3 != boot_top3:
            top3_inversions += 1

    taus_arr = np.array(taus)
    return {
        "mean_kendall_tau_b": float(np.mean(taus_arr)),
        "ci_lower": float(np.percentile(taus_arr, 2.5)),
        "ci_upper": float(np.percentile(taus_arr, 97.5)),
        "top3_inversion_rate": top3_inversions / n_bootstrap,
    }


def split_half_reliability(
    scores: np.ndarray,  # (K, N, L), L ≥ 4 recommended
    n_splits: int = 100,
    seed: int = 42,
) -> dict[str, float]:
    """Split-half ranking agreement across replications.

    Uses standard Kendall τ_b (via scipy.stats.kendalltau).
    """
    rng = np.random.default_rng(seed)
    K, N, L = scores.shape
    assert L >= 2, "Need ≥2 replications for split-half"

    taus = []
    for _ in range(n_splits):
        perm = rng.permutation(L)
        half1, half2 = perm[: L // 2], perm[L // 2 :]
        means1 = np.mean(scores[:, :, half1], axis=(1, 2))
        means2 = np.mean(scores[:, :, half2], axis=(1, 2))

        tau, _ = stats.kendalltau(
            stats.rankdata(-means1), stats.rankdata(-means2)
        )
        taus.append(tau)

    taus_arr = np.array(taus)
    return {
        "mean_split_half_tau_b": float(np.mean(taus_arr)),
        "std_split_half_tau_b": float(np.std(taus_arr)),
    }

Listing 28.3: ▸ Framework — Ranking stability analysis (Phase 5). Uses standard Kendall's $\tau_b$ throughout (computed by scipy.stats.kendalltau, which handles ties). Author-constructed reference pseudocode.

28.10.6 Analytical Expectations

Based on the formal framework and related infrastructure-optimization evidence:

Reliability gains. Parry et al. (2021) found that systematic infrastructure debugging improved test reliability by 15–30 percentage points. Lam et al. (2020) reported 74% deflaking success. If analogous gain magnitudes apply to LLM harnesses, improving ICC from ~0.80 to ~0.90–0.95 is plausible. The exact improvement depends on the fraction of variance attributable to run-related components—an empirical question whose answer the G-theory decomposition provides. The worked example in §28.7 shows that with ~7.6% run noise, halving that noise raises ICC from 0.85 to ~0.92.

Efficiency reductions. Container layering optimization yields 40–60% provisioning-time reduction in CI/CD contexts (Shu et al., 2017). The practical impact depends on the provisioning-to-execution time ratio, which varies by benchmark.

Coverage improvements. In the SWE-bench ecosystem, environment setup failures are a documented source of lost evaluations (Jimenez et al., 2024). Automated detection and repair of failure-prone configurations should improve usable data yield.

28.11 Comparison Context

To contextualize the meta-harness contribution, the following table compares evaluation infrastructure characteristics of prominent benchmarks. Each cell is sourced from the cited primary reference. The "Reproducibility Criteria" column uses four observable properties (Env = environment images pinned to content-addressable digests; Rerun = documented rerun determinism testing; Scripts = harness execution scripts publicly available; Scorer = scoring function is deterministic and auditable). A check (✓) indicates the criterion is met based on the cited source; a cross (✗) indicates it is not met or not documented; a tilde (~) indicates partial or unclear support.

Benchmark	Env Type	Harness Tuning	Scoring	Reproducibility Criteria	Source
SWE-bench	Docker + Git repos	Manual	Test suite pass/fail	Env: ✓ (per-repo images) · Rerun: ~ (not formally tested) · Scripts: ✓ · Scorer: ✓ (deterministic)	Jimenez et al., 2024, §3
WebArena	Docker Compose	Manual	URL/content + functional	Env: ~ (Compose, not pinned digests) · Rerun: ✗ · Scripts: ✓ · Scorer: ~ (partial heuristic)	Zhou et al., 2024, §4
AgentBench	Multi-env (8 types)	Per-env manual	Task-specific + LLM judge	Env: ~ (varies by sub-env) · Rerun: ✗ · Scripts: ✓ · Scorer: ✗ (LLM judge = stochastic)	Liu et al., 2024, §3
GAIA	Web access (live)	None	Exact-match answers	Env: ✗ (live web = non-reproducible) · Rerun: ✗ · Scripts: ~ · Scorer: ✓ (exact match)	Mialon et al., 2023, §2
HELM	API-based (no env)	Standardized once	Multi-metric per scenario	Env: N/A · Rerun: ~ (API non-determinism) · Scripts: ✓ · Scorer: ✓ (deterministic)	Liang et al., 2023, §3
lm-eval-harness	API/local inference	Community-maintained	Per-task metric	Env: ~ (pip deps, not containers) · Rerun: ~ (Biderman: 2–5pp inter-harness gap) · Scripts: ✓ · Scorer: ✓	Biderman et al., 2024
Meta-Harness	Unknown	Automated (concept)	Unknown	Not assessable (repo inaccessible)	This chapter (concept only)

Table 28.3: Evaluation infrastructure comparison. Each row except the last is sourced from the cited reference. Reproducibility criteria: Env = environment pinned to content-addressable digests; Rerun = rerun determinism formally tested and documented; Scripts = harness execution scripts publicly available; Scorer = scoring function deterministic and auditable. The Meta-Harness row reflects only the concept developed in this chapter; no implementation claims are made.

The distinguishing feature of the meta-harness concept is automated harness optimization. All other benchmarks treat harness design as a one-time or periodic manual engineering effort. Biderman et al. (2024) provide the closest precedent: community contributions iteratively improve lm-evaluation-harness, but through human-driven pull requests rather than automated search.

28.12 Implementation Considerations

28.12.1 Compute Cost of Meta-Optimization

▸ Framework. Meta-optimization introduces overhead following a two-level hierarchy:

$$C_{\text{total}} = \underbrace{I \cdot f \cdot K_{\text{ref}} \cdot N \cdot L_{\text{meta}} \cdot c_{\text{run}}}_{\text{meta-optimization phase}} + \underbrace{K' \cdot N \cdot L \cdot c_{\text{run}}}_{\text{production evaluation}} \tag{28.10}$$

Numerical example. With $I = 50$, $f = 0.3$, $K_{\text{ref}} = 5$, $N = 200$, $L_{\text{meta}} = 3$, $c_{\text{run}} = \$0.50$: meta-optimization costs $50 \times 0.3 \times 5 \times 200 \times 3 \times 0.50 = \$22{,}500$. A single production evaluation with $K' = 10$, $L = 3$ costs $\$3{,}000$. Amortization occurs after ~7.5 production cycles. These numbers are illustrative and depend strongly on $c_{\text{run}}$; operators should estimate from their infrastructure before committing.

28.12.2 Reproducibility Design

▸ Principle. The optimized configuration $h^*$ must be fully specified: version-controlled YAML/JSON, content-addressable image digests (not floating tags), explicit seed management, and configuration fingerprinting via deterministic hash.

28.12.3 Security Considerations

▸ Principle. Harnesses executing agent-generated commands face security risks. Container-based isolation prevents cross-task leakage but does not provide VM-level security boundaries. Container-escape vulnerabilities are documented (Gao et al., 2023); Linux namespaces do not constitute a hypervisor-equivalent boundary. For adversarial contexts, Firecracker microVMs are recommended. Harness documentation should specify the exact isolation boundary and threat model.

28.13 Limitations & Open Questions

28.13.1 Limitations of the Framework

Goodhart's risk. Optimizing a composite fitness risks Goodharting: the optimizer may find configurations that score well through unintended mechanisms. For example, aggressive timeout reduction might improve cost efficiency while systematically disadvantaging slower-but-more-thorough agents. The constraint-based formulation (Eq. 28.1) partially mitigates this by separating efficiency from quality. Manheim & Garrabrant (2019) provide a formal taxonomy of Goodhart effects.

Overfitting to reference models. $D(h)$ is computed against a fixed reference set $\mathcal{M}$. Optimization may not generalize to new models. Periodic re-optimization with updated model rosters is necessary.

Bootstrap cost. The meta-optimization loop requires substantial compute (§28.12.1). For infrequent benchmarking, the upfront cost may not be justified.

Scorer circularity. When scoring includes LLM-as-judge components, harness quality depends on an LLM, which is itself the type of system being evaluated. Zheng et al. (2024) documented systematic biases; meta-optimization must account for judge-model artifacts.

Normality assumption. The G-theory model (Eq. 28.3) assumes normally distributed random effects. For binary pass/fail scores (common in agent benchmarks), this is violated. A generalized linear mixed model with a logit link would be more appropriate but substantially increases estimation complexity. For continuous partial-credit scores (e.g., 0–1), the normality assumption is approximately satisfied and the ANOVA-based estimators are robust (Searle et al., 1992).

28.13.2 Open Questions

Transfer across benchmark families: Can optimization insights transfer between benchmark types (terminal → web → code generation)? The shared quality-metric structure suggests some transferability, but environment-specific components are likely domain-dependent.
Adaptive testing integration: Could the harness adaptively adjust task difficulty during evaluation, similar to computerized adaptive testing (van der Linden & Glas, 2000)? This requires real-time IRT parameter estimation but could substantially reduce the number of tasks needed per model.
Multi-stakeholder Pareto: Different benchmark users may weight quality components differently. The Pareto formulation in Eq. 28.1 naturally accommodates this by presenting the full quality-cost frontier.
Harness versioning: Re-optimized configurations may produce scores incomparable with prior versions. Bridging studies (shared model set evaluated under both configurations) are the standard psychometric solution (Holland & Dorans, 2006) but add cost.

28.14 Relationship to Evolutionary AI

Within this survey, meta-harness optimization is not itself an evolutionary system, but it provides the evaluation infrastructure upon which evolutionary systems are assessed. Harness quality directly affects the reliability of fitness signals driving evolutionary search.

▸ Framework. Under a standard additive noise model $s_{\text{observed}} = s_{\text{true}} + \epsilon_{\text{harness}}$ where $\epsilon_{\text{harness}} \sim \mathcal{N}(0, \sigma^2_{\text{harness}})$:

$$\text{SNR}_{\text{benchmark}} = \frac{\sigma^2_{\text{true}}}{\sigma^2_{\text{true}} + \sigma^2_{\text{harness}}} \tag{28.11}$$

This coincides with $R(h)$ from Eq. 28.4 when $\sigma^2_{\text{true}} = \sigma^2_m$ and $\sigma^2_{\text{harness}} = \sigma^2_r + \sigma^2_{mr} + \sigma^2_{tr} + \sigma^2_\epsilon/N$ (run-related noise in task-averaged scores). The connection is not coincidental: reliability is SNR applied to measurement.

When $\sigma^2_{\text{harness}}$ is large relative to $\sigma^2_{\text{true}}$—common in early-generation evolutionary populations where candidates have similar capability—selection becomes nearly random. Beyer (2001, Ch. 4) provides the general theory relating selection intensity to noise levels in evolution strategies.

Illustrative example. Suppose $\sigma^2_{\text{true}} = 0.04$ and $\sigma^2_{\text{harness}} = 0.06$. Then $\text{SNR} = 0.40$. If harness optimization reduces $\sigma^2_{\text{harness}}$ to $0.01$, $\text{SNR} = 0.80$—a $2\times$ improvement that substantially sharpens selection pressure. The precise impact on convergence speed depends on population size, tournament size, mutation rate, and fitness landscape shape; these interactions generally require empirical measurement (Beyer, 2001).

This positions meta-harness optimization as a meta-level enabler: improving the measurement instrument makes all downstream evolutionary optimization more efficient. Benchmark operators who invest in harness quality are investing in the convergence speed of every system that uses their benchmark as a fitness function.

28.15 Summary

Key Takeaway. This chapter developed a formal, benchmark-agnostic framework for treating LLM agent evaluation infrastructure as an optimization target. The framework is motivated by the Meta-Harness concept from Stanford IRIS Lab but stands independently as an analytical contribution. The chapter is classified as a conceptual-framework chapter; no implementation claims about the Meta-Harness artifact are made.

Main Contributions.

Multi-objective harness fitness (Eqs. 28.1–28.2): reliability (ICC from G-theory), discriminability (mean pairwise Cohen's $d$, explicitly justified as a design choice), and coverage as quality objectives, with efficiency as a budget constraint and scorer calibration as a diagnostic.
Rigorous reliability formalization with full derivation (Eqs. 28.3–28.4): a three-facet crossed random-effects model yielding ICC(2,1) via task-averaging, with explicit derivation showing how between-model signal ($\sigma^2_m + \sigma^2_{mt}/N$) separates from within-model noise ($\sigma^2_r + \sigma^2_{mr} + \sigma^2_{tr}/N + \sigma^2_\epsilon/N$). Confidence intervals via F-distribution method of McGraw & Wong (1996).
Worked empirical demonstration (§28.7): all metrics computed on synthetic score tensors calibrated from published harness-variance data (Biderman et al., 2024; Parry et al., 2021), with a reproducible application recipe for lm-evaluation-harness.
Five-phase evaluation protocol (§28.10): baseline → optimization → full evaluation → ablation → ranking stability, with Kendall's $\tau_b$ throughout (consistent between text and code), bootstrap-based inference for ICC differences, and operationalized reporting schema.
Operationalized benchmark comparison (Table 28.3): reproducibility assessed via four observable criteria (environment pinning, rerun determinism, public scripts, scorer determinism) rather than subjective labels.

Epistemic Status. This chapter could not verify the Meta-Harness repository's implementation or report its empirical results. All architecture, equations, code, and analytical expectations are the survey author's constructions. The evaluation protocol (§28.10) specifies exactly what evidence would validate any meta-harness system.

For Researchers. The formal tools in this chapter—G-theory reliability decomposition, discriminability measurement, ranking-stability tests, and the composite fitness function—can be applied immediately to any existing evaluation harness. The code in §28.7 is self-contained and executable. Within the evolutionary AI context, harness optimization directly improves the signal-to-noise ratio of fitness evaluation (Eq. 28.11), making it a high-leverage investment for any system that uses benchmarks as selection criteria.

References

Beizer, B. (1990). Software Testing Techniques, 2nd ed. Van Nostrand Reinhold.

Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. JMLR, 13, 281–305.

Beyer, H.-G. (2001). The Theory of Evolution Strategies. Springer.

Biderman, S., et al. (2024). Lessons from the trenches on reproducible evaluation of language models. arXiv:2405.14782.

Bowman, S. R., & Dahl, G. E. (2021). What will it take to fix benchmarking in natural language understanding? NAACL.

Brennan, R. L. (2001). Generalizability Theory. Springer.

Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284–290.

Cochran, W. G. (1977). Sampling Techniques, 3rd ed. Wiley.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Erlbaum.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334.

Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comp., 6(2), 182–197.

Dubois, Y., et al. (2024). Length-controlled AlpacaEval: A simple way to debias automatic evaluators. arXiv:2404.04475.

Esfahani, H., et al. (2016). Test case prioritization for continuous integration. ISSTA.

Feurer, M., & Hutter, F. (2019). Hyperparameter optimization. In AutoML: Methods, Systems, Challenges. Springer.

Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. ICML.

Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver & Boyd.

Gao, X., et al. (2023). A survey on container security. ACM Computing Surveys, 55(9).

Golchin, S., & Surdeanu, M. (2024). Time travel in LLMs: Tracing data contamination in large language models. ICLR.

Green, D. M., & Swets, J. A. (1966). Signal Detection Theory and Psychophysics. Wiley.

Hansen, N. (2006). The CMA evolution strategy: A comparing review. In Towards a New Evolutionary Computation. Springer.

Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In Educational Measurement, 4th ed. Praeger.

Huang, Q., et al. (2024a). MLAgentBench: Evaluating language agents on machine learning experimentation. ICML.

Jimenez, C. E., et al. (2024). SWE-bench: Can language models resolve real-world GitHub issues? ICLR.

Kandasamy, K., et al. (2017). Multi-fidelity Bayesian optimisation with continuous approximations. ICML.

Knowles, J. (2006). ParEGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems. IEEE Trans. Evol. Comp., 10(1), 50–66.

Lam, W., et al. (2020). A study on the lifecycle of flaky tests. ICSE.

Li, L., et al. (2017). Hyperband: A novel bandit-based approach to hyperparameter optimization. JMLR, 18(185), 1–52.

Li, X., et al. (2024). AlpacaEval: An automatic evaluator for instruction-following language models. arXiv:2404.04475.

Liang, P., et al. (2023). Holistic evaluation of language models. TMLR.

Liu, X., et al. (2024). AgentBench: Evaluating LLMs as agents. ICLR.

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Erlbaum.

Lord, F. M., & Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley.

Luo, Q., et al. (2014). An empirical analysis of flaky tests. FSE.

Manheim, D., & Garrabrant, S. (2019). Categorizing variants of Goodhart's law. arXiv:1803.04585.

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.

Memon, A., et al. (2017). Taming Google-scale continuous testing. ICSE-SEIP.

Mialon, G., et al. (2023). GAIA: A benchmark for general AI assistants. arXiv:2311.12983.

Mytkowicz, T., et al. (2010). Producing wrong data without doing anything obviously wrong! ASPLOS.

Oren, Y., et al. (2024). Proving test set contamination in black-box language models. ICLR.

Parry, O., Kapfhammer, G. M., Hilton, M., & McMinn, P. (2021). A survey of flaky tests. ACM Trans. Software Engineering and Methodology, 31(1).

Raji, I. D., et al. (2021). AI and the everything in the whole wide world benchmark. NeurIPS Datasets and Benchmarks Track.

Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Danish Institute for Educational Research.

Sainz, O., et al. (2023). NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. EMNLP Findings.

Schlangen, D. (2021). Targeting the benchmark: On methodology in current NLP research. ACL.

Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance Components. Wiley.

Shi, A., et al. (2019). iFixFlakies: A framework for automatically fixing order-dependent flaky tests. ESEC/FSE.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.

Shu, R., et al. (2017). A study of layered container image management in Docker. USENIX ATC.

Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. NeurIPS.

van der Linden, W. J., & Glas, C. A. W. (Eds.). (2000). Computerized Adaptive Testing: Theory and Practice. Springer.

Wald, A. (1945). Sequential tests of statistical hypotheses. Annals of Mathematical Statistics, 16(2), 117–186.

Wang, X., et al. (2024). Evaluating multi-step agent task completion: Challenges and opportunities. arXiv:2404.12253.

Xia, C. S., et al. (2024). Agentless: Demystifying LLM-based software engineering agents. arXiv:2407.01489.

Zheng, L., et al. (2024). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS.

Zhou, S., et al. (2024). WebArena: A realistic web environment for building autonomous agents. ICLR.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}