Introduced2026-03

Score8.35/10 — Draft

Chapter 32

AIRA₂

Part P07: Autonomous Research Systems

32.1 Overview and Motivation

Automated machine learning research agents—systems that autonomously solve complex ML engineering tasks—face a fundamental tension between exploration breadth and evaluation reliability. As of early 2026, several systems compete on MLE-bench, a benchmark suite derived from Kaggle competitions that tests an agent's ability to produce complete ML pipelines. Despite rapid progress, prior systems exhibited a puzzling failure mode: performance degraded when given more compute time, suggesting that extended search was counterproductive.

AIRA₂ (Asynchronous multi-GPU research agent, second generation) addresses this failure mode through a systems-level analysis that identifies three structural bottlenecks in AI research agents and resolves each with a targeted architectural decision. The system achieves state-of-the-art results on MLE-bench-30, a curated 30-task subset used in the GPT-5 system card, reaching 71.8% percentile rank at 24 hours and 76.0% at 72 hours—the first demonstration that performance improves monotonically with compute in this setting (Hambardzumyan et al., 2026).

Key Contribution

AIRA₂'s primary contribution is a systems-level diagnosis and resolution of three bottlenecks that prevent AI research agents from scaling with compute. (1) Asynchronous multi-GPU steady-state evolution eliminates synchronization barriers, achieving approximately linear throughput scaling. (2) Hidden Consistent Evaluation (HCE) decouples search optimization from selection, revealing that previously reported "overfitting" was evaluation noise rather than true data memorization. (3) ReAct agent operators replace static single-turn prompts with autonomous multi-step reasoning trajectories. Together, these enable monotonically improving performance with compute—a prerequisite for meaningful compute scaling laws in research agents.

The paper explicitly frames itself as a systems contribution rather than an algorithmic advance: the evolutionary search uses standard temperature-scaled rank selection, and the LLM backbone (Gemini 3.0 Pro Preview) is shared with baselines. The innovation lies in the infrastructure that makes long-horizon evolutionary search productive. This framing distinguishes AIRA₂ from concurrent work such as MARS+ and FM-Agent 2.0, which focus on operator design or model capabilities.

32.1.1 Lineage and Context

AIRA₂ is the direct successor to AIRA-dojo (Toledo et al., 2025), which formalized automated ML research as a search problem with three components: a search policy selecting which candidates to expand, operators transforming candidates into new solutions, and an evaluation signal providing fitness. AIRA-dojo achieved 39.5% percentile rank on MLE-bench at 24 hours with a single GPU, establishing the evolutionary framework but also exposing the three bottlenecks that AIRA₂ resolves.

The competitive landscape on MLE-bench-30 as of March 2026 provides context for AIRA₂'s positioning:

System	Percentile Rank (24h)	Year	Source
AIRA-dojo	39.5%	2025	Toledo et al., 2025
PiEvolve	54.1%	2025	Botla et al., 2025
ML-Master 2.0	57.6%	2025	Liu et al., 2025
MARS	60.4%	2026	Chen et al., 2026
MLEvolve	64.1%	2025	Du et al., 2025
FM-Agent 2.0	69.6%	2025	Li et al., 2025
MARS+	69.9%	2026	Chen et al., 2026
AIRA₂	71.8%	2026	Hambardzumyan et al., 2026
AIRA₂ (72h)	76.0%	2026	Hambardzumyan et al., 2026

All systems except ML-Master 2.0 (which uses DeepSeek V3.2-Speciale) use Gemini 3.0 Pro Preview, making the LLM dimension approximately controlled across the comparison. The 24-hour budget is the standard comparison point; AIRA₂'s 72-hour results are unique to this paper, as other systems do not report extended-time performance.

32.2 Architecture

AIRA₂ uses a two-tier architecture: a global evolutionary orchestrator that maintains population state and performs selection, and an asynchronous pool of worker agents that execute multi-step reasoning trajectories to produce candidate solutions. A dedicated evaluation subsystem (HCE) provides fitness signals while preventing agents from accessing evaluation labels.

32.2.1 Design Decisions and Rationale

Several architectural decisions in AIRA₂ are explicitly motivated by the bottleneck analysis. The following table summarizes these decisions as reported in the paper:

Decision	Rationale (paper-stated)
Steady-state evolution (not generational)	Workers never idle at synchronization barriers; fast mutations feed back immediately while slow ones do not block others
1:1 static GPU-to-worker mapping	Eliminates dynamic scheduling complexity; each worker gets a clean-slate environment
Separate evaluation container	Agents never see labels; prevents metric gaming and evaluation feedback loops
In-memory population database	Fast access for the orchestrator; large artifacts spill to disk
Ephemeral Apptainer containers	Crashed containers do not affect the orchestrator or other workers
Fakeroot mode in containers	Agents can run `apt install` or `pip install` without actual root privileges

32.2.2 Resource Allocation

Each worker is allocated a fixed resource budget as reported in the paper: one NVIDIA H200 GPU (141 GB VRAM), 12 logical CPU cores, and 120 GB system RAM. A dedicated evaluation GPU runs the HCE protocol. The full system for main experiments thus requires approximately 9 GPUs: 8 workers plus 1 evaluator. A 9-hour hard time limit is imposed on each individual code execution, with a configurable global wall-clock limit (72 hours in the main experiments).

32.3 Core Algorithms

32.3.1 Asynchronous Steady-State Evolution

Unlike generational evolutionary algorithms where the entire population is replaced in synchronized waves, AIRA₂ uses steady-state evolution (Syswerda, 1991). When any worker becomes idle, the orchestrator immediately dispatches a new task using the current population state. Newly evaluated individuals are added to the population as soon as they complete, without waiting for other workers.

This design is particularly important for ML research tasks where mutation duration varies by orders of magnitude—from minutes for a hyperparameter tweak to hours for training a large model from scratch. In a generational scheme, fast workers would idle while waiting for the slowest member of each generation; steady-state evolution eliminates this waste entirely.

The following pseudocode illustrates the orchestrator's main loop:

# Pseudocode — no public implementation available
# Illustrates the steady-state evolutionary loop described in Hambardzumyan et al. (2026)

import asyncio
from dataclasses import dataclass, field

@dataclass
class Candidate:
    code: str
    fitness: float
    parent_ids: list[str] = field(default_factory=list)
    metadata: dict = field(default_factory=dict)

async def orchestrator_loop(
    population: list[Candidate],
    workers: list,  # pool of async worker handles
    evaluator,      # HCE evaluation service
    temperature: float = 0.2,
    crossover_prob: float = 0.15,
    wall_clock_limit: float = 72 * 3600,
) -> Candidate:
    """Steady-state evolutionary orchestrator.

    Workers are dispatched as soon as they become idle.
    No synchronization barriers between workers.
    """
    start_time = asyncio.get_event_loop().time()
    idle_workers = asyncio.Queue()
    for w in workers:
        await idle_workers.put(w)

    while (asyncio.get_event_loop().time() - start_time) < wall_clock_limit:
        # Wait for any worker to become available
        worker = await idle_workers.get()

        # Select parent(s) via temperature-scaled rank selection
        parent = rank_select(population, temperature)
        second_parent = None
        if random.random() < crossover_prob:
            second_parent = rank_select(population, temperature)

        # Dispatch task asynchronously — does not block other workers
        asyncio.create_task(
            run_worker(worker, parent, second_parent,
                       evaluator, population, idle_workers)
        )

    # Final selection uses D_val scores, not D_search scores
    return select_final_submission(population, split="D_val")

32.3.2 Temperature-Scaled Rank Selection

AIRA₂ uses rank-based parent selection with a temperature parameter that controls the exploration–exploitation tradeoff. Given a population of $N$ individuals, each individual $i$ is assigned a rank $r_i$ where $r_i = 1$ denotes the best individual. The selection probability is:

$$p(i) = \frac{(N - r_i + 1)^{1/T}}{\sum_{j=1}^{N} (N - r_j + 1)^{1/T}}$$

where:

$N$ is the current population size,
$r_i$ is the rank of individual $i$ (1 = highest fitness),
$T$ is the temperature parameter (default $T = 0.2$).

The temperature governs selection pressure: as $T \to 0$, selection becomes greedy (only the best individual is selected); as $T \to \infty$, selection becomes uniform (all individuals equally likely). The default $T = 0.2$ strongly biases selection toward high-fitness individuals while maintaining diversity. Rank-based selection is chosen over fitness-proportionate selection because ranks are invariant to the magnitude and scale of fitness scores, which vary widely across Kaggle tasks (e.g., RMSE vs. AUC vs. accuracy).

# Pseudocode — no public implementation available
# Temperature-scaled rank selection from Hambardzumyan et al. (2026), Eq. in Section 3

import numpy as np

def rank_select(population: list[Candidate], temperature: float = 0.2) -> Candidate:
    """Select a parent using temperature-scaled rank-based probabilities.

    Lower temperature → stronger exploitation (prefer top-ranked).
    Higher temperature → more exploration (more uniform).
    """
    N = len(population)
    # Sort by fitness (descending), assign ranks 1..N
    sorted_pop = sorted(population, key=lambda c: c.fitness, reverse=True)

    # Compute unnormalized selection weights
    # rank 1 (best) gets weight (N)^(1/T), rank N (worst) gets weight 1^(1/T)
    weights = np.array([(N - rank + 1) ** (1.0 / temperature)
                        for rank in range(1, N + 1)])

    # Normalize to probabilities
    probabilities = weights / weights.sum()

    # Sample one parent
    idx = np.random.choice(N, p=probabilities)
    return sorted_pop[idx]

To illustrate the effect of temperature: with $N = 10$ and $T = 0.2$, the best-ranked individual receives approximately 83% of the selection probability mass, while with $T = 1.0$ the same individual receives about 27%. The paper does not report experiments with adaptive temperature schedules.

32.3.3 Hidden Consistent Evaluation (HCE)

HCE is arguably AIRA₂'s most important contribution, both as a practical safeguard and as an experimental diagnostic tool. The protocol addresses three sources of evaluation noise that the authors identified in prior systems:

Implementation bugs inflating metrics: Agents sometimes produce code with data leakage or other bugs that report artificially high validation scores.
Brittle output parsing: Missing or erroneous score extraction from agent output leads to incorrect fitness signals.
Stochastic data splitting: Random seeds for train/validation splits introduce variance; inferior solutions can survive selection due to favorable random partitions.

HCE eliminates all three by externalizing evaluation entirely. The available labeled data for each task is partitioned once (deterministically) into three fixed splits:

$$D_{\text{labeled}} = D_{\text{train}} \cup D_{\text{search}} \cup D_{\text{val}}, \quad |D_{\text{train}}| : |D_{\text{search}}| : |D_{\text{val}}| = 80 : 10 : 10$$

where:

$D_{\text{train}}$ (80%) is visible to the agent for model training,
$D_{\text{search}}$ (10%) provides fitness scores to the orchestrator; labels are hidden from agents,
$D_{\text{val}}$ (10%) is used only for final submission selection after search terminates,
$D_{\text{test}}$ (the competition's held-out test set) is used exclusively for final reporting.

The evaluation flow proceeds as follows: when a worker submits a solution, predictions are generated for all splits within a separate evaluation container. The orchestrator scores the submission on $D_{\text{search}}$ and uses this score as fitness. $D_{\text{val}}$ scores are computed but not used during search. After the time budget expires, the final submission is selected by $D_{\text{val}}$ score—providing a clean, unbiased selection signal independent of the search fitness.

The critical insight: the decoupling between $D_{\text{search}}$ (used to guide evolutionary search) and $D_{\text{val}}$ (used to select the final submission) means that even if the search process overfits to $D_{\text{search}}$, the final selection corrects for this by choosing the candidate that generalizes best to unseen validation data.

# Pseudocode — no public implementation available
# Hidden Consistent Evaluation protocol from Hambardzumyan et al. (2026)

@dataclass
class EvaluationResult:
    search_score: float   # fitness on D_search (used during evolution)
    val_score: float      # score on D_val (used only for final selection)
    test_score: float     # score on D_test (used only for reporting)

def create_fixed_splits(labeled_data, seed: int = 42):
    """Create deterministic, fixed data splits. Done once per task."""
    rng = np.random.RandomState(seed)
    indices = rng.permutation(len(labeled_data))
    n = len(indices)
    train_end = int(0.8 * n)
    search_end = int(0.9 * n)
    return {
        "D_train": indices[:train_end],        # 80% — visible to agent
        "D_search": indices[train_end:search_end],  # 10% — hidden labels, fitness
        "D_val": indices[search_end:],          # 10% — final selection only
    }

def evaluate_candidate(candidate: Candidate, splits: dict, task) -> EvaluationResult:
    """Evaluate in isolated container. Agent never sees labels.

    This runs in a SEPARATE container from the worker,
    ensuring the agent cannot access evaluation data.
    """
    predictions = run_in_eval_container(candidate.code, task.data)

    return EvaluationResult(
        search_score=compute_metric(predictions, task.labels, splits["D_search"]),
        val_score=compute_metric(predictions, task.labels, splits["D_val"]),
        test_score=compute_metric(predictions, task.labels, task.test_indices),
    )

def select_final_submission(population: list[Candidate], split: str = "D_val"):
    """After search terminates, select by D_val — NOT by D_search."""
    return max(population, key=lambda c: c.eval_result.val_score)

32.3.4 ReAct Agent Operators

AIRA₂ replaces the static, template-based operators of its predecessor (AIRA-dojo used separate Draft, Improve, Debug, and EDA prompts) with autonomous ReAct agents (Yao et al., 2022). Each mutation or crossover operation is performed by a multi-step agent that determines its own action sequence based on the task context. A trajectory $\tau$ consists of interleaved reasoning, action, and observation steps:

$$\tau = (\text{Reason}_1, \text{Act}_1, \text{Obs}_1, \ldots, \text{Reason}_{K-1}, \text{Act}_{K-1}, \text{Obs}_{K-1}, \text{Reason}_K, \text{Act}_K)$$

where actions are either Python code executions or Bash commands in a sandboxed container, and observations include stdout/stderr along with execution duration. The trajectory terminates when the agent invokes a "submit" tool to send its solution to the orchestrator.

The key advantages of ReAct agents over static operators are dynamic compute allocation and scope engineering. On easy sub-tasks, agents submit quickly after a few reasoning steps. On hard sub-tasks, agents spend many turns debugging, experimenting, and iterating—naturally allocating LLM compute proportional to difficulty. This is impossible with fixed, single-turn operators.

Capability	Static Operators (AIRA-dojo)	ReAct Agents (AIRA₂)
Exploratory data analysis	Pre-defined EDA prompt	Agent decides scope at runtime
Debugging	Separate Debug operator, no iterative access	Within same trajectory: observe traceback, hypothesize, re-execute
Resource allocation	Fixed compute per operator	Dynamic—more time on harder sub-problems
Local experimentation	Not supported	Agent can run experiments before committing
State persistence	Stateless between turns	Bash and Jupyter state persists across turns

A critical design detail: within the ReAct trajectory, "no additional guidance and instructions are provided" beyond the initial context (Hambardzumyan et al., 2026). The orchestrator provides the task description, parent solution code and its fitness score, optional second parent for crossover, and population metadata (scores, strategies attempted). The agent then autonomously determines all subsequent actions.

32.3.5 Crossover

With probability $p_c = 0.15$, the orchestrator selects two parents instead of one and dispatches a crossover task. The ReAct agent receives both parent solutions and their respective fitness scores, then produces a child solution that combines elements from both lineages. The low crossover probability reflects the difficulty of meaningful code-level crossover compared to mutation, while still providing a mechanism for combining independently discovered strategies.

32.4 Key Results

32.4.1 Primary Results

All results below are reported by Hambardzumyan et al. (2026) on MLE-bench-30 using Gemini 3.0 Pro Preview, with 3 independent seeds per task and mean ± standard error intervals.

Time Budget	AIRA₂ (8 GPU)	Best Published Baseline	Gap
3h	59.9% ± 3.6	—	—
6h	65.5%	—	—
12h	68.8%	—	—
24h	71.8% ± 3.5	69.9% (MARS+)	+1.9 pp
72h	76.0% ± 3.4	—	+6.1 pp vs. 24h SOTA

At 72 hours, medal rates are: Bronze+ 61.1% ± 5.2, Silver+ 58.9% ± 5.2, Gold 36.7% ± 5.1. The paper chose percentile rank over medal rate as the primary metric because it is continuous (avoiding threshold effects near medal boundaries), captures the full distribution rather than binary outcomes, and has lower variance.

32.4.2 Ablation Analysis

The ablation study at 72 hours isolates the contribution of each architectural component:

Configuration	Percentile Rank (72h)	Δ from Full System
Full AIRA₂ (8 GPU, ReAct, HCE)	76.0%	—
No Subagents (static operators)	73.7%	−2.3 pp
1 GPU (with ReAct + HCE)	63.5%	−12.5 pp
No HCE	56.3%	−19.7 pp
No Evolution (Best-of-K, 8 GPU)	65.2%	−10.8 pp

Three findings from this ablation deserve detailed discussion:

HCE is the largest single contributor. Removing HCE drops performance by 19.7 percentage points. More critically, without HCE, performance degrades from 24h (56.8%) to 72h (56.3%), confirming that longer search actively hurts when evaluation is unreliable. With HCE, performance improves monotonically. This validates the paper's central claim that previously reported "overfitting" was evaluation noise.

Parallelism without evolution is insufficient. The Best-of-K configuration runs 8 independent workers without information sharing (no selection, no crossover, no population). It achieves 65.2%—better than 1-GPU (63.5%) but far below the full system (76.0%). The paper shows that Best-of-K saturates at the same final performance as a single GPU given sufficient time, demonstrating that parallelism alone cannot substitute for evolutionary information sharing.

ReAct agents provide a moderate but consistent improvement. Replacing ReAct agents with static operators (the AIRA-dojo-style prompt templates) reduces performance by 2.3 percentage points. While smaller than the HCE or parallelism effects, this demonstrates that dynamic operators contribute meaningfully, especially on tasks requiring interactive debugging or exploratory data analysis.

32.4.3 Compute Scaling Analysis

The paper presents evidence for a compute scaling law in AI research agents by normalizing performance against cumulative GPU-hours:

At low GPU-hours (< 24), 1-GPU is slightly more efficient per GPU-hour because it avoids the overhead of building an initial population.
At 24+ GPU-hours, the 8-GPU configuration becomes increasingly efficient. The gap widens to 7.5 percentile rank points at 144 GPU-hours.
Performance improves approximately logarithmically with GPU-hours and shows no sign of saturation at 576 GPU-hours (8 GPUs × 72 hours).

This non-saturating behavior suggests that further scaling—more GPUs, longer time horizons, or both—would yield additional gains, paralleling scaling laws observed in LLM pretraining but applied to the meta-level of automated research capability.

32.5 Implementation Details

32.5.1 Containerization

AIRA₂ uses Apptainer (formerly Singularity) containers with the Superimage environment, a comprehensive pre-installed ML development environment inherited from AIRA-dojo. The container lifecycle for each mutation is: (1) spawn a fresh Apptainer container with pre-installed Python, PyTorch, CUDA, and standard data science libraries; (2) mount parent solution code; (3) execute the ReAct agent trajectory with stateful bash and Jupyter sessions; (4) extract the submitted solution; (5) destroy the container; (6) evaluate the solution in a separate container via HCE.

A distinctive feature is stateful tool execution: unlike AIRA-dojo and AIDE, bash and Jupyter kernel state persists across turns within a trajectory. This enables iterative development workflows where an agent writes code, runs it, observes errors, fixes them, and re-runs—all within a single continuous session. Environment variables, working directory state, installed packages, and defined functions all persist.

32.5.2 LLM Configuration

All main experiments use Gemini 3.0 Pro Preview (Google DeepMind, 2025) as the reasoning engine. The paper does not report experiments with alternative LLMs. Baselines also use Gemini 3.0 Pro Preview (except ML-Master 2.0, which uses DeepSeek V3.2-Speciale), making the LLM dimension approximately controlled. The LLM is used exclusively within ReAct agent trajectories; there is no separate LLM call for evaluation, selection, or orchestration logic.

32.5.3 Cost Estimation

The following cost estimates are derived from the paper's hardware specifications and approximate 2026 cloud pricing. Meta likely uses internal GPU clusters, making actual costs lower.

Resource	Quantity	Duration	Estimated Cost
H200 GPUs (workers)	8	72h	~$8,000–12,000
H200 GPU (evaluation)	1	72h (intermittent)	~$500–1,000
LLM API (Gemini 3.0 Pro)	~1000s of trajectories	72h	~$500–2,000
Total per 30-task run			~$9,000–15,000

Note: These are the present author's estimates based on the paper's reported hardware (8× H200 GPUs) and approximate cloud pricing; they are not figures reported in the paper itself.

The high compute cost means that full MLE-bench-30 evaluation is accessible primarily to well-funded research labs. A single experimental condition (3 seeds × 30 tasks × 72 hours) represents a substantial investment, which partially explains the limited number of ablation variants reported.

32.5.4 Reproducibility Assessment

Aspect	Status	Detail
Benchmark definition	Public	MLE-bench-30 is defined in the GPT-5 system card (Singh et al., 2025)
Evaluation protocol	Fully specified	80/10/10 split, externalized grading, deterministic given seeds
Hyperparameters	Fully reported	$T = 0.2$, $p_c = 0.15$, 9h execution cap
Hardware specification	Fully reported	8× H200, 12 CPU cores, 120 GB RAM per worker
Container environment	Partially public	Superimage from AIRA-dojo is publicly available
Statistical methodology	Adequate	3 seeds, standard error intervals
Source code	Not released	Not open-sourced at time of publication
LLM dependency	Proprietary API	Gemini 3.0 Pro Preview is versioned; behavior may change
Prompt content	Not fully reproduced	System prompts and ReAct instructions not published in full

The most significant reproducibility barrier is the combination of closed source code and proprietary LLM dependency. While the algorithmic components are well-specified, the substantial engineering required to replicate the asynchronous orchestrator, containerization system, and remote tool execution makes independent reproduction challenging. The predecessor AIRA-dojo was open-sourced, partially mitigating this concern since AIRA₂ builds on that infrastructure.

32.6 The Overfitting Diagnosis

The paper's treatment of overfitting in agent systems deserves separate discussion because it has implications beyond AIRA₂ itself.

32.6.1 The Problem

Toledo et al. (2025) observed that AIRA-dojo's performance degraded with extended search time, which they attributed to overfitting to validation data. This is a concerning finding for the field: if longer search hurts, then the fundamental premise of evolutionary approaches—that more exploration yields better solutions—is undermined.

32.6.2 The Diagnosis

AIRA₂'s authors performed oracle experiments comparing selection based on the validation set (as agents see it) versus selection based on the held-out test set. They found a 9–13% gap in medal rate between validation-selected and test-selected submissions, demonstrating that the validation signal was unreliable. Crucially, this was not classical overfitting (memorizing training data) but evaluation noise from the three sources described in Section 32.3.3.

The distinction matters: classical overfitting requires architectural solutions like regularization or early stopping. Evaluation noise requires evaluation protocol design—a systems-level fix rather than an algorithmic one. HCE provides exactly this fix by ensuring consistent, externalized, agent-inaccessible evaluation.

32.6.3 Empirical Validation

The ablation data provides direct empirical validation. Without HCE (8 GPU, with ReAct agents), performance moves from 56.8% at 24 hours to 56.3% at 72 hours—confirming that extended search with unreliable evaluation degrades performance. With HCE (same configuration otherwise), performance improves monotonically from 71.8% at 24 hours to 76.0% at 72 hours. The difference between these two trajectories—one declining, one improving—isolates HCE's effect with all other variables held constant.

This finding generalizes beyond AIRA₂: any agent system with noisy evaluation signals could benefit from a similar externalized evaluation protocol. The methodology for diagnosing evaluation noise versus true overfitting (oracle experiment comparing validation-selected versus test-selected outcomes) is itself a reusable contribution.

32.7 Memory and Information Flow

32.7.1 Population as Implicit Memory

AIRA₂ maintains no explicit long-term memory, knowledge base, or skill library. Instead, the population serves as implicit memory: good strategies survive through fitness-based selection, and bad strategies are displaced. Cross-worker information transfer is mediated entirely through the population:

$$\text{Worker A submits} \xrightarrow{\text{evaluate}} \text{Population updated} \xrightarrow{\text{select}} \text{Worker B inherits A's improvements}$$

This is a deliberate simplification. Unlike some research agent systems that extract reusable skills or maintain per-task knowledge graphs, AIRA₂ treats each task as fully independent. There is no mechanism for transferring knowledge between tasks, summarizing lessons across the population, or building an explicit library of strategies. The population metadata (scores, strategies attempted) is injected into agent prompts, providing some implicit signal about what has been tried, but this falls short of structured memory.

32.7.2 Worker-Level State

Within a single trajectory, each ReAct agent maintains rich state through the conversation history, stateful bash session, and stateful Jupyter kernel. This within-trajectory memory enables complex multi-step workflows but is entirely ephemeral—it is discarded when the trajectory ends and the container is destroyed. Only the final submitted code survives.

32.8 Limitations and Discussion

32.8.1 Scope Limitations

AIRA₂ is evaluated exclusively on MLE-bench-30, a benchmark of Kaggle competitions with well-defined metrics. Several limitations follow from this scope:

No open-ended research: Real research often requires formulating problems, not just solving pre-defined competitions. AIRA₂ does not generate hypotheses, write papers, or perform literature review.
No cross-task transfer: Each of the 30 tasks starts from scratch. The system cannot leverage experience from one task to improve performance on another.
Single-benchmark evaluation: Results on MLE-bench-30 may not generalize to other benchmarks, especially those requiring different skills (e.g., theorem proving, scientific simulation).
Fixed LLM backbone: All experiments use a single LLM (Gemini 3.0 Pro Preview). The interaction between architecture choices and model capability is unexplored.

32.8.2 Methodological Considerations

The 24-hour comparison with baselines has a potential confound: AIRA₂ uses 8 GPUs while most baselines use fewer. While the Best-of-K ablation demonstrates that parallelism without evolution is insufficient, the 24-hour comparison is not strictly GPU-hour-normalized across all systems. The paper's compute efficiency analysis partially addresses this by showing 8-GPU superiority at matched GPU-hours, but this analysis is internal to AIRA₂ rather than a cross-system comparison.

The 3-seed statistical design provides reasonable uncertainty estimates but is modest given the high variance across tasks. Some tasks may have high inherent stochasticity that is not captured by 3 seeds.

32.8.3 Engineering Complexity

The system requires substantial infrastructure: asynchronous orchestrator, Apptainer container management, dedicated evaluation GPUs, remote tool execution, and multi-GPU coordination. This engineering complexity creates a barrier to adoption outside well-resourced labs. The fact that the source code is not released exacerbates this barrier.

32.8.4 Open Questions

Several directions are not explored in the paper:

Adaptive temperature: Would a temperature schedule that starts high (exploration) and decreases (exploitation) improve performance?
Population management: The paper does not discuss population size limits, diversity maintenance beyond crossover, or archive strategies.
Multi-model ensembles: Using different LLMs for different workers could provide implicit diversity, but this is untested.
Transfer learning across tasks: A meta-learning layer that extracts reusable strategies could amortize compute across tasks.

32.9 Relationship to Other Systems

AIRA₂ occupies a distinctive position in the landscape of LLM-powered evolutionary systems. It is a consumer of evolutionary search rather than a platform for it: the evolutionary algorithm is standard (rank selection, mutation, crossover), and the contribution lies in the infrastructure that makes evolution productive for ML research tasks.

Dimension	AIRA₂	AlphaEvolve	OpenEvolve
Primary goal	ML competition solving	General algorithm discovery	Open-source evolutionary platform
Search representation	Complete ML pipelines	Program fragments	Program fragments
Evaluation	HCE (externalized, hidden)	Automated scoring	Automated scoring
Parallelism	Async steady-state (8 GPU)	Async island model	Async workers
Operators	ReAct agents (dynamic)	LLM prompts (static)	LLM prompts (configurable)
Memory	Population only (implicit)	MAP-Elites archive	Program database
Code availability	Closed	Closed	Open source
Compute cost	~$10K per 30-task run	Not comparable (different tasks)	Variable

The most direct comparators are the concurrent MLE-bench agents: MARS/MARS+, MLEvolve, PiEvolve, FM-Agent 2.0, and ML-Master 2.0. Among these, AIRA₂ is unique in its explicit treatment of evaluation reliability (HCE) and its demonstration of monotonic compute scaling. Most competitors report only 24-hour results without investigating the scaling behavior that AIRA₂'s 72-hour experiments reveal.

32.10 Summary

Key Takeaway

AIRA₂ demonstrates that the primary obstacle to scaling AI research agents with compute is not algorithmic but infrastructural: unreliable evaluation causes extended search to degrade performance, not improve it. By externalizing evaluation (HCE), enabling asynchronous parallel exploration, and replacing static operators with autonomous ReAct agents, AIRA₂ achieves monotonically improving performance with compute—reaching 76.0% percentile rank on MLE-bench-30 at 72 hours.

Main contribution to the field: The Hidden Consistent Evaluation protocol resolves evaluation noise that masqueraded as overfitting in prior systems, enabling the first demonstration of a compute scaling law for ML research agents. The diagnosis methodology (oracle experiments comparing validation-selected vs. test-selected outcomes) is a reusable contribution applicable to any agent system with noisy evaluation signals.

Most important thing a researcher should know: If your evolutionary or agent-based system shows performance degradation with extended search, the cause may be evaluation noise rather than true overfitting. Before adding regularization or early stopping, test whether externalizing and fixing the evaluation protocol restores monotonic improvement. AIRA₂'s HCE protocol provides a concrete template for this investigation.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}