AIRA₂
Part P07: Autonomous Research Systems
32.1 Overview and Motivation
Automated machine learning research agents—systems that autonomously solve complex ML engineering tasks—face a fundamental tension between exploration breadth and evaluation reliability. As of early 2026, several systems compete on MLE-bench, a benchmark suite derived from Kaggle competitions that tests an agent's ability to produce complete ML pipelines. Despite rapid progress, prior systems exhibited a puzzling failure mode: performance degraded when given more compute time, suggesting that extended search was counterproductive.
AIRA₂ (Asynchronous multi-GPU research agent, second generation) addresses this failure mode through a systems-level analysis that identifies three structural bottlenecks in AI research agents and resolves each with a targeted architectural decision. The system achieves state-of-the-art results on MLE-bench-30, a curated 30-task subset used in the GPT-5 system card, reaching 71.8% percentile rank at 24 hours and 76.0% at 72 hours—the first demonstration that performance improves monotonically with compute in this setting (Hambardzumyan et al., 2026).
Key Contribution
AIRA₂'s primary contribution is a systems-level diagnosis and resolution of three bottlenecks that prevent AI research agents from scaling with compute. (1) Asynchronous multi-GPU steady-state evolution eliminates synchronization barriers, achieving approximately linear throughput scaling. (2) Hidden Consistent Evaluation (HCE) decouples search optimization from selection, revealing that previously reported "overfitting" was evaluation noise rather than true data memorization. (3) ReAct agent operators replace static single-turn prompts with autonomous multi-step reasoning trajectories. Together, these enable monotonically improving performance with compute—a prerequisite for meaningful compute scaling laws in research agents.
The paper explicitly frames itself as a systems contribution rather than an algorithmic advance: the evolutionary search uses standard temperature-scaled rank selection, and the LLM backbone (Gemini 3.0 Pro Preview) is shared with baselines. The innovation lies in the infrastructure that makes long-horizon evolutionary search productive. This framing distinguishes AIRA₂ from concurrent work such as MARS+ and FM-Agent 2.0, which focus on operator design or model capabilities.
32.1.1 Lineage and Context
AIRA₂ is the direct successor to AIRA-dojo (Toledo et al., 2025), which formalized automated ML research as a search problem with three components: a search policy selecting which candidates to expand, operators transforming candidates into new solutions, and an evaluation signal providing fitness. AIRA-dojo achieved 39.5% percentile rank on MLE-bench at 24 hours with a single GPU, establishing the evolutionary framework but also exposing the three bottlenecks that AIRA₂ resolves.
The competitive landscape on MLE-bench-30 as of March 2026 provides context for AIRA₂'s positioning:
| System | Percentile Rank (24h) | Year | Source |
|---|---|---|---|
| AIRA-dojo | 39.5% | 2025 | Toledo et al., 2025 |
| PiEvolve | 54.1% | 2025 | Botla et al., 2025 |
| ML-Master 2.0 | 57.6% | 2025 | Liu et al., 2025 |
| MARS | 60.4% | 2026 | Chen et al., 2026 |
| MLEvolve | 64.1% | 2025 | Du et al., 2025 |
| FM-Agent 2.0 | 69.6% | 2025 | Li et al., 2025 |
| MARS+ | 69.9% | 2026 | Chen et al., 2026 |
| AIRA₂ | 71.8% | 2026 | Hambardzumyan et al., 2026 |
| AIRA₂ (72h) | 76.0% | 2026 | Hambardzumyan et al., 2026 |
All systems except ML-Master 2.0 (which uses DeepSeek V3.2-Speciale) use Gemini 3.0 Pro Preview, making the LLM dimension approximately controlled across the comparison. The 24-hour budget is the standard comparison point; AIRA₂'s 72-hour results are unique to this paper, as other systems do not report extended-time performance.
32.2 Architecture
AIRA₂ uses a two-tier architecture: a global evolutionary orchestrator that maintains population state and performs selection, and an asynchronous pool of worker agents that execute multi-step reasoning trajectories to produce candidate solutions. A dedicated evaluation subsystem (HCE) provides fitness signals while preventing agents from accessing evaluation labels.
32.2.1 Design Decisions and Rationale
Several architectural decisions in AIRA₂ are explicitly motivated by the bottleneck analysis. The following table summarizes these decisions as reported in the paper:
| Decision | Rationale (paper-stated) |
|---|---|
| Steady-state evolution (not generational) | Workers never idle at synchronization barriers; fast mutations feed back immediately while slow ones do not block others |
| 1:1 static GPU-to-worker mapping | Eliminates dynamic scheduling complexity; each worker gets a clean-slate environment |
| Separate evaluation container | Agents never see labels; prevents metric gaming and evaluation feedback loops |
| In-memory population database | Fast access for the orchestrator; large artifacts spill to disk |
| Ephemeral Apptainer containers | Crashed containers do not affect the orchestrator or other workers |
| Fakeroot mode in containers | Agents can run apt install or pip install without actual root privileges |
32.2.2 Resource Allocation
Each worker is allocated a fixed resource budget as reported in the paper: one NVIDIA H200 GPU (141 GB VRAM), 12 logical CPU cores, and 120 GB system RAM. A dedicated evaluation GPU runs the HCE protocol. The full system for main experiments thus requires approximately 9 GPUs: 8 workers plus 1 evaluator. A 9-hour hard time limit is imposed on each individual code execution, with a configurable global wall-clock limit (72 hours in the main experiments).
32.3 Core Algorithms
32.3.1 Asynchronous Steady-State Evolution
Unlike generational evolutionary algorithms where the entire population is replaced in synchronized waves, AIRA₂ uses steady-state evolution (Syswerda, 1991). When any worker becomes idle, the orchestrator immediately dispatches a new task using the current population state. Newly evaluated individuals are added to the population as soon as they complete, without waiting for other workers.
This design is particularly important for ML research tasks where mutation duration varies by orders of magnitude—from minutes for a hyperparameter tweak to hours for training a large model from scratch. In a generational scheme, fast workers would idle while waiting for the slowest member of each generation; steady-state evolution eliminates this waste entirely.
The following pseudocode illustrates the orchestrator's main loop:
# Pseudocode — no public implementation available
# Illustrates the steady-state evolutionary loop described in Hambardzumyan et al. (2026)
import asyncio
from dataclasses import dataclass, field
@dataclass
class Candidate:
code: str
fitness: float
parent_ids: list[str] = field(default_factory=list)
metadata: dict = field(default_factory=dict)
async def orchestrator_loop(
population: list[Candidate],
workers: list, # pool of async worker handles
evaluator, # HCE evaluation service
temperature: float = 0.2,
crossover_prob: float = 0.15,
wall_clock_limit: float = 72 * 3600,
) -> Candidate:
"""Steady-state evolutionary orchestrator.
Workers are dispatched as soon as they become idle.
No synchronization barriers between workers.
"""
start_time = asyncio.get_event_loop().time()
idle_workers = asyncio.Queue()
for w in workers:
await idle_workers.put(w)
while (asyncio.get_event_loop().time() - start_time) < wall_clock_limit:
# Wait for any worker to become available
worker = await idle_workers.get()
# Select parent(s) via temperature-scaled rank selection
parent = rank_select(population, temperature)
second_parent = None
if random.random() < crossover_prob:
second_parent = rank_select(population, temperature)
# Dispatch task asynchronously — does not block other workers
asyncio.create_task(
run_worker(worker, parent, second_parent,
evaluator, population, idle_workers)
)
# Final selection uses D_val scores, not D_search scores
return select_final_submission(population, split="D_val")
32.3.2 Temperature-Scaled Rank Selection
AIRA₂ uses rank-based parent selection with a temperature parameter that controls the exploration–exploitation tradeoff. Given a population of $N$ individuals, each individual $i$ is assigned a rank $r_i$ where $r_i = 1$ denotes the best individual. The selection probability is:
where:
- $N$ is the current population size,
- $r_i$ is the rank of individual $i$ (1 = highest fitness),
- $T$ is the temperature parameter (default $T = 0.2$).
The temperature governs selection pressure: as $T \to 0$, selection becomes greedy (only the best individual is selected); as $T \to \infty$, selection becomes uniform (all individuals equally likely). The default $T = 0.2$ strongly biases selection toward high-fitness individuals while maintaining diversity. Rank-based selection is chosen over fitness-proportionate selection because ranks are invariant to the magnitude and scale of fitness scores, which vary widely across Kaggle tasks (e.g., RMSE vs. AUC vs. accuracy).
# Pseudocode — no public implementation available
# Temperature-scaled rank selection from Hambardzumyan et al. (2026), Eq. in Section 3
import numpy as np
def rank_select(population: list[Candidate], temperature: float = 0.2) -> Candidate:
"""Select a parent using temperature-scaled rank-based probabilities.
Lower temperature → stronger exploitation (prefer top-ranked).
Higher temperature → more exploration (more uniform).
"""
N = len(population)
# Sort by fitness (descending), assign ranks 1..N
sorted_pop = sorted(population, key=lambda c: c.fitness, reverse=True)
# Compute unnormalized selection weights
# rank 1 (best) gets weight (N)^(1/T), rank N (worst) gets weight 1^(1/T)
weights = np.array([(N - rank + 1) ** (1.0 / temperature)
for rank in range(1, N + 1)])
# Normalize to probabilities
probabilities = weights / weights.sum()
# Sample one parent
idx = np.random.choice(N, p=probabilities)
return sorted_pop[idx]
To illustrate the effect of temperature: with $N = 10$ and $T = 0.2$, the best-ranked individual receives approximately 83% of the selection probability mass, while with $T = 1.0$ the same individual receives about 27%. The paper does not report experiments with adaptive temperature schedules.
32.3.3 Hidden Consistent Evaluation (HCE)
HCE is arguably AIRA₂'s most important contribution, both as a practical safeguard and as an experimental diagnostic tool. The protocol addresses three sources of evaluation noise that the authors identified in prior systems:
- Implementation bugs inflating metrics: Agents sometimes produce code with data leakage or other bugs that report artificially high validation scores.
- Brittle output parsing: Missing or erroneous score extraction from agent output leads to incorrect fitness signals.
- Stochastic data splitting: Random seeds for train/validation splits introduce variance; inferior solutions can survive selection due to favorable random partitions.
HCE eliminates all three by externalizing evaluation entirely. The available labeled data for each task is partitioned once (deterministically) into three fixed splits:
where:
- $D_{\text{train}}$ (80%) is visible to the agent for model training,
- $D_{\text{search}}$ (10%) provides fitness scores to the orchestrator; labels are hidden from agents,
- $D_{\text{val}}$ (10%) is used only for final submission selection after search terminates,
- $D_{\text{test}}$ (the competition's held-out test set) is used exclusively for final reporting.
The evaluation flow proceeds as follows: when a worker submits a solution, predictions are generated for all splits within a separate evaluation container. The orchestrator scores the submission on $D_{\text{search}}$ and uses this score as fitness. $D_{\text{val}}$ scores are computed but not used during search. After the time budget expires, the final submission is selected by $D_{\text{val}}$ score—providing a clean, unbiased selection signal independent of the search fitness.
The critical insight: the decoupling between $D_{\text{search}}$ (used to guide evolutionary search) and $D_{\text{val}}$ (used to select the final submission) means that even if the search process overfits to $D_{\text{search}}$, the final selection corrects for this by choosing the candidate that generalizes best to unseen validation data.
# Pseudocode — no public implementation available
# Hidden Consistent Evaluation protocol from Hambardzumyan et al. (2026)
@dataclass
class EvaluationResult:
search_score: float # fitness on D_search (used during evolution)
val_score: float # score on D_val (used only for final selection)
test_score: float # score on D_test (used only for reporting)
def create_fixed_splits(labeled_data, seed: int = 42):
"""Create deterministic, fixed data splits. Done once per task."""
rng = np.random.RandomState(seed)
indices = rng.permutation(len(labeled_data))
n = len(indices)
train_end = int(0.8 * n)
search_end = int(0.9 * n)
return {
"D_train": indices[:train_end], # 80% — visible to agent
"D_search": indices[train_end:search_end], # 10% — hidden labels, fitness
"D_val": indices[search_end:], # 10% — final selection only
}
def evaluate_candidate(candidate: Candidate, splits: dict, task) -> EvaluationResult:
"""Evaluate in isolated container. Agent never sees labels.
This runs in a SEPARATE container from the worker,
ensuring the agent cannot access evaluation data.
"""
predictions = run_in_eval_container(candidate.code, task.data)
return EvaluationResult(
search_score=compute_metric(predictions, task.labels, splits["D_search"]),
val_score=compute_metric(predictions, task.labels, splits["D_val"]),
test_score=compute_metric(predictions, task.labels, task.test_indices),
)
def select_final_submission(population: list[Candidate], split: str = "D_val"):
"""After search terminates, select by D_val — NOT by D_search."""
return max(population, key=lambda c: c.eval_result.val_score)
32.3.4 ReAct Agent Operators
AIRA₂ replaces the static, template-based operators of its predecessor (AIRA-dojo used separate Draft, Improve, Debug, and EDA prompts) with autonomous ReAct agents (Yao et al., 2022). Each mutation or crossover operation is performed by a multi-step agent that determines its own action sequence based on the task context. A trajectory $\tau$ consists of interleaved reasoning, action, and observation steps:
where actions are either Python code executions or Bash commands in a sandboxed container, and observations include stdout/stderr along with execution duration. The trajectory terminates when the agent invokes a "submit" tool to send its solution to the orchestrator.
The key advantages of ReAct agents over static operators are dynamic compute allocation and scope engineering. On easy sub-tasks, agents submit quickly after a few reasoning steps. On hard sub-tasks, agents spend many turns debugging, experimenting, and iterating—naturally allocating LLM compute proportional to difficulty. This is impossible with fixed, single-turn operators.
| Capability | Static Operators (AIRA-dojo) | ReAct Agents (AIRA₂) |
|---|---|---|
| Exploratory data analysis | Pre-defined EDA prompt | Agent decides scope at runtime |
| Debugging | Separate Debug operator, no iterative access | Within same trajectory: observe traceback, hypothesize, re-execute |
| Resource allocation | Fixed compute per operator | Dynamic—more time on harder sub-problems |
| Local experimentation | Not supported | Agent can run experiments before committing |
| State persistence | Stateless between turns | Bash and Jupyter state persists across turns |
A critical design detail: within the ReAct trajectory, "no additional guidance and instructions are provided" beyond the initial context (Hambardzumyan et al., 2026). The orchestrator provides the task description, parent solution code and its fitness score, optional second parent for crossover, and population metadata (scores, strategies attempted). The agent then autonomously determines all subsequent actions.
32.3.5 Crossover
With probability $p_c = 0.15$, the orchestrator selects two parents instead of one and dispatches a crossover task. The ReAct agent receives both parent solutions and their respective fitness scores, then produces a child solution that combines elements from both lineages. The low crossover probability reflects the difficulty of meaningful code-level crossover compared to mutation, while still providing a mechanism for combining independently discovered strategies.
32.4 Key Results
32.4.1 Primary Results
All results below are reported by Hambardzumyan et al. (2026) on MLE-bench-30 using Gemini 3.0 Pro Preview, with 3 independent seeds per task and mean ± standard error intervals.
| Time Budget | AIRA₂ (8 GPU) | Best Published Baseline | Gap |
|---|---|---|---|
| 3h | 59.9% ± 3.6 | — | — |
| 6h | 65.5% | — | — |
| 12h | 68.8% | — | — |
| 24h | 71.8% ± 3.5 | 69.9% (MARS+) | +1.9 pp |
| 72h | 76.0% ± 3.4 | — | +6.1 pp vs. 24h SOTA |
At 72 hours, medal rates are: Bronze+ 61.1% ± 5.2, Silver+ 58.9% ± 5.2, Gold 36.7% ± 5.1. The paper chose percentile rank over medal rate as the primary metric because it is continuous (avoiding threshold effects near medal boundaries), captures the full distribution rather than binary outcomes, and has lower variance.
32.4.2 Ablation Analysis
The ablation study at 72 hours isolates the contribution of each architectural component:
| Configuration | Percentile Rank (72h) | Δ from Full System |
|---|---|---|
| Full AIRA₂ (8 GPU, ReAct, HCE) | 76.0% | — |
| No Subagents (static operators) | 73.7% | −2.3 pp |
| 1 GPU (with ReAct + HCE) | 63.5% | −12.5 pp |
| No HCE | 56.3% | −19.7 pp |
| No Evolution (Best-of-K, 8 GPU) | 65.2% | −10.8 pp |
Three findings from this ablation deserve detailed discussion:
HCE is the largest single contributor. Removing HCE drops performance by 19.7 percentage points. More critically, without HCE, performance degrades from 24h (56.8%) to 72h (56.3%), confirming that longer search actively hurts when evaluation is unreliable. With HCE, performance improves monotonically. This validates the paper's central claim that previously reported "overfitting" was evaluation noise.
Parallelism without evolution is insufficient. The Best-of-K configuration runs 8 independent workers without information sharing (no selection, no crossover, no population). It achieves 65.2%—better than 1-GPU (63.5%) but far below the full system (76.0%). The paper shows that Best-of-K saturates at the same final performance as a single GPU given sufficient time, demonstrating that parallelism alone cannot substitute for evolutionary information sharing.
ReAct agents provide a moderate but consistent improvement. Replacing ReAct agents with static operators (the AIRA-dojo-style prompt templates) reduces performance by 2.3 percentage points. While smaller than the HCE or parallelism effects, this demonstrates that dynamic operators contribute meaningfully, especially on tasks requiring interactive debugging or exploratory data analysis.
32.4.3 Compute Scaling Analysis
The paper presents evidence for a compute scaling law in AI research agents by normalizing performance against cumulative GPU-hours:
- At low GPU-hours (< 24), 1-GPU is slightly more efficient per GPU-hour because it avoids the overhead of building an initial population.
- At 24+ GPU-hours, the 8-GPU configuration becomes increasingly efficient. The gap widens to 7.5 percentile rank points at 144 GPU-hours.
- Performance improves approximately logarithmically with GPU-hours and shows no sign of saturation at 576 GPU-hours (8 GPUs × 72 hours).
This non-saturating behavior suggests that further scaling—more GPUs, longer time horizons, or both—would yield additional gains, paralleling scaling laws observed in LLM pretraining but applied to the meta-level of automated research capability.
32.5 Implementation Details
32.5.1 Containerization
AIRA₂ uses Apptainer (formerly Singularity) containers with the Superimage environment, a comprehensive pre-installed ML development environment inherited from AIRA-dojo. The container lifecycle for each mutation is: (1) spawn a fresh Apptainer container with pre-installed Python, PyTorch, CUDA, and standard data science libraries; (2) mount parent solution code; (3) execute the ReAct agent trajectory with stateful bash and Jupyter sessions; (4) extract the submitted solution; (5) destroy the container; (6) evaluate the solution in a separate container via HCE.
A distinctive feature is stateful tool execution: unlike AIRA-dojo and AIDE, bash and Jupyter kernel state persists across turns within a trajectory. This enables iterative development workflows where an agent writes code, runs it, observes errors, fixes them, and re-runs—all within a single continuous session. Environment variables, working directory state, installed packages, and defined functions all persist.
32.5.2 LLM Configuration
All main experiments use Gemini 3.0 Pro Preview (Google DeepMind, 2025) as the reasoning engine. The paper does not report experiments with alternative LLMs. Baselines also use Gemini 3.0 Pro Preview (except ML-Master 2.0, which uses DeepSeek V3.2-Speciale), making the LLM dimension approximately controlled. The LLM is used exclusively within ReAct agent trajectories; there is no separate LLM call for evaluation, selection, or orchestration logic.
32.5.3 Cost Estimation
The following cost estimates are derived from the paper's hardware specifications and approximate 2026 cloud pricing. Meta likely uses internal GPU clusters, making actual costs lower.
| Resource | Quantity | Duration | Estimated Cost |
|---|---|---|---|
| H200 GPUs (workers) | 8 | 72h | ~$8,000–12,000 |
| H200 GPU (evaluation) | 1 | 72h (intermittent) | ~$500–1,000 |
| LLM API (Gemini 3.0 Pro) | ~1000s of trajectories | 72h | ~$500–2,000 |
| Total per 30-task run | ~$9,000–15,000 |
Note: These are the present author's estimates based on the paper's reported hardware (8× H200 GPUs) and approximate cloud pricing; they are not figures reported in the paper itself.
The high compute cost means that full MLE-bench-30 evaluation is accessible primarily to well-funded research labs. A single experimental condition (3 seeds × 30 tasks × 72 hours) represents a substantial investment, which partially explains the limited number of ablation variants reported.
32.5.4 Reproducibility Assessment
| Aspect | Status | Detail |
|---|---|---|
| Benchmark definition | Public | MLE-bench-30 is defined in the GPT-5 system card (Singh et al., 2025) |
| Evaluation protocol | Fully specified | 80/10/10 split, externalized grading, deterministic given seeds |
| Hyperparameters | Fully reported | $T = 0.2$, $p_c = 0.15$, 9h execution cap |
| Hardware specification | Fully reported | 8× H200, 12 CPU cores, 120 GB RAM per worker |
| Container environment | Partially public | Superimage from AIRA-dojo is publicly available |
| Statistical methodology | Adequate | 3 seeds, standard error intervals |
| Source code | Not released | Not open-sourced at time of publication |
| LLM dependency | Proprietary API | Gemini 3.0 Pro Preview is versioned; behavior may change |
| Prompt content | Not fully reproduced | System prompts and ReAct instructions not published in full |
The most significant reproducibility barrier is the combination of closed source code and proprietary LLM dependency. While the algorithmic components are well-specified, the substantial engineering required to replicate the asynchronous orchestrator, containerization system, and remote tool execution makes independent reproduction challenging. The predecessor AIRA-dojo was open-sourced, partially mitigating this concern since AIRA₂ builds on that infrastructure.
32.6 The Overfitting Diagnosis
The paper's treatment of overfitting in agent systems deserves separate discussion because it has implications beyond AIRA₂ itself.
32.6.1 The Problem
Toledo et al. (2025) observed that AIRA-dojo's performance degraded with extended search time, which they attributed to overfitting to validation data. This is a concerning finding for the field: if longer search hurts, then the fundamental premise of evolutionary approaches—that more exploration yields better solutions—is undermined.
32.6.2 The Diagnosis
AIRA₂'s authors performed oracle experiments comparing selection based on the validation set (as agents see it) versus selection based on the held-out test set. They found a 9–13% gap in medal rate between validation-selected and test-selected submissions, demonstrating that the validation signal was unreliable. Crucially, this was not classical overfitting (memorizing training data) but evaluation noise from the three sources described in Section 32.3.3.
The distinction matters: classical overfitting requires architectural solutions like regularization or early stopping. Evaluation noise requires evaluation protocol design—a systems-level fix rather than an algorithmic one. HCE provides exactly this fix by ensuring consistent, externalized, agent-inaccessible evaluation.
32.6.3 Empirical Validation
The ablation data provides direct empirical validation. Without HCE (8 GPU, with ReAct agents), performance moves from 56.8% at 24 hours to 56.3% at 72 hours—confirming that extended search with unreliable evaluation degrades performance. With HCE (same configuration otherwise), performance improves monotonically from 71.8% at 24 hours to 76.0% at 72 hours. The difference between these two trajectories—one declining, one improving—isolates HCE's effect with all other variables held constant.
This finding generalizes beyond AIRA₂: any agent system with noisy evaluation signals could benefit from a similar externalized evaluation protocol. The methodology for diagnosing evaluation noise versus true overfitting (oracle experiment comparing validation-selected versus test-selected outcomes) is itself a reusable contribution.
32.7 Memory and Information Flow
32.7.1 Population as Implicit Memory
AIRA₂ maintains no explicit long-term memory, knowledge base, or skill library. Instead, the population serves as implicit memory: good strategies survive through fitness-based selection, and bad strategies are displaced. Cross-worker information transfer is mediated entirely through the population:
This is a deliberate simplification. Unlike some research agent systems that extract reusable skills or maintain per-task knowledge graphs, AIRA₂ treats each task as fully independent. There is no mechanism for transferring knowledge between tasks, summarizing lessons across the population, or building an explicit library of strategies. The population metadata (scores, strategies attempted) is injected into agent prompts, providing some implicit signal about what has been tried, but this falls short of structured memory.
32.7.2 Worker-Level State
Within a single trajectory, each ReAct agent maintains rich state through the conversation history, stateful bash session, and stateful Jupyter kernel. This within-trajectory memory enables complex multi-step workflows but is entirely ephemeral—it is discarded when the trajectory ends and the container is destroyed. Only the final submitted code survives.
32.8 Limitations and Discussion
32.8.1 Scope Limitations
AIRA₂ is evaluated exclusively on MLE-bench-30, a benchmark of Kaggle competitions with well-defined metrics. Several limitations follow from this scope:
- No open-ended research: Real research often requires formulating problems, not just solving pre-defined competitions. AIRA₂ does not generate hypotheses, write papers, or perform literature review.
- No cross-task transfer: Each of the 30 tasks starts from scratch. The system cannot leverage experience from one task to improve performance on another.
- Single-benchmark evaluation: Results on MLE-bench-30 may not generalize to other benchmarks, especially those requiring different skills (e.g., theorem proving, scientific simulation).
- Fixed LLM backbone: All experiments use a single LLM (Gemini 3.0 Pro Preview). The interaction between architecture choices and model capability is unexplored.
32.8.2 Methodological Considerations
The 24-hour comparison with baselines has a potential confound: AIRA₂ uses 8 GPUs while most baselines use fewer. While the Best-of-K ablation demonstrates that parallelism without evolution is insufficient, the 24-hour comparison is not strictly GPU-hour-normalized across all systems. The paper's compute efficiency analysis partially addresses this by showing 8-GPU superiority at matched GPU-hours, but this analysis is internal to AIRA₂ rather than a cross-system comparison.
The 3-seed statistical design provides reasonable uncertainty estimates but is modest given the high variance across tasks. Some tasks may have high inherent stochasticity that is not captured by 3 seeds.
32.8.3 Engineering Complexity
The system requires substantial infrastructure: asynchronous orchestrator, Apptainer container management, dedicated evaluation GPUs, remote tool execution, and multi-GPU coordination. This engineering complexity creates a barrier to adoption outside well-resourced labs. The fact that the source code is not released exacerbates this barrier.
32.8.4 Open Questions
Several directions are not explored in the paper:
- Adaptive temperature: Would a temperature schedule that starts high (exploration) and decreases (exploitation) improve performance?
- Population management: The paper does not discuss population size limits, diversity maintenance beyond crossover, or archive strategies.
- Multi-model ensembles: Using different LLMs for different workers could provide implicit diversity, but this is untested.
- Transfer learning across tasks: A meta-learning layer that extracts reusable strategies could amortize compute across tasks.
32.9 Relationship to Other Systems
AIRA₂ occupies a distinctive position in the landscape of LLM-powered evolutionary systems. It is a consumer of evolutionary search rather than a platform for it: the evolutionary algorithm is standard (rank selection, mutation, crossover), and the contribution lies in the infrastructure that makes evolution productive for ML research tasks.
| Dimension | AIRA₂ | AlphaEvolve | OpenEvolve |
|---|---|---|---|
| Primary goal | ML competition solving | General algorithm discovery | Open-source evolutionary platform |
| Search representation | Complete ML pipelines | Program fragments | Program fragments |
| Evaluation | HCE (externalized, hidden) | Automated scoring | Automated scoring |
| Parallelism | Async steady-state (8 GPU) | Async island model | Async workers |
| Operators | ReAct agents (dynamic) | LLM prompts (static) | LLM prompts (configurable) |
| Memory | Population only (implicit) | MAP-Elites archive | Program database |
| Code availability | Closed | Closed | Open source |
| Compute cost | ~$10K per 30-task run | Not comparable (different tasks) | Variable |
The most direct comparators are the concurrent MLE-bench agents: MARS/MARS+, MLEvolve, PiEvolve, FM-Agent 2.0, and ML-Master 2.0. Among these, AIRA₂ is unique in its explicit treatment of evaluation reliability (HCE) and its demonstration of monotonic compute scaling. Most competitors report only 24-hour results without investigating the scaling behavior that AIRA₂'s 72-hour experiments reveal.
32.10 Summary
Key Takeaway
AIRA₂ demonstrates that the primary obstacle to scaling AI research agents with compute is not algorithmic but infrastructural: unreliable evaluation causes extended search to degrade performance, not improve it. By externalizing evaluation (HCE), enabling asynchronous parallel exploration, and replacing static operators with autonomous ReAct agents, AIRA₂ achieves monotonically improving performance with compute—reaching 76.0% percentile rank on MLE-bench-30 at 72 hours.
Main contribution to the field: The Hidden Consistent Evaluation protocol resolves evaluation noise that masqueraded as overfitting in prior systems, enabling the first demonstration of a compute scaling law for ML research agents. The diagnosis methodology (oracle experiments comparing validation-selected vs. test-selected outcomes) is a reusable contribution applicable to any agent system with noisy evaluation signals.
Most important thing a researcher should know: If your evolutionary or agent-based system shows performance degradation with extended search, the cause may be evaluation noise rather than true overfitting. Before adding regularization or early stopping, test whether externalizing and fixing the evaluation protocol restores monotonic improvement. AIRA₂'s HCE protocol provides a concrete template for this investigation.