← Back to Index
AIRA₂
Asynchronous multi-GPU research agent with evolutionary search, Hidden Consistent Evaluation, and ReAct operators that achieves state-of-the-art on MLE-bench-30. Organization: FAIR at Meta / University College London / University of Oxford Published: March 27, 2026 Type: paper (arXiv:2603.26499) Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
AIRA₂: Overcoming Bottlenecks in AI Research Agents
- Venue: arXiv preprint (cs.AI), March 2026
- DOI:
10.48550/arXiv.2603.26499 - License: CC-BY 4.0
- Predecessor: AIRA-dojo (Toledo et al., 2025) — the previous best open-source agent on MLE-bench
- Benchmark: MLE-bench-30 (curated 30-task subset of MLE-bench, used in GPT-5 system card)
- Relation to prior work: Direct successor to AIRA-dojo; builds on the three-bottleneck formalization from Toledo et al. (2025) and addresses each bottleneck with a targeted architectural choice
The paper explicitly frames itself as a systems paper: the contribution is not a new search algorithm or LLM capability, but a set of engineering decisions that resolve identified structural bottlenecks preventing scaling of AI research agents.
2 Authors and Team
| Author | Affiliation | Role |
|---|---|---|
| Karen Hambardzumyan* | FAIR at Meta, UCL | Equal contribution (lead) |
| Nicolas Baldwin* | FAIR at Meta | Equal contribution |
| Edan Toledo* | FAIR at Meta, UCL | Equal contribution; lead of predecessor AIRA-dojo |
| Rishi Hazra* | FAIR at Meta | Equal contribution |
| Michael Kuchnik* | FAIR at Meta | Equal contribution |
| Bassel Al Omari | FAIR at Meta | — |
| Thomas Simon Foster | FAIR at Meta, Oxford | — |
| Anton Protopopov | FAIR at Meta | — |
| Jean-Christophe Gagnon-Audet | FAIR at Meta | — |
| Ishita Mediratta | FAIR at Meta | — |
| Kelvin Niu | FAIR at Meta | — |
| Michael Shvartsman | FAIR at Meta | — |
| Alisia Lupidi | FAIR at Meta, Oxford | — |
| Alexis Audran-Reiss | FAIR at Meta | — |
| Parth Pathak | FAIR at Meta | — |
| Tatiana Shavrina | FAIR at Meta | — |
| Despoina Magka | FAIR at Meta | — |
| Hela Momand | FAIR at Meta | — |
| Derek Dunfield | FAIR at Meta | — |
| Nicola Cancedda | FAIR at Meta | — |
| Pontus Stenetorp | UCL | Academic advisor |
| Carole-Jean Wu | FAIR at Meta | Senior leadership |
| Jakob Nicolaus Foerster | FAIR at Meta, Oxford | Academic advisor |
| Yoram Bachrach | FAIR at Meta | Senior researcher |
| Martin Josifoski* | FAIR at Meta | Equal contribution; correspondence author |
*Equal contribution — author order determined by Mario Kart placement (as noted in the paper).
Team size: 25 authors across FAIR at Meta, University College London, and University of Oxford. This is a large industrial research team with significant compute access, reflecting the resource-intensive nature of the problem.
3 Core Contribution
AIRA₂ identifies and resolves three structural bottlenecks in AI research agents, each with a targeted architectural solution:
| Bottleneck | Problem | AIRA₂ Solution |
|---|---|---|
| Compute Throughput | Synchronous single-GPU execution starves exploration; agents are "sample-bound" (≈1–20 candidates/day on compute-heavy tasks) | Asynchronous multi-GPU worker pool with steady-state evolution; 8 GPUs yield ≈8× throughput |
| Generalization Gap | Validation-based selection causes overfitting; oracle experiments show 9–13% medal rate improvement with test-based selection | Hidden Consistent Evaluation (HCE) protocol: fixed splits, hidden labels, decoupled selection |
| Static Operator Limitation | Fixed single-turn LLM prompts cannot adapt to novel task demands; advanced search cannot compensate for shallow operators | ReAct agents with dynamic scoping and interactive debugging replace all static operators |
Key insight: The paper demonstrates that "overfitting" reported in prior work (Toledo et al., 2025) was actually evaluation noise — not true data memorization. HCE eliminates this noise, enabling performance to improve monotonically with compute.
The contribution is primarily a systems-level design paper rather than an algorithmic advance. The evolutionary search itself is standard (temperature-scaled rank selection); the innovation lies in the infrastructure that makes long-horizon evolutionary search actually work.
Relationship to the Field
The paper positions itself within a rapid progression of ML research agents competing on MLE-bench:
Timeline of MLE-bench agents (2025-2026):
─────────────────────────────────────────────────────
AIRA-dojo (2025) → 39.5% PR @ 24h (single GPU)
MARS (2026) → 60.4% PR @ 24h
PiEvolve (2025) → 54.1% PR @ 24h
ML-Master 2.0 (2025) → 57.6% PR @ 24h
MLEvolve (2025) → 64.1% PR @ 24h
FM-Agent 2.0 (2025) → 69.6% PR @ 24h
MARS+ (2026) → 69.9% PR @ 24h ← previous SOTA
AIRA₂ (2026) → 71.8% PR @ 24h → 76.0% @ 72h ← new SOTA
─────────────────────────────────────────────────────
4 Supported Solutions
Problem Framing
AIRA₂ frames automated ML research as a search problem over a graph of candidate solutions. This decomposition was formalized in AIRA-dojo:
- Search policy: Selects which node (candidate solution) to expand
- Operators: Transform a node into new candidates (mutation, crossover)
- Evaluation signal: Provides fitness to guide the search
Solution Space
The system operates on Kaggle competition solutions as the unit of search. Each candidate is a complete ML pipeline: data loading, preprocessing, feature engineering, model architecture, training loop, hyperparameter configuration, and prediction formatting.
Search Methods Supported
| Method | Description | Status in AIRA₂ |
|---|---|---|
| Steady-state evolution | Asynchronous; new individuals added as workers complete | Primary method |
| Temperature-scaled rank selection | Rank-based parent selection with tunable exploration | Default (T=0.2) |
| Mutation | Single-parent ReAct agent trajectory | Primary operator |
| Crossover | Two-parent ReAct agent trajectory | 15% probability |
| Greedy / Best-of-K | Ablation baseline; parallel without information sharing | Ablation only |
What AIRA₂ Does NOT Do
- No neural architecture search (NAS) as a separate subsystem
- No explicit hyperparameter optimization (handled implicitly by ReAct agents)
- No meta-learning or transfer across tasks
- No automated paper writing or research ideation — focused exclusively on competition solution engineering
5 LLM Integration
Primary Model
| Parameter | Value |
|---|---|
| Model | Gemini 3.0 Pro Preview (Google DeepMind, 2025) |
| Role | Powers all ReAct agent workers |
| Context | Multi-turn with stateful tool execution |
| Alternatives tested | Not reported (all main results use Gemini 3.0 Pro) |
Note: Baselines also use Gemini 3.0 Pro Preview, except ML-Master 2.0 which uses DeepSeek V3.2-Speciale. This makes the comparison fair on the LLM dimension.
LLM Usage Pattern
The LLM is used exclusively as the reasoning engine within ReAct agent trajectories:
ReAct Trajectory τ:
Reason₁ → Act₁ → Obs₁ → Reason₂ → Act₂ → Obs₂ → ... → ReasonK → Submit
Actions: Python code execution, Bash commands (in sandboxed container)
Observations: stdout/stderr, execution duration, file contents
Final: "submit" tool sends solution to orchestrator
Key design decisions about LLM integration:
- No prompt engineering per operator: Unlike AIRA-dojo which had separate Draft/Improve/Debug prompts, AIRA₂ gives the ReAct agent a single context (parent solution + task description + metadata) and lets it decide what to do
- No additional guidance within trajectory: "Within the ReAct trajectory, no additional guidance and instructions are provided" — the agent autonomously determines its action sequence
- Stateful tool execution: Unlike AIDE and AIRA-dojo, bash and Jupyter kernel tools maintain state across turns, enabling iterative debugging
- Execution duration feedback: Tool outputs include timing information, enabling agents to monitor their own efficiency
Prompt Structure
The orchestrator provides the ReAct agent with: - Task description and data schema - Parent solution code and its fitness score - (For crossover) Second parent solution - Population metadata (scores, strategies attempted)
The agent then autonomously executes a multi-step trajectory of reasoning and code execution.
6 Key Results
Primary Metric: Percentile Rank on MLE-bench-30
| Time Budget | AIRA₂ (8 GPU) | Best Baseline | Gap |
|---|---|---|---|
| 3h | 59.9% ± 3.6 | — | — |
| 6h | 65.5% | — | — |
| 12h | 68.8% | — | — |
| 24h | 71.8% ± 3.5 | 69.9% (MARS+) | +1.9 pp |
| 72h | 76.0% ± 3.4 | — | +6.1 pp vs 24h SOTA |
Medal Rates at 72 Hours
| Medal Tier | Rate |
|---|---|
| Bronze+ | 61.1% ± 5.2 |
| Silver+ | 58.9% ± 5.2 |
| Gold | 36.7% ± 5.1 |
Ablation Results (72h)
| Configuration | Percentile Rank | Δ from Full |
|---|---|---|
| Full AIRA₂ (8 GPU) | 76.0% | — |
| No Subagents (8 GPU, static operators) | 73.7% | −2.3 pp |
| 1 GPU (with subagents + HCE) | 63.5% | −12.5 pp |
| No HCE (8 GPU, with subagents) | 56.3% | −19.7 pp |
| No Evolution (Best-of-K, 8 GPU) | 65.2% | −10.8 pp |
Critical finding: Removing HCE causes performance to degrade from 24h to 72h (56.8% → 56.3%), confirming that without reliable evaluation, longer search actively hurts. With HCE, performance monotonically improves.
Compute Efficiency Analysis
The paper normalizes performance by cumulative GPU-hours to compare 1-GPU vs 8-GPU:
- At low GPU-hours (< 24), 1-GPU is slightly more efficient (no overhead of building initial population)
- At 24+ GPU-hours, 8-GPU becomes increasingly efficient
- The gap widens to 7.5 percentile rank points at 144 GPU-hours
- Parallel compute without evolution (Best-of-K) saturates — it converges to the same final performance as 1-GPU, showing that parallelism alone is insufficient; evolutionary information sharing is essential
Statistical Methodology
- 3 independent seeds per task, mean ± standard error reported
- Percentile Rank chosen over medal rate as primary metric because:
- Continuous (not discrete thresholds)
- Captures full distribution (not binary medal outcome)
- Avoids threshold effects near medal boundaries that amplify noise
7 Reproducibility
Strengths
| Aspect | Assessment |
|---|---|
| Benchmark | MLE-bench-30 is publicly defined (used in GPT-5 system card) |
| Evaluation protocol | HCE fully specified (80/10/10 split, externalized grading) |
| Hyperparameters | Temperature T=0.2, crossover probability pc=15% fully reported |
| Hardware | 8× NVIDIA H200 GPUs, 12 CPU cores/worker, 120GB RAM/worker |
| Container environment | Apptainer containers with Superimage (publicly available) |
| Statistical methodology | 3 seeds, SE intervals, proper normalization |
Limitations
| Aspect | Concern |
|---|---|
| Code availability | Not open-sourced at time of publication |
| LLM dependency | Gemini 3.0 Pro Preview is a proprietary, versioned API — exact behavior may change |
| Infrastructure complexity | Requires custom orchestrator, containerization system, remote tool execution — significant engineering to replicate |
| Cost | 8× H200 GPUs for 72 hours is expensive (≈$5,000-10,000 per 30-task evaluation) |
| Prompt content | System prompts and ReAct agent instructions not fully reproduced in paper |
| Data splits | Specific split indices not published (though the splitting procedure is deterministic given seeds) |
Predecessor Availability
AIRA-dojo (the predecessor) was open-sourced, including the Superimage container environment. This partially mitigates reproducibility concerns since AIRA₂ builds on that infrastructure.
8 Compute and API Costs
Hardware Configuration
Per-worker allocation:
┌──────────────────────────┐
│ 1× NVIDIA H200 (141GB) │
│ 12 logical CPU cores │
│ 120GB system RAM │
│ Dedicated GPU mapping │
└──────────────────────────┘
Full system (main experiments):
┌────────────────────────────────┐
│ 8× H200 workers │
│ 1× orchestrator (CPU-only) │
│ 1× evaluation GPU (dedicated) │
│ Total: ~9 GPUs active │
└────────────────────────────────┘
Estimated Costs per Full Evaluation
| Resource | Quantity | Duration | Estimated Cost |
|---|---|---|---|
| H200 GPUs (workers) | 8 | 72h | ~$8,000–12,000 (cloud pricing) |
| H200 GPU (evaluation) | 1 | 72h (intermittent) | ~$500–1,000 |
| LLM API (Gemini 3.0 Pro) | ~1000s of trajectories | 72h | ~$500–2,000 (estimated) |
| Total per 30-task run | — | — | ~$9,000–15,000 |
These are rough estimates based on 2026 cloud H200 pricing. Meta likely uses internal GPU clusters, making actual costs lower.
Throughput Statistics
- Single GPU: ≈1–20 candidates evaluated per day (task-dependent)
- 8 GPUs: ≈8× throughput (approximately linear scaling demonstrated)
- Execution cap: 9-hour hard time limit per individual code execution
- Scaling: Paper claims linear throughput scaling with GPU count; no diminishing returns observed up to 8 GPUs
Cost-Effectiveness Argument
The paper argues that 8-GPU parallelism is not merely "doing the same thing faster" — the evolutionary information sharing between workers creates solutions that a single GPU cannot reach regardless of time:
"Parallelism without information sharing (Best-of-K) saturates early, converging to the same final performance as the single-GPU agent."
9 Architecture Solution
High-Level Architecture
┌─────────────────────────────────────────────────────────────┐
│ AIRA₂ SYSTEM ARCHITECTURE │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ EVOLUTIONARY ORCHESTRATOR │ │
│ │ │ │
│ │ ┌──────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Population P │ │ Selection │ │ Dispatch │ │ │
│ │ │ (candidates, │ │ (rank-based,│ │ (async, │ │ │
│ │ │ scores, │ │ T=0.2) │ │ on-demand)│ │ │
│ │ │ metadata) │ │ │ │ │ │ │
│ │ └──────────────┘ └─────────────┘ └─────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ┌─────┴─────┐ ┌────┴────┐ ┌────┴────┐ │
│ ▼ ▼ ▼ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Worker 1│ │Worker 2│ │Worker 3│ ··· │Worker N│ │
│ │ ReAct │ │ ReAct │ │ ReAct │ │ ReAct │ │
│ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Dev │ │Dev │ │Dev │ │Dev │ │
│ │Contanr │ │Contanr │ │Contanr │ │Contanr │ │
│ │(1×GPU) │ │(1×GPU) │ │(1×GPU) │ │(1×GPU) │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │ │
│ └──────────┴──────────┴────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ HIDDEN CONSISTENT│ │
│ │ EVALUATION (HCE) │ │
│ │ │ │
│ │ ┌──────────────┐ │ │
│ │ │ Eval Contanr │ │ │
│ │ │ (1×GPU) │ │ │
│ │ └──────────────┘ │ │
│ │ │ │
│ │ D_search → score │ │
│ │ D_val → select │ │
│ │ D_test → report │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Two-Tier Design
The architecture comprises two tiers:
- Global Orchestrator: Maintains population state, performs selection, dispatches mutation/crossover tasks to available workers, integrates evaluated results back into the population
- Asynchronous Worker Pool: N independent ReAct agents, each with a dedicated GPU and ephemeral container, executing multi-step reasoning trajectories to produce candidate solutions
Key Architectural Decisions
| Decision | Rationale |
|---|---|
| Steady-state evolution (not generational) | Workers never idle waiting for synchronization barriers; fast mutations complete and feed back immediately |
| 1:1 GPU-worker mapping (static allocation) | Eliminates dynamic scheduling complexity; clean-slate environments for every execution |
| Separate evaluation container | Agents never see labels; prevents metric gaming and feedback loops |
| In-memory program database | Fast access for orchestrator; large artifacts spill to disk |
| Ephemeral Apptainer containers | Crashed containers don't affect orchestrator or other workers |
| Fakeroot mode | Agents can apt install or pip install without actual root privileges |
10 Component Breakdown
Component 1: Evolutionary Orchestrator
Purpose: Coordinate search over population of candidate solutions.
| Parameter | Value |
|---|---|
| Selection | Temperature-scaled rank-based |
| Temperature T | 0.2 (biased toward exploitation) |
| Crossover probability | 15% |
| Population storage | In-memory with disk spillover |
| Communication | Asynchronous; no synchronization barriers |
Selection Formula:
p(i) = (N - rᵢ + 1)^(1/T) / Σⱼ (N - rⱼ + 1)^(1/T)
Where: - N = population size - rᵢ = rank of individual i (1 = best) - T = temperature (T→0: greedy, T→∞: uniform)
Rank-based selection is chosen over fitness-proportionate because ranks are invariant to magnitude and scale of fitness scores, which vary widely across Kaggle tasks.
Component 2: Asynchronous Worker Pool
Purpose: Execute ReAct agent trajectories in parallel.
Each worker: 1. Receives parent solution(s) + task context from orchestrator 2. Executes multi-step ReAct trajectory (Reason → Act → Observe loop) 3. Submits candidate solution via "submit" tool 4. Worker returns to pool; orchestrator evaluates and integrates
Container Environment (Superimage):
- Pre-installed: Python, PyTorch, CUDA, standard data science stack
- Fakeroot mode: perceived root for apt/pip installations
- Isolated: crashed containers don't affect other workers
- Stateful: Bash and Jupyter kernel state persists across turns within a trajectory
Component 3: Hidden Consistent Evaluation (HCE)
Purpose: Provide reliable evaluation signal for search; decouple search optimization from final selection.
Data Partitioning (created once, reused across all seeds):
Available labeled data
├── D_train (80%) → visible to agent for model training
├── D_search (10%) → used by orchestrator for fitness; labels hidden from agents
└── D_val (10%) → used ONLY for final selection after search terminates
And separately:
D_test → held-out competition test set (never used during search or selection)
Evaluation flow: 1. Worker submits solution → predictions generated for all splits 2. Orchestrator evaluates on D_search → fitness score returned to search 3. D_val scores computed but NOT used during search 4. After search terminates, final submission selected by D_val score 5. D_test used only for final reporting
Component 4: ReAct Agent Operators
Purpose: Replace static Draft/Improve/Debug prompts with autonomous multi-step agents.
Trajectory format:
τ = (Reason₁, Act₁, Obs₁, ..., Reason_{K-1}, Act_{K-1}, Obs_{K-1}, ReasonK, ActK)
Capabilities gained over static operators:
| Capability | Static Operators | ReAct Agents |
|---|---|---|
| Exploratory data analysis | Pre-defined prompt | Agent decides at runtime |
| Debugging | Separate "Debug" operator, no iterative access | Same trajectory, observes traceback, hypothesizes fix, re-executes |
| Resource allocation | Fixed per operator | Dynamic — spends more time on harder sub-problems |
| Scope engineering | Hand-designed per anticipated sub-task | Agent determines needed actions autonomously |
| Local experimentation | Not supported | Can run small experiments before committing to ideas |
Component 5: Resource Management
| Resource | Allocation |
|---|---|
| GPU | 1:1 worker mapping (NVIDIA H200, 141GB VRAM) |
| CPU | 12 logical cores per worker |
| RAM | 120GB per worker |
| Time limit | 9-hour hard cap per code execution |
| Global limit | 72-hour wall-clock (configurable) |
| Evaluation | Foreground execution (no separate job queue) |
11 Core Mechanisms (Detailed)
11.1 Asynchronous Steady-State Evolution
Traditional generational evolution waits for all individuals to be evaluated before proceeding. AIRA₂ uses steady-state evolution (Syswerda, 1991), where:
- When any worker becomes idle → orchestrator immediately dispatches new task
- New individual is added to population immediately upon evaluation
- No synchronization barriers between workers
- Fast mutations complete quickly and provide early feedback; slow mutations don't block others
This is particularly critical for ML research tasks where mutation duration varies enormously — from minutes (hyperparameter tweak) to hours (training a large model from scratch).
Information flow in steady-state vs. generational evolution:
Generational (synchronous):
Gen 0: [W1|W2|W3|W4|W5|W6|W7|W8] ──barrier──▶
Gen 1: [W1|W2|W3|W4|W5|W6|W7|W8] ──barrier──▶
Problem: Fast workers idle, waiting for slow workers
Steady-state (async, AIRA₂):
W1: ───────────submit──▶ ──────submit──▶ ───submit──▶
W2: ──submit──▶ ──────submit──▶ ────────submit──▶
W3: ────────────────submit──▶ ──submit──▶ ────▶
...
Advantage: No idle time; fast tasks complete quickly
11.2 Hidden Consistent Evaluation (HCE) — Deep Analysis
HCE is arguably the paper's most important contribution. The authors use it as both a practical safeguard and an experimental tool to diagnose overfitting.
The overfitting diagnosis:
Prior work (Toledo et al., 2025) reported that agent performance degrades with extended search — interpreted as overfitting to validation data. AIRA₂'s experiments reveal this was actually evaluation noise from three sources:
- Implementation bugs inflating metrics: Agents sometimes produce code that reports artificially high validation scores due to bugs (e.g., data leakage in preprocessing)
- Brittle output parsing: Missing or erroneous score extraction leads to incorrect fitness signals
- Stochastic data splitting: Random seeds for train/val splits introduce variance; inferior solutions survive selection due to favorable random partitions
HCE eliminates all three: - External evaluation prevents self-reported metric gaming - Consistent splits eliminate seed-induced variance - Hidden labels prevent agents from accessing evaluation data
Empirical validation: Without HCE, performance degrades from 56.8% (24h) to 56.3% (72h) — confirming overfitting. With HCE, performance improves monotonically: 71.8% (24h) → 76.0% (72h).
11.3 ReAct Agent Dynamics
The shift from static operators to ReAct agents represents a fundamental change in how the system allocates compute:
Static operator paradigm (AIRA-dojo):
Orchestrator → select operator (Draft/Improve/Debug/EDA)
→ fill template prompt
→ single LLM call
→ parse output
→ evaluate
ReAct agent paradigm (AIRA₂):
Orchestrator → provide context (parent solution, task, metadata)
→ ReAct agent autonomously determines:
- What analyses to run
- What code to write
- When to debug
- When to experiment locally
- When to submit
→ Multi-turn trajectory until submission
Dynamic compute allocation: On easy sub-tasks, agents submit quickly (few reasoning steps). On hard sub-tasks, agents spend many turns debugging, experimenting, and iterating. This naturally allocates LLM compute proportional to difficulty — something impossible with fixed, single-turn operators.
11.4 Containerization and Isolation
The containerization system uses Apptainer (formerly Singularity) with the Superimage environment:
Container lifecycle per mutation:
1. Spawn fresh Apptainer container
2. Mount parent solution code
3. ReAct agent executes in container (stateful bash + Jupyter)
4. Agent submits → solution extracted
5. Container destroyed
6. Solution evaluated in SEPARATE container (HCE)
Key properties: - Fakeroot mode: Agents perceive root privileges without actual root access - Pre-installed environment: Comprehensive ML/data science stack avoids installation delays - Crash isolation: Container failures don't propagate - Stateful tools: Unlike prior work, bash and Jupyter state persists across agent turns (within one mutation trajectory)
11.5 Population Dynamics and Diversity
The paper uses temperature T=0.2, which biases selection toward high-fitness individuals while maintaining some diversity. Analysis shows:
- At very early stages (first few GPU-hours), the 8-GPU setup actually underperforms 1-GPU because it spends compute building a diverse initial population
- After the initial population is established, the 8-GPU setup achieves increasingly superior performance
- The crossover rate of 15% introduces genetic diversity by combining strategies from different lineages
The paper also demonstrates that parallelism without evolution is insufficient: Best-of-K (8 independent workers, no information sharing) saturates at the same performance level as a single GPU given enough time. Only evolutionary information sharing (selection pressure + crossover) enables the 8-GPU setup to find solutions unreachable by any single worker.
12 Programming Language
System Implementation
The paper does not specify the exact implementation language, but based on the predecessor (AIRA-dojo) and the described infrastructure:
| Component | Likely Language | Evidence |
|---|---|---|
| Orchestrator | Python | AIRA-dojo is Python; ML research tooling standard |
| ReAct Agent | Python (LLM-generated actions) | Agents write Python/Bash in containers |
| Container Environment | Python (Superimage) | Pre-installed Python ML stack |
| Candidate Solutions | Python | Kaggle competition solutions are Python |
Agent-Generated Code
Agents produce Python code for ML solutions within Apptainer containers. The environment includes:
- PyTorch, TensorFlow, JAX
- scikit-learn, XGBoost, LightGBM, CatBoost
- pandas, numpy, scipy
- Standard data science preprocessing libraries
Tool Interface
The ReAct agents interact via two primary tool types:
- Bash command execution — stateful shell session
- Jupyter kernel execution — stateful Python kernel
Both maintain state across turns, enabling iterative development workflows (write code → run → observe error → fix → run again) within a single mutation trajectory.
13 Memory Management
Population-Level Memory
The program database serves as the system's primary memory:
| Aspect | Implementation |
|---|---|
| Storage location | In-memory (orchestrator process) |
| Large artifacts | Automatically offloaded to disk |
| Contents per individual | Code, fitness score, metadata, parent lineage |
| Access pattern | Fast read for selection; append on new evaluation |
| Communication | Subagents communicate exclusively through central database |
Worker-Level Memory
Each ReAct agent maintains trajectory-level memory through:
- Conversation history: Full Reason-Act-Observe chain from the current trajectory
- Stateful bash session: Environment variables, working directory, installed packages persist across turns
- Stateful Jupyter kernel: Variables, loaded data, defined functions persist across turns
- Parent context: Injected at trajectory start (parent solution code, score, task metadata)
Cross-Worker Communication
Workers do not communicate directly with each other. All information sharing is mediated through the population:
Worker A → submits solution → evaluated → added to population
↓
Population state updated
↓
Worker B → samples parent from population → inherits A's improvements
This is a critical design choice: it avoids the complexity of peer-to-peer communication while still enabling evolutionary information transfer through the selection mechanism.
No Explicit Long-Term Memory
Unlike some research agent systems that maintain explicit knowledge bases or skill libraries, AIRA₂ relies entirely on the population as implicit memory. Good strategies survive through fitness-based selection; bad strategies are displaced. There is no explicit mechanism for: - Extracting reusable skills or patterns - Summarizing lessons learned across tasks - Transferring knowledge between tasks (each task is independent)
14 Continued Learning
Within-Task Learning
AIRA₂ demonstrates within-task continued improvement through evolutionary search:
Performance trajectory (mean across 30 tasks):
3h → 59.9% PR
6h → 65.5% PR (+5.6 pp)
9h → ~67% PR (interpolated)
12h → 68.8% PR (+3.3 pp from 6h)
24h → 71.8% PR (+3.0 pp from 12h)
48h → ~74% PR (interpolated)
72h → 76.0% PR (+4.2 pp from 24h)
The monotonic improvement with compute is the paper's key finding — enabled specifically by HCE preventing overfitting. Prior systems showed degradation after 24 hours; AIRA₂ shows continued gains.
Cross-Task Learning
AIRA₂ does not implement cross-task learning. Each of the 30 MLE-bench tasks is solved independently. There is no mechanism for: - Transferring learned strategies between tasks - Building a library of reusable components - Meta-learning over task distributions
This is explicitly noted as future work.
Evolutionary Dynamics as Learning
The evolutionary process itself constitutes a form of learning within each task:
- Exploration phase (early hours): Workers explore diverse strategies; population builds breadth
- Exploitation phase (later hours): Selection pressure focuses on refinement of promising approaches
- Crossover-driven innovation: Combining elements from different lineages occasionally produces breakthroughs
The temperature parameter T=0.2 controls this exploration-exploitation tradeoff. The paper does not report experiments with adaptive temperature schedules, which could potentially further improve performance.
Scaling Laws
The paper presents evidence for a compute scaling law in AI research agents:
- Performance improves approximately logarithmically with GPU-hours
- The relationship is not yet saturating at 72 hours / 576 GPU-hours
- This suggests that with even more compute (e.g., 32 GPUs for 72 hours), further gains are possible
This parallels the scaling laws observed in LLM pretraining, but applied to the meta-level of automated research agent performance.
15 Applications
Direct Applications
| Application | Description | Status |
|---|---|---|
| Kaggle competition solving | Automated entry into ML competitions | Demonstrated (MLE-bench-30) |
| Automated ML pipeline engineering | End-to-end ML solution development | Core capability |
| Inference-time compute scaling | Using additional compute to improve solution quality | Demonstrated (linear scaling) |
Broader Research Implications
1. Overfitting Diagnosis in Agent Systems
The HCE protocol provides a general methodology for diagnosing whether observed performance degradation in agent systems is due to true overfitting vs. evaluation noise. This has implications beyond ML research agents — any agent system with noisy evaluation signals could benefit from this analysis framework.
2. Parallelism + Evolution as a General Pattern
The finding that "parallelism without information sharing is insufficient" suggests a general principle: to effectively utilize additional compute in agent systems, you need both (a) parallel execution and (b) an information-sharing mechanism (like evolution) that enables workers to build on each other's discoveries.
3. Dynamic Operators via ReAct
The replacement of static operators with ReAct agents is applicable to any evolutionary system where operator design is a bottleneck. Rather than hand-designing operators for each domain, let agents determine their own action sequences.
Limitations and Scope
| Limitation | Impact |
|---|---|
| Kaggle-only evaluation | Unclear if results generalize to open-ended research (no predefined competition metric) |
| No paper writing | Produces code solutions, not research papers |
| Single-benchmark focus | MLE-bench-30 only; no evaluation on other benchmarks |
| Fixed LLM | Tied to Gemini 3.0 Pro; no multi-model experiments |
| No cross-task transfer | Each task starts from scratch |
| High compute cost | 8× H200 for 72h is accessible only to well-funded labs |
Connections to OmniEvolve
AIRA₂'s architecture maps directly to several OmniEvolve design patterns:
| AIRA₂ Component | OmniEvolve Equivalent |
|---|---|
| Evolutionary Orchestrator | omnievolve/orchestrator/ + omnievolve/search/ |
| Asynchronous Worker Pool | Async worker management in search backends |
| HCE Protocol | omnievolve/evaluation/ cascade evaluator with externalized grading |
| ReAct Agents as Operators | omnievolve/mutation/ — dynamic mutation operators |
| Population Database | omnievolve/knowledge/ program database |
| Temperature-scaled selection | omnievolve/search/ selection strategies |
| Containerized execution | omnievolve/safety/ sandbox execution |
References
- Hambardzumyan, K., et al. (2026). "AIRA₂: Overcoming Bottlenecks in AI Research Agents." arXiv:2603.26499.
- Toledo, E., et al. (2025). "AIRA-dojo." [predecessor system]
- Chan, J., et al. (2025). "MLE-bench." [benchmark definition]
- Singh, I., et al. (2025). "MLE-bench-30." [GPT-5 system card subset]
- Chen, X., et al. (2026). "MARS / MARS+." [concurrent work]
- Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models."
- Syswerda, G. (1991). "A Study of Reproduction in Generational and Steady State Genetic Algorithms."
- Kurtzer, G., et al. (2021). "Apptainer." [container runtime]
- Du, M., et al. (2025). "MLEvolve." [concurrent work]
- Botla, V., et al. (2025). "PiEvolve." [concurrent work]
- Li, Z., et al. (2025). "FM-Agent 2.0." [concurrent work]
- Liu, J., et al. (2025). "ML-Master 2.0." [concurrent work]
Classification: Autoresearch — AIRA₂ is an AI system that autonomously conducts ML research (solving Kaggle competitions as a proxy for research capability) using evolutionary search with LLM-powered operators. It falls squarely in the autoresearch category as an agent harness for automated experimental research.