← Back to Index

AIRA₂

Asynchronous multi-GPU research agent with evolutionary search, Hidden Consistent Evaluation, and ReAct operators that achieves state-of-the-art on MLE-bench-30. Organization: FAIR at Meta / University College London / University of Oxford Published: March 27, 2026 Type: paper (arXiv:2603.26499) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

AIRA₂: Overcoming Bottlenecks in AI Research Agents

  • Venue: arXiv preprint (cs.AI), March 2026
  • DOI: 10.48550/arXiv.2603.26499
  • License: CC-BY 4.0
  • Predecessor: AIRA-dojo (Toledo et al., 2025) — the previous best open-source agent on MLE-bench
  • Benchmark: MLE-bench-30 (curated 30-task subset of MLE-bench, used in GPT-5 system card)
  • Relation to prior work: Direct successor to AIRA-dojo; builds on the three-bottleneck formalization from Toledo et al. (2025) and addresses each bottleneck with a targeted architectural choice

The paper explicitly frames itself as a systems paper: the contribution is not a new search algorithm or LLM capability, but a set of engineering decisions that resolve identified structural bottlenecks preventing scaling of AI research agents.


2 Authors and Team

Author Affiliation Role
Karen Hambardzumyan* FAIR at Meta, UCL Equal contribution (lead)
Nicolas Baldwin* FAIR at Meta Equal contribution
Edan Toledo* FAIR at Meta, UCL Equal contribution; lead of predecessor AIRA-dojo
Rishi Hazra* FAIR at Meta Equal contribution
Michael Kuchnik* FAIR at Meta Equal contribution
Bassel Al Omari FAIR at Meta
Thomas Simon Foster FAIR at Meta, Oxford
Anton Protopopov FAIR at Meta
Jean-Christophe Gagnon-Audet FAIR at Meta
Ishita Mediratta FAIR at Meta
Kelvin Niu FAIR at Meta
Michael Shvartsman FAIR at Meta
Alisia Lupidi FAIR at Meta, Oxford
Alexis Audran-Reiss FAIR at Meta
Parth Pathak FAIR at Meta
Tatiana Shavrina FAIR at Meta
Despoina Magka FAIR at Meta
Hela Momand FAIR at Meta
Derek Dunfield FAIR at Meta
Nicola Cancedda FAIR at Meta
Pontus Stenetorp UCL Academic advisor
Carole-Jean Wu FAIR at Meta Senior leadership
Jakob Nicolaus Foerster FAIR at Meta, Oxford Academic advisor
Yoram Bachrach FAIR at Meta Senior researcher
Martin Josifoski* FAIR at Meta Equal contribution; correspondence author

*Equal contribution — author order determined by Mario Kart placement (as noted in the paper).

Team size: 25 authors across FAIR at Meta, University College London, and University of Oxford. This is a large industrial research team with significant compute access, reflecting the resource-intensive nature of the problem.


3 Core Contribution

AIRA₂ identifies and resolves three structural bottlenecks in AI research agents, each with a targeted architectural solution:

Bottleneck Problem AIRA₂ Solution
Compute Throughput Synchronous single-GPU execution starves exploration; agents are "sample-bound" (≈1–20 candidates/day on compute-heavy tasks) Asynchronous multi-GPU worker pool with steady-state evolution; 8 GPUs yield ≈8× throughput
Generalization Gap Validation-based selection causes overfitting; oracle experiments show 9–13% medal rate improvement with test-based selection Hidden Consistent Evaluation (HCE) protocol: fixed splits, hidden labels, decoupled selection
Static Operator Limitation Fixed single-turn LLM prompts cannot adapt to novel task demands; advanced search cannot compensate for shallow operators ReAct agents with dynamic scoping and interactive debugging replace all static operators

Key insight: The paper demonstrates that "overfitting" reported in prior work (Toledo et al., 2025) was actually evaluation noise — not true data memorization. HCE eliminates this noise, enabling performance to improve monotonically with compute.

The contribution is primarily a systems-level design paper rather than an algorithmic advance. The evolutionary search itself is standard (temperature-scaled rank selection); the innovation lies in the infrastructure that makes long-horizon evolutionary search actually work.

Relationship to the Field

The paper positions itself within a rapid progression of ML research agents competing on MLE-bench:

Timeline of MLE-bench agents (2025-2026):
─────────────────────────────────────────────────────
AIRA-dojo (2025)     → 39.5% PR @ 24h (single GPU)
MARS (2026)          → 60.4% PR @ 24h
PiEvolve (2025)      → 54.1% PR @ 24h
ML-Master 2.0 (2025) → 57.6% PR @ 24h
MLEvolve (2025)      → 64.1% PR @ 24h
FM-Agent 2.0 (2025)  → 69.6% PR @ 24h
MARS+ (2026)         → 69.9% PR @ 24h ← previous SOTA
AIRA₂ (2026)         → 71.8% PR @ 24h → 76.0% @ 72h ← new SOTA
─────────────────────────────────────────────────────

4 Supported Solutions

Problem Framing

AIRA₂ frames automated ML research as a search problem over a graph of candidate solutions. This decomposition was formalized in AIRA-dojo:

  • Search policy: Selects which node (candidate solution) to expand
  • Operators: Transform a node into new candidates (mutation, crossover)
  • Evaluation signal: Provides fitness to guide the search

Solution Space

The system operates on Kaggle competition solutions as the unit of search. Each candidate is a complete ML pipeline: data loading, preprocessing, feature engineering, model architecture, training loop, hyperparameter configuration, and prediction formatting.

Search Methods Supported

Method Description Status in AIRA₂
Steady-state evolution Asynchronous; new individuals added as workers complete Primary method
Temperature-scaled rank selection Rank-based parent selection with tunable exploration Default (T=0.2)
Mutation Single-parent ReAct agent trajectory Primary operator
Crossover Two-parent ReAct agent trajectory 15% probability
Greedy / Best-of-K Ablation baseline; parallel without information sharing Ablation only

What AIRA₂ Does NOT Do

  • No neural architecture search (NAS) as a separate subsystem
  • No explicit hyperparameter optimization (handled implicitly by ReAct agents)
  • No meta-learning or transfer across tasks
  • No automated paper writing or research ideation — focused exclusively on competition solution engineering

5 LLM Integration

Primary Model

Parameter Value
Model Gemini 3.0 Pro Preview (Google DeepMind, 2025)
Role Powers all ReAct agent workers
Context Multi-turn with stateful tool execution
Alternatives tested Not reported (all main results use Gemini 3.0 Pro)

Note: Baselines also use Gemini 3.0 Pro Preview, except ML-Master 2.0 which uses DeepSeek V3.2-Speciale. This makes the comparison fair on the LLM dimension.

LLM Usage Pattern

The LLM is used exclusively as the reasoning engine within ReAct agent trajectories:

ReAct Trajectory τ:
  Reason₁ → Act₁ → Obs₁ → Reason₂ → Act₂ → Obs₂ → ... → ReasonK → Submit

Actions: Python code execution, Bash commands (in sandboxed container)
Observations: stdout/stderr, execution duration, file contents
Final: "submit" tool sends solution to orchestrator

Key design decisions about LLM integration:

  1. No prompt engineering per operator: Unlike AIRA-dojo which had separate Draft/Improve/Debug prompts, AIRA₂ gives the ReAct agent a single context (parent solution + task description + metadata) and lets it decide what to do
  2. No additional guidance within trajectory: "Within the ReAct trajectory, no additional guidance and instructions are provided" — the agent autonomously determines its action sequence
  3. Stateful tool execution: Unlike AIDE and AIRA-dojo, bash and Jupyter kernel tools maintain state across turns, enabling iterative debugging
  4. Execution duration feedback: Tool outputs include timing information, enabling agents to monitor their own efficiency

Prompt Structure

The orchestrator provides the ReAct agent with: - Task description and data schema - Parent solution code and its fitness score - (For crossover) Second parent solution - Population metadata (scores, strategies attempted)

The agent then autonomously executes a multi-step trajectory of reasoning and code execution.


6 Key Results

Primary Metric: Percentile Rank on MLE-bench-30

Time Budget AIRA₂ (8 GPU) Best Baseline Gap
3h 59.9% ± 3.6
6h 65.5%
12h 68.8%
24h 71.8% ± 3.5 69.9% (MARS+) +1.9 pp
72h 76.0% ± 3.4 +6.1 pp vs 24h SOTA

Medal Rates at 72 Hours

Medal Tier Rate
Bronze+ 61.1% ± 5.2
Silver+ 58.9% ± 5.2
Gold 36.7% ± 5.1

Ablation Results (72h)

Configuration Percentile Rank Δ from Full
Full AIRA₂ (8 GPU) 76.0%
No Subagents (8 GPU, static operators) 73.7% −2.3 pp
1 GPU (with subagents + HCE) 63.5% −12.5 pp
No HCE (8 GPU, with subagents) 56.3% −19.7 pp
No Evolution (Best-of-K, 8 GPU) 65.2% −10.8 pp

Critical finding: Removing HCE causes performance to degrade from 24h to 72h (56.8% → 56.3%), confirming that without reliable evaluation, longer search actively hurts. With HCE, performance monotonically improves.

Compute Efficiency Analysis

The paper normalizes performance by cumulative GPU-hours to compare 1-GPU vs 8-GPU:

  • At low GPU-hours (< 24), 1-GPU is slightly more efficient (no overhead of building initial population)
  • At 24+ GPU-hours, 8-GPU becomes increasingly efficient
  • The gap widens to 7.5 percentile rank points at 144 GPU-hours
  • Parallel compute without evolution (Best-of-K) saturates — it converges to the same final performance as 1-GPU, showing that parallelism alone is insufficient; evolutionary information sharing is essential

Statistical Methodology

  • 3 independent seeds per task, mean ± standard error reported
  • Percentile Rank chosen over medal rate as primary metric because:
  • Continuous (not discrete thresholds)
  • Captures full distribution (not binary medal outcome)
  • Avoids threshold effects near medal boundaries that amplify noise

7 Reproducibility

Strengths

Aspect Assessment
Benchmark MLE-bench-30 is publicly defined (used in GPT-5 system card)
Evaluation protocol HCE fully specified (80/10/10 split, externalized grading)
Hyperparameters Temperature T=0.2, crossover probability pc=15% fully reported
Hardware 8× NVIDIA H200 GPUs, 12 CPU cores/worker, 120GB RAM/worker
Container environment Apptainer containers with Superimage (publicly available)
Statistical methodology 3 seeds, SE intervals, proper normalization

Limitations

Aspect Concern
Code availability Not open-sourced at time of publication
LLM dependency Gemini 3.0 Pro Preview is a proprietary, versioned API — exact behavior may change
Infrastructure complexity Requires custom orchestrator, containerization system, remote tool execution — significant engineering to replicate
Cost 8× H200 GPUs for 72 hours is expensive (≈$5,000-10,000 per 30-task evaluation)
Prompt content System prompts and ReAct agent instructions not fully reproduced in paper
Data splits Specific split indices not published (though the splitting procedure is deterministic given seeds)

Predecessor Availability

AIRA-dojo (the predecessor) was open-sourced, including the Superimage container environment. This partially mitigates reproducibility concerns since AIRA₂ builds on that infrastructure.


8 Compute and API Costs

Hardware Configuration

Per-worker allocation:
┌──────────────────────────┐
│  1× NVIDIA H200 (141GB) │
│  12 logical CPU cores    │
│  120GB system RAM        │
│  Dedicated GPU mapping   │
└──────────────────────────┘

Full system (main experiments):
┌────────────────────────────────┐
│  8× H200 workers               │
│  1× orchestrator (CPU-only)    │
│  1× evaluation GPU (dedicated) │
│  Total: ~9 GPUs active         │
└────────────────────────────────┘

Estimated Costs per Full Evaluation

Resource Quantity Duration Estimated Cost
H200 GPUs (workers) 8 72h ~$8,000–12,000 (cloud pricing)
H200 GPU (evaluation) 1 72h (intermittent) ~$500–1,000
LLM API (Gemini 3.0 Pro) ~1000s of trajectories 72h ~$500–2,000 (estimated)
Total per 30-task run ~$9,000–15,000

These are rough estimates based on 2026 cloud H200 pricing. Meta likely uses internal GPU clusters, making actual costs lower.

Throughput Statistics

  • Single GPU: ≈1–20 candidates evaluated per day (task-dependent)
  • 8 GPUs: ≈8× throughput (approximately linear scaling demonstrated)
  • Execution cap: 9-hour hard time limit per individual code execution
  • Scaling: Paper claims linear throughput scaling with GPU count; no diminishing returns observed up to 8 GPUs

Cost-Effectiveness Argument

The paper argues that 8-GPU parallelism is not merely "doing the same thing faster" — the evolutionary information sharing between workers creates solutions that a single GPU cannot reach regardless of time:

"Parallelism without information sharing (Best-of-K) saturates early, converging to the same final performance as the single-GPU agent."


9 Architecture Solution

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AIRA₂ SYSTEM ARCHITECTURE                 │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              EVOLUTIONARY ORCHESTRATOR                 │   │
│  │                                                        │   │
│  │  ┌──────────────┐  ┌─────────────┐  ┌─────────────┐  │   │
│  │  │  Population P │  │  Selection   │  │  Dispatch   │  │   │
│  │  │  (candidates, │  │  (rank-based,│  │  (async,    │  │   │
│  │  │   scores,     │  │   T=0.2)     │  │   on-demand)│  │   │
│  │  │   metadata)   │  │              │  │             │  │   │
│  │  └──────────────┘  └─────────────┘  └─────────────┘  │   │
│  └──────────────────────────────────────────────────────┘   │
│           │               │              │                    │
│     ┌─────┴─────┐   ┌────┴────┐   ┌────┴────┐              │
│     ▼           ▼   ▼         ▼   ▼         ▼              │
│  ┌────────┐ ┌────────┐ ┌────────┐      ┌────────┐          │
│  │Worker 1│ │Worker 2│ │Worker 3│ ···  │Worker N│          │
│  │ ReAct  │ │ ReAct  │ │ ReAct  │      │ ReAct  │          │
│  │ Agent  │ │ Agent  │ │ Agent  │      │ Agent  │          │
│  └───┬────┘ └───┬────┘ └───┬────┘      └───┬────┘          │
│      │          │          │                │                │
│      ▼          ▼          ▼                ▼                │
│  ┌────────┐ ┌────────┐ ┌────────┐      ┌────────┐          │
│  │Dev     │ │Dev     │ │Dev     │      │Dev     │          │
│  │Contanr │ │Contanr │ │Contanr │      │Contanr │          │
│  │(1×GPU) │ │(1×GPU) │ │(1×GPU) │      │(1×GPU) │          │
│  └───┬────┘ └───┬────┘ └───┬────┘      └───┬────┘          │
│      │          │          │                │                │
│      └──────────┴──────────┴────────────────┘                │
│                         │                                    │
│                         ▼                                    │
│               ┌──────────────────┐                           │
│               │  HIDDEN CONSISTENT│                           │
│               │  EVALUATION (HCE) │                           │
│               │                    │                           │
│               │  ┌──────────────┐ │                           │
│               │  │ Eval Contanr │ │                           │
│               │  │ (1×GPU)      │ │                           │
│               │  └──────────────┘ │                           │
│               │                    │                           │
│               │  D_search → score  │                           │
│               │  D_val   → select  │                           │
│               │  D_test  → report  │                           │
│               └──────────────────┘                           │
└─────────────────────────────────────────────────────────────┘

Two-Tier Design

The architecture comprises two tiers:

  1. Global Orchestrator: Maintains population state, performs selection, dispatches mutation/crossover tasks to available workers, integrates evaluated results back into the population
  2. Asynchronous Worker Pool: N independent ReAct agents, each with a dedicated GPU and ephemeral container, executing multi-step reasoning trajectories to produce candidate solutions

Key Architectural Decisions

Decision Rationale
Steady-state evolution (not generational) Workers never idle waiting for synchronization barriers; fast mutations complete and feed back immediately
1:1 GPU-worker mapping (static allocation) Eliminates dynamic scheduling complexity; clean-slate environments for every execution
Separate evaluation container Agents never see labels; prevents metric gaming and feedback loops
In-memory program database Fast access for orchestrator; large artifacts spill to disk
Ephemeral Apptainer containers Crashed containers don't affect orchestrator or other workers
Fakeroot mode Agents can apt install or pip install without actual root privileges

10 Component Breakdown

Component 1: Evolutionary Orchestrator

Purpose: Coordinate search over population of candidate solutions.

Parameter Value
Selection Temperature-scaled rank-based
Temperature T 0.2 (biased toward exploitation)
Crossover probability 15%
Population storage In-memory with disk spillover
Communication Asynchronous; no synchronization barriers

Selection Formula:

p(i) = (N - rᵢ + 1)^(1/T) / Σⱼ (N - rⱼ + 1)^(1/T)

Where: - N = population size - rᵢ = rank of individual i (1 = best) - T = temperature (T→0: greedy, T→∞: uniform)

Rank-based selection is chosen over fitness-proportionate because ranks are invariant to magnitude and scale of fitness scores, which vary widely across Kaggle tasks.

Component 2: Asynchronous Worker Pool

Purpose: Execute ReAct agent trajectories in parallel.

Each worker: 1. Receives parent solution(s) + task context from orchestrator 2. Executes multi-step ReAct trajectory (Reason → Act → Observe loop) 3. Submits candidate solution via "submit" tool 4. Worker returns to pool; orchestrator evaluates and integrates

Container Environment (Superimage): - Pre-installed: Python, PyTorch, CUDA, standard data science stack - Fakeroot mode: perceived root for apt/pip installations - Isolated: crashed containers don't affect other workers - Stateful: Bash and Jupyter kernel state persists across turns within a trajectory

Component 3: Hidden Consistent Evaluation (HCE)

Purpose: Provide reliable evaluation signal for search; decouple search optimization from final selection.

Data Partitioning (created once, reused across all seeds):

Available labeled data
├── D_train (80%) → visible to agent for model training
├── D_search (10%) → used by orchestrator for fitness; labels hidden from agents
└── D_val (10%) → used ONLY for final selection after search terminates

And separately:

D_test → held-out competition test set (never used during search or selection)

Evaluation flow: 1. Worker submits solution → predictions generated for all splits 2. Orchestrator evaluates on D_search → fitness score returned to search 3. D_val scores computed but NOT used during search 4. After search terminates, final submission selected by D_val score 5. D_test used only for final reporting

Component 4: ReAct Agent Operators

Purpose: Replace static Draft/Improve/Debug prompts with autonomous multi-step agents.

Trajectory format:

τ = (Reason₁, Act₁, Obs₁, ..., Reason_{K-1}, Act_{K-1}, Obs_{K-1}, ReasonK, ActK)

Capabilities gained over static operators:

Capability Static Operators ReAct Agents
Exploratory data analysis Pre-defined prompt Agent decides at runtime
Debugging Separate "Debug" operator, no iterative access Same trajectory, observes traceback, hypothesizes fix, re-executes
Resource allocation Fixed per operator Dynamic — spends more time on harder sub-problems
Scope engineering Hand-designed per anticipated sub-task Agent determines needed actions autonomously
Local experimentation Not supported Can run small experiments before committing to ideas

Component 5: Resource Management

Resource Allocation
GPU 1:1 worker mapping (NVIDIA H200, 141GB VRAM)
CPU 12 logical cores per worker
RAM 120GB per worker
Time limit 9-hour hard cap per code execution
Global limit 72-hour wall-clock (configurable)
Evaluation Foreground execution (no separate job queue)

11 Core Mechanisms (Detailed)

11.1 Asynchronous Steady-State Evolution

Traditional generational evolution waits for all individuals to be evaluated before proceeding. AIRA₂ uses steady-state evolution (Syswerda, 1991), where:

  1. When any worker becomes idle → orchestrator immediately dispatches new task
  2. New individual is added to population immediately upon evaluation
  3. No synchronization barriers between workers
  4. Fast mutations complete quickly and provide early feedback; slow mutations don't block others

This is particularly critical for ML research tasks where mutation duration varies enormously — from minutes (hyperparameter tweak) to hours (training a large model from scratch).

Information flow in steady-state vs. generational evolution:

Generational (synchronous):
   Gen 0: [W1|W2|W3|W4|W5|W6|W7|W8] ──barrier──▶
   Gen 1: [W1|W2|W3|W4|W5|W6|W7|W8] ──barrier──▶

   Problem: Fast workers idle, waiting for slow workers

Steady-state (async, AIRA₂):
   W1: ───────────submit──▶ ──────submit──▶ ───submit──▶
   W2: ──submit──▶ ──────submit──▶ ────────submit──▶
   W3: ────────────────submit──▶ ──submit──▶ ────▶
   ...

   Advantage: No idle time; fast tasks complete quickly

11.2 Hidden Consistent Evaluation (HCE) — Deep Analysis

HCE is arguably the paper's most important contribution. The authors use it as both a practical safeguard and an experimental tool to diagnose overfitting.

The overfitting diagnosis:

Prior work (Toledo et al., 2025) reported that agent performance degrades with extended search — interpreted as overfitting to validation data. AIRA₂'s experiments reveal this was actually evaluation noise from three sources:

  1. Implementation bugs inflating metrics: Agents sometimes produce code that reports artificially high validation scores due to bugs (e.g., data leakage in preprocessing)
  2. Brittle output parsing: Missing or erroneous score extraction leads to incorrect fitness signals
  3. Stochastic data splitting: Random seeds for train/val splits introduce variance; inferior solutions survive selection due to favorable random partitions

HCE eliminates all three: - External evaluation prevents self-reported metric gaming - Consistent splits eliminate seed-induced variance - Hidden labels prevent agents from accessing evaluation data

Empirical validation: Without HCE, performance degrades from 56.8% (24h) to 56.3% (72h) — confirming overfitting. With HCE, performance improves monotonically: 71.8% (24h) → 76.0% (72h).

11.3 ReAct Agent Dynamics

The shift from static operators to ReAct agents represents a fundamental change in how the system allocates compute:

Static operator paradigm (AIRA-dojo):

Orchestrator → select operator (Draft/Improve/Debug/EDA)
            → fill template prompt
            → single LLM call
            → parse output
            → evaluate

ReAct agent paradigm (AIRA₂):

Orchestrator → provide context (parent solution, task, metadata)
            → ReAct agent autonomously determines:
               - What analyses to run
               - What code to write
               - When to debug
               - When to experiment locally
               - When to submit
            → Multi-turn trajectory until submission

Dynamic compute allocation: On easy sub-tasks, agents submit quickly (few reasoning steps). On hard sub-tasks, agents spend many turns debugging, experimenting, and iterating. This naturally allocates LLM compute proportional to difficulty — something impossible with fixed, single-turn operators.

11.4 Containerization and Isolation

The containerization system uses Apptainer (formerly Singularity) with the Superimage environment:

Container lifecycle per mutation:
1. Spawn fresh Apptainer container
2. Mount parent solution code
3. ReAct agent executes in container (stateful bash + Jupyter)
4. Agent submits → solution extracted
5. Container destroyed
6. Solution evaluated in SEPARATE container (HCE)

Key properties: - Fakeroot mode: Agents perceive root privileges without actual root access - Pre-installed environment: Comprehensive ML/data science stack avoids installation delays - Crash isolation: Container failures don't propagate - Stateful tools: Unlike prior work, bash and Jupyter state persists across agent turns (within one mutation trajectory)

11.5 Population Dynamics and Diversity

The paper uses temperature T=0.2, which biases selection toward high-fitness individuals while maintaining some diversity. Analysis shows:

  • At very early stages (first few GPU-hours), the 8-GPU setup actually underperforms 1-GPU because it spends compute building a diverse initial population
  • After the initial population is established, the 8-GPU setup achieves increasingly superior performance
  • The crossover rate of 15% introduces genetic diversity by combining strategies from different lineages

The paper also demonstrates that parallelism without evolution is insufficient: Best-of-K (8 independent workers, no information sharing) saturates at the same performance level as a single GPU given enough time. Only evolutionary information sharing (selection pressure + crossover) enables the 8-GPU setup to find solutions unreachable by any single worker.


12 Programming Language

System Implementation

The paper does not specify the exact implementation language, but based on the predecessor (AIRA-dojo) and the described infrastructure:

Component Likely Language Evidence
Orchestrator Python AIRA-dojo is Python; ML research tooling standard
ReAct Agent Python (LLM-generated actions) Agents write Python/Bash in containers
Container Environment Python (Superimage) Pre-installed Python ML stack
Candidate Solutions Python Kaggle competition solutions are Python

Agent-Generated Code

Agents produce Python code for ML solutions within Apptainer containers. The environment includes:

  • PyTorch, TensorFlow, JAX
  • scikit-learn, XGBoost, LightGBM, CatBoost
  • pandas, numpy, scipy
  • Standard data science preprocessing libraries

Tool Interface

The ReAct agents interact via two primary tool types:

  1. Bash command execution — stateful shell session
  2. Jupyter kernel execution — stateful Python kernel

Both maintain state across turns, enabling iterative development workflows (write code → run → observe error → fix → run again) within a single mutation trajectory.


13 Memory Management

Population-Level Memory

The program database serves as the system's primary memory:

Aspect Implementation
Storage location In-memory (orchestrator process)
Large artifacts Automatically offloaded to disk
Contents per individual Code, fitness score, metadata, parent lineage
Access pattern Fast read for selection; append on new evaluation
Communication Subagents communicate exclusively through central database

Worker-Level Memory

Each ReAct agent maintains trajectory-level memory through:

  1. Conversation history: Full Reason-Act-Observe chain from the current trajectory
  2. Stateful bash session: Environment variables, working directory, installed packages persist across turns
  3. Stateful Jupyter kernel: Variables, loaded data, defined functions persist across turns
  4. Parent context: Injected at trajectory start (parent solution code, score, task metadata)

Cross-Worker Communication

Workers do not communicate directly with each other. All information sharing is mediated through the population:

Worker A → submits solution → evaluated → added to population
                                              ↓
                              Population state updated
                                              ↓
Worker B → samples parent from population → inherits A's improvements

This is a critical design choice: it avoids the complexity of peer-to-peer communication while still enabling evolutionary information transfer through the selection mechanism.

No Explicit Long-Term Memory

Unlike some research agent systems that maintain explicit knowledge bases or skill libraries, AIRA₂ relies entirely on the population as implicit memory. Good strategies survive through fitness-based selection; bad strategies are displaced. There is no explicit mechanism for: - Extracting reusable skills or patterns - Summarizing lessons learned across tasks - Transferring knowledge between tasks (each task is independent)


14 Continued Learning

Within-Task Learning

AIRA₂ demonstrates within-task continued improvement through evolutionary search:

Performance trajectory (mean across 30 tasks):
  3h  → 59.9% PR
  6h  → 65.5% PR   (+5.6 pp)
  9h  → ~67%  PR    (interpolated)
  12h → 68.8% PR    (+3.3 pp from 6h)
  24h → 71.8% PR    (+3.0 pp from 12h)
  48h → ~74%  PR    (interpolated)
  72h → 76.0% PR    (+4.2 pp from 24h)

The monotonic improvement with compute is the paper's key finding — enabled specifically by HCE preventing overfitting. Prior systems showed degradation after 24 hours; AIRA₂ shows continued gains.

Cross-Task Learning

AIRA₂ does not implement cross-task learning. Each of the 30 MLE-bench tasks is solved independently. There is no mechanism for: - Transferring learned strategies between tasks - Building a library of reusable components - Meta-learning over task distributions

This is explicitly noted as future work.

Evolutionary Dynamics as Learning

The evolutionary process itself constitutes a form of learning within each task:

  1. Exploration phase (early hours): Workers explore diverse strategies; population builds breadth
  2. Exploitation phase (later hours): Selection pressure focuses on refinement of promising approaches
  3. Crossover-driven innovation: Combining elements from different lineages occasionally produces breakthroughs

The temperature parameter T=0.2 controls this exploration-exploitation tradeoff. The paper does not report experiments with adaptive temperature schedules, which could potentially further improve performance.

Scaling Laws

The paper presents evidence for a compute scaling law in AI research agents:

  • Performance improves approximately logarithmically with GPU-hours
  • The relationship is not yet saturating at 72 hours / 576 GPU-hours
  • This suggests that with even more compute (e.g., 32 GPUs for 72 hours), further gains are possible

This parallels the scaling laws observed in LLM pretraining, but applied to the meta-level of automated research agent performance.


15 Applications

Direct Applications

Application Description Status
Kaggle competition solving Automated entry into ML competitions Demonstrated (MLE-bench-30)
Automated ML pipeline engineering End-to-end ML solution development Core capability
Inference-time compute scaling Using additional compute to improve solution quality Demonstrated (linear scaling)

Broader Research Implications

1. Overfitting Diagnosis in Agent Systems

The HCE protocol provides a general methodology for diagnosing whether observed performance degradation in agent systems is due to true overfitting vs. evaluation noise. This has implications beyond ML research agents — any agent system with noisy evaluation signals could benefit from this analysis framework.

2. Parallelism + Evolution as a General Pattern

The finding that "parallelism without information sharing is insufficient" suggests a general principle: to effectively utilize additional compute in agent systems, you need both (a) parallel execution and (b) an information-sharing mechanism (like evolution) that enables workers to build on each other's discoveries.

3. Dynamic Operators via ReAct

The replacement of static operators with ReAct agents is applicable to any evolutionary system where operator design is a bottleneck. Rather than hand-designing operators for each domain, let agents determine their own action sequences.

Limitations and Scope

Limitation Impact
Kaggle-only evaluation Unclear if results generalize to open-ended research (no predefined competition metric)
No paper writing Produces code solutions, not research papers
Single-benchmark focus MLE-bench-30 only; no evaluation on other benchmarks
Fixed LLM Tied to Gemini 3.0 Pro; no multi-model experiments
No cross-task transfer Each task starts from scratch
High compute cost 8× H200 for 72h is accessible only to well-funded labs

Connections to OmniEvolve

AIRA₂'s architecture maps directly to several OmniEvolve design patterns:

AIRA₂ Component OmniEvolve Equivalent
Evolutionary Orchestrator omnievolve/orchestrator/ + omnievolve/search/
Asynchronous Worker Pool Async worker management in search backends
HCE Protocol omnievolve/evaluation/ cascade evaluator with externalized grading
ReAct Agents as Operators omnievolve/mutation/ — dynamic mutation operators
Population Database omnievolve/knowledge/ program database
Temperature-scaled selection omnievolve/search/ selection strategies
Containerized execution omnievolve/safety/ sandbox execution

References

  • Hambardzumyan, K., et al. (2026). "AIRA₂: Overcoming Bottlenecks in AI Research Agents." arXiv:2603.26499.
  • Toledo, E., et al. (2025). "AIRA-dojo." [predecessor system]
  • Chan, J., et al. (2025). "MLE-bench." [benchmark definition]
  • Singh, I., et al. (2025). "MLE-bench-30." [GPT-5 system card subset]
  • Chen, X., et al. (2026). "MARS / MARS+." [concurrent work]
  • Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models."
  • Syswerda, G. (1991). "A Study of Reproduction in Generational and Steady State Genetic Algorithms."
  • Kurtzer, G., et al. (2021). "Apptainer." [container runtime]
  • Du, M., et al. (2025). "MLEvolve." [concurrent work]
  • Botla, V., et al. (2025). "PiEvolve." [concurrent work]
  • Li, Z., et al. (2025). "FM-Agent 2.0." [concurrent work]
  • Liu, J., et al. (2025). "ML-Master 2.0." [concurrent work]

Classification: Autoresearch — AIRA₂ is an AI system that autonomously conducts ML research (solving Kaggle competitions as a proxy for research capability) using evolutionary search with LLM-powered operators. It falls squarely in the autoresearch category as an agent harness for automated experimental research.