← Back to Index

AIRA₂

Asynchronous multi-GPU research agent with evolutionary search, Hidden Consistent Evaluation, and ReAct operators that achieves state-of-the-art on MLE-bench-30. Organization: FAIR at Meta / University College London / University of Oxford Published: March 27, 2026 Type: paper (arXiv:2603.26499) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

AIRA₂: Overcoming Bottlenecks in AI Research Agents

Venue: arXiv preprint (cs.AI), March 2026
DOI: 10.48550/arXiv.2603.26499
License: CC-BY 4.0
Predecessor: AIRA-dojo (Toledo et al., 2025) — the previous best open-source agent on MLE-bench
Benchmark: MLE-bench-30 (curated 30-task subset of MLE-bench, used in GPT-5 system card)
Relation to prior work: Direct successor to AIRA-dojo; builds on the three-bottleneck formalization from Toledo et al. (2025) and addresses each bottleneck with a targeted architectural choice

The paper explicitly frames itself as a systems paper: the contribution is not a new search algorithm or LLM capability, but a set of engineering decisions that resolve identified structural bottlenecks preventing scaling of AI research agents.

2 Authors and Team

Author	Affiliation	Role
Karen Hambardzumyan*	FAIR at Meta, UCL	Equal contribution (lead)
Nicolas Baldwin*	FAIR at Meta	Equal contribution
Edan Toledo*	FAIR at Meta, UCL	Equal contribution; lead of predecessor AIRA-dojo
Rishi Hazra*	FAIR at Meta	Equal contribution
Michael Kuchnik*	FAIR at Meta	Equal contribution
Bassel Al Omari	FAIR at Meta	—
Thomas Simon Foster	FAIR at Meta, Oxford	—
Anton Protopopov	FAIR at Meta	—
Jean-Christophe Gagnon-Audet	FAIR at Meta	—
Ishita Mediratta	FAIR at Meta	—
Kelvin Niu	FAIR at Meta	—
Michael Shvartsman	FAIR at Meta	—
Alisia Lupidi	FAIR at Meta, Oxford	—
Alexis Audran-Reiss	FAIR at Meta	—
Parth Pathak	FAIR at Meta	—
Tatiana Shavrina	FAIR at Meta	—
Despoina Magka	FAIR at Meta	—
Hela Momand	FAIR at Meta	—
Derek Dunfield	FAIR at Meta	—
Nicola Cancedda	FAIR at Meta	—
Pontus Stenetorp	UCL	Academic advisor
Carole-Jean Wu	FAIR at Meta	Senior leadership
Jakob Nicolaus Foerster	FAIR at Meta, Oxford	Academic advisor
Yoram Bachrach	FAIR at Meta	Senior researcher
Martin Josifoski*	FAIR at Meta	Equal contribution; correspondence author

*Equal contribution — author order determined by Mario Kart placement (as noted in the paper).

Team size: 25 authors across FAIR at Meta, University College London, and University of Oxford. This is a large industrial research team with significant compute access, reflecting the resource-intensive nature of the problem.

3 Core Contribution

AIRA₂ identifies and resolves three structural bottlenecks in AI research agents, each with a targeted architectural solution:

Bottleneck	Problem	AIRA₂ Solution
Compute Throughput	Synchronous single-GPU execution starves exploration; agents are "sample-bound" (≈1–20 candidates/day on compute-heavy tasks)	Asynchronous multi-GPU worker pool with steady-state evolution; 8 GPUs yield ≈8× throughput
Generalization Gap	Validation-based selection causes overfitting; oracle experiments show 9–13% medal rate improvement with test-based selection	Hidden Consistent Evaluation (HCE) protocol: fixed splits, hidden labels, decoupled selection
Static Operator Limitation	Fixed single-turn LLM prompts cannot adapt to novel task demands; advanced search cannot compensate for shallow operators	ReAct agents with dynamic scoping and interactive debugging replace all static operators

Key insight: The paper demonstrates that "overfitting" reported in prior work (Toledo et al., 2025) was actually evaluation noise — not true data memorization. HCE eliminates this noise, enabling performance to improve monotonically with compute.

The contribution is primarily a systems-level design paper rather than an algorithmic advance. The evolutionary search itself is standard (temperature-scaled rank selection); the innovation lies in the infrastructure that makes long-horizon evolutionary search actually work.

Relationship to the Field

The paper positions itself within a rapid progression of ML research agents competing on MLE-bench:

Timeline of MLE-bench agents (2025-2026):
─────────────────────────────────────────────────────
AIRA-dojo (2025)     → 39.5% PR @ 24h (single GPU)
MARS (2026)          → 60.4% PR @ 24h
PiEvolve (2025)      → 54.1% PR @ 24h
ML-Master 2.0 (2025) → 57.6% PR @ 24h
MLEvolve (2025)      → 64.1% PR @ 24h
FM-Agent 2.0 (2025)  → 69.6% PR @ 24h
MARS+ (2026)         → 69.9% PR @ 24h ← previous SOTA
AIRA₂ (2026)         → 71.8% PR @ 24h → 76.0% @ 72h ← new SOTA
─────────────────────────────────────────────────────

4 Supported Solutions

Problem Framing

AIRA₂ frames automated ML research as a search problem over a graph of candidate solutions. This decomposition was formalized in AIRA-dojo:

Search policy: Selects which node (candidate solution) to expand
Operators: Transform a node into new candidates (mutation, crossover)
Evaluation signal: Provides fitness to guide the search

Solution Space

The system operates on Kaggle competition solutions as the unit of search. Each candidate is a complete ML pipeline: data loading, preprocessing, feature engineering, model architecture, training loop, hyperparameter configuration, and prediction formatting.

Search Methods Supported

Method	Description	Status in AIRA₂
Steady-state evolution	Asynchronous; new individuals added as workers complete	Primary method
Temperature-scaled rank selection	Rank-based parent selection with tunable exploration	Default (T=0.2)
Mutation	Single-parent ReAct agent trajectory	Primary operator
Crossover	Two-parent ReAct agent trajectory	15% probability
Greedy / Best-of-K	Ablation baseline; parallel without information sharing	Ablation only

What AIRA₂ Does NOT Do

No neural architecture search (NAS) as a separate subsystem
No explicit hyperparameter optimization (handled implicitly by ReAct agents)
No meta-learning or transfer across tasks
No automated paper writing or research ideation — focused exclusively on competition solution engineering

5 LLM Integration

Primary Model

Parameter	Value
Model	Gemini 3.0 Pro Preview (Google DeepMind, 2025)
Role	Powers all ReAct agent workers
Context	Multi-turn with stateful tool execution
Alternatives tested	Not reported (all main results use Gemini 3.0 Pro)

Note: Baselines also use Gemini 3.0 Pro Preview, except ML-Master 2.0 which uses DeepSeek V3.2-Speciale. This makes the comparison fair on the LLM dimension.

LLM Usage Pattern

The LLM is used exclusively as the reasoning engine within ReAct agent trajectories:

ReAct Trajectory τ:
  Reason₁ → Act₁ → Obs₁ → Reason₂ → Act₂ → Obs₂ → ... → ReasonK → Submit

Actions: Python code execution, Bash commands (in sandboxed container)
Observations: stdout/stderr, execution duration, file contents
Final: "submit" tool sends solution to orchestrator

Key design decisions about LLM integration:

No prompt engineering per operator: Unlike AIRA-dojo which had separate Draft/Improve/Debug prompts, AIRA₂ gives the ReAct agent a single context (parent solution + task description + metadata) and lets it decide what to do
No additional guidance within trajectory: "Within the ReAct trajectory, no additional guidance and instructions are provided" — the agent autonomously determines its action sequence
Stateful tool execution: Unlike AIDE and AIRA-dojo, bash and Jupyter kernel tools maintain state across turns, enabling iterative debugging
Execution duration feedback: Tool outputs include timing information, enabling agents to monitor their own efficiency

Prompt Structure

The orchestrator provides the ReAct agent with: - Task description and data schema - Parent solution code and its fitness score - (For crossover) Second parent solution - Population metadata (scores, strategies attempted)

The agent then autonomously executes a multi-step trajectory of reasoning and code execution.

6 Key Results

Primary Metric: Percentile Rank on MLE-bench-30

Time Budget	AIRA₂ (8 GPU)	Best Baseline	Gap
3h	59.9% ± 3.6	—	—
6h	65.5%	—	—
12h	68.8%	—	—
24h	71.8% ± 3.5	69.9% (MARS+)	+1.9 pp
72h	76.0% ± 3.4	—	+6.1 pp vs 24h SOTA

Medal Rates at 72 Hours

Medal Tier	Rate
Bronze+	61.1% ± 5.2
Silver+	58.9% ± 5.2
Gold	36.7% ± 5.1

Ablation Results (72h)

Configuration	Percentile Rank	Δ from Full
Full AIRA₂ (8 GPU)	76.0%	—
No Subagents (8 GPU, static operators)	73.7%	−2.3 pp
1 GPU (with subagents + HCE)	63.5%	−12.5 pp
No HCE (8 GPU, with subagents)	56.3%	−19.7 pp
No Evolution (Best-of-K, 8 GPU)	65.2%	−10.8 pp

Critical finding: Removing HCE causes performance to degrade from 24h to 72h (56.8% → 56.3%), confirming that without reliable evaluation, longer search actively hurts. With HCE, performance monotonically improves.

Compute Efficiency Analysis

The paper normalizes performance by cumulative GPU-hours to compare 1-GPU vs 8-GPU:

At low GPU-hours (< 24), 1-GPU is slightly more efficient (no overhead of building initial population)
At 24+ GPU-hours, 8-GPU becomes increasingly efficient
The gap widens to 7.5 percentile rank points at 144 GPU-hours
Parallel compute without evolution (Best-of-K) saturates — it converges to the same final performance as 1-GPU, showing that parallelism alone is insufficient; evolutionary information sharing is essential

Statistical Methodology

3 independent seeds per task, mean ± standard error reported
Percentile Rank chosen over medal rate as primary metric because:
Continuous (not discrete thresholds)
Captures full distribution (not binary medal outcome)
Avoids threshold effects near medal boundaries that amplify noise

7 Reproducibility

Strengths

Aspect	Assessment
Benchmark	MLE-bench-30 is publicly defined (used in GPT-5 system card)
Evaluation protocol	HCE fully specified (80/10/10 split, externalized grading)
Hyperparameters	Temperature T=0.2, crossover probability pc=15% fully reported
Hardware	8× NVIDIA H200 GPUs, 12 CPU cores/worker, 120GB RAM/worker
Container environment	Apptainer containers with Superimage (publicly available)
Statistical methodology	3 seeds, SE intervals, proper normalization

Limitations

Aspect	Concern
Code availability	Not open-sourced at time of publication
LLM dependency	Gemini 3.0 Pro Preview is a proprietary, versioned API — exact behavior may change
Infrastructure complexity	Requires custom orchestrator, containerization system, remote tool execution — significant engineering to replicate
Cost	8× H200 GPUs for 72 hours is expensive (≈$5,000-10,000 per 30-task evaluation)
Prompt content	System prompts and ReAct agent instructions not fully reproduced in paper
Data splits	Specific split indices not published (though the splitting procedure is deterministic given seeds)

Predecessor Availability

AIRA-dojo (the predecessor) was open-sourced, including the Superimage container environment. This partially mitigates reproducibility concerns since AIRA₂ builds on that infrastructure.

8 Compute and API Costs

Hardware Configuration

Per-worker allocation:
┌──────────────────────────┐
│  1× NVIDIA H200 (141GB) │
│  12 logical CPU cores    │
│  120GB system RAM        │
│  Dedicated GPU mapping   │
└──────────────────────────┘

Full system (main experiments):
┌────────────────────────────────┐
│  8× H200 workers               │
│  1× orchestrator (CPU-only)    │
│  1× evaluation GPU (dedicated) │
│  Total: ~9 GPUs active         │
└────────────────────────────────┘

Estimated Costs per Full Evaluation

Resource	Quantity	Duration	Estimated Cost
H200 GPUs (workers)	8	72h	~$8,000–12,000 (cloud pricing)
H200 GPU (evaluation)	1	72h (intermittent)	~$500–1,000
LLM API (Gemini 3.0 Pro)	~1000s of trajectories	72h	~$500–2,000 (estimated)
Total per 30-task run	—	—	~$9,000–15,000

These are rough estimates based on 2026 cloud H200 pricing. Meta likely uses internal GPU clusters, making actual costs lower.

Throughput Statistics

Single GPU: ≈1–20 candidates evaluated per day (task-dependent)
8 GPUs: ≈8× throughput (approximately linear scaling demonstrated)
Execution cap: 9-hour hard time limit per individual code execution
Scaling: Paper claims linear throughput scaling with GPU count; no diminishing returns observed up to 8 GPUs

Cost-Effectiveness Argument

The paper argues that 8-GPU parallelism is not merely "doing the same thing faster" — the evolutionary information sharing between workers creates solutions that a single GPU cannot reach regardless of time:

"Parallelism without information sharing (Best-of-K) saturates early, converging to the same final performance as the single-GPU agent."

9 Architecture Solution

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AIRA₂ SYSTEM ARCHITECTURE                 │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              EVOLUTIONARY ORCHESTRATOR                 │   │
│  │                                                        │   │
│  │  ┌──────────────┐  ┌─────────────┐  ┌─────────────┐  │   │
│  │  │  Population P │  │  Selection   │  │  Dispatch   │  │   │
│  │  │  (candidates, │  │  (rank-based,│  │  (async,    │  │   │
│  │  │   scores,     │  │   T=0.2)     │  │   on-demand)│  │   │
│  │  │   metadata)   │  │              │  │             │  │   │
│  │  └──────────────┘  └─────────────┘  └─────────────┘  │   │
│  └──────────────────────────────────────────────────────┘   │
│           │               │              │                    │
│     ┌─────┴─────┐   ┌────┴────┐   ┌────┴────┐              │
│     ▼           ▼   ▼         ▼   ▼         ▼              │
│  ┌────────┐ ┌────────┐ ┌────────┐      ┌────────┐          │
│  │Worker 1│ │Worker 2│ │Worker 3│ ···  │Worker N│          │
│  │ ReAct  │ │ ReAct  │ │ ReAct  │      │ ReAct  │          │
│  │ Agent  │ │ Agent  │ │ Agent  │      │ Agent  │          │
│  └───┬────┘ └───┬────┘ └───┬────┘      └───┬────┘          │
│      │          │          │                │                │
│      ▼          ▼          ▼                ▼                │
│  ┌────────┐ ┌────────┐ ┌────────┐      ┌────────┐          │
│  │Dev     │ │Dev     │ │Dev     │      │Dev     │          │
│  │Contanr │ │Contanr │ │Contanr │      │Contanr │          │
│  │(1×GPU) │ │(1×GPU) │ │(1×GPU) │      │(1×GPU) │          │
│  └───┬────┘ └───┬────┘ └───┬────┘      └───┬────┘          │
│      │          │          │                │                │
│      └──────────┴──────────┴────────────────┘                │
│                         │                                    │
│                         ▼                                    │
│               ┌──────────────────┐                           │
│               │  HIDDEN CONSISTENT│                           │
│               │  EVALUATION (HCE) │                           │
│               │                    │                           │
│               │  ┌──────────────┐ │                           │
│               │  │ Eval Contanr │ │                           │
│               │  │ (1×GPU)      │ │                           │
│               │  └──────────────┘ │                           │
│               │                    │                           │
│               │  D_search → score  │                           │
│               │  D_val   → select  │                           │
│               │  D_test  → report  │                           │
│               └──────────────────┘                           │
└─────────────────────────────────────────────────────────────┘

Two-Tier Design

The architecture comprises two tiers:

Global Orchestrator: Maintains population state, performs selection, dispatches mutation/crossover tasks to available workers, integrates evaluated results back into the population
Asynchronous Worker Pool: N independent ReAct agents, each with a dedicated GPU and ephemeral container, executing multi-step reasoning trajectories to produce candidate solutions

Key Architectural Decisions

Decision	Rationale
Steady-state evolution (not generational)	Workers never idle waiting for synchronization barriers; fast mutations complete and feed back immediately
1:1 GPU-worker mapping (static allocation)	Eliminates dynamic scheduling complexity; clean-slate environments for every execution
Separate evaluation container	Agents never see labels; prevents metric gaming and feedback loops
In-memory program database	Fast access for orchestrator; large artifacts spill to disk
Ephemeral Apptainer containers	Crashed containers don't affect orchestrator or other workers
Fakeroot mode	Agents can `apt install` or `pip install` without actual root privileges

10 Component Breakdown

Component 1: Evolutionary Orchestrator

Purpose: Coordinate search over population of candidate solutions.

Parameter	Value
Selection	Temperature-scaled rank-based
Temperature T	0.2 (biased toward exploitation)
Crossover probability	15%
Population storage	In-memory with disk spillover
Communication	Asynchronous; no synchronization barriers

Selection Formula:

p(i) = (N - rᵢ + 1)^(1/T) / Σⱼ (N - rⱼ + 1)^(1/T)

Where: - N = population size - rᵢ = rank of individual i (1 = best) - T = temperature (T→0: greedy, T→∞: uniform)

Rank-based selection is chosen over fitness-proportionate because ranks are invariant to magnitude and scale of fitness scores, which vary widely across Kaggle tasks.

Component 2: Asynchronous Worker Pool

Purpose: Execute ReAct agent trajectories in parallel.

Each worker: 1. Receives parent solution(s) + task context from orchestrator 2. Executes multi-step ReAct trajectory (Reason → Act → Observe loop) 3. Submits candidate solution via "submit" tool 4. Worker returns to pool; orchestrator evaluates and integrates

Container Environment (Superimage): - Pre-installed: Python, PyTorch, CUDA, standard data science stack - Fakeroot mode: perceived root for apt/pip installations - Isolated: crashed containers don't affect other workers - Stateful: Bash and Jupyter kernel state persists across turns within a trajectory

Component 3: Hidden Consistent Evaluation (HCE)

Purpose: Provide reliable evaluation signal for search; decouple search optimization from final selection.

Data Partitioning (created once, reused across all seeds):

Available labeled data
├── D_train (80%) → visible to agent for model training
├── D_search (10%) → used by orchestrator for fitness; labels hidden from agents
└── D_val (10%) → used ONLY for final selection after search terminates

And separately:

D_test → held-out competition test set (never used during search or selection)

Evaluation flow: 1. Worker submits solution → predictions generated for all splits 2. Orchestrator evaluates on D_search → fitness score returned to search 3. D_val scores computed but NOT used during search 4. After search terminates, final submission selected by D_val score 5. D_test used only for final reporting

Component 4: ReAct Agent Operators

Purpose: Replace static Draft/Improve/Debug prompts with autonomous multi-step agents.

Trajectory format:

τ = (Reason₁, Act₁, Obs₁, ..., Reason_{K-1}, Act_{K-1}, Obs_{K-1}, ReasonK, ActK)

Capabilities gained over static operators:

Capability	Static Operators	ReAct Agents
Exploratory data analysis	Pre-defined prompt	Agent decides at runtime
Debugging	Separate "Debug" operator, no iterative access	Same trajectory, observes traceback, hypothesizes fix, re-executes
Resource allocation	Fixed per operator	Dynamic — spends more time on harder sub-problems
Scope engineering	Hand-designed per anticipated sub-task	Agent determines needed actions autonomously
Local experimentation	Not supported	Can run small experiments before committing to ideas

Component 5: Resource Management

Resource	Allocation
GPU	1:1 worker mapping (NVIDIA H200, 141GB VRAM)
CPU	12 logical cores per worker
RAM	120GB per worker
Time limit	9-hour hard cap per code execution
Global limit	72-hour wall-clock (configurable)
Evaluation	Foreground execution (no separate job queue)

11 Core Mechanisms (Detailed)

11.1 Asynchronous Steady-State Evolution

Traditional generational evolution waits for all individuals to be evaluated before proceeding. AIRA₂ uses steady-state evolution (Syswerda, 1991), where:

When any worker becomes idle → orchestrator immediately dispatches new task
New individual is added to population immediately upon evaluation
No synchronization barriers between workers
Fast mutations complete quickly and provide early feedback; slow mutations don't block others

This is particularly critical for ML research tasks where mutation duration varies enormously — from minutes (hyperparameter tweak) to hours (training a large model from scratch).

Information flow in steady-state vs. generational evolution:

Generational (synchronous):
   Gen 0: [W1|W2|W3|W4|W5|W6|W7|W8] ──barrier──▶
   Gen 1: [W1|W2|W3|W4|W5|W6|W7|W8] ──barrier──▶

   Problem: Fast workers idle, waiting for slow workers

Steady-state (async, AIRA₂):
   W1: ───────────submit──▶ ──────submit──▶ ───submit──▶
   W2: ──submit──▶ ──────submit──▶ ────────submit──▶
   W3: ────────────────submit──▶ ──submit──▶ ────▶
   ...

   Advantage: No idle time; fast tasks complete quickly

11.2 Hidden Consistent Evaluation (HCE) — Deep Analysis

HCE is arguably the paper's most important contribution. The authors use it as both a practical safeguard and an experimental tool to diagnose overfitting.

The overfitting diagnosis:

Prior work (Toledo et al., 2025) reported that agent performance degrades with extended search — interpreted as overfitting to validation data. AIRA₂'s experiments reveal this was actually evaluation noise from three sources:

Implementation bugs inflating metrics: Agents sometimes produce code that reports artificially high validation scores due to bugs (e.g., data leakage in preprocessing)
Brittle output parsing: Missing or erroneous score extraction leads to incorrect fitness signals
Stochastic data splitting: Random seeds for train/val splits introduce variance; inferior solutions survive selection due to favorable random partitions

HCE eliminates all three: - External evaluation prevents self-reported metric gaming - Consistent splits eliminate seed-induced variance - Hidden labels prevent agents from accessing evaluation data

Empirical validation: Without HCE, performance degrades from 56.8% (24h) to 56.3% (72h) — confirming overfitting. With HCE, performance improves monotonically: 71.8% (24h) → 76.0% (72h).

11.3 ReAct Agent Dynamics

The shift from static operators to ReAct agents represents a fundamental change in how the system allocates compute:

Static operator paradigm (AIRA-dojo):

Orchestrator → select operator (Draft/Improve/Debug/EDA)
            → fill template prompt
            → single LLM call
            → parse output
            → evaluate

ReAct agent paradigm (AIRA₂):

Orchestrator → provide context (parent solution, task, metadata)
            → ReAct agent autonomously determines:
               - What analyses to run
               - What code to write
               - When to debug
               - When to experiment locally
               - When to submit
            → Multi-turn trajectory until submission

Dynamic compute allocation: On easy sub-tasks, agents submit quickly (few reasoning steps). On hard sub-tasks, agents spend many turns debugging, experimenting, and iterating. This naturally allocates LLM compute proportional to difficulty — something impossible with fixed, single-turn operators.

11.4 Containerization and Isolation

The containerization system uses Apptainer (formerly Singularity) with the Superimage environment:

Container lifecycle per mutation:
1. Spawn fresh Apptainer container
2. Mount parent solution code
3. ReAct agent executes in container (stateful bash + Jupyter)
4. Agent submits → solution extracted
5. Container destroyed
6. Solution evaluated in SEPARATE container (HCE)

Key properties: - Fakeroot mode: Agents perceive root privileges without actual root access - Pre-installed environment: Comprehensive ML/data science stack avoids installation delays - Crash isolation: Container failures don't propagate - Stateful tools: Unlike prior work, bash and Jupyter state persists across agent turns (within one mutation trajectory)

11.5 Population Dynamics and Diversity

The paper uses temperature T=0.2, which biases selection toward high-fitness individuals while maintaining some diversity. Analysis shows:

At very early stages (first few GPU-hours), the 8-GPU setup actually underperforms 1-GPU because it spends compute building a diverse initial population
After the initial population is established, the 8-GPU setup achieves increasingly superior performance
The crossover rate of 15% introduces genetic diversity by combining strategies from different lineages

The paper also demonstrates that parallelism without evolution is insufficient: Best-of-K (8 independent workers, no information sharing) saturates at the same performance level as a single GPU given enough time. Only evolutionary information sharing (selection pressure + crossover) enables the 8-GPU setup to find solutions unreachable by any single worker.

12 Programming Language

System Implementation

The paper does not specify the exact implementation language, but based on the predecessor (AIRA-dojo) and the described infrastructure:

Component	Likely Language	Evidence
Orchestrator	Python	AIRA-dojo is Python; ML research tooling standard
ReAct Agent	Python (LLM-generated actions)	Agents write Python/Bash in containers
Container Environment	Python (Superimage)	Pre-installed Python ML stack
Candidate Solutions	Python	Kaggle competition solutions are Python

Agent-Generated Code

Agents produce Python code for ML solutions within Apptainer containers. The environment includes:

PyTorch, TensorFlow, JAX
scikit-learn, XGBoost, LightGBM, CatBoost
pandas, numpy, scipy
Standard data science preprocessing libraries

Tool Interface

The ReAct agents interact via two primary tool types:

Bash command execution — stateful shell session
Jupyter kernel execution — stateful Python kernel

Both maintain state across turns, enabling iterative development workflows (write code → run → observe error → fix → run again) within a single mutation trajectory.

13 Memory Management

Population-Level Memory

The program database serves as the system's primary memory:

Aspect	Implementation
Storage location	In-memory (orchestrator process)
Large artifacts	Automatically offloaded to disk
Contents per individual	Code, fitness score, metadata, parent lineage
Access pattern	Fast read for selection; append on new evaluation
Communication	Subagents communicate exclusively through central database

Worker-Level Memory

Each ReAct agent maintains trajectory-level memory through:

Conversation history: Full Reason-Act-Observe chain from the current trajectory
Stateful bash session: Environment variables, working directory, installed packages persist across turns
Stateful Jupyter kernel: Variables, loaded data, defined functions persist across turns
Parent context: Injected at trajectory start (parent solution code, score, task metadata)

Cross-Worker Communication

Workers do not communicate directly with each other. All information sharing is mediated through the population:

Worker A → submits solution → evaluated → added to population
                                              ↓
                              Population state updated
                                              ↓
Worker B → samples parent from population → inherits A's improvements

This is a critical design choice: it avoids the complexity of peer-to-peer communication while still enabling evolutionary information transfer through the selection mechanism.

No Explicit Long-Term Memory

Unlike some research agent systems that maintain explicit knowledge bases or skill libraries, AIRA₂ relies entirely on the population as implicit memory. Good strategies survive through fitness-based selection; bad strategies are displaced. There is no explicit mechanism for: - Extracting reusable skills or patterns - Summarizing lessons learned across tasks - Transferring knowledge between tasks (each task is independent)

14 Continued Learning

Within-Task Learning

AIRA₂ demonstrates within-task continued improvement through evolutionary search:

Performance trajectory (mean across 30 tasks):
  3h  → 59.9% PR
  6h  → 65.5% PR   (+5.6 pp)
  9h  → ~67%  PR    (interpolated)
  12h → 68.8% PR    (+3.3 pp from 6h)
  24h → 71.8% PR    (+3.0 pp from 12h)
  48h → ~74%  PR    (interpolated)
  72h → 76.0% PR    (+4.2 pp from 24h)

The monotonic improvement with compute is the paper's key finding — enabled specifically by HCE preventing overfitting. Prior systems showed degradation after 24 hours; AIRA₂ shows continued gains.

Cross-Task Learning

AIRA₂ does not implement cross-task learning. Each of the 30 MLE-bench tasks is solved independently. There is no mechanism for: - Transferring learned strategies between tasks - Building a library of reusable components - Meta-learning over task distributions

This is explicitly noted as future work.

Evolutionary Dynamics as Learning

The evolutionary process itself constitutes a form of learning within each task:

Exploration phase (early hours): Workers explore diverse strategies; population builds breadth
Exploitation phase (later hours): Selection pressure focuses on refinement of promising approaches
Crossover-driven innovation: Combining elements from different lineages occasionally produces breakthroughs

The temperature parameter T=0.2 controls this exploration-exploitation tradeoff. The paper does not report experiments with adaptive temperature schedules, which could potentially further improve performance.

Scaling Laws

The paper presents evidence for a compute scaling law in AI research agents:

Performance improves approximately logarithmically with GPU-hours
The relationship is not yet saturating at 72 hours / 576 GPU-hours
This suggests that with even more compute (e.g., 32 GPUs for 72 hours), further gains are possible

This parallels the scaling laws observed in LLM pretraining, but applied to the meta-level of automated research agent performance.

15 Applications

Direct Applications

Application	Description	Status
Kaggle competition solving	Automated entry into ML competitions	Demonstrated (MLE-bench-30)
Automated ML pipeline engineering	End-to-end ML solution development	Core capability
Inference-time compute scaling	Using additional compute to improve solution quality	Demonstrated (linear scaling)

Broader Research Implications

1. Overfitting Diagnosis in Agent Systems

The HCE protocol provides a general methodology for diagnosing whether observed performance degradation in agent systems is due to true overfitting vs. evaluation noise. This has implications beyond ML research agents — any agent system with noisy evaluation signals could benefit from this analysis framework.

2. Parallelism + Evolution as a General Pattern

The finding that "parallelism without information sharing is insufficient" suggests a general principle: to effectively utilize additional compute in agent systems, you need both (a) parallel execution and (b) an information-sharing mechanism (like evolution) that enables workers to build on each other's discoveries.

3. Dynamic Operators via ReAct

The replacement of static operators with ReAct agents is applicable to any evolutionary system where operator design is a bottleneck. Rather than hand-designing operators for each domain, let agents determine their own action sequences.

Limitations and Scope

Limitation	Impact
Kaggle-only evaluation	Unclear if results generalize to open-ended research (no predefined competition metric)
No paper writing	Produces code solutions, not research papers
Single-benchmark focus	MLE-bench-30 only; no evaluation on other benchmarks
Fixed LLM	Tied to Gemini 3.0 Pro; no multi-model experiments
No cross-task transfer	Each task starts from scratch
High compute cost	8× H200 for 72h is accessible only to well-funded labs

Connections to OmniEvolve

AIRA₂'s architecture maps directly to several OmniEvolve design patterns:

AIRA₂ Component	OmniEvolve Equivalent
Evolutionary Orchestrator	`omnievolve/orchestrator/` + `omnievolve/search/`
Asynchronous Worker Pool	Async worker management in search backends
HCE Protocol	`omnievolve/evaluation/` cascade evaluator with externalized grading
ReAct Agents as Operators	`omnievolve/mutation/` — dynamic mutation operators
Population Database	`omnievolve/knowledge/` program database
Temperature-scaled selection	`omnievolve/search/` selection strategies
Containerized execution	`omnievolve/safety/` sandbox execution

References

Hambardzumyan, K., et al. (2026). "AIRA₂: Overcoming Bottlenecks in AI Research Agents." arXiv:2603.26499.
Toledo, E., et al. (2025). "AIRA-dojo." [predecessor system]
Chan, J., et al. (2025). "MLE-bench." [benchmark definition]
Singh, I., et al. (2025). "MLE-bench-30." [GPT-5 system card subset]
Chen, X., et al. (2026). "MARS / MARS+." [concurrent work]
Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models."
Syswerda, G. (1991). "A Study of Reproduction in Generational and Steady State Genetic Algorithms."
Kurtzer, G., et al. (2021). "Apptainer." [container runtime]
Du, M., et al. (2025). "MLEvolve." [concurrent work]
Botla, V., et al. (2025). "PiEvolve." [concurrent work]
Li, Z., et al. (2025). "FM-Agent 2.0." [concurrent work]
Liu, J., et al. (2025). "ML-Master 2.0." [concurrent work]

Classification: Autoresearch — AIRA₂ is an AI system that autonomously conducts ML research (solving Kaggle competitions as a proxy for research capability) using evolutionary search with LLM-powered operators. It falls squarely in the autoresearch category as an agent harness for automated experimental research.