← Back to Index

AI Scientist — Nature Publication

Towards End-to-End Automation of AI Research Organization: Sakana AI, University of British Columbia, Vector Institute, University of Oxford Published: 2026 (Nature) Type: Journal Paper (Nature) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Scope Note: This document covers the Nature publication (s41586-026-10265-5) and the AI Scientist v2 system (arXiv:2504.08066). For the original AI Scientist v1 system (arXiv:2408.06292, August 2024), see The AI Scientist. This report focuses on what is new relative to v1: template-free operation, agentic tree search, the peer review milestone, automated reviewer validation, and scaling laws of AI science.

Table of Contents

1 Full Title and Attribution

Nature Paper Title: Towards End-to-End Automation of AI Research

Nature DOI: 10.1038/s41586-026-10265-5

AI Scientist v2 Preprint: arXiv:2504.08066 — "The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search" (April 10, 2025)

Original Preprint (v1): arXiv:2408.06292 — "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery" (August 2024)

Open Access: Yes (Nature open-access publication)

Repositories: - AI Scientist v1: github.com/SakanaAI/AI-Scientist - AI Scientist v2: github.com/SakanaAI/AI-Scientist-v2

Lineage: This Nature publication consolidates and extends two prior releases — the original AI Scientist (August 2024) and AI Scientist v2 (April 2025) — adding new scaling results, the complete peer review experiment, and an Automated Reviewer validation study.

Publication Timeline

Date Event Reference
August 2024 AI Scientist v1 preprint released arXiv:2408.06292
August 2024 Open-source release of v1 code GitHub: SakanaAI/AI-Scientist
September 2024 ICLR 2025 ICBINB workshop submission 3 AI-generated papers submitted
January 2025 Peer review results: 1 paper accepted Scores: 6, 7, 6 (avg 6.33)
February 2025 Paper withdrawn per pre-established protocol Responsible AI commitment
April 2025 AI Scientist v2 preprint released arXiv:2504.08066
April 2025 Open-source release of v2 code GitHub: SakanaAI/AI-Scientist-v2
2026 Nature paper published s41586-026-10265-5

2 Authors and Team

Nature Paper Authors

The Nature paper represents a collaboration across four institutions:

Author Affiliation Role
Chris Lu Sakana AI Co-lead, original AI Scientist creator
Cong Lu Sakana AI Co-lead, original AI Scientist creator
Yutaro Yamada Sakana AI v2 lead, agentic tree search
Robert Tjarko Lange Sakana AI v2 development, evolutionary methods expert
Shengran Hu Sakana AI v2 development
David Ha Sakana AI CTO, strategic direction
Jakob Foerster University of Oxford Faculty collaborator
Jeff Clune UBC / Vector Institute / CIFAR Faculty collaborator, open-endedness expert

Team Context

The author list bridges multiple research communities:

  • Sakana AI (Tokyo) — an AI research company founded by David Ha (former Google Brain) and Llion Jones (Transformer co-author). Sakana focuses on nature-inspired AI systems, with The AI Scientist as a flagship project.
  • Jeff Clune (UBC / Vector Institute) — a pioneer in open-ended learning, quality-diversity algorithms, and AI-generating algorithms (OMNI, MAP-Elites). His research group has long argued that open-ended search is key to artificial general intelligence. His influence is visible in the system's emphasis on open-ended discovery and archive-based idea generation.
  • Jakob Foerster (Oxford) — expert in multi-agent systems and game theory. His involvement connects the project to multi-agent research methodology.

v1 → v2 → Nature: Team Evolution

The original v1 was primarily the work of Chris Lu and Cong Lu at Sakana AI. The v2 system added Yutaro Yamada as lead for the agentic tree search methodology, and Robert Tjarko Lange, whose evolutionary methods expertise (EvoJAX, ShinkaEvolve) shaped the tree search design. The Nature paper synthesizes contributions from both phases and adds the scaling analysis and peer review experiment as new material.

3 Core Contribution

What's New in the Nature Paper: The Nature publication is not simply a republication of v1 or v2. It consolidates both systems, adds substantial new analysis, and presents the first demonstration of a fully AI-generated paper passing peer review at a top-tier ML conference workshop. The three core contributions beyond v1 are: (1) template-free operation via agentic tree search, (2) validated Automated Reviewer matching human reviewer accuracy, and (3) scaling laws showing that better models → better papers.

Delta from the Original AI Scientist (v1)

Dimension AI Scientist v1 (Aug 2024) Nature Paper / v2 (2025-2026)
Template dependency Required human-provided code templates Template-free mode generates code from scratch
Experimentation Linear execution of experiment plan Agentic tree search with 4 stages
Code generation Used Aider for code modifications Direct LLM-powered tree search (no Aider)
Peer review test Hypothetical ("could approach acceptance") Actual submission + acceptance at ICLR workshop
Automated Reviewer Described but not rigorously validated Validated against ICLR OpenReview at scale
Scaling analysis Not included Paper quality scales with model capability
Compute scaling Not studied Paper quality scales with compute budget
Figure quality LLM-generated, no visual feedback VLM feedback loop for figure refinement
Idea generation Simple prompting + novelty check Archive-based progressive idea generation
Domain scope 3 templates (NanoGPT, Diffusion, Grokking) Any ML topic (template-free)
IRB approval Not obtained UBC IRB H24-02652 approved

The Three Pillars of the Nature Contribution

Pillar 1: Template-Free Scientific Discovery

The original AI Scientist required a human-prepared code template as a starting point. This constraint limited the system to predefined research domains and required non-trivial human setup. The v2/Nature system introduces a template-free mode where the AI Scientist receives only a broad research direction (e.g., "investigating deep learning limitations") and generates its own code, experiments, and papers without any starting scaffold.

Pillar 2: The Peer Review Turing Test

The authors explicitly frame the peer review experiment as an "AI scientist Turing test" — a test of whether AI-generated science is indistinguishable from human science when evaluated by standard scientific processes. One paper passed this test, albeit at a workshop with a 70% acceptance rate rather than a main conference (32% acceptance rate).

Pillar 3: Scaling Laws of AI Science

Perhaps the most significant finding for the field's future trajectory: paper quality (as measured by the Automated Reviewer) scales predictably with both the capability of the underlying LLM and the compute budget allocated to experimentation. This implies that future improvements to foundation models will automatically improve AI scientific output, without requiring changes to the AI Scientist system itself.

4 Supported Solutions

Output Artifacts

The AI Scientist produces complete scientific manuscripts as its primary output. The Nature paper evaluates these in two operational modes:

Mode Input Output Code Source Experiment Structure
Template-based Human-provided code template + broad topic Full paper on template topic Modified template code Linear experiment plan
Template-free Broad research direction only Full paper on novel topic AI-generated from scratch 4-stage agentic tree search

Paper Output Quality Spectrum

Based on the Automated Reviewer and human evaluation across the Nature paper's experiments:

Quality Level % of Output Description Comparison
Workshop-acceptable ~15-25% (recent models) Would pass workshop peer review Top 45% of workshop submissions
Borderline ~30-40% Approaches workshop standards Interesting ideas, execution issues
Below threshold ~35-50% Significant quality gaps Naive ideas, implementation bugs, hallucinations
Main conference quality ~0% None yet meet main conference bar Acknowledged limitation

Research Domains Covered

Template-based mode (inherited from v1):

Template Domain Base Model GPU Time
NanoGPT Language modeling Karpathy's NanoGPT ~30 min
2D Diffusion Generative modeling Score-based diffusion ~15 min
Grokking Generalization theory Modular arithmetic ~10 min
7 community templates Various ML domains Various Various

Template-free mode (new in v2/Nature):

Research Direction Example Generated Topic Template Created
"Deep learning limitations" Negative results in optimization AI-generated from scratch
"Novel training strategies" Curriculum learning dynamics AI-generated from scratch
"Representation learning" Feature emergence in transformers AI-generated from scratch

The ICLR ICBINB Submission

Three fully AI-generated papers were submitted to the ICLR 2025 I Can't Believe It's Not Better (ICBINB) workshop:

Paper Topic Reviewer Scores Average Outcome
Paper 1 Negative result in deep learning 6, 7, 6 6.33 Would have been accepted (withdrawn)
Paper 2 [Not disclosed] Below threshold Below threshold Not accepted
Paper 3 [Not disclosed] Below threshold Below threshold Not accepted

Key context: - Workshop acceptance rate: 70% (vs. 32% for ICLR main conference) - The accepted paper's average score (6.33) placed it in the top 45% of submissions - The paper scored higher than 55% of human-authored papers at the workshop - Reviewers were informed some submissions were AI-generated but not which ones (blind) - The accepted paper reported a negative result, aligning with the workshop's focus

5 LLM Integration

Foundation Model as Core Engine

The AI Scientist uses foundation models as its central reasoning engine across all phases. The Nature paper evaluates performance across a range of models, revealing the critical finding that model quality directly determines paper quality.

Model Evaluation Across Generations

The Nature paper's most impactful finding is the scaling law: paper quality improves with model release date.

Paper Quality vs. Model Generation
====================================

Automated
Reviewer
Score
  7 │                                              ●
    │                                           ╱
  6 │                                        ●╱     Newer models
    │                                     ╱         (2025-2026)
  5 │                              ●───●╱
    │                           ╱
  4 │                     ●──●╱
    │                  ╱                    Older models
  3 │            ●──●╱                     (2023-2024)
    │         ╱
  2 │    ●──╱
    │
  1 └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──────── Model Release Date
       │  │  │  │  │  │  │  │  │  │
      GPT Claude  Gemini     Claude  (Latest
      3.5 Sonnet  1.5 Pro    Opus    models)
           3.5

Correlation: P < 0.00001 (statistically significant)

This correlation implies that The AI Scientist is a general-purpose amplifier of model capability: as models improve, the quality of AI-generated science improves correspondingly, without modifications to the system itself.

Dual Operating Modes

Mode Code Generation LLM Role
Template-based Uses Aider (open-source coding assistant) Generates ideas, modifies template code via Aider, writes paper
Template-free Direct LLM code generation (no Aider) Generates ideas, writes code from scratch, manages tree search, writes paper

The shift from Aider to direct LLM code generation in template-free mode is significant. Aider provides structured code editing (diff-based patches, file management), but constrains the system to modifying existing code. Direct generation enables the LLM to create entirely new codebases, at the cost of more implementation bugs.

Vision-Language Model (VLM) Integration (New in v2)

A notable addition in v2 is the VLM feedback loop for figure quality:

VLM Figure Refinement Loop (New in v2)
=======================================

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Generate    │     │  VLM         │     │  Refine      │
│  Figure      │────▶│  Evaluate    │────▶│  Figure      │
│  (matplotlib)│     │  (visual     │     │  (adjust     │
│              │     │   quality)   │     │   layout,    │
│              │     │              │     │   colors,    │
│              │◀────│  Feedback:   │     │   labels)    │
│              │     │  "axis labels│     │              │
│              │     │   too small, │     └──────────────┘
│              │     │   legend     │
│              │     │   overlaps"  │
│              │     └──────────────┘
│              │
└──────────────┘

Contribution: Addresses v1's weakness of low-quality figures
with duplicates, missing labels, and poor formatting.

Multi-Model Architecture

The Nature paper reveals that different phases can use different models:

Phase Recommended Model Reasoning
Idea generation Strongest available Creative ideation benefits from frontier capability
Code generation Strong coder Implementation correctness is critical
Experiment execution Agent-capable model Needs tool use, file management
Paper writing Strongest available Long-form coherent academic writing
Automated Review Ensembled (5 reviews) Ensemble reduces individual model bias

6 Key Results

Result 1: First AI-Generated Paper to Pass Peer Review

The headline achievement: a fully AI-generated paper was accepted via the peer review process at a workshop of a top-tier ML conference (ICLR 2025 ICBINB).

Statistical Context:

Metric Value
Workshop acceptance rate 70%
Main conference acceptance rate 32%
Total workshop submissions 43
AI-generated submissions 3
AI submissions accepted 1
Accepted paper's scores 6 (weak accept), 7 (accept), 6 (weak accept)
Average score 6.33
Human acceptance threshold 6.0 (average)
Paper's percentile among submissions Top 45%
Papers it scored higher than 55% of human-authored papers

Important caveats the authors acknowledge:

  1. Workshop papers have a substantially lower bar than main conference papers
  2. Only 1 of 3 submissions was accepted (33% success rate, below workshop average)
  3. The accepted paper reported a negative result, which aligned with the workshop's specific focus
  4. Manual filtering of candidate papers was performed before submission (selecting the most promising from a larger pool)
  5. The paper was withdrawn after acceptance per the pre-established protocol

Result 2: Automated Reviewer Matches Human Reviewers

The Automated Reviewer was rigorously validated against the OpenReview dataset of human decisions:

Metric Automated Reviewer Human Reviewers (NeurIPS 2021)
Balanced Accuracy (pre-cutoff) 69% Comparable (NeurIPS consistency study)
Balanced Accuracy (post-cutoff) 66% Comparable
F1 Score Exceeds inter-human agreement Baseline
Statistical significance P < 0.001 (bootstrap test)
Data contamination effect Minimal (69% → 66%) N/A

Automated Reviewer Architecture:

Automated Reviewer Pipeline (Nature)
=====================================

Paper (LaTeX/PDF)
       │
       ▼
┌─────────────────────────────────────────────┐
│          FIVE INDEPENDENT REVIEWS            │
│                                             │
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│  │ R1  │ │ R2  │ │ R3  │ │ R4  │ │ R5  │ │
│  │     │ │     │ │     │ │     │ │     │ │
│  │Score│ │Score│ │Score│ │Score│ │Score│ │
│  │S/W  │ │S/W  │ │S/W  │ │S/W  │ │S/W  │ │
│  │Dec. │ │Dec. │ │Dec. │ │Dec. │ │Dec. │ │
│  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│     │       │       │       │       │     │
│     └───────┴───────┼───────┴───────┘     │
│                     ▼                     │
│           ┌─────────────────┐             │
│           │   META-REVIEW   │             │
│           │   (Area Chair   │             │
│           │    persona)     │             │
│           │                 │             │
│           │  Synthesizes 5  │             │
│           │  reviews into   │             │
│           │  final decision │             │
│           └────────┬────────┘             │
│                    │                      │
└────────────────────┼──────────────────────┘
                     ▼
            ┌────────────────┐
            │ Final Decision │
            │ + Scores       │
            │ + Strengths    │
            │ + Weaknesses   │
            └────────────────┘

Based on NeurIPS review guidelines.
5-run ensemble for replicability.

Key differences from v1 reviewer:

Aspect v1 Reviewer Nature Reviewer
Number of reviews 3 personas (base, negative, positive) 5 independent reviews + meta-review
Validation Described, limited testing Rigorous validation against 1,000+ ICLR papers
Guideline basis Conference review rubric Official NeurIPS guidelines
Area Chair role Simple aggregation Explicit Area Chair persona for meta-review
Post-cutoff testing Not performed Tested on 2025 papers (after training cutoff)
Statistical rigor Not reported P-values, bootstrap CIs, z-tests

Result 3: Scaling Laws of AI Science

Two scaling relationships are demonstrated:

Scaling Law 1: Model Quality → Paper Quality

Papers generated by newer, more capable models receive higher Automated Reviewer scores. The correlation is statistically significant (P < 0.00001). This is tested across model generations from GPT-3.5 through the latest Claude and Gemini models.

Implication: No system-level changes are needed to improve output. Simply using a better foundation model produces better science.

Scaling Law 2: Compute Budget → Paper Quality

Increasing the number of experimental nodes in the agentic tree search improves paper quality. This suggests that test-time compute scaling — a central trend in modern AI — applies to scientific discovery as well.

Compute Scaling: Tree Search Depth → Paper Quality
===================================================

Score │
  7   │                                          ●
      │                                       ╱
  6   │                                 ●──●╱
      │                              ╱
  5   │                       ●───●╱
      │                    ╱
  4   │             ●───●╱
      │          ╱
  3   │    ●──●╱
      │
  2   └──┬──┬──┬──┬──┬──┬──┬──┬──┬──── Compute Budget
         1  2  4  8  16 32 64 128 256  (tree nodes)

Each additional doubling of compute budget yields
diminishing but consistent quality improvements.

Combined implication: Both scaling axes — model capability and inference-time compute — are on exponentially improving trajectories. If the trend holds, future versions of The AI Scientist will produce substantially better science with both better models and more efficient compute.

Result 4: Template-Free Operation

The template-free mode represents a qualitative capability expansion:

Capability Template-Based Template-Free
Requires human code setup Yes No
Research domain Fixed by template Open-ended
Code origin Modified human code AI-generated from scratch
Experimentation structure Linear plan 4-stage tree search
Code quality Higher (human starting point) Lower (more bugs)
Idea novelty Constrained by template Broader exploration
Paper diversity Limited to template domain Diverse topics

7 Reproducibility

Reproducibility Framework

The Nature paper takes reproducibility significantly more seriously than v1, partly driven by the requirements of Nature's publication standards:

Component Reproducibility Status Evidence
System code ✅ Open-source (v1 + v2) GitHub repositories
Automated Reviewer ✅ Open-source + validated Tested against OpenReview
Generated papers ✅ Available in supplementary Full manuscripts in appendix
Peer review experiment ⚠️ Process documented IRB approved, organizer consent
Model weights ❌ Commercial models API access required
Exact paper regeneration ❌ Stochastic process Different runs produce different papers
Scaling curves ⚠️ Aggregated statistics Mean ± standard error reported

IRB and Ethical Approval

A significant new element is the formal ethical framework:

  • IRB: University of British Columbia IRB approval H24-02652
  • Conference consent: ICLR 2025 leadership and ICBINB workshop organizers explicitly agreed
  • Pre-registration: Decision to withdraw accepted papers was made before submission
  • Disclosure: Reviewers were informed some submissions were AI-generated (blind — they didn't know which ones)
  • Watermarking: All AI-generated papers were watermarked as AI-generated

Statistical Methodology

The Nature paper employs rigorous statistical methods:

Analysis Statistical Method Result
Model scaling correlation Pearson correlation + significance test P < 0.00001
Automated Reviewer accuracy Balanced accuracy + bootstrapped 95% CI 5,000 bootstrap replicates
Human vs. automated agreement Two-sample z-test P = 0.319 (pre-cutoff), P = 0.921 (post-cutoff)
F1 score comparison Non-parametric bootstrap test Automated outperformance P < 0.001
Data contamination Pre/post-cutoff comparison 69% → 66% (minimal effect)

8 Compute and API Costs

Cost Structure (Nature Paper)

The Nature paper does not report exact costs per paper, but we can estimate from the v1 analysis and the scaling experiments:

Template-based mode (inherited from v1):

Stage Estimated Cost % of Total
Idea Generation ~$1.50 10%
Experimentation ~$3.00 20%
Paper Write-up ~$7.50 50%
Peer Review (5-review ensemble) ~$5.00 ~33%
Total (template-based) ~$17

Template-free mode (new cost profile):

Stage Estimated Cost Notes
Idea Generation + Code Generation ~$5-10 Generating code from scratch is more expensive
Agentic Tree Search (4 stages) ~$20-50 Scales with tree depth; main cost driver
Paper Write-up + VLM Figure Refinement ~$10-15 VLM feedback adds cost
Automated Review (5 reviews + meta) ~$5-8 More reviews than v1
Total (template-free, basic) ~$40-80 At minimal tree depth
Total (template-free, full search) ~$100-300 At deeper tree search

Scaling Cost Analysis

The scaling experiments reveal the cost-quality tradeoff:

Tree Nodes Estimated Cost Quality Score Quality/Dollar
4 ~$40 ~3.5 0.088
16 ~$80 ~4.5 0.056
64 ~$160 ~5.5 0.034
256 ~$500+ ~6.5 0.013

The quality/dollar ratio decreases as compute budget increases, following a log-linear relationship typical of scaling laws. The cost of producing a workshop-acceptable paper is roughly $200-500 in the template-free mode with sufficient compute.

Comparison: Cost to Produce Publishable Science

Producer Cost per Paper Quality Time
PhD student ~$50,000-100,000/year (salary + overhead) for ~2-4 papers Main conference quality 3-6 months
AI Scientist (template-based) ~$17 Below workshop bar (v1) Hours
AI Scientist (template-free, scaled) ~$200-500 Workshop-acceptable (~15-25%) Hours-days
AI Scientist (projected future) ~$50-100 (as models improve + costs drop) Main conference (projected) Hours

The economic implications are substantial. Even at current quality levels, the AI Scientist can generate candidate ideas and preliminary experiments at a cost several orders of magnitude lower than human research. The value proposition is strongest when used for broad exploration — generating many candidate directions cheaply, then having humans select and refine the most promising ones.

GPU Compute Costs

In addition to API costs, the system requires GPU compute for running ML experiments:

Experiment Type GPU Time Cloud Cost (A100)
Template-based (NanoGPT) ~30 min ~$1
Template-based (Grokking) ~10 min ~$0.30
Template-free (basic) ~1-2 hours ~$3-6
Template-free (full tree) ~4-12 hours ~$12-36

9 Architecture Solution

Architectural Evolution: v1 → v2 → Nature

The AI Scientist architecture has evolved significantly between versions. The Nature paper presents both architectures and their trade-offs.

AI Scientist v1 Architecture (Template-Based)
===============================================

┌────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ IDEATION   │    │ EXPERIMENT  │    │ WRITE-UP    │    │ REVIEW      │
│            │    │             │    │             │    │             │
│ LLM ideates│───▶│ Aider edits │───▶│ LaTeX gen   │───▶│ 3 personas  │
│ Novelty    │    │ template    │    │ section by  │    │ 3 reflections│
│ check via  │    │ code        │    │ section     │    │ each        │
│ Semantic   │    │ Linear exec │    │ Citation    │    │             │
│ Scholar    │    │             │    │ search      │    │             │
└────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
     Simple            Aider              Good              Basic
     effective         dependent          quality           validation


AI Scientist v2 / Nature Architecture (Template-Free)
======================================================

┌─────────────┐    ┌──────────────────────────────────┐    ┌──────────────┐
│ IDEATION    │    │ AGENTIC TREE SEARCH              │    │ WRITE-UP     │
│             │    │                                  │    │              │
│ Archive-    │    │ Stage 1: Initial Investigation   │    │ LaTeX gen    │
│ based idea  │    │    ├── Baseline implementation   │    │ VLM figure   │
│ generation  │    │    └── Multiple code attempts    │    │ refinement   │
│             │    │                                  │    │ 20 citation  │
│ Progressive │    │ Stage 2: Hyperparameter Tuning   │    │ search rounds│
│ archive     │───▶│    ├── Grid/random search        │───▶│              │
│ growth      │    │    └── Best checkpoint → next    │    │              │
│             │    │                                  │    │              │
│ Semantic    │    │ Stage 3: Research Agenda          │    │              │
│ Scholar +   │    │    ├── Tree search over ideas    │    │              │
│ web search  │    │    └── Best checkpoint → next    │    │              │
│ filtering   │    │                                  │    │              │
│             │    │ Stage 4: Ablation Studies         │    │              │
│             │    │    └── Systematic ablations      │    │              │
└─────────────┘    └──────────────────────────────────┘    └──────────────┘
                                │                                │
                                ▼                                ▼
                   ┌───────────────────────┐         ┌───────────────────────┐
                   │ EXPERIMENT MANAGER    │         │ AUTOMATED REVIEWER    │
                   │ AGENT                 │         │                       │
                   │ Coordinates tree      │         │ 5 independent reviews │
                   │ search, selects       │         │ + Area Chair meta-    │
                   │ checkpoints,          │         │ review                │
                   │ manages resources     │         │ NeurIPS guidelines    │
                   └───────────────────────┘         └───────────────────────┘

Key Architectural Differences

Component v1 v2 / Nature
Code modification Aider (diff-based) Direct LLM generation (tree search)
Experiment structure Linear plan execution 4-stage tree search with checkpointing
Idea management Flat list with novelty check Progressive archive (grows over time)
Figure quality Basic matplotlib VLM feedback loop
Review system 3 personas, 3 reflections 5 reviews + meta-review (Area Chair)
Experiment management Sequential Dedicated Experiment Manager Agent
Template requirement Mandatory Optional (template-free mode available)

The Agentic Tree Search (Core Innovation)

The most significant architectural addition is the agentic tree search for experimentation. This replaces v1's linear experiment execution with a search tree where each node represents an experimental state (code + results + analysis):

Agentic Tree Search Visualization
===================================

                    ROOT (broad research idea)
                         │
            ┌────────────┼────────────┐
            ▼            ▼            ▼
    ┌──────────┐  ┌──────────┐  ┌──────────┐
    │ Stage 1  │  │ Stage 1  │  │ Stage 1  │
    │ Impl. A  │  │ Impl. B  │  │ Impl. C  │
    │ score: 3 │  │ score: 5 │  │ score: 2 │
    └──────────┘  └────┬─────┘  └──────────┘
                       │                        ← Best selected
              ┌────────┼────────┐
              ▼        ▼        ▼
        ┌─────────┐ ┌─────────┐ ┌─────────┐
        │ Stage 2 │ │ Stage 2 │ │ Stage 2 │
        │ HP: lr  │ │ HP: bs  │ │ HP: wd  │
        │ =0.001  │ │ =64     │ │ =0.01   │
        │ score:6 │ │ score:5 │ │ score:4 │
        └────┬────┘ └─────────┘ └─────────┘
             │                              ← Best selected
        ┌────┼────────────┐
        ▼    ▼            ▼
    ┌──────┐ ┌──────┐ ┌──────┐
    │ S3   │ │ S3   │ │ S3   │
    │Idea A│ │Idea B│ │Idea C│
    │ s: 7 │ │ s: 5 │ │ s: 6 │
    └──┬───┘ └──────┘ └──────┘
       │                                    ← Best selected
       ▼
    ┌──────┐
    │ S4   │
    │Ablate│
    │ s: 7 │    ← Final paper uses this checkpoint
    └──────┘

Stage 1: Initial Investigation (multiple implementation attempts)
Stage 2: Hyperparameter Tuning (grid/random search)
Stage 3: Research Agenda Execution (tree search over ideas)
Stage 4: Ablation Studies (systematic ablations)

At each stage boundary, the best-performing checkpoint
is selected to seed the next stage.

Experiment Manager Agent (New)

The v2 system introduces a dedicated Experiment Manager Agent that coordinates the tree search. This agent:

  1. Decides which nodes to expand next (exploration vs. exploitation)
  2. Selects the best checkpoint at stage boundaries
  3. Manages compute budget allocation across tree branches
  4. Tracks experimental progress and identifies promising directions
  5. Kills unproductive branches to conserve resources

This is architecturally significant because it introduces a meta-level agent that reasons about the structure of the search rather than performing the search itself. In evolutionary computation terms, this is analogous to a strategy adaptation mechanism.

10 Component Breakdown

Phase 1: Idea Generation (Enhanced)

v1 approach: LLM generates ideas → novelty check via Semantic Scholar → ideas formatted as JSON.

Nature enhancements:

Enhancement Description Impact
Archive-based generation Ideas are generated relative to a growing archive of prior ideas Enables progressive exploration
Web access tools LLM can search web + Semantic Scholar as tools (not just API calls) Broader literature coverage
Template-free prompting Ideas can target any ML topic, not just template domains Broader scope
Idea filtering pipeline Multi-stage filtering: novelty → feasibility → alignment Higher-quality ideas passed to experimentation

Phase 2: Experimentation (Major Overhaul)

The experimentation phase is completely redesigned in the template-free mode:

v1: Linear execution of experiment plan via Aider code modifications.

Nature (template-free): 4-stage agentic tree search managed by Experiment Manager Agent.

Stage Goal Method Output
1. Initial Investigation Create working baseline Multiple code generation attempts Best baseline code + results
2. Hyperparameter Tuning Optimize baseline Grid/random search over key HPs Best HP configuration
3. Research Agenda Implement the research idea Tree search over implementation variants Best implementation + results
4. Ablation Studies Validate contribution Systematic ablations of key components Ablation table + analysis

Phase 3: Paper Write-up (Enhanced)

v1 approach: Section-by-section LaTeX generation with Semantic Scholar citation search.

Nature enhancements:

Enhancement Description
VLM figure feedback Vision-Language Model evaluates figure quality; iterative refinement
20-round citation search More thorough literature integration (v1 used fewer rounds)
Citation justification For each citation, generates textual justification for inclusion
Experimental journal notes Agent takes structured notes during experimentation for write-up

Phase 4: Automated Review (Major Enhancement)

v1 approach: 3 reviewer personas (base, negative, positive) with 3 reflection rounds each.

Nature approach: 5 independent reviews + Area Chair meta-review.

Dimension v1 Nature
Reviews per paper 3 5
Persona types Base, negative, positive 5 independent (no explicit bias)
Meta-review Simple aggregation (median scores) Area Chair persona synthesizes
Review guidelines Generic conference rubric Official NeurIPS guidelines
Validation Limited comparison 1,000+ papers, OpenReview dataset
Scores output Scores + decision Scores + decision + strengths + weaknesses
Replicability Single pass 5-run ensemble

Supporting Components

Component Status Function
Semantic Scholar API Enhanced Literature search, novelty checking, citation retrieval
Web search tools New in v2 Broader information access beyond Semantic Scholar
LaTeX compiler Inherited Manuscript compilation and PDF generation
Python runtime Enhanced Experiment execution, data analysis, figure generation
Experiment Manager New in v2 Tree search coordination, checkpoint selection
VLM New in v2 Figure quality assessment and feedback

11 Core Mechanisms (Detailed)

Mechanism 1: Agentic Tree Search for Experimentation

The most significant new mechanism is the 4-stage tree search. Unlike v1's linear execution, the tree search enables the system to:

  1. Explore multiple implementation approaches before committing
  2. Recover from dead ends by backtracking to earlier checkpoints
  3. Systematically vary one dimension at a time (stages 2 and 4)
  4. Scale with compute by expanding more nodes

How the tree search works in detail:

At each node, the LLM generates code, runs experiments, and analyzes results. The Experiment Manager decides whether to: - Expand the node (try a variation) - Select it as the best checkpoint for the next stage - Prune it (abandon unproductive branches)

The selection mechanism at stage boundaries uses the experimental results to identify the best-performing checkpoint. This checkpoint's code and data become the starting point for the next stage, ensuring that subsequent work builds on the strongest foundation.

Relationship to evolutionary computation:

Tree Search Component Evolutionary Analog
Nodes Individuals in population
Node expansion Mutation (child programs from parent)
Stage boundary selection Elitist selection (best survives)
Multiple Stage 1 attempts Population initialization
Stage 3 branching Population diversity
Stage 4 ablation Fitness landscape analysis
Experiment Manager Strategy adaptation controller

Mechanism 2: Progressive Archive-Based Ideation

The idea generation phase uses a growing archive inspired by quality-diversity algorithms (MAP-Elites, OMNI):

Archive-Based Idea Generation
==============================

Cycle 1:                    Cycle 2:                    Cycle 3:
┌──────────────┐           ┌──────────────┐           ┌──────────────┐
│ Archive: {}  │           │ Archive:     │           │ Archive:     │
│              │           │ {Idea A,     │           │ {Idea A,     │
│ Generate:    │           │  Idea B}     │           │  Idea B,     │
│ Idea A       │──────────▶│              │──────────▶│  Idea C,     │
│ Idea B       │           │ Generate:    │           │  Idea D}     │
│              │           │ Idea C       │           │              │
│              │           │ Idea D       │           │ Generate:    │
└──────────────┘           │ (informed by │           │ Idea E       │
                           │  A, B)       │           │ (informed by │
                           └──────────────┘           │  A-D)        │
                                                      └──────────────┘

Each new idea is generated in the context of all prior
ideas, enabling progressive refinement and diversification.

This mechanism is directly inspired by Jeff Clune's work on open-ended learning, where an archive of diverse solutions drives continued exploration. The archive acts as a form of curiosity — the system is implicitly rewarded for generating ideas that differ from what's already in the archive.

Mechanism 3: Automated Reviewer as Fitness Function

A key insight of the Nature paper is that the Automated Reviewer functions as a fitness function for AI-generated science. By validating that it matches human reviewer accuracy, the authors establish that optimizing for the Automated Reviewer's scores is a reasonable proxy for optimizing for actual scientific quality.

This is analogous to the fitness function design problem in evolutionary computation: - The fitness function must accurately capture the optimization objective - A misaligned fitness function leads to reward hacking (Goodhart's Law) - The Automated Reviewer is validated to be as aligned with true scientific quality as human reviewers are with each other

Scaling implications: If the Automated Reviewer is a valid fitness function, then the scaling law (more compute → better papers) can be interpreted as a compute-quality Pareto frontier, analogous to scaling laws in evolutionary optimization.

Mechanism 4: VLM-Augmented Figure Refinement

The VLM figure feedback loop introduces multimodal reasoning into the pipeline:

  1. Matplotlib generates a figure from experimental data
  2. The VLM receives the rendered figure image
  3. The VLM evaluates: layout, label readability, color accessibility, legend placement, axis scaling
  4. Feedback is provided in natural language
  5. The code-generating LLM modifies the matplotlib code based on VLM feedback
  6. The cycle repeats until the VLM is satisfied or iteration budget is exhausted

This mechanism addresses a common failure mode in v1 where figures had: - Overlapping labels and legends - Unreadable axis ticks - Duplicated figures in main text and appendix - Missing or misleading color coding - Poor formatting for publication standards

Mechanism 5: Citation Integration Pipeline

The citation search has been enhanced from v1:

v1: For each concept needing citation, search Semantic Scholar → rank by relevance → insert BibTeX.

Nature: 20-round citation refinement where:

  1. The LLM generates draft text
  2. Identifies claims requiring citations
  3. Searches Semantic Scholar + web for relevant papers
  4. Generates textual justification for each citation's inclusion
  5. Compares found literature against the manuscript
  6. Iterates 20 times to improve citation coverage and accuracy

This more thorough process helps mitigate v1's citation hallucination problem, though the Nature paper acknowledges it does not fully eliminate it.

12 Programming Language

System Implementation

Component Language Framework
AI Scientist pipeline Python Custom orchestration
Template-based code editing Python Aider (open-source coding assistant)
Template-free code generation Python Direct LLM generation
Automated Reviewer Python LLM API calls
Generated experiments Python PyTorch, NumPy, matplotlib
Paper output LaTeX NeurIPS/ICML templates

Generated Code Characteristics

The template-free mode generates Python code from scratch, introducing new challenges:

Characteristic Template-Based Template-Free
Code origin Human template modified by Aider AI-generated from scratch
Bug frequency Low (human starting point) Higher (common implementation errors)
Library usage Follows template patterns Variable, sometimes non-idiomatic
Testing Inherits template tests No tests (significant gap)
Documentation Template-level docs Variable quality

Common Code Generation Failures (Template-Free)

The Nature paper documents several recurring code generation issues:

  1. Incorrect implementations — the code doesn't correctly implement the proposed idea
  2. Import errors — referencing libraries not installed or modules that don't exist
  3. Shape mismatches — tensor dimension errors in PyTorch code
  4. Hardcoded paths — assumptions about directory structure
  5. Missing error handling — crashes on edge cases instead of graceful degradation

These failures are handled by the tree search — failed code attempts become pruned branches, and the search continues from working checkpoints.

13 Memory Management

Memory Architecture

The Nature AI Scientist operates with several memory layers:

Memory Architecture (v2 / Nature)
==================================

┌──────────────────────────────────────────────────────────────┐
│  CONTEXT WINDOW (per-LLM-call)                               │
│  • Current phase context (idea, code, results)               │
│  • Experimental journal notes from prior phases              │
│  • Relevant archive entries                                  │
│  • Recent conversation history                               │
│  Limited by model context length                             │
├──────────────────────────────────────────────────────────────┤
│  IDEA ARCHIVE (persistent across idea generation cycles)     │
│  • All previously generated ideas                            │
│  • Their experiment plans and outcomes                       │
│  • Enables progressive exploration                           │
│  Grows monotonically; never pruned                           │
├──────────────────────────────────────────────────────────────┤
│  TREE SEARCH STATE (per-paper)                               │
│  • Node states (code checkpoints + results)                  │
│  • Branch decisions and pruning history                      │
│  • Best checkpoint at each stage boundary                    │
│  Managed by Experiment Manager Agent                         │
├──────────────────────────────────────────────────────────────┤
│  EXPERIMENTAL JOURNAL (per-paper)                            │
│  • Structured notes taken after each experiment              │
│  • Observations, hypotheses, next steps                      │
│  • Used during paper write-up phase                          │
│  Explicit prompt: "take notes in the style of an             │
│  experimental journal for future planning and write-up"      │
├──────────────────────────────────────────────────────────────┤
│  EXTERNAL KNOWLEDGE (accessed on demand)                     │
│  • Semantic Scholar API (literature search, citations)       │
│  • Web search (broader information access)                   │
│  • Not cached between sessions                               │
└──────────────────────────────────────────────────────────────┘

Key Memory Improvements Over v1

Memory Component v1 Nature
Idea archive Present but limited Progressive archive with explicit growth
Experiment state Linear (sequential steps) Tree structure with checkpoints
Journal notes Implicit (figure notes only) Explicit experimental journal
Citation memory Per-section 20-round iterative refinement
Cross-paper memory None Archive carries across idea generation cycles

Memory Limitations

  1. No cross-session persistence: Each complete pipeline run starts from scratch (except the idea archive within a single session)
  2. No learned patterns: The system doesn't learn "what makes a good paper" from its own prior successes and failures
  3. Context window constraints: Complex experiments may exceed context limits, requiring summarization that loses detail
  4. No negative result memory: Failed approaches are not systematically recorded for future avoidance

14 Continued Learning

Within-Pipeline Learning

The AI Scientist exhibits learning within a single pipeline run:

Learning Signal Mechanism Persistence
Idea novelty feedback Semantic Scholar API filters duplicate ideas Session-level
Experiment results Tree search uses results to guide exploration Paper-level
Code debugging Failed code attempts inform subsequent attempts Stage-level
Review feedback (v1) Iterative refinement loop incorporates review Cross-paper (v1 only)
Figure quality feedback VLM loop improves figures within a paper Paper-level

The Scaling Law as Implicit Learning

The most significant "learning" in the AI Scientist system happens at the foundation model level, not the system level. The scaling law demonstrates that improvements to the underlying LLM automatically improve the AI Scientist's output. This is a form of transfer learning — the foundation model's general capabilities (reasoning, coding, writing, analysis) transfer directly to the specialized task of scientific research.

Model Generation Paper Quality Key Improvements
Early (GPT-3.5 era) Score ~2-3 Basic structure, poor execution
Mid (GPT-4 era) Score ~4-5 Better ideas, more rigorous experiments
Recent (Claude Opus, Gemini Pro) Score ~5-6 Workshop-quality, coherent arguments
Projected future Score ~7+ Main conference quality (projected)

Cross-Paper Learning: The Open-Ended Loop

The v1 system's iterative refinement loop — where reviewer feedback feeds back into the ideation stage — represents the most ambitious learning mechanism:

Open-Ended Research Loop
=========================

   Paper 1                Paper 2                Paper 3
┌──────────┐          ┌──────────┐          ┌──────────┐
│ Idea     │          │ Idea     │          │ Idea     │
│ (novel)  │          │ (builds  │          │ (builds  │
│          │          │  on P1)  │          │  on P1+2)│
├──────────┤          ├──────────┤          ├──────────┤
│ Exp.     │          │ Exp.     │          │ Exp.     │
├──────────┤          ├──────────┤          ├──────────┤
│ Write-up │          │ Write-up │          │ Write-up │
├──────────┤          ├──────────┤          ├──────────┤
│ Review   │──feedback│ Review   │──feedback│ Review   │
│ Score: 4 │──────────│ Score: 5 │──────────│ Score: 6 │
└──────────┘          └──────────┘          └──────────┘
                                                  │
                                                  ▼
                                        Workshop-quality
                                        paper achieved

Each paper's review feedback informs subsequent idea
generation, creating a progressive improvement trajectory.

What's Missing: Meta-Learning

The AI Scientist does not perform meta-learning — it doesn't learn how to do research better from its own experience. Several potential meta-learning signals are currently unused:

  1. Review score prediction: Learning which types of ideas tend to receive higher scores
  2. Implementation pattern learning: Recognizing which code patterns lead to successful experiments
  3. Writing quality patterns: Learning which paper structures and argumentation styles receive better reviews
  4. Failure mode avoidance: Systematically avoiding previously observed failure modes (hallucinated citations, duplicate figures, etc.)

These could be implemented via fine-tuning, retrieval-augmented generation, or explicit strategy databases, but are not part of the current system.

15 Applications

Current Applications

The AI Scientist's current applications are in machine learning research automation:

Application Maturity Evidence
ML research exploration Moderate Workshop-quality papers demonstrated
Literature survey augmentation High Semantic Scholar integration works well
Experimental idea generation Moderate Ideas pass novelty checks
Paper drafting assistance Moderate Full manuscripts generated
Automated peer review High Validated against human reviewers
Research brainstorming High Archive-based idea generation

Future Domains (Projected)

The Nature paper outlines expansion plans:

"At present, The AI Scientist conducts computational experiments only. In future work, this same playbook could be applied to other scientific domains where one can automatically conduct experiments (or have humans conduct them) and collect data from them (for example, automated chemistry laboratories, on which swift progress is being made)."

Domain Feasibility Required Adaptations
Computational ML ✅ Current
Computational biology ⚠️ Medium-term Molecular simulation tools, bio-specific templates
Automated chemistry ⚠️ Medium-term Lab automation interfaces, safety constraints
Materials science ⚠️ Medium-term Simulation software integration
Robotics ⚠️ Medium-term Simulation environments
Theoretical mathematics ❌ Longer-term Proof verification (Lean4, Coq)
Social science ❌ Longer-term Data collection, IRB constraints
Physical experiments ❌ Longer-term Hardware interfaces, safety

Ethical and Societal Implications

The Nature paper and its companion editorials raise significant concerns:

Risks identified:

  1. Overwhelming peer review: Mass-generated papers could flood review systems
  2. Credential inflation: Using AI papers to inflate publication records
  3. Idea appropriation: AI may recombine others' ideas without proper attribution
  4. Job displacement: Potential impact on early-career research positions
  5. Noise in scientific literature: Low-quality AI papers polluting the knowledge base
  6. Unethical experiments: AI systems conducting experiments without ethical oversight

Mitigations implemented:

Risk Mitigation
Deceptive submission Pre-registered withdrawal protocol
Lack of consent ICLR leadership + workshop organizer consent
Ethical oversight UBC IRB approval (H24-02652)
Disclosure All AI papers watermarked
Precedent-setting Withdrew accepted paper to avoid normalizing undisclosed AI submissions

Relationship to Evolutionary AI Systems

The AI Scientist's relationship to the evolutionary AI systems surveyed in this collection is primarily complementary rather than competitive:

Evolutionary System AI Scientist's Relationship
AlphaEvolve Uses evolutionary framework for algorithm discovery; AI Scientist could write papers about AlphaEvolve discoveries
FunSearch AI Scientist could automate the write-up of FunSearch discoveries
ShinkaEvolve Tree search in AI Scientist v2 has structural parallels to evolutionary search
AutoEvolver Both demonstrate emergent search behaviors from LLM agents

The evolutionary strategy classification for the AI Scientist is justified by:

  1. The agentic tree search is structurally analogous to evolutionary search with selection pressure
  2. The archive-based ideation mirrors quality-diversity archives (MAP-Elites)
  3. The iterative refinement loop implements an evolutionary improvement cycle
  4. The scaling laws parallel compute-performance scaling in evolutionary algorithms
  5. The Automated Reviewer functions as a fitness function

Classification: EVOLVE

Both the architectural mechanisms (tree search, archive-based exploration, fitness-function-driven selection) and the broader paradigm (iterative improvement of AI-generated artifacts through automated evaluation) place the AI Scientist firmly in the evolutionary strategy category. The Nature publication strengthens this classification by demonstrating scaling laws that parallel evolutionary optimization dynamics — more compute and better operators (models) yield better solutions.

Significance Assessment

The Nature publication represents a landmark in AI research automation:

Impact Level: High. The first demonstration of fully AI-generated science passing human peer review, combined with validated scaling laws suggesting rapid future improvement, establishes the AI Scientist as a turning point. While current quality remains below main-conference standards, the trajectory — supported by both model scaling and compute scaling — suggests that conference-quality AI science is within reach on a 2-3 year horizon.

Limitation Caveat: The 70% workshop acceptance rate, the negative-result alignment with the specific workshop theme, and the 33% success rate (1/3 submissions accepted) all temper the headline claim. Main conference acceptance remains an unmet challenge.

Limitations Specific to the Nature Paper

  1. Selective reporting: Only 3 of many generated papers were submitted; manual filtering introduces human selection bias
  2. Workshop vs. main conference: Workshop acceptance (70% rate) is not equivalent to main conference acceptance (32% rate)
  3. Negative result advantage: The accepted paper reported a negative result, which aligned with the ICBINB workshop's specific focus — this may not generalize
  4. Model access dependency: Results depend on commercial API access to frontier models; full reproducibility requires matching model capabilities
  5. Limited domain: Only ML research is demonstrated; claims about broader scientific applicability are aspirational
  6. No longitudinal study: The scaling laws are cross-sectional (comparing different models at one time point), not longitudinal (tracking the same system over time)
  7. Automated Reviewer limitations: The reviewer is validated on AI/ML papers only; it may not generalize to other scientific domains

References

  1. Lu, C., Lu, C., Yamada, Y., Lange, R.T., Hu, S., Foerster, J., Clune, J., and Ha, D. "Towards End-to-End Automation of AI Research." Nature, s41586-026-10265-5, 2026.
  2. Yamada, Y., Lange, R.T., Lu, C., Hu, S., Lu, C., Foerster, J., Clune, J., and Ha, D. "The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search." arXiv:2504.08066, April 2025.
  3. Lu, C., Lu, C., Lange, R.T., Foerster, J., Clune, J., and Ha, D. "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." arXiv:2408.06292, August 2024.
  4. NeurIPS 2021 Consistency Experiment. "The NeurIPS 2021 Consistency Experiment." NeurIPS Blog, December 2021.
  5. Mouret, J.-B. and Clune, J. "Illuminating search spaces by mapping elites." arXiv:1504.04909, 2015.
  6. Clune, J. "AI-Generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence." arXiv:1905.10985, 2019.
  7. Sakana AI. "AI Scientist v1 Repository." github.com/SakanaAI/AI-Scientist.
  8. Sakana AI. "AI Scientist v2 Repository." github.com/SakanaAI/AI-Scientist-v2.
  9. Aider. Open-source AI coding assistant. aider.chat.
  10. Gauthier, J. et al. "OpenReview: A Scientific Review Platform." 2014.
  11. Sakana AI. "The AI Scientist: Towards Fully Automated AI Research, Now Published in Nature." Blog Post, 2026. sakana.ai/ai-scientist-nature.
  12. Anthropic. "Claude." 2024-2026.
  13. Google DeepMind. "Gemini." 2024-2026.

Back to Index