AI Scientist — Nature Publication

Towards End-to-End Automation of AI Research Organization: Sakana AI, University of British Columbia, Vector Institute, University of Oxford Published: 2026 (Nature) Type: Journal Paper (Nature) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Scope Note: This document covers the Nature publication (s41586-026-10265-5) and the AI Scientist v2 system (arXiv:2504.08066). For the original AI Scientist v1 system (arXiv:2408.06292, August 2024), see The AI Scientist. This report focuses on what is new relative to v1: template-free operation, agentic tree search, the peer review milestone, automated reviewer validation, and scaling laws of AI science.

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Nature Paper Title: Towards End-to-End Automation of AI Research

Nature DOI: 10.1038/s41586-026-10265-5

AI Scientist v2 Preprint: arXiv:2504.08066 — "The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search" (April 10, 2025)

Original Preprint (v1): arXiv:2408.06292 — "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery" (August 2024)

Open Access: Yes (Nature open-access publication)

Repositories: - AI Scientist v1: github.com/SakanaAI/AI-Scientist - AI Scientist v2: github.com/SakanaAI/AI-Scientist-v2

Lineage: This Nature publication consolidates and extends two prior releases — the original AI Scientist (August 2024) and AI Scientist v2 (April 2025) — adding new scaling results, the complete peer review experiment, and an Automated Reviewer validation study.

Publication Timeline

Date	Event	Reference
August 2024	AI Scientist v1 preprint released	arXiv:2408.06292
August 2024	Open-source release of v1 code	GitHub: SakanaAI/AI-Scientist
September 2024	ICLR 2025 ICBINB workshop submission	3 AI-generated papers submitted
January 2025	Peer review results: 1 paper accepted	Scores: 6, 7, 6 (avg 6.33)
February 2025	Paper withdrawn per pre-established protocol	Responsible AI commitment
April 2025	AI Scientist v2 preprint released	arXiv:2504.08066
April 2025	Open-source release of v2 code	GitHub: SakanaAI/AI-Scientist-v2
2026	Nature paper published	s41586-026-10265-5

2 Authors and Team

Nature Paper Authors

The Nature paper represents a collaboration across four institutions:

Author	Affiliation	Role
Chris Lu	Sakana AI	Co-lead, original AI Scientist creator
Cong Lu	Sakana AI	Co-lead, original AI Scientist creator
Yutaro Yamada	Sakana AI	v2 lead, agentic tree search
Robert Tjarko Lange	Sakana AI	v2 development, evolutionary methods expert
Shengran Hu	Sakana AI	v2 development
David Ha	Sakana AI	CTO, strategic direction
Jakob Foerster	University of Oxford	Faculty collaborator
Jeff Clune	UBC / Vector Institute / CIFAR	Faculty collaborator, open-endedness expert

Team Context

The author list bridges multiple research communities:

Sakana AI (Tokyo) — an AI research company founded by David Ha (former Google Brain) and Llion Jones (Transformer co-author). Sakana focuses on nature-inspired AI systems, with The AI Scientist as a flagship project.
Jeff Clune (UBC / Vector Institute) — a pioneer in open-ended learning, quality-diversity algorithms, and AI-generating algorithms (OMNI, MAP-Elites). His research group has long argued that open-ended search is key to artificial general intelligence. His influence is visible in the system's emphasis on open-ended discovery and archive-based idea generation.
Jakob Foerster (Oxford) — expert in multi-agent systems and game theory. His involvement connects the project to multi-agent research methodology.

v1 → v2 → Nature: Team Evolution

The original v1 was primarily the work of Chris Lu and Cong Lu at Sakana AI. The v2 system added Yutaro Yamada as lead for the agentic tree search methodology, and Robert Tjarko Lange, whose evolutionary methods expertise (EvoJAX, ShinkaEvolve) shaped the tree search design. The Nature paper synthesizes contributions from both phases and adds the scaling analysis and peer review experiment as new material.

3 Core Contribution

What's New in the Nature Paper: The Nature publication is not simply a republication of v1 or v2. It consolidates both systems, adds substantial new analysis, and presents the first demonstration of a fully AI-generated paper passing peer review at a top-tier ML conference workshop. The three core contributions beyond v1 are: (1) template-free operation via agentic tree search, (2) validated Automated Reviewer matching human reviewer accuracy, and (3) scaling laws showing that better models → better papers.

Delta from the Original AI Scientist (v1)

Dimension	AI Scientist v1 (Aug 2024)	Nature Paper / v2 (2025-2026)
Template dependency	Required human-provided code templates	Template-free mode generates code from scratch
Experimentation	Linear execution of experiment plan	Agentic tree search with 4 stages
Code generation	Used Aider for code modifications	Direct LLM-powered tree search (no Aider)
Peer review test	Hypothetical ("could approach acceptance")	Actual submission + acceptance at ICLR workshop
Automated Reviewer	Described but not rigorously validated	Validated against ICLR OpenReview at scale
Scaling analysis	Not included	Paper quality scales with model capability
Compute scaling	Not studied	Paper quality scales with compute budget
Figure quality	LLM-generated, no visual feedback	VLM feedback loop for figure refinement
Idea generation	Simple prompting + novelty check	Archive-based progressive idea generation
Domain scope	3 templates (NanoGPT, Diffusion, Grokking)	Any ML topic (template-free)
IRB approval	Not obtained	UBC IRB H24-02652 approved

The Three Pillars of the Nature Contribution

Pillar 1: Template-Free Scientific Discovery

The original AI Scientist required a human-prepared code template as a starting point. This constraint limited the system to predefined research domains and required non-trivial human setup. The v2/Nature system introduces a template-free mode where the AI Scientist receives only a broad research direction (e.g., "investigating deep learning limitations") and generates its own code, experiments, and papers without any starting scaffold.

Pillar 2: The Peer Review Turing Test

The authors explicitly frame the peer review experiment as an "AI scientist Turing test" — a test of whether AI-generated science is indistinguishable from human science when evaluated by standard scientific processes. One paper passed this test, albeit at a workshop with a 70% acceptance rate rather than a main conference (32% acceptance rate).

Pillar 3: Scaling Laws of AI Science

Perhaps the most significant finding for the field's future trajectory: paper quality (as measured by the Automated Reviewer) scales predictably with both the capability of the underlying LLM and the compute budget allocated to experimentation. This implies that future improvements to foundation models will automatically improve AI scientific output, without requiring changes to the AI Scientist system itself.

4 Supported Solutions

Output Artifacts

The AI Scientist produces complete scientific manuscripts as its primary output. The Nature paper evaluates these in two operational modes:

Mode	Input	Output	Code Source	Experiment Structure
Template-based	Human-provided code template + broad topic	Full paper on template topic	Modified template code	Linear experiment plan
Template-free	Broad research direction only	Full paper on novel topic	AI-generated from scratch	4-stage agentic tree search

Paper Output Quality Spectrum

Based on the Automated Reviewer and human evaluation across the Nature paper's experiments:

Quality Level	% of Output	Description	Comparison
Workshop-acceptable	~15-25% (recent models)	Would pass workshop peer review	Top 45% of workshop submissions
Borderline	~30-40%	Approaches workshop standards	Interesting ideas, execution issues
Below threshold	~35-50%	Significant quality gaps	Naive ideas, implementation bugs, hallucinations
Main conference quality	~0%	None yet meet main conference bar	Acknowledged limitation

Research Domains Covered

Template-based mode (inherited from v1):

Template	Domain	Base Model	GPU Time
NanoGPT	Language modeling	Karpathy's NanoGPT	~30 min
2D Diffusion	Generative modeling	Score-based diffusion	~15 min
Grokking	Generalization theory	Modular arithmetic	~10 min
7 community templates	Various ML domains	Various	Various

Template-free mode (new in v2/Nature):

Research Direction Example	Generated Topic	Template Created
"Deep learning limitations"	Negative results in optimization	AI-generated from scratch
"Novel training strategies"	Curriculum learning dynamics	AI-generated from scratch
"Representation learning"	Feature emergence in transformers	AI-generated from scratch

The ICLR ICBINB Submission

Three fully AI-generated papers were submitted to the ICLR 2025 I Can't Believe It's Not Better (ICBINB) workshop:

Paper	Topic	Reviewer Scores	Average	Outcome
Paper 1	Negative result in deep learning	6, 7, 6	6.33	Would have been accepted (withdrawn)
Paper 2	[Not disclosed]	Below threshold	Below threshold	Not accepted
Paper 3	[Not disclosed]	Below threshold	Below threshold	Not accepted

Key context: - Workshop acceptance rate: 70% (vs. 32% for ICLR main conference) - The accepted paper's average score (6.33) placed it in the top 45% of submissions - The paper scored higher than 55% of human-authored papers at the workshop - Reviewers were informed some submissions were AI-generated but not which ones (blind) - The accepted paper reported a negative result, aligning with the workshop's focus

5 LLM Integration

Foundation Model as Core Engine

The AI Scientist uses foundation models as its central reasoning engine across all phases. The Nature paper evaluates performance across a range of models, revealing the critical finding that model quality directly determines paper quality.

Model Evaluation Across Generations

The Nature paper's most impactful finding is the scaling law: paper quality improves with model release date.

Paper Quality vs. Model Generation
====================================

Automated
Reviewer
Score
  7 │                                              ●
    │                                           ╱
  6 │                                        ●╱     Newer models
    │                                     ╱         (2025-2026)
  5 │                              ●───●╱
    │                           ╱
  4 │                     ●──●╱
    │                  ╱                    Older models
  3 │            ●──●╱                     (2023-2024)
    │         ╱
  2 │    ●──╱
    │
  1 └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──────── Model Release Date
       │  │  │  │  │  │  │  │  │  │
      GPT Claude  Gemini     Claude  (Latest
      3.5 Sonnet  1.5 Pro    Opus    models)
           3.5

Correlation: P < 0.00001 (statistically significant)

This correlation implies that The AI Scientist is a general-purpose amplifier of model capability: as models improve, the quality of AI-generated science improves correspondingly, without modifications to the system itself.

Dual Operating Modes

Mode	Code Generation	LLM Role
Template-based	Uses Aider (open-source coding assistant)	Generates ideas, modifies template code via Aider, writes paper
Template-free	Direct LLM code generation (no Aider)	Generates ideas, writes code from scratch, manages tree search, writes paper

The shift from Aider to direct LLM code generation in template-free mode is significant. Aider provides structured code editing (diff-based patches, file management), but constrains the system to modifying existing code. Direct generation enables the LLM to create entirely new codebases, at the cost of more implementation bugs.

Vision-Language Model (VLM) Integration (New in v2)

A notable addition in v2 is the VLM feedback loop for figure quality:

VLM Figure Refinement Loop (New in v2)
=======================================

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Generate    │     │  VLM         │     │  Refine      │
│  Figure      │────▶│  Evaluate    │────▶│  Figure      │
│  (matplotlib)│     │  (visual     │     │  (adjust     │
│              │     │   quality)   │     │   layout,    │
│              │     │              │     │   colors,    │
│              │◀────│  Feedback:   │     │   labels)    │
│              │     │  "axis labels│     │              │
│              │     │   too small, │     └──────────────┘
│              │     │   legend     │
│              │     │   overlaps"  │
│              │     └──────────────┘
│              │
└──────────────┘

Contribution: Addresses v1's weakness of low-quality figures
with duplicates, missing labels, and poor formatting.

Multi-Model Architecture

The Nature paper reveals that different phases can use different models:

Phase	Recommended Model	Reasoning
Idea generation	Strongest available	Creative ideation benefits from frontier capability
Code generation	Strong coder	Implementation correctness is critical
Experiment execution	Agent-capable model	Needs tool use, file management
Paper writing	Strongest available	Long-form coherent academic writing
Automated Review	Ensembled (5 reviews)	Ensemble reduces individual model bias

6 Key Results

Result 1: First AI-Generated Paper to Pass Peer Review

The headline achievement: a fully AI-generated paper was accepted via the peer review process at a workshop of a top-tier ML conference (ICLR 2025 ICBINB).

Statistical Context:

Metric	Value
Workshop acceptance rate	70%
Main conference acceptance rate	32%
Total workshop submissions	43
AI-generated submissions	3
AI submissions accepted	1
Accepted paper's scores	6 (weak accept), 7 (accept), 6 (weak accept)
Average score	6.33
Human acceptance threshold	6.0 (average)
Paper's percentile among submissions	Top 45%
Papers it scored higher than	55% of human-authored papers

Important caveats the authors acknowledge:

Workshop papers have a substantially lower bar than main conference papers
Only 1 of 3 submissions was accepted (33% success rate, below workshop average)
The accepted paper reported a negative result, which aligned with the workshop's specific focus
Manual filtering of candidate papers was performed before submission (selecting the most promising from a larger pool)
The paper was withdrawn after acceptance per the pre-established protocol

Result 2: Automated Reviewer Matches Human Reviewers

The Automated Reviewer was rigorously validated against the OpenReview dataset of human decisions:

Metric	Automated Reviewer	Human Reviewers (NeurIPS 2021)
Balanced Accuracy (pre-cutoff)	69%	Comparable (NeurIPS consistency study)
Balanced Accuracy (post-cutoff)	66%	Comparable
F1 Score	Exceeds inter-human agreement	Baseline
Statistical significance	P < 0.001 (bootstrap test)	—
Data contamination effect	Minimal (69% → 66%)	N/A

Automated Reviewer Architecture:

Automated Reviewer Pipeline (Nature)
=====================================

Paper (LaTeX/PDF)
       │
       ▼
┌─────────────────────────────────────────────┐
│          FIVE INDEPENDENT REVIEWS            │
│                                             │
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│  │ R1  │ │ R2  │ │ R3  │ │ R4  │ │ R5  │ │
│  │     │ │     │ │     │ │     │ │     │ │
│  │Score│ │Score│ │Score│ │Score│ │Score│ │
│  │S/W  │ │S/W  │ │S/W  │ │S/W  │ │S/W  │ │
│  │Dec. │ │Dec. │ │Dec. │ │Dec. │ │Dec. │ │
│  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│     │       │       │       │       │     │
│     └───────┴───────┼───────┴───────┘     │
│                     ▼                     │
│           ┌─────────────────┐             │
│           │   META-REVIEW   │             │
│           │   (Area Chair   │             │
│           │    persona)     │             │
│           │                 │             │
│           │  Synthesizes 5  │             │
│           │  reviews into   │             │
│           │  final decision │             │
│           └────────┬────────┘             │
│                    │                      │
└────────────────────┼──────────────────────┘
                     ▼
            ┌────────────────┐
            │ Final Decision │
            │ + Scores       │
            │ + Strengths    │
            │ + Weaknesses   │
            └────────────────┘

Based on NeurIPS review guidelines.
5-run ensemble for replicability.

Key differences from v1 reviewer:

Aspect	v1 Reviewer	Nature Reviewer
Number of reviews	3 personas (base, negative, positive)	5 independent reviews + meta-review
Validation	Described, limited testing	Rigorous validation against 1,000+ ICLR papers
Guideline basis	Conference review rubric	Official NeurIPS guidelines
Area Chair role	Simple aggregation	Explicit Area Chair persona for meta-review
Post-cutoff testing	Not performed	Tested on 2025 papers (after training cutoff)
Statistical rigor	Not reported	P-values, bootstrap CIs, z-tests

Result 3: Scaling Laws of AI Science

Two scaling relationships are demonstrated:

Scaling Law 1: Model Quality → Paper Quality

Papers generated by newer, more capable models receive higher Automated Reviewer scores. The correlation is statistically significant (P < 0.00001). This is tested across model generations from GPT-3.5 through the latest Claude and Gemini models.

Implication: No system-level changes are needed to improve output. Simply using a better foundation model produces better science.

Scaling Law 2: Compute Budget → Paper Quality

Increasing the number of experimental nodes in the agentic tree search improves paper quality. This suggests that test-time compute scaling — a central trend in modern AI — applies to scientific discovery as well.

Compute Scaling: Tree Search Depth → Paper Quality
===================================================

Score │
  7   │                                          ●
      │                                       ╱
  6   │                                 ●──●╱
      │                              ╱
  5   │                       ●───●╱
      │                    ╱
  4   │             ●───●╱
      │          ╱
  3   │    ●──●╱
      │
  2   └──┬──┬──┬──┬──┬──┬──┬──┬──┬──── Compute Budget
         1  2  4  8  16 32 64 128 256  (tree nodes)

Each additional doubling of compute budget yields
diminishing but consistent quality improvements.

Combined implication: Both scaling axes — model capability and inference-time compute — are on exponentially improving trajectories. If the trend holds, future versions of The AI Scientist will produce substantially better science with both better models and more efficient compute.

Result 4: Template-Free Operation

The template-free mode represents a qualitative capability expansion:

Capability	Template-Based	Template-Free
Requires human code setup	Yes	No
Research domain	Fixed by template	Open-ended
Code origin	Modified human code	AI-generated from scratch
Experimentation structure	Linear plan	4-stage tree search
Code quality	Higher (human starting point)	Lower (more bugs)
Idea novelty	Constrained by template	Broader exploration
Paper diversity	Limited to template domain	Diverse topics

7 Reproducibility

Reproducibility Framework

The Nature paper takes reproducibility significantly more seriously than v1, partly driven by the requirements of Nature's publication standards:

Component	Reproducibility Status	Evidence
System code	✅ Open-source (v1 + v2)	GitHub repositories
Automated Reviewer	✅ Open-source + validated	Tested against OpenReview
Generated papers	✅ Available in supplementary	Full manuscripts in appendix
Peer review experiment	⚠️ Process documented	IRB approved, organizer consent
Model weights	❌ Commercial models	API access required
Exact paper regeneration	❌ Stochastic process	Different runs produce different papers
Scaling curves	⚠️ Aggregated statistics	Mean ± standard error reported

IRB and Ethical Approval

A significant new element is the formal ethical framework:

IRB: University of British Columbia IRB approval H24-02652
Conference consent: ICLR 2025 leadership and ICBINB workshop organizers explicitly agreed
Pre-registration: Decision to withdraw accepted papers was made before submission
Disclosure: Reviewers were informed some submissions were AI-generated (blind — they didn't know which ones)
Watermarking: All AI-generated papers were watermarked as AI-generated

Statistical Methodology

The Nature paper employs rigorous statistical methods:

Analysis	Statistical Method	Result
Model scaling correlation	Pearson correlation + significance test	P < 0.00001
Automated Reviewer accuracy	Balanced accuracy + bootstrapped 95% CI	5,000 bootstrap replicates
Human vs. automated agreement	Two-sample z-test	P = 0.319 (pre-cutoff), P = 0.921 (post-cutoff)
F1 score comparison	Non-parametric bootstrap test	Automated outperformance P < 0.001
Data contamination	Pre/post-cutoff comparison	69% → 66% (minimal effect)

8 Compute and API Costs

Cost Structure (Nature Paper)

The Nature paper does not report exact costs per paper, but we can estimate from the v1 analysis and the scaling experiments:

Template-based mode (inherited from v1):

Stage	Estimated Cost	% of Total
Idea Generation	~$1.50	10%
Experimentation	~$3.00	20%
Paper Write-up	~$7.50	50%
Peer Review (5-review ensemble)	~$5.00	~33%
Total (template-based)	~$17

Template-free mode (new cost profile):

Stage	Estimated Cost	Notes
Idea Generation + Code Generation	~$5-10	Generating code from scratch is more expensive
Agentic Tree Search (4 stages)	~$20-50	Scales with tree depth; main cost driver
Paper Write-up + VLM Figure Refinement	~$10-15	VLM feedback adds cost
Automated Review (5 reviews + meta)	~$5-8	More reviews than v1
Total (template-free, basic)	~$40-80	At minimal tree depth
Total (template-free, full search)	~$100-300	At deeper tree search

Scaling Cost Analysis

The scaling experiments reveal the cost-quality tradeoff:

Tree Nodes	Estimated Cost	Quality Score	Quality/Dollar
4	~$40	~3.5	0.088
16	~$80	~4.5	0.056
64	~$160	~5.5	0.034
256	~$500+	~6.5	0.013

The quality/dollar ratio decreases as compute budget increases, following a log-linear relationship typical of scaling laws. The cost of producing a workshop-acceptable paper is roughly $200-500 in the template-free mode with sufficient compute.

Comparison: Cost to Produce Publishable Science

Producer	Cost per Paper	Quality	Time
PhD student	~$50,000-100,000/year (salary + overhead) for ~2-4 papers	Main conference quality	3-6 months
AI Scientist (template-based)	~$17	Below workshop bar (v1)	Hours
AI Scientist (template-free, scaled)	~$200-500	Workshop-acceptable (~15-25%)	Hours-days
AI Scientist (projected future)	~$50-100 (as models improve + costs drop)	Main conference (projected)	Hours

The economic implications are substantial. Even at current quality levels, the AI Scientist can generate candidate ideas and preliminary experiments at a cost several orders of magnitude lower than human research. The value proposition is strongest when used for broad exploration — generating many candidate directions cheaply, then having humans select and refine the most promising ones.

GPU Compute Costs

In addition to API costs, the system requires GPU compute for running ML experiments:

Experiment Type	GPU Time	Cloud Cost (A100)
Template-based (NanoGPT)	~30 min	~$1
Template-based (Grokking)	~10 min	~$0.30
Template-free (basic)	~1-2 hours	~$3-6
Template-free (full tree)	~4-12 hours	~$12-36

9 Architecture Solution

Architectural Evolution: v1 → v2 → Nature

The AI Scientist architecture has evolved significantly between versions. The Nature paper presents both architectures and their trade-offs.

AI Scientist v1 Architecture (Template-Based)
===============================================

┌────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ IDEATION   │    │ EXPERIMENT  │    │ WRITE-UP    │    │ REVIEW      │
│            │    │             │    │             │    │             │
│ LLM ideates│───▶│ Aider edits │───▶│ LaTeX gen   │───▶│ 3 personas  │
│ Novelty    │    │ template    │    │ section by  │    │ 3 reflections│
│ check via  │    │ code        │    │ section     │    │ each        │
│ Semantic   │    │ Linear exec │    │ Citation    │    │             │
│ Scholar    │    │             │    │ search      │    │             │
└────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
     Simple            Aider              Good              Basic
     effective         dependent          quality           validation


AI Scientist v2 / Nature Architecture (Template-Free)
======================================================

┌─────────────┐    ┌──────────────────────────────────┐    ┌──────────────┐
│ IDEATION    │    │ AGENTIC TREE SEARCH              │    │ WRITE-UP     │
│             │    │                                  │    │              │
│ Archive-    │    │ Stage 1: Initial Investigation   │    │ LaTeX gen    │
│ based idea  │    │    ├── Baseline implementation   │    │ VLM figure   │
│ generation  │    │    └── Multiple code attempts    │    │ refinement   │
│             │    │                                  │    │ 20 citation  │
│ Progressive │    │ Stage 2: Hyperparameter Tuning   │    │ search rounds│
│ archive     │───▶│    ├── Grid/random search        │───▶│              │
│ growth      │    │    └── Best checkpoint → next    │    │              │
│             │    │                                  │    │              │
│ Semantic    │    │ Stage 3: Research Agenda          │    │              │
│ Scholar +   │    │    ├── Tree search over ideas    │    │              │
│ web search  │    │    └── Best checkpoint → next    │    │              │
│ filtering   │    │                                  │    │              │
│             │    │ Stage 4: Ablation Studies         │    │              │
│             │    │    └── Systematic ablations      │    │              │
└─────────────┘    └──────────────────────────────────┘    └──────────────┘
                                │                                │
                                ▼                                ▼
                   ┌───────────────────────┐         ┌───────────────────────┐
                   │ EXPERIMENT MANAGER    │         │ AUTOMATED REVIEWER    │
                   │ AGENT                 │         │                       │
                   │ Coordinates tree      │         │ 5 independent reviews │
                   │ search, selects       │         │ + Area Chair meta-    │
                   │ checkpoints,          │         │ review                │
                   │ manages resources     │         │ NeurIPS guidelines    │
                   └───────────────────────┘         └───────────────────────┘

Key Architectural Differences

Component	v1	v2 / Nature
Code modification	Aider (diff-based)	Direct LLM generation (tree search)
Experiment structure	Linear plan execution	4-stage tree search with checkpointing
Idea management	Flat list with novelty check	Progressive archive (grows over time)
Figure quality	Basic matplotlib	VLM feedback loop
Review system	3 personas, 3 reflections	5 reviews + meta-review (Area Chair)
Experiment management	Sequential	Dedicated Experiment Manager Agent
Template requirement	Mandatory	Optional (template-free mode available)

The Agentic Tree Search (Core Innovation)

The most significant architectural addition is the agentic tree search for experimentation. This replaces v1's linear experiment execution with a search tree where each node represents an experimental state (code + results + analysis):

Agentic Tree Search Visualization
===================================

                    ROOT (broad research idea)
                         │
            ┌────────────┼────────────┐
            ▼            ▼            ▼
    ┌──────────┐  ┌──────────┐  ┌──────────┐
    │ Stage 1  │  │ Stage 1  │  │ Stage 1  │
    │ Impl. A  │  │ Impl. B  │  │ Impl. C  │
    │ score: 3 │  │ score: 5 │  │ score: 2 │
    └──────────┘  └────┬─────┘  └──────────┘
                       │                        ← Best selected
              ┌────────┼────────┐
              ▼        ▼        ▼
        ┌─────────┐ ┌─────────┐ ┌─────────┐
        │ Stage 2 │ │ Stage 2 │ │ Stage 2 │
        │ HP: lr  │ │ HP: bs  │ │ HP: wd  │
        │ =0.001  │ │ =64     │ │ =0.01   │
        │ score:6 │ │ score:5 │ │ score:4 │
        └────┬────┘ └─────────┘ └─────────┘
             │                              ← Best selected
        ┌────┼────────────┐
        ▼    ▼            ▼
    ┌──────┐ ┌──────┐ ┌──────┐
    │ S3   │ │ S3   │ │ S3   │
    │Idea A│ │Idea B│ │Idea C│
    │ s: 7 │ │ s: 5 │ │ s: 6 │
    └──┬───┘ └──────┘ └──────┘
       │                                    ← Best selected
       ▼
    ┌──────┐
    │ S4   │
    │Ablate│
    │ s: 7 │    ← Final paper uses this checkpoint
    └──────┘

Stage 1: Initial Investigation (multiple implementation attempts)
Stage 2: Hyperparameter Tuning (grid/random search)
Stage 3: Research Agenda Execution (tree search over ideas)
Stage 4: Ablation Studies (systematic ablations)

At each stage boundary, the best-performing checkpoint
is selected to seed the next stage.

Experiment Manager Agent (New)

The v2 system introduces a dedicated Experiment Manager Agent that coordinates the tree search. This agent:

Decides which nodes to expand next (exploration vs. exploitation)
Selects the best checkpoint at stage boundaries
Manages compute budget allocation across tree branches
Tracks experimental progress and identifies promising directions
Kills unproductive branches to conserve resources

This is architecturally significant because it introduces a meta-level agent that reasons about the structure of the search rather than performing the search itself. In evolutionary computation terms, this is analogous to a strategy adaptation mechanism.

10 Component Breakdown

Phase 1: Idea Generation (Enhanced)

v1 approach: LLM generates ideas → novelty check via Semantic Scholar → ideas formatted as JSON.

Nature enhancements:

Enhancement	Description	Impact
Archive-based generation	Ideas are generated relative to a growing archive of prior ideas	Enables progressive exploration
Web access tools	LLM can search web + Semantic Scholar as tools (not just API calls)	Broader literature coverage
Template-free prompting	Ideas can target any ML topic, not just template domains	Broader scope
Idea filtering pipeline	Multi-stage filtering: novelty → feasibility → alignment	Higher-quality ideas passed to experimentation

Phase 2: Experimentation (Major Overhaul)

The experimentation phase is completely redesigned in the template-free mode:

v1: Linear execution of experiment plan via Aider code modifications.

Nature (template-free): 4-stage agentic tree search managed by Experiment Manager Agent.

Stage	Goal	Method	Output
1. Initial Investigation	Create working baseline	Multiple code generation attempts	Best baseline code + results
2. Hyperparameter Tuning	Optimize baseline	Grid/random search over key HPs	Best HP configuration
3. Research Agenda	Implement the research idea	Tree search over implementation variants	Best implementation + results
4. Ablation Studies	Validate contribution	Systematic ablations of key components	Ablation table + analysis

Phase 3: Paper Write-up (Enhanced)

v1 approach: Section-by-section LaTeX generation with Semantic Scholar citation search.

Nature enhancements:

Enhancement	Description
VLM figure feedback	Vision-Language Model evaluates figure quality; iterative refinement
20-round citation search	More thorough literature integration (v1 used fewer rounds)
Citation justification	For each citation, generates textual justification for inclusion
Experimental journal notes	Agent takes structured notes during experimentation for write-up

Phase 4: Automated Review (Major Enhancement)

v1 approach: 3 reviewer personas (base, negative, positive) with 3 reflection rounds each.

Nature approach: 5 independent reviews + Area Chair meta-review.

Dimension	v1	Nature
Reviews per paper	3	5
Persona types	Base, negative, positive	5 independent (no explicit bias)
Meta-review	Simple aggregation (median scores)	Area Chair persona synthesizes
Review guidelines	Generic conference rubric	Official NeurIPS guidelines
Validation	Limited comparison	1,000+ papers, OpenReview dataset
Scores output	Scores + decision	Scores + decision + strengths + weaknesses
Replicability	Single pass	5-run ensemble

Supporting Components

Component	Status	Function
Semantic Scholar API	Enhanced	Literature search, novelty checking, citation retrieval
Web search tools	New in v2	Broader information access beyond Semantic Scholar
LaTeX compiler	Inherited	Manuscript compilation and PDF generation
Python runtime	Enhanced	Experiment execution, data analysis, figure generation
Experiment Manager	New in v2	Tree search coordination, checkpoint selection
VLM	New in v2	Figure quality assessment and feedback

11 Core Mechanisms (Detailed)

Mechanism 1: Agentic Tree Search for Experimentation

The most significant new mechanism is the 4-stage tree search. Unlike v1's linear execution, the tree search enables the system to:

Explore multiple implementation approaches before committing
Recover from dead ends by backtracking to earlier checkpoints
Systematically vary one dimension at a time (stages 2 and 4)
Scale with compute by expanding more nodes

How the tree search works in detail:

At each node, the LLM generates code, runs experiments, and analyzes results. The Experiment Manager decides whether to: - Expand the node (try a variation) - Select it as the best checkpoint for the next stage - Prune it (abandon unproductive branches)

The selection mechanism at stage boundaries uses the experimental results to identify the best-performing checkpoint. This checkpoint's code and data become the starting point for the next stage, ensuring that subsequent work builds on the strongest foundation.

Relationship to evolutionary computation:

Tree Search Component	Evolutionary Analog
Nodes	Individuals in population
Node expansion	Mutation (child programs from parent)
Stage boundary selection	Elitist selection (best survives)
Multiple Stage 1 attempts	Population initialization
Stage 3 branching	Population diversity
Stage 4 ablation	Fitness landscape analysis
Experiment Manager	Strategy adaptation controller

Mechanism 2: Progressive Archive-Based Ideation

The idea generation phase uses a growing archive inspired by quality-diversity algorithms (MAP-Elites, OMNI):

Archive-Based Idea Generation
==============================

Cycle 1:                    Cycle 2:                    Cycle 3:
┌──────────────┐           ┌──────────────┐           ┌──────────────┐
│ Archive: {}  │           │ Archive:     │           │ Archive:     │
│              │           │ {Idea A,     │           │ {Idea A,     │
│ Generate:    │           │  Idea B}     │           │  Idea B,     │
│ Idea A       │──────────▶│              │──────────▶│  Idea C,     │
│ Idea B       │           │ Generate:    │           │  Idea D}     │
│              │           │ Idea C       │           │              │
│              │           │ Idea D       │           │ Generate:    │
└──────────────┘           │ (informed by │           │ Idea E       │
                           │  A, B)       │           │ (informed by │
                           └──────────────┘           │  A-D)        │
                                                      └──────────────┘

Each new idea is generated in the context of all prior
ideas, enabling progressive refinement and diversification.

This mechanism is directly inspired by Jeff Clune's work on open-ended learning, where an archive of diverse solutions drives continued exploration. The archive acts as a form of curiosity — the system is implicitly rewarded for generating ideas that differ from what's already in the archive.

Mechanism 3: Automated Reviewer as Fitness Function

A key insight of the Nature paper is that the Automated Reviewer functions as a fitness function for AI-generated science. By validating that it matches human reviewer accuracy, the authors establish that optimizing for the Automated Reviewer's scores is a reasonable proxy for optimizing for actual scientific quality.

This is analogous to the fitness function design problem in evolutionary computation: - The fitness function must accurately capture the optimization objective - A misaligned fitness function leads to reward hacking (Goodhart's Law) - The Automated Reviewer is validated to be as aligned with true scientific quality as human reviewers are with each other

Scaling implications: If the Automated Reviewer is a valid fitness function, then the scaling law (more compute → better papers) can be interpreted as a compute-quality Pareto frontier, analogous to scaling laws in evolutionary optimization.

The VLM figure feedback loop introduces multimodal reasoning into the pipeline:

Matplotlib generates a figure from experimental data
The VLM receives the rendered figure image
The VLM evaluates: layout, label readability, color accessibility, legend placement, axis scaling
Feedback is provided in natural language
The code-generating LLM modifies the matplotlib code based on VLM feedback
The cycle repeats until the VLM is satisfied or iteration budget is exhausted

This mechanism addresses a common failure mode in v1 where figures had: - Overlapping labels and legends - Unreadable axis ticks - Duplicated figures in main text and appendix - Missing or misleading color coding - Poor formatting for publication standards

Mechanism 5: Citation Integration Pipeline

The citation search has been enhanced from v1:

v1: For each concept needing citation, search Semantic Scholar → rank by relevance → insert BibTeX.

Nature: 20-round citation refinement where:

The LLM generates draft text
Identifies claims requiring citations
Searches Semantic Scholar + web for relevant papers
Generates textual justification for each citation's inclusion
Compares found literature against the manuscript
Iterates 20 times to improve citation coverage and accuracy

This more thorough process helps mitigate v1's citation hallucination problem, though the Nature paper acknowledges it does not fully eliminate it.

12 Programming Language

System Implementation

Component	Language	Framework
AI Scientist pipeline	Python	Custom orchestration
Template-based code editing	Python	Aider (open-source coding assistant)
Template-free code generation	Python	Direct LLM generation
Automated Reviewer	Python	LLM API calls
Generated experiments	Python	PyTorch, NumPy, matplotlib
Paper output	LaTeX	NeurIPS/ICML templates

Generated Code Characteristics

The template-free mode generates Python code from scratch, introducing new challenges:

Characteristic	Template-Based	Template-Free
Code origin	Human template modified by Aider	AI-generated from scratch
Bug frequency	Low (human starting point)	Higher (common implementation errors)
Library usage	Follows template patterns	Variable, sometimes non-idiomatic
Testing	Inherits template tests	No tests (significant gap)
Documentation	Template-level docs	Variable quality

Common Code Generation Failures (Template-Free)

The Nature paper documents several recurring code generation issues:

Incorrect implementations — the code doesn't correctly implement the proposed idea
Import errors — referencing libraries not installed or modules that don't exist
Shape mismatches — tensor dimension errors in PyTorch code
Hardcoded paths — assumptions about directory structure
Missing error handling — crashes on edge cases instead of graceful degradation

These failures are handled by the tree search — failed code attempts become pruned branches, and the search continues from working checkpoints.

13 Memory Management

Memory Architecture

The Nature AI Scientist operates with several memory layers:

Memory Architecture (v2 / Nature)
==================================

┌──────────────────────────────────────────────────────────────┐
│  CONTEXT WINDOW (per-LLM-call)                               │
│  • Current phase context (idea, code, results)               │
│  • Experimental journal notes from prior phases              │
│  • Relevant archive entries                                  │
│  • Recent conversation history                               │
│  Limited by model context length                             │
├──────────────────────────────────────────────────────────────┤
│  IDEA ARCHIVE (persistent across idea generation cycles)     │
│  • All previously generated ideas                            │
│  • Their experiment plans and outcomes                       │
│  • Enables progressive exploration                           │
│  Grows monotonically; never pruned                           │
├──────────────────────────────────────────────────────────────┤
│  TREE SEARCH STATE (per-paper)                               │
│  • Node states (code checkpoints + results)                  │
│  • Branch decisions and pruning history                      │
│  • Best checkpoint at each stage boundary                    │
│  Managed by Experiment Manager Agent                         │
├──────────────────────────────────────────────────────────────┤
│  EXPERIMENTAL JOURNAL (per-paper)                            │
│  • Structured notes taken after each experiment              │
│  • Observations, hypotheses, next steps                      │
│  • Used during paper write-up phase                          │
│  Explicit prompt: "take notes in the style of an             │
│  experimental journal for future planning and write-up"      │
├──────────────────────────────────────────────────────────────┤
│  EXTERNAL KNOWLEDGE (accessed on demand)                     │
│  • Semantic Scholar API (literature search, citations)       │
│  • Web search (broader information access)                   │
│  • Not cached between sessions                               │
└──────────────────────────────────────────────────────────────┘

Key Memory Improvements Over v1

Memory Component	v1	Nature
Idea archive	Present but limited	Progressive archive with explicit growth
Experiment state	Linear (sequential steps)	Tree structure with checkpoints
Journal notes	Implicit (figure notes only)	Explicit experimental journal
Citation memory	Per-section	20-round iterative refinement
Cross-paper memory	None	Archive carries across idea generation cycles

Memory Limitations

No cross-session persistence: Each complete pipeline run starts from scratch (except the idea archive within a single session)
No learned patterns: The system doesn't learn "what makes a good paper" from its own prior successes and failures
Context window constraints: Complex experiments may exceed context limits, requiring summarization that loses detail
No negative result memory: Failed approaches are not systematically recorded for future avoidance

14 Continued Learning

Within-Pipeline Learning

The AI Scientist exhibits learning within a single pipeline run:

Learning Signal	Mechanism	Persistence
Idea novelty feedback	Semantic Scholar API filters duplicate ideas	Session-level
Experiment results	Tree search uses results to guide exploration	Paper-level
Code debugging	Failed code attempts inform subsequent attempts	Stage-level
Review feedback (v1)	Iterative refinement loop incorporates review	Cross-paper (v1 only)
Figure quality feedback	VLM loop improves figures within a paper	Paper-level

The Scaling Law as Implicit Learning

The most significant "learning" in the AI Scientist system happens at the foundation model level, not the system level. The scaling law demonstrates that improvements to the underlying LLM automatically improve the AI Scientist's output. This is a form of transfer learning — the foundation model's general capabilities (reasoning, coding, writing, analysis) transfer directly to the specialized task of scientific research.

Model Generation	Paper Quality	Key Improvements
Early (GPT-3.5 era)	Score ~2-3	Basic structure, poor execution
Mid (GPT-4 era)	Score ~4-5	Better ideas, more rigorous experiments
Recent (Claude Opus, Gemini Pro)	Score ~5-6	Workshop-quality, coherent arguments
Projected future	Score ~7+	Main conference quality (projected)

Cross-Paper Learning: The Open-Ended Loop

The v1 system's iterative refinement loop — where reviewer feedback feeds back into the ideation stage — represents the most ambitious learning mechanism:

Open-Ended Research Loop
=========================

   Paper 1                Paper 2                Paper 3
┌──────────┐          ┌──────────┐          ┌──────────┐
│ Idea     │          │ Idea     │          │ Idea     │
│ (novel)  │          │ (builds  │          │ (builds  │
│          │          │  on P1)  │          │  on P1+2)│
├──────────┤          ├──────────┤          ├──────────┤
│ Exp.     │          │ Exp.     │          │ Exp.     │
├──────────┤          ├──────────┤          ├──────────┤
│ Write-up │          │ Write-up │          │ Write-up │
├──────────┤          ├──────────┤          ├──────────┤
│ Review   │──feedback│ Review   │──feedback│ Review   │
│ Score: 4 │──────────│ Score: 5 │──────────│ Score: 6 │
└──────────┘          └──────────┘          └──────────┘
                                                  │
                                                  ▼
                                        Workshop-quality
                                        paper achieved

Each paper's review feedback informs subsequent idea
generation, creating a progressive improvement trajectory.

What's Missing: Meta-Learning

The AI Scientist does not perform meta-learning — it doesn't learn how to do research better from its own experience. Several potential meta-learning signals are currently unused:

Review score prediction: Learning which types of ideas tend to receive higher scores
Implementation pattern learning: Recognizing which code patterns lead to successful experiments
Writing quality patterns: Learning which paper structures and argumentation styles receive better reviews
Failure mode avoidance: Systematically avoiding previously observed failure modes (hallucinated citations, duplicate figures, etc.)

These could be implemented via fine-tuning, retrieval-augmented generation, or explicit strategy databases, but are not part of the current system.

15 Applications

Current Applications

The AI Scientist's current applications are in machine learning research automation:

Application	Maturity	Evidence
ML research exploration	Moderate	Workshop-quality papers demonstrated
Literature survey augmentation	High	Semantic Scholar integration works well
Experimental idea generation	Moderate	Ideas pass novelty checks
Paper drafting assistance	Moderate	Full manuscripts generated
Automated peer review	High	Validated against human reviewers
Research brainstorming	High	Archive-based idea generation

Future Domains (Projected)

The Nature paper outlines expansion plans:

"At present, The AI Scientist conducts computational experiments only. In future work, this same playbook could be applied to other scientific domains where one can automatically conduct experiments (or have humans conduct them) and collect data from them (for example, automated chemistry laboratories, on which swift progress is being made)."

Domain	Feasibility	Required Adaptations
Computational ML	✅ Current	—
Computational biology	⚠️ Medium-term	Molecular simulation tools, bio-specific templates
Automated chemistry	⚠️ Medium-term	Lab automation interfaces, safety constraints
Materials science	⚠️ Medium-term	Simulation software integration
Robotics	⚠️ Medium-term	Simulation environments
Theoretical mathematics	❌ Longer-term	Proof verification (Lean4, Coq)
Social science	❌ Longer-term	Data collection, IRB constraints
Physical experiments	❌ Longer-term	Hardware interfaces, safety

Ethical and Societal Implications

The Nature paper and its companion editorials raise significant concerns:

Risks identified:

Overwhelming peer review: Mass-generated papers could flood review systems
Credential inflation: Using AI papers to inflate publication records
Idea appropriation: AI may recombine others' ideas without proper attribution
Job displacement: Potential impact on early-career research positions
Noise in scientific literature: Low-quality AI papers polluting the knowledge base
Unethical experiments: AI systems conducting experiments without ethical oversight

Mitigations implemented:

Risk	Mitigation
Deceptive submission	Pre-registered withdrawal protocol
Lack of consent	ICLR leadership + workshop organizer consent
Ethical oversight	UBC IRB approval (H24-02652)
Disclosure	All AI papers watermarked
Precedent-setting	Withdrew accepted paper to avoid normalizing undisclosed AI submissions

Relationship to Evolutionary AI Systems

The AI Scientist's relationship to the evolutionary AI systems surveyed in this collection is primarily complementary rather than competitive:

Evolutionary System	AI Scientist's Relationship
AlphaEvolve	Uses evolutionary framework for algorithm discovery; AI Scientist could write papers about AlphaEvolve discoveries
FunSearch	AI Scientist could automate the write-up of FunSearch discoveries
ShinkaEvolve	Tree search in AI Scientist v2 has structural parallels to evolutionary search
AutoEvolver	Both demonstrate emergent search behaviors from LLM agents

The evolutionary strategy classification for the AI Scientist is justified by:

The agentic tree search is structurally analogous to evolutionary search with selection pressure
The archive-based ideation mirrors quality-diversity archives (MAP-Elites)
The iterative refinement loop implements an evolutionary improvement cycle
The scaling laws parallel compute-performance scaling in evolutionary algorithms
The Automated Reviewer functions as a fitness function

Classification: EVOLVE

Both the architectural mechanisms (tree search, archive-based exploration, fitness-function-driven selection) and the broader paradigm (iterative improvement of AI-generated artifacts through automated evaluation) place the AI Scientist firmly in the evolutionary strategy category. The Nature publication strengthens this classification by demonstrating scaling laws that parallel evolutionary optimization dynamics — more compute and better operators (models) yield better solutions.

Significance Assessment

The Nature publication represents a landmark in AI research automation:

Impact Level: High. The first demonstration of fully AI-generated science passing human peer review, combined with validated scaling laws suggesting rapid future improvement, establishes the AI Scientist as a turning point. While current quality remains below main-conference standards, the trajectory — supported by both model scaling and compute scaling — suggests that conference-quality AI science is within reach on a 2-3 year horizon.

Limitation Caveat: The 70% workshop acceptance rate, the negative-result alignment with the specific workshop theme, and the 33% success rate (1/3 submissions accepted) all temper the headline claim. Main conference acceptance remains an unmet challenge.

Limitations Specific to the Nature Paper

Selective reporting: Only 3 of many generated papers were submitted; manual filtering introduces human selection bias
Workshop vs. main conference: Workshop acceptance (70% rate) is not equivalent to main conference acceptance (32% rate)
Negative result advantage: The accepted paper reported a negative result, which aligned with the ICBINB workshop's specific focus — this may not generalize
Model access dependency: Results depend on commercial API access to frontier models; full reproducibility requires matching model capabilities
Limited domain: Only ML research is demonstrated; claims about broader scientific applicability are aspirational
No longitudinal study: The scaling laws are cross-sectional (comparing different models at one time point), not longitudinal (tracking the same system over time)
Automated Reviewer limitations: The reviewer is validated on AI/ML papers only; it may not generalize to other scientific domains

References

Lu, C., Lu, C., Yamada, Y., Lange, R.T., Hu, S., Foerster, J., Clune, J., and Ha, D. "Towards End-to-End Automation of AI Research." Nature, s41586-026-10265-5, 2026.
Yamada, Y., Lange, R.T., Lu, C., Hu, S., Lu, C., Foerster, J., Clune, J., and Ha, D. "The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search." arXiv:2504.08066, April 2025.
Lu, C., Lu, C., Lange, R.T., Foerster, J., Clune, J., and Ha, D. "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." arXiv:2408.06292, August 2024.
NeurIPS 2021 Consistency Experiment. "The NeurIPS 2021 Consistency Experiment." NeurIPS Blog, December 2021.
Mouret, J.-B. and Clune, J. "Illuminating search spaces by mapping elites." arXiv:1504.04909, 2015.
Clune, J. "AI-Generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence." arXiv:1905.10985, 2019.
Sakana AI. "AI Scientist v1 Repository." github.com/SakanaAI/AI-Scientist.
Sakana AI. "AI Scientist v2 Repository." github.com/SakanaAI/AI-Scientist-v2.
Aider. Open-source AI coding assistant. aider.chat.
Gauthier, J. et al. "OpenReview: A Scientific Review Platform." 2014.
Sakana AI. "The AI Scientist: Towards Fully Automated AI Research, Now Published in Nature." Blog Post, 2026. sakana.ai/ai-scientist-nature.
Anthropic. "Claude." 2024-2026.
Google DeepMind. "Gemini." 2024-2026.

Back to Index