AI Scientist — Nature Publication
Towards End-to-End Automation of AI Research Organization: Sakana AI, University of British Columbia, Vector Institute, University of Oxford Published: 2026 (Nature) Type: Journal Paper (Nature) Report Type: PhD-Level Technical Analysis Report Date: April 2026
Scope Note: This document covers the Nature publication (s41586-026-10265-5) and the AI Scientist v2 system (arXiv:2504.08066). For the original AI Scientist v1 system (arXiv:2408.06292, August 2024), see The AI Scientist. This report focuses on what is new relative to v1: template-free operation, agentic tree search, the peer review milestone, automated reviewer validation, and scaling laws of AI science.
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Nature Paper Title: Towards End-to-End Automation of AI Research
Nature DOI: 10.1038/s41586-026-10265-5
AI Scientist v2 Preprint: arXiv:2504.08066 — "The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search" (April 10, 2025)
Original Preprint (v1): arXiv:2408.06292 — "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery" (August 2024)
Open Access: Yes (Nature open-access publication)
Repositories: - AI Scientist v1: github.com/SakanaAI/AI-Scientist - AI Scientist v2: github.com/SakanaAI/AI-Scientist-v2
Lineage: This Nature publication consolidates and extends two prior releases — the original AI Scientist (August 2024) and AI Scientist v2 (April 2025) — adding new scaling results, the complete peer review experiment, and an Automated Reviewer validation study.
Publication Timeline
| Date | Event | Reference |
|---|---|---|
| August 2024 | AI Scientist v1 preprint released | arXiv:2408.06292 |
| August 2024 | Open-source release of v1 code | GitHub: SakanaAI/AI-Scientist |
| September 2024 | ICLR 2025 ICBINB workshop submission | 3 AI-generated papers submitted |
| January 2025 | Peer review results: 1 paper accepted | Scores: 6, 7, 6 (avg 6.33) |
| February 2025 | Paper withdrawn per pre-established protocol | Responsible AI commitment |
| April 2025 | AI Scientist v2 preprint released | arXiv:2504.08066 |
| April 2025 | Open-source release of v2 code | GitHub: SakanaAI/AI-Scientist-v2 |
| 2026 | Nature paper published | s41586-026-10265-5 |
2 Authors and Team
Nature Paper Authors
The Nature paper represents a collaboration across four institutions:
| Author | Affiliation | Role |
|---|---|---|
| Chris Lu | Sakana AI | Co-lead, original AI Scientist creator |
| Cong Lu | Sakana AI | Co-lead, original AI Scientist creator |
| Yutaro Yamada | Sakana AI | v2 lead, agentic tree search |
| Robert Tjarko Lange | Sakana AI | v2 development, evolutionary methods expert |
| Shengran Hu | Sakana AI | v2 development |
| David Ha | Sakana AI | CTO, strategic direction |
| Jakob Foerster | University of Oxford | Faculty collaborator |
| Jeff Clune | UBC / Vector Institute / CIFAR | Faculty collaborator, open-endedness expert |
Team Context
The author list bridges multiple research communities:
- Sakana AI (Tokyo) — an AI research company founded by David Ha (former Google Brain) and Llion Jones (Transformer co-author). Sakana focuses on nature-inspired AI systems, with The AI Scientist as a flagship project.
- Jeff Clune (UBC / Vector Institute) — a pioneer in open-ended learning, quality-diversity algorithms, and AI-generating algorithms (OMNI, MAP-Elites). His research group has long argued that open-ended search is key to artificial general intelligence. His influence is visible in the system's emphasis on open-ended discovery and archive-based idea generation.
- Jakob Foerster (Oxford) — expert in multi-agent systems and game theory. His involvement connects the project to multi-agent research methodology.
v1 → v2 → Nature: Team Evolution
The original v1 was primarily the work of Chris Lu and Cong Lu at Sakana AI. The v2 system added Yutaro Yamada as lead for the agentic tree search methodology, and Robert Tjarko Lange, whose evolutionary methods expertise (EvoJAX, ShinkaEvolve) shaped the tree search design. The Nature paper synthesizes contributions from both phases and adds the scaling analysis and peer review experiment as new material.
3 Core Contribution
What's New in the Nature Paper: The Nature publication is not simply a republication of v1 or v2. It consolidates both systems, adds substantial new analysis, and presents the first demonstration of a fully AI-generated paper passing peer review at a top-tier ML conference workshop. The three core contributions beyond v1 are: (1) template-free operation via agentic tree search, (2) validated Automated Reviewer matching human reviewer accuracy, and (3) scaling laws showing that better models → better papers.
Delta from the Original AI Scientist (v1)
| Dimension | AI Scientist v1 (Aug 2024) | Nature Paper / v2 (2025-2026) |
|---|---|---|
| Template dependency | Required human-provided code templates | Template-free mode generates code from scratch |
| Experimentation | Linear execution of experiment plan | Agentic tree search with 4 stages |
| Code generation | Used Aider for code modifications | Direct LLM-powered tree search (no Aider) |
| Peer review test | Hypothetical ("could approach acceptance") | Actual submission + acceptance at ICLR workshop |
| Automated Reviewer | Described but not rigorously validated | Validated against ICLR OpenReview at scale |
| Scaling analysis | Not included | Paper quality scales with model capability |
| Compute scaling | Not studied | Paper quality scales with compute budget |
| Figure quality | LLM-generated, no visual feedback | VLM feedback loop for figure refinement |
| Idea generation | Simple prompting + novelty check | Archive-based progressive idea generation |
| Domain scope | 3 templates (NanoGPT, Diffusion, Grokking) | Any ML topic (template-free) |
| IRB approval | Not obtained | UBC IRB H24-02652 approved |
The Three Pillars of the Nature Contribution
Pillar 1: Template-Free Scientific Discovery
The original AI Scientist required a human-prepared code template as a starting point. This constraint limited the system to predefined research domains and required non-trivial human setup. The v2/Nature system introduces a template-free mode where the AI Scientist receives only a broad research direction (e.g., "investigating deep learning limitations") and generates its own code, experiments, and papers without any starting scaffold.
Pillar 2: The Peer Review Turing Test
The authors explicitly frame the peer review experiment as an "AI scientist Turing test" — a test of whether AI-generated science is indistinguishable from human science when evaluated by standard scientific processes. One paper passed this test, albeit at a workshop with a 70% acceptance rate rather than a main conference (32% acceptance rate).
Pillar 3: Scaling Laws of AI Science
Perhaps the most significant finding for the field's future trajectory: paper quality (as measured by the Automated Reviewer) scales predictably with both the capability of the underlying LLM and the compute budget allocated to experimentation. This implies that future improvements to foundation models will automatically improve AI scientific output, without requiring changes to the AI Scientist system itself.
4 Supported Solutions
Output Artifacts
The AI Scientist produces complete scientific manuscripts as its primary output. The Nature paper evaluates these in two operational modes:
| Mode | Input | Output | Code Source | Experiment Structure |
|---|---|---|---|---|
| Template-based | Human-provided code template + broad topic | Full paper on template topic | Modified template code | Linear experiment plan |
| Template-free | Broad research direction only | Full paper on novel topic | AI-generated from scratch | 4-stage agentic tree search |
Paper Output Quality Spectrum
Based on the Automated Reviewer and human evaluation across the Nature paper's experiments:
| Quality Level | % of Output | Description | Comparison |
|---|---|---|---|
| Workshop-acceptable | ~15-25% (recent models) | Would pass workshop peer review | Top 45% of workshop submissions |
| Borderline | ~30-40% | Approaches workshop standards | Interesting ideas, execution issues |
| Below threshold | ~35-50% | Significant quality gaps | Naive ideas, implementation bugs, hallucinations |
| Main conference quality | ~0% | None yet meet main conference bar | Acknowledged limitation |
Research Domains Covered
Template-based mode (inherited from v1):
| Template | Domain | Base Model | GPU Time |
|---|---|---|---|
| NanoGPT | Language modeling | Karpathy's NanoGPT | ~30 min |
| 2D Diffusion | Generative modeling | Score-based diffusion | ~15 min |
| Grokking | Generalization theory | Modular arithmetic | ~10 min |
| 7 community templates | Various ML domains | Various | Various |
Template-free mode (new in v2/Nature):
| Research Direction Example | Generated Topic | Template Created |
|---|---|---|
| "Deep learning limitations" | Negative results in optimization | AI-generated from scratch |
| "Novel training strategies" | Curriculum learning dynamics | AI-generated from scratch |
| "Representation learning" | Feature emergence in transformers | AI-generated from scratch |
The ICLR ICBINB Submission
Three fully AI-generated papers were submitted to the ICLR 2025 I Can't Believe It's Not Better (ICBINB) workshop:
| Paper | Topic | Reviewer Scores | Average | Outcome |
|---|---|---|---|---|
| Paper 1 | Negative result in deep learning | 6, 7, 6 | 6.33 | Would have been accepted (withdrawn) |
| Paper 2 | [Not disclosed] | Below threshold | Below threshold | Not accepted |
| Paper 3 | [Not disclosed] | Below threshold | Below threshold | Not accepted |
Key context: - Workshop acceptance rate: 70% (vs. 32% for ICLR main conference) - The accepted paper's average score (6.33) placed it in the top 45% of submissions - The paper scored higher than 55% of human-authored papers at the workshop - Reviewers were informed some submissions were AI-generated but not which ones (blind) - The accepted paper reported a negative result, aligning with the workshop's focus
5 LLM Integration
Foundation Model as Core Engine
The AI Scientist uses foundation models as its central reasoning engine across all phases. The Nature paper evaluates performance across a range of models, revealing the critical finding that model quality directly determines paper quality.
Model Evaluation Across Generations
The Nature paper's most impactful finding is the scaling law: paper quality improves with model release date.
Paper Quality vs. Model Generation
====================================
Automated
Reviewer
Score
7 │ ●
│ ╱
6 │ ●╱ Newer models
│ ╱ (2025-2026)
5 │ ●───●╱
│ ╱
4 │ ●──●╱
│ ╱ Older models
3 │ ●──●╱ (2023-2024)
│ ╱
2 │ ●──╱
│
1 └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──────── Model Release Date
│ │ │ │ │ │ │ │ │ │
GPT Claude Gemini Claude (Latest
3.5 Sonnet 1.5 Pro Opus models)
3.5
Correlation: P < 0.00001 (statistically significant)
This correlation implies that The AI Scientist is a general-purpose amplifier of model capability: as models improve, the quality of AI-generated science improves correspondingly, without modifications to the system itself.
Dual Operating Modes
| Mode | Code Generation | LLM Role |
|---|---|---|
| Template-based | Uses Aider (open-source coding assistant) | Generates ideas, modifies template code via Aider, writes paper |
| Template-free | Direct LLM code generation (no Aider) | Generates ideas, writes code from scratch, manages tree search, writes paper |
The shift from Aider to direct LLM code generation in template-free mode is significant. Aider provides structured code editing (diff-based patches, file management), but constrains the system to modifying existing code. Direct generation enables the LLM to create entirely new codebases, at the cost of more implementation bugs.
Vision-Language Model (VLM) Integration (New in v2)
A notable addition in v2 is the VLM feedback loop for figure quality:
VLM Figure Refinement Loop (New in v2)
=======================================
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Generate │ │ VLM │ │ Refine │
│ Figure │────▶│ Evaluate │────▶│ Figure │
│ (matplotlib)│ │ (visual │ │ (adjust │
│ │ │ quality) │ │ layout, │
│ │ │ │ │ colors, │
│ │◀────│ Feedback: │ │ labels) │
│ │ │ "axis labels│ │ │
│ │ │ too small, │ └──────────────┘
│ │ │ legend │
│ │ │ overlaps" │
│ │ └──────────────┘
│ │
└──────────────┘
Contribution: Addresses v1's weakness of low-quality figures
with duplicates, missing labels, and poor formatting.
Multi-Model Architecture
The Nature paper reveals that different phases can use different models:
| Phase | Recommended Model | Reasoning |
|---|---|---|
| Idea generation | Strongest available | Creative ideation benefits from frontier capability |
| Code generation | Strong coder | Implementation correctness is critical |
| Experiment execution | Agent-capable model | Needs tool use, file management |
| Paper writing | Strongest available | Long-form coherent academic writing |
| Automated Review | Ensembled (5 reviews) | Ensemble reduces individual model bias |
6 Key Results
Result 1: First AI-Generated Paper to Pass Peer Review
The headline achievement: a fully AI-generated paper was accepted via the peer review process at a workshop of a top-tier ML conference (ICLR 2025 ICBINB).
Statistical Context:
| Metric | Value |
|---|---|
| Workshop acceptance rate | 70% |
| Main conference acceptance rate | 32% |
| Total workshop submissions | 43 |
| AI-generated submissions | 3 |
| AI submissions accepted | 1 |
| Accepted paper's scores | 6 (weak accept), 7 (accept), 6 (weak accept) |
| Average score | 6.33 |
| Human acceptance threshold | 6.0 (average) |
| Paper's percentile among submissions | Top 45% |
| Papers it scored higher than | 55% of human-authored papers |
Important caveats the authors acknowledge:
- Workshop papers have a substantially lower bar than main conference papers
- Only 1 of 3 submissions was accepted (33% success rate, below workshop average)
- The accepted paper reported a negative result, which aligned with the workshop's specific focus
- Manual filtering of candidate papers was performed before submission (selecting the most promising from a larger pool)
- The paper was withdrawn after acceptance per the pre-established protocol
Result 2: Automated Reviewer Matches Human Reviewers
The Automated Reviewer was rigorously validated against the OpenReview dataset of human decisions:
| Metric | Automated Reviewer | Human Reviewers (NeurIPS 2021) |
|---|---|---|
| Balanced Accuracy (pre-cutoff) | 69% | Comparable (NeurIPS consistency study) |
| Balanced Accuracy (post-cutoff) | 66% | Comparable |
| F1 Score | Exceeds inter-human agreement | Baseline |
| Statistical significance | P < 0.001 (bootstrap test) | — |
| Data contamination effect | Minimal (69% → 66%) | N/A |
Automated Reviewer Architecture:
Automated Reviewer Pipeline (Nature)
=====================================
Paper (LaTeX/PDF)
│
▼
┌─────────────────────────────────────────────┐
│ FIVE INDEPENDENT REVIEWS │
│ │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ R1 │ │ R2 │ │ R3 │ │ R4 │ │ R5 │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │Score│ │Score│ │Score│ │Score│ │Score│ │
│ │S/W │ │S/W │ │S/W │ │S/W │ │S/W │ │
│ │Dec. │ │Dec. │ │Dec. │ │Dec. │ │Dec. │ │
│ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│ │ │ │ │ │ │
│ └───────┴───────┼───────┴───────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ META-REVIEW │ │
│ │ (Area Chair │ │
│ │ persona) │ │
│ │ │ │
│ │ Synthesizes 5 │ │
│ │ reviews into │ │
│ │ final decision │ │
│ └────────┬────────┘ │
│ │ │
└────────────────────┼──────────────────────┘
▼
┌────────────────┐
│ Final Decision │
│ + Scores │
│ + Strengths │
│ + Weaknesses │
└────────────────┘
Based on NeurIPS review guidelines.
5-run ensemble for replicability.
Key differences from v1 reviewer:
| Aspect | v1 Reviewer | Nature Reviewer |
|---|---|---|
| Number of reviews | 3 personas (base, negative, positive) | 5 independent reviews + meta-review |
| Validation | Described, limited testing | Rigorous validation against 1,000+ ICLR papers |
| Guideline basis | Conference review rubric | Official NeurIPS guidelines |
| Area Chair role | Simple aggregation | Explicit Area Chair persona for meta-review |
| Post-cutoff testing | Not performed | Tested on 2025 papers (after training cutoff) |
| Statistical rigor | Not reported | P-values, bootstrap CIs, z-tests |
Result 3: Scaling Laws of AI Science
Two scaling relationships are demonstrated:
Scaling Law 1: Model Quality → Paper Quality
Papers generated by newer, more capable models receive higher Automated Reviewer scores. The correlation is statistically significant (P < 0.00001). This is tested across model generations from GPT-3.5 through the latest Claude and Gemini models.
Implication: No system-level changes are needed to improve output. Simply using a better foundation model produces better science.
Scaling Law 2: Compute Budget → Paper Quality
Increasing the number of experimental nodes in the agentic tree search improves paper quality. This suggests that test-time compute scaling — a central trend in modern AI — applies to scientific discovery as well.
Compute Scaling: Tree Search Depth → Paper Quality
===================================================
Score │
7 │ ●
│ ╱
6 │ ●──●╱
│ ╱
5 │ ●───●╱
│ ╱
4 │ ●───●╱
│ ╱
3 │ ●──●╱
│
2 └──┬──┬──┬──┬──┬──┬──┬──┬──┬──── Compute Budget
1 2 4 8 16 32 64 128 256 (tree nodes)
Each additional doubling of compute budget yields
diminishing but consistent quality improvements.
Combined implication: Both scaling axes — model capability and inference-time compute — are on exponentially improving trajectories. If the trend holds, future versions of The AI Scientist will produce substantially better science with both better models and more efficient compute.
Result 4: Template-Free Operation
The template-free mode represents a qualitative capability expansion:
| Capability | Template-Based | Template-Free |
|---|---|---|
| Requires human code setup | Yes | No |
| Research domain | Fixed by template | Open-ended |
| Code origin | Modified human code | AI-generated from scratch |
| Experimentation structure | Linear plan | 4-stage tree search |
| Code quality | Higher (human starting point) | Lower (more bugs) |
| Idea novelty | Constrained by template | Broader exploration |
| Paper diversity | Limited to template domain | Diverse topics |
7 Reproducibility
Reproducibility Framework
The Nature paper takes reproducibility significantly more seriously than v1, partly driven by the requirements of Nature's publication standards:
| Component | Reproducibility Status | Evidence |
|---|---|---|
| System code | ✅ Open-source (v1 + v2) | GitHub repositories |
| Automated Reviewer | ✅ Open-source + validated | Tested against OpenReview |
| Generated papers | ✅ Available in supplementary | Full manuscripts in appendix |
| Peer review experiment | ⚠️ Process documented | IRB approved, organizer consent |
| Model weights | ❌ Commercial models | API access required |
| Exact paper regeneration | ❌ Stochastic process | Different runs produce different papers |
| Scaling curves | ⚠️ Aggregated statistics | Mean ± standard error reported |
IRB and Ethical Approval
A significant new element is the formal ethical framework:
- IRB: University of British Columbia IRB approval H24-02652
- Conference consent: ICLR 2025 leadership and ICBINB workshop organizers explicitly agreed
- Pre-registration: Decision to withdraw accepted papers was made before submission
- Disclosure: Reviewers were informed some submissions were AI-generated (blind — they didn't know which ones)
- Watermarking: All AI-generated papers were watermarked as AI-generated
Statistical Methodology
The Nature paper employs rigorous statistical methods:
| Analysis | Statistical Method | Result |
|---|---|---|
| Model scaling correlation | Pearson correlation + significance test | P < 0.00001 |
| Automated Reviewer accuracy | Balanced accuracy + bootstrapped 95% CI | 5,000 bootstrap replicates |
| Human vs. automated agreement | Two-sample z-test | P = 0.319 (pre-cutoff), P = 0.921 (post-cutoff) |
| F1 score comparison | Non-parametric bootstrap test | Automated outperformance P < 0.001 |
| Data contamination | Pre/post-cutoff comparison | 69% → 66% (minimal effect) |
8 Compute and API Costs
Cost Structure (Nature Paper)
The Nature paper does not report exact costs per paper, but we can estimate from the v1 analysis and the scaling experiments:
Template-based mode (inherited from v1):
| Stage | Estimated Cost | % of Total |
|---|---|---|
| Idea Generation | ~$1.50 | 10% |
| Experimentation | ~$3.00 | 20% |
| Paper Write-up | ~$7.50 | 50% |
| Peer Review (5-review ensemble) | ~$5.00 | ~33% |
| Total (template-based) | ~$17 |
Template-free mode (new cost profile):
| Stage | Estimated Cost | Notes |
|---|---|---|
| Idea Generation + Code Generation | ~$5-10 | Generating code from scratch is more expensive |
| Agentic Tree Search (4 stages) | ~$20-50 | Scales with tree depth; main cost driver |
| Paper Write-up + VLM Figure Refinement | ~$10-15 | VLM feedback adds cost |
| Automated Review (5 reviews + meta) | ~$5-8 | More reviews than v1 |
| Total (template-free, basic) | ~$40-80 | At minimal tree depth |
| Total (template-free, full search) | ~$100-300 | At deeper tree search |
Scaling Cost Analysis
The scaling experiments reveal the cost-quality tradeoff:
| Tree Nodes | Estimated Cost | Quality Score | Quality/Dollar |
|---|---|---|---|
| 4 | ~$40 | ~3.5 | 0.088 |
| 16 | ~$80 | ~4.5 | 0.056 |
| 64 | ~$160 | ~5.5 | 0.034 |
| 256 | ~$500+ | ~6.5 | 0.013 |
The quality/dollar ratio decreases as compute budget increases, following a log-linear relationship typical of scaling laws. The cost of producing a workshop-acceptable paper is roughly $200-500 in the template-free mode with sufficient compute.
Comparison: Cost to Produce Publishable Science
| Producer | Cost per Paper | Quality | Time |
|---|---|---|---|
| PhD student | ~$50,000-100,000/year (salary + overhead) for ~2-4 papers | Main conference quality | 3-6 months |
| AI Scientist (template-based) | ~$17 | Below workshop bar (v1) | Hours |
| AI Scientist (template-free, scaled) | ~$200-500 | Workshop-acceptable (~15-25%) | Hours-days |
| AI Scientist (projected future) | ~$50-100 (as models improve + costs drop) | Main conference (projected) | Hours |
The economic implications are substantial. Even at current quality levels, the AI Scientist can generate candidate ideas and preliminary experiments at a cost several orders of magnitude lower than human research. The value proposition is strongest when used for broad exploration — generating many candidate directions cheaply, then having humans select and refine the most promising ones.
GPU Compute Costs
In addition to API costs, the system requires GPU compute for running ML experiments:
| Experiment Type | GPU Time | Cloud Cost (A100) |
|---|---|---|
| Template-based (NanoGPT) | ~30 min | ~$1 |
| Template-based (Grokking) | ~10 min | ~$0.30 |
| Template-free (basic) | ~1-2 hours | ~$3-6 |
| Template-free (full tree) | ~4-12 hours | ~$12-36 |
9 Architecture Solution
Architectural Evolution: v1 → v2 → Nature
The AI Scientist architecture has evolved significantly between versions. The Nature paper presents both architectures and their trade-offs.
AI Scientist v1 Architecture (Template-Based)
===============================================
┌────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ IDEATION │ │ EXPERIMENT │ │ WRITE-UP │ │ REVIEW │
│ │ │ │ │ │ │ │
│ LLM ideates│───▶│ Aider edits │───▶│ LaTeX gen │───▶│ 3 personas │
│ Novelty │ │ template │ │ section by │ │ 3 reflections│
│ check via │ │ code │ │ section │ │ each │
│ Semantic │ │ Linear exec │ │ Citation │ │ │
│ Scholar │ │ │ │ search │ │ │
└────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Simple Aider Good Basic
effective dependent quality validation
AI Scientist v2 / Nature Architecture (Template-Free)
======================================================
┌─────────────┐ ┌──────────────────────────────────┐ ┌──────────────┐
│ IDEATION │ │ AGENTIC TREE SEARCH │ │ WRITE-UP │
│ │ │ │ │ │
│ Archive- │ │ Stage 1: Initial Investigation │ │ LaTeX gen │
│ based idea │ │ ├── Baseline implementation │ │ VLM figure │
│ generation │ │ └── Multiple code attempts │ │ refinement │
│ │ │ │ │ 20 citation │
│ Progressive │ │ Stage 2: Hyperparameter Tuning │ │ search rounds│
│ archive │───▶│ ├── Grid/random search │───▶│ │
│ growth │ │ └── Best checkpoint → next │ │ │
│ │ │ │ │ │
│ Semantic │ │ Stage 3: Research Agenda │ │ │
│ Scholar + │ │ ├── Tree search over ideas │ │ │
│ web search │ │ └── Best checkpoint → next │ │ │
│ filtering │ │ │ │ │
│ │ │ Stage 4: Ablation Studies │ │ │
│ │ │ └── Systematic ablations │ │ │
└─────────────┘ └──────────────────────────────────┘ └──────────────┘
│ │
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ EXPERIMENT MANAGER │ │ AUTOMATED REVIEWER │
│ AGENT │ │ │
│ Coordinates tree │ │ 5 independent reviews │
│ search, selects │ │ + Area Chair meta- │
│ checkpoints, │ │ review │
│ manages resources │ │ NeurIPS guidelines │
└───────────────────────┘ └───────────────────────┘
Key Architectural Differences
| Component | v1 | v2 / Nature |
|---|---|---|
| Code modification | Aider (diff-based) | Direct LLM generation (tree search) |
| Experiment structure | Linear plan execution | 4-stage tree search with checkpointing |
| Idea management | Flat list with novelty check | Progressive archive (grows over time) |
| Figure quality | Basic matplotlib | VLM feedback loop |
| Review system | 3 personas, 3 reflections | 5 reviews + meta-review (Area Chair) |
| Experiment management | Sequential | Dedicated Experiment Manager Agent |
| Template requirement | Mandatory | Optional (template-free mode available) |
The Agentic Tree Search (Core Innovation)
The most significant architectural addition is the agentic tree search for experimentation. This replaces v1's linear experiment execution with a search tree where each node represents an experimental state (code + results + analysis):
Agentic Tree Search Visualization
===================================
ROOT (broad research idea)
│
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Stage 1 │ │ Stage 1 │ │ Stage 1 │
│ Impl. A │ │ Impl. B │ │ Impl. C │
│ score: 3 │ │ score: 5 │ │ score: 2 │
└──────────┘ └────┬─────┘ └──────────┘
│ ← Best selected
┌────────┼────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Stage 2 │ │ Stage 2 │ │ Stage 2 │
│ HP: lr │ │ HP: bs │ │ HP: wd │
│ =0.001 │ │ =64 │ │ =0.01 │
│ score:6 │ │ score:5 │ │ score:4 │
└────┬────┘ └─────────┘ └─────────┘
│ ← Best selected
┌────┼────────────┐
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐
│ S3 │ │ S3 │ │ S3 │
│Idea A│ │Idea B│ │Idea C│
│ s: 7 │ │ s: 5 │ │ s: 6 │
└──┬───┘ └──────┘ └──────┘
│ ← Best selected
▼
┌──────┐
│ S4 │
│Ablate│
│ s: 7 │ ← Final paper uses this checkpoint
└──────┘
Stage 1: Initial Investigation (multiple implementation attempts)
Stage 2: Hyperparameter Tuning (grid/random search)
Stage 3: Research Agenda Execution (tree search over ideas)
Stage 4: Ablation Studies (systematic ablations)
At each stage boundary, the best-performing checkpoint
is selected to seed the next stage.
Experiment Manager Agent (New)
The v2 system introduces a dedicated Experiment Manager Agent that coordinates the tree search. This agent:
- Decides which nodes to expand next (exploration vs. exploitation)
- Selects the best checkpoint at stage boundaries
- Manages compute budget allocation across tree branches
- Tracks experimental progress and identifies promising directions
- Kills unproductive branches to conserve resources
This is architecturally significant because it introduces a meta-level agent that reasons about the structure of the search rather than performing the search itself. In evolutionary computation terms, this is analogous to a strategy adaptation mechanism.
10 Component Breakdown
Phase 1: Idea Generation (Enhanced)
v1 approach: LLM generates ideas → novelty check via Semantic Scholar → ideas formatted as JSON.
Nature enhancements:
| Enhancement | Description | Impact |
|---|---|---|
| Archive-based generation | Ideas are generated relative to a growing archive of prior ideas | Enables progressive exploration |
| Web access tools | LLM can search web + Semantic Scholar as tools (not just API calls) | Broader literature coverage |
| Template-free prompting | Ideas can target any ML topic, not just template domains | Broader scope |
| Idea filtering pipeline | Multi-stage filtering: novelty → feasibility → alignment | Higher-quality ideas passed to experimentation |
Phase 2: Experimentation (Major Overhaul)
The experimentation phase is completely redesigned in the template-free mode:
v1: Linear execution of experiment plan via Aider code modifications.
Nature (template-free): 4-stage agentic tree search managed by Experiment Manager Agent.
| Stage | Goal | Method | Output |
|---|---|---|---|
| 1. Initial Investigation | Create working baseline | Multiple code generation attempts | Best baseline code + results |
| 2. Hyperparameter Tuning | Optimize baseline | Grid/random search over key HPs | Best HP configuration |
| 3. Research Agenda | Implement the research idea | Tree search over implementation variants | Best implementation + results |
| 4. Ablation Studies | Validate contribution | Systematic ablations of key components | Ablation table + analysis |
Phase 3: Paper Write-up (Enhanced)
v1 approach: Section-by-section LaTeX generation with Semantic Scholar citation search.
Nature enhancements:
| Enhancement | Description |
|---|---|
| VLM figure feedback | Vision-Language Model evaluates figure quality; iterative refinement |
| 20-round citation search | More thorough literature integration (v1 used fewer rounds) |
| Citation justification | For each citation, generates textual justification for inclusion |
| Experimental journal notes | Agent takes structured notes during experimentation for write-up |
Phase 4: Automated Review (Major Enhancement)
v1 approach: 3 reviewer personas (base, negative, positive) with 3 reflection rounds each.
Nature approach: 5 independent reviews + Area Chair meta-review.
| Dimension | v1 | Nature |
|---|---|---|
| Reviews per paper | 3 | 5 |
| Persona types | Base, negative, positive | 5 independent (no explicit bias) |
| Meta-review | Simple aggregation (median scores) | Area Chair persona synthesizes |
| Review guidelines | Generic conference rubric | Official NeurIPS guidelines |
| Validation | Limited comparison | 1,000+ papers, OpenReview dataset |
| Scores output | Scores + decision | Scores + decision + strengths + weaknesses |
| Replicability | Single pass | 5-run ensemble |
Supporting Components
| Component | Status | Function |
|---|---|---|
| Semantic Scholar API | Enhanced | Literature search, novelty checking, citation retrieval |
| Web search tools | New in v2 | Broader information access beyond Semantic Scholar |
| LaTeX compiler | Inherited | Manuscript compilation and PDF generation |
| Python runtime | Enhanced | Experiment execution, data analysis, figure generation |
| Experiment Manager | New in v2 | Tree search coordination, checkpoint selection |
| VLM | New in v2 | Figure quality assessment and feedback |
11 Core Mechanisms (Detailed)
Mechanism 1: Agentic Tree Search for Experimentation
The most significant new mechanism is the 4-stage tree search. Unlike v1's linear execution, the tree search enables the system to:
- Explore multiple implementation approaches before committing
- Recover from dead ends by backtracking to earlier checkpoints
- Systematically vary one dimension at a time (stages 2 and 4)
- Scale with compute by expanding more nodes
How the tree search works in detail:
At each node, the LLM generates code, runs experiments, and analyzes results. The Experiment Manager decides whether to: - Expand the node (try a variation) - Select it as the best checkpoint for the next stage - Prune it (abandon unproductive branches)
The selection mechanism at stage boundaries uses the experimental results to identify the best-performing checkpoint. This checkpoint's code and data become the starting point for the next stage, ensuring that subsequent work builds on the strongest foundation.
Relationship to evolutionary computation:
| Tree Search Component | Evolutionary Analog |
|---|---|
| Nodes | Individuals in population |
| Node expansion | Mutation (child programs from parent) |
| Stage boundary selection | Elitist selection (best survives) |
| Multiple Stage 1 attempts | Population initialization |
| Stage 3 branching | Population diversity |
| Stage 4 ablation | Fitness landscape analysis |
| Experiment Manager | Strategy adaptation controller |
Mechanism 2: Progressive Archive-Based Ideation
The idea generation phase uses a growing archive inspired by quality-diversity algorithms (MAP-Elites, OMNI):
Archive-Based Idea Generation
==============================
Cycle 1: Cycle 2: Cycle 3:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Archive: {} │ │ Archive: │ │ Archive: │
│ │ │ {Idea A, │ │ {Idea A, │
│ Generate: │ │ Idea B} │ │ Idea B, │
│ Idea A │──────────▶│ │──────────▶│ Idea C, │
│ Idea B │ │ Generate: │ │ Idea D} │
│ │ │ Idea C │ │ │
│ │ │ Idea D │ │ Generate: │
└──────────────┘ │ (informed by │ │ Idea E │
│ A, B) │ │ (informed by │
└──────────────┘ │ A-D) │
└──────────────┘
Each new idea is generated in the context of all prior
ideas, enabling progressive refinement and diversification.
This mechanism is directly inspired by Jeff Clune's work on open-ended learning, where an archive of diverse solutions drives continued exploration. The archive acts as a form of curiosity — the system is implicitly rewarded for generating ideas that differ from what's already in the archive.
Mechanism 3: Automated Reviewer as Fitness Function
A key insight of the Nature paper is that the Automated Reviewer functions as a fitness function for AI-generated science. By validating that it matches human reviewer accuracy, the authors establish that optimizing for the Automated Reviewer's scores is a reasonable proxy for optimizing for actual scientific quality.
This is analogous to the fitness function design problem in evolutionary computation: - The fitness function must accurately capture the optimization objective - A misaligned fitness function leads to reward hacking (Goodhart's Law) - The Automated Reviewer is validated to be as aligned with true scientific quality as human reviewers are with each other
Scaling implications: If the Automated Reviewer is a valid fitness function, then the scaling law (more compute → better papers) can be interpreted as a compute-quality Pareto frontier, analogous to scaling laws in evolutionary optimization.
Mechanism 4: VLM-Augmented Figure Refinement
The VLM figure feedback loop introduces multimodal reasoning into the pipeline:
- Matplotlib generates a figure from experimental data
- The VLM receives the rendered figure image
- The VLM evaluates: layout, label readability, color accessibility, legend placement, axis scaling
- Feedback is provided in natural language
- The code-generating LLM modifies the matplotlib code based on VLM feedback
- The cycle repeats until the VLM is satisfied or iteration budget is exhausted
This mechanism addresses a common failure mode in v1 where figures had: - Overlapping labels and legends - Unreadable axis ticks - Duplicated figures in main text and appendix - Missing or misleading color coding - Poor formatting for publication standards
Mechanism 5: Citation Integration Pipeline
The citation search has been enhanced from v1:
v1: For each concept needing citation, search Semantic Scholar → rank by relevance → insert BibTeX.
Nature: 20-round citation refinement where:
- The LLM generates draft text
- Identifies claims requiring citations
- Searches Semantic Scholar + web for relevant papers
- Generates textual justification for each citation's inclusion
- Compares found literature against the manuscript
- Iterates 20 times to improve citation coverage and accuracy
This more thorough process helps mitigate v1's citation hallucination problem, though the Nature paper acknowledges it does not fully eliminate it.
12 Programming Language
System Implementation
| Component | Language | Framework |
|---|---|---|
| AI Scientist pipeline | Python | Custom orchestration |
| Template-based code editing | Python | Aider (open-source coding assistant) |
| Template-free code generation | Python | Direct LLM generation |
| Automated Reviewer | Python | LLM API calls |
| Generated experiments | Python | PyTorch, NumPy, matplotlib |
| Paper output | LaTeX | NeurIPS/ICML templates |
Generated Code Characteristics
The template-free mode generates Python code from scratch, introducing new challenges:
| Characteristic | Template-Based | Template-Free |
|---|---|---|
| Code origin | Human template modified by Aider | AI-generated from scratch |
| Bug frequency | Low (human starting point) | Higher (common implementation errors) |
| Library usage | Follows template patterns | Variable, sometimes non-idiomatic |
| Testing | Inherits template tests | No tests (significant gap) |
| Documentation | Template-level docs | Variable quality |
Common Code Generation Failures (Template-Free)
The Nature paper documents several recurring code generation issues:
- Incorrect implementations — the code doesn't correctly implement the proposed idea
- Import errors — referencing libraries not installed or modules that don't exist
- Shape mismatches — tensor dimension errors in PyTorch code
- Hardcoded paths — assumptions about directory structure
- Missing error handling — crashes on edge cases instead of graceful degradation
These failures are handled by the tree search — failed code attempts become pruned branches, and the search continues from working checkpoints.
13 Memory Management
Memory Architecture
The Nature AI Scientist operates with several memory layers:
Memory Architecture (v2 / Nature)
==================================
┌──────────────────────────────────────────────────────────────┐
│ CONTEXT WINDOW (per-LLM-call) │
│ • Current phase context (idea, code, results) │
│ • Experimental journal notes from prior phases │
│ • Relevant archive entries │
│ • Recent conversation history │
│ Limited by model context length │
├──────────────────────────────────────────────────────────────┤
│ IDEA ARCHIVE (persistent across idea generation cycles) │
│ • All previously generated ideas │
│ • Their experiment plans and outcomes │
│ • Enables progressive exploration │
│ Grows monotonically; never pruned │
├──────────────────────────────────────────────────────────────┤
│ TREE SEARCH STATE (per-paper) │
│ • Node states (code checkpoints + results) │
│ • Branch decisions and pruning history │
│ • Best checkpoint at each stage boundary │
│ Managed by Experiment Manager Agent │
├──────────────────────────────────────────────────────────────┤
│ EXPERIMENTAL JOURNAL (per-paper) │
│ • Structured notes taken after each experiment │
│ • Observations, hypotheses, next steps │
│ • Used during paper write-up phase │
│ Explicit prompt: "take notes in the style of an │
│ experimental journal for future planning and write-up" │
├──────────────────────────────────────────────────────────────┤
│ EXTERNAL KNOWLEDGE (accessed on demand) │
│ • Semantic Scholar API (literature search, citations) │
│ • Web search (broader information access) │
│ • Not cached between sessions │
└──────────────────────────────────────────────────────────────┘
Key Memory Improvements Over v1
| Memory Component | v1 | Nature |
|---|---|---|
| Idea archive | Present but limited | Progressive archive with explicit growth |
| Experiment state | Linear (sequential steps) | Tree structure with checkpoints |
| Journal notes | Implicit (figure notes only) | Explicit experimental journal |
| Citation memory | Per-section | 20-round iterative refinement |
| Cross-paper memory | None | Archive carries across idea generation cycles |
Memory Limitations
- No cross-session persistence: Each complete pipeline run starts from scratch (except the idea archive within a single session)
- No learned patterns: The system doesn't learn "what makes a good paper" from its own prior successes and failures
- Context window constraints: Complex experiments may exceed context limits, requiring summarization that loses detail
- No negative result memory: Failed approaches are not systematically recorded for future avoidance
14 Continued Learning
Within-Pipeline Learning
The AI Scientist exhibits learning within a single pipeline run:
| Learning Signal | Mechanism | Persistence |
|---|---|---|
| Idea novelty feedback | Semantic Scholar API filters duplicate ideas | Session-level |
| Experiment results | Tree search uses results to guide exploration | Paper-level |
| Code debugging | Failed code attempts inform subsequent attempts | Stage-level |
| Review feedback (v1) | Iterative refinement loop incorporates review | Cross-paper (v1 only) |
| Figure quality feedback | VLM loop improves figures within a paper | Paper-level |
The Scaling Law as Implicit Learning
The most significant "learning" in the AI Scientist system happens at the foundation model level, not the system level. The scaling law demonstrates that improvements to the underlying LLM automatically improve the AI Scientist's output. This is a form of transfer learning — the foundation model's general capabilities (reasoning, coding, writing, analysis) transfer directly to the specialized task of scientific research.
| Model Generation | Paper Quality | Key Improvements |
|---|---|---|
| Early (GPT-3.5 era) | Score ~2-3 | Basic structure, poor execution |
| Mid (GPT-4 era) | Score ~4-5 | Better ideas, more rigorous experiments |
| Recent (Claude Opus, Gemini Pro) | Score ~5-6 | Workshop-quality, coherent arguments |
| Projected future | Score ~7+ | Main conference quality (projected) |
Cross-Paper Learning: The Open-Ended Loop
The v1 system's iterative refinement loop — where reviewer feedback feeds back into the ideation stage — represents the most ambitious learning mechanism:
Open-Ended Research Loop
=========================
Paper 1 Paper 2 Paper 3
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Idea │ │ Idea │ │ Idea │
│ (novel) │ │ (builds │ │ (builds │
│ │ │ on P1) │ │ on P1+2)│
├──────────┤ ├──────────┤ ├──────────┤
│ Exp. │ │ Exp. │ │ Exp. │
├──────────┤ ├──────────┤ ├──────────┤
│ Write-up │ │ Write-up │ │ Write-up │
├──────────┤ ├──────────┤ ├──────────┤
│ Review │──feedback│ Review │──feedback│ Review │
│ Score: 4 │──────────│ Score: 5 │──────────│ Score: 6 │
└──────────┘ └──────────┘ └──────────┘
│
▼
Workshop-quality
paper achieved
Each paper's review feedback informs subsequent idea
generation, creating a progressive improvement trajectory.
What's Missing: Meta-Learning
The AI Scientist does not perform meta-learning — it doesn't learn how to do research better from its own experience. Several potential meta-learning signals are currently unused:
- Review score prediction: Learning which types of ideas tend to receive higher scores
- Implementation pattern learning: Recognizing which code patterns lead to successful experiments
- Writing quality patterns: Learning which paper structures and argumentation styles receive better reviews
- Failure mode avoidance: Systematically avoiding previously observed failure modes (hallucinated citations, duplicate figures, etc.)
These could be implemented via fine-tuning, retrieval-augmented generation, or explicit strategy databases, but are not part of the current system.
15 Applications
Current Applications
The AI Scientist's current applications are in machine learning research automation:
| Application | Maturity | Evidence |
|---|---|---|
| ML research exploration | Moderate | Workshop-quality papers demonstrated |
| Literature survey augmentation | High | Semantic Scholar integration works well |
| Experimental idea generation | Moderate | Ideas pass novelty checks |
| Paper drafting assistance | Moderate | Full manuscripts generated |
| Automated peer review | High | Validated against human reviewers |
| Research brainstorming | High | Archive-based idea generation |
Future Domains (Projected)
The Nature paper outlines expansion plans:
"At present, The AI Scientist conducts computational experiments only. In future work, this same playbook could be applied to other scientific domains where one can automatically conduct experiments (or have humans conduct them) and collect data from them (for example, automated chemistry laboratories, on which swift progress is being made)."
| Domain | Feasibility | Required Adaptations |
|---|---|---|
| Computational ML | ✅ Current | — |
| Computational biology | ⚠️ Medium-term | Molecular simulation tools, bio-specific templates |
| Automated chemistry | ⚠️ Medium-term | Lab automation interfaces, safety constraints |
| Materials science | ⚠️ Medium-term | Simulation software integration |
| Robotics | ⚠️ Medium-term | Simulation environments |
| Theoretical mathematics | ❌ Longer-term | Proof verification (Lean4, Coq) |
| Social science | ❌ Longer-term | Data collection, IRB constraints |
| Physical experiments | ❌ Longer-term | Hardware interfaces, safety |
Ethical and Societal Implications
The Nature paper and its companion editorials raise significant concerns:
Risks identified:
- Overwhelming peer review: Mass-generated papers could flood review systems
- Credential inflation: Using AI papers to inflate publication records
- Idea appropriation: AI may recombine others' ideas without proper attribution
- Job displacement: Potential impact on early-career research positions
- Noise in scientific literature: Low-quality AI papers polluting the knowledge base
- Unethical experiments: AI systems conducting experiments without ethical oversight
Mitigations implemented:
| Risk | Mitigation |
|---|---|
| Deceptive submission | Pre-registered withdrawal protocol |
| Lack of consent | ICLR leadership + workshop organizer consent |
| Ethical oversight | UBC IRB approval (H24-02652) |
| Disclosure | All AI papers watermarked |
| Precedent-setting | Withdrew accepted paper to avoid normalizing undisclosed AI submissions |
Relationship to Evolutionary AI Systems
The AI Scientist's relationship to the evolutionary AI systems surveyed in this collection is primarily complementary rather than competitive:
| Evolutionary System | AI Scientist's Relationship |
|---|---|
| AlphaEvolve | Uses evolutionary framework for algorithm discovery; AI Scientist could write papers about AlphaEvolve discoveries |
| FunSearch | AI Scientist could automate the write-up of FunSearch discoveries |
| ShinkaEvolve | Tree search in AI Scientist v2 has structural parallels to evolutionary search |
| AutoEvolver | Both demonstrate emergent search behaviors from LLM agents |
The evolutionary strategy classification for the AI Scientist is justified by:
- The agentic tree search is structurally analogous to evolutionary search with selection pressure
- The archive-based ideation mirrors quality-diversity archives (MAP-Elites)
- The iterative refinement loop implements an evolutionary improvement cycle
- The scaling laws parallel compute-performance scaling in evolutionary algorithms
- The Automated Reviewer functions as a fitness function
Classification: EVOLVE
Both the architectural mechanisms (tree search, archive-based exploration, fitness-function-driven selection) and the broader paradigm (iterative improvement of AI-generated artifacts through automated evaluation) place the AI Scientist firmly in the evolutionary strategy category. The Nature publication strengthens this classification by demonstrating scaling laws that parallel evolutionary optimization dynamics — more compute and better operators (models) yield better solutions.
Significance Assessment
The Nature publication represents a landmark in AI research automation:
Impact Level: High. The first demonstration of fully AI-generated science passing human peer review, combined with validated scaling laws suggesting rapid future improvement, establishes the AI Scientist as a turning point. While current quality remains below main-conference standards, the trajectory — supported by both model scaling and compute scaling — suggests that conference-quality AI science is within reach on a 2-3 year horizon.
Limitation Caveat: The 70% workshop acceptance rate, the negative-result alignment with the specific workshop theme, and the 33% success rate (1/3 submissions accepted) all temper the headline claim. Main conference acceptance remains an unmet challenge.
Limitations Specific to the Nature Paper
- Selective reporting: Only 3 of many generated papers were submitted; manual filtering introduces human selection bias
- Workshop vs. main conference: Workshop acceptance (70% rate) is not equivalent to main conference acceptance (32% rate)
- Negative result advantage: The accepted paper reported a negative result, which aligned with the ICBINB workshop's specific focus — this may not generalize
- Model access dependency: Results depend on commercial API access to frontier models; full reproducibility requires matching model capabilities
- Limited domain: Only ML research is demonstrated; claims about broader scientific applicability are aspirational
- No longitudinal study: The scaling laws are cross-sectional (comparing different models at one time point), not longitudinal (tracking the same system over time)
- Automated Reviewer limitations: The reviewer is validated on AI/ML papers only; it may not generalize to other scientific domains
References
- Lu, C., Lu, C., Yamada, Y., Lange, R.T., Hu, S., Foerster, J., Clune, J., and Ha, D. "Towards End-to-End Automation of AI Research." Nature, s41586-026-10265-5, 2026.
- Yamada, Y., Lange, R.T., Lu, C., Hu, S., Lu, C., Foerster, J., Clune, J., and Ha, D. "The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search." arXiv:2504.08066, April 2025.
- Lu, C., Lu, C., Lange, R.T., Foerster, J., Clune, J., and Ha, D. "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." arXiv:2408.06292, August 2024.
- NeurIPS 2021 Consistency Experiment. "The NeurIPS 2021 Consistency Experiment." NeurIPS Blog, December 2021.
- Mouret, J.-B. and Clune, J. "Illuminating search spaces by mapping elites." arXiv:1504.04909, 2015.
- Clune, J. "AI-Generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence." arXiv:1905.10985, 2019.
- Sakana AI. "AI Scientist v1 Repository." github.com/SakanaAI/AI-Scientist.
- Sakana AI. "AI Scientist v2 Repository." github.com/SakanaAI/AI-Scientist-v2.
- Aider. Open-source AI coding assistant. aider.chat.
- Gauthier, J. et al. "OpenReview: A Scientific Review Platform." 2014.
- Sakana AI. "The AI Scientist: Towards Fully Automated AI Research, Now Published in Nature." Blog Post, 2026. sakana.ai/ai-scientist-nature.
- Anthropic. "Claude." 2024-2026.
- Google DeepMind. "Gemini." 2024-2026.