← Back to Index
The AI Scientist v2
End-to-end agentic system that produced the first entirely AI-generated peer-review-accepted workshop paper through progressive tree search over the scientific discovery pipeline. Organization: Sakana AI / Foerster Lab (University of Oxford) / University of British Columbia / Vector Institute Published: April 10, 2025 Type: paper (arXiv:2504.08066) Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
- Venue: arXiv preprint (cs.AI, cs.CL, cs.LG), April 2025; evaluated at ICLR 2025 Workshop "I Can't Believe It's Not Better" (ICBINB)
- DOI:
10.48550/arXiv.2504.08066 - License: CC-BY 4.0 (paper); Responsible AI Source Code License (code — derivative of RAIL)
- Predecessor: The AI Scientist v1 (Lu et al., arXiv:2408.06292, August 2024)
- Code: github.com/SakanaAI/AI-Scientist-v2
- Workshop Experiment Data: github.com/SakanaAI/AI-Scientist-ICLR2025-Workshop-Experiment
- Blog Post: sakana.ai/ai-scientist-v2
- Relation to prior work: Direct successor to AI Scientist v1; tree search component built on AIDE (Jiang et al., 2025); evaluated via formal ICLR peer review rather than automated reviewer alone
The paper positions itself as a milestone paper: the first demonstration that a fully autonomous AI system can generate a scientific manuscript that passes real human peer review at a recognized venue. The contribution is both systems-level (architectural innovations enabling template-free exploration) and empirical (controlled experiment with blind peer review).
2 Authors and Team
| Author | Affiliation | Role |
|---|---|---|
| Yutaro Yamada* | Sakana AI | Equal contribution; correspondence author |
| Robert Tjarko Lange* | Sakana AI | Equal contribution; correspondence author |
| Cong Lu* | Sakana AI, UBC, Vector Institute | Equal contribution; correspondence author; v1 lead |
| Shengran Hu | Sakana AI, UBC, Vector Institute | — |
| Chris Lu | FLAIR, University of Oxford | v1 first author |
| Jakob Foerster | FLAIR, University of Oxford | Equal advising |
| Jeff Clune | UBC, Vector Institute, Canada CIFAR AI Chair | Equal advising |
| David Ha | Sakana AI | Equal advising; Sakana AI co-founder |
*Equal contribution.
Team size: 8 authors across Sakana AI, University of British Columbia, Vector Institute, and University of Oxford. Notably smaller than many competing systems, reflecting Sakana AI's lean research style. The team overlaps significantly with v1 (Lu, Lu, Lange, Foerster, Clune, Ha), providing direct continuity.
Key figures: - David Ha: Former Google Brain researcher, co-founder and CEO of Sakana AI. Known for world models, neuroevolution, and creative AI research. - Jeff Clune: Pioneer of open-endedness, quality-diversity, and MAP-Elites. His intellectual fingerprint is visible in the tree-search exploration paradigm. - Jakob Foerster: Oxford faculty; multi-agent systems, meta-learning. Connects the work to the FLAIR lab's broader agenda on agent coordination. - Chris Lu and Cong Lu: Co-leads of v1; their transition to v2 ensures architectural continuity while enabling paradigm shift.
3 Core Contribution
The AI Scientist v2 addresses three fundamental limitations of its predecessor through targeted architectural innovations:
| Limitation in v1 | Problem | v2 Solution |
|---|---|---|
| Template dependency | Required human-authored code templates for each research domain; could not operate without experiment.py, plot.py, and prompt.json hand-crafted per topic |
Eliminates templates entirely; starts from a markdown topic description + generated JSON idea file |
| Linear experimentation | Followed a strictly sequential hypothesis → code → execute → analyze pipeline; short-sighted, unable to explore branching hypotheses | Progressive agentic tree search (BFTS) with parallel exploration of multiple experimental directions |
| Text-only review | Automated reviewer evaluated text only; no visual understanding of figures | VLM (Vision-Language Model) feedback loop for iterative figure refinement during both experimentation and paper writing |
The Headline Result
One of three fully autonomous manuscripts submitted to the ICLR 2025 ICBINB workshop received scores of 6, 7, and 6 (average 6.33/10), placing it in the top 45% of all submissions — above the average human acceptance threshold. This is the first documented instance of a fully AI-generated paper passing real human peer review.
What v2 Is NOT
- Not necessarily better than v1 per-paper: The authors explicitly note that v1 with a strong template can produce higher success rates on well-defined tasks
- Not conference-level: The accepted paper was workshop-level; the authors state none of the three manuscripts met main-track conference standards
- Not free of hallucinations: External evaluations found fabricated results, hallucinated methodology, and overestimated novelty in some outputs
- Not a replacement for human scientists: The system demonstrates capability at the workshop level, not the ability to produce trustworthy science independently
Relationship to the Field
Timeline of autonomous scientific discovery systems:
──────────────────────────────────────────────────────────────────────
AI Scientist v1 (Aug 2024) → First end-to-end system; $15/paper;
template-dependent; automated reviewer only
MLAgentBench (2024) → ML experiment automation benchmark
AIDE (2025) → BFTS for ML engineering; MLEBench SOTA
AI Scientist v2 (Apr 2025) → Template-free; agentic tree search;
first real peer-review acceptance
AutoResearchClaw (2025) → Multi-agent research; ReAct + tools
EurekaClaw (2025) → Evolutionary research; skills library
AIRA₂ (Mar 2026) → 8-GPU async evolution; MLE-bench SOTA
──────────────────────────────────────────────────────────────────────
The AI Scientist v2 occupies a unique position: it targets the full scientific pipeline (idea → paper) rather than just ML engineering tasks. AIDE influenced the tree search; the AI Scientist lineage contributes the end-to-end manuscript generation and peer review evaluation.
4 Supported Solutions
Problem Framing
The AI Scientist v2 frames automated scientific discovery as a tree-structured search problem over the space of experimental programs and manuscripts. The unit of search is not a competition solution (as in MLE-bench agents) but a complete scientific workflow: hypothesis, implementation, experiments, analysis, figures, and paper.
Solution Space
The system operates on ML research ideas as the unit of exploration. Each candidate path through the tree represents:
- A scientific hypothesis (generated during ideation)
- An implementation strategy (Python code for experiments)
- Experimental results (metrics, training curves, outputs)
- Visualizations (figures generated by plotting code)
- A manuscript (LaTeX paper with citations, figures, and text)
Search Methods Supported
| Method | Description | Role in v2 |
|---|---|---|
| Best-First Tree Search (BFTS) | LLM-evaluated nodes ranked by experimental quality; best nodes expanded first | Primary search method across all 4 stages |
| Parallel node expansion | Multiple nodes expanded concurrently via num_workers parallel paths |
Accelerates exploration; default 3 workers |
| Probabilistic debugging | Buggy nodes selected for debugging with probability debug_prob |
Enables error recovery without abandoning investment |
| Stage-gated progression | Tree search organized into 4 sequential stages with explicit stopping criteria | Structured exploration from prototype to ablations |
| LLM-based node evaluation | Dedicated LLM evaluator scores non-buggy nodes for best-first selection | Replaces scalar metrics with qualitative assessment |
What v2 Does NOT Do
- No neural architecture search as a subsystem
- No reinforcement learning or gradient-based optimization of the search itself
- No meta-learning or cross-experiment transfer (each idea starts fresh)
- No automated submission to venues (manuscripts are generated but submission was manual)
- No real-time human-in-the-loop steering during the search (fully autonomous once launched)
- No multi-idea orchestration within a single run (one idea per
launch_scientist_bfts.pyinvocation)
5 LLM Integration
Model Configuration
Unlike v1 which used a single --model argument, v2 employs task-specific model routing with separate LLMs for different pipeline stages:
| Stage | Recommended Model | Purpose |
|---|---|---|
| Ideation | GPT-4o (2024-05-13) | Idea generation, novelty checking via Semantic Scholar |
| Experimentation (BFTS) | Claude 3.5 Sonnet | Code generation, experiment planning, debugging within tree search |
| Aggregate Plots | o3-mini (2025-01-31) | Consolidating visualization across replication runs |
| Paper Writing | o1-preview (2024-09-12) | Single-pass manuscript generation + reflection |
| Citation Gathering | GPT-4o (2024-11-20) | Literature search integration with Semantic Scholar |
| Paper Review | GPT-4o (2024-11-20) | Automated text review |
| VLM Feedback | VLM (vision-capable model) | Figure quality assessment during experiments and writing |
LLM Usage Patterns
The system uses LLMs in fundamentally different ways across the pipeline:
1. Code Generation (Experimentation)
Node Expansion Cycle:
LLM generates: (plan_text, experiment_code.py)
→ Execute in Python interpreter
→ If error: record trace, mark buggy, stop
→ If success: save metrics to .npy files
→ LLM generates: plot_code.py
→ Execute plotting code
→ VLM evaluates generated figures
→ If VLM flags issues: mark buggy, record feedback
→ If passes: mark non-buggy, record node metadata
2. Best-First Node Selection
LLM Evaluator receives:
- All non-buggy node descriptions
- Performance metrics per node
- Training dynamics summaries
- Generated plot quality assessments
LLM returns: ranked ordering of nodes for expansion
3. Experiment Manager Agent
After each stage completion:
Manager LLM evaluates all terminal nodes
→ Selects best node as seed for next stage
→ Records checkpoint
→ Launches replication runs for statistics
→ Transitions to next stage
4. Paper Writing
v1 approach: Iterative Aider-based editing (multiple rounds of code-level edits)
v2 approach: Single-pass generation by reasoning model (o1) + separate reflection stage
→ Substantial simplification over v1's incremental writing
Key Architectural Difference from v1
v1 used the Aider coding assistant (aider-chat library) as an intermediary between the LLM and the codebase. Aider managed git history, diff application, and multi-turn code editing conversations. v2 removes Aider entirely — the LLM directly generates complete code files within the tree search loop. This eliminates a layer of abstraction and gives the system more direct control over code structure, at the cost of losing Aider's git-based undo capabilities.
API Integration
| API | Purpose | Required |
|---|---|---|
| OpenAI API | GPT-4o, o1-preview, o3-mini for writing/review/ideation | Yes (for default config) |
| AWS Bedrock | Claude 3.5 Sonnet for experimentation | Yes (for default config) |
| Gemini API | Alternative experimentation model | Optional |
| Semantic Scholar API | Literature search, novelty checking, citation gathering | Optional (recommended) |
6 Key Results
ICLR 2025 Workshop Submission Experiment
Three fully autonomous manuscripts were submitted to the ICLR 2025 "I Can't Believe It's Not Better" (ICBINB) workshop. The experiment was conducted with full transparency:
- Workshop organizers and ICLR leadership were informed in advance
- Reviewers were told "some submissions might be AI-generated" but not which ones
- Reviewers could opt out of reviewing potentially AI-generated papers
- IRB approval was obtained
- Accepted manuscripts would be withdrawn post-review (and were)
| Paper | Topic | Reviewer Scores | Average | Outcome |
|---|---|---|---|---|
| Compositional Regularization | Temporal consistency regularizer on LSTM embeddings for compositional generalization | 6, 7, 6 | 6.33 | Exceeded acceptance threshold |
| Real-World Pest Detection | Deep learning for agricultural pest classification | 3, 7, 4 | 4.67 | Rejected |
| Label Noise & Model Calibration | Effect of label noise on model calibration | 3, 3, 3 | 3.00 | Rejected |
The Accepted Paper: Compositional Regularization
The accepted paper investigated whether incorporating an explicit compositional regularization term into neural network training improves compositional generalization:
- Approach: Penalizes large deviations between embeddings of successive time steps in LSTM-based sequence models
- Hypothesis: Encouraging smoothness in embedding space promotes compositionality
- Datasets: Synthetic arithmetic expression datasets
- Finding: Compositional regularization did not yield significant improvements and occasionally harmed performance — a negative result
- Reviewer assessment: Appreciated for "clearly identifying the challenges of effective compositional regularization and reporting on negative results"
- Ranking: Top 45% of 43 submissions (roughly 20th position)
- Post-review status: Withdrawn as pre-arranged with organizers
Internal Quality Assessment
The authors conducted their own evaluation of all three manuscripts:
| Quality Dimension | Assessment |
|---|---|
| Workshop-level quality | One manuscript (the accepted one) meets workshop standards |
| Conference-level quality | None of the three manuscripts meets main-track standards |
| Citation accuracy | Occasional hallucinated citations (known LLM limitation) |
| Methodological rigor | Adequate for workshops; lacks depth for conferences |
| Figure quality | Improved over v1 via VLM feedback, but some caption inaccuracies remain |
| Code quality | Functional but not always well-structured or documented |
External Evaluations and Known Limitations
Independent analyses (MLR-Bench, Pebblous) identified systemic issues in AI Scientist v2 outputs:
| Issue | Description | Frequency |
|---|---|---|
| Fabricated results | System hides failed experiments and reports them as successful | Found in multiple outputs |
| Hallucinated methodology | Describes methods not actually implemented in code | Intermittent |
| Overestimated novelty | Presents well-known concepts as novel discoveries | Common |
| Dataset overlap | Potential train/test contamination in some experiments | Identified in accepted paper |
| Caption inaccuracies | Figure captions not always matching figure content | Several instances |
Comparison with v1 Results
| Metric | v1 | v2 |
|---|---|---|
| Evaluation method | Automated reviewer only | Real human peer review at ICLR workshop |
| Cost per paper | ~$15 | ~$20–25 |
| Template requirement | Yes (per domain) | No |
| Domain flexibility | 3 specific domains (NanoGPT, 2D Diffusion, Grokking) | Any ML topic describable in markdown |
| Success rate | Higher (within template scope) | Lower (broader, exploratory) |
| Best reviewer score | Exceeded automated reviewer threshold | 6.33/10 average from human reviewers |
| Accepted at real venue | No (not submitted) | Yes (1 of 3 at ICBINB workshop) |
| Paper writing approach | Iterative Aider-based editing | Single-pass o1 generation + reflection |
| Experiment depth | Shallow, linear | Deep, tree-structured, multi-stage |
7 Reproducibility
Strengths
| Aspect | Assessment |
|---|---|
| Code availability | Fully open-sourced at github.com/SakanaAI/AI-Scientist-v2 |
| Workshop experiment data | Separately published with full manuscripts and reviews |
| Configuration | bfts_config.yaml provides declarative tree search configuration |
| Installation | Conda environment with pinned dependencies; requirements.txt provided |
| Hardware requirements | Explicitly stated: Linux, NVIDIA GPUs, CUDA, PyTorch |
| API requirements | All required API keys documented (OpenAI, AWS Bedrock, Semantic Scholar) |
| Prompts | Included in Appendix B of the paper |
| Hyperparameters | Full sampling hyperparameters in Appendix A |
| Tree visualization | unified_tree_viz.html generated for each run, enabling post-hoc inspection |
Limitations
| Aspect | Concern |
|---|---|
| LLM API dependency | Results depend on specific model versions (Claude 3.5 Sonnet, o1-preview) that may be deprecated or updated |
| Non-determinism | LLM sampling introduces stochasticity; exact tree trajectories are not reproducible across runs |
| Cost barrier | $20–25 per paper attempt; multiple attempts needed for reliable results |
| GPU requirement | NVIDIA GPU with CUDA required; not runnable on CPU or Apple Silicon |
| Success rate variability | "Higher success rates are generally observed when using powerful models like Claude 3.5 Sonnet" — weaker models may fail more often |
| LaTeX dependencies | Requires poppler, chktex, and LaTeX toolchain — brittle cross-platform |
| Semantic Scholar dependency | Rate limits may affect ideation and citation stages without API key |
| Sandbox requirements | Executes LLM-generated code; requires Docker/container isolation for safety |
Predecessor Reproducibility
AI Scientist v1 is fully open-sourced at github.com/SakanaAI/AI-Scientist with templates for three domains. The v2 codebase explicitly acknowledges building on AIDE (WecoAI/aideml) for the tree search component.
8 Compute and API Costs
Cost Breakdown per Paper
Cost structure per paper generation attempt:
┌─────────────────────────────────────────────────┐
│ Stage │ Estimated Cost │
├─────────────────────────────────────────────────┤
│ Idea Generation │ ~$3 │
│ (LLM calls + Semantic Scholar queries) │
├─────────────────────────────────────────────────┤
│ BFTS Experimentation │ $15–20 │
│ (Claude 3.5 Sonnet for code gen/debug/eval) │
├─────────────────────────────────────────────────┤
│ Paper Writing │ ~$5 │
│ (o1-preview + GPT-4o for citations) │
├─────────────────────────────────────────────────┤
│ Total per paper attempt │ ~$20–25 │
└─────────────────────────────────────────────────┘
Comparison with v1 Costs
| Metric | v1 | v2 |
|---|---|---|
| Total per paper | ~$15 | ~$20–25 |
| GPU compute | Minimal (uses template code) | Moderate (runs ML experiments) |
| LLM API | Single model, ~$15 | Multi-model, ~$20 |
| Time to completion | Few hours | Several hours (experimentation) + 20–30 min (writing) |
| Human setup cost | High (template creation per domain) | Low (markdown topic description) |
Hardware Requirements
Minimum configuration:
┌──────────────────────────────────────┐
│ Linux OS (required) │
│ NVIDIA GPU with CUDA support │
│ PyTorch-compatible GPU drivers │
│ Sufficient VRAM for target experiments│
│ Docker/container runtime (recommended)│
└──────────────────────────────────────┘
BFTS Configuration Parameters
The tree search is controlled via bfts_config.yaml:
| Parameter | Description | Default |
|---|---|---|
num_workers |
Parallel exploration paths | 3 |
steps |
Maximum nodes to explore | 21 |
num_seeds |
Initial root nodes per tree | 3 |
num_drafts |
Independent trees to grow (Stage 1) | Configurable |
max_debug_depth |
Max debug attempts before abandoning a node | Configurable |
debug_prob |
Probability of selecting a buggy node for debugging | Configurable |
With
num_workers=3andsteps=21, the system explores up to 21 nodes total, expanding 3 concurrently per step. This gives ~7 expansion rounds per stage.
Cost-Effectiveness Analysis
The $20–25 per paper is misleadingly low as a headline number. Important caveats:
- Success rate is not 100%: Many runs fail to produce a viable manuscript; effective cost per publishable paper is significantly higher
- GPU costs not included: If experiments require significant GPU compute (training models), hardware costs add substantially
- Human review not included: The system does not guarantee quality; human review and potential revision would add labor costs
- The $15 v1 comparison is apples-to-oranges: v1 cost was lower because templates did the heavy lifting that v2 must do from scratch
9 Architecture Solution
High-Level Architecture
The AI Scientist v2 architecture consists of two major phases executed sequentially, with the experimentation phase internally organized as a four-stage tree search:
┌─────────────────────────────────────────────────────────────────────────┐
│ AI SCIENTIST v2 PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PHASE 1: IDEATION │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Topic Description (.md) │ │
│ │ ↓ │ │
│ │ LLM Idea Generation (with Semantic Scholar novelty check) │ │
│ │ ↓ │ │
│ │ Structured Ideas (.json) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ↓ │
│ PHASE 2: EXPERIMENTATION (BFTS) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌────────────────┐ │ │
│ │ │ Stage 1 │→│ Stage 2 │→│ Stage 3 │ │ │
│ │ │ Preliminary │ │ Hyperparameter│ │ Research Agenda │ │ │
│ │ │Investigation │ │ Tuning │ │ Execution │ │ │
│ │ └─────────────┘ └──────────────┘ └────────────────┘ │ │
│ │ ↓ │ │
│ │ ┌────────────────┐ │ │
│ │ Experiment Manager ←───────│ Stage 4 │ │ │
│ │ Agent (coordinates) │ Ablation Studies│ │ │
│ │ └────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ↓ │
│ PHASE 3: MANUSCRIPT GENERATION │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Plot Aggregation → Paper Writing → Citation Gathering │ │
│ │ → VLM Figure Review → Reflection → Final PDF │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ↓ │
│ PHASE 4: REVIEW │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ LLM Text Review + VLM Figure/Caption Review │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
v1 → v2 Architecture Comparison
AI SCIENTIST v1 (LINEAR):
─────────────────────────────────────────────────────────────
Template Code → Idea Gen → Aider Code Editing → Execute
↑ ↓
(human-authored) Visualize
↓
Paper Write
(iterative)
↓
Auto Review
↓
Improvement
AI SCIENTIST v2 (TREE-STRUCTURED):
─────────────────────────────────────────────────────────────
Topic (.md) → Idea Gen → ┌─ Stage 1: Prototype ─── BFTS ──┐
(+ S2) │ Stage 2: Tune ─── BFTS ──│
│ Stage 3: Agenda ─── BFTS ──│
│ Stage 4: Ablations ─── BFTS ──│
└──────────────────────────────────┘
↓
Plot Aggregation + VLM Review
↓
Single-Pass Paper Writing (o1)
↓
VLM + LLM Review
Orchestration Architecture
The system is orchestrated through two entry points:
perform_ideation_temp_free.py— Generates research ideas from a topic descriptionlaunch_scientist_bfts.py— Executes the full BFTS experimentation + writing + review pipeline for a single idea
Key architectural decisions:
| Decision | v1 Approach | v2 Approach | Rationale |
|---|---|---|---|
| Configuration | Command-line args only | bfts_config.yaml + CLI args |
Tree search has too many parameters for CLI alone |
| Idea scope per run | Multiple ideas per invocation | Single idea per invocation | Each tree search is computationally expensive |
| Model selection | Single --model flag |
Separate flags per stage | Different tasks benefit from different model strengths |
| Code editing | Aider-mediated diffs | Direct LLM code generation | Tree search needs complete code per node, not incremental diffs |
| Experiment execution | Sequential, in-process | Parallel, multi-worker | Tree search enables natural parallelization |
| Paper writing | Multi-round Aider editing | Single-pass reasoning model + reflection | Simplifies writing; leverages o1's long-form capabilities |
Module Structure
ai_scientist/
├── perform_ideation_temp_free.py # Phase 1: Idea generation
├── treesearch/ # Phase 2: Core BFTS engine
│ └── perform_experiments_bfts_with_agentmanager # Main entry point
├── perform_plotting.py # Plot aggregation across nodes
├── perform_writeup.py # Paper writing (normal + ICBINB formats)
├── perform_llm_review.py # Text-based automated review
├── perform_vlm_review.py # Vision-based figure/caption review
├── gather_citations.py # Semantic Scholar citation integration
├── ideas/ # Topic descriptions and generated ideas
│ └── i_cant_believe_its_not_better.md # Example topic file
└── llm.py # LLM client creation utilities
bfts_config.yaml # Tree search configuration
launch_scientist_bfts.py # Main orchestrator script
Dependency Architecture
v2 introduces significant new dependencies reflecting the architectural shift:
| Library | Purpose | v1 Status | v2 Status |
|---|---|---|---|
aider-chat |
LLM-mediated code editing | Core dependency | Removed |
omegaconf |
Hierarchical YAML configuration | Not present | Added (for bfts_config.yaml) |
python-igraph |
Graph data structure for tree | Not present | Added (tree management) |
seaborn |
Statistical visualization | Not present | Added |
rich |
Terminal output formatting | Not present | Added |
jsonschema |
JSON validation | Not present | Added |
dataclasses-json |
Serialization of node state | Not present | Added |
boto3 / botocore |
AWS Bedrock for Claude | Not present | Added |
pymupdf4llm |
PDF processing for LLMs | pymupdf (basic) |
Upgraded |
torch |
ML framework | Required | Not in requirements.txt (assumed pre-installed) |
google-generativeai |
Gemini API | Present | Removed (Gemini via OpenAI API instead) |
10 Component Breakdown
Component 1: Idea Generation Engine
Purpose: Generate structured research ideas from a high-level topic description, with literature-grounded novelty assessment.
Input: Markdown file with Title, Keywords, TL;DR, and Abstract sections defining the research scope.
Process:
1. LLM generates candidate research ideas based on the topic description
2. Each idea undergoes multiple reflection rounds (--num-reflections, default 5)
3. Semantic Scholar is queried in-loop for novelty checking and related work identification
4. Ideas are refined based on literature context
5. Output: structured JSON with hypotheses, proposed experiments, and related work
Output: ideas/<topic_name>.json containing a list of structured research ideas.
Key difference from v1: v1's ideation was constrained by the code template — ideas were incremental modifications to existing code. v2 starts from abstract concepts (like a grant proposal) before any code exists.
Component 2: Experiment Progress Manager Agent
Purpose: Coordinate the four-stage experimental lifecycle, enforcing structure while allowing flexible exploration within each stage.
Stage definitions and stopping criteria:
Stage 1: PRELIMINARY INVESTIGATION
├── Goal: Establish feasibility via minimal working prototype
├── Stop when: Basic working prototype successfully executes
├── Output: Root node for Stage 2
└── Search: BFTS with parallel initial nodes
Stage 2: HYPERPARAMETER TUNING
├── Goal: Optimize critical hyperparameters for robust baseline
├── Stop when: Training curves converge + success on ≥2 datasets
├── Output: Tuned baseline node for Stage 3
├── Search: BFTS with specialized hyperparameter nodes
└── Tracking: Previously tested hyperparameters recorded to avoid redundancy
Stage 3: RESEARCH AGENDA EXECUTION
├── Goal: Systematically implement the core research agenda
├── Stop when: Computational budget exhausted
├── Runtime check: If runs finish too fast → suggest increasing complexity
├── Output: Best experimental node for Stage 4
└── Search: BFTS with refinement and debugging
Stage 4: ABLATION STUDIES
├── Goal: Assess importance of research components
├── Stop when: Computational budget exhausted
├── Output: Final experimental results + replication statistics
├── Search: BFTS with specialized ablation nodes
└── Replication: Multiple seeds for statistical robustness
Inter-stage transitions: After each stage, the manager: 1. Evaluates all terminal nodes using a dedicated LLM evaluator 2. Selects the best-performing node based on articulated criteria 3. Creates a checkpoint 4. Launches replication runs of the best node for mean/std statistics 5. Seeds the next stage with the selected best node
Component 3: Parallelized Agentic Tree Search (BFTS)
Purpose: Explore the experimental search space through structured tree expansion, balancing exploitation (refining promising paths) and exploration (debugging failed paths).
Node definition: Each node is a tuple containing:
- Experiment script (Python code)
- High-level plan description (natural language)
- Execution error trace (if applicable)
- Experiment runtime
- Performance metrics
- LLM feedback on execution
- Visualization script
- Generated figure file paths
- VLM feedback on figures
- Node status: buggy | non-buggy
Node types:
| Type | Purpose | Created During | Expansion Rule |
|---|---|---|---|
| Standard (non-buggy) | Successfully executed experiment | All stages | Refinement: improve code for better performance |
| Buggy | Failed execution or VLM rejection | All stages | Debugging: fix errors using recorded error trace |
| Hyperparameter | Systematic hyperparameter exploration | Stage 2 | HP variation with redundancy tracking |
| Ablation | Component importance assessment | Stage 4 | Ablation conditions with redundancy tracking |
| Replication | Statistical robustness via different seeds | End of each stage | Same code, different random seed |
| Aggregation | Combine replication results | After replications | No new experiments; generates combined plots |
Selection policy:
- With probability debug_prob: select a buggy node for debugging
- Otherwise: best-first selection among non-buggy nodes
- Best-first evaluation by LLM based on: performance metrics, training dynamics, plot quality
Component 4: VLM Feedback Loop
Purpose: Iteratively evaluate and refine figure quality at two points in the pipeline.
Integration Point 1 — During Experimentation:
Experiment execution → metrics saved to .npy files
→ Plotting code generates figures
→ VLM receives: figures + code context
→ VLM checks: label clarity, legend presence, data accuracy
→ If issues found: node marked buggy, feedback recorded
→ If passes: node marked non-buggy
Integration Point 2 — During Paper Writing:
Paper draft generated → screenshots extracted of each figure
→ VLM receives: figure images + captions + referencing text
→ VLM checks:
- Figure-caption alignment
- Visual clarity (labels, legends, axes)
- Duplication between main text and appendix
- Data presentation accuracy
→ Feedback integrated into reflection stage
→ Iterative refinement until quality threshold met
Key improvement over v1: v1 had no visual understanding whatsoever — the automated reviewer only processed text. v2's VLM integration enables the system to detect and correct visual presentation issues that would be immediately apparent to a human reviewer.
Component 5: Paper Writing Engine
Purpose: Generate a complete scientific manuscript from experimental results.
v2 approach (simplified from v1): 1. Plot aggregation: Consolidate figures from best experimental nodes and replications 2. Single-pass generation: Reasoning model (o1-preview) generates complete LaTeX manuscript in one pass 3. Citation gathering: Semantic Scholar integration adds relevant references (up to 20 rounds) 4. VLM reflection: Vision-language model reviews figures and captions 5. Reflection stage: Reasoning model reviews and refines the complete manuscript 6. Output: Compiled PDF
Paper formats supported:
- normal: Standard 8-page conference format
- icbinb: 4-page workshop format (used for ICLR 2025 ICBINB submissions)
Timing: Writing stage takes approximately 20–30 minutes total.
Component 6: Automated Review System
Purpose: Evaluate generated manuscripts using both text and visual analysis.
Dual review pipeline:
1. LLM text review (perform_llm_review.py): Evaluates manuscript text for clarity, methodology, and scientific rigor
2. VLM visual review (perform_vlm_review.py): Evaluates figures, captions, and their alignment with text
Improvement over v1: v1's reviewer was text-only and was validated against human scores as a proxy. v2's addition of VLM review adds a visual dimension, though the ultimate validation came from real human peer review rather than automated scoring alone.
11 Core Mechanisms (Detailed)
Mechanism 1: Template Elimination
The shift from template-dependent to template-free operation is the single most architecturally significant change. Understanding how it works requires tracing the information flow:
v1 Information Flow (Template-Dependent):
Human creates:
templates/nanoGPT/
├── experiment.py (baseline code — hundreds of lines)
├── plot.py (visualization code)
├── prompt.json (domain context for LLM)
└── seed_ideas.json (example ideas)
LLM receives: template code + context → generates incremental modifications
Aider applies: diff patches to template code
Execution: modified template code runs in same environment
v2 Information Flow (Template-Free):
Human creates:
ideas/my_topic.md (markdown description: title, keywords, TL;DR, abstract)
(typically <1 page of text)
Ideation LLM: generates structured research idea from topic description
BFTS LLM: generates complete experiment code from scratch (no template)
→ Stage 1: minimal working prototype
→ Stage 2: hyperparameter-optimized version
→ Stage 3: full research agenda implementation
→ Stage 4: ablation variants
Datasets: loaded via Hugging Face Hub (one-line API call)
How the system bootstraps without a template: 1. The ideation stage produces a concrete hypothesis and experimental design 2. Stage 1 of BFTS generates a minimal prototype from scratch — the LLM writes complete Python code, not modifications 3. Multiple parallel root nodes provide diversity in initial implementations 4. The tree search explores variations, with the LLM evaluator selecting the most promising directions 5. Each subsequent stage builds on the best node from the previous stage, so code quality improves progressively
Trade-off: Without a human-verified template as starting point, the system is more likely to produce incorrect or poorly-structured code. This is offset by the tree search's ability to explore multiple paths and recover from failures via debugging nodes.
Mechanism 2: Best-First Tree Search (BFTS)
The BFTS algorithm is adapted from AIDE (Jiang et al., 2025) with modifications for the multi-stage scientific experimentation context:
AIDE's original BFTS (for ML engineering tasks): - Each node = a potential solution with a scalar evaluation score (e.g., validation accuracy) - Nodes selected for expansion based on score ranking - Single-stage: continuous refinement toward a single metric
AI Scientist v2's adapted BFTS (for scientific discovery): - Each node = experiment code + results + figures + LLM feedback (rich metadata) - Node evaluation by LLM rather than scalar metric (qualitative assessment) - Multi-stage: four distinct stages with different objectives and node types - Additional node types (hyperparameter, ablation, replication, aggregation) beyond AIDE's standard/buggy distinction
Algorithm pseudocode:
function BFTS_Stage(root_node, config):
tree ← initialize_tree(root_node)
for step in range(config.steps):
# Select nodes for expansion
candidates ← []
for _ in range(config.num_workers):
if random() < config.debug_prob AND tree.has_buggy_nodes():
node ← select_buggy_node(tree)
child ← create_debug_child(node)
else:
node ← llm_best_first_select(tree.non_buggy_nodes())
child ← create_refinement_child(node)
candidates.append(child)
# Parallel execution
parallel_execute(candidates)
# Post-execution evaluation
for child in candidates:
if child.execution_failed:
child.status ← BUGGY
child.error_trace ← capture_error()
else:
child.metrics ← load_numpy_results()
child.figures ← run_plotting_code()
vlm_feedback ← vlm_evaluate(child.figures)
if vlm_feedback.has_issues:
child.status ← BUGGY
child.vlm_feedback ← vlm_feedback
else:
child.status ← NON_BUGGY
tree.add_node(child)
# Stage completion
best_node ← llm_evaluate_and_select(tree.non_buggy_nodes())
replications ← create_replication_nodes(best_node, num_seeds=3)
parallel_execute(replications)
aggregation ← create_aggregation_node(replications)
return best_node, aggregation
Key properties of the search:
- Anytime: Can be stopped at any step and still yield the current best node
- Recoverable: Buggy nodes are not discarded; they can be debugged in future steps
- Parallel: Multiple nodes expanded concurrently per step
- Qualitative evaluation: LLM-based node scoring captures aspects that scalar metrics miss (code quality, experimental design, visualization clarity)
- Stage-aware: Different stages use different node types and stopping criteria
Mechanism 3: VLM-Integrated Figure Refinement
The VLM feedback loop operates as a quality gate at two points:
During Experimentation (per-node):
Node execution succeeds
→ System reads .npy result files
→ LLM generates plotting code
→ Plotting code executes → figure files
→ VLM receives figure images + experiment context
→ VLM evaluates:
✓ Are axes labeled?
✓ Is there a legend?
✓ Do data values match metrics?
✓ Is the visualization type appropriate?
✓ Are there any misleading elements?
→ VLM returns structured feedback
→ If any check fails: node marked buggy
→ Feedback stored for future debugging attempts
During Paper Writing (manuscript-level):
LaTeX manuscript generated
→ Screenshot each figure from rendered PDF
→ Extract caption text and referencing text ("Figure X")
→ VLM receives: image + caption + reference text
→ VLM evaluates:
✓ Does the caption accurately describe the figure?
✓ Does the referencing text correctly interpret the figure?
✓ Are there duplicate figures in main text and appendix?
✓ Is visual quality sufficient for publication?
→ Feedback integrated into reflection stage
→ Writing model revises manuscript based on VLM feedback
→ Iterate until quality threshold or max iterations
Mechanism 4: Experiment Manager State Machine
The experiment manager operates as a finite state machine over the four stages:
┌──────────────────────────────────────────────────────────────┐
│ │
▼ │
┌──────────────┐ best ┌──────────────┐ best ┌──────────────┐
│ STAGE 1 │───node───→│ STAGE 2 │───node───→│ STAGE 3 │
│ Preliminary │ │ HP Tuning │ │ Agenda │
│Investigation │ │ │ │ Execution │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
Stop: working Stop: convergence + Stop: budget
prototype ≥2 datasets pass exhausted
│
best node
│
▼
┌──────────────┐
│ STAGE 4 │
│ Ablation │
│ Studies │
└──────────────┘
│
Stop: budget
exhausted
│
▼
┌──────────────┐
│ MANUSCRIPT │
│ GENERATION │
└──────────────┘
State transitions: - Each stage runs BFTS independently - Best node from stage N becomes root node of stage N+1 - Checkpoints saved at each transition - Replication runs launched at each transition for statistics - Manager decides when stopping criteria are met
Stopping criteria specifics: - Stage 1: Binary — did any node produce a running prototype? - Stage 2: Convergence — training curves stabilize across datasets - Stage 3: Budget — fixed compute allocation, with complexity escalation if runs are too fast - Stage 4: Budget — fixed compute allocation
Mechanism 5: Dataset Loading Strategy
Rather than relying on locally packaged datasets (as v1 templates did), v2 uses a standardized approach:
from datasets import load_dataset
dataset = datasets.load_dataset("dataset_name")
Advantages: - No manual dataset preparation per template - Access to thousands of Hugging Face Hub datasets - Standardized train/validation/test splits - Automatic downloading and caching
Limitations (acknowledged by authors):
- Not all datasets support load_dataset
- Some datasets require custom preprocessing
- No guarantee the LLM will choose appropriate datasets for the hypothesis
- No built-in validation that train/test splits are properly separated in the generated code
12 Programming Language
Implementation Language
Python 3.11 — the entire system is implemented in Python.
Key Libraries and Their Roles
| Library | Version | Role |
|---|---|---|
openai |
Latest | OpenAI API client (GPT-4o, o1, o3-mini) |
anthropic[bedrock] |
Latest | Claude models via AWS Bedrock |
omegaconf |
Latest | Hierarchical configuration management |
python-igraph |
Latest | Tree data structure for BFTS |
datasets (Hugging Face) |
Latest | Dataset loading for experiments |
numpy |
Latest | Experiment result storage (.npy) |
matplotlib |
Latest | Figure generation |
seaborn |
Latest | Statistical visualization |
pymupdf4llm |
Latest | PDF processing for LLM review |
rich |
Latest | Terminal output and logging |
jsonschema |
Latest | Configuration validation |
dataclasses-json |
Latest | Node state serialization |
Code Execution Model
The system generates and executes Python code in a subprocess:
- Experiment code is written to .py files
- Executed via Python interpreter
- stdout/stderr captured for error traces
- Results saved to structured numpy files
- Plotting code executed separately
Safety Considerations
"This codebase will execute Large Language Model (LLM)-written code. There are various risks and challenges associated with this autonomy, including the potential use of dangerous packages, uncontrolled web access, and the possibility of spawning unintended processes."
The authors recommend Docker containers for sandboxing. No built-in sandboxing is provided in the codebase itself.
13 Memory Management
Intra-Run Memory
The tree search maintains memory through the tree structure itself:
Tree Node Memory:
├── Code: complete experiment script preserved at each node
├── Plan: natural language description of what this node implements
├── Metrics: experimental results stored in .npy files
├── Figures: generated visualization files
├── Errors: full error traces for buggy nodes
├── Feedback: LLM and VLM feedback recorded per node
└── Status: buggy/non-buggy classification
Inter-Node Memory:
├── Parent-child links preserve experimental lineage
├── Sibling relationships show parallel exploration paths
├── Best-node selection carries forward across stages
└── Replication nodes share parent code, vary seeds
Cross-Stage Memory
The experiment manager maintains state across stages:
- Checkpoints: Saved at each stage completion
- Best node propagation: Selected node becomes root of next stage
- Hyperparameter history: Stage 2 tracks tested configurations to avoid redundancy
- Ablation history: Stage 4 tracks tested conditions similarly
- Replication statistics: Mean/std computed and carried forward to manuscript
Cross-Run Memory
No persistent memory exists between separate runs. Each invocation of launch_scientist_bfts.py starts fresh. There is:
- No skills library or knowledge base carried across experiments
- No cross-experiment transfer learning
- No persistent embedding store for novelty comparison
- No accumulated heuristics from previous runs
This is a significant limitation compared to systems like EurekaClaw (which maintains a skills library) or AIRA₂ (which accumulates population knowledge across evolutionary generations).
Novelty Memory During Ideation
During idea generation, Semantic Scholar queries provide a form of external memory: - The system checks proposed ideas against published literature - Previously proposed ideas within the same ideation run are tracked - This prevents redundant idea generation within a single session - No persistence across separate ideation runs
Visualization
The system generates unified_tree_viz.html — an interactive HTML visualization of the complete tree search for each run. This serves as a post-hoc analysis tool rather than a runtime memory mechanism, but enables human researchers to understand the search trajectory.
14 Continued Learning
Within a Single Run
The tree search implements a form of within-run learning:
- Error recovery: Buggy nodes' error traces inform debugging attempts — the system learns from its mistakes within a run
- Progressive refinement: Each stage builds on the best outcome of the previous stage — cumulative improvement
- VLM feedback integration: Figure quality issues detected early inform later plotting attempts
- Hyperparameter tracking: Stage 2 avoids re-testing configurations, learning from previous attempts
- Ablation tracking: Stage 4 similarly avoids redundant experiments
Across Runs
No cross-run learning exists. Each run is independent with no mechanism for:
- Transferring successful strategies to new topics
- Building a library of reusable experimental components
- Accumulating domain expertise over time
- Meta-learning about which experimental approaches work best
Comparison with Learning-Enabled Systems
| System | Within-Run Learning | Cross-Run Learning | Mechanism |
|---|---|---|---|
| AI Scientist v2 | Tree search refinement | None | BFTS tree state |
| AI Scientist v1 | Aider conversation history | None | Linear context |
| EurekaClaw | Evolutionary memory | Skills library | Persistent vector store |
| AIRA₂ | Population evolution | None (per-task) | Evolutionary memory |
| AutoResearchClaw | Agent memory | Session memory | Knowledge graph |
Potential for Cross-Run Learning
The paper does not discuss cross-run learning, but the architecture would naturally support it through:
- Tree node embeddings: Successful node codes could be embedded and retrieved for future runs
- Strategy library: Successful experimental strategies could be abstracted and reused
- Domain models: Accumulating domain knowledge from multiple runs on related topics
- Meta-heuristics: Learning which BFTS configurations work best for which types of research questions
15 Applications
Primary Application: Automated ML Research
The system's demonstrated application is generating workshop-level research manuscripts in machine learning. The three ICLR submissions spanned:
- Compositional generalization (regularization techniques for sequence models)
- Agricultural pest detection (applied deep learning)
- Model calibration under label noise (ML robustness)
This breadth — from theoretical ML to applied computer vision — demonstrates the domain-general ambition of the template-free approach.
Application Domain Constraints
| Constraint | Impact |
|---|---|
| ML-only | System assumes ML experiments (Python + PyTorch/TensorFlow); cannot conduct wet-lab, social science, or theoretical mathematics research |
| Hugging Face datasets | Experiments limited to datasets available via Hugging Face Hub or those the LLM can generate synthetically |
| GPU-dependent | Cannot run on CPU-only machines; limits accessibility |
| Single-paper scope | Each run produces one paper; cannot conduct research programs spanning multiple publications |
| Workshop-level quality | Current capability is workshop-level; not yet suitable for main-track conferences |
Potential Extensions
Near-term (system-level improvements): - Multi-idea orchestration within a single session - Persistent knowledge base across runs - Improved code sandboxing - Support for non-ML experimental domains (e.g., bioinformatics pipelines) - Integration with laboratory robotics for wet-lab experiments
Medium-term (capability improvements): - Conference-level paper quality through deeper tree search and better LLMs - Multi-modal experiments (vision, NLP, robotics simultaneously) - Collaborative multi-agent research teams - Automated rebuttal writing for reviewer feedback - Integration with preprint servers for automated submission
Long-term (paradigm implications): - Continuous scientific discovery loops (open-ended research programs) - Cross-disciplinary hypothesis generation - AI-driven meta-research (studying what makes research effective) - Autonomous identification of high-impact research directions
Safety and Ethics Considerations
The paper dedicates significant discussion to safety and ethical implications:
-
Code execution risk: LLM-generated code may be unsafe (e.g., using dangerous packages, spawning processes, accessing the network). Docker sandboxing is recommended but not enforced.
-
Scientific integrity: The system can produce hallucinated citations, fabricated results, and misleading claims. Without human oversight, these could enter the scientific record.
-
Peer review ethics: The ICLR experiment was conducted with full transparency and pre-arranged withdrawal. However, the existence of such systems raises questions about:
- How should venues handle AI-generated submissions?
- Should AI-generated papers be required to disclose their provenance?
-
What is the reviewer's responsibility when AI submissions increase in volume?
-
Acceleration risks: If scaled, such systems could flood peer review with low-quality submissions, overwhelming human reviewers. The paper acknowledges this and advocates for community discussion.
-
Mandatory disclosure: The code license requires users to "clearly and prominently disclose the use of AI in any resulting scientific manuscripts or papers."
Significance for the Field of Automated Scientific Discovery
The AI Scientist v2 represents a qualitative threshold crossing: the first time a fully autonomous system produced work accepted by human peer reviewers at a recognized venue. While the practical significance is modest (one workshop paper), the conceptual significance is substantial:
- Proof of concept validated externally: Unlike v1's self-evaluation via automated reviewer, v2's validation came from independent human experts who were not told which papers were AI-generated.
- Template-free generalization demonstrated: The three submitted papers covered meaningfully different ML topics, not just variations within a single domain.
- Tree search superiority confirmed: The multi-stage BFTS approach enabled deeper experimental exploration than v1's linear pipeline, reflected in the experimental designs of the submitted papers.
- The gap to conference-level is clear: The authors' honest assessment that none of the papers meet conference standards provides a concrete improvement target for the field.
Positioning Within OmniEvolve
The AI Scientist v2 is relevant to OmniEvolve's design in several ways:
| v2 Component | OmniEvolve Parallel | Key Lesson |
|---|---|---|
| BFTS tree search | Search backends (island-based evolution) | Tree search with LLM evaluation is effective for code-level exploration |
| Experiment Manager | Orchestrator lifecycle management | Multi-stage search with explicit stopping criteria outperforms single-stage |
| VLM feedback loop | Cascade evaluator (multi-signal) | Visual evaluation adds signal that text-only evaluation misses |
| No cross-run learning | Knowledge module (skills, logs) | OmniEvolve's learning infrastructure addresses v2's biggest architectural gap |
| Template-free design | Benchmark-agnostic evaluation | Domain generality requires rethinking how context is provided to the search |
bfts_config.yaml |
Pydantic config schemas | Complex search methods need structured, validated configuration |
| Parallel node execution | Island migration and parallel evaluation | Parallelism with information sharing outperforms independent parallel runs |
Appendix A: Complete v1 → v2 Feature Comparison
| Feature | AI Scientist v1 | AI Scientist v2 |
|---|---|---|
| Release date | August 2024 | April 2025 |
| arXiv | 2408.06292 | 2504.08066 |
| Code availability | Open source | Open source |
| License | Apache 2.0 | Responsible AI Source Code License |
| Template required | Yes (per domain) | No |
| Experimentation style | Linear sequential | Best-First Tree Search |
| Parallel experiments | No | Yes (num_workers concurrent) |
| Code editing method | Aider (diff-based) | Direct LLM generation |
| VLM integration | None | During experiments + writing |
| Paper writing | Multi-round Aider editing | Single-pass o1 + reflection |
| Evaluation | Automated reviewer only | Real human peer review + automated |
| Best result | Exceeded automated reviewer threshold | Peer-review acceptance at ICLR workshop |
| Cost per paper | ~$15 | ~$20–25 |
| Domains demonstrated | 3 (NanoGPT, Diffusion, Grokking) | Open (any ML topic via markdown) |
| Dataset handling | Bundled in templates | Hugging Face Hub |
| Configuration | CLI arguments only | CLI + YAML (bfts_config.yaml) |
| Model routing | Single --model |
Task-specific model selection |
| Tree search base | None | AIDE (WecoAI) |
| Semantic Scholar | Post-hoc novelty check | In-loop during ideation |
| Paper formats | 8-page | 8-page + 4-page ICBINB |
| Success rate | Higher (within template scope) | Lower (broader scope) |
| Sandboxing | Recommended | Recommended (same) |
Appendix B: ICLR 2025 ICBINB Workshop Experiment Protocol
Experimental Design
- Pre-registration: Sakana AI coordinated with ICBINB workshop organizers and ICLR leadership before submission
- IRB approval: Obtained institutional review board approval for the human evaluation experiment
- Reviewer consent: Reviewers were informed that "some submissions might be AI-generated" and could opt out
- Blind review: AI-generated papers were mixed with 40 human-written submissions (43 total)
- No disclosure during review: Reviewers did not know which papers were AI-generated
- Post-review disclosure: All reviewers were informed after the review process completed
- Pre-arranged withdrawal: Accepted AI-generated papers would be withdrawn regardless of scores
Results Detail
Paper 1: Compositional Regularization (Accepted) - Investigated temporal consistency regularizer on LSTM embeddings - Evaluated on synthetic arithmetic expression datasets - Found that compositional regularization does NOT significantly improve performance - Reviewers appreciated the negative result and clear identification of challenges - Scores: 6 (weak accept), 7 (accept), 6 (weak accept) → average 6.33 - Placed in top 45% of 43 submissions
Paper 2: Real-World Pest Detection (Rejected) - Applied deep learning to agricultural pest classification - Scores: 3, 7, 4 → average 4.67 - High variance in reviewer scores suggests inconsistent quality
Paper 3: Label Noise & Model Calibration (Rejected) - Studied effect of label noise on model calibration - Scores: 3, 3, 3 → average 3.00 - Uniformly low scores indicate consistent quality issues
Author Self-Assessment
The Sakana AI team conducted internal evaluation and identified: - Citation hallucinations in all three papers - Insufficient methodological rigor for conference-level - Ambiguity in method descriptions (e.g., unclear which network component is regularized) - Potential dataset overlap issues - Figure caption inaccuracies - Their assessment matched peer review: one paper was workshop-worthy, two were not
Appendix C: Glossary of Key Terms
| Term | Definition |
|---|---|
| BFTS | Best-First Tree Search — tree search algorithm where the most promising nodes are expanded first, guided by an evaluation function |
| VLM | Vision-Language Model — model that processes both images and text, used here for figure evaluation |
| ICBINB | "I Can't Believe It's Not Better" — ICLR workshop focused on negative results and challenges in deep learning |
| Aider | Open-source AI coding assistant used in v1 for diff-based code editing; removed in v2 |
| AIDE | AI Development Environment by WecoAI — tree search system for ML engineering that inspired v2's BFTS |
| Node | A single state in the tree search, comprising experiment code, results, figures, and metadata |
| Buggy node | A node that failed execution or VLM review |
| Non-buggy node | A node that successfully executed and passed VLM review |
| Experiment Manager | Dedicated agent that coordinates the four-stage experimental lifecycle |
| Semantic Scholar | Academic search engine used for literature search and novelty checking |
| Replication node | Node that re-runs parent experiment with different random seed for statistical robustness |
| Aggregation node | Special node that consolidates results from replication nodes into combined visualizations |