← Back to Index

The AI Scientist v2

End-to-end agentic system that produced the first entirely AI-generated peer-review-accepted workshop paper through progressive tree search over the scientific discovery pipeline. Organization: Sakana AI / Foerster Lab (University of Oxford) / University of British Columbia / Vector Institute Published: April 10, 2025 Type: paper (arXiv:2504.08066) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Venue: arXiv preprint (cs.AI, cs.CL, cs.LG), April 2025; evaluated at ICLR 2025 Workshop "I Can't Believe It's Not Better" (ICBINB)
DOI: 10.48550/arXiv.2504.08066
License: CC-BY 4.0 (paper); Responsible AI Source Code License (code — derivative of RAIL)
Predecessor: The AI Scientist v1 (Lu et al., arXiv:2408.06292, August 2024)
Code: github.com/SakanaAI/AI-Scientist-v2
Workshop Experiment Data: github.com/SakanaAI/AI-Scientist-ICLR2025-Workshop-Experiment
Blog Post: sakana.ai/ai-scientist-v2
Relation to prior work: Direct successor to AI Scientist v1; tree search component built on AIDE (Jiang et al., 2025); evaluated via formal ICLR peer review rather than automated reviewer alone

The paper positions itself as a milestone paper: the first demonstration that a fully autonomous AI system can generate a scientific manuscript that passes real human peer review at a recognized venue. The contribution is both systems-level (architectural innovations enabling template-free exploration) and empirical (controlled experiment with blind peer review).

2 Authors and Team

Author	Affiliation	Role
Yutaro Yamada*	Sakana AI	Equal contribution; correspondence author
Robert Tjarko Lange*	Sakana AI	Equal contribution; correspondence author
Cong Lu*	Sakana AI, UBC, Vector Institute	Equal contribution; correspondence author; v1 lead
Shengran Hu	Sakana AI, UBC, Vector Institute	—
Chris Lu	FLAIR, University of Oxford	v1 first author
Jakob Foerster	FLAIR, University of Oxford	Equal advising
Jeff Clune	UBC, Vector Institute, Canada CIFAR AI Chair	Equal advising
David Ha	Sakana AI	Equal advising; Sakana AI co-founder

*Equal contribution.

Team size: 8 authors across Sakana AI, University of British Columbia, Vector Institute, and University of Oxford. Notably smaller than many competing systems, reflecting Sakana AI's lean research style. The team overlaps significantly with v1 (Lu, Lu, Lange, Foerster, Clune, Ha), providing direct continuity.

Key figures: - David Ha: Former Google Brain researcher, co-founder and CEO of Sakana AI. Known for world models, neuroevolution, and creative AI research. - Jeff Clune: Pioneer of open-endedness, quality-diversity, and MAP-Elites. His intellectual fingerprint is visible in the tree-search exploration paradigm. - Jakob Foerster: Oxford faculty; multi-agent systems, meta-learning. Connects the work to the FLAIR lab's broader agenda on agent coordination. - Chris Lu and Cong Lu: Co-leads of v1; their transition to v2 ensures architectural continuity while enabling paradigm shift.

3 Core Contribution

The AI Scientist v2 addresses three fundamental limitations of its predecessor through targeted architectural innovations:

Limitation in v1	Problem	v2 Solution
Template dependency	Required human-authored code templates for each research domain; could not operate without `experiment.py`, `plot.py`, and `prompt.json` hand-crafted per topic	Eliminates templates entirely; starts from a markdown topic description + generated JSON idea file
Linear experimentation	Followed a strictly sequential hypothesis → code → execute → analyze pipeline; short-sighted, unable to explore branching hypotheses	Progressive agentic tree search (BFTS) with parallel exploration of multiple experimental directions
Text-only review	Automated reviewer evaluated text only; no visual understanding of figures	VLM (Vision-Language Model) feedback loop for iterative figure refinement during both experimentation and paper writing

The Headline Result

One of three fully autonomous manuscripts submitted to the ICLR 2025 ICBINB workshop received scores of 6, 7, and 6 (average 6.33/10), placing it in the top 45% of all submissions — above the average human acceptance threshold. This is the first documented instance of a fully AI-generated paper passing real human peer review.

What v2 Is NOT

Not necessarily better than v1 per-paper: The authors explicitly note that v1 with a strong template can produce higher success rates on well-defined tasks
Not conference-level: The accepted paper was workshop-level; the authors state none of the three manuscripts met main-track conference standards
Not free of hallucinations: External evaluations found fabricated results, hallucinated methodology, and overestimated novelty in some outputs
Not a replacement for human scientists: The system demonstrates capability at the workshop level, not the ability to produce trustworthy science independently

Relationship to the Field

Timeline of autonomous scientific discovery systems:
──────────────────────────────────────────────────────────────────────
AI Scientist v1 (Aug 2024)    → First end-to-end system; $15/paper;
                                 template-dependent; automated reviewer only
MLAgentBench (2024)            → ML experiment automation benchmark
AIDE (2025)                    → BFTS for ML engineering; MLEBench SOTA
AI Scientist v2 (Apr 2025)    → Template-free; agentic tree search;
                                 first real peer-review acceptance
AutoResearchClaw (2025)        → Multi-agent research; ReAct + tools
EurekaClaw (2025)              → Evolutionary research; skills library
AIRA₂ (Mar 2026)              → 8-GPU async evolution; MLE-bench SOTA
──────────────────────────────────────────────────────────────────────

The AI Scientist v2 occupies a unique position: it targets the full scientific pipeline (idea → paper) rather than just ML engineering tasks. AIDE influenced the tree search; the AI Scientist lineage contributes the end-to-end manuscript generation and peer review evaluation.

4 Supported Solutions

Problem Framing

The AI Scientist v2 frames automated scientific discovery as a tree-structured search problem over the space of experimental programs and manuscripts. The unit of search is not a competition solution (as in MLE-bench agents) but a complete scientific workflow: hypothesis, implementation, experiments, analysis, figures, and paper.

Solution Space

The system operates on ML research ideas as the unit of exploration. Each candidate path through the tree represents:

A scientific hypothesis (generated during ideation)
An implementation strategy (Python code for experiments)
Experimental results (metrics, training curves, outputs)
Visualizations (figures generated by plotting code)
A manuscript (LaTeX paper with citations, figures, and text)

Search Methods Supported

Method	Description	Role in v2
Best-First Tree Search (BFTS)	LLM-evaluated nodes ranked by experimental quality; best nodes expanded first	Primary search method across all 4 stages
Parallel node expansion	Multiple nodes expanded concurrently via `num_workers` parallel paths	Accelerates exploration; default 3 workers
Probabilistic debugging	Buggy nodes selected for debugging with probability `debug_prob`	Enables error recovery without abandoning investment
Stage-gated progression	Tree search organized into 4 sequential stages with explicit stopping criteria	Structured exploration from prototype to ablations
LLM-based node evaluation	Dedicated LLM evaluator scores non-buggy nodes for best-first selection	Replaces scalar metrics with qualitative assessment

What v2 Does NOT Do

No neural architecture search as a subsystem
No reinforcement learning or gradient-based optimization of the search itself
No meta-learning or cross-experiment transfer (each idea starts fresh)
No automated submission to venues (manuscripts are generated but submission was manual)
No real-time human-in-the-loop steering during the search (fully autonomous once launched)
No multi-idea orchestration within a single run (one idea per launch_scientist_bfts.py invocation)

5 LLM Integration

Model Configuration

Unlike v1 which used a single --model argument, v2 employs task-specific model routing with separate LLMs for different pipeline stages:

Stage	Recommended Model	Purpose
Ideation	GPT-4o (2024-05-13)	Idea generation, novelty checking via Semantic Scholar
Experimentation (BFTS)	Claude 3.5 Sonnet	Code generation, experiment planning, debugging within tree search
Aggregate Plots	o3-mini (2025-01-31)	Consolidating visualization across replication runs
Paper Writing	o1-preview (2024-09-12)	Single-pass manuscript generation + reflection
Citation Gathering	GPT-4o (2024-11-20)	Literature search integration with Semantic Scholar
Paper Review	GPT-4o (2024-11-20)	Automated text review
VLM Feedback	VLM (vision-capable model)	Figure quality assessment during experiments and writing

LLM Usage Patterns

The system uses LLMs in fundamentally different ways across the pipeline:

1. Code Generation (Experimentation)

Node Expansion Cycle:
  LLM generates: (plan_text, experiment_code.py)
  → Execute in Python interpreter
  → If error: record trace, mark buggy, stop
  → If success: save metrics to .npy files
  → LLM generates: plot_code.py
  → Execute plotting code
  → VLM evaluates generated figures
  → If VLM flags issues: mark buggy, record feedback
  → If passes: mark non-buggy, record node metadata

2. Best-First Node Selection

LLM Evaluator receives:
  - All non-buggy node descriptions
  - Performance metrics per node
  - Training dynamics summaries
  - Generated plot quality assessments
LLM returns: ranked ordering of nodes for expansion

3. Experiment Manager Agent

After each stage completion:
  Manager LLM evaluates all terminal nodes
  → Selects best node as seed for next stage
  → Records checkpoint
  → Launches replication runs for statistics
  → Transitions to next stage

4. Paper Writing

v1 approach: Iterative Aider-based editing (multiple rounds of code-level edits)
v2 approach: Single-pass generation by reasoning model (o1) + separate reflection stage
  → Substantial simplification over v1's incremental writing

Key Architectural Difference from v1

v1 used the Aider coding assistant (aider-chat library) as an intermediary between the LLM and the codebase. Aider managed git history, diff application, and multi-turn code editing conversations. v2 removes Aider entirely — the LLM directly generates complete code files within the tree search loop. This eliminates a layer of abstraction and gives the system more direct control over code structure, at the cost of losing Aider's git-based undo capabilities.

API Integration

API	Purpose	Required
OpenAI API	GPT-4o, o1-preview, o3-mini for writing/review/ideation	Yes (for default config)
AWS Bedrock	Claude 3.5 Sonnet for experimentation	Yes (for default config)
Gemini API	Alternative experimentation model	Optional
Semantic Scholar API	Literature search, novelty checking, citation gathering	Optional (recommended)

6 Key Results

ICLR 2025 Workshop Submission Experiment

Three fully autonomous manuscripts were submitted to the ICLR 2025 "I Can't Believe It's Not Better" (ICBINB) workshop. The experiment was conducted with full transparency:

Workshop organizers and ICLR leadership were informed in advance
Reviewers were told "some submissions might be AI-generated" but not which ones
Reviewers could opt out of reviewing potentially AI-generated papers
IRB approval was obtained
Accepted manuscripts would be withdrawn post-review (and were)

Paper	Topic	Reviewer Scores	Average	Outcome
Compositional Regularization	Temporal consistency regularizer on LSTM embeddings for compositional generalization	6, 7, 6	6.33	Exceeded acceptance threshold
Real-World Pest Detection	Deep learning for agricultural pest classification	3, 7, 4	4.67	Rejected
Label Noise & Model Calibration	Effect of label noise on model calibration	3, 3, 3	3.00	Rejected

The Accepted Paper: Compositional Regularization

The accepted paper investigated whether incorporating an explicit compositional regularization term into neural network training improves compositional generalization:

Approach: Penalizes large deviations between embeddings of successive time steps in LSTM-based sequence models
Hypothesis: Encouraging smoothness in embedding space promotes compositionality
Datasets: Synthetic arithmetic expression datasets
Finding: Compositional regularization did not yield significant improvements and occasionally harmed performance — a negative result
Reviewer assessment: Appreciated for "clearly identifying the challenges of effective compositional regularization and reporting on negative results"
Ranking: Top 45% of 43 submissions (roughly 20th position)
Post-review status: Withdrawn as pre-arranged with organizers

Internal Quality Assessment

The authors conducted their own evaluation of all three manuscripts:

Quality Dimension	Assessment
Workshop-level quality	One manuscript (the accepted one) meets workshop standards
Conference-level quality	None of the three manuscripts meets main-track standards
Citation accuracy	Occasional hallucinated citations (known LLM limitation)
Methodological rigor	Adequate for workshops; lacks depth for conferences
Figure quality	Improved over v1 via VLM feedback, but some caption inaccuracies remain
Code quality	Functional but not always well-structured or documented

External Evaluations and Known Limitations

Independent analyses (MLR-Bench, Pebblous) identified systemic issues in AI Scientist v2 outputs:

Issue	Description	Frequency
Fabricated results	System hides failed experiments and reports them as successful	Found in multiple outputs
Hallucinated methodology	Describes methods not actually implemented in code	Intermittent
Overestimated novelty	Presents well-known concepts as novel discoveries	Common
Dataset overlap	Potential train/test contamination in some experiments	Identified in accepted paper
Caption inaccuracies	Figure captions not always matching figure content	Several instances

Comparison with v1 Results

Metric	v1	v2
Evaluation method	Automated reviewer only	Real human peer review at ICLR workshop
Cost per paper	~$15	~$20–25
Template requirement	Yes (per domain)	No
Domain flexibility	3 specific domains (NanoGPT, 2D Diffusion, Grokking)	Any ML topic describable in markdown
Success rate	Higher (within template scope)	Lower (broader, exploratory)
Best reviewer score	Exceeded automated reviewer threshold	6.33/10 average from human reviewers
Accepted at real venue	No (not submitted)	Yes (1 of 3 at ICBINB workshop)
Paper writing approach	Iterative Aider-based editing	Single-pass o1 generation + reflection
Experiment depth	Shallow, linear	Deep, tree-structured, multi-stage

7 Reproducibility

Strengths

Aspect	Assessment
Code availability	Fully open-sourced at github.com/SakanaAI/AI-Scientist-v2
Workshop experiment data	Separately published with full manuscripts and reviews
Configuration	`bfts_config.yaml` provides declarative tree search configuration
Installation	Conda environment with pinned dependencies; `requirements.txt` provided
Hardware requirements	Explicitly stated: Linux, NVIDIA GPUs, CUDA, PyTorch
API requirements	All required API keys documented (OpenAI, AWS Bedrock, Semantic Scholar)
Prompts	Included in Appendix B of the paper
Hyperparameters	Full sampling hyperparameters in Appendix A
Tree visualization	`unified_tree_viz.html` generated for each run, enabling post-hoc inspection

Limitations

Aspect	Concern
LLM API dependency	Results depend on specific model versions (Claude 3.5 Sonnet, o1-preview) that may be deprecated or updated
Non-determinism	LLM sampling introduces stochasticity; exact tree trajectories are not reproducible across runs
Cost barrier	$20–25 per paper attempt; multiple attempts needed for reliable results
GPU requirement	NVIDIA GPU with CUDA required; not runnable on CPU or Apple Silicon
Success rate variability	"Higher success rates are generally observed when using powerful models like Claude 3.5 Sonnet" — weaker models may fail more often
LaTeX dependencies	Requires poppler, chktex, and LaTeX toolchain — brittle cross-platform
Semantic Scholar dependency	Rate limits may affect ideation and citation stages without API key
Sandbox requirements	Executes LLM-generated code; requires Docker/container isolation for safety

Predecessor Reproducibility

AI Scientist v1 is fully open-sourced at github.com/SakanaAI/AI-Scientist with templates for three domains. The v2 codebase explicitly acknowledges building on AIDE (WecoAI/aideml) for the tree search component.

8 Compute and API Costs

Cost Breakdown per Paper

Cost structure per paper generation attempt:
┌─────────────────────────────────────────────────┐
│ Stage                      │ Estimated Cost     │
├─────────────────────────────────────────────────┤
│ Idea Generation            │ ~$3                │
│ (LLM calls + Semantic Scholar queries)          │
├─────────────────────────────────────────────────┤
│ BFTS Experimentation       │ $15–20             │
│ (Claude 3.5 Sonnet for code gen/debug/eval)     │
├─────────────────────────────────────────────────┤
│ Paper Writing              │ ~$5                │
│ (o1-preview + GPT-4o for citations)             │
├─────────────────────────────────────────────────┤
│ Total per paper attempt    │ ~$20–25            │
└─────────────────────────────────────────────────┘

Comparison with v1 Costs

Metric	v1	v2
Total per paper	~$15	~$20–25
GPU compute	Minimal (uses template code)	Moderate (runs ML experiments)
LLM API	Single model, ~$15	Multi-model, ~$20
Time to completion	Few hours	Several hours (experimentation) + 20–30 min (writing)
Human setup cost	High (template creation per domain)	Low (markdown topic description)

Hardware Requirements

Minimum configuration:
┌──────────────────────────────────────┐
│  Linux OS (required)                  │
│  NVIDIA GPU with CUDA support         │
│  PyTorch-compatible GPU drivers       │
│  Sufficient VRAM for target experiments│
│  Docker/container runtime (recommended)│
└──────────────────────────────────────┘

BFTS Configuration Parameters

The tree search is controlled via bfts_config.yaml:

Parameter	Description	Default
`num_workers`	Parallel exploration paths	3
`steps`	Maximum nodes to explore	21
`num_seeds`	Initial root nodes per tree	3
`num_drafts`	Independent trees to grow (Stage 1)	Configurable
`max_debug_depth`	Max debug attempts before abandoning a node	Configurable
`debug_prob`	Probability of selecting a buggy node for debugging	Configurable

With num_workers=3 and steps=21, the system explores up to 21 nodes total, expanding 3 concurrently per step. This gives ~7 expansion rounds per stage.

Cost-Effectiveness Analysis

The $20–25 per paper is misleadingly low as a headline number. Important caveats:

Success rate is not 100%: Many runs fail to produce a viable manuscript; effective cost per publishable paper is significantly higher
GPU costs not included: If experiments require significant GPU compute (training models), hardware costs add substantially
Human review not included: The system does not guarantee quality; human review and potential revision would add labor costs
The $15 v1 comparison is apples-to-oranges: v1 cost was lower because templates did the heavy lifting that v2 must do from scratch

9 Architecture Solution

High-Level Architecture

The AI Scientist v2 architecture consists of two major phases executed sequentially, with the experimentation phase internally organized as a four-stage tree search:

┌─────────────────────────────────────────────────────────────────────────┐
│                        AI SCIENTIST v2 PIPELINE                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  PHASE 1: IDEATION                                                     │
│  ┌───────────────────────────────────────────────────────────┐         │
│  │  Topic Description (.md)                                   │         │
│  │       ↓                                                    │         │
│  │  LLM Idea Generation (with Semantic Scholar novelty check) │         │
│  │       ↓                                                    │         │
│  │  Structured Ideas (.json)                                  │         │
│  └───────────────────────────────────────────────────────────┘         │
│       ↓                                                                 │
│  PHASE 2: EXPERIMENTATION (BFTS)                                       │
│  ┌───────────────────────────────────────────────────────────┐         │
│  │  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐   │         │
│  │  │   Stage 1    │→│   Stage 2     │→│    Stage 3      │   │         │
│  │  │ Preliminary  │  │ Hyperparameter│  │ Research Agenda │   │         │
│  │  │Investigation │  │   Tuning      │  │  Execution      │   │         │
│  │  └─────────────┘  └──────────────┘  └────────────────┘   │         │
│  │                                           ↓               │         │
│  │                                     ┌────────────────┐   │         │
│  │          Experiment Manager ←───────│    Stage 4      │   │         │
│  │          Agent (coordinates)        │ Ablation Studies│   │         │
│  │                                     └────────────────┘   │         │
│  └───────────────────────────────────────────────────────────┘         │
│       ↓                                                                 │
│  PHASE 3: MANUSCRIPT GENERATION                                        │
│  ┌───────────────────────────────────────────────────────────┐         │
│  │  Plot Aggregation → Paper Writing → Citation Gathering     │         │
│  │       → VLM Figure Review → Reflection → Final PDF         │         │
│  └───────────────────────────────────────────────────────────┘         │
│       ↓                                                                 │
│  PHASE 4: REVIEW                                                       │
│  ┌───────────────────────────────────────────────────────────┐         │
│  │  LLM Text Review + VLM Figure/Caption Review               │         │
│  └───────────────────────────────────────────────────────────┘         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

v1 → v2 Architecture Comparison

AI SCIENTIST v1 (LINEAR):
─────────────────────────────────────────────────────────────
Template Code    →  Idea Gen  →  Aider Code Editing  →  Execute
     ↑                                                     ↓
  (human-authored)                                    Visualize
                                                          ↓
                                                    Paper Write
                                                    (iterative)
                                                          ↓
                                                    Auto Review
                                                          ↓
                                                    Improvement

AI SCIENTIST v2 (TREE-STRUCTURED):
─────────────────────────────────────────────────────────────
Topic (.md)  →  Idea Gen   →  ┌─ Stage 1: Prototype  ─── BFTS ──┐
                  (+ S2)       │  Stage 2: Tune        ─── BFTS ──│
                               │  Stage 3: Agenda      ─── BFTS ──│
                               │  Stage 4: Ablations   ─── BFTS ──│
                               └──────────────────────────────────┘
                                              ↓
                               Plot Aggregation + VLM Review
                                              ↓
                               Single-Pass Paper Writing (o1)
                                              ↓
                               VLM + LLM Review

Orchestration Architecture

The system is orchestrated through two entry points:

perform_ideation_temp_free.py — Generates research ideas from a topic description
launch_scientist_bfts.py — Executes the full BFTS experimentation + writing + review pipeline for a single idea

Key architectural decisions:

Decision	v1 Approach	v2 Approach	Rationale
Configuration	Command-line args only	`bfts_config.yaml` + CLI args	Tree search has too many parameters for CLI alone
Idea scope per run	Multiple ideas per invocation	Single idea per invocation	Each tree search is computationally expensive
Model selection	Single `--model` flag	Separate flags per stage	Different tasks benefit from different model strengths
Code editing	Aider-mediated diffs	Direct LLM code generation	Tree search needs complete code per node, not incremental diffs
Experiment execution	Sequential, in-process	Parallel, multi-worker	Tree search enables natural parallelization
Paper writing	Multi-round Aider editing	Single-pass reasoning model + reflection	Simplifies writing; leverages o1's long-form capabilities

Module Structure

ai_scientist/
├── perform_ideation_temp_free.py   # Phase 1: Idea generation
├── treesearch/                      # Phase 2: Core BFTS engine
│   └── perform_experiments_bfts_with_agentmanager  # Main entry point
├── perform_plotting.py              # Plot aggregation across nodes
├── perform_writeup.py               # Paper writing (normal + ICBINB formats)
├── perform_llm_review.py            # Text-based automated review
├── perform_vlm_review.py            # Vision-based figure/caption review
├── gather_citations.py              # Semantic Scholar citation integration
├── ideas/                           # Topic descriptions and generated ideas
│   └── i_cant_believe_its_not_better.md  # Example topic file
└── llm.py                           # LLM client creation utilities

bfts_config.yaml                     # Tree search configuration
launch_scientist_bfts.py             # Main orchestrator script

Dependency Architecture

v2 introduces significant new dependencies reflecting the architectural shift:

Library	Purpose	v1 Status	v2 Status
`aider-chat`	LLM-mediated code editing	Core dependency	Removed
`omegaconf`	Hierarchical YAML configuration	Not present	Added (for `bfts_config.yaml`)
`python-igraph`	Graph data structure for tree	Not present	Added (tree management)
`seaborn`	Statistical visualization	Not present	Added
`rich`	Terminal output formatting	Not present	Added
`jsonschema`	JSON validation	Not present	Added
`dataclasses-json`	Serialization of node state	Not present	Added
`boto3` / `botocore`	AWS Bedrock for Claude	Not present	Added
`pymupdf4llm`	PDF processing for LLMs	`pymupdf` (basic)	Upgraded
`torch`	ML framework	Required	Not in requirements.txt (assumed pre-installed)
`google-generativeai`	Gemini API	Present	Removed (Gemini via OpenAI API instead)

10 Component Breakdown

Component 1: Idea Generation Engine

Purpose: Generate structured research ideas from a high-level topic description, with literature-grounded novelty assessment.

Input: Markdown file with Title, Keywords, TL;DR, and Abstract sections defining the research scope.

Process: 1. LLM generates candidate research ideas based on the topic description 2. Each idea undergoes multiple reflection rounds (--num-reflections, default 5) 3. Semantic Scholar is queried in-loop for novelty checking and related work identification 4. Ideas are refined based on literature context 5. Output: structured JSON with hypotheses, proposed experiments, and related work

Output: ideas/<topic_name>.json containing a list of structured research ideas.

Key difference from v1: v1's ideation was constrained by the code template — ideas were incremental modifications to existing code. v2 starts from abstract concepts (like a grant proposal) before any code exists.

Component 2: Experiment Progress Manager Agent

Purpose: Coordinate the four-stage experimental lifecycle, enforcing structure while allowing flexible exploration within each stage.

Stage definitions and stopping criteria:

Stage 1: PRELIMINARY INVESTIGATION
├── Goal: Establish feasibility via minimal working prototype
├── Stop when: Basic working prototype successfully executes
├── Output: Root node for Stage 2
└── Search: BFTS with parallel initial nodes

Stage 2: HYPERPARAMETER TUNING
├── Goal: Optimize critical hyperparameters for robust baseline
├── Stop when: Training curves converge + success on ≥2 datasets
├── Output: Tuned baseline node for Stage 3
├── Search: BFTS with specialized hyperparameter nodes
└── Tracking: Previously tested hyperparameters recorded to avoid redundancy

Stage 3: RESEARCH AGENDA EXECUTION
├── Goal: Systematically implement the core research agenda
├── Stop when: Computational budget exhausted
├── Runtime check: If runs finish too fast → suggest increasing complexity
├── Output: Best experimental node for Stage 4
└── Search: BFTS with refinement and debugging

Stage 4: ABLATION STUDIES
├── Goal: Assess importance of research components
├── Stop when: Computational budget exhausted
├── Output: Final experimental results + replication statistics
├── Search: BFTS with specialized ablation nodes
└── Replication: Multiple seeds for statistical robustness

Inter-stage transitions: After each stage, the manager: 1. Evaluates all terminal nodes using a dedicated LLM evaluator 2. Selects the best-performing node based on articulated criteria 3. Creates a checkpoint 4. Launches replication runs of the best node for mean/std statistics 5. Seeds the next stage with the selected best node

Component 3: Parallelized Agentic Tree Search (BFTS)

Purpose: Explore the experimental search space through structured tree expansion, balancing exploitation (refining promising paths) and exploration (debugging failed paths).

Node definition: Each node is a tuple containing: - Experiment script (Python code) - High-level plan description (natural language) - Execution error trace (if applicable) - Experiment runtime - Performance metrics - LLM feedback on execution - Visualization script - Generated figure file paths - VLM feedback on figures - Node status: buggy | non-buggy

Node types:

Type	Purpose	Created During	Expansion Rule
Standard (non-buggy)	Successfully executed experiment	All stages	Refinement: improve code for better performance
Buggy	Failed execution or VLM rejection	All stages	Debugging: fix errors using recorded error trace
Hyperparameter	Systematic hyperparameter exploration	Stage 2	HP variation with redundancy tracking
Ablation	Component importance assessment	Stage 4	Ablation conditions with redundancy tracking
Replication	Statistical robustness via different seeds	End of each stage	Same code, different random seed
Aggregation	Combine replication results	After replications	No new experiments; generates combined plots

Selection policy: - With probability debug_prob: select a buggy node for debugging - Otherwise: best-first selection among non-buggy nodes - Best-first evaluation by LLM based on: performance metrics, training dynamics, plot quality

Component 4: VLM Feedback Loop

Purpose: Iteratively evaluate and refine figure quality at two points in the pipeline.

Integration Point 1 — During Experimentation:

Experiment execution → metrics saved to .npy files
  → Plotting code generates figures
  → VLM receives: figures + code context
  → VLM checks: label clarity, legend presence, data accuracy
  → If issues found: node marked buggy, feedback recorded
  → If passes: node marked non-buggy

Integration Point 2 — During Paper Writing:

Paper draft generated → screenshots extracted of each figure
  → VLM receives: figure images + captions + referencing text
  → VLM checks:
      - Figure-caption alignment
      - Visual clarity (labels, legends, axes)
      - Duplication between main text and appendix
      - Data presentation accuracy
  → Feedback integrated into reflection stage
  → Iterative refinement until quality threshold met

Key improvement over v1: v1 had no visual understanding whatsoever — the automated reviewer only processed text. v2's VLM integration enables the system to detect and correct visual presentation issues that would be immediately apparent to a human reviewer.

Component 5: Paper Writing Engine

Purpose: Generate a complete scientific manuscript from experimental results.

v2 approach (simplified from v1): 1. Plot aggregation: Consolidate figures from best experimental nodes and replications 2. Single-pass generation: Reasoning model (o1-preview) generates complete LaTeX manuscript in one pass 3. Citation gathering: Semantic Scholar integration adds relevant references (up to 20 rounds) 4. VLM reflection: Vision-language model reviews figures and captions 5. Reflection stage: Reasoning model reviews and refines the complete manuscript 6. Output: Compiled PDF

Paper formats supported: - normal: Standard 8-page conference format - icbinb: 4-page workshop format (used for ICLR 2025 ICBINB submissions)

Timing: Writing stage takes approximately 20–30 minutes total.

Component 6: Automated Review System

Purpose: Evaluate generated manuscripts using both text and visual analysis.

Dual review pipeline: 1. LLM text review (perform_llm_review.py): Evaluates manuscript text for clarity, methodology, and scientific rigor 2. VLM visual review (perform_vlm_review.py): Evaluates figures, captions, and their alignment with text

Improvement over v1: v1's reviewer was text-only and was validated against human scores as a proxy. v2's addition of VLM review adds a visual dimension, though the ultimate validation came from real human peer review rather than automated scoring alone.

11 Core Mechanisms (Detailed)

Mechanism 1: Template Elimination

The shift from template-dependent to template-free operation is the single most architecturally significant change. Understanding how it works requires tracing the information flow:

v1 Information Flow (Template-Dependent):

Human creates:
  templates/nanoGPT/
    ├── experiment.py    (baseline code — hundreds of lines)
    ├── plot.py           (visualization code)
    ├── prompt.json       (domain context for LLM)
    └── seed_ideas.json   (example ideas)

LLM receives: template code + context → generates incremental modifications
Aider applies: diff patches to template code
Execution: modified template code runs in same environment

v2 Information Flow (Template-Free):

Human creates:
  ideas/my_topic.md     (markdown description: title, keywords, TL;DR, abstract)
                          (typically <1 page of text)

Ideation LLM: generates structured research idea from topic description
BFTS LLM: generates complete experiment code from scratch (no template)
  → Stage 1: minimal working prototype
  → Stage 2: hyperparameter-optimized version
  → Stage 3: full research agenda implementation
  → Stage 4: ablation variants
Datasets: loaded via Hugging Face Hub (one-line API call)

How the system bootstraps without a template: 1. The ideation stage produces a concrete hypothesis and experimental design 2. Stage 1 of BFTS generates a minimal prototype from scratch — the LLM writes complete Python code, not modifications 3. Multiple parallel root nodes provide diversity in initial implementations 4. The tree search explores variations, with the LLM evaluator selecting the most promising directions 5. Each subsequent stage builds on the best node from the previous stage, so code quality improves progressively

Trade-off: Without a human-verified template as starting point, the system is more likely to produce incorrect or poorly-structured code. This is offset by the tree search's ability to explore multiple paths and recover from failures via debugging nodes.

Mechanism 2: Best-First Tree Search (BFTS)

The BFTS algorithm is adapted from AIDE (Jiang et al., 2025) with modifications for the multi-stage scientific experimentation context:

AIDE's original BFTS (for ML engineering tasks): - Each node = a potential solution with a scalar evaluation score (e.g., validation accuracy) - Nodes selected for expansion based on score ranking - Single-stage: continuous refinement toward a single metric

AI Scientist v2's adapted BFTS (for scientific discovery): - Each node = experiment code + results + figures + LLM feedback (rich metadata) - Node evaluation by LLM rather than scalar metric (qualitative assessment) - Multi-stage: four distinct stages with different objectives and node types - Additional node types (hyperparameter, ablation, replication, aggregation) beyond AIDE's standard/buggy distinction

Algorithm pseudocode:

function BFTS_Stage(root_node, config):
    tree ← initialize_tree(root_node)
    for step in range(config.steps):
        # Select nodes for expansion
        candidates ← []
        for _ in range(config.num_workers):
            if random() < config.debug_prob AND tree.has_buggy_nodes():
                node ← select_buggy_node(tree)
                child ← create_debug_child(node)
            else:
                node ← llm_best_first_select(tree.non_buggy_nodes())
                child ← create_refinement_child(node)
            candidates.append(child)

        # Parallel execution
        parallel_execute(candidates)

        # Post-execution evaluation
        for child in candidates:
            if child.execution_failed:
                child.status ← BUGGY
                child.error_trace ← capture_error()
            else:
                child.metrics ← load_numpy_results()
                child.figures ← run_plotting_code()
                vlm_feedback ← vlm_evaluate(child.figures)
                if vlm_feedback.has_issues:
                    child.status ← BUGGY
                    child.vlm_feedback ← vlm_feedback
                else:
                    child.status ← NON_BUGGY
            tree.add_node(child)

    # Stage completion
    best_node ← llm_evaluate_and_select(tree.non_buggy_nodes())
    replications ← create_replication_nodes(best_node, num_seeds=3)
    parallel_execute(replications)
    aggregation ← create_aggregation_node(replications)
    return best_node, aggregation

Key properties of the search:

Anytime: Can be stopped at any step and still yield the current best node
Recoverable: Buggy nodes are not discarded; they can be debugged in future steps
Parallel: Multiple nodes expanded concurrently per step
Qualitative evaluation: LLM-based node scoring captures aspects that scalar metrics miss (code quality, experimental design, visualization clarity)
Stage-aware: Different stages use different node types and stopping criteria

The VLM feedback loop operates as a quality gate at two points:

During Experimentation (per-node):

Node execution succeeds
  → System reads .npy result files
  → LLM generates plotting code
  → Plotting code executes → figure files
  → VLM receives figure images + experiment context
  → VLM evaluates:
      ✓ Are axes labeled?
      ✓ Is there a legend?
      ✓ Do data values match metrics?
      ✓ Is the visualization type appropriate?
      ✓ Are there any misleading elements?
  → VLM returns structured feedback
  → If any check fails: node marked buggy
  → Feedback stored for future debugging attempts

During Paper Writing (manuscript-level):

LaTeX manuscript generated
  → Screenshot each figure from rendered PDF
  → Extract caption text and referencing text ("Figure X")
  → VLM receives: image + caption + reference text
  → VLM evaluates:
      ✓ Does the caption accurately describe the figure?
      ✓ Does the referencing text correctly interpret the figure?
      ✓ Are there duplicate figures in main text and appendix?
      ✓ Is visual quality sufficient for publication?
  → Feedback integrated into reflection stage
  → Writing model revises manuscript based on VLM feedback
  → Iterate until quality threshold or max iterations

Mechanism 4: Experiment Manager State Machine

The experiment manager operates as a finite state machine over the four stages:

         ┌──────────────────────────────────────────────────────────────┐
         │                                                              │
         ▼                                                              │
  ┌──────────────┐    best    ┌──────────────┐    best    ┌──────────────┐
  │   STAGE 1    │───node───→│   STAGE 2    │───node───→│   STAGE 3    │
  │ Preliminary  │            │  HP Tuning   │            │   Agenda     │
  │Investigation │            │              │            │  Execution   │
  └──────────────┘            └──────────────┘            └──────────────┘
        │                           │                           │
   Stop: working              Stop: convergence +          Stop: budget
   prototype                  ≥2 datasets pass             exhausted
                                                                │
                                                          best node
                                                                │
                                                                ▼
                                                         ┌──────────────┐
                                                         │   STAGE 4    │
                                                         │  Ablation    │
                                                         │  Studies     │
                                                         └──────────────┘
                                                                │
                                                           Stop: budget
                                                           exhausted
                                                                │
                                                                ▼
                                                         ┌──────────────┐
                                                         │  MANUSCRIPT  │
                                                         │  GENERATION  │
                                                         └──────────────┘

State transitions: - Each stage runs BFTS independently - Best node from stage N becomes root node of stage N+1 - Checkpoints saved at each transition - Replication runs launched at each transition for statistics - Manager decides when stopping criteria are met

Stopping criteria specifics: - Stage 1: Binary — did any node produce a running prototype? - Stage 2: Convergence — training curves stabilize across datasets - Stage 3: Budget — fixed compute allocation, with complexity escalation if runs are too fast - Stage 4: Budget — fixed compute allocation

Mechanism 5: Dataset Loading Strategy

Rather than relying on locally packaged datasets (as v1 templates did), v2 uses a standardized approach:

from datasets import load_dataset
dataset = datasets.load_dataset("dataset_name")

Advantages: - No manual dataset preparation per template - Access to thousands of Hugging Face Hub datasets - Standardized train/validation/test splits - Automatic downloading and caching

Limitations (acknowledged by authors): - Not all datasets support load_dataset - Some datasets require custom preprocessing - No guarantee the LLM will choose appropriate datasets for the hypothesis - No built-in validation that train/test splits are properly separated in the generated code

12 Programming Language

Implementation Language

Python 3.11 — the entire system is implemented in Python.

Key Libraries and Their Roles

Library	Version	Role
`openai`	Latest	OpenAI API client (GPT-4o, o1, o3-mini)
`anthropic[bedrock]`	Latest	Claude models via AWS Bedrock
`omegaconf`	Latest	Hierarchical configuration management
`python-igraph`	Latest	Tree data structure for BFTS
`datasets` (Hugging Face)	Latest	Dataset loading for experiments
`numpy`	Latest	Experiment result storage (.npy)
`matplotlib`	Latest	Figure generation
`seaborn`	Latest	Statistical visualization
`pymupdf4llm`	Latest	PDF processing for LLM review
`rich`	Latest	Terminal output and logging
`jsonschema`	Latest	Configuration validation
`dataclasses-json`	Latest	Node state serialization

Code Execution Model

The system generates and executes Python code in a subprocess: - Experiment code is written to .py files - Executed via Python interpreter - stdout/stderr captured for error traces - Results saved to structured numpy files - Plotting code executed separately

Safety Considerations

"This codebase will execute Large Language Model (LLM)-written code. There are various risks and challenges associated with this autonomy, including the potential use of dangerous packages, uncontrolled web access, and the possibility of spawning unintended processes."

The authors recommend Docker containers for sandboxing. No built-in sandboxing is provided in the codebase itself.

13 Memory Management

Intra-Run Memory

The tree search maintains memory through the tree structure itself:

Tree Node Memory:
├── Code: complete experiment script preserved at each node
├── Plan: natural language description of what this node implements
├── Metrics: experimental results stored in .npy files
├── Figures: generated visualization files
├── Errors: full error traces for buggy nodes
├── Feedback: LLM and VLM feedback recorded per node
└── Status: buggy/non-buggy classification

Inter-Node Memory:
├── Parent-child links preserve experimental lineage
├── Sibling relationships show parallel exploration paths
├── Best-node selection carries forward across stages
└── Replication nodes share parent code, vary seeds

Cross-Stage Memory

The experiment manager maintains state across stages:

Checkpoints: Saved at each stage completion
Best node propagation: Selected node becomes root of next stage
Hyperparameter history: Stage 2 tracks tested configurations to avoid redundancy
Ablation history: Stage 4 tracks tested conditions similarly
Replication statistics: Mean/std computed and carried forward to manuscript

Cross-Run Memory

No persistent memory exists between separate runs. Each invocation of launch_scientist_bfts.py starts fresh. There is:

No skills library or knowledge base carried across experiments
No cross-experiment transfer learning
No persistent embedding store for novelty comparison
No accumulated heuristics from previous runs

This is a significant limitation compared to systems like EurekaClaw (which maintains a skills library) or AIRA₂ (which accumulates population knowledge across evolutionary generations).

Novelty Memory During Ideation

During idea generation, Semantic Scholar queries provide a form of external memory: - The system checks proposed ideas against published literature - Previously proposed ideas within the same ideation run are tracked - This prevents redundant idea generation within a single session - No persistence across separate ideation runs

Visualization

The system generates unified_tree_viz.html — an interactive HTML visualization of the complete tree search for each run. This serves as a post-hoc analysis tool rather than a runtime memory mechanism, but enables human researchers to understand the search trajectory.

14 Continued Learning

Within a Single Run

The tree search implements a form of within-run learning:

Error recovery: Buggy nodes' error traces inform debugging attempts — the system learns from its mistakes within a run
Progressive refinement: Each stage builds on the best outcome of the previous stage — cumulative improvement
VLM feedback integration: Figure quality issues detected early inform later plotting attempts
Hyperparameter tracking: Stage 2 avoids re-testing configurations, learning from previous attempts
Ablation tracking: Stage 4 similarly avoids redundant experiments

Across Runs

No cross-run learning exists. Each run is independent with no mechanism for:

Transferring successful strategies to new topics
Building a library of reusable experimental components
Accumulating domain expertise over time
Meta-learning about which experimental approaches work best

Comparison with Learning-Enabled Systems

System	Within-Run Learning	Cross-Run Learning	Mechanism
AI Scientist v2	Tree search refinement	None	BFTS tree state
AI Scientist v1	Aider conversation history	None	Linear context
EurekaClaw	Evolutionary memory	Skills library	Persistent vector store
AIRA₂	Population evolution	None (per-task)	Evolutionary memory
AutoResearchClaw	Agent memory	Session memory	Knowledge graph

Potential for Cross-Run Learning

The paper does not discuss cross-run learning, but the architecture would naturally support it through:

Tree node embeddings: Successful node codes could be embedded and retrieved for future runs
Strategy library: Successful experimental strategies could be abstracted and reused
Domain models: Accumulating domain knowledge from multiple runs on related topics
Meta-heuristics: Learning which BFTS configurations work best for which types of research questions

15 Applications

Primary Application: Automated ML Research

The system's demonstrated application is generating workshop-level research manuscripts in machine learning. The three ICLR submissions spanned:

Compositional generalization (regularization techniques for sequence models)
Agricultural pest detection (applied deep learning)
Model calibration under label noise (ML robustness)

This breadth — from theoretical ML to applied computer vision — demonstrates the domain-general ambition of the template-free approach.

Application Domain Constraints

Constraint	Impact
ML-only	System assumes ML experiments (Python + PyTorch/TensorFlow); cannot conduct wet-lab, social science, or theoretical mathematics research
Hugging Face datasets	Experiments limited to datasets available via Hugging Face Hub or those the LLM can generate synthetically
GPU-dependent	Cannot run on CPU-only machines; limits accessibility
Single-paper scope	Each run produces one paper; cannot conduct research programs spanning multiple publications
Workshop-level quality	Current capability is workshop-level; not yet suitable for main-track conferences

Potential Extensions

Near-term (system-level improvements): - Multi-idea orchestration within a single session - Persistent knowledge base across runs - Improved code sandboxing - Support for non-ML experimental domains (e.g., bioinformatics pipelines) - Integration with laboratory robotics for wet-lab experiments

Medium-term (capability improvements): - Conference-level paper quality through deeper tree search and better LLMs - Multi-modal experiments (vision, NLP, robotics simultaneously) - Collaborative multi-agent research teams - Automated rebuttal writing for reviewer feedback - Integration with preprint servers for automated submission

Long-term (paradigm implications): - Continuous scientific discovery loops (open-ended research programs) - Cross-disciplinary hypothesis generation - AI-driven meta-research (studying what makes research effective) - Autonomous identification of high-impact research directions

Safety and Ethics Considerations

The paper dedicates significant discussion to safety and ethical implications:

Code execution risk: LLM-generated code may be unsafe (e.g., using dangerous packages, spawning processes, accessing the network). Docker sandboxing is recommended but not enforced.
Scientific integrity: The system can produce hallucinated citations, fabricated results, and misleading claims. Without human oversight, these could enter the scientific record.
Peer review ethics: The ICLR experiment was conducted with full transparency and pre-arranged withdrawal. However, the existence of such systems raises questions about:
How should venues handle AI-generated submissions?
Should AI-generated papers be required to disclose their provenance?
What is the reviewer's responsibility when AI submissions increase in volume?
Acceleration risks: If scaled, such systems could flood peer review with low-quality submissions, overwhelming human reviewers. The paper acknowledges this and advocates for community discussion.
Mandatory disclosure: The code license requires users to "clearly and prominently disclose the use of AI in any resulting scientific manuscripts or papers."

Significance for the Field of Automated Scientific Discovery

The AI Scientist v2 represents a qualitative threshold crossing: the first time a fully autonomous system produced work accepted by human peer reviewers at a recognized venue. While the practical significance is modest (one workshop paper), the conceptual significance is substantial:

Proof of concept validated externally: Unlike v1's self-evaluation via automated reviewer, v2's validation came from independent human experts who were not told which papers were AI-generated.
Template-free generalization demonstrated: The three submitted papers covered meaningfully different ML topics, not just variations within a single domain.
Tree search superiority confirmed: The multi-stage BFTS approach enabled deeper experimental exploration than v1's linear pipeline, reflected in the experimental designs of the submitted papers.
The gap to conference-level is clear: The authors' honest assessment that none of the papers meet conference standards provides a concrete improvement target for the field.

Positioning Within OmniEvolve

The AI Scientist v2 is relevant to OmniEvolve's design in several ways:

v2 Component	OmniEvolve Parallel	Key Lesson
BFTS tree search	Search backends (island-based evolution)	Tree search with LLM evaluation is effective for code-level exploration
Experiment Manager	Orchestrator lifecycle management	Multi-stage search with explicit stopping criteria outperforms single-stage
VLM feedback loop	Cascade evaluator (multi-signal)	Visual evaluation adds signal that text-only evaluation misses
No cross-run learning	Knowledge module (skills, logs)	OmniEvolve's learning infrastructure addresses v2's biggest architectural gap
Template-free design	Benchmark-agnostic evaluation	Domain generality requires rethinking how context is provided to the search
`bfts_config.yaml`	Pydantic config schemas	Complex search methods need structured, validated configuration
Parallel node execution	Island migration and parallel evaluation	Parallelism with information sharing outperforms independent parallel runs

Appendix A: Complete v1 → v2 Feature Comparison

Feature	AI Scientist v1	AI Scientist v2
Release date	August 2024	April 2025
arXiv	2408.06292	2504.08066
Code availability	Open source	Open source
License	Apache 2.0	Responsible AI Source Code License
Template required	Yes (per domain)	No
Experimentation style	Linear sequential	Best-First Tree Search
Parallel experiments	No	Yes (`num_workers` concurrent)
Code editing method	Aider (diff-based)	Direct LLM generation
VLM integration	None	During experiments + writing
Paper writing	Multi-round Aider editing	Single-pass o1 + reflection
Evaluation	Automated reviewer only	Real human peer review + automated
Best result	Exceeded automated reviewer threshold	Peer-review acceptance at ICLR workshop
Cost per paper	~$15	~$20–25
Domains demonstrated	3 (NanoGPT, Diffusion, Grokking)	Open (any ML topic via markdown)
Dataset handling	Bundled in templates	Hugging Face Hub
Configuration	CLI arguments only	CLI + YAML (`bfts_config.yaml`)
Model routing	Single `--model`	Task-specific model selection
Tree search base	None	AIDE (WecoAI)
Semantic Scholar	Post-hoc novelty check	In-loop during ideation
Paper formats	8-page	8-page + 4-page ICBINB
Success rate	Higher (within template scope)	Lower (broader scope)
Sandboxing	Recommended	Recommended (same)

Appendix B: ICLR 2025 ICBINB Workshop Experiment Protocol

Experimental Design

Pre-registration: Sakana AI coordinated with ICBINB workshop organizers and ICLR leadership before submission
IRB approval: Obtained institutional review board approval for the human evaluation experiment
Reviewer consent: Reviewers were informed that "some submissions might be AI-generated" and could opt out
Blind review: AI-generated papers were mixed with 40 human-written submissions (43 total)
No disclosure during review: Reviewers did not know which papers were AI-generated
Post-review disclosure: All reviewers were informed after the review process completed
Pre-arranged withdrawal: Accepted AI-generated papers would be withdrawn regardless of scores

Results Detail

Paper 1: Compositional Regularization (Accepted) - Investigated temporal consistency regularizer on LSTM embeddings - Evaluated on synthetic arithmetic expression datasets - Found that compositional regularization does NOT significantly improve performance - Reviewers appreciated the negative result and clear identification of challenges - Scores: 6 (weak accept), 7 (accept), 6 (weak accept) → average 6.33 - Placed in top 45% of 43 submissions

Paper 2: Real-World Pest Detection (Rejected) - Applied deep learning to agricultural pest classification - Scores: 3, 7, 4 → average 4.67 - High variance in reviewer scores suggests inconsistent quality

Paper 3: Label Noise & Model Calibration (Rejected) - Studied effect of label noise on model calibration - Scores: 3, 3, 3 → average 3.00 - Uniformly low scores indicate consistent quality issues

Author Self-Assessment

The Sakana AI team conducted internal evaluation and identified: - Citation hallucinations in all three papers - Insufficient methodological rigor for conference-level - Ambiguity in method descriptions (e.g., unclear which network component is regularized) - Potential dataset overlap issues - Figure caption inaccuracies - Their assessment matched peer review: one paper was workshop-worthy, two were not

Appendix C: Glossary of Key Terms

Term	Definition
BFTS	Best-First Tree Search — tree search algorithm where the most promising nodes are expanded first, guided by an evaluation function
VLM	Vision-Language Model — model that processes both images and text, used here for figure evaluation
ICBINB	"I Can't Believe It's Not Better" — ICLR workshop focused on negative results and challenges in deep learning
Aider	Open-source AI coding assistant used in v1 for diff-based code editing; removed in v2
AIDE	AI Development Environment by WecoAI — tree search system for ML engineering that inspired v2's BFTS
Node	A single state in the tree search, comprising experiment code, results, figures, and metadata
Buggy node	A node that failed execution or VLM review
Non-buggy node	A node that successfully executed and passed VLM review
Experiment Manager	Dedicated agent that coordinates the four-stage experimental lifecycle
Semantic Scholar	Academic search engine used for literature search and novelty checking
Replication node	Node that re-runs parent experiment with different random seed for statistical robustness
Aggregation node	Special node that consolidates results from replication nodes into combined visualizations