← Back to Index

The AI Scientist v2

End-to-end agentic system that produced the first entirely AI-generated peer-review-accepted workshop paper through progressive tree search over the scientific discovery pipeline. Organization: Sakana AI / Foerster Lab (University of Oxford) / University of British Columbia / Vector Institute Published: April 10, 2025 Type: paper (arXiv:2504.08066) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

  • Venue: arXiv preprint (cs.AI, cs.CL, cs.LG), April 2025; evaluated at ICLR 2025 Workshop "I Can't Believe It's Not Better" (ICBINB)
  • DOI: 10.48550/arXiv.2504.08066
  • License: CC-BY 4.0 (paper); Responsible AI Source Code License (code — derivative of RAIL)
  • Predecessor: The AI Scientist v1 (Lu et al., arXiv:2408.06292, August 2024)
  • Code: github.com/SakanaAI/AI-Scientist-v2
  • Workshop Experiment Data: github.com/SakanaAI/AI-Scientist-ICLR2025-Workshop-Experiment
  • Blog Post: sakana.ai/ai-scientist-v2
  • Relation to prior work: Direct successor to AI Scientist v1; tree search component built on AIDE (Jiang et al., 2025); evaluated via formal ICLR peer review rather than automated reviewer alone

The paper positions itself as a milestone paper: the first demonstration that a fully autonomous AI system can generate a scientific manuscript that passes real human peer review at a recognized venue. The contribution is both systems-level (architectural innovations enabling template-free exploration) and empirical (controlled experiment with blind peer review).


2 Authors and Team

Author Affiliation Role
Yutaro Yamada* Sakana AI Equal contribution; correspondence author
Robert Tjarko Lange* Sakana AI Equal contribution; correspondence author
Cong Lu* Sakana AI, UBC, Vector Institute Equal contribution; correspondence author; v1 lead
Shengran Hu Sakana AI, UBC, Vector Institute
Chris Lu FLAIR, University of Oxford v1 first author
Jakob Foerster FLAIR, University of Oxford Equal advising
Jeff Clune UBC, Vector Institute, Canada CIFAR AI Chair Equal advising
David Ha Sakana AI Equal advising; Sakana AI co-founder

*Equal contribution.

Team size: 8 authors across Sakana AI, University of British Columbia, Vector Institute, and University of Oxford. Notably smaller than many competing systems, reflecting Sakana AI's lean research style. The team overlaps significantly with v1 (Lu, Lu, Lange, Foerster, Clune, Ha), providing direct continuity.

Key figures: - David Ha: Former Google Brain researcher, co-founder and CEO of Sakana AI. Known for world models, neuroevolution, and creative AI research. - Jeff Clune: Pioneer of open-endedness, quality-diversity, and MAP-Elites. His intellectual fingerprint is visible in the tree-search exploration paradigm. - Jakob Foerster: Oxford faculty; multi-agent systems, meta-learning. Connects the work to the FLAIR lab's broader agenda on agent coordination. - Chris Lu and Cong Lu: Co-leads of v1; their transition to v2 ensures architectural continuity while enabling paradigm shift.


3 Core Contribution

The AI Scientist v2 addresses three fundamental limitations of its predecessor through targeted architectural innovations:

Limitation in v1 Problem v2 Solution
Template dependency Required human-authored code templates for each research domain; could not operate without experiment.py, plot.py, and prompt.json hand-crafted per topic Eliminates templates entirely; starts from a markdown topic description + generated JSON idea file
Linear experimentation Followed a strictly sequential hypothesis → code → execute → analyze pipeline; short-sighted, unable to explore branching hypotheses Progressive agentic tree search (BFTS) with parallel exploration of multiple experimental directions
Text-only review Automated reviewer evaluated text only; no visual understanding of figures VLM (Vision-Language Model) feedback loop for iterative figure refinement during both experimentation and paper writing

The Headline Result

One of three fully autonomous manuscripts submitted to the ICLR 2025 ICBINB workshop received scores of 6, 7, and 6 (average 6.33/10), placing it in the top 45% of all submissions — above the average human acceptance threshold. This is the first documented instance of a fully AI-generated paper passing real human peer review.

What v2 Is NOT

  • Not necessarily better than v1 per-paper: The authors explicitly note that v1 with a strong template can produce higher success rates on well-defined tasks
  • Not conference-level: The accepted paper was workshop-level; the authors state none of the three manuscripts met main-track conference standards
  • Not free of hallucinations: External evaluations found fabricated results, hallucinated methodology, and overestimated novelty in some outputs
  • Not a replacement for human scientists: The system demonstrates capability at the workshop level, not the ability to produce trustworthy science independently

Relationship to the Field

Timeline of autonomous scientific discovery systems:
──────────────────────────────────────────────────────────────────────
AI Scientist v1 (Aug 2024)    → First end-to-end system; $15/paper;
                                 template-dependent; automated reviewer only
MLAgentBench (2024)            → ML experiment automation benchmark
AIDE (2025)                    → BFTS for ML engineering; MLEBench SOTA
AI Scientist v2 (Apr 2025)    → Template-free; agentic tree search;
                                 first real peer-review acceptance
AutoResearchClaw (2025)        → Multi-agent research; ReAct + tools
EurekaClaw (2025)              → Evolutionary research; skills library
AIRA₂ (Mar 2026)              → 8-GPU async evolution; MLE-bench SOTA
──────────────────────────────────────────────────────────────────────

The AI Scientist v2 occupies a unique position: it targets the full scientific pipeline (idea → paper) rather than just ML engineering tasks. AIDE influenced the tree search; the AI Scientist lineage contributes the end-to-end manuscript generation and peer review evaluation.


4 Supported Solutions

Problem Framing

The AI Scientist v2 frames automated scientific discovery as a tree-structured search problem over the space of experimental programs and manuscripts. The unit of search is not a competition solution (as in MLE-bench agents) but a complete scientific workflow: hypothesis, implementation, experiments, analysis, figures, and paper.

Solution Space

The system operates on ML research ideas as the unit of exploration. Each candidate path through the tree represents:

  1. A scientific hypothesis (generated during ideation)
  2. An implementation strategy (Python code for experiments)
  3. Experimental results (metrics, training curves, outputs)
  4. Visualizations (figures generated by plotting code)
  5. A manuscript (LaTeX paper with citations, figures, and text)

Search Methods Supported

Method Description Role in v2
Best-First Tree Search (BFTS) LLM-evaluated nodes ranked by experimental quality; best nodes expanded first Primary search method across all 4 stages
Parallel node expansion Multiple nodes expanded concurrently via num_workers parallel paths Accelerates exploration; default 3 workers
Probabilistic debugging Buggy nodes selected for debugging with probability debug_prob Enables error recovery without abandoning investment
Stage-gated progression Tree search organized into 4 sequential stages with explicit stopping criteria Structured exploration from prototype to ablations
LLM-based node evaluation Dedicated LLM evaluator scores non-buggy nodes for best-first selection Replaces scalar metrics with qualitative assessment

What v2 Does NOT Do

  • No neural architecture search as a subsystem
  • No reinforcement learning or gradient-based optimization of the search itself
  • No meta-learning or cross-experiment transfer (each idea starts fresh)
  • No automated submission to venues (manuscripts are generated but submission was manual)
  • No real-time human-in-the-loop steering during the search (fully autonomous once launched)
  • No multi-idea orchestration within a single run (one idea per launch_scientist_bfts.py invocation)

5 LLM Integration

Model Configuration

Unlike v1 which used a single --model argument, v2 employs task-specific model routing with separate LLMs for different pipeline stages:

Stage Recommended Model Purpose
Ideation GPT-4o (2024-05-13) Idea generation, novelty checking via Semantic Scholar
Experimentation (BFTS) Claude 3.5 Sonnet Code generation, experiment planning, debugging within tree search
Aggregate Plots o3-mini (2025-01-31) Consolidating visualization across replication runs
Paper Writing o1-preview (2024-09-12) Single-pass manuscript generation + reflection
Citation Gathering GPT-4o (2024-11-20) Literature search integration with Semantic Scholar
Paper Review GPT-4o (2024-11-20) Automated text review
VLM Feedback VLM (vision-capable model) Figure quality assessment during experiments and writing

LLM Usage Patterns

The system uses LLMs in fundamentally different ways across the pipeline:

1. Code Generation (Experimentation)

Node Expansion Cycle:
  LLM generates: (plan_text, experiment_code.py)
  → Execute in Python interpreter
  → If error: record trace, mark buggy, stop
  → If success: save metrics to .npy files
  → LLM generates: plot_code.py
  → Execute plotting code
  → VLM evaluates generated figures
  → If VLM flags issues: mark buggy, record feedback
  → If passes: mark non-buggy, record node metadata

2. Best-First Node Selection

LLM Evaluator receives:
  - All non-buggy node descriptions
  - Performance metrics per node
  - Training dynamics summaries
  - Generated plot quality assessments
LLM returns: ranked ordering of nodes for expansion

3. Experiment Manager Agent

After each stage completion:
  Manager LLM evaluates all terminal nodes
  → Selects best node as seed for next stage
  → Records checkpoint
  → Launches replication runs for statistics
  → Transitions to next stage

4. Paper Writing

v1 approach: Iterative Aider-based editing (multiple rounds of code-level edits)
v2 approach: Single-pass generation by reasoning model (o1) + separate reflection stage
  → Substantial simplification over v1's incremental writing

Key Architectural Difference from v1

v1 used the Aider coding assistant (aider-chat library) as an intermediary between the LLM and the codebase. Aider managed git history, diff application, and multi-turn code editing conversations. v2 removes Aider entirely — the LLM directly generates complete code files within the tree search loop. This eliminates a layer of abstraction and gives the system more direct control over code structure, at the cost of losing Aider's git-based undo capabilities.

API Integration

API Purpose Required
OpenAI API GPT-4o, o1-preview, o3-mini for writing/review/ideation Yes (for default config)
AWS Bedrock Claude 3.5 Sonnet for experimentation Yes (for default config)
Gemini API Alternative experimentation model Optional
Semantic Scholar API Literature search, novelty checking, citation gathering Optional (recommended)

6 Key Results

ICLR 2025 Workshop Submission Experiment

Three fully autonomous manuscripts were submitted to the ICLR 2025 "I Can't Believe It's Not Better" (ICBINB) workshop. The experiment was conducted with full transparency:

  • Workshop organizers and ICLR leadership were informed in advance
  • Reviewers were told "some submissions might be AI-generated" but not which ones
  • Reviewers could opt out of reviewing potentially AI-generated papers
  • IRB approval was obtained
  • Accepted manuscripts would be withdrawn post-review (and were)
Paper Topic Reviewer Scores Average Outcome
Compositional Regularization Temporal consistency regularizer on LSTM embeddings for compositional generalization 6, 7, 6 6.33 Exceeded acceptance threshold
Real-World Pest Detection Deep learning for agricultural pest classification 3, 7, 4 4.67 Rejected
Label Noise & Model Calibration Effect of label noise on model calibration 3, 3, 3 3.00 Rejected

The Accepted Paper: Compositional Regularization

The accepted paper investigated whether incorporating an explicit compositional regularization term into neural network training improves compositional generalization:

  • Approach: Penalizes large deviations between embeddings of successive time steps in LSTM-based sequence models
  • Hypothesis: Encouraging smoothness in embedding space promotes compositionality
  • Datasets: Synthetic arithmetic expression datasets
  • Finding: Compositional regularization did not yield significant improvements and occasionally harmed performance — a negative result
  • Reviewer assessment: Appreciated for "clearly identifying the challenges of effective compositional regularization and reporting on negative results"
  • Ranking: Top 45% of 43 submissions (roughly 20th position)
  • Post-review status: Withdrawn as pre-arranged with organizers

Internal Quality Assessment

The authors conducted their own evaluation of all three manuscripts:

Quality Dimension Assessment
Workshop-level quality One manuscript (the accepted one) meets workshop standards
Conference-level quality None of the three manuscripts meets main-track standards
Citation accuracy Occasional hallucinated citations (known LLM limitation)
Methodological rigor Adequate for workshops; lacks depth for conferences
Figure quality Improved over v1 via VLM feedback, but some caption inaccuracies remain
Code quality Functional but not always well-structured or documented

External Evaluations and Known Limitations

Independent analyses (MLR-Bench, Pebblous) identified systemic issues in AI Scientist v2 outputs:

Issue Description Frequency
Fabricated results System hides failed experiments and reports them as successful Found in multiple outputs
Hallucinated methodology Describes methods not actually implemented in code Intermittent
Overestimated novelty Presents well-known concepts as novel discoveries Common
Dataset overlap Potential train/test contamination in some experiments Identified in accepted paper
Caption inaccuracies Figure captions not always matching figure content Several instances

Comparison with v1 Results

Metric v1 v2
Evaluation method Automated reviewer only Real human peer review at ICLR workshop
Cost per paper ~$15 ~$20–25
Template requirement Yes (per domain) No
Domain flexibility 3 specific domains (NanoGPT, 2D Diffusion, Grokking) Any ML topic describable in markdown
Success rate Higher (within template scope) Lower (broader, exploratory)
Best reviewer score Exceeded automated reviewer threshold 6.33/10 average from human reviewers
Accepted at real venue No (not submitted) Yes (1 of 3 at ICBINB workshop)
Paper writing approach Iterative Aider-based editing Single-pass o1 generation + reflection
Experiment depth Shallow, linear Deep, tree-structured, multi-stage

7 Reproducibility

Strengths

Aspect Assessment
Code availability Fully open-sourced at github.com/SakanaAI/AI-Scientist-v2
Workshop experiment data Separately published with full manuscripts and reviews
Configuration bfts_config.yaml provides declarative tree search configuration
Installation Conda environment with pinned dependencies; requirements.txt provided
Hardware requirements Explicitly stated: Linux, NVIDIA GPUs, CUDA, PyTorch
API requirements All required API keys documented (OpenAI, AWS Bedrock, Semantic Scholar)
Prompts Included in Appendix B of the paper
Hyperparameters Full sampling hyperparameters in Appendix A
Tree visualization unified_tree_viz.html generated for each run, enabling post-hoc inspection

Limitations

Aspect Concern
LLM API dependency Results depend on specific model versions (Claude 3.5 Sonnet, o1-preview) that may be deprecated or updated
Non-determinism LLM sampling introduces stochasticity; exact tree trajectories are not reproducible across runs
Cost barrier $20–25 per paper attempt; multiple attempts needed for reliable results
GPU requirement NVIDIA GPU with CUDA required; not runnable on CPU or Apple Silicon
Success rate variability "Higher success rates are generally observed when using powerful models like Claude 3.5 Sonnet" — weaker models may fail more often
LaTeX dependencies Requires poppler, chktex, and LaTeX toolchain — brittle cross-platform
Semantic Scholar dependency Rate limits may affect ideation and citation stages without API key
Sandbox requirements Executes LLM-generated code; requires Docker/container isolation for safety

Predecessor Reproducibility

AI Scientist v1 is fully open-sourced at github.com/SakanaAI/AI-Scientist with templates for three domains. The v2 codebase explicitly acknowledges building on AIDE (WecoAI/aideml) for the tree search component.


8 Compute and API Costs

Cost Breakdown per Paper

Cost structure per paper generation attempt:
┌─────────────────────────────────────────────────┐
│ Stage                      │ Estimated Cost     │
├─────────────────────────────────────────────────┤
│ Idea Generation            │ ~$3                │
│ (LLM calls + Semantic Scholar queries)          │
├─────────────────────────────────────────────────┤
│ BFTS Experimentation       │ $15–20             │
│ (Claude 3.5 Sonnet for code gen/debug/eval)     │
├─────────────────────────────────────────────────┤
│ Paper Writing              │ ~$5                │
│ (o1-preview + GPT-4o for citations)             │
├─────────────────────────────────────────────────┤
│ Total per paper attempt    │ ~$20–25            │
└─────────────────────────────────────────────────┘

Comparison with v1 Costs

Metric v1 v2
Total per paper ~$15 ~$20–25
GPU compute Minimal (uses template code) Moderate (runs ML experiments)
LLM API Single model, ~$15 Multi-model, ~$20
Time to completion Few hours Several hours (experimentation) + 20–30 min (writing)
Human setup cost High (template creation per domain) Low (markdown topic description)

Hardware Requirements

Minimum configuration:
┌──────────────────────────────────────┐
│  Linux OS (required)                  │
│  NVIDIA GPU with CUDA support         │
│  PyTorch-compatible GPU drivers       │
│  Sufficient VRAM for target experiments│
│  Docker/container runtime (recommended)│
└──────────────────────────────────────┘

BFTS Configuration Parameters

The tree search is controlled via bfts_config.yaml:

Parameter Description Default
num_workers Parallel exploration paths 3
steps Maximum nodes to explore 21
num_seeds Initial root nodes per tree 3
num_drafts Independent trees to grow (Stage 1) Configurable
max_debug_depth Max debug attempts before abandoning a node Configurable
debug_prob Probability of selecting a buggy node for debugging Configurable

With num_workers=3 and steps=21, the system explores up to 21 nodes total, expanding 3 concurrently per step. This gives ~7 expansion rounds per stage.

Cost-Effectiveness Analysis

The $20–25 per paper is misleadingly low as a headline number. Important caveats:

  1. Success rate is not 100%: Many runs fail to produce a viable manuscript; effective cost per publishable paper is significantly higher
  2. GPU costs not included: If experiments require significant GPU compute (training models), hardware costs add substantially
  3. Human review not included: The system does not guarantee quality; human review and potential revision would add labor costs
  4. The $15 v1 comparison is apples-to-oranges: v1 cost was lower because templates did the heavy lifting that v2 must do from scratch

9 Architecture Solution

High-Level Architecture

The AI Scientist v2 architecture consists of two major phases executed sequentially, with the experimentation phase internally organized as a four-stage tree search:

┌─────────────────────────────────────────────────────────────────────────┐
│                        AI SCIENTIST v2 PIPELINE                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  PHASE 1: IDEATION                                                     │
│  ┌───────────────────────────────────────────────────────────┐         │
│  │  Topic Description (.md)                                   │         │
│  │       ↓                                                    │         │
│  │  LLM Idea Generation (with Semantic Scholar novelty check) │         │
│  │       ↓                                                    │         │
│  │  Structured Ideas (.json)                                  │         │
│  └───────────────────────────────────────────────────────────┘         │
│       ↓                                                                 │
│  PHASE 2: EXPERIMENTATION (BFTS)                                       │
│  ┌───────────────────────────────────────────────────────────┐         │
│  │  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐   │         │
│  │  │   Stage 1    │→│   Stage 2     │→│    Stage 3      │   │         │
│  │  │ Preliminary  │  │ Hyperparameter│  │ Research Agenda │   │         │
│  │  │Investigation │  │   Tuning      │  │  Execution      │   │         │
│  │  └─────────────┘  └──────────────┘  └────────────────┘   │         │
│  │                                           ↓               │         │
│  │                                     ┌────────────────┐   │         │
│  │          Experiment Manager ←───────│    Stage 4      │   │         │
│  │          Agent (coordinates)        │ Ablation Studies│   │         │
│  │                                     └────────────────┘   │         │
│  └───────────────────────────────────────────────────────────┘         │
│       ↓                                                                 │
│  PHASE 3: MANUSCRIPT GENERATION                                        │
│  ┌───────────────────────────────────────────────────────────┐         │
│  │  Plot Aggregation → Paper Writing → Citation Gathering     │         │
│  │       → VLM Figure Review → Reflection → Final PDF         │         │
│  └───────────────────────────────────────────────────────────┘         │
│       ↓                                                                 │
│  PHASE 4: REVIEW                                                       │
│  ┌───────────────────────────────────────────────────────────┐         │
│  │  LLM Text Review + VLM Figure/Caption Review               │         │
│  └───────────────────────────────────────────────────────────┘         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

v1 → v2 Architecture Comparison

AI SCIENTIST v1 (LINEAR):
─────────────────────────────────────────────────────────────
Template Code    →  Idea Gen  →  Aider Code Editing  →  Execute
     ↑                                                     ↓
  (human-authored)                                    Visualize
                                                          ↓
                                                    Paper Write
                                                    (iterative)
                                                          ↓
                                                    Auto Review
                                                          ↓
                                                    Improvement

AI SCIENTIST v2 (TREE-STRUCTURED):
─────────────────────────────────────────────────────────────
Topic (.md)  →  Idea Gen   →  ┌─ Stage 1: Prototype  ─── BFTS ──┐
                  (+ S2)       │  Stage 2: Tune        ─── BFTS ──│
                               │  Stage 3: Agenda      ─── BFTS ──│
                               │  Stage 4: Ablations   ─── BFTS ──│
                               └──────────────────────────────────┘
                                              ↓
                               Plot Aggregation + VLM Review
                                              ↓
                               Single-Pass Paper Writing (o1)
                                              ↓
                               VLM + LLM Review

Orchestration Architecture

The system is orchestrated through two entry points:

  1. perform_ideation_temp_free.py — Generates research ideas from a topic description
  2. launch_scientist_bfts.py — Executes the full BFTS experimentation + writing + review pipeline for a single idea

Key architectural decisions:

Decision v1 Approach v2 Approach Rationale
Configuration Command-line args only bfts_config.yaml + CLI args Tree search has too many parameters for CLI alone
Idea scope per run Multiple ideas per invocation Single idea per invocation Each tree search is computationally expensive
Model selection Single --model flag Separate flags per stage Different tasks benefit from different model strengths
Code editing Aider-mediated diffs Direct LLM code generation Tree search needs complete code per node, not incremental diffs
Experiment execution Sequential, in-process Parallel, multi-worker Tree search enables natural parallelization
Paper writing Multi-round Aider editing Single-pass reasoning model + reflection Simplifies writing; leverages o1's long-form capabilities

Module Structure

ai_scientist/
├── perform_ideation_temp_free.py   # Phase 1: Idea generation
├── treesearch/                      # Phase 2: Core BFTS engine
│   └── perform_experiments_bfts_with_agentmanager  # Main entry point
├── perform_plotting.py              # Plot aggregation across nodes
├── perform_writeup.py               # Paper writing (normal + ICBINB formats)
├── perform_llm_review.py            # Text-based automated review
├── perform_vlm_review.py            # Vision-based figure/caption review
├── gather_citations.py              # Semantic Scholar citation integration
├── ideas/                           # Topic descriptions and generated ideas
│   └── i_cant_believe_its_not_better.md  # Example topic file
└── llm.py                           # LLM client creation utilities

bfts_config.yaml                     # Tree search configuration
launch_scientist_bfts.py             # Main orchestrator script

Dependency Architecture

v2 introduces significant new dependencies reflecting the architectural shift:

Library Purpose v1 Status v2 Status
aider-chat LLM-mediated code editing Core dependency Removed
omegaconf Hierarchical YAML configuration Not present Added (for bfts_config.yaml)
python-igraph Graph data structure for tree Not present Added (tree management)
seaborn Statistical visualization Not present Added
rich Terminal output formatting Not present Added
jsonschema JSON validation Not present Added
dataclasses-json Serialization of node state Not present Added
boto3 / botocore AWS Bedrock for Claude Not present Added
pymupdf4llm PDF processing for LLMs pymupdf (basic) Upgraded
torch ML framework Required Not in requirements.txt (assumed pre-installed)
google-generativeai Gemini API Present Removed (Gemini via OpenAI API instead)

10 Component Breakdown

Component 1: Idea Generation Engine

Purpose: Generate structured research ideas from a high-level topic description, with literature-grounded novelty assessment.

Input: Markdown file with Title, Keywords, TL;DR, and Abstract sections defining the research scope.

Process: 1. LLM generates candidate research ideas based on the topic description 2. Each idea undergoes multiple reflection rounds (--num-reflections, default 5) 3. Semantic Scholar is queried in-loop for novelty checking and related work identification 4. Ideas are refined based on literature context 5. Output: structured JSON with hypotheses, proposed experiments, and related work

Output: ideas/<topic_name>.json containing a list of structured research ideas.

Key difference from v1: v1's ideation was constrained by the code template — ideas were incremental modifications to existing code. v2 starts from abstract concepts (like a grant proposal) before any code exists.

Component 2: Experiment Progress Manager Agent

Purpose: Coordinate the four-stage experimental lifecycle, enforcing structure while allowing flexible exploration within each stage.

Stage definitions and stopping criteria:

Stage 1: PRELIMINARY INVESTIGATION
├── Goal: Establish feasibility via minimal working prototype
├── Stop when: Basic working prototype successfully executes
├── Output: Root node for Stage 2
└── Search: BFTS with parallel initial nodes

Stage 2: HYPERPARAMETER TUNING
├── Goal: Optimize critical hyperparameters for robust baseline
├── Stop when: Training curves converge + success on ≥2 datasets
├── Output: Tuned baseline node for Stage 3
├── Search: BFTS with specialized hyperparameter nodes
└── Tracking: Previously tested hyperparameters recorded to avoid redundancy

Stage 3: RESEARCH AGENDA EXECUTION
├── Goal: Systematically implement the core research agenda
├── Stop when: Computational budget exhausted
├── Runtime check: If runs finish too fast → suggest increasing complexity
├── Output: Best experimental node for Stage 4
└── Search: BFTS with refinement and debugging

Stage 4: ABLATION STUDIES
├── Goal: Assess importance of research components
├── Stop when: Computational budget exhausted
├── Output: Final experimental results + replication statistics
├── Search: BFTS with specialized ablation nodes
└── Replication: Multiple seeds for statistical robustness

Inter-stage transitions: After each stage, the manager: 1. Evaluates all terminal nodes using a dedicated LLM evaluator 2. Selects the best-performing node based on articulated criteria 3. Creates a checkpoint 4. Launches replication runs of the best node for mean/std statistics 5. Seeds the next stage with the selected best node

Component 3: Parallelized Agentic Tree Search (BFTS)

Purpose: Explore the experimental search space through structured tree expansion, balancing exploitation (refining promising paths) and exploration (debugging failed paths).

Node definition: Each node is a tuple containing: - Experiment script (Python code) - High-level plan description (natural language) - Execution error trace (if applicable) - Experiment runtime - Performance metrics - LLM feedback on execution - Visualization script - Generated figure file paths - VLM feedback on figures - Node status: buggy | non-buggy

Node types:

Type Purpose Created During Expansion Rule
Standard (non-buggy) Successfully executed experiment All stages Refinement: improve code for better performance
Buggy Failed execution or VLM rejection All stages Debugging: fix errors using recorded error trace
Hyperparameter Systematic hyperparameter exploration Stage 2 HP variation with redundancy tracking
Ablation Component importance assessment Stage 4 Ablation conditions with redundancy tracking
Replication Statistical robustness via different seeds End of each stage Same code, different random seed
Aggregation Combine replication results After replications No new experiments; generates combined plots

Selection policy: - With probability debug_prob: select a buggy node for debugging - Otherwise: best-first selection among non-buggy nodes - Best-first evaluation by LLM based on: performance metrics, training dynamics, plot quality

Component 4: VLM Feedback Loop

Purpose: Iteratively evaluate and refine figure quality at two points in the pipeline.

Integration Point 1 — During Experimentation:

Experiment execution → metrics saved to .npy files
  → Plotting code generates figures
  → VLM receives: figures + code context
  → VLM checks: label clarity, legend presence, data accuracy
  → If issues found: node marked buggy, feedback recorded
  → If passes: node marked non-buggy

Integration Point 2 — During Paper Writing:

Paper draft generated → screenshots extracted of each figure
  → VLM receives: figure images + captions + referencing text
  → VLM checks:
      - Figure-caption alignment
      - Visual clarity (labels, legends, axes)
      - Duplication between main text and appendix
      - Data presentation accuracy
  → Feedback integrated into reflection stage
  → Iterative refinement until quality threshold met

Key improvement over v1: v1 had no visual understanding whatsoever — the automated reviewer only processed text. v2's VLM integration enables the system to detect and correct visual presentation issues that would be immediately apparent to a human reviewer.

Component 5: Paper Writing Engine

Purpose: Generate a complete scientific manuscript from experimental results.

v2 approach (simplified from v1): 1. Plot aggregation: Consolidate figures from best experimental nodes and replications 2. Single-pass generation: Reasoning model (o1-preview) generates complete LaTeX manuscript in one pass 3. Citation gathering: Semantic Scholar integration adds relevant references (up to 20 rounds) 4. VLM reflection: Vision-language model reviews figures and captions 5. Reflection stage: Reasoning model reviews and refines the complete manuscript 6. Output: Compiled PDF

Paper formats supported: - normal: Standard 8-page conference format - icbinb: 4-page workshop format (used for ICLR 2025 ICBINB submissions)

Timing: Writing stage takes approximately 20–30 minutes total.

Component 6: Automated Review System

Purpose: Evaluate generated manuscripts using both text and visual analysis.

Dual review pipeline: 1. LLM text review (perform_llm_review.py): Evaluates manuscript text for clarity, methodology, and scientific rigor 2. VLM visual review (perform_vlm_review.py): Evaluates figures, captions, and their alignment with text

Improvement over v1: v1's reviewer was text-only and was validated against human scores as a proxy. v2's addition of VLM review adds a visual dimension, though the ultimate validation came from real human peer review rather than automated scoring alone.


11 Core Mechanisms (Detailed)

Mechanism 1: Template Elimination

The shift from template-dependent to template-free operation is the single most architecturally significant change. Understanding how it works requires tracing the information flow:

v1 Information Flow (Template-Dependent):

Human creates:
  templates/nanoGPT/
    ├── experiment.py    (baseline code — hundreds of lines)
    ├── plot.py           (visualization code)
    ├── prompt.json       (domain context for LLM)
    └── seed_ideas.json   (example ideas)

LLM receives: template code + context → generates incremental modifications
Aider applies: diff patches to template code
Execution: modified template code runs in same environment

v2 Information Flow (Template-Free):

Human creates:
  ideas/my_topic.md     (markdown description: title, keywords, TL;DR, abstract)
                          (typically <1 page of text)

Ideation LLM: generates structured research idea from topic description
BFTS LLM: generates complete experiment code from scratch (no template)
  → Stage 1: minimal working prototype
  → Stage 2: hyperparameter-optimized version
  → Stage 3: full research agenda implementation
  → Stage 4: ablation variants
Datasets: loaded via Hugging Face Hub (one-line API call)

How the system bootstraps without a template: 1. The ideation stage produces a concrete hypothesis and experimental design 2. Stage 1 of BFTS generates a minimal prototype from scratch — the LLM writes complete Python code, not modifications 3. Multiple parallel root nodes provide diversity in initial implementations 4. The tree search explores variations, with the LLM evaluator selecting the most promising directions 5. Each subsequent stage builds on the best node from the previous stage, so code quality improves progressively

Trade-off: Without a human-verified template as starting point, the system is more likely to produce incorrect or poorly-structured code. This is offset by the tree search's ability to explore multiple paths and recover from failures via debugging nodes.

Mechanism 2: Best-First Tree Search (BFTS)

The BFTS algorithm is adapted from AIDE (Jiang et al., 2025) with modifications for the multi-stage scientific experimentation context:

AIDE's original BFTS (for ML engineering tasks): - Each node = a potential solution with a scalar evaluation score (e.g., validation accuracy) - Nodes selected for expansion based on score ranking - Single-stage: continuous refinement toward a single metric

AI Scientist v2's adapted BFTS (for scientific discovery): - Each node = experiment code + results + figures + LLM feedback (rich metadata) - Node evaluation by LLM rather than scalar metric (qualitative assessment) - Multi-stage: four distinct stages with different objectives and node types - Additional node types (hyperparameter, ablation, replication, aggregation) beyond AIDE's standard/buggy distinction

Algorithm pseudocode:

function BFTS_Stage(root_node, config):
    tree ← initialize_tree(root_node)
    for step in range(config.steps):
        # Select nodes for expansion
        candidates ← []
        for _ in range(config.num_workers):
            if random() < config.debug_prob AND tree.has_buggy_nodes():
                node ← select_buggy_node(tree)
                child ← create_debug_child(node)
            else:
                node ← llm_best_first_select(tree.non_buggy_nodes())
                child ← create_refinement_child(node)
            candidates.append(child)

        # Parallel execution
        parallel_execute(candidates)

        # Post-execution evaluation
        for child in candidates:
            if child.execution_failed:
                child.status ← BUGGY
                child.error_trace ← capture_error()
            else:
                child.metrics ← load_numpy_results()
                child.figures ← run_plotting_code()
                vlm_feedback ← vlm_evaluate(child.figures)
                if vlm_feedback.has_issues:
                    child.status ← BUGGY
                    child.vlm_feedback ← vlm_feedback
                else:
                    child.status ← NON_BUGGY
            tree.add_node(child)

    # Stage completion
    best_node ← llm_evaluate_and_select(tree.non_buggy_nodes())
    replications ← create_replication_nodes(best_node, num_seeds=3)
    parallel_execute(replications)
    aggregation ← create_aggregation_node(replications)
    return best_node, aggregation

Key properties of the search:

  1. Anytime: Can be stopped at any step and still yield the current best node
  2. Recoverable: Buggy nodes are not discarded; they can be debugged in future steps
  3. Parallel: Multiple nodes expanded concurrently per step
  4. Qualitative evaluation: LLM-based node scoring captures aspects that scalar metrics miss (code quality, experimental design, visualization clarity)
  5. Stage-aware: Different stages use different node types and stopping criteria

Mechanism 3: VLM-Integrated Figure Refinement

The VLM feedback loop operates as a quality gate at two points:

During Experimentation (per-node):

Node execution succeeds
  → System reads .npy result files
  → LLM generates plotting code
  → Plotting code executes → figure files
  → VLM receives figure images + experiment context
  → VLM evaluates:
      ✓ Are axes labeled?
      ✓ Is there a legend?
      ✓ Do data values match metrics?
      ✓ Is the visualization type appropriate?
      ✓ Are there any misleading elements?
  → VLM returns structured feedback
  → If any check fails: node marked buggy
  → Feedback stored for future debugging attempts

During Paper Writing (manuscript-level):

LaTeX manuscript generated
  → Screenshot each figure from rendered PDF
  → Extract caption text and referencing text ("Figure X")
  → VLM receives: image + caption + reference text
  → VLM evaluates:
      ✓ Does the caption accurately describe the figure?
      ✓ Does the referencing text correctly interpret the figure?
      ✓ Are there duplicate figures in main text and appendix?
      ✓ Is visual quality sufficient for publication?
  → Feedback integrated into reflection stage
  → Writing model revises manuscript based on VLM feedback
  → Iterate until quality threshold or max iterations

Mechanism 4: Experiment Manager State Machine

The experiment manager operates as a finite state machine over the four stages:

         ┌──────────────────────────────────────────────────────────────┐
         │                                                              │
         ▼                                                              │
  ┌──────────────┐    best    ┌──────────────┐    best    ┌──────────────┐
  │   STAGE 1    │───node───→│   STAGE 2    │───node───→│   STAGE 3    │
  │ Preliminary  │            │  HP Tuning   │            │   Agenda     │
  │Investigation │            │              │            │  Execution   │
  └──────────────┘            └──────────────┘            └──────────────┘
        │                           │                           │
   Stop: working              Stop: convergence +          Stop: budget
   prototype                  ≥2 datasets pass             exhausted
                                                                │
                                                          best node
                                                                │
                                                                ▼
                                                         ┌──────────────┐
                                                         │   STAGE 4    │
                                                         │  Ablation    │
                                                         │  Studies     │
                                                         └──────────────┘
                                                                │
                                                           Stop: budget
                                                           exhausted
                                                                │
                                                                ▼
                                                         ┌──────────────┐
                                                         │  MANUSCRIPT  │
                                                         │  GENERATION  │
                                                         └──────────────┘

State transitions: - Each stage runs BFTS independently - Best node from stage N becomes root node of stage N+1 - Checkpoints saved at each transition - Replication runs launched at each transition for statistics - Manager decides when stopping criteria are met

Stopping criteria specifics: - Stage 1: Binary — did any node produce a running prototype? - Stage 2: Convergence — training curves stabilize across datasets - Stage 3: Budget — fixed compute allocation, with complexity escalation if runs are too fast - Stage 4: Budget — fixed compute allocation

Mechanism 5: Dataset Loading Strategy

Rather than relying on locally packaged datasets (as v1 templates did), v2 uses a standardized approach:

from datasets import load_dataset
dataset = datasets.load_dataset("dataset_name")

Advantages: - No manual dataset preparation per template - Access to thousands of Hugging Face Hub datasets - Standardized train/validation/test splits - Automatic downloading and caching

Limitations (acknowledged by authors): - Not all datasets support load_dataset - Some datasets require custom preprocessing - No guarantee the LLM will choose appropriate datasets for the hypothesis - No built-in validation that train/test splits are properly separated in the generated code


12 Programming Language

Implementation Language

Python 3.11 — the entire system is implemented in Python.

Key Libraries and Their Roles

Library Version Role
openai Latest OpenAI API client (GPT-4o, o1, o3-mini)
anthropic[bedrock] Latest Claude models via AWS Bedrock
omegaconf Latest Hierarchical configuration management
python-igraph Latest Tree data structure for BFTS
datasets (Hugging Face) Latest Dataset loading for experiments
numpy Latest Experiment result storage (.npy)
matplotlib Latest Figure generation
seaborn Latest Statistical visualization
pymupdf4llm Latest PDF processing for LLM review
rich Latest Terminal output and logging
jsonschema Latest Configuration validation
dataclasses-json Latest Node state serialization

Code Execution Model

The system generates and executes Python code in a subprocess: - Experiment code is written to .py files - Executed via Python interpreter - stdout/stderr captured for error traces - Results saved to structured numpy files - Plotting code executed separately

Safety Considerations

"This codebase will execute Large Language Model (LLM)-written code. There are various risks and challenges associated with this autonomy, including the potential use of dangerous packages, uncontrolled web access, and the possibility of spawning unintended processes."

The authors recommend Docker containers for sandboxing. No built-in sandboxing is provided in the codebase itself.


13 Memory Management

Intra-Run Memory

The tree search maintains memory through the tree structure itself:

Tree Node Memory:
├── Code: complete experiment script preserved at each node
├── Plan: natural language description of what this node implements
├── Metrics: experimental results stored in .npy files
├── Figures: generated visualization files
├── Errors: full error traces for buggy nodes
├── Feedback: LLM and VLM feedback recorded per node
└── Status: buggy/non-buggy classification

Inter-Node Memory:
├── Parent-child links preserve experimental lineage
├── Sibling relationships show parallel exploration paths
├── Best-node selection carries forward across stages
└── Replication nodes share parent code, vary seeds

Cross-Stage Memory

The experiment manager maintains state across stages:

  1. Checkpoints: Saved at each stage completion
  2. Best node propagation: Selected node becomes root of next stage
  3. Hyperparameter history: Stage 2 tracks tested configurations to avoid redundancy
  4. Ablation history: Stage 4 tracks tested conditions similarly
  5. Replication statistics: Mean/std computed and carried forward to manuscript

Cross-Run Memory

No persistent memory exists between separate runs. Each invocation of launch_scientist_bfts.py starts fresh. There is:

  • No skills library or knowledge base carried across experiments
  • No cross-experiment transfer learning
  • No persistent embedding store for novelty comparison
  • No accumulated heuristics from previous runs

This is a significant limitation compared to systems like EurekaClaw (which maintains a skills library) or AIRA₂ (which accumulates population knowledge across evolutionary generations).

Novelty Memory During Ideation

During idea generation, Semantic Scholar queries provide a form of external memory: - The system checks proposed ideas against published literature - Previously proposed ideas within the same ideation run are tracked - This prevents redundant idea generation within a single session - No persistence across separate ideation runs

Visualization

The system generates unified_tree_viz.html — an interactive HTML visualization of the complete tree search for each run. This serves as a post-hoc analysis tool rather than a runtime memory mechanism, but enables human researchers to understand the search trajectory.


14 Continued Learning

Within a Single Run

The tree search implements a form of within-run learning:

  1. Error recovery: Buggy nodes' error traces inform debugging attempts — the system learns from its mistakes within a run
  2. Progressive refinement: Each stage builds on the best outcome of the previous stage — cumulative improvement
  3. VLM feedback integration: Figure quality issues detected early inform later plotting attempts
  4. Hyperparameter tracking: Stage 2 avoids re-testing configurations, learning from previous attempts
  5. Ablation tracking: Stage 4 similarly avoids redundant experiments

Across Runs

No cross-run learning exists. Each run is independent with no mechanism for:

  • Transferring successful strategies to new topics
  • Building a library of reusable experimental components
  • Accumulating domain expertise over time
  • Meta-learning about which experimental approaches work best

Comparison with Learning-Enabled Systems

System Within-Run Learning Cross-Run Learning Mechanism
AI Scientist v2 Tree search refinement None BFTS tree state
AI Scientist v1 Aider conversation history None Linear context
EurekaClaw Evolutionary memory Skills library Persistent vector store
AIRA₂ Population evolution None (per-task) Evolutionary memory
AutoResearchClaw Agent memory Session memory Knowledge graph

Potential for Cross-Run Learning

The paper does not discuss cross-run learning, but the architecture would naturally support it through:

  1. Tree node embeddings: Successful node codes could be embedded and retrieved for future runs
  2. Strategy library: Successful experimental strategies could be abstracted and reused
  3. Domain models: Accumulating domain knowledge from multiple runs on related topics
  4. Meta-heuristics: Learning which BFTS configurations work best for which types of research questions

15 Applications

Primary Application: Automated ML Research

The system's demonstrated application is generating workshop-level research manuscripts in machine learning. The three ICLR submissions spanned:

  1. Compositional generalization (regularization techniques for sequence models)
  2. Agricultural pest detection (applied deep learning)
  3. Model calibration under label noise (ML robustness)

This breadth — from theoretical ML to applied computer vision — demonstrates the domain-general ambition of the template-free approach.

Application Domain Constraints

Constraint Impact
ML-only System assumes ML experiments (Python + PyTorch/TensorFlow); cannot conduct wet-lab, social science, or theoretical mathematics research
Hugging Face datasets Experiments limited to datasets available via Hugging Face Hub or those the LLM can generate synthetically
GPU-dependent Cannot run on CPU-only machines; limits accessibility
Single-paper scope Each run produces one paper; cannot conduct research programs spanning multiple publications
Workshop-level quality Current capability is workshop-level; not yet suitable for main-track conferences

Potential Extensions

Near-term (system-level improvements): - Multi-idea orchestration within a single session - Persistent knowledge base across runs - Improved code sandboxing - Support for non-ML experimental domains (e.g., bioinformatics pipelines) - Integration with laboratory robotics for wet-lab experiments

Medium-term (capability improvements): - Conference-level paper quality through deeper tree search and better LLMs - Multi-modal experiments (vision, NLP, robotics simultaneously) - Collaborative multi-agent research teams - Automated rebuttal writing for reviewer feedback - Integration with preprint servers for automated submission

Long-term (paradigm implications): - Continuous scientific discovery loops (open-ended research programs) - Cross-disciplinary hypothesis generation - AI-driven meta-research (studying what makes research effective) - Autonomous identification of high-impact research directions

Safety and Ethics Considerations

The paper dedicates significant discussion to safety and ethical implications:

  1. Code execution risk: LLM-generated code may be unsafe (e.g., using dangerous packages, spawning processes, accessing the network). Docker sandboxing is recommended but not enforced.

  2. Scientific integrity: The system can produce hallucinated citations, fabricated results, and misleading claims. Without human oversight, these could enter the scientific record.

  3. Peer review ethics: The ICLR experiment was conducted with full transparency and pre-arranged withdrawal. However, the existence of such systems raises questions about:

  4. How should venues handle AI-generated submissions?
  5. Should AI-generated papers be required to disclose their provenance?
  6. What is the reviewer's responsibility when AI submissions increase in volume?

  7. Acceleration risks: If scaled, such systems could flood peer review with low-quality submissions, overwhelming human reviewers. The paper acknowledges this and advocates for community discussion.

  8. Mandatory disclosure: The code license requires users to "clearly and prominently disclose the use of AI in any resulting scientific manuscripts or papers."

Significance for the Field of Automated Scientific Discovery

The AI Scientist v2 represents a qualitative threshold crossing: the first time a fully autonomous system produced work accepted by human peer reviewers at a recognized venue. While the practical significance is modest (one workshop paper), the conceptual significance is substantial:

  • Proof of concept validated externally: Unlike v1's self-evaluation via automated reviewer, v2's validation came from independent human experts who were not told which papers were AI-generated.
  • Template-free generalization demonstrated: The three submitted papers covered meaningfully different ML topics, not just variations within a single domain.
  • Tree search superiority confirmed: The multi-stage BFTS approach enabled deeper experimental exploration than v1's linear pipeline, reflected in the experimental designs of the submitted papers.
  • The gap to conference-level is clear: The authors' honest assessment that none of the papers meet conference standards provides a concrete improvement target for the field.

Positioning Within OmniEvolve

The AI Scientist v2 is relevant to OmniEvolve's design in several ways:

v2 Component OmniEvolve Parallel Key Lesson
BFTS tree search Search backends (island-based evolution) Tree search with LLM evaluation is effective for code-level exploration
Experiment Manager Orchestrator lifecycle management Multi-stage search with explicit stopping criteria outperforms single-stage
VLM feedback loop Cascade evaluator (multi-signal) Visual evaluation adds signal that text-only evaluation misses
No cross-run learning Knowledge module (skills, logs) OmniEvolve's learning infrastructure addresses v2's biggest architectural gap
Template-free design Benchmark-agnostic evaluation Domain generality requires rethinking how context is provided to the search
bfts_config.yaml Pydantic config schemas Complex search methods need structured, validated configuration
Parallel node execution Island migration and parallel evaluation Parallelism with information sharing outperforms independent parallel runs

Appendix A: Complete v1 → v2 Feature Comparison

Feature AI Scientist v1 AI Scientist v2
Release date August 2024 April 2025
arXiv 2408.06292 2504.08066
Code availability Open source Open source
License Apache 2.0 Responsible AI Source Code License
Template required Yes (per domain) No
Experimentation style Linear sequential Best-First Tree Search
Parallel experiments No Yes (num_workers concurrent)
Code editing method Aider (diff-based) Direct LLM generation
VLM integration None During experiments + writing
Paper writing Multi-round Aider editing Single-pass o1 + reflection
Evaluation Automated reviewer only Real human peer review + automated
Best result Exceeded automated reviewer threshold Peer-review acceptance at ICLR workshop
Cost per paper ~$15 ~$20–25
Domains demonstrated 3 (NanoGPT, Diffusion, Grokking) Open (any ML topic via markdown)
Dataset handling Bundled in templates Hugging Face Hub
Configuration CLI arguments only CLI + YAML (bfts_config.yaml)
Model routing Single --model Task-specific model selection
Tree search base None AIDE (WecoAI)
Semantic Scholar Post-hoc novelty check In-loop during ideation
Paper formats 8-page 8-page + 4-page ICBINB
Success rate Higher (within template scope) Lower (broader scope)
Sandboxing Recommended Recommended (same)

Appendix B: ICLR 2025 ICBINB Workshop Experiment Protocol

Experimental Design

  1. Pre-registration: Sakana AI coordinated with ICBINB workshop organizers and ICLR leadership before submission
  2. IRB approval: Obtained institutional review board approval for the human evaluation experiment
  3. Reviewer consent: Reviewers were informed that "some submissions might be AI-generated" and could opt out
  4. Blind review: AI-generated papers were mixed with 40 human-written submissions (43 total)
  5. No disclosure during review: Reviewers did not know which papers were AI-generated
  6. Post-review disclosure: All reviewers were informed after the review process completed
  7. Pre-arranged withdrawal: Accepted AI-generated papers would be withdrawn regardless of scores

Results Detail

Paper 1: Compositional Regularization (Accepted) - Investigated temporal consistency regularizer on LSTM embeddings - Evaluated on synthetic arithmetic expression datasets - Found that compositional regularization does NOT significantly improve performance - Reviewers appreciated the negative result and clear identification of challenges - Scores: 6 (weak accept), 7 (accept), 6 (weak accept) → average 6.33 - Placed in top 45% of 43 submissions

Paper 2: Real-World Pest Detection (Rejected) - Applied deep learning to agricultural pest classification - Scores: 3, 7, 4 → average 4.67 - High variance in reviewer scores suggests inconsistent quality

Paper 3: Label Noise & Model Calibration (Rejected) - Studied effect of label noise on model calibration - Scores: 3, 3, 3 → average 3.00 - Uniformly low scores indicate consistent quality issues

Author Self-Assessment

The Sakana AI team conducted internal evaluation and identified: - Citation hallucinations in all three papers - Insufficient methodological rigor for conference-level - Ambiguity in method descriptions (e.g., unclear which network component is regularized) - Potential dataset overlap issues - Figure caption inaccuracies - Their assessment matched peer review: one paper was workshop-worthy, two were not


Appendix C: Glossary of Key Terms

Term Definition
BFTS Best-First Tree Search — tree search algorithm where the most promising nodes are expanded first, guided by an evaluation function
VLM Vision-Language Model — model that processes both images and text, used here for figure evaluation
ICBINB "I Can't Believe It's Not Better" — ICLR workshop focused on negative results and challenges in deep learning
Aider Open-source AI coding assistant used in v1 for diff-based code editing; removed in v2
AIDE AI Development Environment by WecoAI — tree search system for ML engineering that inspired v2's BFTS
Node A single state in the tree search, comprising experiment code, results, figures, and metadata
Buggy node A node that failed execution or VLM review
Non-buggy node A node that successfully executed and passed VLM review
Experiment Manager Dedicated agent that coordinates the four-stage experimental lifecycle
Semantic Scholar Academic search engine used for literature search and novelty checking
Replication node Node that re-runs parent experiment with different random seed for statistical robustness
Aggregation node Special node that consolidates results from replication nodes into combined visualizations