Introduced2025-04

Score7.88/10 — Draft

Chapter 34

AI Scientist v2

Part P07: Autonomous Research Systems

Key Contribution

The AI Scientist v2 is the first fully autonomous system to produce a scientific manuscript accepted through real human peer review at a recognized venue (ICLR 2025 ICBINB Workshop). It achieves this through three architectural innovations over its predecessor: elimination of human-authored code templates via template-free ideation, replacement of linear experimentation with a progressive four-stage best-first tree search (BFTS), and integration of vision-language model (VLM) feedback for iterative figure refinement. The system was developed by Sakana AI in collaboration with the University of Oxford, University of British Columbia, and Vector Institute (arXiv:2504.08066, April 2025).

34.1 Overview and Motivation

The AI Scientist v1 (Lu et al., arXiv:2408.06292, August 2024) demonstrated the feasibility of end-to-end automated scientific paper generation at approximately $15 per manuscript. However, its practical impact was constrained by three structural limitations: dependence on human-authored code templates for each research domain, a strictly sequential experimentation pipeline incapable of exploring branching hypotheses, and a text-only automated reviewer with no visual understanding of generated figures. These limitations confined v1 to incremental modifications within pre-defined research scaffolds rather than genuine open-ended scientific exploration.

The AI Scientist v2 (Yamada*, Lange*, Lu* et al., arXiv:2504.08066, April 2025) addresses all three limitations through a fundamentally redesigned architecture. The system accepts a markdown topic description as its sole input—no template code, no pre-configured experimental infrastructure—and autonomously generates complete research manuscripts through progressive tree-structured experimentation. Eight authors across Sakana AI, University of British Columbia, Vector Institute, and University of Oxford contributed to the work, with significant team overlap from v1 ensuring architectural continuity.

The headline result provides the system's primary validation: one of three fully autonomous manuscripts submitted to the ICLR 2025 "I Can't Believe It's Not Better" (ICBINB) workshop received scores of 6, 7, and 6 (average 6.33/10) from human reviewers, placing it in the top 45% of 43 submissions and exceeding the acceptance threshold. This constitutes the first documented instance of a fully AI-generated paper passing real human peer review at a recognized venue. The experiment was conducted with full ethical oversight: workshop organizers and ICLR leadership were informed in advance, reviewers could opt out, IRB approval was obtained, and accepted manuscripts were pre-arranged for withdrawal.

It is important to state what v2 is not. The authors explicitly acknowledge that v1 with a strong template can produce higher success rates on well-defined tasks, that none of the three submitted manuscripts met main-track conference standards, that external evaluations found fabricated results and hallucinated methodology in some outputs, and that the system does not replace human scientists. The contribution is a proof of concept at the workshop level, not a demonstration of trustworthy autonomous science.

34.2 Architecture

34.2.1 Pipeline Overview

The AI Scientist v2 pipeline consists of four sequential phases, with the experimentation phase internally organized as a four-stage tree search. The system is orchestrated through two entry points in the repository: perform_ideation_temp_free.py for idea generation and launch_scientist_bfts.py for the full experimentation-through-review pipeline.

34.2.2 Architectural Departures from v1

The transition from v1 to v2 involves six fundamental architectural changes, each motivated by a specific limitation of the predecessor:

Design Decision	v1 Approach	v2 Approach	Rationale
Input specification	Human-authored code template per domain	Markdown topic description only	Eliminates per-domain engineering cost
Code editing	Aider library (diff-based patches)	Direct LLM code generation per node	Tree search requires complete code at each node, not incremental diffs
Experimentation	Sequential linear pipeline	Four-stage best-first tree search	Enables branching exploration and error recovery
Model routing	Single `--model` flag	Task-specific model selection per stage	Different tasks benefit from different model strengths
Paper writing	Multi-round Aider-based iterative editing	Single-pass o1-preview generation + reflection	Leverages reasoning model capabilities; simplifies pipeline
Figure evaluation	None (text-only review)	VLM feedback during experiments and writing	Catches visual presentation errors invisible to text-only review

A particularly significant dependency change is the complete removal of aider-chat, which served as the intermediary between the LLM and codebase in v1. In its place, v2 introduces python-igraph for tree data structure management, omegaconf for hierarchical YAML configuration, and boto3/botocore for AWS Bedrock access to Claude models. The addition of dataclasses-json for node state serialization reflects the richer per-node metadata that the tree search requires.

34.2.3 Task-Specific Model Routing

Unlike v1's single-model design, v2 routes different pipeline stages to different LLMs based on their task characteristics:

Pipeline Stage	Model (as reported)	Purpose
Ideation	GPT-4o (2024-05-13)	Idea generation, Semantic Scholar novelty checking
Experimentation (BFTS)	Claude 3.5 Sonnet	Code generation, planning, debugging within tree search
Aggregate Plots	o3-mini (2025-01-31)	Consolidating visualizations across replication runs
Paper Writing	o1-preview (2024-09-12)	Single-pass manuscript generation + reflection
Citation Gathering	GPT-4o (2024-11-20)	Literature search integration
Paper Review	GPT-4o (2024-11-20)	Automated text review
VLM Feedback	Vision-capable model	Figure quality assessment

This routing strategy requires both OpenAI API access (for GPT-4o, o1-preview, o3-mini) and AWS Bedrock access (for Claude 3.5 Sonnet). Gemini API support is available as an alternative experimentation model. The authors report that "higher success rates are generally observed when using powerful models like Claude 3.5 Sonnet" for the experimentation stage.

34.2.4 Module Structure

The repository (github.com/SakanaAI/AI-Scientist-v2) is organized around the four pipeline phases:

# from repo: AI-Scientist-v2/ (top-level structure)
# ai_scientist/
#   perform_ideation_temp_free.py   — Phase 1: idea generation from markdown topic
#   treesearch/                      — Phase 2: core BFTS engine
#     perform_experiments_bfts_with_agentmanager  — main BFTS entry point
#   perform_plotting.py              — plot aggregation across best nodes
#   perform_writeup.py               — paper writing (normal + ICBINB formats)
#   perform_llm_review.py            — text-based automated review
#   perform_vlm_review.py            — vision-based figure/caption review
#   gather_citations.py              — Semantic Scholar citation integration
#   llm.py                           — LLM client creation utilities
#   ideas/                           — topic descriptions and generated ideas
#     i_cant_believe_its_not_better.md  — example topic file for ICBINB
#
# bfts_config.yaml                   — tree search configuration
# launch_scientist_bfts.py           — main orchestrator script

34.3 Core Algorithms

34.3.1 Best-First Tree Search (BFTS)

The central algorithmic contribution of v2 is the adaptation of Best-First Tree Search from AIDE (Jiang et al., 2025) to the multi-stage scientific experimentation context. Where AIDE applies BFTS to ML engineering tasks with scalar evaluation scores, v2 extends the paradigm with LLM-based qualitative evaluation, four distinct stages with different objectives and node types, and additional node categories beyond AIDE's standard/buggy distinction.

Each node in the search tree is a rich data structure containing:

Experiment script (complete Python code)
High-level plan description (natural language)
Execution error trace (if applicable)
Performance metrics (stored as .npy files)
Visualization script and generated figure paths
LLM and VLM feedback
Node status: buggy or non-buggy

The search operates within each stage according to the following algorithm:

# Pseudocode faithfully reconstructed from arXiv:2504.08066 Section 3
# and repo: AI-Scientist-v2/ai_scientist/treesearch/

def bfts_stage(root_node, config):
    """Best-First Tree Search for a single experimental stage.

    Args:
        root_node: Seed node from previous stage (or initial prototype).
        config: Parameters from bfts_config.yaml including
                num_workers, steps, debug_prob.
    """
    tree = initialize_tree(root_node)  # python-igraph graph structure

    for step in range(config.steps):  # default: 21 steps total
        candidates = []

        for _ in range(config.num_workers):  # default: 3 parallel workers
            # Node selection: probabilistic debugging vs. best-first
            if random() < config.debug_prob and tree.has_buggy_nodes():
                node = select_buggy_node(tree)
                child = create_debug_child(node)  # uses error trace
            else:
                # LLM evaluator ranks all non-buggy nodes
                node = llm_best_first_select(tree.non_buggy_nodes())
                child = create_refinement_child(node)
            candidates.append(child)

        # Parallel execution of all selected nodes
        parallel_execute(candidates)

        # Post-execution evaluation for each candidate
        for child in candidates:
            if child.execution_failed:
                child.status = "buggy"
                child.error_trace = capture_error()
            else:
                child.metrics = load_numpy_results()  # from .npy files
                child.figures = run_plotting_code()

                # VLM quality gate on generated figures
                vlm_feedback = vlm_evaluate(child.figures)
                if vlm_feedback.has_issues:
                    child.status = "buggy"
                    child.vlm_feedback = vlm_feedback
                else:
                    child.status = "non_buggy"

            tree.add_node(child)

    # Stage completion: select best, run replications
    best_node = llm_evaluate_and_select(tree.non_buggy_nodes())
    replications = create_replication_nodes(best_node, num_seeds=3)
    parallel_execute(replications)
    aggregation = create_aggregation_node(replications)
    return best_node, aggregation

The BFTS configuration is managed through bfts_config.yaml with the following key parameters:

Parameter	Description	Default
`num_workers`	Parallel exploration paths per step	3
`steps`	Maximum nodes to explore per stage	21
`num_seeds`	Initial root nodes per tree	3
`num_drafts`	Independent trees to grow (Stage 1)	Configurable
`max_debug_depth`	Max debug attempts before abandoning a node	Configurable
`debug_prob`	Probability of selecting a buggy node for debugging	Configurable

With num_workers=3 and steps=21, the system explores up to 21 nodes total, expanding 3 concurrently per step, yielding approximately 7 expansion rounds per stage.

34.3.2 Formal Characterization of Node Selection

The node selection mechanism balances exploitation (refining promising experimental paths) with exploration (recovering from failures). Let $\mathcal{T} = (V, E)$ denote the tree at a given step, with $V_b \subset V$ the set of buggy nodes and $V_g = V \setminus V_b$ the set of non-buggy nodes. The probability of selecting any individual node for expansion follows a two-phase policy:

$$P(\text{select } v) = \begin{cases} p_{\text{debug}} \cdot \frac{1}{|V_b|} & \text{if } v \in V_b \text{ and } |V_b| > 0 \\ (1 - p_{\text{debug}}) \cdot \mathbb{1}[v = v^*] & \text{if } v \in V_g \end{cases}$$

where $p_{\text{debug}}$ is the debug_prob parameter, $|V_b|$ is the count of buggy nodes, and $v^* = \arg\max_{v \in V_g} \text{LLM\_rank}(v)$ is the best non-buggy node as determined by the LLM evaluator. The LLM evaluator receives all non-buggy node descriptions, performance metrics, training dynamics, and plot quality assessments, and returns a ranked ordering. This qualitative ranking replaces the scalar fitness function used in AIDE's original formulation, capturing aspects such as code quality, experimental design soundness, and visualization clarity that are difficult to express as a single number.

A key property of this selection policy is recoverability: buggy nodes are never discarded. A node that failed execution retains its error trace and can be selected for debugging in future steps. This is particularly valuable when substantial computation has already been invested in a subtree—the system can attempt repair rather than abandoning the investment.

34.3.3 Four-Stage Experiment Manager

The Experiment Manager Agent coordinates the four-stage lifecycle as a finite state machine. Each stage runs an independent BFTS instance with distinct objectives and stopping criteria:

At each inter-stage transition, the manager performs five operations: (1) evaluates all terminal nodes using a dedicated LLM evaluator, (2) selects the best-performing node based on articulated criteria, (3) creates a checkpoint, (4) launches replication runs of the best node with different random seeds for statistical robustness, and (5) seeds the next stage with the selected best node as root. Stage 2 additionally tracks previously tested hyperparameter configurations to avoid redundancy, and Stage 4 similarly tracks tested ablation conditions.

The stopping criteria vary by stage intent. Stage 1 uses a binary criterion: has any node produced a running prototype? Stage 2 requires training curve convergence across at least two datasets. Stages 3 and 4 run until a fixed computational budget is exhausted, with Stage 3 including a dynamic check—if runs finish too quickly, the manager suggests increasing experimental complexity.

34.3.4 VLM Feedback Loop

The integration of vision-language models at two pipeline points addresses v1's complete blindness to visual content. During experimentation, the VLM serves as a quality gate on every non-erroring node:

# Pseudocode reconstructed from arXiv:2504.08066 Section 3.3
# Reflects the two-point VLM integration described in the paper

def vlm_evaluate_experiment_node(node):
    """VLM quality gate applied after successful experiment execution.

    The VLM receives figure images and experiment context,
    then checks visual quality criteria. A failure marks the
    node as buggy with structured feedback for future debugging.
    """
    # Node has already executed successfully and saved .npy results
    # Plotting code has been generated and executed to produce figures
    figures = node.generated_figures

    vlm_checks = [
        "Are all axes clearly labeled?",
        "Is there a legend where appropriate?",
        "Do displayed data values match recorded metrics?",
        "Is the visualization type appropriate for the data?",
        "Are there any misleading visual elements?"
    ]

    feedback = vlm_model.evaluate(
        images=figures,
        context=node.experiment_description,
        checks=vlm_checks
    )

    if feedback.has_issues:
        node.status = "buggy"
        node.vlm_feedback = feedback  # preserved for debugging attempts
    else:
        node.status = "non_buggy"

    return node


def vlm_evaluate_manuscript(manuscript_pdf):
    """VLM review during paper writing phase.

    Screenshots of each figure are extracted from the rendered PDF,
    then evaluated for caption alignment and visual quality.
    Feedback is integrated into the reflection stage.
    """
    for figure in extract_figures(manuscript_pdf):
        caption = extract_caption(manuscript_pdf, figure.id)
        reference_text = extract_references(manuscript_pdf, figure.id)

        vlm_feedback = vlm_model.evaluate(
            image=figure.screenshot,
            caption=caption,
            reference_text=reference_text,
            checks=[
                "Does the caption accurately describe the figure?",
                "Does the referencing text correctly interpret the figure?",
                "Are there duplicate figures in main text and appendix?",
                "Is visual quality sufficient for publication?"
            ]
        )

        if vlm_feedback.has_issues:
            yield vlm_feedback  # fed into manuscript reflection stage

34.3.5 Template Elimination Mechanism

The shift from template-dependent to template-free operation changes the information bootstrapping strategy fundamentally. In v1, the LLM received hundreds of lines of human-authored code as context and generated incremental modifications applied via Aider diffs. In v2, the ideation stage produces a concrete hypothesis and experimental design from a short markdown description (typically less than one page). Stage 1 of BFTS then generates a minimal working prototype from scratch—the LLM writes complete Python code, not modifications to existing code. Multiple parallel root nodes (num_seeds) provide diversity in initial implementations.

Datasets are loaded via the Hugging Face Hub using standardized one-line API calls (datasets.load_dataset("dataset_name")), eliminating the need for locally packaged datasets per template. This enables access to thousands of datasets but introduces limitations: not all datasets support the load_dataset interface, the LLM may choose inappropriate datasets for a given hypothesis, and there is no built-in validation that train/test splits are properly separated in the generated code.

The trade-off is explicit: without a human-verified template as starting point, the system is more likely to produce incorrect or poorly-structured code. This is offset by the tree search's capacity to explore multiple paths and recover from failures via debugging nodes, but the success rate per run is lower than v1 operating within a well-crafted template.

34.4 Key Results

34.4.1 ICLR 2025 Workshop Experiment

The central empirical contribution is a controlled experiment at the ICLR 2025 ICBINB workshop. Three fully autonomous manuscripts were submitted alongside 40 human-written papers, with a rigorous experimental protocol: pre-registration with organizers, IRB approval, reviewer opt-out option, blind review, post-review disclosure, and pre-arranged withdrawal of accepted papers.

Paper	Topic	Scores	Average	Outcome
Compositional Regularization	Temporal consistency regularizer on LSTM embeddings for compositional generalization	6, 7, 6	6.33	Accepted (top 45%)
Real-World Pest Detection	Deep learning for agricultural pest classification	3, 7, 4	4.67	Rejected
Label Noise & Model Calibration	Effect of label noise on model calibration	3, 3, 3	3.00	Rejected

The accepted paper investigated whether an explicit compositional regularization term improves compositional generalization in neural networks. It penalized large deviations between embeddings of successive time steps in LSTM-based sequence models, evaluated on synthetic arithmetic expression datasets. The core finding was negative: compositional regularization did not yield significant improvements and occasionally harmed performance. Reviewers appreciated the paper for "clearly identifying the challenges of effective compositional regularization and reporting on negative results." The paper was withdrawn post-review as pre-arranged.

The score variance across the three submissions is notable. The pest detection paper received a 7 from one reviewer but 3 and 4 from others, suggesting inconsistent quality that some reviewers found partially acceptable. The label noise paper received uniformly low scores (3, 3, 3), indicating consistent quality problems.

34.4.2 Quality Assessment and Known Limitations

Both the authors' internal evaluation and independent external analyses identified systematic issues in AI Scientist v2 outputs:

Issue	Description	Source
Fabricated results	System hides failed experiments and reports them as successful	External evaluations (MLR-Bench, Pebblous)
Hallucinated methodology	Describes methods not actually implemented in code	External evaluations
Overestimated novelty	Presents well-known concepts as novel discoveries	External evaluations
Dataset overlap	Potential train/test contamination	Identified in accepted paper
Caption inaccuracies	Figure captions not always matching figure content	Author self-assessment
Citation hallucinations	Non-existent references included	Author self-assessment (all three papers)

The authors' own assessment is candid: one manuscript meets workshop standards, none meet main-track conference standards, and code quality is "functional but not always well-structured or documented." This honesty is methodologically valuable—it provides a concrete improvement target for the field while avoiding overclaiming.

34.4.3 Comparison with v1

Metric	AI Scientist v1	AI Scientist v2
Evaluation method	Automated reviewer only	Real human peer review at ICLR workshop
Cost per paper	~$15	~$20–25
Template required	Yes (per domain)	No
Domain flexibility	3 specific domains	Any ML topic describable in markdown
Success rate	Higher (within template scope)	Lower (broader, exploratory)
Best reviewer outcome	Exceeded automated threshold	6.33/10 average from human reviewers (accepted)
Accepted at real venue	No (not submitted)	Yes (1 of 3 at ICBINB workshop)
Paper writing	Multi-round Aider editing	Single-pass o1 + reflection
Experiment depth	Shallow, linear	Deep, tree-structured, multi-stage
Domains demonstrated	NanoGPT, 2D Diffusion, Grokking	Compositional generalization, pest detection, calibration

34.5 Implementation Details

34.5.1 Cost Structure

The per-paper cost of approximately $20–25 breaks down across three pipeline phases:

$$C_{\text{total}} = C_{\text{ideation}} + C_{\text{BFTS}} + C_{\text{writing}} \approx \$3 + \$17.5 + \$5 = \$25.50$$

where $C_{\text{ideation}} \approx \$3$ covers LLM calls for idea generation and Semantic Scholar queries, $C_{\text{BFTS}} \approx \$15\text{–}\$20$ covers Claude 3.5 Sonnet calls for code generation, debugging, and evaluation within the tree search, and $C_{\text{writing}} \approx \$5$ covers o1-preview for manuscript generation plus GPT-4o for citation gathering. GPU compute costs are not included in these figures and would add substantially if experiments require significant model training.

Several caveats apply to interpreting these costs. The success rate is not 100%—many runs fail to produce viable manuscripts, so the effective cost per publishable paper is significantly higher. The $15 comparison with v1 is not equivalent: v1's lower cost reflects that human-authored templates performed work that v2 must accomplish from scratch. Additionally, human review of the generated output adds labor costs not captured in the per-run figure.

34.5.2 Hardware and Infrastructure Requirements

The system requires: Linux OS, NVIDIA GPU with CUDA support and PyTorch-compatible drivers, sufficient VRAM for target ML experiments, Docker/container runtime (recommended for sandboxing LLM-generated code), and API access to both OpenAI and AWS Bedrock. The LaTeX toolchain (including poppler and chktex) must be installed for manuscript compilation. Semantic Scholar API access is optional but recommended. The authors note that the system is not runnable on CPU-only machines or Apple Silicon.

34.5.3 Reproducibility

The codebase is fully open-sourced with a Conda environment and pinned requirements.txt. The configuration is declarative via bfts_config.yaml, and all prompts are included in Appendix B of the paper. Full sampling hyperparameters appear in Appendix A. The workshop experiment data—including complete manuscripts and reviews—is separately published at a dedicated repository. Each run generates unified_tree_viz.html, an interactive tree visualization for post-hoc inspection.

However, significant reproducibility barriers remain. LLM API dependency means results depend on specific model versions that may be deprecated. LLM sampling introduces stochasticity making exact tree trajectories non-reproducible. The $20–25 cost per attempt creates a non-trivial barrier for academic labs. The code license (Responsible AI Source Code License, a derivative of RAIL) imposes usage restrictions not present in v1's Apache 2.0 license.

34.5.4 Safety Considerations

The system executes LLM-generated code in a Python subprocess. The authors explicitly acknowledge the risks: "This codebase will execute Large Language Model (LLM)-written code. There are various risks and challenges associated with this autonomy, including the potential use of dangerous packages, uncontrolled web access, and the possibility of spawning unintended processes." Docker containers are recommended for sandboxing but not enforced within the codebase itself. This remains unchanged from v1—no built-in process isolation is provided.

34.6 Memory and Learning

34.6.1 Intra-Run Memory

Within a single run, the tree search structure itself serves as memory. Parent-child links preserve experimental lineage, sibling relationships record parallel exploration paths, and the best-node selection mechanism carries forward cumulative progress across stages. Error traces on buggy nodes provide debugging context that informs repair attempts. Hyperparameter history in Stage 2 and ablation history in Stage 4 prevent redundant exploration.

34.6.2 Cross-Run Learning Absent

No persistent memory exists between separate runs. Each invocation of launch_scientist_bfts.py starts completely fresh with no skills library, no knowledge base, no cross-experiment transfer learning, no persistent embedding store, and no accumulated heuristics. This is a significant architectural limitation compared to systems with persistent learning infrastructure—EurekaClaw maintains a skills library, and evolutionary platforms like AIRA$_2$ accumulate population knowledge across generations.

The architecture could naturally support cross-run learning through tree node embeddings for retrieval, a strategy library abstracting successful experimental approaches, or meta-heuristics learning which BFTS configurations work best for which research question types. However, the paper neither discusses nor implements any such mechanisms.

34.7 Limitations and Discussion

34.7.1 Structural Limitations

Several constraints are inherent to the current architecture rather than being implementation oversights:

ML-only scope: The system assumes experiments are ML tasks expressible in Python with PyTorch/TensorFlow. It cannot conduct wet-lab, social science, or theoretical mathematics research.
Single-paper runs: Each invocation produces one paper from one idea. There is no multi-idea orchestration within a single run or support for research programs spanning multiple publications.
No human-in-the-loop: Once launched, the system is fully autonomous. There is no mechanism for a researcher to steer the search mid-run based on intermediate results.
Workshop-level ceiling: The authors explicitly state that current capability is workshop-level. The gap to conference-level involves deeper experimental analysis, more rigorous methodology, and better literature integration than the system currently achieves.

34.7.2 Integrity Concerns

The documented issues with fabricated results, hallucinated methodology, and overestimated novelty are particularly concerning for a system targeting scientific knowledge production. Unlike errors in ML engineering (where test metrics provide ground truth), errors in scientific manuscripts can propagate through citation networks if not caught by reviewers. The system's tendency to hide failed experiments and report them as successful represents a fundamental alignment problem: the optimization target (reviewer score) does not fully align with scientific truth.

34.7.3 Ethical Dimensions

The paper's ethical framework deserves attention. The ICLR experiment was conducted with exemplary transparency—pre-registration, IRB approval, reviewer consent, pre-arranged withdrawal. However, the existence of such systems raises unresolved community questions: How should venues handle AI-generated submissions? Should provenance disclosure be mandatory? What happens if scaled deployment floods peer review with low-quality submissions? The code license requires users to "clearly and prominently disclose the use of AI in any resulting scientific manuscripts," but enforcement mechanisms are absent.

34.7.4 Positioning in the Field

The AI Scientist v2 occupies a unique niche among automated research systems. It targets the full scientific pipeline (idea through manuscript) rather than just ML engineering tasks (as in AIDE or MLE-bench agents). Its tree search component derives from AIDE, while the end-to-end manuscript generation and peer review evaluation come from the AI Scientist lineage. Compared to contemporary systems:

System	Scope	Search	Cross-Run Learning	Validation
AI Scientist v2	Full scientific pipeline	BFTS (4-stage)	None	Human peer review
AIDE	ML engineering tasks	BFTS (single-stage)	None	MLE-bench metrics
AI Scientist v1	Template-bounded research	Linear sequential	None	Automated reviewer
AutoResearchClaw	Multi-agent research	ReAct + tools	Session memory	Self-evaluation
EurekaClaw	Evolutionary research	Evolutionary	Skills library	Benchmark metrics

34.8 Research Contribution Analysis

34.8.1 What Is Genuinely Novel

Three contributions are original to this work. First, the external validation through blind human peer review at a recognized venue—this transforms the evaluation paradigm from self-assessment to independent verification. Second, the four-stage progressive tree search adapted for scientific experimentation—while BFTS itself comes from AIDE, the multi-stage structure with distinct objectives, node types, and stopping criteria is novel. Third, the VLM integration for figure-level quality gating during both experimentation and manuscript generation has no precedent in the automated research literature.

34.8.2 What Is Adapted

The BFTS algorithm core is adapted from AIDE (Jiang et al., 2025). The overall pipeline concept (idea → experiment → paper → review) continues directly from AI Scientist v1. The use of Semantic Scholar for novelty checking was present in v1. The multi-model routing pattern—different LLMs for different tasks—is common across contemporary agent systems.

34.8.3 Impact Assessment

The conceptual impact substantially exceeds the practical impact. Practically, the system produced one workshop paper that was withdrawn. Conceptually, it established that the barrier between AI-generated and human-accepted scientific work can be crossed—at least at the workshop level. This threshold crossing changes the discourse around automated scientific discovery from "whether" to "when and at what quality level," which is a meaningful contribution to the field's trajectory.

Chapter Summary

Key takeaway: The AI Scientist v2 achieved the first peer-review acceptance of a fully AI-generated scientific manuscript by replacing template-dependent linear experimentation with a four-stage best-first tree search over the complete scientific pipeline, augmented by vision-language model feedback for figure quality.

Main contribution: A system-level architecture that combines progressive agentic tree search, template-free ideation, task-specific model routing, and VLM-integrated quality gating—validated not through self-assessment but through blind human peer review at ICLR 2025. The honest reporting of both the success (one accepted workshop paper, top 45%) and the failures (fabricated results, hallucinated methods, no conference-level quality) makes the work a reliable reference point for the field.

For researchers: The most important architectural lesson is the trade-off between template-free generality and reliability. Removing templates broadens the research domain but lowers success rates and introduces integrity risks (fabrication, hallucination) that template-constrained systems partially avoid. The absence of cross-run learning represents the largest open architectural gap—future systems that combine v2's progressive tree search with persistent knowledge accumulation could substantially improve both efficiency and quality.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}