Introduced2025-09
Score8.73/10 — Final
Chapter 39

DeepScientist

Part: Autonomous Research Systems

39.1 Overview & Motivation

DeepScientist (Weng et al., 2025; arXiv:2509.26603) reframes autonomous scientific discovery as a Bayesian Optimization (BO) problem, where the search space comprises all possible scientific methods, an LLM Reviewer serves as the surrogate model, and a UCB acquisition function governs the exploration–exploitation tradeoff [PAPER, §3]. The system was developed at Westlake University's Engineering School and represents a direct evolution from the same group's CycleResearcher work on review-driven iterative paper refinement [PAPER, §1].

The paper reports that DeepScientist produced 21 genuine scientific innovations from approximately 5,000 generated hypotheses across three frontier AI tasks, with three methods that the authors claim surpassed human state-of-the-art: A2P for agent failure attribution (+183.7% accuracy), ACRA for LLM inference acceleration (+1.9% tokens/s), and PA-Detect for AI text detection (+7.9% AUROC with +190% latency reduction) [PAPER, §5, Tables 1–2]. These are author-reported results evaluated by the team's own DeepReviewer model and three human experts; no independent reproduction has been documented as of April 2026.

Within this survey's taxonomy, DeepScientist occupies a distinctive position among autonomous research systems. While AI Scientist (Sakana, 2024), AI Researcher (Alibaba, 2024), and Zochi (2025) focus on generating research papers evaluated by automated reviewers, DeepScientist targets working implementations evaluated against quantitative task metrics [PAPER, §2]. The shift from paper generation to method discovery — and the adoption of a principled optimization framework rather than ad-hoc search — is the paper's central claim to novelty.

Key Contribution: DeepScientist formalizes autonomous scientific discovery as Bayesian Optimization with an LLM surrogate, using a three-component valuation vector (utility, quality, exploration) and UCB acquisition to guide month-long GPU campaigns. The paper reports verified SOTA-surpassing results on three frontier AI tasks — a claim that, if independently reproduced, would represent a milestone in autonomous research [PAPER, §5].

39.2 Architecture

39.2.1 Three-Stage Discovery Cycle

DeepScientist's architecture centers on a three-stage hierarchical discovery cycle that repeats continuously for the duration of a campaign (approximately one month per task) [PAPER, §3]. The stages are:

  1. Strategize & Hypothesize — LLM Reviewer (Gemini-2.5-Pro) analyzes the Findings Memory, generates novel hypotheses, and produces valuation vectors [PAPER, §3.1].
  2. Implement & Verify — UCB acquisition selects the most promising hypothesis; a coding agent (Claude-4-Opus) implements and experimentally validates it [PAPER, §3.2].
  3. Analyze & Report — Triggered only for implementations that surpass the baseline; performs ablation studies, cross-dataset evaluation, and paper synthesis [PAPER, §3.3].

All three stages share a persistent Findings Memory — a structured, three-level knowledge base (Idea Findings, Implement Findings, Progress Findings) that accumulates knowledge across the campaign [PAPER, §3.1]. When memory exceeds the LLM's context window, a retrieval model selects the Top-K most relevant findings [PAPER, §3.1].

Stage 1: Strategize & Hypothesize Gemini-2.5-Pro Analyze Findings Memory Retrieve Top-K (if overflow) Generate hypotheses Produce V = ⟨v_u, v_q, v_e⟩ → Store as Idea Findings UCB Selector argmax(w_u·v_u + w_q·v_q + κ·v_e) Deterministic, no LLM Stage 2: Implement & Verify Claude-4-Opus (coding agent) Read repository codebase Plan & implement code changes Execute experiments (sandboxed, GPU) Debug, iterate, record f(I_{t+1}) → Update Implement Findings f(I) > baseline? Stage 3: Analyze & Report Gemini-2.5-Pro (conditionally triggered) Ablation studies Cross-dataset evaluation MCP-managed experimental lifecycle → Paper synthesis + Progress Findings Findings Memory (shared across all GPU instances) Level 1: Idea Findings (~5,000) Hypotheses + valuation vectors V Level 2: Implement Findings (~1,100) Code + experimental results f(I) Level 3: Progress Findings (21) Innovations + analysis + papers Also: Human knowledge, retrieval index, historical results reads writes cycle repeats (~1 month) 16 Parallel GPU Instances 2 servers × 8 NVIDIA H800 GPUs Each runs independent cycle, shared memory

Figure 39.1: DeepScientist three-stage discovery cycle with shared Findings Memory and parallel GPU execution. All component labels and data flows are derived from the paper's description [PAPER, §3, Figures 1–2].

39.2.2 Dual-Model Architecture

DeepScientist employs a functional separation between two frontier LLM models [PAPER, §3]:

RoleModelStagesEvidence
Reasoning / strategy / review / paper synthesisGemini-2.5-ProStages 1, 3[PAPER, §3]
Code generation / implementation / debuggingClaude-4-OpusStage 2[PAPER, §3]

This is a functional separation (reasoning vs. coding), distinct from the hierarchical separation (cheap model for quantity, expensive for quality) seen in AlphaEvolve's Gemini Flash + Pro ensemble [PAPER, §2]. The coding agent in Stage 2 has full permissions: repository read/write, sandboxed execution, internet access, package installation, and dedicated H800 GPU access [PAPER, §3.2].

39.2.3 Repository Status

The paper cites a public repository at github.com/ResearAI/DeepScientist [PAPER, §1] and a project page at ai-researcher.net [PAPER, §1]. The repository is released under CC BY-NC-SA 4.0 [PAPER, §1].

Repository Verification Note: This chapter has not performed a pinned-commit audit of the DeepScientist repository. All implementation claims below are sourced from the paper or README descriptions unless otherwise noted. No [REPO] tags are used in this chapter because the repository contents have not been independently verified by the survey authors. Readers intending to reproduce or extend the system should verify the current repository state against the claims made here.

39.3 Core Algorithms

39.3.1 Bayesian Optimization Formulation

DeepScientist's central intellectual contribution is formalizing autonomous discovery as Bayesian Optimization [PAPER, §3]. The optimization objective:

$$ I^* = \underset{I \in \mathcal{I}}{\operatorname{argmax}} \; f(I) $$ [Published formula — paper §3, Eq. 1]
SymbolMeaningDomain
\(\mathcal{I}\)Space of all possible scientific methods (hypotheses, algorithms, implementations)Combinatorial, unstructured
\(f(I)\)True value function — actual performance of method \(I\) when implemented and evaluated\(\mathbb{R}\)
\(I^*\)Globally optimal method (unknown, approximated through search)\(\mathcal{I}\)

Evaluating \(f(I)\) is expensive: each evaluation requires generating a full implementation, running GPU experiments, and analyzing results against baselines. DeepScientist addresses this by using an LLM Reviewer as a surrogate model that cheaply approximates \(f\). For each candidate hypothesis, the reviewer produces a three-component valuation vector [PAPER, §3.1]:

$$ \mathbf{V}(I) = \langle v_u(I), \, v_q(I), \, v_e(I) \rangle $$ [Published formula — paper §3, Eq. 2]
SymbolMeaningRole in BO
\(v_u\)Utility value — estimated practical impact and performance improvementExploitation signal
\(v_q\)Quality value — estimated methodological soundness and rigorExploitation signal
\(v_e\)Exploration value — estimated novelty relative to previously explored regionsExploration signal

The UCB acquisition function selects the next hypothesis to evaluate [PAPER, §3.1]:

$$ I_{t+1} = \underset{I}{\operatorname{argmax}} \left( w_u \cdot v_u(I) + w_q \cdot v_q(I) + \kappa \cdot v_e(I) \right) $$ [Published formula — paper §3, Eq. 3]
SymbolMeaningNotes
\(w_u\)Weight on utilityExploitation coefficient; values not reported in paper
\(w_q\)Weight on qualityExploitation coefficient; values not reported in paper
\(\kappa\)Exploration coefficientControls exploration–exploitation tradeoff; value not reported
\(I_{t+1}\)Selected hypothesis for evaluation at step \(t+1\)Promoted from Idea Finding to Implement Finding
[INFERRED] Theoretical Guarantees and Their Limits: In classical Bayesian Optimization, UCB provides sublinear cumulative regret bounds when the surrogate model is a well-specified Gaussian Process with calibrated posterior uncertainty. DeepScientist's LLM surrogate replaces the GP posterior with prompt-based assessment, which sacrifices the calibrated uncertainty estimates (\(\sigma(x)\)) that underpin UCB's theoretical guarantees. The exploration component \(v_e\) is a heuristic substitute for posterior variance, produced by the same LLM that generates hypotheses — there is no formal guarantee of calibration or coverage. The BO formulation should therefore be understood as a principled heuristic inspired by the UCB framework rather than a system with provable regret bounds. The paper does not claim formal convergence guarantees [PAPER, §3].

39.3.2 The Discovery Loop

1. Surrogate Update LLM Reviewer → V(I) for candidates Context: Top-K from Findings Memory 2. UCB Acquisition Select I_{t+1} = argmax UCB(I) Deterministic, no LLM call 3. Experimental Evaluation Coding agent implements I_{t+1} Returns f(I_{t+1}) from GPU experiments 4. Memory Update M_t → M_{t+1} with result f(I_{t+1}) If f(I) > baseline: promote to Progress Finding Otherwise: record negative result, return to Step 1 5. Deep Analysis (conditional) Ablations, cross-dataset eval, paper synthesis surpassed baseline? repeat Findings Memory (persistent, shared) Level 1: Idea Findings — hypotheses + V Level 2: Implement Findings — code + results Level 3: Progress Findings — innovations + papers + Human knowledge seed + retrieval index

Figure 39.2: DeepScientist BO discovery loop. All steps are paper-described [PAPER, §3].

39.3.3 Three-Component Valuation and Surrogate Mapping

The decomposition of the exploitation signal into utility (\(v_u\)) and quality (\(v_q\)) is the key departure from classical BO, where the surrogate produces a single mean prediction \(\mu(x)\) and uncertainty \(\sigma(x)\) [PAPER, §3.1]. The mapping between classical and DeepScientist formulations:

$$ \underbrace{w_u \cdot v_u + w_q \cdot v_q}_{\text{exploitation} \;\approx\; \mu(x)} + \underbrace{\kappa \cdot v_e}_{\text{exploration} \;\approx\; \kappa \cdot \sigma(x)} $$ [Author-derived formalization — mapping described in paper §3, explicit algebraic correspondence constructed by survey authors]

This decomposition allows the system to distinguish between bold but risky ideas (high \(v_u\), low \(v_q\)) and conservative but reliable ones (low \(v_u\), high \(v_q\)). The paper does not report the specific values of \(w_u\), \(w_q\), or \(\kappa\) used in the campaigns, nor whether these parameters were fixed or adapted during execution [PAPER; values not reported].

39.3.4 Pseudocode for Discovery Cycle

# Pseudocode — reconstructed from paper description (§3).
# No class names, function signatures, or module paths are verified from the repository.

def deepscientist_campaign(task, memory, config):
    """Main discovery loop — runs for ~1 month per task."""
    # Seed memory with human knowledge (papers, baselines, codebases)
    memory.load_human_knowledge(task.seed_papers, task.baseline_repo)

    for t in range(config.max_iterations):
        # --- Stage 1: Strategize & Hypothesize ---
        if memory.exceeds_context_window():
            context = memory.retrieve_top_k(query=task.description, k=config.top_k)
        else:
            context = memory.all_findings()

        # LLM Reviewer (Gemini-2.5-Pro) generates and evaluates hypotheses
        idea_findings = llm_reviewer.generate_hypotheses(
            task=task, context=context
        )
        # Each idea_finding carries V = 
        memory.store_ideas(idea_findings)

        # --- UCB Acquisition (no LLM — deterministic) ---
        candidate = select_by_ucb(
            ideas=memory.unselected_ideas(),
            w_u=config.w_u, w_q=config.w_q, kappa=config.kappa
        )
        candidate.promote_to_implement()

        # --- Stage 2: Implement & Verify ---
        result = coding_agent.implement_and_evaluate(
            hypothesis=candidate,
            repository=task.baseline_repo,
            gpu=config.assigned_gpu
        )
        candidate.record_result(result)
        memory.update(candidate)

        # --- Stage 3 (conditional): Analyze & Report ---
        if result.surpasses_baseline(task.baseline_metric):
            candidate.promote_to_progress()
            analysis = analyzer.deep_analysis(
                finding=candidate,
                ablation_configs=task.ablation_suite,
                additional_datasets=task.eval_datasets
            )
            paper = synthesizer.generate_paper(candidate, analysis)
            memory.store_progress(candidate, analysis, paper)
# Pseudocode — reconstructed from paper description (§3.1).
# UCB selector is the only non-LLM component in the critical path.

def select_by_ucb(ideas, w_u, w_q, kappa):
    """Select idea with highest UCB score."""
    best_idea, best_score = None, float('-inf')
    for idea in ideas:
        score = (w_u * idea.v_u) + (w_q * idea.v_q) + (kappa * idea.v_e)
        if score > best_score:
            best_idea, best_score = idea, score
    return best_idea

39.3.5 Findings Memory Lifecycle

Each finding follows a promotion lifecycle through three levels [PAPER, §3]:

  1. Idea Finding — generated by the Strategist, carries hypothesis text and valuation vector \(\mathbf{V}\). Remains in memory regardless of selection.
  2. Implement Finding — promoted from Level 1 after UCB selection. Augmented with implementation details, experimental results, and baseline delta. Negative results are retained, preventing re-exploration of failed approaches.
  3. Progress Finding — promoted from Level 2 when \(f(I) > f_{\text{baseline}}\). Triggers Stage 3 analysis, ablation studies, and paper synthesis.

The paper reports approximate volumes over a month-long campaign: ~5,000 Idea Findings, ~1,100 Implement Findings, 21 Progress Findings [PAPER, §5]. The funnel conversion rate (0.42% from ideas to innovations) is itself a significant empirical finding about the difficulty of autonomous discovery.

39.3.6 Retrieval for Context Management

As the Findings Memory grows beyond the LLM context window, a retrieval model selects the Top-K most relevant findings as input to the Strategist [PAPER, §3.1]. The paper does not specify the retrieval model, embedding dimension, indexing strategy, or the value of K. This mechanism is what enables month-long campaigns — without it, the system would be limited to the few hundred findings fitting in a single context window.

[INFERRED] Retrieval Implementation Unknowns: The paper describes the retrieval mechanism at a functional level but does not specify: (a) the embedding model used, (b) whether retrieval is semantic, keyword-based, or hybrid, (c) the value of K or how it is set, (d) whether findings are indexed incrementally or re-indexed periodically, or (e) how the balance between recent findings and historically important ones is managed. These implementation details significantly affect the system's ability to maintain continuity over long campaigns.

39.4 Key Results

39.4.0 Evaluation Caveats

Evaluation Caveats — Read Before Interpreting Results:
  • Missing absolute baselines: The paper reports percentage improvements over baselines but does not consistently report absolute baseline scores for all three tasks. Without absolute values, the magnitude of improvements cannot be independently assessed [PAPER, Tables 1–2].
  • Unreported seed/run counts: The paper does not specify how many independent runs were conducted per task, nor whether the reported results reflect single best-of-campaign outcomes or aggregated statistics. The stochastic nature of both LLM generation and experimental execution means single-run results carry high variance.
  • Reviewer circularity: The automated review scores (Table 2) use DeepReviewer, developed by the same team as part of their CycleResearcher work. While validated against human judgments, systematic bias toward the team's output style cannot be excluded [PAPER, §4].
  • Unmatched compute budgets: DeepScientist used 16 H800 GPUs for ~1 month per task. Comparisons with AI Scientist, AI Researcher, and other systems do not control for compute budget [PAPER, §4].
  • Human verification limited: Three human experts verified outputs, but the paper does not detail the verification protocol, how many findings each expert reviewed, or their selection criteria.
  • No independent reproduction: As of April 2026, no independent team has attempted to reproduce DeepScientist's claimed SOTA results.

39.4.1 SOTA-Surpassing Results

Benchmark / Task Task Description Baseline Score System Score Δ Seeds / Runs Compute Budget Evaluation Protocol Evidence Source
Agent Failure Attribution Attribute agent failures to causal actions — (not reported as absolute) — (not reported as absolute) +183.7% accuracy (paper-reported) — (not reported) ~1 month, 16×H800 GPUs, Gemini-2.5-Pro + Claude-4-Opus APIs — (not reported in detail) [PAPER, §5, Table 1]
LLM Inference Acceleration Increase tokens/s without quality loss — (not reported as absolute) — (not reported as absolute) +1.9% tokens/s (paper-reported) — (not reported) ~1 month, 16×H800 GPUs, Gemini-2.5-Pro + Claude-4-Opus APIs — (not reported in detail) [PAPER, §5, Table 1]
AI Text Detection Distinguish AI-generated from human text — (not reported as absolute) — (not reported as absolute) +7.9% AUROC, −65.5% latency (paper-reported) — (not reported) ~1 month, 16×H800 GPUs, Gemini-2.5-Pro + Claude-4-Opus APIs — (not reported in detail) [PAPER, §5, Table 1]

The discovered methods are A2P (Abduction-Action-Prediction), ACRA, and PA-Detect. The paper describes A2P as a three-phase reasoning framework (abduction → action analysis → prediction) and PA-Detect as achieving a Pareto improvement on the accuracy–latency frontier [PAPER, §5]. ACRA's internal mechanism is not described in detail beyond the name.

39.4.2 Automated Review Evaluation (DeepReviewer)

SystemPapersSoundnessPresentationContributionRatingAccept RateEvidence
AI Scientist102.081.801.753.350%[PAPER, Table 2]
AI Researcher71.751.461.572.570%[PAPER, Table 2]
AI Scientist v231.671.501.502.330%[PAPER, Table 2]
CycleResearcher62.251.752.133.750%[PAPER, Table 2]
Zochi22.382.382.254.630%[PAPER, Table 2]
DeepScientist52.902.902.905.9060%[PAPER, Table 2]

DeepScientist is the only system achieving non-zero acceptance under DeepReviewer (3 of 5 papers). The reviewer circularity risk described in §39.4.0 applies directly here.

39.4.3 Human Expert Review

To address automated reviewer bias, the paper reports human expert evaluation [PAPER, §4.2]:

MetricDeepScientist PapersICLR 2025 Human PapersEvidence
Average rating5.005.08[PAPER, §4.2]
Reviewers per paper33–4[PAPER, §4.2]
Inter-annotator agreement (Krippendorff α)0.739— (not reported)[PAPER, §4.2]

The 5.00 rating is statistically indistinguishable from the 5.08 ICLR baseline given the small sample size (5 papers, 3 reviewers each). Krippendorff's α = 0.739 exceeds the 0.667 threshold for "substantial agreement" [PAPER, §4.2]. However, at ICLR a rating of 5 corresponds roughly to "marginally below acceptance threshold," meaning these papers are competitive with but not clearly above venue-quality human work.

39.4.4 Innovation Funnel

StageVolumeConversion RateEvidence
Idea Findings generated~5,000[PAPER, §5]
Implement Findings (selected by UCB)~1,100~22% of ideas[PAPER, §5]
Progress Findings (innovations)21~1.9% of implementations[PAPER, §5]
SOTA-surpassing methods3~14.3% of innovations[PAPER, §5]
Ideas → SOTA~0.06%[PAPER; computed from reported numbers]

39.5 Implementation & Cost

39.5.1 Paper-Reported Infrastructure

ComponentSpecificationEvidence
GPU hardware2 servers × 8 NVIDIA H800 (80 GB) = 16 GPUs total[PAPER, §4]
GPU hours consumed20,000+ total across all tasks[PAPER, §4]
Campaign duration~1 month per task[PAPER, §4]
Parallelism16 GPU instances, each running independent exploration thread[PAPER, §4]
Reasoning modelGemini-2.5-Pro[PAPER, §3]
Coding modelClaude-4-Opus[PAPER, §3]
Human experts3 (for output verification)[PAPER, §4]
LicenseCC BY-NC-SA 4.0[PAPER, §1]

39.5.2 Cost Estimates (Author-Derived)

Provenance Note: The following cost estimates are not reported in the paper. They are survey-author calculations based on publicly available cloud pricing as of April 2026. Actual costs may differ significantly depending on institutional pricing, API negotiation, and usage patterns.
ComponentQuantityEstimated Unit CostEstimated TotalProvenance
H800 GPU hours20,000+$2–4/hr (cloud)$40K–80K[INFERRED from public pricing]
Gemini-2.5-Pro APINot reported$0.00125–0.01/1K tokens$5K–50K (rough range)[INFERRED]
Claude-4-Opus APINot reported$0.015–0.075/1K tokens$10K–100K (rough range)[INFERRED]
Human expert time3 experts × est. 40 hrs$100–200/hr$12K–24K[INFERRED]
Estimated total per task$67K–254K[INFERRED]
[INFERRED] Cost Context: The estimated $67K–254K per task is comparable to a postdoctoral researcher-year in many institutions. However, DeepScientist completes a task in ~1 month rather than 1–5 years, and the architecture is parallelizable — adding GPUs increases throughput approximately linearly (per the paper's claim that instances run independently with shared memory). Whether this scaling holds beyond 16 GPUs is not demonstrated.

39.6 Reproducibility Checklist

RequirementStatusNotesEvidence
Code publicly releasedPartialRepository exists at github.com/ResearAI/DeepScientist; completeness of released code not independently verified[PAPER, §1]
Config files available— (not independently verified)Paper does not detail config schema; repo may contain examples[PAPER]
Pretrained weights / checkpointsN/ASystem uses API-based LLMs (Gemini, Claude); no custom weights[PAPER, §3]
Documented entry point or run command— (not independently verified)Not described in paper; may exist in repo README[PAPER]
Compute requirements stated16×H800 GPUs, 20,000+ GPU hours, ~1 month per task[PAPER, §4]
Seeds and run counts reportedNeither seed values nor number of independent runs reported[PAPER]
Independent reproduction attemptedNo independent reproduction documented as of April 2026
Findings Memory dumps released— (not independently verified)Would enable trajectory analysis; paper does not confirm release[PAPER]
Discovered method code (A2P, ACRA, PA-Detect)PartialPaper describes A2P and PA-Detect at algorithmic level; code may or may not be in repo[PAPER, §5]
Experimental logs released— (not independently verified)Not mentioned in paper[PAPER]

The primary reproducibility barrier is economic, not technical. The paper's methodology is described clearly enough to implement, but actually running DeepScientist requires compute resources (~$67K–254K per task) that few academic labs can dedicate to reproduction. A scaled-down validation (fewer GPUs, shorter campaigns, cheaper models) would test the architectural concepts but could not reproduce the claimed SOTA results.

[INFERRED] Suggested Reduced-Scale Validation: A partial reproduction targeting concept validation (BO loop correctness, memory accumulation, UCB selection behavior) could be conducted with 1–2 GPUs over 1–2 weeks using smaller models (e.g., Gemini Flash, Claude Sonnet) for an estimated $1K–10K. This would not replicate SOTA-surpassing outcomes but could verify that the architectural mechanisms function as described.

39.7 Threats to Validity

This section consolidates all identified threats to the validity of DeepScientist's claims, organized by category.

39.7.1 Reviewer Circularity

The automated review scores (Table 2) are produced by DeepReviewer, developed by the same research group as part of CycleResearcher [PAPER, §4]. While DeepReviewer was validated against human judgments in the CycleResearcher paper, the possibility of systematic bias toward the authoring team's output style cannot be excluded. The 60% accept rate (vs. 0% for all competitors) is a dramatic gap that warrants skepticism until confirmed by an independent review model. The paper partially addresses this with human expert review (§39.4.3), but the sample size is small (5 papers, 3 reviewers).

39.7.2 Compute-Budget Mismatch

DeepScientist used 20,000+ GPU hours on H800 hardware for each task. Comparisons with AI Scientist (~10 GPU hours per paper), AI Researcher, and other systems do not control for this ~2,000× compute difference [PAPER, §4–5]. It is an open question whether the other systems would achieve comparable results given equivalent compute.

39.7.3 Missing Absolute Baselines and Evaluation Protocols

The three SOTA-surpassing claims are reported as relative improvements (+183.7%, +1.9%, +7.9%) without consistently reporting the absolute baseline scores, the evaluation datasets, the exact evaluation protocols, or the statistical significance of the improvements [PAPER, §5]. The +183.7% accuracy gain for A2P is particularly difficult to interpret without knowing the absolute baseline — a 183.7% improvement from 10% accuracy (to ~28%) is very different from 183.7% from 30% (to ~85%).

39.7.4 Human-in-the-Loop Verification Gaps

Three human experts verified outputs and filtered hallucinations [PAPER, §4]. The paper does not specify: (a) the verification protocol, (b) how many findings each expert reviewed, (c) whether experts could intervene during campaigns to guide exploration, (d) the criteria for accepting vs. rejecting a Progress Finding, or (e) whether expert filtering was applied before or after the SOTA claims were formulated. The boundary between autonomous discovery and human-guided selection is therefore not fully transparent.

39.7.5 Absence of Independent Reproduction

As of April 2026, no independent team has reproduced DeepScientist's results. The extreme compute cost (~$200K+ across three tasks) makes independent reproduction prohibitively expensive for most groups. Until independent reproduction occurs, the SOTA claims rest solely on the authors' self-evaluation.

39.7.6 Surrogate Calibration

The LLM Reviewer produces valuation vectors \(\mathbf{V}\) that are used for UCB selection, but no analysis of surrogate calibration is reported [PAPER]. Key questions: How well does \(v_u\) correlate with actual experimental improvement? Does \(v_e\) accurately reflect novelty? Does the surrogate improve over the course of a campaign? Without calibration data, it is unclear whether the BO formulation is contributing meaningfully versus simpler selection heuristics.

39.7.7 Single-Campaign Results per Task

Each task appears to have been run as a single month-long campaign [PAPER, §4]. With no replicate campaigns, the variance of outcomes across campaigns is unknown. A different random seed could yield a substantially different set of innovations — or none at all.

39.8 Limitations & Open Questions

39.8.1 Paper-Acknowledged Limitations

The paper explicitly acknowledges the efficiency problem, framing it as the next research frontier [PAPER, §6]:

"The central question is no longer 'Can AI innovate?' but rather 'How can we efficiently guide its powerful, yet highly dissipative, exploratory process to maximize scientific return?'"

The 0.06% conversion rate from ideas to SOTA methods, while demonstrating that autonomous discovery is possible, also quantifies its current inefficiency.

LimitationSeverityEvidence
Extreme compute cost (20,000+ GPU hours per task)High[PAPER, §4]
API dependency on frontier models (Gemini-2.5-Pro, Claude-4-Opus)High[PAPER, §3]
Human expert requirement for verificationMedium[PAPER, §4]
Low conversion rate (0.42% ideas → innovations)Medium[PAPER, §5]
Month-long campaign durationMedium[PAPER, §4]
No cross-campaign learning mechanism describedMedium[PAPER; mechanism not mentioned]

39.8.2 Open Questions

[INFERRED] Open Research Questions:
  1. Surrogate vs. random selection: How much does the UCB acquisition function contribute compared to random selection of hypotheses? An ablation removing UCB (replacing it with random or round-robin selection) would isolate the value of the BO formulation.
  2. Cross-campaign transfer: Can Findings Memory from one task seed another? Meta-strategies that generalize across tasks (e.g., "ablation-first approaches are reliable") could accelerate future campaigns.
  3. Scaling laws: The paper claims near-linear scaling with GPUs but provides data at only one operating point (16 GPUs). Do innovation yields scale linearly to 64, 256, or 1,024 GPUs?
  4. Open-weight models: Replacing Gemini and Claude with open-weight alternatives would remove the API dependency and enable true reproducibility, but at what quality cost?
  5. Domain generalization: All three demonstrated tasks are frontier AI problems. Can the BO formulation extend to domains where the LLM has less prior knowledge (chemistry, materials science, biology)?

39.9 Survey Positioning

39.9.1 Comparison with Related Autonomous Research Systems

SystemOriginSearch ParadigmPrimary OutputSOTA ClaimsCompute ScaleMemoryBudget Match
AI Scientist (Sakana, 2024) Industry LLM ideation, heuristic scoring Papers No ~10 GPU-hrs/paper Stateless Not matched. Budgets differ by 2–3 orders of magnitude from DeepScientist.
AI Scientist v2 (Sakana, 2025) Industry Open-ended multi-paper campaigns Papers No Not reported in detail Campaign-level
CycleResearcher (Westlake, 2024) Academic Review-driven iteration Papers No Not reported in detail Session-level
Zochi (2025) Unknown High-quality generation Papers No Not reported in detail Not described
DeepScientist (Westlake, 2025) Academic UCB over LLM surrogate (BO) Methods + Papers Yes (3 tasks, author-reported) 20,000+ GPU-hrs 3-level Findings Memory

39.9.2 Comparison with Evolutionary Code Discovery Systems

DeepScientist's BO-based approach contrasts with evolutionary approaches to code discovery:

DimensionFunSearch / AlphaEvolveDeepScientist
Search paradigmEvolutionary (population-based)Bayesian Optimization (surrogate-guided)
What is maintainedPopulation of programsMemory of findings (hypotheses + results)
Selection mechanismFitness-proportional / MAP-ElitesUCB acquisition function
Exploration signalMutation stochasticity / behavior diversityExplicit exploration score \(v_e\)
Negative resultsDiscarded (low-fitness individuals die)Retained in memory for future guidance
Theoretical basisEvolutionary dynamicsBayesian optimization / bandits
Output granularityFunctions / programsFull scientific methods (code + experiments + papers)
Campaign scopeHours to days~1 month

The retention of negative results is a distinctive design choice. Evolutionary systems discard low-fitness individuals; DeepScientist preserves failed implementations as informative data for future hypothesis generation. Whether this provides a measurable advantage over evolutionary forgetting is not empirically tested.

39.9.3 Positioning within Autonomous Research Trajectory

DeepScientist positions itself as the culmination of an academic trajectory from paper refinement to method discovery [PAPER, §2]:

  • CycleResearcher (2024): Review-driven iterative refinement of research papers. Output: better-written papers.
  • DeepScientist (2025): BO-guided discovery of research methods. Output: working implementations that outperform baselines.

The paper claims that DeepScientist is the first autonomous system to demonstrate verified SOTA-surpassing results on frontier AI tasks [PAPER, §1]. This claim is bounded to the specific comparison class of "autonomous research systems that produce implemented methods evaluated on standard benchmarks," and is author-reported rather than independently verified. Among the systems surveyed in this book, no other autonomous research system has made comparable verified-SOTA claims, though AlphaEvolve has demonstrated improvements on specific mathematical and coding tasks in a different problem class (program synthesis vs. scientific method discovery).

39.10 Summary

Key Takeaway: DeepScientist provides a mathematically principled framework (Bayesian Optimization with an LLM surrogate and UCB acquisition) for autonomous scientific discovery, and the paper reports three methods that surpassed human state-of-the-art on frontier AI tasks — a claim that, if independently reproduced, would mark a watershed in autonomous research.

Main Contribution: The formalization of discovery as BO over hypothesis space, with a three-component valuation vector (utility, quality, exploration) that enables principled exploration–exploitation balance. The Findings Memory, which retains negative results alongside successes, provides a cumulative knowledge base that scales beyond context-window limits via retrieval. The innovation funnel (5,000 → 1,100 → 21 → 3) quantifies both the possibility and the current inefficiency of autonomous scientific discovery.

What Researchers Should Know: DeepScientist's claimed results are compelling but carry significant caveats: reviewer circularity (DeepReviewer is from the same team), absence of absolute baselines, unreported seed/run counts, extreme compute requirements (~20,000+ GPU-hours per task), and no independent reproduction. The system requires human expert verification (3 experts), making it semi-autonomous rather than fully autonomous. The true scientific contribution may be less the specific SOTA results and more the demonstration that principled optimization over a structured hypothesis memory, at sufficient scale, can produce genuine methodological innovations — and the honest quantification of how inefficient that process currently is (0.06% conversion rate from ideas to SOTA methods).