Introduced2025-09

Score8.73/10 — Final

Chapter 39

DeepScientist

Part: Autonomous Research Systems

39.1 Overview & Motivation

DeepScientist (Weng et al., 2025; arXiv:2509.26603) reframes autonomous scientific discovery as a Bayesian Optimization (BO) problem, where the search space comprises all possible scientific methods, an LLM Reviewer serves as the surrogate model, and a UCB acquisition function governs the exploration–exploitation tradeoff [PAPER, §3]. The system was developed at Westlake University's Engineering School and represents a direct evolution from the same group's CycleResearcher work on review-driven iterative paper refinement [PAPER, §1].

The paper reports that DeepScientist produced 21 genuine scientific innovations from approximately 5,000 generated hypotheses across three frontier AI tasks, with three methods that the authors claim surpassed human state-of-the-art: A2P for agent failure attribution (+183.7% accuracy), ACRA for LLM inference acceleration (+1.9% tokens/s), and PA-Detect for AI text detection (+7.9% AUROC with +190% latency reduction) [PAPER, §5, Tables 1–2]. These are author-reported results evaluated by the team's own DeepReviewer model and three human experts; no independent reproduction has been documented as of April 2026.

Within this survey's taxonomy, DeepScientist occupies a distinctive position among autonomous research systems. While AI Scientist (Sakana, 2024), AI Researcher (Alibaba, 2024), and Zochi (2025) focus on generating research papers evaluated by automated reviewers, DeepScientist targets working implementations evaluated against quantitative task metrics [PAPER, §2]. The shift from paper generation to method discovery — and the adoption of a principled optimization framework rather than ad-hoc search — is the paper's central claim to novelty.

Key Contribution: DeepScientist formalizes autonomous scientific discovery as Bayesian Optimization with an LLM surrogate, using a three-component valuation vector (utility, quality, exploration) and UCB acquisition to guide month-long GPU campaigns. The paper reports verified SOTA-surpassing results on three frontier AI tasks — a claim that, if independently reproduced, would represent a milestone in autonomous research [PAPER, §5].

39.2 Architecture

39.2.1 Three-Stage Discovery Cycle

DeepScientist's architecture centers on a three-stage hierarchical discovery cycle that repeats continuously for the duration of a campaign (approximately one month per task) [PAPER, §3]. The stages are:

Strategize & Hypothesize — LLM Reviewer (Gemini-2.5-Pro) analyzes the Findings Memory, generates novel hypotheses, and produces valuation vectors [PAPER, §3.1].
Implement & Verify — UCB acquisition selects the most promising hypothesis; a coding agent (Claude-4-Opus) implements and experimentally validates it [PAPER, §3.2].
Analyze & Report — Triggered only for implementations that surpass the baseline; performs ablation studies, cross-dataset evaluation, and paper synthesis [PAPER, §3.3].

All three stages share a persistent Findings Memory — a structured, three-level knowledge base (Idea Findings, Implement Findings, Progress Findings) that accumulates knowledge across the campaign [PAPER, §3.1]. When memory exceeds the LLM's context window, a retrieval model selects the Top-K most relevant findings [PAPER, §3.1].

Figure 39.1: DeepScientist three-stage discovery cycle with shared Findings Memory and parallel GPU execution. All component labels and data flows are derived from the paper's description [PAPER, §3, Figures 1–2].

39.2.2 Dual-Model Architecture

DeepScientist employs a functional separation between two frontier LLM models [PAPER, §3]:

Role	Model	Stages	Evidence
Reasoning / strategy / review / paper synthesis	Gemini-2.5-Pro	Stages 1, 3	[PAPER, §3]
Code generation / implementation / debugging	Claude-4-Opus	Stage 2	[PAPER, §3]

This is a functional separation (reasoning vs. coding), distinct from the hierarchical separation (cheap model for quantity, expensive for quality) seen in AlphaEvolve's Gemini Flash + Pro ensemble [PAPER, §2]. The coding agent in Stage 2 has full permissions: repository read/write, sandboxed execution, internet access, package installation, and dedicated H800 GPU access [PAPER, §3.2].

39.2.3 Repository Status

The paper cites a public repository at github.com/ResearAI/DeepScientist [PAPER, §1] and a project page at ai-researcher.net [PAPER, §1]. The repository is released under CC BY-NC-SA 4.0 [PAPER, §1].

Repository Verification Note: This chapter has not performed a pinned-commit audit of the DeepScientist repository. All implementation claims below are sourced from the paper or README descriptions unless otherwise noted. No [REPO] tags are used in this chapter because the repository contents have not been independently verified by the survey authors. Readers intending to reproduce or extend the system should verify the current repository state against the claims made here.

39.3 Core Algorithms

39.3.1 Bayesian Optimization Formulation

DeepScientist's central intellectual contribution is formalizing autonomous discovery as Bayesian Optimization [PAPER, §3]. The optimization objective:

$$ I^* = \underset{I \in \mathcal{I}}{\operatorname{argmax}} \; f(I) $$ [Published formula — paper §3, Eq. 1]

Symbol	Meaning	Domain
$\mathcal{I}$	Space of all possible scientific methods (hypotheses, algorithms, implementations)	Combinatorial, unstructured
$f(I)$	True value function — actual performance of method $I$ when implemented and evaluated	$\mathbb{R}$
$I^*$	Globally optimal method (unknown, approximated through search)	$\mathcal{I}$

Evaluating $f(I)$ is expensive: each evaluation requires generating a full implementation, running GPU experiments, and analyzing results against baselines. DeepScientist addresses this by using an LLM Reviewer as a surrogate model that cheaply approximates $f$. For each candidate hypothesis, the reviewer produces a three-component valuation vector [PAPER, §3.1]:

$$ \mathbf{V}(I) = \langle v_u(I), \, v_q(I), \, v_e(I) \rangle $$ [Published formula — paper §3, Eq. 2]

Symbol	Meaning	Role in BO
$v_u$	Utility value — estimated practical impact and performance improvement	Exploitation signal
$v_q$	Quality value — estimated methodological soundness and rigor	Exploitation signal
$v_e$	Exploration value — estimated novelty relative to previously explored regions	Exploration signal

The UCB acquisition function selects the next hypothesis to evaluate [PAPER, §3.1]:

$$ I_{t+1} = \underset{I}{\operatorname{argmax}} \left( w_u \cdot v_u(I) + w_q \cdot v_q(I) + \kappa \cdot v_e(I) \right) $$ [Published formula — paper §3, Eq. 3]

Symbol	Meaning	Notes
$w_u$	Weight on utility	Exploitation coefficient; values not reported in paper
$w_q$	Weight on quality	Exploitation coefficient; values not reported in paper
$\kappa$	Exploration coefficient	Controls exploration–exploitation tradeoff; value not reported
$I_{t+1}$	Selected hypothesis for evaluation at step $t+1$	Promoted from Idea Finding to Implement Finding

[INFERRED] Theoretical Guarantees and Their Limits: In classical Bayesian Optimization, UCB provides sublinear cumulative regret bounds when the surrogate model is a well-specified Gaussian Process with calibrated posterior uncertainty. DeepScientist's LLM surrogate replaces the GP posterior with prompt-based assessment, which sacrifices the calibrated uncertainty estimates ($\sigma(x)$) that underpin UCB's theoretical guarantees. The exploration component $v_e$ is a heuristic substitute for posterior variance, produced by the same LLM that generates hypotheses — there is no formal guarantee of calibration or coverage. The BO formulation should therefore be understood as a principled heuristic inspired by the UCB framework rather than a system with provable regret bounds. The paper does not claim formal convergence guarantees [PAPER, §3].

39.3.2 The Discovery Loop

Figure 39.2: DeepScientist BO discovery loop. All steps are paper-described [PAPER, §3].

39.3.3 Three-Component Valuation and Surrogate Mapping

The decomposition of the exploitation signal into utility ($v_u$) and quality ($v_q$) is the key departure from classical BO, where the surrogate produces a single mean prediction $\mu(x)$ and uncertainty $\sigma(x)$ [PAPER, §3.1]. The mapping between classical and DeepScientist formulations:

$$ \underbrace{w_u \cdot v_u + w_q \cdot v_q}_{\text{exploitation} \;\approx\; \mu(x)} + \underbrace{\kappa \cdot v_e}_{\text{exploration} \;\approx\; \kappa \cdot \sigma(x)} $$ [Author-derived formalization — mapping described in paper §3, explicit algebraic correspondence constructed by survey authors]

This decomposition allows the system to distinguish between bold but risky ideas (high $v_u$, low $v_q$) and conservative but reliable ones (low $v_u$, high $v_q$). The paper does not report the specific values of $w_u$, $w_q$, or $\kappa$ used in the campaigns, nor whether these parameters were fixed or adapted during execution [PAPER; values not reported].

39.3.4 Pseudocode for Discovery Cycle

# Pseudocode — reconstructed from paper description (§3).
# No class names, function signatures, or module paths are verified from the repository.

def deepscientist_campaign(task, memory, config):
    """Main discovery loop — runs for ~1 month per task."""
    # Seed memory with human knowledge (papers, baselines, codebases)
    memory.load_human_knowledge(task.seed_papers, task.baseline_repo)

    for t in range(config.max_iterations):
        # --- Stage 1: Strategize & Hypothesize ---
        if memory.exceeds_context_window():
            context = memory.retrieve_top_k(query=task.description, k=config.top_k)
        else:
            context = memory.all_findings()

        # LLM Reviewer (Gemini-2.5-Pro) generates and evaluates hypotheses
        idea_findings = llm_reviewer.generate_hypotheses(
            task=task, context=context
        )
        # Each idea_finding carries V = 
        memory.store_ideas(idea_findings)

        # --- UCB Acquisition (no LLM — deterministic) ---
        candidate = select_by_ucb(
            ideas=memory.unselected_ideas(),
            w_u=config.w_u, w_q=config.w_q, kappa=config.kappa
        )
        candidate.promote_to_implement()

        # --- Stage 2: Implement & Verify ---
        result = coding_agent.implement_and_evaluate(
            hypothesis=candidate,
            repository=task.baseline_repo,
            gpu=config.assigned_gpu
        )
        candidate.record_result(result)
        memory.update(candidate)

        # --- Stage 3 (conditional): Analyze & Report ---
        if result.surpasses_baseline(task.baseline_metric):
            candidate.promote_to_progress()
            analysis = analyzer.deep_analysis(
                finding=candidate,
                ablation_configs=task.ablation_suite,
                additional_datasets=task.eval_datasets
            )
            paper = synthesizer.generate_paper(candidate, analysis)
            memory.store_progress(candidate, analysis, paper)

# Pseudocode — reconstructed from paper description (§3.1).
# UCB selector is the only non-LLM component in the critical path.

def select_by_ucb(ideas, w_u, w_q, kappa):
    """Select idea with highest UCB score."""
    best_idea, best_score = None, float('-inf')
    for idea in ideas:
        score = (w_u * idea.v_u) + (w_q * idea.v_q) + (kappa * idea.v_e)
        if score > best_score:
            best_idea, best_score = idea, score
    return best_idea

39.3.5 Findings Memory Lifecycle

Each finding follows a promotion lifecycle through three levels [PAPER, §3]:

Idea Finding — generated by the Strategist, carries hypothesis text and valuation vector $\mathbf{V}$. Remains in memory regardless of selection.
Implement Finding — promoted from Level 1 after UCB selection. Augmented with implementation details, experimental results, and baseline delta. Negative results are retained, preventing re-exploration of failed approaches.
Progress Finding — promoted from Level 2 when $f(I) > f_{\text{baseline}}$. Triggers Stage 3 analysis, ablation studies, and paper synthesis.

The paper reports approximate volumes over a month-long campaign: ~5,000 Idea Findings, ~1,100 Implement Findings, 21 Progress Findings [PAPER, §5]. The funnel conversion rate (0.42% from ideas to innovations) is itself a significant empirical finding about the difficulty of autonomous discovery.

39.3.6 Retrieval for Context Management

As the Findings Memory grows beyond the LLM context window, a retrieval model selects the Top-K most relevant findings as input to the Strategist [PAPER, §3.1]. The paper does not specify the retrieval model, embedding dimension, indexing strategy, or the value of K. This mechanism is what enables month-long campaigns — without it, the system would be limited to the few hundred findings fitting in a single context window.

[INFERRED] Retrieval Implementation Unknowns: The paper describes the retrieval mechanism at a functional level but does not specify: (a) the embedding model used, (b) whether retrieval is semantic, keyword-based, or hybrid, (c) the value of K or how it is set, (d) whether findings are indexed incrementally or re-indexed periodically, or (e) how the balance between recent findings and historically important ones is managed. These implementation details significantly affect the system's ability to maintain continuity over long campaigns.

39.4 Key Results

39.4.0 Evaluation Caveats

Evaluation Caveats — Read Before Interpreting Results:

Missing absolute baselines: The paper reports percentage improvements over baselines but does not consistently report absolute baseline scores for all three tasks. Without absolute values, the magnitude of improvements cannot be independently assessed [PAPER, Tables 1–2].
Unreported seed/run counts: The paper does not specify how many independent runs were conducted per task, nor whether the reported results reflect single best-of-campaign outcomes or aggregated statistics. The stochastic nature of both LLM generation and experimental execution means single-run results carry high variance.
Reviewer circularity: The automated review scores (Table 2) use DeepReviewer, developed by the same team as part of their CycleResearcher work. While validated against human judgments, systematic bias toward the team's output style cannot be excluded [PAPER, §4].
Unmatched compute budgets: DeepScientist used 16 H800 GPUs for ~1 month per task. Comparisons with AI Scientist, AI Researcher, and other systems do not control for compute budget [PAPER, §4].
Human verification limited: Three human experts verified outputs, but the paper does not detail the verification protocol, how many findings each expert reviewed, or their selection criteria.
No independent reproduction: As of April 2026, no independent team has attempted to reproduce DeepScientist's claimed SOTA results.

39.4.1 SOTA-Surpassing Results

Benchmark / Task	Task Description	Baseline Score	System Score	Δ	Seeds / Runs	Compute Budget	Evaluation Protocol	Evidence Source
Agent Failure Attribution	Attribute agent failures to causal actions	— (not reported as absolute)	— (not reported as absolute)	+183.7% accuracy (paper-reported)	— (not reported)	~1 month, 16×H800 GPUs, Gemini-2.5-Pro + Claude-4-Opus APIs	— (not reported in detail)	[PAPER, §5, Table 1]
LLM Inference Acceleration	Increase tokens/s without quality loss	— (not reported as absolute)	— (not reported as absolute)	+1.9% tokens/s (paper-reported)	— (not reported)	~1 month, 16×H800 GPUs, Gemini-2.5-Pro + Claude-4-Opus APIs	— (not reported in detail)	[PAPER, §5, Table 1]
AI Text Detection	Distinguish AI-generated from human text	— (not reported as absolute)	— (not reported as absolute)	+7.9% AUROC, −65.5% latency (paper-reported)	— (not reported)	~1 month, 16×H800 GPUs, Gemini-2.5-Pro + Claude-4-Opus APIs	— (not reported in detail)	[PAPER, §5, Table 1]

The discovered methods are A2P (Abduction-Action-Prediction), ACRA, and PA-Detect. The paper describes A2P as a three-phase reasoning framework (abduction → action analysis → prediction) and PA-Detect as achieving a Pareto improvement on the accuracy–latency frontier [PAPER, §5]. ACRA's internal mechanism is not described in detail beyond the name.

39.4.2 Automated Review Evaluation (DeepReviewer)

System	Papers	Soundness	Presentation	Contribution	Rating	Accept Rate	Evidence
AI Scientist	10	2.08	1.80	1.75	3.35	0%	[PAPER, Table 2]
AI Researcher	7	1.75	1.46	1.57	2.57	0%	[PAPER, Table 2]
AI Scientist v2	3	1.67	1.50	1.50	2.33	0%	[PAPER, Table 2]
CycleResearcher	6	2.25	1.75	2.13	3.75	0%	[PAPER, Table 2]
Zochi	2	2.38	2.38	2.25	4.63	0%	[PAPER, Table 2]
DeepScientist	5	2.90	2.90	2.90	5.90	60%	[PAPER, Table 2]

DeepScientist is the only system achieving non-zero acceptance under DeepReviewer (3 of 5 papers). The reviewer circularity risk described in §39.4.0 applies directly here.

39.4.3 Human Expert Review

To address automated reviewer bias, the paper reports human expert evaluation [PAPER, §4.2]:

Metric	DeepScientist Papers	ICLR 2025 Human Papers	Evidence
Average rating	5.00	5.08	[PAPER, §4.2]
Reviewers per paper	3	3–4	[PAPER, §4.2]
Inter-annotator agreement (Krippendorff α)	0.739	— (not reported)	[PAPER, §4.2]

The 5.00 rating is statistically indistinguishable from the 5.08 ICLR baseline given the small sample size (5 papers, 3 reviewers each). Krippendorff's α = 0.739 exceeds the 0.667 threshold for "substantial agreement" [PAPER, §4.2]. However, at ICLR a rating of 5 corresponds roughly to "marginally below acceptance threshold," meaning these papers are competitive with but not clearly above venue-quality human work.

39.4.4 Innovation Funnel

Stage	Volume	Conversion Rate	Evidence
Idea Findings generated	~5,000	—	[PAPER, §5]
Implement Findings (selected by UCB)	~1,100	~22% of ideas	[PAPER, §5]
Progress Findings (innovations)	21	~1.9% of implementations	[PAPER, §5]
SOTA-surpassing methods	3	~14.3% of innovations	[PAPER, §5]
Ideas → SOTA	—	~0.06%	[PAPER; computed from reported numbers]

39.5 Implementation & Cost

39.5.1 Paper-Reported Infrastructure

Component	Specification	Evidence
GPU hardware	2 servers × 8 NVIDIA H800 (80 GB) = 16 GPUs total	[PAPER, §4]
GPU hours consumed	20,000+ total across all tasks	[PAPER, §4]
Campaign duration	~1 month per task	[PAPER, §4]
Parallelism	16 GPU instances, each running independent exploration thread	[PAPER, §4]
Reasoning model	Gemini-2.5-Pro	[PAPER, §3]
Coding model	Claude-4-Opus	[PAPER, §3]
Human experts	3 (for output verification)	[PAPER, §4]
License	CC BY-NC-SA 4.0	[PAPER, §1]

39.5.2 Cost Estimates (Author-Derived)

Provenance Note: The following cost estimates are not reported in the paper. They are survey-author calculations based on publicly available cloud pricing as of April 2026. Actual costs may differ significantly depending on institutional pricing, API negotiation, and usage patterns.

Component	Quantity	Estimated Unit Cost	Estimated Total	Provenance
H800 GPU hours	20,000+	$2–4/hr (cloud)	$40K–80K	[INFERRED from public pricing]
Gemini-2.5-Pro API	Not reported	$0.00125–0.01/1K tokens	$5K–50K (rough range)	[INFERRED]
Claude-4-Opus API	Not reported	$0.015–0.075/1K tokens	$10K–100K (rough range)	[INFERRED]
Human expert time	3 experts × est. 40 hrs	$100–200/hr	$12K–24K	[INFERRED]
Estimated total per task			$67K–254K	[INFERRED]

[INFERRED] Cost Context: The estimated $67K–254K per task is comparable to a postdoctoral researcher-year in many institutions. However, DeepScientist completes a task in ~1 month rather than 1–5 years, and the architecture is parallelizable — adding GPUs increases throughput approximately linearly (per the paper's claim that instances run independently with shared memory). Whether this scaling holds beyond 16 GPUs is not demonstrated.

39.6 Reproducibility Checklist

Requirement	Status	Notes	Evidence
Code publicly released	Partial	Repository exists at github.com/ResearAI/DeepScientist; completeness of released code not independently verified	[PAPER, §1]
Config files available	— (not independently verified)	Paper does not detail config schema; repo may contain examples	[PAPER]
Pretrained weights / checkpoints	N/A	System uses API-based LLMs (Gemini, Claude); no custom weights	[PAPER, §3]
Documented entry point or run command	— (not independently verified)	Not described in paper; may exist in repo README	[PAPER]
Compute requirements stated	✓	16×H800 GPUs, 20,000+ GPU hours, ~1 month per task	[PAPER, §4]
Seeds and run counts reported	✗	Neither seed values nor number of independent runs reported	[PAPER]
Independent reproduction attempted	✗	No independent reproduction documented as of April 2026	—
Findings Memory dumps released	— (not independently verified)	Would enable trajectory analysis; paper does not confirm release	[PAPER]
Discovered method code (A2P, ACRA, PA-Detect)	Partial	Paper describes A2P and PA-Detect at algorithmic level; code may or may not be in repo	[PAPER, §5]
Experimental logs released	— (not independently verified)	Not mentioned in paper	[PAPER]

The primary reproducibility barrier is economic, not technical. The paper's methodology is described clearly enough to implement, but actually running DeepScientist requires compute resources (~$67K–254K per task) that few academic labs can dedicate to reproduction. A scaled-down validation (fewer GPUs, shorter campaigns, cheaper models) would test the architectural concepts but could not reproduce the claimed SOTA results.

[INFERRED] Suggested Reduced-Scale Validation: A partial reproduction targeting concept validation (BO loop correctness, memory accumulation, UCB selection behavior) could be conducted with 1–2 GPUs over 1–2 weeks using smaller models (e.g., Gemini Flash, Claude Sonnet) for an estimated $1K–10K. This would not replicate SOTA-surpassing outcomes but could verify that the architectural mechanisms function as described.

39.7 Threats to Validity

This section consolidates all identified threats to the validity of DeepScientist's claims, organized by category.

39.7.1 Reviewer Circularity

The automated review scores (Table 2) are produced by DeepReviewer, developed by the same research group as part of CycleResearcher [PAPER, §4]. While DeepReviewer was validated against human judgments in the CycleResearcher paper, the possibility of systematic bias toward the authoring team's output style cannot be excluded. The 60% accept rate (vs. 0% for all competitors) is a dramatic gap that warrants skepticism until confirmed by an independent review model. The paper partially addresses this with human expert review (§39.4.3), but the sample size is small (5 papers, 3 reviewers).

39.7.2 Compute-Budget Mismatch

DeepScientist used 20,000+ GPU hours on H800 hardware for each task. Comparisons with AI Scientist (~10 GPU hours per paper), AI Researcher, and other systems do not control for this ~2,000× compute difference [PAPER, §4–5]. It is an open question whether the other systems would achieve comparable results given equivalent compute.

39.7.3 Missing Absolute Baselines and Evaluation Protocols

The three SOTA-surpassing claims are reported as relative improvements (+183.7%, +1.9%, +7.9%) without consistently reporting the absolute baseline scores, the evaluation datasets, the exact evaluation protocols, or the statistical significance of the improvements [PAPER, §5]. The +183.7% accuracy gain for A2P is particularly difficult to interpret without knowing the absolute baseline — a 183.7% improvement from 10% accuracy (to ~28%) is very different from 183.7% from 30% (to ~85%).

39.7.4 Human-in-the-Loop Verification Gaps

Three human experts verified outputs and filtered hallucinations [PAPER, §4]. The paper does not specify: (a) the verification protocol, (b) how many findings each expert reviewed, (c) whether experts could intervene during campaigns to guide exploration, (d) the criteria for accepting vs. rejecting a Progress Finding, or (e) whether expert filtering was applied before or after the SOTA claims were formulated. The boundary between autonomous discovery and human-guided selection is therefore not fully transparent.

39.7.5 Absence of Independent Reproduction

As of April 2026, no independent team has reproduced DeepScientist's results. The extreme compute cost (~$200K+ across three tasks) makes independent reproduction prohibitively expensive for most groups. Until independent reproduction occurs, the SOTA claims rest solely on the authors' self-evaluation.

39.7.6 Surrogate Calibration

The LLM Reviewer produces valuation vectors $\mathbf{V}$ that are used for UCB selection, but no analysis of surrogate calibration is reported [PAPER]. Key questions: How well does $v_u$ correlate with actual experimental improvement? Does $v_e$ accurately reflect novelty? Does the surrogate improve over the course of a campaign? Without calibration data, it is unclear whether the BO formulation is contributing meaningfully versus simpler selection heuristics.

39.7.7 Single-Campaign Results per Task

Each task appears to have been run as a single month-long campaign [PAPER, §4]. With no replicate campaigns, the variance of outcomes across campaigns is unknown. A different random seed could yield a substantially different set of innovations — or none at all.

39.8 Limitations & Open Questions

39.8.1 Paper-Acknowledged Limitations

The paper explicitly acknowledges the efficiency problem, framing it as the next research frontier [PAPER, §6]:

"The central question is no longer 'Can AI innovate?' but rather 'How can we efficiently guide its powerful, yet highly dissipative, exploratory process to maximize scientific return?'"

The 0.06% conversion rate from ideas to SOTA methods, while demonstrating that autonomous discovery is possible, also quantifies its current inefficiency.

Limitation	Severity	Evidence
Extreme compute cost (20,000+ GPU hours per task)	High	[PAPER, §4]
API dependency on frontier models (Gemini-2.5-Pro, Claude-4-Opus)	High	[PAPER, §3]
Human expert requirement for verification	Medium	[PAPER, §4]
Low conversion rate (0.42% ideas → innovations)	Medium	[PAPER, §5]
Month-long campaign duration	Medium	[PAPER, §4]
No cross-campaign learning mechanism described	Medium	[PAPER; mechanism not mentioned]

39.8.2 Open Questions

[INFERRED] Open Research Questions:

Surrogate vs. random selection: How much does the UCB acquisition function contribute compared to random selection of hypotheses? An ablation removing UCB (replacing it with random or round-robin selection) would isolate the value of the BO formulation.
Cross-campaign transfer: Can Findings Memory from one task seed another? Meta-strategies that generalize across tasks (e.g., "ablation-first approaches are reliable") could accelerate future campaigns.
Scaling laws: The paper claims near-linear scaling with GPUs but provides data at only one operating point (16 GPUs). Do innovation yields scale linearly to 64, 256, or 1,024 GPUs?
Open-weight models: Replacing Gemini and Claude with open-weight alternatives would remove the API dependency and enable true reproducibility, but at what quality cost?
Domain generalization: All three demonstrated tasks are frontier AI problems. Can the BO formulation extend to domains where the LLM has less prior knowledge (chemistry, materials science, biology)?

39.9 Survey Positioning

39.9.1 Comparison with Related Autonomous Research Systems

System	Origin	Search Paradigm	Primary Output	SOTA Claims	Compute Scale	Memory	Budget Match
AI Scientist (Sakana, 2024)	Industry	LLM ideation, heuristic scoring	Papers	No	~10 GPU-hrs/paper	Stateless	Not matched. Budgets differ by 2–3 orders of magnitude from DeepScientist.
AI Scientist v2 (Sakana, 2025)	Industry	Open-ended multi-paper campaigns	Papers	No	Not reported in detail	Campaign-level
CycleResearcher (Westlake, 2024)	Academic	Review-driven iteration	Papers	No	Not reported in detail	Session-level
Zochi (2025)	Unknown	High-quality generation	Papers	No	Not reported in detail	Not described
DeepScientist (Westlake, 2025)	Academic	UCB over LLM surrogate (BO)	Methods + Papers	Yes (3 tasks, author-reported)	20,000+ GPU-hrs	3-level Findings Memory

39.9.2 Comparison with Evolutionary Code Discovery Systems

DeepScientist's BO-based approach contrasts with evolutionary approaches to code discovery:

Dimension	FunSearch / AlphaEvolve	DeepScientist
Search paradigm	Evolutionary (population-based)	Bayesian Optimization (surrogate-guided)
What is maintained	Population of programs	Memory of findings (hypotheses + results)
Selection mechanism	Fitness-proportional / MAP-Elites	UCB acquisition function
Exploration signal	Mutation stochasticity / behavior diversity	Explicit exploration score $v_e$
Negative results	Discarded (low-fitness individuals die)	Retained in memory for future guidance
Theoretical basis	Evolutionary dynamics	Bayesian optimization / bandits
Output granularity	Functions / programs	Full scientific methods (code + experiments + papers)
Campaign scope	Hours to days	~1 month

The retention of negative results is a distinctive design choice. Evolutionary systems discard low-fitness individuals; DeepScientist preserves failed implementations as informative data for future hypothesis generation. Whether this provides a measurable advantage over evolutionary forgetting is not empirically tested.

39.9.3 Positioning within Autonomous Research Trajectory

DeepScientist positions itself as the culmination of an academic trajectory from paper refinement to method discovery [PAPER, §2]:

CycleResearcher (2024): Review-driven iterative refinement of research papers. Output: better-written papers.
DeepScientist (2025): BO-guided discovery of research methods. Output: working implementations that outperform baselines.

The paper claims that DeepScientist is the first autonomous system to demonstrate verified SOTA-surpassing results on frontier AI tasks [PAPER, §1]. This claim is bounded to the specific comparison class of "autonomous research systems that produce implemented methods evaluated on standard benchmarks," and is author-reported rather than independently verified. Among the systems surveyed in this book, no other autonomous research system has made comparable verified-SOTA claims, though AlphaEvolve has demonstrated improvements on specific mathematical and coding tasks in a different problem class (program synthesis vs. scientific method discovery).

39.10 Summary

Key Takeaway: DeepScientist provides a mathematically principled framework (Bayesian Optimization with an LLM surrogate and UCB acquisition) for autonomous scientific discovery, and the paper reports three methods that surpassed human state-of-the-art on frontier AI tasks — a claim that, if independently reproduced, would mark a watershed in autonomous research.

Main Contribution: The formalization of discovery as BO over hypothesis space, with a three-component valuation vector (utility, quality, exploration) that enables principled exploration–exploitation balance. The Findings Memory, which retains negative results alongside successes, provides a cumulative knowledge base that scales beyond context-window limits via retrieval. The innovation funnel (5,000 → 1,100 → 21 → 3) quantifies both the possibility and the current inefficiency of autonomous scientific discovery.

What Researchers Should Know: DeepScientist's claimed results are compelling but carry significant caveats: reviewer circularity (DeepReviewer is from the same team), absence of absolute baselines, unreported seed/run counts, extreme compute requirements (~20,000+ GPU-hours per task), and no independent reproduction. The system requires human expert verification (3 experts), making it semi-autonomous rather than fully autonomous. The true scientific contribution may be less the specific SOTA results and more the demonstration that principled optimization over a structured hypothesis memory, at sufficient scale, can produce genuine methodological innovations — and the honest quantification of how inefficient that process currently is (0.06% conversion rate from ideas to SOTA methods).

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}

Symbol	Meaning	Domain
\(\mathcal{I}\)	Space of all possible scientific methods (hypotheses, algorithms, implementations)	Combinatorial, unstructured
\(f(I)\)	True value function — actual performance of method \(I\) when implemented and evaluated	\(\mathbb{R}\)
\(I^*\)	Globally optimal method (unknown, approximated through search)	\(\mathcal{I}\)

Symbol	Meaning	Role in BO
\(v_u\)	Utility value — estimated practical impact and performance improvement	Exploitation signal
\(v_q\)	Quality value — estimated methodological soundness and rigor	Exploitation signal
\(v_e\)	Exploration value — estimated novelty relative to previously explored regions	Exploration signal

Symbol	Meaning	Notes
\(w_u\)	Weight on utility	Exploitation coefficient; values not reported in paper
\(w_q\)	Weight on quality	Exploitation coefficient; values not reported in paper
\(\kappa\)	Exploration coefficient	Controls exploration–exploitation tradeoff; value not reported
\(I_{t+1}\)	Selected hypothesis for evaluation at step \(t+1\)	Promoted from Idea Finding to Implement Finding