← Back to Index

DeepScientist

Bayesian Optimization-guided autonomous scientific discovery system that surpassed human state-of-the-art on three frontier AI tasks through month-long continuous GPU campaigns Organization: Westlake University (Engineering School) Published: September 30, 2025 Type: paper (arXiv:2509.26603) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively

arXiv: 2509.26603
Published: September 30, 2025
Code: github.com/ResearAI/DeepScientist
Project Page: ai-researcher.net
License: CC BY-NC-SA 4.0
Tagline: Progressive scientific discovery through Bayesian Optimization over an LLM-driven hypothesis–implementation–analysis cycle
Default models: Gemini-2.5-Pro (core logic / reviewer / strategist), Claude-4-Opus (code generation / implementation agent)
Input modes: Fully autonomous month-long research campaigns on frontier AI tasks, seeded with a codebase repository and initial findings memory

Naming and Lineage

"DeepScientist" signals two things: the "Deep" prefix invokes both deep learning and the depth of the system's search (thousands of hypotheses, hundreds of implementations, month-long campaigns), while "Scientist" positions the system as an autonomous researcher rather than merely an assistant or copilot. The name also places the system in deliberate contrast to Sakana AI's "AI Scientist" — claiming a more rigorous, results-driven approach to the same aspiration.

The system is a direct evolution of the CycleResearcher line of work from the same lead author (Yixuan Weng). CycleResearcher introduced the idea of review-driven iterative refinement of AI-generated research papers. DeepScientist extends this from paper refinement to full-stack scientific discovery — from hypothesis generation through experimental validation to paper synthesis.

Lineage Chain

CycleResearcher (Weng et al., 2024)
│   Review-driven iterative paper refinement
│   Introduced DeepReviewer for automated evaluation
│
└── DeepScientist (Weng et al., 2025)  ← this system
    Full Bayesian Optimization formulation
    Hypothesis → Implementation → Analysis cycle
    Findings Memory as cumulative knowledge base
    Surpassed human SOTA on 3 frontier tasks

The evolution from CycleResearcher to DeepScientist represents a fundamental shift: CycleResearcher operated on papers (text artifacts), while DeepScientist operates on methods (code + experiments + results). The review model from CycleResearcher becomes the surrogate function in DeepScientist's Bayesian Optimization loop.

Unique Position in the Ecosystem

DeepScientist is the first autonomous research system to demonstrate verified state-of-the-art surpassing results on frontier AI tasks. While other systems (AI Scientist, AI Researcher, CycleResearcher) generate research papers that are then evaluated by automated reviewers, DeepScientist produces working implementations that measurably outperform existing methods. This is the critical distinction: the output is not a paper that scores well on review metrics — it is a method that scores well on task metrics.

Ecosystem Positioning
│
├── AI Scientist (Sakana)      — breadth: generates ML experiment papers
├── AI Researcher (Alibaba)    — breadth: generates research papers from ideas
├── CycleResearcher (Westlake) — refinement: iterative paper improvement via review
├── AI Scientist v2 (Sakana)   — evolution: open-ended, multi-paper campaigns
├── Zochi                      — quality: high-quality paper generation
└── DeepScientist (Westlake)   — depth + results: BO-guided discovery with SOTA outcomes ← this system

2 Authors and Team

Author	Role (Inferred)	Note
Yixuan Weng*	Co-lead, system architect	Same lead author as CycleResearcher; * indicates equal contribution
Minjun Zhu*	Co-lead, implementation lead	* indicates equal contribution
Qiujie Xie	Core contributor
Qiyao Sun	Core contributor
Zhen Lin	Core contributor
Sifan Liu	Core contributor
Yue Zhang†	Corresponding author, PI	† indicates corresponding; senior researcher at Westlake

BibTeX Citation

@article{weng2025deepscientist,
  title     = {DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively},
  author    = {Weng, Yixuan and Zhu, Minjun and Xie, Qiujie and Sun, Qiyao
               and Lin, Zhen and Liu, Sifan and Zhang, Yue},
  journal   = {arXiv preprint arXiv:2509.26603},
  year      = {2025}
}

Team composition: Seven authors from Westlake University's Engineering School. The equal-contribution designation for the first two authors suggests a system architect / implementation lead split — Weng bringing the conceptual framework from CycleResearcher, Zhu leading the engineering implementation. Yue Zhang as corresponding author and PI indicates this is a focused research lab effort with strong senior supervision.

Institutional context: Westlake University is a private research-intensive university in Hangzhou, China, founded in 2018 with an explicit mandate for frontier research. The Engineering School's NLP/AI group (under Yue Zhang) has been productive in the AI-for-science space, with CycleResearcher and DeepScientist representing a coherent multi-year research program rather than a one-off contribution.

Human supervision: The paper acknowledges 3 human experts who verified outputs and filtered hallucinations. This is significant — DeepScientist is not fully autonomous in the way Karpathy's autoresearch is. Human experts serve as a final filter on the discovery pipeline, ensuring that claimed innovations are genuine. The paper is transparent about this, which strengthens its credibility.

3 Core Contribution

Key Novelty: DeepScientist formalizes autonomous scientific discovery as a Bayesian Optimization problem, where the search space is all possible scientific methods, the objective function is the true value of a method, and an LLM Reviewer serves as the surrogate model — enabling UCB-guided exploration/exploitation of hypothesis space that produced 21 genuine scientific innovations from ~5,000 generated ideas, including three methods that surpassed human state-of-the-art.

The Bayesian Optimization Formulation

This is DeepScientist's central intellectual contribution — not just a system, but a mathematical framework for autonomous discovery. The formulation:

Objective:

I* = argmax_{I ∈ I} f(I)

Where: - I is the space of all possible scientific methods (hypotheses, algorithms, implementations) - f(I) is the true value function — the actual performance of method I when implemented and evaluated - I* is the globally optimal method (unknown, approximated through search)

The Problem: Evaluating f(I) is extremely expensive. Each evaluation requires: 1. Generating a full implementation of method I 2. Running experiments (potentially hours of GPU time) 3. Analyzing results against baselines

This makes exhaustive search impossible. Bayesian Optimization provides the principled framework for deciding which hypothesis to evaluate next.

The Surrogate Model: DeepScientist uses an LLM Reviewer as the surrogate model that approximates f cheaply. For each candidate hypothesis, the reviewer produces a valuation vector:

V = <v_u, v_q, v_e>

Where: - v_u = utility value — estimated practical impact and performance improvement - v_q = quality value — estimated methodological soundness and rigor - v_e = exploration value — estimated novelty and distance from previously explored regions

This three-dimensional valuation is a significant design choice. Traditional Bayesian Optimization uses a scalar objective; DeepScientist explicitly decomposes the surrogate into orthogonal axes that capture different aspects of scientific value.

The Acquisition Function: UCB (Upper Confidence Bound) balances exploitation of promising hypotheses with exploration of novel directions:

I_{t+1} = argmax_{I} (w_u · v_u + w_q · v_q + κ · v_e)

Where: - w_u = weight on utility (exploitation signal) - w_q = weight on quality (exploitation signal) - κ = exploration coefficient (controls exploration–exploitation tradeoff) - v_e = exploration value (pure exploration signal)

The UCB formulation has a well-known theoretical foundation in the multi-armed bandit literature. By casting hypothesis selection as a bandit problem with an LLM-estimated reward model, DeepScientist inherits the regret guarantees and convergence properties of UCB — at least in principle, modulo the accuracy of the LLM surrogate.

Why This Formulation Matters

Most autonomous research systems use ad-hoc methods for deciding what to try next:

System	Hypothesis Selection Strategy	Theoretical Foundation
AI Scientist	LLM generates ideas, scores by novelty/feasibility	Heuristic
AlphaEvolve	MAP-Elites archive + LLM mutation	Evolutionary (quality-diversity)
FunSearch	LLM mutation + island model + best-shot sampling	Evolutionary
Autoresearch (Karpathy)	LLM decides freely, greedy hill-climbing	None (LLM intuition)
OpenEvolve	Multi-island evolution + bandit-based selection	Partially principled
DeepScientist	UCB acquisition over LLM surrogate	Bayesian Optimization

DeepScientist is unique in providing a mathematically principled framework for the explore/exploit tradeoff in scientific discovery. The UCB acquisition function is not just a heuristic — it is an instance of a well-studied algorithm with known optimality properties.

The Five Differentiating Claims

Mathematical formulation: Scientific discovery as Bayesian Optimization, not ad-hoc search
Surrogate model: LLM Reviewer as cheap approximation to expensive evaluation
UCB acquisition: Principled exploration/exploitation balance, not greedy or random
Hierarchical memory: Three-level Findings Memory that accumulates knowledge across campaigns
Verified SOTA results: Three methods that measurably surpassed human state-of-the-art

System	Origin	Paradigm	Output	SOTA Claims	Scale
AI Scientist (Sakana, 2024)	Industry	LLM paper generation	Papers	No	~10 papers
AI Researcher (Alibaba, 2024)	Industry	Paper generation pipeline	Papers	No	~7 papers
CycleResearcher (Westlake, 2024)	Academic	Review-driven refinement	Papers	No	~6 papers
AI Scientist v2 (Sakana, 2025)	Industry	Open-ended campaigns	Papers	No	~3 papers
Zochi (2025)	Unknown	High-quality generation	Papers	No	~2 papers
DeepScientist (Westlake, 2025)	Academic	Bayesian Optimization	Methods + Papers	Yes (3 tasks)	~5,000 ideas → 21 innovations

The scale differential is striking. While other systems produce single-digit papers, DeepScientist generates thousands of hypotheses and hundreds of implementations. The funnel ratio (~5,000 → ~1,100 → 21) reveals the true difficulty of autonomous discovery: only 0.42% of generated ideas lead to genuine innovations. This is a profoundly important empirical finding about the nature of LLM-driven research.

4 Supported Solutions

Primary Domain: Frontier AI Research Tasks

DeepScientist targets frontier AI research problems where: 1. A codebase repository exists with a baseline implementation 2. Quantitative evaluation metrics are well-defined 3. The search space of possible improvements is large 4. Experimental validation is computationally feasible (hours, not weeks per trial)

The Three Demonstrated Tasks

Task	Domain	Baseline	Metric	DeepScientist Method	Improvement
Agent Failure Attribution	AI agents	Existing attribution methods	Accuracy	A2P (Abduction-Action-Prediction)	+183.7%
LLM Inference Acceleration	Systems/ML	Standard inference pipelines	Tokens/second	ACRA	+1.9%
AI Text Detection	NLP/Security	Detection classifiers	AUROC + Latency	PA-Detect	+7.9% AUROC, +190% latency reduction

Task 1: Agent Failure Attribution

Problem: Given a trace of an AI agent's actions and an observed failure, attribute the failure to the specific action(s) that caused it.

DeepScientist's Discovery — A2P (Abduction-Action-Prediction): The A2P method uses a three-phase reasoning approach: 1. Abduction — infer possible causes of the failure from the observed outcome 2. Action analysis — evaluate each action in the trace against the abduced causes 3. Prediction — predict which action-cause pair best explains the failure

The +183.7% improvement over baseline is by far the largest gain among the three tasks. This suggests the baseline was particularly weak or the problem was particularly amenable to LLM-style reasoning improvements.

Task 2: LLM Inference Acceleration

Problem: Accelerate the inference speed of large language models without degrading output quality.

DeepScientist's Discovery — ACRA: The +1.9% improvement in tokens/second is modest in absolute terms but significant in context — inference optimization is a heavily-studied area where marginal gains are hard-won. The method name suggests it involves some form of adaptive or conditional computation routing.

Task 3: AI Text Detection

Problem: Distinguish AI-generated text from human-written text.

DeepScientist's Discovery — PA-Detect: This is arguably the most impressive result across two dimensions: - +7.9% AUROC — substantial improvement in detection accuracy - +190% latency reduction — simultaneously making detection nearly 3x faster

Achieving accuracy and speed improvements simultaneously is rare — most optimizations trade one for the other. PA-Detect's dual improvement suggests a fundamentally better approach rather than incremental tuning.

Solution Space Characterization

For each task, DeepScientist operates within a defined solution space:

Per-Task Solution Space
│
├── Repository-Level Modifications
│   ├── Algorithm changes (new methods, modified pipelines)
│   ├── Architecture changes (model structure, layers, attention)
│   ├── Training modifications (loss functions, schedules, augmentation)
│   ├── Inference optimizations (caching, batching, pruning)
│   └── Evaluation protocol changes (metrics, preprocessing)
│
├── Hypothesis-Level Innovations
│   ├── Novel combinations of existing techniques
│   ├── Theoretical insights applied to implementation
│   ├── Cross-domain transfer of methods
│   └── Ablation-discovered simplifications
│
└── Constraints
    ├── Must produce runnable code (not just ideas)
    ├── Must improve quantitative metrics vs. baseline
    ├── Must be validated through controlled experiments
    └── Must survive human expert review

The Innovation Funnel

DeepScientist's most revealing metric is the conversion rate through its funnel:

~5,000 Idea Findings generated
    │
    │  UCB acquisition selects most promising
    │  (~22% selection rate)
    │
~1,100 Implement Findings validated
    │
    │  Only those surpassing baseline promoted
    │  (~1.9% of implementations succeed)
    │
   21 Progress Findings (genuine innovations)
    │
    │  Only ~0.42% of original ideas lead to innovation
    │
    3 SOTA-surpassing methods published

This funnel ratio is itself a scientific contribution. It quantifies the difficulty of autonomous discovery: even with a principled search strategy and powerful LLM models, the vast majority of hypotheses fail. The 0.42% success rate suggests that scientific innovation — even in well-defined AI tasks — remains extremely hard.

5 LLM Integration

Dual-Model Architecture

DeepScientist employs a deliberate separation of concerns between two frontier LLM models:

Role	Model	Justification
Core logic	Gemini-2.5-Pro	Strategist, reviewer, hypothesis evaluator, report writer — requires broad reasoning, long context, and nuanced scientific judgment
Code generation	Claude-4-Opus	Implementation agent — requires precise, repository-level code generation with strong debugging capabilities

This is not a cost optimization (both are frontier models) but a capability optimization. The authors implicitly argue that the best reasoning model and the best coding model are different — and that the system benefits from using each where it excels.

LLM as Surrogate Model

The most novel use of the LLM is as the surrogate function in the Bayesian Optimization loop. In classical BO, the surrogate is a Gaussian Process or a neural network trained on past evaluations. In DeepScientist, the surrogate is an LLM Reviewer — a prompted Gemini-2.5-Pro that takes a hypothesis description and returns a valuation vector V = <v_u, v_q, v_e>.

Classical BO                          DeepScientist BO
┌─────────────────┐                  ┌─────────────────┐
│ Gaussian Process │                  │  LLM Reviewer   │
│ (trained on data)│                  │ (prompted, zero- │
│                  │                  │  or few-shot)    │
│ Input: x ∈ ℝ^d  │                  │ Input: I (text)  │
│ Output: μ(x),σ(x)│                 │ Output: V=<v_u,  │
│                  │                  │  v_q, v_e>       │
└─────────────────┘                  └─────────────────┘

Key differences from classical surrogates:

Property	Gaussian Process	LLM Reviewer
Input space	Continuous ℝ^d	Natural language (hypothesis descriptions)
Training	Fitted to (x, y) pairs	Pre-trained on scientific literature
Uncertainty	Calibrated posterior variance σ(x)	Heuristic exploration score v_e
Update	Bayesian posterior update	Prompt-based (context of past findings)
Cost	O(n³) for n observations	O(1) API call per evaluation
Expressiveness	Smooth functions	Arbitrary scientific reasoning

The LLM surrogate sacrifices calibrated uncertainty (the core theoretical advantage of GPs) for expressiveness over a vastly richer input space. This is a pragmatic choice: you cannot represent "a new method for agent failure attribution based on abductive reasoning" as a point in ℝ^d, but an LLM can reason about it.

How Each LLM Is Used at Each Stage

Stage 1 — Strategize & Hypothesize (Gemini-2.5-Pro):

Input: Findings Memory (thousands of structured records)
       + Retrieved Top-K relevant findings (when memory exceeds context)
       + Task description and current SOTA baselines

LLM Operations:
1. Analyze patterns in past findings (successes, failures, trends)
2. Generate novel hypothesis based on gap analysis
3. Produce valuation vector V = <v_u, v_q, v_e> for each hypothesis
4. Store as "Idea Finding" in memory

Output: Ranked set of Idea Findings with valuation vectors

Stage 2 — Implement & Verify (Claude-4-Opus):

Input: Selected Idea Finding (highest UCB score)
       + Repository codebase (full access)
       + Experimental baselines and metrics

LLM Operations:
1. Plan implementation strategy (reading existing code structure)
2. Generate code changes (repository-level modifications)
3. Execute experiments in sandboxed environment
4. Debug failures, iterate on implementation
5. Produce experimental logs and results

Output: Implementation + experimental results + updated finding record

Stage 3 — Analyze & Report (Gemini-2.5-Pro):

Input: Successful implementation results
       + Baseline comparisons
       + Full experimental logs

LLM Operations:
1. Design deeper analytical experiments (ablations, new datasets)
2. Manage experimental lifecycle via MCP tools
3. Synthesize results into coherent narrative
4. Generate research paper with proper structure and citations

Output: Research paper + comprehensive experimental analysis

Agent Capabilities

The implementation agent (Stage 2) has notably broad permissions:

Permission	Scope	Rationale
Read code	Full repository access	Must understand existing codebase to modify it
Write code	Full repository modification	Must implement novel methods
Execute code	Sandboxed environment	Must run experiments
Internet access	Yes	May need to reference documentation, download dependencies
Install packages	Yes	May need new libraries for implementation
GPU access	Dedicated H800 GPU	Experiments require accelerator compute

This is more permissive than most autonomous research systems. AI Scientist (Sakana) restricts agents to a predefined template. Karpathy's autoresearch limits modifications to a single file. DeepScientist gives the coding agent full repository-level access — reflecting the reality that genuine scientific innovation often requires structural changes, not just parameter tuning.

Prompt Engineering and Review Architecture

The LLM Reviewer (surrogate model) is a cornerstone of the system's effectiveness. It must:

Assess utility — estimate how much a proposed method would improve task performance
Assess quality — evaluate methodological soundness, potential pitfalls, feasibility
Assess exploration value — determine how novel the hypothesis is relative to previously explored ideas

The three-dimensional output is critical. A single scalar score would collapse these orthogonal concerns, making it impossible for the UCB acquisition function to properly balance exploration and exploitation. By separating the signals, the system can independently control how much it values novelty (via κ) versus expected performance (via w_u, w_q).

Contrast with Single-Model Systems

System	Models Used	Separation of Concerns
AI Scientist	GPT-4 / Claude for everything	None — same model ideates, codes, writes, reviews
Autoresearch	Single coding agent	None — LLM does all reasoning and coding
AlphaEvolve	Gemini Flash + Pro ensemble	Model hierarchy (fast for quantity, large for quality)
DeepScientist	Gemini-2.5-Pro + Claude-4-Opus	Functional (reasoning vs. coding)

DeepScientist's separation is functional (reasoning vs. coding) rather than hierarchical (cheap vs. expensive). This is a defensible architecture decision: the best available reasoning model need not be the best coder, and vice versa.

6 Key Results

SOTA-Surpassing Results

The headline results are the three methods that surpassed human state-of-the-art:

Result 1: Agent Failure Attribution — A2P Method

Metric	Baseline SOTA	DeepScientist (A2P)	Improvement
Accuracy	Not specified	+183.7% over baseline	+183.7%

The A2P (Abduction-Action-Prediction) method is a three-phase reasoning framework: 1. Abduction: Given the observed failure, generate candidate causal explanations 2. Action Analysis: For each action in the agent trace, evaluate its alignment with each candidate cause 3. Prediction: Select the action-cause pair with highest explanatory power

The magnitude of improvement (+183.7%) is extraordinary. Such a large gain typically indicates either: - The baseline was particularly weak (a common criticism) - The task was particularly amenable to the type of reasoning an LLM can bring - The method represents a genuinely transformative approach

Given that this is a relatively new task (agent failure attribution), the first explanation is most likely — but the result is still noteworthy as a demonstration of autonomous discovery.

Result 2: LLM Inference Acceleration — ACRA Method

Metric	Baseline SOTA	DeepScientist (ACRA)	Improvement
Tokens/second	Not specified	+1.9% over baseline	+1.9%

The +1.9% improvement is modest but meaningful in a domain where: - Inference optimization is heavily studied by well-funded teams (NVIDIA, Google, Meta) - Most easy optimizations have already been found - Even 1-2% improvements translate to significant cost savings at scale

The ACRA method name suggests Adaptive/Conditional computation with some form of Routing or Attention modification. The fact that an autonomous system found a genuine improvement in this heavily-optimized space is itself a significant result.

Result 3: AI Text Detection — PA-Detect Method

Metric	Baseline SOTA	DeepScientist (PA-Detect)	Improvement
AUROC	Not specified	+7.9% over baseline	+7.9%
Latency	Not specified	-65.5% (190% faster)	+190% speed

PA-Detect is the most compelling result because it achieves Pareto improvement — simultaneously better on two competing objectives:

           AUROC (higher is better)
              ▲
              │         ★ PA-Detect
              │        ╱
              │       ╱  Pareto frontier shift
              │      ╱
              │     ○ Previous SOTA
              │
              └──────────────────► Latency (lower is better)

Achieving +7.9% AUROC while simultaneously reducing latency by 65.5% means PA-Detect doesn't just find a better point on the existing accuracy-speed tradeoff curve — it shifts the entire Pareto frontier. This is rare in optimization and suggests a fundamentally better algorithmic approach rather than hyperparameter tuning.

Automated Review Evaluation (DeepReviewer)

Table 2 from the paper compares DeepScientist's papers against other AI research systems using DeepReviewer (the automated review model from CycleResearcher):

System	Papers	Soundness	Presentation	Contribution	Rating	Accept Rate
AI Scientist	10	2.08	1.80	1.75	3.35	0%
AI Researcher	7	1.75	1.46	1.57	2.57	0%
AI Scientist v2	3	1.67	1.50	1.50	2.33	0%
CycleResearcher	6	2.25	1.75	2.13	3.75	0%
Zochi	2	2.38	2.38	2.25	4.63	0%
DeepScientist	5	2.90	2.90	2.90	5.90	60%

Analysis of the Review Scores

DeepScientist dominates every dimension. The gap is not marginal:

Dimension	DeepScientist	Next Best	Gap
Soundness	2.90	2.38 (Zochi)	+0.52
Presentation	2.90	2.38 (Zochi)	+0.52
Contribution	2.90	2.25 (Zochi)	+0.65
Rating	5.90	4.63 (Zochi)	+1.27
Accept Rate	60%	0% (all others)	+60pp

The 60% accept rate is the most striking number. Every other system — including sophisticated ones like AI Scientist v2 and Zochi — has a 0% accept rate under DeepReviewer. DeepScientist is the first to cross the acceptance threshold, and it does so with a substantial majority (3 of 5 papers accepted).

Potential confound: DeepReviewer was developed by the same team (from CycleResearcher). While the automated reviewer was validated against human judgments in the CycleResearcher paper, there is a risk of systemic bias — the reviewer may favor the same team's output style. The human expert review helps address this concern.

Human Expert Review

To address the automated reviewer concern, the paper reports human expert evaluation:

Metric	DeepScientist Papers	ICLR 2025 Human Papers
Average rating	5.00	5.08
Number of reviewers	3 per paper	3-4 per paper
Inter-annotator agreement (Krippendorff α)	0.739	N/A

The DeepScientist papers received ratings statistically indistinguishable from human-written papers at a top ML venue.

Key observations:

Rating 5.00 vs 5.08: The 0.08 gap is well within noise. At ICLR, a rating of 5 corresponds roughly to "marginally below acceptance threshold" — meaning these papers are competitive with, but not clearly above, venue-quality human research.
Krippendorff's α = 0.739: This indicates "substantial agreement" (>0.667 threshold). The reviewers were consistent in their assessments, suggesting the ratings are reliable rather than artifacts of reviewer disagreement.
Caveat: Three reviewers across a handful of papers provides limited statistical power. The confidence interval around 5.00 is wide. But the directional finding — that AI-generated research papers can achieve parity with human submissions — is still noteworthy.

Scale Metrics

Metric	Value	Interpretation
GPU hours consumed	20,000+	Equivalent to ~$200K-400K at cloud rates
Unique scientific ideas	~5,000	Massive hypothesis generation capacity
Experimentally validated	~1,100	22% of ideas selected for implementation
Scientific innovations	21	1.9% of implementations, 0.42% of ideas
SOTA-surpassing methods	3	14.3% of innovations, 0.06% of all ideas
Campaign duration	~1 month per task	Continuous autonomous operation
Human experts for verification	3	Necessary for hallucination filtering

The Dissipation Problem

The paper's own analysis reveals a critical challenge: the vast majority of compute is "wasted" on hypotheses that don't work. The funnel from 5,000 ideas to 3 SOTA methods represents a 0.06% conversion rate. The authors frame this honestly:

"The central question is no longer 'Can AI innovate?' but rather 'How can we efficiently guide its powerful, yet highly dissipative, exploratory process to maximize scientific return?'"

This framing shifts the research agenda from "can AI do science?" (answered: yes) to "can AI do science efficiently?" (answered: not yet). The 20,000+ GPU hours for 3 results is orders of magnitude more expensive than a human research team producing comparable results — but the cost is expected to decrease as models improve and search becomes more efficient.

7 Reproducibility

Code Availability

Artifact	Available	Location
Source code	Yes	github.com/ResearAI/DeepScientist
Project page	Yes	ai-researcher.net
Paper	Yes	arXiv:2509.26603
Discovered methods (A2P, ACRA, PA-Detect)	Partial	In paper descriptions
Findings Memory dumps	Unknown	Not explicitly released
Experimental logs	Unknown	Not explicitly released
DeepReviewer model	Inherited from CycleResearcher	Separate release

Reproducibility Barriers

High barriers to exact reproduction:

Barrier	Severity	Notes
Hardware	Critical	2 servers × 8 NVIDIA H800 GPUs = 16 H800s required
Compute cost	Critical	20,000+ GPU hours ≈ $200K-400K at cloud rates
Model access	High	Requires Gemini-2.5-Pro and Claude-4-Opus API access
API costs	High	Month-long continuous LLM API calls at frontier model rates
Time	High	Month-long campaigns per task — cannot reproduce quickly
Stochasticity	Medium	LLM outputs are stochastic; same prompts yield different hypotheses
Human experts	Medium	3 domain experts needed for output verification
Task baselines	Low	Frontier AI tasks with known baselines

The reproducibility challenge is primarily economic, not technical. The system architecture is documented, the code is released, and the methodology is clear. But actually running DeepScientist requires resources that few academic labs possess — essentially two full compute nodes with H800 GPUs running for a month, plus substantial API budgets.

Partial Reproduction Path

A realistic partial reproduction strategy:

Full reproduction (prohibitive for most):
├── 16 × H800 GPUs for 1 month per task
├── Gemini-2.5-Pro + Claude-4-Opus API budget
├── 3 domain experts
└── Total: ~$200K-400K per task

Scaled-down reproduction (feasible):
├── 1-2 × consumer GPUs (A100/H100)
├── Smaller hypothesis budget (100-500 instead of 5,000)
├── Shorter campaigns (days instead of months)
├── Smaller models (Gemini Flash, Claude Sonnet)
└── Total: ~$1K-10K per task

Concept validation (cheap):
├── Single GPU
├── Reproduce the BO formulation on a toy task
├── Verify UCB acquisition logic
├── Test Findings Memory data structures
└── Total: ~$100-500

What Would Strengthen Reproducibility

Release Findings Memory dumps — allow researchers to analyze the discovery trajectory
Release experimental logs — enable understanding of which hypotheses failed and why
Release the discovered methods — full code for A2P, ACRA, PA-Detect (partially available in paper)
Provide scaling curves — how do results degrade with fewer GPUs, shorter campaigns, cheaper models?
Open-source the evaluation harness — standardized benchmarks for comparing autonomous research systems

8 Compute and API Costs

Hardware Configuration

Server 1                              Server 2
┌─────────────────────────────┐      ┌─────────────────────────────┐
│  8 × NVIDIA H800 (80GB)    │      │  8 × NVIDIA H800 (80GB)    │
│                             │      │                             │
│  GPU 0: DeepScientist       │      │  GPU 8:  DeepScientist     │
│  GPU 1: Instance            │      │  GPU 9:  Instance          │
│  GPU 2: Instance            │      │  GPU 10: Instance          │
│  GPU 3: Instance            │      │  GPU 11: Instance          │
│  GPU 4: Instance            │      │  GPU 12: Instance          │
│  GPU 5: Instance            │      │  GPU 13: Instance          │
│  GPU 6: Instance            │      │  GPU 14: Instance          │
│  GPU 7: Instance            │      │  GPU 15: Instance          │
└─────────────────────────────┘      └─────────────────────────────┘

Each GPU runs a separate DeepScientist instance
16 parallel exploration threads
Month-long continuous operation per task

Cost Estimation

Component	Quantity	Unit Cost (est.)	Total (est.)
H800 GPU hours	20,000+	$2-4/hr (cloud)	$40,000-80,000
Gemini-2.5-Pro API	~millions of tokens	$0.00125-0.01/1K tokens	$5,000-50,000
Claude-4-Opus API	~millions of tokens (code gen)	$0.015-0.075/1K tokens	$10,000-100,000
Human expert time	3 experts × ~40 hrs	$100-200/hr	$12,000-24,000
Total per task			$67,000-254,000
Total (3 tasks)			$200,000-762,000

These are rough estimates based on publicly available pricing. The actual costs may be lower if the team had institutional discounts or higher if API usage was more intensive than estimated.

Cost per Innovation

Metric	Value	Interpretation
Cost per idea generated	~$40-150	Cheap: LLM inference for hypothesis generation
Cost per implementation validated	~$180-690	Moderate: GPU hours for experimentation
Cost per innovation discovered	~$9,500-36,300	Expensive: reflects the low hit rate
Cost per SOTA-surpassing method	~$67,000-254,000	Very expensive: but comparable to a postdoc-year

Comparison to Human Research Costs

Research Mode	Cost per SOTA Result	Time to Result	Scalability
PhD student	~$60,000-120,000/year	1-5 years	Not scalable
Industry research team	~$500,000-2M/year	6-18 months	Limited
DeepScientist	~$67K-254K	~1 month	Parallelizable

The comparison is imperfect — human researchers produce understanding, mentorship, and serendipitous discoveries alongside their primary results. But on the narrow dimension of "time and cost to produce a SOTA-surpassing method," DeepScientist is competitive with or faster than human researchers, albeit at similar cost.

Scaling Properties

DeepScientist exhibits a property that human research does not: near-linear scaling with compute. Adding more GPUs = more parallel exploration threads = more hypotheses evaluated = more innovations discovered (in expectation). Human research teams face diminishing returns as team size grows (coordination costs, communication overhead, duplication of effort).

Innovation yield vs. compute (conceptual)

Innovations │        DeepScientist
discovered  │       ╱ (near-linear with compute)
            │      ╱
            │     ╱
            │    ╱     Human team
            │   ╱     ╱ (diminishing returns)
            │  ╱     ╱
            │ ╱     ╱
            │╱    ╱
            └─────────────────── Compute / Team Size

This scaling property — if it holds as hypothesized — is the most consequential implication of the DeepScientist framework. It suggests that scientific discovery can be industrialized: throw more GPUs at the problem and get more results, linearly.

9 Architecture Solution

Three-Stage Hierarchical Discovery Cycle

DeepScientist's architecture is a three-stage cycle, where each stage builds on the outputs of the previous one. The cycle repeats continuously for the duration of a campaign (approximately one month per task).

┌──────────────────────────────────────────────────────────────────────┐
│                    DEEPSCIENTIST DISCOVERY CYCLE                     │
│                                                                      │
│  ┌─────────────────────┐    ┌─────────────────────┐                 │
│  │  STAGE 1            │    │  FINDINGS MEMORY     │                 │
│  │  Strategize &       │◄───┤                      │                 │
│  │  Hypothesize        │    │  ┌─────────────────┐ │                 │
│  │                     │    │  │ Idea Findings    │ │                 │
│  │  • Analyze memory   │───►│  │ (hypotheses)     │ │                 │
│  │  • Retrieve Top-K   │    │  ├─────────────────┤ │                 │
│  │  • LLM surrogate    │    │  │ Implement        │ │                 │
│  │    valuation V       │    │  │ Findings         │ │                 │
│  │  • Store Idea        │    │  │ (code + results) │ │                 │
│  │    Findings          │    │  ├─────────────────┤ │                 │
│  └────────┬────────────┘    │  │ Progress         │ │                 │
│           │                  │  │ Findings         │ │                 │
│           │ UCB selects      │  │ (innovations)    │ │                 │
│           │ best candidate   │  └─────────────────┘ │                 │
│           ▼                  │                       │                 │
│  ┌─────────────────────┐    │  Also contains:       │                 │
│  │  STAGE 2            │    │  • Human knowledge    │                 │
│  │  Implement &        │    │    (papers, code)     │                 │
│  │  Verify             │    │  • Historical results │                 │
│  │                     │    │  • Valuation vectors  │                 │
│  │  • Coding agent     │    └───────────────────────┘                 │
│  │  • Sandboxed env    │                                              │
│  │  • Full repo access │                                              │
│  │  • GPU experiments  │                                              │
│  │  • Result f(I_{t+1})│                                              │
│  └────────┬────────────┘                                              │
│           │                                                           │
│           │ If result surpasses baseline                              │
│           ▼                                                           │
│  ┌─────────────────────┐                                              │
│  │  STAGE 3            │                                              │
│  │  Analyze &          │                                              │
│  │  Report             │                                              │
│  │                     │                                              │
│  │  • Ablation studies │                                              │
│  │  • New dataset eval │                                              │
│  │  • MCP lifecycle    │                                              │
│  │  • Paper synthesis  │                                              │
│  └─────────────────────┘                                              │
│                                                                      │
│  ← Cycle repeats for ~1 month per task →                             │
└──────────────────────────────────────────────────────────────────────┘

Stage 1: Strategize & Hypothesize

Purpose: Generate and evaluate candidate hypotheses using the LLM as a surrogate model.

Process: 1. Load current Findings Memory (potentially thousands of structured records) 2. When memory exceeds context window, use retrieval model for Top-K relevant findings 3. LLM Reviewer (Gemini-2.5-Pro) analyzes patterns in past findings 4. Generate new hypotheses based on identified gaps and opportunities 5. For each hypothesis, produce valuation vector V = <v_u, v_q, v_e> 6. Store as "Idea Finding" in memory

Key design choice: The retrieval step is critical for scaling. As the Findings Memory grows to thousands of entries, it cannot fit in a single LLM context window. The retrieval model ensures that the most relevant past findings are presented to the strategist, providing continuity even as the memory grows beyond context limits.

Stage 2: Implement & Verify

Purpose: Select the most promising hypothesis via UCB, implement it, and evaluate experimentally.

Process: 1. UCB acquisition function selects the Idea Finding with highest score: I_{t+1} = argmax_{I} (w_u · v_u + w_q · v_q + κ · v_e) 2. Selected finding promoted to "Implement Finding" 3. Coding agent (Claude-4-Opus) receives the hypothesis and repository 4. Agent plans implementation strategy by reading existing code 5. Agent implements changes (full repository-level modifications) 6. Experiments executed in sandboxed environment with dedicated GPU 7. Results f(I_{t+1}) recorded and finding record updated

Key design choice: The coding agent has full permissions (read code, access internet, install packages). This is necessary for genuine innovation — constraining the agent to a template or a single file would preclude the kinds of structural changes that lead to breakthrough methods.

Stage 3: Analyze & Report

Purpose: Triggered only by successful implementations that surpass the baseline. Performs deeper analysis and generates a research paper.

Process: 1. Successful implementation promoted to "Progress Finding" 2. MCP tools manage the experimental lifecycle 3. Deeper analytical experiments: ablation studies, evaluation on new datasets 4. Synthesis agent (Gemini-2.5-Pro) produces a coherent research paper

Key design choice: Stage 3 is conditional — it only fires for successes. This prevents wasting compute on analyzing failed experiments. The asymmetry is important: hypothesis generation and implementation are cheap enough to run speculatively, but deep analysis and paper writing should only happen for results worth reporting.

Parallel Execution Architecture

16 GPU instances, each running independently:

GPU 0:  Stage 1 → Stage 2 (experiment) → Stage 1 → Stage 2 → ...
GPU 1:  Stage 1 → Stage 2 (experiment) → Stage 1 → Stage 2 → ...
GPU 2:  Stage 1 → Stage 2 (experiment) → Stage 3 (success!) → Stage 1 → ...
...
GPU 15: Stage 1 → Stage 2 (experiment) → Stage 1 → Stage 2 → ...

Shared: Findings Memory (read by all, written by all)

Each GPU instance operates as an independent exploration thread, sharing the Findings Memory. This architecture enables: - Parallel exploration: 16 hypotheses can be evaluated simultaneously - Shared learning: discoveries on one GPU inform hypothesis generation on all others - Fault tolerance: failure of one instance doesn't affect others - Linear scaling: adding GPUs proportionally increases throughput

The shared Findings Memory is the key coordination mechanism. Without it, each GPU would explore independently, duplicating effort and missing opportunities to build on each other's findings.

10 Component Breakdown

Component Architecture

DeepScientist System Components
│
├── Strategist (Gemini-2.5-Pro)
│   ├── Memory Analyzer — reads and patterns over Findings Memory
│   ├── Hypothesis Generator — produces novel Idea Findings
│   ├── Surrogate Evaluator — LLM Reviewer producing V = <v_u, v_q, v_e>
│   └── Retrieval Model — Top-K finding selection when memory exceeds context
│
├── Selector (UCB Acquisition Function)
│   ├── Weight Manager — maintains w_u, w_q, κ parameters
│   ├── Score Calculator — computes UCB score per Idea Finding
│   └── Promotion Logic — Idea Finding → Implement Finding
│
├── Implementer (Claude-4-Opus)
│   ├── Code Reader — analyzes repository structure
│   ├── Plan Generator — designs implementation strategy
│   ├── Code Writer — repository-level modifications
│   ├── Experiment Runner — executes in sandboxed environment
│   └── Result Logger — records f(I_{t+1}) in finding record
│
├── Analyzer (Gemini-2.5-Pro)
│   ├── Ablation Designer — plans deeper experiments
│   ├── Dataset Evaluator — tests on additional benchmarks
│   ├── MCP Lifecycle Manager — experimental lifecycle tools
│   └── Paper Synthesizer — generates research paper
│
├── Findings Memory
│   ├── Idea Findings Store — hypotheses with valuation vectors
│   ├── Implement Findings Store — implementations with results
│   ├── Progress Findings Store — successful innovations
│   ├── Human Knowledge Base — papers, code repositories
│   └── Retrieval Index — for Top-K selection
│
└── Infrastructure
    ├── GPU Scheduler — assigns instances to GPUs
    ├── Sandbox Manager — isolated execution environments
    ├── API Client Pool — manages LLM API connections
    └── Logging & Monitoring — tracks campaign progress

Component Interaction Matrix

Component	Reads From	Writes To	LLM Model
Strategist	Findings Memory, Task Description	Idea Findings	Gemini-2.5-Pro
Selector	Idea Findings (valuation vectors)	Implement Findings (promotion)	None (pure computation)
Implementer	Implement Findings, Repository	Implement Findings (results)	Claude-4-Opus
Analyzer	Progress Findings, Experimental Data	Progress Findings (analysis), Papers	Gemini-2.5-Pro
Retrieval Model	Findings Memory	Top-K selections	Embedding model

The Surrogate Model Component

The surrogate model deserves special attention as the most architecturally novel component:

Surrogate Model (LLM Reviewer)
│
├── Input Assembly
│   ├── Hypothesis description (natural language)
│   ├── Task context (problem definition, metrics, baseline)
│   ├── Relevant past findings (Top-K from memory)
│   └── Current state of knowledge (patterns, trends)
│
├── Evaluation Process
│   ├── Assess utility: "How much would this improve task performance?"
│   ├── Assess quality: "Is this methodologically sound and feasible?"
│   └── Assess exploration: "How novel is this vs. what we've tried?"
│
└── Output
    ├── v_u (utility value) — scalar estimate of expected improvement
    ├── v_q (quality value) — scalar estimate of methodological quality
    └── v_e (exploration value) — scalar estimate of novelty

The UCB Selector Component

UCB Acquisition Function
│
├── Input
│   ├── Set of Idea Findings with valuation vectors V_i = <v_u, v_q, v_e>
│   └── Parameters: w_u, w_q, κ
│
├── Score Computation (for each Idea Finding I_i)
│   └── UCB(I_i) = w_u · v_u(I_i) + w_q · v_q(I_i) + κ · v_e(I_i)
│
├── Selection
│   └── I_{t+1} = argmax_{I_i} UCB(I_i)
│
└── Promotion
    └── Selected I_{t+1}: Idea Finding → Implement Finding

The UCB selector is the only non-LLM component in the critical path. It is a deterministic function of the valuation vectors, ensuring that the exploration/exploitation tradeoff is principled rather than dependent on LLM stochasticity.

MCP Tools for Experimental Lifecycle

Stage 3 uses MCP (Model Context Protocol) tools to manage the experimental lifecycle:

Tool Category	Purpose	Used In
Experiment Design	Plan ablation studies, control experiments	Stage 3
Dataset Management	Load, preprocess, evaluate on new datasets	Stage 3
Result Compilation	Aggregate metrics, generate tables and figures	Stage 3
Paper Formatting	Structure sections, manage references, LaTeX	Stage 3

The use of MCP tools rather than hardcoded logic allows the analysis pipeline to be flexible — the LLM can decide which tools to invoke based on the specific findings, rather than following a rigid template.

11 Core Mechanisms (Detailed)

Mechanism 1: Bayesian Optimization over Hypothesis Space

Formal setup:

Let I denote the space of all possible scientific methods (a vast, combinatorial space of algorithms, architectures, and configurations). Let f: I → ℝ be the true value function that maps each method to its actual performance. The goal:

I* = argmax_{I ∈ I} f(I)

The BO loop proceeds as follows:

t = 0: Initialize Findings Memory M_0 with human knowledge (papers, baselines)

For t = 1, 2, 3, ...:
    1. SURROGATE UPDATE
       Use LLM Reviewer to produce valuation vectors for new Idea Findings:
       V(I) = <v_u(I), v_q(I), v_e(I)>  for each I in candidates

    2. ACQUISITION
       Select next hypothesis to evaluate:
       I_{t+1} = argmax_{I} UCB(I) = argmax_{I} (w_u · v_u + w_q · v_q + κ · v_e)

    3. EVALUATION
       Deploy coding agent to implement and test I_{t+1}:
       f(I_{t+1}) = ExperimentalResult(I_{t+1})

    4. MEMORY UPDATE
       Update M_t → M_{t+1} with result:
       - If f(I_{t+1}) > f(I_best): promote to Progress Finding
       - Otherwise: record result and update valuation model's context

    5. REPEAT

Why UCB and not other acquisition functions?

Acquisition Function	Formula	Properties	Why Not Used
UCB	`μ(x) + κσ(x)`	Deterministic, tunable exploration	Used
Expected Improvement (EI)	`E[max(f(x) - f*, 0)]`	Greedy, low exploration	Too exploitative for open-ended search
Thompson Sampling	Sample from posterior	Stochastic, naturally explores	Hard to implement with LLM surrogate
Knowledge Gradient	Value of information	Optimal for finite budget	Computationally expensive

UCB is the natural choice because: 1. It is deterministic given the valuation vector — no additional stochasticity beyond the LLM 2. The exploration coefficient κ is explicitly tunable — the researchers can control the exploration/exploitation balance 3. It works with any surrogate that produces mean + uncertainty — the three-component valuation vector is a natural fit

Mechanism 2: The Three-Component Valuation Vector

The valuation vector V = <v_u, v_q, v_e> decomposition is not standard in BO. Classical BO surrogates produce a mean prediction μ(x) and uncertainty σ(x). DeepScientist's decomposition maps to BO concepts as follows:

Classical BO:           DeepScientist:
├── μ(x) (mean)    ←→  w_u · v_u + w_q · v_q  (exploitation signal)
└── σ(x) (variance) ←→  κ · v_e                 (exploration signal)

The separation of the exploitation signal into utility (v_u) and quality (v_q) is the key innovation. In classical BO, the mean is a single scalar. In DeepScientist, the exploitation signal is a weighted combination of two semantically distinct assessments:

Utility (v_u): "How much performance improvement would this method achieve?" — focuses on the magnitude of the expected gain
Quality (v_q): "How methodologically sound is this approach?" — focuses on the probability of the gain being real

This separation allows the system to distinguish between: - High utility, low quality: Bold ideas that promise large gains but may be unsound (high-risk, high-reward) - Low utility, high quality: Incremental improvements that are almost certain to work (low-risk, low-reward) - High utility, high quality: The most promising candidates (rare, highly selected)

By adjusting the weights w_u and w_q, the system can shift between risk-seeking and risk-averse strategies.

Mechanism 3: Findings Memory as Cumulative Knowledge Base

The Findings Memory is a structured, list-style database with thousands of records organized into three hierarchical levels:

Findings Memory
│
├── Level 1: IDEA FINDINGS
│   ├── Source: Generated by Strategist (LLM)
│   ├── Content: Hypothesis description, rationale, related work references
│   ├── Metadata: Valuation vector V, generation timestamp, source findings
│   ├── Lifecycle: Created → [Selected by UCB → Promoted to Level 2] or [Persists]
│   └── Volume: ~5,000 over a month-long campaign
│
├── Level 2: IMPLEMENT FINDINGS
│   ├── Source: Promoted from Level 1 after UCB selection
│   ├── Content: Implementation details, code changes, experimental setup
│   ├── Metadata: Result f(I), experimental logs, runtime, resource usage
│   ├── Lifecycle: Created → [Surpasses baseline → Promoted to Level 3] or [Persists]
│   └── Volume: ~1,100 over a month-long campaign
│
└── Level 3: PROGRESS FINDINGS
    ├── Source: Promoted from Level 2 after successful validation
    ├── Content: Full method description, ablation results, multi-dataset evaluation
    ├── Metadata: Improvement over baseline, paper draft, human review status
    ├── Lifecycle: Created → [Deeper analysis via Stage 3] → [Paper publication]
    └── Volume: 21 over a month-long campaign

Dual-source knowledge: The memory contains both: 1. Human knowledge — structured records from existing papers, codebases, and known methods 2. System-generated knowledge — the system's own hypotheses, implementations, and results

This creates a feedback loop: human knowledge seeds the initial exploration, the system generates new findings, which in turn inform future hypothesis generation. Over time, the system-generated knowledge dominates, as the Findings Memory becomes a comprehensive record of what has been tried and what works.

Mechanism 4: Retrieval for Context-Length Management

As the Findings Memory grows, it exceeds the context window of even the largest LLMs. DeepScientist addresses this with a retrieval model:

Findings Memory (thousands of records)
│
├── When memory fits in context:
│   └── Pass entire memory to Strategist LLM
│
└── When memory exceeds context:
    ├── Retrieval model indexes all findings
    ├── Query: current task context + recent findings + identified gaps
    ├── Top-K most relevant findings retrieved
    └── Top-K findings passed to Strategist LLM

    This ensures:
    ├── Relevant historical context is always available
    ├── The system doesn't "forget" important past findings
    ├── Context window is used efficiently
    └── The system can scale to arbitrary campaign lengths

The retrieval mechanism is what enables month-long campaigns. Without it, the system would be limited to the number of findings that fit in a single context window — probably a few hundred at most. With retrieval, it can maintain continuity over thousands of findings.

Mechanism 5: Conditional Stage Triggering

Stage 3 (Analyze & Report) is not always triggered. It fires only when an implementation surpasses the baseline:

Stage 2 result: f(I_{t+1})

Decision logic:
├── f(I_{t+1}) > f(I_baseline)?
│   ├── YES → Promote to Progress Finding → Trigger Stage 3
│   └── NO  → Record result in memory → Return to Stage 1
│
│ Filtering ratios (from the paper):
│ ├── ~1,100 implementations attempted
│ ├── ~21 surpassed baseline (1.9%)
│ └── Only 21 triggered Stage 3

This asymmetry is crucial for efficiency. Stage 3 involves expensive operations (ablation studies, multi-dataset evaluation, paper synthesis). Running it for every implementation would be wasteful. By restricting it to successes, the system focuses its analytical budget on findings that matter.

Mechanism 6: Progressive Promotion Lifecycle

Each finding follows a lifecycle from idea to publication:

                    Generated by LLM
                         │
                    IDEA FINDING
                    (hypothesis + V)
                         │
                    Selected by UCB?
                    ├── No → Stays in memory
                    │        (available for future analysis)
                    └── Yes ↓
                         │
                    IMPLEMENT FINDING
                    (code + experiments)
                         │
                    Surpasses baseline?
                    ├── No → Stays in memory with result
                    │        (negative result is valuable data)
                    └── Yes ↓
                         │
                    PROGRESS FINDING
                    (innovation + analysis)
                         │
                    Deeper analysis (Stage 3)
                         │
                    Paper synthesis
                         │
                    Human expert review
                         │
                    Published method

Critically, negative results are retained in memory. An implementation that fails to surpass the baseline is not discarded — its result is recorded and available to the Strategist. This prevents the system from re-trying failed approaches and allows it to learn from its mistakes. The information content of "hypothesis H was implemented and produced result R which was below baseline" is valuable for guiding future hypothesis generation.

12 Programming Language

Implementation Stack

Component	Language/Framework	Notes
Core orchestration	Python	Campaign management, memory operations, retrieval
Coding agent (implementations)	Python (generated)	Task-specific code for each hypothesis
LLM integration	Python (API clients)	Gemini API, Claude API
Experimental execution	Python + CUDA	GPU-accelerated experiments
MCP tools	Python	Experimental lifecycle management
Findings Memory	Structured storage (list-style database)	JSON/database records
Retrieval model	Python + embedding model	For Top-K finding selection

Repository Structure (Inferred)

DeepScientist/
├── deepscientist/
│   ├── strategist/        ← Stage 1: hypothesis generation and evaluation
│   │   ├── analyzer.py    ← Findings Memory pattern analysis
│   │   ├── generator.py   ← Hypothesis generation
│   │   ├── evaluator.py   ← LLM Reviewer (surrogate model)
│   │   └── retriever.py   ← Top-K finding retrieval
│   │
│   ├── selector/          ← UCB acquisition function
│   │   ├── ucb.py         ← UCB score computation
│   │   └── promoter.py    ← Idea → Implement finding promotion
│   │
│   ├── implementer/       ← Stage 2: implementation and verification
│   │   ├── agent.py       ← Coding agent orchestration
│   │   ├── sandbox.py     ← Sandboxed execution environment
│   │   └── logger.py      ← Result recording
│   │
│   ├── analyzer/          ← Stage 3: analysis and reporting
│   │   ├── ablation.py    ← Ablation study design and execution
│   │   ├── evaluator.py   ← Multi-dataset evaluation
│   │   └── writer.py      ← Paper synthesis
│   │
│   ├── memory/            ← Findings Memory management
│   │   ├── store.py       ← CRUD operations on findings
│   │   ├── types.py       ← Finding type definitions (Idea/Implement/Progress)
│   │   └── index.py       ← Retrieval index management
│   │
│   ├── tools/             ← MCP tool definitions
│   │   └── lifecycle.py   ← Experimental lifecycle tools
│   │
│   └── config/            ← Campaign configuration
│       ├── task.py        ← Task definition (repo, metrics, baseline)
│       └── campaign.py    ← Campaign parameters (duration, GPUs, κ)
│
├── tasks/                 ← Task definitions for each frontier problem
│   ├── agent_failure/     ← Agent failure attribution task
│   ├── inference_accel/   ← LLM inference acceleration task
│   └── text_detection/    ← AI text detection task
│
└── results/               ← Campaign outputs
    ├── findings/          ← Findings Memory dumps
    ├── papers/            ← Generated research papers
    └── logs/              ← Experimental logs

Code Generation Patterns

The implementer (Claude-4-Opus) generates task-specific code during Stage 2. The generation pattern follows a structured workflow:

Implementation Workflow (Claude-4-Opus)
│
├── 1. PLAN
│   ├── Read existing codebase structure
│   ├── Identify relevant files and functions
│   ├── Design modification strategy
│   └── Output: Implementation plan (natural language)
│
├── 2. READ
│   ├── Read specific files identified in plan
│   ├── Understand interfaces and dependencies
│   ├── Identify insertion points
│   └── Output: Codebase understanding
│
├── 3. IMPLEMENT
│   ├── Generate code changes
│   ├── May create new files or modify existing ones
│   ├── Repository-level modifications (not just single-file edits)
│   └── Output: Modified codebase
│
├── 4. EXECUTE
│   ├── Run experiments in sandboxed environment
│   ├── Monitor for errors and failures
│   ├── Debug and iterate if necessary
│   └── Output: Experimental results
│
└── 5. LOG
    ├── Record experimental results
    ├── Generate experimental logs
    ├── Compute metrics vs. baseline
    └── Output: Result record f(I_{t+1})

13 Memory Management

Findings Memory: Architecture and Data Model

The Findings Memory is the central data structure of DeepScientist — a cumulative, structured knowledge base that grows throughout a campaign. Unlike ephemeral LLM context or conversation history, the Findings Memory is a persistent, typed database of scientific findings.

Finding Record Schema

Each finding in the memory follows a structured schema:

Finding Record
├── id: str                          ← Unique identifier
├── level: enum {Idea, Implement, Progress}  ← Promotion level
├── hypothesis: str                  ← Natural language description of the idea
├── rationale: str                   ← Why this hypothesis might work
├── related_findings: list[str]      ← IDs of findings that informed this one
├── valuation: ValuationVector       ← V = <v_u, v_q, v_e>
│   ├── utility: float              ← Expected performance improvement
│   ├── quality: float              ← Methodological soundness
│   └── exploration: float          ← Novelty relative to explored space
├── implementation: Optional[ImplementationRecord]
│   ├── code_changes: list[str]     ← Description of modifications
│   ├── experimental_setup: str     ← How experiments were configured
│   ├── result: float               ← f(I) — actual performance
│   ├── baseline_delta: float       ← Improvement over baseline
│   └── logs: str                   ← Experimental output logs
├── analysis: Optional[AnalysisRecord]   ← Only for Progress Findings
│   ├── ablation_results: dict      ← Ablation study outcomes
│   ├── cross_dataset_results: dict ← Performance on additional datasets
│   └── paper_draft: str            ← Generated paper content
├── created_at: datetime
├── updated_at: datetime
└── source: enum {Human, System}     ← Human knowledge or system-generated

Memory Growth Dynamics

Campaign Timeline (1 month)
│
├── Day 1-3: Seeding
│   ├── Human knowledge loaded (papers, baselines, known methods)
│   ├── Initial Idea Findings generated from seed knowledge
│   └── Memory size: ~50-200 findings (mostly human-sourced)
│
├── Day 3-10: Early Exploration
│   ├── High κ (exploration coefficient): system tries diverse hypotheses
│   ├── Many implementations fail (building negative knowledge)
│   ├── First successful implementations appear
│   └── Memory size: ~500-1,500 findings
│
├── Day 10-20: Focused Exploitation
│   ├── System identifies promising directions from early successes
│   ├── κ may decrease as promising regions are found
│   ├── Implementations become more targeted
│   ├── Multiple Progress Findings emerge
│   └── Memory size: ~2,000-3,500 findings
│
└── Day 20-30: Refinement
    ├── Deep exploitation of most promising directions
    ├── Ablation studies and cross-dataset evaluation
    ├── Paper synthesis for best results
    └── Memory size: ~4,000-5,000+ findings

Retrieval System for Context Management

As the Findings Memory grows, it becomes too large for a single LLM context window. The retrieval system addresses this:

Retrieval Pipeline
│
├── Index Construction
│   ├── Each finding is embedded (hypothesis text + metadata)
│   ├── Index updated incrementally as new findings are added
│   └── Supports both semantic and keyword search
│
├── Query Construction
│   ├── Current task context
│   ├── Recent findings (last N)
│   ├── Identified gaps and opportunities
│   └── Combined into retrieval query
│
├── Top-K Selection
│   ├── Retrieve K most relevant findings
│   ├── K sized to fit within LLM context budget
│   ├── Balance: recent findings + historically important findings
│   └── Include both successes and failures for balanced context
│
└── Context Assembly
    ├── Task description (fixed)
    ├── Retrieved Top-K findings (variable)
    ├── Recent findings (sliding window)
    └── Combined context → Strategist LLM

Memory as Scientific Knowledge Graph

The Findings Memory implicitly forms a knowledge graph through the related_findings field. Each finding references the findings that informed it, creating a directed acyclic graph (DAG) of scientific reasoning:

Human Knowledge (papers, baselines)
├── Idea A (inspired by Paper X)
│   ├── Implement A (failed: -2.3% vs baseline)
│   └── Idea B (inspired by Idea A's failure)
│       ├── Implement B (succeeded: +4.1% vs baseline)
│       │   └── Progress B (ablation confirms contribution)
│       └── Idea C (combining Idea B with Paper Y)
│           └── Implement C (succeeded: +7.9% vs baseline)  ← PA-Detect
│               └── Progress C → Paper
│
├── Idea D (independent of A)
│   └── Implement D (failed: -0.5% vs baseline)
│       └── [Negative result informs future hypotheses]
│
└── ... thousands more paths, most ending in failure

This graph structure enables the Strategist to understand not just what has been tried, but why it was tried and what it led to. The causal chain from human knowledge through failed attempts to eventual success is the intellectual history of the campaign — and it's fully recorded in the Findings Memory.

Comparison to Other Systems' Memory

System	Memory Type	Persistence	Structure	Growth
AI Scientist	None (stateless per paper)	Session only	Unstructured	No
Autoresearch (Karpathy)	Git history + results.tsv	Permanent	Flat log	Linear
AlphaEvolve	MAP-Elites archive	Per-run	Grid (behavior space)	Bounded
FunSearch	Island populations	Per-run	Best-shot per island	Bounded
OpenEvolve	Multi-island populations	Checkpointed	Population per island	Bounded
EurekaClaw	4-tier memory system	Permanent	Tiered (RAM → disk → graph → insights)	Unbounded
DeepScientist	3-level Findings Memory	Campaign duration	Hierarchical (Idea → Implement → Progress)	Unbounded

DeepScientist's memory is unique in several ways: 1. Typed hierarchy: The three levels (Idea/Implement/Progress) reflect the actual scientific workflow 2. Valuation vectors: Each finding carries quantitative assessments, not just text 3. Dual-source: Human knowledge and system knowledge coexist in the same structure 4. Negative results preserved: Failed implementations are valuable data, not discarded 5. Retrieval-backed scaling: Memory can grow beyond context limits without losing access

14 Continued Learning

Intra-Campaign Learning

DeepScientist's primary learning mechanism operates within a single campaign. The Findings Memory accumulates knowledge that directly influences future hypothesis generation:

Learning Loop (within campaign)
│
├── t=0: Strategist has only human knowledge
│   └── Hypotheses are broad, exploratory, human-knowledge-biased
│
├── t=100: Memory contains ~100 findings
│   └── Strategist begins recognizing patterns in failures
│   └── Hypotheses become more targeted
│
├── t=500: Memory contains ~500 findings
│   └── Strategist has a model of "what works" for this task
│   └── Exploration focuses on variations of successful approaches
│
├── t=1000: Memory contains ~1000+ findings
│   └── Strategist's implicit model is refined
│   └── Hypotheses are highly focused, diminishing marginal returns
│
└── Qualitative shift: from exploration to exploitation as campaign progresses

This is not "learning" in the machine learning sense (no weights are updated). It is in-context learning at the campaign level — the LLM's hypothesis generation improves as it receives more information about what works and what doesn't. The Findings Memory serves as the "training set" for this in-context learning.

The Surrogate Model's Implicit Improvement

A subtlety of DeepScientist's BO formulation: the surrogate model (LLM Reviewer) implicitly improves over the course of a campaign, even though its weights are frozen:

Surrogate Accuracy Over Time
│
├── t=0: LLM Reviewer evaluates hypotheses based on general scientific knowledge
│   └── Accuracy: Low (no task-specific calibration)
│   └── The v_u, v_q, v_e scores are educated guesses
│
├── t=100: LLM Reviewer sees hypothesis + 100 past findings as context
│   └── Accuracy: Improving (can compare against known results)
│   └── Valuation becomes data-driven, not just prior-driven
│
├── t=500: LLM Reviewer sees hypothesis + Top-K from 500 findings
│   └── Accuracy: Moderate (has empirical calibration data)
│   └── Can estimate improvement magnitude based on similar past findings
│
└── t=1000+: LLM Reviewer sees hypothesis + Top-K from 1000+ findings
    └── Accuracy: Highest (rich empirical basis for judgment)
    └── Effectively calibrated against hundreds of real experiments

This mirrors classical BO: as more (x, y) pairs are observed, the Gaussian Process posterior becomes more accurate. In DeepScientist, as more (hypothesis, result) pairs accumulate in memory, the LLM Reviewer's in-context "posterior" becomes more accurate. The mechanism is entirely different (statistical vs. prompt-based), but the functional effect is similar.

Cross-Campaign Learning

The paper does not explicitly describe cross-campaign learning (transferring findings from one task's campaign to another). This is a notable gap:

Learning Type	Within Campaign	Across Campaigns
Hypothesis quality improvement	Yes (via Findings Memory)	Not described
Surrogate calibration	Yes (via in-context learning)	Not described
Strategy evolution	Yes (via accumulated patterns)	Not described
Method transfer	N/A	Not described

Potential for cross-campaign learning: - Meta-strategies that work across tasks (e.g., "ablation-first approaches are reliable") could be extracted and reused - Findings Memory from one task could seed another task's initial hypotheses - The exploration coefficient κ could be calibrated based on past campaigns' innovation rates

Comparison to Evolutionary Learning

DeepScientist's within-campaign learning differs from evolutionary systems in important ways:

Aspect	Evolutionary (FunSearch, AlphaEvolve)	Bayesian Optimization (DeepScientist)
What evolves	Population of programs	Memory of findings
Selection	Fitness-proportional	UCB acquisition function
Recombination	Crossover of programs	LLM synthesis of ideas from multiple findings
Mutation	LLM-based code perturbation	LLM-based hypothesis generation
Memory	Population (bounded)	Findings Memory (unbounded)
Learning signal	Fitness score (scalar)	Valuation vector (3D) + experimental result
Convergence	Population concentrates	Exploitation weight increases

The key difference: evolutionary systems learn by maintaining a population of solutions that improves through selection and variation. DeepScientist learns by maintaining a memory of knowledge that informs hypothesis generation. The evolutionary approach is bottom-up (good solutions survive); the BO approach is top-down (principled selection guides exploration).

Human-in-the-Loop Learning

The 3 human experts who verify DeepScientist's outputs represent a learning mechanism that the paper somewhat underplays:

Human Expert Contribution
│
├── Filter: Reject hallucinated or trivially flawed results
│   └── Prevents false Progress Findings from contaminating memory
│
├── Validate: Confirm genuine innovations are real
│   └── Provides ground truth that the system cannot generate alone
│
├── Guide (implicit): Expert attention patterns may influence priority
│   └── Unclear if experts can intervene during campaigns
│
└── Quality gate: Final barrier before claiming SOTA
    └── Ensures published methods are genuinely novel and correct

The human experts serve as a high-quality but low-bandwidth "oracle" — they cannot evaluate thousands of findings, but they can verify the small number of Progress Findings. This hybrid autonomy (system explores broadly, humans verify narrowly) is a pragmatic architecture for current LLM capabilities.

15 Applications

Primary Application: Autonomous Frontier AI Research

DeepScientist is designed for a specific class of problems:

Criterion	Requirement	Rationale
Codebase	Existing repository with baseline implementation	Agent needs a starting point to modify
Metrics	Well-defined quantitative evaluation metrics	UCB acquisition needs scalar scores
Search space	Large space of possible improvements	Justifies the BO overhead vs. manual search
Evaluation cost	Moderate (hours, not weeks per trial)	Month-long campaign needs ~1,100 evaluations
Domain	Frontier AI research	LLM reasoning is strongest in this domain
Baseline	Known SOTA for comparison	Progression requires a target to beat

Demonstrated Application Domains

Domain	Task	DeepScientist Method	Result
AI Agents	Failure attribution	A2P (Abduction-Action-Prediction)	+183.7% accuracy
Systems/ML	Inference acceleration	ACRA	+1.9% tokens/s
NLP/Security	AI text detection	PA-Detect	+7.9% AUROC, +190% speed

Potential Extension Domains

Based on the system's architecture, DeepScientist could be applied to any domain meeting the above criteria:

Domain	Example Task	Feasibility	Notes
Computer Vision	Object detection on COCO	High	Well-defined metrics, existing codebases
NLP	Machine translation (BLEU)	High	Standard benchmarks, clear evaluation
Reinforcement Learning	Sample efficiency on Atari	Medium	Evaluation is expensive (many episodes)
Drug Discovery	Molecular property prediction	Medium	Requires domain-specific knowledge
Robotics	Control policy optimization	Low	Physical experiments not feasible
Theorem Proving	Proof success rate	Medium	Needs formal verification tooling
Code Generation	HumanEval/MBPP	High	Well-defined metrics, fast evaluation

Integration Scenarios

Scenario 1: Corporate AI Research Lab

Research team identifies frontier task with stagnating progress
    → Configure DeepScientist with task repository and baselines
    → Allocate GPU cluster for month-long campaign
    → System generates thousands of hypotheses
    → UCB guides exploration/exploitation
    → ~20 innovations discovered
    → Human researchers review and validate top results
    → Publish methods that surpass SOTA
    → ROI: 3 SOTA methods per month per GPU cluster

Scenario 2: Academic Research Acceleration

PhD student working on a specific AI problem
    → Student provides existing codebase + baselines + metrics
    → Scaled-down campaign (1-2 GPUs, 1 week)
    → System explores hypothesis space student hasn't considered
    → Generates ~100 implementations, ~5-10 potential improvements
    → Student analyzes results, integrates best ideas
    → Accelerates research timeline from months to weeks

Scenario 3: Benchmark Competition

Competition organizer releases new benchmark
    → Configure DeepScientist with competition codebase and metrics
    → Run parallel campaigns with different task configurations
    → System explores solution space exhaustively
    → Submit best Progress Findings to leaderboard
    → Potential for autonomous competition entries

Limitations

Limitation	Severity	Impact	Mitigation Path
Extreme compute cost	High	Restricts use to well-funded labs	Model cost decreases, more efficient search
API dependency	High	Relies on Gemini and Claude APIs	Open-weight model alternatives
Human expert requirement	Medium	3 experts needed for verification	Better automated verification
Domain restriction	Medium	Currently limited to frontier AI tasks	Architecture is domain-agnostic
Low conversion rate	Medium	0.42% ideas → innovations	Better surrogate models, smarter acquisition
Month-long campaigns	Medium	Slow iteration on system design	Shorter campaigns for prototyping
Surrogate calibration	Medium	LLM Reviewer may misjudge hypothesis value	Calibration against actual results
Reproducibility	Medium	Stochastic LLM outputs	Seed control, ensemble strategies
Negative result waste	Low	Most compute produces failures	Failures inform future search (by design)

The Efficiency Question

The paper's most provocative framing concerns the efficiency of autonomous discovery:

"The central question is no longer 'Can AI innovate?' but rather 'How can we efficiently guide its powerful, yet highly dissipative, exploratory process to maximize scientific return?'"

This question has profound implications for the field:

Current state (DeepScientist):

5,000 ideas → 1,100 implementations → 21 innovations → 3 SOTA
Conversion rate: 0.06% (ideas to SOTA)
Cost: ~$200K-762K for 3 SOTA methods

Hypothetical 10x improvement:

5,000 ideas → 1,100 implementations → 210 innovations → 30 SOTA
Conversion rate: 0.6% (ideas to SOTA)
Cost: ~$20K-76K per SOTA method

Hypothetical 100x improvement:

500 ideas → 200 implementations → 50 innovations → 10 SOTA
Conversion rate: 2% (ideas to SOTA)
Cost: ~$2K-8K per SOTA method

At 100x improvement, autonomous research becomes economically accessible to individual researchers. The question is whether better surrogate models, smarter acquisition functions, or more efficient implementation agents can achieve this.

Strengths vs. Weaknesses Summary

Strength	Weakness
Mathematically principled BO framework for discovery	Extreme compute requirements (20,000+ GPU hours)
Verified SOTA-surpassing results on 3 tasks	Low conversion rate (0.42% of ideas lead to innovation)
60% accept rate (3/5 papers accepted by DeepReviewer)	DeepReviewer is from the same team — potential bias
Human expert review confirms venue-quality papers	Only 3 human experts — limited statistical power
UCB provides principled exploration/exploitation	Surrogate model (LLM Reviewer) lacks calibrated uncertainty
Findings Memory preserves negative results	No cross-campaign learning mechanism
Dual-model architecture (reasoning + coding)	API dependency on frontier models
Scalable via parallel GPU instances	Month-long campaigns limit iteration speed
Clear improvement over all prior systems	Human supervision still required for verification

Historical Significance

DeepScientist marks a watershed moment in autonomous research: the first system to demonstrate verified, quantitative improvements over human state-of-the-art on frontier AI tasks. Previous systems (AI Scientist, CycleResearcher, Zochi) demonstrated the ability to generate plausible research papers, but none demonstrated the ability to produce methods that actually work better than existing ones.

The progression from CycleResearcher to DeepScientist mirrors the broader trajectory of the field:

2024: Can AI write research papers?  → Yes, but low quality (AI Scientist)
2024: Can AI improve paper quality?  → Yes, via review-driven refinement (CycleResearcher)
2025: Can AI make real discoveries?  → Yes, but at enormous cost (DeepScientist)
2026: Can AI do this efficiently?    → Open question

DeepScientist answers the "Can AI make real discoveries?" question affirmatively, but its 0.06% conversion rate and $200K+ cost per task make clear that the efficiency problem is the next frontier. The Bayesian Optimization framework provides the right conceptual foundation for attacking this problem — better surrogate models, smarter acquisition functions, and more efficient implementation agents could dramatically reduce the cost of autonomous discovery.

The system's honest reporting of its funnel metrics (5,000 → 1,100 → 21 → 3) is itself a contribution. It quantifies what everyone suspected but no one had measured: autonomous scientific discovery is possible but profoundly inefficient with current technology. This sets a concrete benchmark for future systems to improve upon.

Analysis prepared April 2026. Based on arXiv:2509.26603 and publicly available materials from the ResearAI project.

DeepScientist

Table of Contents

1 Full Title and Attribution

Naming and Lineage

Lineage Chain

Unique Position in the Ecosystem

2 Authors and Team

BibTeX Citation

3 Core Contribution

The Bayesian Optimization Formulation

Why This Formulation Matters

The Five Differentiating Claims

Comparison to Related Systems

4 Supported Solutions

Primary Domain: Frontier AI Research Tasks

The Three Demonstrated Tasks

Task 1: Agent Failure Attribution

Task 2: LLM Inference Acceleration

Task 3: AI Text Detection

Solution Space Characterization

The Innovation Funnel

5 LLM Integration

Dual-Model Architecture

LLM as Surrogate Model

How Each LLM Is Used at Each Stage

Agent Capabilities

Prompt Engineering and Review Architecture

Contrast with Single-Model Systems

6 Key Results

SOTA-Surpassing Results

Result 1: Agent Failure Attribution — A2P Method

Result 2: LLM Inference Acceleration — ACRA Method

Result 3: AI Text Detection — PA-Detect Method

Automated Review Evaluation (DeepReviewer)

Analysis of the Review Scores

Human Expert Review

Scale Metrics

The Dissipation Problem

7 Reproducibility

Code Availability

Reproducibility Barriers

Partial Reproduction Path

What Would Strengthen Reproducibility

8 Compute and API Costs

Hardware Configuration

Cost Estimation

Cost per Innovation

Comparison to Human Research Costs

Scaling Properties

9 Architecture Solution

Three-Stage Hierarchical Discovery Cycle

Stage 1: Strategize & Hypothesize

Stage 2: Implement & Verify

Stage 3: Analyze & Report

Parallel Execution Architecture

10 Component Breakdown

Component Architecture

Component Interaction Matrix

The Surrogate Model Component

The UCB Selector Component

MCP Tools for Experimental Lifecycle

11 Core Mechanisms (Detailed)

Mechanism 1: Bayesian Optimization over Hypothesis Space

Mechanism 2: The Three-Component Valuation Vector

Mechanism 3: Findings Memory as Cumulative Knowledge Base

Mechanism 4: Retrieval for Context-Length Management

Mechanism 5: Conditional Stage Triggering

Mechanism 6: Progressive Promotion Lifecycle

12 Programming Language

Implementation Stack

Repository Structure (Inferred)

Code Generation Patterns

13 Memory Management

Findings Memory: Architecture and Data Model

Finding Record Schema

Memory Growth Dynamics

Retrieval System for Context Management

Memory as Scientific Knowledge Graph

Comparison to Other Systems' Memory

14 Continued Learning