← Back to Index

DeepScientist

Bayesian Optimization-guided autonomous scientific discovery system that surpassed human state-of-the-art on three frontier AI tasks through month-long continuous GPU campaigns Organization: Westlake University (Engineering School) Published: September 30, 2025 Type: paper (arXiv:2509.26603) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

Full Title: DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively

  • arXiv: 2509.26603
  • Published: September 30, 2025
  • Code: github.com/ResearAI/DeepScientist
  • Project Page: ai-researcher.net
  • License: CC BY-NC-SA 4.0
  • Tagline: Progressive scientific discovery through Bayesian Optimization over an LLM-driven hypothesis–implementation–analysis cycle
  • Default models: Gemini-2.5-Pro (core logic / reviewer / strategist), Claude-4-Opus (code generation / implementation agent)
  • Input modes: Fully autonomous month-long research campaigns on frontier AI tasks, seeded with a codebase repository and initial findings memory

Naming and Lineage

"DeepScientist" signals two things: the "Deep" prefix invokes both deep learning and the depth of the system's search (thousands of hypotheses, hundreds of implementations, month-long campaigns), while "Scientist" positions the system as an autonomous researcher rather than merely an assistant or copilot. The name also places the system in deliberate contrast to Sakana AI's "AI Scientist" — claiming a more rigorous, results-driven approach to the same aspiration.

The system is a direct evolution of the CycleResearcher line of work from the same lead author (Yixuan Weng). CycleResearcher introduced the idea of review-driven iterative refinement of AI-generated research papers. DeepScientist extends this from paper refinement to full-stack scientific discovery — from hypothesis generation through experimental validation to paper synthesis.

Lineage Chain

CycleResearcher (Weng et al., 2024)
│   Review-driven iterative paper refinement
│   Introduced DeepReviewer for automated evaluation
│
└── DeepScientist (Weng et al., 2025)  ← this system
    Full Bayesian Optimization formulation
    Hypothesis → Implementation → Analysis cycle
    Findings Memory as cumulative knowledge base
    Surpassed human SOTA on 3 frontier tasks

The evolution from CycleResearcher to DeepScientist represents a fundamental shift: CycleResearcher operated on papers (text artifacts), while DeepScientist operates on methods (code + experiments + results). The review model from CycleResearcher becomes the surrogate function in DeepScientist's Bayesian Optimization loop.

Unique Position in the Ecosystem

DeepScientist is the first autonomous research system to demonstrate verified state-of-the-art surpassing results on frontier AI tasks. While other systems (AI Scientist, AI Researcher, CycleResearcher) generate research papers that are then evaluated by automated reviewers, DeepScientist produces working implementations that measurably outperform existing methods. This is the critical distinction: the output is not a paper that scores well on review metrics — it is a method that scores well on task metrics.

Ecosystem Positioning
│
├── AI Scientist (Sakana)      — breadth: generates ML experiment papers
├── AI Researcher (Alibaba)    — breadth: generates research papers from ideas
├── CycleResearcher (Westlake) — refinement: iterative paper improvement via review
├── AI Scientist v2 (Sakana)   — evolution: open-ended, multi-paper campaigns
├── Zochi                      — quality: high-quality paper generation
└── DeepScientist (Westlake)   — depth + results: BO-guided discovery with SOTA outcomes ← this system

2 Authors and Team

Author Role (Inferred) Note
Yixuan Weng* Co-lead, system architect Same lead author as CycleResearcher; * indicates equal contribution
Minjun Zhu* Co-lead, implementation lead * indicates equal contribution
Qiujie Xie Core contributor
Qiyao Sun Core contributor
Zhen Lin Core contributor
Sifan Liu Core contributor
Yue Zhang† Corresponding author, PI † indicates corresponding; senior researcher at Westlake

BibTeX Citation

@article{weng2025deepscientist,
  title     = {DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively},
  author    = {Weng, Yixuan and Zhu, Minjun and Xie, Qiujie and Sun, Qiyao
               and Lin, Zhen and Liu, Sifan and Zhang, Yue},
  journal   = {arXiv preprint arXiv:2509.26603},
  year      = {2025}
}

Team composition: Seven authors from Westlake University's Engineering School. The equal-contribution designation for the first two authors suggests a system architect / implementation lead split — Weng bringing the conceptual framework from CycleResearcher, Zhu leading the engineering implementation. Yue Zhang as corresponding author and PI indicates this is a focused research lab effort with strong senior supervision.

Institutional context: Westlake University is a private research-intensive university in Hangzhou, China, founded in 2018 with an explicit mandate for frontier research. The Engineering School's NLP/AI group (under Yue Zhang) has been productive in the AI-for-science space, with CycleResearcher and DeepScientist representing a coherent multi-year research program rather than a one-off contribution.

Human supervision: The paper acknowledges 3 human experts who verified outputs and filtered hallucinations. This is significant — DeepScientist is not fully autonomous in the way Karpathy's autoresearch is. Human experts serve as a final filter on the discovery pipeline, ensuring that claimed innovations are genuine. The paper is transparent about this, which strengthens its credibility.


3 Core Contribution

Key Novelty: DeepScientist formalizes autonomous scientific discovery as a Bayesian Optimization problem, where the search space is all possible scientific methods, the objective function is the true value of a method, and an LLM Reviewer serves as the surrogate model — enabling UCB-guided exploration/exploitation of hypothesis space that produced 21 genuine scientific innovations from ~5,000 generated ideas, including three methods that surpassed human state-of-the-art.

The Bayesian Optimization Formulation

This is DeepScientist's central intellectual contribution — not just a system, but a mathematical framework for autonomous discovery. The formulation:

Objective:

I* = argmax_{I ∈ I} f(I)

Where: - I is the space of all possible scientific methods (hypotheses, algorithms, implementations) - f(I) is the true value function — the actual performance of method I when implemented and evaluated - I* is the globally optimal method (unknown, approximated through search)

The Problem: Evaluating f(I) is extremely expensive. Each evaluation requires: 1. Generating a full implementation of method I 2. Running experiments (potentially hours of GPU time) 3. Analyzing results against baselines

This makes exhaustive search impossible. Bayesian Optimization provides the principled framework for deciding which hypothesis to evaluate next.

The Surrogate Model: DeepScientist uses an LLM Reviewer as the surrogate model that approximates f cheaply. For each candidate hypothesis, the reviewer produces a valuation vector:

V = <v_u, v_q, v_e>

Where: - v_u = utility value — estimated practical impact and performance improvement - v_q = quality value — estimated methodological soundness and rigor - v_e = exploration value — estimated novelty and distance from previously explored regions

This three-dimensional valuation is a significant design choice. Traditional Bayesian Optimization uses a scalar objective; DeepScientist explicitly decomposes the surrogate into orthogonal axes that capture different aspects of scientific value.

The Acquisition Function: UCB (Upper Confidence Bound) balances exploitation of promising hypotheses with exploration of novel directions:

I_{t+1} = argmax_{I} (w_u · v_u + w_q · v_q + κ · v_e)

Where: - w_u = weight on utility (exploitation signal) - w_q = weight on quality (exploitation signal) - κ = exploration coefficient (controls exploration–exploitation tradeoff) - v_e = exploration value (pure exploration signal)

The UCB formulation has a well-known theoretical foundation in the multi-armed bandit literature. By casting hypothesis selection as a bandit problem with an LLM-estimated reward model, DeepScientist inherits the regret guarantees and convergence properties of UCB — at least in principle, modulo the accuracy of the LLM surrogate.

Why This Formulation Matters

Most autonomous research systems use ad-hoc methods for deciding what to try next:

System Hypothesis Selection Strategy Theoretical Foundation
AI Scientist LLM generates ideas, scores by novelty/feasibility Heuristic
AlphaEvolve MAP-Elites archive + LLM mutation Evolutionary (quality-diversity)
FunSearch LLM mutation + island model + best-shot sampling Evolutionary
Autoresearch (Karpathy) LLM decides freely, greedy hill-climbing None (LLM intuition)
OpenEvolve Multi-island evolution + bandit-based selection Partially principled
DeepScientist UCB acquisition over LLM surrogate Bayesian Optimization

DeepScientist is unique in providing a mathematically principled framework for the explore/exploit tradeoff in scientific discovery. The UCB acquisition function is not just a heuristic — it is an instance of a well-studied algorithm with known optimality properties.

The Five Differentiating Claims

  1. Mathematical formulation: Scientific discovery as Bayesian Optimization, not ad-hoc search
  2. Surrogate model: LLM Reviewer as cheap approximation to expensive evaluation
  3. UCB acquisition: Principled exploration/exploitation balance, not greedy or random
  4. Hierarchical memory: Three-level Findings Memory that accumulates knowledge across campaigns
  5. Verified SOTA results: Three methods that measurably surpassed human state-of-the-art
System Origin Paradigm Output SOTA Claims Scale
AI Scientist (Sakana, 2024) Industry LLM paper generation Papers No ~10 papers
AI Researcher (Alibaba, 2024) Industry Paper generation pipeline Papers No ~7 papers
CycleResearcher (Westlake, 2024) Academic Review-driven refinement Papers No ~6 papers
AI Scientist v2 (Sakana, 2025) Industry Open-ended campaigns Papers No ~3 papers
Zochi (2025) Unknown High-quality generation Papers No ~2 papers
DeepScientist (Westlake, 2025) Academic Bayesian Optimization Methods + Papers Yes (3 tasks) ~5,000 ideas → 21 innovations

The scale differential is striking. While other systems produce single-digit papers, DeepScientist generates thousands of hypotheses and hundreds of implementations. The funnel ratio (~5,000 → ~1,100 → 21) reveals the true difficulty of autonomous discovery: only 0.42% of generated ideas lead to genuine innovations. This is a profoundly important empirical finding about the nature of LLM-driven research.


4 Supported Solutions

Primary Domain: Frontier AI Research Tasks

DeepScientist targets frontier AI research problems where: 1. A codebase repository exists with a baseline implementation 2. Quantitative evaluation metrics are well-defined 3. The search space of possible improvements is large 4. Experimental validation is computationally feasible (hours, not weeks per trial)

The Three Demonstrated Tasks

Task Domain Baseline Metric DeepScientist Method Improvement
Agent Failure Attribution AI agents Existing attribution methods Accuracy A2P (Abduction-Action-Prediction) +183.7%
LLM Inference Acceleration Systems/ML Standard inference pipelines Tokens/second ACRA +1.9%
AI Text Detection NLP/Security Detection classifiers AUROC + Latency PA-Detect +7.9% AUROC, +190% latency reduction

Task 1: Agent Failure Attribution

Problem: Given a trace of an AI agent's actions and an observed failure, attribute the failure to the specific action(s) that caused it.

DeepScientist's Discovery — A2P (Abduction-Action-Prediction): The A2P method uses a three-phase reasoning approach: 1. Abduction — infer possible causes of the failure from the observed outcome 2. Action analysis — evaluate each action in the trace against the abduced causes 3. Prediction — predict which action-cause pair best explains the failure

The +183.7% improvement over baseline is by far the largest gain among the three tasks. This suggests the baseline was particularly weak or the problem was particularly amenable to LLM-style reasoning improvements.

Task 2: LLM Inference Acceleration

Problem: Accelerate the inference speed of large language models without degrading output quality.

DeepScientist's Discovery — ACRA: The +1.9% improvement in tokens/second is modest in absolute terms but significant in context — inference optimization is a heavily-studied area where marginal gains are hard-won. The method name suggests it involves some form of adaptive or conditional computation routing.

Task 3: AI Text Detection

Problem: Distinguish AI-generated text from human-written text.

DeepScientist's Discovery — PA-Detect: This is arguably the most impressive result across two dimensions: - +7.9% AUROC — substantial improvement in detection accuracy - +190% latency reduction — simultaneously making detection nearly 3x faster

Achieving accuracy and speed improvements simultaneously is rare — most optimizations trade one for the other. PA-Detect's dual improvement suggests a fundamentally better approach rather than incremental tuning.

Solution Space Characterization

For each task, DeepScientist operates within a defined solution space:

Per-Task Solution Space
│
├── Repository-Level Modifications
│   ├── Algorithm changes (new methods, modified pipelines)
│   ├── Architecture changes (model structure, layers, attention)
│   ├── Training modifications (loss functions, schedules, augmentation)
│   ├── Inference optimizations (caching, batching, pruning)
│   └── Evaluation protocol changes (metrics, preprocessing)
│
├── Hypothesis-Level Innovations
│   ├── Novel combinations of existing techniques
│   ├── Theoretical insights applied to implementation
│   ├── Cross-domain transfer of methods
│   └── Ablation-discovered simplifications
│
└── Constraints
    ├── Must produce runnable code (not just ideas)
    ├── Must improve quantitative metrics vs. baseline
    ├── Must be validated through controlled experiments
    └── Must survive human expert review

The Innovation Funnel

DeepScientist's most revealing metric is the conversion rate through its funnel:

~5,000 Idea Findings generated
    │
    │  UCB acquisition selects most promising
    │  (~22% selection rate)
    │
~1,100 Implement Findings validated
    │
    │  Only those surpassing baseline promoted
    │  (~1.9% of implementations succeed)
    │
   21 Progress Findings (genuine innovations)
    │
    │  Only ~0.42% of original ideas lead to innovation
    │
    3 SOTA-surpassing methods published

This funnel ratio is itself a scientific contribution. It quantifies the difficulty of autonomous discovery: even with a principled search strategy and powerful LLM models, the vast majority of hypotheses fail. The 0.42% success rate suggests that scientific innovation — even in well-defined AI tasks — remains extremely hard.


5 LLM Integration

Dual-Model Architecture

DeepScientist employs a deliberate separation of concerns between two frontier LLM models:

Role Model Justification
Core logic Gemini-2.5-Pro Strategist, reviewer, hypothesis evaluator, report writer — requires broad reasoning, long context, and nuanced scientific judgment
Code generation Claude-4-Opus Implementation agent — requires precise, repository-level code generation with strong debugging capabilities

This is not a cost optimization (both are frontier models) but a capability optimization. The authors implicitly argue that the best reasoning model and the best coding model are different — and that the system benefits from using each where it excels.

LLM as Surrogate Model

The most novel use of the LLM is as the surrogate function in the Bayesian Optimization loop. In classical BO, the surrogate is a Gaussian Process or a neural network trained on past evaluations. In DeepScientist, the surrogate is an LLM Reviewer — a prompted Gemini-2.5-Pro that takes a hypothesis description and returns a valuation vector V = <v_u, v_q, v_e>.

Classical BO                          DeepScientist BO
┌─────────────────┐                  ┌─────────────────┐
│ Gaussian Process │                  │  LLM Reviewer   │
│ (trained on data)│                  │ (prompted, zero- │
│                  │                  │  or few-shot)    │
│ Input: x ∈ ℝ^d  │                  │ Input: I (text)  │
│ Output: μ(x),σ(x)│                 │ Output: V=<v_u,  │
│                  │                  │  v_q, v_e>       │
└─────────────────┘                  └─────────────────┘

Key differences from classical surrogates:

Property Gaussian Process LLM Reviewer
Input space Continuous ℝ^d Natural language (hypothesis descriptions)
Training Fitted to (x, y) pairs Pre-trained on scientific literature
Uncertainty Calibrated posterior variance σ(x) Heuristic exploration score v_e
Update Bayesian posterior update Prompt-based (context of past findings)
Cost O(n³) for n observations O(1) API call per evaluation
Expressiveness Smooth functions Arbitrary scientific reasoning

The LLM surrogate sacrifices calibrated uncertainty (the core theoretical advantage of GPs) for expressiveness over a vastly richer input space. This is a pragmatic choice: you cannot represent "a new method for agent failure attribution based on abductive reasoning" as a point in ℝ^d, but an LLM can reason about it.

How Each LLM Is Used at Each Stage

Stage 1 — Strategize & Hypothesize (Gemini-2.5-Pro):

Input: Findings Memory (thousands of structured records)
       + Retrieved Top-K relevant findings (when memory exceeds context)
       + Task description and current SOTA baselines

LLM Operations:
1. Analyze patterns in past findings (successes, failures, trends)
2. Generate novel hypothesis based on gap analysis
3. Produce valuation vector V = <v_u, v_q, v_e> for each hypothesis
4. Store as "Idea Finding" in memory

Output: Ranked set of Idea Findings with valuation vectors

Stage 2 — Implement & Verify (Claude-4-Opus):

Input: Selected Idea Finding (highest UCB score)
       + Repository codebase (full access)
       + Experimental baselines and metrics

LLM Operations:
1. Plan implementation strategy (reading existing code structure)
2. Generate code changes (repository-level modifications)
3. Execute experiments in sandboxed environment
4. Debug failures, iterate on implementation
5. Produce experimental logs and results

Output: Implementation + experimental results + updated finding record

Stage 3 — Analyze & Report (Gemini-2.5-Pro):

Input: Successful implementation results
       + Baseline comparisons
       + Full experimental logs

LLM Operations:
1. Design deeper analytical experiments (ablations, new datasets)
2. Manage experimental lifecycle via MCP tools
3. Synthesize results into coherent narrative
4. Generate research paper with proper structure and citations

Output: Research paper + comprehensive experimental analysis

Agent Capabilities

The implementation agent (Stage 2) has notably broad permissions:

Permission Scope Rationale
Read code Full repository access Must understand existing codebase to modify it
Write code Full repository modification Must implement novel methods
Execute code Sandboxed environment Must run experiments
Internet access Yes May need to reference documentation, download dependencies
Install packages Yes May need new libraries for implementation
GPU access Dedicated H800 GPU Experiments require accelerator compute

This is more permissive than most autonomous research systems. AI Scientist (Sakana) restricts agents to a predefined template. Karpathy's autoresearch limits modifications to a single file. DeepScientist gives the coding agent full repository-level access — reflecting the reality that genuine scientific innovation often requires structural changes, not just parameter tuning.

Prompt Engineering and Review Architecture

The LLM Reviewer (surrogate model) is a cornerstone of the system's effectiveness. It must:

  1. Assess utility — estimate how much a proposed method would improve task performance
  2. Assess quality — evaluate methodological soundness, potential pitfalls, feasibility
  3. Assess exploration value — determine how novel the hypothesis is relative to previously explored ideas

The three-dimensional output is critical. A single scalar score would collapse these orthogonal concerns, making it impossible for the UCB acquisition function to properly balance exploration and exploitation. By separating the signals, the system can independently control how much it values novelty (via κ) versus expected performance (via w_u, w_q).

Contrast with Single-Model Systems

System Models Used Separation of Concerns
AI Scientist GPT-4 / Claude for everything None — same model ideates, codes, writes, reviews
Autoresearch Single coding agent None — LLM does all reasoning and coding
AlphaEvolve Gemini Flash + Pro ensemble Model hierarchy (fast for quantity, large for quality)
DeepScientist Gemini-2.5-Pro + Claude-4-Opus Functional (reasoning vs. coding)

DeepScientist's separation is functional (reasoning vs. coding) rather than hierarchical (cheap vs. expensive). This is a defensible architecture decision: the best available reasoning model need not be the best coder, and vice versa.


6 Key Results

SOTA-Surpassing Results

The headline results are the three methods that surpassed human state-of-the-art:

Result 1: Agent Failure Attribution — A2P Method

Metric Baseline SOTA DeepScientist (A2P) Improvement
Accuracy Not specified +183.7% over baseline +183.7%

The A2P (Abduction-Action-Prediction) method is a three-phase reasoning framework: 1. Abduction: Given the observed failure, generate candidate causal explanations 2. Action Analysis: For each action in the agent trace, evaluate its alignment with each candidate cause 3. Prediction: Select the action-cause pair with highest explanatory power

The magnitude of improvement (+183.7%) is extraordinary. Such a large gain typically indicates either: - The baseline was particularly weak (a common criticism) - The task was particularly amenable to the type of reasoning an LLM can bring - The method represents a genuinely transformative approach

Given that this is a relatively new task (agent failure attribution), the first explanation is most likely — but the result is still noteworthy as a demonstration of autonomous discovery.

Result 2: LLM Inference Acceleration — ACRA Method

Metric Baseline SOTA DeepScientist (ACRA) Improvement
Tokens/second Not specified +1.9% over baseline +1.9%

The +1.9% improvement is modest but meaningful in a domain where: - Inference optimization is heavily studied by well-funded teams (NVIDIA, Google, Meta) - Most easy optimizations have already been found - Even 1-2% improvements translate to significant cost savings at scale

The ACRA method name suggests Adaptive/Conditional computation with some form of Routing or Attention modification. The fact that an autonomous system found a genuine improvement in this heavily-optimized space is itself a significant result.

Result 3: AI Text Detection — PA-Detect Method

Metric Baseline SOTA DeepScientist (PA-Detect) Improvement
AUROC Not specified +7.9% over baseline +7.9%
Latency Not specified -65.5% (190% faster) +190% speed

PA-Detect is the most compelling result because it achieves Pareto improvement — simultaneously better on two competing objectives:

           AUROC (higher is better)
              ▲
              │         ★ PA-Detect
              │        ╱
              │       ╱  Pareto frontier shift
              │      ╱
              │     ○ Previous SOTA
              │
              └──────────────────► Latency (lower is better)

Achieving +7.9% AUROC while simultaneously reducing latency by 65.5% means PA-Detect doesn't just find a better point on the existing accuracy-speed tradeoff curve — it shifts the entire Pareto frontier. This is rare in optimization and suggests a fundamentally better algorithmic approach rather than hyperparameter tuning.

Automated Review Evaluation (DeepReviewer)

Table 2 from the paper compares DeepScientist's papers against other AI research systems using DeepReviewer (the automated review model from CycleResearcher):

System Papers Soundness Presentation Contribution Rating Accept Rate
AI Scientist 10 2.08 1.80 1.75 3.35 0%
AI Researcher 7 1.75 1.46 1.57 2.57 0%
AI Scientist v2 3 1.67 1.50 1.50 2.33 0%
CycleResearcher 6 2.25 1.75 2.13 3.75 0%
Zochi 2 2.38 2.38 2.25 4.63 0%
DeepScientist 5 2.90 2.90 2.90 5.90 60%

Analysis of the Review Scores

DeepScientist dominates every dimension. The gap is not marginal:

Dimension DeepScientist Next Best Gap
Soundness 2.90 2.38 (Zochi) +0.52
Presentation 2.90 2.38 (Zochi) +0.52
Contribution 2.90 2.25 (Zochi) +0.65
Rating 5.90 4.63 (Zochi) +1.27
Accept Rate 60% 0% (all others) +60pp

The 60% accept rate is the most striking number. Every other system — including sophisticated ones like AI Scientist v2 and Zochi — has a 0% accept rate under DeepReviewer. DeepScientist is the first to cross the acceptance threshold, and it does so with a substantial majority (3 of 5 papers accepted).

Potential confound: DeepReviewer was developed by the same team (from CycleResearcher). While the automated reviewer was validated against human judgments in the CycleResearcher paper, there is a risk of systemic bias — the reviewer may favor the same team's output style. The human expert review helps address this concern.

Human Expert Review

To address the automated reviewer concern, the paper reports human expert evaluation:

Metric DeepScientist Papers ICLR 2025 Human Papers
Average rating 5.00 5.08
Number of reviewers 3 per paper 3-4 per paper
Inter-annotator agreement (Krippendorff α) 0.739 N/A

The DeepScientist papers received ratings statistically indistinguishable from human-written papers at a top ML venue.

Key observations:

  1. Rating 5.00 vs 5.08: The 0.08 gap is well within noise. At ICLR, a rating of 5 corresponds roughly to "marginally below acceptance threshold" — meaning these papers are competitive with, but not clearly above, venue-quality human research.

  2. Krippendorff's α = 0.739: This indicates "substantial agreement" (>0.667 threshold). The reviewers were consistent in their assessments, suggesting the ratings are reliable rather than artifacts of reviewer disagreement.

  3. Caveat: Three reviewers across a handful of papers provides limited statistical power. The confidence interval around 5.00 is wide. But the directional finding — that AI-generated research papers can achieve parity with human submissions — is still noteworthy.

Scale Metrics

Metric Value Interpretation
GPU hours consumed 20,000+ Equivalent to ~$200K-400K at cloud rates
Unique scientific ideas ~5,000 Massive hypothesis generation capacity
Experimentally validated ~1,100 22% of ideas selected for implementation
Scientific innovations 21 1.9% of implementations, 0.42% of ideas
SOTA-surpassing methods 3 14.3% of innovations, 0.06% of all ideas
Campaign duration ~1 month per task Continuous autonomous operation
Human experts for verification 3 Necessary for hallucination filtering

The Dissipation Problem

The paper's own analysis reveals a critical challenge: the vast majority of compute is "wasted" on hypotheses that don't work. The funnel from 5,000 ideas to 3 SOTA methods represents a 0.06% conversion rate. The authors frame this honestly:

"The central question is no longer 'Can AI innovate?' but rather 'How can we efficiently guide its powerful, yet highly dissipative, exploratory process to maximize scientific return?'"

This framing shifts the research agenda from "can AI do science?" (answered: yes) to "can AI do science efficiently?" (answered: not yet). The 20,000+ GPU hours for 3 results is orders of magnitude more expensive than a human research team producing comparable results — but the cost is expected to decrease as models improve and search becomes more efficient.


7 Reproducibility

Code Availability

Artifact Available Location
Source code Yes github.com/ResearAI/DeepScientist
Project page Yes ai-researcher.net
Paper Yes arXiv:2509.26603
Discovered methods (A2P, ACRA, PA-Detect) Partial In paper descriptions
Findings Memory dumps Unknown Not explicitly released
Experimental logs Unknown Not explicitly released
DeepReviewer model Inherited from CycleResearcher Separate release

Reproducibility Barriers

High barriers to exact reproduction:

Barrier Severity Notes
Hardware Critical 2 servers × 8 NVIDIA H800 GPUs = 16 H800s required
Compute cost Critical 20,000+ GPU hours ≈ $200K-400K at cloud rates
Model access High Requires Gemini-2.5-Pro and Claude-4-Opus API access
API costs High Month-long continuous LLM API calls at frontier model rates
Time High Month-long campaigns per task — cannot reproduce quickly
Stochasticity Medium LLM outputs are stochastic; same prompts yield different hypotheses
Human experts Medium 3 domain experts needed for output verification
Task baselines Low Frontier AI tasks with known baselines

The reproducibility challenge is primarily economic, not technical. The system architecture is documented, the code is released, and the methodology is clear. But actually running DeepScientist requires resources that few academic labs possess — essentially two full compute nodes with H800 GPUs running for a month, plus substantial API budgets.

Partial Reproduction Path

A realistic partial reproduction strategy:

Full reproduction (prohibitive for most):
├── 16 × H800 GPUs for 1 month per task
├── Gemini-2.5-Pro + Claude-4-Opus API budget
├── 3 domain experts
└── Total: ~$200K-400K per task

Scaled-down reproduction (feasible):
├── 1-2 × consumer GPUs (A100/H100)
├── Smaller hypothesis budget (100-500 instead of 5,000)
├── Shorter campaigns (days instead of months)
├── Smaller models (Gemini Flash, Claude Sonnet)
└── Total: ~$1K-10K per task

Concept validation (cheap):
├── Single GPU
├── Reproduce the BO formulation on a toy task
├── Verify UCB acquisition logic
├── Test Findings Memory data structures
└── Total: ~$100-500

What Would Strengthen Reproducibility

  1. Release Findings Memory dumps — allow researchers to analyze the discovery trajectory
  2. Release experimental logs — enable understanding of which hypotheses failed and why
  3. Release the discovered methods — full code for A2P, ACRA, PA-Detect (partially available in paper)
  4. Provide scaling curves — how do results degrade with fewer GPUs, shorter campaigns, cheaper models?
  5. Open-source the evaluation harness — standardized benchmarks for comparing autonomous research systems

8 Compute and API Costs

Hardware Configuration

Server 1                              Server 2
┌─────────────────────────────┐      ┌─────────────────────────────┐
│  8 × NVIDIA H800 (80GB)    │      │  8 × NVIDIA H800 (80GB)    │
│                             │      │                             │
│  GPU 0: DeepScientist       │      │  GPU 8:  DeepScientist     │
│  GPU 1: Instance            │      │  GPU 9:  Instance          │
│  GPU 2: Instance            │      │  GPU 10: Instance          │
│  GPU 3: Instance            │      │  GPU 11: Instance          │
│  GPU 4: Instance            │      │  GPU 12: Instance          │
│  GPU 5: Instance            │      │  GPU 13: Instance          │
│  GPU 6: Instance            │      │  GPU 14: Instance          │
│  GPU 7: Instance            │      │  GPU 15: Instance          │
└─────────────────────────────┘      └─────────────────────────────┘

Each GPU runs a separate DeepScientist instance
16 parallel exploration threads
Month-long continuous operation per task

Cost Estimation

Component Quantity Unit Cost (est.) Total (est.)
H800 GPU hours 20,000+ $2-4/hr (cloud) $40,000-80,000
Gemini-2.5-Pro API ~millions of tokens $0.00125-0.01/1K tokens $5,000-50,000
Claude-4-Opus API ~millions of tokens (code gen) $0.015-0.075/1K tokens $10,000-100,000
Human expert time 3 experts × ~40 hrs $100-200/hr $12,000-24,000
Total per task $67,000-254,000
Total (3 tasks) $200,000-762,000

These are rough estimates based on publicly available pricing. The actual costs may be lower if the team had institutional discounts or higher if API usage was more intensive than estimated.

Cost per Innovation

Metric Value Interpretation
Cost per idea generated ~$40-150 Cheap: LLM inference for hypothesis generation
Cost per implementation validated ~$180-690 Moderate: GPU hours for experimentation
Cost per innovation discovered ~$9,500-36,300 Expensive: reflects the low hit rate
Cost per SOTA-surpassing method ~$67,000-254,000 Very expensive: but comparable to a postdoc-year

Comparison to Human Research Costs

Research Mode Cost per SOTA Result Time to Result Scalability
PhD student ~$60,000-120,000/year 1-5 years Not scalable
Industry research team ~$500,000-2M/year 6-18 months Limited
DeepScientist ~$67K-254K ~1 month Parallelizable

The comparison is imperfect — human researchers produce understanding, mentorship, and serendipitous discoveries alongside their primary results. But on the narrow dimension of "time and cost to produce a SOTA-surpassing method," DeepScientist is competitive with or faster than human researchers, albeit at similar cost.

Scaling Properties

DeepScientist exhibits a property that human research does not: near-linear scaling with compute. Adding more GPUs = more parallel exploration threads = more hypotheses evaluated = more innovations discovered (in expectation). Human research teams face diminishing returns as team size grows (coordination costs, communication overhead, duplication of effort).

Innovation yield vs. compute (conceptual)

Innovations │        DeepScientist
discovered  │       ╱ (near-linear with compute)
            │      ╱
            │     ╱
            │    ╱     Human team
            │   ╱     ╱ (diminishing returns)
            │  ╱     ╱
            │ ╱     ╱
            │╱    ╱
            └─────────────────── Compute / Team Size

This scaling property — if it holds as hypothesized — is the most consequential implication of the DeepScientist framework. It suggests that scientific discovery can be industrialized: throw more GPUs at the problem and get more results, linearly.


9 Architecture Solution

Three-Stage Hierarchical Discovery Cycle

DeepScientist's architecture is a three-stage cycle, where each stage builds on the outputs of the previous one. The cycle repeats continuously for the duration of a campaign (approximately one month per task).

┌──────────────────────────────────────────────────────────────────────┐
│                    DEEPSCIENTIST DISCOVERY CYCLE                     │
│                                                                      │
│  ┌─────────────────────┐    ┌─────────────────────┐                 │
│  │  STAGE 1            │    │  FINDINGS MEMORY     │                 │
│  │  Strategize &       │◄───┤                      │                 │
│  │  Hypothesize        │    │  ┌─────────────────┐ │                 │
│  │                     │    │  │ Idea Findings    │ │                 │
│  │  • Analyze memory   │───►│  │ (hypotheses)     │ │                 │
│  │  • Retrieve Top-K   │    │  ├─────────────────┤ │                 │
│  │  • LLM surrogate    │    │  │ Implement        │ │                 │
│  │    valuation V       │    │  │ Findings         │ │                 │
│  │  • Store Idea        │    │  │ (code + results) │ │                 │
│  │    Findings          │    │  ├─────────────────┤ │                 │
│  └────────┬────────────┘    │  │ Progress         │ │                 │
│           │                  │  │ Findings         │ │                 │
│           │ UCB selects      │  │ (innovations)    │ │                 │
│           │ best candidate   │  └─────────────────┘ │                 │
│           ▼                  │                       │                 │
│  ┌─────────────────────┐    │  Also contains:       │                 │
│  │  STAGE 2            │    │  • Human knowledge    │                 │
│  │  Implement &        │    │    (papers, code)     │                 │
│  │  Verify             │    │  • Historical results │                 │
│  │                     │    │  • Valuation vectors  │                 │
│  │  • Coding agent     │    └───────────────────────┘                 │
│  │  • Sandboxed env    │                                              │
│  │  • Full repo access │                                              │
│  │  • GPU experiments  │                                              │
│  │  • Result f(I_{t+1})│                                              │
│  └────────┬────────────┘                                              │
│           │                                                           │
│           │ If result surpasses baseline                              │
│           ▼                                                           │
│  ┌─────────────────────┐                                              │
│  │  STAGE 3            │                                              │
│  │  Analyze &          │                                              │
│  │  Report             │                                              │
│  │                     │                                              │
│  │  • Ablation studies │                                              │
│  │  • New dataset eval │                                              │
│  │  • MCP lifecycle    │                                              │
│  │  • Paper synthesis  │                                              │
│  └─────────────────────┘                                              │
│                                                                      │
│  ← Cycle repeats for ~1 month per task →                             │
└──────────────────────────────────────────────────────────────────────┘

Stage 1: Strategize & Hypothesize

Purpose: Generate and evaluate candidate hypotheses using the LLM as a surrogate model.

Process: 1. Load current Findings Memory (potentially thousands of structured records) 2. When memory exceeds context window, use retrieval model for Top-K relevant findings 3. LLM Reviewer (Gemini-2.5-Pro) analyzes patterns in past findings 4. Generate new hypotheses based on identified gaps and opportunities 5. For each hypothesis, produce valuation vector V = <v_u, v_q, v_e> 6. Store as "Idea Finding" in memory

Key design choice: The retrieval step is critical for scaling. As the Findings Memory grows to thousands of entries, it cannot fit in a single LLM context window. The retrieval model ensures that the most relevant past findings are presented to the strategist, providing continuity even as the memory grows beyond context limits.

Stage 2: Implement & Verify

Purpose: Select the most promising hypothesis via UCB, implement it, and evaluate experimentally.

Process: 1. UCB acquisition function selects the Idea Finding with highest score: I_{t+1} = argmax_{I} (w_u · v_u + w_q · v_q + κ · v_e) 2. Selected finding promoted to "Implement Finding" 3. Coding agent (Claude-4-Opus) receives the hypothesis and repository 4. Agent plans implementation strategy by reading existing code 5. Agent implements changes (full repository-level modifications) 6. Experiments executed in sandboxed environment with dedicated GPU 7. Results f(I_{t+1}) recorded and finding record updated

Key design choice: The coding agent has full permissions (read code, access internet, install packages). This is necessary for genuine innovation — constraining the agent to a template or a single file would preclude the kinds of structural changes that lead to breakthrough methods.

Stage 3: Analyze & Report

Purpose: Triggered only by successful implementations that surpass the baseline. Performs deeper analysis and generates a research paper.

Process: 1. Successful implementation promoted to "Progress Finding" 2. MCP tools manage the experimental lifecycle 3. Deeper analytical experiments: ablation studies, evaluation on new datasets 4. Synthesis agent (Gemini-2.5-Pro) produces a coherent research paper

Key design choice: Stage 3 is conditional — it only fires for successes. This prevents wasting compute on analyzing failed experiments. The asymmetry is important: hypothesis generation and implementation are cheap enough to run speculatively, but deep analysis and paper writing should only happen for results worth reporting.

Parallel Execution Architecture

16 GPU instances, each running independently:

GPU 0:  Stage 1 → Stage 2 (experiment) → Stage 1 → Stage 2 → ...
GPU 1:  Stage 1 → Stage 2 (experiment) → Stage 1 → Stage 2 → ...
GPU 2:  Stage 1 → Stage 2 (experiment) → Stage 3 (success!) → Stage 1 → ...
...
GPU 15: Stage 1 → Stage 2 (experiment) → Stage 1 → Stage 2 → ...

Shared: Findings Memory (read by all, written by all)

Each GPU instance operates as an independent exploration thread, sharing the Findings Memory. This architecture enables: - Parallel exploration: 16 hypotheses can be evaluated simultaneously - Shared learning: discoveries on one GPU inform hypothesis generation on all others - Fault tolerance: failure of one instance doesn't affect others - Linear scaling: adding GPUs proportionally increases throughput

The shared Findings Memory is the key coordination mechanism. Without it, each GPU would explore independently, duplicating effort and missing opportunities to build on each other's findings.


10 Component Breakdown

Component Architecture

DeepScientist System Components
│
├── Strategist (Gemini-2.5-Pro)
│   ├── Memory Analyzer — reads and patterns over Findings Memory
│   ├── Hypothesis Generator — produces novel Idea Findings
│   ├── Surrogate Evaluator — LLM Reviewer producing V = <v_u, v_q, v_e>
│   └── Retrieval Model — Top-K finding selection when memory exceeds context
│
├── Selector (UCB Acquisition Function)
│   ├── Weight Manager — maintains w_u, w_q, κ parameters
│   ├── Score Calculator — computes UCB score per Idea Finding
│   └── Promotion Logic — Idea Finding → Implement Finding
│
├── Implementer (Claude-4-Opus)
│   ├── Code Reader — analyzes repository structure
│   ├── Plan Generator — designs implementation strategy
│   ├── Code Writer — repository-level modifications
│   ├── Experiment Runner — executes in sandboxed environment
│   └── Result Logger — records f(I_{t+1}) in finding record
│
├── Analyzer (Gemini-2.5-Pro)
│   ├── Ablation Designer — plans deeper experiments
│   ├── Dataset Evaluator — tests on additional benchmarks
│   ├── MCP Lifecycle Manager — experimental lifecycle tools
│   └── Paper Synthesizer — generates research paper
│
├── Findings Memory
│   ├── Idea Findings Store — hypotheses with valuation vectors
│   ├── Implement Findings Store — implementations with results
│   ├── Progress Findings Store — successful innovations
│   ├── Human Knowledge Base — papers, code repositories
│   └── Retrieval Index — for Top-K selection
│
└── Infrastructure
    ├── GPU Scheduler — assigns instances to GPUs
    ├── Sandbox Manager — isolated execution environments
    ├── API Client Pool — manages LLM API connections
    └── Logging & Monitoring — tracks campaign progress

Component Interaction Matrix

Component Reads From Writes To LLM Model
Strategist Findings Memory, Task Description Idea Findings Gemini-2.5-Pro
Selector Idea Findings (valuation vectors) Implement Findings (promotion) None (pure computation)
Implementer Implement Findings, Repository Implement Findings (results) Claude-4-Opus
Analyzer Progress Findings, Experimental Data Progress Findings (analysis), Papers Gemini-2.5-Pro
Retrieval Model Findings Memory Top-K selections Embedding model

The Surrogate Model Component

The surrogate model deserves special attention as the most architecturally novel component:

Surrogate Model (LLM Reviewer)
│
├── Input Assembly
│   ├── Hypothesis description (natural language)
│   ├── Task context (problem definition, metrics, baseline)
│   ├── Relevant past findings (Top-K from memory)
│   └── Current state of knowledge (patterns, trends)
│
├── Evaluation Process
│   ├── Assess utility: "How much would this improve task performance?"
│   ├── Assess quality: "Is this methodologically sound and feasible?"
│   └── Assess exploration: "How novel is this vs. what we've tried?"
│
└── Output
    ├── v_u (utility value) — scalar estimate of expected improvement
    ├── v_q (quality value) — scalar estimate of methodological quality
    └── v_e (exploration value) — scalar estimate of novelty

The UCB Selector Component

UCB Acquisition Function
│
├── Input
│   ├── Set of Idea Findings with valuation vectors V_i = <v_u, v_q, v_e>
│   └── Parameters: w_u, w_q, κ
│
├── Score Computation (for each Idea Finding I_i)
│   └── UCB(I_i) = w_u · v_u(I_i) + w_q · v_q(I_i) + κ · v_e(I_i)
│
├── Selection
│   └── I_{t+1} = argmax_{I_i} UCB(I_i)
│
└── Promotion
    └── Selected I_{t+1}: Idea Finding → Implement Finding

The UCB selector is the only non-LLM component in the critical path. It is a deterministic function of the valuation vectors, ensuring that the exploration/exploitation tradeoff is principled rather than dependent on LLM stochasticity.

MCP Tools for Experimental Lifecycle

Stage 3 uses MCP (Model Context Protocol) tools to manage the experimental lifecycle:

Tool Category Purpose Used In
Experiment Design Plan ablation studies, control experiments Stage 3
Dataset Management Load, preprocess, evaluate on new datasets Stage 3
Result Compilation Aggregate metrics, generate tables and figures Stage 3
Paper Formatting Structure sections, manage references, LaTeX Stage 3

The use of MCP tools rather than hardcoded logic allows the analysis pipeline to be flexible — the LLM can decide which tools to invoke based on the specific findings, rather than following a rigid template.


11 Core Mechanisms (Detailed)

Mechanism 1: Bayesian Optimization over Hypothesis Space

Formal setup:

Let I denote the space of all possible scientific methods (a vast, combinatorial space of algorithms, architectures, and configurations). Let f: I → ℝ be the true value function that maps each method to its actual performance. The goal:

I* = argmax_{I ∈ I} f(I)

The BO loop proceeds as follows:

t = 0: Initialize Findings Memory M_0 with human knowledge (papers, baselines)

For t = 1, 2, 3, ...:
    1. SURROGATE UPDATE
       Use LLM Reviewer to produce valuation vectors for new Idea Findings:
       V(I) = <v_u(I), v_q(I), v_e(I)>  for each I in candidates

    2. ACQUISITION
       Select next hypothesis to evaluate:
       I_{t+1} = argmax_{I} UCB(I) = argmax_{I} (w_u · v_u + w_q · v_q + κ · v_e)

    3. EVALUATION
       Deploy coding agent to implement and test I_{t+1}:
       f(I_{t+1}) = ExperimentalResult(I_{t+1})

    4. MEMORY UPDATE
       Update M_t → M_{t+1} with result:
       - If f(I_{t+1}) > f(I_best): promote to Progress Finding
       - Otherwise: record result and update valuation model's context

    5. REPEAT

Why UCB and not other acquisition functions?

Acquisition Function Formula Properties Why Not Used
UCB μ(x) + κσ(x) Deterministic, tunable exploration Used
Expected Improvement (EI) E[max(f(x) - f*, 0)] Greedy, low exploration Too exploitative for open-ended search
Thompson Sampling Sample from posterior Stochastic, naturally explores Hard to implement with LLM surrogate
Knowledge Gradient Value of information Optimal for finite budget Computationally expensive

UCB is the natural choice because: 1. It is deterministic given the valuation vector — no additional stochasticity beyond the LLM 2. The exploration coefficient κ is explicitly tunable — the researchers can control the exploration/exploitation balance 3. It works with any surrogate that produces mean + uncertainty — the three-component valuation vector is a natural fit

Mechanism 2: The Three-Component Valuation Vector

The valuation vector V = <v_u, v_q, v_e> decomposition is not standard in BO. Classical BO surrogates produce a mean prediction μ(x) and uncertainty σ(x). DeepScientist's decomposition maps to BO concepts as follows:

Classical BO:           DeepScientist:
├── μ(x) (mean)    ←→  w_u · v_u + w_q · v_q  (exploitation signal)
└── σ(x) (variance) ←→  κ · v_e                 (exploration signal)

The separation of the exploitation signal into utility (v_u) and quality (v_q) is the key innovation. In classical BO, the mean is a single scalar. In DeepScientist, the exploitation signal is a weighted combination of two semantically distinct assessments:

  • Utility (v_u): "How much performance improvement would this method achieve?" — focuses on the magnitude of the expected gain
  • Quality (v_q): "How methodologically sound is this approach?" — focuses on the probability of the gain being real

This separation allows the system to distinguish between: - High utility, low quality: Bold ideas that promise large gains but may be unsound (high-risk, high-reward) - Low utility, high quality: Incremental improvements that are almost certain to work (low-risk, low-reward) - High utility, high quality: The most promising candidates (rare, highly selected)

By adjusting the weights w_u and w_q, the system can shift between risk-seeking and risk-averse strategies.

Mechanism 3: Findings Memory as Cumulative Knowledge Base

The Findings Memory is a structured, list-style database with thousands of records organized into three hierarchical levels:

Findings Memory
│
├── Level 1: IDEA FINDINGS
│   ├── Source: Generated by Strategist (LLM)
│   ├── Content: Hypothesis description, rationale, related work references
│   ├── Metadata: Valuation vector V, generation timestamp, source findings
│   ├── Lifecycle: Created → [Selected by UCB → Promoted to Level 2] or [Persists]
│   └── Volume: ~5,000 over a month-long campaign
│
├── Level 2: IMPLEMENT FINDINGS
│   ├── Source: Promoted from Level 1 after UCB selection
│   ├── Content: Implementation details, code changes, experimental setup
│   ├── Metadata: Result f(I), experimental logs, runtime, resource usage
│   ├── Lifecycle: Created → [Surpasses baseline → Promoted to Level 3] or [Persists]
│   └── Volume: ~1,100 over a month-long campaign
│
└── Level 3: PROGRESS FINDINGS
    ├── Source: Promoted from Level 2 after successful validation
    ├── Content: Full method description, ablation results, multi-dataset evaluation
    ├── Metadata: Improvement over baseline, paper draft, human review status
    ├── Lifecycle: Created → [Deeper analysis via Stage 3] → [Paper publication]
    └── Volume: 21 over a month-long campaign

Dual-source knowledge: The memory contains both: 1. Human knowledge — structured records from existing papers, codebases, and known methods 2. System-generated knowledge — the system's own hypotheses, implementations, and results

This creates a feedback loop: human knowledge seeds the initial exploration, the system generates new findings, which in turn inform future hypothesis generation. Over time, the system-generated knowledge dominates, as the Findings Memory becomes a comprehensive record of what has been tried and what works.

Mechanism 4: Retrieval for Context-Length Management

As the Findings Memory grows, it exceeds the context window of even the largest LLMs. DeepScientist addresses this with a retrieval model:

Findings Memory (thousands of records)
│
├── When memory fits in context:
│   └── Pass entire memory to Strategist LLM
│
└── When memory exceeds context:
    ├── Retrieval model indexes all findings
    ├── Query: current task context + recent findings + identified gaps
    ├── Top-K most relevant findings retrieved
    └── Top-K findings passed to Strategist LLM

    This ensures:
    ├── Relevant historical context is always available
    ├── The system doesn't "forget" important past findings
    ├── Context window is used efficiently
    └── The system can scale to arbitrary campaign lengths

The retrieval mechanism is what enables month-long campaigns. Without it, the system would be limited to the number of findings that fit in a single context window — probably a few hundred at most. With retrieval, it can maintain continuity over thousands of findings.

Mechanism 5: Conditional Stage Triggering

Stage 3 (Analyze & Report) is not always triggered. It fires only when an implementation surpasses the baseline:

Stage 2 result: f(I_{t+1})

Decision logic:
├── f(I_{t+1}) > f(I_baseline)?
│   ├── YES → Promote to Progress Finding → Trigger Stage 3
│   └── NO  → Record result in memory → Return to Stage 1
│
│ Filtering ratios (from the paper):
│ ├── ~1,100 implementations attempted
│ ├── ~21 surpassed baseline (1.9%)
│ └── Only 21 triggered Stage 3

This asymmetry is crucial for efficiency. Stage 3 involves expensive operations (ablation studies, multi-dataset evaluation, paper synthesis). Running it for every implementation would be wasteful. By restricting it to successes, the system focuses its analytical budget on findings that matter.

Mechanism 6: Progressive Promotion Lifecycle

Each finding follows a lifecycle from idea to publication:

                    Generated by LLM
                         │
                    IDEA FINDING
                    (hypothesis + V)
                         │
                    Selected by UCB?
                    ├── No → Stays in memory
                    │        (available for future analysis)
                    └── Yes ↓
                         │
                    IMPLEMENT FINDING
                    (code + experiments)
                         │
                    Surpasses baseline?
                    ├── No → Stays in memory with result
                    │        (negative result is valuable data)
                    └── Yes ↓
                         │
                    PROGRESS FINDING
                    (innovation + analysis)
                         │
                    Deeper analysis (Stage 3)
                         │
                    Paper synthesis
                         │
                    Human expert review
                         │
                    Published method

Critically, negative results are retained in memory. An implementation that fails to surpass the baseline is not discarded — its result is recorded and available to the Strategist. This prevents the system from re-trying failed approaches and allows it to learn from its mistakes. The information content of "hypothesis H was implemented and produced result R which was below baseline" is valuable for guiding future hypothesis generation.


12 Programming Language

Implementation Stack

Component Language/Framework Notes
Core orchestration Python Campaign management, memory operations, retrieval
Coding agent (implementations) Python (generated) Task-specific code for each hypothesis
LLM integration Python (API clients) Gemini API, Claude API
Experimental execution Python + CUDA GPU-accelerated experiments
MCP tools Python Experimental lifecycle management
Findings Memory Structured storage (list-style database) JSON/database records
Retrieval model Python + embedding model For Top-K finding selection

Repository Structure (Inferred)

DeepScientist/
├── deepscientist/
│   ├── strategist/        ← Stage 1: hypothesis generation and evaluation
│   │   ├── analyzer.py    ← Findings Memory pattern analysis
│   │   ├── generator.py   ← Hypothesis generation
│   │   ├── evaluator.py   ← LLM Reviewer (surrogate model)
│   │   └── retriever.py   ← Top-K finding retrieval
│   │
│   ├── selector/          ← UCB acquisition function
│   │   ├── ucb.py         ← UCB score computation
│   │   └── promoter.py    ← Idea → Implement finding promotion
│   │
│   ├── implementer/       ← Stage 2: implementation and verification
│   │   ├── agent.py       ← Coding agent orchestration
│   │   ├── sandbox.py     ← Sandboxed execution environment
│   │   └── logger.py      ← Result recording
│   │
│   ├── analyzer/          ← Stage 3: analysis and reporting
│   │   ├── ablation.py    ← Ablation study design and execution
│   │   ├── evaluator.py   ← Multi-dataset evaluation
│   │   └── writer.py      ← Paper synthesis
│   │
│   ├── memory/            ← Findings Memory management
│   │   ├── store.py       ← CRUD operations on findings
│   │   ├── types.py       ← Finding type definitions (Idea/Implement/Progress)
│   │   └── index.py       ← Retrieval index management
│   │
│   ├── tools/             ← MCP tool definitions
│   │   └── lifecycle.py   ← Experimental lifecycle tools
│   │
│   └── config/            ← Campaign configuration
│       ├── task.py        ← Task definition (repo, metrics, baseline)
│       └── campaign.py    ← Campaign parameters (duration, GPUs, κ)
│
├── tasks/                 ← Task definitions for each frontier problem
│   ├── agent_failure/     ← Agent failure attribution task
│   ├── inference_accel/   ← LLM inference acceleration task
│   └── text_detection/    ← AI text detection task
│
└── results/               ← Campaign outputs
    ├── findings/          ← Findings Memory dumps
    ├── papers/            ← Generated research papers
    └── logs/              ← Experimental logs

Code Generation Patterns

The implementer (Claude-4-Opus) generates task-specific code during Stage 2. The generation pattern follows a structured workflow:

Implementation Workflow (Claude-4-Opus)
│
├── 1. PLAN
│   ├── Read existing codebase structure
│   ├── Identify relevant files and functions
│   ├── Design modification strategy
│   └── Output: Implementation plan (natural language)
│
├── 2. READ
│   ├── Read specific files identified in plan
│   ├── Understand interfaces and dependencies
│   ├── Identify insertion points
│   └── Output: Codebase understanding
│
├── 3. IMPLEMENT
│   ├── Generate code changes
│   ├── May create new files or modify existing ones
│   ├── Repository-level modifications (not just single-file edits)
│   └── Output: Modified codebase
│
├── 4. EXECUTE
│   ├── Run experiments in sandboxed environment
│   ├── Monitor for errors and failures
│   ├── Debug and iterate if necessary
│   └── Output: Experimental results
│
└── 5. LOG
    ├── Record experimental results
    ├── Generate experimental logs
    ├── Compute metrics vs. baseline
    └── Output: Result record f(I_{t+1})

13 Memory Management

Findings Memory: Architecture and Data Model

The Findings Memory is the central data structure of DeepScientist — a cumulative, structured knowledge base that grows throughout a campaign. Unlike ephemeral LLM context or conversation history, the Findings Memory is a persistent, typed database of scientific findings.

Finding Record Schema

Each finding in the memory follows a structured schema:

Finding Record
├── id: str                          ← Unique identifier
├── level: enum {Idea, Implement, Progress}  ← Promotion level
├── hypothesis: str                  ← Natural language description of the idea
├── rationale: str                   ← Why this hypothesis might work
├── related_findings: list[str]      ← IDs of findings that informed this one
├── valuation: ValuationVector       ← V = <v_u, v_q, v_e>
│   ├── utility: float              ← Expected performance improvement
│   ├── quality: float              ← Methodological soundness
│   └── exploration: float          ← Novelty relative to explored space
├── implementation: Optional[ImplementationRecord]
│   ├── code_changes: list[str]     ← Description of modifications
│   ├── experimental_setup: str     ← How experiments were configured
│   ├── result: float               ← f(I) — actual performance
│   ├── baseline_delta: float       ← Improvement over baseline
│   └── logs: str                   ← Experimental output logs
├── analysis: Optional[AnalysisRecord]   ← Only for Progress Findings
│   ├── ablation_results: dict      ← Ablation study outcomes
│   ├── cross_dataset_results: dict ← Performance on additional datasets
│   └── paper_draft: str            ← Generated paper content
├── created_at: datetime
├── updated_at: datetime
└── source: enum {Human, System}     ← Human knowledge or system-generated

Memory Growth Dynamics

Campaign Timeline (1 month)
│
├── Day 1-3: Seeding
│   ├── Human knowledge loaded (papers, baselines, known methods)
│   ├── Initial Idea Findings generated from seed knowledge
│   └── Memory size: ~50-200 findings (mostly human-sourced)
│
├── Day 3-10: Early Exploration
│   ├── High κ (exploration coefficient): system tries diverse hypotheses
│   ├── Many implementations fail (building negative knowledge)
│   ├── First successful implementations appear
│   └── Memory size: ~500-1,500 findings
│
├── Day 10-20: Focused Exploitation
│   ├── System identifies promising directions from early successes
│   ├── κ may decrease as promising regions are found
│   ├── Implementations become more targeted
│   ├── Multiple Progress Findings emerge
│   └── Memory size: ~2,000-3,500 findings
│
└── Day 20-30: Refinement
    ├── Deep exploitation of most promising directions
    ├── Ablation studies and cross-dataset evaluation
    ├── Paper synthesis for best results
    └── Memory size: ~4,000-5,000+ findings

Retrieval System for Context Management

As the Findings Memory grows, it becomes too large for a single LLM context window. The retrieval system addresses this:

Retrieval Pipeline
│
├── Index Construction
│   ├── Each finding is embedded (hypothesis text + metadata)
│   ├── Index updated incrementally as new findings are added
│   └── Supports both semantic and keyword search
│
├── Query Construction
│   ├── Current task context
│   ├── Recent findings (last N)
│   ├── Identified gaps and opportunities
│   └── Combined into retrieval query
│
├── Top-K Selection
│   ├── Retrieve K most relevant findings
│   ├── K sized to fit within LLM context budget
│   ├── Balance: recent findings + historically important findings
│   └── Include both successes and failures for balanced context
│
└── Context Assembly
    ├── Task description (fixed)
    ├── Retrieved Top-K findings (variable)
    ├── Recent findings (sliding window)
    └── Combined context → Strategist LLM

Memory as Scientific Knowledge Graph

The Findings Memory implicitly forms a knowledge graph through the related_findings field. Each finding references the findings that informed it, creating a directed acyclic graph (DAG) of scientific reasoning:

Human Knowledge (papers, baselines)
├── Idea A (inspired by Paper X)
│   ├── Implement A (failed: -2.3% vs baseline)
│   └── Idea B (inspired by Idea A's failure)
│       ├── Implement B (succeeded: +4.1% vs baseline)
│       │   └── Progress B (ablation confirms contribution)
│       └── Idea C (combining Idea B with Paper Y)
│           └── Implement C (succeeded: +7.9% vs baseline)  ← PA-Detect
│               └── Progress C → Paper
│
├── Idea D (independent of A)
│   └── Implement D (failed: -0.5% vs baseline)
│       └── [Negative result informs future hypotheses]
│
└── ... thousands more paths, most ending in failure

This graph structure enables the Strategist to understand not just what has been tried, but why it was tried and what it led to. The causal chain from human knowledge through failed attempts to eventual success is the intellectual history of the campaign — and it's fully recorded in the Findings Memory.

Comparison to Other Systems' Memory

System Memory Type Persistence Structure Growth
AI Scientist None (stateless per paper) Session only Unstructured No
Autoresearch (Karpathy) Git history + results.tsv Permanent Flat log Linear
AlphaEvolve MAP-Elites archive Per-run Grid (behavior space) Bounded
FunSearch Island populations Per-run Best-shot per island Bounded
OpenEvolve Multi-island populations Checkpointed Population per island Bounded
EurekaClaw 4-tier memory system Permanent Tiered (RAM → disk → graph → insights) Unbounded
DeepScientist 3-level Findings Memory Campaign duration Hierarchical (Idea → Implement → Progress) Unbounded

DeepScientist's memory is unique in several ways: 1. Typed hierarchy: The three levels (Idea/Implement/Progress) reflect the actual scientific workflow 2. Valuation vectors: Each finding carries quantitative assessments, not just text 3. Dual-source: Human knowledge and system knowledge coexist in the same structure 4. Negative results preserved: Failed implementations are valuable data, not discarded 5. Retrieval-backed scaling: Memory can grow beyond context limits without losing access


14 Continued Learning

Intra-Campaign Learning

DeepScientist's primary learning mechanism operates within a single campaign. The Findings Memory accumulates knowledge that directly influences future hypothesis generation:

Learning Loop (within campaign)
│
├── t=0: Strategist has only human knowledge
│   └── Hypotheses are broad, exploratory, human-knowledge-biased
│
├── t=100: Memory contains ~100 findings
│   └── Strategist begins recognizing patterns in failures
│   └── Hypotheses become more targeted
│
├── t=500: Memory contains ~500 findings
│   └── Strategist has a model of "what works" for this task
│   └── Exploration focuses on variations of successful approaches
│
├── t=1000: Memory contains ~1000+ findings
│   └── Strategist's implicit model is refined
│   └── Hypotheses are highly focused, diminishing marginal returns
│
└── Qualitative shift: from exploration to exploitation as campaign progresses

This is not "learning" in the machine learning sense (no weights are updated). It is in-context learning at the campaign level — the LLM's hypothesis generation improves as it receives more information about what works and what doesn't. The Findings Memory serves as the "training set" for this in-context learning.

The Surrogate Model's Implicit Improvement

A subtlety of DeepScientist's BO formulation: the surrogate model (LLM Reviewer) implicitly improves over the course of a campaign, even though its weights are frozen:

Surrogate Accuracy Over Time
│
├── t=0: LLM Reviewer evaluates hypotheses based on general scientific knowledge
│   └── Accuracy: Low (no task-specific calibration)
│   └── The v_u, v_q, v_e scores are educated guesses
│
├── t=100: LLM Reviewer sees hypothesis + 100 past findings as context
│   └── Accuracy: Improving (can compare against known results)
│   └── Valuation becomes data-driven, not just prior-driven
│
├── t=500: LLM Reviewer sees hypothesis + Top-K from 500 findings
│   └── Accuracy: Moderate (has empirical calibration data)
│   └── Can estimate improvement magnitude based on similar past findings
│
└── t=1000+: LLM Reviewer sees hypothesis + Top-K from 1000+ findings
    └── Accuracy: Highest (rich empirical basis for judgment)
    └── Effectively calibrated against hundreds of real experiments

This mirrors classical BO: as more (x, y) pairs are observed, the Gaussian Process posterior becomes more accurate. In DeepScientist, as more (hypothesis, result) pairs accumulate in memory, the LLM Reviewer's in-context "posterior" becomes more accurate. The mechanism is entirely different (statistical vs. prompt-based), but the functional effect is similar.

Cross-Campaign Learning

The paper does not explicitly describe cross-campaign learning (transferring findings from one task's campaign to another). This is a notable gap:

Learning Type Within Campaign Across Campaigns
Hypothesis quality improvement Yes (via Findings Memory) Not described
Surrogate calibration Yes (via in-context learning) Not described
Strategy evolution Yes (via accumulated patterns) Not described
Method transfer N/A Not described

Potential for cross-campaign learning: - Meta-strategies that work across tasks (e.g., "ablation-first approaches are reliable") could be extracted and reused - Findings Memory from one task could seed another task's initial hypotheses - The exploration coefficient κ could be calibrated based on past campaigns' innovation rates

Comparison to Evolutionary Learning

DeepScientist's within-campaign learning differs from evolutionary systems in important ways:

Aspect Evolutionary (FunSearch, AlphaEvolve) Bayesian Optimization (DeepScientist)
What evolves Population of programs Memory of findings
Selection Fitness-proportional UCB acquisition function
Recombination Crossover of programs LLM synthesis of ideas from multiple findings
Mutation LLM-based code perturbation LLM-based hypothesis generation
Memory Population (bounded) Findings Memory (unbounded)
Learning signal Fitness score (scalar) Valuation vector (3D) + experimental result
Convergence Population concentrates Exploitation weight increases

The key difference: evolutionary systems learn by maintaining a population of solutions that improves through selection and variation. DeepScientist learns by maintaining a memory of knowledge that informs hypothesis generation. The evolutionary approach is bottom-up (good solutions survive); the BO approach is top-down (principled selection guides exploration).

Human-in-the-Loop Learning

The 3 human experts who verify DeepScientist's outputs represent a learning mechanism that the paper somewhat underplays:

Human Expert Contribution
│
├── Filter: Reject hallucinated or trivially flawed results
│   └── Prevents false Progress Findings from contaminating memory
│
├── Validate: Confirm genuine innovations are real
│   └── Provides ground truth that the system cannot generate alone
│
├── Guide (implicit): Expert attention patterns may influence priority
│   └── Unclear if experts can intervene during campaigns
│
└── Quality gate: Final barrier before claiming SOTA
    └── Ensures published methods are genuinely novel and correct

The human experts serve as a high-quality but low-bandwidth "oracle" — they cannot evaluate thousands of findings, but they can verify the small number of Progress Findings. This hybrid autonomy (system explores broadly, humans verify narrowly) is a pragmatic architecture for current LLM capabilities.


15 Applications

Primary Application: Autonomous Frontier AI Research

DeepScientist is designed for a specific class of problems:

Criterion Requirement Rationale
Codebase Existing repository with baseline implementation Agent needs a starting point to modify
Metrics Well-defined quantitative evaluation metrics UCB acquisition needs scalar scores
Search space Large space of possible improvements Justifies the BO overhead vs. manual search
Evaluation cost Moderate (hours, not weeks per trial) Month-long campaign needs ~1,100 evaluations
Domain Frontier AI research LLM reasoning is strongest in this domain
Baseline Known SOTA for comparison Progression requires a target to beat

Demonstrated Application Domains

Domain Task DeepScientist Method Result
AI Agents Failure attribution A2P (Abduction-Action-Prediction) +183.7% accuracy
Systems/ML Inference acceleration ACRA +1.9% tokens/s
NLP/Security AI text detection PA-Detect +7.9% AUROC, +190% speed

Potential Extension Domains

Based on the system's architecture, DeepScientist could be applied to any domain meeting the above criteria:

Domain Example Task Feasibility Notes
Computer Vision Object detection on COCO High Well-defined metrics, existing codebases
NLP Machine translation (BLEU) High Standard benchmarks, clear evaluation
Reinforcement Learning Sample efficiency on Atari Medium Evaluation is expensive (many episodes)
Drug Discovery Molecular property prediction Medium Requires domain-specific knowledge
Robotics Control policy optimization Low Physical experiments not feasible
Theorem Proving Proof success rate Medium Needs formal verification tooling
Code Generation HumanEval/MBPP High Well-defined metrics, fast evaluation

Integration Scenarios

Scenario 1: Corporate AI Research Lab

Research team identifies frontier task with stagnating progress
    → Configure DeepScientist with task repository and baselines
    → Allocate GPU cluster for month-long campaign
    → System generates thousands of hypotheses
    → UCB guides exploration/exploitation
    → ~20 innovations discovered
    → Human researchers review and validate top results
    → Publish methods that surpass SOTA
    → ROI: 3 SOTA methods per month per GPU cluster

Scenario 2: Academic Research Acceleration

PhD student working on a specific AI problem
    → Student provides existing codebase + baselines + metrics
    → Scaled-down campaign (1-2 GPUs, 1 week)
    → System explores hypothesis space student hasn't considered
    → Generates ~100 implementations, ~5-10 potential improvements
    → Student analyzes results, integrates best ideas
    → Accelerates research timeline from months to weeks

Scenario 3: Benchmark Competition

Competition organizer releases new benchmark
    → Configure DeepScientist with competition codebase and metrics
    → Run parallel campaigns with different task configurations
    → System explores solution space exhaustively
    → Submit best Progress Findings to leaderboard
    → Potential for autonomous competition entries

Limitations

Limitation Severity Impact Mitigation Path
Extreme compute cost High Restricts use to well-funded labs Model cost decreases, more efficient search
API dependency High Relies on Gemini and Claude APIs Open-weight model alternatives
Human expert requirement Medium 3 experts needed for verification Better automated verification
Domain restriction Medium Currently limited to frontier AI tasks Architecture is domain-agnostic
Low conversion rate Medium 0.42% ideas → innovations Better surrogate models, smarter acquisition
Month-long campaigns Medium Slow iteration on system design Shorter campaigns for prototyping
Surrogate calibration Medium LLM Reviewer may misjudge hypothesis value Calibration against actual results
Reproducibility Medium Stochastic LLM outputs Seed control, ensemble strategies
Negative result waste Low Most compute produces failures Failures inform future search (by design)

The Efficiency Question

The paper's most provocative framing concerns the efficiency of autonomous discovery:

"The central question is no longer 'Can AI innovate?' but rather 'How can we efficiently guide its powerful, yet highly dissipative, exploratory process to maximize scientific return?'"

This question has profound implications for the field:

Current state (DeepScientist):

5,000 ideas → 1,100 implementations → 21 innovations → 3 SOTA
Conversion rate: 0.06% (ideas to SOTA)
Cost: ~$200K-762K for 3 SOTA methods

Hypothetical 10x improvement:

5,000 ideas → 1,100 implementations → 210 innovations → 30 SOTA
Conversion rate: 0.6% (ideas to SOTA)
Cost: ~$20K-76K per SOTA method

Hypothetical 100x improvement:

500 ideas → 200 implementations → 50 innovations → 10 SOTA
Conversion rate: 2% (ideas to SOTA)
Cost: ~$2K-8K per SOTA method

At 100x improvement, autonomous research becomes economically accessible to individual researchers. The question is whether better surrogate models, smarter acquisition functions, or more efficient implementation agents can achieve this.

Strengths vs. Weaknesses Summary

Strength Weakness
Mathematically principled BO framework for discovery Extreme compute requirements (20,000+ GPU hours)
Verified SOTA-surpassing results on 3 tasks Low conversion rate (0.42% of ideas lead to innovation)
60% accept rate (3/5 papers accepted by DeepReviewer) DeepReviewer is from the same team — potential bias
Human expert review confirms venue-quality papers Only 3 human experts — limited statistical power
UCB provides principled exploration/exploitation Surrogate model (LLM Reviewer) lacks calibrated uncertainty
Findings Memory preserves negative results No cross-campaign learning mechanism
Dual-model architecture (reasoning + coding) API dependency on frontier models
Scalable via parallel GPU instances Month-long campaigns limit iteration speed
Clear improvement over all prior systems Human supervision still required for verification

Historical Significance

DeepScientist marks a watershed moment in autonomous research: the first system to demonstrate verified, quantitative improvements over human state-of-the-art on frontier AI tasks. Previous systems (AI Scientist, CycleResearcher, Zochi) demonstrated the ability to generate plausible research papers, but none demonstrated the ability to produce methods that actually work better than existing ones.

The progression from CycleResearcher to DeepScientist mirrors the broader trajectory of the field:

2024: Can AI write research papers?  → Yes, but low quality (AI Scientist)
2024: Can AI improve paper quality?  → Yes, via review-driven refinement (CycleResearcher)
2025: Can AI make real discoveries?  → Yes, but at enormous cost (DeepScientist)
2026: Can AI do this efficiently?    → Open question

DeepScientist answers the "Can AI make real discoveries?" question affirmatively, but its 0.06% conversion rate and $200K+ cost per task make clear that the efficiency problem is the next frontier. The Bayesian Optimization framework provides the right conceptual foundation for attacking this problem — better surrogate models, smarter acquisition functions, and more efficient implementation agents could dramatically reduce the cost of autonomous discovery.

The system's honest reporting of its funnel metrics (5,000 → 1,100 → 21 → 3) is itself a contribution. It quantifies what everyone suspected but no one had measured: autonomous scientific discovery is possible but profoundly inefficient with current technology. This sets a concrete benchmark for future systems to improve upon.


Analysis prepared April 2026. Based on arXiv:2509.26603 and publicly available materials from the ResearAI project.