← Back to Index
DeepScientist
Bayesian Optimization-guided autonomous scientific discovery system that surpassed human state-of-the-art on three frontier AI tasks through month-long continuous GPU campaigns Organization: Westlake University (Engineering School) Published: September 30, 2025 Type: paper (arXiv:2509.26603) Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively
- arXiv: 2509.26603
- Published: September 30, 2025
- Code: github.com/ResearAI/DeepScientist
- Project Page: ai-researcher.net
- License: CC BY-NC-SA 4.0
- Tagline: Progressive scientific discovery through Bayesian Optimization over an LLM-driven hypothesis–implementation–analysis cycle
- Default models: Gemini-2.5-Pro (core logic / reviewer / strategist), Claude-4-Opus (code generation / implementation agent)
- Input modes: Fully autonomous month-long research campaigns on frontier AI tasks, seeded with a codebase repository and initial findings memory
Naming and Lineage
"DeepScientist" signals two things: the "Deep" prefix invokes both deep learning and the depth of the system's search (thousands of hypotheses, hundreds of implementations, month-long campaigns), while "Scientist" positions the system as an autonomous researcher rather than merely an assistant or copilot. The name also places the system in deliberate contrast to Sakana AI's "AI Scientist" — claiming a more rigorous, results-driven approach to the same aspiration.
The system is a direct evolution of the CycleResearcher line of work from the same lead author (Yixuan Weng). CycleResearcher introduced the idea of review-driven iterative refinement of AI-generated research papers. DeepScientist extends this from paper refinement to full-stack scientific discovery — from hypothesis generation through experimental validation to paper synthesis.
Lineage Chain
CycleResearcher (Weng et al., 2024)
│ Review-driven iterative paper refinement
│ Introduced DeepReviewer for automated evaluation
│
└── DeepScientist (Weng et al., 2025) ← this system
Full Bayesian Optimization formulation
Hypothesis → Implementation → Analysis cycle
Findings Memory as cumulative knowledge base
Surpassed human SOTA on 3 frontier tasks
The evolution from CycleResearcher to DeepScientist represents a fundamental shift: CycleResearcher operated on papers (text artifacts), while DeepScientist operates on methods (code + experiments + results). The review model from CycleResearcher becomes the surrogate function in DeepScientist's Bayesian Optimization loop.
Unique Position in the Ecosystem
DeepScientist is the first autonomous research system to demonstrate verified state-of-the-art surpassing results on frontier AI tasks. While other systems (AI Scientist, AI Researcher, CycleResearcher) generate research papers that are then evaluated by automated reviewers, DeepScientist produces working implementations that measurably outperform existing methods. This is the critical distinction: the output is not a paper that scores well on review metrics — it is a method that scores well on task metrics.
Ecosystem Positioning
│
├── AI Scientist (Sakana) — breadth: generates ML experiment papers
├── AI Researcher (Alibaba) — breadth: generates research papers from ideas
├── CycleResearcher (Westlake) — refinement: iterative paper improvement via review
├── AI Scientist v2 (Sakana) — evolution: open-ended, multi-paper campaigns
├── Zochi — quality: high-quality paper generation
└── DeepScientist (Westlake) — depth + results: BO-guided discovery with SOTA outcomes ← this system
2 Authors and Team
| Author | Role (Inferred) | Note |
|---|---|---|
| Yixuan Weng* | Co-lead, system architect | Same lead author as CycleResearcher; * indicates equal contribution |
| Minjun Zhu* | Co-lead, implementation lead | * indicates equal contribution |
| Qiujie Xie | Core contributor | |
| Qiyao Sun | Core contributor | |
| Zhen Lin | Core contributor | |
| Sifan Liu | Core contributor | |
| Yue Zhang† | Corresponding author, PI | † indicates corresponding; senior researcher at Westlake |
BibTeX Citation
@article{weng2025deepscientist,
title = {DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively},
author = {Weng, Yixuan and Zhu, Minjun and Xie, Qiujie and Sun, Qiyao
and Lin, Zhen and Liu, Sifan and Zhang, Yue},
journal = {arXiv preprint arXiv:2509.26603},
year = {2025}
}
Team composition: Seven authors from Westlake University's Engineering School. The equal-contribution designation for the first two authors suggests a system architect / implementation lead split — Weng bringing the conceptual framework from CycleResearcher, Zhu leading the engineering implementation. Yue Zhang as corresponding author and PI indicates this is a focused research lab effort with strong senior supervision.
Institutional context: Westlake University is a private research-intensive university in Hangzhou, China, founded in 2018 with an explicit mandate for frontier research. The Engineering School's NLP/AI group (under Yue Zhang) has been productive in the AI-for-science space, with CycleResearcher and DeepScientist representing a coherent multi-year research program rather than a one-off contribution.
Human supervision: The paper acknowledges 3 human experts who verified outputs and filtered hallucinations. This is significant — DeepScientist is not fully autonomous in the way Karpathy's autoresearch is. Human experts serve as a final filter on the discovery pipeline, ensuring that claimed innovations are genuine. The paper is transparent about this, which strengthens its credibility.
3 Core Contribution
Key Novelty: DeepScientist formalizes autonomous scientific discovery as a Bayesian Optimization problem, where the search space is all possible scientific methods, the objective function is the true value of a method, and an LLM Reviewer serves as the surrogate model — enabling UCB-guided exploration/exploitation of hypothesis space that produced 21 genuine scientific innovations from ~5,000 generated ideas, including three methods that surpassed human state-of-the-art.
The Bayesian Optimization Formulation
This is DeepScientist's central intellectual contribution — not just a system, but a mathematical framework for autonomous discovery. The formulation:
Objective:
I* = argmax_{I ∈ I} f(I)
Where:
- I is the space of all possible scientific methods (hypotheses, algorithms, implementations)
- f(I) is the true value function — the actual performance of method I when implemented and evaluated
- I* is the globally optimal method (unknown, approximated through search)
The Problem: Evaluating f(I) is extremely expensive. Each evaluation requires:
1. Generating a full implementation of method I
2. Running experiments (potentially hours of GPU time)
3. Analyzing results against baselines
This makes exhaustive search impossible. Bayesian Optimization provides the principled framework for deciding which hypothesis to evaluate next.
The Surrogate Model: DeepScientist uses an LLM Reviewer as the surrogate model that approximates f cheaply. For each candidate hypothesis, the reviewer produces a valuation vector:
V = <v_u, v_q, v_e>
Where:
- v_u = utility value — estimated practical impact and performance improvement
- v_q = quality value — estimated methodological soundness and rigor
- v_e = exploration value — estimated novelty and distance from previously explored regions
This three-dimensional valuation is a significant design choice. Traditional Bayesian Optimization uses a scalar objective; DeepScientist explicitly decomposes the surrogate into orthogonal axes that capture different aspects of scientific value.
The Acquisition Function: UCB (Upper Confidence Bound) balances exploitation of promising hypotheses with exploration of novel directions:
I_{t+1} = argmax_{I} (w_u · v_u + w_q · v_q + κ · v_e)
Where:
- w_u = weight on utility (exploitation signal)
- w_q = weight on quality (exploitation signal)
- κ = exploration coefficient (controls exploration–exploitation tradeoff)
- v_e = exploration value (pure exploration signal)
The UCB formulation has a well-known theoretical foundation in the multi-armed bandit literature. By casting hypothesis selection as a bandit problem with an LLM-estimated reward model, DeepScientist inherits the regret guarantees and convergence properties of UCB — at least in principle, modulo the accuracy of the LLM surrogate.
Why This Formulation Matters
Most autonomous research systems use ad-hoc methods for deciding what to try next:
| System | Hypothesis Selection Strategy | Theoretical Foundation |
|---|---|---|
| AI Scientist | LLM generates ideas, scores by novelty/feasibility | Heuristic |
| AlphaEvolve | MAP-Elites archive + LLM mutation | Evolutionary (quality-diversity) |
| FunSearch | LLM mutation + island model + best-shot sampling | Evolutionary |
| Autoresearch (Karpathy) | LLM decides freely, greedy hill-climbing | None (LLM intuition) |
| OpenEvolve | Multi-island evolution + bandit-based selection | Partially principled |
| DeepScientist | UCB acquisition over LLM surrogate | Bayesian Optimization |
DeepScientist is unique in providing a mathematically principled framework for the explore/exploit tradeoff in scientific discovery. The UCB acquisition function is not just a heuristic — it is an instance of a well-studied algorithm with known optimality properties.
The Five Differentiating Claims
- Mathematical formulation: Scientific discovery as Bayesian Optimization, not ad-hoc search
- Surrogate model: LLM Reviewer as cheap approximation to expensive evaluation
- UCB acquisition: Principled exploration/exploitation balance, not greedy or random
- Hierarchical memory: Three-level Findings Memory that accumulates knowledge across campaigns
- Verified SOTA results: Three methods that measurably surpassed human state-of-the-art
Comparison to Related Systems
| System | Origin | Paradigm | Output | SOTA Claims | Scale |
|---|---|---|---|---|---|
| AI Scientist (Sakana, 2024) | Industry | LLM paper generation | Papers | No | ~10 papers |
| AI Researcher (Alibaba, 2024) | Industry | Paper generation pipeline | Papers | No | ~7 papers |
| CycleResearcher (Westlake, 2024) | Academic | Review-driven refinement | Papers | No | ~6 papers |
| AI Scientist v2 (Sakana, 2025) | Industry | Open-ended campaigns | Papers | No | ~3 papers |
| Zochi (2025) | Unknown | High-quality generation | Papers | No | ~2 papers |
| DeepScientist (Westlake, 2025) | Academic | Bayesian Optimization | Methods + Papers | Yes (3 tasks) | ~5,000 ideas → 21 innovations |
The scale differential is striking. While other systems produce single-digit papers, DeepScientist generates thousands of hypotheses and hundreds of implementations. The funnel ratio (~5,000 → ~1,100 → 21) reveals the true difficulty of autonomous discovery: only 0.42% of generated ideas lead to genuine innovations. This is a profoundly important empirical finding about the nature of LLM-driven research.
4 Supported Solutions
Primary Domain: Frontier AI Research Tasks
DeepScientist targets frontier AI research problems where: 1. A codebase repository exists with a baseline implementation 2. Quantitative evaluation metrics are well-defined 3. The search space of possible improvements is large 4. Experimental validation is computationally feasible (hours, not weeks per trial)
The Three Demonstrated Tasks
| Task | Domain | Baseline | Metric | DeepScientist Method | Improvement |
|---|---|---|---|---|---|
| Agent Failure Attribution | AI agents | Existing attribution methods | Accuracy | A2P (Abduction-Action-Prediction) | +183.7% |
| LLM Inference Acceleration | Systems/ML | Standard inference pipelines | Tokens/second | ACRA | +1.9% |
| AI Text Detection | NLP/Security | Detection classifiers | AUROC + Latency | PA-Detect | +7.9% AUROC, +190% latency reduction |
Task 1: Agent Failure Attribution
Problem: Given a trace of an AI agent's actions and an observed failure, attribute the failure to the specific action(s) that caused it.
DeepScientist's Discovery — A2P (Abduction-Action-Prediction): The A2P method uses a three-phase reasoning approach: 1. Abduction — infer possible causes of the failure from the observed outcome 2. Action analysis — evaluate each action in the trace against the abduced causes 3. Prediction — predict which action-cause pair best explains the failure
The +183.7% improvement over baseline is by far the largest gain among the three tasks. This suggests the baseline was particularly weak or the problem was particularly amenable to LLM-style reasoning improvements.
Task 2: LLM Inference Acceleration
Problem: Accelerate the inference speed of large language models without degrading output quality.
DeepScientist's Discovery — ACRA: The +1.9% improvement in tokens/second is modest in absolute terms but significant in context — inference optimization is a heavily-studied area where marginal gains are hard-won. The method name suggests it involves some form of adaptive or conditional computation routing.
Task 3: AI Text Detection
Problem: Distinguish AI-generated text from human-written text.
DeepScientist's Discovery — PA-Detect: This is arguably the most impressive result across two dimensions: - +7.9% AUROC — substantial improvement in detection accuracy - +190% latency reduction — simultaneously making detection nearly 3x faster
Achieving accuracy and speed improvements simultaneously is rare — most optimizations trade one for the other. PA-Detect's dual improvement suggests a fundamentally better approach rather than incremental tuning.
Solution Space Characterization
For each task, DeepScientist operates within a defined solution space:
Per-Task Solution Space
│
├── Repository-Level Modifications
│ ├── Algorithm changes (new methods, modified pipelines)
│ ├── Architecture changes (model structure, layers, attention)
│ ├── Training modifications (loss functions, schedules, augmentation)
│ ├── Inference optimizations (caching, batching, pruning)
│ └── Evaluation protocol changes (metrics, preprocessing)
│
├── Hypothesis-Level Innovations
│ ├── Novel combinations of existing techniques
│ ├── Theoretical insights applied to implementation
│ ├── Cross-domain transfer of methods
│ └── Ablation-discovered simplifications
│
└── Constraints
├── Must produce runnable code (not just ideas)
├── Must improve quantitative metrics vs. baseline
├── Must be validated through controlled experiments
└── Must survive human expert review
The Innovation Funnel
DeepScientist's most revealing metric is the conversion rate through its funnel:
~5,000 Idea Findings generated
│
│ UCB acquisition selects most promising
│ (~22% selection rate)
│
~1,100 Implement Findings validated
│
│ Only those surpassing baseline promoted
│ (~1.9% of implementations succeed)
│
21 Progress Findings (genuine innovations)
│
│ Only ~0.42% of original ideas lead to innovation
│
3 SOTA-surpassing methods published
This funnel ratio is itself a scientific contribution. It quantifies the difficulty of autonomous discovery: even with a principled search strategy and powerful LLM models, the vast majority of hypotheses fail. The 0.42% success rate suggests that scientific innovation — even in well-defined AI tasks — remains extremely hard.
5 LLM Integration
Dual-Model Architecture
DeepScientist employs a deliberate separation of concerns between two frontier LLM models:
| Role | Model | Justification |
|---|---|---|
| Core logic | Gemini-2.5-Pro | Strategist, reviewer, hypothesis evaluator, report writer — requires broad reasoning, long context, and nuanced scientific judgment |
| Code generation | Claude-4-Opus | Implementation agent — requires precise, repository-level code generation with strong debugging capabilities |
This is not a cost optimization (both are frontier models) but a capability optimization. The authors implicitly argue that the best reasoning model and the best coding model are different — and that the system benefits from using each where it excels.
LLM as Surrogate Model
The most novel use of the LLM is as the surrogate function in the Bayesian Optimization loop. In classical BO, the surrogate is a Gaussian Process or a neural network trained on past evaluations. In DeepScientist, the surrogate is an LLM Reviewer — a prompted Gemini-2.5-Pro that takes a hypothesis description and returns a valuation vector V = <v_u, v_q, v_e>.
Classical BO DeepScientist BO
┌─────────────────┐ ┌─────────────────┐
│ Gaussian Process │ │ LLM Reviewer │
│ (trained on data)│ │ (prompted, zero- │
│ │ │ or few-shot) │
│ Input: x ∈ ℝ^d │ │ Input: I (text) │
│ Output: μ(x),σ(x)│ │ Output: V=<v_u, │
│ │ │ v_q, v_e> │
└─────────────────┘ └─────────────────┘
Key differences from classical surrogates:
| Property | Gaussian Process | LLM Reviewer |
|---|---|---|
| Input space | Continuous ℝ^d | Natural language (hypothesis descriptions) |
| Training | Fitted to (x, y) pairs | Pre-trained on scientific literature |
| Uncertainty | Calibrated posterior variance σ(x) | Heuristic exploration score v_e |
| Update | Bayesian posterior update | Prompt-based (context of past findings) |
| Cost | O(n³) for n observations | O(1) API call per evaluation |
| Expressiveness | Smooth functions | Arbitrary scientific reasoning |
The LLM surrogate sacrifices calibrated uncertainty (the core theoretical advantage of GPs) for expressiveness over a vastly richer input space. This is a pragmatic choice: you cannot represent "a new method for agent failure attribution based on abductive reasoning" as a point in ℝ^d, but an LLM can reason about it.
How Each LLM Is Used at Each Stage
Stage 1 — Strategize & Hypothesize (Gemini-2.5-Pro):
Input: Findings Memory (thousands of structured records)
+ Retrieved Top-K relevant findings (when memory exceeds context)
+ Task description and current SOTA baselines
LLM Operations:
1. Analyze patterns in past findings (successes, failures, trends)
2. Generate novel hypothesis based on gap analysis
3. Produce valuation vector V = <v_u, v_q, v_e> for each hypothesis
4. Store as "Idea Finding" in memory
Output: Ranked set of Idea Findings with valuation vectors
Stage 2 — Implement & Verify (Claude-4-Opus):
Input: Selected Idea Finding (highest UCB score)
+ Repository codebase (full access)
+ Experimental baselines and metrics
LLM Operations:
1. Plan implementation strategy (reading existing code structure)
2. Generate code changes (repository-level modifications)
3. Execute experiments in sandboxed environment
4. Debug failures, iterate on implementation
5. Produce experimental logs and results
Output: Implementation + experimental results + updated finding record
Stage 3 — Analyze & Report (Gemini-2.5-Pro):
Input: Successful implementation results
+ Baseline comparisons
+ Full experimental logs
LLM Operations:
1. Design deeper analytical experiments (ablations, new datasets)
2. Manage experimental lifecycle via MCP tools
3. Synthesize results into coherent narrative
4. Generate research paper with proper structure and citations
Output: Research paper + comprehensive experimental analysis
Agent Capabilities
The implementation agent (Stage 2) has notably broad permissions:
| Permission | Scope | Rationale |
|---|---|---|
| Read code | Full repository access | Must understand existing codebase to modify it |
| Write code | Full repository modification | Must implement novel methods |
| Execute code | Sandboxed environment | Must run experiments |
| Internet access | Yes | May need to reference documentation, download dependencies |
| Install packages | Yes | May need new libraries for implementation |
| GPU access | Dedicated H800 GPU | Experiments require accelerator compute |
This is more permissive than most autonomous research systems. AI Scientist (Sakana) restricts agents to a predefined template. Karpathy's autoresearch limits modifications to a single file. DeepScientist gives the coding agent full repository-level access — reflecting the reality that genuine scientific innovation often requires structural changes, not just parameter tuning.
Prompt Engineering and Review Architecture
The LLM Reviewer (surrogate model) is a cornerstone of the system's effectiveness. It must:
- Assess utility — estimate how much a proposed method would improve task performance
- Assess quality — evaluate methodological soundness, potential pitfalls, feasibility
- Assess exploration value — determine how novel the hypothesis is relative to previously explored ideas
The three-dimensional output is critical. A single scalar score would collapse these orthogonal concerns, making it impossible for the UCB acquisition function to properly balance exploration and exploitation. By separating the signals, the system can independently control how much it values novelty (via κ) versus expected performance (via w_u, w_q).
Contrast with Single-Model Systems
| System | Models Used | Separation of Concerns |
|---|---|---|
| AI Scientist | GPT-4 / Claude for everything | None — same model ideates, codes, writes, reviews |
| Autoresearch | Single coding agent | None — LLM does all reasoning and coding |
| AlphaEvolve | Gemini Flash + Pro ensemble | Model hierarchy (fast for quantity, large for quality) |
| DeepScientist | Gemini-2.5-Pro + Claude-4-Opus | Functional (reasoning vs. coding) |
DeepScientist's separation is functional (reasoning vs. coding) rather than hierarchical (cheap vs. expensive). This is a defensible architecture decision: the best available reasoning model need not be the best coder, and vice versa.
6 Key Results
SOTA-Surpassing Results
The headline results are the three methods that surpassed human state-of-the-art:
Result 1: Agent Failure Attribution — A2P Method
| Metric | Baseline SOTA | DeepScientist (A2P) | Improvement |
|---|---|---|---|
| Accuracy | Not specified | +183.7% over baseline | +183.7% |
The A2P (Abduction-Action-Prediction) method is a three-phase reasoning framework: 1. Abduction: Given the observed failure, generate candidate causal explanations 2. Action Analysis: For each action in the agent trace, evaluate its alignment with each candidate cause 3. Prediction: Select the action-cause pair with highest explanatory power
The magnitude of improvement (+183.7%) is extraordinary. Such a large gain typically indicates either: - The baseline was particularly weak (a common criticism) - The task was particularly amenable to the type of reasoning an LLM can bring - The method represents a genuinely transformative approach
Given that this is a relatively new task (agent failure attribution), the first explanation is most likely — but the result is still noteworthy as a demonstration of autonomous discovery.
Result 2: LLM Inference Acceleration — ACRA Method
| Metric | Baseline SOTA | DeepScientist (ACRA) | Improvement |
|---|---|---|---|
| Tokens/second | Not specified | +1.9% over baseline | +1.9% |
The +1.9% improvement is modest but meaningful in a domain where: - Inference optimization is heavily studied by well-funded teams (NVIDIA, Google, Meta) - Most easy optimizations have already been found - Even 1-2% improvements translate to significant cost savings at scale
The ACRA method name suggests Adaptive/Conditional computation with some form of Routing or Attention modification. The fact that an autonomous system found a genuine improvement in this heavily-optimized space is itself a significant result.
Result 3: AI Text Detection — PA-Detect Method
| Metric | Baseline SOTA | DeepScientist (PA-Detect) | Improvement |
|---|---|---|---|
| AUROC | Not specified | +7.9% over baseline | +7.9% |
| Latency | Not specified | -65.5% (190% faster) | +190% speed |
PA-Detect is the most compelling result because it achieves Pareto improvement — simultaneously better on two competing objectives:
AUROC (higher is better)
▲
│ ★ PA-Detect
│ ╱
│ ╱ Pareto frontier shift
│ ╱
│ ○ Previous SOTA
│
└──────────────────► Latency (lower is better)
Achieving +7.9% AUROC while simultaneously reducing latency by 65.5% means PA-Detect doesn't just find a better point on the existing accuracy-speed tradeoff curve — it shifts the entire Pareto frontier. This is rare in optimization and suggests a fundamentally better algorithmic approach rather than hyperparameter tuning.
Automated Review Evaluation (DeepReviewer)
Table 2 from the paper compares DeepScientist's papers against other AI research systems using DeepReviewer (the automated review model from CycleResearcher):
| System | Papers | Soundness | Presentation | Contribution | Rating | Accept Rate |
|---|---|---|---|---|---|---|
| AI Scientist | 10 | 2.08 | 1.80 | 1.75 | 3.35 | 0% |
| AI Researcher | 7 | 1.75 | 1.46 | 1.57 | 2.57 | 0% |
| AI Scientist v2 | 3 | 1.67 | 1.50 | 1.50 | 2.33 | 0% |
| CycleResearcher | 6 | 2.25 | 1.75 | 2.13 | 3.75 | 0% |
| Zochi | 2 | 2.38 | 2.38 | 2.25 | 4.63 | 0% |
| DeepScientist | 5 | 2.90 | 2.90 | 2.90 | 5.90 | 60% |
Analysis of the Review Scores
DeepScientist dominates every dimension. The gap is not marginal:
| Dimension | DeepScientist | Next Best | Gap |
|---|---|---|---|
| Soundness | 2.90 | 2.38 (Zochi) | +0.52 |
| Presentation | 2.90 | 2.38 (Zochi) | +0.52 |
| Contribution | 2.90 | 2.25 (Zochi) | +0.65 |
| Rating | 5.90 | 4.63 (Zochi) | +1.27 |
| Accept Rate | 60% | 0% (all others) | +60pp |
The 60% accept rate is the most striking number. Every other system — including sophisticated ones like AI Scientist v2 and Zochi — has a 0% accept rate under DeepReviewer. DeepScientist is the first to cross the acceptance threshold, and it does so with a substantial majority (3 of 5 papers accepted).
Potential confound: DeepReviewer was developed by the same team (from CycleResearcher). While the automated reviewer was validated against human judgments in the CycleResearcher paper, there is a risk of systemic bias — the reviewer may favor the same team's output style. The human expert review helps address this concern.
Human Expert Review
To address the automated reviewer concern, the paper reports human expert evaluation:
| Metric | DeepScientist Papers | ICLR 2025 Human Papers |
|---|---|---|
| Average rating | 5.00 | 5.08 |
| Number of reviewers | 3 per paper | 3-4 per paper |
| Inter-annotator agreement (Krippendorff α) | 0.739 | N/A |
The DeepScientist papers received ratings statistically indistinguishable from human-written papers at a top ML venue.
Key observations:
-
Rating 5.00 vs 5.08: The 0.08 gap is well within noise. At ICLR, a rating of 5 corresponds roughly to "marginally below acceptance threshold" — meaning these papers are competitive with, but not clearly above, venue-quality human research.
-
Krippendorff's α = 0.739: This indicates "substantial agreement" (>0.667 threshold). The reviewers were consistent in their assessments, suggesting the ratings are reliable rather than artifacts of reviewer disagreement.
-
Caveat: Three reviewers across a handful of papers provides limited statistical power. The confidence interval around 5.00 is wide. But the directional finding — that AI-generated research papers can achieve parity with human submissions — is still noteworthy.
Scale Metrics
| Metric | Value | Interpretation |
|---|---|---|
| GPU hours consumed | 20,000+ | Equivalent to ~$200K-400K at cloud rates |
| Unique scientific ideas | ~5,000 | Massive hypothesis generation capacity |
| Experimentally validated | ~1,100 | 22% of ideas selected for implementation |
| Scientific innovations | 21 | 1.9% of implementations, 0.42% of ideas |
| SOTA-surpassing methods | 3 | 14.3% of innovations, 0.06% of all ideas |
| Campaign duration | ~1 month per task | Continuous autonomous operation |
| Human experts for verification | 3 | Necessary for hallucination filtering |
The Dissipation Problem
The paper's own analysis reveals a critical challenge: the vast majority of compute is "wasted" on hypotheses that don't work. The funnel from 5,000 ideas to 3 SOTA methods represents a 0.06% conversion rate. The authors frame this honestly:
"The central question is no longer 'Can AI innovate?' but rather 'How can we efficiently guide its powerful, yet highly dissipative, exploratory process to maximize scientific return?'"
This framing shifts the research agenda from "can AI do science?" (answered: yes) to "can AI do science efficiently?" (answered: not yet). The 20,000+ GPU hours for 3 results is orders of magnitude more expensive than a human research team producing comparable results — but the cost is expected to decrease as models improve and search becomes more efficient.
7 Reproducibility
Code Availability
| Artifact | Available | Location |
|---|---|---|
| Source code | Yes | github.com/ResearAI/DeepScientist |
| Project page | Yes | ai-researcher.net |
| Paper | Yes | arXiv:2509.26603 |
| Discovered methods (A2P, ACRA, PA-Detect) | Partial | In paper descriptions |
| Findings Memory dumps | Unknown | Not explicitly released |
| Experimental logs | Unknown | Not explicitly released |
| DeepReviewer model | Inherited from CycleResearcher | Separate release |
Reproducibility Barriers
High barriers to exact reproduction:
| Barrier | Severity | Notes |
|---|---|---|
| Hardware | Critical | 2 servers × 8 NVIDIA H800 GPUs = 16 H800s required |
| Compute cost | Critical | 20,000+ GPU hours ≈ $200K-400K at cloud rates |
| Model access | High | Requires Gemini-2.5-Pro and Claude-4-Opus API access |
| API costs | High | Month-long continuous LLM API calls at frontier model rates |
| Time | High | Month-long campaigns per task — cannot reproduce quickly |
| Stochasticity | Medium | LLM outputs are stochastic; same prompts yield different hypotheses |
| Human experts | Medium | 3 domain experts needed for output verification |
| Task baselines | Low | Frontier AI tasks with known baselines |
The reproducibility challenge is primarily economic, not technical. The system architecture is documented, the code is released, and the methodology is clear. But actually running DeepScientist requires resources that few academic labs possess — essentially two full compute nodes with H800 GPUs running for a month, plus substantial API budgets.
Partial Reproduction Path
A realistic partial reproduction strategy:
Full reproduction (prohibitive for most):
├── 16 × H800 GPUs for 1 month per task
├── Gemini-2.5-Pro + Claude-4-Opus API budget
├── 3 domain experts
└── Total: ~$200K-400K per task
Scaled-down reproduction (feasible):
├── 1-2 × consumer GPUs (A100/H100)
├── Smaller hypothesis budget (100-500 instead of 5,000)
├── Shorter campaigns (days instead of months)
├── Smaller models (Gemini Flash, Claude Sonnet)
└── Total: ~$1K-10K per task
Concept validation (cheap):
├── Single GPU
├── Reproduce the BO formulation on a toy task
├── Verify UCB acquisition logic
├── Test Findings Memory data structures
└── Total: ~$100-500
What Would Strengthen Reproducibility
- Release Findings Memory dumps — allow researchers to analyze the discovery trajectory
- Release experimental logs — enable understanding of which hypotheses failed and why
- Release the discovered methods — full code for A2P, ACRA, PA-Detect (partially available in paper)
- Provide scaling curves — how do results degrade with fewer GPUs, shorter campaigns, cheaper models?
- Open-source the evaluation harness — standardized benchmarks for comparing autonomous research systems
8 Compute and API Costs
Hardware Configuration
Server 1 Server 2
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ 8 × NVIDIA H800 (80GB) │ │ 8 × NVIDIA H800 (80GB) │
│ │ │ │
│ GPU 0: DeepScientist │ │ GPU 8: DeepScientist │
│ GPU 1: Instance │ │ GPU 9: Instance │
│ GPU 2: Instance │ │ GPU 10: Instance │
│ GPU 3: Instance │ │ GPU 11: Instance │
│ GPU 4: Instance │ │ GPU 12: Instance │
│ GPU 5: Instance │ │ GPU 13: Instance │
│ GPU 6: Instance │ │ GPU 14: Instance │
│ GPU 7: Instance │ │ GPU 15: Instance │
└─────────────────────────────┘ └─────────────────────────────┘
Each GPU runs a separate DeepScientist instance
16 parallel exploration threads
Month-long continuous operation per task
Cost Estimation
| Component | Quantity | Unit Cost (est.) | Total (est.) |
|---|---|---|---|
| H800 GPU hours | 20,000+ | $2-4/hr (cloud) | $40,000-80,000 |
| Gemini-2.5-Pro API | ~millions of tokens | $0.00125-0.01/1K tokens | $5,000-50,000 |
| Claude-4-Opus API | ~millions of tokens (code gen) | $0.015-0.075/1K tokens | $10,000-100,000 |
| Human expert time | 3 experts × ~40 hrs | $100-200/hr | $12,000-24,000 |
| Total per task | $67,000-254,000 | ||
| Total (3 tasks) | $200,000-762,000 |
These are rough estimates based on publicly available pricing. The actual costs may be lower if the team had institutional discounts or higher if API usage was more intensive than estimated.
Cost per Innovation
| Metric | Value | Interpretation |
|---|---|---|
| Cost per idea generated | ~$40-150 | Cheap: LLM inference for hypothesis generation |
| Cost per implementation validated | ~$180-690 | Moderate: GPU hours for experimentation |
| Cost per innovation discovered | ~$9,500-36,300 | Expensive: reflects the low hit rate |
| Cost per SOTA-surpassing method | ~$67,000-254,000 | Very expensive: but comparable to a postdoc-year |
Comparison to Human Research Costs
| Research Mode | Cost per SOTA Result | Time to Result | Scalability |
|---|---|---|---|
| PhD student | ~$60,000-120,000/year | 1-5 years | Not scalable |
| Industry research team | ~$500,000-2M/year | 6-18 months | Limited |
| DeepScientist | ~$67K-254K | ~1 month | Parallelizable |
The comparison is imperfect — human researchers produce understanding, mentorship, and serendipitous discoveries alongside their primary results. But on the narrow dimension of "time and cost to produce a SOTA-surpassing method," DeepScientist is competitive with or faster than human researchers, albeit at similar cost.
Scaling Properties
DeepScientist exhibits a property that human research does not: near-linear scaling with compute. Adding more GPUs = more parallel exploration threads = more hypotheses evaluated = more innovations discovered (in expectation). Human research teams face diminishing returns as team size grows (coordination costs, communication overhead, duplication of effort).
Innovation yield vs. compute (conceptual)
Innovations │ DeepScientist
discovered │ ╱ (near-linear with compute)
│ ╱
│ ╱
│ ╱ Human team
│ ╱ ╱ (diminishing returns)
│ ╱ ╱
│ ╱ ╱
│╱ ╱
└─────────────────── Compute / Team Size
This scaling property — if it holds as hypothesized — is the most consequential implication of the DeepScientist framework. It suggests that scientific discovery can be industrialized: throw more GPUs at the problem and get more results, linearly.
9 Architecture Solution
Three-Stage Hierarchical Discovery Cycle
DeepScientist's architecture is a three-stage cycle, where each stage builds on the outputs of the previous one. The cycle repeats continuously for the duration of a campaign (approximately one month per task).
┌──────────────────────────────────────────────────────────────────────┐
│ DEEPSCIENTIST DISCOVERY CYCLE │
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ STAGE 1 │ │ FINDINGS MEMORY │ │
│ │ Strategize & │◄───┤ │ │
│ │ Hypothesize │ │ ┌─────────────────┐ │ │
│ │ │ │ │ Idea Findings │ │ │
│ │ • Analyze memory │───►│ │ (hypotheses) │ │ │
│ │ • Retrieve Top-K │ │ ├─────────────────┤ │ │
│ │ • LLM surrogate │ │ │ Implement │ │ │
│ │ valuation V │ │ │ Findings │ │ │
│ │ • Store Idea │ │ │ (code + results) │ │ │
│ │ Findings │ │ ├─────────────────┤ │ │
│ └────────┬────────────┘ │ │ Progress │ │ │
│ │ │ │ Findings │ │ │
│ │ UCB selects │ │ (innovations) │ │ │
│ │ best candidate │ └─────────────────┘ │ │
│ ▼ │ │ │
│ ┌─────────────────────┐ │ Also contains: │ │
│ │ STAGE 2 │ │ • Human knowledge │ │
│ │ Implement & │ │ (papers, code) │ │
│ │ Verify │ │ • Historical results │ │
│ │ │ │ • Valuation vectors │ │
│ │ • Coding agent │ └───────────────────────┘ │
│ │ • Sandboxed env │ │
│ │ • Full repo access │ │
│ │ • GPU experiments │ │
│ │ • Result f(I_{t+1})│ │
│ └────────┬────────────┘ │
│ │ │
│ │ If result surpasses baseline │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ STAGE 3 │ │
│ │ Analyze & │ │
│ │ Report │ │
│ │ │ │
│ │ • Ablation studies │ │
│ │ • New dataset eval │ │
│ │ • MCP lifecycle │ │
│ │ • Paper synthesis │ │
│ └─────────────────────┘ │
│ │
│ ← Cycle repeats for ~1 month per task → │
└──────────────────────────────────────────────────────────────────────┘
Stage 1: Strategize & Hypothesize
Purpose: Generate and evaluate candidate hypotheses using the LLM as a surrogate model.
Process:
1. Load current Findings Memory (potentially thousands of structured records)
2. When memory exceeds context window, use retrieval model for Top-K relevant findings
3. LLM Reviewer (Gemini-2.5-Pro) analyzes patterns in past findings
4. Generate new hypotheses based on identified gaps and opportunities
5. For each hypothesis, produce valuation vector V = <v_u, v_q, v_e>
6. Store as "Idea Finding" in memory
Key design choice: The retrieval step is critical for scaling. As the Findings Memory grows to thousands of entries, it cannot fit in a single LLM context window. The retrieval model ensures that the most relevant past findings are presented to the strategist, providing continuity even as the memory grows beyond context limits.
Stage 2: Implement & Verify
Purpose: Select the most promising hypothesis via UCB, implement it, and evaluate experimentally.
Process:
1. UCB acquisition function selects the Idea Finding with highest score:
I_{t+1} = argmax_{I} (w_u · v_u + w_q · v_q + κ · v_e)
2. Selected finding promoted to "Implement Finding"
3. Coding agent (Claude-4-Opus) receives the hypothesis and repository
4. Agent plans implementation strategy by reading existing code
5. Agent implements changes (full repository-level modifications)
6. Experiments executed in sandboxed environment with dedicated GPU
7. Results f(I_{t+1}) recorded and finding record updated
Key design choice: The coding agent has full permissions (read code, access internet, install packages). This is necessary for genuine innovation — constraining the agent to a template or a single file would preclude the kinds of structural changes that lead to breakthrough methods.
Stage 3: Analyze & Report
Purpose: Triggered only by successful implementations that surpass the baseline. Performs deeper analysis and generates a research paper.
Process: 1. Successful implementation promoted to "Progress Finding" 2. MCP tools manage the experimental lifecycle 3. Deeper analytical experiments: ablation studies, evaluation on new datasets 4. Synthesis agent (Gemini-2.5-Pro) produces a coherent research paper
Key design choice: Stage 3 is conditional — it only fires for successes. This prevents wasting compute on analyzing failed experiments. The asymmetry is important: hypothesis generation and implementation are cheap enough to run speculatively, but deep analysis and paper writing should only happen for results worth reporting.
Parallel Execution Architecture
16 GPU instances, each running independently:
GPU 0: Stage 1 → Stage 2 (experiment) → Stage 1 → Stage 2 → ...
GPU 1: Stage 1 → Stage 2 (experiment) → Stage 1 → Stage 2 → ...
GPU 2: Stage 1 → Stage 2 (experiment) → Stage 3 (success!) → Stage 1 → ...
...
GPU 15: Stage 1 → Stage 2 (experiment) → Stage 1 → Stage 2 → ...
Shared: Findings Memory (read by all, written by all)
Each GPU instance operates as an independent exploration thread, sharing the Findings Memory. This architecture enables: - Parallel exploration: 16 hypotheses can be evaluated simultaneously - Shared learning: discoveries on one GPU inform hypothesis generation on all others - Fault tolerance: failure of one instance doesn't affect others - Linear scaling: adding GPUs proportionally increases throughput
The shared Findings Memory is the key coordination mechanism. Without it, each GPU would explore independently, duplicating effort and missing opportunities to build on each other's findings.
10 Component Breakdown
Component Architecture
DeepScientist System Components
│
├── Strategist (Gemini-2.5-Pro)
│ ├── Memory Analyzer — reads and patterns over Findings Memory
│ ├── Hypothesis Generator — produces novel Idea Findings
│ ├── Surrogate Evaluator — LLM Reviewer producing V = <v_u, v_q, v_e>
│ └── Retrieval Model — Top-K finding selection when memory exceeds context
│
├── Selector (UCB Acquisition Function)
│ ├── Weight Manager — maintains w_u, w_q, κ parameters
│ ├── Score Calculator — computes UCB score per Idea Finding
│ └── Promotion Logic — Idea Finding → Implement Finding
│
├── Implementer (Claude-4-Opus)
│ ├── Code Reader — analyzes repository structure
│ ├── Plan Generator — designs implementation strategy
│ ├── Code Writer — repository-level modifications
│ ├── Experiment Runner — executes in sandboxed environment
│ └── Result Logger — records f(I_{t+1}) in finding record
│
├── Analyzer (Gemini-2.5-Pro)
│ ├── Ablation Designer — plans deeper experiments
│ ├── Dataset Evaluator — tests on additional benchmarks
│ ├── MCP Lifecycle Manager — experimental lifecycle tools
│ └── Paper Synthesizer — generates research paper
│
├── Findings Memory
│ ├── Idea Findings Store — hypotheses with valuation vectors
│ ├── Implement Findings Store — implementations with results
│ ├── Progress Findings Store — successful innovations
│ ├── Human Knowledge Base — papers, code repositories
│ └── Retrieval Index — for Top-K selection
│
└── Infrastructure
├── GPU Scheduler — assigns instances to GPUs
├── Sandbox Manager — isolated execution environments
├── API Client Pool — manages LLM API connections
└── Logging & Monitoring — tracks campaign progress
Component Interaction Matrix
| Component | Reads From | Writes To | LLM Model |
|---|---|---|---|
| Strategist | Findings Memory, Task Description | Idea Findings | Gemini-2.5-Pro |
| Selector | Idea Findings (valuation vectors) | Implement Findings (promotion) | None (pure computation) |
| Implementer | Implement Findings, Repository | Implement Findings (results) | Claude-4-Opus |
| Analyzer | Progress Findings, Experimental Data | Progress Findings (analysis), Papers | Gemini-2.5-Pro |
| Retrieval Model | Findings Memory | Top-K selections | Embedding model |
The Surrogate Model Component
The surrogate model deserves special attention as the most architecturally novel component:
Surrogate Model (LLM Reviewer)
│
├── Input Assembly
│ ├── Hypothesis description (natural language)
│ ├── Task context (problem definition, metrics, baseline)
│ ├── Relevant past findings (Top-K from memory)
│ └── Current state of knowledge (patterns, trends)
│
├── Evaluation Process
│ ├── Assess utility: "How much would this improve task performance?"
│ ├── Assess quality: "Is this methodologically sound and feasible?"
│ └── Assess exploration: "How novel is this vs. what we've tried?"
│
└── Output
├── v_u (utility value) — scalar estimate of expected improvement
├── v_q (quality value) — scalar estimate of methodological quality
└── v_e (exploration value) — scalar estimate of novelty
The UCB Selector Component
UCB Acquisition Function
│
├── Input
│ ├── Set of Idea Findings with valuation vectors V_i = <v_u, v_q, v_e>
│ └── Parameters: w_u, w_q, κ
│
├── Score Computation (for each Idea Finding I_i)
│ └── UCB(I_i) = w_u · v_u(I_i) + w_q · v_q(I_i) + κ · v_e(I_i)
│
├── Selection
│ └── I_{t+1} = argmax_{I_i} UCB(I_i)
│
└── Promotion
└── Selected I_{t+1}: Idea Finding → Implement Finding
The UCB selector is the only non-LLM component in the critical path. It is a deterministic function of the valuation vectors, ensuring that the exploration/exploitation tradeoff is principled rather than dependent on LLM stochasticity.
MCP Tools for Experimental Lifecycle
Stage 3 uses MCP (Model Context Protocol) tools to manage the experimental lifecycle:
| Tool Category | Purpose | Used In |
|---|---|---|
| Experiment Design | Plan ablation studies, control experiments | Stage 3 |
| Dataset Management | Load, preprocess, evaluate on new datasets | Stage 3 |
| Result Compilation | Aggregate metrics, generate tables and figures | Stage 3 |
| Paper Formatting | Structure sections, manage references, LaTeX | Stage 3 |
The use of MCP tools rather than hardcoded logic allows the analysis pipeline to be flexible — the LLM can decide which tools to invoke based on the specific findings, rather than following a rigid template.
11 Core Mechanisms (Detailed)
Mechanism 1: Bayesian Optimization over Hypothesis Space
Formal setup:
Let I denote the space of all possible scientific methods (a vast, combinatorial space of algorithms, architectures, and configurations). Let f: I → ℝ be the true value function that maps each method to its actual performance. The goal:
I* = argmax_{I ∈ I} f(I)
The BO loop proceeds as follows:
t = 0: Initialize Findings Memory M_0 with human knowledge (papers, baselines)
For t = 1, 2, 3, ...:
1. SURROGATE UPDATE
Use LLM Reviewer to produce valuation vectors for new Idea Findings:
V(I) = <v_u(I), v_q(I), v_e(I)> for each I in candidates
2. ACQUISITION
Select next hypothesis to evaluate:
I_{t+1} = argmax_{I} UCB(I) = argmax_{I} (w_u · v_u + w_q · v_q + κ · v_e)
3. EVALUATION
Deploy coding agent to implement and test I_{t+1}:
f(I_{t+1}) = ExperimentalResult(I_{t+1})
4. MEMORY UPDATE
Update M_t → M_{t+1} with result:
- If f(I_{t+1}) > f(I_best): promote to Progress Finding
- Otherwise: record result and update valuation model's context
5. REPEAT
Why UCB and not other acquisition functions?
| Acquisition Function | Formula | Properties | Why Not Used |
|---|---|---|---|
| UCB | μ(x) + κσ(x) |
Deterministic, tunable exploration | Used |
| Expected Improvement (EI) | E[max(f(x) - f*, 0)] |
Greedy, low exploration | Too exploitative for open-ended search |
| Thompson Sampling | Sample from posterior | Stochastic, naturally explores | Hard to implement with LLM surrogate |
| Knowledge Gradient | Value of information | Optimal for finite budget | Computationally expensive |
UCB is the natural choice because: 1. It is deterministic given the valuation vector — no additional stochasticity beyond the LLM 2. The exploration coefficient κ is explicitly tunable — the researchers can control the exploration/exploitation balance 3. It works with any surrogate that produces mean + uncertainty — the three-component valuation vector is a natural fit
Mechanism 2: The Three-Component Valuation Vector
The valuation vector V = <v_u, v_q, v_e> decomposition is not standard in BO. Classical BO surrogates produce a mean prediction μ(x) and uncertainty σ(x). DeepScientist's decomposition maps to BO concepts as follows:
Classical BO: DeepScientist:
├── μ(x) (mean) ←→ w_u · v_u + w_q · v_q (exploitation signal)
└── σ(x) (variance) ←→ κ · v_e (exploration signal)
The separation of the exploitation signal into utility (v_u) and quality (v_q) is the key innovation. In classical BO, the mean is a single scalar. In DeepScientist, the exploitation signal is a weighted combination of two semantically distinct assessments:
- Utility (v_u): "How much performance improvement would this method achieve?" — focuses on the magnitude of the expected gain
- Quality (v_q): "How methodologically sound is this approach?" — focuses on the probability of the gain being real
This separation allows the system to distinguish between: - High utility, low quality: Bold ideas that promise large gains but may be unsound (high-risk, high-reward) - Low utility, high quality: Incremental improvements that are almost certain to work (low-risk, low-reward) - High utility, high quality: The most promising candidates (rare, highly selected)
By adjusting the weights w_u and w_q, the system can shift between risk-seeking and risk-averse strategies.
Mechanism 3: Findings Memory as Cumulative Knowledge Base
The Findings Memory is a structured, list-style database with thousands of records organized into three hierarchical levels:
Findings Memory
│
├── Level 1: IDEA FINDINGS
│ ├── Source: Generated by Strategist (LLM)
│ ├── Content: Hypothesis description, rationale, related work references
│ ├── Metadata: Valuation vector V, generation timestamp, source findings
│ ├── Lifecycle: Created → [Selected by UCB → Promoted to Level 2] or [Persists]
│ └── Volume: ~5,000 over a month-long campaign
│
├── Level 2: IMPLEMENT FINDINGS
│ ├── Source: Promoted from Level 1 after UCB selection
│ ├── Content: Implementation details, code changes, experimental setup
│ ├── Metadata: Result f(I), experimental logs, runtime, resource usage
│ ├── Lifecycle: Created → [Surpasses baseline → Promoted to Level 3] or [Persists]
│ └── Volume: ~1,100 over a month-long campaign
│
└── Level 3: PROGRESS FINDINGS
├── Source: Promoted from Level 2 after successful validation
├── Content: Full method description, ablation results, multi-dataset evaluation
├── Metadata: Improvement over baseline, paper draft, human review status
├── Lifecycle: Created → [Deeper analysis via Stage 3] → [Paper publication]
└── Volume: 21 over a month-long campaign
Dual-source knowledge: The memory contains both: 1. Human knowledge — structured records from existing papers, codebases, and known methods 2. System-generated knowledge — the system's own hypotheses, implementations, and results
This creates a feedback loop: human knowledge seeds the initial exploration, the system generates new findings, which in turn inform future hypothesis generation. Over time, the system-generated knowledge dominates, as the Findings Memory becomes a comprehensive record of what has been tried and what works.
Mechanism 4: Retrieval for Context-Length Management
As the Findings Memory grows, it exceeds the context window of even the largest LLMs. DeepScientist addresses this with a retrieval model:
Findings Memory (thousands of records)
│
├── When memory fits in context:
│ └── Pass entire memory to Strategist LLM
│
└── When memory exceeds context:
├── Retrieval model indexes all findings
├── Query: current task context + recent findings + identified gaps
├── Top-K most relevant findings retrieved
└── Top-K findings passed to Strategist LLM
This ensures:
├── Relevant historical context is always available
├── The system doesn't "forget" important past findings
├── Context window is used efficiently
└── The system can scale to arbitrary campaign lengths
The retrieval mechanism is what enables month-long campaigns. Without it, the system would be limited to the number of findings that fit in a single context window — probably a few hundred at most. With retrieval, it can maintain continuity over thousands of findings.
Mechanism 5: Conditional Stage Triggering
Stage 3 (Analyze & Report) is not always triggered. It fires only when an implementation surpasses the baseline:
Stage 2 result: f(I_{t+1})
Decision logic:
├── f(I_{t+1}) > f(I_baseline)?
│ ├── YES → Promote to Progress Finding → Trigger Stage 3
│ └── NO → Record result in memory → Return to Stage 1
│
│ Filtering ratios (from the paper):
│ ├── ~1,100 implementations attempted
│ ├── ~21 surpassed baseline (1.9%)
│ └── Only 21 triggered Stage 3
This asymmetry is crucial for efficiency. Stage 3 involves expensive operations (ablation studies, multi-dataset evaluation, paper synthesis). Running it for every implementation would be wasteful. By restricting it to successes, the system focuses its analytical budget on findings that matter.
Mechanism 6: Progressive Promotion Lifecycle
Each finding follows a lifecycle from idea to publication:
Generated by LLM
│
IDEA FINDING
(hypothesis + V)
│
Selected by UCB?
├── No → Stays in memory
│ (available for future analysis)
└── Yes ↓
│
IMPLEMENT FINDING
(code + experiments)
│
Surpasses baseline?
├── No → Stays in memory with result
│ (negative result is valuable data)
└── Yes ↓
│
PROGRESS FINDING
(innovation + analysis)
│
Deeper analysis (Stage 3)
│
Paper synthesis
│
Human expert review
│
Published method
Critically, negative results are retained in memory. An implementation that fails to surpass the baseline is not discarded — its result is recorded and available to the Strategist. This prevents the system from re-trying failed approaches and allows it to learn from its mistakes. The information content of "hypothesis H was implemented and produced result R which was below baseline" is valuable for guiding future hypothesis generation.
12 Programming Language
Implementation Stack
| Component | Language/Framework | Notes |
|---|---|---|
| Core orchestration | Python | Campaign management, memory operations, retrieval |
| Coding agent (implementations) | Python (generated) | Task-specific code for each hypothesis |
| LLM integration | Python (API clients) | Gemini API, Claude API |
| Experimental execution | Python + CUDA | GPU-accelerated experiments |
| MCP tools | Python | Experimental lifecycle management |
| Findings Memory | Structured storage (list-style database) | JSON/database records |
| Retrieval model | Python + embedding model | For Top-K finding selection |
Repository Structure (Inferred)
DeepScientist/
├── deepscientist/
│ ├── strategist/ ← Stage 1: hypothesis generation and evaluation
│ │ ├── analyzer.py ← Findings Memory pattern analysis
│ │ ├── generator.py ← Hypothesis generation
│ │ ├── evaluator.py ← LLM Reviewer (surrogate model)
│ │ └── retriever.py ← Top-K finding retrieval
│ │
│ ├── selector/ ← UCB acquisition function
│ │ ├── ucb.py ← UCB score computation
│ │ └── promoter.py ← Idea → Implement finding promotion
│ │
│ ├── implementer/ ← Stage 2: implementation and verification
│ │ ├── agent.py ← Coding agent orchestration
│ │ ├── sandbox.py ← Sandboxed execution environment
│ │ └── logger.py ← Result recording
│ │
│ ├── analyzer/ ← Stage 3: analysis and reporting
│ │ ├── ablation.py ← Ablation study design and execution
│ │ ├── evaluator.py ← Multi-dataset evaluation
│ │ └── writer.py ← Paper synthesis
│ │
│ ├── memory/ ← Findings Memory management
│ │ ├── store.py ← CRUD operations on findings
│ │ ├── types.py ← Finding type definitions (Idea/Implement/Progress)
│ │ └── index.py ← Retrieval index management
│ │
│ ├── tools/ ← MCP tool definitions
│ │ └── lifecycle.py ← Experimental lifecycle tools
│ │
│ └── config/ ← Campaign configuration
│ ├── task.py ← Task definition (repo, metrics, baseline)
│ └── campaign.py ← Campaign parameters (duration, GPUs, κ)
│
├── tasks/ ← Task definitions for each frontier problem
│ ├── agent_failure/ ← Agent failure attribution task
│ ├── inference_accel/ ← LLM inference acceleration task
│ └── text_detection/ ← AI text detection task
│
└── results/ ← Campaign outputs
├── findings/ ← Findings Memory dumps
├── papers/ ← Generated research papers
└── logs/ ← Experimental logs
Code Generation Patterns
The implementer (Claude-4-Opus) generates task-specific code during Stage 2. The generation pattern follows a structured workflow:
Implementation Workflow (Claude-4-Opus)
│
├── 1. PLAN
│ ├── Read existing codebase structure
│ ├── Identify relevant files and functions
│ ├── Design modification strategy
│ └── Output: Implementation plan (natural language)
│
├── 2. READ
│ ├── Read specific files identified in plan
│ ├── Understand interfaces and dependencies
│ ├── Identify insertion points
│ └── Output: Codebase understanding
│
├── 3. IMPLEMENT
│ ├── Generate code changes
│ ├── May create new files or modify existing ones
│ ├── Repository-level modifications (not just single-file edits)
│ └── Output: Modified codebase
│
├── 4. EXECUTE
│ ├── Run experiments in sandboxed environment
│ ├── Monitor for errors and failures
│ ├── Debug and iterate if necessary
│ └── Output: Experimental results
│
└── 5. LOG
├── Record experimental results
├── Generate experimental logs
├── Compute metrics vs. baseline
└── Output: Result record f(I_{t+1})
13 Memory Management
Findings Memory: Architecture and Data Model
The Findings Memory is the central data structure of DeepScientist — a cumulative, structured knowledge base that grows throughout a campaign. Unlike ephemeral LLM context or conversation history, the Findings Memory is a persistent, typed database of scientific findings.
Finding Record Schema
Each finding in the memory follows a structured schema:
Finding Record
├── id: str ← Unique identifier
├── level: enum {Idea, Implement, Progress} ← Promotion level
├── hypothesis: str ← Natural language description of the idea
├── rationale: str ← Why this hypothesis might work
├── related_findings: list[str] ← IDs of findings that informed this one
├── valuation: ValuationVector ← V = <v_u, v_q, v_e>
│ ├── utility: float ← Expected performance improvement
│ ├── quality: float ← Methodological soundness
│ └── exploration: float ← Novelty relative to explored space
├── implementation: Optional[ImplementationRecord]
│ ├── code_changes: list[str] ← Description of modifications
│ ├── experimental_setup: str ← How experiments were configured
│ ├── result: float ← f(I) — actual performance
│ ├── baseline_delta: float ← Improvement over baseline
│ └── logs: str ← Experimental output logs
├── analysis: Optional[AnalysisRecord] ← Only for Progress Findings
│ ├── ablation_results: dict ← Ablation study outcomes
│ ├── cross_dataset_results: dict ← Performance on additional datasets
│ └── paper_draft: str ← Generated paper content
├── created_at: datetime
├── updated_at: datetime
└── source: enum {Human, System} ← Human knowledge or system-generated
Memory Growth Dynamics
Campaign Timeline (1 month)
│
├── Day 1-3: Seeding
│ ├── Human knowledge loaded (papers, baselines, known methods)
│ ├── Initial Idea Findings generated from seed knowledge
│ └── Memory size: ~50-200 findings (mostly human-sourced)
│
├── Day 3-10: Early Exploration
│ ├── High κ (exploration coefficient): system tries diverse hypotheses
│ ├── Many implementations fail (building negative knowledge)
│ ├── First successful implementations appear
│ └── Memory size: ~500-1,500 findings
│
├── Day 10-20: Focused Exploitation
│ ├── System identifies promising directions from early successes
│ ├── κ may decrease as promising regions are found
│ ├── Implementations become more targeted
│ ├── Multiple Progress Findings emerge
│ └── Memory size: ~2,000-3,500 findings
│
└── Day 20-30: Refinement
├── Deep exploitation of most promising directions
├── Ablation studies and cross-dataset evaluation
├── Paper synthesis for best results
└── Memory size: ~4,000-5,000+ findings
Retrieval System for Context Management
As the Findings Memory grows, it becomes too large for a single LLM context window. The retrieval system addresses this:
Retrieval Pipeline
│
├── Index Construction
│ ├── Each finding is embedded (hypothesis text + metadata)
│ ├── Index updated incrementally as new findings are added
│ └── Supports both semantic and keyword search
│
├── Query Construction
│ ├── Current task context
│ ├── Recent findings (last N)
│ ├── Identified gaps and opportunities
│ └── Combined into retrieval query
│
├── Top-K Selection
│ ├── Retrieve K most relevant findings
│ ├── K sized to fit within LLM context budget
│ ├── Balance: recent findings + historically important findings
│ └── Include both successes and failures for balanced context
│
└── Context Assembly
├── Task description (fixed)
├── Retrieved Top-K findings (variable)
├── Recent findings (sliding window)
└── Combined context → Strategist LLM
Memory as Scientific Knowledge Graph
The Findings Memory implicitly forms a knowledge graph through the related_findings field. Each finding references the findings that informed it, creating a directed acyclic graph (DAG) of scientific reasoning:
Human Knowledge (papers, baselines)
├── Idea A (inspired by Paper X)
│ ├── Implement A (failed: -2.3% vs baseline)
│ └── Idea B (inspired by Idea A's failure)
│ ├── Implement B (succeeded: +4.1% vs baseline)
│ │ └── Progress B (ablation confirms contribution)
│ └── Idea C (combining Idea B with Paper Y)
│ └── Implement C (succeeded: +7.9% vs baseline) ← PA-Detect
│ └── Progress C → Paper
│
├── Idea D (independent of A)
│ └── Implement D (failed: -0.5% vs baseline)
│ └── [Negative result informs future hypotheses]
│
└── ... thousands more paths, most ending in failure
This graph structure enables the Strategist to understand not just what has been tried, but why it was tried and what it led to. The causal chain from human knowledge through failed attempts to eventual success is the intellectual history of the campaign — and it's fully recorded in the Findings Memory.
Comparison to Other Systems' Memory
| System | Memory Type | Persistence | Structure | Growth |
|---|---|---|---|---|
| AI Scientist | None (stateless per paper) | Session only | Unstructured | No |
| Autoresearch (Karpathy) | Git history + results.tsv | Permanent | Flat log | Linear |
| AlphaEvolve | MAP-Elites archive | Per-run | Grid (behavior space) | Bounded |
| FunSearch | Island populations | Per-run | Best-shot per island | Bounded |
| OpenEvolve | Multi-island populations | Checkpointed | Population per island | Bounded |
| EurekaClaw | 4-tier memory system | Permanent | Tiered (RAM → disk → graph → insights) | Unbounded |
| DeepScientist | 3-level Findings Memory | Campaign duration | Hierarchical (Idea → Implement → Progress) | Unbounded |
DeepScientist's memory is unique in several ways: 1. Typed hierarchy: The three levels (Idea/Implement/Progress) reflect the actual scientific workflow 2. Valuation vectors: Each finding carries quantitative assessments, not just text 3. Dual-source: Human knowledge and system knowledge coexist in the same structure 4. Negative results preserved: Failed implementations are valuable data, not discarded 5. Retrieval-backed scaling: Memory can grow beyond context limits without losing access
14 Continued Learning
Intra-Campaign Learning
DeepScientist's primary learning mechanism operates within a single campaign. The Findings Memory accumulates knowledge that directly influences future hypothesis generation:
Learning Loop (within campaign)
│
├── t=0: Strategist has only human knowledge
│ └── Hypotheses are broad, exploratory, human-knowledge-biased
│
├── t=100: Memory contains ~100 findings
│ └── Strategist begins recognizing patterns in failures
│ └── Hypotheses become more targeted
│
├── t=500: Memory contains ~500 findings
│ └── Strategist has a model of "what works" for this task
│ └── Exploration focuses on variations of successful approaches
│
├── t=1000: Memory contains ~1000+ findings
│ └── Strategist's implicit model is refined
│ └── Hypotheses are highly focused, diminishing marginal returns
│
└── Qualitative shift: from exploration to exploitation as campaign progresses
This is not "learning" in the machine learning sense (no weights are updated). It is in-context learning at the campaign level — the LLM's hypothesis generation improves as it receives more information about what works and what doesn't. The Findings Memory serves as the "training set" for this in-context learning.
The Surrogate Model's Implicit Improvement
A subtlety of DeepScientist's BO formulation: the surrogate model (LLM Reviewer) implicitly improves over the course of a campaign, even though its weights are frozen:
Surrogate Accuracy Over Time
│
├── t=0: LLM Reviewer evaluates hypotheses based on general scientific knowledge
│ └── Accuracy: Low (no task-specific calibration)
│ └── The v_u, v_q, v_e scores are educated guesses
│
├── t=100: LLM Reviewer sees hypothesis + 100 past findings as context
│ └── Accuracy: Improving (can compare against known results)
│ └── Valuation becomes data-driven, not just prior-driven
│
├── t=500: LLM Reviewer sees hypothesis + Top-K from 500 findings
│ └── Accuracy: Moderate (has empirical calibration data)
│ └── Can estimate improvement magnitude based on similar past findings
│
└── t=1000+: LLM Reviewer sees hypothesis + Top-K from 1000+ findings
└── Accuracy: Highest (rich empirical basis for judgment)
└── Effectively calibrated against hundreds of real experiments
This mirrors classical BO: as more (x, y) pairs are observed, the Gaussian Process posterior becomes more accurate. In DeepScientist, as more (hypothesis, result) pairs accumulate in memory, the LLM Reviewer's in-context "posterior" becomes more accurate. The mechanism is entirely different (statistical vs. prompt-based), but the functional effect is similar.
Cross-Campaign Learning
The paper does not explicitly describe cross-campaign learning (transferring findings from one task's campaign to another). This is a notable gap:
| Learning Type | Within Campaign | Across Campaigns |
|---|---|---|
| Hypothesis quality improvement | Yes (via Findings Memory) | Not described |
| Surrogate calibration | Yes (via in-context learning) | Not described |
| Strategy evolution | Yes (via accumulated patterns) | Not described |
| Method transfer | N/A | Not described |
Potential for cross-campaign learning: - Meta-strategies that work across tasks (e.g., "ablation-first approaches are reliable") could be extracted and reused - Findings Memory from one task could seed another task's initial hypotheses - The exploration coefficient κ could be calibrated based on past campaigns' innovation rates
Comparison to Evolutionary Learning
DeepScientist's within-campaign learning differs from evolutionary systems in important ways:
| Aspect | Evolutionary (FunSearch, AlphaEvolve) | Bayesian Optimization (DeepScientist) |
|---|---|---|
| What evolves | Population of programs | Memory of findings |
| Selection | Fitness-proportional | UCB acquisition function |
| Recombination | Crossover of programs | LLM synthesis of ideas from multiple findings |
| Mutation | LLM-based code perturbation | LLM-based hypothesis generation |
| Memory | Population (bounded) | Findings Memory (unbounded) |
| Learning signal | Fitness score (scalar) | Valuation vector (3D) + experimental result |
| Convergence | Population concentrates | Exploitation weight increases |
The key difference: evolutionary systems learn by maintaining a population of solutions that improves through selection and variation. DeepScientist learns by maintaining a memory of knowledge that informs hypothesis generation. The evolutionary approach is bottom-up (good solutions survive); the BO approach is top-down (principled selection guides exploration).
Human-in-the-Loop Learning
The 3 human experts who verify DeepScientist's outputs represent a learning mechanism that the paper somewhat underplays:
Human Expert Contribution
│
├── Filter: Reject hallucinated or trivially flawed results
│ └── Prevents false Progress Findings from contaminating memory
│
├── Validate: Confirm genuine innovations are real
│ └── Provides ground truth that the system cannot generate alone
│
├── Guide (implicit): Expert attention patterns may influence priority
│ └── Unclear if experts can intervene during campaigns
│
└── Quality gate: Final barrier before claiming SOTA
└── Ensures published methods are genuinely novel and correct
The human experts serve as a high-quality but low-bandwidth "oracle" — they cannot evaluate thousands of findings, but they can verify the small number of Progress Findings. This hybrid autonomy (system explores broadly, humans verify narrowly) is a pragmatic architecture for current LLM capabilities.
15 Applications
Primary Application: Autonomous Frontier AI Research
DeepScientist is designed for a specific class of problems:
| Criterion | Requirement | Rationale |
|---|---|---|
| Codebase | Existing repository with baseline implementation | Agent needs a starting point to modify |
| Metrics | Well-defined quantitative evaluation metrics | UCB acquisition needs scalar scores |
| Search space | Large space of possible improvements | Justifies the BO overhead vs. manual search |
| Evaluation cost | Moderate (hours, not weeks per trial) | Month-long campaign needs ~1,100 evaluations |
| Domain | Frontier AI research | LLM reasoning is strongest in this domain |
| Baseline | Known SOTA for comparison | Progression requires a target to beat |
Demonstrated Application Domains
| Domain | Task | DeepScientist Method | Result |
|---|---|---|---|
| AI Agents | Failure attribution | A2P (Abduction-Action-Prediction) | +183.7% accuracy |
| Systems/ML | Inference acceleration | ACRA | +1.9% tokens/s |
| NLP/Security | AI text detection | PA-Detect | +7.9% AUROC, +190% speed |
Potential Extension Domains
Based on the system's architecture, DeepScientist could be applied to any domain meeting the above criteria:
| Domain | Example Task | Feasibility | Notes |
|---|---|---|---|
| Computer Vision | Object detection on COCO | High | Well-defined metrics, existing codebases |
| NLP | Machine translation (BLEU) | High | Standard benchmarks, clear evaluation |
| Reinforcement Learning | Sample efficiency on Atari | Medium | Evaluation is expensive (many episodes) |
| Drug Discovery | Molecular property prediction | Medium | Requires domain-specific knowledge |
| Robotics | Control policy optimization | Low | Physical experiments not feasible |
| Theorem Proving | Proof success rate | Medium | Needs formal verification tooling |
| Code Generation | HumanEval/MBPP | High | Well-defined metrics, fast evaluation |
Integration Scenarios
Scenario 1: Corporate AI Research Lab
Research team identifies frontier task with stagnating progress
→ Configure DeepScientist with task repository and baselines
→ Allocate GPU cluster for month-long campaign
→ System generates thousands of hypotheses
→ UCB guides exploration/exploitation
→ ~20 innovations discovered
→ Human researchers review and validate top results
→ Publish methods that surpass SOTA
→ ROI: 3 SOTA methods per month per GPU cluster
Scenario 2: Academic Research Acceleration
PhD student working on a specific AI problem
→ Student provides existing codebase + baselines + metrics
→ Scaled-down campaign (1-2 GPUs, 1 week)
→ System explores hypothesis space student hasn't considered
→ Generates ~100 implementations, ~5-10 potential improvements
→ Student analyzes results, integrates best ideas
→ Accelerates research timeline from months to weeks
Scenario 3: Benchmark Competition
Competition organizer releases new benchmark
→ Configure DeepScientist with competition codebase and metrics
→ Run parallel campaigns with different task configurations
→ System explores solution space exhaustively
→ Submit best Progress Findings to leaderboard
→ Potential for autonomous competition entries
Limitations
| Limitation | Severity | Impact | Mitigation Path |
|---|---|---|---|
| Extreme compute cost | High | Restricts use to well-funded labs | Model cost decreases, more efficient search |
| API dependency | High | Relies on Gemini and Claude APIs | Open-weight model alternatives |
| Human expert requirement | Medium | 3 experts needed for verification | Better automated verification |
| Domain restriction | Medium | Currently limited to frontier AI tasks | Architecture is domain-agnostic |
| Low conversion rate | Medium | 0.42% ideas → innovations | Better surrogate models, smarter acquisition |
| Month-long campaigns | Medium | Slow iteration on system design | Shorter campaigns for prototyping |
| Surrogate calibration | Medium | LLM Reviewer may misjudge hypothesis value | Calibration against actual results |
| Reproducibility | Medium | Stochastic LLM outputs | Seed control, ensemble strategies |
| Negative result waste | Low | Most compute produces failures | Failures inform future search (by design) |
The Efficiency Question
The paper's most provocative framing concerns the efficiency of autonomous discovery:
"The central question is no longer 'Can AI innovate?' but rather 'How can we efficiently guide its powerful, yet highly dissipative, exploratory process to maximize scientific return?'"
This question has profound implications for the field:
Current state (DeepScientist):
5,000 ideas → 1,100 implementations → 21 innovations → 3 SOTA
Conversion rate: 0.06% (ideas to SOTA)
Cost: ~$200K-762K for 3 SOTA methods
Hypothetical 10x improvement:
5,000 ideas → 1,100 implementations → 210 innovations → 30 SOTA
Conversion rate: 0.6% (ideas to SOTA)
Cost: ~$20K-76K per SOTA method
Hypothetical 100x improvement:
500 ideas → 200 implementations → 50 innovations → 10 SOTA
Conversion rate: 2% (ideas to SOTA)
Cost: ~$2K-8K per SOTA method
At 100x improvement, autonomous research becomes economically accessible to individual researchers. The question is whether better surrogate models, smarter acquisition functions, or more efficient implementation agents can achieve this.
Strengths vs. Weaknesses Summary
| Strength | Weakness |
|---|---|
| Mathematically principled BO framework for discovery | Extreme compute requirements (20,000+ GPU hours) |
| Verified SOTA-surpassing results on 3 tasks | Low conversion rate (0.42% of ideas lead to innovation) |
| 60% accept rate (3/5 papers accepted by DeepReviewer) | DeepReviewer is from the same team — potential bias |
| Human expert review confirms venue-quality papers | Only 3 human experts — limited statistical power |
| UCB provides principled exploration/exploitation | Surrogate model (LLM Reviewer) lacks calibrated uncertainty |
| Findings Memory preserves negative results | No cross-campaign learning mechanism |
| Dual-model architecture (reasoning + coding) | API dependency on frontier models |
| Scalable via parallel GPU instances | Month-long campaigns limit iteration speed |
| Clear improvement over all prior systems | Human supervision still required for verification |
Historical Significance
DeepScientist marks a watershed moment in autonomous research: the first system to demonstrate verified, quantitative improvements over human state-of-the-art on frontier AI tasks. Previous systems (AI Scientist, CycleResearcher, Zochi) demonstrated the ability to generate plausible research papers, but none demonstrated the ability to produce methods that actually work better than existing ones.
The progression from CycleResearcher to DeepScientist mirrors the broader trajectory of the field:
2024: Can AI write research papers? → Yes, but low quality (AI Scientist)
2024: Can AI improve paper quality? → Yes, via review-driven refinement (CycleResearcher)
2025: Can AI make real discoveries? → Yes, but at enormous cost (DeepScientist)
2026: Can AI do this efficiently? → Open question
DeepScientist answers the "Can AI make real discoveries?" question affirmatively, but its 0.06% conversion rate and $200K+ cost per task make clear that the efficiency problem is the next frontier. The Bayesian Optimization framework provides the right conceptual foundation for attacking this problem — better surrogate models, smarter acquisition functions, and more efficient implementation agents could dramatically reduce the cost of autonomous discovery.
The system's honest reporting of its funnel metrics (5,000 → 1,100 → 21 → 3) is itself a contribution. It quantifies what everyone suspected but no one had measured: autonomous scientific discovery is possible but profoundly inefficient with current technology. This sets a concrete benchmark for future systems to improve upon.
Analysis prepared April 2026. Based on arXiv:2509.26603 and publicly available materials from the ResearAI project.