← Back to Index

AI-Researcher

A fully autonomous multi-agent research system that orchestrates the complete scientific pipeline—from literature review and hypothesis generation through algorithm implementation to publication-ready manuscript preparation—with minimal human intervention, evaluated via the purpose-built Scientist-Bench benchmark across guided and open-ended discovery tasks. Organization: The University of Hong Kong (HKUDS Lab) Published: May 24, 2025 Type: paper (arXiv:2505.18705) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

Full Title: AI-Researcher: Autonomous Scientific Innovation

ArXiv: arXiv:2505.18705 (cs.AI)

Repository: github.com/HKUDS/AI-Researcher

Project Page: autoresearcher.github.io

Leaderboard: Scientist-Bench Leaderboard

Production Version: novix.science/chat

License: CC BY 4.0

Venue: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)

Stars: ~5,011 (as of April 2026)

Status: Active open-source project with benchmark leaderboard, production deployment (Novix), and community submissions

AI-Researcher is among the first systems to demonstrate that an LLM-based multi-agent framework can autonomously execute the complete scientific research lifecycle—from reading papers and generating hypotheses to writing code and producing camera-ready manuscripts—achieving implementation success rates exceeding 93% and producing research that approaches human-level quality as measured by ICLR-standard review criteria.

Lineage and Positioning

AI-Researcher occupies a pivotal position in the rapidly developing ecosystem of autonomous research systems. It was one of the earliest complete end-to-end systems to gain significant traction (predating many 2026 systems) and has been cited as a direct influence by numerous successor projects:

System Relationship to AI-Researcher
AutoResearchClaw (AIMING Lab) Explicitly acknowledges AI-Researcher as architectural inspiration
EurekaClaw Lists AI-Researcher as predecessor in lineage
ClawTeam (HKUDS) From same lab; extends collaborative agent patterns
Auto-Deep-Research (HKUDS) Successor deep research system from same group
MetaClaw Draws on AI-Researcher's multi-agent orchestration
ScienceClaw Science-focused extension acknowledging AI-Researcher

The system's influence extends beyond the "Claw" family—its Scientist-Bench benchmark has become a standard evaluation framework, and its divergent-convergent ideation paradigm has been adopted by multiple subsequent systems.


2 Authors and Team

Author Affiliation Role
Jiabin Tang PhD Student, The University of Hong Kong Co-lead, equal contribution
Lianghao Xia Research Assistant Professor, HKU Co-lead, equal contribution
Zhonghang Li PhD Student, South China University of Technology Core contributor
Chao Huang Assistant Professor, HKU Corresponding author, PI

BibTeX Citation

@inproceedings{tang2025airesearcher,
  title     = {AI-Researcher: Autonomous Scientific Innovation},
  author    = {Tang, Jiabin and Xia, Lianghao and Li, Zhonghang and Huang, Chao},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2025},
  url       = {https://arxiv.org/abs/2505.18705}
}

Team composition: A compact 4-person team from the HKUDS (HKU Data Science) lab. The small team size is notable given the system's breadth—covering literature analysis, ideation, implementation, documentation, and a full benchmark. Chao Huang's lab (HKUDS) has produced a family of related systems (ClawTeam, Auto-Deep-Research) suggesting AI-Researcher served as the foundational architecture for ongoing lab research in autonomous scientific discovery.

Institutional context: Unlike distributed multi-institutional efforts (e.g., OpenResearcher with 10 authors across 6 organizations), AI-Researcher emerged from a single tight-knit lab. This enabled architectural coherence but also means the benchmark construction, system design, and evaluation were conducted by the same small group—a consideration for independent validation.


3 Core Contribution

The Problem

Scientific research demands a synthesis of capabilities that no prior AI system had fully integrated:

  1. Literature comprehension at scale: Reading dozens of papers, extracting mathematical formulations, identifying code implementations, and building coherent mental models of the research landscape
  2. Creative hypothesis generation: Moving beyond recombination of known ideas to identify genuine conceptual gaps and propose novel directions
  3. Faithful implementation: Translating theoretical proposals into working code that correctly implements the proposed algorithms—not just syntactically valid code but semantically correct research prototypes
  4. Coherent documentation: Producing publication-quality manuscripts that maintain consistency across sections, accurately describe the implemented methods, and present results with scholarly rigor

Prior systems addressed fragments of this pipeline: literature analysis tools (Semantic Scholar, Elicit), code generation agents (SWE-Bench competitors), and writing assistants (GPT-based drafting). None orchestrated the complete lifecycle with bidirectional consistency guarantees between theory, code, and documentation.

The Solution

AI-Researcher introduces three architectural innovations that collectively enable end-to-end autonomous research:

┌──────────────────────────────────────────────────────────────────────────┐
│                     AI-Researcher Pipeline                                │
│                                                                          │
│  Phase 1: LITERATURE REVIEW                                              │
│  ┌─────────────────────┐   ┌───────────────────────────────────────────┐ │
│  │ Knowledge Acquisition│   │ Resource Analyst                          │ │
│  │ Agent                │   │ ┌─────────────┐  ┌────────────────────┐  │ │
│  │ • 10-15 ref papers   │──→│ │Paper Analyst │  │Code Analyst        │  │ │
│  │ • 5+ GitHub repos    │   │ │(LaTeX→math)  │  │(repo→implementation│  │ │
│  │ • arXiv supplements  │   │ └──────┬───────┘  └────────┬───────────┘  │ │
│  │ • Docker sandbox     │   │        └────────┬──────────┘              │ │
│  └─────────────────────┘   │           ▼                               │ │
│                             │  Bidirectional Theory↔Code Mappings       │ │
│                             └───────────────────────────────────────────┘ │
│                                          │                                │
│  Phase 2: IDEA GENERATION                ▼                                │
│  ┌───────────────────────────────────────────────────────────────────────┐│
│  │ Divergent-Convergent Discovery Framework                              ││
│  │                                                                       ││
│  │  Divergent Phase          Convergent Phase                            ││
│  │  ┌───────────────┐       ┌──────────────────────────┐                ││
│  │  │ 5 orthogonal   │──→   │ Evaluate against:         │               ││
│  │  │ research        │      │ • Scientific Novelty      │               ││
│  │  │ directions      │      │ • Technical Soundness     │               ││
│  │  └───────────────┘       │ • Transformative Potential │               ││
│  │                           └──────────┬───────────────┘                ││
│  │                                      ▼                                ││
│  │                           Best proposal selected                      ││
│  └───────────────────────────────────────────────────────────────────────┘│
│                                          │                                │
│  Phase 3: IMPLEMENTATION                 ▼                                │
│  ┌───────────────────────────────────────────────────────────────────────┐│
│  │ Multi-Stage Refinement (Mentor-Student Paradigm)                      ││
│  │                                                                       ││
│  │  ┌──────────┐    feedback    ┌──────────────┐                        ││
│  │  │ Mentor   │◄──────────────►│ Student      │                        ││
│  │  │ Agent    │    iteration   │ Agent        │                        ││
│  │  │ (review) │───────────────→│ (implement)  │                        ││
│  │  └──────────┘                └──────────────┘                        ││
│  │       Iterative refinement cycles until convergence                   ││
│  └───────────────────────────────────────────────────────────────────────┘│
│                                          │                                │
│  Phase 4: DOCUMENTATION                  ▼                                │
│  ┌───────────────────────────────────────────────────────────────────────┐│
│  │ Hierarchical Synthesis → Publication-Ready Manuscript                  ││
│  │ • Cross-document consistency maintenance                              ││
│  │ • Factual integrity verification                                      ││
│  │ • ICLR-standard scholarly format                                      ││
│  └───────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────────────────────────────┘

Three Key Innovations

  1. Bidirectional Theory-Code Mapping (Resource Analyst). The Resource Analyst agent decomposes complex research concepts into "atomic" components and establishes explicit mappings between mathematical formulations (extracted from LaTeX sources) and their code implementations (extracted from GitHub repositories). This dramatically reduces hallucination in downstream implementation because the system has concrete, verified reference pairs rather than relying on the LLM's parametric memory for implementation details.

  2. Divergent-Convergent Discovery Framework. Rather than generating a single research idea (prone to anchoring bias) or producing unconstrained brainstorms (prone to incoherence), the system first diverges—generating five conceptually distinct research directions exploring orthogonal perspectives—then converges by evaluating each against Scientific Novelty, Technical Soundness, and Transformative Potential criteria. This mirrors the creative-then-critical thinking cycle that characterizes productive human ideation.

  3. Mentor-Student Iterative Refinement. Implementation follows an explicit multi-round feedback cycle modeled on the academic advisor-student relationship. A Mentor Agent reviews code and provides structured feedback; a Student Agent revises the implementation accordingly. This replaces the fragile one-shot code generation paradigm with a self-correcting loop that enables test-time compute scaling—more refinement cycles yield higher-quality implementations.

The Benchmark: Scientist-Bench

As a co-contribution, the paper introduces Scientist-Bench—a purpose-built benchmark for evaluating autonomous research systems. This is significant because prior work lacked standardized evaluation frameworks, making cross-system comparison impossible. Scientist-Bench provides:

  • 22 representative papers from 2022–2024 across 16 research domains
  • Two challenge levels testing fundamentally different capabilities
  • Rigorous anonymization to prevent memorization exploitation
  • Multi-model debiased evaluation protocol aligned with ICLR review standards

4 Supported Solutions

Solution Type Support Level Details
End-to-end autonomous research Primary target Literature→idea→code→paper with minimal human input
Literature analysis and synthesis Core phase Multi-paper comprehension, concept extraction, theory-code mapping
Research idea generation Core phase Divergent-convergent framework for novel hypothesis formation
Algorithm implementation Core phase Iterative mentor-student refinement with feedback cycles
Scientific manuscript writing Core phase Hierarchical synthesis with cross-document consistency
Autonomous research evaluation Co-contribution Scientist-Bench benchmark with two challenge levels
Guided research execution Evaluated Level-1 tasks: follow provided research instructions
Open-ended research exploration Evaluated Level-2 tasks: formulate own research direction from references

Task Complexity Spectrum

AI-Researcher targets a fundamentally different problem than code-generation benchmarks (SWE-Bench, HumanEval) or literature review tools. The system must navigate the entire creative-to-technical arc:

Phase Cognitive Demand Prior Systems AI-Researcher
Literature comprehension Understanding multiple papers, extracting mathematical formulations Semantic Scholar, Elicit ✅ Knowledge Acquisition + Resource Analyst
Research gap identification Creative reasoning about what's missing Manual + ChatGPT assists ✅ Divergent ideation phase
Novel hypothesis formation Generating genuinely new research directions Chain-of-Ideas, ResearchAgent ✅ Convergent evaluation with multi-criteria
Faithful implementation Correct code from theoretical specification SWE-Bench agents ✅ Mentor-Student iterative refinement
Experimental validation Running experiments, collecting results Manual ✅ Automated within Docker
Scholarly documentation Coherent multi-section manuscripts GPT-based drafting ✅ Hierarchical synthesis

What AI-Researcher Does NOT Do

  • Peer review: The system generates papers but does not review external submissions (though Scientist-Bench includes an LLM-based review component for evaluation)
  • Physical experiments: Limited to computational research; no wet-lab or hardware integration
  • Real-time literature monitoring: Input is a fixed set of reference papers; no continuous arXiv scanning
  • Multi-run learning: Each research run is independent; no cross-run knowledge accumulation or skill learning
  • Human-in-the-loop collaboration: Designed for full autonomy; no mid-pipeline human feedback mechanism

5 LLM Integration

Model Usage Architecture

AI-Researcher's multi-agent architecture distributes different cognitive tasks across LLM calls, though the paper does not prescribe a single model—the system is designed as model-agnostic with different agents potentially using different LLMs.

Agent Function Model Requirements
Knowledge Acquisition Agent Repository filtering, literature retrieval Long context for paper processing, tool use for GitHub/arXiv APIs
Paper Analyst (sub-agent) LaTeX parsing, math extraction Strong mathematical comprehension, precise extraction
Code Analyst (sub-agent) Repository analysis, implementation mapping Code understanding, dependency tracing
Plan Agent Implementation roadmap generation Structured reasoning, project planning
Idea Generator Divergent-convergent ideation Creative reasoning, multi-criteria evaluation
Mentor Agent Code review, feedback generation Code review capability, technical judgment
Student Agent Implementation, revision Strong code generation, instruction following
Documentation Agent Manuscript writing Long-form coherent writing, LaTeX generation
Code Review Agent (evaluation) Technical execution validation Static analysis understanding, runtime reasoning
Paper Review Agent (evaluation) Scientific contribution evaluation Scholarly judgment aligned with ICLR criteria

Evaluated LLM Configurations

The Scientist-Bench leaderboard reports results for two primary configurations:

Configuration Completeness (L1) Correctness (L1) Completeness (L2) Correctness (L2)
AI-Researcher (Claude-series) 93.8% 2.65/5.0 100% 2.50/5.0
AI-Researcher (4o-series) 50.0% 1.00/5.0 100% 2.25/5.0

The dramatic difference between Claude-series (93.8%) and GPT-4o-series (50.0%) on Level-1 completeness highlights the system's sensitivity to the underlying LLM's code generation and instruction-following capabilities. Both achieve 100% completeness on Level-2 open-ended tasks—a finding discussed in detail in §6.

Evaluation Debiasing Protocol

For Scientist-Bench evaluation, the paper employs multiple LLM evaluators as a debiasing strategy:

  • Multiple model families: GPT models, Claude models, and Gemini models serve as independent reviewers
  • Temperature: Set to 1.0 for all evaluator calls to maximize diversity in judgments
  • Position bias mitigation: Random swapping of paper presentation order (AI-generated vs. ground truth) to eliminate ordering effects
  • Panel aggregation: Multiple independent evaluations aggregated for reliability

This multi-evaluator approach is particularly important given known biases in LLM-as-judge paradigms—models tend to favor their own outputs, prefer longer responses, and exhibit position bias. Using diverse model families with randomized presentation partially mitigates these systematic biases.


6 Key Results

Primary Finding: Open-Ended Exploration Outperforms Guided Tasks

The most surprising and scientifically significant finding of the paper is that AI-Researcher performs better on open-ended exploration (Level-2) than on guided implementation tasks (Level-1):

Metric Level-1 (Guided) Level-2 (Open-Ended) Delta
Completeness (Claude) 93.8% 100% +6.2%
Completeness (4o) 50.0% 100% +50.0%
Correctness (Claude) 2.65 2.50 −0.15
Correctness (4o) 1.00 2.25 +1.25

Interpretation. This counterintuitive result admits several explanations, each with distinct implications:

  1. Internal coherence hypothesis: When the system generates its own research direction (Level-2), the idea, implementation plan, and code are all internally consistent because they originate from the same reasoning process. When following external instructions (Level-1), the system must faithfully interpret and implement someone else's idea, introducing potential misalignment between the instruction's intent and the system's interpretation.

  2. Complexity calibration hypothesis: In open-ended mode, the system naturally gravitates toward research directions it can implement well—ideas that align with the LLM's strengths in code generation and mathematical formulation. Guided tasks may specify approaches that are inherently harder to implement correctly or that require domain-specific knowledge the LLM lacks.

  3. Anchoring avoidance hypothesis: Prescriptive instructions may anchor the system on specific implementation approaches that are suboptimal for an LLM-based agent, while open-ended exploration allows the system to find implementation paths that play to its strengths.

  4. Evaluation alignment hypothesis: When the system generates both the idea and the paper, the documentation more accurately reflects the implemented method. In guided mode, discrepancies between the instruction's expected output and the system's actual implementation may lead to lower evaluation scores.

Broader significance. This finding challenges the assumption that autonomous research systems need detailed human guidance to produce quality work. It suggests that for computational research, the bottleneck may not be idea quality but rather the alignment between ideas and the implementing system's capabilities. This has profound implications for how autonomous research systems should be deployed—potentially with loose directional guidance rather than specific implementation instructions.

Implementation Success Rates

The completion ratio metric (measuring what fraction of required functionality is successfully implemented) is remarkably high:

Configuration Level-1 Level-2
Claude-series 93.8% 100%
4o-series 50.0% 100%

The 93.8% Level-1 completeness for Claude-series is exceptional given that these are complete research implementations, not isolated coding tasks. This includes training pipelines, evaluation code, data loading, model architectures, and experimental configurations.

The GPT-4o-series' dramatic improvement from 50% (Level-1) to 100% (Level-2) is particularly notable. It suggests that the 4o-series models struggle with faithful instruction following for complex research specifications but excel when allowed to formulate their own (likely simpler, more LLM-friendly) research directions.

Scientific Quality Assessment

The comparative rating r ∈ {−3, ..., 3} measures AI-generated papers against human ground-truth papers from top venues:

Configuration Avg. Rating (L1) Avg. Rating (L2) Comparable %
Claude-series 2.0 2.0 99.9%
4o-series 2.0 2.0 99.9%

A rating of 2.0 means AI-generated papers are consistently rated as having positive scientific contribution relative to the ground-truth papers. The 99.9% "comparable" rate indicates that virtually all generated papers are deemed substantive enough to warrant comparison—they are not dismissed as trivially low-quality.

Benchmark Validation

The paper validates its LLM-based evaluation protocol against real ICLR review decisions:

  • 5 popular LLMs used as evaluators
  • 64 randomly sampled ICLR submissions (32 paper pairs)
  • LLM-based reviewer judgments "perfectly align" with ICLR's final acceptance/rejection decisions

This calibration is critical for establishing that the benchmark's automated evaluation is a meaningful proxy for human expert judgment, though "perfect alignment" on 32 pairs should be interpreted with appropriate caution regarding sample size.


7 Reproducibility

Open Artifacts

Artifact Status Location
Source code Open github.com/HKUDS/AI-Researcher
Technical report Open Paper PDF
Scientist-Bench data Open Project page
Benchmark leaderboard Open Leaderboard
22 ground-truth papers Included Part of Scientist-Bench
Reference paper sets Included 15–20 references per benchmark sample
Anonymization protocol Described Paper §2.2
Evaluation prompts Provided Paper Appendix A.7
License CC BY 4.0 Permissive academic use

Reproducibility Strengths

  1. Full pipeline code: The complete multi-agent system is released, enabling end-to-end reproduction
  2. Benchmark data: All 22 benchmark papers, reference sets, and datasets are provided
  3. Evaluation protocol: Detailed prompts and scoring rubrics for both Stage 1 (technical validation) and Stage 2 (scientific contribution) evaluation
  4. Leaderboard with submission: Open leaderboard accepting community submissions, enabling independent validation

Reproducibility Challenges

  1. LLM API dependency: Results depend on specific LLM versions (Claude-series, GPT-4o-series) that evolve over time. Model updates may alter system behavior
  2. API cost barrier: Running the full pipeline requires significant LLM API spending (see §8), creating a financial barrier to reproduction
  3. Stochastic generation: LLM outputs are inherently stochastic; exact reproduction of specific papers is not possible
  4. Evaluation model drift: The LLM-based evaluation protocol is subject to the same API version drift as the generation pipeline
  5. Small benchmark size: 22 papers across 16 domains means some domains have only 1–2 samples, limiting statistical power for domain-specific claims
  6. Self-evaluation concern: The system is evaluated using LLMs from the same families used for generation, raising potential systematic biases despite the debiasing protocol

What Is NOT Reproducible

  • Exact generated papers (stochastic LLM generation)
  • Evaluation scores with deprecated model versions
  • Cost estimates (API pricing changes)

8 Compute and API Costs

Per-Run Cost Structure

AI-Researcher's cost is dominated by LLM API calls across its multi-agent pipeline. Each complete research run (one paper) involves:

Phase Agents Involved Estimated Token Volume Cost Driver
Literature Review Knowledge Acquisition, Resource Analyst (Paper + Code Analysts) High: reads 10–15 full papers + code repos Input tokens (long documents)
Idea Generation Idea Generator Moderate: 5 divergent ideas + convergent evaluation Output tokens (creative generation)
Implementation Student + Mentor (multiple cycles) Very high: iterative code generation + review Both input and output (iterative)
Documentation Documentation Agent High: full manuscript generation Output tokens (long-form writing)
Evaluation Code Review + Paper Review agents Moderate: structured assessment Input tokens (reading generated paper)

Estimated Per-Paper Costs

While the paper does not provide exact cost breakdowns, we can estimate based on the pipeline structure:

Model Family Estimated Cost Per Paper Rationale
Claude-series (Sonnet) $15–$50 Multiple long-context calls, iterative refinement
Claude-series (Opus) $50–$150 Higher per-token cost for more capable model
GPT-4o-series $10–$40 Competitive pricing but lower completeness

Full Benchmark Costs

Running all 22 Scientist-Bench papers at both levels:

Scenario Papers Estimated Total
Level-1 only (Claude) 22 $330–$1,100
Level-2 only (Claude) 22 $330–$1,100
Both levels, both models 88 runs $1,000–$5,000

Infrastructure Costs

Component Requirement Cost
Docker environment Local machine or cloud VM Minimal (existing infrastructure)
GPU Not required for pipeline; needed for some benchmark tasks Variable
Storage Reference papers, code repos, generated artifacts Minimal

Cost Comparison with Human Research

Metric Human Researcher AI-Researcher
Time per paper Weeks to months Hours
Direct cost per paper $0 (salary-amortized) $15–$150 (API costs)
Iteration speed Days per revision cycle Minutes per revision cycle
Parallelism 1 paper at a time Unlimited concurrent runs
Quality (Scientist-Bench) Ground truth Approaching comparable (rating ~2.0)

The economics are striking: at $15–$150 per complete research paper, AI-Researcher could generate hundreds of research explorations for the cost of a single month of a PhD student's stipend. The bottleneck shifts from human time to API cost and evaluation bandwidth.


9 Architecture Solution

System Architecture

┌────────────────────────────────────────────────────────────────────────────┐
│                        AI-RESEARCHER SYSTEM                                │
│                                                                            │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                    INPUT LAYER                                       │  │
│  │                                                                      │  │
│  │  User Provides:                                                      │  │
│  │  ┌────────────────┐  ┌───────────────┐  ┌────────────────────────┐  │  │
│  │  │ 10-15 Reference │  │ Research      │  │ Datasets               │  │  │
│  │  │ Papers (ℛ)      │  │ Instruction   │  │ (𝒟)                    │  │  │
│  │  │                 │  │ (I) [L1 only] │  │                        │  │  │
│  │  └────────┬───────┘  └───────┬───────┘  └───────────┬────────────┘  │  │
│  └───────────┼──────────────────┼──────────────────────┼────────────────┘  │
│              │                  │                       │                    │
│  ┌───────────┼──────────────────┼──────────────────────┼────────────────┐  │
│  │           ▼                  │                       │                │  │
│  │  PHASE 1: LITERATURE REVIEW  │                       │                │  │
│  │                               │                       │                │  │
│  │  ┌─────────────────────────┐  │                       │                │  │
│  │  │ Knowledge Acquisition    │  │                       │                │  │
│  │  │ Agent                    │  │                       │                │  │
│  │  │                          │  │                       │                │  │
│  │  │ 1. Filter GitHub repos   │  │                       │                │  │
│  │  │    (5+ high-quality)     │  │                       │                │  │
│  │  │ 2. Retrieve arXiv papers │  │                       │                │  │
│  │  │    with LaTeX sources    │  │                       │                │  │
│  │  │ 3. All in Docker sandbox │  │                       │                │  │
│  │  └───────────┬─────────────┘  │                       │                │  │
│  │              ▼                │                       │                │  │
│  │  ┌─────────────────────────┐  │                       │                │  │
│  │  │ Resource Analyst Agent   │  │                       │                │  │
│  │  │                          │  │                       │                │  │
│  │  │  ┌──────────────────┐   │  │                       │                │  │
│  │  │  │ Paper Analyst     │   │  │                       │                │  │
│  │  │  │ • RAG over LaTeX  │   │  │                       │                │  │
│  │  │  │ • Math extraction │   │  │                       │                │  │
│  │  │  │ • Concept decomp. │   │  │                       │                │  │
│  │  │  └────────┬─────────┘   │  │                       │                │  │
│  │  │           │              │  │                       │                │  │
│  │  │  ┌────────▼─────────┐   │  │                       │                │  │
│  │  │  │ Code Analyst      │   │  │                       │                │  │
│  │  │  │ • Repo analysis   │   │  │                       │                │  │
│  │  │  │ • Impl. mapping   │   │  │                       │                │  │
│  │  │  │ • Dependency trace │   │  │                       │                │  │
│  │  │  └────────┬─────────┘   │  │                       │                │  │
│  │  │           ▼              │  │                       │                │  │
│  │  │  Bidirectional Mappings  │  │                       │                │  │
│  │  │  [Math ↔ Code]           │  │                       │                │  │
│  │  └───────────┬─────────────┘  │                       │                │  │
│  │              ▼                │                       │                │  │
│  │  ┌─────────────────────────┐  │                       │                │  │
│  │  │ Plan Agent               │  │                       │                │  │
│  │  │ • Implementation roadmap │  │                       │                │  │
│  │  │ • Training procedures    │  │                       │                │  │
│  │  │ • Testing methodology    │  │                       │                │  │
│  │  │ • Dataset requirements   │  │                       │                │  │
│  │  └───────────┬─────────────┘  │                       │                │  │
│  └──────────────┼────────────────┼───────────────────────┼────────────────┘  │
│                 │                │                       │                    │
│  ┌──────────────┼────────────────┼───────────────────────┼────────────────┐  │
│  │              ▼                ▼                       │                │  │
│  │  PHASE 2: IDEA GENERATION                             │                │  │
│  │                                                       │                │  │
│  │  ┌─────────────────────────────────────────────────┐  │                │  │
│  │  │ Idea Generator                                   │  │                │  │
│  │  │                                                   │  │                │  │
│  │  │ [L1: Guided by instruction I]                     │  │                │  │
│  │  │ [L2: Self-directed from references only]          │  │                │  │
│  │  │                                                   │  │                │  │
│  │  │ Divergent Phase:                                  │  │                │  │
│  │  │   → 5 conceptually distinct directions            │  │                │  │
│  │  │   → Orthogonal perspectives explored              │  │                │  │
│  │  │   → Cross-disciplinary connections sought          │  │                │  │
│  │  │                                                   │  │                │  │
│  │  │ Convergent Phase:                                 │  │                │  │
│  │  │   → Evaluate: Scientific Novelty                  │  │                │  │
│  │  │   → Evaluate: Technical Soundness                 │  │                │  │
│  │  │   → Evaluate: Transformative Potential            │  │                │  │
│  │  │   → Select best; develop comprehensive proposal   │  │                │  │
│  │  │                                                   │  │                │  │
│  │  │ Output Proposal Structure:                        │  │                │  │
│  │  │   • Challenges                                    │  │                │  │
│  │  │   • Existing Methods                              │  │                │  │
│  │  │   • Motivation                                    │  │                │  │
│  │  │   • Proposed Method                               │  │                │  │
│  │  │   • Technical Details                             │  │                │  │
│  │  │   • Expected Outcomes                             │  │                │  │
│  │  └─────────────────────────┬───────────────────────┘  │                │  │
│  └────────────────────────────┼──────────────────────────┼────────────────┘  │
│                               │                          │                    │
│  ┌────────────────────────────┼──────────────────────────┼────────────────┐  │
│  │                            ▼                          ▼                │  │
│  │  PHASE 3: IMPLEMENTATION + VALIDATION                                 │  │
│  │                                                                       │  │
│  │  ┌───────────────────────────────────────────────────────────────┐    │  │
│  │  │ Multi-Stage Refinement Architecture                           │    │  │
│  │  │                                                               │    │  │
│  │  │  Cycle n:                                                     │    │  │
│  │  │  ┌──────────┐                        ┌──────────────────┐     │    │  │
│  │  │  │ Student   │──→ implementation ──→  │ Mentor Agent      │     │    │  │
│  │  │  │ Agent     │                        │                  │     │    │  │
│  │  │  │           │←── feedback ──────────│ • Algorithm check │     │    │  │
│  │  │  │           │                        │ • Efficiency      │     │    │  │
│  │  │  │           │──→ revised code ──→   │ • Constraints     │     │    │  │
│  │  │  │           │                        │                  │     │    │  │
│  │  │  └──────────┘  ... repeat ...        └──────────────────┘     │    │  │
│  │  │                                                               │    │  │
│  │  │  Convergence: code passes all mentor checks                   │    │  │
│  │  └───────────────────────────────────────────────────────────────┘    │  │
│  │                                                                       │  │
│  │  Docker Sandbox: PyTorch pre-installed, dynamic dependency mgmt       │  │
│  └───────────────────────────────┬───────────────────────────────────────┘  │
│                                  │                                          │
│  ┌───────────────────────────────┼───────────────────────────────────────┐  │
│  │                               ▼                                       │  │
│  │  PHASE 4: DOCUMENTATION                                               │  │
│  │                                                                       │  │
│  │  ┌───────────────────────────────────────────────────────────────┐    │  │
│  │  │ Documentation Agent                                           │    │  │
│  │  │                                                               │    │  │
│  │  │ • Hierarchical synthesis across all research artifacts        │    │  │
│  │  │ • Cross-document consistency verification                     │    │  │
│  │  │ • Factual integrity maintenance                               │    │  │
│  │  │ • Background → Motivation → Method → Experiments → Results    │    │  │
│  │  │ • LaTeX output in conference format                           │    │  │
│  │  └───────────────────────────────────────────────────────────────┘    │  │
│  └───────────────────────────────┬───────────────────────────────────────┘  │
│                                  │                                          │
│  ┌───────────────────────────────┼───────────────────────────────────────┐  │
│  │                               ▼                                       │  │
│  │  OUTPUT: {𝒞 (code scripts), p (technical report)}                     │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────────────────┘

Design Principles

  1. Minimal input, maximal output. The system requires only 10–15 reference papers (and optionally a research instruction and datasets). From this minimal seed, it produces working code and a complete manuscript. This low barrier is a deliberate design choice enabling broad applicability.

  2. Sandboxed execution. All operations run in Docker containers. This provides security (preventing host system modification), consistency (pre-configured ML environments with PyTorch), and flexibility (dynamic package installation as research needs evolve). The sandboxing is particularly critical for the implementation phase where generated code is executed.

  3. Bidirectional grounding. The Resource Analyst's theory-code mapping creates a verified knowledge base that subsequent agents draw upon, reducing hallucination risk. This is architecturally distinct from systems that rely solely on the LLM's parametric memory for implementation.

  4. Iterative refinement over one-shot generation. The mentor-student paradigm explicitly rejects the single-pass code generation approach. By investing compute in multiple refinement cycles, the system can recover from errors and converge on correct implementations.

  5. Evaluation co-designed with the system. Scientist-Bench was developed alongside AI-Researcher, ensuring the evaluation framework captures the specific capabilities the system targets. This tight coupling is both a strength (evaluation is well-suited to the task) and a potential weakness (benchmark may be inadvertently biased toward the system's strengths).


10 Component Breakdown

Component 1: Knowledge Acquisition Agent

Purpose: Bootstrap the research knowledge base from minimal user input.

Input: 10–15 reference papers provided by the user.

Process:

  1. GitHub Repository Filtering: The agent evaluates candidate repositories across five quality dimensions:
  2. Code Recency: Prioritizes actively maintained, up-to-date implementations
  3. GitHub Popularity: Star count as a proxy for community-validated quality
  4. Documentation Quality: Completeness of README and inline documentation
  5. Domain Relevance: Alignment with the research focus area
  6. Citation Impact: Scholarly influence of the associated paper

The filtering produces at least 5 high-quality repositories that serve as the implementation knowledge base.

  1. Supplementary Literature Retrieval: For each selected repository, the agent retrieves the corresponding arXiv paper including complete LaTeX source files. This is critical—LaTeX sources provide structured access to mathematical formulations that PDF-only retrieval cannot match.

Output: A curated collection of high-quality code repositories and their associated papers with LaTeX sources.

Security: All operations execute within Docker containers, preventing the agent from modifying the host system.

Component 2: Resource Analyst Agent

Purpose: Create the bidirectional theory-code knowledge base that grounds all subsequent pipeline stages.

Sub-agents: - Paper Analyst: Operates via RAG over downloaded LaTeX files - Code Analyst: Analyzes downloaded code repositories

Process:

  1. Concept Decomposition: Using the initial research idea as guidance, the agent decomposes complex research objectives into atomic academic concepts—fundamental, indivisible research elements requiring individual investigation. This decomposition is the key insight: complex methods are made tractable by breaking them into their minimal constituents.

  2. Mathematical Formalization (Paper Analyst): For each atomic concept, the Paper Analyst examines LaTeX files through RAG-based retrieval, extracting the precise mathematical formulations. This creates a formal specification for each concept.

  3. Implementation Analysis (Code Analyst): For each mathematically formalized concept, the Code Analyst locates the corresponding implementation in the downloaded repositories, identifying the critical reference files and their dependencies.

  4. Knowledge Integration: Results from both analysts are synthesized into comprehensive concept profiles that establish explicit bidirectional connections: for every mathematical expression, there is a linked code implementation, and vice versa.

  5. Iterative Refinement: The decomposition-analysis cycle continues until all atomic concepts are thoroughly investigated.

Output: A detailed research report with complete concept profiles establishing theory↔code mappings.

Why this matters: This bidirectional mapping is AI-Researcher's most distinctive architectural contribution. By grounding mathematical abstractions in concrete code and linking implementations back to their theoretical foundations, the system creates a verified reference base. When the Student Agent later implements a new method, it can reference not just "the attention mechanism" (from parametric memory, prone to hallucination) but the specific mathematical definition and its verified code implementation from a real repository. This dramatically reduces implementation errors.

Component 3: Plan Agent

Purpose: Transform the Resource Analyst's findings into an actionable implementation roadmap.

Output: A comprehensive plan addressing: - Training procedures - Testing methodologies - Dataset requirements - Implementation order and dependencies - Expected experimental configurations

Component 4: Idea Generator

Purpose: Produce a novel, implementable research proposal.

Process (Divergent-Convergent Discovery Framework):

Divergent Phase: - Generates 5 conceptually distinct research directions - Each direction explores orthogonal perspectives - Cross-disciplinary connections are actively sought - Conceptual gaps, contradictory findings, and emerging patterns are identified

Convergent Phase: - Each direction is evaluated against three criteria: - Scientific Novelty: Does this represent a genuine advance? - Technical Soundness: Is the proposed approach feasible and rigorous? - Transformative Potential: Could this lead to significant impact? - The most promising direction is selected for comprehensive development

Output: A structured research proposal containing: - Challenges (fundamental limitations in current understanding) - Existing Methods (analysis revealing conceptual blind spots) - Motivation (scientific necessity for the proposed approach) - Proposed Method (novel theoretical framework or algorithmic innovation) - Technical Details (implementable specification) - Expected Outcomes (projected impact)

Component 5: Mentor Agent

Purpose: Provide structured review feedback during implementation.

Role: Mirrors the academic advisor in the advisor-student relationship. Reviews Student Agent's code across: - Algorithm Correctness (does the code implement what the proposal specifies?) - Computational Efficiency (is the implementation reasonably performant?) - Adherence to Specified Constraints (does the code respect dataset, resource, and methodological constraints?)

Component 6: Student Agent

Purpose: Write and iteratively refine the implementation code.

Process: Receives the Mentor's feedback and revises the implementation. Multiple refinement cycles continue until the Mentor Agent's checks pass. This iterative loop enables test-time compute scaling—more cycles generally yield higher-quality implementations.

Component 7: Documentation Agent

Purpose: Produce a publication-quality manuscript from the research artifacts.

Process: - Hierarchical synthesis across all prior outputs (literature analysis, research proposal, implementation, experimental results) - Cross-document consistency maintenance (ensures the paper's method description matches the actual implementation) - Factual integrity verification (results described match actual experimental outputs) - Structured output in conference paper format (background, motivation, methodology, experiments, results, conclusion)


11 Core Mechanisms (Detailed)

Mechanism 1: Bidirectional Theory-Code Mapping

The Resource Analyst's bidirectional mapping between mathematical formulations and code implementations is the system's most novel and impactful mechanism. To understand why, consider the failure modes it prevents:

Without bidirectional mapping (typical LLM-based implementation):

Research idea: "Implement graph attention with spectral normalization"
    ↓
LLM generates code from parametric memory
    ↓
Risk: attention formula hallucinated, spectral norm applied incorrectly,
      dimensions mismatched, loss function wrong

With bidirectional mapping (AI-Researcher):

Research idea: "Implement graph attention with spectral normalization"
    ↓
Paper Analyst extracts from LaTeX:
    α_ij = softmax(LeakyReLU(a^T [Wh_i || Wh_j]))  [Eq. 3, paper X]
    W̃ = W / σ(W)  [Eq. 7, paper Y]
    ↓
Code Analyst locates implementations:
    α_ij → layers/gat.py:42-58, function compute_attention()
    W̃   → utils/spectral_norm.py:15-30, class SpectralNorm
    ↓
Concept profile: {
    concept: "spectrally-normalized graph attention",
    math: [Eq. 3, Eq. 7],
    code_refs: [gat.py:42-58, spectral_norm.py:15-30],
    dependencies: [torch.nn.functional.leaky_relu, torch.linalg.svd]
}
    ↓
Student Agent implements with verified reference, not hallucinated memory

Atomic decomposition depth. The system doesn't just map "method X" to "file Y." It decomposes methods into their minimal mathematical atoms and maps each atom independently. This means complex multi-component methods (e.g., a transformer with custom attention, position encoding, normalization, and loss function) have every component individually grounded.

RAG over LaTeX vs. PDF. The Paper Analyst specifically operates over LaTeX source files rather than parsed PDFs. This is architecturally significant because LaTeX preserves the semantic structure of mathematical expressions (macros, environments, equation numbering) that PDF parsing typically corrupts. A LaTeX \sum_{i=1}^{N} is unambiguous; its PDF rendering may be misparse as plain text.

Mechanism 2: Divergent-Convergent Discovery Framework

The ideation mechanism deliberately separates creative expansion from critical evaluation, reflecting established creativity research (e.g., Guilford's divergent-convergent thinking model):

Phase 1 — Divergent (Expansion):

Input: Research landscape analysis from Resource Analyst
    ↓
Generate Direction 1: [orthogonal approach A]
Generate Direction 2: [orthogonal approach B]
Generate Direction 3: [orthogonal approach C]
Generate Direction 4: [cross-disciplinary approach D]
Generate Direction 5: [contradictory-finding-based approach E]
    ↓
5 maximally diverse proposals

Phase 2 — Convergent (Selection):

For each direction d ∈ {1,...,5}:
    Score_novelty(d)     ∈ criteria space
    Score_soundness(d)   ∈ criteria space
    Score_potential(d)   ∈ criteria space
    ↓
d* = argmax weighted_score(d)
    ↓
Comprehensive development of d*

Why 5 directions? This is a design choice balancing exploration breadth against generation cost. Too few (1–2) risks anchoring; too many (10+) dilutes evaluation quality and increases cost. Five provides sufficient diversity to escape the LLM's default mode while remaining evaluable.

Evaluation criteria analysis:

Criterion What It Captures Failure Mode If Missing
Scientific Novelty Is this genuinely new? System rehashes known approaches
Technical Soundness Is this implementable and rigorous? Proposes impossible/incorrect methods
Transformative Potential Could this matter? Generates trivially novel but useless ideas

The three criteria form a minimal sufficient set: novelty without soundness produces fantasies; soundness without novelty produces incremental work; both without potential produces technically correct irrelevancies.

Mechanism 3: Mentor-Student Iterative Refinement

The implementation phase's iterative refinement is modeled explicitly on the academic advisor-student relationship:

Cycle 1:
  Student: [initial implementation based on proposal + concept profiles]
  Mentor:  [review → feedback on algorithm correctness, efficiency, constraints]

Cycle 2:
  Student: [revised implementation incorporating mentor feedback]
  Mentor:  [review → feedback on remaining issues]

...

Cycle N:
  Student: [final implementation]
  Mentor:  [approval — all checks pass]

Test-time compute scaling. More refinement cycles generally improve output quality, analogous to how more advisor feedback rounds improve a student's code. This provides a direct mechanism for trading compute for quality—critical for research applications where correctness matters more than speed.

Feedback structure. The Mentor Agent evaluates across three specific dimensions: 1. Algorithm Correctness: Does the code faithfully implement the mathematical specification from the research proposal? 2. Computational Efficiency: Are there obvious performance issues (e.g., unnecessary recomputation, memory leaks)? 3. Adherence to Specified Constraints: Does the implementation respect declared constraints (dataset formats, compute budgets, methodological requirements)?

This structured feedback prevents vague "make it better" cycles and gives the Student Agent specific, actionable improvement targets.

Mechanism 4: Scientist-Bench Anonymization Protocol

A critical innovation in Scientist-Bench is its anonymization protocol, designed to prevent LLMs from recognizing and regurgitating memorized papers:

Anonymization Technique What It Prevents Implementation
Method name masking Model name recall → copy Replace algorithm/model names with generic identifiers
Technical detail abstraction Architecture recall → copy Remove implementation specifics while preserving core concepts
Dataset standardization Dataset-based shortcutting Normalize experimental contexts to prevent familiarity exploitation
Citation anonymization Temporal/institutional recall Eliminate date markers and institutional affiliations

Why this matters. LLMs trained on arXiv have memorized significant portions of the ML literature. Without anonymization, a system given references to "diffusion models with classifier guidance" might simply regurgitate the known classifier-free guidance paper rather than generating a novel approach. The anonymization forces the system to engage with the underlying concepts rather than matching surface patterns.

Effectiveness concern. Despite these measures, sufficiently capable LLMs may still recognize research areas from structural and conceptual cues even after anonymization. The paper does not provide a formal analysis of how effectively the anonymization prevents memorization-based shortcuts—a significant gap in the evaluation methodology.

Mechanism 5: Two-Stage Evaluation Framework

Scientist-Bench's evaluation addresses the dual nature of research quality—technical implementation and scientific contribution:

Stage 1: Technical Execution Validation

Input: Generated code 𝒞
    ↓
Code Review Agent performs:
  • Static analysis (syntax, structure, completeness)
  • Runtime verification (does it execute?)
  • Algorithm correctness check (does it implement what's specified?)
  • Computational efficiency assessment
  • Constraint adherence verification
    ↓
Output: Completion ratio (% of required functionality implemented)

Stage 2: Scientific Contribution Evaluation

Input: Generated paper p, Ground-truth paper y, Review guidelines g
    ↓
RandomSwap(p, y)  → Randomize presentation order
    ↓
Paper Review Agent evaluates:
  • Technical novelty
  • Methodological rigor
  • Empirical validation
  • Impact potential
    ↓
Output: Comparative rating r ∈ {-3,...,3}
        Structured justifications J

Debiasing mechanisms: 1. Position randomization: Papers are presented in random order to eliminate position bias (LLMs tend to favor the first or second option depending on the model) 2. Multi-model panel: GPT, Claude, and Gemini models independently evaluate, creating a diverse review panel 3. Temperature 1.0: Maximum sampling diversity to avoid degenerate single-mode evaluations

Validation against human judgment. The paper reports that on 32 ICLR paper pairs, the LLM-based review protocol achieves "perfect alignment" with ICLR acceptance decisions. While impressive, the small sample size (32 pairs) and the specific choice of using acceptance/rejection as the binary criterion (rather than fine-grained review scores) limit the strength of this validation claim.

Mechanism 6: Containerized Secure Execution

All pipeline phases execute within Docker containers, providing:

  1. Security isolation: Generated code cannot access or modify the host system. This is essential given that the Student Agent produces and executes arbitrary code—without sandboxing, a single hallucinated rm -rf or malicious library installation could compromise the host.

  2. Environment consistency: Containers come pre-configured with ML frameworks (PyTorch at minimum), ensuring reproducible execution environments. The system avoids the "works on my machine" problem.

  3. Dynamic dependency management: Agents can autonomously install additional Python packages as research needs evolve during a run. The container absorbs these installations without affecting the host.

  4. Scalable parallelism: Docker containers enable running multiple research pipelines simultaneously without interference, critical for benchmark-scale evaluation.


12 Programming Language

Component Language Framework/Library
Multi-agent orchestration Python Custom agent framework
Knowledge Acquisition Python GitHub API, arXiv API
Paper Analyst Python RAG pipeline over LaTeX
Code Analyst Python AST analysis, repository traversal
Idea Generator Python LLM prompting framework
Implementation (Student/Mentor) Python LLM-based code generation
Documentation Agent Python LLM-based LaTeX generation
Evaluation pipeline Python LLM-based review agents
Sandbox Docker Containerized Python runtime
Generated code output Python (primarily) PyTorch, domain-dependent

The entire system is Python-native, consistent with the ML research ecosystem. Docker provides the execution sandbox but the orchestration and agent logic are pure Python.

Code Organization (Inferred from Repository)

AI-Researcher/
├── assets/
│   └── paper.pdf                  # Technical report
├── src/
│   ├── agents/
│   │   ├── knowledge_acquisition.py   # Literature + repo discovery
│   │   ├── paper_analyst.py           # LaTeX RAG, math extraction
│   │   ├── code_analyst.py            # Repository analysis
│   │   ├── plan_agent.py              # Implementation roadmap
│   │   ├── idea_generator.py          # Divergent-convergent ideation
│   │   ├── mentor.py                  # Code review feedback
│   │   ├── student.py                 # Implementation + revision
│   │   └── documentation.py           # Manuscript generation
│   ├── evaluation/
│   │   ├── code_review.py             # Stage 1: technical validation
│   │   ├── paper_review.py            # Stage 2: scientific contribution
│   │   └── debiasing.py              # Position swap, multi-model
│   ├── sandbox/
│   │   ├── docker_manager.py          # Container lifecycle
│   │   └── execution.py              # Code execution in sandbox
│   ├── utils/
│   │   ├── arxiv_client.py           # arXiv API integration
│   │   ├── github_client.py          # GitHub API integration
│   │   └── latex_parser.py           # LaTeX source processing
│   └── pipeline.py                    # End-to-end orchestration
├── scientist_bench/
│   ├── papers/                        # 22 ground-truth papers
│   ├── references/                    # Reference paper sets
│   ├── datasets/                      # Benchmark datasets
│   ├── instructions/                  # Level-1 research instructions
│   └── evaluation/                    # Evaluation scripts + prompts
├── config/                            # Model configs, prompts
├── docker/                            # Dockerfile for sandbox
└── README.md

Prompt Engineering

The system relies heavily on carefully crafted system prompts for each agent. The paper provides detailed prompt templates in the appendix, covering: - Knowledge Acquisition search strategies - Concept decomposition instructions for Paper/Code Analysts - Divergent ideation instructions with explicit diversity requirements - Mentor review rubrics with structured feedback templates - Documentation agent writing guidelines aligned with conference format - Evaluation prompts aligned with ICLR review criteria


13 Memory Management

Within-Run Memory Architecture

AI-Researcher's memory management operates at two levels: the agent orchestration level and the individual agent context level.

Agent Orchestration Level

The pipeline is structured as a sequential handoff, with each phase producing artifacts consumed by the next:

Knowledge Acquisition → [papers, repos, LaTeX sources]
        ↓
Resource Analyst → [concept profiles with theory↔code mappings]
        ↓
Plan Agent → [implementation roadmap]
        ↓
Idea Generator → [structured research proposal]
        ↓
Student/Mentor → [implementation code, experimental results]
        ↓
Documentation Agent → [publication-ready manuscript]

Each handoff is a materialized artifact—written to the filesystem within the Docker container. This means the full context of prior phases doesn't need to fit in a single LLM context window; agents receive structured summaries and can access specific artifacts as needed.

Individual Agent Context Level

Each agent operates within the LLM's context window. The critical pressure points are:

Agent Context Pressure Mitigation
Paper Analyst Must process full LaTeX papers RAG-based retrieval (query-specific chunks, not full papers)
Code Analyst Must understand multi-file repositories Targeted file retrieval + dependency tracing
Idea Generator Needs comprehensive research landscape Structured concept profiles as condensed input
Student Agent Must hold proposal + concept profiles + evolving code Iterative cycles (each cycle focuses on specific feedback)
Documentation Agent Must synthesize all prior artifacts Hierarchical synthesis (section-by-section, not all-at-once)

RAG as Memory Extension

The Paper Analyst's use of RAG over LaTeX files is a memory architecture decision as much as a retrieval one. By indexing LaTeX sources and retrieving relevant chunks per query, the system effectively extends the agent's accessible knowledge beyond the context window limit. A single paper's LaTeX source might be 30–50K tokens; with 15–20 papers, the total exceeds 500K tokens—well beyond any current context window. RAG allows the agent to access this knowledge base on-demand.

No Cross-Run Memory

Like many autonomous research systems of its generation (May 2025), AI-Researcher does not implement cross-run learning:

  • No skill extraction from successful research runs
  • No meta-learning about which ideation strategies produce better results
  • No accumulated knowledge base that improves with use
  • Each run starts from scratch with only the user-provided references

This is a deliberate simplicity choice. Cross-run memory would introduce complex engineering challenges (skill representation, relevance determination, staleness management) without a clear theoretical framework for what constitutes useful research "experience." Later systems like MetaClaw and EurekaClaw address this gap with explicit continual learning mechanisms.

Concept Profile as Structured Memory

The Resource Analyst's concept profiles function as a structured external memory system within a single run:

ConceptProfile {
    concept_name: str
    atomic_components: [
        {
            name: str,
            math_formulation: LaTeX,
            source_paper: str,
            source_equation: int,
            code_implementation: {
                file: str,
                lines: (int, int),
                repository: str
            },
            dependencies: [str]
        },
        ...
    ]
    bidirectional_mappings: [
        {math_ref: str, code_ref: str},
        ...
    ]
}

This structured representation compresses the full content of papers and repositories into a query-friendly format that downstream agents (Idea Generator, Student Agent, Documentation Agent) can efficiently consume. It is, in effect, a domain-specific memory compression scheme optimized for research implementation tasks.


14 Continued Learning

Current Paradigm: Stateless Per-Run

AI-Researcher operates as a stateless system: each research run is independent, and no information persists across runs. The system's "knowledge" is entirely constituted by: 1. The user-provided reference papers (external) 2. The LLM's parametric knowledge (frozen) 3. The concept profiles built during the current run (ephemeral)

Potential Continued Learning Extensions

The paper does not explicitly discuss continued learning, but the architecture admits several natural extensions:

1. Concept Profile Accumulation

The bidirectional theory-code mappings produced by the Resource Analyst could be accumulated across runs into a persistent knowledge base:

Run 1: Concept profiles for {graph attention, spectral methods}
Run 2: Concept profiles for {diffusion models, score matching}
Run 3: Concept profiles for {contrastive learning, InfoNCE}
...
Accumulated KB: Rich theory↔code mapping across ML

Over many runs, this would build a comprehensive, grounded knowledge base that reduces the per-run cost of literature analysis and improves implementation accuracy.

2. Ideation Strategy Learning

The divergent-convergent framework generates 5 directions and selects 1. The 4 rejected directions and their evaluation scores constitute training data for improving the ideation strategy:

Run n:
  Generated directions: [d1, d2, d3, d4, d5]
  Evaluations: [novelty, soundness, potential] for each
  Selected: d3
  Final paper quality: r = 2.0

→ Learn: What properties of d3 made it successful?
→ Adjust: Generate more d3-like directions in future

This feedback loop could be implemented as few-shot examples in the Idea Generator's prompt or as fine-tuning data for a specialized ideation model.

3. Implementation Pattern Learning

The mentor-student refinement cycles produce rich data about common implementation errors and effective fixes:

Cycle 1: Student writes incorrect attention mask → Mentor identifies → Student fixes
Cycle 2: Student misses gradient detach → Mentor identifies → Student fixes
...

→ Extract patterns: "Common error: forgetting .detach() in contrastive loss"
→ Pre-prompt Student with learned patterns to reduce cycle count

This mirrors EurekaClaw's skill extraction mechanism, adapted for code implementation rather than theorem proving.

4. Evaluation Calibration

Across many benchmark runs, the system could calibrate its LLM-based evaluation against accumulated human judgments, progressively improving the reliability of automated assessment.

5. Meta-Research Learning

At a higher level, the system could learn meta-research patterns: - Which research areas produce more successful implementations? - What reference paper combinations lead to more novel ideas? - How does the number of refinement cycles correlate with final quality? - Which LLM models perform best for which research domains?

Comparison with Systems That Do Implement Continued Learning

System Continued Learning Mechanism
AI-Researcher None (stateless per-run)
EurekaClaw Skill extraction from proof strategies
MetaClaw Cross-run meta-learning from research outcomes
SkyPilot Autoresearch Accumulated experiment database
OpenResearcher None (single-pass SFT)

AI-Researcher's lack of continued learning is typical for first-generation autonomous research systems (2025) but represents a clear improvement opportunity for future versions.


15 Applications

Primary Application: Accelerating Computational Research

AI-Researcher's primary application is automating the complete cycle of computational research in AI/ML domains. The system is evaluated across 16 research areas:

Domain Example Topics
Diffusion Models Score-based generative modeling, denoising
Vector Quantization VQ-VAE variants, codebook learning
Graph Neural Networks Message passing, spectral methods
Recommender Systems Collaborative filtering, sequential recommendation
Computer Vision Image classification, object detection
Self-Supervised Learning Contrastive methods, masked prediction
Contrastive Learning InfoNCE variants, hard negative mining
Image Processing Super-resolution, denoising, restoration
Natural Language Processing Text classification, generation
Multi-Modal Learning Vision-language, audio-visual
Time Series Forecasting, anomaly detection
Reinforcement Learning Policy optimization, reward shaping
Federated Learning Privacy-preserving distributed training
Knowledge Graphs Link prediction, entity alignment
Neural Architecture Search Efficient architecture design
Optimization Learning rate scheduling, adaptive methods

Application Scenario 1: Research Exploration at Scale

A research group could use AI-Researcher to rapidly explore a solution space:

Given: 15 papers on graph contrastive learning
    ↓
Run AI-Researcher 20 times (different random seeds / temperature)
    ↓
20 distinct research directions explored
20 implementations produced
20 manuscripts generated
    ↓
Human researchers review top-ranked outputs
Select most promising for deeper investigation

At $15–$150 per run, 20 explorations cost $300–$3,000—comparable to a few days of a researcher's time but producing 20 distinct directions rather than the 1–2 a human might explore in the same period.

Application Scenario 2: Research Benchmarking and Evaluation

Scientist-Bench enables standardized evaluation of autonomous research systems:

System A: Run on 22 Scientist-Bench papers → [completeness, correctness, rating]
System B: Run on 22 Scientist-Bench papers → [completeness, correctness, rating]
    ↓
Direct, standardized comparison

This application extends beyond AI-Researcher itself. The open leaderboard accepts community submissions, positioning Scientist-Bench as a shared benchmark for the growing field of autonomous research.

Application Scenario 3: Research Education and Training

The system's structured pipeline mirrors the research process, making it potentially useful as an educational tool:

  • Students can study the concept profiles to understand how research ideas map between theory and code
  • The divergent-convergent framework models creative research thinking explicitly
  • The mentor-student refinement cycle demonstrates iterative code improvement
  • Generated manuscripts provide examples of research writing structure

Application Scenario 4: Corporate R&D Acceleration

The production deployment at novix.science/chat suggests a commercial application path:

  • R&D teams provide domain references
  • AI-Researcher generates multiple research directions
  • Technical teams evaluate and refine the most promising outputs
  • Cycle time from "idea exploration" to "working prototype" reduced from weeks to hours

Limitations for Applications

  1. Domain restriction: Currently validated only on computational AI/ML research. Extension to experimental sciences (biology, chemistry, physics) would require fundamentally different implementation capabilities.

  2. Quality ceiling: Generated papers "approach" human quality but do not consistently match it. Results are most useful as starting points for human refinement, not final outputs.

  3. Novelty verification gap: The system generates "novel" ideas relative to its input references, but cannot verify novelty against the full literature. Generated ideas may unknowingly replicate existing unpublished or obscure work.

  4. No experimental feedback integration: The implementation phase executes code but the results do not feed back into the ideation or documentation in a dynamic way that mirrors how human researchers adapt their approach based on experimental outcomes.

  5. Benchmark size limitations: 22 papers across 16 domains means statistical claims about domain-specific performance have limited power. A domain with 1–2 benchmark samples cannot support robust conclusions about the system's capability in that area.

Impact Assessment

Dimension Assessment
Technical novelty High: first complete end-to-end system with bidirectional grounding
Benchmark contribution High: Scientist-Bench fills critical evaluation gap
Reproducibility Moderate: open code but API-dependent, stochastic
Practical impact Growing: 5K+ GitHub stars, NeurIPS acceptance, production deployment
Scientific rigor Moderate: small benchmark, self-evaluation concerns
Influence on field High: widely cited as inspiration by subsequent systems

Relationship to OmniEvolve

AI-Researcher's multi-agent pipeline architecture, benchmark-driven evaluation methodology, and iterative refinement mechanisms are directly relevant to OmniEvolve's design:

AI-Researcher Pattern OmniEvolve Analog
Resource Analyst (bidirectional mapping) Knowledge module (theory↔code grounding for mutation operators)
Divergent-convergent ideation Search backends (diverse candidate generation → fitness evaluation)
Mentor-student refinement Cascade evaluator (iterative quality gates)
Scientist-Bench (two-level evaluation) Benchmark suites (fairness, parity, multi-level assessment)
Docker sandboxing Safety module (restricted subprocess/container execution)
Concept profiles (structured memory) Learning logs (structured knowledge for prompt populations)
Plan Agent (implementation roadmap) Orchestrator (experiment lifecycle planning)

The key transferable insight is the bidirectional grounding principle: linking abstract specifications to concrete implementations prevents hallucination drift, a problem equally relevant in evolutionary algorithm discovery where mutated candidates must faithfully implement their specified behavior.