← Back to Index

AutoResearchClaw

Fully autonomous 23-stage pipeline that transforms a research idea into a conference-ready paper with real literature, sandboxed experiments, multi-agent peer review, and self-evolving cross-run learning. Organization: AIMING Lab (UC Santa Cruz, UNC Chapel Hill, Johns Hopkins, UC Davis, et al.) Published: March 15, 2026 (v0.1.0); actively maintained through v0.3.2+ Type: repo (GitHub: aiming-lab/AutoResearchClaw) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

AutoResearchClaw: Fully Autonomous Research from Idea to Paper

  • Repository: github.com/aiming-lab/AutoResearchClaw
  • License: MIT
  • Stars: ~9,800+ (as of April 2026)
  • First release: v0.1.0, March 15, 2026
  • Current release: v0.3.2, March 22, 2026
  • Tagline: "Chat an Idea. Get a Paper."
  • Predecessor lineage: Inspired by AI Scientist (Sakana AI), AutoResearch (Karpathy), FARS (Analemma)
  • Companion systems:
  • MetaClaw — cross-run learning engine (skill extraction from failures)
  • OpenClaw — AI assistant platform (chat interface for pipeline orchestration)

The project name "Claw" references the lobster emoji used throughout the branding, suggesting the system's ability to "grasp" research problems and work through them autonomously.

Version History

Version Date Key Features
v0.1.0 Mar 15, 2026 Initial release: 23-stage pipeline, end-to-end autonomous
v0.2.0 Mar 16, 2026 CodeAgent, BenchmarkAgent, FigureAgent; Docker sandbox hardening; 4-round quality audit
v0.3.0 Mar 17, 2026 MetaClaw integration (+18.3% robustness); cross-run learning
v0.3.1 Mar 18, 2026 OpenCode Beast Mode; Novita AI provider; thread-safety hardening
v0.3.2 Mar 22, 2026 Cross-platform ACP support; anti-fabrication system; 100+ bug fixes; --resume
v0.3.2+ Mar 30, 2026 Flexible skill loading; 20 pre-loaded skills; A-Evolve skill

Notable pace: 6 significant releases in 15 days, suggesting rapid iteration under active development pressure.


2 Authors and Team

Author Affiliation
Jiaqi Liu
Peng Xia
Siwei Han
Shi Qiu
Letian Zhang
Guiming Chen
Haoqin Tu
Xinyu Yang
Jiawei Zhou
Hongtu Zhu UNC Chapel Hill
Yun Li
Yuyin Zhou UC Santa Cruz
Zeyu Zheng
Cihang Xie UC Santa Cruz
Mingyu Ding Johns Hopkins University
Huaxiu Yao UC Davis (AIMING Lab lead)

Team composition: Academic research group (AIMING Lab) spanning multiple US universities. Unlike the industrial AIRA₂ team (25 authors at Meta FAIR), this is a more typical academic team producing open-source research infrastructure.

AIMING Lab context: The lab has produced related work including MetaClaw for cross-run learning, suggesting a broader research program around automated scientific discovery.


3 Core Contribution

AutoResearchClaw's core contribution is a complete end-to-end pipeline that autonomously transforms a text research topic into a conference-ready academic paper. Unlike systems that focus on a single phase (literature search, experiment execution, or writing), AutoResearchClaw integrates all phases into a single orchestrated workflow.

The 23-Stage Pipeline

Phase A: Research Scoping            Phase E: Experiment Execution
  1. TOPIC_INIT                        12. EXPERIMENT_RUN
  2. PROBLEM_DECOMPOSE                 13. ITERATIVE_REFINE  ← self-healing

Phase B: Literature Discovery        Phase F: Analysis & Decision
  3. SEARCH_STRATEGY                   14. RESULT_ANALYSIS    ← multi-agent
  4. LITERATURE_COLLECT  ← real APIs   15. RESEARCH_DECISION  ← PIVOT/REFINE
  5. LITERATURE_SCREEN   [GATE]
  6. KNOWLEDGE_EXTRACT                 Phase G: Paper Writing
                                       16. PAPER_OUTLINE
Phase C: Knowledge Synthesis           17. PAPER_DRAFT
  7. SYNTHESIS                         18. PEER_REVIEW        ← evidence check
  8. HYPOTHESIS_GEN    ← debate        19. PAPER_REVISION

Phase D: Experiment Design           Phase H: Finalization
  9. EXPERIMENT_DESIGN   [GATE]        20. QUALITY_GATE       [GATE]
 10. CODE_GENERATION                   21. KNOWLEDGE_ARCHIVE
 11. RESOURCE_PLANNING                 22. EXPORT_PUBLISH     ← LaTeX
                                       23. CITATION_VERIFY    ← relevance check

Five Differentiating Capabilities

Capability Description
PIVOT/REFINE Loop Stage 15 autonomously decides: PROCEED (continue), REFINE (tweak parameters → Stage 13), or PIVOT (new research direction → Stage 8). Artifacts auto-versioned across loops.
Multi-Agent Debate Hypothesis generation, result analysis, and peer review each use structured multi-perspective LLM debate rather than single-pass generation.
Self-Learning (MetaClaw) Lessons extracted per run (decision rationale, runtime warnings, metric anomalies) with 30-day time-decay. Future runs avoid past mistakes.
Anti-Fabrication System VerifiedRegistry enforces ground-truth experiment data in papers. Unverified numbers are sanitized. Failed experiments are auto-diagnosed and repaired before writing.
Real Citation Verification 4-layer verification: arXiv ID → CrossRef/DataCite DOI → Semantic Scholar title match → LLM relevance scoring. Hallucinated references automatically removed.

What Makes It Novel in the Autoresearch Landscape

Compared to AI Scientist (Sakana AI) and AIRA₂ (Meta FAIR):

Dimension AI Scientist AIRA₂ AutoResearchClaw
Output Paper (limited quality) Competition solution Conference-ready paper
Experiments Simulated / toy Real ML training (Kaggle) Sandboxed Python (configurable fidelity)
Literature No real retrieval N/A Real APIs (OpenAlex, Semantic Scholar, arXiv)
Citation integrity Hallucinated refs common N/A 4-layer verification
Self-improvement None Within-task evolution Cross-run MetaClaw learning
Human-in-loop None None 3 quality gates (optional)
Target venue Workshop-level N/A (competition) NeurIPS / ICML / ICLR

4 Supported Solutions

Research Domains

AutoResearchClaw is domain-agnostic by design. The showcase demonstrates papers across 8 domains:

Domain Showcase Paper
Mathematics Random matrix theory
Statistics Weak IV estimators
Biology SIR/SEIR identifiability
Computing Krylov preconditioners
NLP (Token merging — FAME)
Reinforcement Learning LACE exploration
Computer Vision GARD-LoRA
Model Compression CRAFT distillation

Experiment Execution Modes

Mode Description Use Case
simulated LLM generates plausible results without execution Prototyping, low-resource
sandbox AST-validated Python in local subprocess Default; most common
docker Hardened Docker containers with network policy Production; GPU experiments
ssh_remote Execution on remote GPU server Large-scale training

Experiment Complexity Tiers

The system automatically assesses experiment complexity and routes accordingly:

Complexity Assessment:
┌─────────────────────────────────────────────────┐
│  Simple (score < 0.2)                           │
│  → Direct LLM code generation                  │
│                                                  │
│  Medium (0.2 ≤ score < threshold)               │
│  → CodeAgent v2 with architecture planning      │
│                                                  │
│  Complex (score ≥ threshold)                    │
│  → OpenCode Beast Mode                          │
│  → Multi-file projects with custom architectures│
│  → Training loops + ablation studies            │
└─────────────────────────────────────────────────┘

Output Artifacts

A complete run produces:

Artifact Format Description
paper_draft.md Markdown Full academic paper (5,000–6,500 words)
paper.tex LaTeX Conference-ready (NeurIPS/ICML/ICLR templates)
references.bib BibTeX Real references, auto-pruned to match inline citations
verification_report.json JSON 4-layer citation integrity + relevance verification
experiment_runs/ Python + JSON Generated code + sandbox results + structured metrics
charts/ PNG/PDF Auto-generated comparison charts with error bars
reviews.md Markdown Multi-agent peer review with consistency checks
evolution/ Markdown Self-learning lessons extracted from the run
deliverables/ Mixed All final outputs, compile-ready for Overleaf

5 LLM Integration

Provider Architecture

AutoResearchClaw supports a pluggable LLM backend through multiple provider types:

Provider Configuration Notes
OpenAI-compatible base_url + api_key_env Default; works with any OpenAI-compatible API
OpenAI Direct OpenAI API GPT-4o, GPT-4o-mini
OpenRouter Multi-model routing Access to many models via single API
DeepSeek DeepSeek API DeepSeek V3/V3.2
Minimax Minimax API
Novita AI Novita API Added in v0.3.1
ACP (Agent Client Protocol) CLI agent delegation Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI

ACP: Agent Client Protocol

A distinctive feature is ACP support, where AutoResearchClaw delegates LLM calls to external CLI agents rather than calling APIs directly:

llm:
  provider: "acp"
  acp:
    agent: "claude"   # or: codex, gh, gemini, opencode, kimi
    cwd: "."

The ACP adapter communicates via acpx, maintaining a single persistent session across all 23 pipeline stages. This means the CLI agent accumulates context about the entire research process, potentially improving coherence across stages.

Model Usage Across Pipeline Stages

The LLM is used in qualitatively different ways across the pipeline:

Stage Group LLM Usage Pattern
Scoping (1-2) Single-turn generation: topic decomposition, problem tree
Literature (3-6) Query generation + relevance scoring + knowledge extraction
Synthesis (7-8) Multi-agent debate: structured hypothesis generation
Design (9-11) Architecture planning: implementation blueprint generation
Code Gen (10, 13) Multi-file code generation (CodeAgent or OpenCode Beast Mode)
Analysis (14) Multi-agent debate: structured result interpretation
Decision (15) Reasoning: PROCEED/REFINE/PIVOT with rationale
Writing (16-19) Section-by-section drafting + peer review + revision
Quality (20, 23) Scoring: quality gates + citation relevance

OpenCode Beast Mode

Complex experiments are automatically routed to OpenCode, an external code generation system:

opencode:
  enabled: true
  auto: true
  complexity_threshold: 0.2  # 0.0-1.0
  timeout_sec: 600
  max_retries: 1

OpenCode generates multi-file projects with custom architectures, training loops, and ablation studies — going beyond what single-prompt code generation can produce.

Anti-Fabrication Integration

The VerifiedRegistry system enforces that only experimentally verified results appear in the paper:

  1. Experiments produce structured JSON metrics
  2. VerifiedRegistry indexes all verified metrics
  3. During paper writing, only registry-verified numbers can be cited
  4. Unverified numbers are sanitized (removed or flagged)
  5. If experiments fail, a diagnosis-and-repair loop attempts to fix them before writing

6 Key Results

Showcase Papers

AutoResearchClaw has produced 8 papers across 8 domains, generated fully autonomously with zero human intervention:

Paper Domain Key Method
Paper I Random matrix theory Mathematical analysis
Paper II Weak IV estimators Statistical methodology
Paper III SIR/SEIR identifiability Epidemiological modeling
Paper IV Krylov preconditioners Numerical computing
Paper V GARD-LoRA Parameter-efficient fine-tuning
Paper VI LACE exploration Reinforcement learning
Paper VII FAME token merging Vision transformer efficiency
Paper VIII CRAFT distillation Knowledge distillation

MetaClaw Integration Results

Controlled A/B experiments (same topic, same LLM, same configuration):

Metric Baseline With MetaClaw Improvement
Stage retry rate 10.5% 7.9% −24.8%
Refine cycle count 2.0 1.2 −40.0%
Pipeline stage completion 18/19 19/19 +5.3%
Overall robustness score 0.714 0.845 +18.3%

Composite robustness score = weighted average of stage completion (40%), retry reduction (30%), refine cycle efficiency (30%).

Quality Indicators

Indicator Value
Paper length 5,000–6,500 words
Target venue quality NeurIPS / ICML / ICLR format
Citation verification 4-layer (arXiv, CrossRef, DataCite, LLM)
Real references Yes (OpenAlex, Semantic Scholar, arXiv APIs)
Hallucinated reference rate Auto-removed (exact rate not reported)
Experiment fidelity Sandboxed Python with hardware-aware adaptation
Peer review Multi-agent with methodology-evidence consistency checks

Adoption Metrics

Metric Value (as of April 2026)
GitHub stars ~9,800+
Releases 6 in 15 days
Test suite 1,823 tests passing
Community skills 20 built-in + extensible via SKILL.md
Localization 9 languages (README translations)

Limitations of Reported Results

  1. No blind evaluation: Papers not submitted to actual conferences; quality is self-assessed
  2. No human expert review: Showcase papers not evaluated by domain experts
  3. No comparison to human baselines: No metric comparing output quality to human-written papers
  4. Selection bias: Showcase papers may be cherry-picked from many runs
  5. Experiment fidelity: Sandbox experiments may not match full-scale reproducible research

7 Reproducibility

Strengths

Aspect Assessment
Open source Fully MIT-licensed; complete codebase on GitHub
Installation Single pip install -e . + researchclaw setup
Configuration Comprehensive YAML config with documented defaults
Test suite 1,823 tests passing
Documentation Extensive README, integration guide, tester guide
Example config config.researchclaw.example.yaml with all options
Multi-platform Cross-platform via ACP; not locked to single LLM provider
Deterministic pipeline 23 stages execute in fixed order with checkpoint/resume

Challenges

Aspect Concern
LLM non-determinism Output quality varies with model, temperature, and API version
API dependencies Requires OpenAlex, Semantic Scholar, arXiv APIs (external services)
Experiment quality Sandbox experiments may not reproduce at full scale
Cost variability LLM API costs depend on model choice and generation length
Docker setup Docker mode requires additional infrastructure
OpenCode dependency Beast Mode requires separate OpenCode installation
Rapid iteration 6 releases in 15 days suggests API may be unstable

Running a Reproduction

# Minimal reproduction:
git clone https://github.com/aiming-lab/AutoResearchClaw.git
cd AutoResearchClaw
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
researchclaw setup
researchclaw init    # interactive config
export OPENAI_API_KEY="sk-..."
researchclaw run --topic "Your topic" --auto-approve

# Resume interrupted run:
researchclaw run --resume  # auto-detects last checkpoint

8 Compute and API Costs

LLM API Costs (Estimated)

A full 23-stage pipeline run involves extensive LLM usage across all stages. Estimated costs by model:

Model Estimated Cost per Run Notes
GPT-4o $15–50 Many stages, multi-agent debate, code generation
GPT-4o-mini $3–10 Budget option; lower quality
Claude 3.5 Sonnet (via ACP) $10–30 Using Claude Code as agent
DeepSeek V3 $2–8 Cost-effective alternative
Gemini Pro $5–20 Via OpenRouter or direct

Costs vary significantly based on topic complexity, number of REFINE/PIVOT cycles, experiment iterations, and paper length.

Compute Resources

Resource Minimum Recommended
CPU 4 cores 8+ cores
RAM 8GB 16GB+
GPU None (CPU-only mode) NVIDIA (CUDA) or Apple (MPS)
Disk 5GB 20GB+ (for experiment artifacts)
Network Required for LLM APIs + literature search

Time Budget

Configuration Estimated Duration
Simulated experiments, fast model 30–60 minutes
Sandbox experiments, GPT-4o 2–6 hours
Docker + OpenCode Beast Mode 4–12 hours
With PIVOT/REFINE cycles 6–24 hours

Per-Stage Time Breakdown (Estimated)

Phase A: Scoping ................ 5-10 min
Phase B: Literature ............. 15-45 min (API-bound)
Phase C: Synthesis .............. 10-20 min
Phase D: Experiment Design ...... 10-30 min
Phase E: Experiment Execution ... 30-180 min (compute-bound)
Phase F: Analysis + Decision .... 10-20 min
Phase G: Paper Writing .......... 30-60 min
Phase H: Finalization ........... 15-30 min
─────────────────────────────────────────────
Total (no loops) ................ ~2-6 hours
+ REFINE cycles ................. +30-90 min each
+ PIVOT cycles .................. +2-4 hours each

9 Architecture Solution

High-Level Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                   AutoResearchClaw Architecture                  │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │                    PIPELINE ORCHESTRATOR                    │  │
│  │  (researchclaw/pipeline/runner.py)                         │  │
│  │                                                            │  │
│  │  ┌──────┐ ┌──────┐ ┌──────┐       ┌──────┐ ┌──────┐     │  │
│  │  │Stg 1 │→│Stg 2 │→│Stg 3 │→ ··· →│Stg 22│→│Stg 23│     │  │
│  │  └──────┘ └──────┘ └──┬───┘       └──────┘ └──────┘     │  │
│  │                       │                                    │  │
│  │              ┌────────┴────────┐                           │  │
│  │              │   GATE STAGES   │                           │  │
│  │              │  5, 9, 20       │                           │  │
│  │              │  (approve/reject│                           │  │
│  │              │   + rollback)   │                           │  │
│  │              └─────────────────┘                           │  │
│  │                                                            │  │
│  │         ┌──────────────────────────────┐                   │  │
│  │         │    DECISION ENGINE (Stg 15)  │                   │  │
│  │         │                              │                   │  │
│  │         │  PROCEED → Stage 16          │                   │  │
│  │         │  REFINE  → Stage 13 (loop)   │                   │  │
│  │         │  PIVOT   → Stage 8 (restart) │                   │  │
│  │         └──────────────────────────────┘                   │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │
│  │  MULTI-AGENT     │  │  EXPERIMENT     │  │  KNOWLEDGE     │  │
│  │  SUBSYSTEMS      │  │  SANDBOX        │  │  BASE          │  │
│  │                   │  │                  │  │                │  │
│  │  • CodeAgent     │  │  • AST validate │  │  • Decisions   │  │
│  │  • BenchmarkAgt  │  │  • NaN/Inf trap │  │  • Experiments │  │
│  │  • FigureAgent   │  │  • Self-heal    │  │  • Findings    │  │
│  │  • Debate agents │  │  • Docker/local │  │  • Literature  │  │
│  │                   │  │  • GPU detect   │  │  • Questions   │  │
│  │                   │  │                  │  │  • Reviews     │  │
│  └─────────────────┘  └─────────────────┘  └────────────────┘  │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │
│  │  LLM PROVIDER    │  │  LITERATURE     │  │  METACLAW      │  │
│  │  LAYER            │  │  APIS           │  │  BRIDGE        │  │
│  │                   │  │                  │  │                │  │
│  │  • OpenAI-compat │  │  • OpenAlex     │  │  • Lessons     │  │
│  │  • ACP agents    │  │  • Semantic Sch │  │  • Skills      │  │
│  │  • Retry/fallback│  │  • arXiv        │  │  • Overlay     │  │
│  │  • Budget control│  │  • CrossRef     │  │  • Time-decay  │  │
│  └─────────────────┘  └─────────────────┘  └────────────────┘  │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │
│  │  SENTINEL        │  │  VERIFIED       │  │  EXPORT        │  │
│  │  WATCHDOG        │  │  REGISTRY       │  │  ENGINE        │  │
│  │                   │  │                  │  │                │  │
│  │  • NaN/Inf detect│  │  • Ground-truth │  │  • LaTeX       │  │
│  │  • Consistency   │  │  • Anti-fabr.   │  │  • BibTeX      │  │
│  │  • Citation score│  │  • Experiment   │  │  • NeurIPS/    │  │
│  │  • Anti-fabr.    │  │    diagnosis    │  │    ICML/ICLR   │  │
│  └─────────────────┘  └─────────────────┘  └────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Key Architectural Decisions

Decision Rationale
Sequential 23-stage pipeline Deterministic ordering ensures reproducibility and checkpoint/resume capability
3 quality gates Human-in-the-loop at critical decision points (literature screening, experiment design, final quality)
PIVOT/REFINE loops Allows the system to recover from dead-end research directions autonomously
Multi-agent debate Multiple LLM perspectives reduce single-point-of-failure in reasoning
VerifiedRegistry Prevents the most common failure mode: LLM-fabricated experimental results
Pluggable LLM backend ACP support means the system isn't locked to any single provider
MetaClaw bridge Cross-run learning addresses the "same mistakes" problem

Control Flow: The PIVOT/REFINE Decision

The most architecturally interesting feature is Stage 15's autonomous research direction control:

                    ┌──────────────────┐
                    │  Stage 14:       │
                    │  Result Analysis │
                    └────────┬─────────┘
                             │
                             ▼
                    ┌──────────────────┐
                    │  Stage 15:       │
                    │  RESEARCH        │
                    │  DECISION        │
                    └───────┬──────────┘
                            │
              ┌─────────────┼──────────────┐
              │             │              │
              ▼             ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │ PROCEED  │  │ REFINE   │  │ PIVOT    │
        │          │  │          │  │          │
        │ → Stg 16 │  │ → Stg 13 │  │ → Stg 8 │
        │ (write)  │  │ (iterate)│  │ (restart)│
        └──────────┘  └──────────┘  └──────────┘

PROCEED: Results are satisfactory → move to paper writing
REFINE:  Results need parameter tuning → go back to iterative refinement
PIVOT:   Hypothesis is wrong → generate new hypotheses and restart experiments

Artifacts are auto-versioned across loops to prevent data loss.

10 Component Breakdown

Component 1: Pipeline Orchestrator

Location: researchclaw/pipeline/

The orchestrator manages the sequential execution of 23 stages with: - Checkpoint/resume: Pipeline state saved after each stage; --resume flag auto-detects - Gate management: Stages 5, 9, 20 pause for human approval (or auto-approve) - Loop control: REFINE → Stage 13, PIVOT → Stage 8 - Error recovery: Stage failures trigger retry with configurable limits - Artifact versioning: Each loop iteration creates versioned artifact snapshots

Component 2: Literature Discovery Engine

Stages: 3–6

Multi-source literature search with real academic APIs:

Query Expansion (Stage 3):
  Research topic → multiple search queries
                → domain-specific terminology
                → synonym expansion

Literature Collection (Stage 4):
  ┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
  │  OpenAlex   │     │ Semantic Scholar │     │   arXiv     │
  │  API        │ ──▶ │  API             │ ──▶ │   API       │
  └─────────────┘     └──────────────────┘     └─────────────┘
         │                    │                       │
         └────────────────────┼───────────────────────┘
                              ▼
                    ┌──────────────────┐
                    │  Deduplication   │
                    │  Circuit Breaker │
                    │  Rate Limiting   │
                    └──────────────────┘

Literature Screening (Stage 5 — GATE):
  LLM relevance scoring → human approval → filtered set

Knowledge Extraction (Stage 6):
  Per-paper → structured knowledge cards

Component 3: Multi-Agent Debate System

Used in: Stages 8 (hypothesis gen), 14 (result analysis), 18 (peer review)

The debate system uses multiple LLM "perspectives" to reduce single-point reasoning failures:

  • Hypothesis generation: Multiple agents propose competing hypotheses; structured debate narrows to testable set
  • Result analysis: Independent agents analyze experimental results from different angles; consensus synthesis
  • Peer review: Agents review the draft paper with explicit methodology-evidence consistency checks

Component 4: CodeAgent v2

Location: researchclaw/agents/code_agent/

Multi-phase code generation system replacing simple single-prompt code generation:

Phase Description
Architecture Planning Deep implementation blueprint before coding
Sequential Generation Files generated one-by-one following dependency DAG
Hard Validation AST-based gates blocking identical ablations, hardcoded metrics
Execution-in-the-Loop Fix attempts based on actual execution errors
code_agent:
  enabled: true
  architecture_planning: true
  sequential_generation: true
  hard_validation: true
  hard_validation_max_repairs: 2
  exec_fix_max_iterations: 3
  exec_fix_timeout_sec: 60

Component 5: BenchmarkAgent

Location: researchclaw/agents/benchmark_agent/

4-agent pipeline for automated dataset and baseline selection:

Surveyor → Selector → Acquirer → Validator
   │           │          │          │
   ▼           ▼          ▼          ▼
 Search     Rank &      Download   Validate
 HuggingFace  select     datasets   integrity
 + Scholar   by tier     + cache    + format

Agents: surveyor.py, selector.py, acquirer.py, validator.py

Configuration:

benchmark_agent:
  enabled: true
  enable_hf_search: true
  enable_web_search: true
  tier_limit: 2         # 1=small, 2=medium, 3=large
  min_benchmarks: 1
  min_baselines: 2

Component 6: FigureAgent

Location: researchclaw/agents/figure_agent/

5-agent pipeline for academic figure generation:

Planner → CodeGen → Renderer → Critic → Integrator
   │          │         │          │          │
   ▼          ▼         ▼          ▼          ▼
 Plan       Generate  Execute    Critique   Place in
 figures    matplotlib render     quality    paper
 needed     code       to PNG    + iterate  + caption

Configuration:

figure_agent:
  enabled: true
  min_figures: 3
  max_figures: 8
  max_iterations: 3     # Critic-driven refinement
  dpi: 300
  strict_mode: false

Component 7: Sentinel Watchdog

Purpose: Background quality monitor running continuously during pipeline execution.

Monitors for: - NaN/Inf detection: Catches numerical instabilities in experiment results - Paper-evidence consistency: Verifies claims match experimental data - Citation relevance scoring: Scores how relevant each citation is to the paper - Anti-fabrication guard: Flags numbers that don't appear in VerifiedRegistry

Component 8: Export Engine

Location: researchclaw/pipeline/ (stages 22-23)

Converts Markdown paper to conference-ready LaTeX:

Template Target
neurips_2025 NeurIPS 2025
iclr_2026 ICLR 2026
icml_2026 ICML 2026

Handles: math expressions, tables, figures, cross-references, \cite{} commands, auto-pruned BibTeX.


11 Core Mechanisms (Detailed)

11.1 The PIVOT/REFINE Decision Engine

Stage 15 is the pipeline's most complex decision point. The LLM analyzes experimental results and makes a three-way decision:

PROCEED criteria: - Results support the hypothesis - Statistical significance achieved - Sufficient experimental coverage - Results are novel relative to baselines

REFINE criteria: - Results partially support hypothesis but need parameter tuning - Some experimental conditions failed - Metrics are close to significance threshold - Additional iterations likely to help

PIVOT criteria: - Results contradict the hypothesis - Fundamental methodology issue identified - Results are not distinguishable from baselines - A different research direction is more promising

When PIVOT is triggered: 1. Current results are archived with version tag 2. Pipeline jumps back to Stage 8 (hypothesis generation) 3. Previous failed hypotheses are provided as negative context 4. New hypotheses must differ from all previous attempts 5. The cycle continues with fresh experiment design

This creates a closed-loop research process that can autonomously recover from dead-end research directions — a capability absent from most competing systems.

11.2 Anti-Fabrication System

The VerifiedRegistry is a defense against the most dangerous failure mode of LLM-generated research papers: fabricated experimental results.

Problem: LLMs can generate plausible-looking experimental results that have no basis in actual computation. This is the single biggest threat to the credibility of AI-generated research.

Solution architecture:

┌────────────────────────────────────────────────────────┐
│                  VerifiedRegistry                       │
│                                                         │
│  experiment_id → {                                      │
│    conditions: [...],                                   │
│    metrics: {                                           │
│      "accuracy": 0.847,    ← from actual execution     │
│      "f1_score": 0.812,    ← from actual execution     │
│      "train_time": 142.3   ← from actual execution     │
│    },                                                   │
│    execution_log: "...",                                │
│    timestamp: "2026-03-28T14:32:00Z"                   │
│  }                                                      │
│                                                         │
│  Enforcement:                                           │
│  • Paper writing stage queries registry                 │
│  • Only registry-verified numbers may appear in text    │
│  • Unverified claims → sanitized or flagged             │
│  • Tables must reference experiment_ids                 │
│                                                         │
│  Repair loop (if experiments failed):                   │
│  1. Diagnose failure cause                              │
│  2. Generate repair code (via OpenCode)                 │
│  3. Re-execute (up to max_cycles=3)                     │
│  4. If min_completion_rate not met → degrade gracefully │
└────────────────────────────────────────────────────────┘

11.3 Four-Layer Citation Verification

Academic credibility requires real references. AutoResearchClaw implements 4 verification layers:

Layer 1: arXiv ID Check
  └─ If citation claims arXiv ID → verify it exists via arXiv API
  └─ If ID is invalid → remove citation

Layer 2: CrossRef / DataCite DOI
  └─ Verify DOI resolves to a real publication
  └─ Check metadata (title, authors, year) matches claim

Layer 3: Semantic Scholar Title Match
  └─ Search paper title in Semantic Scholar
  └─ Fuzzy match to handle minor title variations
  └─ Verify authors and venue match

Layer 4: LLM Relevance Scoring
  └─ Even if citation is real, is it relevant to the paper?
  └─ Score relevance (0-1)
  └─ Remove citations below threshold

Result: Only citations that are (a) real and (b) relevant survive.

This addresses a critical weakness of AI Scientist and similar systems where hallucinated references undermined paper credibility.

11.4 Hardware-Aware Experiment Adaptation

The system auto-detects available hardware and adapts experiments accordingly:

Hardware Detection:
  ┌─────────────┐
  │ NVIDIA GPU  │ → CUDA mode → full-scale training
  │ (detected)  │   PyTorch CUDA, large batch sizes
  └─────────────┘

  ┌─────────────┐
  │ Apple MPS   │ → MPS mode → adapted scale
  │ (detected)  │   PyTorch MPS, reduced batch sizes
  └─────────────┘

  ┌─────────────┐
  │ CPU only    │ → CPU mode → minimal experiments
  │ (fallback)  │   Small models, few epochs, sklearn focus
  └─────────────┘

Code generation adapts: imports, model sizes, batch sizes, training epochs, and package selection are all adjusted based on detected hardware tier.

11.5 Experiment Sandbox Execution

The sandbox provides safe, reproducible experiment execution:

AST validation (pre-execution): - Parse generated code as Python AST - Check for prohibited constructs (network access, file system escape) - Verify import whitelist compliance - Block identical ablations (same code, different variable names) - Detect hardcoded metrics (fabrication attempt)

Execution guardrails: - Memory limit (configurable, default 4096MB) - Time budget (configurable, default 300s) - NaN/Inf fast-fail: detect and abort on numerical instabilities - Partial result capture: save intermediate results even on failure - Self-healing: on failure, diagnose error → generate fix → retry (up to 10 rounds)

Docker mode adds: - Network policy (none, setup_only, pip_only, full) - Container isolation - Auto-install dependencies (detect imports → generate requirements.txt) - GPU passthrough

11.6 Skills System

AutoResearchClaw implements a skills system inspired by Claude Code's SKILL.md format:

Skills Loading:
  1. Built-in skills (19 shipped) ← researchclaw package
  2. Project-local skills        ← .claude/skills/ directory
  3. User-installed skills       ← researchclaw skills install
  4. Team-shared skills          ← custom_dirs in config
  5. Community skills            ← K-Dense-AI/claude-scientific-skills (150+)

Each skill is a SKILL.md file with YAML frontmatter:

---
name: scientific-writing
description: IMRAD structure, citation formatting, reporting guidelines
trigger-keywords: [paper, writing, draft, manuscript]
applicable-stages: [16, 17, 19]
enabled: true
---
[Skill instructions for the LLM...]

Skills are loaded and injected into LLM prompts automatically at applicable stages. This enables domain-specific expertise without modifying the core pipeline.

Notable built-in skills: - scientific-writing — IMRAD structure, citation formatting - chemistry-rdkit — Molecular analysis, SMILES, drug discovery - literature-search — Systematic review, PRISMA methodology - hypothesis-formulation — Testable hypothesis construction - statistical-reporting — Statistical analysis and reporting standards - a-evolve — Agentic evolution (community-contributed from A-Evolve)


12 Programming Language

System Implementation

Component Language Framework
Pipeline orchestrator Python Custom framework
Stage implementations Python
Agent subsystems Python Custom agent base class
CLI interface Python Click (via researchclaw command)
Configuration YAML Parsed by Python
Prompts YAML prompts.default.yaml
Generated experiments Python PyTorch, scikit-learn, etc.
Generated papers Markdown → LaTeX Jinja2 templates

Codebase Structure

AutoResearchClaw/
├── researchclaw/                  # Main package
│   ├── __init__.py
│   ├── __main__.py
│   ├── adapters.py                # LLM provider adapters
│   ├── agents/                    # Multi-agent subsystems
│   │   ├── base.py                # BaseAgent ABC
│   │   ├── benchmark_agent/       # 4-agent benchmark pipeline
│   │   │   ├── surveyor.py
│   │   │   ├── selector.py
│   │   │   ├── acquirer.py
│   │   │   └── validator.py
│   │   ├── code_agent/            # Multi-phase code generation
│   │   │   ├── architect.py
│   │   │   ├── builder.py
│   │   │   └── validator.py
│   │   ├── code_searcher/         # Code search agent
│   │   ├── debate/                # Multi-agent debate
│   │   └── figure_agent/          # 5-agent figure pipeline
│   │       ├── planner.py
│   │       ├── codegen.py
│   │       ├── renderer.py
│   │       ├── critic.py
│   │       └── integrator.py
│   ├── cli/                       # CLI entry points
│   ├── config/                    # Configuration management
│   ├── knowledge/                 # Knowledge base
│   ├── literature/                # Literature search APIs
│   ├── pipeline/                  # Pipeline orchestrator
│   │   ├── runner.py              # Main pipeline runner
│   │   ├── stages/                # Individual stage implementations
│   │   └── checkpoint.py          # State persistence
│   ├── sandbox/                   # Experiment execution
│   ├── sentinel/                  # Quality watchdog
│   ├── skills/                    # Skills management
│   ├── templates/                 # LaTeX templates
│   └── verification/              # Citation verification
├── .claude/skills/                # Built-in SKILL.md files
├── config.researchclaw.example.yaml
├── prompts.default.yaml           # Default LLM prompts
├── pyproject.toml
├── docs/                          # Documentation
└── tests/                         # Test suite (1,823 tests)

Dependencies (from pyproject.toml)

Key dependencies include: - LLM integration: openai, httpx (for API calls) - Literature: requests (for academic APIs) - LaTeX: jinja2 (for template rendering) - Data processing: pyyaml, json - CLI: Standard library argparse or click - AST validation: Python ast module (stdlib)


13 Memory Management

Run-Level Memory: Knowledge Base

Every pipeline run builds a structured knowledge base across 6 categories:

Category Contents Persistence
Decisions Research direction choices, PIVOT/REFINE rationale Per-run
Experiments Code, configurations, results, failure logs Per-run
Findings Key results, statistical analyses, insights Per-run
Literature Paper summaries, knowledge cards, citation metadata Per-run
Questions Open research questions, hypotheses tested Per-run
Reviews Peer review feedback, revision history Per-run

Backend options:

knowledge_base:
  backend: "markdown"   # or "obsidian"
  root: "docs/kb"

Cross-Run Memory: MetaClaw

MetaClaw provides persistent cross-run learning through a lesson → skill pipeline:

Run N:
  Pipeline executes → failures/warnings captured as Lessons
  Lesson structure:
    - stage: which pipeline stage
    - severity: warning | error | critical
    - category: code_gen | experiment | literature | writing
    - description: what went wrong
    - resolution: how it was fixed (if auto-resolved)
    - timestamp: when it occurred

                    ↓ MetaClaw Processing ↓

  Lesson → Skill conversion:
    - Filter by min_severity (default: warning)
    - Extract actionable pattern
    - Generate SKILL.md with prevention instructions
    - Store as arc-* skill in ~/.metaclaw/skills/
    - Max skills_per_run: 3

                    ↓ Next Run ↓

Run N+1:
  build_overlay() at pipeline start:
    - Load all arc-* skills from ~/.metaclaw/skills/
    - Apply 30-day time-decay weighting
    - Inject relevant skills into every stage's LLM prompt
    - LLM avoids known pitfalls → fewer retries

Stage-Level Memory: Context Accumulation

Within a single run, each stage has access to:

  1. Pipeline state: All outputs from all completed stages
  2. Knowledge base: Accumulated findings, decisions, literature
  3. Experiment history: All previous experimental results
  4. Artifact versions: All versions from REFINE/PIVOT loops

When using ACP mode, the agent CLI maintains full conversation history across all 23 stages, providing maximum context continuity.

Memory Isolation

Scope Persistence Sharing
Within-stage Ephemeral Stage-internal only
Within-run Run directory All stages in same run
Cross-run (MetaClaw) ~/.metaclaw/skills/ All future runs
Cross-project Not implemented

14 Continued Learning

Self-Learning via MetaClaw Integration

AutoResearchClaw's most distinctive continued learning mechanism is the MetaClaw bridge:

Learning Loop:
┌────────────────────────────────────────────────────┐
│                                                      │
│  Run 1: First execution                              │
│    • Encounters Stage 12 timeout (experiment too slow)│
│    • Auto-repairs by reducing batch size              │
│    • Lesson captured: "reduce batch size on timeout" │
│                        ↓                             │
│  MetaClaw converts → arc-experiment-timeout SKILL.md │
│                        ↓                             │
│  Run 2: Second execution (different topic)            │
│    • SKILL injected into Stage 12 prompt             │
│    • LLM proactively uses smaller batch size          │
│    • No timeout → no retry needed                    │
│    • Robustness improved                              │
│                                                      │
└────────────────────────────────────────────────────┘

Time-decay: Skills have a 30-day decay period, preventing stale lessons from dominating. Recent lessons are weighted more heavily than old ones.

Measured impact (from controlled A/B experiments): - Stage retry rate: −24.8% - Refine cycle count: −40.0% - Pipeline completion: +5.3% - Overall robustness: +18.3%

Within-Run Iterative Refinement

The REFINE loop (Stage 15 → Stage 13) provides within-run learning:

  • Each REFINE cycle builds on previous results
  • Parameter adjustments are informed by all prior iterations
  • Up to max_iterations (default: 10) refinement cycles
  • Artifacts are versioned to track improvement trajectory

Experiment Self-Healing

The sandbox executor implements a diagnosis-repair loop:

Execute code → failure detected
                     ↓
         Diagnose error type:
           • Import error → add dependency
           • Runtime error → modify code
           • Timeout → reduce scale
           • NaN/Inf → add numerical guards
                     ↓
         Generate repair (LLM or OpenCode)
                     ↓
         Re-execute (up to exec_fix_max_iterations=3)
                     ↓
         If still failing → capture partial results → degrade gracefully

Skills Library as Accumulated Knowledge

The skills system functions as a growing knowledge base:

  • Built-in skills (19): Curated by the development team
  • Community skills (150+ via K-Dense-AI): Crowdsourced scientific knowledge
  • MetaClaw-generated skills: Automatically created from pipeline failures
  • Custom skills: User/team-specific knowledge

Over time, a research group's skill library accumulates domain-specific knowledge that makes the pipeline increasingly effective for their particular research area.

Process Reward Model (PRM) — Optional Quality Gate

MetaClaw optionally integrates a Process Reward Model for quality gating:

metaclaw_bridge:
  prm:
    enabled: false           # Opt-in
    model: "gpt-5.4"        # PRM judge model
    votes: 3                 # Majority vote
    gate_stages: [5, 9, 15, 20]

When enabled, an LLM-as-judge evaluates stage outputs and blocks low-quality results from proceeding — adding another layer of quality control beyond the standard gate stages.


15 Applications

Primary Application: Automated Paper Generation

The primary use case is generating conference-ready academic papers from a research topic:

researchclaw run --topic "Investigating the role of attention sparsity \
  in reducing transformer inference cost" --auto-approve

Output: Complete paper with real literature, executed experiments, verified results, peer review, and LaTeX export.

Research Workflow Integration

Use Case Description Configuration
Literature review Run phases A-C only for systematic literature review Stop at Stage 7
Experiment design Run phases A-D for designed experiments without execution Stop at Stage 11
Full autonomy Complete pipeline with --auto-approve Default
Supervised research Pipeline pauses at 3 gates for human review Without --auto-approve
Chat-driven Via OpenClaw: "Research X" in Discord/Telegram/Slack OpenClaw bridge enabled

Platform Integration

AutoResearchClaw supports deployment across multiple interfaces:

┌──────────────────────────────────────────────────────┐
│                    USER INTERFACES                     │
│                                                        │
│  CLI              OpenClaw           Python API        │
│  researchclaw     "Research X"       Runner(config)   │
│  run --topic      via Discord/       .run()           │
│  "..."            Telegram/etc.                       │
│                                                        │
│  Claude Code      Copilot CLI       Any AI CLI        │
│  "Run research    researchclaw      Provide           │
│   on [topic]"     run --topic       AGENTS.md         │
│                   via ACP           as context         │
└──────────────────────────────────────────────────────┘

Target Users

User Type Value Proposition
PhD students Rapid prototyping of research directions; literature review automation
Research labs High-throughput hypothesis testing across multiple topics
Industry R&D Quick feasibility studies and literature surveys
Interdisciplinary teams Domain-agnostic pipeline works across fields

Limitations

Limitation Impact Mitigation
Paper quality ceiling Not yet at top-venue acceptance quality Multi-agent review + MetaClaw improves over time
Experiment scale Sandbox experiments are small-scale Docker mode + SSH remote for larger experiments
Domain expertise No deep domain knowledge beyond LLM training data Skills system adds domain knowledge; community skills
Fabrication risk Despite VerifiedRegistry, subtle fabrication possible Sentinel watchdog + human review at gates
LLM cost Full runs cost $15–50 with GPT-4o Fallback models, budget control in config
Novelty assessment Cannot reliably assess if research is truly novel Human judgment required for novelty claims
No peer acceptance No evidence of generated papers passing real peer review Showcase papers are self-evaluated only
System Full Pipeline Real Literature Experiments Self-Learning Open Source
AutoResearchClaw 23 stages OpenAlex + S2 + arXiv Sandbox/Docker MetaClaw MIT
AI Scientist (Sakana) Partial No (hallucinated) Limited No Apache 2.0
AIRA₂ (Meta) Experiments only N/A Full-scale GPU Within-task evolution Not released
FARS (Analemma) Full pipeline Unknown Unknown Unknown Proprietary
AutoResearch (Karpathy) Partial Partial No No Open

Connections to OmniEvolve

AutoResearchClaw's architecture maps to several OmniEvolve design patterns:

AutoResearchClaw Component OmniEvolve Equivalent
23-stage pipeline orchestrator omnievolve/orchestrator/ experiment lifecycle
PIVOT/REFINE decision loop Adaptive search strategy in omnievolve/search/
MetaClaw cross-run learning omnievolve/knowledge/ learning logs and skills
VerifiedRegistry anti-fabrication omnievolve/evaluation/ cascade evaluator integrity
Multi-agent debate Multi-operator mutation in omnievolve/mutation/
Skills system omnievolve/plugins/ plugin discovery
Sentinel watchdog omnievolve/safety/ audit and policy enforcement
CodeAgent/BenchmarkAgent/FigureAgent Specialized omnievolve/mutation/ operators
Experiment sandbox omnievolve/safety/ sandbox execution
Knowledge base omnievolve/knowledge/ structured knowledge storage

The PIVOT/REFINE mechanism is particularly relevant to OmniEvolve as it represents a form of adaptive search where the system autonomously decides when to refine (exploit) vs. when to restart (explore) — a fundamental exploration-exploitation decision analogous to island-based search with restart policies.


References

  • Liu, J., et al. (2026). "AutoResearchClaw: Fully Autonomous Research from Idea to Paper." GitHub repository, aiming-lab/AutoResearchClaw.
  • Lu, C., et al. (2024). "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." Sakana AI. arXiv:2408.06292.
  • Karpathy, A. (2025). "AutoResearch." GitHub repository.
  • Analemma (2025). "FARS: Fully Automated Research System." Blog post.
  • MetaClaw: github.com/aiming-lab/MetaClaw
  • OpenClaw: github.com/openclaw/openclaw
  • OpenCode: github.com/anomalyco/opencode
  • A-Evolve: github.com/A-EVO-Lab/a-evolve
  • K-Dense-AI Claude Scientific Skills: github.com/K-Dense-AI/claude-scientific-skills

Classification: Autoresearch — AutoResearchClaw is a fully autonomous AI system that conducts the complete research process from idea to paper, including literature review, hypothesis generation, experiment execution, result analysis, paper writing, and peer review. It is a prototypical autoresearch system — an agent harness that automates the end-to-end scientific research workflow. Its MetaClaw integration adds self-evolving capability, and its PIVOT/REFINE mechanism introduces adaptive search over research directions.