← Back to Index

AutoResearchClaw

Fully autonomous 23-stage pipeline that transforms a research idea into a conference-ready paper with real literature, sandboxed experiments, multi-agent peer review, and self-evolving cross-run learning. Organization: AIMING Lab (UC Santa Cruz, UNC Chapel Hill, Johns Hopkins, UC Davis, et al.) Published: March 15, 2026 (v0.1.0); actively maintained through v0.3.2+ Type: repo (GitHub: aiming-lab/AutoResearchClaw) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

AutoResearchClaw: Fully Autonomous Research from Idea to Paper

Repository: github.com/aiming-lab/AutoResearchClaw
License: MIT
Stars: ~9,800+ (as of April 2026)
First release: v0.1.0, March 15, 2026
Current release: v0.3.2, March 22, 2026
Tagline: "Chat an Idea. Get a Paper."
Predecessor lineage: Inspired by AI Scientist (Sakana AI), AutoResearch (Karpathy), FARS (Analemma)
Companion systems:
MetaClaw — cross-run learning engine (skill extraction from failures)
OpenClaw — AI assistant platform (chat interface for pipeline orchestration)

The project name "Claw" references the lobster emoji used throughout the branding, suggesting the system's ability to "grasp" research problems and work through them autonomously.

Version History

Version	Date	Key Features
v0.1.0	Mar 15, 2026	Initial release: 23-stage pipeline, end-to-end autonomous
v0.2.0	Mar 16, 2026	CodeAgent, BenchmarkAgent, FigureAgent; Docker sandbox hardening; 4-round quality audit
v0.3.0	Mar 17, 2026	MetaClaw integration (+18.3% robustness); cross-run learning
v0.3.1	Mar 18, 2026	OpenCode Beast Mode; Novita AI provider; thread-safety hardening
v0.3.2	Mar 22, 2026	Cross-platform ACP support; anti-fabrication system; 100+ bug fixes; `--resume`
v0.3.2+	Mar 30, 2026	Flexible skill loading; 20 pre-loaded skills; A-Evolve skill

Notable pace: 6 significant releases in 15 days, suggesting rapid iteration under active development pressure.

2 Authors and Team

Author	Affiliation
Jiaqi Liu	—
Peng Xia	—
Siwei Han	—
Shi Qiu	—
Letian Zhang	—
Guiming Chen	—
Haoqin Tu	—
Xinyu Yang	—
Jiawei Zhou	—
Hongtu Zhu	UNC Chapel Hill
Yun Li	—
Yuyin Zhou	UC Santa Cruz
Zeyu Zheng	—
Cihang Xie	UC Santa Cruz
Mingyu Ding	Johns Hopkins University
Huaxiu Yao	UC Davis (AIMING Lab lead)

Team composition: Academic research group (AIMING Lab) spanning multiple US universities. Unlike the industrial AIRA₂ team (25 authors at Meta FAIR), this is a more typical academic team producing open-source research infrastructure.

AIMING Lab context: The lab has produced related work including MetaClaw for cross-run learning, suggesting a broader research program around automated scientific discovery.

3 Core Contribution

AutoResearchClaw's core contribution is a complete end-to-end pipeline that autonomously transforms a text research topic into a conference-ready academic paper. Unlike systems that focus on a single phase (literature search, experiment execution, or writing), AutoResearchClaw integrates all phases into a single orchestrated workflow.

The 23-Stage Pipeline

Phase A: Research Scoping            Phase E: Experiment Execution
  1. TOPIC_INIT                        12. EXPERIMENT_RUN
  2. PROBLEM_DECOMPOSE                 13. ITERATIVE_REFINE  ← self-healing

Phase B: Literature Discovery        Phase F: Analysis & Decision
  3. SEARCH_STRATEGY                   14. RESULT_ANALYSIS    ← multi-agent
  4. LITERATURE_COLLECT  ← real APIs   15. RESEARCH_DECISION  ← PIVOT/REFINE
  5. LITERATURE_SCREEN   [GATE]
  6. KNOWLEDGE_EXTRACT                 Phase G: Paper Writing
                                       16. PAPER_OUTLINE
Phase C: Knowledge Synthesis           17. PAPER_DRAFT
  7. SYNTHESIS                         18. PEER_REVIEW        ← evidence check
  8. HYPOTHESIS_GEN    ← debate        19. PAPER_REVISION

Phase D: Experiment Design           Phase H: Finalization
  9. EXPERIMENT_DESIGN   [GATE]        20. QUALITY_GATE       [GATE]
 10. CODE_GENERATION                   21. KNOWLEDGE_ARCHIVE
 11. RESOURCE_PLANNING                 22. EXPORT_PUBLISH     ← LaTeX
                                       23. CITATION_VERIFY    ← relevance check

Five Differentiating Capabilities

Capability	Description
PIVOT/REFINE Loop	Stage 15 autonomously decides: PROCEED (continue), REFINE (tweak parameters → Stage 13), or PIVOT (new research direction → Stage 8). Artifacts auto-versioned across loops.
Multi-Agent Debate	Hypothesis generation, result analysis, and peer review each use structured multi-perspective LLM debate rather than single-pass generation.
Self-Learning (MetaClaw)	Lessons extracted per run (decision rationale, runtime warnings, metric anomalies) with 30-day time-decay. Future runs avoid past mistakes.
Anti-Fabrication System	VerifiedRegistry enforces ground-truth experiment data in papers. Unverified numbers are sanitized. Failed experiments are auto-diagnosed and repaired before writing.
Real Citation Verification	4-layer verification: arXiv ID → CrossRef/DataCite DOI → Semantic Scholar title match → LLM relevance scoring. Hallucinated references automatically removed.

What Makes It Novel in the Autoresearch Landscape

Compared to AI Scientist (Sakana AI) and AIRA₂ (Meta FAIR):

Dimension	AI Scientist	AIRA₂	AutoResearchClaw
Output	Paper (limited quality)	Competition solution	Conference-ready paper
Experiments	Simulated / toy	Real ML training (Kaggle)	Sandboxed Python (configurable fidelity)
Literature	No real retrieval	N/A	Real APIs (OpenAlex, Semantic Scholar, arXiv)
Citation integrity	Hallucinated refs common	N/A	4-layer verification
Self-improvement	None	Within-task evolution	Cross-run MetaClaw learning
Human-in-loop	None	None	3 quality gates (optional)
Target venue	Workshop-level	N/A (competition)	NeurIPS / ICML / ICLR

4 Supported Solutions

Research Domains

AutoResearchClaw is domain-agnostic by design. The showcase demonstrates papers across 8 domains:

Domain	Showcase Paper
Mathematics	Random matrix theory
Statistics	Weak IV estimators
Biology	SIR/SEIR identifiability
Computing	Krylov preconditioners
NLP	(Token merging — FAME)
Reinforcement Learning	LACE exploration
Computer Vision	GARD-LoRA
Model Compression	CRAFT distillation

Experiment Execution Modes

Mode	Description	Use Case
`simulated`	LLM generates plausible results without execution	Prototyping, low-resource
`sandbox`	AST-validated Python in local subprocess	Default; most common
`docker`	Hardened Docker containers with network policy	Production; GPU experiments
`ssh_remote`	Execution on remote GPU server	Large-scale training

Experiment Complexity Tiers

The system automatically assesses experiment complexity and routes accordingly:

Complexity Assessment:
┌─────────────────────────────────────────────────┐
│  Simple (score < 0.2)                           │
│  → Direct LLM code generation                  │
│                                                  │
│  Medium (0.2 ≤ score < threshold)               │
│  → CodeAgent v2 with architecture planning      │
│                                                  │
│  Complex (score ≥ threshold)                    │
│  → OpenCode Beast Mode                          │
│  → Multi-file projects with custom architectures│
│  → Training loops + ablation studies            │
└─────────────────────────────────────────────────┘

Output Artifacts

A complete run produces:

Artifact	Format	Description
`paper_draft.md`	Markdown	Full academic paper (5,000–6,500 words)
`paper.tex`	LaTeX	Conference-ready (NeurIPS/ICML/ICLR templates)
`references.bib`	BibTeX	Real references, auto-pruned to match inline citations
`verification_report.json`	JSON	4-layer citation integrity + relevance verification
`experiment_runs/`	Python + JSON	Generated code + sandbox results + structured metrics
`charts/`	PNG/PDF	Auto-generated comparison charts with error bars
`reviews.md`	Markdown	Multi-agent peer review with consistency checks
`evolution/`	Markdown	Self-learning lessons extracted from the run
`deliverables/`	Mixed	All final outputs, compile-ready for Overleaf

5 LLM Integration

Provider Architecture

AutoResearchClaw supports a pluggable LLM backend through multiple provider types:

Provider	Configuration	Notes
OpenAI-compatible	`base_url` + `api_key_env`	Default; works with any OpenAI-compatible API
OpenAI	Direct OpenAI API	GPT-4o, GPT-4o-mini
OpenRouter	Multi-model routing	Access to many models via single API
DeepSeek	DeepSeek API	DeepSeek V3/V3.2
Minimax	Minimax API	—
Novita AI	Novita API	Added in v0.3.1
ACP (Agent Client Protocol)	CLI agent delegation	Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI

ACP: Agent Client Protocol

A distinctive feature is ACP support, where AutoResearchClaw delegates LLM calls to external CLI agents rather than calling APIs directly:

llm:
  provider: "acp"
  acp:
    agent: "claude"   # or: codex, gh, gemini, opencode, kimi
    cwd: "."

The ACP adapter communicates via acpx, maintaining a single persistent session across all 23 pipeline stages. This means the CLI agent accumulates context about the entire research process, potentially improving coherence across stages.

Model Usage Across Pipeline Stages

The LLM is used in qualitatively different ways across the pipeline:

Stage Group	LLM Usage Pattern
Scoping (1-2)	Single-turn generation: topic decomposition, problem tree
Literature (3-6)	Query generation + relevance scoring + knowledge extraction
Synthesis (7-8)	Multi-agent debate: structured hypothesis generation
Design (9-11)	Architecture planning: implementation blueprint generation
Code Gen (10, 13)	Multi-file code generation (CodeAgent or OpenCode Beast Mode)
Analysis (14)	Multi-agent debate: structured result interpretation
Decision (15)	Reasoning: PROCEED/REFINE/PIVOT with rationale
Writing (16-19)	Section-by-section drafting + peer review + revision
Quality (20, 23)	Scoring: quality gates + citation relevance

OpenCode Beast Mode

Complex experiments are automatically routed to OpenCode, an external code generation system:

opencode:
  enabled: true
  auto: true
  complexity_threshold: 0.2  # 0.0-1.0
  timeout_sec: 600
  max_retries: 1

OpenCode generates multi-file projects with custom architectures, training loops, and ablation studies — going beyond what single-prompt code generation can produce.

Anti-Fabrication Integration

The VerifiedRegistry system enforces that only experimentally verified results appear in the paper:

Experiments produce structured JSON metrics
VerifiedRegistry indexes all verified metrics
During paper writing, only registry-verified numbers can be cited
Unverified numbers are sanitized (removed or flagged)
If experiments fail, a diagnosis-and-repair loop attempts to fix them before writing

6 Key Results

Showcase Papers

AutoResearchClaw has produced 8 papers across 8 domains, generated fully autonomously with zero human intervention:

Paper	Domain	Key Method
Paper I	Random matrix theory	Mathematical analysis
Paper II	Weak IV estimators	Statistical methodology
Paper III	SIR/SEIR identifiability	Epidemiological modeling
Paper IV	Krylov preconditioners	Numerical computing
Paper V	GARD-LoRA	Parameter-efficient fine-tuning
Paper VI	LACE exploration	Reinforcement learning
Paper VII	FAME token merging	Vision transformer efficiency
Paper VIII	CRAFT distillation	Knowledge distillation

MetaClaw Integration Results

Controlled A/B experiments (same topic, same LLM, same configuration):

Metric	Baseline	With MetaClaw	Improvement
Stage retry rate	10.5%	7.9%	−24.8%
Refine cycle count	2.0	1.2	−40.0%
Pipeline stage completion	18/19	19/19	+5.3%
Overall robustness score	0.714	0.845	+18.3%

Composite robustness score = weighted average of stage completion (40%), retry reduction (30%), refine cycle efficiency (30%).

Quality Indicators

Indicator	Value
Paper length	5,000–6,500 words
Target venue quality	NeurIPS / ICML / ICLR format
Citation verification	4-layer (arXiv, CrossRef, DataCite, LLM)
Real references	Yes (OpenAlex, Semantic Scholar, arXiv APIs)
Hallucinated reference rate	Auto-removed (exact rate not reported)
Experiment fidelity	Sandboxed Python with hardware-aware adaptation
Peer review	Multi-agent with methodology-evidence consistency checks

Adoption Metrics

Metric	Value (as of April 2026)
GitHub stars	~9,800+
Releases	6 in 15 days
Test suite	1,823 tests passing
Community skills	20 built-in + extensible via `SKILL.md`
Localization	9 languages (README translations)

Limitations of Reported Results

No blind evaluation: Papers not submitted to actual conferences; quality is self-assessed
No human expert review: Showcase papers not evaluated by domain experts
No comparison to human baselines: No metric comparing output quality to human-written papers
Selection bias: Showcase papers may be cherry-picked from many runs
Experiment fidelity: Sandbox experiments may not match full-scale reproducible research

7 Reproducibility

Strengths

Aspect	Assessment
Open source	Fully MIT-licensed; complete codebase on GitHub
Installation	Single `pip install -e .` + `researchclaw setup`
Configuration	Comprehensive YAML config with documented defaults
Test suite	1,823 tests passing
Documentation	Extensive README, integration guide, tester guide
Example config	`config.researchclaw.example.yaml` with all options
Multi-platform	Cross-platform via ACP; not locked to single LLM provider
Deterministic pipeline	23 stages execute in fixed order with checkpoint/resume

Challenges

Aspect	Concern
LLM non-determinism	Output quality varies with model, temperature, and API version
API dependencies	Requires OpenAlex, Semantic Scholar, arXiv APIs (external services)
Experiment quality	Sandbox experiments may not reproduce at full scale
Cost variability	LLM API costs depend on model choice and generation length
Docker setup	Docker mode requires additional infrastructure
OpenCode dependency	Beast Mode requires separate OpenCode installation
Rapid iteration	6 releases in 15 days suggests API may be unstable

Running a Reproduction

# Minimal reproduction:
git clone https://github.com/aiming-lab/AutoResearchClaw.git
cd AutoResearchClaw
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
researchclaw setup
researchclaw init    # interactive config
export OPENAI_API_KEY="sk-..."
researchclaw run --topic "Your topic" --auto-approve

# Resume interrupted run:
researchclaw run --resume  # auto-detects last checkpoint

8 Compute and API Costs

LLM API Costs (Estimated)

A full 23-stage pipeline run involves extensive LLM usage across all stages. Estimated costs by model:

Model	Estimated Cost per Run	Notes
GPT-4o	$15–50	Many stages, multi-agent debate, code generation
GPT-4o-mini	$3–10	Budget option; lower quality
Claude 3.5 Sonnet (via ACP)	$10–30	Using Claude Code as agent
DeepSeek V3	$2–8	Cost-effective alternative
Gemini Pro	$5–20	Via OpenRouter or direct

Costs vary significantly based on topic complexity, number of REFINE/PIVOT cycles, experiment iterations, and paper length.

Compute Resources

Resource	Minimum	Recommended
CPU	4 cores	8+ cores
RAM	8GB	16GB+
GPU	None (CPU-only mode)	NVIDIA (CUDA) or Apple (MPS)
Disk	5GB	20GB+ (for experiment artifacts)
Network	Required for LLM APIs + literature search	—

Time Budget

Configuration	Estimated Duration
Simulated experiments, fast model	30–60 minutes
Sandbox experiments, GPT-4o	2–6 hours
Docker + OpenCode Beast Mode	4–12 hours
With PIVOT/REFINE cycles	6–24 hours

Per-Stage Time Breakdown (Estimated)

Phase A: Scoping ................ 5-10 min
Phase B: Literature ............. 15-45 min (API-bound)
Phase C: Synthesis .............. 10-20 min
Phase D: Experiment Design ...... 10-30 min
Phase E: Experiment Execution ... 30-180 min (compute-bound)
Phase F: Analysis + Decision .... 10-20 min
Phase G: Paper Writing .......... 30-60 min
Phase H: Finalization ........... 15-30 min
─────────────────────────────────────────────
Total (no loops) ................ ~2-6 hours
+ REFINE cycles ................. +30-90 min each
+ PIVOT cycles .................. +2-4 hours each

9 Architecture Solution

High-Level Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                   AutoResearchClaw Architecture                  │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │                    PIPELINE ORCHESTRATOR                    │  │
│  │  (researchclaw/pipeline/runner.py)                         │  │
│  │                                                            │  │
│  │  ┌──────┐ ┌──────┐ ┌──────┐       ┌──────┐ ┌──────┐     │  │
│  │  │Stg 1 │→│Stg 2 │→│Stg 3 │→ ··· →│Stg 22│→│Stg 23│     │  │
│  │  └──────┘ └──────┘ └──┬───┘       └──────┘ └──────┘     │  │
│  │                       │                                    │  │
│  │              ┌────────┴────────┐                           │  │
│  │              │   GATE STAGES   │                           │  │
│  │              │  5, 9, 20       │                           │  │
│  │              │  (approve/reject│                           │  │
│  │              │   + rollback)   │                           │  │
│  │              └─────────────────┘                           │  │
│  │                                                            │  │
│  │         ┌──────────────────────────────┐                   │  │
│  │         │    DECISION ENGINE (Stg 15)  │                   │  │
│  │         │                              │                   │  │
│  │         │  PROCEED → Stage 16          │                   │  │
│  │         │  REFINE  → Stage 13 (loop)   │                   │  │
│  │         │  PIVOT   → Stage 8 (restart) │                   │  │
│  │         └──────────────────────────────┘                   │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │
│  │  MULTI-AGENT     │  │  EXPERIMENT     │  │  KNOWLEDGE     │  │
│  │  SUBSYSTEMS      │  │  SANDBOX        │  │  BASE          │  │
│  │                   │  │                  │  │                │  │
│  │  • CodeAgent     │  │  • AST validate │  │  • Decisions   │  │
│  │  • BenchmarkAgt  │  │  • NaN/Inf trap │  │  • Experiments │  │
│  │  • FigureAgent   │  │  • Self-heal    │  │  • Findings    │  │
│  │  • Debate agents │  │  • Docker/local │  │  • Literature  │  │
│  │                   │  │  • GPU detect   │  │  • Questions   │  │
│  │                   │  │                  │  │  • Reviews     │  │
│  └─────────────────┘  └─────────────────┘  └────────────────┘  │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │
│  │  LLM PROVIDER    │  │  LITERATURE     │  │  METACLAW      │  │
│  │  LAYER            │  │  APIS           │  │  BRIDGE        │  │
│  │                   │  │                  │  │                │  │
│  │  • OpenAI-compat │  │  • OpenAlex     │  │  • Lessons     │  │
│  │  • ACP agents    │  │  • Semantic Sch │  │  • Skills      │  │
│  │  • Retry/fallback│  │  • arXiv        │  │  • Overlay     │  │
│  │  • Budget control│  │  • CrossRef     │  │  • Time-decay  │  │
│  └─────────────────┘  └─────────────────┘  └────────────────┘  │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │
│  │  SENTINEL        │  │  VERIFIED       │  │  EXPORT        │  │
│  │  WATCHDOG        │  │  REGISTRY       │  │  ENGINE        │  │
│  │                   │  │                  │  │                │  │
│  │  • NaN/Inf detect│  │  • Ground-truth │  │  • LaTeX       │  │
│  │  • Consistency   │  │  • Anti-fabr.   │  │  • BibTeX      │  │
│  │  • Citation score│  │  • Experiment   │  │  • NeurIPS/    │  │
│  │  • Anti-fabr.    │  │    diagnosis    │  │    ICML/ICLR   │  │
│  └─────────────────┘  └─────────────────┘  └────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Key Architectural Decisions

Decision	Rationale
Sequential 23-stage pipeline	Deterministic ordering ensures reproducibility and checkpoint/resume capability
3 quality gates	Human-in-the-loop at critical decision points (literature screening, experiment design, final quality)
PIVOT/REFINE loops	Allows the system to recover from dead-end research directions autonomously
Multi-agent debate	Multiple LLM perspectives reduce single-point-of-failure in reasoning
VerifiedRegistry	Prevents the most common failure mode: LLM-fabricated experimental results
Pluggable LLM backend	ACP support means the system isn't locked to any single provider
MetaClaw bridge	Cross-run learning addresses the "same mistakes" problem

Control Flow: The PIVOT/REFINE Decision

The most architecturally interesting feature is Stage 15's autonomous research direction control:

                    ┌──────────────────┐
                    │  Stage 14:       │
                    │  Result Analysis │
                    └────────┬─────────┘
                             │
                             ▼
                    ┌──────────────────┐
                    │  Stage 15:       │
                    │  RESEARCH        │
                    │  DECISION        │
                    └───────┬──────────┘
                            │
              ┌─────────────┼──────────────┐
              │             │              │
              ▼             ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │ PROCEED  │  │ REFINE   │  │ PIVOT    │
        │          │  │          │  │          │
        │ → Stg 16 │  │ → Stg 13 │  │ → Stg 8 │
        │ (write)  │  │ (iterate)│  │ (restart)│
        └──────────┘  └──────────┘  └──────────┘

PROCEED: Results are satisfactory → move to paper writing
REFINE:  Results need parameter tuning → go back to iterative refinement
PIVOT:   Hypothesis is wrong → generate new hypotheses and restart experiments

Artifacts are auto-versioned across loops to prevent data loss.

10 Component Breakdown

Component 1: Pipeline Orchestrator

Location: researchclaw/pipeline/

The orchestrator manages the sequential execution of 23 stages with: - Checkpoint/resume: Pipeline state saved after each stage; --resume flag auto-detects - Gate management: Stages 5, 9, 20 pause for human approval (or auto-approve) - Loop control: REFINE → Stage 13, PIVOT → Stage 8 - Error recovery: Stage failures trigger retry with configurable limits - Artifact versioning: Each loop iteration creates versioned artifact snapshots

Component 2: Literature Discovery Engine

Stages: 3–6

Multi-source literature search with real academic APIs:

Query Expansion (Stage 3):
  Research topic → multiple search queries
                → domain-specific terminology
                → synonym expansion

Literature Collection (Stage 4):
  ┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
  │  OpenAlex   │     │ Semantic Scholar │     │   arXiv     │
  │  API        │ ──▶ │  API             │ ──▶ │   API       │
  └─────────────┘     └──────────────────┘     └─────────────┘
         │                    │                       │
         └────────────────────┼───────────────────────┘
                              ▼
                    ┌──────────────────┐
                    │  Deduplication   │
                    │  Circuit Breaker │
                    │  Rate Limiting   │
                    └──────────────────┘

Literature Screening (Stage 5 — GATE):
  LLM relevance scoring → human approval → filtered set

Knowledge Extraction (Stage 6):
  Per-paper → structured knowledge cards

Component 3: Multi-Agent Debate System

Used in: Stages 8 (hypothesis gen), 14 (result analysis), 18 (peer review)

The debate system uses multiple LLM "perspectives" to reduce single-point reasoning failures:

Hypothesis generation: Multiple agents propose competing hypotheses; structured debate narrows to testable set
Result analysis: Independent agents analyze experimental results from different angles; consensus synthesis
Peer review: Agents review the draft paper with explicit methodology-evidence consistency checks

Component 4: CodeAgent v2

Location: researchclaw/agents/code_agent/

Multi-phase code generation system replacing simple single-prompt code generation:

Phase	Description
Architecture Planning	Deep implementation blueprint before coding
Sequential Generation	Files generated one-by-one following dependency DAG
Hard Validation	AST-based gates blocking identical ablations, hardcoded metrics
Execution-in-the-Loop	Fix attempts based on actual execution errors

code_agent:
  enabled: true
  architecture_planning: true
  sequential_generation: true
  hard_validation: true
  hard_validation_max_repairs: 2
  exec_fix_max_iterations: 3
  exec_fix_timeout_sec: 60

Component 5: BenchmarkAgent

Location: researchclaw/agents/benchmark_agent/

4-agent pipeline for automated dataset and baseline selection:

Surveyor → Selector → Acquirer → Validator
   │           │          │          │
   ▼           ▼          ▼          ▼
 Search     Rank &      Download   Validate
 HuggingFace  select     datasets   integrity
 + Scholar   by tier     + cache    + format

Agents: surveyor.py, selector.py, acquirer.py, validator.py

Configuration:

benchmark_agent:
  enabled: true
  enable_hf_search: true
  enable_web_search: true
  tier_limit: 2         # 1=small, 2=medium, 3=large
  min_benchmarks: 1
  min_baselines: 2

Component 6: FigureAgent

Location: researchclaw/agents/figure_agent/

5-agent pipeline for academic figure generation:

Planner → CodeGen → Renderer → Critic → Integrator
   │          │         │          │          │
   ▼          ▼         ▼          ▼          ▼
 Plan       Generate  Execute    Critique   Place in
 figures    matplotlib render     quality    paper
 needed     code       to PNG    + iterate  + caption

Configuration:

figure_agent:
  enabled: true
  min_figures: 3
  max_figures: 8
  max_iterations: 3     # Critic-driven refinement
  dpi: 300
  strict_mode: false

Component 7: Sentinel Watchdog

Purpose: Background quality monitor running continuously during pipeline execution.

Monitors for: - NaN/Inf detection: Catches numerical instabilities in experiment results - Paper-evidence consistency: Verifies claims match experimental data - Citation relevance scoring: Scores how relevant each citation is to the paper - Anti-fabrication guard: Flags numbers that don't appear in VerifiedRegistry

Component 8: Export Engine

Location: researchclaw/pipeline/ (stages 22-23)

Converts Markdown paper to conference-ready LaTeX:

Template	Target
`neurips_2025`	NeurIPS 2025
`iclr_2026`	ICLR 2026
`icml_2026`	ICML 2026

Handles: math expressions, tables, figures, cross-references, \cite{} commands, auto-pruned BibTeX.

11 Core Mechanisms (Detailed)

11.1 The PIVOT/REFINE Decision Engine

Stage 15 is the pipeline's most complex decision point. The LLM analyzes experimental results and makes a three-way decision:

PROCEED criteria: - Results support the hypothesis - Statistical significance achieved - Sufficient experimental coverage - Results are novel relative to baselines

REFINE criteria: - Results partially support hypothesis but need parameter tuning - Some experimental conditions failed - Metrics are close to significance threshold - Additional iterations likely to help

PIVOT criteria: - Results contradict the hypothesis - Fundamental methodology issue identified - Results are not distinguishable from baselines - A different research direction is more promising

When PIVOT is triggered: 1. Current results are archived with version tag 2. Pipeline jumps back to Stage 8 (hypothesis generation) 3. Previous failed hypotheses are provided as negative context 4. New hypotheses must differ from all previous attempts 5. The cycle continues with fresh experiment design

This creates a closed-loop research process that can autonomously recover from dead-end research directions — a capability absent from most competing systems.

11.2 Anti-Fabrication System

The VerifiedRegistry is a defense against the most dangerous failure mode of LLM-generated research papers: fabricated experimental results.

Problem: LLMs can generate plausible-looking experimental results that have no basis in actual computation. This is the single biggest threat to the credibility of AI-generated research.

Solution architecture:

┌────────────────────────────────────────────────────────┐
│                  VerifiedRegistry                       │
│                                                         │
│  experiment_id → {                                      │
│    conditions: [...],                                   │
│    metrics: {                                           │
│      "accuracy": 0.847,    ← from actual execution     │
│      "f1_score": 0.812,    ← from actual execution     │
│      "train_time": 142.3   ← from actual execution     │
│    },                                                   │
│    execution_log: "...",                                │
│    timestamp: "2026-03-28T14:32:00Z"                   │
│  }                                                      │
│                                                         │
│  Enforcement:                                           │
│  • Paper writing stage queries registry                 │
│  • Only registry-verified numbers may appear in text    │
│  • Unverified claims → sanitized or flagged             │
│  • Tables must reference experiment_ids                 │
│                                                         │
│  Repair loop (if experiments failed):                   │
│  1. Diagnose failure cause                              │
│  2. Generate repair code (via OpenCode)                 │
│  3. Re-execute (up to max_cycles=3)                     │
│  4. If min_completion_rate not met → degrade gracefully │
└────────────────────────────────────────────────────────┘

11.3 Four-Layer Citation Verification

Academic credibility requires real references. AutoResearchClaw implements 4 verification layers:

Layer 1: arXiv ID Check
  └─ If citation claims arXiv ID → verify it exists via arXiv API
  └─ If ID is invalid → remove citation

Layer 2: CrossRef / DataCite DOI
  └─ Verify DOI resolves to a real publication
  └─ Check metadata (title, authors, year) matches claim

Layer 3: Semantic Scholar Title Match
  └─ Search paper title in Semantic Scholar
  └─ Fuzzy match to handle minor title variations
  └─ Verify authors and venue match

Layer 4: LLM Relevance Scoring
  └─ Even if citation is real, is it relevant to the paper?
  └─ Score relevance (0-1)
  └─ Remove citations below threshold

Result: Only citations that are (a) real and (b) relevant survive.

This addresses a critical weakness of AI Scientist and similar systems where hallucinated references undermined paper credibility.

11.4 Hardware-Aware Experiment Adaptation

The system auto-detects available hardware and adapts experiments accordingly:

Hardware Detection:
  ┌─────────────┐
  │ NVIDIA GPU  │ → CUDA mode → full-scale training
  │ (detected)  │   PyTorch CUDA, large batch sizes
  └─────────────┘

  ┌─────────────┐
  │ Apple MPS   │ → MPS mode → adapted scale
  │ (detected)  │   PyTorch MPS, reduced batch sizes
  └─────────────┘

  ┌─────────────┐
  │ CPU only    │ → CPU mode → minimal experiments
  │ (fallback)  │   Small models, few epochs, sklearn focus
  └─────────────┘

Code generation adapts: imports, model sizes, batch sizes, training epochs, and package selection are all adjusted based on detected hardware tier.

11.5 Experiment Sandbox Execution

The sandbox provides safe, reproducible experiment execution:

AST validation (pre-execution): - Parse generated code as Python AST - Check for prohibited constructs (network access, file system escape) - Verify import whitelist compliance - Block identical ablations (same code, different variable names) - Detect hardcoded metrics (fabrication attempt)

Execution guardrails: - Memory limit (configurable, default 4096MB) - Time budget (configurable, default 300s) - NaN/Inf fast-fail: detect and abort on numerical instabilities - Partial result capture: save intermediate results even on failure - Self-healing: on failure, diagnose error → generate fix → retry (up to 10 rounds)

Docker mode adds: - Network policy (none, setup_only, pip_only, full) - Container isolation - Auto-install dependencies (detect imports → generate requirements.txt) - GPU passthrough

11.6 Skills System

AutoResearchClaw implements a skills system inspired by Claude Code's SKILL.md format:

Skills Loading:
  1. Built-in skills (19 shipped) ← researchclaw package
  2. Project-local skills        ← .claude/skills/ directory
  3. User-installed skills       ← researchclaw skills install
  4. Team-shared skills          ← custom_dirs in config
  5. Community skills            ← K-Dense-AI/claude-scientific-skills (150+)

Each skill is a SKILL.md file with YAML frontmatter:

---
name: scientific-writing
description: IMRAD structure, citation formatting, reporting guidelines
trigger-keywords: [paper, writing, draft, manuscript]
applicable-stages: [16, 17, 19]
enabled: true
---
[Skill instructions for the LLM...]

Skills are loaded and injected into LLM prompts automatically at applicable stages. This enables domain-specific expertise without modifying the core pipeline.

Notable built-in skills: - scientific-writing — IMRAD structure, citation formatting - chemistry-rdkit — Molecular analysis, SMILES, drug discovery - literature-search — Systematic review, PRISMA methodology - hypothesis-formulation — Testable hypothesis construction - statistical-reporting — Statistical analysis and reporting standards - a-evolve — Agentic evolution (community-contributed from A-Evolve)

12 Programming Language

System Implementation

Component	Language	Framework
Pipeline orchestrator	Python	Custom framework
Stage implementations	Python	—
Agent subsystems	Python	Custom agent base class
CLI interface	Python	Click (via `researchclaw` command)
Configuration	YAML	Parsed by Python
Prompts	YAML	`prompts.default.yaml`
Generated experiments	Python	PyTorch, scikit-learn, etc.
Generated papers	Markdown → LaTeX	Jinja2 templates

Codebase Structure

AutoResearchClaw/
├── researchclaw/                  # Main package
│   ├── __init__.py
│   ├── __main__.py
│   ├── adapters.py                # LLM provider adapters
│   ├── agents/                    # Multi-agent subsystems
│   │   ├── base.py                # BaseAgent ABC
│   │   ├── benchmark_agent/       # 4-agent benchmark pipeline
│   │   │   ├── surveyor.py
│   │   │   ├── selector.py
│   │   │   ├── acquirer.py
│   │   │   └── validator.py
│   │   ├── code_agent/            # Multi-phase code generation
│   │   │   ├── architect.py
│   │   │   ├── builder.py
│   │   │   └── validator.py
│   │   ├── code_searcher/         # Code search agent
│   │   ├── debate/                # Multi-agent debate
│   │   └── figure_agent/          # 5-agent figure pipeline
│   │       ├── planner.py
│   │       ├── codegen.py
│   │       ├── renderer.py
│   │       ├── critic.py
│   │       └── integrator.py
│   ├── cli/                       # CLI entry points
│   ├── config/                    # Configuration management
│   ├── knowledge/                 # Knowledge base
│   ├── literature/                # Literature search APIs
│   ├── pipeline/                  # Pipeline orchestrator
│   │   ├── runner.py              # Main pipeline runner
│   │   ├── stages/                # Individual stage implementations
│   │   └── checkpoint.py          # State persistence
│   ├── sandbox/                   # Experiment execution
│   ├── sentinel/                  # Quality watchdog
│   ├── skills/                    # Skills management
│   ├── templates/                 # LaTeX templates
│   └── verification/              # Citation verification
├── .claude/skills/                # Built-in SKILL.md files
├── config.researchclaw.example.yaml
├── prompts.default.yaml           # Default LLM prompts
├── pyproject.toml
├── docs/                          # Documentation
└── tests/                         # Test suite (1,823 tests)

Dependencies (from pyproject.toml)

Key dependencies include: - LLM integration: openai, httpx (for API calls) - Literature: requests (for academic APIs) - LaTeX: jinja2 (for template rendering) - Data processing: pyyaml, json - CLI: Standard library argparse or click - AST validation: Python ast module (stdlib)

13 Memory Management

Run-Level Memory: Knowledge Base

Every pipeline run builds a structured knowledge base across 6 categories:

Category	Contents	Persistence
Decisions	Research direction choices, PIVOT/REFINE rationale	Per-run
Experiments	Code, configurations, results, failure logs	Per-run
Findings	Key results, statistical analyses, insights	Per-run
Literature	Paper summaries, knowledge cards, citation metadata	Per-run
Questions	Open research questions, hypotheses tested	Per-run
Reviews	Peer review feedback, revision history	Per-run

Backend options:

knowledge_base:
  backend: "markdown"   # or "obsidian"
  root: "docs/kb"

Cross-Run Memory: MetaClaw

MetaClaw provides persistent cross-run learning through a lesson → skill pipeline:

Run N:
  Pipeline executes → failures/warnings captured as Lessons
  Lesson structure:
    - stage: which pipeline stage
    - severity: warning | error | critical
    - category: code_gen | experiment | literature | writing
    - description: what went wrong
    - resolution: how it was fixed (if auto-resolved)
    - timestamp: when it occurred

                    ↓ MetaClaw Processing ↓

  Lesson → Skill conversion:
    - Filter by min_severity (default: warning)
    - Extract actionable pattern
    - Generate SKILL.md with prevention instructions
    - Store as arc-* skill in ~/.metaclaw/skills/
    - Max skills_per_run: 3

                    ↓ Next Run ↓

Run N+1:
  build_overlay() at pipeline start:
    - Load all arc-* skills from ~/.metaclaw/skills/
    - Apply 30-day time-decay weighting
    - Inject relevant skills into every stage's LLM prompt
    - LLM avoids known pitfalls → fewer retries

Stage-Level Memory: Context Accumulation

Within a single run, each stage has access to:

Pipeline state: All outputs from all completed stages
Knowledge base: Accumulated findings, decisions, literature
Experiment history: All previous experimental results
Artifact versions: All versions from REFINE/PIVOT loops

When using ACP mode, the agent CLI maintains full conversation history across all 23 stages, providing maximum context continuity.

Memory Isolation

Scope	Persistence	Sharing
Within-stage	Ephemeral	Stage-internal only
Within-run	Run directory	All stages in same run
Cross-run (MetaClaw)	`~/.metaclaw/skills/`	All future runs
Cross-project	Not implemented	—

14 Continued Learning

Self-Learning via MetaClaw Integration

AutoResearchClaw's most distinctive continued learning mechanism is the MetaClaw bridge:

Learning Loop:
┌────────────────────────────────────────────────────┐
│                                                      │
│  Run 1: First execution                              │
│    • Encounters Stage 12 timeout (experiment too slow)│
│    • Auto-repairs by reducing batch size              │
│    • Lesson captured: "reduce batch size on timeout" │
│                        ↓                             │
│  MetaClaw converts → arc-experiment-timeout SKILL.md │
│                        ↓                             │
│  Run 2: Second execution (different topic)            │
│    • SKILL injected into Stage 12 prompt             │
│    • LLM proactively uses smaller batch size          │
│    • No timeout → no retry needed                    │
│    • Robustness improved                              │
│                                                      │
└────────────────────────────────────────────────────┘

Time-decay: Skills have a 30-day decay period, preventing stale lessons from dominating. Recent lessons are weighted more heavily than old ones.

Measured impact (from controlled A/B experiments): - Stage retry rate: −24.8% - Refine cycle count: −40.0% - Pipeline completion: +5.3% - Overall robustness: +18.3%

The REFINE loop (Stage 15 → Stage 13) provides within-run learning:

Each REFINE cycle builds on previous results
Parameter adjustments are informed by all prior iterations
Up to max_iterations (default: 10) refinement cycles
Artifacts are versioned to track improvement trajectory

Experiment Self-Healing

The sandbox executor implements a diagnosis-repair loop:

Execute code → failure detected
                     ↓
         Diagnose error type:
           • Import error → add dependency
           • Runtime error → modify code
           • Timeout → reduce scale
           • NaN/Inf → add numerical guards
                     ↓
         Generate repair (LLM or OpenCode)
                     ↓
         Re-execute (up to exec_fix_max_iterations=3)
                     ↓
         If still failing → capture partial results → degrade gracefully

Skills Library as Accumulated Knowledge

The skills system functions as a growing knowledge base:

Built-in skills (19): Curated by the development team
Community skills (150+ via K-Dense-AI): Crowdsourced scientific knowledge
MetaClaw-generated skills: Automatically created from pipeline failures
Custom skills: User/team-specific knowledge

Over time, a research group's skill library accumulates domain-specific knowledge that makes the pipeline increasingly effective for their particular research area.

Process Reward Model (PRM) — Optional Quality Gate

MetaClaw optionally integrates a Process Reward Model for quality gating:

metaclaw_bridge:
  prm:
    enabled: false           # Opt-in
    model: "gpt-5.4"        # PRM judge model
    votes: 3                 # Majority vote
    gate_stages: [5, 9, 15, 20]

When enabled, an LLM-as-judge evaluates stage outputs and blocks low-quality results from proceeding — adding another layer of quality control beyond the standard gate stages.

15 Applications

Primary Application: Automated Paper Generation

The primary use case is generating conference-ready academic papers from a research topic:

researchclaw run --topic "Investigating the role of attention sparsity \
  in reducing transformer inference cost" --auto-approve

Output: Complete paper with real literature, executed experiments, verified results, peer review, and LaTeX export.

Research Workflow Integration

Use Case	Description	Configuration
Literature review	Run phases A-C only for systematic literature review	Stop at Stage 7
Experiment design	Run phases A-D for designed experiments without execution	Stop at Stage 11
Full autonomy	Complete pipeline with `--auto-approve`	Default
Supervised research	Pipeline pauses at 3 gates for human review	Without `--auto-approve`
Chat-driven	Via OpenClaw: "Research X" in Discord/Telegram/Slack	OpenClaw bridge enabled

Platform Integration

AutoResearchClaw supports deployment across multiple interfaces:

┌──────────────────────────────────────────────────────┐
│                    USER INTERFACES                     │
│                                                        │
│  CLI              OpenClaw           Python API        │
│  researchclaw     "Research X"       Runner(config)   │
│  run --topic      via Discord/       .run()           │
│  "..."            Telegram/etc.                       │
│                                                        │
│  Claude Code      Copilot CLI       Any AI CLI        │
│  "Run research    researchclaw      Provide           │
│   on [topic]"     run --topic       AGENTS.md         │
│                   via ACP           as context         │
└──────────────────────────────────────────────────────┘

Target Users

User Type	Value Proposition
PhD students	Rapid prototyping of research directions; literature review automation
Research labs	High-throughput hypothesis testing across multiple topics
Industry R&D	Quick feasibility studies and literature surveys
Interdisciplinary teams	Domain-agnostic pipeline works across fields

Limitations

Limitation	Impact	Mitigation
Paper quality ceiling	Not yet at top-venue acceptance quality	Multi-agent review + MetaClaw improves over time
Experiment scale	Sandbox experiments are small-scale	Docker mode + SSH remote for larger experiments
Domain expertise	No deep domain knowledge beyond LLM training data	Skills system adds domain knowledge; community skills
Fabrication risk	Despite VerifiedRegistry, subtle fabrication possible	Sentinel watchdog + human review at gates
LLM cost	Full runs cost $15–50 with GPT-4o	Fallback models, budget control in config
Novelty assessment	Cannot reliably assess if research is truly novel	Human judgment required for novelty claims
No peer acceptance	No evidence of generated papers passing real peer review	Showcase papers are self-evaluated only

System	Full Pipeline	Real Literature	Experiments	Self-Learning	Open Source
AutoResearchClaw	23 stages	OpenAlex + S2 + arXiv	Sandbox/Docker	MetaClaw	MIT
AI Scientist (Sakana)	Partial	No (hallucinated)	Limited	No	Apache 2.0
AIRA₂ (Meta)	Experiments only	N/A	Full-scale GPU	Within-task evolution	Not released
FARS (Analemma)	Full pipeline	Unknown	Unknown	Unknown	Proprietary
AutoResearch (Karpathy)	Partial	Partial	No	No	Open

Connections to OmniEvolve

AutoResearchClaw's architecture maps to several OmniEvolve design patterns:

AutoResearchClaw Component	OmniEvolve Equivalent
23-stage pipeline orchestrator	`omnievolve/orchestrator/` experiment lifecycle
PIVOT/REFINE decision loop	Adaptive search strategy in `omnievolve/search/`
MetaClaw cross-run learning	`omnievolve/knowledge/` learning logs and skills
VerifiedRegistry anti-fabrication	`omnievolve/evaluation/` cascade evaluator integrity
Multi-agent debate	Multi-operator mutation in `omnievolve/mutation/`
Skills system	`omnievolve/plugins/` plugin discovery
Sentinel watchdog	`omnievolve/safety/` audit and policy enforcement
CodeAgent/BenchmarkAgent/FigureAgent	Specialized `omnievolve/mutation/` operators
Experiment sandbox	`omnievolve/safety/` sandbox execution
Knowledge base	`omnievolve/knowledge/` structured knowledge storage

The PIVOT/REFINE mechanism is particularly relevant to OmniEvolve as it represents a form of adaptive search where the system autonomously decides when to refine (exploit) vs. when to restart (explore) — a fundamental exploration-exploitation decision analogous to island-based search with restart policies.

References

Liu, J., et al. (2026). "AutoResearchClaw: Fully Autonomous Research from Idea to Paper." GitHub repository, aiming-lab/AutoResearchClaw.
Lu, C., et al. (2024). "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." Sakana AI. arXiv:2408.06292.
Karpathy, A. (2025). "AutoResearch." GitHub repository.
Analemma (2025). "FARS: Fully Automated Research System." Blog post.
MetaClaw: github.com/aiming-lab/MetaClaw
OpenClaw: github.com/openclaw/openclaw
OpenCode: github.com/anomalyco/opencode
A-Evolve: github.com/A-EVO-Lab/a-evolve
K-Dense-AI Claude Scientific Skills: github.com/K-Dense-AI/claude-scientific-skills

Classification: Autoresearch — AutoResearchClaw is a fully autonomous AI system that conducts the complete research process from idea to paper, including literature review, hypothesis generation, experiment execution, result analysis, paper writing, and peer review. It is a prototypical autoresearch system — an agent harness that automates the end-to-end scientific research workflow. Its MetaClaw integration adds self-evolving capability, and its PIVOT/REFINE mechanism introduces adaptive search over research directions.