← Back to Index
AutoResearchClaw
Fully autonomous 23-stage pipeline that transforms a research idea into a conference-ready paper with real literature, sandboxed experiments, multi-agent peer review, and self-evolving cross-run learning. Organization: AIMING Lab (UC Santa Cruz, UNC Chapel Hill, Johns Hopkins, UC Davis, et al.) Published: March 15, 2026 (v0.1.0); actively maintained through v0.3.2+ Type: repo (GitHub: aiming-lab/AutoResearchClaw) Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
AutoResearchClaw: Fully Autonomous Research from Idea to Paper
- Repository: github.com/aiming-lab/AutoResearchClaw
- License: MIT
- Stars: ~9,800+ (as of April 2026)
- First release: v0.1.0, March 15, 2026
- Current release: v0.3.2, March 22, 2026
- Tagline: "Chat an Idea. Get a Paper."
- Predecessor lineage: Inspired by AI Scientist (Sakana AI), AutoResearch (Karpathy), FARS (Analemma)
- Companion systems:
- MetaClaw — cross-run learning engine (skill extraction from failures)
- OpenClaw — AI assistant platform (chat interface for pipeline orchestration)
The project name "Claw" references the lobster emoji used throughout the branding, suggesting the system's ability to "grasp" research problems and work through them autonomously.
Version History
| Version | Date | Key Features |
|---|---|---|
| v0.1.0 | Mar 15, 2026 | Initial release: 23-stage pipeline, end-to-end autonomous |
| v0.2.0 | Mar 16, 2026 | CodeAgent, BenchmarkAgent, FigureAgent; Docker sandbox hardening; 4-round quality audit |
| v0.3.0 | Mar 17, 2026 | MetaClaw integration (+18.3% robustness); cross-run learning |
| v0.3.1 | Mar 18, 2026 | OpenCode Beast Mode; Novita AI provider; thread-safety hardening |
| v0.3.2 | Mar 22, 2026 | Cross-platform ACP support; anti-fabrication system; 100+ bug fixes; --resume |
| v0.3.2+ | Mar 30, 2026 | Flexible skill loading; 20 pre-loaded skills; A-Evolve skill |
Notable pace: 6 significant releases in 15 days, suggesting rapid iteration under active development pressure.
2 Authors and Team
| Author | Affiliation |
|---|---|
| Jiaqi Liu | — |
| Peng Xia | — |
| Siwei Han | — |
| Shi Qiu | — |
| Letian Zhang | — |
| Guiming Chen | — |
| Haoqin Tu | — |
| Xinyu Yang | — |
| Jiawei Zhou | — |
| Hongtu Zhu | UNC Chapel Hill |
| Yun Li | — |
| Yuyin Zhou | UC Santa Cruz |
| Zeyu Zheng | — |
| Cihang Xie | UC Santa Cruz |
| Mingyu Ding | Johns Hopkins University |
| Huaxiu Yao | UC Davis (AIMING Lab lead) |
Team composition: Academic research group (AIMING Lab) spanning multiple US universities. Unlike the industrial AIRA₂ team (25 authors at Meta FAIR), this is a more typical academic team producing open-source research infrastructure.
AIMING Lab context: The lab has produced related work including MetaClaw for cross-run learning, suggesting a broader research program around automated scientific discovery.
3 Core Contribution
AutoResearchClaw's core contribution is a complete end-to-end pipeline that autonomously transforms a text research topic into a conference-ready academic paper. Unlike systems that focus on a single phase (literature search, experiment execution, or writing), AutoResearchClaw integrates all phases into a single orchestrated workflow.
The 23-Stage Pipeline
Phase A: Research Scoping Phase E: Experiment Execution
1. TOPIC_INIT 12. EXPERIMENT_RUN
2. PROBLEM_DECOMPOSE 13. ITERATIVE_REFINE ← self-healing
Phase B: Literature Discovery Phase F: Analysis & Decision
3. SEARCH_STRATEGY 14. RESULT_ANALYSIS ← multi-agent
4. LITERATURE_COLLECT ← real APIs 15. RESEARCH_DECISION ← PIVOT/REFINE
5. LITERATURE_SCREEN [GATE]
6. KNOWLEDGE_EXTRACT Phase G: Paper Writing
16. PAPER_OUTLINE
Phase C: Knowledge Synthesis 17. PAPER_DRAFT
7. SYNTHESIS 18. PEER_REVIEW ← evidence check
8. HYPOTHESIS_GEN ← debate 19. PAPER_REVISION
Phase D: Experiment Design Phase H: Finalization
9. EXPERIMENT_DESIGN [GATE] 20. QUALITY_GATE [GATE]
10. CODE_GENERATION 21. KNOWLEDGE_ARCHIVE
11. RESOURCE_PLANNING 22. EXPORT_PUBLISH ← LaTeX
23. CITATION_VERIFY ← relevance check
Five Differentiating Capabilities
| Capability | Description |
|---|---|
| PIVOT/REFINE Loop | Stage 15 autonomously decides: PROCEED (continue), REFINE (tweak parameters → Stage 13), or PIVOT (new research direction → Stage 8). Artifacts auto-versioned across loops. |
| Multi-Agent Debate | Hypothesis generation, result analysis, and peer review each use structured multi-perspective LLM debate rather than single-pass generation. |
| Self-Learning (MetaClaw) | Lessons extracted per run (decision rationale, runtime warnings, metric anomalies) with 30-day time-decay. Future runs avoid past mistakes. |
| Anti-Fabrication System | VerifiedRegistry enforces ground-truth experiment data in papers. Unverified numbers are sanitized. Failed experiments are auto-diagnosed and repaired before writing. |
| Real Citation Verification | 4-layer verification: arXiv ID → CrossRef/DataCite DOI → Semantic Scholar title match → LLM relevance scoring. Hallucinated references automatically removed. |
What Makes It Novel in the Autoresearch Landscape
Compared to AI Scientist (Sakana AI) and AIRA₂ (Meta FAIR):
| Dimension | AI Scientist | AIRA₂ | AutoResearchClaw |
|---|---|---|---|
| Output | Paper (limited quality) | Competition solution | Conference-ready paper |
| Experiments | Simulated / toy | Real ML training (Kaggle) | Sandboxed Python (configurable fidelity) |
| Literature | No real retrieval | N/A | Real APIs (OpenAlex, Semantic Scholar, arXiv) |
| Citation integrity | Hallucinated refs common | N/A | 4-layer verification |
| Self-improvement | None | Within-task evolution | Cross-run MetaClaw learning |
| Human-in-loop | None | None | 3 quality gates (optional) |
| Target venue | Workshop-level | N/A (competition) | NeurIPS / ICML / ICLR |
4 Supported Solutions
Research Domains
AutoResearchClaw is domain-agnostic by design. The showcase demonstrates papers across 8 domains:
| Domain | Showcase Paper |
|---|---|
| Mathematics | Random matrix theory |
| Statistics | Weak IV estimators |
| Biology | SIR/SEIR identifiability |
| Computing | Krylov preconditioners |
| NLP | (Token merging — FAME) |
| Reinforcement Learning | LACE exploration |
| Computer Vision | GARD-LoRA |
| Model Compression | CRAFT distillation |
Experiment Execution Modes
| Mode | Description | Use Case |
|---|---|---|
simulated |
LLM generates plausible results without execution | Prototyping, low-resource |
sandbox |
AST-validated Python in local subprocess | Default; most common |
docker |
Hardened Docker containers with network policy | Production; GPU experiments |
ssh_remote |
Execution on remote GPU server | Large-scale training |
Experiment Complexity Tiers
The system automatically assesses experiment complexity and routes accordingly:
Complexity Assessment:
┌─────────────────────────────────────────────────┐
│ Simple (score < 0.2) │
│ → Direct LLM code generation │
│ │
│ Medium (0.2 ≤ score < threshold) │
│ → CodeAgent v2 with architecture planning │
│ │
│ Complex (score ≥ threshold) │
│ → OpenCode Beast Mode │
│ → Multi-file projects with custom architectures│
│ → Training loops + ablation studies │
└─────────────────────────────────────────────────┘
Output Artifacts
A complete run produces:
| Artifact | Format | Description |
|---|---|---|
paper_draft.md |
Markdown | Full academic paper (5,000–6,500 words) |
paper.tex |
LaTeX | Conference-ready (NeurIPS/ICML/ICLR templates) |
references.bib |
BibTeX | Real references, auto-pruned to match inline citations |
verification_report.json |
JSON | 4-layer citation integrity + relevance verification |
experiment_runs/ |
Python + JSON | Generated code + sandbox results + structured metrics |
charts/ |
PNG/PDF | Auto-generated comparison charts with error bars |
reviews.md |
Markdown | Multi-agent peer review with consistency checks |
evolution/ |
Markdown | Self-learning lessons extracted from the run |
deliverables/ |
Mixed | All final outputs, compile-ready for Overleaf |
5 LLM Integration
Provider Architecture
AutoResearchClaw supports a pluggable LLM backend through multiple provider types:
| Provider | Configuration | Notes |
|---|---|---|
| OpenAI-compatible | base_url + api_key_env |
Default; works with any OpenAI-compatible API |
| OpenAI | Direct OpenAI API | GPT-4o, GPT-4o-mini |
| OpenRouter | Multi-model routing | Access to many models via single API |
| DeepSeek | DeepSeek API | DeepSeek V3/V3.2 |
| Minimax | Minimax API | — |
| Novita AI | Novita API | Added in v0.3.1 |
| ACP (Agent Client Protocol) | CLI agent delegation | Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI |
ACP: Agent Client Protocol
A distinctive feature is ACP support, where AutoResearchClaw delegates LLM calls to external CLI agents rather than calling APIs directly:
llm:
provider: "acp"
acp:
agent: "claude" # or: codex, gh, gemini, opencode, kimi
cwd: "."
The ACP adapter communicates via acpx, maintaining a single persistent session across all 23 pipeline stages. This means the CLI agent accumulates context about the entire research process, potentially improving coherence across stages.
Model Usage Across Pipeline Stages
The LLM is used in qualitatively different ways across the pipeline:
| Stage Group | LLM Usage Pattern |
|---|---|
| Scoping (1-2) | Single-turn generation: topic decomposition, problem tree |
| Literature (3-6) | Query generation + relevance scoring + knowledge extraction |
| Synthesis (7-8) | Multi-agent debate: structured hypothesis generation |
| Design (9-11) | Architecture planning: implementation blueprint generation |
| Code Gen (10, 13) | Multi-file code generation (CodeAgent or OpenCode Beast Mode) |
| Analysis (14) | Multi-agent debate: structured result interpretation |
| Decision (15) | Reasoning: PROCEED/REFINE/PIVOT with rationale |
| Writing (16-19) | Section-by-section drafting + peer review + revision |
| Quality (20, 23) | Scoring: quality gates + citation relevance |
OpenCode Beast Mode
Complex experiments are automatically routed to OpenCode, an external code generation system:
opencode:
enabled: true
auto: true
complexity_threshold: 0.2 # 0.0-1.0
timeout_sec: 600
max_retries: 1
OpenCode generates multi-file projects with custom architectures, training loops, and ablation studies — going beyond what single-prompt code generation can produce.
Anti-Fabrication Integration
The VerifiedRegistry system enforces that only experimentally verified results appear in the paper:
- Experiments produce structured JSON metrics
- VerifiedRegistry indexes all verified metrics
- During paper writing, only registry-verified numbers can be cited
- Unverified numbers are sanitized (removed or flagged)
- If experiments fail, a diagnosis-and-repair loop attempts to fix them before writing
6 Key Results
Showcase Papers
AutoResearchClaw has produced 8 papers across 8 domains, generated fully autonomously with zero human intervention:
| Paper | Domain | Key Method |
|---|---|---|
| Paper I | Random matrix theory | Mathematical analysis |
| Paper II | Weak IV estimators | Statistical methodology |
| Paper III | SIR/SEIR identifiability | Epidemiological modeling |
| Paper IV | Krylov preconditioners | Numerical computing |
| Paper V | GARD-LoRA | Parameter-efficient fine-tuning |
| Paper VI | LACE exploration | Reinforcement learning |
| Paper VII | FAME token merging | Vision transformer efficiency |
| Paper VIII | CRAFT distillation | Knowledge distillation |
MetaClaw Integration Results
Controlled A/B experiments (same topic, same LLM, same configuration):
| Metric | Baseline | With MetaClaw | Improvement |
|---|---|---|---|
| Stage retry rate | 10.5% | 7.9% | −24.8% |
| Refine cycle count | 2.0 | 1.2 | −40.0% |
| Pipeline stage completion | 18/19 | 19/19 | +5.3% |
| Overall robustness score | 0.714 | 0.845 | +18.3% |
Composite robustness score = weighted average of stage completion (40%), retry reduction (30%), refine cycle efficiency (30%).
Quality Indicators
| Indicator | Value |
|---|---|
| Paper length | 5,000–6,500 words |
| Target venue quality | NeurIPS / ICML / ICLR format |
| Citation verification | 4-layer (arXiv, CrossRef, DataCite, LLM) |
| Real references | Yes (OpenAlex, Semantic Scholar, arXiv APIs) |
| Hallucinated reference rate | Auto-removed (exact rate not reported) |
| Experiment fidelity | Sandboxed Python with hardware-aware adaptation |
| Peer review | Multi-agent with methodology-evidence consistency checks |
Adoption Metrics
| Metric | Value (as of April 2026) |
|---|---|
| GitHub stars | ~9,800+ |
| Releases | 6 in 15 days |
| Test suite | 1,823 tests passing |
| Community skills | 20 built-in + extensible via SKILL.md |
| Localization | 9 languages (README translations) |
Limitations of Reported Results
- No blind evaluation: Papers not submitted to actual conferences; quality is self-assessed
- No human expert review: Showcase papers not evaluated by domain experts
- No comparison to human baselines: No metric comparing output quality to human-written papers
- Selection bias: Showcase papers may be cherry-picked from many runs
- Experiment fidelity: Sandbox experiments may not match full-scale reproducible research
7 Reproducibility
Strengths
| Aspect | Assessment |
|---|---|
| Open source | Fully MIT-licensed; complete codebase on GitHub |
| Installation | Single pip install -e . + researchclaw setup |
| Configuration | Comprehensive YAML config with documented defaults |
| Test suite | 1,823 tests passing |
| Documentation | Extensive README, integration guide, tester guide |
| Example config | config.researchclaw.example.yaml with all options |
| Multi-platform | Cross-platform via ACP; not locked to single LLM provider |
| Deterministic pipeline | 23 stages execute in fixed order with checkpoint/resume |
Challenges
| Aspect | Concern |
|---|---|
| LLM non-determinism | Output quality varies with model, temperature, and API version |
| API dependencies | Requires OpenAlex, Semantic Scholar, arXiv APIs (external services) |
| Experiment quality | Sandbox experiments may not reproduce at full scale |
| Cost variability | LLM API costs depend on model choice and generation length |
| Docker setup | Docker mode requires additional infrastructure |
| OpenCode dependency | Beast Mode requires separate OpenCode installation |
| Rapid iteration | 6 releases in 15 days suggests API may be unstable |
Running a Reproduction
# Minimal reproduction:
git clone https://github.com/aiming-lab/AutoResearchClaw.git
cd AutoResearchClaw
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
researchclaw setup
researchclaw init # interactive config
export OPENAI_API_KEY="sk-..."
researchclaw run --topic "Your topic" --auto-approve
# Resume interrupted run:
researchclaw run --resume # auto-detects last checkpoint
8 Compute and API Costs
LLM API Costs (Estimated)
A full 23-stage pipeline run involves extensive LLM usage across all stages. Estimated costs by model:
| Model | Estimated Cost per Run | Notes |
|---|---|---|
| GPT-4o | $15–50 | Many stages, multi-agent debate, code generation |
| GPT-4o-mini | $3–10 | Budget option; lower quality |
| Claude 3.5 Sonnet (via ACP) | $10–30 | Using Claude Code as agent |
| DeepSeek V3 | $2–8 | Cost-effective alternative |
| Gemini Pro | $5–20 | Via OpenRouter or direct |
Costs vary significantly based on topic complexity, number of REFINE/PIVOT cycles, experiment iterations, and paper length.
Compute Resources
| Resource | Minimum | Recommended |
|---|---|---|
| CPU | 4 cores | 8+ cores |
| RAM | 8GB | 16GB+ |
| GPU | None (CPU-only mode) | NVIDIA (CUDA) or Apple (MPS) |
| Disk | 5GB | 20GB+ (for experiment artifacts) |
| Network | Required for LLM APIs + literature search | — |
Time Budget
| Configuration | Estimated Duration |
|---|---|
| Simulated experiments, fast model | 30–60 minutes |
| Sandbox experiments, GPT-4o | 2–6 hours |
| Docker + OpenCode Beast Mode | 4–12 hours |
| With PIVOT/REFINE cycles | 6–24 hours |
Per-Stage Time Breakdown (Estimated)
Phase A: Scoping ................ 5-10 min
Phase B: Literature ............. 15-45 min (API-bound)
Phase C: Synthesis .............. 10-20 min
Phase D: Experiment Design ...... 10-30 min
Phase E: Experiment Execution ... 30-180 min (compute-bound)
Phase F: Analysis + Decision .... 10-20 min
Phase G: Paper Writing .......... 30-60 min
Phase H: Finalization ........... 15-30 min
─────────────────────────────────────────────
Total (no loops) ................ ~2-6 hours
+ REFINE cycles ................. +30-90 min each
+ PIVOT cycles .................. +2-4 hours each
9 Architecture Solution
High-Level Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ AutoResearchClaw Architecture │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ PIPELINE ORCHESTRATOR │ │
│ │ (researchclaw/pipeline/runner.py) │ │
│ │ │ │
│ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │Stg 1 │→│Stg 2 │→│Stg 3 │→ ··· →│Stg 22│→│Stg 23│ │ │
│ │ └──────┘ └──────┘ └──┬───┘ └──────┘ └──────┘ │ │
│ │ │ │ │
│ │ ┌────────┴────────┐ │ │
│ │ │ GATE STAGES │ │ │
│ │ │ 5, 9, 20 │ │ │
│ │ │ (approve/reject│ │ │
│ │ │ + rollback) │ │ │
│ │ └─────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ DECISION ENGINE (Stg 15) │ │ │
│ │ │ │ │ │
│ │ │ PROCEED → Stage 16 │ │ │
│ │ │ REFINE → Stage 13 (loop) │ │ │
│ │ │ PIVOT → Stage 8 (restart) │ │ │
│ │ └──────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │
│ │ MULTI-AGENT │ │ EXPERIMENT │ │ KNOWLEDGE │ │
│ │ SUBSYSTEMS │ │ SANDBOX │ │ BASE │ │
│ │ │ │ │ │ │ │
│ │ • CodeAgent │ │ • AST validate │ │ • Decisions │ │
│ │ • BenchmarkAgt │ │ • NaN/Inf trap │ │ • Experiments │ │
│ │ • FigureAgent │ │ • Self-heal │ │ • Findings │ │
│ │ • Debate agents │ │ • Docker/local │ │ • Literature │ │
│ │ │ │ • GPU detect │ │ • Questions │ │
│ │ │ │ │ │ • Reviews │ │
│ └─────────────────┘ └─────────────────┘ └────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │
│ │ LLM PROVIDER │ │ LITERATURE │ │ METACLAW │ │
│ │ LAYER │ │ APIS │ │ BRIDGE │ │
│ │ │ │ │ │ │ │
│ │ • OpenAI-compat │ │ • OpenAlex │ │ • Lessons │ │
│ │ • ACP agents │ │ • Semantic Sch │ │ • Skills │ │
│ │ • Retry/fallback│ │ • arXiv │ │ • Overlay │ │
│ │ • Budget control│ │ • CrossRef │ │ • Time-decay │ │
│ └─────────────────┘ └─────────────────┘ └────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │
│ │ SENTINEL │ │ VERIFIED │ │ EXPORT │ │
│ │ WATCHDOG │ │ REGISTRY │ │ ENGINE │ │
│ │ │ │ │ │ │ │
│ │ • NaN/Inf detect│ │ • Ground-truth │ │ • LaTeX │ │
│ │ • Consistency │ │ • Anti-fabr. │ │ • BibTeX │ │
│ │ • Citation score│ │ • Experiment │ │ • NeurIPS/ │ │
│ │ • Anti-fabr. │ │ diagnosis │ │ ICML/ICLR │ │
│ └─────────────────┘ └─────────────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Key Architectural Decisions
| Decision | Rationale |
|---|---|
| Sequential 23-stage pipeline | Deterministic ordering ensures reproducibility and checkpoint/resume capability |
| 3 quality gates | Human-in-the-loop at critical decision points (literature screening, experiment design, final quality) |
| PIVOT/REFINE loops | Allows the system to recover from dead-end research directions autonomously |
| Multi-agent debate | Multiple LLM perspectives reduce single-point-of-failure in reasoning |
| VerifiedRegistry | Prevents the most common failure mode: LLM-fabricated experimental results |
| Pluggable LLM backend | ACP support means the system isn't locked to any single provider |
| MetaClaw bridge | Cross-run learning addresses the "same mistakes" problem |
Control Flow: The PIVOT/REFINE Decision
The most architecturally interesting feature is Stage 15's autonomous research direction control:
┌──────────────────┐
│ Stage 14: │
│ Result Analysis │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Stage 15: │
│ RESEARCH │
│ DECISION │
└───────┬──────────┘
│
┌─────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ PROCEED │ │ REFINE │ │ PIVOT │
│ │ │ │ │ │
│ → Stg 16 │ │ → Stg 13 │ │ → Stg 8 │
│ (write) │ │ (iterate)│ │ (restart)│
└──────────┘ └──────────┘ └──────────┘
PROCEED: Results are satisfactory → move to paper writing
REFINE: Results need parameter tuning → go back to iterative refinement
PIVOT: Hypothesis is wrong → generate new hypotheses and restart experiments
Artifacts are auto-versioned across loops to prevent data loss.
10 Component Breakdown
Component 1: Pipeline Orchestrator
Location: researchclaw/pipeline/
The orchestrator manages the sequential execution of 23 stages with:
- Checkpoint/resume: Pipeline state saved after each stage; --resume flag auto-detects
- Gate management: Stages 5, 9, 20 pause for human approval (or auto-approve)
- Loop control: REFINE → Stage 13, PIVOT → Stage 8
- Error recovery: Stage failures trigger retry with configurable limits
- Artifact versioning: Each loop iteration creates versioned artifact snapshots
Component 2: Literature Discovery Engine
Stages: 3–6
Multi-source literature search with real academic APIs:
Query Expansion (Stage 3):
Research topic → multiple search queries
→ domain-specific terminology
→ synonym expansion
Literature Collection (Stage 4):
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ OpenAlex │ │ Semantic Scholar │ │ arXiv │
│ API │ ──▶ │ API │ ──▶ │ API │
└─────────────┘ └──────────────────┘ └─────────────┘
│ │ │
└────────────────────┼───────────────────────┘
▼
┌──────────────────┐
│ Deduplication │
│ Circuit Breaker │
│ Rate Limiting │
└──────────────────┘
Literature Screening (Stage 5 — GATE):
LLM relevance scoring → human approval → filtered set
Knowledge Extraction (Stage 6):
Per-paper → structured knowledge cards
Component 3: Multi-Agent Debate System
Used in: Stages 8 (hypothesis gen), 14 (result analysis), 18 (peer review)
The debate system uses multiple LLM "perspectives" to reduce single-point reasoning failures:
- Hypothesis generation: Multiple agents propose competing hypotheses; structured debate narrows to testable set
- Result analysis: Independent agents analyze experimental results from different angles; consensus synthesis
- Peer review: Agents review the draft paper with explicit methodology-evidence consistency checks
Component 4: CodeAgent v2
Location: researchclaw/agents/code_agent/
Multi-phase code generation system replacing simple single-prompt code generation:
| Phase | Description |
|---|---|
| Architecture Planning | Deep implementation blueprint before coding |
| Sequential Generation | Files generated one-by-one following dependency DAG |
| Hard Validation | AST-based gates blocking identical ablations, hardcoded metrics |
| Execution-in-the-Loop | Fix attempts based on actual execution errors |
code_agent:
enabled: true
architecture_planning: true
sequential_generation: true
hard_validation: true
hard_validation_max_repairs: 2
exec_fix_max_iterations: 3
exec_fix_timeout_sec: 60
Component 5: BenchmarkAgent
Location: researchclaw/agents/benchmark_agent/
4-agent pipeline for automated dataset and baseline selection:
Surveyor → Selector → Acquirer → Validator
│ │ │ │
▼ ▼ ▼ ▼
Search Rank & Download Validate
HuggingFace select datasets integrity
+ Scholar by tier + cache + format
Agents: surveyor.py, selector.py, acquirer.py, validator.py
Configuration:
benchmark_agent:
enabled: true
enable_hf_search: true
enable_web_search: true
tier_limit: 2 # 1=small, 2=medium, 3=large
min_benchmarks: 1
min_baselines: 2
Component 6: FigureAgent
Location: researchclaw/agents/figure_agent/
5-agent pipeline for academic figure generation:
Planner → CodeGen → Renderer → Critic → Integrator
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Plan Generate Execute Critique Place in
figures matplotlib render quality paper
needed code to PNG + iterate + caption
Configuration:
figure_agent:
enabled: true
min_figures: 3
max_figures: 8
max_iterations: 3 # Critic-driven refinement
dpi: 300
strict_mode: false
Component 7: Sentinel Watchdog
Purpose: Background quality monitor running continuously during pipeline execution.
Monitors for: - NaN/Inf detection: Catches numerical instabilities in experiment results - Paper-evidence consistency: Verifies claims match experimental data - Citation relevance scoring: Scores how relevant each citation is to the paper - Anti-fabrication guard: Flags numbers that don't appear in VerifiedRegistry
Component 8: Export Engine
Location: researchclaw/pipeline/ (stages 22-23)
Converts Markdown paper to conference-ready LaTeX:
| Template | Target |
|---|---|
neurips_2025 |
NeurIPS 2025 |
iclr_2026 |
ICLR 2026 |
icml_2026 |
ICML 2026 |
Handles: math expressions, tables, figures, cross-references, \cite{} commands, auto-pruned BibTeX.
11 Core Mechanisms (Detailed)
11.1 The PIVOT/REFINE Decision Engine
Stage 15 is the pipeline's most complex decision point. The LLM analyzes experimental results and makes a three-way decision:
PROCEED criteria: - Results support the hypothesis - Statistical significance achieved - Sufficient experimental coverage - Results are novel relative to baselines
REFINE criteria: - Results partially support hypothesis but need parameter tuning - Some experimental conditions failed - Metrics are close to significance threshold - Additional iterations likely to help
PIVOT criteria: - Results contradict the hypothesis - Fundamental methodology issue identified - Results are not distinguishable from baselines - A different research direction is more promising
When PIVOT is triggered: 1. Current results are archived with version tag 2. Pipeline jumps back to Stage 8 (hypothesis generation) 3. Previous failed hypotheses are provided as negative context 4. New hypotheses must differ from all previous attempts 5. The cycle continues with fresh experiment design
This creates a closed-loop research process that can autonomously recover from dead-end research directions — a capability absent from most competing systems.
11.2 Anti-Fabrication System
The VerifiedRegistry is a defense against the most dangerous failure mode of LLM-generated research papers: fabricated experimental results.
Problem: LLMs can generate plausible-looking experimental results that have no basis in actual computation. This is the single biggest threat to the credibility of AI-generated research.
Solution architecture:
┌────────────────────────────────────────────────────────┐
│ VerifiedRegistry │
│ │
│ experiment_id → { │
│ conditions: [...], │
│ metrics: { │
│ "accuracy": 0.847, ← from actual execution │
│ "f1_score": 0.812, ← from actual execution │
│ "train_time": 142.3 ← from actual execution │
│ }, │
│ execution_log: "...", │
│ timestamp: "2026-03-28T14:32:00Z" │
│ } │
│ │
│ Enforcement: │
│ • Paper writing stage queries registry │
│ • Only registry-verified numbers may appear in text │
│ • Unverified claims → sanitized or flagged │
│ • Tables must reference experiment_ids │
│ │
│ Repair loop (if experiments failed): │
│ 1. Diagnose failure cause │
│ 2. Generate repair code (via OpenCode) │
│ 3. Re-execute (up to max_cycles=3) │
│ 4. If min_completion_rate not met → degrade gracefully │
└────────────────────────────────────────────────────────┘
11.3 Four-Layer Citation Verification
Academic credibility requires real references. AutoResearchClaw implements 4 verification layers:
Layer 1: arXiv ID Check
└─ If citation claims arXiv ID → verify it exists via arXiv API
└─ If ID is invalid → remove citation
Layer 2: CrossRef / DataCite DOI
└─ Verify DOI resolves to a real publication
└─ Check metadata (title, authors, year) matches claim
Layer 3: Semantic Scholar Title Match
└─ Search paper title in Semantic Scholar
└─ Fuzzy match to handle minor title variations
└─ Verify authors and venue match
Layer 4: LLM Relevance Scoring
└─ Even if citation is real, is it relevant to the paper?
└─ Score relevance (0-1)
└─ Remove citations below threshold
Result: Only citations that are (a) real and (b) relevant survive.
This addresses a critical weakness of AI Scientist and similar systems where hallucinated references undermined paper credibility.
11.4 Hardware-Aware Experiment Adaptation
The system auto-detects available hardware and adapts experiments accordingly:
Hardware Detection:
┌─────────────┐
│ NVIDIA GPU │ → CUDA mode → full-scale training
│ (detected) │ PyTorch CUDA, large batch sizes
└─────────────┘
┌─────────────┐
│ Apple MPS │ → MPS mode → adapted scale
│ (detected) │ PyTorch MPS, reduced batch sizes
└─────────────┘
┌─────────────┐
│ CPU only │ → CPU mode → minimal experiments
│ (fallback) │ Small models, few epochs, sklearn focus
└─────────────┘
Code generation adapts: imports, model sizes, batch sizes, training epochs, and package selection are all adjusted based on detected hardware tier.
11.5 Experiment Sandbox Execution
The sandbox provides safe, reproducible experiment execution:
AST validation (pre-execution): - Parse generated code as Python AST - Check for prohibited constructs (network access, file system escape) - Verify import whitelist compliance - Block identical ablations (same code, different variable names) - Detect hardcoded metrics (fabrication attempt)
Execution guardrails: - Memory limit (configurable, default 4096MB) - Time budget (configurable, default 300s) - NaN/Inf fast-fail: detect and abort on numerical instabilities - Partial result capture: save intermediate results even on failure - Self-healing: on failure, diagnose error → generate fix → retry (up to 10 rounds)
Docker mode adds:
- Network policy (none, setup_only, pip_only, full)
- Container isolation
- Auto-install dependencies (detect imports → generate requirements.txt)
- GPU passthrough
11.6 Skills System
AutoResearchClaw implements a skills system inspired by Claude Code's SKILL.md format:
Skills Loading:
1. Built-in skills (19 shipped) ← researchclaw package
2. Project-local skills ← .claude/skills/ directory
3. User-installed skills ← researchclaw skills install
4. Team-shared skills ← custom_dirs in config
5. Community skills ← K-Dense-AI/claude-scientific-skills (150+)
Each skill is a SKILL.md file with YAML frontmatter:
---
name: scientific-writing
description: IMRAD structure, citation formatting, reporting guidelines
trigger-keywords: [paper, writing, draft, manuscript]
applicable-stages: [16, 17, 19]
enabled: true
---
[Skill instructions for the LLM...]
Skills are loaded and injected into LLM prompts automatically at applicable stages. This enables domain-specific expertise without modifying the core pipeline.
Notable built-in skills:
- scientific-writing — IMRAD structure, citation formatting
- chemistry-rdkit — Molecular analysis, SMILES, drug discovery
- literature-search — Systematic review, PRISMA methodology
- hypothesis-formulation — Testable hypothesis construction
- statistical-reporting — Statistical analysis and reporting standards
- a-evolve — Agentic evolution (community-contributed from A-Evolve)
12 Programming Language
System Implementation
| Component | Language | Framework |
|---|---|---|
| Pipeline orchestrator | Python | Custom framework |
| Stage implementations | Python | — |
| Agent subsystems | Python | Custom agent base class |
| CLI interface | Python | Click (via researchclaw command) |
| Configuration | YAML | Parsed by Python |
| Prompts | YAML | prompts.default.yaml |
| Generated experiments | Python | PyTorch, scikit-learn, etc. |
| Generated papers | Markdown → LaTeX | Jinja2 templates |
Codebase Structure
AutoResearchClaw/
├── researchclaw/ # Main package
│ ├── __init__.py
│ ├── __main__.py
│ ├── adapters.py # LLM provider adapters
│ ├── agents/ # Multi-agent subsystems
│ │ ├── base.py # BaseAgent ABC
│ │ ├── benchmark_agent/ # 4-agent benchmark pipeline
│ │ │ ├── surveyor.py
│ │ │ ├── selector.py
│ │ │ ├── acquirer.py
│ │ │ └── validator.py
│ │ ├── code_agent/ # Multi-phase code generation
│ │ │ ├── architect.py
│ │ │ ├── builder.py
│ │ │ └── validator.py
│ │ ├── code_searcher/ # Code search agent
│ │ ├── debate/ # Multi-agent debate
│ │ └── figure_agent/ # 5-agent figure pipeline
│ │ ├── planner.py
│ │ ├── codegen.py
│ │ ├── renderer.py
│ │ ├── critic.py
│ │ └── integrator.py
│ ├── cli/ # CLI entry points
│ ├── config/ # Configuration management
│ ├── knowledge/ # Knowledge base
│ ├── literature/ # Literature search APIs
│ ├── pipeline/ # Pipeline orchestrator
│ │ ├── runner.py # Main pipeline runner
│ │ ├── stages/ # Individual stage implementations
│ │ └── checkpoint.py # State persistence
│ ├── sandbox/ # Experiment execution
│ ├── sentinel/ # Quality watchdog
│ ├── skills/ # Skills management
│ ├── templates/ # LaTeX templates
│ └── verification/ # Citation verification
├── .claude/skills/ # Built-in SKILL.md files
├── config.researchclaw.example.yaml
├── prompts.default.yaml # Default LLM prompts
├── pyproject.toml
├── docs/ # Documentation
└── tests/ # Test suite (1,823 tests)
Dependencies (from pyproject.toml)
Key dependencies include:
- LLM integration: openai, httpx (for API calls)
- Literature: requests (for academic APIs)
- LaTeX: jinja2 (for template rendering)
- Data processing: pyyaml, json
- CLI: Standard library argparse or click
- AST validation: Python ast module (stdlib)
13 Memory Management
Run-Level Memory: Knowledge Base
Every pipeline run builds a structured knowledge base across 6 categories:
| Category | Contents | Persistence |
|---|---|---|
| Decisions | Research direction choices, PIVOT/REFINE rationale | Per-run |
| Experiments | Code, configurations, results, failure logs | Per-run |
| Findings | Key results, statistical analyses, insights | Per-run |
| Literature | Paper summaries, knowledge cards, citation metadata | Per-run |
| Questions | Open research questions, hypotheses tested | Per-run |
| Reviews | Peer review feedback, revision history | Per-run |
Backend options:
knowledge_base:
backend: "markdown" # or "obsidian"
root: "docs/kb"
Cross-Run Memory: MetaClaw
MetaClaw provides persistent cross-run learning through a lesson → skill pipeline:
Run N:
Pipeline executes → failures/warnings captured as Lessons
Lesson structure:
- stage: which pipeline stage
- severity: warning | error | critical
- category: code_gen | experiment | literature | writing
- description: what went wrong
- resolution: how it was fixed (if auto-resolved)
- timestamp: when it occurred
↓ MetaClaw Processing ↓
Lesson → Skill conversion:
- Filter by min_severity (default: warning)
- Extract actionable pattern
- Generate SKILL.md with prevention instructions
- Store as arc-* skill in ~/.metaclaw/skills/
- Max skills_per_run: 3
↓ Next Run ↓
Run N+1:
build_overlay() at pipeline start:
- Load all arc-* skills from ~/.metaclaw/skills/
- Apply 30-day time-decay weighting
- Inject relevant skills into every stage's LLM prompt
- LLM avoids known pitfalls → fewer retries
Stage-Level Memory: Context Accumulation
Within a single run, each stage has access to:
- Pipeline state: All outputs from all completed stages
- Knowledge base: Accumulated findings, decisions, literature
- Experiment history: All previous experimental results
- Artifact versions: All versions from REFINE/PIVOT loops
When using ACP mode, the agent CLI maintains full conversation history across all 23 stages, providing maximum context continuity.
Memory Isolation
| Scope | Persistence | Sharing |
|---|---|---|
| Within-stage | Ephemeral | Stage-internal only |
| Within-run | Run directory | All stages in same run |
| Cross-run (MetaClaw) | ~/.metaclaw/skills/ |
All future runs |
| Cross-project | Not implemented | — |
14 Continued Learning
Self-Learning via MetaClaw Integration
AutoResearchClaw's most distinctive continued learning mechanism is the MetaClaw bridge:
Learning Loop:
┌────────────────────────────────────────────────────┐
│ │
│ Run 1: First execution │
│ • Encounters Stage 12 timeout (experiment too slow)│
│ • Auto-repairs by reducing batch size │
│ • Lesson captured: "reduce batch size on timeout" │
│ ↓ │
│ MetaClaw converts → arc-experiment-timeout SKILL.md │
│ ↓ │
│ Run 2: Second execution (different topic) │
│ • SKILL injected into Stage 12 prompt │
│ • LLM proactively uses smaller batch size │
│ • No timeout → no retry needed │
│ • Robustness improved │
│ │
└────────────────────────────────────────────────────┘
Time-decay: Skills have a 30-day decay period, preventing stale lessons from dominating. Recent lessons are weighted more heavily than old ones.
Measured impact (from controlled A/B experiments): - Stage retry rate: −24.8% - Refine cycle count: −40.0% - Pipeline completion: +5.3% - Overall robustness: +18.3%
Within-Run Iterative Refinement
The REFINE loop (Stage 15 → Stage 13) provides within-run learning:
- Each REFINE cycle builds on previous results
- Parameter adjustments are informed by all prior iterations
- Up to
max_iterations(default: 10) refinement cycles - Artifacts are versioned to track improvement trajectory
Experiment Self-Healing
The sandbox executor implements a diagnosis-repair loop:
Execute code → failure detected
↓
Diagnose error type:
• Import error → add dependency
• Runtime error → modify code
• Timeout → reduce scale
• NaN/Inf → add numerical guards
↓
Generate repair (LLM or OpenCode)
↓
Re-execute (up to exec_fix_max_iterations=3)
↓
If still failing → capture partial results → degrade gracefully
Skills Library as Accumulated Knowledge
The skills system functions as a growing knowledge base:
- Built-in skills (19): Curated by the development team
- Community skills (150+ via K-Dense-AI): Crowdsourced scientific knowledge
- MetaClaw-generated skills: Automatically created from pipeline failures
- Custom skills: User/team-specific knowledge
Over time, a research group's skill library accumulates domain-specific knowledge that makes the pipeline increasingly effective for their particular research area.
Process Reward Model (PRM) — Optional Quality Gate
MetaClaw optionally integrates a Process Reward Model for quality gating:
metaclaw_bridge:
prm:
enabled: false # Opt-in
model: "gpt-5.4" # PRM judge model
votes: 3 # Majority vote
gate_stages: [5, 9, 15, 20]
When enabled, an LLM-as-judge evaluates stage outputs and blocks low-quality results from proceeding — adding another layer of quality control beyond the standard gate stages.
15 Applications
Primary Application: Automated Paper Generation
The primary use case is generating conference-ready academic papers from a research topic:
researchclaw run --topic "Investigating the role of attention sparsity \
in reducing transformer inference cost" --auto-approve
Output: Complete paper with real literature, executed experiments, verified results, peer review, and LaTeX export.
Research Workflow Integration
| Use Case | Description | Configuration |
|---|---|---|
| Literature review | Run phases A-C only for systematic literature review | Stop at Stage 7 |
| Experiment design | Run phases A-D for designed experiments without execution | Stop at Stage 11 |
| Full autonomy | Complete pipeline with --auto-approve |
Default |
| Supervised research | Pipeline pauses at 3 gates for human review | Without --auto-approve |
| Chat-driven | Via OpenClaw: "Research X" in Discord/Telegram/Slack | OpenClaw bridge enabled |
Platform Integration
AutoResearchClaw supports deployment across multiple interfaces:
┌──────────────────────────────────────────────────────┐
│ USER INTERFACES │
│ │
│ CLI OpenClaw Python API │
│ researchclaw "Research X" Runner(config) │
│ run --topic via Discord/ .run() │
│ "..." Telegram/etc. │
│ │
│ Claude Code Copilot CLI Any AI CLI │
│ "Run research researchclaw Provide │
│ on [topic]" run --topic AGENTS.md │
│ via ACP as context │
└──────────────────────────────────────────────────────┘
Target Users
| User Type | Value Proposition |
|---|---|
| PhD students | Rapid prototyping of research directions; literature review automation |
| Research labs | High-throughput hypothesis testing across multiple topics |
| Industry R&D | Quick feasibility studies and literature surveys |
| Interdisciplinary teams | Domain-agnostic pipeline works across fields |
Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| Paper quality ceiling | Not yet at top-venue acceptance quality | Multi-agent review + MetaClaw improves over time |
| Experiment scale | Sandbox experiments are small-scale | Docker mode + SSH remote for larger experiments |
| Domain expertise | No deep domain knowledge beyond LLM training data | Skills system adds domain knowledge; community skills |
| Fabrication risk | Despite VerifiedRegistry, subtle fabrication possible | Sentinel watchdog + human review at gates |
| LLM cost | Full runs cost $15–50 with GPT-4o | Fallback models, budget control in config |
| Novelty assessment | Cannot reliably assess if research is truly novel | Human judgment required for novelty claims |
| No peer acceptance | No evidence of generated papers passing real peer review | Showcase papers are self-evaluated only |
Comparison with Related Systems
| System | Full Pipeline | Real Literature | Experiments | Self-Learning | Open Source |
|---|---|---|---|---|---|
| AutoResearchClaw | 23 stages | OpenAlex + S2 + arXiv | Sandbox/Docker | MetaClaw | MIT |
| AI Scientist (Sakana) | Partial | No (hallucinated) | Limited | No | Apache 2.0 |
| AIRA₂ (Meta) | Experiments only | N/A | Full-scale GPU | Within-task evolution | Not released |
| FARS (Analemma) | Full pipeline | Unknown | Unknown | Unknown | Proprietary |
| AutoResearch (Karpathy) | Partial | Partial | No | No | Open |
Connections to OmniEvolve
AutoResearchClaw's architecture maps to several OmniEvolve design patterns:
| AutoResearchClaw Component | OmniEvolve Equivalent |
|---|---|
| 23-stage pipeline orchestrator | omnievolve/orchestrator/ experiment lifecycle |
| PIVOT/REFINE decision loop | Adaptive search strategy in omnievolve/search/ |
| MetaClaw cross-run learning | omnievolve/knowledge/ learning logs and skills |
| VerifiedRegistry anti-fabrication | omnievolve/evaluation/ cascade evaluator integrity |
| Multi-agent debate | Multi-operator mutation in omnievolve/mutation/ |
| Skills system | omnievolve/plugins/ plugin discovery |
| Sentinel watchdog | omnievolve/safety/ audit and policy enforcement |
| CodeAgent/BenchmarkAgent/FigureAgent | Specialized omnievolve/mutation/ operators |
| Experiment sandbox | omnievolve/safety/ sandbox execution |
| Knowledge base | omnievolve/knowledge/ structured knowledge storage |
The PIVOT/REFINE mechanism is particularly relevant to OmniEvolve as it represents a form of adaptive search where the system autonomously decides when to refine (exploit) vs. when to restart (explore) — a fundamental exploration-exploitation decision analogous to island-based search with restart policies.
References
- Liu, J., et al. (2026). "AutoResearchClaw: Fully Autonomous Research from Idea to Paper." GitHub repository, aiming-lab/AutoResearchClaw.
- Lu, C., et al. (2024). "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." Sakana AI. arXiv:2408.06292.
- Karpathy, A. (2025). "AutoResearch." GitHub repository.
- Analemma (2025). "FARS: Fully Automated Research System." Blog post.
- MetaClaw: github.com/aiming-lab/MetaClaw
- OpenClaw: github.com/openclaw/openclaw
- OpenCode: github.com/anomalyco/opencode
- A-Evolve: github.com/A-EVO-Lab/a-evolve
- K-Dense-AI Claude Scientific Skills: github.com/K-Dense-AI/claude-scientific-skills
Classification: Autoresearch — AutoResearchClaw is a fully autonomous AI system that conducts the complete research process from idea to paper, including literature review, hypothesis generation, experiment execution, result analysis, paper writing, and peer review. It is a prototypical autoresearch system — an agent harness that automates the end-to-end scientific research workflow. Its MetaClaw integration adds self-evolving capability, and its PIVOT/REFINE mechanism introduces adaptive search over research directions.