AutoResearchClaw
Part P07: Autonomous Research Systems
35.1 Overview & Motivation
The automation of scientific research represents one of the most ambitious applications of large language models. While prior systems such as AI Scientist (Lu et al., 2024; Sakana AI) demonstrated that LLMs could draft research papers, and AIRA₂ (Meta FAIR) showed that LLM-guided evolution could solve competition-grade machine learning tasks, neither achieved a full closed-loop pipeline from research idea to conference-formatted manuscript with verified experiments and real citations. AutoResearchClaw, released in March 2026 by the AIMING Lab — a multi-university collaboration spanning UC Santa Cruz, UNC Chapel Hill, Johns Hopkins, and UC Davis — targets this gap.
The system's tagline, "Chat an Idea. Get a Paper," captures its ambition: a user provides a research topic as a text string, and the system autonomously produces a complete academic paper including literature review with real references, executed and verified experiments, multi-agent peer review, and LaTeX export in NeurIPS/ICML/ICLR format. The name "Claw" references the lobster emoji in the project's branding (source: repository README).
Provenance Note: Verified vs. Inferred Content
This chapter draws on the AutoResearchClaw GitHub repository (MIT license), its README, configuration examples, and linked companion repositories. The following conventions are used throughout:
- Repository-documented: facts directly stated in the README, YAML config examples, CLI documentation, or visible directory structure. These are marked with "(source: repository README)" or similar tags.
- Author formalization: mathematical equations and formal models constructed by the survey author to capture documented behavior in precise notation. These are explicitly labeled as analytical interpretations.
- Abstract pseudocode: code blocks written by the survey author to illustrate documented pipeline behavior. These are not excerpts from the repository and do not reflect actual class names, method signatures, or internal APIs. They are labeled accordingly.
Where the chapter describes internal mechanisms (e.g., how the PIVOT/REFINE decision is implemented, how the VerifiedRegistry enforces claims), the level of detail is bounded by what the repository README and configuration files disclose. Internal implementation details beyond public documentation are not available for verification.
Key Contribution
AutoResearchClaw's principal contribution is a 23-stage deterministic pipeline that integrates research scoping, real literature discovery (via OpenAlex, Semantic Scholar, and arXiv APIs), multi-agent hypothesis debate, sandboxed experiment execution with self-healing, a PIVOT/REFINE decision loop for autonomous research-direction control, a VerifiedRegistry anti-fabrication system that prevents LLM-generated numbers from entering the paper, four-layer citation verification, and cross-run learning via the companion MetaClaw system. To the best of our knowledge based on publicly available systems surveyed in this book, the combination of anti-fabrication enforcement with adaptive research-direction control and real citation verification has not been demonstrated in a single open-source autoresearch system prior to this release — though this claim is bounded by the survey's coverage and the system's own lack of external validation (see Section 35.6.1).
35.1.1 Predecessor Lineage
The repository README explicitly acknowledges inspiration from three systems: AI Scientist (Sakana AI, 2024), AutoResearch (Karpathy, 2025), and FARS (Analemma, 2025). AutoResearchClaw addresses specific weaknesses in each predecessor as documented by its authors:
| Predecessor | Weakness Addressed | AutoResearchClaw Solution | Criteria for Comparison |
|---|---|---|---|
| AI Scientist (Lu et al., 2024) | Hallucinated references; limited experiment fidelity | 4-layer citation verification; sandboxed execution with VerifiedRegistry | Lu et al. (2024) report hallucinated references as a known failure mode; no citation verification pipeline described in their paper |
| AutoResearch (Karpathy, 2025) | Partial pipeline (no experiment execution) | Full 23-stage pipeline with code generation and sandbox | AutoResearch repository scope limited to literature search and writing; no experiment execution stages documented |
| FARS (Analemma, 2025) | Proprietary; unknown internals | MIT-licensed; fully open source | FARS is a proprietary system with no public source code or detailed methodology disclosure |
Note: the comparisons above reflect AutoResearchClaw's authors' framing. The AI Scientist comparison is partially supported by Lu et al. (2024), who acknowledge citation quality issues. The AutoResearch and FARS comparisons rely on AutoResearchClaw README claims and cannot be independently verified against those systems' full capabilities.
The system also ships with two companion projects: MetaClaw, a cross-run learning engine that extracts skills from pipeline failures, and OpenClaw, an AI assistant platform providing a chat interface for pipeline orchestration via Discord, Telegram, and Slack (source: repository README and linked repositories).
35.1.2 Development Velocity
A notable characteristic is the project's rapid iteration: six significant releases in fifteen days (v0.1.0 on March 15, 2026 through v0.3.2 on March 22, 2026), with continued updates through at least March 30. This pace, while indicative of active development, also suggests potential API instability — a concern for reproducibility discussed further in Section 35.5.
35.2 Architecture
35.2.1 Pipeline Overview
AutoResearchClaw organizes research automation into eight phases containing 23 sequential stages. The pipeline is managed by an orchestrator (documented location: researchclaw/pipeline/runner.py) that provides checkpoint/resume capability, gate management, loop control, and artifact versioning (source: repository README and directory structure).
Figure 35.1: The 23-stage pipeline architecture of AutoResearchClaw. Quality gates (⊘) at stages 5, 9, and 20 optionally pause for human review. Multi-agent debate stages (◈) use structured multi-perspective LLM reasoning. The PIVOT/REFINE decision at Stage 15 enables autonomous research-direction control. Source: repository README and documented directory structure.
35.2.2 Architectural Decisions
Several design choices distinguish AutoResearchClaw's architecture from simpler autoresearch pipelines (all sourced from repository README unless otherwise noted):
Sequential determinism with loop escapes. The 23 stages execute in fixed order, providing checkpoint/resume capability via the --resume flag (added in v0.3.2). However, Stage 15's PIVOT/REFINE decision introduces controlled non-linearity: REFINE loops back to Stage 13 (iterative refinement), while PIVOT jumps back to Stage 8 (hypothesis generation). Artifacts are auto-versioned across loops to prevent data loss.
Three quality gates. Stages 5 (literature screening), 9 (experiment design), and 20 (final quality) pause for optional human approval. The --auto-approve flag bypasses all gates for fully autonomous operation. This design acknowledges that full autonomy is desirable for throughput but risky for quality — a tension that permeates autoresearch system design.
Pluggable LLM backend with ACP. Beyond standard OpenAI-compatible API calls, the system supports Agent Client Protocol (ACP) via the acpx library, which delegates LLM calls to external CLI agents (Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI). In ACP mode, a single persistent session accumulates context across all 23 stages, potentially improving coherence (source: repository README, v0.3.2 changelog).
35.2.3 Multi-Agent Subsystems
The system employs specialized agent subsystems for tasks that exceed single-prompt LLM generation. Three subsystems are documented in the repository's directory structure under researchclaw/agents/:
| Agent | Documented Location | Pipeline (from README) | Purpose |
|---|---|---|---|
| CodeAgent v2 | researchclaw/agents/code_agent/ | Architect → Builder → Validator | Multi-phase code generation with AST validation |
| BenchmarkAgent | researchclaw/agents/benchmark_agent/ | Surveyor → Selector → Acquirer → Validator | Automated dataset and baseline selection |
| FigureAgent | researchclaw/agents/figure_agent/ | Planner → CodeGen → Renderer → Critic → Integrator | Academic figure generation with iterative refinement |
Note: the directory structure and sub-agent names above are documented in the repository README and visible in the repository file tree. The internal class names, method signatures, and agent communication protocols are not disclosed in public documentation and are therefore not described here.
Additionally, a multi-agent debate system in researchclaw/agents/debate/ is used at stages 8 (hypothesis generation), 14 (result analysis), and 18 (peer review). Each debate uses multiple LLM "perspectives" to reduce single-point reasoning failures, though the exact debate protocol and number of agents per debate are not documented in public materials.
35.3 Core Algorithms
35.3.1 The PIVOT/REFINE Decision Engine
Stage 15 is the pipeline's most architecturally significant component. After result analysis in Stage 14, the system makes a three-way autonomous decision about research direction. This mechanism transforms AutoResearchClaw from a linear pipeline into an adaptive search process over the space of research directions — a direct analogue to the exploration-exploitation tradeoff in evolutionary search.
The decision criteria, as documented in the repository README:
| Decision | Target Stage | Trigger Conditions (repository-documented) |
|---|---|---|
| PROCEED | Stage 16 (paper writing) | Results support hypothesis; statistical significance achieved; sufficient coverage; novelty relative to baselines |
| REFINE | Stage 13 (iterative refinement) | Partial support; some conditions failed; metrics near significance; additional iterations likely to help |
| PIVOT | Stage 8 (hypothesis generation) | Results contradict hypothesis; fundamental methodology issue; results indistinguishable from baselines |
Author formalization. We model the decision as a function over the experimental outcome space to make the documented criteria precise. Let $R = \{r_1, r_2, \ldots, r_m\}$ denote the set of experimental results from Stage 14, where each $r_i$ comprises a metric name, observed value, baseline value, and statistical significance indicator. Let $H$ denote the current hypothesis and $B$ the set of baseline results. The decision function $\delta$ maps these to one of three actions:
| Symbol | Definition | Implementation Artifact | Status |
|---|---|---|---|
| $\texttt{support}(R, H)$ | Composite measure of how well results $R$ support hypothesis $H$ (effect size + statistical significance) | Stage 15 LLM prompt; criteria listed in README | Author formalization of documented criteria |
| $\texttt{novelty}(R, B)$ | Improvement of results over baselines $B$ | Stage 15 LLM prompt; "novel relative to baselines" criterion | Author formalization of documented criteria |
| $\tau_p, \tau_n$ | PROCEED thresholds for support and novelty | Not exposed in public config | Author-introduced notation; no documented config field |
| $\tau_r$ | REFINE lower threshold for support | Not exposed in public config | Author-introduced notation; no documented config field |
| $k$ | Current refinement iteration count | Pipeline loop counter | Repository-documented (README mentions loop tracking) |
| $k_{\max}$ | Maximum allowed refinement iterations | max_iterations; README documents default of 10 | Repository-documented |
Important caveat: This formalization is an analytical interpretation by the survey author of the documented decision criteria. The actual implementation almost certainly uses an LLM-based reasoning step rather than numeric threshold comparisons — the README describes the decision criteria in natural language and the system is LLM-driven throughout. The thresholds $\tau_p$, $\tau_r$, and $\tau_n$ are analytical constructs introduced here to make the decision boundaries precise; they do not correspond to known configuration fields.
When PIVOT is triggered, the system archives current results with a version tag, jumps back to Stage 8, provides previous failed hypotheses as negative context, and requires new hypotheses to differ from all previous attempts (source: repository README). This creates a closed-loop research process that can autonomously recover from dead-end research directions.
# ABSTRACT PSEUDOCODE — illustrative reconstruction of Stage 15 behavior.
# NOT extracted from the repository. Class names, method signatures, and
# internal APIs are invented by the survey author for pedagogical clarity.
# The actual implementation is not publicly documented at this level of detail.
class ResearchDecisionPseudocode:
"""Illustrates the Stage 15 autonomous research direction control."""
PROCEED = "PROCEED"
REFINE = "REFINE"
PIVOT = "PIVOT"
def decide(
self,
results: list, # Experimental results from Stage 14
hypothesis: str, # Current hypothesis text
baselines: list, # Baseline comparison results
iteration: int, # Current refinement cycle count
max_iterations: int = 10, # Documented default (README)
) -> tuple[str, str]:
"""Returns (decision, rationale) via LLM reasoning.
The LLM receives structured result data and the documented
PROCEED/REFINE/PIVOT criteria, then outputs a decision
with an explicit rationale.
"""
prompt = self._build_decision_prompt(
results=results,
hypothesis=hypothesis,
baselines=baselines,
iteration=iteration,
# README documents that failed hypotheses are provided
# as negative context on PIVOT
previous_failures=self.knowledge_base.get("failed_hypotheses", []),
)
response = self.llm.generate(prompt)
decision, rationale = self._parse_decision(response)
# README documents artifact versioning across loops
if decision in (self.REFINE, self.PIVOT):
self.artifact_store.version_snapshot(
tag=f"iter-{iteration}-{decision.lower()}"
)
# README documents negative context injection on PIVOT
if decision == self.PIVOT:
self.knowledge_base.add(
category="decisions",
entry={
"hypothesis": hypothesis,
"outcome": "pivoted",
"rationale": rationale,
},
)
return decision, rationale
35.3.2 Anti-Fabrication System: VerifiedRegistry
The most dangerous failure mode of LLM-generated research papers is fabricated experimental results — numbers that look plausible but have no basis in actual computation. AutoResearchClaw addresses this with the VerifiedRegistry, a ground-truth enforcement layer that sits between experiment execution and paper writing (source: repository README; documented directory: researchclaw/verification/).
The mechanism works as follows, per the repository README:
- Registration: When experiments in Stage 12/13 produce results, structured JSON metrics are indexed in the VerifiedRegistry with experiment IDs, conditions, metric values, execution logs, and timestamps.
- Enforcement: During paper writing (Stages 16–19), the LLM may only cite metrics that exist in the registry. Unverified numbers are sanitized — either removed or flagged.
- Repair: If experiments fail (preventing registration of any results), a diagnosis-and-repair loop attempts to fix the code and re-execute. Configuration allows up to
max_cycles=3repair attempts. If minimum completion rate is not met, the system degrades gracefully by reporting partial results.
# ABSTRACT PSEUDOCODE — illustrative reconstruction of VerifiedRegistry behavior.
# NOT extracted from the repository. The actual class hierarchy, method names,
# and internal data structures are not publicly documented.
class VerifiedRegistryPseudocode:
"""Illustrates the ground-truth enforcement concept.
Core idea (from README): only experimentally verified numbers
may appear in the generated paper. Unverified claims are sanitized.
"""
def __init__(self):
self._registry: dict[str, dict] = {} # experiment_id -> record
def register(
self,
experiment_id: str,
conditions: list[str],
metrics: dict[str, float],
execution_log: str,
) -> None:
"""Register verified results from sandbox execution.
Called after successful experiment runs in Stages 12/13.
Metrics are structured JSON (documented in README).
"""
self._registry[experiment_id] = {
"conditions": conditions,
"metrics": metrics, # e.g., {"accuracy": 0.847, "f1": 0.812}
"execution_log": execution_log,
"timestamp": "...", # execution timestamp
}
def verify_claim(self, metric_name: str, claimed_value: float) -> bool:
"""Check if a claimed number exists in verified results.
README states: 'only registry-verified numbers can be cited.'
"""
for record in self._registry.values():
if metric_name in record["metrics"]:
if abs(record["metrics"][metric_name] - claimed_value) < 1e-6:
return True
return False
def sanitize_paper(self, draft: str) -> str:
"""Remove or flag unverified numerical claims.
README: 'Unverified numbers are sanitized (removed or flagged).'
The Sentinel Watchdog performs complementary checks.
"""
numbers = self._extract_numerical_claims(draft)
for claim in numbers:
if not self.verify_claim(claim.metric, claim.value):
draft = self._replace_with_flag(draft, claim)
return draft
The Sentinel Watchdog (researchclaw/sentinel/) operates as a complementary background quality monitor, performing continuous checks for NaN/Inf in results, paper-evidence consistency, citation relevance scoring, and anti-fabrication enforcement (source: repository README).
Figure 35.2: VerifiedRegistry data flow. Experiments register structured metrics; paper writing stages query the registry; unverified claims are sanitized. The repair loop attempts to fix failed experiments before paper writing begins. Source: repository README documentation.
35.3.3 Four-Layer Citation Verification
Hallucinated references were identified as a critical weakness in AI Scientist (Lu et al., 2024). AutoResearchClaw implements a four-layer verification pipeline in Stage 23 to ensure all citations are both real and relevant (source: repository README):
Figure 35.3: Four-layer citation verification pipeline. Each layer eliminates a different class of citation failure: nonexistent IDs (Layer 1), unresolvable DOIs (Layer 2), unverifiable titles/authors (Layer 3), and irrelevant-but-real references (Layer 4). Failures at any layer cause citation removal. Source: repository README.
The four layers target progressively subtler forms of citation failure:
- arXiv ID verification: If a citation claims an arXiv identifier, the system verifies it exists via the arXiv API. Invalid IDs trigger immediate removal.
- CrossRef/DataCite DOI resolution: DOIs are verified to resolve to real publications, with metadata (title, authors, year) cross-checked against the citation.
- Semantic Scholar title matching: Paper titles are searched in Semantic Scholar with fuzzy matching to handle minor variations. Authors and venue are also verified.
- LLM relevance scoring: Even if a citation is verifiably real, the system scores its relevance to the paper content on a 0–1 scale. Citations below a threshold are removed as "real but irrelevant" padding.
The BibTeX output is auto-pruned to match only inline \cite{} references that survive all four layers, preventing orphaned bibliography entries (source: repository README).
Implementation note: The repository README documents the four-layer design and names the external APIs used (arXiv, CrossRef, DataCite, Semantic Scholar). The internal implementation of the LLM relevance scoring — including the relevance threshold value, prompt design, and scoring mechanism — is not publicly documented. The hallucinated-reference removal rate is also not reported.
35.3.4 MetaClaw Cross-Run Learning
MetaClaw provides persistent cross-run learning through a lesson-to-skill pipeline. After each pipeline run, the system captures failures and warnings as structured lessons. MetaClaw then converts these into SKILL.md files stored in ~/.metaclaw/skills/, which are injected into future runs' LLM prompts at applicable stages (source: repository README, MetaClaw repository).
Author formalization. The learning loop can be expressed as follows. Let $L_n = \{l_1, l_2, \ldots, l_j\}$ denote lessons extracted from run $n$, where each lesson $l_i$ includes a stage identifier, severity level, category, description, and optional resolution. MetaClaw converts lessons to skills via a filtering and transformation function:
The cumulative skill library available at run $N$ is:
where $w(s, t_N)$ is a time-decay weighting function:
Here $t_s$ is the timestamp when skill $s$ was created, $t_N$ is the start time of run $N$, and $T_{\text{decay}}$ is the decay period.
| Symbol | Definition | Config Field / Artifact | Documented Default | Status |
|---|---|---|---|---|
| $\sigma_{\min}$ | Minimum severity for lesson→skill conversion | metaclaw_bridge.min_severity | "warning" | Repository config example |
| $|S_n|$ cap | Maximum new skills per run | metaclaw_bridge.skills_per_run | 3 | Repository config example |
| $T_{\text{decay}}$ | Time-decay period for skill relevance | Described in README as "30-day decay" | 30 days | Repository README |
| $\texttt{convert}(\cdot)$ | Lesson-to-skill transformation | MetaClaw processing pipeline | — | Documented behavior; internal mechanism unknown |
| $w(s, t_N)$ | Time-decay weight for skill $s$ at time $t_N$ | Implemented in build_overlay() | Linear decay | Author formalization of documented "30-day decay" behavior |
Note: the linear time-decay form $w(s, t_N) = \max(0, 1 - (t_N - t_s)/30)$ is the survey author's analytical interpretation of the documented "30-day decay period" behavior. The actual decay function may use a different shape (exponential, step, etc.) — the README describes the 30-day period but not the functional form.
# ABSTRACT PSEUDOCODE — illustrative reconstruction of MetaClaw skill injection.
# NOT extracted from the repository. Illustrates the documented lesson→skill→overlay
# pipeline based on README descriptions and config fields.
class MetaClawBridgePseudocode:
"""Cross-run learning via lesson extraction and skill injection."""
DECAY_DAYS = 30 # README: "30-day decay period"
MAX_SKILLS_PER_RUN = 3 # Config: metaclaw_bridge.skills_per_run
def extract_lessons(self, run_log) -> list:
"""Extract actionable lessons from a completed pipeline run.
README documents: lessons captured from 'decision rationale,
runtime warnings, metric anomalies.'
"""
lessons = []
for stage_log in run_log.stages:
if stage_log.retries > 0 or stage_log.warnings:
lessons.append({
"stage": stage_log.stage_id,
"severity": stage_log.max_severity,
"category": stage_log.category,
"description": stage_log.failure_description,
"resolution": stage_log.auto_resolution,
})
return lessons
def build_overlay(self, current_time) -> list:
"""Load time-weighted skills for injection into current run.
README: skills stored in ~/.metaclaw/skills/ as SKILL.md files.
"""
from pathlib import Path
skills_dir = Path.home() / ".metaclaw" / "skills"
active_skills = []
for skill_path in skills_dir.glob("arc-*.md"):
skill = self._parse_skill_file(skill_path)
age_days = (current_time - skill["created_at"]).days
weight = max(0.0, 1.0 - age_days / self.DECAY_DAYS)
if weight > 0:
skill["weight"] = weight
active_skills.append(skill)
return sorted(active_skills, key=lambda s: s["weight"], reverse=True)
35.3.5 Experiment Complexity Routing
AutoResearchClaw automatically assesses experiment complexity and routes code generation accordingly. The repository README and configuration example document a three-tier routing system:
where $c \in [0, 1]$ is the assessed complexity score, $c_{\text{low}}$ is the lower routing boundary, and $\theta$ is the upper threshold.
| Symbol | Definition | Config Field | Documented Default | Status |
|---|---|---|---|---|
| $c$ | Assessed complexity score | Computed internally at experiment design | — | Documented behavior; scoring method not disclosed |
| $c_{\text{low}}$ | Lower routing boundary (simple → medium) | Not separately configurable in public config | 0.2 (README description) | Repository README |
| $\theta$ | Upper threshold (medium → Beast Mode) | opencode.complexity_threshold | 0.2 | Repository config example |
opencode.complexity_threshold: 0.2 as well. With $c_{\text{low}} = \theta = 0.2$, the middle tier (CodeAgent v2 without OpenCode) has an empty domain: no complexity score satisfies $0.2 \leq c < 0.2$. This collapses the three-tier system into a two-tier system (direct LLM for $c < 0.2$; OpenCode Beast Mode for $c \geq 0.2$). Three possible explanations: (1) the documented default of 0.2 is a placeholder expecting user adjustment to a higher value; (2) the lower boundary $c_{\text{low}}$ is actually less than 0.2 in the implementation but not separately documented; or (3) CodeAgent v2 is always active and the "threshold" only controls whether OpenCode is additionally invoked. Without access to the internal routing logic, we cannot resolve this inconsistency. In practice, users who want three-tier routing should set opencode.complexity_threshold to a value strictly greater than 0.2.
CodeAgent v2 (researchclaw/agents/code_agent/) performs architecture planning, sequential file generation following a dependency DAG, AST-based hard validation (blocking identical ablations and hardcoded metrics), and execution-in-the-loop fixing (up to exec_fix_max_iterations=3 attempts with 60-second timeout per the config example). OpenCode Beast Mode delegates to the external OpenCode system for multi-file projects with custom architectures, training loops, and ablation studies (source: repository README and config example).
35.3.6 Hardware-Aware Experiment Adaptation
The system auto-detects available hardware and adapts generated experiment code accordingly (source: repository README):
| Detected Hardware | Mode | Adaptations (README-documented) |
|---|---|---|
| NVIDIA GPU (CUDA) | Full-scale | PyTorch CUDA, large batch sizes, full training epochs |
| Apple Silicon (MPS) | Adapted scale | PyTorch MPS backend, reduced batch sizes |
| CPU only | Minimal | Small models, few epochs, scikit-learn focus |
Code generation adjusts imports, model sizes, batch sizes, training epochs, and package selection based on the detected hardware tier. This is critical for reproducibility across heterogeneous hardware environments, though the specific heuristics for each adaptation are not documented beyond the summary above.
35.4 Key Results
35.4.1 Showcase Papers
The repository documents eight papers generated fully autonomously across eight research domains: random matrix theory, weak IV estimators, SIR/SEIR identifiability, Krylov preconditioners, GARD-LoRA (parameter-efficient fine-tuning), LACE exploration (reinforcement learning), FAME token merging (vision transformers), and CRAFT distillation (knowledge distillation). These papers are presented in the repository as showcase demonstrations of the pipeline's breadth (source: repository README).
35.4.2 MetaClaw Integration Results
The repository README reports controlled A/B experiments comparing pipeline performance with and without MetaClaw cross-run learning. The README states these used "same topic, same LLM, same configuration," but does not disclose further experimental details.
| Metric | Baseline (no MetaClaw) | With MetaClaw | Change | Source | Verifiable? |
|---|---|---|---|---|---|
| Stage retry rate | 10.5% | 7.9% | −24.8% | Repository README | No (raw data not available) |
| Refine cycle count | 2.0 | 1.2 | −40.0% | Repository README | No (raw data not available) |
| Pipeline stage completion | 18/19 | 19/19 | +5.3% | Repository README | No (single pair implied) |
| Overall robustness score | 0.714 | 0.845 | +18.3% | Repository README | Derivable from above metrics |
The composite robustness score formula is documented in the repository:
| Variable | Definition | Source |
|---|---|---|
| stage_completion_rate | Fraction of measured stages completing without failure | Repository README |
| retry_rate | Fraction of stages requiring at least one retry | Repository README |
| refine_efficiency | Reduction in REFINE cycles (definition not fully specified) | Repository README; exact formula undocumented |
| Weights (0.4, 0.3, 0.3) | Author-chosen weighting | Repository README |
Consistency check. Using the reported numbers: baseline robustness $= 0.4 \times (18/19) + 0.3 \times (1 - 0.105) + 0.3 \times \text{refine\_eff}_{\text{base}} = 0.379 + 0.269 + 0.3 \times \text{refine\_eff}_{\text{base}}$. For this to equal 0.714, we need $\text{refine\_eff}_{\text{base}} \approx 0.22$. Similarly, MetaClaw robustness $= 0.4 \times (19/19) + 0.3 \times (1 - 0.079) + 0.3 \times \text{refine\_eff}_{\text{meta}} = 0.400 + 0.276 + 0.3 \times \text{refine\_eff}_{\text{meta}}$. For 0.845, we need $\text{refine\_eff}_{\text{meta}} \approx 0.56$. The numbers are internally consistent given reasonable refine-efficiency values, but the refine_efficiency metric itself is not fully defined.
35.4.3 Adoption Metrics
As of April 2026, the repository reports approximately 9,800+ GitHub stars, a test suite of 1,823 passing tests, 20 built-in skills with an extensible community skill system, and README translations in 9 languages (source: repository README). These adoption numbers are notable for a system released only weeks prior, suggesting significant community interest in autoresearch tooling.
35.5 Implementation Details
35.5.1 Codebase and Dependencies
The system is implemented entirely in Python, organized under the researchclaw/ package. The directory structure below is visible in the public repository:
| Directory | Purpose (from README) | Notable Subdirectories |
|---|---|---|
researchclaw/pipeline/ | Pipeline orchestrator, stage runner, checkpointing | runner.py, checkpoint.py, stages/ |
researchclaw/agents/ | Multi-agent subsystems | base.py, code_agent/, benchmark_agent/, figure_agent/, debate/ |
researchclaw/literature/ | Academic API clients | OpenAlex, Semantic Scholar, arXiv integration |
researchclaw/sandbox/ | Experiment execution sandbox | AST validation, Docker mode, self-healing |
researchclaw/sentinel/ | Quality watchdog | NaN/Inf detection, consistency checks |
researchclaw/verification/ | Citation verification, VerifiedRegistry | 4-layer verification pipeline |
researchclaw/skills/ | Skills management | Skill loading, injection, community skills |
researchclaw/templates/ | LaTeX templates | NeurIPS 2025, ICLR 2026, ICML 2026 |
researchclaw/config/ | Configuration management | YAML config parsing |
researchclaw/knowledge/ | Run-level knowledge base | Markdown or Obsidian backend |
Dependencies include openai and httpx for LLM integration, requests for literature APIs, jinja2 for LaTeX template rendering, pyyaml for configuration, and Python's stdlib ast module for code validation (source: repository pyproject.toml). The CLI interface is built on click, exposed via the researchclaw command.
35.5.2 Configuration
The system is configured via a YAML file. The following is reproduced from the documented config.researchclaw.example.yaml (source: repository), showing key configuration blocks with documented defaults:
# Reproduced from config.researchclaw.example.yaml (repository documentation)
# This IS a direct representation of the documented configuration structure.
llm:
provider: "openai" # openai | openrouter | deepseek | acp | ...
acp:
agent: "claude" # claude | codex | gh | gemini | opencode | kimi
sandbox:
mode: "sandbox" # simulated | sandbox | docker | ssh_remote
memory_limit_mb: 4096
time_budget_sec: 300
code_agent:
enabled: true
architecture_planning: true
hard_validation: true
hard_validation_max_repairs: 2
exec_fix_max_iterations: 3
exec_fix_timeout_sec: 60
benchmark_agent:
enabled: true
enable_hf_search: true
tier_limit: 2 # 1=small, 2=medium, 3=large
figure_agent:
enabled: true
min_figures: 3
max_figures: 8
max_iterations: 3 # Critic-driven refinement
dpi: 300
opencode:
enabled: true
auto: true
complexity_threshold: 0.2 # See Section 35.3.5 for threshold inconsistency
timeout_sec: 600
metaclaw_bridge:
enabled: false # Opt-in
min_severity: "warning"
skills_per_run: 3
prm:
enabled: false
model: "gpt-5.4"
votes: 3
gate_stages: [5, 9, 15, 20]
knowledge_base:
backend: "markdown" # or "obsidian"
root: "docs/kb"
35.5.3 Reproducing a Run
The following CLI commands are documented in the repository README and represent the verified user-facing interface:
# From repository README — verified CLI commands
# Installation:
git clone https://github.com/aiming-lab/AutoResearchClaw.git
cd AutoResearchClaw
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
researchclaw setup
researchclaw init # Interactive config generation
# Set API key:
export OPENAI_API_KEY="sk-..."
# Full autonomous run:
researchclaw run --topic "Your research topic" --auto-approve
# Resume interrupted run (added in v0.3.2):
researchclaw run --resume # Auto-detects last checkpoint
35.5.4 Cost Analysis
A full 23-stage pipeline run involves extensive LLM usage across all stages. The repository README provides the following cost estimates per run:
| Model | Estimated Cost per Run | Notes | Source |
|---|---|---|---|
| GPT-4o | $15–50 | Full quality; many stages + multi-agent debate | Repository README |
| GPT-4o-mini | $3–10 | Budget option; lower expected quality | Repository README |
| Claude 3.5 Sonnet (via ACP) | $10–30 | Using Claude Code as persistent agent | Repository README |
| DeepSeek V3 | $2–8 | Cost-effective alternative | Repository README |
| Gemini Pro | $5–20 | Via OpenRouter or direct API | Repository README |
These are author estimates from the repository README, not independently verified cost measurements. Actual costs vary significantly based on topic complexity, number of REFINE/PIVOT cycles, experiment iterations, paper length, and LLM provider pricing at time of execution. No methodology for these estimates (e.g., token counting, averaged across runs) is provided.
35.5.5 Time Budget
Estimated pipeline duration varies by an order of magnitude depending on configuration (source: repository README):
| Configuration | Estimated Duration | Source |
|---|---|---|
| Simulated experiments, fast model | 30–60 minutes | Repository README |
| Sandbox experiments, GPT-4o | 2–6 hours | Repository README |
| Docker + OpenCode Beast Mode | 4–12 hours | Repository README |
| With PIVOT/REFINE cycles | 6–24 hours | Repository README |
Experiment execution (Phase E) dominates wall-clock time at 30–180 minutes, while literature collection (Phase B) is typically API-bound at 15–45 minutes. Each REFINE cycle adds an estimated 30–90 minutes, while a PIVOT adds 2–4 hours by restarting from hypothesis generation. These are README-reported estimates without documented measurement methodology.
35.5.6 Reproducibility Assessment
| Factor | Assessment | Evidence | Provenance |
|---|---|---|---|
| Source availability | Strong | MIT-licensed, complete codebase on GitHub | Verified: public repository |
| Installation | Strong | pip install -e . + researchclaw setup | Repository README |
| Configuration | Strong | Documented YAML with example file | Repository config example |
| Test suite | Strong | 1,823 tests passing (repository-reported) | Repository README (not independently run) |
| Checkpoint/resume | Strong | --resume flag auto-detects last checkpoint | Repository README, v0.3.2 changelog |
| LLM determinism | Weak | Output varies with model, temperature, API version | Inherent to LLM-based systems |
| API dependencies | Moderate | Requires OpenAlex, Semantic Scholar, arXiv (external services) | Repository README |
| API stability | Weak | 6 releases in 15 days suggests rapid API churn | Repository version history |
| External tool deps | Moderate | OpenCode Beast Mode requires separate installation | Repository README |
35.5.7 Skills System
AutoResearchClaw implements a skills system inspired by Claude Code's SKILL.md format. Each skill is a Markdown file with YAML frontmatter specifying name, description, trigger keywords, applicable pipeline stages, and enabled status. Skills are loaded from five sources in priority order (source: repository README):
- Built-in skills (20 shipped): packaged with
researchclaw - Project-local skills:
.claude/skills/directory - User-installed skills: via
researchclaw skills install - Team-shared skills: custom directories in config
- Community skills: 150+ via K-Dense-AI/claude-scientific-skills repository
Notable built-in skills documented in the README include scientific-writing (IMRAD structure, citation formatting), chemistry-rdkit (molecular analysis, SMILES), literature-search (systematic review, PRISMA methodology), hypothesis-formulation, statistical-reporting, and a-evolve (community-contributed from the A-Evolve project). At applicable stages, relevant skills are automatically injected into LLM prompts, enabling domain-specific expertise without modifying the core pipeline.
The skill file format, documented in the repository:
# Skill file format (from repository documentation)
---
name: scientific-writing
description: IMRAD structure, citation formatting, reporting guidelines
trigger-keywords: [paper, writing, draft, manuscript]
applicable-stages: [16, 17, 19]
enabled: true
---
[Skill instructions for the LLM...]
35.6 Limitations & Discussion
35.6.1 Quality Ceiling
The most significant limitation is the absence of external quality validation. No generated paper has been submitted to a real conference, evaluated by domain experts in blind review, or compared to human-written papers on matched topics. The system's quality assessment is entirely internal: multi-agent peer review and quality gates are implemented by the same LLM that generated the content, creating a potential echo chamber. Without external benchmarking against human baselines, claims of "conference-ready" quality remain aspirational.
35.6.2 Experiment Fidelity
Sandbox experiments, while reproducible, operate at small scale. The default execution mode uses a local subprocess with configurable memory (4096 MB default) and time (300s default) limits per the config example. Docker mode and SSH remote execution extend capacity but add infrastructure complexity. For research fields requiring large-scale training, multi-GPU experiments, or specialized hardware (e.g., TPU clusters), the gap between sandbox experiments and full-scale reproducible research remains substantial.
35.6.3 Residual Fabrication Risk
Despite the VerifiedRegistry and Sentinel Watchdog, subtle fabrication remains possible. The anti-fabrication system verifies that numbers in the paper match actual experiment outputs, but it cannot verify that the interpretation of those numbers is correct, that the experimental setup is sound, or that the conclusions drawn are valid. An LLM can produce a paper with all verified numbers but misleading analysis — a form of "honest fabrication" that technical safeguards alone cannot prevent.
35.6.4 Novelty Assessment
The system cannot reliably assess whether its research is genuinely novel. While the PIVOT decision considers novelty relative to baselines, this is a narrow operational definition. True novelty assessment requires deep understanding of the research landscape, ongoing conferences, parallel work, and conceptual contribution — capabilities that current LLMs approximate but do not reliably deliver. Human judgment remains essential for novelty claims.
35.6.5 LLM Non-Determinism and Reproducibility
Given identical inputs and configuration, two runs of the same topic may produce substantially different papers due to LLM non-determinism. Temperature settings, API version changes, and provider-specific behavior all introduce variability. The checkpoint/resume system ensures a given run is recoverable, but cross-run reproducibility (same topic producing comparable output) is not guaranteed. This is a fundamental limitation shared by all LLM-powered autoresearch systems.
35.6.6 Comparative Analysis
The following table compares AutoResearchClaw with the two most closely related systems covered in this survey. Each cell is annotated with evidence provenance.
| Capability | Comparison Criterion | AutoResearchClaw | AI Scientist (Sakana AI) | AIRA₂ (Meta FAIR) |
|---|---|---|---|---|
| Pipeline scope | Number of distinct automated stages from input to output | 23 stages, idea to paper (README) | Partial: idea → paper, limited experiment fidelity (Lu et al., 2024) | Experiment optimization only; no paper generation (Chapter 32) |
| Literature retrieval | Whether system queries real academic databases | Real APIs: OpenAlex, Semantic Scholar, arXiv (README) | No real-time retrieval documented; hallucinated references reported (Lu et al., 2024, §5) | N/A — not a paper-generation system |
| Citation integrity | Verification mechanism for reference authenticity | 4-layer verification pipeline (README) | No citation verification documented (Lu et al., 2024) | N/A |
| Experiment execution | Environment for running generated code | Sandboxed Python, Docker, SSH (README, config) | Limited sandbox with code generation (Lu et al., 2024) | Full-scale GPU training on Kaggle tasks (Chapter 32) |
| Cross-run learning | Whether knowledge persists across independent runs | MetaClaw skill extraction and injection (README, MetaClaw repo) | No cross-run learning documented | Within-task evolution only (Chapter 32) |
| Research-direction control | Ability to autonomously change hypothesis mid-run | PIVOT/REFINE loop at Stage 15 (README) | No documented direction-change mechanism | N/A (single-task optimization) |
| Human-in-loop | Optional human review gates | 3 optional quality gates (README) | No human-in-loop gates documented | No human-in-loop gates documented |
| Anti-fabrication | Mechanism preventing invented experimental numbers | VerifiedRegistry + Sentinel (README) | No anti-fabrication mechanism documented | N/A (real Kaggle evaluations) |
| License | Source availability | MIT (verified: public repo) | Apache 2.0 (verified: public repo) | Not publicly released |
| External validation | Third-party evaluation of output quality | None (self-assessed showcase) | Self-assessed; workshop-level papers claimed (Lu et al., 2024) | Kaggle competition rankings (Chapter 32) |
Sources: AutoResearchClaw claims from repository README. AI Scientist claims from Lu et al. (2024), "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery," arXiv:2408.06292. AIRA₂ claims from Chapter 32 of this survey. "No documented mechanism" means the published paper or public repository does not describe such a feature; absence of documentation does not conclusively prove absence of the feature. Entries marked "N/A" indicate the system does not target that capability by design.
35.6.7 Connections to Evolutionary AI
While AutoResearchClaw is not an evolutionary algorithm system in the traditional sense, its architecture embodies several patterns central to this survey's theme:
PIVOT/REFINE as exploration-exploitation. The Stage 15 decision engine implements a form of adaptive search over the space of research directions. REFINE corresponds to exploitation (parameter tuning within the current hypothesis), while PIVOT corresponds to exploration (abandoning the current hypothesis for a new one). This mirrors the exploration-exploitation tradeoff in island-based evolutionary search with restart policies, where stagnation triggers population reinitialization.
MetaClaw as evolutionary memory. The cross-run skill accumulation with time-decay weighting resembles learning-rate schedules in evolutionary strategies: recent experience is weighted more heavily, but old experience is not immediately discarded. The 30-day decay period functions as a form of forgetting that prevents outdated lessons from constraining future search.
Multi-agent debate as population diversity. Using multiple LLM perspectives for hypothesis generation, result analysis, and peer review parallels the use of diverse populations in evolutionary algorithms to avoid premature convergence. Each "agent" represents a different point in reasoning space.
Self-healing as repair mutation. The sandbox executor's diagnosis-repair loop (detect error → diagnose → generate fix → re-execute) mirrors repair operators in genetic programming, where syntactically or semantically invalid programs are patched rather than discarded.
35.7 The Skills System as Accumulated Knowledge
The skills system deserves separate attention as a mechanism for knowledge accumulation that bridges individual runs and community contributions. Over time, a research group's skill library accumulates domain-specific knowledge that makes the pipeline increasingly effective for their particular research area. This creates a positive feedback loop that we can informally characterize as:
where $N$ is the current run number, $|\mathcal{S}_n|$ is the number of skills contributed by run $n$, $\bar{w}_n$ is their average time-decay weight at the current run, and $f$ is monotonically increasing in its second argument. This is an informal analytical model introduced by the survey author to express the documented behavior qualitatively. It does not correspond to any implementation formula. The actual relationship between skill count and pipeline effectiveness has not been formally characterized or empirically measured in the repository.
The community dimension is also significant: 150+ skills via the K-Dense-AI repository represent crowdsourced scientific methodology knowledge that any AutoResearchClaw installation can leverage (source: repository README). This suggests a model where scientific methodology itself becomes a shareable, versionable artifact — an intriguing direction for the autoresearch field, though the actual impact of community skills on pipeline quality has not been empirically evaluated.
35.8 Process Reward Model Integration
MetaClaw optionally integrates a Process Reward Model (PRM) for quality gating at configurable stages. When enabled, an LLM-as-judge evaluates stage outputs using majority voting (source: repository configuration example, metaclaw_bridge.prm section):
| Config Field | Default | Description |
|---|---|---|
metaclaw_bridge.prm.enabled | false | Opt-in activation |
metaclaw_bridge.prm.model | "gpt-5.4" | LLM model used as judge |
metaclaw_bridge.prm.votes | 3 | Number of independent judge evaluations (majority vote) |
metaclaw_bridge.prm.gate_stages | [5, 9, 15, 20] | Pipeline stages where PRM gates are applied |
# ABSTRACT PSEUDOCODE — illustrative reconstruction of PRM gate behavior.
# NOT extracted from the repository. Based on documented config fields.
class ProcessRewardGatePseudocode:
"""Illustrates the optional LLM-as-judge quality gate concept.
Config fields: metaclaw_bridge.prm.{enabled, model, votes, gate_stages}
"""
def __init__(self, model: str, votes: int, gate_stages: list[int]):
self.model = model # Config: "gpt-5.4"
self.votes = votes # Config: 3 (majority vote)
self.gate_stages = gate_stages # Config: [5, 9, 15, 20]
def evaluate(self, stage_id: int, stage_output) -> bool:
"""Returns True if stage output passes majority-vote quality check.
At each gated stage, 'votes' independent LLM judge calls
are made. Stage passes if a majority approve.
"""
if stage_id not in self.gate_stages:
return True # Not a gated stage
approvals = 0
for _ in range(self.votes):
score = self._single_judge_call(stage_output)
if score >= 0.5: # Threshold not documented; 0.5 assumed
approvals += 1
return approvals > self.votes // 2 # Majority approval
def _single_judge_call(self, output) -> float:
"""Single LLM judge evaluation; returns quality score in [0, 1].
Implementation details (prompt, scoring rubric) not documented.
"""
prompt = self._build_judge_prompt(output)
response = self.llm.generate(prompt, model=self.model)
return self._parse_score(response)
The PRM adds another layer of quality control beyond the standard gate stages, though it also adds LLM cost proportional to the number of gated stages times the vote count. At 3 votes across 4 gate stages, this represents 12 additional LLM calls per run. The actual quality improvement from PRM gating is not reported in the repository.
35.9 Summary
Key Takeaway
AutoResearchClaw demonstrates that fully autonomous research pipelines — from text topic to conference-formatted paper with real literature, executed experiments, and verified results — are technically achievable in 2026. Its most notable documented capabilities are the PIVOT/REFINE decision loop for autonomous research-direction control, the VerifiedRegistry anti-fabrication system for enforcing experimental ground truth in generated papers, and four-layer citation verification that addresses hallucinated references. The MetaClaw cross-run learning system with time-decaying skills provides a mechanism for cumulative improvement across runs.
Main contribution to the field: Among open-source autoresearch systems surveyed in this book, AutoResearchClaw is the first to integrate anti-fabrication enforcement, real citation verification via academic APIs, adaptive research-direction control, and cross-run learning into a single end-to-end pipeline. This assessment is based on comparison with AI Scientist (Lu et al., 2024), AIRA₂ (Chapter 32), AutoResearch (Karpathy, 2025), and FARS (Analemma, 2025) using the criteria in Section 35.6.6. However, FARS's proprietary nature means its full capabilities cannot be assessed, and the rapidly evolving autoresearch landscape may include systems not covered by this survey.
What a researcher should know: Despite its architectural sophistication, AutoResearchClaw's output quality remains unvalidated by external peer review. The system addresses documented failure modes of prior autoresearch systems (hallucinated references in AI Scientist, partial pipelines in AutoResearch), but introduces its own limitations: the quality ceiling is bounded by LLM capability, experiment fidelity is limited by sandbox constraints, novelty assessment requires human judgment, and the MetaClaw integration results lack statistical rigor (see Section 35.4.2). The 23-stage pipeline is best understood not as a replacement for human researchers, but as an automation of the mechanical aspects of research — literature gathering, experiment coding, result formatting — while leaving the most intellectually demanding aspects (true novelty, deep insight, conceptual contribution) as open challenges for future systems.
Evidence boundaries: This chapter's analysis is grounded in the public repository README, configuration examples, directory structure, and version history. All code blocks are abstract pseudocode written by the survey author to illustrate documented behavior — they are not repository excerpts. Mathematical formalizations are analytical interpretations of documented criteria, not implementation descriptions. Readers should consult the repository directly for current implementation details.