Introduced2026-03

Score7.77/10 — Draft

Chapter 35

AutoResearchClaw

Part P07: Autonomous Research Systems

35.1 Overview & Motivation

The automation of scientific research represents one of the most ambitious applications of large language models. While prior systems such as AI Scientist (Lu et al., 2024; Sakana AI) demonstrated that LLMs could draft research papers, and AIRA₂ (Meta FAIR) showed that LLM-guided evolution could solve competition-grade machine learning tasks, neither achieved a full closed-loop pipeline from research idea to conference-formatted manuscript with verified experiments and real citations. AutoResearchClaw, released in March 2026 by the AIMING Lab — a multi-university collaboration spanning UC Santa Cruz, UNC Chapel Hill, Johns Hopkins, and UC Davis — targets this gap.

The system's tagline, "Chat an Idea. Get a Paper," captures its ambition: a user provides a research topic as a text string, and the system autonomously produces a complete academic paper including literature review with real references, executed and verified experiments, multi-agent peer review, and LaTeX export in NeurIPS/ICML/ICLR format. The name "Claw" references the lobster emoji in the project's branding (source: repository README).

Provenance Note: Verified vs. Inferred Content

This chapter draws on the AutoResearchClaw GitHub repository (MIT license), its README, configuration examples, and linked companion repositories. The following conventions are used throughout:

Repository-documented: facts directly stated in the README, YAML config examples, CLI documentation, or visible directory structure. These are marked with "(source: repository README)" or similar tags.
Author formalization: mathematical equations and formal models constructed by the survey author to capture documented behavior in precise notation. These are explicitly labeled as analytical interpretations.
Abstract pseudocode: code blocks written by the survey author to illustrate documented pipeline behavior. These are not excerpts from the repository and do not reflect actual class names, method signatures, or internal APIs. They are labeled accordingly.

Where the chapter describes internal mechanisms (e.g., how the PIVOT/REFINE decision is implemented, how the VerifiedRegistry enforces claims), the level of detail is bounded by what the repository README and configuration files disclose. Internal implementation details beyond public documentation are not available for verification.

Key Contribution

AutoResearchClaw's principal contribution is a 23-stage deterministic pipeline that integrates research scoping, real literature discovery (via OpenAlex, Semantic Scholar, and arXiv APIs), multi-agent hypothesis debate, sandboxed experiment execution with self-healing, a PIVOT/REFINE decision loop for autonomous research-direction control, a VerifiedRegistry anti-fabrication system that prevents LLM-generated numbers from entering the paper, four-layer citation verification, and cross-run learning via the companion MetaClaw system. To the best of our knowledge based on publicly available systems surveyed in this book, the combination of anti-fabrication enforcement with adaptive research-direction control and real citation verification has not been demonstrated in a single open-source autoresearch system prior to this release — though this claim is bounded by the survey's coverage and the system's own lack of external validation (see Section 35.6.1).

35.1.1 Predecessor Lineage

The repository README explicitly acknowledges inspiration from three systems: AI Scientist (Sakana AI, 2024), AutoResearch (Karpathy, 2025), and FARS (Analemma, 2025). AutoResearchClaw addresses specific weaknesses in each predecessor as documented by its authors:

Predecessor	Weakness Addressed	AutoResearchClaw Solution	Criteria for Comparison
AI Scientist (Lu et al., 2024)	Hallucinated references; limited experiment fidelity	4-layer citation verification; sandboxed execution with VerifiedRegistry	Lu et al. (2024) report hallucinated references as a known failure mode; no citation verification pipeline described in their paper
AutoResearch (Karpathy, 2025)	Partial pipeline (no experiment execution)	Full 23-stage pipeline with code generation and sandbox	AutoResearch repository scope limited to literature search and writing; no experiment execution stages documented
FARS (Analemma, 2025)	Proprietary; unknown internals	MIT-licensed; fully open source	FARS is a proprietary system with no public source code or detailed methodology disclosure

Note: the comparisons above reflect AutoResearchClaw's authors' framing. The AI Scientist comparison is partially supported by Lu et al. (2024), who acknowledge citation quality issues. The AutoResearch and FARS comparisons rely on AutoResearchClaw README claims and cannot be independently verified against those systems' full capabilities.

The system also ships with two companion projects: MetaClaw, a cross-run learning engine that extracts skills from pipeline failures, and OpenClaw, an AI assistant platform providing a chat interface for pipeline orchestration via Discord, Telegram, and Slack (source: repository README and linked repositories).

35.1.2 Development Velocity

A notable characteristic is the project's rapid iteration: six significant releases in fifteen days (v0.1.0 on March 15, 2026 through v0.3.2 on March 22, 2026), with continued updates through at least March 30. This pace, while indicative of active development, also suggests potential API instability — a concern for reproducibility discussed further in Section 35.5.

35.2 Architecture

35.2.1 Pipeline Overview

AutoResearchClaw organizes research automation into eight phases containing 23 sequential stages. The pipeline is managed by an orchestrator (documented location: researchclaw/pipeline/runner.py) that provides checkpoint/resume capability, gate management, loop control, and artifact versioning (source: repository README and directory structure).

Figure 35.1: The 23-stage pipeline architecture of AutoResearchClaw. Quality gates (⊘) at stages 5, 9, and 20 optionally pause for human review. Multi-agent debate stages (◈) use structured multi-perspective LLM reasoning. The PIVOT/REFINE decision at Stage 15 enables autonomous research-direction control. Source: repository README and documented directory structure.

35.2.2 Architectural Decisions

Several design choices distinguish AutoResearchClaw's architecture from simpler autoresearch pipelines (all sourced from repository README unless otherwise noted):

Sequential determinism with loop escapes. The 23 stages execute in fixed order, providing checkpoint/resume capability via the --resume flag (added in v0.3.2). However, Stage 15's PIVOT/REFINE decision introduces controlled non-linearity: REFINE loops back to Stage 13 (iterative refinement), while PIVOT jumps back to Stage 8 (hypothesis generation). Artifacts are auto-versioned across loops to prevent data loss.

Three quality gates. Stages 5 (literature screening), 9 (experiment design), and 20 (final quality) pause for optional human approval. The --auto-approve flag bypasses all gates for fully autonomous operation. This design acknowledges that full autonomy is desirable for throughput but risky for quality — a tension that permeates autoresearch system design.

Pluggable LLM backend with ACP. Beyond standard OpenAI-compatible API calls, the system supports Agent Client Protocol (ACP) via the acpx library, which delegates LLM calls to external CLI agents (Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI). In ACP mode, a single persistent session accumulates context across all 23 stages, potentially improving coherence (source: repository README, v0.3.2 changelog).

35.2.3 Multi-Agent Subsystems

The system employs specialized agent subsystems for tasks that exceed single-prompt LLM generation. Three subsystems are documented in the repository's directory structure under researchclaw/agents/:

Agent	Documented Location	Pipeline (from README)	Purpose
CodeAgent v2	`researchclaw/agents/code_agent/`	Architect → Builder → Validator	Multi-phase code generation with AST validation
BenchmarkAgent	`researchclaw/agents/benchmark_agent/`	Surveyor → Selector → Acquirer → Validator	Automated dataset and baseline selection
FigureAgent	`researchclaw/agents/figure_agent/`	Planner → CodeGen → Renderer → Critic → Integrator	Academic figure generation with iterative refinement

Note: the directory structure and sub-agent names above are documented in the repository README and visible in the repository file tree. The internal class names, method signatures, and agent communication protocols are not disclosed in public documentation and are therefore not described here.

Additionally, a multi-agent debate system in researchclaw/agents/debate/ is used at stages 8 (hypothesis generation), 14 (result analysis), and 18 (peer review). Each debate uses multiple LLM "perspectives" to reduce single-point reasoning failures, though the exact debate protocol and number of agents per debate are not documented in public materials.

35.3 Core Algorithms

35.3.1 The PIVOT/REFINE Decision Engine

Stage 15 is the pipeline's most architecturally significant component. After result analysis in Stage 14, the system makes a three-way autonomous decision about research direction. This mechanism transforms AutoResearchClaw from a linear pipeline into an adaptive search process over the space of research directions — a direct analogue to the exploration-exploitation tradeoff in evolutionary search.

The decision criteria, as documented in the repository README:

Decision	Target Stage	Trigger Conditions (repository-documented)
PROCEED	Stage 16 (paper writing)	Results support hypothesis; statistical significance achieved; sufficient coverage; novelty relative to baselines
REFINE	Stage 13 (iterative refinement)	Partial support; some conditions failed; metrics near significance; additional iterations likely to help
PIVOT	Stage 8 (hypothesis generation)	Results contradict hypothesis; fundamental methodology issue; results indistinguishable from baselines

Author formalization. We model the decision as a function over the experimental outcome space to make the documented criteria precise. Let $R = \{r_1, r_2, \ldots, r_m\}$ denote the set of experimental results from Stage 14, where each $r_i$ comprises a metric name, observed value, baseline value, and statistical significance indicator. Let $H$ denote the current hypothesis and $B$ the set of baseline results. The decision function $\delta$ maps these to one of three actions:

$$\delta(R, H, B) = \begin{cases} \text{PROCEED} & \text{if } \texttt{support}(R, H) \geq \tau_p \;\land\; \texttt{novelty}(R, B) \geq \tau_n \\ \text{REFINE} & \text{if } \tau_r \leq \texttt{support}(R, H) < \tau_p \;\land\; k < k_{\max} \\ \text{PIVOT} & \text{otherwise} \end{cases}$$

*Symbol definitions and implementation mapping for the PIVOT/REFINE formalization*
Symbol	Definition	Implementation Artifact	Status
$\texttt{support}(R, H)$	Composite measure of how well results $R$ support hypothesis $H$ (effect size + statistical significance)	Stage 15 LLM prompt; criteria listed in README	Author formalization of documented criteria
$\texttt{novelty}(R, B)$	Improvement of results over baselines $B$	Stage 15 LLM prompt; "novel relative to baselines" criterion	Author formalization of documented criteria
$\tau_p, \tau_n$	PROCEED thresholds for support and novelty	Not exposed in public config	Author-introduced notation; no documented config field
$\tau_r$	REFINE lower threshold for support	Not exposed in public config	Author-introduced notation; no documented config field
$k$	Current refinement iteration count	Pipeline loop counter	Repository-documented (README mentions loop tracking)
$k_{\max}$	Maximum allowed refinement iterations	`max_iterations`; README documents default of 10	Repository-documented

Important caveat: This formalization is an analytical interpretation by the survey author of the documented decision criteria. The actual implementation almost certainly uses an LLM-based reasoning step rather than numeric threshold comparisons — the README describes the decision criteria in natural language and the system is LLM-driven throughout. The thresholds $\tau_p$, $\tau_r$, and $\tau_n$ are analytical constructs introduced here to make the decision boundaries precise; they do not correspond to known configuration fields.

When PIVOT is triggered, the system archives current results with a version tag, jumps back to Stage 8, provides previous failed hypotheses as negative context, and requires new hypotheses to differ from all previous attempts (source: repository README). This creates a closed-loop research process that can autonomously recover from dead-end research directions.

# ABSTRACT PSEUDOCODE — illustrative reconstruction of Stage 15 behavior.
# NOT extracted from the repository. Class names, method signatures, and
# internal APIs are invented by the survey author for pedagogical clarity.
# The actual implementation is not publicly documented at this level of detail.

class ResearchDecisionPseudocode:
    """Illustrates the Stage 15 autonomous research direction control."""

    PROCEED = "PROCEED"
    REFINE = "REFINE"
    PIVOT = "PIVOT"

    def decide(
        self,
        results: list,       # Experimental results from Stage 14
        hypothesis: str,     # Current hypothesis text
        baselines: list,     # Baseline comparison results
        iteration: int,      # Current refinement cycle count
        max_iterations: int = 10,  # Documented default (README)
    ) -> tuple[str, str]:
        """Returns (decision, rationale) via LLM reasoning.

        The LLM receives structured result data and the documented
        PROCEED/REFINE/PIVOT criteria, then outputs a decision
        with an explicit rationale.
        """
        prompt = self._build_decision_prompt(
            results=results,
            hypothesis=hypothesis,
            baselines=baselines,
            iteration=iteration,
            # README documents that failed hypotheses are provided
            # as negative context on PIVOT
            previous_failures=self.knowledge_base.get("failed_hypotheses", []),
        )

        response = self.llm.generate(prompt)
        decision, rationale = self._parse_decision(response)

        # README documents artifact versioning across loops
        if decision in (self.REFINE, self.PIVOT):
            self.artifact_store.version_snapshot(
                tag=f"iter-{iteration}-{decision.lower()}"
            )

        # README documents negative context injection on PIVOT
        if decision == self.PIVOT:
            self.knowledge_base.add(
                category="decisions",
                entry={
                    "hypothesis": hypothesis,
                    "outcome": "pivoted",
                    "rationale": rationale,
                },
            )

        return decision, rationale

35.3.2 Anti-Fabrication System: VerifiedRegistry

The most dangerous failure mode of LLM-generated research papers is fabricated experimental results — numbers that look plausible but have no basis in actual computation. AutoResearchClaw addresses this with the VerifiedRegistry, a ground-truth enforcement layer that sits between experiment execution and paper writing (source: repository README; documented directory: researchclaw/verification/).

The mechanism works as follows, per the repository README:

Registration: When experiments in Stage 12/13 produce results, structured JSON metrics are indexed in the VerifiedRegistry with experiment IDs, conditions, metric values, execution logs, and timestamps.
Enforcement: During paper writing (Stages 16–19), the LLM may only cite metrics that exist in the registry. Unverified numbers are sanitized — either removed or flagged.
Repair: If experiments fail (preventing registration of any results), a diagnosis-and-repair loop attempts to fix the code and re-execute. Configuration allows up to max_cycles=3 repair attempts. If minimum completion rate is not met, the system degrades gracefully by reporting partial results.

# ABSTRACT PSEUDOCODE — illustrative reconstruction of VerifiedRegistry behavior.
# NOT extracted from the repository. The actual class hierarchy, method names,
# and internal data structures are not publicly documented.

class VerifiedRegistryPseudocode:
    """Illustrates the ground-truth enforcement concept.

    Core idea (from README): only experimentally verified numbers
    may appear in the generated paper. Unverified claims are sanitized.
    """

    def __init__(self):
        self._registry: dict[str, dict] = {}  # experiment_id -> record

    def register(
        self,
        experiment_id: str,
        conditions: list[str],
        metrics: dict[str, float],
        execution_log: str,
    ) -> None:
        """Register verified results from sandbox execution.

        Called after successful experiment runs in Stages 12/13.
        Metrics are structured JSON (documented in README).
        """
        self._registry[experiment_id] = {
            "conditions": conditions,
            "metrics": metrics,    # e.g., {"accuracy": 0.847, "f1": 0.812}
            "execution_log": execution_log,
            "timestamp": "...",    # execution timestamp
        }

    def verify_claim(self, metric_name: str, claimed_value: float) -> bool:
        """Check if a claimed number exists in verified results.

        README states: 'only registry-verified numbers can be cited.'
        """
        for record in self._registry.values():
            if metric_name in record["metrics"]:
                if abs(record["metrics"][metric_name] - claimed_value) < 1e-6:
                    return True
        return False

    def sanitize_paper(self, draft: str) -> str:
        """Remove or flag unverified numerical claims.

        README: 'Unverified numbers are sanitized (removed or flagged).'
        The Sentinel Watchdog performs complementary checks.
        """
        numbers = self._extract_numerical_claims(draft)
        for claim in numbers:
            if not self.verify_claim(claim.metric, claim.value):
                draft = self._replace_with_flag(draft, claim)
        return draft

The Sentinel Watchdog (researchclaw/sentinel/) operates as a complementary background quality monitor, performing continuous checks for NaN/Inf in results, paper-evidence consistency, citation relevance scoring, and anti-fabrication enforcement (source: repository README).

Figure 35.2: VerifiedRegistry data flow. Experiments register structured metrics; paper writing stages query the registry; unverified claims are sanitized. The repair loop attempts to fix failed experiments before paper writing begins. Source: repository README documentation.

35.3.3 Four-Layer Citation Verification

Hallucinated references were identified as a critical weakness in AI Scientist (Lu et al., 2024). AutoResearchClaw implements a four-layer verification pipeline in Stage 23 to ensure all citations are both real and relevant (source: repository README):

Figure 35.3: Four-layer citation verification pipeline. Each layer eliminates a different class of citation failure: nonexistent IDs (Layer 1), unresolvable DOIs (Layer 2), unverifiable titles/authors (Layer 3), and irrelevant-but-real references (Layer 4). Failures at any layer cause citation removal. Source: repository README.

The four layers target progressively subtler forms of citation failure:

arXiv ID verification: If a citation claims an arXiv identifier, the system verifies it exists via the arXiv API. Invalid IDs trigger immediate removal.
CrossRef/DataCite DOI resolution: DOIs are verified to resolve to real publications, with metadata (title, authors, year) cross-checked against the citation.
Semantic Scholar title matching: Paper titles are searched in Semantic Scholar with fuzzy matching to handle minor variations. Authors and venue are also verified.
LLM relevance scoring: Even if a citation is verifiably real, the system scores its relevance to the paper content on a 0–1 scale. Citations below a threshold are removed as "real but irrelevant" padding.

The BibTeX output is auto-pruned to match only inline \cite{} references that survive all four layers, preventing orphaned bibliography entries (source: repository README).

Implementation note: The repository README documents the four-layer design and names the external APIs used (arXiv, CrossRef, DataCite, Semantic Scholar). The internal implementation of the LLM relevance scoring — including the relevance threshold value, prompt design, and scoring mechanism — is not publicly documented. The hallucinated-reference removal rate is also not reported.

35.3.4 MetaClaw Cross-Run Learning

MetaClaw provides persistent cross-run learning through a lesson-to-skill pipeline. After each pipeline run, the system captures failures and warnings as structured lessons. MetaClaw then converts these into SKILL.md files stored in ~/.metaclaw/skills/, which are injected into future runs' LLM prompts at applicable stages (source: repository README, MetaClaw repository).

Author formalization. The learning loop can be expressed as follows. Let $L_n = \{l_1, l_2, \ldots, l_j\}$ denote lessons extracted from run $n$, where each lesson $l_i$ includes a stage identifier, severity level, category, description, and optional resolution. MetaClaw converts lessons to skills via a filtering and transformation function:

$$S_n = \texttt{convert}\!\bigl(\{l \in L_n \mid \texttt{severity}(l) \geq \sigma_{\min}\}\bigr)$$

The cumulative skill library available at run $N$ is:

$$\mathcal{S}_N = \bigcup_{n=1}^{N-1} \{ s \in S_n \mid w(s, t_N) > 0 \}$$

where $w(s, t_N)$ is a time-decay weighting function:

$$w(s, t_N) = \max\!\left(0,\; 1 - \frac{t_N - t_s}{T_{\text{decay}}}\right)$$

Here $t_s$ is the timestamp when skill $s$ was created, $t_N$ is the start time of run $N$, and $T_{\text{decay}}$ is the decay period.

*Symbol definitions and implementation mapping for MetaClaw formalization*
Symbol	Definition	Config Field / Artifact	Documented Default	Status
$\sigma_{\min}$	Minimum severity for lesson→skill conversion	`metaclaw_bridge.min_severity`	`"warning"`	Repository config example
$\|S_n\|$ cap	Maximum new skills per run	`metaclaw_bridge.skills_per_run`	`3`	Repository config example
$T_{\text{decay}}$	Time-decay period for skill relevance	Described in README as "30-day decay"	30 days	Repository README
$\texttt{convert}(\cdot)$	Lesson-to-skill transformation	MetaClaw processing pipeline	—	Documented behavior; internal mechanism unknown
$w(s, t_N)$	Time-decay weight for skill $s$ at time $t_N$	Implemented in `build_overlay()`	Linear decay	Author formalization of documented "30-day decay" behavior

Note: the linear time-decay form $w(s, t_N) = \max(0, 1 - (t_N - t_s)/30)$ is the survey author's analytical interpretation of the documented "30-day decay period" behavior. The actual decay function may use a different shape (exponential, step, etc.) — the README describes the 30-day period but not the functional form.

# ABSTRACT PSEUDOCODE — illustrative reconstruction of MetaClaw skill injection.
# NOT extracted from the repository. Illustrates the documented lesson→skill→overlay
# pipeline based on README descriptions and config fields.

class MetaClawBridgePseudocode:
    """Cross-run learning via lesson extraction and skill injection."""

    DECAY_DAYS = 30              # README: "30-day decay period"
    MAX_SKILLS_PER_RUN = 3       # Config: metaclaw_bridge.skills_per_run

    def extract_lessons(self, run_log) -> list:
        """Extract actionable lessons from a completed pipeline run.

        README documents: lessons captured from 'decision rationale,
        runtime warnings, metric anomalies.'
        """
        lessons = []
        for stage_log in run_log.stages:
            if stage_log.retries > 0 or stage_log.warnings:
                lessons.append({
                    "stage": stage_log.stage_id,
                    "severity": stage_log.max_severity,
                    "category": stage_log.category,
                    "description": stage_log.failure_description,
                    "resolution": stage_log.auto_resolution,
                })
        return lessons

    def build_overlay(self, current_time) -> list:
        """Load time-weighted skills for injection into current run.

        README: skills stored in ~/.metaclaw/skills/ as SKILL.md files.
        """
        from pathlib import Path
        skills_dir = Path.home() / ".metaclaw" / "skills"
        active_skills = []
        for skill_path in skills_dir.glob("arc-*.md"):
            skill = self._parse_skill_file(skill_path)
            age_days = (current_time - skill["created_at"]).days
            weight = max(0.0, 1.0 - age_days / self.DECAY_DAYS)
            if weight > 0:
                skill["weight"] = weight
                active_skills.append(skill)
        return sorted(active_skills, key=lambda s: s["weight"], reverse=True)

35.3.5 Experiment Complexity Routing

AutoResearchClaw automatically assesses experiment complexity and routes code generation accordingly. The repository README and configuration example document a three-tier routing system:

$$\texttt{route}(c) = \begin{cases} \text{Direct LLM generation} & \text{if } c < c_{\text{low}} \\ \text{CodeAgent v2 (architecture planning)} & \text{if } c_{\text{low}} \leq c < \theta \\ \text{OpenCode Beast Mode} & \text{if } c \geq \theta \end{cases}$$

where $c \in [0, 1]$ is the assessed complexity score, $c_{\text{low}}$ is the lower routing boundary, and $\theta$ is the upper threshold.

*Symbol definitions and implementation mapping for complexity routing*
Symbol	Definition	Config Field	Documented Default	Status
$c$	Assessed complexity score	Computed internally at experiment design	—	Documented behavior; scoring method not disclosed
$c_{\text{low}}$	Lower routing boundary (simple → medium)	Not separately configurable in public config	0.2 (README description)	Repository README
$\theta$	Upper threshold (medium → Beast Mode)	`opencode.complexity_threshold`	0.2	Repository config example

Threshold inconsistency in documentation. The README describes a three-tier system with the lower boundary at 0.2, but the example configuration sets opencode.complexity_threshold: 0.2 as well. With $c_{\text{low}} = \theta = 0.2$, the middle tier (CodeAgent v2 without OpenCode) has an empty domain: no complexity score satisfies $0.2 \leq c < 0.2$. This collapses the three-tier system into a two-tier system (direct LLM for $c < 0.2$; OpenCode Beast Mode for $c \geq 0.2$). Three possible explanations: (1) the documented default of 0.2 is a placeholder expecting user adjustment to a higher value; (2) the lower boundary $c_{\text{low}}$ is actually less than 0.2 in the implementation but not separately documented; or (3) CodeAgent v2 is always active and the "threshold" only controls whether OpenCode is additionally invoked. Without access to the internal routing logic, we cannot resolve this inconsistency. In practice, users who want three-tier routing should set opencode.complexity_threshold to a value strictly greater than 0.2.

CodeAgent v2 (researchclaw/agents/code_agent/) performs architecture planning, sequential file generation following a dependency DAG, AST-based hard validation (blocking identical ablations and hardcoded metrics), and execution-in-the-loop fixing (up to exec_fix_max_iterations=3 attempts with 60-second timeout per the config example). OpenCode Beast Mode delegates to the external OpenCode system for multi-file projects with custom architectures, training loops, and ablation studies (source: repository README and config example).

35.3.6 Hardware-Aware Experiment Adaptation

The system auto-detects available hardware and adapts generated experiment code accordingly (source: repository README):

Detected Hardware	Mode	Adaptations (README-documented)
NVIDIA GPU (CUDA)	Full-scale	PyTorch CUDA, large batch sizes, full training epochs
Apple Silicon (MPS)	Adapted scale	PyTorch MPS backend, reduced batch sizes
CPU only	Minimal	Small models, few epochs, scikit-learn focus

Code generation adjusts imports, model sizes, batch sizes, training epochs, and package selection based on the detected hardware tier. This is critical for reproducibility across heterogeneous hardware environments, though the specific heuristics for each adaptation are not documented beyond the summary above.

35.4 Key Results

35.4.1 Showcase Papers

The repository documents eight papers generated fully autonomously across eight research domains: random matrix theory, weak IV estimators, SIR/SEIR identifiability, Krylov preconditioners, GARD-LoRA (parameter-efficient fine-tuning), LACE exploration (reinforcement learning), FAME token merging (vision transformers), and CRAFT distillation (knowledge distillation). These papers are presented in the repository as showcase demonstrations of the pipeline's breadth (source: repository README).

Evidence limitations. The showcase papers have not been submitted to actual conferences, evaluated by domain experts in blind review, or compared to human-written papers on matched topics. The repository reports no quantitative quality metrics for the papers themselves (e.g., reviewer scores, readability measures, or factual accuracy rates). Claims of "conference-ready" quality and "NeurIPS/ICML/ICLR" targeting in the README reflect system design intent rather than validated output quality. Selection bias is also possible — the eight showcased papers may not be representative of typical run quality.

35.4.2 MetaClaw Integration Results

The repository README reports controlled A/B experiments comparing pipeline performance with and without MetaClaw cross-run learning. The README states these used "same topic, same LLM, same configuration," but does not disclose further experimental details.

Repository-reported anecdotal evidence. The following table reproduces numbers from the repository README. Critical methodological details are absent: the number of trials, specific topics used, which LLM model and configuration were employed, variance or confidence intervals across runs, and whether these are single-run results or averages over multiple trials. The "18/19" and "19/19" stage-completion denominators suggest a single comparison pair with 19 measured stages (not the full 23, possibly excluding gate stages), though this is not confirmed. Without trial counts and statistical measures of variability, these numbers should be interpreted as illustrative anecdotes rather than statistically robust evidence.

Metric	Baseline (no MetaClaw)	With MetaClaw	Change	Source	Verifiable?
Stage retry rate	10.5%	7.9%	−24.8%	Repository README	No (raw data not available)
Refine cycle count	2.0	1.2	−40.0%	Repository README	No (raw data not available)
Pipeline stage completion	18/19	19/19	+5.3%	Repository README	No (single pair implied)
Overall robustness score	0.714	0.845	+18.3%	Repository README	Derivable from above metrics

The composite robustness score formula is documented in the repository:

$$\text{robustness} = 0.4 \cdot \text{stage\_completion\_rate} + 0.3 \cdot (1 - \text{retry\_rate}) + 0.3 \cdot \text{refine\_efficiency}$$

*Robustness score variable definitions*
Variable	Definition	Source
stage_completion_rate	Fraction of measured stages completing without failure	Repository README
retry_rate	Fraction of stages requiring at least one retry	Repository README
refine_efficiency	Reduction in REFINE cycles (definition not fully specified)	Repository README; exact formula undocumented
Weights (0.4, 0.3, 0.3)	Author-chosen weighting	Repository README

Consistency check. Using the reported numbers: baseline robustness $= 0.4 \times (18/19) + 0.3 \times (1 - 0.105) + 0.3 \times \text{refine\_eff}_{\text{base}} = 0.379 + 0.269 + 0.3 \times \text{refine\_eff}_{\text{base}}$. For this to equal 0.714, we need $\text{refine\_eff}_{\text{base}} \approx 0.22$. Similarly, MetaClaw robustness $= 0.4 \times (19/19) + 0.3 \times (1 - 0.079) + 0.3 \times \text{refine\_eff}_{\text{meta}} = 0.400 + 0.276 + 0.3 \times \text{refine\_eff}_{\text{meta}}$. For 0.845, we need $\text{refine\_eff}_{\text{meta}} \approx 0.56$. The numbers are internally consistent given reasonable refine-efficiency values, but the refine_efficiency metric itself is not fully defined.

35.4.3 Adoption Metrics

As of April 2026, the repository reports approximately 9,800+ GitHub stars, a test suite of 1,823 passing tests, 20 built-in skills with an extensible community skill system, and README translations in 9 languages (source: repository README). These adoption numbers are notable for a system released only weeks prior, suggesting significant community interest in autoresearch tooling.

35.5 Implementation Details

35.5.1 Codebase and Dependencies

The system is implemented entirely in Python, organized under the researchclaw/ package. The directory structure below is visible in the public repository:

Directory	Purpose (from README)	Notable Subdirectories
`researchclaw/pipeline/`	Pipeline orchestrator, stage runner, checkpointing	`runner.py`, `checkpoint.py`, `stages/`
`researchclaw/agents/`	Multi-agent subsystems	`base.py`, `code_agent/`, `benchmark_agent/`, `figure_agent/`, `debate/`
`researchclaw/literature/`	Academic API clients	OpenAlex, Semantic Scholar, arXiv integration
`researchclaw/sandbox/`	Experiment execution sandbox	AST validation, Docker mode, self-healing
`researchclaw/sentinel/`	Quality watchdog	NaN/Inf detection, consistency checks
`researchclaw/verification/`	Citation verification, VerifiedRegistry	4-layer verification pipeline
`researchclaw/skills/`	Skills management	Skill loading, injection, community skills
`researchclaw/templates/`	LaTeX templates	NeurIPS 2025, ICLR 2026, ICML 2026
`researchclaw/config/`	Configuration management	YAML config parsing
`researchclaw/knowledge/`	Run-level knowledge base	Markdown or Obsidian backend

Dependencies include openai and httpx for LLM integration, requests for literature APIs, jinja2 for LaTeX template rendering, pyyaml for configuration, and Python's stdlib ast module for code validation (source: repository pyproject.toml). The CLI interface is built on click, exposed via the researchclaw command.

35.5.2 Configuration

The system is configured via a YAML file. The following is reproduced from the documented config.researchclaw.example.yaml (source: repository), showing key configuration blocks with documented defaults:

# Reproduced from config.researchclaw.example.yaml (repository documentation)
# This IS a direct representation of the documented configuration structure.

llm:
  provider: "openai"          # openai | openrouter | deepseek | acp | ...
  acp:
    agent: "claude"           # claude | codex | gh | gemini | opencode | kimi

sandbox:
  mode: "sandbox"             # simulated | sandbox | docker | ssh_remote
  memory_limit_mb: 4096
  time_budget_sec: 300

code_agent:
  enabled: true
  architecture_planning: true
  hard_validation: true
  hard_validation_max_repairs: 2
  exec_fix_max_iterations: 3
  exec_fix_timeout_sec: 60

benchmark_agent:
  enabled: true
  enable_hf_search: true
  tier_limit: 2               # 1=small, 2=medium, 3=large

figure_agent:
  enabled: true
  min_figures: 3
  max_figures: 8
  max_iterations: 3           # Critic-driven refinement
  dpi: 300

opencode:
  enabled: true
  auto: true
  complexity_threshold: 0.2   # See Section 35.3.5 for threshold inconsistency
  timeout_sec: 600

metaclaw_bridge:
  enabled: false              # Opt-in
  min_severity: "warning"
  skills_per_run: 3
  prm:
    enabled: false
    model: "gpt-5.4"
    votes: 3
    gate_stages: [5, 9, 15, 20]

knowledge_base:
  backend: "markdown"         # or "obsidian"
  root: "docs/kb"

35.5.3 Reproducing a Run

The following CLI commands are documented in the repository README and represent the verified user-facing interface:

# From repository README — verified CLI commands

# Installation:
git clone https://github.com/aiming-lab/AutoResearchClaw.git
cd AutoResearchClaw
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
researchclaw setup
researchclaw init    # Interactive config generation

# Set API key:
export OPENAI_API_KEY="sk-..."

# Full autonomous run:
researchclaw run --topic "Your research topic" --auto-approve

# Resume interrupted run (added in v0.3.2):
researchclaw run --resume  # Auto-detects last checkpoint

35.5.4 Cost Analysis

A full 23-stage pipeline run involves extensive LLM usage across all stages. The repository README provides the following cost estimates per run:

Model	Estimated Cost per Run	Notes	Source
GPT-4o	$15–50	Full quality; many stages + multi-agent debate	Repository README
GPT-4o-mini	$3–10	Budget option; lower expected quality	Repository README
Claude 3.5 Sonnet (via ACP)	$10–30	Using Claude Code as persistent agent	Repository README
DeepSeek V3	$2–8	Cost-effective alternative	Repository README
Gemini Pro	$5–20	Via OpenRouter or direct API	Repository README

These are author estimates from the repository README, not independently verified cost measurements. Actual costs vary significantly based on topic complexity, number of REFINE/PIVOT cycles, experiment iterations, paper length, and LLM provider pricing at time of execution. No methodology for these estimates (e.g., token counting, averaged across runs) is provided.

35.5.5 Time Budget

Estimated pipeline duration varies by an order of magnitude depending on configuration (source: repository README):

Configuration	Estimated Duration	Source
Simulated experiments, fast model	30–60 minutes	Repository README
Sandbox experiments, GPT-4o	2–6 hours	Repository README
Docker + OpenCode Beast Mode	4–12 hours	Repository README
With PIVOT/REFINE cycles	6–24 hours	Repository README

Experiment execution (Phase E) dominates wall-clock time at 30–180 minutes, while literature collection (Phase B) is typically API-bound at 15–45 minutes. Each REFINE cycle adds an estimated 30–90 minutes, while a PIVOT adds 2–4 hours by restarting from hypothesis generation. These are README-reported estimates without documented measurement methodology.

35.5.6 Reproducibility Assessment

Factor	Assessment	Evidence	Provenance
Source availability	Strong	MIT-licensed, complete codebase on GitHub	Verified: public repository
Installation	Strong	`pip install -e .` + `researchclaw setup`	Repository README
Configuration	Strong	Documented YAML with example file	Repository config example
Test suite	Strong	1,823 tests passing (repository-reported)	Repository README (not independently run)
Checkpoint/resume	Strong	`--resume` flag auto-detects last checkpoint	Repository README, v0.3.2 changelog
LLM determinism	Weak	Output varies with model, temperature, API version	Inherent to LLM-based systems
API dependencies	Moderate	Requires OpenAlex, Semantic Scholar, arXiv (external services)	Repository README
API stability	Weak	6 releases in 15 days suggests rapid API churn	Repository version history
External tool deps	Moderate	OpenCode Beast Mode requires separate installation	Repository README

35.5.7 Skills System

AutoResearchClaw implements a skills system inspired by Claude Code's SKILL.md format. Each skill is a Markdown file with YAML frontmatter specifying name, description, trigger keywords, applicable pipeline stages, and enabled status. Skills are loaded from five sources in priority order (source: repository README):

Built-in skills (20 shipped): packaged with researchclaw
Project-local skills: .claude/skills/ directory
User-installed skills: via researchclaw skills install
Team-shared skills: custom directories in config
Community skills: 150+ via K-Dense-AI/claude-scientific-skills repository

Notable built-in skills documented in the README include scientific-writing (IMRAD structure, citation formatting), chemistry-rdkit (molecular analysis, SMILES), literature-search (systematic review, PRISMA methodology), hypothesis-formulation, statistical-reporting, and a-evolve (community-contributed from the A-Evolve project). At applicable stages, relevant skills are automatically injected into LLM prompts, enabling domain-specific expertise without modifying the core pipeline.

The skill file format, documented in the repository:

# Skill file format (from repository documentation)
---
name: scientific-writing
description: IMRAD structure, citation formatting, reporting guidelines
trigger-keywords: [paper, writing, draft, manuscript]
applicable-stages: [16, 17, 19]
enabled: true
---
[Skill instructions for the LLM...]

35.6 Limitations & Discussion

35.6.1 Quality Ceiling

The most significant limitation is the absence of external quality validation. No generated paper has been submitted to a real conference, evaluated by domain experts in blind review, or compared to human-written papers on matched topics. The system's quality assessment is entirely internal: multi-agent peer review and quality gates are implemented by the same LLM that generated the content, creating a potential echo chamber. Without external benchmarking against human baselines, claims of "conference-ready" quality remain aspirational.

35.6.2 Experiment Fidelity

Sandbox experiments, while reproducible, operate at small scale. The default execution mode uses a local subprocess with configurable memory (4096 MB default) and time (300s default) limits per the config example. Docker mode and SSH remote execution extend capacity but add infrastructure complexity. For research fields requiring large-scale training, multi-GPU experiments, or specialized hardware (e.g., TPU clusters), the gap between sandbox experiments and full-scale reproducible research remains substantial.

35.6.3 Residual Fabrication Risk

Despite the VerifiedRegistry and Sentinel Watchdog, subtle fabrication remains possible. The anti-fabrication system verifies that numbers in the paper match actual experiment outputs, but it cannot verify that the interpretation of those numbers is correct, that the experimental setup is sound, or that the conclusions drawn are valid. An LLM can produce a paper with all verified numbers but misleading analysis — a form of "honest fabrication" that technical safeguards alone cannot prevent.

35.6.4 Novelty Assessment

The system cannot reliably assess whether its research is genuinely novel. While the PIVOT decision considers novelty relative to baselines, this is a narrow operational definition. True novelty assessment requires deep understanding of the research landscape, ongoing conferences, parallel work, and conceptual contribution — capabilities that current LLMs approximate but do not reliably deliver. Human judgment remains essential for novelty claims.

35.6.5 LLM Non-Determinism and Reproducibility

Given identical inputs and configuration, two runs of the same topic may produce substantially different papers due to LLM non-determinism. Temperature settings, API version changes, and provider-specific behavior all introduce variability. The checkpoint/resume system ensures a given run is recoverable, but cross-run reproducibility (same topic producing comparable output) is not guaranteed. This is a fundamental limitation shared by all LLM-powered autoresearch systems.

35.6.6 Comparative Analysis

The following table compares AutoResearchClaw with the two most closely related systems covered in this survey. Each cell is annotated with evidence provenance.

Capability	Comparison Criterion	AutoResearchClaw	AI Scientist (Sakana AI)	AIRA₂ (Meta FAIR)
Pipeline scope	Number of distinct automated stages from input to output	23 stages, idea to paper (README)	Partial: idea → paper, limited experiment fidelity (Lu et al., 2024)	Experiment optimization only; no paper generation (Chapter 32)
Literature retrieval	Whether system queries real academic databases	Real APIs: OpenAlex, Semantic Scholar, arXiv (README)	No real-time retrieval documented; hallucinated references reported (Lu et al., 2024, §5)	N/A — not a paper-generation system
Citation integrity	Verification mechanism for reference authenticity	4-layer verification pipeline (README)	No citation verification documented (Lu et al., 2024)	N/A
Experiment execution	Environment for running generated code	Sandboxed Python, Docker, SSH (README, config)	Limited sandbox with code generation (Lu et al., 2024)	Full-scale GPU training on Kaggle tasks (Chapter 32)
Cross-run learning	Whether knowledge persists across independent runs	MetaClaw skill extraction and injection (README, MetaClaw repo)	No cross-run learning documented	Within-task evolution only (Chapter 32)
Research-direction control	Ability to autonomously change hypothesis mid-run	PIVOT/REFINE loop at Stage 15 (README)	No documented direction-change mechanism	N/A (single-task optimization)
Human-in-loop	Optional human review gates	3 optional quality gates (README)	No human-in-loop gates documented	No human-in-loop gates documented
Anti-fabrication	Mechanism preventing invented experimental numbers	VerifiedRegistry + Sentinel (README)	No anti-fabrication mechanism documented	N/A (real Kaggle evaluations)
License	Source availability	MIT (verified: public repo)	Apache 2.0 (verified: public repo)	Not publicly released
External validation	Third-party evaluation of output quality	None (self-assessed showcase)	Self-assessed; workshop-level papers claimed (Lu et al., 2024)	Kaggle competition rankings (Chapter 32)

Sources: AutoResearchClaw claims from repository README. AI Scientist claims from Lu et al. (2024), "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery," arXiv:2408.06292. AIRA₂ claims from Chapter 32 of this survey. "No documented mechanism" means the published paper or public repository does not describe such a feature; absence of documentation does not conclusively prove absence of the feature. Entries marked "N/A" indicate the system does not target that capability by design.

35.6.7 Connections to Evolutionary AI

While AutoResearchClaw is not an evolutionary algorithm system in the traditional sense, its architecture embodies several patterns central to this survey's theme:

PIVOT/REFINE as exploration-exploitation. The Stage 15 decision engine implements a form of adaptive search over the space of research directions. REFINE corresponds to exploitation (parameter tuning within the current hypothesis), while PIVOT corresponds to exploration (abandoning the current hypothesis for a new one). This mirrors the exploration-exploitation tradeoff in island-based evolutionary search with restart policies, where stagnation triggers population reinitialization.

MetaClaw as evolutionary memory. The cross-run skill accumulation with time-decay weighting resembles learning-rate schedules in evolutionary strategies: recent experience is weighted more heavily, but old experience is not immediately discarded. The 30-day decay period functions as a form of forgetting that prevents outdated lessons from constraining future search.

Multi-agent debate as population diversity. Using multiple LLM perspectives for hypothesis generation, result analysis, and peer review parallels the use of diverse populations in evolutionary algorithms to avoid premature convergence. Each "agent" represents a different point in reasoning space.

Self-healing as repair mutation. The sandbox executor's diagnosis-repair loop (detect error → diagnose → generate fix → re-execute) mirrors repair operators in genetic programming, where syntactically or semantically invalid programs are patched rather than discarded.

35.7 The Skills System as Accumulated Knowledge

The skills system deserves separate attention as a mechanism for knowledge accumulation that bridges individual runs and community contributions. Over time, a research group's skill library accumulates domain-specific knowledge that makes the pipeline increasingly effective for their particular research area. This creates a positive feedback loop that we can informally characterize as:

$$\text{effectiveness}(N) \approx f\!\left(\text{base\_capability},\; \sum_{n=1}^{N-1} |\mathcal{S}_n| \cdot \bar{w}_n\right)$$

where $N$ is the current run number, $|\mathcal{S}_n|$ is the number of skills contributed by run $n$, $\bar{w}_n$ is their average time-decay weight at the current run, and $f$ is monotonically increasing in its second argument. This is an informal analytical model introduced by the survey author to express the documented behavior qualitatively. It does not correspond to any implementation formula. The actual relationship between skill count and pipeline effectiveness has not been formally characterized or empirically measured in the repository.

The community dimension is also significant: 150+ skills via the K-Dense-AI repository represent crowdsourced scientific methodology knowledge that any AutoResearchClaw installation can leverage (source: repository README). This suggests a model where scientific methodology itself becomes a shareable, versionable artifact — an intriguing direction for the autoresearch field, though the actual impact of community skills on pipeline quality has not been empirically evaluated.

35.8 Process Reward Model Integration

MetaClaw optionally integrates a Process Reward Model (PRM) for quality gating at configurable stages. When enabled, an LLM-as-judge evaluates stage outputs using majority voting (source: repository configuration example, metaclaw_bridge.prm section):

*PRM configuration fields (from config.researchclaw.example.yaml)*
Config Field	Default	Description
`metaclaw_bridge.prm.enabled`	`false`	Opt-in activation
`metaclaw_bridge.prm.model`	`"gpt-5.4"`	LLM model used as judge
`metaclaw_bridge.prm.votes`	`3`	Number of independent judge evaluations (majority vote)
`metaclaw_bridge.prm.gate_stages`	`[5, 9, 15, 20]`	Pipeline stages where PRM gates are applied

# ABSTRACT PSEUDOCODE — illustrative reconstruction of PRM gate behavior.
# NOT extracted from the repository. Based on documented config fields.

class ProcessRewardGatePseudocode:
    """Illustrates the optional LLM-as-judge quality gate concept.

    Config fields: metaclaw_bridge.prm.{enabled, model, votes, gate_stages}
    """

    def __init__(self, model: str, votes: int, gate_stages: list[int]):
        self.model = model          # Config: "gpt-5.4"
        self.votes = votes          # Config: 3 (majority vote)
        self.gate_stages = gate_stages  # Config: [5, 9, 15, 20]

    def evaluate(self, stage_id: int, stage_output) -> bool:
        """Returns True if stage output passes majority-vote quality check.

        At each gated stage, 'votes' independent LLM judge calls
        are made. Stage passes if a majority approve.
        """
        if stage_id not in self.gate_stages:
            return True  # Not a gated stage

        approvals = 0
        for _ in range(self.votes):
            score = self._single_judge_call(stage_output)
            if score >= 0.5:  # Threshold not documented; 0.5 assumed
                approvals += 1

        return approvals > self.votes // 2  # Majority approval

    def _single_judge_call(self, output) -> float:
        """Single LLM judge evaluation; returns quality score in [0, 1].

        Implementation details (prompt, scoring rubric) not documented.
        """
        prompt = self._build_judge_prompt(output)
        response = self.llm.generate(prompt, model=self.model)
        return self._parse_score(response)

The PRM adds another layer of quality control beyond the standard gate stages, though it also adds LLM cost proportional to the number of gated stages times the vote count. At 3 votes across 4 gate stages, this represents 12 additional LLM calls per run. The actual quality improvement from PRM gating is not reported in the repository.

35.9 Summary

Key Takeaway

AutoResearchClaw demonstrates that fully autonomous research pipelines — from text topic to conference-formatted paper with real literature, executed experiments, and verified results — are technically achievable in 2026. Its most notable documented capabilities are the PIVOT/REFINE decision loop for autonomous research-direction control, the VerifiedRegistry anti-fabrication system for enforcing experimental ground truth in generated papers, and four-layer citation verification that addresses hallucinated references. The MetaClaw cross-run learning system with time-decaying skills provides a mechanism for cumulative improvement across runs.

Main contribution to the field: Among open-source autoresearch systems surveyed in this book, AutoResearchClaw is the first to integrate anti-fabrication enforcement, real citation verification via academic APIs, adaptive research-direction control, and cross-run learning into a single end-to-end pipeline. This assessment is based on comparison with AI Scientist (Lu et al., 2024), AIRA₂ (Chapter 32), AutoResearch (Karpathy, 2025), and FARS (Analemma, 2025) using the criteria in Section 35.6.6. However, FARS's proprietary nature means its full capabilities cannot be assessed, and the rapidly evolving autoresearch landscape may include systems not covered by this survey.

What a researcher should know: Despite its architectural sophistication, AutoResearchClaw's output quality remains unvalidated by external peer review. The system addresses documented failure modes of prior autoresearch systems (hallucinated references in AI Scientist, partial pipelines in AutoResearch), but introduces its own limitations: the quality ceiling is bounded by LLM capability, experiment fidelity is limited by sandbox constraints, novelty assessment requires human judgment, and the MetaClaw integration results lack statistical rigor (see Section 35.4.2). The 23-stage pipeline is best understood not as a replacement for human researchers, but as an automation of the mechanical aspects of research — literature gathering, experiment coding, result formatting — while leaving the most intellectually demanding aspects (true novelty, deep insight, conceptual contribution) as open challenges for future systems.

Evidence boundaries: This chapter's analysis is grounded in the public repository README, configuration examples, directory structure, and version history. All code blocks are abstract pseudocode written by the survey author to illustrate documented behavior — they are not repository excerpts. Mathematical formalizations are analytical interpretations of documented criteria, not implementation descriptions. Readers should consult the repository directly for current implementation details.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}