EurekaClaw: Autonomous Mathematical Discovery Through Bottom-Up Theorem Proving
Part P07: Autonomous Research Systems
40.1 Overview and Motivation
EurekaClaw is a multi-agent AI system for autonomous mathematical research, published in 2026 under the Apache 2.0 license by a team led by Quanquan Gu (github.com/EurekaClaw/EurekaClaw). The system autonomously crawls literature, generates and stress-tests mathematical hypotheses, proves theorems via a seven-stage bottom-up pipeline, runs numerical experiments, and writes camera-ready LaTeX papers. Its continual learning subsystem distills proof strategies into reusable skills across sessions, making it one of the most depth-focused autonomous research systems in the 2024–2026 landscape.
The name "EurekaClaw" combines the discovery exclamation "Eureka!" with "Claw" — a branding lineage shared with related systems including AutoResearchClaw, MetaClaw, and OpenClaw. The metaphor suggests the system's ability to grasp and hold onto breakthrough moments in mathematical reasoning. EurekaClaw should not be confused with NVIDIA's Eureka system for reward function evolution in robotics; despite the shared naming, the two systems address entirely different problems.
Key Contribution
EurekaClaw is, among the systems surveyed in this book (2024–2026), architecturally distinctive in combining bottom-up theorem proving (constructing proofs from atomic lemmas via a directed acyclic graph), formal verification (optional Lean4 proof checking), and continual skill learning (automatic distillation of proof strategies across sessions) within a single seven-stage pipeline. It occupies a unique niche: narrower in domain scope than general-purpose research agents, but substantially deeper in its target domain of mathematical theory than any competitor identified in this survey.
40.1.1 Positioning Within the Ecosystem
Within the "Claw" family of research agents, EurekaClaw specializes in mathematical depth. AutoResearchClaw provides breadth across research domains through a 23-stage paper pipeline. MetaClaw focuses on cross-run learning and skill extraction at the meta level. EurekaClaw trades domain breadth for proof-construction depth, making it the only system in the family with dedicated theorem proving, verification loops, and formal checking capabilities.
| System | Focus | Pipeline | Formal Proving | Memory Tiers | Continual Learning |
|---|---|---|---|---|---|
| EurekaClaw | Mathematical theory | 7-stage with verification loop | Yes (Lean4) | 4 | Skill distillation |
| AutoResearchClaw | General research | 23-stage end-to-end | No | 3 (via MetaClaw) | Cross-run skills |
| AI Scientist (Sakana) | ML experiments | Idea → experiment → paper | No | 0 | None |
| Google AI Co-Scientist | Hypothesis generation | Multi-agent debate | No | 2 (tournament) | Selection |
| AIRA₂ (Meta) | STEM research | 15+ specialized agents | No | 2 (session) | None |
40.1.2 Three Levels of Autonomy
EurekaClaw provides three escalating input modes that range from precise conjecture proving to open-ended domain exploration [Documented: README CLI examples]:
| Command | Level | Input | Output |
|---|---|---|---|
eurekaclaw prove "<statement>" | 1 (lowest autonomy) | Precise mathematical statement | Proof + LaTeX paper |
eurekaclaw from-papers <arxiv_ids> | 2 (medium) | Specific papers to extend | Gap analysis + new theorem + proof + paper |
eurekaclaw explore "<domain>" | 3 (highest) | Broad research area | Literature survey + conjecture + proof + paper |
This tiered autonomy design acknowledges a spectrum of mathematical research tasks. A researcher with a precise conjecture needs proving support (Level 1). A researcher reading papers and sensing an unexplored direction needs gap analysis and conjecture generation (Level 2). A researcher entering a new field needs the full pipeline from survey to paper (Level 3). The system adapts its pipeline entry point accordingly.
40.1.3 Evidence Provenance and Verification Scope
Repository Snapshot and Verification Scope
Repository: github.com/EurekaClaw/EurekaClaw
Review date: April 2026
Pinned commit: No specific commit hash was pinned for this review. Claims are based on the documented source tree, README, and project documentation site as of the review date.
Throughout this chapter, claims are labeled as follows:
- [Documented] — directly stated in the project README, documentation site, or published code excerpts.
- [Repository structure] — inferred from the documented file/directory layout.
- [Author reconstruction] — the chapter author's reconstruction of plausible behavior based on documented inputs/outputs; not confirmed in source code. All pseudocode and inferred mechanisms are explicitly marked with this label.
Code blocks in this chapter fall into two categories: (a) documented API surfaces, where method names and signatures are taken from project documentation but internal implementations have not been independently verified; and (b) author pseudocode, where the logic is the chapter author's reconstruction for pedagogical purposes. Each code block is labeled accordingly. Readers wishing to verify specific claims should consult the repository directly.
40.1.4 Source-Traceability Summary
Table 40.2 maps each major section of this chapter to its evidence sources, enabling readers to assess the grounding of each claim independently.
| Section | Topic | Primary Source | Evidence Tier |
|---|---|---|---|
| 40.2.1 | KnowledgeBus API | Doc + README | Documented interface; internals not verified |
| 40.2.2 | Agent architecture, BaseAgent/BaseTool | Doc + Structure | Documented class names; method bodies not verified |
| 40.3.1 | Seven-stage pipeline | README + Doc | Documented stages and agent names |
| 40.3.2 | Bottom-up proof loop (core) | Doc | Documented agents and flow; iteration limit documented |
| 40.3.2 | Stagnation detection | Recon | Author reconstruction; concept mentioned, mechanism inferred |
| 40.3.2 | DAG restructuring | Recon | Author reconstruction from documented counterexample handling |
| 40.3.4 | Skill injection | Doc + Structure | Documented API; selection strategies documented |
| 40.3.5 | UCB1 walkthrough | Recon | Hypothetical illustration; no session artifacts available |
| 40.4 | Scientist-Bench dimensions | Doc | Documented evaluator; weights and metrics documented |
| 40.5 | Memory tiers | Doc + Structure | Documented tier structure; storage paths documented |
| 40.5.2 | SkillEvolver API | Doc | Documented class; fast model usage documented |
| 40.6.2 | DomainPlugin ABC | Doc | Documented interface; registration mechanism partially inferred |
| 40.6.3 | Cost estimates | Doc | Self-reported ranges; not independently measured |
40.2 Architecture
EurekaClaw's architecture is organized around three central design elements: a MetaOrchestrator that sequences pipeline stages, a KnowledgeBus that serves as the typed artifact store connecting all components, and a collection of specialized agents that execute each pipeline stage. All agents share a single primary LLM configuration (defaulting to claude-sonnet-4-6 via the EUREKACLAW_MODEL environment variable [Documented]), with the exception of the post-session skill distillation path, which uses a separate "fast model" for efficiency (Section 40.5.2).
40.2.1 The KnowledgeBus
The KnowledgeBus is the architectural backbone — a typed artifact store with publish-subscribe semantics that all pipeline stages read from and write to [Documented: described in project architecture documentation as eurekaclaw/knowledge_bus/bus.py]. Unlike a generic shared-state dictionary, the KnowledgeBus enforces typed access through dedicated accessor methods for each artifact type.
# Documented API surface: eurekaclaw/knowledge_bus/bus.py
# ──────────────────────────────────────────────────────────────
# PROVENANCE: Method names and signatures below are taken from
# project documentation. Internal implementations, parameter
# defaults, and serialization logic have NOT been independently
# verified at a specific commit. Treat as documented interface,
# not audited source code.
# ──────────────────────────────────────────────────────────────
class KnowledgeBus:
"""Typed artifact store with pub/sub for inter-stage communication."""
def __init__(self, session_id: str) -> None: ...
# Typed artifact accessors [Documented]
def put_theory_state(self, state: TheoryState) -> None: ...
def get_theory_state(self) -> TheoryState | None: ...
def put_research_brief(self, brief: ResearchBrief) -> None: ...
def get_research_brief(self) -> ResearchBrief | None: ...
def put_experiment_result(self, result: ExperimentResult) -> None: ...
def get_experiment_result(self) -> ExperimentResult | None: ...
def append_citations(self, papers: list[Paper]) -> None: ...
# Generic key-value fallback [Documented]
def put(self, key: str, value: Any) -> None: ...
def get(self, key: str, default: Any = None) -> Any: ...
# Reactive subscriptions [Documented]
def subscribe(self, artifact_type: str, callback: Callable) -> None: ...
# Persistence [Documented: JSON serialization to session directory]
def persist(self, session_dir: Path) -> None: ...
@classmethod
def load(cls, session_id: str, session_dir: Path) -> "KnowledgeBus": ...
The bus supports reactive subscriptions: components can register callbacks for artifact updates, enabling loose coupling between pipeline stages [Documented]. Persistence is handled through JSON serialization, producing a structured artifact directory per session that enables post-hoc inspection and partial re-processing [Documented].
40.2.2 Agent Architecture
All agents inherit from a common BaseAgent abstract base class [Documented: eurekaclaw/agents/base.py]. This ABC standardizes the agent interface around an execute() method while providing shared access to the tool registry, skill injection, and episodic memory logging. Tools are exposed to agents via the Anthropic tool definition format [Documented], with each tool implementing a BaseTool ABC (eurekaclaw/tools/base.py) that provides input_schema(), call(), and to_anthropic_tool_def() methods [Documented].
| Tool | Documented Module | Purpose |
|---|---|---|
ArxivSearchTool | tools/arxiv.py | Search arXiv by query, return papers with metadata |
SemanticScholarTool | tools/semantic_scholar.py | Search with citation counts and venue data |
Lean4VerifyTool | tools/lean4.py | Formal proof verification via Lean4 |
WolframAlphaTool | tools/wolfram.py | Symbolic computation queries |
CodeExecutionTool | tools/code_exec.py | Python code execution in sandbox |
WebSearchTool | tools/web_search.py | General web search |
40.2.3 Implementation Status Summary
| Stage | Status | Evidence |
|---|---|---|
| 1. Survey | Implemented | CLI demos, documented agents |
| 2. Formalize | Implemented | Documented agent, TheoryState output |
| 3. Theory | Implemented | Documented agents, proof loop description |
| 4. Experiment | Under active development | Explicitly marked in project docs |
| 5. Write | Implemented | CLI demo, LaTeX output |
| 6. Evaluate (Scientist-Bench) | Implemented | Documented evaluator, CLI command |
| 7. Learn | Implemented | Documented extractors and SkillEvolver |
| Browser UI | Implemented (beta) | Frontend build system documented |
40.3 Core Algorithms
40.3.1 The Seven-Stage Pipeline
EurekaClaw's research pipeline consists of seven sequential stages, with the third stage (Theory) containing an iterative loop. The MetaOrchestrator sequences these stages, manages gate-mode enforcement (fully autonomous via GATE_MODE=none, semi-autonomous via auto, or human-in-the-loop via human [Documented: .env configuration]), and handles domain plugin resolution based on keyword matching against user input [Documented].
Stage 1 — Survey. A multi-agent literature review comprising four sub-agents [Documented: eurekaclaw/agents/survey/]: PaperFetcher (API calls to arXiv and Semantic Scholar), Summarizer (LLM-based paper summarization), GapAnalyst (cross-paper analysis identifying unexplored combinations), and DirectionProposer (conjecture generation with scoring). The output is a ResearchBrief containing scored research directions.
Stage 2 — Formalize. The Formalizer agent [Documented: eurekaclaw/agents/formalize/] converts the selected research direction into a formal LaTeX theorem statement, generates a proof plan with provenance annotations (known, adapted, or new), and constructs a lemma DAG skeleton — the dependency graph of sub-results needed for the main proof.
Stage 3 — Theory. The iterative proving loop, described in detail in Section 40.3.2 below. This is the dominant cost stage.
Stage 4 — Experiment. Numerical validation of theoretical bounds via the ExperimentDesigner and ExperimentRunner agents [Documented: eurekaclaw/agents/experiment/]. This stage is explicitly documented as under active development; the pipeline proceeds to Stage 5 without experiment results when unavailable.
Stage 5 — Write. The WriterAgent [Documented: eurekaclaw/agents/writer/] generates a camera-ready LaTeX paper from the KnowledgeBus artifacts, including theorem environments, proof environments, related work from bibliography summaries, and experiment sections where available.
Stage 6 — Evaluate. The Scientist-Bench framework scores the output along five weighted dimensions (Section 40.4). Invoked via eurekaclaw eval-session <session_id> [Documented].
Stage 7 — Learn. The ContinualLearningLoop extracts domain insights, tool patterns, and distilled skills from the session for use in future runs (Section 40.5.2). This stage uses a separate "fast model" with constrained output (max_tokens=1024) [Documented] rather than the primary EUREKACLAW_MODEL.
40.3.2 Bottom-Up Proof Construction
The Theory stage is EurekaClaw's most distinctive mechanism. Rather than attempting top-down proof decomposition (which requires knowing the proof structure in advance), EurekaClaw builds proofs bottom-up from atomic lemmas organized in a directed acyclic graph (DAG). This approach mirrors how human mathematicians often work: proving supporting results first, then assembling them into the main argument.
The proof state is tracked in a TheoryState object — the most complex artifact in the system [Documented: eurekaclaw/types/artifacts.py]:
# Documented data model: eurekaclaw/types/artifacts.py
# ──────────────────────────────────────────────────────────────
# PROVENANCE: Field names and types from project documentation.
# Exact Pydantic validators and default values not verified.
# ──────────────────────────────────────────────────────────────
class TheoryState(BaseModel):
informal_statement: str # Plain-English conjecture
formal_statement: str # LaTeX-formalized theorem
known_results: list[KnownResult] # Extracted from literature survey
research_gap: str # GapAnalyst's finding
proof_plan: list[ProofPlan] # Each with provenance: known | adapted | new
lemma_dag: dict[str, LemmaNode] # Dependency graph of sub-lemmas
proven_lemmas: dict[str, ProofRecord] # lemma_id → proof
open_goals: list[str] # Remaining lemma IDs to prove
failed_attempts: list[FailedAttempt] # History for learning
counterexamples: list[Counterexample] # Adversarial findings
assembled_proof: str | None # Final combined proof
status: Literal["pending", "in_progress", "proved", "refuted", "abandoned"]
The proving loop operates as follows. Algorithm 1 presents the documented core behavior with explicit provenance annotations on each step. Two inferred mechanisms — stagnation detection and DAG restructuring — are described in separate boxes below the algorithm.
Algorithm 1: Bottom-Up Proof Construction (Documented Core + Annotated Inferences)
Input: TheoryState T with lemma_dag, open_goals, proven_lemmas
Config: THEORY_MAX_ITERATIONS (default: 10) [Documented]
Output: Updated TheoryState with status ∈ {proved, refuted, abandoned, in_progress}
1 goals ← TopologicalSort(T.lemma_dag) [Documented]
// Process nodes in bottom-up order: leaves first,
// then nodes whose dependencies are all resolved.
2 for each goal g in goals (leaf-first order): [Documented]
3 if ∃ dep ∈ g.dependencies such that dep ∉ T.proven_lemmas:
4 continue // dependencies not yet resolved
5 for iter = 1 to THEORY_MAX_ITERATIONS: [Documented]
6 // ── SKILL INJECTION ── [Documented]
7 skills ← SkillInjector.top_k(g.statement, role="theory", k=5)
8 deps ← {T.proven_lemmas[d] | d ∈ g.dependencies}
9 hist ← T.failed_attempts filtered to goal g
10 // ── PROVE ── [Documented]
11 candidate ← Prover.prove(g, deps, skills, hist)
12 // ── COUNTEREXAMPLE SEARCH ── [Documented: agent exists]
13 cx ← CounterexampleSearcher.search(g)
14 if cx ≠ ∅: [Author reconstruction: §40.3.2a]
15 T.counterexamples.append(cx)
16 RestructureDAG(T, g, cx) // see Box 40.1
17 goto line 1
18 // ── VERIFY ── [Documented]
19 result ← Verifier.verify(candidate, g, deps)
20 // Checks: logical consistency, step completeness,
21 // assumption validity [Documented]
22 // Optional: Lean4VerifyTool for formal checking [Documented]
23 if result.verified: [Documented]
24 T.proven_lemmas[g.id] ← ProofRecord(candidate, result)
25 T.open_goals.remove(g.id)
26 break
27 // ── REFINE ── [Documented]
28 T.failed_attempts.append(FailedAttempt(g.id, candidate, result.error))
29 if IsStagnant(T.failed_attempts, g.id): [Author reconstruction: §40.3.2b]
30 candidate ← Refiner.broad_repair(candidate, result.error, hist)
31 else:
32 candidate ← Refiner.targeted_repair(candidate, result.error)
33 // Loop back to VERIFY with refined candidate
34 // ── ASSEMBLY ── [Documented]
35 if T.open_goals is empty:
36 T.assembled_proof ← AssembleProof(T.proven_lemmas, T.lemma_dag)
37 T.status ← "proved"
38 else if any counterexample refutes the main theorem:
39 T.status ← "refuted"
40 else if all per-lemma iterations exhausted:
41 T.status ← "abandoned"
42 else:
43 T.status ← "in_progress"
44 return T
Topological ordering [Documented]. The lemma DAG is processed leaf-first: lemmas with no dependencies are proven first, then lemmas whose dependencies have all been resolved, and so on upward. This is a standard topological sort (Kahn's algorithm) applied to the dependency graph, where an edge $u \to v$ means "lemma $v$ depends on lemma $u$." The ordering ensures that when the Prover attempts a lemma, all of its prerequisites are already available as proven results.
Box 40.1 — Author Reconstruction: DAG Restructuring After Counterexample
Evidence status: The existence of CounterexampleSearcher as a dedicated agent is documented. The documentation states that counterexamples can be found during the theory loop. However, the specific mechanism by which the lemma DAG is restructured after a counterexample is the chapter author's reconstruction — no documented description of this process has been identified.
Reconstructed behavior: When the CounterexampleSearcher finds an input violating a lemma, the lemma is marked as refuted. A plausible restructuring would: (1) identify all lemmas transitively dependent on the refuted lemma, (2) re-open those goals, (3) re-invoke the Formalizer to produce an alternative decomposition avoiding the refuted path, and (4) restart the proof loop with the revised DAG. The precise trigger conditions, re-invocation API, and failure modes of this process are unknown.
Box 40.2 — Author Reconstruction: Stagnation Detection
Evidence status: The project documentation mentions that the theory loop includes "stagnation → force refine" behavior (documented in the pipeline flow description), confirming that the concept of stagnation detection exists. However, the specific definition of "stagnation" — the recurrence threshold, error pattern categorization method, and the distinction between targeted and broad repair — are not documented in public materials.
Reconstructed mechanism: A reasonable implementation would extract error categories (e.g., "invalid assumption," "missing case," "bound too loose") from the Verifier's output and trigger broad repair after $N \geq 3$ consecutive occurrences of the same category for a given lemma. The broad repair strategy would restructure the proof approach rather than merely patching the specific error. The actual threshold $N$, the error categorization method, and the distinction between repair strategies are unknown and may differ from this reconstruction.
// AUTHOR PSEUDOCODE — not from repository
Function IsStagnant(failed_attempts, goal_id):
recent ← last N attempts for goal_id // N: unknown threshold
patterns ← [categorize(a.error) for a in recent]
return dominant_category_count(patterns) ≥ THRESHOLD
// categorize(): maps errors to types — method unknown
// THRESHOLD: not documented
40.3.3 Research Direction Scoring
During the Survey stage, the DirectionProposer agent scores each candidate research direction along three dimensions [Documented]. Given a set of candidate directions $\{d_1, d_2, \ldots, d_n\}$, each direction $d_i$ receives three scores:
where:
- $S_{\text{novelty}}(d_i) \in [0,1]$ — measures distance from existing work in the surveyed literature [Documented],
- $S_{\text{soundness}}(d_i) \in [0,1]$ — estimates the feasibility of constructing a proof for the direction [Documented],
- $S_{\text{transformative}}(d_i) \in [0,1]$ — assesses potential impact [Documented],
- $w_n, w_s, w_t \geq 0$ with $w_n + w_s + w_t = 1$ are weights. The specific default weight values are not documented in public materials; the composite formula above is the chapter author's formalization of the documented three-score ranking.
The direction with the highest composite score is selected as the selected_direction in the ResearchBrief [Documented]. All three component scores are generated by LLM judgment, not by formal metrics. This makes the scoring useful for relative comparison within a session but unsuitable as an absolute quality measure.
40.3.4 Skill Injection Mechanism
The skill system provides domain-specific proof strategies to agents at runtime [Documented: eurekaclaw/skills/]. Skills are stored as Markdown files with YAML frontmatter containing structured metadata. At each agent invocation, the SkillInjector selects the top-$k$ most relevant skills (default $k=5$) based on one of three strategies: tag matching, semantic embedding similarity, or a hybrid of both [Documented]. Selected skills are rendered into XML blocks and appended to the agent's system prompt [Documented].
# Documented API: eurekaclaw/skills/injector.py and registry.py
# ──────────────────────────────────────────────────────────────
# PROVENANCE: Class names, method signatures, and the SkillMeta
# schema below are from project documentation. The render_for_prompt
# output format (XML skill blocks) is documented. Internal selection
# logic and embedding implementation are not verified.
# ──────────────────────────────────────────────────────────────
class SkillInjector:
"""Retrieves and formats relevant skills for agent prompts."""
def top_k(
self,
task: str,
role: str,
k: int = 5, # [Documented]
strategy: Literal["tag", "semantic", "hybrid"] = "hybrid" # [Documented]
) -> list[SkillRecord]: ...
def render_for_prompt(self, skills: list[SkillRecord]) -> str:
"""Format skills as XML for prompt injection. [Documented output format]"""
# Produces:
# <skills>
# <skill name="..."> ... </skill>
# </skills>
...
class SkillMeta(BaseModel): # [Documented]
name: str
version: str = "1.0"
tags: list[str] = []
agent_roles: list[str] = [] # e.g., ["theory", "survey"]
pipeline_stages: list[str] = [] # e.g., ["theory", "experiment"]
description: str = ""
source: Literal["seed", "distilled", "manual"] = "seed"
created_at: datetime
usage_count: int = 0
success_rate: float | None = None
Skills come from three sources with descending priority [Documented]: user skills (manual or system-distilled, stored at ~/.eurekaclaw/skills/), domain plugin skills (from domains/<domain>/skills/), and built-in seed skills (from eurekaclaw/skills/seed_skills/). Higher-priority skills override lower-priority ones with the same name.
After each session, the SkillRegistry updates success statistics using an exponential moving average (EMA) [Documented: eurekaclaw/skills/registry.py]:
where $\hat{r}_t(s)$ is the estimated success rate of skill $s$ at time $t$, $r_t(s) \in \{0, 1\}$ is the binary outcome (whether the skill contributed to a successful proof in session $t$), and $\alpha = 0.3$ is the smoothing factor [Documented]. This EMA gives weight $\alpha(1-\alpha)^{t-\tau}$ to the outcome at time $\tau$, so the effective memory window is approximately $1/\alpha \approx 3.3$ sessions. How "contribution to a successful proof" is attributed — whether by injection presence alone or by some finer-grained signal — is not documented.
40.3.5 Illustrative Walkthrough: UCB1 Regret Bound (Level 1 Benchmark)
Provenance Notice
This walkthrough is entirely reconstructed by the chapter author from the documented pipeline stages, artifact schemas, and MAB plugin benchmark definitions. It illustrates the expected system flow rather than a specific recorded run. No actual session artifacts, verifier outputs, or theorem_state excerpts from a real EurekaClaw run are publicly available for independent verification. The walkthrough should be read as a pedagogical illustration, not as evidence of system capability.
Hypothetical input [Documented CLI format]:
# Documented CLI command format — exact behavior not independently verified
eurekaclaw prove \
"For K-armed stochastic bandits with sub-Gaussian rewards, \
UCB1 achieves expected regret R_T ≤ Σ_{i:Δ_i>0} (8 ln T)/Δ_i + (1+π²/3)Σ Δ_i" \
--domain "multi-armed bandit" --output ./results
Expected Stage 1 — Survey. Because this is a Level 1 invocation (prove), the survey is scoped to the specific conjecture. The PaperFetcher queries arXiv and Semantic Scholar for "UCB1 regret bound stochastic bandit." The GapAnalyst notes this is a known result (no gap). The DirectionProposer identifies the direction as "reproduce the UCB1 finite-time regret bound," with high soundness and low novelty.
Expected Stage 2 — Formalize. The Formalizer converts the conjecture into a formal theorem statement and constructs a lemma DAG skeleton. The diagram below shows a plausible decomposition:
Expected Stage 3 — Theory. Processing begins at the leaves. L4 (Hoeffding) and L3 (tail sum) are proven first with no dependencies. L5 (UCB index property) depends on L4. Then L2 (expected pull count) depends on L4 and L5. Finally, the main theorem is assembled.
A plausible refinement scenario: the first attempt at L2 fails verification because it uses a loose union bound where a tighter per-round counting argument is needed. The Refiner revises the proof to apply the Hoeffding bound per-round and sum over rounds to obtain the logarithmic term.
Expected final artifact directory [Documented structure]:
# Documented artifact structure — contents shown are hypothetical
results/<session_id>/
├── theory_state.json # Lemma DAG, proven lemmas, failed attempts
├── research_brief.json # Literature survey findings, scored directions
├── bibliography.json # Papers found during survey
├── experiment_result.json # Empty (experiment stage under development)
├── paper.tex # LaTeX with Theorem, Lemmas, proofs
├── paper.pdf # Compiled PDF (if LaTeX compiler available)
└── eval_report.json # Scientist-Bench evaluation scores
What a Real Session Trace Would Contain
To ground this walkthrough in verifiable evidence, a real session trace would need to include: (1) the exact CLI command with all flags; (2) the actual theory_state.json contents showing the generated lemma DAG, proven lemma count, and failed attempt records; (3) at least one verbatim Verifier error message; (4) the final eval_report.json with Scientist-Bench scores; and (5) the compiled paper.pdf. No such artifacts are publicly available from the project as of April 2026. If reproduced artifacts become available, they would substantially strengthen the claims in this chapter.
40.4 Evaluation: Scientist-Bench
EurekaClaw includes an internal evaluation framework called Scientist-Bench [Documented] that scores research outputs along five weighted dimensions:
| Dimension ($q_i$) | Weight ($w_i$) | Documented Measurement Method |
|---|---|---|
| Formal correctness | 0.35 | Lean4 formal verification or LLM peer review fallback |
| Novelty | 0.25 | Embedding distance from known results (embedding model and distance metric not specified) |
| Experimental alignment | 0.15 | Agreement between theoretical bounds and numerical experiments |
| Proof depth | 0.15 | Lemma count in DAG (normalization method not specified) |
| Citation coverage | 0.10 | Completeness of bibliography relative to surveyed literature |
Self-referential evaluation caveat: This evaluation is self-referential — the system evaluates its own outputs. The LLM peer review fallback for formal correctness (used when Lean4 is unavailable) introduces the well-documented weakness of LLMs evaluating LLM-generated mathematical proofs. Scientist-Bench is useful for comparing configurations and tracking skill-accumulation effects, but it should not be treated as an external validation of proof quality.
40.4.1 Domain Benchmark Problems
The Multi-Armed Bandit domain plugin includes tiered benchmark problems [Documented: domains/mab/benchmark/]:
| Level | Difficulty | Examples | Purpose |
|---|---|---|---|
| Level 1 | Reproduce known results | UCB1 regret bound, Lai-Robbins lower bound | Validate system can reproduce textbook theorems |
| Level 2 | Refine existing bounds | Bernstein-UCB, MOSS, KL-UCB | Test ability to improve on known techniques |
| Level 3 | Open problems | Heavy-tailed bandits, infinite-arm, batched | Probe frontier research capability |
These benchmarks enable a meaningful evaluation trajectory: if the system cannot reliably reproduce UCB1's $O(\sqrt{KT \log T})$ regret bound (Level 1), it is unlikely to produce novel results at Level 3. However, the benchmarks are stored as JSON problem descriptions, and the evaluation of solutions remains LLM-based unless Lean4 verification is available.
40.4.2 Independent Evaluation Attempt
To address the absence of published quantitative results, this section describes what an independent evaluation would require and the current state of available evidence.
| Metric | Status | Why It Matters |
|---|---|---|
| Success rate on Level 1 problems | Not reported | Baseline: can the system reproduce known theorems? |
| Fraction formally verified in Lean4 | Not reported | Proof reliability beyond LLM self-evaluation |
| Average refinement cycles per lemma | Not reported | Efficiency of the prove-verify-refine loop |
| Token/runtime distributions per stage | Order-of-magnitude ranges only (see Section 40.6.3) | Cost predictability |
Baseline vs. memory_guided comparison | Not reported | Whether continual learning improves performance |
| Success rate on Level 2/3 problems | Not reported | Frontier capability assessment |
| Average lemma DAG depth | Not reported | Proof complexity and bottom-up construction depth |
Minimum reproduction protocol. A credible independent evaluation would require: (1) installing EurekaClaw at a pinned commit with documented Python and dependency versions; (2) running at least 3 Level-1 benchmark problems from the MAB plugin with a fixed EUREKACLAW_MODEL and THEORY_MAX_ITERATIONS; (3) recording per-run artifacts including theory_state.json (lemma count, proven count, failed attempt count, counterexample count), eval_report.json (Scientist-Bench scores), and wall-clock time; (4) attempting Lean4 formal verification on each proven theorem; (5) reporting success/failure rates, refinement cycle counts, and token consumption by stage.
The only evidence of system capability comes from CLI demo output [Documented] showing messages like "Found 23 relevant papers," "Theorem 3.1 drafted. LaTeX ready," and "Proof complete." These demonstrate that the pipeline runs end-to-end but do not establish the mathematical quality of the outputs. All claims about EurekaClaw's mathematical capabilities in this chapter describe the designed architecture, not empirically validated performance.
40.5 Memory Architecture and Continual Learning
EurekaClaw implements the most granular memory system among the autonomous research tools surveyed in this book, with four distinct tiers that serve different temporal and semantic scopes [Documented: eurekaclaw/memory/].
40.5.1 Four-Tier Memory Architecture
Tier 1 — Episodic Memory [Documented: eurekaclaw/memory/episodic.py]. An in-RAM ring buffer with fixed capacity that stores agent events within a single session. Each entry records entry_id, session_id, agent_role, content, metadata, and timestamp [Documented]. Episodic memory is not persisted — it is lost when the session ends. Its purpose is to provide short-term context: the Prover can see the Verifier's latest error, and the Refiner can access recent proof attempts.
Tier 2 — Persistent Memory [Documented: eurekaclaw/memory/persistent.py]. A cross-session key-value store serialized to ~/.eurekaclaw/memory/persistent.json. Each record contains a namespaced key (e.g., "theory.failed_strategies.sample_complexity"), an arbitrary JSON-serializable value, tags, and provenance metadata [Documented]. This tier stores strategies that worked or failed across sessions.
Tier 3 — Knowledge Graph [Documented: eurekaclaw/memory/knowledge_graph.py]. A theorem dependency network stored at ~/.eurekaclaw/memory/knowledge_graph.json. Each node represents a proven theorem with its formal statement, domain, and session provenance, linked by "uses" relations [Documented]. This enables the system to recognize when a new conjecture relates to previously proven results and to reuse proven lemmas as building blocks.
Tier 4 — Domain Insights [Documented: eurekaclaw/learning/memory_extractor.py]. Per-domain Markdown files stored at ~/.eurekaclaw/memories/<domain>/, each containing LLM-generated analysis of proof successes and failures. An index file (_index.json) enables semantic search across insights [Documented]. At session startup, the MetaOrchestrator loads the top-4 most relevant Tier 4 insights and injects them into all agent system prompts [Documented].
40.5.2 Continual Learning Loop
After each session, the ContinualLearningLoop.post_run() method executes three extraction phases [Documented]:
- Session Memory Extraction: The
SessionMemoryExtractoranalyzes theTheoryState(proven lemmas, failed attempts, counterexamples) and produces a Tier 4 domain insight file [Documented]. - Tool Pattern Extraction: The
ToolPatternExtractoranalyzes tool-call sequences and generates new tool-usage skills [Documented]. - Skill Distillation: The
SkillEvolver[Documented:eurekaclaw/skills/evolver.py] takes deduplicated failed attempts and compressed successful proofs (trimmed to 300 characters) and calls a fast LLM model (withmax_tokens=1024) to distill reusable proof strategies [Documented]. Results are saved viaSkillRegistry.upsert().
# Documented API: eurekaclaw/skills/evolver.py
# ──────────────────────────────────────────────────────────────
# PROVENANCE: Class name, method signature, and the fast-model
# usage pattern are from project documentation. The 300-char
# compression limit and max_tokens=1024 are documented. The
# exact model identifier for the fast path, the distillation
# prompt template, and the skill parsing logic are NOT verified.
# ──────────────────────────────────────────────────────────────
class SkillEvolver:
"""Distills proof strategies from session experience into reusable skills."""
async def distill_from_session(
self,
failures: list[FailedAttempt],
successes: list[ProofRecord],
) -> list[SkillRecord]:
# Deduplicate failures, compress successes [Documented]
unique_failures = self._deduplicate(failures)
compressed = [self._compress(s, max_chars=300) for s in successes]
# LLM distillation call [Documented: uses fast model]
response = await self.client.create(
model=self.fast_model, # distinct from EUREKACLAW_MODEL
max_tokens=1024, # [Documented]
messages=[{
"role": "user",
"content": self._build_distillation_prompt(
unique_failures, compressed
)
}]
)
# Parse and register [Documented]
skills = self._parse_skill_response(response.content)
for skill in skills:
self.registry.upsert(skill)
return skills
This creates two learning pathways. Individual learning occurs as each user's system accumulates distilled skills from their own sessions. Community learning occurs through ClawHub [Documented], a mechanism for sharing skills between researchers:
# Documented CLI commands for skill management
eurekaclaw install-skills steipete/github # Install community skills from ClawHub
eurekaclaw install-skills # Install built-in seed skills
eurekaclaw install-skills --force # Overwrite existing skills
40.5.3 Memory-Guided Pipeline Mode
When THEORY_PIPELINE=memory_guided [Documented: .env configuration], the system actively uses accumulated memory during proof construction:
- At session startup, the MetaOrchestrator loads the top-4 Tier 4 insights into all agent system prompts [Documented].
- During proof planning, the Formalizer checks the Tier 3 knowledge graph to reuse proven lemmas [Documented].
- During proof attempts, the Prover checks Tier 2 persistent memory to avoid previously failed strategies [Documented].
The design is architecturally sound — providing relevant prior experience should help an LLM avoid known-bad proof strategies. However, no comparative evaluation between the default pipeline and the memory_guided configuration has been published. Whether memory guidance meaningfully reduces iteration count, improves success rates, or merely adds prompting overhead is an open empirical question.
40.6 Implementation Details
40.6.1 Technology Stack
The core engine is implemented in Python ≥ 3.11 with full async/await throughout the agent pipeline [Documented]. Data models use Pydantic v2 BaseModel for all artifacts [Documented]. The browser UI is a React/TypeScript application built with Vite [Documented: frontend/], featuring components for agent tracking, proof DAG visualization, lemma status chips, human guidance input, and skill management.
Model configuration: The primary model is set via EUREKACLAW_MODEL, defaulting to claude-sonnet-4-6 [Documented]. All pipeline agents (Survey, Formalize, Theory, Write, Evaluate) share this model. The post-session learning path (SkillEvolver) uses a separate "fast model" [Documented] whose identity is not specified. The project description states compatibility with "Every Major Model API" [Documented] but the architecture is documented as optimized for Claude's instruction following and mathematical reasoning.
40.6.2 Domain Plugin System
EurekaClaw supports pluggable research domains through a DomainPlugin ABC [Documented: eurekaclaw/domains/base.py]. Each domain provides specialized tools, workflow hints, skill directories, and benchmark problems. The MAB (Multi-Armed Bandit) domain is the reference implementation [Documented: eurekaclaw/domains/mab/], shipping with simulation environments (Gaussian and Bernoulli bandits), concentration inequality tools, regret decomposition tools, information-theoretic tools, and a bandit experiment runner.
# Documented ABC: eurekaclaw/domains/base.py
# ──────────────────────────────────────────────────────────────
# PROVENANCE: Interface fields and method signatures are from
# project documentation. The @register_domain decorator appears
# in documentation examples; whether it is a real decorator or
# whether registration uses a different mechanism (e.g., a config
# list or entry_points) has not been independently verified.
# ──────────────────────────────────────────────────────────────
class DomainPlugin(ABC):
name: str # Machine identifier [Documented]
display_name: str # Human-readable name [Documented]
keywords: list[str] # Auto-detection triggers [Documented]
@abstractmethod
def register_tools(self, registry: ToolRegistry) -> None: ...
@abstractmethod
def get_workflow_hint(self) -> str: ...
def get_skills_dirs(self) -> list[Path]: ...
def get_benchmark_problems(self, level: str) -> list[dict]: ...
Domain auto-detection works through keyword matching [Documented]: the MAB plugin triggers on terms like "bandit," "multi-armed," "ucb," "thompson," "regret," "exploration," and "exploitation." Custom domains require implementing the ABC. The documentation shows a @register_domain decorator in examples, but the internal registration mechanism (whether decorator-based, config-list-based, or entry-point-based) has not been independently confirmed.
40.6.3 Cost Analysis
The cost model is dominated by the Theory stage, where the iterative prove-verify-refine loop can consume substantial tokens. The following are from project documentation [Documented] and represent order-of-magnitude ranges, not measured benchmarks:
| Stage | Documented Token Range | Dominant Cost Driver |
|---|---|---|
| Survey | 20K–100K | Paper summarization and gap analysis |
| Formalize | 5K–20K | Statement formalization |
| Theory | 50K–500K+ | Iterative proving loop (× THEORY_MAX_ITERATIONS) |
| Experiment | 10K–50K | Experiment design + code generation |
| Write | 20K–80K | Full paper generation |
| Evaluate | 5K–15K | Quality assessment |
| Learn | 2K–10K | Skill distillation (fast model) |
Total cost per run ranges from approximately $1–$10 for Level 1 to $5–$50+ for Level 3 [Documented], depending on conjecture difficulty, literature breadth, and proof depth. The THEORY_MAX_ITERATIONS parameter (default: 10) [Documented] acts as a direct multiplier on theory-stage cost. Hardware requirements: Python ≥ 3.11, Node.js ≥ 18, 4+ GB RAM, no GPU required [Documented]. Lean4 is optional.
40.6.4 Reproducibility
EurekaClaw's reproducibility profile is mixed. Installation is well-documented with multiple methods (one-line installers, manual setup with uv or pip) [Documented]. Configuration is centralized in a single .env file [Documented]. Session artifacts are fully preserved in a structured directory, enabling post-hoc inspection.
However, determinism is low: LLM non-determinism dominates the generative process, and no temperature or seed controls are documented [Documented: absence noted]. Two runs with identical configurations will produce different proofs, different theorem statements, and potentially different research directions. Formal verification via Lean4 provides a deterministic checking step, but proof generation is stochastic. The memory_guided pipeline mode further reduces reproducibility, as results depend on accumulated state from prior sessions.
40.7 Limitations and Open Questions
40.7.1 Proof Reliability
The most significant limitation is the reliability of LLM-generated proofs. Mathematical proofs require absolute logical rigor — a single invalid step renders the entire argument unsound. LLMs, including Claude, are known to produce plausible-looking but incorrect mathematical reasoning. The Verifier agent provides a first defense, but it is itself an LLM call, creating a situation where one LLM checks another's work. Lean4 formal verification addresses this gap but requires successful translation of natural-language proofs into Lean4 syntax — itself an imperfect process. All EurekaClaw-generated proofs should be treated as drafts requiring expert human verification.
40.7.2 Domain Coverage and Maturity
While the plugin architecture enables domain extensibility, only one domain plugin (Multi-Armed Bandits) ships with the system [Documented]. Researchers in other areas must implement their own domain plugins. The core pipeline is domain-agnostic, but effectiveness is heavily domain-dependent due to the importance of domain-specific proof strategies in skill injection. The Experiment stage (Stage 4) is explicitly under active development [Documented], creating a gap where theoretical bounds are proved but not numerically validated.
40.7.3 Model Architecture Trade-offs
The pipeline uses a nearly uniform model configuration: all agents use the primary EUREKACLAW_MODEL, with only the post-session SkillEvolver using a distinct fast model. This simplifies deployment but prevents task-specific optimization. The survey sub-agents might benefit from a faster, cheaper model, while the Prover and Verifier might benefit from a more capable reasoning model. A hierarchical model-routing strategy — analogous to bandit-based model selection in systems like OmniEvolve — could improve both cost efficiency and proving quality.
40.7.4 "Eureka Moments" vs. Systematic Search
Despite its name, EurekaClaw's proof construction is more systematic than serendipitous. The bottom-up lemma DAG approach is methodical: prove dependencies first, then assemble. The "eureka" element, if any, lies in the GapAnalyst's identification of unexplored directions and the Prover's generation of novel proof steps. Whether LLMs genuinely produce mathematical insight (as opposed to pattern-matching from training data) remains an open question in the field, and EurekaClaw's architecture does not resolve it. The system is better understood as automating the labor-intensive aspects of mathematical research than as genuinely discovering new mathematics.
40.8 Comparative Analysis
EurekaClaw's position in the landscape of LLM-powered research systems reveals a clear specialization strategy. While systems like AutoResearchClaw and AI Scientist aim for breadth across research domains, EurekaClaw sacrifices generality for depth in mathematical theorem proving.
| Dimension | EurekaClaw | AutoResearchClaw | AI Scientist (Sakana) |
|---|---|---|---|
| Domain scope | Mathematical theory (narrow) | Any research domain (broad) | ML experiments (medium) |
| Pipeline stages | 7 (with iterative theory loop) | 23 (end-to-end paper) | 3 (idea → experiment → paper) |
| Formal verification | Lean4 (optional) | None | None |
| Memory tiers | 4 | 3 (via MetaClaw) | 0 |
| Proof construction | Bottom-up lemma DAG | Not specialized | Not applicable |
| Documented cost/run | $1–$50+ (self-reported) | $5–$30 (self-reported) | $10–$50+ (self-reported) |
| Community learning | ClawHub skill sharing | MetaClaw cross-run | None |
| Independent evaluation | None available | None available | Peer review simulation |
Two features stand out architecturally. First, the 4-tier memory architecture is the most granular among surveyed systems. The separation of episodic (in-RAM), persistent (cross-session key-value), knowledge graph (theorem dependencies), and domain insights (LLM-extracted strategies) allows each type of learned information to persist at its appropriate granularity. Most competing systems either have no persistent memory or conflate all types into a single store.
Second, the bottom-up proof construction paradigm appears unique in the surveyed landscape. Most LLM-based reasoning systems use top-down decomposition. EurekaClaw's approach of building from atomic lemmas upward, tracking dependencies in a DAG, and only assembling the main proof when all dependencies are satisfied mirrors the structure of actual mathematical papers and provides a natural checkpoint mechanism.
40.9 The Evolutionary Perspective
Although EurekaClaw is not an evolutionary algorithm system in the traditional sense — it does not maintain a population of programs, apply mutation operators, or use fitness-based selection — it shares several conceptual parallels with the evolutionary discovery systems covered elsewhere in this survey.
The Prover → Verifier → Refiner loop implements a form of generate-evaluate-repair that is structurally analogous to the mutation-evaluation-selection cycle in evolutionary program synthesis. The failed-attempt history that prevents the Prover from repeating known-bad approaches parallels novelty-based search in quality-diversity algorithms. The lemma DAG itself can be viewed as a modular building-block approach where proven sub-results serve as reusable components — analogous to the building blocks hypothesis in genetic programming.
More precisely, let $\mathcal{S} = \{s_1, \ldots, s_m\}$ denote the skill registry. Each skill $s_i$ has an estimated success rate $\hat{r}(s_i)$ maintained via EMA. The SkillInjector selects skills weighted by relevance and success rate — functionally equivalent to a fitness-proportional selection operator. After each session, the SkillEvolver generates new skills from successful proofs (exploitation) while the failed-attempt history drives exploration of alternative strategies (exploration). This skill-level dynamics forms an evolutionary process over proof strategies, even though the proof construction itself is single-trajectory.
The skill distillation and ClawHub sharing mechanisms parallel population-level knowledge transfer in island-model evolutionary systems. The boundary between "multi-agent LLM orchestration" and "evolutionary search" is becoming porous: both paradigms navigate vast search spaces by combining generation, evaluation, and accumulated knowledge. The primary distinction is that evolutionary systems maintain explicit populations with fitness-based selection, while EurekaClaw uses a single-trajectory search guided by accumulated memory and skills.
Summary
Key takeaway: EurekaClaw is, among the systems surveyed (2024–2026), architecturally distinctive in combining bottom-up theorem proving through lemma DAGs, optional Lean4 formal verification, and four-tier continual learning that distills proof strategies across sessions.
Main contribution: The bottom-up proof construction paradigm — building from atomic lemmas upward rather than top-down decomposition — combined with a skill distillation loop that converts session experience into reusable proof strategies. Algorithm 1 in Section 40.3.2 provides a clear specification of the documented core (topological goal ordering, prove-verify-refine loop, counterexample search) and explicitly labels author-reconstructed mechanisms (stagnation detection, DAG restructuring).
Evidence limitations: No independent evaluations, quantitative benchmarks, or reproduced results are publicly available as of April 2026. The system's actual mathematical capability — success rates on Level 1/2/3 problems, formal verification rates, and the impact of memory-guided search — remains undocumented. Section 40.4.2 describes the minimum reproduction protocol needed to assess these claims. All code blocks in this chapter are labeled as either documented API surfaces (method names from project documentation, implementations not verified) or author pseudocode (reconstructed for pedagogical purposes).
What researchers should know: EurekaClaw's mathematical outputs should be treated as drafts requiring expert verification. The system's strength lies in accelerating the mechanical aspects of research (literature review, proof construction, paper writing) rather than in producing guaranteed-correct mathematics. The domain plugin architecture enables extension beyond the shipped MAB domain, but meaningful use in new domains requires substantial investment in domain-specific tools, skills, and benchmarks.