← Back to Index

EurekaClaw

Multi-agent AI research assistant that autonomously crawls literature, generates and stress-tests mathematical hypotheses, proves theorems via a 7-stage bottom-up pipeline, runs numerical experiments, and writes camera-ready LaTeX papers — with continual learning that distills proof strategies into reusable skills across sessions. Organization: Multi-institutional team (led by Quanquan Gu et al.) Published: 2026 (Apache 2.0) Type: repo (GitHub: EurekaClaw/EurekaClaw) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

EurekaClaw: An AI Agent for Capturing Eureka Moments

Repository: github.com/EurekaClaw/EurekaClaw
Documentation: eurekaclaw.github.io
License: Apache 2.0
Stars: ~631 (as of April 2026)
Tagline: "The AI that catches your Eureka moments. Crawls arXiv · Generates theorems · Proves lemmas · Writes LaTeX papers · Runs experiments"
Default model: claude-sonnet-4-6 (Claude as primary reasoning engine)
Input modes: Three escalating levels of autonomy — from precise conjecture proving to open-ended domain exploration

Naming and Branding

The name "EurekaClaw" combines the discovery exclamation "Eureka!" with "Claw" — a reference to the lobster emoji (🦞) used throughout the project's branding and CLI output. The metaphor suggests the system's ability to "grasp" and hold onto breakthrough moments in mathematical reasoning. The project shares the "Claw" branding lineage with related systems (AutoResearchClaw, MetaClaw, OpenClaw, Dr. Claw, ScienceClaw), suggesting a broader ecosystem of "Claw"-branded AI research tools.

Lineage and Acknowledgments

EurekaClaw explicitly acknowledges inspiration from several predecessor systems:

System	Influence
MetaClaw (AIMING Lab)	Multi-agent research orchestration patterns
AutoResearchClaw (AIMING Lab)	Automated research pipeline architecture
EvoScientist	Evolutionary hypothesis generation
AI-Researcher (HKUDS)	Automated research pipeline design
Dr. Claw (OpenLAIR)	Open research agent framework
OpenClaw	Open-source research agent infrastructure
ClawTeam (HKUDS)	Collaborative research agent patterns
ScienceClaw	Science-focused agent design

Unique Position in the Ecosystem

While AutoResearchClaw and MetaClaw focus on end-to-end paper generation across general research domains, EurekaClaw specializes in mathematical theory — theorem statement, proof construction, formal verification, and theoretical paper writing. This makes it the most depth-focused system in the "Claw" family.

Claw Family Positioning
│
├── AutoResearchClaw  — breadth: any research domain, 23-stage paper pipeline
├── MetaClaw          — meta: cross-run learning, skill extraction
├── OpenClaw          — platform: chat-based research assistant
├── ClawTeam          — collaboration: multi-agent team coordination
├── ScienceClaw       — domain: science-focused agent framework
├── Dr. Claw          — framework: open research agent base
└── EurekaClaw        — depth: mathematical theorem proving + paper writing ← this system

2 Authors and Team

Author	Role (Inferred)
Xuheng Li	Lead developer / architect
Qiwei Di	Core contributor
Chenggong Zhang	Core contributor
Kaixuan Ji	Core contributor
Qingyue Zhao	Core contributor
Yifeng Liu	Core contributor
Shiyuan Zhang	Core contributor
Quanquan Gu	Principal investigator / senior author

BibTeX Citation

@misc{eurekaclaw2026,
  title     = {EurekaClaw: An AI Agent for Capturing Eureka Moments},
  author    = {Li, Xuheng and Di, Qiwei and Zhang, Chenggong and Ji, Kaixuan
               and Zhao, Qingyue and Liu, Yifeng and Zhang, Shiyuan and Gu, Quanquan},
  year      = {2026},
  url       = {https://github.com/EurekaClaw/EurekaClaw}
}

Team composition: Academic research group with 8 contributors. The team size and structure suggest a focused research lab project, likely from a strong ML/AI theory group (given the emphasis on mathematical proof, concentration inequalities, regret bounds, and multi-armed bandits as the showcased domain). Quanquan Gu is a recognized researcher in ML theory and bandits, consistent with the MAB domain plugin being the first and most developed domain.

Institutional context: Unlike the AIMING Lab's AutoResearchClaw (16 contributors across 5 universities), EurekaClaw appears to originate from a single lab, enabling tighter architectural coherence and deeper domain specialization.

3 Core Contribution

EurekaClaw's core contribution is a complete autonomous pipeline for mathematical research that uniquely combines:

Literature-grounded hypothesis generation — crawling arXiv and Semantic Scholar to identify research gaps
Bottom-up theorem proving — a 7-stage pipeline that builds proofs from lemmas up to main theorems
Formal verification integration — optional Lean4 proof checking for mathematical rigor
Continual skill learning — automatic distillation of proof strategies into reusable skills across sessions
Domain plugin extensibility — pluggable research domain support with domain-specific tools, skills, and benchmarks

The Seven Pipeline Stages

Stage 1: SURVEY
│  Agents: SurveyOrchestrator, PaperFetcher, Summarizer, GapAnalyst, DirectionProposer
│  Output: ResearchBrief (directions scored by novelty × soundness × transformative potential)
│
Stage 2: FORMALIZE
│  Agents: Formalizer
│  Output: TheoryState.formal_statement (LaTeX theorem), proof plan, lemma DAG skeleton
│
Stage 3: THEORY (iterative)
│  Agents: Prover, Verifier, Refiner, CounterexampleSearcher
│  Output: Proven lemmas, assembled proof (or refutation)
│  Loop: prover → verifier → (fail?) refiner → repeat; stagnation → force refine
│
Stage 4: EXPERIMENT
│  Agents: ExperimentDesigner, ExperimentRunner
│  Output: Numerical validation of theoretical bounds
│  Status: Under development
│
Stage 5: WRITE
│  Agents: WriterAgent
│  Output: Camera-ready LaTeX paper with theorem environments and citations
│
Stage 6: EVALUATE
│  Tool: Scientist-Bench
│  Output: Multi-dimensional quality score (correctness, novelty, depth, alignment, citations)
│
Stage 7: LEARN
│  Agents: ContinualLearningLoop, SkillEvolver, SessionMemoryExtractor
│  Output: New skill files, updated memory tiers, knowledge graph entries

Five Differentiating Capabilities

Capability	Description	Comparison
Bottom-up proof construction	Builds proofs from atomic lemmas via a DAG, not top-down decomposition	Unique among autoresearch systems; most use top-down planning
7-stage theory pipeline with verification	Prover → Verifier → Refiner loop with counterexample search	AutoResearchClaw's theory support is part of general experiment, not dedicated
Formal verification (Lean4)	Optional Lean4 proof checking for mathematical rigor	No other "Claw" system integrates formal verification
4-tier memory system	Episodic → persistent → knowledge graph → domain insights	Most sophisticated memory of any Claw-family system
Domain plugin architecture	Pluggable research domains with tools, skills, workflows, and benchmarks	Enables specialization without forking the core system

System	Focus	Pipeline	Proving	Memory	Learning
EurekaClaw	Mathematical theory	7-stage with verification loop	Yes (Lean4)	4-tier	Skill distillation
AutoResearchClaw	General research	23-stage end-to-end	No	MetaClaw	Cross-run skills
AI Scientist (Sakana)	ML experiments	Idea → experiment → paper	No	None	None
Google AI Co-Scientist	Hypothesis generation	Multi-agent debate	No	Tournament	Selection
K-Dense BYOK	Multi-discipline assistant	Chat → expert delegation	No	Session	None
AIRA₂ (Meta)	STEM research	15+ specialized agents	No	Session	None

EurekaClaw occupies a distinctive niche: it is the only system that combines deep mathematical proving capability with continual learning and a formal verification pathway. While narrower in domain than systems like K-Dense BYOK or AutoResearchClaw, it goes deeper in its target domain than any competitor.

4 Supported Solutions

Three Input Modes

EurekaClaw provides three escalating levels of research autonomy:

Command	Level	Autonomy	Input	Output
`eurekaclaw prove "<statement>"`	1	Lowest	Precise mathematical statement	Proof + LaTeX paper
`eurekaclaw from-papers <arxiv_ids>`	2	Medium	Specific papers to extend	Gap analysis + new theorem + proof + paper
`eurekaclaw explore "<domain>"`	3	Highest	Broad research area	Literature survey + conjecture + proof + paper

Level 1: Prove (Conjecture → Proof)

eurekaclaw prove "The sample complexity of transformers is O(L·d·log(d)/ε²)" \
    --domain "ML theory" --output ./results

The system: 1. Surveys literature for related results on the conjecture 2. Formalizes the statement into LaTeX theorem environment 3. Decomposes into lemma DAG 4. Proves each lemma bottom-up 5. Assembles full proof 6. Writes camera-ready paper

Level 2: From Papers (Papers → Novel Results)

eurekaclaw from-papers 1706.03762 2005.14165 --domain "attention mechanisms"

The system: 1. Fetches and summarizes the specified papers 2. Identifies gaps and extension opportunities 3. Generates novel conjecture based on gaps 4. Proceeds through full prove pipeline

Level 3: Explore (Domain → Discovery)

eurekaclaw explore "multi-armed bandit theory"

The system: 1. Conducts broad literature survey of the domain 2. Identifies open problems and promising directions 3. Scores directions by novelty, soundness, and transformative potential 4. Selects most promising direction 5. Generates conjecture 6. Proceeds through full prove pipeline

Domain Plugin System

EurekaClaw supports pluggable research domains that provide specialized tools, skills, workflows, and benchmarks:

DomainPlugin (ABC)
│
├── name: str                    ← machine identifier
├── display_name: str            ← human-readable name
├── keywords: list[str]          ← auto-detection triggers
├── description: str
│
├── register_tools(registry)     ← inject domain-specific tools
├── get_workflow_hint() → str    ← research guidance for agent prompts
├── get_skills_dirs() → list[Path] ← extra skill directories
└── get_benchmark_problems(level) → list[dict]  ← evaluation problems

MAB Domain Plugin (Reference Implementation)

The Multi-Armed Bandit domain is the first and most developed plugin:

domains/mab/
├── __init__.py            MABDomainPlugin
│   ├── name = "mab"
│   ├── display_name = "Stochastic Multi-Armed Bandits"
│   └── keywords = ["bandit", "multi-armed", "mab", "ucb", "thompson",
│                    "regret", "exploration", "exploitation"]
│
├── workflow.py            WORKFLOW_HINT (research guidance text)
│
├── envs/                  Simulation environments
│   ├── stochastic.py      GaussianBandit, BernoulliBandit
│   └── runner.py          run_experiment(), sweep_T()
│                          (UCB1 & Thompson Sampling implementations)
│
├── tools/                 LLM-callable domain tools
│   ├── concentration.py   Hoeffding, Bernstein, sub-Gaussian, UCB radius
│   ├── regret.py          Regret decomposition, Lai-Robbins lower bound
│   ├── information.py     KL(Bernoulli), KL(Gaussian), Fano's inequality
│   └── bandit_tool.py     BanditExperimentTool (experiment runner)
│
├── skills/                Domain-specific proof strategies
│   ├── ucb_regret_analysis.md
│   ├── thompson_sampling_analysis.md
│   ├── lower_bound_construction.md
│   └── bandit_simulation.md
│
└── benchmark/             Tiered evaluation problems
    ├── level1.json        Reproduce: UCB1, Lai-Robbins (known bounds)
    ├── level2.json        Refine: Bernstein-UCB, MOSS, KL-UCB
    └── level3.json        Open: heavy tails, infinite-arm, batched bandits

Adding Custom Domains

from eurekaclaw.domains.base import DomainPlugin
from eurekaclaw.domains import register_domain

@register_domain
class MyDomainPlugin(DomainPlugin):
    name = "my_domain"
    display_name = "My Research Domain"
    keywords = ["keyword1", "keyword2"]

    def register_tools(self, registry: ToolRegistry) -> None:
        registry.register(MySpecialTool())

    def get_workflow_hint(self) -> str:
        return """When researching my_domain:
        - Always start by checking known results X and Y
        - Use technique Z for the main proof step"""

    def get_skills_dirs(self) -> list[Path]:
        return [Path(__file__).parent / "skills"]

    def get_benchmark_problems(self, level: str) -> list[dict]:
        bm_file = Path(__file__).parent / "benchmark" / f"{level}.json"
        return json.loads(bm_file.read_text()) if bm_file.exists() else []

Registration requires adding the module path to _DOMAIN_PACKAGES in the domain resolver. Domains are auto-detected from user input via keyword matching.

5 LLM Integration

Model Configuration

Variable	Default	Purpose
`ANTHROPIC_API_KEY`	—	API key for Claude access
`EUREKACLAW_MODEL`	`claude-sonnet-4-6`	Main reasoning model for all agents

EurekaClaw is designed around Claude as the primary reasoning engine, unlike K-Dense BYOK's provider-agnostic approach. The system description states compatibility with "Every Major Model API" but is architecturally optimized for Claude's strengths in mathematical reasoning and instruction following.

Authentication Options

Method	Provider	Use Case
API key (`ANTHROPIC_API_KEY`)	Anthropic	Standard programmatic access
OAuth (Claude Pro/Max subscription)	Anthropic	Personal use without API billing

Agent-Model Architecture

MetaOrchestrator
│   Model: EUREKACLAW_MODEL (claude-sonnet-4-6)
│
├── SurveyOrchestrator
│   │   Model: EUREKACLAW_MODEL
│   ├── PaperFetcher    ← tool calls (arXiv, Semantic Scholar)
│   ├── Summarizer      ← LLM reasoning
│   ├── GapAnalyst      ← LLM reasoning
│   └── DirectionProposer ← LLM reasoning + scoring
│
├── Formalizer
│   │   Model: EUREKACLAW_MODEL
│   └── Statement formalization, proof planning, lemma DAG construction
│
├── TheoryOrchestrator
│   │   Model: EUREKACLAW_MODEL
│   ├── Prover              ← LLM reasoning (proof generation)
│   ├── Verifier             ← LLM reasoning + Lean4 tool
│   ├── Refiner              ← LLM reasoning (proof repair)
│   └── CounterexampleSearcher ← LLM reasoning (adversarial)
│
├── ExperimentOrchestrator
│   │   Model: EUREKACLAW_MODEL
│   ├── ExperimentDesigner   ← LLM reasoning
│   └── ExperimentRunner     ← Code execution tool
│
├── WriterAgent
│   │   Model: EUREKACLAW_MODEL
│   └── LaTeX paper generation with theorem environments
│
└── ContinualLearningLoop
    │   Model: fast model (max_tokens=1024)
    ├── SessionMemoryExtractor  ← LLM analysis of session
    ├── ToolPatternExtractor    ← LLM analysis of tool usage
    └── SkillEvolver            ← LLM skill distillation

All agents share the same model configuration, creating architectural simplicity but limiting model-specific optimization per task. The continual learning loop uses a "fast model" for efficiency during post-session skill distillation.

Tool-LLM Interface

Tools are registered with the ToolRegistry and exposed to agents via the Anthropic tool definition format:

class BaseTool(ABC):
    name: ClassVar[str]
    description: ClassVar[str]

    def input_schema(self) -> dict: ...
    async def call(self, **kwargs) -> str: ...
    def to_anthropic_tool_def(self) -> dict: ...

Built-in Tools

Tool	Purpose	Output
`ArxivSearchTool`	Search arXiv for papers by query	List of papers with metadata
`SemanticScholarTool`	Search Semantic Scholar with citation counts	List of papers with venue/citation data
`WebSearchTool`	General web search	Search result snippets
`Lean4VerifyTool`	Formal proof verification via Lean4	Verified/failed with output
`WolframAlphaTool`	Symbolic computation queries	Computation results
`CodeExecutionTool`	Python code execution in sandbox	stdout + stderr
Domain-specific tools	Per-plugin (e.g., `BanditExperimentTool`)	Domain-dependent

6 Key Results

Scientist-Bench Evaluator

EurekaClaw includes an internal evaluation framework called Scientist-Bench that scores research outputs along five dimensions:

Dimension	Weight	Measurement
Formal correctness	0.35	Lean4 formal verification or LLM peer review
Novelty	0.25	Embedding distance from known results in literature
Experimental alignment	0.15	Numerical experiments validate theoretical bounds
Proof depth	0.15	Lemma count (complexity of proof DAG)
Citation coverage	0.10	Completeness of related work citations

eurekaclaw eval-session <session_id>

Evaluation Methodology

The evaluation is self-referential — the system evaluates its own outputs — which limits its reliability as a benchmark but provides useful signal for: - Comparing runs with different configurations - Identifying which domains/conjectures the system handles well - Tracking improvement from skill accumulation over time

Benchmark Problems (MAB Domain)

The MAB domain plugin includes three levels of benchmark problems:

Level	Difficulty	Examples	Purpose
Level 1	Reproduce known	UCB1 regret bound, Lai-Robbins lower bound	Validate the system can reproduce textbook results
Level 2	Refine existing	Bernstein-UCB, MOSS, KL-UCB	Test ability to improve on known techniques
Level 3	Open problems	Heavy-tailed bandits, infinite-arm, batched	Probe frontier research capability

Demonstrated Capabilities

Based on documentation and CLI examples:

Capability	Example	Evidence Level
Literature crawling	"Found 23 relevant papers" from arXiv	CLI demo output
Hypothesis generation	"O(n log n) via topological filtration"	CLI demo output
Theorem drafting	"Theorem 3.1 drafted. LaTeX ready."	CLI demo output
Proof completion	"Proof complete."	CLI demo output
Paper generation	"Paper draft saved to ./results/"	CLI demo output
Skill distillation	UCB regret analysis skill	Seed skill documentation
Formal verification	Lean4 integration	Tool documentation

Caveat: No independent benchmarks or peer-reviewed evaluations are available. All evidence comes from the project's own documentation and demonstrations.

Gate Modes

EurekaClaw supports three gate modes that control the level of human oversight:

Mode	Behavior	Use Case
`none`	Fully autonomous — no pauses	Batch processing, overnight runs
`auto`	System pauses at critical junctures	Default — balanced autonomy
`human`	Human approval required at every gate	Maximum oversight for critical research

7 Reproducibility

Installation Methods

Method	Complexity	Platforms
One-line installer	Minimal	macOS, Linux
One-line installer	Minimal	Windows (PowerShell)
Manual with uv	Low	macOS, Linux (recommended)
Manual with pip	Low	macOS, Linux
Manual	Low	Windows

Detailed Setup (uv method)

# Prerequisites: Python ≥ 3.11, Node.js ≥ 18, Git, uv

git clone https://github.com/EurekaClaw/EurekaClaw
cd EurekaClaw

uv venv --python 3.11 .venv
source .venv/bin/activate

uv pip install -e "."
cd frontend && npm install && cd ..

eurekaclaw install-skills     # install built-in proof skills
eurekaclaw onboard            # configure API key and settings

Configuration for Reproducibility

cp .env.example .env

Variable	Default	Reproducibility Impact
`EUREKACLAW_MODEL`	`claude-sonnet-4-6`	High — different models produce different proofs
`GATE_MODE`	`auto`	Low — affects human interaction, not output quality
`THEORY_PIPELINE`	`default`	High — `memory_guided` injects prior session knowledge
`OUTPUT_FORMAT`	`latex`	Low — format only
`EXPERIMENT_MODE`	`auto`	Medium — controls numerical validation
`THEORY_MAX_ITERATIONS`	`10`	High — limits proof search depth

Reproducibility Assessment

Factor	Rating	Details
Installation	High	Multiple well-documented methods, automated setup
Configuration	High	Single `.env` file, clear defaults, interactive onboard
Determinism	Low	LLM non-determinism dominates; no temperature/seed controls documented
Session persistence	High	`KnowledgeBus.persist()` saves full session state as JSON artifacts
Artifact preservation	High	Theory state, bibliography, experiment results all serialized
Formal verification	Medium	Lean4 provides deterministic proof checking, but proof generation is stochastic

Session Artifact Structure

Each run produces a structured artifact directory:

results/<session_id>/
├── theory_state.json       ← Full proof state machine (lemma DAG, proofs, failures)
├── research_brief.json     ← Literature survey findings, scored directions
├── bibliography.json       ← All papers found during survey
├── experiment_result.json  ← Numerical validation results
├── paper.tex               ← Generated LaTeX paper
├── paper.pdf               ← Compiled PDF (if LaTeX available)
└── eval_report.json        ← Scientist-Bench evaluation scores

This artifact preservation enables partial reproducibility: while the generative process is stochastic, the outputs are fully captured and can be inspected, compared, and re-processed (e.g., re-running the writer agent on preserved theory state).

8 Compute and API Costs

Cost Model

Cost = Σ(agent_calls) × token_price(EUREKACLAW_MODEL)
     + Σ(tool_calls) × tool_cost(tool_type)
     + learning_calls × token_price(fast_model)

Per-Stage Cost Estimates

Stage	Typical Token Usage	Dominant Cost Driver
Survey	20K–100K tokens	Paper summarization and gap analysis
Formalize	5K–20K tokens	Statement formalization
Theory	50K–500K+ tokens	Iterative proving loop (up to THEORY_MAX_ITERATIONS)
Experiment	10K–50K tokens	Experiment design + code generation
Write	20K–80K tokens	Full paper generation
Evaluate	5K–15K tokens	Quality assessment
Learn	2K–10K tokens	Skill distillation (fast model)

Cost Scaling Factors

Factor	Impact
Conjecture difficulty	Hard proofs → more theory iterations → more tokens
Literature breadth	Broad domains → more papers to summarize
Proof depth	Deep lemma DAGs → more prover/verifier cycles
Experiment complexity	Complex numerical validation → more code generation
THEORY_MAX_ITERATIONS	Direct multiplier on theory stage cost (default: 10)

Hardware Requirements

Requirement	Minimum	Recommended
Python	≥ 3.11	3.11+
Node.js	≥ 18	18+
RAM	4 GB	8+ GB
Storage	2 GB	5+ GB (with skills and results)
Network	Required	Stable broadband
GPU	Not required	Not required
Lean4	Optional	For formal verification

Cost Comparison

System	Estimated Cost per Run	Primary Model
EurekaClaw (Level 1, prove)	$1–$10	Claude Sonnet
EurekaClaw (Level 3, explore)	$5–$50+	Claude Sonnet
AutoResearchClaw	$5–$30	Configurable
AI Scientist (Sakana)	$10–$50+	Claude/GPT-4
K-Dense BYOK (per session)	$0.05–$5	User-selected

EurekaClaw's theory proving loop is the dominant cost factor. A difficult conjecture that requires maximum iterations with failed attempts and refinement can consume significantly more tokens than a straightforward proof.

9 Architecture Solution

High-Level Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                          EurekaClaw System                                │
│                                                                           │
│  ┌─────────────┐    ┌──────────────────────────────────────────────────┐  │
│  │  CLI Entry   │    │              MetaOrchestrator                    │  │
│  │  Points      │───►│  • Pipeline stage sequencing                    │  │
│  │              │    │  • KnowledgeBus management                      │  │
│  │  prove       │    │  • Memory injection (top-4 domain insights)     │  │
│  │  from-papers │    │  • Domain plugin resolution                     │  │
│  │  explore     │    │  • Gate mode enforcement                        │  │
│  └─────────────┘    └───────────────┬──────────────────────────────────┘  │
│                                     │                                     │
│  ┌──────────────┐    ┌──────────────▼──────────────────────────────────┐  │
│  │  Browser UI   │    │              KnowledgeBus                       │  │
│  │  (React/TS)   │───►│  Typed artifact store + pub/sub               │  │
│  │               │    │  ┌─────────────┐ ┌────────────┐ ┌───────────┐ │  │
│  │  • Agent track│    │  │ResearchBrief│ │TheoryState │ │Bibliography│ │  │
│  │  • Proof view │    │  └─────────────┘ └────────────┘ └───────────┘ │  │
│  │  • Pause/     │    │  ┌────────────────┐ ┌──────────────┐          │  │
│  │    resume     │    │  │ExperimentResult│ │TaskPipeline  │          │  │
│  │  • Skills mgr │    │  └────────────────┘ └──────────────┘          │  │
│  └──────────────┘    └────────────────────────────────────────────────┘  │
│                                                                           │
│  ┌────────────────────────────────────────────────────────────────────┐   │
│  │                     Pipeline Stages                                │   │
│  │                                                                    │   │
│  │  Survey ──► Formalize ──► Theory ──► Experiment ──► Write ──►     │   │
│  │                             ▲  │                         Evaluate  │   │
│  │                             │  │                            │      │   │
│  │                             └──┘                         Learn     │   │
│  │                         (iterative                                 │   │
│  │                          proof loop)                               │   │
│  └────────────────────────────────────────────────────────────────────┘   │
│                                                                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌────────────┐   │
│  │ Tool Registry │  │Skill Registry│  │Memory Manager│  │  Domain    │   │
│  │              │  │              │  │              │  │  Plugins   │   │
│  │ arXiv        │  │ Seed skills  │  │ Episodic (T1)│  │            │   │
│  │ Sem.Scholar  │  │ Distilled    │  │ Persist (T2) │  │ MAB        │   │
│  │ WebSearch    │  │ Manual       │  │ KG (T3)      │  │ [custom]   │   │
│  │ Lean4        │  │ Domain       │  │ Domain (T4)  │  │            │   │
│  │ Wolfram      │  │              │  │              │  │            │   │
│  │ CodeExec     │  │ Injector     │  │ Extractor    │  │            │   │
│  │ [domain]     │  │ Evolver      │  │              │  │            │   │
│  └──────────────┘  └──────────────┘  └──────────────┘  └────────────┘   │
└──────────────────────────────────────────────────────────────────────────┘

KnowledgeBus — Central State Management

The KnowledgeBus is the architectural backbone — a typed artifact store with pub/sub that all pipeline stages read from and write to:

class KnowledgeBus:
    def __init__(self, session_id: str) -> None

    # Typed artifact accessors
    def put_research_brief(brief: ResearchBrief) -> None
    def get_research_brief() -> ResearchBrief | None
    def put_theory_state(state: TheoryState) -> None
    def get_theory_state() -> TheoryState | None
    def put_experiment_result(result: ExperimentResult) -> None
    def get_experiment_result() -> ExperimentResult | None
    def put_bibliography(bib: Bibliography) -> None
    def get_bibliography() -> Bibliography | None
    def append_citations(papers: list[Paper]) -> None
    def put_pipeline(pipeline: TaskPipeline) -> None
    def get_pipeline() -> TaskPipeline | None

    # Generic key-value store
    def put(key: str, value: Any) -> None
    def get(key: str, default: Any = None) -> Any

    # Reactive subscriptions
    def subscribe(artifact_type: str, callback: Callable) -> None

    # Persistence
    def persist(session_dir: Path) -> None
    @classmethod
    def load(session_id: str, session_dir: Path) -> KnowledgeBus

TheoryState — Proof State Machine

The TheoryState is the most complex artifact, tracking the entire proof construction process:

TheoryState
├── informal_statement      ← Plain-English conjecture
├── formal_statement        ← LaTeX-formalized theorem
├── known_results[]         ← KnownResult extracted from literature
├── research_gap            ← GapAnalyst's finding
├── proof_plan[]            ← ProofPlan with provenance (known/adapted/new)
├── lemma_dag{}             ← LemmaNode graph with dependencies
├── proven_lemmas{}         ← lemma_id → ProofRecord
├── open_goals[]            ← Remaining lemma_ids to prove
├── failed_attempts[]       ← FailedAttempt history (used for learning)
├── counterexamples[]       ← Counterexample discoveries
├── assembled_proof         ← Final combined proof text
└── status                  ← pending | in_progress | proved | refuted | abandoned

ResearchBrief — Scored Research Directions

ResearchBrief
├── domain, query, conjecture
├── directions[]            ← ResearchDirection objects
│     ├── novelty_score         (0-1)
│     ├── soundness_score       (0-1)
│     ├── transformative_score  (0-1)
│     └── composite_score       ← weighted average
├── selected_direction      ← Chosen after convergence
└── open_problems[], key_mathematical_objects[]

10 Component Breakdown

Source Code Structure

eurekaclaw/
├── main.py                 ← EurekaSession, run_research(), save_artifacts()
├── cli/                    ← CLI entry points (prove, from-papers, explore, ui)
│
├── orchestrator/           ← MetaOrchestrator (pipeline sequencing)
│
├── agents/                 ← All agent implementations
│   ├── base.py             ← BaseAgent ABC
│   ├── survey/             ← Survey pipeline agents
│   │   ├── orchestrator.py     SurveyOrchestrator
│   │   ├── paper_fetcher.py    PaperFetcher
│   │   ├── summarizer.py       Summarizer
│   │   ├── gap_analyst.py      GapAnalyst
│   │   └── direction.py        DirectionProposer
│   ├── formalize/          ← Formalizer agent
│   ├── theory/             ← Theory pipeline agents
│   │   ├── orchestrator.py     TheoryOrchestrator
│   │   ├── prover.py           Prover
│   │   ├── verifier.py         Verifier
│   │   ├── refiner.py          Refiner
│   │   └── counterexample.py   CounterexampleSearcher
│   ├── experiment/         ← Experiment pipeline agents
│   │   ├── designer.py         ExperimentDesigner
│   │   └── runner.py           ExperimentRunner
│   └── writer/             ← WriterAgent
│       └── agent.py
│
├── knowledge_bus/          ← KnowledgeBus (typed artifact store + pub/sub)
│   └── bus.py
│
├── types/                  ← Pydantic data models
│   └── artifacts.py        ← TheoryState, ResearchBrief, Bibliography, etc.
│
├── tools/                  ← Tool implementations
│   ├── base.py             ← BaseTool ABC, ToolRegistry
│   ├── arxiv.py            ← ArxivSearchTool
│   ├── semantic_scholar.py ← SemanticScholarTool
│   ├── web_search.py       ← WebSearchTool
│   ├── lean4.py            ← Lean4VerifyTool
│   ├── wolfram.py          ← WolframAlphaTool
│   └── code_exec.py        ← CodeExecutionTool
│
├── skills/                 ← Skill system
│   ├── registry.py         ← SkillRegistry (load, search, upsert)
│   ├── injector.py         ← SkillInjector (retrieve + format for prompts)
│   ├── install.py          ← SkillInstaller (seed + ClawHub)
│   ├── evolver.py          ← SkillEvolver (distill from session)
│   └── seed_skills/        ← Built-in skill markdown files
│       ├── theory/
│       ├── survey/
│       └── [domain]/
│
├── memory/                 ← Memory system
│   ├── manager.py          ← MemoryManager (unified interface)
│   ├── episodic.py         ← EpisodicMemory (in-RAM ring buffer)
│   ├── persistent.py       ← PersistentMemory (cross-run JSON)
│   └── knowledge_graph.py  ← KnowledgeGraph (theorem dependency network)
│
├── learning/               ← Continual learning
│   └── memory_extractor.py ← SessionMemoryExtractor (Tier 4 insights)
│
├── domains/                ← Domain plugin system
│   ├── base.py             ← DomainPlugin ABC, @register_domain
│   └── mab/                ← Multi-Armed Bandit domain
│       ├── __init__.py
│       ├── workflow.py
│       ├── envs/
│       ├── tools/
│       ├── skills/
│       └── benchmark/
│
├── evaluation/             ← Scientist-Bench evaluator
│
├── ui/                     ← Static assets for browser UI
│   └── static/
│
└── frontend/               ← React + TypeScript browser UI
    ├── src/
    ├── package.json
    └── tsconfig.json

Component Dependencies

                         CLI / Browser UI
                              │
                    MetaOrchestrator
                   /      |       \
                  /       |        \
         SurveyOrch  TheoryOrch  WriterAgent
          /  |  \      / | \        |
     Fetch Sum Gap  Prove Verify  LaTeX
              |      |    |     generation
          Direction  Refine  Counter-
          Proposer         example
                              |
                    ─────────────────────
                    │   KnowledgeBus    │ ← all agents read/write
                    ─────────────────────
                         /    |    \
                   ToolReg SkillReg MemoryMgr
                   /  |  \    |      /  |  \
               arXiv S2 Lean4 Skills T1 T2 T3/T4
                              |
                         DomainPlugins

Key Abstractions

Abstraction	Type	Purpose
`BaseAgent`	ABC	Common agent interface (execute, tool access, memory logging)
`BaseTool`	ABC	Tool interface (input_schema, call, to_anthropic_tool_def)
`DomainPlugin`	ABC	Domain specialization interface
`KnowledgeBus`	Concrete	Typed artifact store with pub/sub
`SkillRegistry`	Concrete	Skill storage, search, and lifecycle
`MemoryManager`	Concrete	Unified interface across 4 memory tiers
`ToolRegistry`	Concrete	Tool registration and dispatch
`EurekaSession`	Concrete	Top-level session management

11 Core Mechanisms (Detailed)

11.1 Survey Pipeline — Literature-Grounded Research

The survey pipeline implements a multi-agent literature review that produces scored research directions:

Survey Pipeline Flow
│
├── 1. PaperFetcher
│   ├── Input: domain string, conjecture (if available)
│   ├── Tools: ArxivSearchTool, SemanticScholarTool
│   ├── Process: Query construction → API calls → deduplication
│   └── Output: Raw paper list (titles, abstracts, citations, venues)
│
├── 2. Summarizer
│   ├── Input: Raw paper list
│   ├── Process: LLM summarization of each paper's contributions
│   └── Output: Structured summaries with key techniques and results
│
├── 3. GapAnalyst
│   ├── Input: Summarized papers + domain context
│   ├── Process: Cross-paper analysis to find unexplored combinations
│   └── Output: research_gap field in TheoryState
│
├── 4. DirectionProposer
│   ├── Input: Summaries + gaps + domain workflow hint
│   ├── Process: Generate candidate directions, score each
│   ├── Scoring:
│   │   ├── novelty_score (0-1): distance from existing work
│   │   ├── soundness_score (0-1): feasibility of proof approach
│   │   ├── transformative_score (0-1): potential impact
│   │   └── composite_score: weighted average
│   └── Output: ResearchBrief with selected_direction
│
└── 5. KnowledgeBus Update
    ├── put_research_brief(brief)
    ├── put_bibliography(bibliography)
    └── Subscribers notified

11.2 Theory Pipeline — Bottom-Up Proof Construction

The theory pipeline is EurekaClaw's most distinctive mechanism — a bottom-up proof construction system that builds proofs from atomic lemmas:

Theory Pipeline State Machine
│
├── INITIALIZATION
│   ├── Formalizer receives conjecture + research brief
│   ├── Creates formal_statement (LaTeX theorem)
│   ├── Generates proof_plan[] with provenance:
│   │   ├── "known" — directly from literature (cite)
│   │   ├── "adapted" — modified from known results
│   │   └── "new" — novel contribution required
│   └── Constructs lemma_dag{} — dependency graph of lemmas
│
├── PROOF LOOP (up to THEORY_MAX_ITERATIONS)
│   │
│   ├── For each open_goal in lemma_dag (bottom-up order):
│   │   │
│   │   ├── PROVE
│   │   │   ├── Prover agent receives:
│   │   │   │   ├── Lemma statement
│   │   │   │   ├── Available proven lemmas (dependencies)
│   │   │   │   ├── Injected skills (top-k relevant)
│   │   │   │   ├── Failed attempts history (avoid repetition)
│   │   │   │   └── Domain workflow hint
│   │   │   └── Output: Candidate proof text
│   │   │
│   │   ├── VERIFY
│   │   │   ├── Verifier agent checks proof:
│   │   │   │   ├── Logical consistency
│   │   │   │   ├── Step completeness
│   │   │   │   ├── Assumption validity
│   │   │   │   └── Optional: Lean4 formal check
│   │   │   └── Output: verified (→ proven) or error message (→ refine)
│   │   │
│   │   ├── REFINE (if verification failed)
│   │   │   ├── Refiner receives: proof + error message + failed history
│   │   │   ├── Repairs specific issues identified by verifier
│   │   │   └── Output: Revised proof → back to VERIFY
│   │   │
│   │   └── COUNTEREXAMPLE SEARCH (parallel)
│   │       ├── CounterexampleSearcher runs adversarially
│   │       ├── Attempts to find inputs that violate the lemma
│   │       └── If found: lemma refuted → DAG restructuring
│   │
│   ├── STAGNATION DETECTION
│   │   ├── If same error appears N times
│   │   └── Force Refiner with broader repair strategy
│   │
│   └── DAG ADVANCEMENT
│       ├── Move proven lemma from open_goals to proven_lemmas
│       ├── Unlock dependent lemmas
│       └── Check if main theorem's dependencies are all proven
│
├── ASSEMBLY
│   ├── Combine proven lemmas into assembled_proof
│   ├── Order by dependency chain
│   └── Add connecting reasoning between lemmas
│
└── STATUS DETERMINATION
    ├── All lemmas proven → status = "proved"
    ├── Counterexample found for theorem → status = "refuted"
    ├── Max iterations exhausted → status = "abandoned"
    └── Partial progress → status = "in_progress" (for human gate)

11.3 Skill Injection — Context-Aware Knowledge Loading

The skill system provides domain-specific proof strategies to agents at runtime:

Skill Injection Pipeline
│
├── 1. REGISTRATION (startup)
│   ├── Load seed skills: eurekaclaw/skills/seed_skills/**/*.md
│   ├── Load domain skills: domains/<domain>/skills/*.md
│   └── Load user skills: ~/.eurekaclaw/skills/**/*.md
│   Priority: user > domain > seed (higher overrides lower)
│
├── 2. SELECTION (per agent call)
│   ├── SkillInjector.top_k(task, role, k=5, strategy)
│   ├── Strategies:
│   │   ├── "tag" — match skill tags to task keywords
│   │   ├── "semantic" — embedding similarity to task description
│   │   └── "hybrid" — combine tag and semantic scores
│   └── Also accepts manual selection via InputSpec.selected_skills
│
├── 3. RENDERING (prompt construction)
│   ├── SkillInjector.render_for_prompt(skills)
│   └── Output format:
│       <skills>
│       <skill name="ucb_regret_analysis">
│       # UCB Regret Analysis
│       When bounding UCB1 regret, decompose into:
│       1. Suboptimal arm pulls where confidence bound held
│       2. Pulls where the bound failed
│       ...
│       </skill>
│       </skills>
│
├── 4. INJECTION (agent system prompt)
│   └── Selected skills appended to agent's system prompt
│
└── 5. TRACKING (post-execution)
    ├── SkillRegistry.update_stats()
    ├── EMA α=0.3 update on success_rate for each injected skill
    └── Usage count incremented

Skill Metadata Schema

class SkillMeta(BaseModel):
    name: str
    version: str = "1.0"
    tags: list[str] = []
    agent_roles: list[str] = []       # e.g., ["theory", "survey"]
    pipeline_stages: list[str] = []   # e.g., ["theory", "experiment"]
    description: str = ""
    source: Literal["seed", "distilled", "manual"] = "seed"
    created_at: datetime
    usage_count: int = 0
    success_rate: float | None = None

11.4 Formal Verification — Lean4 Integration

EurekaClaw optionally integrates with Lean4 for formal proof verification:

Proof Verification Flow
│
├── LLM-Generated Proof
│   └── Natural language + mathematical notation
│
├── Lean4 Translation (if available)
│   ├── Convert to Lean4 syntax
│   └── Submit to Lean4VerifyTool
│
├── Lean4VerifyTool.call()
│   ├── Success:
│   │   └── {"verified": true, "theorem": "my_theorem",
│   │        "message": "Proof checked successfully"}
│   │
│   └── Failure:
│       └── {"verified": false, "lean4_output": "error: ...",
│            "message": "Verification failed"}
│
└── Integration with Theory Loop
    ├── If verified → high confidence, mark proven
    ├── If failed → Lean4 error fed to Refiner for repair
    └── Lean4 errors are more specific than LLM peer review

11.5 Paper Generation — LaTeX with Theorem Environments

The WriterAgent generates camera-ready LaTeX papers from the assembled proof state:

Writer Pipeline
│
├── Input (from KnowledgeBus):
│   ├── TheoryState (proof, lemmas, theorem statement)
│   ├── ResearchBrief (motivation, related work, gap analysis)
│   ├── Bibliography (citations)
│   └── ExperimentResult (if available)
│
├── LaTeX Generation:
│   ├── Title and abstract from conjecture + results
│   ├── Introduction from research brief
│   ├── Related work from bibliography summaries
│   ├── Main results section:
│   │   ├── Theorem environment for formal statement
│   │   ├── Proof environment with assembled proof
│   │   └── Lemma environments for supporting results
│   ├── Experiments section (if experiment results available)
│   ├── Conclusion from proof outcomes
│   └── Bibliography from citation list
│
└── Output:
    ├── paper.tex (full LaTeX source)
    └── paper.pdf (if LaTeX compiler available)

11.6 Scientist-Bench Evaluation

The evaluation framework scores the quality of research outputs:

Evaluation Pipeline
│
├── Formal Correctness (weight: 0.35)
│   ├── Primary: Lean4 verification (if available)
│   └── Fallback: LLM peer review (structured critique)
│
├── Novelty (weight: 0.25)
│   ├── Compute embedding of theorem statement
│   ├── Compare to embeddings of known_results from literature
│   └── Score = mean distance from nearest known results
│
├── Experimental Alignment (weight: 0.15)
│   ├── Compare theoretical bounds to numerical experiments
│   └── Score = degree of alignment between theory and experiments
│
├── Proof Depth (weight: 0.15)
│   ├── Count lemmas in lemma_dag
│   └── Score = normalized lemma count (deeper = better)
│
├── Citation Coverage (weight: 0.10)
│   ├── Check bibliography completeness
│   └── Score = fraction of key papers cited
│
└── Composite Score = Σ(weight_i × score_i)

12 Programming Language

Technology Stack

Layer	Language/Framework	Purpose
Core engine	Python ≥ 3.11	Agents, orchestration, tools, memory, skills
Type system	Pydantic v2	Data models (BaseModel for all artifacts)
CLI	Python (click/argparse)	Command-line interface
Browser UI	React + TypeScript	Interactive research interface
Frontend build	Vite	Development server and production builds
Skills	Markdown (with YAML frontmatter)	Domain knowledge documents
Configuration	.env files	API keys and settings
Benchmarks	JSON	Domain-specific evaluation problems
Build system	Makefile	Frontend build, dev server, type checking

Python Architecture Patterns

# ABC pattern for agents
class BaseAgent(ABC):
    async def execute(self, ...) -> ...: ...

# ABC pattern for tools
class BaseTool(ABC):
    name: ClassVar[str]
    description: ClassVar[str]
    def input_schema(self) -> dict: ...
    async def call(self, **kwargs) -> str: ...

# ABC pattern for domain plugins
class DomainPlugin(ABC):
    @abstractmethod
    def register_tools(self, registry: ToolRegistry) -> None: ...
    @abstractmethod
    def get_workflow_hint(self) -> str: ...

# Pydantic for all data models
class TheoryState(BaseModel):
    informal_statement: str
    formal_statement: str
    status: Literal["pending", "in_progress", "proved", "refuted", "abandoned"]
    ...

class SkillRecord(BaseModel):
    meta: SkillMeta
    content: str
    embedding: list[float] | None = None

Frontend Architecture

frontend/
├── src/
│   ├── components/
│   │   ├── AgentTrack        ← Live pipeline stage visualization
│   │   ├── ProofSketch       ← Proof DAG visualization
│   │   ├── LemmaChips        ← Interactive lemma status chips
│   │   ├── GuidanceTextarea  ← Human guidance input for gates
│   │   └── SkillsManager     ← Skill registry browser
│   ├── hooks/                ← React hooks for state management
│   └── App.tsx               ← Main application component
├── package.json
└── tsconfig.json

Build Commands

Command	Purpose
`make open`	Production build + open browser (port 8080)
`make start`	Production build + serve (no browser)
`make dev`	Hot-reload development (Vite :5173 + Python :7860)
`make build`	TypeScript compile + Vite build
`make typecheck`	TypeScript type checking only

Testing Infrastructure

# Unit tests (no API key needed)
pytest tests/unit/ -v

# Integration tests (requires API key)
ANTHROPIC_API_KEY=sk-... pytest tests/integration/ -v

# Frontend type checking
make typecheck

Code Quality Assessment

Indicator	Assessment
Type safety	Strong — Pydantic models for all data, ABC patterns for abstractions
Async	Full async/await throughout agent pipeline
Testing	Unit + integration test separation; unit tests offline-capable
Documentation	Comprehensive — dedicated docs site with architecture, API, and user guides
Extensibility	High — plugin system for domains, skill system for knowledge

13 Memory Management

Four-Tier Memory Architecture

EurekaClaw implements the most sophisticated memory system among autoresearch tools, with four distinct tiers:

┌─────────────────────────────────────────────────────────────────┐
│                    EurekaClaw Memory Architecture                 │
│                                                                   │
│  Tier 1: EPISODIC (in-RAM ring buffer)                           │
│  ├── Scope: Within single session                                │
│  ├── Lifetime: Session only (not persisted)                      │
│  ├── Content: Agent events with role, content, metadata          │
│  ├── Access: memory.log_event() / memory.recent_events()         │
│  └── Purpose: Short-term context for multi-turn agent reasoning  │
│                                                                   │
│  Tier 2: PERSISTENT (cross-run JSON file)                        │
│  ├── Scope: Across all sessions                                  │
│  ├── Lifetime: Permanent (~/.eurekaclaw/memory/persistent.json)  │
│  ├── Content: Key-value records with tags and source session      │
│  ├── Access: memory.remember() / memory.recall()                 │
│  └── Purpose: Successful strategies, common patterns, preferences│
│                                                                   │
│  Tier 3: KNOWLEDGE GRAPH (theorem dependency network)            │
│  ├── Scope: Across all sessions                                  │
│  ├── Lifetime: Permanent (~/.eurekaclaw/memory/knowledge_graph.json) │
│  ├── Content: KnowledgeNode objects linked by "uses" relations   │
│  ├── Access: memory.add_theorem() / memory.find_related_theorems() │
│  └── Purpose: Theorem provenance, dependency tracking, reuse     │
│                                                                   │
│  Tier 4: DOMAIN INSIGHTS (per-domain markdown files)             │
│  ├── Scope: Per research domain                                  │
│  ├── Lifetime: Permanent (~/.eurekaclaw/memories/<domain>/)      │
│  ├── Content: LLM-generated analysis of proof successes/failures │
│  ├── Access: memory.load_for_injection(domain, k=4)              │
│  └── Purpose: Domain-specific proof strategies for future sessions│
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Tier 1: Episodic Memory

class EpisodicEntry(BaseModel):
    entry_id: str
    session_id: str
    agent_role: str       # "survey", "theory", "writer", etc.
    content: str          # free-text event description
    metadata: dict = {}   # structured data (tool name, paper_id, etc.)
    timestamp: datetime

Implementation: In-RAM ring buffer with fixed capacity
Write path: BaseAgent.execute() → memory.log_event(agent_role, content, metadata)
Read path: memory.recent_events(n=20, agent_role=None)
Persistence: None — lost when session ends
Purpose: Enables agents to access recent events from other agents within the same session (e.g., the Prover seeing the Verifier's latest error message)

Tier 2: Persistent Memory

class CrossRunRecord(BaseModel):
    record_id: str
    key: str              # namespaced, e.g., "theory.failed_strategies.sample_complexity"
    value: Any            # arbitrary JSON-serializable value
    tags: list[str] = []
    source_session: str = ""
    created_at: datetime
    updated_at: datetime

Storage: ~/.eurekaclaw/memory/persistent.json
Write path: memory.remember(key, value, tags, source_session)
Read path: memory.recall(key) or memory.recall_by_tag(tag)
Purpose: Cross-session key-value store for strategies that worked or failed

Tier 3: Knowledge Graph

class KnowledgeNode(BaseModel):
    node_id: str
    theorem_name: str
    formal_statement: str
    domain: str = ""
    session_id: str = ""
    tags: list[str] = []
    created_at: datetime

Storage: ~/.eurekaclaw/memory/knowledge_graph.json
Write path: memory.add_theorem(name, statement, domain, session_id)
Link path: memory.link_theorems(from_id, to_id, relation="uses")
Read path: memory.find_related_theorems(node_id, depth=2)
Purpose: Tracks proven theorems and their dependency relationships. Enables the system to:
Recognize when a new conjecture relates to previously proven results
Reuse proven lemmas as building blocks
Understand the dependency structure of accumulated knowledge

Tier 4: Domain Insights

Storage: ~/.eurekaclaw/memories/<domain>/YYYYMMDD_<slug>.md
Index: ~/.eurekaclaw/memories/<domain>/_index.json (semantic search index)
Write path: Post-session extraction by SessionMemoryExtractor
Read path: memory.load_for_injection(domain, k=4, query=task_description)
Purpose: LLM-generated insights about proof strategies, extracted from completed sessions. These are injected into agent system prompts at session start, giving the system accumulated domain expertise.

Memory Data Flow

During session:
  BaseAgent.execute()
    └── memory.log_event() ──────────────────────────► Tier 1 (RAM only)

After session (ContinualLearningLoop.post_run()):
  SessionMemoryExtractor.extract_and_save()
    └── LLM analysis of TheoryState
        └── write YYYYMMDD_<slug>.md ─────────────────► Tier 4 (domain insights)

  ToolPatternExtractor.extract_and_save()
    └── analyse tool-call patterns
        └── generate new Skill files ─────────────────► Skill system

  SkillRegistry.update_stats()
    └── EMA α=0.3 update on success_rate ─────────────► Skill metadata

Next session startup:
  MetaOrchestrator
    └── MemoryManager.load_for_injection(domain)
        └── top-4 Tier 4 files ───────────────────────► Agent system prompts

Filesystem Layout

~/.eurekaclaw/
├── memory/
│   ├── persistent.json           ← Tier 2: cross-run key-value store
│   └── knowledge_graph.json      ← Tier 3: theorem dependency graph
├── memories/
│   └── <domain>/
│       ├── YYYYMMDD_<slug>.md    ← Tier 4: per-domain insight files
│       └── _index.json           ← Tier 4: index for semantic search
└── skills/                       ← Skill files updated by ContinualLearningLoop

Comparison to Other Systems

System	Tiers	Episodic	Persistent	Knowledge Graph	Domain Insights
EurekaClaw	4	RAM ring buffer	JSON key-value	Theorem DAG	LLM-extracted .md
AutoResearchClaw	3	Session state	MetaClaw skills	No	MetaClaw cross-run
K-Dense BYOK	1	Conversation	No	No	No
AIRA₂	2	Session	Tournament	No	No
Google Co-Scientist	2	Session	Tournament ranking	No	No

14 Continued Learning

Continual Learning Pipeline

EurekaClaw implements a post-session learning loop that automatically distills proof strategies and domain insights from completed sessions:

Session Completion
│
└── ContinualLearningLoop.post_run()
    │
    ├── 1. SESSION MEMORY EXTRACTION (Tier 4)
    │   ├── SessionMemoryExtractor.extract_and_save()
    │   ├── Input: TheoryState (proven lemmas, failed attempts, counterexamples)
    │   ├── Process: LLM analysis of what worked and what didn't
    │   └── Output: ~/.eurekaclaw/memories/<domain>/YYYYMMDD_<slug>.md
    │
    ├── 2. TOOL PATTERN EXTRACTION
    │   ├── ToolPatternExtractor.extract_and_save()
    │   ├── Input: Tool-call sequences from session
    │   ├── Process: LLM analysis of effective tool usage patterns
    │   └── Output: New tool-usage skills
    │
    ├── 3. SKILL DISTILLATION
    │   ├── SkillEvolver.distill_from_session(failures, successes)
    │   ├── Input:
    │   │   ├── FailedAttempt[] from TheoryState (deduplicated)
    │   │   └── Successful proofs (trimmed to 300 chars)
    │   ├── Process:
    │   │   ├── LLM call (fast model, max_tokens=1024)
    │   │   ├── _parse_skill_response()
    │   │   └── SkillRegistry.upsert()
    │   └── Output: ~/.eurekaclaw/skills/<name>.md
    │
    └── 4. SKILL STATISTICS UPDATE
        ├── SkillRegistry.update_stats()
        ├── EMA α=0.3 update on success_rate for injected skills
        └── Usage count incremented

Skill Distillation Mechanics

The SkillEvolver is the key learning component. It transforms raw session experience into reusable skill documents:

Input to SkillEvolver
├── failures: FailedAttempt[]
│   ├── Deduplicated (only unique failure patterns)
│   └── Each contains: attempted proof, error message, context
├── successes: ProofRecord[]
│   └── Compressed (proof text trimmed to 300 chars)
│
LLM Distillation (fast model)
├── Prompt: "Given these proof attempts, what reusable strategy can be extracted?"
├── Output format: Markdown with YAML frontmatter (skill format)
└── Constraints: max_tokens=1024 (keeps skills concise)
│
Skill Registration
├── _parse_skill_response() → SkillRecord
├── SkillRegistry.upsert() → writes to ~/.eurekaclaw/skills/<name>.md
└── Available for injection in future sessions

Skill Lifecycle

                    Created
                      │
              ┌───────┼───────┐
              │       │       │
            seed   distilled  manual
              │       │       │
              └───────┼───────┘
                      │
                  Registered
                  (SkillRegistry)
                      │
                  Selected for session
                  (SkillInjector.top_k())
                      │
                  Injected into agent prompt
                      │
                  Used (or not) during proving
                      │
              ┌───────┼───────┐
              │               │
          Contributed     Not effective
          to success         │
              │         (no update)
              │
          success_rate updated
          (EMA α=0.3)
              │
          Higher priority in future selection

Skill Sources and Priority

Priority (high to low):
│
├── 3. User skills: ~/.eurekaclaw/skills/**/*.md
│   ├── Manual: user-written
│   └── Distilled: system-generated from sessions
│
├── 2. Domain plugin skills: domains/<domain>/skills/*.md
│   └── Plugin-specific strategies
│
└── 1. Seed skills: eurekaclaw/skills/seed_skills/**/*.md
    └── Built-in proof strategies (shipped with system)

EurekaClaw supports installing skills from external authors:

# Install from ClawHub
eurekaclaw install-skills steipete/github

# Install built-in seeds
eurekaclaw install-skills
eurekaclaw install-skills --force    # overwrite existing

This creates a community learning loop: researchers who develop effective proof strategies can package them as skills and share them. Combined with automatic distillation, this creates two learning pathways:

Individual learning: Each user's system accumulates distilled skills from their own sessions
Community learning: Effective skills are shared via ClawHub for collective benefit

Memory-Guided Pipeline

When THEORY_PIPELINE=memory_guided, the system actively uses accumulated memory during proof construction:

memory_guided Pipeline Differences
│
├── Session startup:
│   └── MetaOrchestrator loads top-4 Tier 4 insights
│       → injected into all agent system prompts
│
├── Proof planning:
│   └── Formalizer checks Tier 3 knowledge graph
│       → reuses proven lemmas when applicable
│
├── Proof attempt:
│   └── Prover checks Tier 2 persistent memory
│       → avoids strategies that previously failed
│
└── Result:
    └── Proof attempts guided by accumulated experience
        → fewer wasted iterations on known-bad approaches

Learning Comparison

System	Learning Type	What is Learned	Persistence	Sharing
EurekaClaw	Post-session distillation	Proof strategies, tool patterns	Permanent files	ClawHub
AutoResearchClaw	MetaClaw cross-run	Research strategies from failures	Persistent skills	Within team
EvoScientist	Evolutionary	Hypothesis quality (fitness)	Population state	No
K-Dense BYOK	None	N/A	N/A	Community workflows
AI Scientist	None	N/A	N/A	N/A

EurekaClaw's learning system is the most comprehensive among autoresearch tools, combining individual session-level distillation with community-level skill sharing. The 4-tier memory architecture ensures that insights are captured at appropriate granularity levels and persist across sessions.

15 Applications

Primary Application: Automated Mathematical Research

EurekaClaw is purpose-built for mathematical theory research, with a complete pipeline from literature survey to published paper:

Phase	Capability	Output
Literature Survey	Crawl arXiv + Semantic Scholar, summarize, identify gaps	ResearchBrief with scored directions
Hypothesis Generation	Generate novel conjectures from identified gaps	Formal theorem statement
Proof Construction	Bottom-up lemma proving with verification loop	Assembled proof
Formal Verification	Optional Lean4 checking	Verified/failed with detailed errors
Numerical Validation	Experiment design and execution	Numerical evidence for bounds
Paper Writing	Camera-ready LaTeX generation	Complete paper with theorems, proofs, citations
Quality Assessment	Scientist-Bench evaluation	Multi-dimensional quality score

Research Domain Applications

Domain	Available Support	Maturity
Multi-Armed Bandits	Full plugin (tools, skills, envs, benchmarks)	High — reference implementation
ML Theory	General theory pipeline	Medium — demonstrated in examples
Attention Mechanisms	From-papers mode demonstrated	Medium — via Level 2 input
Concentration Inequalities	Seed skills available	Medium — via skill injection
General Mathematics	Core pipeline applicable	Low — no domain plugin
Custom Domains	Plugin architecture available	Framework — requires user implementation

Target Users

User	Use Case	EurekaClaw Mode
Mathematics PhD students	Explore open problems in their field	Level 3: Explore
ML theory researchers	Extend results from specific papers	Level 2: From Papers
Proof assistants	Verify and formalize specific conjectures	Level 1: Prove
Research labs	Systematic exploration of theory landscape	Level 3 + batch processing
Course instructors	Generate exercise proofs for teaching	Level 1 with known results

Integration Scenarios

Scenario 1: PhD Research Acceleration

Researcher identifies gap in bandit literature
    → eurekaclaw explore "finite-time regret bounds for contextual bandits"
    → System surveys 50+ papers, identifies 3 promising directions
    → Selects direction with highest composite score
    → Generates and proves novel theorem
    → Produces LaTeX paper draft
    → Researcher reviews, refines, submits

Scenario 2: Paper Extension

Researcher reads two key papers
    → eurekaclaw from-papers 1706.03762 2005.14165 --domain "attention"
    → System identifies gaps between the papers
    → Generates conjecture combining insights
    → Proves conjecture (or identifies counterexample)
    → Writes paper extending both works

Scenario 3: Proof Verification

Researcher has a conjecture with informal proof sketch
    → eurekaclaw prove "Statement..." --domain "ML theory"
    → System formalizes and verifies via proof pipeline
    → Optional Lean4 formal verification
    → Identifies gaps in informal proof
    → Completes or refutes the conjecture

Limitations for Research Applications

Limitation	Impact	Mitigation
LLM proof reliability	Generated proofs may contain subtle errors	Lean4 verification, human review
Domain coverage	Only MAB has full plugin support	Plugin architecture for custom domains
Novelty assessment	Embedding-based novelty may miss subtle innovations	Human judgment on selected directions
Experiment runner	Under development — not fully functional	Manual numerical validation
Formal verification	Lean4 translation is imperfect	LLM peer review as fallback
Paper quality	Generated papers need substantial human editing	Use as draft, not final submission

Strengths vs. Weaknesses Summary

Strength	Weakness
Deepest theorem-proving capability among autoresearch systems	Narrow domain focus (mathematical theory only)
4-tier memory with continual learning	Single-model architecture limits optimization
Formal verification pathway via Lean4	Experiment runner still under development
Plugin architecture for domain extensibility	Only one domain plugin (MAB) shipped
Community skill sharing via ClawHub	No independent benchmarks or evaluations
Both CLI and browser UI	Beta maturity

Future Roadmap (Inferred)

Based on documentation and "under development" markers:

Feature	Status	Impact
Experiment Runner	Under development	Complete the theory-experiment loop
Additional domain plugins	Architecture ready	Expand beyond MAB
Process Reward Model	Referenced in code	RL-style training on proof quality
MADMAX mode	Referenced in code	Advanced proof search strategies
Lean4 integration hardening	Ongoing	More reliable formal verification