← Back to Index
EurekaClaw
Multi-agent AI research assistant that autonomously crawls literature, generates and stress-tests mathematical hypotheses, proves theorems via a 7-stage bottom-up pipeline, runs numerical experiments, and writes camera-ready LaTeX papers — with continual learning that distills proof strategies into reusable skills across sessions. Organization: Multi-institutional team (led by Quanquan Gu et al.) Published: 2026 (Apache 2.0) Type: repo (GitHub: EurekaClaw/EurekaClaw) Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
EurekaClaw: An AI Agent for Capturing Eureka Moments
- Repository: github.com/EurekaClaw/EurekaClaw
- Documentation: eurekaclaw.github.io
- License: Apache 2.0
- Stars: ~631 (as of April 2026)
- Tagline: "The AI that catches your Eureka moments. Crawls arXiv · Generates theorems · Proves lemmas · Writes LaTeX papers · Runs experiments"
- Default model:
claude-sonnet-4-6(Claude as primary reasoning engine) - Input modes: Three escalating levels of autonomy — from precise conjecture proving to open-ended domain exploration
Naming and Branding
The name "EurekaClaw" combines the discovery exclamation "Eureka!" with "Claw" — a reference to the lobster emoji (🦞) used throughout the project's branding and CLI output. The metaphor suggests the system's ability to "grasp" and hold onto breakthrough moments in mathematical reasoning. The project shares the "Claw" branding lineage with related systems (AutoResearchClaw, MetaClaw, OpenClaw, Dr. Claw, ScienceClaw), suggesting a broader ecosystem of "Claw"-branded AI research tools.
Lineage and Acknowledgments
EurekaClaw explicitly acknowledges inspiration from several predecessor systems:
| System | Influence |
|---|---|
| MetaClaw (AIMING Lab) | Multi-agent research orchestration patterns |
| AutoResearchClaw (AIMING Lab) | Automated research pipeline architecture |
| EvoScientist | Evolutionary hypothesis generation |
| AI-Researcher (HKUDS) | Automated research pipeline design |
| Dr. Claw (OpenLAIR) | Open research agent framework |
| OpenClaw | Open-source research agent infrastructure |
| ClawTeam (HKUDS) | Collaborative research agent patterns |
| ScienceClaw | Science-focused agent design |
Unique Position in the Ecosystem
While AutoResearchClaw and MetaClaw focus on end-to-end paper generation across general research domains, EurekaClaw specializes in mathematical theory — theorem statement, proof construction, formal verification, and theoretical paper writing. This makes it the most depth-focused system in the "Claw" family.
Claw Family Positioning
│
├── AutoResearchClaw — breadth: any research domain, 23-stage paper pipeline
├── MetaClaw — meta: cross-run learning, skill extraction
├── OpenClaw — platform: chat-based research assistant
├── ClawTeam — collaboration: multi-agent team coordination
├── ScienceClaw — domain: science-focused agent framework
├── Dr. Claw — framework: open research agent base
└── EurekaClaw — depth: mathematical theorem proving + paper writing ← this system
2 Authors and Team
| Author | Role (Inferred) |
|---|---|
| Xuheng Li | Lead developer / architect |
| Qiwei Di | Core contributor |
| Chenggong Zhang | Core contributor |
| Kaixuan Ji | Core contributor |
| Qingyue Zhao | Core contributor |
| Yifeng Liu | Core contributor |
| Shiyuan Zhang | Core contributor |
| Quanquan Gu | Principal investigator / senior author |
BibTeX Citation
@misc{eurekaclaw2026,
title = {EurekaClaw: An AI Agent for Capturing Eureka Moments},
author = {Li, Xuheng and Di, Qiwei and Zhang, Chenggong and Ji, Kaixuan
and Zhao, Qingyue and Liu, Yifeng and Zhang, Shiyuan and Gu, Quanquan},
year = {2026},
url = {https://github.com/EurekaClaw/EurekaClaw}
}
Team composition: Academic research group with 8 contributors. The team size and structure suggest a focused research lab project, likely from a strong ML/AI theory group (given the emphasis on mathematical proof, concentration inequalities, regret bounds, and multi-armed bandits as the showcased domain). Quanquan Gu is a recognized researcher in ML theory and bandits, consistent with the MAB domain plugin being the first and most developed domain.
Institutional context: Unlike the AIMING Lab's AutoResearchClaw (16 contributors across 5 universities), EurekaClaw appears to originate from a single lab, enabling tighter architectural coherence and deeper domain specialization.
3 Core Contribution
EurekaClaw's core contribution is a complete autonomous pipeline for mathematical research that uniquely combines:
- Literature-grounded hypothesis generation — crawling arXiv and Semantic Scholar to identify research gaps
- Bottom-up theorem proving — a 7-stage pipeline that builds proofs from lemmas up to main theorems
- Formal verification integration — optional Lean4 proof checking for mathematical rigor
- Continual skill learning — automatic distillation of proof strategies into reusable skills across sessions
- Domain plugin extensibility — pluggable research domain support with domain-specific tools, skills, and benchmarks
The Seven Pipeline Stages
Stage 1: SURVEY
│ Agents: SurveyOrchestrator, PaperFetcher, Summarizer, GapAnalyst, DirectionProposer
│ Output: ResearchBrief (directions scored by novelty × soundness × transformative potential)
│
Stage 2: FORMALIZE
│ Agents: Formalizer
│ Output: TheoryState.formal_statement (LaTeX theorem), proof plan, lemma DAG skeleton
│
Stage 3: THEORY (iterative)
│ Agents: Prover, Verifier, Refiner, CounterexampleSearcher
│ Output: Proven lemmas, assembled proof (or refutation)
│ Loop: prover → verifier → (fail?) refiner → repeat; stagnation → force refine
│
Stage 4: EXPERIMENT
│ Agents: ExperimentDesigner, ExperimentRunner
│ Output: Numerical validation of theoretical bounds
│ Status: Under development
│
Stage 5: WRITE
│ Agents: WriterAgent
│ Output: Camera-ready LaTeX paper with theorem environments and citations
│
Stage 6: EVALUATE
│ Tool: Scientist-Bench
│ Output: Multi-dimensional quality score (correctness, novelty, depth, alignment, citations)
│
Stage 7: LEARN
│ Agents: ContinualLearningLoop, SkillEvolver, SessionMemoryExtractor
│ Output: New skill files, updated memory tiers, knowledge graph entries
Five Differentiating Capabilities
| Capability | Description | Comparison |
|---|---|---|
| Bottom-up proof construction | Builds proofs from atomic lemmas via a DAG, not top-down decomposition | Unique among autoresearch systems; most use top-down planning |
| 7-stage theory pipeline with verification | Prover → Verifier → Refiner loop with counterexample search | AutoResearchClaw's theory support is part of general experiment, not dedicated |
| Formal verification (Lean4) | Optional Lean4 proof checking for mathematical rigor | No other "Claw" system integrates formal verification |
| 4-tier memory system | Episodic → persistent → knowledge graph → domain insights | Most sophisticated memory of any Claw-family system |
| Domain plugin architecture | Pluggable research domains with tools, skills, workflows, and benchmarks | Enables specialization without forking the core system |
Comparison to Related Systems
| System | Focus | Pipeline | Proving | Memory | Learning |
|---|---|---|---|---|---|
| EurekaClaw | Mathematical theory | 7-stage with verification loop | Yes (Lean4) | 4-tier | Skill distillation |
| AutoResearchClaw | General research | 23-stage end-to-end | No | MetaClaw | Cross-run skills |
| AI Scientist (Sakana) | ML experiments | Idea → experiment → paper | No | None | None |
| Google AI Co-Scientist | Hypothesis generation | Multi-agent debate | No | Tournament | Selection |
| K-Dense BYOK | Multi-discipline assistant | Chat → expert delegation | No | Session | None |
| AIRA₂ (Meta) | STEM research | 15+ specialized agents | No | Session | None |
EurekaClaw occupies a distinctive niche: it is the only system that combines deep mathematical proving capability with continual learning and a formal verification pathway. While narrower in domain than systems like K-Dense BYOK or AutoResearchClaw, it goes deeper in its target domain than any competitor.
4 Supported Solutions
Three Input Modes
EurekaClaw provides three escalating levels of research autonomy:
| Command | Level | Autonomy | Input | Output |
|---|---|---|---|---|
eurekaclaw prove "<statement>" |
1 | Lowest | Precise mathematical statement | Proof + LaTeX paper |
eurekaclaw from-papers <arxiv_ids> |
2 | Medium | Specific papers to extend | Gap analysis + new theorem + proof + paper |
eurekaclaw explore "<domain>" |
3 | Highest | Broad research area | Literature survey + conjecture + proof + paper |
Level 1: Prove (Conjecture → Proof)
eurekaclaw prove "The sample complexity of transformers is O(L·d·log(d)/ε²)" \
--domain "ML theory" --output ./results
The system: 1. Surveys literature for related results on the conjecture 2. Formalizes the statement into LaTeX theorem environment 3. Decomposes into lemma DAG 4. Proves each lemma bottom-up 5. Assembles full proof 6. Writes camera-ready paper
Level 2: From Papers (Papers → Novel Results)
eurekaclaw from-papers 1706.03762 2005.14165 --domain "attention mechanisms"
The system: 1. Fetches and summarizes the specified papers 2. Identifies gaps and extension opportunities 3. Generates novel conjecture based on gaps 4. Proceeds through full prove pipeline
Level 3: Explore (Domain → Discovery)
eurekaclaw explore "multi-armed bandit theory"
The system: 1. Conducts broad literature survey of the domain 2. Identifies open problems and promising directions 3. Scores directions by novelty, soundness, and transformative potential 4. Selects most promising direction 5. Generates conjecture 6. Proceeds through full prove pipeline
Domain Plugin System
EurekaClaw supports pluggable research domains that provide specialized tools, skills, workflows, and benchmarks:
DomainPlugin (ABC)
│
├── name: str ← machine identifier
├── display_name: str ← human-readable name
├── keywords: list[str] ← auto-detection triggers
├── description: str
│
├── register_tools(registry) ← inject domain-specific tools
├── get_workflow_hint() → str ← research guidance for agent prompts
├── get_skills_dirs() → list[Path] ← extra skill directories
└── get_benchmark_problems(level) → list[dict] ← evaluation problems
MAB Domain Plugin (Reference Implementation)
The Multi-Armed Bandit domain is the first and most developed plugin:
domains/mab/
├── __init__.py MABDomainPlugin
│ ├── name = "mab"
│ ├── display_name = "Stochastic Multi-Armed Bandits"
│ └── keywords = ["bandit", "multi-armed", "mab", "ucb", "thompson",
│ "regret", "exploration", "exploitation"]
│
├── workflow.py WORKFLOW_HINT (research guidance text)
│
├── envs/ Simulation environments
│ ├── stochastic.py GaussianBandit, BernoulliBandit
│ └── runner.py run_experiment(), sweep_T()
│ (UCB1 & Thompson Sampling implementations)
│
├── tools/ LLM-callable domain tools
│ ├── concentration.py Hoeffding, Bernstein, sub-Gaussian, UCB radius
│ ├── regret.py Regret decomposition, Lai-Robbins lower bound
│ ├── information.py KL(Bernoulli), KL(Gaussian), Fano's inequality
│ └── bandit_tool.py BanditExperimentTool (experiment runner)
│
├── skills/ Domain-specific proof strategies
│ ├── ucb_regret_analysis.md
│ ├── thompson_sampling_analysis.md
│ ├── lower_bound_construction.md
│ └── bandit_simulation.md
│
└── benchmark/ Tiered evaluation problems
├── level1.json Reproduce: UCB1, Lai-Robbins (known bounds)
├── level2.json Refine: Bernstein-UCB, MOSS, KL-UCB
└── level3.json Open: heavy tails, infinite-arm, batched bandits
Adding Custom Domains
from eurekaclaw.domains.base import DomainPlugin
from eurekaclaw.domains import register_domain
@register_domain
class MyDomainPlugin(DomainPlugin):
name = "my_domain"
display_name = "My Research Domain"
keywords = ["keyword1", "keyword2"]
def register_tools(self, registry: ToolRegistry) -> None:
registry.register(MySpecialTool())
def get_workflow_hint(self) -> str:
return """When researching my_domain:
- Always start by checking known results X and Y
- Use technique Z for the main proof step"""
def get_skills_dirs(self) -> list[Path]:
return [Path(__file__).parent / "skills"]
def get_benchmark_problems(self, level: str) -> list[dict]:
bm_file = Path(__file__).parent / "benchmark" / f"{level}.json"
return json.loads(bm_file.read_text()) if bm_file.exists() else []
Registration requires adding the module path to _DOMAIN_PACKAGES in the domain resolver. Domains are auto-detected from user input via keyword matching.
5 LLM Integration
Model Configuration
| Variable | Default | Purpose |
|---|---|---|
ANTHROPIC_API_KEY |
— | API key for Claude access |
EUREKACLAW_MODEL |
claude-sonnet-4-6 |
Main reasoning model for all agents |
EurekaClaw is designed around Claude as the primary reasoning engine, unlike K-Dense BYOK's provider-agnostic approach. The system description states compatibility with "Every Major Model API" but is architecturally optimized for Claude's strengths in mathematical reasoning and instruction following.
Authentication Options
| Method | Provider | Use Case |
|---|---|---|
API key (ANTHROPIC_API_KEY) |
Anthropic | Standard programmatic access |
| OAuth (Claude Pro/Max subscription) | Anthropic | Personal use without API billing |
Agent-Model Architecture
MetaOrchestrator
│ Model: EUREKACLAW_MODEL (claude-sonnet-4-6)
│
├── SurveyOrchestrator
│ │ Model: EUREKACLAW_MODEL
│ ├── PaperFetcher ← tool calls (arXiv, Semantic Scholar)
│ ├── Summarizer ← LLM reasoning
│ ├── GapAnalyst ← LLM reasoning
│ └── DirectionProposer ← LLM reasoning + scoring
│
├── Formalizer
│ │ Model: EUREKACLAW_MODEL
│ └── Statement formalization, proof planning, lemma DAG construction
│
├── TheoryOrchestrator
│ │ Model: EUREKACLAW_MODEL
│ ├── Prover ← LLM reasoning (proof generation)
│ ├── Verifier ← LLM reasoning + Lean4 tool
│ ├── Refiner ← LLM reasoning (proof repair)
│ └── CounterexampleSearcher ← LLM reasoning (adversarial)
│
├── ExperimentOrchestrator
│ │ Model: EUREKACLAW_MODEL
│ ├── ExperimentDesigner ← LLM reasoning
│ └── ExperimentRunner ← Code execution tool
│
├── WriterAgent
│ │ Model: EUREKACLAW_MODEL
│ └── LaTeX paper generation with theorem environments
│
└── ContinualLearningLoop
│ Model: fast model (max_tokens=1024)
├── SessionMemoryExtractor ← LLM analysis of session
├── ToolPatternExtractor ← LLM analysis of tool usage
└── SkillEvolver ← LLM skill distillation
All agents share the same model configuration, creating architectural simplicity but limiting model-specific optimization per task. The continual learning loop uses a "fast model" for efficiency during post-session skill distillation.
Tool-LLM Interface
Tools are registered with the ToolRegistry and exposed to agents via the Anthropic tool definition format:
class BaseTool(ABC):
name: ClassVar[str]
description: ClassVar[str]
def input_schema(self) -> dict: ...
async def call(self, **kwargs) -> str: ...
def to_anthropic_tool_def(self) -> dict: ...
Built-in Tools
| Tool | Purpose | Output |
|---|---|---|
ArxivSearchTool |
Search arXiv for papers by query | List of papers with metadata |
SemanticScholarTool |
Search Semantic Scholar with citation counts | List of papers with venue/citation data |
WebSearchTool |
General web search | Search result snippets |
Lean4VerifyTool |
Formal proof verification via Lean4 | Verified/failed with output |
WolframAlphaTool |
Symbolic computation queries | Computation results |
CodeExecutionTool |
Python code execution in sandbox | stdout + stderr |
| Domain-specific tools | Per-plugin (e.g., BanditExperimentTool) |
Domain-dependent |
6 Key Results
Scientist-Bench Evaluator
EurekaClaw includes an internal evaluation framework called Scientist-Bench that scores research outputs along five dimensions:
| Dimension | Weight | Measurement |
|---|---|---|
| Formal correctness | 0.35 | Lean4 formal verification or LLM peer review |
| Novelty | 0.25 | Embedding distance from known results in literature |
| Experimental alignment | 0.15 | Numerical experiments validate theoretical bounds |
| Proof depth | 0.15 | Lemma count (complexity of proof DAG) |
| Citation coverage | 0.10 | Completeness of related work citations |
eurekaclaw eval-session <session_id>
Evaluation Methodology
The evaluation is self-referential — the system evaluates its own outputs — which limits its reliability as a benchmark but provides useful signal for: - Comparing runs with different configurations - Identifying which domains/conjectures the system handles well - Tracking improvement from skill accumulation over time
Benchmark Problems (MAB Domain)
The MAB domain plugin includes three levels of benchmark problems:
| Level | Difficulty | Examples | Purpose |
|---|---|---|---|
| Level 1 | Reproduce known | UCB1 regret bound, Lai-Robbins lower bound | Validate the system can reproduce textbook results |
| Level 2 | Refine existing | Bernstein-UCB, MOSS, KL-UCB | Test ability to improve on known techniques |
| Level 3 | Open problems | Heavy-tailed bandits, infinite-arm, batched | Probe frontier research capability |
Demonstrated Capabilities
Based on documentation and CLI examples:
| Capability | Example | Evidence Level |
|---|---|---|
| Literature crawling | "Found 23 relevant papers" from arXiv | CLI demo output |
| Hypothesis generation | "O(n log n) via topological filtration" | CLI demo output |
| Theorem drafting | "Theorem 3.1 drafted. LaTeX ready." | CLI demo output |
| Proof completion | "Proof complete." | CLI demo output |
| Paper generation | "Paper draft saved to ./results/" | CLI demo output |
| Skill distillation | UCB regret analysis skill | Seed skill documentation |
| Formal verification | Lean4 integration | Tool documentation |
Caveat: No independent benchmarks or peer-reviewed evaluations are available. All evidence comes from the project's own documentation and demonstrations.
Gate Modes
EurekaClaw supports three gate modes that control the level of human oversight:
| Mode | Behavior | Use Case |
|---|---|---|
none |
Fully autonomous — no pauses | Batch processing, overnight runs |
auto |
System pauses at critical junctures | Default — balanced autonomy |
human |
Human approval required at every gate | Maximum oversight for critical research |
7 Reproducibility
Installation Methods
| Method | Complexity | Platforms |
|---|---|---|
| One-line installer | Minimal | macOS, Linux |
| One-line installer | Minimal | Windows (PowerShell) |
| Manual with uv | Low | macOS, Linux (recommended) |
| Manual with pip | Low | macOS, Linux |
| Manual | Low | Windows |
Detailed Setup (uv method)
# Prerequisites: Python ≥ 3.11, Node.js ≥ 18, Git, uv
git clone https://github.com/EurekaClaw/EurekaClaw
cd EurekaClaw
uv venv --python 3.11 .venv
source .venv/bin/activate
uv pip install -e "."
cd frontend && npm install && cd ..
eurekaclaw install-skills # install built-in proof skills
eurekaclaw onboard # configure API key and settings
Configuration for Reproducibility
cp .env.example .env
| Variable | Default | Reproducibility Impact |
|---|---|---|
EUREKACLAW_MODEL |
claude-sonnet-4-6 |
High — different models produce different proofs |
GATE_MODE |
auto |
Low — affects human interaction, not output quality |
THEORY_PIPELINE |
default |
High — memory_guided injects prior session knowledge |
OUTPUT_FORMAT |
latex |
Low — format only |
EXPERIMENT_MODE |
auto |
Medium — controls numerical validation |
THEORY_MAX_ITERATIONS |
10 |
High — limits proof search depth |
Reproducibility Assessment
| Factor | Rating | Details |
|---|---|---|
| Installation | High | Multiple well-documented methods, automated setup |
| Configuration | High | Single .env file, clear defaults, interactive onboard |
| Determinism | Low | LLM non-determinism dominates; no temperature/seed controls documented |
| Session persistence | High | KnowledgeBus.persist() saves full session state as JSON artifacts |
| Artifact preservation | High | Theory state, bibliography, experiment results all serialized |
| Formal verification | Medium | Lean4 provides deterministic proof checking, but proof generation is stochastic |
Session Artifact Structure
Each run produces a structured artifact directory:
results/<session_id>/
├── theory_state.json ← Full proof state machine (lemma DAG, proofs, failures)
├── research_brief.json ← Literature survey findings, scored directions
├── bibliography.json ← All papers found during survey
├── experiment_result.json ← Numerical validation results
├── paper.tex ← Generated LaTeX paper
├── paper.pdf ← Compiled PDF (if LaTeX available)
└── eval_report.json ← Scientist-Bench evaluation scores
This artifact preservation enables partial reproducibility: while the generative process is stochastic, the outputs are fully captured and can be inspected, compared, and re-processed (e.g., re-running the writer agent on preserved theory state).
8 Compute and API Costs
Cost Model
Cost = Σ(agent_calls) × token_price(EUREKACLAW_MODEL)
+ Σ(tool_calls) × tool_cost(tool_type)
+ learning_calls × token_price(fast_model)
Per-Stage Cost Estimates
| Stage | Typical Token Usage | Dominant Cost Driver |
|---|---|---|
| Survey | 20K–100K tokens | Paper summarization and gap analysis |
| Formalize | 5K–20K tokens | Statement formalization |
| Theory | 50K–500K+ tokens | Iterative proving loop (up to THEORY_MAX_ITERATIONS) |
| Experiment | 10K–50K tokens | Experiment design + code generation |
| Write | 20K–80K tokens | Full paper generation |
| Evaluate | 5K–15K tokens | Quality assessment |
| Learn | 2K–10K tokens | Skill distillation (fast model) |
Cost Scaling Factors
| Factor | Impact |
|---|---|
| Conjecture difficulty | Hard proofs → more theory iterations → more tokens |
| Literature breadth | Broad domains → more papers to summarize |
| Proof depth | Deep lemma DAGs → more prover/verifier cycles |
| Experiment complexity | Complex numerical validation → more code generation |
| THEORY_MAX_ITERATIONS | Direct multiplier on theory stage cost (default: 10) |
Hardware Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| Python | ≥ 3.11 | 3.11+ |
| Node.js | ≥ 18 | 18+ |
| RAM | 4 GB | 8+ GB |
| Storage | 2 GB | 5+ GB (with skills and results) |
| Network | Required | Stable broadband |
| GPU | Not required | Not required |
| Lean4 | Optional | For formal verification |
Cost Comparison
| System | Estimated Cost per Run | Primary Model |
|---|---|---|
| EurekaClaw (Level 1, prove) | $1–$10 | Claude Sonnet |
| EurekaClaw (Level 3, explore) | $5–$50+ | Claude Sonnet |
| AutoResearchClaw | $5–$30 | Configurable |
| AI Scientist (Sakana) | $10–$50+ | Claude/GPT-4 |
| K-Dense BYOK (per session) | $0.05–$5 | User-selected |
EurekaClaw's theory proving loop is the dominant cost factor. A difficult conjecture that requires maximum iterations with failed attempts and refinement can consume significantly more tokens than a straightforward proof.
9 Architecture Solution
High-Level Architecture
┌──────────────────────────────────────────────────────────────────────────┐
│ EurekaClaw System │
│ │
│ ┌─────────────┐ ┌──────────────────────────────────────────────────┐ │
│ │ CLI Entry │ │ MetaOrchestrator │ │
│ │ Points │───►│ • Pipeline stage sequencing │ │
│ │ │ │ • KnowledgeBus management │ │
│ │ prove │ │ • Memory injection (top-4 domain insights) │ │
│ │ from-papers │ │ • Domain plugin resolution │ │
│ │ explore │ │ • Gate mode enforcement │ │
│ └─────────────┘ └───────────────┬──────────────────────────────────┘ │
│ │ │
│ ┌──────────────┐ ┌──────────────▼──────────────────────────────────┐ │
│ │ Browser UI │ │ KnowledgeBus │ │
│ │ (React/TS) │───►│ Typed artifact store + pub/sub │ │
│ │ │ │ ┌─────────────┐ ┌────────────┐ ┌───────────┐ │ │
│ │ • Agent track│ │ │ResearchBrief│ │TheoryState │ │Bibliography│ │ │
│ │ • Proof view │ │ └─────────────┘ └────────────┘ └───────────┘ │ │
│ │ • Pause/ │ │ ┌────────────────┐ ┌──────────────┐ │ │
│ │ resume │ │ │ExperimentResult│ │TaskPipeline │ │ │
│ │ • Skills mgr │ │ └────────────────┘ └──────────────┘ │ │
│ └──────────────┘ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Pipeline Stages │ │
│ │ │ │
│ │ Survey ──► Formalize ──► Theory ──► Experiment ──► Write ──► │ │
│ │ ▲ │ Evaluate │ │
│ │ │ │ │ │ │
│ │ └──┘ Learn │ │
│ │ (iterative │ │
│ │ proof loop) │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Tool Registry │ │Skill Registry│ │Memory Manager│ │ Domain │ │
│ │ │ │ │ │ │ │ Plugins │ │
│ │ arXiv │ │ Seed skills │ │ Episodic (T1)│ │ │ │
│ │ Sem.Scholar │ │ Distilled │ │ Persist (T2) │ │ MAB │ │
│ │ WebSearch │ │ Manual │ │ KG (T3) │ │ [custom] │ │
│ │ Lean4 │ │ Domain │ │ Domain (T4) │ │ │ │
│ │ Wolfram │ │ │ │ │ │ │ │
│ │ CodeExec │ │ Injector │ │ Extractor │ │ │ │
│ │ [domain] │ │ Evolver │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
KnowledgeBus — Central State Management
The KnowledgeBus is the architectural backbone — a typed artifact store with pub/sub that all pipeline stages read from and write to:
class KnowledgeBus:
def __init__(self, session_id: str) -> None
# Typed artifact accessors
def put_research_brief(brief: ResearchBrief) -> None
def get_research_brief() -> ResearchBrief | None
def put_theory_state(state: TheoryState) -> None
def get_theory_state() -> TheoryState | None
def put_experiment_result(result: ExperimentResult) -> None
def get_experiment_result() -> ExperimentResult | None
def put_bibliography(bib: Bibliography) -> None
def get_bibliography() -> Bibliography | None
def append_citations(papers: list[Paper]) -> None
def put_pipeline(pipeline: TaskPipeline) -> None
def get_pipeline() -> TaskPipeline | None
# Generic key-value store
def put(key: str, value: Any) -> None
def get(key: str, default: Any = None) -> Any
# Reactive subscriptions
def subscribe(artifact_type: str, callback: Callable) -> None
# Persistence
def persist(session_dir: Path) -> None
@classmethod
def load(session_id: str, session_dir: Path) -> KnowledgeBus
TheoryState — Proof State Machine
The TheoryState is the most complex artifact, tracking the entire proof construction process:
TheoryState
├── informal_statement ← Plain-English conjecture
├── formal_statement ← LaTeX-formalized theorem
├── known_results[] ← KnownResult extracted from literature
├── research_gap ← GapAnalyst's finding
├── proof_plan[] ← ProofPlan with provenance (known/adapted/new)
├── lemma_dag{} ← LemmaNode graph with dependencies
├── proven_lemmas{} ← lemma_id → ProofRecord
├── open_goals[] ← Remaining lemma_ids to prove
├── failed_attempts[] ← FailedAttempt history (used for learning)
├── counterexamples[] ← Counterexample discoveries
├── assembled_proof ← Final combined proof text
└── status ← pending | in_progress | proved | refuted | abandoned
ResearchBrief — Scored Research Directions
ResearchBrief
├── domain, query, conjecture
├── directions[] ← ResearchDirection objects
│ ├── novelty_score (0-1)
│ ├── soundness_score (0-1)
│ ├── transformative_score (0-1)
│ └── composite_score ← weighted average
├── selected_direction ← Chosen after convergence
└── open_problems[], key_mathematical_objects[]
10 Component Breakdown
Source Code Structure
eurekaclaw/
├── main.py ← EurekaSession, run_research(), save_artifacts()
├── cli/ ← CLI entry points (prove, from-papers, explore, ui)
│
├── orchestrator/ ← MetaOrchestrator (pipeline sequencing)
│
├── agents/ ← All agent implementations
│ ├── base.py ← BaseAgent ABC
│ ├── survey/ ← Survey pipeline agents
│ │ ├── orchestrator.py SurveyOrchestrator
│ │ ├── paper_fetcher.py PaperFetcher
│ │ ├── summarizer.py Summarizer
│ │ ├── gap_analyst.py GapAnalyst
│ │ └── direction.py DirectionProposer
│ ├── formalize/ ← Formalizer agent
│ ├── theory/ ← Theory pipeline agents
│ │ ├── orchestrator.py TheoryOrchestrator
│ │ ├── prover.py Prover
│ │ ├── verifier.py Verifier
│ │ ├── refiner.py Refiner
│ │ └── counterexample.py CounterexampleSearcher
│ ├── experiment/ ← Experiment pipeline agents
│ │ ├── designer.py ExperimentDesigner
│ │ └── runner.py ExperimentRunner
│ └── writer/ ← WriterAgent
│ └── agent.py
│
├── knowledge_bus/ ← KnowledgeBus (typed artifact store + pub/sub)
│ └── bus.py
│
├── types/ ← Pydantic data models
│ └── artifacts.py ← TheoryState, ResearchBrief, Bibliography, etc.
│
├── tools/ ← Tool implementations
│ ├── base.py ← BaseTool ABC, ToolRegistry
│ ├── arxiv.py ← ArxivSearchTool
│ ├── semantic_scholar.py ← SemanticScholarTool
│ ├── web_search.py ← WebSearchTool
│ ├── lean4.py ← Lean4VerifyTool
│ ├── wolfram.py ← WolframAlphaTool
│ └── code_exec.py ← CodeExecutionTool
│
├── skills/ ← Skill system
│ ├── registry.py ← SkillRegistry (load, search, upsert)
│ ├── injector.py ← SkillInjector (retrieve + format for prompts)
│ ├── install.py ← SkillInstaller (seed + ClawHub)
│ ├── evolver.py ← SkillEvolver (distill from session)
│ └── seed_skills/ ← Built-in skill markdown files
│ ├── theory/
│ ├── survey/
│ └── [domain]/
│
├── memory/ ← Memory system
│ ├── manager.py ← MemoryManager (unified interface)
│ ├── episodic.py ← EpisodicMemory (in-RAM ring buffer)
│ ├── persistent.py ← PersistentMemory (cross-run JSON)
│ └── knowledge_graph.py ← KnowledgeGraph (theorem dependency network)
│
├── learning/ ← Continual learning
│ └── memory_extractor.py ← SessionMemoryExtractor (Tier 4 insights)
│
├── domains/ ← Domain plugin system
│ ├── base.py ← DomainPlugin ABC, @register_domain
│ └── mab/ ← Multi-Armed Bandit domain
│ ├── __init__.py
│ ├── workflow.py
│ ├── envs/
│ ├── tools/
│ ├── skills/
│ └── benchmark/
│
├── evaluation/ ← Scientist-Bench evaluator
│
├── ui/ ← Static assets for browser UI
│ └── static/
│
└── frontend/ ← React + TypeScript browser UI
├── src/
├── package.json
└── tsconfig.json
Component Dependencies
CLI / Browser UI
│
MetaOrchestrator
/ | \
/ | \
SurveyOrch TheoryOrch WriterAgent
/ | \ / | \ |
Fetch Sum Gap Prove Verify LaTeX
| | | generation
Direction Refine Counter-
Proposer example
|
─────────────────────
│ KnowledgeBus │ ← all agents read/write
─────────────────────
/ | \
ToolReg SkillReg MemoryMgr
/ | \ | / | \
arXiv S2 Lean4 Skills T1 T2 T3/T4
|
DomainPlugins
Key Abstractions
| Abstraction | Type | Purpose |
|---|---|---|
BaseAgent |
ABC | Common agent interface (execute, tool access, memory logging) |
BaseTool |
ABC | Tool interface (input_schema, call, to_anthropic_tool_def) |
DomainPlugin |
ABC | Domain specialization interface |
KnowledgeBus |
Concrete | Typed artifact store with pub/sub |
SkillRegistry |
Concrete | Skill storage, search, and lifecycle |
MemoryManager |
Concrete | Unified interface across 4 memory tiers |
ToolRegistry |
Concrete | Tool registration and dispatch |
EurekaSession |
Concrete | Top-level session management |
11 Core Mechanisms (Detailed)
11.1 Survey Pipeline — Literature-Grounded Research
The survey pipeline implements a multi-agent literature review that produces scored research directions:
Survey Pipeline Flow
│
├── 1. PaperFetcher
│ ├── Input: domain string, conjecture (if available)
│ ├── Tools: ArxivSearchTool, SemanticScholarTool
│ ├── Process: Query construction → API calls → deduplication
│ └── Output: Raw paper list (titles, abstracts, citations, venues)
│
├── 2. Summarizer
│ ├── Input: Raw paper list
│ ├── Process: LLM summarization of each paper's contributions
│ └── Output: Structured summaries with key techniques and results
│
├── 3. GapAnalyst
│ ├── Input: Summarized papers + domain context
│ ├── Process: Cross-paper analysis to find unexplored combinations
│ └── Output: research_gap field in TheoryState
│
├── 4. DirectionProposer
│ ├── Input: Summaries + gaps + domain workflow hint
│ ├── Process: Generate candidate directions, score each
│ ├── Scoring:
│ │ ├── novelty_score (0-1): distance from existing work
│ │ ├── soundness_score (0-1): feasibility of proof approach
│ │ ├── transformative_score (0-1): potential impact
│ │ └── composite_score: weighted average
│ └── Output: ResearchBrief with selected_direction
│
└── 5. KnowledgeBus Update
├── put_research_brief(brief)
├── put_bibliography(bibliography)
└── Subscribers notified
11.2 Theory Pipeline — Bottom-Up Proof Construction
The theory pipeline is EurekaClaw's most distinctive mechanism — a bottom-up proof construction system that builds proofs from atomic lemmas:
Theory Pipeline State Machine
│
├── INITIALIZATION
│ ├── Formalizer receives conjecture + research brief
│ ├── Creates formal_statement (LaTeX theorem)
│ ├── Generates proof_plan[] with provenance:
│ │ ├── "known" — directly from literature (cite)
│ │ ├── "adapted" — modified from known results
│ │ └── "new" — novel contribution required
│ └── Constructs lemma_dag{} — dependency graph of lemmas
│
├── PROOF LOOP (up to THEORY_MAX_ITERATIONS)
│ │
│ ├── For each open_goal in lemma_dag (bottom-up order):
│ │ │
│ │ ├── PROVE
│ │ │ ├── Prover agent receives:
│ │ │ │ ├── Lemma statement
│ │ │ │ ├── Available proven lemmas (dependencies)
│ │ │ │ ├── Injected skills (top-k relevant)
│ │ │ │ ├── Failed attempts history (avoid repetition)
│ │ │ │ └── Domain workflow hint
│ │ │ └── Output: Candidate proof text
│ │ │
│ │ ├── VERIFY
│ │ │ ├── Verifier agent checks proof:
│ │ │ │ ├── Logical consistency
│ │ │ │ ├── Step completeness
│ │ │ │ ├── Assumption validity
│ │ │ │ └── Optional: Lean4 formal check
│ │ │ └── Output: verified (→ proven) or error message (→ refine)
│ │ │
│ │ ├── REFINE (if verification failed)
│ │ │ ├── Refiner receives: proof + error message + failed history
│ │ │ ├── Repairs specific issues identified by verifier
│ │ │ └── Output: Revised proof → back to VERIFY
│ │ │
│ │ └── COUNTEREXAMPLE SEARCH (parallel)
│ │ ├── CounterexampleSearcher runs adversarially
│ │ ├── Attempts to find inputs that violate the lemma
│ │ └── If found: lemma refuted → DAG restructuring
│ │
│ ├── STAGNATION DETECTION
│ │ ├── If same error appears N times
│ │ └── Force Refiner with broader repair strategy
│ │
│ └── DAG ADVANCEMENT
│ ├── Move proven lemma from open_goals to proven_lemmas
│ ├── Unlock dependent lemmas
│ └── Check if main theorem's dependencies are all proven
│
├── ASSEMBLY
│ ├── Combine proven lemmas into assembled_proof
│ ├── Order by dependency chain
│ └── Add connecting reasoning between lemmas
│
└── STATUS DETERMINATION
├── All lemmas proven → status = "proved"
├── Counterexample found for theorem → status = "refuted"
├── Max iterations exhausted → status = "abandoned"
└── Partial progress → status = "in_progress" (for human gate)
11.3 Skill Injection — Context-Aware Knowledge Loading
The skill system provides domain-specific proof strategies to agents at runtime:
Skill Injection Pipeline
│
├── 1. REGISTRATION (startup)
│ ├── Load seed skills: eurekaclaw/skills/seed_skills/**/*.md
│ ├── Load domain skills: domains/<domain>/skills/*.md
│ └── Load user skills: ~/.eurekaclaw/skills/**/*.md
│ Priority: user > domain > seed (higher overrides lower)
│
├── 2. SELECTION (per agent call)
│ ├── SkillInjector.top_k(task, role, k=5, strategy)
│ ├── Strategies:
│ │ ├── "tag" — match skill tags to task keywords
│ │ ├── "semantic" — embedding similarity to task description
│ │ └── "hybrid" — combine tag and semantic scores
│ └── Also accepts manual selection via InputSpec.selected_skills
│
├── 3. RENDERING (prompt construction)
│ ├── SkillInjector.render_for_prompt(skills)
│ └── Output format:
│ <skills>
│ <skill name="ucb_regret_analysis">
│ # UCB Regret Analysis
│ When bounding UCB1 regret, decompose into:
│ 1. Suboptimal arm pulls where confidence bound held
│ 2. Pulls where the bound failed
│ ...
│ </skill>
│ </skills>
│
├── 4. INJECTION (agent system prompt)
│ └── Selected skills appended to agent's system prompt
│
└── 5. TRACKING (post-execution)
├── SkillRegistry.update_stats()
├── EMA α=0.3 update on success_rate for each injected skill
└── Usage count incremented
Skill Metadata Schema
class SkillMeta(BaseModel):
name: str
version: str = "1.0"
tags: list[str] = []
agent_roles: list[str] = [] # e.g., ["theory", "survey"]
pipeline_stages: list[str] = [] # e.g., ["theory", "experiment"]
description: str = ""
source: Literal["seed", "distilled", "manual"] = "seed"
created_at: datetime
usage_count: int = 0
success_rate: float | None = None
11.4 Formal Verification — Lean4 Integration
EurekaClaw optionally integrates with Lean4 for formal proof verification:
Proof Verification Flow
│
├── LLM-Generated Proof
│ └── Natural language + mathematical notation
│
├── Lean4 Translation (if available)
│ ├── Convert to Lean4 syntax
│ └── Submit to Lean4VerifyTool
│
├── Lean4VerifyTool.call()
│ ├── Success:
│ │ └── {"verified": true, "theorem": "my_theorem",
│ │ "message": "Proof checked successfully"}
│ │
│ └── Failure:
│ └── {"verified": false, "lean4_output": "error: ...",
│ "message": "Verification failed"}
│
└── Integration with Theory Loop
├── If verified → high confidence, mark proven
├── If failed → Lean4 error fed to Refiner for repair
└── Lean4 errors are more specific than LLM peer review
11.5 Paper Generation — LaTeX with Theorem Environments
The WriterAgent generates camera-ready LaTeX papers from the assembled proof state:
Writer Pipeline
│
├── Input (from KnowledgeBus):
│ ├── TheoryState (proof, lemmas, theorem statement)
│ ├── ResearchBrief (motivation, related work, gap analysis)
│ ├── Bibliography (citations)
│ └── ExperimentResult (if available)
│
├── LaTeX Generation:
│ ├── Title and abstract from conjecture + results
│ ├── Introduction from research brief
│ ├── Related work from bibliography summaries
│ ├── Main results section:
│ │ ├── Theorem environment for formal statement
│ │ ├── Proof environment with assembled proof
│ │ └── Lemma environments for supporting results
│ ├── Experiments section (if experiment results available)
│ ├── Conclusion from proof outcomes
│ └── Bibliography from citation list
│
└── Output:
├── paper.tex (full LaTeX source)
└── paper.pdf (if LaTeX compiler available)
11.6 Scientist-Bench Evaluation
The evaluation framework scores the quality of research outputs:
Evaluation Pipeline
│
├── Formal Correctness (weight: 0.35)
│ ├── Primary: Lean4 verification (if available)
│ └── Fallback: LLM peer review (structured critique)
│
├── Novelty (weight: 0.25)
│ ├── Compute embedding of theorem statement
│ ├── Compare to embeddings of known_results from literature
│ └── Score = mean distance from nearest known results
│
├── Experimental Alignment (weight: 0.15)
│ ├── Compare theoretical bounds to numerical experiments
│ └── Score = degree of alignment between theory and experiments
│
├── Proof Depth (weight: 0.15)
│ ├── Count lemmas in lemma_dag
│ └── Score = normalized lemma count (deeper = better)
│
├── Citation Coverage (weight: 0.10)
│ ├── Check bibliography completeness
│ └── Score = fraction of key papers cited
│
└── Composite Score = Σ(weight_i × score_i)
12 Programming Language
Technology Stack
| Layer | Language/Framework | Purpose |
|---|---|---|
| Core engine | Python ≥ 3.11 | Agents, orchestration, tools, memory, skills |
| Type system | Pydantic v2 | Data models (BaseModel for all artifacts) |
| CLI | Python (click/argparse) | Command-line interface |
| Browser UI | React + TypeScript | Interactive research interface |
| Frontend build | Vite | Development server and production builds |
| Skills | Markdown (with YAML frontmatter) | Domain knowledge documents |
| Configuration | .env files | API keys and settings |
| Benchmarks | JSON | Domain-specific evaluation problems |
| Build system | Makefile | Frontend build, dev server, type checking |
Python Architecture Patterns
# ABC pattern for agents
class BaseAgent(ABC):
async def execute(self, ...) -> ...: ...
# ABC pattern for tools
class BaseTool(ABC):
name: ClassVar[str]
description: ClassVar[str]
def input_schema(self) -> dict: ...
async def call(self, **kwargs) -> str: ...
# ABC pattern for domain plugins
class DomainPlugin(ABC):
@abstractmethod
def register_tools(self, registry: ToolRegistry) -> None: ...
@abstractmethod
def get_workflow_hint(self) -> str: ...
# Pydantic for all data models
class TheoryState(BaseModel):
informal_statement: str
formal_statement: str
status: Literal["pending", "in_progress", "proved", "refuted", "abandoned"]
...
class SkillRecord(BaseModel):
meta: SkillMeta
content: str
embedding: list[float] | None = None
Frontend Architecture
frontend/
├── src/
│ ├── components/
│ │ ├── AgentTrack ← Live pipeline stage visualization
│ │ ├── ProofSketch ← Proof DAG visualization
│ │ ├── LemmaChips ← Interactive lemma status chips
│ │ ├── GuidanceTextarea ← Human guidance input for gates
│ │ └── SkillsManager ← Skill registry browser
│ ├── hooks/ ← React hooks for state management
│ └── App.tsx ← Main application component
├── package.json
└── tsconfig.json
Build Commands
| Command | Purpose |
|---|---|
make open |
Production build + open browser (port 8080) |
make start |
Production build + serve (no browser) |
make dev |
Hot-reload development (Vite :5173 + Python :7860) |
make build |
TypeScript compile + Vite build |
make typecheck |
TypeScript type checking only |
Testing Infrastructure
# Unit tests (no API key needed)
pytest tests/unit/ -v
# Integration tests (requires API key)
ANTHROPIC_API_KEY=sk-... pytest tests/integration/ -v
# Frontend type checking
make typecheck
Code Quality Assessment
| Indicator | Assessment |
|---|---|
| Type safety | Strong — Pydantic models for all data, ABC patterns for abstractions |
| Async | Full async/await throughout agent pipeline |
| Testing | Unit + integration test separation; unit tests offline-capable |
| Documentation | Comprehensive — dedicated docs site with architecture, API, and user guides |
| Extensibility | High — plugin system for domains, skill system for knowledge |
13 Memory Management
Four-Tier Memory Architecture
EurekaClaw implements the most sophisticated memory system among autoresearch tools, with four distinct tiers:
┌─────────────────────────────────────────────────────────────────┐
│ EurekaClaw Memory Architecture │
│ │
│ Tier 1: EPISODIC (in-RAM ring buffer) │
│ ├── Scope: Within single session │
│ ├── Lifetime: Session only (not persisted) │
│ ├── Content: Agent events with role, content, metadata │
│ ├── Access: memory.log_event() / memory.recent_events() │
│ └── Purpose: Short-term context for multi-turn agent reasoning │
│ │
│ Tier 2: PERSISTENT (cross-run JSON file) │
│ ├── Scope: Across all sessions │
│ ├── Lifetime: Permanent (~/.eurekaclaw/memory/persistent.json) │
│ ├── Content: Key-value records with tags and source session │
│ ├── Access: memory.remember() / memory.recall() │
│ └── Purpose: Successful strategies, common patterns, preferences│
│ │
│ Tier 3: KNOWLEDGE GRAPH (theorem dependency network) │
│ ├── Scope: Across all sessions │
│ ├── Lifetime: Permanent (~/.eurekaclaw/memory/knowledge_graph.json) │
│ ├── Content: KnowledgeNode objects linked by "uses" relations │
│ ├── Access: memory.add_theorem() / memory.find_related_theorems() │
│ └── Purpose: Theorem provenance, dependency tracking, reuse │
│ │
│ Tier 4: DOMAIN INSIGHTS (per-domain markdown files) │
│ ├── Scope: Per research domain │
│ ├── Lifetime: Permanent (~/.eurekaclaw/memories/<domain>/) │
│ ├── Content: LLM-generated analysis of proof successes/failures │
│ ├── Access: memory.load_for_injection(domain, k=4) │
│ └── Purpose: Domain-specific proof strategies for future sessions│
│ │
└─────────────────────────────────────────────────────────────────┘
Tier 1: Episodic Memory
class EpisodicEntry(BaseModel):
entry_id: str
session_id: str
agent_role: str # "survey", "theory", "writer", etc.
content: str # free-text event description
metadata: dict = {} # structured data (tool name, paper_id, etc.)
timestamp: datetime
- Implementation: In-RAM ring buffer with fixed capacity
- Write path:
BaseAgent.execute()→memory.log_event(agent_role, content, metadata) - Read path:
memory.recent_events(n=20, agent_role=None) - Persistence: None — lost when session ends
- Purpose: Enables agents to access recent events from other agents within the same session (e.g., the Prover seeing the Verifier's latest error message)
Tier 2: Persistent Memory
class CrossRunRecord(BaseModel):
record_id: str
key: str # namespaced, e.g., "theory.failed_strategies.sample_complexity"
value: Any # arbitrary JSON-serializable value
tags: list[str] = []
source_session: str = ""
created_at: datetime
updated_at: datetime
- Storage:
~/.eurekaclaw/memory/persistent.json - Write path:
memory.remember(key, value, tags, source_session) - Read path:
memory.recall(key)ormemory.recall_by_tag(tag) - Purpose: Cross-session key-value store for strategies that worked or failed
Tier 3: Knowledge Graph
class KnowledgeNode(BaseModel):
node_id: str
theorem_name: str
formal_statement: str
domain: str = ""
session_id: str = ""
tags: list[str] = []
created_at: datetime
- Storage:
~/.eurekaclaw/memory/knowledge_graph.json - Write path:
memory.add_theorem(name, statement, domain, session_id) - Link path:
memory.link_theorems(from_id, to_id, relation="uses") - Read path:
memory.find_related_theorems(node_id, depth=2) - Purpose: Tracks proven theorems and their dependency relationships. Enables the system to:
- Recognize when a new conjecture relates to previously proven results
- Reuse proven lemmas as building blocks
- Understand the dependency structure of accumulated knowledge
Tier 4: Domain Insights
- Storage:
~/.eurekaclaw/memories/<domain>/YYYYMMDD_<slug>.md - Index:
~/.eurekaclaw/memories/<domain>/_index.json(semantic search index) - Write path: Post-session extraction by
SessionMemoryExtractor - Read path:
memory.load_for_injection(domain, k=4, query=task_description) - Purpose: LLM-generated insights about proof strategies, extracted from completed sessions. These are injected into agent system prompts at session start, giving the system accumulated domain expertise.
Memory Data Flow
During session:
BaseAgent.execute()
└── memory.log_event() ──────────────────────────► Tier 1 (RAM only)
After session (ContinualLearningLoop.post_run()):
SessionMemoryExtractor.extract_and_save()
└── LLM analysis of TheoryState
└── write YYYYMMDD_<slug>.md ─────────────────► Tier 4 (domain insights)
ToolPatternExtractor.extract_and_save()
└── analyse tool-call patterns
└── generate new Skill files ─────────────────► Skill system
SkillRegistry.update_stats()
└── EMA α=0.3 update on success_rate ─────────────► Skill metadata
Next session startup:
MetaOrchestrator
└── MemoryManager.load_for_injection(domain)
└── top-4 Tier 4 files ───────────────────────► Agent system prompts
Filesystem Layout
~/.eurekaclaw/
├── memory/
│ ├── persistent.json ← Tier 2: cross-run key-value store
│ └── knowledge_graph.json ← Tier 3: theorem dependency graph
├── memories/
│ └── <domain>/
│ ├── YYYYMMDD_<slug>.md ← Tier 4: per-domain insight files
│ └── _index.json ← Tier 4: index for semantic search
└── skills/ ← Skill files updated by ContinualLearningLoop
Comparison to Other Systems
| System | Tiers | Episodic | Persistent | Knowledge Graph | Domain Insights |
|---|---|---|---|---|---|
| EurekaClaw | 4 | RAM ring buffer | JSON key-value | Theorem DAG | LLM-extracted .md |
| AutoResearchClaw | 3 | Session state | MetaClaw skills | No | MetaClaw cross-run |
| K-Dense BYOK | 1 | Conversation | No | No | No |
| AIRA₂ | 2 | Session | Tournament | No | No |
| Google Co-Scientist | 2 | Session | Tournament ranking | No | No |
14 Continued Learning
Continual Learning Pipeline
EurekaClaw implements a post-session learning loop that automatically distills proof strategies and domain insights from completed sessions:
Session Completion
│
└── ContinualLearningLoop.post_run()
│
├── 1. SESSION MEMORY EXTRACTION (Tier 4)
│ ├── SessionMemoryExtractor.extract_and_save()
│ ├── Input: TheoryState (proven lemmas, failed attempts, counterexamples)
│ ├── Process: LLM analysis of what worked and what didn't
│ └── Output: ~/.eurekaclaw/memories/<domain>/YYYYMMDD_<slug>.md
│
├── 2. TOOL PATTERN EXTRACTION
│ ├── ToolPatternExtractor.extract_and_save()
│ ├── Input: Tool-call sequences from session
│ ├── Process: LLM analysis of effective tool usage patterns
│ └── Output: New tool-usage skills
│
├── 3. SKILL DISTILLATION
│ ├── SkillEvolver.distill_from_session(failures, successes)
│ ├── Input:
│ │ ├── FailedAttempt[] from TheoryState (deduplicated)
│ │ └── Successful proofs (trimmed to 300 chars)
│ ├── Process:
│ │ ├── LLM call (fast model, max_tokens=1024)
│ │ ├── _parse_skill_response()
│ │ └── SkillRegistry.upsert()
│ └── Output: ~/.eurekaclaw/skills/<name>.md
│
└── 4. SKILL STATISTICS UPDATE
├── SkillRegistry.update_stats()
├── EMA α=0.3 update on success_rate for injected skills
└── Usage count incremented
Skill Distillation Mechanics
The SkillEvolver is the key learning component. It transforms raw session experience into reusable skill documents:
Input to SkillEvolver
├── failures: FailedAttempt[]
│ ├── Deduplicated (only unique failure patterns)
│ └── Each contains: attempted proof, error message, context
├── successes: ProofRecord[]
│ └── Compressed (proof text trimmed to 300 chars)
│
LLM Distillation (fast model)
├── Prompt: "Given these proof attempts, what reusable strategy can be extracted?"
├── Output format: Markdown with YAML frontmatter (skill format)
└── Constraints: max_tokens=1024 (keeps skills concise)
│
Skill Registration
├── _parse_skill_response() → SkillRecord
├── SkillRegistry.upsert() → writes to ~/.eurekaclaw/skills/<name>.md
└── Available for injection in future sessions
Skill Lifecycle
Created
│
┌───────┼───────┐
│ │ │
seed distilled manual
│ │ │
└───────┼───────┘
│
Registered
(SkillRegistry)
│
Selected for session
(SkillInjector.top_k())
│
Injected into agent prompt
│
Used (or not) during proving
│
┌───────┼───────┐
│ │
Contributed Not effective
to success │
│ (no update)
│
success_rate updated
(EMA α=0.3)
│
Higher priority in future selection
Skill Sources and Priority
Priority (high to low):
│
├── 3. User skills: ~/.eurekaclaw/skills/**/*.md
│ ├── Manual: user-written
│ └── Distilled: system-generated from sessions
│
├── 2. Domain plugin skills: domains/<domain>/skills/*.md
│ └── Plugin-specific strategies
│
└── 1. Seed skills: eurekaclaw/skills/seed_skills/**/*.md
└── Built-in proof strategies (shipped with system)
ClawHub — Community Skill Sharing
EurekaClaw supports installing skills from external authors:
# Install from ClawHub
eurekaclaw install-skills steipete/github
# Install built-in seeds
eurekaclaw install-skills
eurekaclaw install-skills --force # overwrite existing
This creates a community learning loop: researchers who develop effective proof strategies can package them as skills and share them. Combined with automatic distillation, this creates two learning pathways:
- Individual learning: Each user's system accumulates distilled skills from their own sessions
- Community learning: Effective skills are shared via ClawHub for collective benefit
Memory-Guided Pipeline
When THEORY_PIPELINE=memory_guided, the system actively uses accumulated memory during proof construction:
memory_guided Pipeline Differences
│
├── Session startup:
│ └── MetaOrchestrator loads top-4 Tier 4 insights
│ → injected into all agent system prompts
│
├── Proof planning:
│ └── Formalizer checks Tier 3 knowledge graph
│ → reuses proven lemmas when applicable
│
├── Proof attempt:
│ └── Prover checks Tier 2 persistent memory
│ → avoids strategies that previously failed
│
└── Result:
└── Proof attempts guided by accumulated experience
→ fewer wasted iterations on known-bad approaches
Learning Comparison
| System | Learning Type | What is Learned | Persistence | Sharing |
|---|---|---|---|---|
| EurekaClaw | Post-session distillation | Proof strategies, tool patterns | Permanent files | ClawHub |
| AutoResearchClaw | MetaClaw cross-run | Research strategies from failures | Persistent skills | Within team |
| EvoScientist | Evolutionary | Hypothesis quality (fitness) | Population state | No |
| K-Dense BYOK | None | N/A | N/A | Community workflows |
| AI Scientist | None | N/A | N/A | N/A |
EurekaClaw's learning system is the most comprehensive among autoresearch tools, combining individual session-level distillation with community-level skill sharing. The 4-tier memory architecture ensures that insights are captured at appropriate granularity levels and persist across sessions.
15 Applications
Primary Application: Automated Mathematical Research
EurekaClaw is purpose-built for mathematical theory research, with a complete pipeline from literature survey to published paper:
| Phase | Capability | Output |
|---|---|---|
| Literature Survey | Crawl arXiv + Semantic Scholar, summarize, identify gaps | ResearchBrief with scored directions |
| Hypothesis Generation | Generate novel conjectures from identified gaps | Formal theorem statement |
| Proof Construction | Bottom-up lemma proving with verification loop | Assembled proof |
| Formal Verification | Optional Lean4 checking | Verified/failed with detailed errors |
| Numerical Validation | Experiment design and execution | Numerical evidence for bounds |
| Paper Writing | Camera-ready LaTeX generation | Complete paper with theorems, proofs, citations |
| Quality Assessment | Scientist-Bench evaluation | Multi-dimensional quality score |
Research Domain Applications
| Domain | Available Support | Maturity |
|---|---|---|
| Multi-Armed Bandits | Full plugin (tools, skills, envs, benchmarks) | High — reference implementation |
| ML Theory | General theory pipeline | Medium — demonstrated in examples |
| Attention Mechanisms | From-papers mode demonstrated | Medium — via Level 2 input |
| Concentration Inequalities | Seed skills available | Medium — via skill injection |
| General Mathematics | Core pipeline applicable | Low — no domain plugin |
| Custom Domains | Plugin architecture available | Framework — requires user implementation |
Target Users
| User | Use Case | EurekaClaw Mode |
|---|---|---|
| Mathematics PhD students | Explore open problems in their field | Level 3: Explore |
| ML theory researchers | Extend results from specific papers | Level 2: From Papers |
| Proof assistants | Verify and formalize specific conjectures | Level 1: Prove |
| Research labs | Systematic exploration of theory landscape | Level 3 + batch processing |
| Course instructors | Generate exercise proofs for teaching | Level 1 with known results |
Integration Scenarios
Scenario 1: PhD Research Acceleration
Researcher identifies gap in bandit literature
→ eurekaclaw explore "finite-time regret bounds for contextual bandits"
→ System surveys 50+ papers, identifies 3 promising directions
→ Selects direction with highest composite score
→ Generates and proves novel theorem
→ Produces LaTeX paper draft
→ Researcher reviews, refines, submits
Scenario 2: Paper Extension
Researcher reads two key papers
→ eurekaclaw from-papers 1706.03762 2005.14165 --domain "attention"
→ System identifies gaps between the papers
→ Generates conjecture combining insights
→ Proves conjecture (or identifies counterexample)
→ Writes paper extending both works
Scenario 3: Proof Verification
Researcher has a conjecture with informal proof sketch
→ eurekaclaw prove "Statement..." --domain "ML theory"
→ System formalizes and verifies via proof pipeline
→ Optional Lean4 formal verification
→ Identifies gaps in informal proof
→ Completes or refutes the conjecture
Limitations for Research Applications
| Limitation | Impact | Mitigation |
|---|---|---|
| LLM proof reliability | Generated proofs may contain subtle errors | Lean4 verification, human review |
| Domain coverage | Only MAB has full plugin support | Plugin architecture for custom domains |
| Novelty assessment | Embedding-based novelty may miss subtle innovations | Human judgment on selected directions |
| Experiment runner | Under development — not fully functional | Manual numerical validation |
| Formal verification | Lean4 translation is imperfect | LLM peer review as fallback |
| Paper quality | Generated papers need substantial human editing | Use as draft, not final submission |
Strengths vs. Weaknesses Summary
| Strength | Weakness |
|---|---|
| Deepest theorem-proving capability among autoresearch systems | Narrow domain focus (mathematical theory only) |
| 4-tier memory with continual learning | Single-model architecture limits optimization |
| Formal verification pathway via Lean4 | Experiment runner still under development |
| Plugin architecture for domain extensibility | Only one domain plugin (MAB) shipped |
| Community skill sharing via ClawHub | No independent benchmarks or evaluations |
| Both CLI and browser UI | Beta maturity |
Future Roadmap (Inferred)
Based on documentation and "under development" markers:
| Feature | Status | Impact |
|---|---|---|
| Experiment Runner | Under development | Complete the theory-experiment loop |
| Additional domain plugins | Architecture ready | Expand beyond MAB |
| Process Reward Model | Referenced in code | RL-style training on proof quality |
| MADMAX mode | Referenced in code | Advanced proof search strategies |
| Lean4 integration hardening | Ongoing | More reliable formal verification |