← Back to Index
OpenProver
Automated theorem prover using a planner-worker agentic architecture with frontier LLMs, inspired by Google DeepMind's Aletheia Organization: Independent (Matěj Kripner / Charles University, Prague) Published: February 2025 (initial); v1.0.1, March 2026 (current) Type: Open-source repository + PyPI package Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: OpenProver — Automated Theorem Prover Inspired by Aletheia; Claude Code for Mathematicians
Repository: github.com/Kripner/openprover
PyPI Package: openprover v1.0.1
Tagline: "Theorem prover powered by language models."
Lineage: Directly inspired by Google DeepMind's Aletheia — the fully autonomous mathematics research agent powered by Gemini 3 Deep Think that solved 6/10 open research problems from the FirstProof challenge (February 2025). OpenProver replicates and open-sources the planner-worker-whiteboard architecture that Aletheia demonstrated at the frontier of automated mathematical reasoning.
Related Work by Author: Matěj Kripner is also a contributor to LeanTree — a white-box approach to proof search that factorizes complex proof states in Lean 4 into simpler branches for more efficient automated theorem proving (published at UFAL, Charles University, Prague).
Citation:
@misc{openprover,
author = {Matěj Kripner},
title = {OpenProver},
year = {2025},
publisher = {GitHub},
url = {https://github.com/kripner/openprover}
}
2 Authors and Team
Primary Author: Matěj Kripner (kripner@ufal.mff.cuni.cz)
Affiliation: Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic
Team Size: Solo developer with 388 commits on the master branch (as of April 2026). Community contributions via 11 forks.
Background: Kripner's work sits at the intersection of formal methods and machine learning applied to mathematics. His LeanTree research addresses the complementary problem of making proof search more efficient through state factorization, while OpenProver tackles the higher-level orchestration of LLM-guided proving.
Repository Activity: | Metric | Value | | --- | --- | | Commits | 388 | | Stars | 67+ | | Forks | 11 | | Primary Language | Python (92.9%) | | Secondary Languages | JavaScript (4.2%), CSS (2.4%) | | License | Not explicitly specified |
3 Core Contribution
Key Novelty: OpenProver is the first open-source implementation of the planner-worker-whiteboard architecture for automated theorem proving that Aletheia demonstrated at Google DeepMind. It bridges the gap between closed frontier proof agents and the open research community by providing a fully inspectable, resumable, multi-LLM orchestration framework for both informal and formal mathematical reasoning.
What Makes OpenProver Distinctive
-
Planner-Worker Separation with Persistent State: Unlike single-agent proving systems that maintain one long context window, OpenProver separates the high-level strategic reasoning (planner) from focused execution (workers). The planner maintains a whiteboard and repository across steps, while workers are stateless and parallelizable. This mirrors how human mathematicians work — a strategist coordinates specialists.
-
Whiteboard as External Mathematical Memory: The whiteboard is a terse LaTeX scratchpad that the planner reads and updates every step. This creates an external memory mechanism that persists across context window boundaries, allowing proof strategies to evolve over hours without losing coherence.
-
Repository as Structured Knowledge Base: The
repo/directory stores named items — lemmas, observations, literature findings, failed approaches — as individual markdown files. Items are cross-referenced via<span class="broken-link">wikilink</span>syntax, creating a navigable knowledge graph that both planner and workers can query. -
Formal + Informal Dual-Track Proving: OpenProver operates in three modes — pure informal proving (PROOF.md), prove-and-formalize (PROOF.md + PROOF.lean), and formalize-only (given an existing informal proof, produce verified Lean 4 code). This acknowledges the reality that most mathematical reasoning happens informally first.
-
Multi-Provider LLM Support: Unlike closed systems locked to one model provider, OpenProver supports Claude (via Claude Code CLI), Mistral's Leanstral (specialized for Lean), and any OpenAI-compatible model via vLLM. Different models can be assigned to planner vs. worker roles.
-
Full Disk Persistence and Resumability: Every run creates a structured directory with all state on disk. Runs can be interrupted (Ctrl-C, machine crash) and resumed from the last completed step. This is critical for multi-hour proving sessions with expensive API calls.
Relationship to Prior Systems
| System | Year | Architecture | Formal Verification | Open Source |
|---|---|---|---|---|
| Aletheia (DeepMind) | 2025 | Planner-worker-whiteboard | Extraction + verification protocols | No (closed) |
| LeanCopilot | 2023 | Single-agent tactic suggestion | Lean 4 via LeanDojo | Yes |
| LEGO-Prover | 2024 | Skill library + decomposition | Lean via LeanDojo | Partial |
| ReProver | 2023 | Encoder retrieval + generation | Lean 4 | Yes |
| OpenProver | 2025 | Planner-worker-whiteboard | Lean 4 via lake env lean + MCP tools |
Yes |
| UlamAI | 2026 | Search-based with LLM policy | Lean 4 via LeanDojo/CLI/LSP | Yes |
4 Supported Solutions
OpenProver targets mathematical proofs at varying levels of formality and difficulty:
| Solution Type | Description | Example |
|---|---|---|
Informal Proof |
Natural-language mathematical proofs written as structured markdown | Irrationality of √2, infinitely many primes |
Formal Lean 4 Proof |
Machine-checked proofs compiled against Mathlib | Cauchy-Schwarz inequality in Lean 4 |
Formalization |
Converting an existing informal proof to verified Lean 4 | Given PROOF.md → produce PROOF.lean |
Literature Survey |
Web-enabled search for relevant mathematical results and papers | Finding prior work on Erdős problem 838 |
Proof Strategy |
Whiteboard-based high-level proof planning before execution | Decomposing a hard problem into lemmas |
Difficulty Spectrum Covered
| Problem | Difficulty | Status |
|---|---|---|
| √2 is irrational | Easy (textbook) | Solved |
| Infinitely many primes | Easy (textbook) | Solved |
| e is irrational | Medium (undergraduate) | Addressed |
| Cauchy-Schwarz inequality | Medium (undergraduate) | Addressed |
| Erdős problem 205 | Medium (research-adjacent) | Addressed |
| Erdős problem 838 | Hard (open problem) | Attempted |
| Collatz conjecture | Open (famously unsolved) | Attempted |
Operating Modes
| Mode | Input | Goal | Termination Condition |
|---|---|---|---|
prove |
THEOREM.md | Informal proof | submit_proof → PROOF.md |
prove_and_formalize |
THEOREM.md + THEOREM.lean | Both proofs | PROOF.md + PROOF.lean |
formalize_only |
THEOREM.md + THEOREM.lean + PROOF.md | Formal proof | submit_proof succeeds → PROOF.lean |
5 LLM Integration
Multi-Provider Architecture
OpenProver implements a factory pattern for LLM clients, supporting three distinct provider backends:
| Provider | Client Class | Interface | Tool Support | Cost Tracking |
|---|---|---|---|---|
| Claude (Anthropic) | LLMClient |
Claude Code CLI subprocess | MCP server for lean_verify/lean_search | Yes (from Claude CLI response) |
| Leanstral (Mistral) | via OpenAI-compatible | Mistral API with MISTRAL_API_KEY |
Native | Via API |
| Local/vLLM | HFClient |
OpenAI-compatible HTTP at --provider-url |
Native tool calling | No (local = free) |
Hierarchical Model Assignment
A distinctive feature is the ability to assign different models to different roles:
┌──────────────────────────────────────────┐
│ MODEL HIERARCHY │
│ │
│ --planner-model opus │
│ ↓ Strategic reasoning, whiteboard │
│ ↓ management, action selection │
│ │
│ --worker-model sonnet │
│ ↓ Focused task execution, proof │
│ ↓ exploration, verification │
│ │
│ (Literature search gets web-enabled │
│ Claude with WebSearch + WebFetch) │
└──────────────────────────────────────────┘
Rationale: The planner sees a compressed view of the proof state (whiteboard + repo index) and must make strategic decisions — a task suited to more capable but expensive models (Opus). Workers receive focused tasks and run in parallel — suited to faster, cheaper models (Sonnet).
Claude CLI Integration Details
The Claude integration is notable for its depth. OpenProver wraps the claude CLI binary via subprocess.Popen:
Non-streaming mode:
claude -p --model <model> --system-prompt <...> \
--output-format json --tools ""
Streaming mode:
claude -p --model <model> --system-prompt <...> \
--output-format stream-json --verbose \
--include-partial-messages --tools ""
Key implementation detail: streaming uses readline() on the subprocess pipe rather than Python's line iterator, because the line iterator has read-ahead buffering that defeats real-time display.
MCP Tool Calling: When --lean-project is set, the Claude CLI spawns an MCP server subprocess (lean/mcp_server.py) with:
--mcp-config <config> --strict-mcp-config \
--permission-mode bypassPermissions \
--allowedTools mcp__lean_tools__lean_verify mcp__lean_tools__lean_search
Tool events are detected from the NDJSON stream:
- tool_use content blocks carry tool name and input (streamed incrementally)
- Tool results appear as top-level {"type": "user"} messages containing tool_result entries matched by tool_use_id
- MCP prefixes are stripped: mcp__lean_tools__lean_verify → lean_verify
- Status is inferred from result text (e.g., results starting with "OK" = success)
vLLM Integration for Local Models
For local models via vLLM, OpenProver uses the HFClient:
- Health check on init via
/healthendpoint - Streaming via chunked transfer encoding
- Multi-turn tool calling loop: LLM requests tool calls →
_execute_worker_tool()dispatches → results appended to conversation → LLM continues - Supports MiniMax-M2.5 reasoning models with thinking tokens
- Configurable
--answer-reserve(default 4096 tokens) for local models that need token budget partitioning
Web Search Integration
When --no-isolation is set, the planner can dispatch literature_search actions. These spawn a special worker with Claude CLI configured for web access:
--permission-mode bypassPermissions \
--allowedTools WebSearch WebFetch
This enables the system to find relevant mathematical papers and results online during proof search — a capability that mirrors how human mathematicians consult the literature when stuck.
6 Key Results
Demonstrated Capabilities
OpenProver is a tool-focused repository rather than a benchmark-reporting research paper. Specific quantitative results are limited, but the system demonstrates:
- Easy Theorems (Solved):
- Irrationality of √2 — complete informal and formal proofs
-
Infinitely many primes — complete informal proofs
-
Medium Theorems (Addressed):
- Irrationality of e — proof strategies explored
- Cauchy-Schwarz inequality — informal and formal proof attempts
-
Erdős problem 205 — multi-step proof exploration
-
Hard/Open Problems (Attempted):
- Erdős problem 838 — extended autonomous sessions with literature search
- Collatz conjecture — proof exploration (unsolved, as expected)
Architectural Validation
The system validates that the Aletheia-style planner-worker architecture can be replicated with open-source components:
| Aletheia Feature | OpenProver Equivalent | Status |
|---|---|---|
| Gemini Deep Think planner | Claude Opus / any LLM planner | Implemented |
| Parallel worker execution | ThreadPoolExecutor workers | Implemented |
| Whiteboard scratchpad | WHITEBOARD.md with LaTeX |
Implemented |
| Repository of findings | repo/ directory with wikilinks |
Implemented |
| Extraction + verification | submit_proof + Lean 4 verification |
Implemented |
| Literature search | Web-enabled Claude workers | Implemented |
| Formal proof compilation | lake env lean + MCP tools |
Implemented |
Lean Explore Search Pipeline
One of OpenProver's most technically sophisticated components is the lean_search tool, which searches ~400k Lean 4 declarations across Init, Batteries, Lean, Mathlib, and Std:
| Stage | Technique | Latency | Details |
|---|---|---|---|
| BM25 Retrieval | Keyword search over LLM-generated natural language descriptions | ~0.01s | Two indices with different tokenization, results merged |
| Semantic Retrieval | Qwen3-Embedding-0.6B (1024-dim), FAISS index, cosine similarity | ~1s | Pre-built index, sentence-transformer encoding |
| Score Fusion | Normalized combination: 0.3 × BM25 + 0.7 × semantic | — | Dependency boost for related declarations |
| Reranking | Qwen3-Reranker-0.6B cross-encoder | ~1-2s (GPU) | Skipped on CPU (~50s); GPU recommended |
| Hydration | Full metadata lookup from SQLite | — | Signature, docstring, source location |
First call takes ~10s (model loading). Subsequent calls ~1s.
7 Reproducibility
Run Directory Structure
Every OpenProver run creates a fully self-contained, resumable directory:
runs/<slug>-<timestamp>/
├── THEOREM.md # Immutable copy of input theorem
├── THEOREM.lean # Formal Lean statement (if provided)
├── WHITEBOARD.md # Current whiteboard state (enables resume)
├── PROOF.md # Final informal proof (if found)
├── PROOF.lean # Final formal proof (if lean mode)
├── DISCUSSION.md # Post-session LLM analysis
├── run_config.toml # Saved run configuration
├── repo/ # Repository items
│ ├── lemma-foo.md # "Summary: ...\n\nFull content"
│ └── observation-bar.md
├── steps/
│ ├── step_001/
│ │ ├── planner.toml # Planner's TOML decision
│ │ └── workers/
│ │ ├── task_0.md # Worker task description
│ │ ├── result_0.md # Worker output
│ │ └── worker_0_call.json
│ └── step_002/...
└── archive/
└── calls/
└── call_001.json # Full LLM call record
Resumability Protocol
Resume detection: check for existing WHITEBOARD.md + THEOREM.md. Step count inferred from step_NNN directories. Settings restored from run_config.toml; CLI flags override saved values.
This means a proving session can survive: - Machine crashes / power loss - Ctrl-C interruptions - Context window exhaustion - Network failures during API calls
Artifact Completeness
Each archived LLM call (call_NNN.json) records:
| Field | Content |
|---|---|
prompt |
Full input prompt |
system_prompt |
System prompt used |
schema |
JSON schema (if structured output) |
response |
Complete LLM response |
cost |
Cost in USD |
timing |
Wall-clock duration |
errors |
Any errors encountered |
Inspection Mode
openprover inspect runs/<run-dir>
Browse any historical run in the TUI in read-only mode, reviewing every step's planner decisions, worker tasks, and results.
8 Compute and API Costs
Budget System
OpenProver implements explicit budget tracking with two modes:
| Budget Mode | Flag | Default | Description |
|---|---|---|---|
| Time budget | --max-time |
4 hours | Wall-clock time limit |
| Token budget | --max-tokens |
— | Cumulative output token limit |
The Budget class exposes three checkpoints:
- is_exhausted: budget fully consumed
- should_conclude (at --conclude-after, default 0.99): planner enters conclusion phase
- can_give_up (at --give-up-after, default 0.50): give_up action becomes available
The planner prompt receives budget status every step so it can pace its exploration.
Cost Estimation
| Scenario | Model | Parallelism | Time | Estimated Cost |
|---|---|---|---|---|
| Easy theorem (√2 irrational) | Sonnet | 1 | ~10-30 min | ~$1-5 |
| Medium theorem (Cauchy-Schwarz) | Sonnet | 1 | ~1-2 hours | ~$10-30 |
| Hard exploration (Erdős 838) | Opus planner + Sonnet workers | 3 | 2-4 hours | ~$50-200 |
| Local model run | vLLM (MiniMax-M2.5) | 1 | Variable | $0 (compute only) |
Note: These are rough estimates. Actual costs depend heavily on model pricing, theorem difficulty, and search efficiency. Every call's cost is logged in
archive/calls/for precise post-hoc analysis.
Resource Requirements
| Resource | Minimum | Recommended |
|---|---|---|
| Python | 3.10+ | 3.11+ (for tomllib) |
| RAM | 4 GB | 8+ GB (for Lean Explore models) |
| GPU | None (CPU works) | CUDA GPU (for lean_search reranking) |
| Disk | ~500 MB (base) | ~2 GB (with Lean Explore data) |
| Network | Required (for API models) | — |
| Lean 4 | Optional | Required for formal verification |
9 Architecture Solution
System Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ OpenProver System │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ CLI (cli.py) │ │
│ │ Parse args → Setup TUI → Create Prover → Run → Print cost │ │
│ └──────────┬───────────────────────────────────┬───────────────┘ │
│ │ │ │
│ ┌──────────▼──────────┐ ┌──────────────▼──────────────┐ │
│ │ Prover Loop │ │ TUI (tui.py) │ │
│ │ (prover.py) │◄───────►│ Full-screen terminal UI │ │
│ │ │ display │ Tabs: planner + workers │ │
│ │ ┌───────────────┐ │ │ Streaming + key handling │ │
│ │ │ Whiteboard │ │ │ ANSI escape codes │ │
│ │ │ (LaTeX pad) │ │ └─────────────────────────────┘ │
│ │ └───────────────┘ │ │
│ │ ┌───────────────┐ │ │
│ │ │ Repository │ │ │
│ │ │ (repo/*.md) │ │ │
│ │ │ + wikilinks │ │ │
│ │ └───────────────┘ │ │
│ │ ┌───────────────┐ │ │
│ │ │ Output Window │ │ │
│ │ │ (last 3 outs) │ │ │
│ │ └───────────────┘ │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────▼──────────────────────────────────────────────────┐ │
│ │ Action Dispatch │ │
│ │ │ │
│ │ spawn ──────► Workers (ThreadPoolExecutor, up to -P) │ │
│ │ literature_search ──► Web-enabled Claude worker │ │
│ │ read_items ──► Fetch repo/*.md content │ │
│ │ write_items ──► Create/update/delete repo items │ │
│ │ write_whiteboard ──► Update WHITEBOARD.md │ │
│ │ read_theorem ──► Return THEOREM.md + .lean + PROOF.md │ │
│ │ submit_proof ──► Save PROOF.md, verify PROOF.lean │ │
│ │ give_up ──► Terminate (after threshold) │ │
│ └─────────────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────▼───────────────────────────┐ │
│ │ LLM Layer (llm.py) │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ LLMClient │ │ HFClient │ │ │
│ │ │ (Claude CLI) │ │ (vLLM/OpenAI) │ │ │
│ │ │ │ │ │ │ │
│ │ │ • Popen+readline│ │ • HTTP streaming │ │ │
│ │ │ • NDJSON parsing│ │ • Tool calling │ │ │
│ │ │ • MCP config │ │ • Health check │ │ │
│ │ │ • Cost tracking │ │ • Chat() loop │ │ │
│ │ └──────────────────┘ └──────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Lean Layer (lean/) │ │
│ │ │ │
│ │ core.py │ Parse THEOREM.lean, assemble proofs, │ │
│ │ │ run `lake env lean`, LeanWorkDir │ │
│ │ mcp_server.py │ MCP server: lean_verify + lean_search │ │
│ │ data.py │ LeanExplore data management │ │
│ │ tools.py │ Tool defs for vLLM native calling │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Support Modules │ │
│ │ │ │
│ │ prompts.py │ All prompt templates, TOML parser, actions │ │
│ │ budget.py │ Time/token budget tracking │ │
│ │ inspect.py │ Read-only run browser │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Data Flow Per Step
Step N begins
│
▼
┌────────────────────────────────────────┐
│ Build planner prompt: │
│ • Current whiteboard │
│ • Repository index (one-line summaries)│
│ • Last 3 outputs (rolling window) │
│ • Budget status │
│ • Mode-specific status indicators │
└───────────────┬────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ Planner LLM call (streaming) │
│ Response must contain ```toml block │
└───────────────┬────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ Parse TOML: │
│ • action (one of 8 actions) │
│ • summary │
│ • whiteboard update │
│ • <span class="broken-link">tasks</span>, <span class="broken-link">items</span>, search_query │
└───────────────┬────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ Dispatch to action handler │
│ (one of _handle_* methods) │
└───────────────┬────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ Save step to disk: │
│ • steps/step_NNN/planner.toml │
│ • steps/step_NNN/workers/ │
│ • archive/calls/ │
│ • Update WHITEBOARD.md │
└────────────────────────────────────────┘
│
▼
Step N+1
10 Component Breakdown
Module Architecture
| Module | File | Lines (est.) | Responsibility |
|---|---|---|---|
| CLI | cli.py |
~200 | Argument parsing, TUI setup, signal handlers, cost summary |
| Prover | prover.py |
~800 | Core loop, action dispatch, Repo class, wikilink resolution, operating mode logic |
| LLM | llm.py |
~500 | LLMClient (Claude CLI), HFClient (vLLM), streaming, archiving, cost tracking |
| Prompts | prompts.py |
~600 | System prompts, prompt formatters, TOML parser, actions enum |
| Budget | budget.py |
~100 | Time/token tracking, threshold calculations |
| TUI | tui.py |
~800 | Full-screen terminal UI, tab system, streaming display, key handling |
| Inspect | inspect.py |
~150 | Read-only run browser |
| Lean Core | lean/core.py |
~300 | LeanTheorem parser, proof assembly, run_lean_check, LeanWorkDir |
| Lean MCP | lean/mcp_server.py |
~200 | MCP server exposing lean_verify and lean_search |
| Lean Data | lean/data.py |
~150 | LeanExplore data management, dependency installation |
| Lean Tools | lean/tools.py |
~100 | Tool definitions for vLLM native tool calling |
Dependency Graph
cli.py
├── prover.py
│ ├── llm.py
│ ├── prompts.py
│ ├── budget.py
│ └── lean/
│ ├── core.py
│ ├── data.py
│ ├── mcp_server.py (subprocess)
│ └── tools.py
├── tui.py
└── inspect.py
The dependency graph is clean and acyclic. lean/ is a self-contained subpackage. prompts.py has no code dependencies beyond the standard library.
External Dependencies
OpenProver has a remarkably minimal dependency footprint:
[project]
requires-python = ">=3.10"
dependencies = [
"mcp>=1.0",
]
Only the MCP library is a hard dependency. All other functionality comes from:
- Python standard library (subprocess, json, tomllib, threading, os, select)
- Claude CLI (system binary)
- Optional: Lean 4, LeanDojo, lean-explore, torch, sentence-transformers
11 Core Mechanisms (Detailed)
11.1 Planner-Worker Protocol
The planner operates in a strict loop. Each step:
- Prompt Construction: The planner receives a structured prompt containing:
- The current whiteboard (mathematical scratchpad)
- A one-line index of all repository items
- The last 3 outputs from previous steps (rolling window)
- Budget status (time/tokens remaining, can_conclude, can_give_up)
-
Mode-specific status indicators (theorem read, proof found, formal proof status)
-
TOML Decision: The planner responds with a fenced TOML block: ```toml action = "spawn" summary = "Explore the contradiction approach via modular arithmetic" whiteboard = """ # Goal: Prove √2 is irrational ## Strategy: Assume √2 = p/q, derive contradiction ## Status: Working on the parity argument ## Tried: Direct computation (failed) """
tasks description = "Prove that if p² = 2q², then p must be even. Reference parity-lemma."
tasks description = "Verify the infinite descent argument terminates." ```
-
TOML Parsing:
parse_planner_toml()extracts the TOML from fenced code blocks usingtomllib(Python 3.11+),tomli(3.10 fallback), or a minimal regex-based parser. It handles both<span class="broken-link">tasks</span>array-of-tables syntax andlean_block_Nnumbered keys. -
Action Dispatch: Based on the
actionfield, one of 8 handlers executes.
11.2 Worker Execution
Workers are stateless, parallel, and focused:
Planner spawns task
│
▼
Resolve <span class="broken-link">wikilinks</span> in task description
│ (fetch referenced repo items)
│
▼
Build worker prompt:
│ • Task description + resolved references
│ • System prompt (mathematician role)
│
▼
LLM call (streaming to worker tab)
│
├── [If lean_worker_tools enabled]
│ ├── lean_verify(code) → compile in LeanWorkDir
│ └── lean_search(query) → search ~400k declarations
│
▼
Return result to planner
Parallel Execution: Workers run in a ThreadPoolExecutor with concurrency bounded by --parallelism (default 1). Each worker is completely independent — they share no state, only read from the repository.
Verification Workers: The planner can specifically task workers with verification. A verifier worker sees only the proof text (not the reasoning) and must end with VERDICT: CORRECT or VERDICT: INCORRECT.
11.3 Whiteboard Management
The whiteboard is the planner's external mathematical memory:
┌─────────────────────────────────────────┐
│ WHITEBOARD.md │
│ │
│ # Goal: [theorem statement] │
│ ## Strategy: [current approach] │
│ ## Status: [where we are] │
│ ## Tried: [approaches attempted] │
│ ## Key Observations: │
│ - [observation 1] │
│ - [observation 2] │
│ ## Open Questions: │
│ - [question 1] │
└─────────────────────────────────────────┘
Design Principles: - Terse LaTeX-friendly format (abbreviations encouraged) - Updated by the planner every step - Persists to disk (enables resume) - Read by the planner at the start of every step - Functions as external memory beyond context window limits
11.4 Repository and Wikilinks
The repository is a file-based knowledge graph:
repo/
├── sqrt2-parity-lemma.md
│ Summary: p² = 2q² implies p even.
│
│ Proof: If p is odd, p² is odd. But 2q² is even. Contradiction.
│
├── descent-argument.md
│ Summary: Infinite descent shows no reduced fraction exists.
│
│ [Full argument with LaTeX]
│
└── algebraic-number-context.md
Summary: Literature review on algebraic irrationality proofs.
[Web search results and analysis]
Wikilink Resolution: Before a worker receives its task, repo.resolve_wikilinks():
1. Finds all <span class="broken-link">slug</span> references in the task description
2. Fetches the content of each referenced item
3. Appends a "Referenced Materials" section to the worker's input
This creates a lightweight knowledge management system where the planner can share proven lemmas, observations, or literature findings with workers without duplicating content.
11.5 Lean 4 Formal Verification
When --lean-project is set, formal verification integrates at two levels:
Level 1: Repository Item Verification
When the planner writes an item with format="lean", it is:
1. Written to <lean-project>/OpenProver-<id>/<slug>-<random>.lean
2. Compiled via lake env lean <file> from the project directory
3. Pass/fail feedback (with compiler errors) returned to planner
Level 2: Final Proof Assembly
When the planner submits a proof in lean mode:
1. LeanTheorem parses THEOREM.lean, locating all sorry positions
2. The planner provides N replacement blocks (one per sorry) + optional context
3. assemble_proof() replaces each sorry with its corresponding block (reverse order for offset preservation)
4. Validates: correct replacement count, no import in injected code
5. Runs lake env lean on the assembled file
6. Success → writes PROOF.lean; failure → compiler errors fed back
Level 3: Worker Tools via MCP
When --lean-worker-tools is enabled:
- Claude workers: MCP server subprocess provides lean_verify and lean_search
- vLLM workers: LeanExplore search service initialized in-process; native OpenAI tool calling
11.6 Literature Search
When --no-isolation is set:
Planner action: literature_search
│
▼
Spawn web-enabled Claude worker with:
• WebSearch tool (web queries)
• WebFetch tool (page retrieval)
│
▼
Worker searches for relevant:
• Published papers
• Known results
• Prior proof attempts
• Related theorems
│
▼
Results stored as repo items
│
▼
Available to all future workers via <span class="broken-link">wikilinks</span>
This is one of OpenProver's most distinctive features — it enables the prover to consult the mathematical literature mid-proof, just as a human mathematician would.
11.7 Budget-Aware Planning
The planner receives budget status every step:
Budget: 45% remaining (1h 48m of 4h)
Conclusion phase: not yet (triggers at 99%)
Give-up threshold: passed (50%)
This enables adaptive behavior: - Early steps: aggressive exploration, spawn many workers - Middle steps: focus on most promising avenues - Late steps: consolidate findings, attempt proof submission - Conclusion phase: wrap up, write DISCUSSION.md
12 Programming Language
Language Breakdown
| Language | Percentage | Usage |
|---|---|---|
| Python | 92.9% | Core system (all modules) |
| JavaScript | 4.2% | TUI components |
| CSS | 2.4% | TUI styling |
| Other | 0.5% | Configuration, shell scripts |
Python Version Requirements
- Minimum: Python 3.10 (for
matchstatements, type union syntax) - Recommended: Python 3.11+ (for
tomllibin standard library; 3.10 usestomlifallback) - No async: The system uses
threadingandsubprocess, notasyncio
Notable Implementation Choices
-
No Heavy ML Frameworks Required: The core system depends only on
mcp>=1.0. Lean search dependencies (torch, sentence-transformers) are optional and installed on demand. -
Subprocess-Based LLM Integration: Rather than using SDK libraries, OpenProver calls the Claude CLI binary directly via
subprocess.Popen. This is unconventional but provides maximum control over streaming behavior and tool configuration. -
TOML as Structured Output Format: The planner's decisions are structured as TOML blocks within natural language responses. This is a pragmatic choice — TOML is simpler than JSON for LLMs to generate (no escaping issues) and supports multi-line strings naturally.
-
File-Based State Management: All state lives on disk as plain files (markdown, TOML, JSON). No database, no in-memory state that survives process death. This makes the system trivially inspectable and resumable.
-
Thread Safety via Write Lock: All TUI stdout writes are protected by a
_write_lock. The background key-reading thread doesn't hold the lock during I/O, preventing deadlocks while maintaining display consistency.
13 Memory Management
External Memory Architecture
OpenProver implements a multi-layered external memory system that compensates for LLM context window limitations:
┌─────────────────────────────────────────────────┐
│ MEMORY HIERARCHY │
│ │
│ Layer 1: Whiteboard (WHITEBOARD.md) │
│ ├── Capacity: ~2-4 KB (terse LaTeX) │
│ ├── Persistence: Per-step to disk │
│ ├── Visibility: Planner only │
│ ├── Purpose: Strategic state, current approach │
│ └── Update: Every step by planner │
│ │
│ Layer 2: Repository (repo/*.md) │
│ ├── Capacity: Unlimited (one file per item) │
│ ├── Persistence: Permanent on disk │
│ ├── Visibility: Planner (index), Workers (via │
│ │ wikilinks) │
│ ├── Purpose: Proven lemmas, observations, │
│ │ literature, failed approaches │
│ └── Update: Planner CRUD via write_items │
│ │
│ Layer 3: Output Window (last 3 outputs) │
│ ├── Capacity: 3 items (rolling) │
│ ├── Persistence: In-memory only │
│ ├── Visibility: Planner only │
│ ├── Purpose: Recent worker results, feedback │
│ └── Update: Automatic (FIFO) │
│ │
│ Layer 4: Step Archive (steps/step_NNN/) │
│ ├── Capacity: Unlimited │
│ ├── Persistence: Permanent on disk │
│ ├── Visibility: Inspect mode only │
│ ├── Purpose: Full audit trail │
│ └── Update: Append-only │
│ │
│ Layer 5: LLM Call Archive (archive/calls/) │
│ ├── Capacity: Unlimited │
│ ├── Persistence: Permanent on disk │
│ ├── Visibility: Post-hoc analysis │
│ ├── Purpose: Cost tracking, debugging, │
│ │ reproducibility │
│ └── Update: Append-only │
└─────────────────────────────────────────────────┘
Wikilink-Based Knowledge Retrieval
The <span class="broken-link">wikilink</span> system functions as a lightweight retrieval mechanism:
- Planner stores a lemma as
repo/parity-lemma.md - Later task references
<span class="broken-link">parity-lemma</span> resolve_wikilinks()fetches the full content- Worker receives task + appended reference materials
This is superior to dumping all context into every prompt because: - Tokens are spent only when content is referenced - Workers see only what's relevant to their specific task - The planner controls what knowledge flows to which worker
Context Window Management
The planner sees a compressed view of the proof state:
| Component | Typical Size | Purpose |
|---|---|---|
| Whiteboard | ~500-2000 tokens | Current strategy and status |
| Repo index | ~50-200 tokens | One-line summary per item |
| Last 3 outputs | ~1000-3000 tokens | Recent worker results |
| Budget status | ~50 tokens | Pacing information |
| System prompt | ~1000-2000 tokens | Role and action definitions |
Total planner context: ~3000-7000 tokens per step — well within any frontier model's capacity. This means the planner never hits context limits, even in multi-hour sessions.
No Embedding-Based Memory
Unlike systems that use vector databases for semantic retrieval over past experiences, OpenProver relies entirely on: - Structured naming (repo item slugs) - Explicit references (wikilinks) - Planner intelligence (the LLM decides what to store and reference)
This is simpler and more transparent but relies on the planner's ability to curate the repository effectively.
14 Continued Learning
Learning Within a Run
OpenProver demonstrates within-session learning through its repository mechanism:
Step 1: Worker explores approach A, fails
→ Planner stores "approach-a-failure.md" in repo
→ Whiteboard updated: "Tried: approach A (failed due to X)"
Step 5: Worker attempts related approach
→ Task includes <span class="broken-link">approach-a-failure</span> reference
→ Worker avoids repeating the same mistake
Step 10: Planner recognizes pattern across failures
→ Stores "key-insight.md" with synthesized observation
→ Redirects search based on accumulated evidence
Cross-Run Learning
OpenProver currently has no cross-run learning mechanism. Each run starts from scratch with only the theorem statement.
What's Missing (and Why It Matters):
| Capability | Status | Impact |
|---|---|---|
| Proof strategy transfer | Not implemented | Can't reuse strategies from similar theorems |
| Tactic preference learning | Not implemented | Can't learn which tactics work for which goal types |
| Failure pattern recognition | Not implemented | Can't avoid known dead ends from prior runs |
| Repository bootstrapping | Not implemented | Can't seed a new run with relevant prior knowledge |
| Fine-tuning from traces | Not implemented | Can't train better models from successful proofs |
Potential for Evolutionary Integration
OpenProver's architecture has natural extension points for evolutionary methods:
-
Repository Seeding: Pre-populate
repo/with lemmas and strategies from prior successful runs on related theorems. The wikilink system already supports this — the planner just needs a way to discover relevant prior knowledge. -
Strategy Evolution: The whiteboard format could be used as a "chromosome" — evolving proof strategies across runs, with fitness measured by proof success or progress toward proof.
-
Prompt Population: The planner system prompt could be evolved, with different versions competing on benchmark problems.
-
Model Selection: The planner/worker model assignment could be treated as a hyperparameter optimized across runs.
Post-Session Analysis
Every completed run generates a DISCUSSION.md via a dedicated LLM call that analyzes:
- What worked and what didn't
- Key insights and observations
- Suggestions for future attempts
- Assessment of the proof's strength (for informal proofs)
This creates a self-reflective artifact that, while not currently fed into future runs, could serve as input for a cross-run learning system.
15 Applications
Primary Application: Mathematical Proof Discovery
OpenProver's primary use case is assisting mathematicians with proof discovery at various difficulty levels:
| Application | Mode | Description |
|---|---|---|
| Homework assistance | prove | Generate informal proofs for undergraduate-level theorems |
| Research exploration | prove + literature_search | Explore proof strategies for research-level problems with web search |
| Formalization | formalize_only | Convert an existing informal proof to verified Lean 4 |
| Proof + Formalization | prove_and_formalize | Find a proof and produce verified Lean 4 code simultaneously |
| Problem exploration | prove (autonomous, long budget) | Extended autonomous exploration of hard/open problems |
Research Applications
- Benchmarking LLMs for Mathematical Reasoning: OpenProver provides a controlled environment for comparing different LLMs' mathematical reasoning abilities. By logging every LLM call with full prompts and responses, researchers can analyze:
- Which models propose better tactics
- How reasoning depth correlates with proof success
-
Cost-quality tradeoffs between model tiers
-
Architecture Research: The planner-worker architecture is configurable enough to study:
- Effect of parallelism on proof search efficiency
- Value of literature search for hard problems
-
Whiteboard strategies and their impact on proof direction
-
Formal Verification Pipeline Research: The Lean integration enables studying:
- Informal-to-formal translation accuracy
- Error patterns in LLM-generated Lean code
- Effectiveness of compiler feedback for repair
Comparison with Alternative Approaches
| Approach | OpenProver Strength | OpenProver Limitation |
|---|---|---|
| vs. Single-agent provers (LeanCopilot) | Richer strategic reasoning, literature search, persistent memory | Higher token cost, more complex setup |
| vs. RL-trained provers (AlphaProof) | No training required, works with any LLM, open source | No learned value function, no policy improvement |
| vs. Search-based provers (UlamAI) | Richer proof state (whiteboard + repo), informal reasoning | Less sophisticated tactic-level search, no transposition table |
| vs. Closed systems (Aletheia) | Open source, inspectable, resumable, multi-provider | Smaller scale, no proprietary model access |
Integration Scenarios
For Individual Mathematicians:
# Quick exploration of a conjecture
openprover my_conjecture.md --autonomous --max-time 1h
# Deep dive with literature search
openprover hard_problem.md --no-isolation --model opus -P 3 --max-time 4h
# Formalize a proof for a paper
openprover theorem.md --lean-project ~/mathlib4 \
--lean-theorem theorem.lean --proof proof.md
For Research Groups: - Batch proving across theorem suites - Comparative evaluation of models/strategies - Training data generation from proof traces
For Education: - Step-by-step proof exploration in interactive mode - Students can give feedback at each step, guiding the proof - TUI provides visibility into the reasoning process
Limitations and Honest Assessment
| Limitation | Details |
|---|---|
| No benchmark results | No systematic evaluation on standard benchmarks (miniF2F, etc.) |
| No cross-run learning | Each run starts from scratch |
| Claude-centric | Default configuration assumes Claude Code CLI on PATH |
| No RL/SFT pipeline | Pure prompting — no model improvement from traces |
| No value model | Search relies on LLM intuition, not learned value functions |
| Solo development | Bus factor of 1; sustainability risk |
| Informal-first bias | Strongest at informal proofs; formal verification is secondary |
Significance for the Harness Systems Landscape
OpenProver represents a critical point in the harness systems taxonomy: it demonstrates that frontier proof agents can be built with open-source tools and commercial LLM APIs, without proprietary training data or custom-trained models. Its planner-worker-whiteboard architecture — open-sourced from Aletheia's design — provides a reusable pattern for any agentic system that needs to coordinate strategic reasoning with parallel execution over extended time horizons.
The key insight is that persistent structured state (whiteboard + repository + wikilinks) can substitute for massive context windows, enabling multi-hour reasoning sessions with standard-context models. This architectural pattern generalizes beyond theorem proving to any discovery process that requires strategic coordination, literature consultation, and incremental knowledge accumulation.