← Back to Index

OpenProver

Automated theorem prover using a planner-worker agentic architecture with frontier LLMs, inspired by Google DeepMind's Aletheia Organization: Independent (Matěj Kripner / Charles University, Prague) Published: February 2025 (initial); v1.0.1, March 2026 (current) Type: Open-source repository + PyPI package Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: OpenProver — Automated Theorem Prover Inspired by Aletheia; Claude Code for Mathematicians

Repository: github.com/Kripner/openprover

PyPI Package: openprover v1.0.1

Tagline: "Theorem prover powered by language models."

Lineage: Directly inspired by Google DeepMind's Aletheia — the fully autonomous mathematics research agent powered by Gemini 3 Deep Think that solved 6/10 open research problems from the FirstProof challenge (February 2025). OpenProver replicates and open-sources the planner-worker-whiteboard architecture that Aletheia demonstrated at the frontier of automated mathematical reasoning.

Related Work by Author: Matěj Kripner is also a contributor to LeanTree — a white-box approach to proof search that factorizes complex proof states in Lean 4 into simpler branches for more efficient automated theorem proving (published at UFAL, Charles University, Prague).

Citation:

@misc{openprover,
  author = {Matěj Kripner},
  title = {OpenProver},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/kripner/openprover}
}

2 Authors and Team

Primary Author: Matěj Kripner (kripner@ufal.mff.cuni.cz)

Affiliation: Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic

Team Size: Solo developer with 388 commits on the master branch (as of April 2026). Community contributions via 11 forks.

Background: Kripner's work sits at the intersection of formal methods and machine learning applied to mathematics. His LeanTree research addresses the complementary problem of making proof search more efficient through state factorization, while OpenProver tackles the higher-level orchestration of LLM-guided proving.

Repository Activity: | Metric | Value | | --- | --- | | Commits | 388 | | Stars | 67+ | | Forks | 11 | | Primary Language | Python (92.9%) | | Secondary Languages | JavaScript (4.2%), CSS (2.4%) | | License | Not explicitly specified |

3 Core Contribution

Key Novelty: OpenProver is the first open-source implementation of the planner-worker-whiteboard architecture for automated theorem proving that Aletheia demonstrated at Google DeepMind. It bridges the gap between closed frontier proof agents and the open research community by providing a fully inspectable, resumable, multi-LLM orchestration framework for both informal and formal mathematical reasoning.

What Makes OpenProver Distinctive

Planner-Worker Separation with Persistent State: Unlike single-agent proving systems that maintain one long context window, OpenProver separates the high-level strategic reasoning (planner) from focused execution (workers). The planner maintains a whiteboard and repository across steps, while workers are stateless and parallelizable. This mirrors how human mathematicians work — a strategist coordinates specialists.
Whiteboard as External Mathematical Memory: The whiteboard is a terse LaTeX scratchpad that the planner reads and updates every step. This creates an external memory mechanism that persists across context window boundaries, allowing proof strategies to evolve over hours without losing coherence.
Repository as Structured Knowledge Base: The repo/ directory stores named items — lemmas, observations, literature findings, failed approaches — as individual markdown files. Items are cross-referenced via wikilink syntax, creating a navigable knowledge graph that both planner and workers can query.
Formal + Informal Dual-Track Proving: OpenProver operates in three modes — pure informal proving (PROOF.md), prove-and-formalize (PROOF.md + PROOF.lean), and formalize-only (given an existing informal proof, produce verified Lean 4 code). This acknowledges the reality that most mathematical reasoning happens informally first.
Multi-Provider LLM Support: Unlike closed systems locked to one model provider, OpenProver supports Claude (via Claude Code CLI), Mistral's Leanstral (specialized for Lean), and any OpenAI-compatible model via vLLM. Different models can be assigned to planner vs. worker roles.
Full Disk Persistence and Resumability: Every run creates a structured directory with all state on disk. Runs can be interrupted (Ctrl-C, machine crash) and resumed from the last completed step. This is critical for multi-hour proving sessions with expensive API calls.

Relationship to Prior Systems

System	Year	Architecture	Formal Verification	Open Source
Aletheia (DeepMind)	2025	Planner-worker-whiteboard	Extraction + verification protocols	No (closed)
LeanCopilot	2023	Single-agent tactic suggestion	Lean 4 via LeanDojo	Yes
LEGO-Prover	2024	Skill library + decomposition	Lean via LeanDojo	Partial
ReProver	2023	Encoder retrieval + generation	Lean 4	Yes
OpenProver	2025	Planner-worker-whiteboard	Lean 4 via `lake env lean` + MCP tools	Yes
UlamAI	2026	Search-based with LLM policy	Lean 4 via LeanDojo/CLI/LSP	Yes

4 Supported Solutions

OpenProver targets mathematical proofs at varying levels of formality and difficulty:

Solution Type	Description	Example
`Informal Proof`	Natural-language mathematical proofs written as structured markdown	Irrationality of √2, infinitely many primes
`Formal Lean 4 Proof`	Machine-checked proofs compiled against Mathlib	Cauchy-Schwarz inequality in Lean 4
`Formalization`	Converting an existing informal proof to verified Lean 4	Given PROOF.md → produce PROOF.lean
`Literature Survey`	Web-enabled search for relevant mathematical results and papers	Finding prior work on Erdős problem 838
`Proof Strategy`	Whiteboard-based high-level proof planning before execution	Decomposing a hard problem into lemmas

Difficulty Spectrum Covered

Problem	Difficulty	Status
√2 is irrational	Easy (textbook)	Solved
Infinitely many primes	Easy (textbook)	Solved
e is irrational	Medium (undergraduate)	Addressed
Cauchy-Schwarz inequality	Medium (undergraduate)	Addressed
Erdős problem 205	Medium (research-adjacent)	Addressed
Erdős problem 838	Hard (open problem)	Attempted
Collatz conjecture	Open (famously unsolved)	Attempted

Operating Modes

Mode	Input	Goal	Termination Condition
`prove`	THEOREM.md	Informal proof	`submit_proof` → PROOF.md
`prove_and_formalize`	THEOREM.md + THEOREM.lean	Both proofs	PROOF.md + PROOF.lean
`formalize_only`	THEOREM.md + THEOREM.lean + PROOF.md	Formal proof	`submit_proof` succeeds → PROOF.lean

5 LLM Integration

Multi-Provider Architecture

OpenProver implements a factory pattern for LLM clients, supporting three distinct provider backends:

Provider	Client Class	Interface	Tool Support	Cost Tracking
Claude (Anthropic)	`LLMClient`	Claude Code CLI subprocess	MCP server for lean_verify/lean_search	Yes (from Claude CLI response)
Leanstral (Mistral)	via OpenAI-compatible	Mistral API with `MISTRAL_API_KEY`	Native	Via API
Local/vLLM	`HFClient`	OpenAI-compatible HTTP at `--provider-url`	Native tool calling	No (local = free)

Hierarchical Model Assignment

A distinctive feature is the ability to assign different models to different roles:

┌──────────────────────────────────────────┐
│           MODEL HIERARCHY                │
│                                          │
│  --planner-model opus                    │
│    ↓ Strategic reasoning, whiteboard     │
│    ↓ management, action selection        │
│                                          │
│  --worker-model sonnet                   │
│    ↓ Focused task execution, proof       │
│    ↓ exploration, verification           │
│                                          │
│  (Literature search gets web-enabled     │
│   Claude with WebSearch + WebFetch)      │
└──────────────────────────────────────────┘

Rationale: The planner sees a compressed view of the proof state (whiteboard + repo index) and must make strategic decisions — a task suited to more capable but expensive models (Opus). Workers receive focused tasks and run in parallel — suited to faster, cheaper models (Sonnet).

Claude CLI Integration Details

The Claude integration is notable for its depth. OpenProver wraps the claude CLI binary via subprocess.Popen:

Non-streaming mode:

claude -p --model <model> --system-prompt <...> \
  --output-format json --tools ""

Streaming mode:

claude -p --model <model> --system-prompt <...> \
  --output-format stream-json --verbose \
  --include-partial-messages --tools ""

Key implementation detail: streaming uses readline() on the subprocess pipe rather than Python's line iterator, because the line iterator has read-ahead buffering that defeats real-time display.

MCP Tool Calling: When --lean-project is set, the Claude CLI spawns an MCP server subprocess (lean/mcp_server.py) with:

--mcp-config <config> --strict-mcp-config \
--permission-mode bypassPermissions \
--allowedTools mcp__lean_tools__lean_verify mcp__lean_tools__lean_search

Tool events are detected from the NDJSON stream: - tool_use content blocks carry tool name and input (streamed incrementally) - Tool results appear as top-level {"type": "user"} messages containing tool_result entries matched by tool_use_id - MCP prefixes are stripped: mcp__lean_tools__lean_verify → lean_verify - Status is inferred from result text (e.g., results starting with "OK" = success)

vLLM Integration for Local Models

For local models via vLLM, OpenProver uses the HFClient:

Health check on init via /health endpoint
Streaming via chunked transfer encoding
Multi-turn tool calling loop: LLM requests tool calls → _execute_worker_tool() dispatches → results appended to conversation → LLM continues
Supports MiniMax-M2.5 reasoning models with thinking tokens
Configurable --answer-reserve (default 4096 tokens) for local models that need token budget partitioning

Web Search Integration

When --no-isolation is set, the planner can dispatch literature_search actions. These spawn a special worker with Claude CLI configured for web access:

--permission-mode bypassPermissions \
--allowedTools WebSearch WebFetch

This enables the system to find relevant mathematical papers and results online during proof search — a capability that mirrors how human mathematicians consult the literature when stuck.

6 Key Results

Demonstrated Capabilities

OpenProver is a tool-focused repository rather than a benchmark-reporting research paper. Specific quantitative results are limited, but the system demonstrates:

Easy Theorems (Solved):
Irrationality of √2 — complete informal and formal proofs
Infinitely many primes — complete informal proofs
Medium Theorems (Addressed):
Irrationality of e — proof strategies explored
Cauchy-Schwarz inequality — informal and formal proof attempts
Erdős problem 205 — multi-step proof exploration
Hard/Open Problems (Attempted):
Erdős problem 838 — extended autonomous sessions with literature search
Collatz conjecture — proof exploration (unsolved, as expected)

Architectural Validation

The system validates that the Aletheia-style planner-worker architecture can be replicated with open-source components:

Aletheia Feature	OpenProver Equivalent	Status
Gemini Deep Think planner	Claude Opus / any LLM planner	Implemented
Parallel worker execution	ThreadPoolExecutor workers	Implemented
Whiteboard scratchpad	`WHITEBOARD.md` with LaTeX	Implemented
Repository of findings	`repo/` directory with wikilinks	Implemented
Extraction + verification	`submit_proof` + Lean 4 verification	Implemented
Literature search	Web-enabled Claude workers	Implemented
Formal proof compilation	`lake env lean` + MCP tools	Implemented

Lean Explore Search Pipeline

One of OpenProver's most technically sophisticated components is the lean_search tool, which searches ~400k Lean 4 declarations across Init, Batteries, Lean, Mathlib, and Std:

Stage	Technique	Latency	Details
BM25 Retrieval	Keyword search over LLM-generated natural language descriptions	~0.01s	Two indices with different tokenization, results merged
Semantic Retrieval	Qwen3-Embedding-0.6B (1024-dim), FAISS index, cosine similarity	~1s	Pre-built index, sentence-transformer encoding
Score Fusion	Normalized combination: 0.3 × BM25 + 0.7 × semantic	—	Dependency boost for related declarations
Reranking	Qwen3-Reranker-0.6B cross-encoder	~1-2s (GPU)	Skipped on CPU (~50s); GPU recommended
Hydration	Full metadata lookup from SQLite	—	Signature, docstring, source location

First call takes ~10s (model loading). Subsequent calls ~1s.

7 Reproducibility

Run Directory Structure

Every OpenProver run creates a fully self-contained, resumable directory:

runs/<slug>-<timestamp>/
├── THEOREM.md             # Immutable copy of input theorem
├── THEOREM.lean           # Formal Lean statement (if provided)
├── WHITEBOARD.md          # Current whiteboard state (enables resume)
├── PROOF.md               # Final informal proof (if found)
├── PROOF.lean             # Final formal proof (if lean mode)
├── DISCUSSION.md          # Post-session LLM analysis
├── run_config.toml        # Saved run configuration
├── repo/                  # Repository items
│   ├── lemma-foo.md       # "Summary: ...\n\nFull content"
│   └── observation-bar.md
├── steps/
│   ├── step_001/
│   │   ├── planner.toml   # Planner's TOML decision
│   │   └── workers/
│   │       ├── task_0.md   # Worker task description
│   │       ├── result_0.md # Worker output
│   │       └── worker_0_call.json
│   └── step_002/...
└── archive/
    └── calls/
        └── call_001.json  # Full LLM call record

Resumability Protocol

Resume detection: check for existing WHITEBOARD.md + THEOREM.md. Step count inferred from step_NNN directories. Settings restored from run_config.toml; CLI flags override saved values.

This means a proving session can survive: - Machine crashes / power loss - Ctrl-C interruptions - Context window exhaustion - Network failures during API calls

Artifact Completeness

Each archived LLM call (call_NNN.json) records:

Field	Content
`prompt`	Full input prompt
`system_prompt`	System prompt used
`schema`	JSON schema (if structured output)
`response`	Complete LLM response
`cost`	Cost in USD
`timing`	Wall-clock duration
`errors`	Any errors encountered

Inspection Mode

openprover inspect runs/<run-dir>

Browse any historical run in the TUI in read-only mode, reviewing every step's planner decisions, worker tasks, and results.

8 Compute and API Costs

Budget System

OpenProver implements explicit budget tracking with two modes:

Budget Mode	Flag	Default	Description
Time budget	`--max-time`	4 hours	Wall-clock time limit
Token budget	`--max-tokens`	—	Cumulative output token limit

The Budget class exposes three checkpoints: - is_exhausted: budget fully consumed - should_conclude (at --conclude-after, default 0.99): planner enters conclusion phase - can_give_up (at --give-up-after, default 0.50): give_up action becomes available

The planner prompt receives budget status every step so it can pace its exploration.

Cost Estimation

Scenario	Model	Parallelism	Time	Estimated Cost
Easy theorem (√2 irrational)	Sonnet	1	~10-30 min	~$1-5
Medium theorem (Cauchy-Schwarz)	Sonnet	1	~1-2 hours	~$10-30
Hard exploration (Erdős 838)	Opus planner + Sonnet workers	3	2-4 hours	~$50-200
Local model run	vLLM (MiniMax-M2.5)	1	Variable	$0 (compute only)

Note: These are rough estimates. Actual costs depend heavily on model pricing, theorem difficulty, and search efficiency. Every call's cost is logged in archive/calls/ for precise post-hoc analysis.

Resource Requirements

Resource	Minimum	Recommended
Python	3.10+	3.11+ (for tomllib)
RAM	4 GB	8+ GB (for Lean Explore models)
GPU	None (CPU works)	CUDA GPU (for lean_search reranking)
Disk	~500 MB (base)	~2 GB (with Lean Explore data)
Network	Required (for API models)	—
Lean 4	Optional	Required for formal verification

9 Architecture Solution

System Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                        OpenProver System                           │
│                                                                    │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                      CLI (cli.py)                            │  │
│  │  Parse args → Setup TUI → Create Prover → Run → Print cost  │  │
│  └──────────┬───────────────────────────────────┬───────────────┘  │
│             │                                   │                  │
│  ┌──────────▼──────────┐         ┌──────────────▼──────────────┐  │
│  │     Prover Loop     │         │         TUI (tui.py)        │  │
│  │    (prover.py)      │◄───────►│  Full-screen terminal UI    │  │
│  │                     │ display │  Tabs: planner + workers    │  │
│  │  ┌───────────────┐  │         │  Streaming + key handling   │  │
│  │  │  Whiteboard   │  │         │  ANSI escape codes          │  │
│  │  │ (LaTeX pad)   │  │         └─────────────────────────────┘  │
│  │  └───────────────┘  │                                          │
│  │  ┌───────────────┐  │                                          │
│  │  │  Repository   │  │                                          │
│  │  │ (repo/*.md)   │  │                                          │
│  │  │ + wikilinks   │  │                                          │
│  │  └───────────────┘  │                                          │
│  │  ┌───────────────┐  │                                          │
│  │  │ Output Window │  │                                          │
│  │  │ (last 3 outs) │  │                                          │
│  │  └───────────────┘  │                                          │
│  └──────────┬──────────┘                                          │
│             │                                                      │
│  ┌──────────▼──────────────────────────────────────────────────┐  │
│  │                    Action Dispatch                          │  │
│  │                                                             │  │
│  │  spawn ──────► Workers (ThreadPoolExecutor, up to -P)      │  │
│  │  literature_search ──► Web-enabled Claude worker           │  │
│  │  read_items ──► Fetch repo/*.md content                    │  │
│  │  write_items ──► Create/update/delete repo items           │  │
│  │  write_whiteboard ──► Update WHITEBOARD.md                 │  │
│  │  read_theorem ──► Return THEOREM.md + .lean + PROOF.md    │  │
│  │  submit_proof ──► Save PROOF.md, verify PROOF.lean        │  │
│  │  give_up ──► Terminate (after threshold)                   │  │
│  └─────────────────────────────────┬───────────────────────────┘  │
│                                    │                               │
│  ┌─────────────────────────────────▼───────────────────────────┐  │
│  │                    LLM Layer (llm.py)                       │  │
│  │                                                             │  │
│  │  ┌──────────────────┐   ┌──────────────────┐               │  │
│  │  │  LLMClient       │   │  HFClient        │               │  │
│  │  │  (Claude CLI)    │   │  (vLLM/OpenAI)   │               │  │
│  │  │                  │   │                   │               │  │
│  │  │  • Popen+readline│   │  • HTTP streaming │               │  │
│  │  │  • NDJSON parsing│   │  • Tool calling   │               │  │
│  │  │  • MCP config    │   │  • Health check   │               │  │
│  │  │  • Cost tracking │   │  • Chat() loop    │               │  │
│  │  └──────────────────┘   └──────────────────┘               │  │
│  └─────────────────────────────────────────────────────────────┘  │
│                                                                    │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                    Lean Layer (lean/)                        │  │
│  │                                                             │  │
│  │  core.py         │ Parse THEOREM.lean, assemble proofs,    │  │
│  │                  │ run `lake env lean`, LeanWorkDir         │  │
│  │  mcp_server.py   │ MCP server: lean_verify + lean_search   │  │
│  │  data.py         │ LeanExplore data management             │  │
│  │  tools.py        │ Tool defs for vLLM native calling       │  │
│  └─────────────────────────────────────────────────────────────┘  │
│                                                                    │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                    Support Modules                          │  │
│  │                                                             │  │
│  │  prompts.py  │ All prompt templates, TOML parser, actions  │  │
│  │  budget.py   │ Time/token budget tracking                  │  │
│  │  inspect.py  │ Read-only run browser                       │  │
│  └─────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Data Flow Per Step

Step N begins
    │
    ▼
┌────────────────────────────────────────┐
│ Build planner prompt:                  │
│  • Current whiteboard                  │
│  • Repository index (one-line summaries)│
│  • Last 3 outputs (rolling window)     │
│  • Budget status                       │
│  • Mode-specific status indicators     │
└───────────────┬────────────────────────┘
                │
                ▼
┌────────────────────────────────────────┐
│ Planner LLM call (streaming)          │
│  Response must contain ```toml block  │
└───────────────┬────────────────────────┘
                │
                ▼
┌────────────────────────────────────────┐
│ Parse TOML:                            │
│  • action (one of 8 actions)          │
│  • summary                            │
│  • whiteboard update                   │
│  • <span class="broken-link">tasks</span>, <span class="broken-link">items</span>, search_query  │
└───────────────┬────────────────────────┘
                │
                ▼
┌────────────────────────────────────────┐
│ Dispatch to action handler             │
│  (one of _handle_* methods)           │
└───────────────┬────────────────────────┘
                │
                ▼
┌────────────────────────────────────────┐
│ Save step to disk:                     │
│  • steps/step_NNN/planner.toml        │
│  • steps/step_NNN/workers/            │
│  • archive/calls/                     │
│  • Update WHITEBOARD.md               │
└────────────────────────────────────────┘
                │
                ▼
          Step N+1

10 Component Breakdown

Module Architecture

Module	File	Lines (est.)	Responsibility
CLI	`cli.py`	~200	Argument parsing, TUI setup, signal handlers, cost summary
Prover	`prover.py`	~800	Core loop, action dispatch, Repo class, wikilink resolution, operating mode logic
LLM	`llm.py`	~500	`LLMClient` (Claude CLI), `HFClient` (vLLM), streaming, archiving, cost tracking
Prompts	`prompts.py`	~600	System prompts, prompt formatters, TOML parser, actions enum
Budget	`budget.py`	~100	Time/token tracking, threshold calculations
TUI	`tui.py`	~800	Full-screen terminal UI, tab system, streaming display, key handling
Inspect	`inspect.py`	~150	Read-only run browser
Lean Core	`lean/core.py`	~300	`LeanTheorem` parser, proof assembly, `run_lean_check`, `LeanWorkDir`
Lean MCP	`lean/mcp_server.py`	~200	MCP server exposing lean_verify and lean_search
Lean Data	`lean/data.py`	~150	LeanExplore data management, dependency installation
Lean Tools	`lean/tools.py`	~100	Tool definitions for vLLM native tool calling

Dependency Graph

cli.py
  ├── prover.py
  │     ├── llm.py
  │     ├── prompts.py
  │     ├── budget.py
  │     └── lean/
  │           ├── core.py
  │           ├── data.py
  │           ├── mcp_server.py (subprocess)
  │           └── tools.py
  ├── tui.py
  └── inspect.py

The dependency graph is clean and acyclic. lean/ is a self-contained subpackage. prompts.py has no code dependencies beyond the standard library.

External Dependencies

OpenProver has a remarkably minimal dependency footprint:

[project]
requires-python = ">=3.10"
dependencies = [
  "mcp>=1.0",
]

Only the MCP library is a hard dependency. All other functionality comes from: - Python standard library (subprocess, json, tomllib, threading, os, select) - Claude CLI (system binary) - Optional: Lean 4, LeanDojo, lean-explore, torch, sentence-transformers

11 Core Mechanisms (Detailed)

11.1 Planner-Worker Protocol

The planner operates in a strict loop. Each step:

Prompt Construction: The planner receives a structured prompt containing:
The current whiteboard (mathematical scratchpad)
A one-line index of all repository items
The last 3 outputs from previous steps (rolling window)
Budget status (time/tokens remaining, can_conclude, can_give_up)
Mode-specific status indicators (theorem read, proof found, formal proof status)
TOML Decision: The planner responds with a fenced TOML block: ```toml action = "spawn" summary = "Explore the contradiction approach via modular arithmetic" whiteboard = """ # Goal: Prove √2 is irrational ## Strategy: Assume √2 = p/q, derive contradiction ## Status: Working on the parity argument ## Tried: Direct computation (failed) """

tasks description = "Prove that if p² = 2q², then p must be even. Reference parity-lemma."

tasks description = "Verify the infinite descent argument terminates." ```

TOML Parsing: parse_planner_toml() extracts the TOML from fenced code blocks using tomllib (Python 3.11+), tomli (3.10 fallback), or a minimal regex-based parser. It handles both tasks array-of-tables syntax and lean_block_N numbered keys.
Action Dispatch: Based on the action field, one of 8 handlers executes.

11.2 Worker Execution

Workers are stateless, parallel, and focused:

Planner spawns task
    │
    ▼
Resolve <span class="broken-link">wikilinks</span> in task description
    │ (fetch referenced repo items)
    │
    ▼
Build worker prompt:
    │ • Task description + resolved references
    │ • System prompt (mathematician role)
    │
    ▼
LLM call (streaming to worker tab)
    │
    ├── [If lean_worker_tools enabled]
    │   ├── lean_verify(code) → compile in LeanWorkDir
    │   └── lean_search(query) → search ~400k declarations
    │
    ▼
Return result to planner

Parallel Execution: Workers run in a ThreadPoolExecutor with concurrency bounded by --parallelism (default 1). Each worker is completely independent — they share no state, only read from the repository.

Verification Workers: The planner can specifically task workers with verification. A verifier worker sees only the proof text (not the reasoning) and must end with VERDICT: CORRECT or VERDICT: INCORRECT.

11.3 Whiteboard Management

The whiteboard is the planner's external mathematical memory:

┌─────────────────────────────────────────┐
│ WHITEBOARD.md                            │
│                                          │
│ # Goal: [theorem statement]              │
│ ## Strategy: [current approach]          │
│ ## Status: [where we are]               │
│ ## Tried: [approaches attempted]        │
│ ## Key Observations:                    │
│   - [observation 1]                     │
│   - [observation 2]                     │
│ ## Open Questions:                      │
│   - [question 1]                        │
└─────────────────────────────────────────┘

Design Principles: - Terse LaTeX-friendly format (abbreviations encouraged) - Updated by the planner every step - Persists to disk (enables resume) - Read by the planner at the start of every step - Functions as external memory beyond context window limits

11.4 Repository and Wikilinks

The repository is a file-based knowledge graph:

repo/
├── sqrt2-parity-lemma.md
│   Summary: p² = 2q² implies p even.
│
│   Proof: If p is odd, p² is odd. But 2q² is even. Contradiction.
│
├── descent-argument.md
│   Summary: Infinite descent shows no reduced fraction exists.
│
│   [Full argument with LaTeX]
│
└── algebraic-number-context.md
    Summary: Literature review on algebraic irrationality proofs.

    [Web search results and analysis]

Wikilink Resolution: Before a worker receives its task, repo.resolve_wikilinks(): 1. Finds all slug references in the task description 2. Fetches the content of each referenced item 3. Appends a "Referenced Materials" section to the worker's input

This creates a lightweight knowledge management system where the planner can share proven lemmas, observations, or literature findings with workers without duplicating content.

11.5 Lean 4 Formal Verification

When --lean-project is set, formal verification integrates at two levels:

Level 1: Repository Item Verification

When the planner writes an item with format="lean", it is: 1. Written to <lean-project>/OpenProver-<id>/<slug>-<random>.lean 2. Compiled via lake env lean <file> from the project directory 3. Pass/fail feedback (with compiler errors) returned to planner

Level 2: Final Proof Assembly

When the planner submits a proof in lean mode: 1. LeanTheorem parses THEOREM.lean, locating all sorry positions 2. The planner provides N replacement blocks (one per sorry) + optional context 3. assemble_proof() replaces each sorry with its corresponding block (reverse order for offset preservation) 4. Validates: correct replacement count, no import in injected code 5. Runs lake env lean on the assembled file 6. Success → writes PROOF.lean; failure → compiler errors fed back

Level 3: Worker Tools via MCP

When --lean-worker-tools is enabled: - Claude workers: MCP server subprocess provides lean_verify and lean_search - vLLM workers: LeanExplore search service initialized in-process; native OpenAI tool calling

11.6 Literature Search

When --no-isolation is set:

Planner action: literature_search
    │
    ▼
Spawn web-enabled Claude worker with:
    • WebSearch tool (web queries)
    • WebFetch tool (page retrieval)
    │
    ▼
Worker searches for relevant:
    • Published papers
    • Known results
    • Prior proof attempts
    • Related theorems
    │
    ▼
Results stored as repo items
    │
    ▼
Available to all future workers via <span class="broken-link">wikilinks</span>

This is one of OpenProver's most distinctive features — it enables the prover to consult the mathematical literature mid-proof, just as a human mathematician would.

11.7 Budget-Aware Planning

The planner receives budget status every step:

Budget: 45% remaining (1h 48m of 4h)
Conclusion phase: not yet (triggers at 99%)
Give-up threshold: passed (50%)

This enables adaptive behavior: - Early steps: aggressive exploration, spawn many workers - Middle steps: focus on most promising avenues - Late steps: consolidate findings, attempt proof submission - Conclusion phase: wrap up, write DISCUSSION.md

12 Programming Language

Language Breakdown

Language	Percentage	Usage
Python	92.9%	Core system (all modules)
JavaScript	4.2%	TUI components
CSS	2.4%	TUI styling
Other	0.5%	Configuration, shell scripts

Python Version Requirements

Minimum: Python 3.10 (for match statements, type union syntax)
Recommended: Python 3.11+ (for tomllib in standard library; 3.10 uses tomli fallback)
No async: The system uses threading and subprocess, not asyncio

Notable Implementation Choices

No Heavy ML Frameworks Required: The core system depends only on mcp>=1.0. Lean search dependencies (torch, sentence-transformers) are optional and installed on demand.
Subprocess-Based LLM Integration: Rather than using SDK libraries, OpenProver calls the Claude CLI binary directly via subprocess.Popen. This is unconventional but provides maximum control over streaming behavior and tool configuration.
TOML as Structured Output Format: The planner's decisions are structured as TOML blocks within natural language responses. This is a pragmatic choice — TOML is simpler than JSON for LLMs to generate (no escaping issues) and supports multi-line strings naturally.
File-Based State Management: All state lives on disk as plain files (markdown, TOML, JSON). No database, no in-memory state that survives process death. This makes the system trivially inspectable and resumable.
Thread Safety via Write Lock: All TUI stdout writes are protected by a _write_lock. The background key-reading thread doesn't hold the lock during I/O, preventing deadlocks while maintaining display consistency.

13 Memory Management

External Memory Architecture

OpenProver implements a multi-layered external memory system that compensates for LLM context window limitations:

┌─────────────────────────────────────────────────┐
│              MEMORY HIERARCHY                    │
│                                                  │
│  Layer 1: Whiteboard (WHITEBOARD.md)             │
│  ├── Capacity: ~2-4 KB (terse LaTeX)             │
│  ├── Persistence: Per-step to disk               │
│  ├── Visibility: Planner only                    │
│  ├── Purpose: Strategic state, current approach  │
│  └── Update: Every step by planner               │
│                                                  │
│  Layer 2: Repository (repo/*.md)                 │
│  ├── Capacity: Unlimited (one file per item)     │
│  ├── Persistence: Permanent on disk              │
│  ├── Visibility: Planner (index), Workers (via   │
│  │               wikilinks)                      │
│  ├── Purpose: Proven lemmas, observations,       │
│  │            literature, failed approaches       │
│  └── Update: Planner CRUD via write_items        │
│                                                  │
│  Layer 3: Output Window (last 3 outputs)         │
│  ├── Capacity: 3 items (rolling)                 │
│  ├── Persistence: In-memory only                 │
│  ├── Visibility: Planner only                    │
│  ├── Purpose: Recent worker results, feedback    │
│  └── Update: Automatic (FIFO)                    │
│                                                  │
│  Layer 4: Step Archive (steps/step_NNN/)         │
│  ├── Capacity: Unlimited                         │
│  ├── Persistence: Permanent on disk              │
│  ├── Visibility: Inspect mode only               │
│  ├── Purpose: Full audit trail                   │
│  └── Update: Append-only                         │
│                                                  │
│  Layer 5: LLM Call Archive (archive/calls/)      │
│  ├── Capacity: Unlimited                         │
│  ├── Persistence: Permanent on disk              │
│  ├── Visibility: Post-hoc analysis               │
│  ├── Purpose: Cost tracking, debugging,          │
│  │            reproducibility                    │
│  └── Update: Append-only                         │
└─────────────────────────────────────────────────┘

Wikilink-Based Knowledge Retrieval

The wikilink system functions as a lightweight retrieval mechanism:

Planner stores a lemma as repo/parity-lemma.md
Later task references parity-lemma
resolve_wikilinks() fetches the full content
Worker receives task + appended reference materials

This is superior to dumping all context into every prompt because: - Tokens are spent only when content is referenced - Workers see only what's relevant to their specific task - The planner controls what knowledge flows to which worker

Context Window Management

The planner sees a compressed view of the proof state:

Component	Typical Size	Purpose
Whiteboard	~500-2000 tokens	Current strategy and status
Repo index	~50-200 tokens	One-line summary per item
Last 3 outputs	~1000-3000 tokens	Recent worker results
Budget status	~50 tokens	Pacing information
System prompt	~1000-2000 tokens	Role and action definitions

Total planner context: ~3000-7000 tokens per step — well within any frontier model's capacity. This means the planner never hits context limits, even in multi-hour sessions.

No Embedding-Based Memory

Unlike systems that use vector databases for semantic retrieval over past experiences, OpenProver relies entirely on: - Structured naming (repo item slugs) - Explicit references (wikilinks) - Planner intelligence (the LLM decides what to store and reference)

This is simpler and more transparent but relies on the planner's ability to curate the repository effectively.

14 Continued Learning

Learning Within a Run

OpenProver demonstrates within-session learning through its repository mechanism:

Step 1: Worker explores approach A, fails
    → Planner stores "approach-a-failure.md" in repo
    → Whiteboard updated: "Tried: approach A (failed due to X)"

Step 5: Worker attempts related approach
    → Task includes <span class="broken-link">approach-a-failure</span> reference
    → Worker avoids repeating the same mistake

Step 10: Planner recognizes pattern across failures
    → Stores "key-insight.md" with synthesized observation
    → Redirects search based on accumulated evidence

Cross-Run Learning

OpenProver currently has no cross-run learning mechanism. Each run starts from scratch with only the theorem statement.

What's Missing (and Why It Matters):

Capability	Status	Impact
Proof strategy transfer	Not implemented	Can't reuse strategies from similar theorems
Tactic preference learning	Not implemented	Can't learn which tactics work for which goal types
Failure pattern recognition	Not implemented	Can't avoid known dead ends from prior runs
Repository bootstrapping	Not implemented	Can't seed a new run with relevant prior knowledge
Fine-tuning from traces	Not implemented	Can't train better models from successful proofs

Potential for Evolutionary Integration

OpenProver's architecture has natural extension points for evolutionary methods:

Repository Seeding: Pre-populate repo/ with lemmas and strategies from prior successful runs on related theorems. The wikilink system already supports this — the planner just needs a way to discover relevant prior knowledge.
Strategy Evolution: The whiteboard format could be used as a "chromosome" — evolving proof strategies across runs, with fitness measured by proof success or progress toward proof.
Prompt Population: The planner system prompt could be evolved, with different versions competing on benchmark problems.
Model Selection: The planner/worker model assignment could be treated as a hyperparameter optimized across runs.

Post-Session Analysis

Every completed run generates a DISCUSSION.md via a dedicated LLM call that analyzes: - What worked and what didn't - Key insights and observations - Suggestions for future attempts - Assessment of the proof's strength (for informal proofs)

This creates a self-reflective artifact that, while not currently fed into future runs, could serve as input for a cross-run learning system.

15 Applications

Primary Application: Mathematical Proof Discovery

OpenProver's primary use case is assisting mathematicians with proof discovery at various difficulty levels:

Application	Mode	Description
Homework assistance	prove	Generate informal proofs for undergraduate-level theorems
Research exploration	prove + literature_search	Explore proof strategies for research-level problems with web search
Formalization	formalize_only	Convert an existing informal proof to verified Lean 4
Proof + Formalization	prove_and_formalize	Find a proof and produce verified Lean 4 code simultaneously
Problem exploration	prove (autonomous, long budget)	Extended autonomous exploration of hard/open problems

Research Applications

Benchmarking LLMs for Mathematical Reasoning: OpenProver provides a controlled environment for comparing different LLMs' mathematical reasoning abilities. By logging every LLM call with full prompts and responses, researchers can analyze:
Which models propose better tactics
How reasoning depth correlates with proof success
Cost-quality tradeoffs between model tiers
Architecture Research: The planner-worker architecture is configurable enough to study:
Effect of parallelism on proof search efficiency
Value of literature search for hard problems
Whiteboard strategies and their impact on proof direction
Formal Verification Pipeline Research: The Lean integration enables studying:
Informal-to-formal translation accuracy
Error patterns in LLM-generated Lean code
Effectiveness of compiler feedback for repair

Comparison with Alternative Approaches

Approach	OpenProver Strength	OpenProver Limitation
vs. Single-agent provers (LeanCopilot)	Richer strategic reasoning, literature search, persistent memory	Higher token cost, more complex setup
vs. RL-trained provers (AlphaProof)	No training required, works with any LLM, open source	No learned value function, no policy improvement
vs. Search-based provers (UlamAI)	Richer proof state (whiteboard + repo), informal reasoning	Less sophisticated tactic-level search, no transposition table
vs. Closed systems (Aletheia)	Open source, inspectable, resumable, multi-provider	Smaller scale, no proprietary model access

Integration Scenarios

For Individual Mathematicians:

# Quick exploration of a conjecture
openprover my_conjecture.md --autonomous --max-time 1h

# Deep dive with literature search
openprover hard_problem.md --no-isolation --model opus -P 3 --max-time 4h

# Formalize a proof for a paper
openprover theorem.md --lean-project ~/mathlib4 \
  --lean-theorem theorem.lean --proof proof.md

For Research Groups: - Batch proving across theorem suites - Comparative evaluation of models/strategies - Training data generation from proof traces

For Education: - Step-by-step proof exploration in interactive mode - Students can give feedback at each step, guiding the proof - TUI provides visibility into the reasoning process

Limitations and Honest Assessment

Limitation	Details
No benchmark results	No systematic evaluation on standard benchmarks (miniF2F, etc.)
No cross-run learning	Each run starts from scratch
Claude-centric	Default configuration assumes Claude Code CLI on PATH
No RL/SFT pipeline	Pure prompting — no model improvement from traces
No value model	Search relies on LLM intuition, not learned value functions
Solo development	Bus factor of 1; sustainability risk
Informal-first bias	Strongest at informal proofs; formal verification is secondary

Significance for the Harness Systems Landscape

OpenProver represents a critical point in the harness systems taxonomy: it demonstrates that frontier proof agents can be built with open-source tools and commercial LLM APIs, without proprietary training data or custom-trained models. Its planner-worker-whiteboard architecture — open-sourced from Aletheia's design — provides a reusable pattern for any agentic system that needs to coordinate strategic reasoning with parallel execution over extended time horizons.

The key insight is that persistent structured state (whiteboard + repository + wikilinks) can substitute for massive context windows, enabling multi-hour reasoning sessions with standard-context models. This architectural pattern generalizes beyond theorem proving to any discovery process that requires strategic coordination, literature consultation, and incremental knowledge accumulation.