← Back to Index

OpenProver

Automated theorem prover using a planner-worker agentic architecture with frontier LLMs, inspired by Google DeepMind's Aletheia Organization: Independent (Matěj Kripner / Charles University, Prague) Published: February 2025 (initial); v1.0.1, March 2026 (current) Type: Open-source repository + PyPI package Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents

1 Full Title and Attribution

Full Title: OpenProver — Automated Theorem Prover Inspired by Aletheia; Claude Code for Mathematicians

Repository: github.com/Kripner/openprover

PyPI Package: openprover v1.0.1

Tagline: "Theorem prover powered by language models."

Lineage: Directly inspired by Google DeepMind's Aletheia — the fully autonomous mathematics research agent powered by Gemini 3 Deep Think that solved 6/10 open research problems from the FirstProof challenge (February 2025). OpenProver replicates and open-sources the planner-worker-whiteboard architecture that Aletheia demonstrated at the frontier of automated mathematical reasoning.

Related Work by Author: Matěj Kripner is also a contributor to LeanTree — a white-box approach to proof search that factorizes complex proof states in Lean 4 into simpler branches for more efficient automated theorem proving (published at UFAL, Charles University, Prague).

Citation:

@misc{openprover,
  author = {Matěj Kripner},
  title = {OpenProver},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/kripner/openprover}
}

2 Authors and Team

Primary Author: Matěj Kripner (kripner@ufal.mff.cuni.cz)

Affiliation: Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic

Team Size: Solo developer with 388 commits on the master branch (as of April 2026). Community contributions via 11 forks.

Background: Kripner's work sits at the intersection of formal methods and machine learning applied to mathematics. His LeanTree research addresses the complementary problem of making proof search more efficient through state factorization, while OpenProver tackles the higher-level orchestration of LLM-guided proving.

Repository Activity: | Metric | Value | | --- | --- | | Commits | 388 | | Stars | 67+ | | Forks | 11 | | Primary Language | Python (92.9%) | | Secondary Languages | JavaScript (4.2%), CSS (2.4%) | | License | Not explicitly specified |

3 Core Contribution

Key Novelty: OpenProver is the first open-source implementation of the planner-worker-whiteboard architecture for automated theorem proving that Aletheia demonstrated at Google DeepMind. It bridges the gap between closed frontier proof agents and the open research community by providing a fully inspectable, resumable, multi-LLM orchestration framework for both informal and formal mathematical reasoning.

What Makes OpenProver Distinctive

  1. Planner-Worker Separation with Persistent State: Unlike single-agent proving systems that maintain one long context window, OpenProver separates the high-level strategic reasoning (planner) from focused execution (workers). The planner maintains a whiteboard and repository across steps, while workers are stateless and parallelizable. This mirrors how human mathematicians work — a strategist coordinates specialists.

  2. Whiteboard as External Mathematical Memory: The whiteboard is a terse LaTeX scratchpad that the planner reads and updates every step. This creates an external memory mechanism that persists across context window boundaries, allowing proof strategies to evolve over hours without losing coherence.

  3. Repository as Structured Knowledge Base: The repo/ directory stores named items — lemmas, observations, literature findings, failed approaches — as individual markdown files. Items are cross-referenced via <span class="broken-link">wikilink</span> syntax, creating a navigable knowledge graph that both planner and workers can query.

  4. Formal + Informal Dual-Track Proving: OpenProver operates in three modes — pure informal proving (PROOF.md), prove-and-formalize (PROOF.md + PROOF.lean), and formalize-only (given an existing informal proof, produce verified Lean 4 code). This acknowledges the reality that most mathematical reasoning happens informally first.

  5. Multi-Provider LLM Support: Unlike closed systems locked to one model provider, OpenProver supports Claude (via Claude Code CLI), Mistral's Leanstral (specialized for Lean), and any OpenAI-compatible model via vLLM. Different models can be assigned to planner vs. worker roles.

  6. Full Disk Persistence and Resumability: Every run creates a structured directory with all state on disk. Runs can be interrupted (Ctrl-C, machine crash) and resumed from the last completed step. This is critical for multi-hour proving sessions with expensive API calls.

Relationship to Prior Systems

System Year Architecture Formal Verification Open Source
Aletheia (DeepMind) 2025 Planner-worker-whiteboard Extraction + verification protocols No (closed)
LeanCopilot 2023 Single-agent tactic suggestion Lean 4 via LeanDojo Yes
LEGO-Prover 2024 Skill library + decomposition Lean via LeanDojo Partial
ReProver 2023 Encoder retrieval + generation Lean 4 Yes
OpenProver 2025 Planner-worker-whiteboard Lean 4 via lake env lean + MCP tools Yes
UlamAI 2026 Search-based with LLM policy Lean 4 via LeanDojo/CLI/LSP Yes

4 Supported Solutions

OpenProver targets mathematical proofs at varying levels of formality and difficulty:

Solution Type Description Example
Informal Proof Natural-language mathematical proofs written as structured markdown Irrationality of √2, infinitely many primes
Formal Lean 4 Proof Machine-checked proofs compiled against Mathlib Cauchy-Schwarz inequality in Lean 4
Formalization Converting an existing informal proof to verified Lean 4 Given PROOF.md → produce PROOF.lean
Literature Survey Web-enabled search for relevant mathematical results and papers Finding prior work on Erdős problem 838
Proof Strategy Whiteboard-based high-level proof planning before execution Decomposing a hard problem into lemmas

Difficulty Spectrum Covered

Problem Difficulty Status
√2 is irrational Easy (textbook) Solved
Infinitely many primes Easy (textbook) Solved
e is irrational Medium (undergraduate) Addressed
Cauchy-Schwarz inequality Medium (undergraduate) Addressed
Erdős problem 205 Medium (research-adjacent) Addressed
Erdős problem 838 Hard (open problem) Attempted
Collatz conjecture Open (famously unsolved) Attempted

Operating Modes

Mode Input Goal Termination Condition
prove THEOREM.md Informal proof submit_proof → PROOF.md
prove_and_formalize THEOREM.md + THEOREM.lean Both proofs PROOF.md + PROOF.lean
formalize_only THEOREM.md + THEOREM.lean + PROOF.md Formal proof submit_proof succeeds → PROOF.lean

5 LLM Integration

Multi-Provider Architecture

OpenProver implements a factory pattern for LLM clients, supporting three distinct provider backends:

Provider Client Class Interface Tool Support Cost Tracking
Claude (Anthropic) LLMClient Claude Code CLI subprocess MCP server for lean_verify/lean_search Yes (from Claude CLI response)
Leanstral (Mistral) via OpenAI-compatible Mistral API with MISTRAL_API_KEY Native Via API
Local/vLLM HFClient OpenAI-compatible HTTP at --provider-url Native tool calling No (local = free)

Hierarchical Model Assignment

A distinctive feature is the ability to assign different models to different roles:

┌──────────────────────────────────────────┐
│           MODEL HIERARCHY                │
│                                          │
│  --planner-model opus                    │
│    ↓ Strategic reasoning, whiteboard     │
│    ↓ management, action selection        │
│                                          │
│  --worker-model sonnet                   │
│    ↓ Focused task execution, proof       │
│    ↓ exploration, verification           │
│                                          │
│  (Literature search gets web-enabled     │
│   Claude with WebSearch + WebFetch)      │
└──────────────────────────────────────────┘

Rationale: The planner sees a compressed view of the proof state (whiteboard + repo index) and must make strategic decisions — a task suited to more capable but expensive models (Opus). Workers receive focused tasks and run in parallel — suited to faster, cheaper models (Sonnet).

Claude CLI Integration Details

The Claude integration is notable for its depth. OpenProver wraps the claude CLI binary via subprocess.Popen:

Non-streaming mode:

claude -p --model <model> --system-prompt <...> \
  --output-format json --tools ""

Streaming mode:

claude -p --model <model> --system-prompt <...> \
  --output-format stream-json --verbose \
  --include-partial-messages --tools ""

Key implementation detail: streaming uses readline() on the subprocess pipe rather than Python's line iterator, because the line iterator has read-ahead buffering that defeats real-time display.

MCP Tool Calling: When --lean-project is set, the Claude CLI spawns an MCP server subprocess (lean/mcp_server.py) with:

--mcp-config <config> --strict-mcp-config \
--permission-mode bypassPermissions \
--allowedTools mcp__lean_tools__lean_verify mcp__lean_tools__lean_search

Tool events are detected from the NDJSON stream: - tool_use content blocks carry tool name and input (streamed incrementally) - Tool results appear as top-level {"type": "user"} messages containing tool_result entries matched by tool_use_id - MCP prefixes are stripped: mcp__lean_tools__lean_verifylean_verify - Status is inferred from result text (e.g., results starting with "OK" = success)

vLLM Integration for Local Models

For local models via vLLM, OpenProver uses the HFClient:

  • Health check on init via /health endpoint
  • Streaming via chunked transfer encoding
  • Multi-turn tool calling loop: LLM requests tool calls → _execute_worker_tool() dispatches → results appended to conversation → LLM continues
  • Supports MiniMax-M2.5 reasoning models with thinking tokens
  • Configurable --answer-reserve (default 4096 tokens) for local models that need token budget partitioning

Web Search Integration

When --no-isolation is set, the planner can dispatch literature_search actions. These spawn a special worker with Claude CLI configured for web access:

--permission-mode bypassPermissions \
--allowedTools WebSearch WebFetch

This enables the system to find relevant mathematical papers and results online during proof search — a capability that mirrors how human mathematicians consult the literature when stuck.

6 Key Results

Demonstrated Capabilities

OpenProver is a tool-focused repository rather than a benchmark-reporting research paper. Specific quantitative results are limited, but the system demonstrates:

  1. Easy Theorems (Solved):
  2. Irrationality of √2 — complete informal and formal proofs
  3. Infinitely many primes — complete informal proofs

  4. Medium Theorems (Addressed):

  5. Irrationality of e — proof strategies explored
  6. Cauchy-Schwarz inequality — informal and formal proof attempts
  7. Erdős problem 205 — multi-step proof exploration

  8. Hard/Open Problems (Attempted):

  9. Erdős problem 838 — extended autonomous sessions with literature search
  10. Collatz conjecture — proof exploration (unsolved, as expected)

Architectural Validation

The system validates that the Aletheia-style planner-worker architecture can be replicated with open-source components:

Aletheia Feature OpenProver Equivalent Status
Gemini Deep Think planner Claude Opus / any LLM planner Implemented
Parallel worker execution ThreadPoolExecutor workers Implemented
Whiteboard scratchpad WHITEBOARD.md with LaTeX Implemented
Repository of findings repo/ directory with wikilinks Implemented
Extraction + verification submit_proof + Lean 4 verification Implemented
Literature search Web-enabled Claude workers Implemented
Formal proof compilation lake env lean + MCP tools Implemented

Lean Explore Search Pipeline

One of OpenProver's most technically sophisticated components is the lean_search tool, which searches ~400k Lean 4 declarations across Init, Batteries, Lean, Mathlib, and Std:

Stage Technique Latency Details
BM25 Retrieval Keyword search over LLM-generated natural language descriptions ~0.01s Two indices with different tokenization, results merged
Semantic Retrieval Qwen3-Embedding-0.6B (1024-dim), FAISS index, cosine similarity ~1s Pre-built index, sentence-transformer encoding
Score Fusion Normalized combination: 0.3 × BM25 + 0.7 × semantic Dependency boost for related declarations
Reranking Qwen3-Reranker-0.6B cross-encoder ~1-2s (GPU) Skipped on CPU (~50s); GPU recommended
Hydration Full metadata lookup from SQLite Signature, docstring, source location

First call takes ~10s (model loading). Subsequent calls ~1s.

7 Reproducibility

Run Directory Structure

Every OpenProver run creates a fully self-contained, resumable directory:

runs/<slug>-<timestamp>/
├── THEOREM.md             # Immutable copy of input theorem
├── THEOREM.lean           # Formal Lean statement (if provided)
├── WHITEBOARD.md          # Current whiteboard state (enables resume)
├── PROOF.md               # Final informal proof (if found)
├── PROOF.lean             # Final formal proof (if lean mode)
├── DISCUSSION.md          # Post-session LLM analysis
├── run_config.toml        # Saved run configuration
├── repo/                  # Repository items
│   ├── lemma-foo.md       # "Summary: ...\n\nFull content"
│   └── observation-bar.md
├── steps/
│   ├── step_001/
│   │   ├── planner.toml   # Planner's TOML decision
│   │   └── workers/
│   │       ├── task_0.md   # Worker task description
│   │       ├── result_0.md # Worker output
│   │       └── worker_0_call.json
│   └── step_002/...
└── archive/
    └── calls/
        └── call_001.json  # Full LLM call record

Resumability Protocol

Resume detection: check for existing WHITEBOARD.md + THEOREM.md. Step count inferred from step_NNN directories. Settings restored from run_config.toml; CLI flags override saved values.

This means a proving session can survive: - Machine crashes / power loss - Ctrl-C interruptions - Context window exhaustion - Network failures during API calls

Artifact Completeness

Each archived LLM call (call_NNN.json) records:

Field Content
prompt Full input prompt
system_prompt System prompt used
schema JSON schema (if structured output)
response Complete LLM response
cost Cost in USD
timing Wall-clock duration
errors Any errors encountered

Inspection Mode

openprover inspect runs/<run-dir>

Browse any historical run in the TUI in read-only mode, reviewing every step's planner decisions, worker tasks, and results.

8 Compute and API Costs

Budget System

OpenProver implements explicit budget tracking with two modes:

Budget Mode Flag Default Description
Time budget --max-time 4 hours Wall-clock time limit
Token budget --max-tokens Cumulative output token limit

The Budget class exposes three checkpoints: - is_exhausted: budget fully consumed - should_conclude (at --conclude-after, default 0.99): planner enters conclusion phase - can_give_up (at --give-up-after, default 0.50): give_up action becomes available

The planner prompt receives budget status every step so it can pace its exploration.

Cost Estimation

Scenario Model Parallelism Time Estimated Cost
Easy theorem (√2 irrational) Sonnet 1 ~10-30 min ~$1-5
Medium theorem (Cauchy-Schwarz) Sonnet 1 ~1-2 hours ~$10-30
Hard exploration (Erdős 838) Opus planner + Sonnet workers 3 2-4 hours ~$50-200
Local model run vLLM (MiniMax-M2.5) 1 Variable $0 (compute only)

Note: These are rough estimates. Actual costs depend heavily on model pricing, theorem difficulty, and search efficiency. Every call's cost is logged in archive/calls/ for precise post-hoc analysis.

Resource Requirements

Resource Minimum Recommended
Python 3.10+ 3.11+ (for tomllib)
RAM 4 GB 8+ GB (for Lean Explore models)
GPU None (CPU works) CUDA GPU (for lean_search reranking)
Disk ~500 MB (base) ~2 GB (with Lean Explore data)
Network Required (for API models)
Lean 4 Optional Required for formal verification

9 Architecture Solution

System Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                        OpenProver System                           │
│                                                                    │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                      CLI (cli.py)                            │  │
│  │  Parse args → Setup TUI → Create Prover → Run → Print cost  │  │
│  └──────────┬───────────────────────────────────┬───────────────┘  │
│             │                                   │                  │
│  ┌──────────▼──────────┐         ┌──────────────▼──────────────┐  │
│  │     Prover Loop     │         │         TUI (tui.py)        │  │
│  │    (prover.py)      │◄───────►│  Full-screen terminal UI    │  │
│  │                     │ display │  Tabs: planner + workers    │  │
│  │  ┌───────────────┐  │         │  Streaming + key handling   │  │
│  │  │  Whiteboard   │  │         │  ANSI escape codes          │  │
│  │  │ (LaTeX pad)   │  │         └─────────────────────────────┘  │
│  │  └───────────────┘  │                                          │
│  │  ┌───────────────┐  │                                          │
│  │  │  Repository   │  │                                          │
│  │  │ (repo/*.md)   │  │                                          │
│  │  │ + wikilinks   │  │                                          │
│  │  └───────────────┘  │                                          │
│  │  ┌───────────────┐  │                                          │
│  │  │ Output Window │  │                                          │
│  │  │ (last 3 outs) │  │                                          │
│  │  └───────────────┘  │                                          │
│  └──────────┬──────────┘                                          │
│             │                                                      │
│  ┌──────────▼──────────────────────────────────────────────────┐  │
│  │                    Action Dispatch                          │  │
│  │                                                             │  │
│  │  spawn ──────► Workers (ThreadPoolExecutor, up to -P)      │  │
│  │  literature_search ──► Web-enabled Claude worker           │  │
│  │  read_items ──► Fetch repo/*.md content                    │  │
│  │  write_items ──► Create/update/delete repo items           │  │
│  │  write_whiteboard ──► Update WHITEBOARD.md                 │  │
│  │  read_theorem ──► Return THEOREM.md + .lean + PROOF.md    │  │
│  │  submit_proof ──► Save PROOF.md, verify PROOF.lean        │  │
│  │  give_up ──► Terminate (after threshold)                   │  │
│  └─────────────────────────────────┬───────────────────────────┘  │
│                                    │                               │
│  ┌─────────────────────────────────▼───────────────────────────┐  │
│  │                    LLM Layer (llm.py)                       │  │
│  │                                                             │  │
│  │  ┌──────────────────┐   ┌──────────────────┐               │  │
│  │  │  LLMClient       │   │  HFClient        │               │  │
│  │  │  (Claude CLI)    │   │  (vLLM/OpenAI)   │               │  │
│  │  │                  │   │                   │               │  │
│  │  │  • Popen+readline│   │  • HTTP streaming │               │  │
│  │  │  • NDJSON parsing│   │  • Tool calling   │               │  │
│  │  │  • MCP config    │   │  • Health check   │               │  │
│  │  │  • Cost tracking │   │  • Chat() loop    │               │  │
│  │  └──────────────────┘   └──────────────────┘               │  │
│  └─────────────────────────────────────────────────────────────┘  │
│                                                                    │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                    Lean Layer (lean/)                        │  │
│  │                                                             │  │
│  │  core.py         │ Parse THEOREM.lean, assemble proofs,    │  │
│  │                  │ run `lake env lean`, LeanWorkDir         │  │
│  │  mcp_server.py   │ MCP server: lean_verify + lean_search   │  │
│  │  data.py         │ LeanExplore data management             │  │
│  │  tools.py        │ Tool defs for vLLM native calling       │  │
│  └─────────────────────────────────────────────────────────────┘  │
│                                                                    │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                    Support Modules                          │  │
│  │                                                             │  │
│  │  prompts.py  │ All prompt templates, TOML parser, actions  │  │
│  │  budget.py   │ Time/token budget tracking                  │  │
│  │  inspect.py  │ Read-only run browser                       │  │
│  └─────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Data Flow Per Step

Step N begins
    │
    ▼
┌────────────────────────────────────────┐
│ Build planner prompt:                  │
│  • Current whiteboard                  │
│  • Repository index (one-line summaries)│
│  • Last 3 outputs (rolling window)     │
│  • Budget status                       │
│  • Mode-specific status indicators     │
└───────────────┬────────────────────────┘
                │
                ▼
┌────────────────────────────────────────┐
│ Planner LLM call (streaming)          │
│  Response must contain ```toml block  │
└───────────────┬────────────────────────┘
                │
                ▼
┌────────────────────────────────────────┐
│ Parse TOML:                            │
│  • action (one of 8 actions)          │
│  • summary                            │
│  • whiteboard update                   │
│  • <span class="broken-link">tasks</span>, <span class="broken-link">items</span>, search_query  │
└───────────────┬────────────────────────┘
                │
                ▼
┌────────────────────────────────────────┐
│ Dispatch to action handler             │
│  (one of _handle_* methods)           │
└───────────────┬────────────────────────┘
                │
                ▼
┌────────────────────────────────────────┐
│ Save step to disk:                     │
│  • steps/step_NNN/planner.toml        │
│  • steps/step_NNN/workers/            │
│  • archive/calls/                     │
│  • Update WHITEBOARD.md               │
└────────────────────────────────────────┘
                │
                ▼
          Step N+1

10 Component Breakdown

Module Architecture

Module File Lines (est.) Responsibility
CLI cli.py ~200 Argument parsing, TUI setup, signal handlers, cost summary
Prover prover.py ~800 Core loop, action dispatch, Repo class, wikilink resolution, operating mode logic
LLM llm.py ~500 LLMClient (Claude CLI), HFClient (vLLM), streaming, archiving, cost tracking
Prompts prompts.py ~600 System prompts, prompt formatters, TOML parser, actions enum
Budget budget.py ~100 Time/token tracking, threshold calculations
TUI tui.py ~800 Full-screen terminal UI, tab system, streaming display, key handling
Inspect inspect.py ~150 Read-only run browser
Lean Core lean/core.py ~300 LeanTheorem parser, proof assembly, run_lean_check, LeanWorkDir
Lean MCP lean/mcp_server.py ~200 MCP server exposing lean_verify and lean_search
Lean Data lean/data.py ~150 LeanExplore data management, dependency installation
Lean Tools lean/tools.py ~100 Tool definitions for vLLM native tool calling

Dependency Graph

cli.py
  ├── prover.py
  │     ├── llm.py
  │     ├── prompts.py
  │     ├── budget.py
  │     └── lean/
  │           ├── core.py
  │           ├── data.py
  │           ├── mcp_server.py (subprocess)
  │           └── tools.py
  ├── tui.py
  └── inspect.py

The dependency graph is clean and acyclic. lean/ is a self-contained subpackage. prompts.py has no code dependencies beyond the standard library.

External Dependencies

OpenProver has a remarkably minimal dependency footprint:

[project]
requires-python = ">=3.10"
dependencies = [
  "mcp>=1.0",
]

Only the MCP library is a hard dependency. All other functionality comes from: - Python standard library (subprocess, json, tomllib, threading, os, select) - Claude CLI (system binary) - Optional: Lean 4, LeanDojo, lean-explore, torch, sentence-transformers

11 Core Mechanisms (Detailed)

11.1 Planner-Worker Protocol

The planner operates in a strict loop. Each step:

  1. Prompt Construction: The planner receives a structured prompt containing:
  2. The current whiteboard (mathematical scratchpad)
  3. A one-line index of all repository items
  4. The last 3 outputs from previous steps (rolling window)
  5. Budget status (time/tokens remaining, can_conclude, can_give_up)
  6. Mode-specific status indicators (theorem read, proof found, formal proof status)

  7. TOML Decision: The planner responds with a fenced TOML block: ```toml action = "spawn" summary = "Explore the contradiction approach via modular arithmetic" whiteboard = """ # Goal: Prove √2 is irrational ## Strategy: Assume √2 = p/q, derive contradiction ## Status: Working on the parity argument ## Tried: Direct computation (failed) """

tasks description = "Prove that if p² = 2q², then p must be even. Reference parity-lemma."

tasks description = "Verify the infinite descent argument terminates." ```

  1. TOML Parsing: parse_planner_toml() extracts the TOML from fenced code blocks using tomllib (Python 3.11+), tomli (3.10 fallback), or a minimal regex-based parser. It handles both <span class="broken-link">tasks</span> array-of-tables syntax and lean_block_N numbered keys.

  2. Action Dispatch: Based on the action field, one of 8 handlers executes.

11.2 Worker Execution

Workers are stateless, parallel, and focused:

Planner spawns task
    │
    ▼
Resolve <span class="broken-link">wikilinks</span> in task description
    │ (fetch referenced repo items)
    │
    ▼
Build worker prompt:
    │ • Task description + resolved references
    │ • System prompt (mathematician role)
    │
    ▼
LLM call (streaming to worker tab)
    │
    ├── [If lean_worker_tools enabled]
    │   ├── lean_verify(code) → compile in LeanWorkDir
    │   └── lean_search(query) → search ~400k declarations
    │
    ▼
Return result to planner

Parallel Execution: Workers run in a ThreadPoolExecutor with concurrency bounded by --parallelism (default 1). Each worker is completely independent — they share no state, only read from the repository.

Verification Workers: The planner can specifically task workers with verification. A verifier worker sees only the proof text (not the reasoning) and must end with VERDICT: CORRECT or VERDICT: INCORRECT.

11.3 Whiteboard Management

The whiteboard is the planner's external mathematical memory:

┌─────────────────────────────────────────┐
│ WHITEBOARD.md                            │
│                                          │
│ # Goal: [theorem statement]              │
│ ## Strategy: [current approach]          │
│ ## Status: [where we are]               │
│ ## Tried: [approaches attempted]        │
│ ## Key Observations:                    │
│   - [observation 1]                     │
│   - [observation 2]                     │
│ ## Open Questions:                      │
│   - [question 1]                        │
└─────────────────────────────────────────┘

Design Principles: - Terse LaTeX-friendly format (abbreviations encouraged) - Updated by the planner every step - Persists to disk (enables resume) - Read by the planner at the start of every step - Functions as external memory beyond context window limits

The repository is a file-based knowledge graph:

repo/
├── sqrt2-parity-lemma.md
│   Summary: p² = 2q² implies p even.
│
│   Proof: If p is odd, p² is odd. But 2q² is even. Contradiction.
│
├── descent-argument.md
│   Summary: Infinite descent shows no reduced fraction exists.
│
│   [Full argument with LaTeX]
│
└── algebraic-number-context.md
    Summary: Literature review on algebraic irrationality proofs.

    [Web search results and analysis]

Wikilink Resolution: Before a worker receives its task, repo.resolve_wikilinks(): 1. Finds all <span class="broken-link">slug</span> references in the task description 2. Fetches the content of each referenced item 3. Appends a "Referenced Materials" section to the worker's input

This creates a lightweight knowledge management system where the planner can share proven lemmas, observations, or literature findings with workers without duplicating content.

11.5 Lean 4 Formal Verification

When --lean-project is set, formal verification integrates at two levels:

Level 1: Repository Item Verification

When the planner writes an item with format="lean", it is: 1. Written to <lean-project>/OpenProver-<id>/<slug>-<random>.lean 2. Compiled via lake env lean <file> from the project directory 3. Pass/fail feedback (with compiler errors) returned to planner

Level 2: Final Proof Assembly

When the planner submits a proof in lean mode: 1. LeanTheorem parses THEOREM.lean, locating all sorry positions 2. The planner provides N replacement blocks (one per sorry) + optional context 3. assemble_proof() replaces each sorry with its corresponding block (reverse order for offset preservation) 4. Validates: correct replacement count, no import in injected code 5. Runs lake env lean on the assembled file 6. Success → writes PROOF.lean; failure → compiler errors fed back

Level 3: Worker Tools via MCP

When --lean-worker-tools is enabled: - Claude workers: MCP server subprocess provides lean_verify and lean_search - vLLM workers: LeanExplore search service initialized in-process; native OpenAI tool calling

When --no-isolation is set:

Planner action: literature_search
    │
    ▼
Spawn web-enabled Claude worker with:
    • WebSearch tool (web queries)
    • WebFetch tool (page retrieval)
    │
    ▼
Worker searches for relevant:
    • Published papers
    • Known results
    • Prior proof attempts
    • Related theorems
    │
    ▼
Results stored as repo items
    │
    ▼
Available to all future workers via <span class="broken-link">wikilinks</span>

This is one of OpenProver's most distinctive features — it enables the prover to consult the mathematical literature mid-proof, just as a human mathematician would.

11.7 Budget-Aware Planning

The planner receives budget status every step:

Budget: 45% remaining (1h 48m of 4h)
Conclusion phase: not yet (triggers at 99%)
Give-up threshold: passed (50%)

This enables adaptive behavior: - Early steps: aggressive exploration, spawn many workers - Middle steps: focus on most promising avenues - Late steps: consolidate findings, attempt proof submission - Conclusion phase: wrap up, write DISCUSSION.md

12 Programming Language

Language Breakdown

Language Percentage Usage
Python 92.9% Core system (all modules)
JavaScript 4.2% TUI components
CSS 2.4% TUI styling
Other 0.5% Configuration, shell scripts

Python Version Requirements

  • Minimum: Python 3.10 (for match statements, type union syntax)
  • Recommended: Python 3.11+ (for tomllib in standard library; 3.10 uses tomli fallback)
  • No async: The system uses threading and subprocess, not asyncio

Notable Implementation Choices

  1. No Heavy ML Frameworks Required: The core system depends only on mcp>=1.0. Lean search dependencies (torch, sentence-transformers) are optional and installed on demand.

  2. Subprocess-Based LLM Integration: Rather than using SDK libraries, OpenProver calls the Claude CLI binary directly via subprocess.Popen. This is unconventional but provides maximum control over streaming behavior and tool configuration.

  3. TOML as Structured Output Format: The planner's decisions are structured as TOML blocks within natural language responses. This is a pragmatic choice — TOML is simpler than JSON for LLMs to generate (no escaping issues) and supports multi-line strings naturally.

  4. File-Based State Management: All state lives on disk as plain files (markdown, TOML, JSON). No database, no in-memory state that survives process death. This makes the system trivially inspectable and resumable.

  5. Thread Safety via Write Lock: All TUI stdout writes are protected by a _write_lock. The background key-reading thread doesn't hold the lock during I/O, preventing deadlocks while maintaining display consistency.

13 Memory Management

External Memory Architecture

OpenProver implements a multi-layered external memory system that compensates for LLM context window limitations:

┌─────────────────────────────────────────────────┐
│              MEMORY HIERARCHY                    │
│                                                  │
│  Layer 1: Whiteboard (WHITEBOARD.md)             │
│  ├── Capacity: ~2-4 KB (terse LaTeX)             │
│  ├── Persistence: Per-step to disk               │
│  ├── Visibility: Planner only                    │
│  ├── Purpose: Strategic state, current approach  │
│  └── Update: Every step by planner               │
│                                                  │
│  Layer 2: Repository (repo/*.md)                 │
│  ├── Capacity: Unlimited (one file per item)     │
│  ├── Persistence: Permanent on disk              │
│  ├── Visibility: Planner (index), Workers (via   │
│  │               wikilinks)                      │
│  ├── Purpose: Proven lemmas, observations,       │
│  │            literature, failed approaches       │
│  └── Update: Planner CRUD via write_items        │
│                                                  │
│  Layer 3: Output Window (last 3 outputs)         │
│  ├── Capacity: 3 items (rolling)                 │
│  ├── Persistence: In-memory only                 │
│  ├── Visibility: Planner only                    │
│  ├── Purpose: Recent worker results, feedback    │
│  └── Update: Automatic (FIFO)                    │
│                                                  │
│  Layer 4: Step Archive (steps/step_NNN/)         │
│  ├── Capacity: Unlimited                         │
│  ├── Persistence: Permanent on disk              │
│  ├── Visibility: Inspect mode only               │
│  ├── Purpose: Full audit trail                   │
│  └── Update: Append-only                         │
│                                                  │
│  Layer 5: LLM Call Archive (archive/calls/)      │
│  ├── Capacity: Unlimited                         │
│  ├── Persistence: Permanent on disk              │
│  ├── Visibility: Post-hoc analysis               │
│  ├── Purpose: Cost tracking, debugging,          │
│  │            reproducibility                    │
│  └── Update: Append-only                         │
└─────────────────────────────────────────────────┘

The <span class="broken-link">wikilink</span> system functions as a lightweight retrieval mechanism:

  1. Planner stores a lemma as repo/parity-lemma.md
  2. Later task references <span class="broken-link">parity-lemma</span>
  3. resolve_wikilinks() fetches the full content
  4. Worker receives task + appended reference materials

This is superior to dumping all context into every prompt because: - Tokens are spent only when content is referenced - Workers see only what's relevant to their specific task - The planner controls what knowledge flows to which worker

Context Window Management

The planner sees a compressed view of the proof state:

Component Typical Size Purpose
Whiteboard ~500-2000 tokens Current strategy and status
Repo index ~50-200 tokens One-line summary per item
Last 3 outputs ~1000-3000 tokens Recent worker results
Budget status ~50 tokens Pacing information
System prompt ~1000-2000 tokens Role and action definitions

Total planner context: ~3000-7000 tokens per step — well within any frontier model's capacity. This means the planner never hits context limits, even in multi-hour sessions.

No Embedding-Based Memory

Unlike systems that use vector databases for semantic retrieval over past experiences, OpenProver relies entirely on: - Structured naming (repo item slugs) - Explicit references (wikilinks) - Planner intelligence (the LLM decides what to store and reference)

This is simpler and more transparent but relies on the planner's ability to curate the repository effectively.

14 Continued Learning

Learning Within a Run

OpenProver demonstrates within-session learning through its repository mechanism:

Step 1: Worker explores approach A, fails
    → Planner stores "approach-a-failure.md" in repo
    → Whiteboard updated: "Tried: approach A (failed due to X)"

Step 5: Worker attempts related approach
    → Task includes <span class="broken-link">approach-a-failure</span> reference
    → Worker avoids repeating the same mistake

Step 10: Planner recognizes pattern across failures
    → Stores "key-insight.md" with synthesized observation
    → Redirects search based on accumulated evidence

Cross-Run Learning

OpenProver currently has no cross-run learning mechanism. Each run starts from scratch with only the theorem statement.

What's Missing (and Why It Matters):

Capability Status Impact
Proof strategy transfer Not implemented Can't reuse strategies from similar theorems
Tactic preference learning Not implemented Can't learn which tactics work for which goal types
Failure pattern recognition Not implemented Can't avoid known dead ends from prior runs
Repository bootstrapping Not implemented Can't seed a new run with relevant prior knowledge
Fine-tuning from traces Not implemented Can't train better models from successful proofs

Potential for Evolutionary Integration

OpenProver's architecture has natural extension points for evolutionary methods:

  1. Repository Seeding: Pre-populate repo/ with lemmas and strategies from prior successful runs on related theorems. The wikilink system already supports this — the planner just needs a way to discover relevant prior knowledge.

  2. Strategy Evolution: The whiteboard format could be used as a "chromosome" — evolving proof strategies across runs, with fitness measured by proof success or progress toward proof.

  3. Prompt Population: The planner system prompt could be evolved, with different versions competing on benchmark problems.

  4. Model Selection: The planner/worker model assignment could be treated as a hyperparameter optimized across runs.

Post-Session Analysis

Every completed run generates a DISCUSSION.md via a dedicated LLM call that analyzes: - What worked and what didn't - Key insights and observations - Suggestions for future attempts - Assessment of the proof's strength (for informal proofs)

This creates a self-reflective artifact that, while not currently fed into future runs, could serve as input for a cross-run learning system.

15 Applications

Primary Application: Mathematical Proof Discovery

OpenProver's primary use case is assisting mathematicians with proof discovery at various difficulty levels:

Application Mode Description
Homework assistance prove Generate informal proofs for undergraduate-level theorems
Research exploration prove + literature_search Explore proof strategies for research-level problems with web search
Formalization formalize_only Convert an existing informal proof to verified Lean 4
Proof + Formalization prove_and_formalize Find a proof and produce verified Lean 4 code simultaneously
Problem exploration prove (autonomous, long budget) Extended autonomous exploration of hard/open problems

Research Applications

  1. Benchmarking LLMs for Mathematical Reasoning: OpenProver provides a controlled environment for comparing different LLMs' mathematical reasoning abilities. By logging every LLM call with full prompts and responses, researchers can analyze:
  2. Which models propose better tactics
  3. How reasoning depth correlates with proof success
  4. Cost-quality tradeoffs between model tiers

  5. Architecture Research: The planner-worker architecture is configurable enough to study:

  6. Effect of parallelism on proof search efficiency
  7. Value of literature search for hard problems
  8. Whiteboard strategies and their impact on proof direction

  9. Formal Verification Pipeline Research: The Lean integration enables studying:

  10. Informal-to-formal translation accuracy
  11. Error patterns in LLM-generated Lean code
  12. Effectiveness of compiler feedback for repair

Comparison with Alternative Approaches

Approach OpenProver Strength OpenProver Limitation
vs. Single-agent provers (LeanCopilot) Richer strategic reasoning, literature search, persistent memory Higher token cost, more complex setup
vs. RL-trained provers (AlphaProof) No training required, works with any LLM, open source No learned value function, no policy improvement
vs. Search-based provers (UlamAI) Richer proof state (whiteboard + repo), informal reasoning Less sophisticated tactic-level search, no transposition table
vs. Closed systems (Aletheia) Open source, inspectable, resumable, multi-provider Smaller scale, no proprietary model access

Integration Scenarios

For Individual Mathematicians:

# Quick exploration of a conjecture
openprover my_conjecture.md --autonomous --max-time 1h

# Deep dive with literature search
openprover hard_problem.md --no-isolation --model opus -P 3 --max-time 4h

# Formalize a proof for a paper
openprover theorem.md --lean-project ~/mathlib4 \
  --lean-theorem theorem.lean --proof proof.md

For Research Groups: - Batch proving across theorem suites - Comparative evaluation of models/strategies - Training data generation from proof traces

For Education: - Step-by-step proof exploration in interactive mode - Students can give feedback at each step, guiding the proof - TUI provides visibility into the reasoning process

Limitations and Honest Assessment

Limitation Details
No benchmark results No systematic evaluation on standard benchmarks (miniF2F, etc.)
No cross-run learning Each run starts from scratch
Claude-centric Default configuration assumes Claude Code CLI on PATH
No RL/SFT pipeline Pure prompting — no model improvement from traces
No value model Search relies on LLM intuition, not learned value functions
Solo development Bus factor of 1; sustainability risk
Informal-first bias Strongest at informal proofs; formal verification is secondary

Significance for the Harness Systems Landscape

OpenProver represents a critical point in the harness systems taxonomy: it demonstrates that frontier proof agents can be built with open-source tools and commercial LLM APIs, without proprietary training data or custom-trained models. Its planner-worker-whiteboard architecture — open-sourced from Aletheia's design — provides a reusable pattern for any agentic system that needs to coordinate strategic reasoning with parallel execution over extended time horizons.

The key insight is that persistent structured state (whiteboard + repository + wikilinks) can substitute for massive context windows, enabling multi-hour reasoning sessions with standard-context models. This architectural pattern generalizes beyond theorem proving to any discovery process that requires strategic coordination, literature consultation, and incremental knowledge accumulation.