OpenProver
Part P08: Harness & Agent Frameworks
60.1 Overview and Motivation
Automated theorem proving stands at the frontier of mathematical reasoning by machines. While large language models have demonstrated impressive informal mathematical ability — solving competition problems, generating plausible proof sketches, and even suggesting novel conjectures — the gap between informal reasoning and machine-checked formal proofs remains substantial. Bridging this gap requires systems that can orchestrate extended, multi-step proof searches while maintaining strategic coherence over hours of exploration.
In February 2025, Google DeepMind announced Aletheia, a fully autonomous mathematics research agent powered by Gemini 3 Deep Think that solved 6 of 10 open research problems from the FirstProof challenge. Aletheia demonstrated a planner-worker-whiteboard architecture — a design pattern in which a strategic planner coordinates focused, stateless workers through a persistent mathematical scratchpad. The system was proprietary, leaving the research community without access to its internals or the ability to reproduce its results.
OpenProver, developed by Matěj Kripner at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague, is an open-source implementation of a planner-worker-whiteboard architecture for automated theorem proving, explicitly inspired by the Aletheia design as described in DeepMind's public communications. Released in February 2025 with active development through v1.0.1 (March 2026), OpenProver provides a fully inspectable, resumable, multi-LLM orchestration framework that supports both informal natural-language proofs and machine-checked Lean 4 formalizations.
Repository Snapshot
| Repository | github.com/Kripner/openprover |
| Version reviewed | v1.0.1 (PyPI release, March 2026) |
| Commits at review | 388 on master branch |
| Primary language | Python (92.9%), ~3,000 lines across 11 modules |
| License | Not explicitly specified in repository |
| Review access date | April 2026 |
All file paths, module references, and structural claims in this chapter are based on inspection of the repository at the above date. Line counts are estimates. Code snippets are labeled as pseudocode, lightly edited, or verbatim at each occurrence.
Key Contribution
OpenProver demonstrates that a planner-worker-whiteboard architecture — similar in high-level design to the pattern described in DeepMind's Aletheia announcements — can be constructed using open-source tools and commercial LLM APIs, without proprietary training data or custom-trained models. The scope of equivalence is architectural: OpenProver replicates the separation of planner and workers, persistent whiteboard state, and parallel worker dispatch, based on Aletheia's publicly described design. It does not claim feature parity with Aletheia's internal capabilities, model quality, or proving performance, which remain undisclosed. OpenProver's central architectural insight is that persistent structured state — a whiteboard, a file-based repository with wikilink cross-references, and a rolling output window — can substitute for massive context windows, enabling multi-hour reasoning sessions with standard-context models.
60.1.1 Positioning within the Survey
Within this survey's scope of LLM-powered self-evolving systems, OpenProver occupies a distinctive niche. It is not an evolutionary algorithm discovery system in the traditional sense — it does not maintain populations, apply crossover operators, or optimize fitness functions. Rather, it represents a search-based discovery agent in which the LLM planner performs a form of heuristic tree search over proof strategies, with the whiteboard and repository serving as an explicit external memory that accumulates knowledge across search steps. The system's within-session learning — storing failed approaches, proven lemmas, and synthesized observations — mirrors the knowledge-building dynamics of evolutionary systems, albeit without the population-level selection pressure.
OpenProver's relevance to the evolutionary AI landscape is threefold:
- Reusable orchestration pattern (implemented): Its planner-worker architecture provides a concrete, working pattern for agentic systems that coordinate strategic reasoning with parallel execution.
- Lightweight knowledge management (implemented): Its repository and wikilink system constitute a knowledge management mechanism analogous to the skill libraries and learning logs found in evolutionary code-generation systems.
- Extension points for evolutionary methods (not implemented — speculative): As discussed in Section 60.7.3, the architecture has natural extension points for evolutionary methods including strategy evolution and prompt population optimization. These are research directions identified by the chapter author, not features of the current system.
60.2 Architecture
OpenProver's architecture follows a clean, acyclic dependency structure organized around a central prover loop that coordinates planning, worker dispatch, and state management. The system comprises approximately 3,000 lines of Python distributed across eleven modules in a flat package layout with a self-contained lean/ subpackage. The hard dependency footprint is minimal — only the mcp>=1.0 library is declared in pyproject.toml.
60.2.1 Module Structure
The codebase is organized as a flat Python package with a single self-contained subpackage for Lean integration. The following table lists each module with its responsibility, derived from the repository structure at github.com/Kripner/openprover (v1.0.1):
| Module | File | Est. Lines | Responsibility | Evidence |
|---|---|---|---|---|
| CLI | cli.py | ~200 | Argument parsing, TUI setup, signal handlers, cost summary | Verified in repo |
| Prover | prover.py | ~800 | Core loop, action dispatch, Repo class, wikilink resolution, operating modes | Verified in repo |
| LLM | llm.py | ~500 | LLMClient (Claude CLI), HFClient (vLLM), streaming, archiving, cost tracking | Verified in repo |
| Prompts | prompts.py | ~600 | System prompts, prompt formatters, TOML parser, action enumeration | Verified in repo |
| Budget | budget.py | ~100 | Time/token tracking, threshold calculations | Verified in repo |
| TUI | tui.py | ~800 | Full-screen terminal UI, tab system, streaming display, key handling | Verified in repo |
| Inspect | inspect.py | ~150 | Read-only historical run browser | Verified in repo |
| Lean Core | lean/core.py | ~300 | LeanTheorem parser, proof assembly, run_lean_check, LeanWorkDir | Verified in repo |
| Lean MCP | lean/mcp_server.py | ~200 | MCP server exposing lean_verify and lean_search tools | Verified in repo |
| Lean Data | lean/data.py | ~150 | LeanExplore data management, dependency installation | Verified in repo |
| Lean Tools | lean/tools.py | ~100 | Tool definitions for vLLM native tool calling | Verified in repo |
The dependency graph is strictly acyclic: cli.py depends on prover.py, tui.py, and inspect.py; prover.py depends on llm.py, prompts.py, budget.py, and the lean/ subpackage; prompts.py has no code dependencies beyond the standard library. All functionality beyond MCP comes from the Python standard library (subprocess, json, tomllib, threading, os, select) and the Claude CLI system binary. Optional dependencies for the Lean search pipeline (torch, sentence-transformers, FAISS) are installed on demand by lean/data.py.
60.2.2 Operating Modes
OpenProver supports three operating modes, configured via CLI flags and detected from the presence of input files in the run directory. These modes reflect the dual-track nature of mathematical reasoning — informal intuition followed by formal verification:
| Mode | Input | Goal | Output Artifacts |
|---|---|---|---|
prove | THEOREM.md | Informal proof | PROOF.md |
prove_and_formalize | THEOREM.md + THEOREM.lean | Both proofs | PROOF.md + PROOF.lean |
formalize_only | THEOREM.md + THEOREM.lean + PROOF.md | Formal proof only | PROOF.lean |
This mode separation acknowledges a fundamental asymmetry in mathematical practice: most reasoning happens informally first, and formalization is a distinct — often more difficult — task. By supporting all three modes, OpenProver can serve as both a proof discovery tool and a formalization assistant.
60.3 Core Algorithms
60.3.1 The Planner-Worker Protocol
The central algorithm of OpenProver is a step-wise loop in which a planner LLM makes strategic decisions and dispatches worker LLMs for focused execution. To describe this protocol precisely, we define the following notation:
Symbol Definitions
| $n$ | Current step count (integer, incremented each planner iteration), $n = 1, 2, 3, \ldots$ |
| $\tau$ | Elapsed wall-clock time in seconds since the run started |
| $T_{\max}$ | Maximum wall-clock time budget in seconds (CLI flag --max-time, default 14400s = 4 hours) |
| $N_{\max}$ | Optional cumulative output token budget (CLI flag --max-tokens, no default) |
| $S$ | System prompt (role definition and action specifications), invariant across steps |
| $W_n$ | Whiteboard content at step $n$ (LaTeX-formatted markdown string) |
| $I_n$ | Repository index at step $n$ (one-line summary per stored item) |
| $O_n$ | Output window at step $n$: the last $k = 3$ worker results (rolling FIFO) |
| $B_n$ | Budget status at step $n$: remaining time, conclusion/give-up flags |
| $\mathcal{A}$ | The set of eight valid planner actions (defined below) |
| $\alpha_c, \alpha_g$ | Fractional time thresholds for conclusion and give-up phases (defaults: 0.99, 0.50) |
Step 1 — Prompt Construction. At each step $n$, the planner receives a structured prompt $P_n$ composed of five elements:
Step 2 — TOML Decision. The planner generates a natural-language response containing a fenced TOML block that specifies its action. The TOML structure includes an action identifier $a_n \in \mathcal{A}$, a summary, an updated whiteboard $W_{n+1}$, and action-specific payloads such as task descriptions or item operations.
Step 3 — Action Dispatch. The action is routed to one of eight handlers. The action set is:
Step 4 — State Update. After the action completes, the whiteboard is updated to $W_{n+1}$, the output window shifts to include new results (dropping the oldest if $|O_n| = k$), and all artifacts are persisted to disk in steps/step_NNN/.
The following pseudocode illustrates the core prover loop. It captures the structure of the Prover class in prover.py but simplifies error handling, mode logic, and TUI integration for readability:
# PSEUDOCODE — Simplified from prover.py (Prover class).
# Captures the loop structure. Real implementation handles
# additional concerns: mode detection, resume, TUI callbacks,
# discussion generation, error recovery.
class Prover:
def __init__(self, theorem_path, config, llm_client, tui):
self.whiteboard = "" # W_n: LaTeX scratchpad
self.repo = Repo(run_dir) # I_n: file-based knowledge graph
self.outputs = [] # O_n: rolling output window (max 3)
self.budget = Budget(config) # B_n: time/token tracking
self.step = 0
def run(self):
"""Main prover loop — one iteration per planner step."""
while not self.budget.is_exhausted:
self.step += 1
# Step 1: Build planner prompt (see prompts.py)
prompt = build_planner_prompt(
whiteboard=self.whiteboard,
repo_index=self.repo.get_index(),
recent_outputs=self.outputs[-3:],
budget_status=self.budget.status(),
mode_indicators=self.get_mode_status()
)
# Step 2: Call planner LLM, parse TOML decision
response = self.planner_client.call(prompt, stream=True)
decision = parse_planner_toml(response)
# Step 3: Dispatch action
action = decision["action"]
if action == "spawn":
results = self._handle_spawn(decision["tasks"])
elif action == "literature_search":
results = self._handle_literature_search(decision)
elif action == "submit_proof":
success = self._handle_submit_proof(decision)
if success:
return # Proof found
elif action == "give_up":
if self.budget.can_give_up:
return
# ... other action handlers
# Step 4: Update state
self.whiteboard = decision.get("whiteboard", self.whiteboard)
self.outputs.extend(results)
self._save_step_to_disk(decision, results)
60.3.2 Worker Execution and Wikilink Resolution
Workers are stateless and parallelizable. Before dispatch, each task description undergoes wikilink resolution — a lightweight retrieval mechanism implemented in the Repo class within prover.py that injects relevant repository content into the worker's prompt without consuming tokens on irrelevant material.
The resolution process operates on the task description $d$ by finding all [[slug]] references, fetching the corresponding repository items, and appending them as a "Referenced Materials" section. Let $\text{refs}(d) = \{s_1, s_2, \ldots, s_m\}$ be the set of slugs referenced in task $d$, and let $\text{content}(s_i)$ be the full text of repository item $s_i$. The enriched task is:
where $\|$ denotes string concatenation. This ensures that tokens are spent only when content is explicitly referenced, and each worker sees only the knowledge relevant to its specific subtask.
# PSEUDOCODE — Simplified from prover.py (Repo class and worker dispatch).
# Real implementation: file operations use pathlib; worker dispatch
# integrates with TUI streaming; error handling for missing items.
import re
from concurrent.futures import ThreadPoolExecutor
class Repo:
"""File-based knowledge graph with wikilink cross-references."""
def __init__(self, run_dir):
self.repo_dir = run_dir / "repo"
self.repo_dir.mkdir(exist_ok=True)
def resolve_wikilinks(self, text: str) -> str:
"""Replace [[slug]] references with item content."""
refs = re.findall(r'\[\[([^\]]+)\]\]', text)
if not refs:
return text
materials = []
for slug in refs:
item_path = self.repo_dir / f"{slug}.md"
if item_path.exists():
materials.append(f"## {slug}\n{item_path.read_text()}")
if materials:
text += "\n\n---\n# Referenced Materials\n" + "\n\n".join(materials)
return text
def get_index(self) -> str:
"""One-line summary per item for the planner prompt."""
lines = []
for item_path in sorted(self.repo_dir.glob("*.md")):
first_line = item_path.read_text().split('\n')[0]
lines.append(f"- [[{item_path.stem}]]: {first_line}")
return "\n".join(lines)
def dispatch_workers(tasks, repo, worker_client, parallelism=1):
"""Execute worker tasks with wikilink resolution.
Parallelism controlled by --parallelism flag (default 1)."""
def run_task(task_desc):
enriched = repo.resolve_wikilinks(task_desc)
return worker_client.call(enriched, stream=True)
with ThreadPoolExecutor(max_workers=parallelism) as pool:
results = list(pool.map(run_task, tasks))
return results
60.3.3 TOML-Based Structured Output
A pragmatic but noteworthy design choice is the use of TOML as the structured output format for planner decisions. Unlike JSON, TOML supports multi-line strings naturally (via triple-quoted strings) and avoids the escaping issues that frequently cause JSON parsing failures in LLM output. The planner's response contains a fenced TOML block extracted by parse_planner_toml() in prompts.py, using Python 3.11's built-in tomllib (with a tomli fallback for Python 3.10):
# PSEUDOCODE — Simplified from prompts.py (TOML parsing).
# Real implementation includes additional fallback parsers and
# handles edge cases in TOML block extraction.
import re
try:
import tomllib
except ImportError:
import tomli as tomllib # Python 3.10 fallback
def parse_planner_toml(response_text: str) -> dict:
"""Extract and parse the TOML decision block from planner output."""
match = re.search(r'```toml\s*\n(.*?)```', response_text, re.DOTALL)
if not match:
raise ValueError("No TOML block found in planner response")
toml_str = match.group(1)
decision = tomllib.loads(toml_str)
assert "action" in decision, "Missing 'action' field"
assert decision["action"] in VALID_ACTIONS
return decision
A typical planner TOML decision (example, not from a specific run):
# Illustrative example of planner TOML output format
action = "spawn"
summary = "Explore the contradiction approach via modular arithmetic"
whiteboard = '''
# Goal: Prove √2 is irrational
## Strategy: Assume √2 = p/q in lowest terms, derive contradiction
## Status: Working on the parity argument
## Tried: Direct computation (failed — no algebraic identity yields contradiction)
'''
[[tasks]]
description = "Prove that if p² = 2q², then p must be even. Reference [[parity-lemma]]."
[[tasks]]
description = "Verify the infinite descent argument terminates."
60.3.4 Budget-Aware Planning
The planner operates under explicit budget constraints that shape its search strategy. The Budget class in budget.py tracks wall-clock time and optionally cumulative output tokens. It exposes three boolean checkpoints that are included in the planner's prompt at every step. Using the notation defined in Section 60.3.1:
Here $\tau / T_{\max}$ is the fraction of the wall-clock time budget consumed. The defaults — $\alpha_c = 0.99$ (conclude after 99% of time) and $\alpha_g = 0.50$ (give-up permitted after 50% of time) — are configurable via --conclude-after and --give-up-after CLI flags. These thresholds enable adaptive behavior: aggressive exploration in early steps, focused refinement in middle steps, and consolidation or graceful termination in late steps. Note that these are soft signals communicated through the planner prompt — the LLM decides how to respond to them, and the only hard termination is is_exhausted.
60.3.5 Lean 4 Formal Verification Pipeline
When the --lean-project flag is set, OpenProver activates a formal verification pipeline implemented in the lean/ subpackage. This pipeline operates at three levels:
Level 1 — Repository Item Verification. When the planner stores a repository item with format="lean", the Lean code is written to a file within the specified Lean project directory and compiled via lake env lean <file>. Compiler output (success or errors) is returned to the planner. This is implemented in lean/core.py via the run_lean_check() function and the LeanWorkDir class, which manages temporary Lean source files within the user's project.
Level 2 — Final Proof Assembly. The LeanTheorem class in lean/core.py parses THEOREM.lean to locate all sorry positions — placeholders where proofs need to be filled. The assemble_proof() method replaces each sorry with the planner-provided proof block, processing in reverse order to preserve character offsets:
# PSEUDOCODE — Simplified from lean/core.py (LeanTheorem class).
# Real implementation includes additional validation, context
# injection, and error reporting.
class LeanTheorem:
"""Parse a Lean 4 theorem file and locate sorry positions."""
def __init__(self, lean_path):
self.source = lean_path.read_text()
self.sorry_positions = self._find_sorry_positions()
def _find_sorry_positions(self):
"""Find (start, end) byte offsets of each 'sorry' in source."""
positions = []
for match in re.finditer(r'\bsorry\b', self.source):
positions.append((match.start(), match.end()))
return positions
def assemble_proof(self, blocks, context=None):
"""Replace sorry placeholders with proof blocks.
Processes in reverse order to preserve character offsets.
Validates: correct count, no imports in injected code.
"""
if len(blocks) != len(self.sorry_positions):
raise ValueError(
f"Expected {len(self.sorry_positions)} blocks, "
f"got {len(blocks)}"
)
result = self.source
for (start, end), block in zip(
reversed(self.sorry_positions), reversed(blocks)
):
if "import" in block:
raise ValueError("Import statements not allowed in proof blocks")
result = result[:start] + block + result[end:]
return result
Level 3 — Worker Tools via MCP. When --lean-worker-tools is enabled, workers gain access to two MCP tools: lean_verify (compile a Lean snippet in the project context) and lean_search (search ~400,000 Lean 4 declarations). For Claude-based workers, the MCP server (lean/mcp_server.py) is spawned as a subprocess configured via --mcp-config. For vLLM workers, LeanExplore search runs in-process using the tool definitions in lean/tools.py with native OpenAI-compatible tool calling.
60.3.6 The Lean Explore Search Pipeline
One of OpenProver's most technically sophisticated components is the lean_search tool, which searches approximately 400,000 Lean 4 declarations across Init, Batteries, Lean, Mathlib, and Std. The pipeline employs a multi-stage retrieval architecture implemented in lean/mcp_server.py and lean/data.py:
| Stage | Technique | Approx. Latency | Details |
|---|---|---|---|
| BM25 Retrieval | Keyword search over LLM-generated descriptions | ~0.01s | Two indices with different tokenization, results merged |
| Semantic Retrieval | Qwen3-Embedding-0.6B (1024-dim), FAISS index | ~1s | Pre-built index, cosine similarity |
| Score Fusion | Normalized linear combination | — | BM25 and semantic scores combined; see formula below |
| Reranking | Qwen3-Reranker-0.6B cross-encoder | ~1–2s (GPU) | Skipped on CPU (~50s); GPU recommended |
| Hydration | Full metadata lookup from SQLite | — | Signature, docstring, source location |
The score fusion stage combines sparse and dense retrieval with a weighted sum. Let $d$ denote a Lean declaration and $q$ the search query. Define $\hat{s}_{\text{BM25}}(d, q)$ and $\hat{s}_{\text{sem}}(d, q)$ as the min-max normalized scores from the BM25 and semantic retrieval stages respectively. The fused score is:
where $\lambda = 0.3$ (i.e., 30% keyword weight, 70% semantic weight). After fusion, the repository applies a dependency boost — declarations that appear in the dependency graph of already-retrieved top results receive a score increase. The exact magnitude and mechanism of this boost are not fully detailed in the codebase comments; it appears to be a heuristic adjustment that promotes declarations related to the current proof context. We describe it here descriptively rather than formalizing it, as the implementation details are subject to change across versions.
The first call takes approximately 10 seconds for model loading; subsequent calls complete in roughly 1 second. These latency figures are reported in the repository documentation and may vary by hardware.
60.4 Memory Architecture
A central design thesis of OpenProver is that persistent structured state can substitute for massive context windows. The system implements a five-layer memory hierarchy, each with distinct capacity, persistence, and visibility characteristics:
The planner's per-step context budget is carefully managed. In a typical step, the planner sees approximately 3,000–7,000 tokens: ~500–2,000 for the whiteboard, ~50–200 for the repository index (one line per item), ~1,000–3,000 for the last three outputs, ~50 for budget status, and ~1,000–2,000 for the system prompt (defined in prompts.py). This is well within any frontier model's context capacity, meaning the planner never hits context limits, even in multi-hour sessions spanning dozens of steps.
This design contrasts sharply with single-agent approaches that maintain one long conversation thread. In those systems, context window limits create a hard ceiling on reasoning depth. OpenProver's external memory architecture removes this ceiling by allowing the planner to selectively attend to relevant information — reading specific repository items, reviewing recent outputs, and consulting its own strategic notes — without carrying the full history of every prior step.
60.4.1 Within-Session Learning
The repository and whiteboard together enable a form of within-session learning that mirrors the knowledge accumulation dynamics of evolutionary systems:
- Exploration and failure recording. Early workers explore different proof strategies. When an approach fails, the planner stores the failure as a repository item (e.g.,
approach-a-failure.md) and updates the whiteboard's "Tried" section. - Knowledge accumulation. Proven lemmas, useful observations, and literature findings accumulate in the repository. These are available to all future workers via wikilink references.
- Strategic adaptation. The planner synthesizes patterns across failures and successes, updating its whiteboard strategy accordingly. By step 10 or 15, the system has built a structured understanding of the problem that no single LLM call could achieve.
However, OpenProver currently has no cross-run learning mechanism. Each run starts from scratch with only the theorem statement. There is no proof strategy transfer, no tactic preference learning, no failure pattern recognition across runs, and no repository bootstrapping from prior successful proofs. This represents a significant limitation and a natural extension point, as discussed in Section 60.7.
60.5 LLM Integration
60.5.1 Multi-Provider Architecture
OpenProver implements a factory pattern for LLM clients in llm.py, supporting three distinct provider backends. A distinctive feature is the hierarchical model assignment — different models can be assigned to planner and worker roles via --planner-model and --worker-model CLI flags:
| Provider | Client Class | Interface | Tool Support | Cost Tracking |
|---|---|---|---|---|
| Claude (Anthropic) | LLMClient in llm.py | Claude Code CLI subprocess | MCP server for Lean tools | Yes (parsed from CLI JSON response) |
| Leanstral (Mistral) | via OpenAI-compatible | Mistral API | Native | Via API response |
| Local/vLLM | HFClient in llm.py | OpenAI-compatible HTTP | Native tool calling | No (local compute) |
The rationale for model separation is cost-efficiency: the planner sees a compressed view of the proof state and must make strategic decisions — a task suited to more capable but expensive models (e.g., Claude Opus). Workers receive focused tasks and run in parallel — suited to faster, cheaper models (e.g., Claude Sonnet). A typical configuration uses --planner-model opus and --worker-model sonnet.
60.5.2 Claude CLI Integration
The Claude integration is unconventional: rather than using an SDK library, OpenProver wraps the claude CLI binary directly via subprocess.Popen. This design choice, observable in the LLMClient class in llm.py, provides control over streaming behavior and tool configuration without SDK version dependencies. Streaming uses readline() on the subprocess pipe rather than Python's line iterator, because the line iterator has read-ahead buffering that defeats real-time display.
The NDJSON stream from the Claude CLI carries tool-use events that OpenProver parses to detect MCP tool calls:
tool_usecontent blocks carry tool name and input (streamed incrementally)- Tool results appear as top-level
{"type": "user"}messages containingtool_resultentries matched bytool_use_id - MCP prefixes are stripped:
mcp__lean_tools__lean_verifybecomeslean_verify
When --lean-project is set, the Claude CLI is configured to spawn an MCP server subprocess using flags including --mcp-config, --strict-mcp-config, --permission-mode bypassPermissions, and --allowedTools restricting access to lean_verify and lean_search.
60.5.3 Literature Search
When the --no-isolation flag is set, the planner gains access to a literature_search action that spawns a web-enabled Claude worker with WebSearch and WebFetch tools. This enables the system to consult mathematical papers and known results mid-proof — a capability that mirrors how human mathematicians work. Results are stored as repository items and become available to all future workers via wikilinks. This integration is observable in the _handle_literature_search method in prover.py.
60.6 Demonstrated Capabilities and Evidence Model
OpenProver is primarily a tool-focused repository rather than a benchmark-reporting research paper. It does not provide systematic evaluations on standard theorem-proving benchmarks such as miniF2F, ProofNet, or FIMO. The evidence for its capabilities comes from case-study demonstrations — example runs included in the repository — rather than controlled experiments with statistical reporting. This section describes what those demonstrations show and explicitly states the limits of the evidence.
Evidence Scope Statement
All results below are from ad-hoc demonstrations documented in the repository's example runs and README. No controlled benchmark evaluation with standardized metrics, multiple seeds, or baseline comparisons has been published. The "Status" column reflects the repository author's reported outcomes, not independently verified results. Cost and timing estimates are author-provided and depend on model pricing at time of run.
60.6.1 Case-Study Results
| Problem | Difficulty | Status (reported) | Evidence Type |
|---|---|---|---|
| $\sqrt{2}$ is irrational | Easy (textbook) | Solved — informal and formal proofs produced | Repository example run |
| Infinitely many primes | Easy (textbook) | Solved — informal proofs produced | Repository example run |
| $e$ is irrational | Medium (undergraduate) | Proof strategies explored | Repository example run |
| Cauchy-Schwarz inequality | Medium (undergraduate) | Informal and formal attempts | Repository example run |
| Erdős problem 205 | Medium (research-adjacent) | Multi-step exploration | Repository example run |
| Erdős problem 838 | Hard (open) | Extended autonomous sessions | Repository example run |
| Collatz conjecture | Open (famously unsolved) | Exploration (unsolved, as expected) | Repository example run |
These demonstrations validate that the planner-worker-whiteboard architecture functions for problems ranging from textbook to research-level: the planner dispatches workers, the whiteboard accumulates state across steps, wikilinks resolve correctly, and the budget system terminates gracefully. However, the absence of systematic benchmarking means we cannot make quantitative claims about the system's proving strength relative to alternatives such as LeanCopilot, LEGO-Prover, or UlamAI. Specifically:
- No success/failure rates are reported across a problem suite.
- No step counts, token counts, or wall-clock times are reported in a standardized format.
- No ablation studies isolate the contribution of individual components (whiteboard, wikilinks, literature search, worker parallelism).
- No comparison with baseline approaches (e.g., single-agent prompting) is available.
60.6.2 Architectural Validation
Setting aside proving strength, the case studies validate that each major architectural component of the Aletheia-style planner-worker pattern can be implemented with open-source tooling:
| Aletheia Feature (as publicly described) | OpenProver Equivalent | Implementation Status |
|---|---|---|
| Gemini Deep Think planner | Claude Opus / any LLM planner | Implemented (prover.py) |
| Parallel worker execution | ThreadPoolExecutor workers | Implemented (prover.py) |
| Whiteboard scratchpad | WHITEBOARD.md with LaTeX | Implemented (prover.py) |
| Repository of findings | repo/ with wikilinks | Implemented (Repo class, prover.py) |
| Extraction + verification | submit_proof + Lean 4 | Implemented (lean/core.py) |
| Literature search | Web-enabled Claude workers | Implemented (prover.py, requires --no-isolation) |
| Formal proof compilation | lake env lean + MCP tools | Implemented (lean/ subpackage) |
It is important to note that this is architectural replication based on Aletheia's publicly described design, not a performance equivalence claim. Aletheia's internal capabilities — including Gemini's training, fine-tuning, proprietary data, and actual proving power — are not disclosed, making direct comparison impossible.
60.6.3 Cost Estimation
OpenProver logs every LLM call's cost in archive/calls/call_NNN.json, enabling precise post-hoc cost analysis. The following are rough estimates from the repository documentation, not systematic measurements across multiple runs:
| Scenario | Model Configuration | Parallelism | Time (approx.) | Estimated Cost | Source |
|---|---|---|---|---|---|
| Easy theorem ($\sqrt{2}$ irrational) | Sonnet | 1 | ~10–30 min | ~$1–5 | README estimate |
| Medium theorem (Cauchy-Schwarz) | Sonnet | 1 | ~1–2 hours | ~$10–30 | README estimate |
| Hard exploration (Erdős 838) | Opus planner + Sonnet workers | 3 | 2–4 hours | ~$50–200 | README estimate |
| Local model run | vLLM (MiniMax-M2.5) | 1 | Variable | $0 (compute only) | README estimate |
Actual costs depend on model pricing (which changes frequently), theorem difficulty, and search efficiency. The per-call cost logging enables users to compute precise totals for their own runs.
60.7 Limitations and Discussion
60.7.1 Honest Assessment of Limitations
Several significant limitations constrain OpenProver's current impact and should be considered when evaluating the system:
| Limitation | Details |
|---|---|
| No systematic benchmark results | No evaluation on standard benchmarks such as miniF2F, ProofNet, or FIMO. All demonstrated results are from ad-hoc case studies. This makes it impossible to position OpenProver's proving strength quantitatively relative to alternatives. |
| No cross-run learning | Each run starts from scratch. No strategy transfer, tactic learning, or failure avoidance across runs. |
| Claude-centric default configuration | The default setup assumes the Claude Code CLI is installed and on PATH. Alternative providers require additional configuration and different flags. |
| No reinforcement learning or fine-tuning | Pure prompting only. No model improvement from proof traces. No learned value function to guide search. |
| Solo development | 388 commits by a single developer as of April 2026. Bus factor of 1 creates sustainability risk. |
| Informal-first bias | The system is strongest at informal proofs. Formal verification is a secondary capability that depends on the quality of the informal reasoning step. |
| No value model for search guidance | Unlike RL-trained provers (e.g., AlphaProof), search relies entirely on LLM intuition rather than learned heuristics. |
| Non-determinism | LLM outputs are inherently non-deterministic. The system does not expose temperature or seed controls. Exact reproducibility of proof searches is not achievable. |
60.7.2 Comparison with Alternative Approaches
| System | Year | Architecture | Formal Verification | Open Source | Training Required |
|---|---|---|---|---|---|
| Aletheia (DeepMind) | 2025 | Planner-worker-whiteboard | Extraction + verification | No | Gemini fine-tuning |
| LeanCopilot | 2023 | Single-agent tactic suggestion | Lean 4 via LeanDojo | Yes | No (prompting) |
| LEGO-Prover | 2024 | Skill library + decomposition | Lean via LeanDojo | Partial | No (prompting) |
| ReProver | 2023 | Encoder retrieval + generation | Lean 4 | Yes | Fine-tuned encoder |
| OpenProver | 2025 | Planner-worker-whiteboard | Lean 4 via MCP | Yes | No (prompting) |
| UlamAI | 2026 | Search-based with LLM policy | Lean 4 via LSP | Yes | No (prompting) |
OpenProver's primary strength relative to single-agent provers (LeanCopilot, ReProver) is its richer strategic reasoning, literature search capability, and persistent memory. Against RL-trained provers (AlphaProof), it requires no training but lacks learned value functions. Against search-based provers (UlamAI), it offers richer proof state through the whiteboard and repository but lacks sophisticated tactic-level search with transposition tables. These comparisons are qualitative — no head-to-head benchmark comparisons have been published.
60.7.3 Extension Points for Evolutionary Methods
Status: Not implemented. The following extension points are identified by the chapter author as promising research directions suggested by the architecture. None of these features exist in the current OpenProver codebase. They are included here to connect the system to this survey's broader themes, not to describe implemented functionality.
OpenProver's architecture presents several natural extension points for evolutionary and population-based methods:
- Repository Seeding. Pre-populating
repo/with lemmas and strategies from prior successful runs on related theorems. The wikilink system already supports referencing arbitrary markdown files — only a cross-run knowledge retrieval mechanism would be needed to select relevant prior knowledge. - Strategy Evolution. The whiteboard format could serve as a "chromosome" — evolving proof strategies across runs, with fitness measured by proof success or progress indicators. A population of whiteboard templates, subject to selection and mutation, could accelerate convergence on effective strategies for a given problem family.
- Prompt Population. The planner and worker system prompts in
prompts.pycould be evolved using the LLM-powered evolutionary paradigm described throughout this survey, with different prompt variants competing on benchmark problems. - Model Selection as Hyperparameter. The planner/worker model assignment could be treated as an evolvable configuration, optimized across a population of runs with different model-task allocations.
These directions would require significant additional engineering — at minimum, a benchmark suite for fitness evaluation, a run orchestrator for managing populations of proving attempts, and a knowledge retrieval system for cross-run repository seeding. They represent multi-month research projects, not incremental feature additions.
60.8 Reproducibility
OpenProver achieves strong within-run reproducibility through its file-based state management. Every run creates a self-contained directory structure. The following layout is documented in the repository README and verified by inspection:
# Directory structure for a completed run
# Verified from repository documentation and run artifacts
runs/<slug>-<timestamp>/
├── THEOREM.md # Immutable copy of input theorem
├── THEOREM.lean # Formal Lean statement (if provided)
├── WHITEBOARD.md # Current whiteboard state (enables resume)
├── PROOF.md # Final informal proof (if found)
├── PROOF.lean # Final formal proof (if lean mode)
├── DISCUSSION.md # Post-session LLM analysis
├── run_config.toml # Saved run configuration
├── repo/ # Repository items with wikilinks
│ ├── lemma-foo.md
│ └── observation-bar.md
├── steps/
│ ├── step_001/
│ │ ├── planner.toml # Planner's TOML decision
│ │ └── workers/
│ │ ├── task_0.md # Worker task description
│ │ ├── result_0.md # Worker output
│ │ └── worker_0_call.json
│ └── step_002/...
└── archive/
└── calls/
└── call_001.json # Full LLM call record (prompt, response, cost, timing)
The resume protocol detects existing runs by checking for WHITEBOARD.md + THEOREM.md, infers the step count from step_NNN directories, and restores settings from run_config.toml (with CLI flags overriding saved values). This means a proving session survives machine crashes, power loss, Ctrl-C interruptions, context window exhaustion, and network failures during API calls.
Historical runs can be browsed in read-only mode via openprover inspect runs/<run-dir>, which presents every step's planner decisions, worker tasks, and results in the TUI.
60.8.1 Resource Requirements
| Resource | Minimum | Recommended |
|---|---|---|
| Python | 3.10+ | 3.11+ (for tomllib) |
| RAM | 4 GB | 8+ GB (for Lean Explore models) |
| GPU | None (CPU works) | CUDA GPU (for lean_search reranking) |
| Disk | ~500 MB (base) | ~2 GB (with Lean Explore data) |
| Lean 4 | Optional | Required for formal verification modes |
| Claude CLI | Required for default config | — |
60.8.2 Reproducibility Limitations
Despite strong artifact management, exact reproducibility is limited by several factors:
- LLM non-determinism. Even with identical prompts and configurations, different API calls may produce different planner decisions, leading to divergent proof searches. The system does not currently expose model temperature or random seed controls through its interface.
- Provider evolution. LLM provider behavior changes over time as models are updated. A run that succeeded in February 2025 may behave differently when repeated in 2026, even with the same configuration.
- No standardized benchmark protocol. Without fixed problem sets, evaluation metrics, and reporting standards, comparisons between different OpenProver runs (or between OpenProver and other systems) cannot be made rigorously.
60.8.3 Typical Usage
The following CLI invocations, drawn from the repository README, illustrate standard usage patterns:
# Quick informal proof exploration (from repository README)
openprover my_conjecture.md --autonomous --max-time 1h
# Deep dive with literature search and parallel workers
openprover hard_problem.md --no-isolation --planner-model opus \
--worker-model sonnet -P 3 --max-time 4h
# Formalize a proof with Lean 4 verification
openprover theorem.md --lean-project ~/mathlib4 \
--lean-theorem theorem.lean --proof proof.md
# Inspect a historical run
openprover inspect runs/sqrt2-20250315-143022/
60.9 Significance for the Field
OpenProver's contribution to the landscape of LLM-powered systems is best understood at three levels.
At the architecture level, it demonstrates that the planner-worker-whiteboard pattern — publicly described by DeepMind's Aletheia for automated mathematical reasoning — can be implemented with open-source components and commercial LLM APIs. To our knowledge, it is among the first open-source systems to implement this specific architectural pattern for theorem proving, though this claim is bounded to systems identified in this survey as of April 2026. The clean separation between strategic planning and focused execution, mediated by persistent external state, provides a reusable architectural template for any agentic system operating over extended time horizons.
At the memory level, it shows that a remarkably simple file-based knowledge management system — markdown files with wikilink cross-references — can effectively compensate for context window limitations. The planner never exceeds ~7,000 tokens per step, yet can coordinate proof searches spanning hours and dozens of steps. This is a practical counterpoint to the assumption that longer context windows are the primary path to deeper reasoning.
At the community level, it lowers the barrier to entry for automated theorem proving research. A researcher with a Claude API key (or a local vLLM instance) and a theorem statement can launch an autonomous proof search in minutes. The full auditability of every run — with archived prompts, responses, costs, and decisions — makes the system valuable for studying LLM reasoning behavior in mathematical domains.
The system's threading model (not async), subprocess-based LLM integration (not SDK), and file-based state (not database) represent pragmatic engineering choices that prioritize simplicity and inspectability over scalability. For a research tool intended to be read and modified by individual researchers, these are defensible tradeoffs.
Key caveats. OpenProver has not been evaluated on standard benchmarks, so its competitive proving strength relative to systems like LeanCopilot, LEGO-Prover, ReProver, or UlamAI remains unknown. The architectural equivalence to Aletheia is structural — based on publicly described design elements — not a claim of performance parity. The system's value at this stage is primarily as an open research platform and architectural reference implementation, not as a state-of-the-art prover.
Summary
Key takeaway: OpenProver is an open-source implementation of the planner-worker-whiteboard architecture for LLM-guided theorem proving, inspired by DeepMind's Aletheia. It demonstrates that persistent structured state — a terse LaTeX whiteboard, a file-based knowledge repository with wikilink retrieval, and a rolling output window — can enable multi-hour proof search sessions with standard-context models, without custom training or proprietary infrastructure.
Main contribution: An inspectable, resumable, multi-provider orchestration framework (~3,000 lines of Python, single hard dependency) that bridges the gap between closed frontier proof agents and the open research community, providing both informal and Lean 4 formal proving capabilities with full artifact auditability.
Evidence model: The system's capabilities are demonstrated through case studies on problems ranging from textbook (√2 irrationality) to open research problems (Erdős 838), documented as example runs in the repository. No systematic benchmark evaluation has been published. The architectural pattern generalizes beyond theorem proving, but quantitative claims about proving strength await benchmark-scale evaluation.
What a researcher should know: OpenProver is best understood as a research platform and architectural reference, not as a benchmark-dominating prover. Its value lies in inspectability, extensibility, and the demonstration that the planner-worker-whiteboard pattern is viable with commodity LLM APIs. Key limitations are the absence of cross-run learning and systematic benchmarking. The evolutionary extensions discussed in Section 60.7.3 are speculative research directions, not implemented features.