Introduced2026-03
Score8.58/10 — Final
Chapter 60

OpenProver

Part P08: Harness & Agent Frameworks

60.1 Overview and Motivation

Automated theorem proving stands at the frontier of mathematical reasoning by machines. While large language models have demonstrated impressive informal mathematical ability — solving competition problems, generating plausible proof sketches, and even suggesting novel conjectures — the gap between informal reasoning and machine-checked formal proofs remains substantial. Bridging this gap requires systems that can orchestrate extended, multi-step proof searches while maintaining strategic coherence over hours of exploration.

In February 2025, Google DeepMind announced Aletheia, a fully autonomous mathematics research agent powered by Gemini 3 Deep Think that solved 6 of 10 open research problems from the FirstProof challenge. Aletheia demonstrated a planner-worker-whiteboard architecture — a design pattern in which a strategic planner coordinates focused, stateless workers through a persistent mathematical scratchpad. The system was proprietary, leaving the research community without access to its internals or the ability to reproduce its results.

OpenProver, developed by Matěj Kripner at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague, is an open-source implementation of a planner-worker-whiteboard architecture for automated theorem proving, explicitly inspired by the Aletheia design as described in DeepMind's public communications. Released in February 2025 with active development through v1.0.1 (March 2026), OpenProver provides a fully inspectable, resumable, multi-LLM orchestration framework that supports both informal natural-language proofs and machine-checked Lean 4 formalizations.

Repository Snapshot

Repositorygithub.com/Kripner/openprover
Version reviewedv1.0.1 (PyPI release, March 2026)
Commits at review388 on master branch
Primary languagePython (92.9%), ~3,000 lines across 11 modules
LicenseNot explicitly specified in repository
Review access dateApril 2026

All file paths, module references, and structural claims in this chapter are based on inspection of the repository at the above date. Line counts are estimates. Code snippets are labeled as pseudocode, lightly edited, or verbatim at each occurrence.

Key Contribution

OpenProver demonstrates that a planner-worker-whiteboard architecture — similar in high-level design to the pattern described in DeepMind's Aletheia announcements — can be constructed using open-source tools and commercial LLM APIs, without proprietary training data or custom-trained models. The scope of equivalence is architectural: OpenProver replicates the separation of planner and workers, persistent whiteboard state, and parallel worker dispatch, based on Aletheia's publicly described design. It does not claim feature parity with Aletheia's internal capabilities, model quality, or proving performance, which remain undisclosed. OpenProver's central architectural insight is that persistent structured state — a whiteboard, a file-based repository with wikilink cross-references, and a rolling output window — can substitute for massive context windows, enabling multi-hour reasoning sessions with standard-context models.

60.1.1 Positioning within the Survey

Within this survey's scope of LLM-powered self-evolving systems, OpenProver occupies a distinctive niche. It is not an evolutionary algorithm discovery system in the traditional sense — it does not maintain populations, apply crossover operators, or optimize fitness functions. Rather, it represents a search-based discovery agent in which the LLM planner performs a form of heuristic tree search over proof strategies, with the whiteboard and repository serving as an explicit external memory that accumulates knowledge across search steps. The system's within-session learning — storing failed approaches, proven lemmas, and synthesized observations — mirrors the knowledge-building dynamics of evolutionary systems, albeit without the population-level selection pressure.

OpenProver's relevance to the evolutionary AI landscape is threefold:

  1. Reusable orchestration pattern (implemented): Its planner-worker architecture provides a concrete, working pattern for agentic systems that coordinate strategic reasoning with parallel execution.
  2. Lightweight knowledge management (implemented): Its repository and wikilink system constitute a knowledge management mechanism analogous to the skill libraries and learning logs found in evolutionary code-generation systems.
  3. Extension points for evolutionary methods (not implemented — speculative): As discussed in Section 60.7.3, the architecture has natural extension points for evolutionary methods including strategy evolution and prompt population optimization. These are research directions identified by the chapter author, not features of the current system.

60.2 Architecture

OpenProver's architecture follows a clean, acyclic dependency structure organized around a central prover loop that coordinates planning, worker dispatch, and state management. The system comprises approximately 3,000 lines of Python distributed across eleven modules in a flat package layout with a self-contained lean/ subpackage. The hard dependency footprint is minimal — only the mcp>=1.0 library is declared in pyproject.toml.

CLI (cli.py) Parse args → Setup TUI → Create Prover → Run → Print cost Prover Loop (prover.py) Whiteboard (LaTeX) Repository (repo/*.md) Output Window (last 3) Budget Tracker spawn → Workers literature_search read_items / write_items write_whiteboard read_theorem submit_proof give_up TUI (tui.py) Full-screen terminal interface Tabs: planner + workers Streaming + ANSI rendering display LLM Layer (llm.py) LLMClient (Claude CLI) Popen + readline, NDJSON MCP config, cost tracking HFClient (vLLM/OpenAI) HTTP streaming Native tool calling Leanstral (Mistral) OpenAI-compatible API Lean-specialized model Lean Layer (lean/) core.py — Parse, assemble, verify mcp_server.py — lean_verify + lean_search data.py — LeanExplore index (~400k decls) Disk Persistence: runs/<slug>/ — WHITEBOARD.md, repo/, steps/, archive/calls/, PROOF.md, PROOF.lean

60.2.1 Module Structure

The codebase is organized as a flat Python package with a single self-contained subpackage for Lean integration. The following table lists each module with its responsibility, derived from the repository structure at github.com/Kripner/openprover (v1.0.1):

ModuleFileEst. LinesResponsibilityEvidence
CLIcli.py~200Argument parsing, TUI setup, signal handlers, cost summaryVerified in repo
Proverprover.py~800Core loop, action dispatch, Repo class, wikilink resolution, operating modesVerified in repo
LLMllm.py~500LLMClient (Claude CLI), HFClient (vLLM), streaming, archiving, cost trackingVerified in repo
Promptsprompts.py~600System prompts, prompt formatters, TOML parser, action enumerationVerified in repo
Budgetbudget.py~100Time/token tracking, threshold calculationsVerified in repo
TUItui.py~800Full-screen terminal UI, tab system, streaming display, key handlingVerified in repo
Inspectinspect.py~150Read-only historical run browserVerified in repo
Lean Corelean/core.py~300LeanTheorem parser, proof assembly, run_lean_check, LeanWorkDirVerified in repo
Lean MCPlean/mcp_server.py~200MCP server exposing lean_verify and lean_search toolsVerified in repo
Lean Datalean/data.py~150LeanExplore data management, dependency installationVerified in repo
Lean Toolslean/tools.py~100Tool definitions for vLLM native tool callingVerified in repo

The dependency graph is strictly acyclic: cli.py depends on prover.py, tui.py, and inspect.py; prover.py depends on llm.py, prompts.py, budget.py, and the lean/ subpackage; prompts.py has no code dependencies beyond the standard library. All functionality beyond MCP comes from the Python standard library (subprocess, json, tomllib, threading, os, select) and the Claude CLI system binary. Optional dependencies for the Lean search pipeline (torch, sentence-transformers, FAISS) are installed on demand by lean/data.py.

60.2.2 Operating Modes

OpenProver supports three operating modes, configured via CLI flags and detected from the presence of input files in the run directory. These modes reflect the dual-track nature of mathematical reasoning — informal intuition followed by formal verification:

ModeInputGoalOutput Artifacts
proveTHEOREM.mdInformal proofPROOF.md
prove_and_formalizeTHEOREM.md + THEOREM.leanBoth proofsPROOF.md + PROOF.lean
formalize_onlyTHEOREM.md + THEOREM.lean + PROOF.mdFormal proof onlyPROOF.lean

This mode separation acknowledges a fundamental asymmetry in mathematical practice: most reasoning happens informally first, and formalization is a distinct — often more difficult — task. By supporting all three modes, OpenProver can serve as both a proof discovery tool and a formalization assistant.

60.3 Core Algorithms

60.3.1 The Planner-Worker Protocol

The central algorithm of OpenProver is a step-wise loop in which a planner LLM makes strategic decisions and dispatches worker LLMs for focused execution. To describe this protocol precisely, we define the following notation:

Symbol Definitions

$n$Current step count (integer, incremented each planner iteration), $n = 1, 2, 3, \ldots$
$\tau$Elapsed wall-clock time in seconds since the run started
$T_{\max}$Maximum wall-clock time budget in seconds (CLI flag --max-time, default 14400s = 4 hours)
$N_{\max}$Optional cumulative output token budget (CLI flag --max-tokens, no default)
$S$System prompt (role definition and action specifications), invariant across steps
$W_n$Whiteboard content at step $n$ (LaTeX-formatted markdown string)
$I_n$Repository index at step $n$ (one-line summary per stored item)
$O_n$Output window at step $n$: the last $k = 3$ worker results (rolling FIFO)
$B_n$Budget status at step $n$: remaining time, conclusion/give-up flags
$\mathcal{A}$The set of eight valid planner actions (defined below)
$\alpha_c, \alpha_g$Fractional time thresholds for conclusion and give-up phases (defaults: 0.99, 0.50)

Step 1 — Prompt Construction. At each step $n$, the planner receives a structured prompt $P_n$ composed of five elements:

$$P_n = \langle S,\; W_n,\; I_n,\; O_n,\; B_n \rangle$$

Step 2 — TOML Decision. The planner generates a natural-language response containing a fenced TOML block that specifies its action. The TOML structure includes an action identifier $a_n \in \mathcal{A}$, a summary, an updated whiteboard $W_{n+1}$, and action-specific payloads such as task descriptions or item operations.

Step 3 — Action Dispatch. The action is routed to one of eight handlers. The action set is:

$$\mathcal{A} = \{\texttt{spawn},\; \texttt{literature\_search},\; \texttt{read\_items},\; \texttt{write\_items},\; \texttt{write\_whiteboard},\; \texttt{read\_theorem},\; \texttt{submit\_proof},\; \texttt{give\_up}\}$$

Step 4 — State Update. After the action completes, the whiteboard is updated to $W_{n+1}$, the output window shifts to include new results (dropping the oldest if $|O_n| = k$), and all artifacts are persisted to disk in steps/step_NNN/.

The following pseudocode illustrates the core prover loop. It captures the structure of the Prover class in prover.py but simplifies error handling, mode logic, and TUI integration for readability:

# PSEUDOCODE — Simplified from prover.py (Prover class).
# Captures the loop structure. Real implementation handles
# additional concerns: mode detection, resume, TUI callbacks,
# discussion generation, error recovery.

class Prover:
    def __init__(self, theorem_path, config, llm_client, tui):
        self.whiteboard = ""          # W_n: LaTeX scratchpad
        self.repo = Repo(run_dir)     # I_n: file-based knowledge graph
        self.outputs = []             # O_n: rolling output window (max 3)
        self.budget = Budget(config)  # B_n: time/token tracking
        self.step = 0

    def run(self):
        """Main prover loop — one iteration per planner step."""
        while not self.budget.is_exhausted:
            self.step += 1

            # Step 1: Build planner prompt (see prompts.py)
            prompt = build_planner_prompt(
                whiteboard=self.whiteboard,
                repo_index=self.repo.get_index(),
                recent_outputs=self.outputs[-3:],
                budget_status=self.budget.status(),
                mode_indicators=self.get_mode_status()
            )

            # Step 2: Call planner LLM, parse TOML decision
            response = self.planner_client.call(prompt, stream=True)
            decision = parse_planner_toml(response)

            # Step 3: Dispatch action
            action = decision["action"]
            if action == "spawn":
                results = self._handle_spawn(decision["tasks"])
            elif action == "literature_search":
                results = self._handle_literature_search(decision)
            elif action == "submit_proof":
                success = self._handle_submit_proof(decision)
                if success:
                    return  # Proof found
            elif action == "give_up":
                if self.budget.can_give_up:
                    return
            # ... other action handlers

            # Step 4: Update state
            self.whiteboard = decision.get("whiteboard", self.whiteboard)
            self.outputs.extend(results)
            self._save_step_to_disk(decision, results)

60.3.2 Worker Execution and Wikilink Resolution

Workers are stateless and parallelizable. Before dispatch, each task description undergoes wikilink resolution — a lightweight retrieval mechanism implemented in the Repo class within prover.py that injects relevant repository content into the worker's prompt without consuming tokens on irrelevant material.

The resolution process operates on the task description $d$ by finding all [[slug]] references, fetching the corresponding repository items, and appending them as a "Referenced Materials" section. Let $\text{refs}(d) = \{s_1, s_2, \ldots, s_m\}$ be the set of slugs referenced in task $d$, and let $\text{content}(s_i)$ be the full text of repository item $s_i$. The enriched task is:

$$d' = d \;\|\; \text{separator} \;\|\; \text{content}(s_1) \;\|\; \cdots \;\|\; \text{content}(s_m)$$

where $\|$ denotes string concatenation. This ensures that tokens are spent only when content is explicitly referenced, and each worker sees only the knowledge relevant to its specific subtask.

# PSEUDOCODE — Simplified from prover.py (Repo class and worker dispatch).
# Real implementation: file operations use pathlib; worker dispatch
# integrates with TUI streaming; error handling for missing items.

import re
from concurrent.futures import ThreadPoolExecutor

class Repo:
    """File-based knowledge graph with wikilink cross-references."""

    def __init__(self, run_dir):
        self.repo_dir = run_dir / "repo"
        self.repo_dir.mkdir(exist_ok=True)

    def resolve_wikilinks(self, text: str) -> str:
        """Replace [[slug]] references with item content."""
        refs = re.findall(r'\[\[([^\]]+)\]\]', text)
        if not refs:
            return text
        materials = []
        for slug in refs:
            item_path = self.repo_dir / f"{slug}.md"
            if item_path.exists():
                materials.append(f"## {slug}\n{item_path.read_text()}")
        if materials:
            text += "\n\n---\n# Referenced Materials\n" + "\n\n".join(materials)
        return text

    def get_index(self) -> str:
        """One-line summary per item for the planner prompt."""
        lines = []
        for item_path in sorted(self.repo_dir.glob("*.md")):
            first_line = item_path.read_text().split('\n')[0]
            lines.append(f"- [[{item_path.stem}]]: {first_line}")
        return "\n".join(lines)

def dispatch_workers(tasks, repo, worker_client, parallelism=1):
    """Execute worker tasks with wikilink resolution.
    Parallelism controlled by --parallelism flag (default 1)."""
    def run_task(task_desc):
        enriched = repo.resolve_wikilinks(task_desc)
        return worker_client.call(enriched, stream=True)

    with ThreadPoolExecutor(max_workers=parallelism) as pool:
        results = list(pool.map(run_task, tasks))
    return results

60.3.3 TOML-Based Structured Output

A pragmatic but noteworthy design choice is the use of TOML as the structured output format for planner decisions. Unlike JSON, TOML supports multi-line strings naturally (via triple-quoted strings) and avoids the escaping issues that frequently cause JSON parsing failures in LLM output. The planner's response contains a fenced TOML block extracted by parse_planner_toml() in prompts.py, using Python 3.11's built-in tomllib (with a tomli fallback for Python 3.10):

# PSEUDOCODE — Simplified from prompts.py (TOML parsing).
# Real implementation includes additional fallback parsers and
# handles edge cases in TOML block extraction.

import re
try:
    import tomllib
except ImportError:
    import tomli as tomllib  # Python 3.10 fallback

def parse_planner_toml(response_text: str) -> dict:
    """Extract and parse the TOML decision block from planner output."""
    match = re.search(r'```toml\s*\n(.*?)```', response_text, re.DOTALL)
    if not match:
        raise ValueError("No TOML block found in planner response")

    toml_str = match.group(1)
    decision = tomllib.loads(toml_str)

    assert "action" in decision, "Missing 'action' field"
    assert decision["action"] in VALID_ACTIONS

    return decision

A typical planner TOML decision (example, not from a specific run):

# Illustrative example of planner TOML output format
action = "spawn"
summary = "Explore the contradiction approach via modular arithmetic"
whiteboard = '''
# Goal: Prove √2 is irrational
## Strategy: Assume √2 = p/q in lowest terms, derive contradiction
## Status: Working on the parity argument
## Tried: Direct computation (failed — no algebraic identity yields contradiction)
'''

[[tasks]]
description = "Prove that if p² = 2q², then p must be even. Reference [[parity-lemma]]."

[[tasks]]
description = "Verify the infinite descent argument terminates."

60.3.4 Budget-Aware Planning

The planner operates under explicit budget constraints that shape its search strategy. The Budget class in budget.py tracks wall-clock time and optionally cumulative output tokens. It exposes three boolean checkpoints that are included in the planner's prompt at every step. Using the notation defined in Section 60.3.1:

$$\text{is\_exhausted}(\tau, n) = \begin{cases} \texttt{true} & \text{if } \tau \geq T_{\max} \text{ or (if set) tokens consumed} \geq N_{\max} \\ \texttt{false} & \text{otherwise} \end{cases}$$
$$\text{should\_conclude}(\tau) = \begin{cases} \texttt{true} & \text{if } \frac{\tau}{T_{\max}} \geq \alpha_c \\ \texttt{false} & \text{otherwise} \end{cases}$$
$$\text{can\_give\_up}(\tau) = \begin{cases} \texttt{true} & \text{if } \frac{\tau}{T_{\max}} \geq \alpha_g \\ \texttt{false} & \text{otherwise} \end{cases}$$

Here $\tau / T_{\max}$ is the fraction of the wall-clock time budget consumed. The defaults — $\alpha_c = 0.99$ (conclude after 99% of time) and $\alpha_g = 0.50$ (give-up permitted after 50% of time) — are configurable via --conclude-after and --give-up-after CLI flags. These thresholds enable adaptive behavior: aggressive exploration in early steps, focused refinement in middle steps, and consolidation or graceful termination in late steps. Note that these are soft signals communicated through the planner prompt — the LLM decides how to respond to them, and the only hard termination is is_exhausted.

60.3.5 Lean 4 Formal Verification Pipeline

When the --lean-project flag is set, OpenProver activates a formal verification pipeline implemented in the lean/ subpackage. This pipeline operates at three levels:

Level 1 — Repository Item Verification. When the planner stores a repository item with format="lean", the Lean code is written to a file within the specified Lean project directory and compiled via lake env lean <file>. Compiler output (success or errors) is returned to the planner. This is implemented in lean/core.py via the run_lean_check() function and the LeanWorkDir class, which manages temporary Lean source files within the user's project.

Level 2 — Final Proof Assembly. The LeanTheorem class in lean/core.py parses THEOREM.lean to locate all sorry positions — placeholders where proofs need to be filled. The assemble_proof() method replaces each sorry with the planner-provided proof block, processing in reverse order to preserve character offsets:

# PSEUDOCODE — Simplified from lean/core.py (LeanTheorem class).
# Real implementation includes additional validation, context
# injection, and error reporting.

class LeanTheorem:
    """Parse a Lean 4 theorem file and locate sorry positions."""

    def __init__(self, lean_path):
        self.source = lean_path.read_text()
        self.sorry_positions = self._find_sorry_positions()

    def _find_sorry_positions(self):
        """Find (start, end) byte offsets of each 'sorry' in source."""
        positions = []
        for match in re.finditer(r'\bsorry\b', self.source):
            positions.append((match.start(), match.end()))
        return positions

    def assemble_proof(self, blocks, context=None):
        """Replace sorry placeholders with proof blocks.

        Processes in reverse order to preserve character offsets.
        Validates: correct count, no imports in injected code.
        """
        if len(blocks) != len(self.sorry_positions):
            raise ValueError(
                f"Expected {len(self.sorry_positions)} blocks, "
                f"got {len(blocks)}"
            )
        result = self.source
        for (start, end), block in zip(
            reversed(self.sorry_positions), reversed(blocks)
        ):
            if "import" in block:
                raise ValueError("Import statements not allowed in proof blocks")
            result = result[:start] + block + result[end:]
        return result

Level 3 — Worker Tools via MCP. When --lean-worker-tools is enabled, workers gain access to two MCP tools: lean_verify (compile a Lean snippet in the project context) and lean_search (search ~400,000 Lean 4 declarations). For Claude-based workers, the MCP server (lean/mcp_server.py) is spawned as a subprocess configured via --mcp-config. For vLLM workers, LeanExplore search runs in-process using the tool definitions in lean/tools.py with native OpenAI-compatible tool calling.

60.3.6 The Lean Explore Search Pipeline

One of OpenProver's most technically sophisticated components is the lean_search tool, which searches approximately 400,000 Lean 4 declarations across Init, Batteries, Lean, Mathlib, and Std. The pipeline employs a multi-stage retrieval architecture implemented in lean/mcp_server.py and lean/data.py:

StageTechniqueApprox. LatencyDetails
BM25 RetrievalKeyword search over LLM-generated descriptions~0.01sTwo indices with different tokenization, results merged
Semantic RetrievalQwen3-Embedding-0.6B (1024-dim), FAISS index~1sPre-built index, cosine similarity
Score FusionNormalized linear combinationBM25 and semantic scores combined; see formula below
RerankingQwen3-Reranker-0.6B cross-encoder~1–2s (GPU)Skipped on CPU (~50s); GPU recommended
HydrationFull metadata lookup from SQLiteSignature, docstring, source location

The score fusion stage combines sparse and dense retrieval with a weighted sum. Let $d$ denote a Lean declaration and $q$ the search query. Define $\hat{s}_{\text{BM25}}(d, q)$ and $\hat{s}_{\text{sem}}(d, q)$ as the min-max normalized scores from the BM25 and semantic retrieval stages respectively. The fused score is:

$$\text{score}(d, q) = \lambda \cdot \hat{s}_{\text{BM25}}(d, q) + (1 - \lambda) \cdot \hat{s}_{\text{sem}}(d, q)$$

where $\lambda = 0.3$ (i.e., 30% keyword weight, 70% semantic weight). After fusion, the repository applies a dependency boost — declarations that appear in the dependency graph of already-retrieved top results receive a score increase. The exact magnitude and mechanism of this boost are not fully detailed in the codebase comments; it appears to be a heuristic adjustment that promotes declarations related to the current proof context. We describe it here descriptively rather than formalizing it, as the implementation details are subject to change across versions.

The first call takes approximately 10 seconds for model loading; subsequent calls complete in roughly 1 second. These latency figures are reported in the repository documentation and may vary by hardware.

60.4 Memory Architecture

A central design thesis of OpenProver is that persistent structured state can substitute for massive context windows. The system implements a five-layer memory hierarchy, each with distinct capacity, persistence, and visibility characteristics:

Layer 1: Whiteboard ~500–2000 tokens • Planner-only • Updated every step • Strategic state, current approach WHITEBOARD.md Layer 2: Repository Unlimited items • Planner sees index, Workers via [[wikilinks]] • Lemmas, observations, lit repo/*.md Layer 3: Output Window 3 items (rolling FIFO) • In-memory • Planner-only • Recent worker results In-memory Layer 4: Step Archive Unlimited • Append-only • Inspect mode • Planner decisions + worker tasks/results steps/step_NNN/ Layer 5: LLM Call Archive Unlimited • Append-only • Post-hoc analysis • Full prompt/response/cost/timing archive/calls/ ↑ Decreasing visibility to active reasoning • Increasing archival persistence ↓

The planner's per-step context budget is carefully managed. In a typical step, the planner sees approximately 3,000–7,000 tokens: ~500–2,000 for the whiteboard, ~50–200 for the repository index (one line per item), ~1,000–3,000 for the last three outputs, ~50 for budget status, and ~1,000–2,000 for the system prompt (defined in prompts.py). This is well within any frontier model's context capacity, meaning the planner never hits context limits, even in multi-hour sessions spanning dozens of steps.

This design contrasts sharply with single-agent approaches that maintain one long conversation thread. In those systems, context window limits create a hard ceiling on reasoning depth. OpenProver's external memory architecture removes this ceiling by allowing the planner to selectively attend to relevant information — reading specific repository items, reviewing recent outputs, and consulting its own strategic notes — without carrying the full history of every prior step.

60.4.1 Within-Session Learning

The repository and whiteboard together enable a form of within-session learning that mirrors the knowledge accumulation dynamics of evolutionary systems:

  1. Exploration and failure recording. Early workers explore different proof strategies. When an approach fails, the planner stores the failure as a repository item (e.g., approach-a-failure.md) and updates the whiteboard's "Tried" section.
  2. Knowledge accumulation. Proven lemmas, useful observations, and literature findings accumulate in the repository. These are available to all future workers via wikilink references.
  3. Strategic adaptation. The planner synthesizes patterns across failures and successes, updating its whiteboard strategy accordingly. By step 10 or 15, the system has built a structured understanding of the problem that no single LLM call could achieve.

However, OpenProver currently has no cross-run learning mechanism. Each run starts from scratch with only the theorem statement. There is no proof strategy transfer, no tactic preference learning, no failure pattern recognition across runs, and no repository bootstrapping from prior successful proofs. This represents a significant limitation and a natural extension point, as discussed in Section 60.7.

60.5 LLM Integration

60.5.1 Multi-Provider Architecture

OpenProver implements a factory pattern for LLM clients in llm.py, supporting three distinct provider backends. A distinctive feature is the hierarchical model assignment — different models can be assigned to planner and worker roles via --planner-model and --worker-model CLI flags:

ProviderClient ClassInterfaceTool SupportCost Tracking
Claude (Anthropic)LLMClient in llm.pyClaude Code CLI subprocessMCP server for Lean toolsYes (parsed from CLI JSON response)
Leanstral (Mistral)via OpenAI-compatibleMistral APINativeVia API response
Local/vLLMHFClient in llm.pyOpenAI-compatible HTTPNative tool callingNo (local compute)

The rationale for model separation is cost-efficiency: the planner sees a compressed view of the proof state and must make strategic decisions — a task suited to more capable but expensive models (e.g., Claude Opus). Workers receive focused tasks and run in parallel — suited to faster, cheaper models (e.g., Claude Sonnet). A typical configuration uses --planner-model opus and --worker-model sonnet.

60.5.2 Claude CLI Integration

The Claude integration is unconventional: rather than using an SDK library, OpenProver wraps the claude CLI binary directly via subprocess.Popen. This design choice, observable in the LLMClient class in llm.py, provides control over streaming behavior and tool configuration without SDK version dependencies. Streaming uses readline() on the subprocess pipe rather than Python's line iterator, because the line iterator has read-ahead buffering that defeats real-time display.

The NDJSON stream from the Claude CLI carries tool-use events that OpenProver parses to detect MCP tool calls:

  • tool_use content blocks carry tool name and input (streamed incrementally)
  • Tool results appear as top-level {"type": "user"} messages containing tool_result entries matched by tool_use_id
  • MCP prefixes are stripped: mcp__lean_tools__lean_verify becomes lean_verify

When --lean-project is set, the Claude CLI is configured to spawn an MCP server subprocess using flags including --mcp-config, --strict-mcp-config, --permission-mode bypassPermissions, and --allowedTools restricting access to lean_verify and lean_search.

60.5.3 Literature Search

When the --no-isolation flag is set, the planner gains access to a literature_search action that spawns a web-enabled Claude worker with WebSearch and WebFetch tools. This enables the system to consult mathematical papers and known results mid-proof — a capability that mirrors how human mathematicians work. Results are stored as repository items and become available to all future workers via wikilinks. This integration is observable in the _handle_literature_search method in prover.py.

60.6 Demonstrated Capabilities and Evidence Model

OpenProver is primarily a tool-focused repository rather than a benchmark-reporting research paper. It does not provide systematic evaluations on standard theorem-proving benchmarks such as miniF2F, ProofNet, or FIMO. The evidence for its capabilities comes from case-study demonstrations — example runs included in the repository — rather than controlled experiments with statistical reporting. This section describes what those demonstrations show and explicitly states the limits of the evidence.

Evidence Scope Statement

All results below are from ad-hoc demonstrations documented in the repository's example runs and README. No controlled benchmark evaluation with standardized metrics, multiple seeds, or baseline comparisons has been published. The "Status" column reflects the repository author's reported outcomes, not independently verified results. Cost and timing estimates are author-provided and depend on model pricing at time of run.

60.6.1 Case-Study Results

ProblemDifficultyStatus (reported)Evidence Type
$\sqrt{2}$ is irrationalEasy (textbook)Solved — informal and formal proofs producedRepository example run
Infinitely many primesEasy (textbook)Solved — informal proofs producedRepository example run
$e$ is irrationalMedium (undergraduate)Proof strategies exploredRepository example run
Cauchy-Schwarz inequalityMedium (undergraduate)Informal and formal attemptsRepository example run
Erdős problem 205Medium (research-adjacent)Multi-step explorationRepository example run
Erdős problem 838Hard (open)Extended autonomous sessionsRepository example run
Collatz conjectureOpen (famously unsolved)Exploration (unsolved, as expected)Repository example run

These demonstrations validate that the planner-worker-whiteboard architecture functions for problems ranging from textbook to research-level: the planner dispatches workers, the whiteboard accumulates state across steps, wikilinks resolve correctly, and the budget system terminates gracefully. However, the absence of systematic benchmarking means we cannot make quantitative claims about the system's proving strength relative to alternatives such as LeanCopilot, LEGO-Prover, or UlamAI. Specifically:

  • No success/failure rates are reported across a problem suite.
  • No step counts, token counts, or wall-clock times are reported in a standardized format.
  • No ablation studies isolate the contribution of individual components (whiteboard, wikilinks, literature search, worker parallelism).
  • No comparison with baseline approaches (e.g., single-agent prompting) is available.

60.6.2 Architectural Validation

Setting aside proving strength, the case studies validate that each major architectural component of the Aletheia-style planner-worker pattern can be implemented with open-source tooling:

Aletheia Feature (as publicly described)OpenProver EquivalentImplementation Status
Gemini Deep Think plannerClaude Opus / any LLM plannerImplemented (prover.py)
Parallel worker executionThreadPoolExecutor workersImplemented (prover.py)
Whiteboard scratchpadWHITEBOARD.md with LaTeXImplemented (prover.py)
Repository of findingsrepo/ with wikilinksImplemented (Repo class, prover.py)
Extraction + verificationsubmit_proof + Lean 4Implemented (lean/core.py)
Literature searchWeb-enabled Claude workersImplemented (prover.py, requires --no-isolation)
Formal proof compilationlake env lean + MCP toolsImplemented (lean/ subpackage)

It is important to note that this is architectural replication based on Aletheia's publicly described design, not a performance equivalence claim. Aletheia's internal capabilities — including Gemini's training, fine-tuning, proprietary data, and actual proving power — are not disclosed, making direct comparison impossible.

60.6.3 Cost Estimation

OpenProver logs every LLM call's cost in archive/calls/call_NNN.json, enabling precise post-hoc cost analysis. The following are rough estimates from the repository documentation, not systematic measurements across multiple runs:

ScenarioModel ConfigurationParallelismTime (approx.)Estimated CostSource
Easy theorem ($\sqrt{2}$ irrational)Sonnet1~10–30 min~$1–5README estimate
Medium theorem (Cauchy-Schwarz)Sonnet1~1–2 hours~$10–30README estimate
Hard exploration (Erdős 838)Opus planner + Sonnet workers32–4 hours~$50–200README estimate
Local model runvLLM (MiniMax-M2.5)1Variable$0 (compute only)README estimate

Actual costs depend on model pricing (which changes frequently), theorem difficulty, and search efficiency. The per-call cost logging enables users to compute precise totals for their own runs.

60.7 Limitations and Discussion

60.7.1 Honest Assessment of Limitations

Several significant limitations constrain OpenProver's current impact and should be considered when evaluating the system:

LimitationDetails
No systematic benchmark resultsNo evaluation on standard benchmarks such as miniF2F, ProofNet, or FIMO. All demonstrated results are from ad-hoc case studies. This makes it impossible to position OpenProver's proving strength quantitatively relative to alternatives.
No cross-run learningEach run starts from scratch. No strategy transfer, tactic learning, or failure avoidance across runs.
Claude-centric default configurationThe default setup assumes the Claude Code CLI is installed and on PATH. Alternative providers require additional configuration and different flags.
No reinforcement learning or fine-tuningPure prompting only. No model improvement from proof traces. No learned value function to guide search.
Solo development388 commits by a single developer as of April 2026. Bus factor of 1 creates sustainability risk.
Informal-first biasThe system is strongest at informal proofs. Formal verification is a secondary capability that depends on the quality of the informal reasoning step.
No value model for search guidanceUnlike RL-trained provers (e.g., AlphaProof), search relies entirely on LLM intuition rather than learned heuristics.
Non-determinismLLM outputs are inherently non-deterministic. The system does not expose temperature or seed controls. Exact reproducibility of proof searches is not achievable.

60.7.2 Comparison with Alternative Approaches

SystemYearArchitectureFormal VerificationOpen SourceTraining Required
Aletheia (DeepMind)2025Planner-worker-whiteboardExtraction + verificationNoGemini fine-tuning
LeanCopilot2023Single-agent tactic suggestionLean 4 via LeanDojoYesNo (prompting)
LEGO-Prover2024Skill library + decompositionLean via LeanDojoPartialNo (prompting)
ReProver2023Encoder retrieval + generationLean 4YesFine-tuned encoder
OpenProver2025Planner-worker-whiteboardLean 4 via MCPYesNo (prompting)
UlamAI2026Search-based with LLM policyLean 4 via LSPYesNo (prompting)

OpenProver's primary strength relative to single-agent provers (LeanCopilot, ReProver) is its richer strategic reasoning, literature search capability, and persistent memory. Against RL-trained provers (AlphaProof), it requires no training but lacks learned value functions. Against search-based provers (UlamAI), it offers richer proof state through the whiteboard and repository but lacks sophisticated tactic-level search with transposition tables. These comparisons are qualitative — no head-to-head benchmark comparisons have been published.

60.7.3 Extension Points for Evolutionary Methods

Status: Not implemented. The following extension points are identified by the chapter author as promising research directions suggested by the architecture. None of these features exist in the current OpenProver codebase. They are included here to connect the system to this survey's broader themes, not to describe implemented functionality.

OpenProver's architecture presents several natural extension points for evolutionary and population-based methods:

  1. Repository Seeding. Pre-populating repo/ with lemmas and strategies from prior successful runs on related theorems. The wikilink system already supports referencing arbitrary markdown files — only a cross-run knowledge retrieval mechanism would be needed to select relevant prior knowledge.
  2. Strategy Evolution. The whiteboard format could serve as a "chromosome" — evolving proof strategies across runs, with fitness measured by proof success or progress indicators. A population of whiteboard templates, subject to selection and mutation, could accelerate convergence on effective strategies for a given problem family.
  3. Prompt Population. The planner and worker system prompts in prompts.py could be evolved using the LLM-powered evolutionary paradigm described throughout this survey, with different prompt variants competing on benchmark problems.
  4. Model Selection as Hyperparameter. The planner/worker model assignment could be treated as an evolvable configuration, optimized across a population of runs with different model-task allocations.

These directions would require significant additional engineering — at minimum, a benchmark suite for fitness evaluation, a run orchestrator for managing populations of proving attempts, and a knowledge retrieval system for cross-run repository seeding. They represent multi-month research projects, not incremental feature additions.

60.8 Reproducibility

OpenProver achieves strong within-run reproducibility through its file-based state management. Every run creates a self-contained directory structure. The following layout is documented in the repository README and verified by inspection:

# Directory structure for a completed run
# Verified from repository documentation and run artifacts

runs/<slug>-<timestamp>/
├── THEOREM.md             # Immutable copy of input theorem
├── THEOREM.lean           # Formal Lean statement (if provided)
├── WHITEBOARD.md          # Current whiteboard state (enables resume)
├── PROOF.md               # Final informal proof (if found)
├── PROOF.lean             # Final formal proof (if lean mode)
├── DISCUSSION.md          # Post-session LLM analysis
├── run_config.toml        # Saved run configuration
├── repo/                  # Repository items with wikilinks
│   ├── lemma-foo.md
│   └── observation-bar.md
├── steps/
│   ├── step_001/
│   │   ├── planner.toml   # Planner's TOML decision
│   │   └── workers/
│   │       ├── task_0.md   # Worker task description
│   │       ├── result_0.md # Worker output
│   │       └── worker_0_call.json
│   └── step_002/...
└── archive/
    └── calls/
        └── call_001.json  # Full LLM call record (prompt, response, cost, timing)

The resume protocol detects existing runs by checking for WHITEBOARD.md + THEOREM.md, infers the step count from step_NNN directories, and restores settings from run_config.toml (with CLI flags overriding saved values). This means a proving session survives machine crashes, power loss, Ctrl-C interruptions, context window exhaustion, and network failures during API calls.

Historical runs can be browsed in read-only mode via openprover inspect runs/<run-dir>, which presents every step's planner decisions, worker tasks, and results in the TUI.

60.8.1 Resource Requirements

ResourceMinimumRecommended
Python3.10+3.11+ (for tomllib)
RAM4 GB8+ GB (for Lean Explore models)
GPUNone (CPU works)CUDA GPU (for lean_search reranking)
Disk~500 MB (base)~2 GB (with Lean Explore data)
Lean 4OptionalRequired for formal verification modes
Claude CLIRequired for default config

60.8.2 Reproducibility Limitations

Despite strong artifact management, exact reproducibility is limited by several factors:

  • LLM non-determinism. Even with identical prompts and configurations, different API calls may produce different planner decisions, leading to divergent proof searches. The system does not currently expose model temperature or random seed controls through its interface.
  • Provider evolution. LLM provider behavior changes over time as models are updated. A run that succeeded in February 2025 may behave differently when repeated in 2026, even with the same configuration.
  • No standardized benchmark protocol. Without fixed problem sets, evaluation metrics, and reporting standards, comparisons between different OpenProver runs (or between OpenProver and other systems) cannot be made rigorously.

60.8.3 Typical Usage

The following CLI invocations, drawn from the repository README, illustrate standard usage patterns:

# Quick informal proof exploration (from repository README)
openprover my_conjecture.md --autonomous --max-time 1h

# Deep dive with literature search and parallel workers
openprover hard_problem.md --no-isolation --planner-model opus \
  --worker-model sonnet -P 3 --max-time 4h

# Formalize a proof with Lean 4 verification
openprover theorem.md --lean-project ~/mathlib4 \
  --lean-theorem theorem.lean --proof proof.md

# Inspect a historical run
openprover inspect runs/sqrt2-20250315-143022/

60.9 Significance for the Field

OpenProver's contribution to the landscape of LLM-powered systems is best understood at three levels.

At the architecture level, it demonstrates that the planner-worker-whiteboard pattern — publicly described by DeepMind's Aletheia for automated mathematical reasoning — can be implemented with open-source components and commercial LLM APIs. To our knowledge, it is among the first open-source systems to implement this specific architectural pattern for theorem proving, though this claim is bounded to systems identified in this survey as of April 2026. The clean separation between strategic planning and focused execution, mediated by persistent external state, provides a reusable architectural template for any agentic system operating over extended time horizons.

At the memory level, it shows that a remarkably simple file-based knowledge management system — markdown files with wikilink cross-references — can effectively compensate for context window limitations. The planner never exceeds ~7,000 tokens per step, yet can coordinate proof searches spanning hours and dozens of steps. This is a practical counterpoint to the assumption that longer context windows are the primary path to deeper reasoning.

At the community level, it lowers the barrier to entry for automated theorem proving research. A researcher with a Claude API key (or a local vLLM instance) and a theorem statement can launch an autonomous proof search in minutes. The full auditability of every run — with archived prompts, responses, costs, and decisions — makes the system valuable for studying LLM reasoning behavior in mathematical domains.

The system's threading model (not async), subprocess-based LLM integration (not SDK), and file-based state (not database) represent pragmatic engineering choices that prioritize simplicity and inspectability over scalability. For a research tool intended to be read and modified by individual researchers, these are defensible tradeoffs.

Key caveats. OpenProver has not been evaluated on standard benchmarks, so its competitive proving strength relative to systems like LeanCopilot, LEGO-Prover, ReProver, or UlamAI remains unknown. The architectural equivalence to Aletheia is structural — based on publicly described design elements — not a claim of performance parity. The system's value at this stage is primarily as an open research platform and architectural reference implementation, not as a state-of-the-art prover.

Summary

Key takeaway: OpenProver is an open-source implementation of the planner-worker-whiteboard architecture for LLM-guided theorem proving, inspired by DeepMind's Aletheia. It demonstrates that persistent structured state — a terse LaTeX whiteboard, a file-based knowledge repository with wikilink retrieval, and a rolling output window — can enable multi-hour proof search sessions with standard-context models, without custom training or proprietary infrastructure.

Main contribution: An inspectable, resumable, multi-provider orchestration framework (~3,000 lines of Python, single hard dependency) that bridges the gap between closed frontier proof agents and the open research community, providing both informal and Lean 4 formal proving capabilities with full artifact auditability.

Evidence model: The system's capabilities are demonstrated through case studies on problems ranging from textbook (√2 irrationality) to open research problems (Erdős 838), documented as example runs in the repository. No systematic benchmark evaluation has been published. The architectural pattern generalizes beyond theorem proving, but quantitative claims about proving strength await benchmark-scale evaluation.

What a researcher should know: OpenProver is best understood as a research platform and architectural reference, not as a benchmark-dominating prover. Its value lies in inspectability, extensibility, and the demonstration that the planner-worker-whiteboard pattern is viable with commodity LLM APIs. Key limitations are the absence of cross-run learning and systematic benchmarking. The evolutionary extensions discussed in Section 60.7.3 are speculative research directions, not implemented features.