Introduced2026-03

Score8.58/10 — Final

Chapter 60

OpenProver

Part P08: Harness & Agent Frameworks

60.1 Overview and Motivation

Automated theorem proving stands at the frontier of mathematical reasoning by machines. While large language models have demonstrated impressive informal mathematical ability — solving competition problems, generating plausible proof sketches, and even suggesting novel conjectures — the gap between informal reasoning and machine-checked formal proofs remains substantial. Bridging this gap requires systems that can orchestrate extended, multi-step proof searches while maintaining strategic coherence over hours of exploration.

In February 2025, Google DeepMind announced Aletheia, a fully autonomous mathematics research agent powered by Gemini 3 Deep Think that solved 6 of 10 open research problems from the FirstProof challenge. Aletheia demonstrated a planner-worker-whiteboard architecture — a design pattern in which a strategic planner coordinates focused, stateless workers through a persistent mathematical scratchpad. The system was proprietary, leaving the research community without access to its internals or the ability to reproduce its results.

OpenProver, developed by Matěj Kripner at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague, is an open-source implementation of a planner-worker-whiteboard architecture for automated theorem proving, explicitly inspired by the Aletheia design as described in DeepMind's public communications. Released in February 2025 with active development through v1.0.1 (March 2026), OpenProver provides a fully inspectable, resumable, multi-LLM orchestration framework that supports both informal natural-language proofs and machine-checked Lean 4 formalizations.

Repository Snapshot

Repository	github.com/Kripner/openprover
Version reviewed	v1.0.1 (PyPI release, March 2026)
Commits at review	388 on master branch
Primary language	Python (92.9%), ~3,000 lines across 11 modules
License	Not explicitly specified in repository
Review access date	April 2026

All file paths, module references, and structural claims in this chapter are based on inspection of the repository at the above date. Line counts are estimates. Code snippets are labeled as pseudocode, lightly edited, or verbatim at each occurrence.

Key Contribution

OpenProver demonstrates that a planner-worker-whiteboard architecture — similar in high-level design to the pattern described in DeepMind's Aletheia announcements — can be constructed using open-source tools and commercial LLM APIs, without proprietary training data or custom-trained models. The scope of equivalence is architectural: OpenProver replicates the separation of planner and workers, persistent whiteboard state, and parallel worker dispatch, based on Aletheia's publicly described design. It does not claim feature parity with Aletheia's internal capabilities, model quality, or proving performance, which remain undisclosed. OpenProver's central architectural insight is that persistent structured state — a whiteboard, a file-based repository with wikilink cross-references, and a rolling output window — can substitute for massive context windows, enabling multi-hour reasoning sessions with standard-context models.

60.1.1 Positioning within the Survey

Within this survey's scope of LLM-powered self-evolving systems, OpenProver occupies a distinctive niche. It is not an evolutionary algorithm discovery system in the traditional sense — it does not maintain populations, apply crossover operators, or optimize fitness functions. Rather, it represents a search-based discovery agent in which the LLM planner performs a form of heuristic tree search over proof strategies, with the whiteboard and repository serving as an explicit external memory that accumulates knowledge across search steps. The system's within-session learning — storing failed approaches, proven lemmas, and synthesized observations — mirrors the knowledge-building dynamics of evolutionary systems, albeit without the population-level selection pressure.

OpenProver's relevance to the evolutionary AI landscape is threefold:

Reusable orchestration pattern (implemented): Its planner-worker architecture provides a concrete, working pattern for agentic systems that coordinate strategic reasoning with parallel execution.
Lightweight knowledge management (implemented): Its repository and wikilink system constitute a knowledge management mechanism analogous to the skill libraries and learning logs found in evolutionary code-generation systems.
Extension points for evolutionary methods (not implemented — speculative): As discussed in Section 60.7.3, the architecture has natural extension points for evolutionary methods including strategy evolution and prompt population optimization. These are research directions identified by the chapter author, not features of the current system.

60.2 Architecture

OpenProver's architecture follows a clean, acyclic dependency structure organized around a central prover loop that coordinates planning, worker dispatch, and state management. The system comprises approximately 3,000 lines of Python distributed across eleven modules in a flat package layout with a self-contained lean/ subpackage. The hard dependency footprint is minimal — only the mcp>=1.0 library is declared in pyproject.toml.

60.2.1 Module Structure

The codebase is organized as a flat Python package with a single self-contained subpackage for Lean integration. The following table lists each module with its responsibility, derived from the repository structure at github.com/Kripner/openprover (v1.0.1):

Module	File	Est. Lines	Responsibility	Evidence
CLI	`cli.py`	~200	Argument parsing, TUI setup, signal handlers, cost summary	Verified in repo
Prover	`prover.py`	~800	Core loop, action dispatch, `Repo` class, wikilink resolution, operating modes	Verified in repo
LLM	`llm.py`	~500	`LLMClient` (Claude CLI), `HFClient` (vLLM), streaming, archiving, cost tracking	Verified in repo
Prompts	`prompts.py`	~600	System prompts, prompt formatters, TOML parser, action enumeration	Verified in repo
Budget	`budget.py`	~100	Time/token tracking, threshold calculations	Verified in repo
TUI	`tui.py`	~800	Full-screen terminal UI, tab system, streaming display, key handling	Verified in repo
Inspect	`inspect.py`	~150	Read-only historical run browser	Verified in repo
Lean Core	`lean/core.py`	~300	`LeanTheorem` parser, proof assembly, `run_lean_check`, `LeanWorkDir`	Verified in repo
Lean MCP	`lean/mcp_server.py`	~200	MCP server exposing `lean_verify` and `lean_search` tools	Verified in repo
Lean Data	`lean/data.py`	~150	LeanExplore data management, dependency installation	Verified in repo
Lean Tools	`lean/tools.py`	~100	Tool definitions for vLLM native tool calling	Verified in repo

The dependency graph is strictly acyclic: cli.py depends on prover.py, tui.py, and inspect.py; prover.py depends on llm.py, prompts.py, budget.py, and the lean/ subpackage; prompts.py has no code dependencies beyond the standard library. All functionality beyond MCP comes from the Python standard library (subprocess, json, tomllib, threading, os, select) and the Claude CLI system binary. Optional dependencies for the Lean search pipeline (torch, sentence-transformers, FAISS) are installed on demand by lean/data.py.

60.2.2 Operating Modes

OpenProver supports three operating modes, configured via CLI flags and detected from the presence of input files in the run directory. These modes reflect the dual-track nature of mathematical reasoning — informal intuition followed by formal verification:

Mode	Input	Goal	Output Artifacts
`prove`	`THEOREM.md`	Informal proof	`PROOF.md`
`prove_and_formalize`	`THEOREM.md` + `THEOREM.lean`	Both proofs	`PROOF.md` + `PROOF.lean`
`formalize_only`	`THEOREM.md` + `THEOREM.lean` + `PROOF.md`	Formal proof only	`PROOF.lean`

This mode separation acknowledges a fundamental asymmetry in mathematical practice: most reasoning happens informally first, and formalization is a distinct — often more difficult — task. By supporting all three modes, OpenProver can serve as both a proof discovery tool and a formalization assistant.

60.3 Core Algorithms

60.3.1 The Planner-Worker Protocol

The central algorithm of OpenProver is a step-wise loop in which a planner LLM makes strategic decisions and dispatches worker LLMs for focused execution. To describe this protocol precisely, we define the following notation:

Symbol Definitions

$n$	Current step count (integer, incremented each planner iteration), $n = 1, 2, 3, \ldots$
$\tau$	Elapsed wall-clock time in seconds since the run started
$T_{\max}$	Maximum wall-clock time budget in seconds (CLI flag `--max-time`, default 14400s = 4 hours)
$N_{\max}$	Optional cumulative output token budget (CLI flag `--max-tokens`, no default)
$S$	System prompt (role definition and action specifications), invariant across steps
$W_n$	Whiteboard content at step $n$ (LaTeX-formatted markdown string)
$I_n$	Repository index at step $n$ (one-line summary per stored item)
$O_n$	Output window at step $n$: the last $k = 3$ worker results (rolling FIFO)
$B_n$	Budget status at step $n$: remaining time, conclusion/give-up flags
$\mathcal{A}$	The set of eight valid planner actions (defined below)
$\alpha_c, \alpha_g$	Fractional time thresholds for conclusion and give-up phases (defaults: 0.99, 0.50)

Step 1 — Prompt Construction. At each step $n$, the planner receives a structured prompt $P_n$ composed of five elements:

$$P_n = \langle S,\; W_n,\; I_n,\; O_n,\; B_n \rangle$$

Step 2 — TOML Decision. The planner generates a natural-language response containing a fenced TOML block that specifies its action. The TOML structure includes an action identifier $a_n \in \mathcal{A}$, a summary, an updated whiteboard $W_{n+1}$, and action-specific payloads such as task descriptions or item operations.

Step 3 — Action Dispatch. The action is routed to one of eight handlers. The action set is:

$$\mathcal{A} = \{\texttt{spawn},\; \texttt{literature\_search},\; \texttt{read\_items},\; \texttt{write\_items},\; \texttt{write\_whiteboard},\; \texttt{read\_theorem},\; \texttt{submit\_proof},\; \texttt{give\_up}\}$$

Step 4 — State Update. After the action completes, the whiteboard is updated to $W_{n+1}$, the output window shifts to include new results (dropping the oldest if $|O_n| = k$), and all artifacts are persisted to disk in steps/step_NNN/.

The following pseudocode illustrates the core prover loop. It captures the structure of the Prover class in prover.py but simplifies error handling, mode logic, and TUI integration for readability:

# PSEUDOCODE — Simplified from prover.py (Prover class).
# Captures the loop structure. Real implementation handles
# additional concerns: mode detection, resume, TUI callbacks,
# discussion generation, error recovery.

class Prover:
    def __init__(self, theorem_path, config, llm_client, tui):
        self.whiteboard = ""          # W_n: LaTeX scratchpad
        self.repo = Repo(run_dir)     # I_n: file-based knowledge graph
        self.outputs = []             # O_n: rolling output window (max 3)
        self.budget = Budget(config)  # B_n: time/token tracking
        self.step = 0

    def run(self):
        """Main prover loop — one iteration per planner step."""
        while not self.budget.is_exhausted:
            self.step += 1

            # Step 1: Build planner prompt (see prompts.py)
            prompt = build_planner_prompt(
                whiteboard=self.whiteboard,
                repo_index=self.repo.get_index(),
                recent_outputs=self.outputs[-3:],
                budget_status=self.budget.status(),
                mode_indicators=self.get_mode_status()
            )

            # Step 2: Call planner LLM, parse TOML decision
            response = self.planner_client.call(prompt, stream=True)
            decision = parse_planner_toml(response)

            # Step 3: Dispatch action
            action = decision["action"]
            if action == "spawn":
                results = self._handle_spawn(decision["tasks"])
            elif action == "literature_search":
                results = self._handle_literature_search(decision)
            elif action == "submit_proof":
                success = self._handle_submit_proof(decision)
                if success:
                    return  # Proof found
            elif action == "give_up":
                if self.budget.can_give_up:
                    return
            # ... other action handlers

            # Step 4: Update state
            self.whiteboard = decision.get("whiteboard", self.whiteboard)
            self.outputs.extend(results)
            self._save_step_to_disk(decision, results)

60.3.2 Worker Execution and Wikilink Resolution

Workers are stateless and parallelizable. Before dispatch, each task description undergoes wikilink resolution — a lightweight retrieval mechanism implemented in the Repo class within prover.py that injects relevant repository content into the worker's prompt without consuming tokens on irrelevant material.

The resolution process operates on the task description $d$ by finding all [[slug]] references, fetching the corresponding repository items, and appending them as a "Referenced Materials" section. Let $\text{refs}(d) = \{s_1, s_2, \ldots, s_m\}$ be the set of slugs referenced in task $d$, and let $\text{content}(s_i)$ be the full text of repository item $s_i$. The enriched task is:

$$d' = d \;\|\; \text{separator} \;\|\; \text{content}(s_1) \;\|\; \cdots \;\|\; \text{content}(s_m)$$

where $\|$ denotes string concatenation. This ensures that tokens are spent only when content is explicitly referenced, and each worker sees only the knowledge relevant to its specific subtask.

# PSEUDOCODE — Simplified from prover.py (Repo class and worker dispatch).
# Real implementation: file operations use pathlib; worker dispatch
# integrates with TUI streaming; error handling for missing items.

import re
from concurrent.futures import ThreadPoolExecutor

class Repo:
    """File-based knowledge graph with wikilink cross-references."""

    def __init__(self, run_dir):
        self.repo_dir = run_dir / "repo"
        self.repo_dir.mkdir(exist_ok=True)

    def resolve_wikilinks(self, text: str) -> str:
        """Replace [[slug]] references with item content."""
        refs = re.findall(r'\[\[([^\]]+)\]\]', text)
        if not refs:
            return text
        materials = []
        for slug in refs:
            item_path = self.repo_dir / f"{slug}.md"
            if item_path.exists():
                materials.append(f"## {slug}\n{item_path.read_text()}")
        if materials:
            text += "\n\n---\n# Referenced Materials\n" + "\n\n".join(materials)
        return text

    def get_index(self) -> str:
        """One-line summary per item for the planner prompt."""
        lines = []
        for item_path in sorted(self.repo_dir.glob("*.md")):
            first_line = item_path.read_text().split('\n')[0]
            lines.append(f"- [[{item_path.stem}]]: {first_line}")
        return "\n".join(lines)

def dispatch_workers(tasks, repo, worker_client, parallelism=1):
    """Execute worker tasks with wikilink resolution.
    Parallelism controlled by --parallelism flag (default 1)."""
    def run_task(task_desc):
        enriched = repo.resolve_wikilinks(task_desc)
        return worker_client.call(enriched, stream=True)

    with ThreadPoolExecutor(max_workers=parallelism) as pool:
        results = list(pool.map(run_task, tasks))
    return results

60.3.3 TOML-Based Structured Output

A pragmatic but noteworthy design choice is the use of TOML as the structured output format for planner decisions. Unlike JSON, TOML supports multi-line strings naturally (via triple-quoted strings) and avoids the escaping issues that frequently cause JSON parsing failures in LLM output. The planner's response contains a fenced TOML block extracted by parse_planner_toml() in prompts.py, using Python 3.11's built-in tomllib (with a tomli fallback for Python 3.10):

# PSEUDOCODE — Simplified from prompts.py (TOML parsing).
# Real implementation includes additional fallback parsers and
# handles edge cases in TOML block extraction.

import re
try:
    import tomllib
except ImportError:
    import tomli as tomllib  # Python 3.10 fallback

def parse_planner_toml(response_text: str) -> dict:
    """Extract and parse the TOML decision block from planner output."""
    match = re.search(r'```toml\s*\n(.*?)```', response_text, re.DOTALL)
    if not match:
        raise ValueError("No TOML block found in planner response")

    toml_str = match.group(1)
    decision = tomllib.loads(toml_str)

    assert "action" in decision, "Missing 'action' field"
    assert decision["action"] in VALID_ACTIONS

    return decision

A typical planner TOML decision (example, not from a specific run):

# Illustrative example of planner TOML output format
action = "spawn"
summary = "Explore the contradiction approach via modular arithmetic"
whiteboard = '''
# Goal: Prove √2 is irrational
## Strategy: Assume √2 = p/q in lowest terms, derive contradiction
## Status: Working on the parity argument
## Tried: Direct computation (failed — no algebraic identity yields contradiction)
'''

[[tasks]]
description = "Prove that if p² = 2q², then p must be even. Reference [[parity-lemma]]."

[[tasks]]
description = "Verify the infinite descent argument terminates."

60.3.4 Budget-Aware Planning

The planner operates under explicit budget constraints that shape its search strategy. The Budget class in budget.py tracks wall-clock time and optionally cumulative output tokens. It exposes three boolean checkpoints that are included in the planner's prompt at every step. Using the notation defined in Section 60.3.1:

$$\text{is\_exhausted}(\tau, n) = \begin{cases} \texttt{true} & \text{if } \tau \geq T_{\max} \text{ or (if set) tokens consumed} \geq N_{\max} \\ \texttt{false} & \text{otherwise} \end{cases}$$

$$\text{should\_conclude}(\tau) = \begin{cases} \texttt{true} & \text{if } \frac{\tau}{T_{\max}} \geq \alpha_c \\ \texttt{false} & \text{otherwise} \end{cases}$$

$$\text{can\_give\_up}(\tau) = \begin{cases} \texttt{true} & \text{if } \frac{\tau}{T_{\max}} \geq \alpha_g \\ \texttt{false} & \text{otherwise} \end{cases}$$

Here $\tau / T_{\max}$ is the fraction of the wall-clock time budget consumed. The defaults — $\alpha_c = 0.99$ (conclude after 99% of time) and $\alpha_g = 0.50$ (give-up permitted after 50% of time) — are configurable via --conclude-after and --give-up-after CLI flags. These thresholds enable adaptive behavior: aggressive exploration in early steps, focused refinement in middle steps, and consolidation or graceful termination in late steps. Note that these are soft signals communicated through the planner prompt — the LLM decides how to respond to them, and the only hard termination is is_exhausted.

60.3.5 Lean 4 Formal Verification Pipeline

When the --lean-project flag is set, OpenProver activates a formal verification pipeline implemented in the lean/ subpackage. This pipeline operates at three levels:

Level 1 — Repository Item Verification. When the planner stores a repository item with format="lean", the Lean code is written to a file within the specified Lean project directory and compiled via lake env lean <file>. Compiler output (success or errors) is returned to the planner. This is implemented in lean/core.py via the run_lean_check() function and the LeanWorkDir class, which manages temporary Lean source files within the user's project.

Level 2 — Final Proof Assembly. The LeanTheorem class in lean/core.py parses THEOREM.lean to locate all sorry positions — placeholders where proofs need to be filled. The assemble_proof() method replaces each sorry with the planner-provided proof block, processing in reverse order to preserve character offsets:

# PSEUDOCODE — Simplified from lean/core.py (LeanTheorem class).
# Real implementation includes additional validation, context
# injection, and error reporting.

class LeanTheorem:
    """Parse a Lean 4 theorem file and locate sorry positions."""

    def __init__(self, lean_path):
        self.source = lean_path.read_text()
        self.sorry_positions = self._find_sorry_positions()

    def _find_sorry_positions(self):
        """Find (start, end) byte offsets of each 'sorry' in source."""
        positions = []
        for match in re.finditer(r'\bsorry\b', self.source):
            positions.append((match.start(), match.end()))
        return positions

    def assemble_proof(self, blocks, context=None):
        """Replace sorry placeholders with proof blocks.

        Processes in reverse order to preserve character offsets.
        Validates: correct count, no imports in injected code.
        """
        if len(blocks) != len(self.sorry_positions):
            raise ValueError(
                f"Expected {len(self.sorry_positions)} blocks, "
                f"got {len(blocks)}"
            )
        result = self.source
        for (start, end), block in zip(
            reversed(self.sorry_positions), reversed(blocks)
        ):
            if "import" in block:
                raise ValueError("Import statements not allowed in proof blocks")
            result = result[:start] + block + result[end:]
        return result

Level 3 — Worker Tools via MCP. When --lean-worker-tools is enabled, workers gain access to two MCP tools: lean_verify (compile a Lean snippet in the project context) and lean_search (search ~400,000 Lean 4 declarations). For Claude-based workers, the MCP server (lean/mcp_server.py) is spawned as a subprocess configured via --mcp-config. For vLLM workers, LeanExplore search runs in-process using the tool definitions in lean/tools.py with native OpenAI-compatible tool calling.

60.3.6 The Lean Explore Search Pipeline

One of OpenProver's most technically sophisticated components is the lean_search tool, which searches approximately 400,000 Lean 4 declarations across Init, Batteries, Lean, Mathlib, and Std. The pipeline employs a multi-stage retrieval architecture implemented in lean/mcp_server.py and lean/data.py:

Stage	Technique	Approx. Latency	Details
BM25 Retrieval	Keyword search over LLM-generated descriptions	~0.01s	Two indices with different tokenization, results merged
Semantic Retrieval	Qwen3-Embedding-0.6B (1024-dim), FAISS index	~1s	Pre-built index, cosine similarity
Score Fusion	Normalized linear combination	—	BM25 and semantic scores combined; see formula below
Reranking	Qwen3-Reranker-0.6B cross-encoder	~1–2s (GPU)	Skipped on CPU (~50s); GPU recommended
Hydration	Full metadata lookup from SQLite	—	Signature, docstring, source location

The score fusion stage combines sparse and dense retrieval with a weighted sum. Let $d$ denote a Lean declaration and $q$ the search query. Define $\hat{s}_{\text{BM25}}(d, q)$ and $\hat{s}_{\text{sem}}(d, q)$ as the min-max normalized scores from the BM25 and semantic retrieval stages respectively. The fused score is:

$$\text{score}(d, q) = \lambda \cdot \hat{s}_{\text{BM25}}(d, q) + (1 - \lambda) \cdot \hat{s}_{\text{sem}}(d, q)$$

where $\lambda = 0.3$ (i.e., 30% keyword weight, 70% semantic weight). After fusion, the repository applies a dependency boost — declarations that appear in the dependency graph of already-retrieved top results receive a score increase. The exact magnitude and mechanism of this boost are not fully detailed in the codebase comments; it appears to be a heuristic adjustment that promotes declarations related to the current proof context. We describe it here descriptively rather than formalizing it, as the implementation details are subject to change across versions.

The first call takes approximately 10 seconds for model loading; subsequent calls complete in roughly 1 second. These latency figures are reported in the repository documentation and may vary by hardware.

60.4 Memory Architecture

A central design thesis of OpenProver is that persistent structured state can substitute for massive context windows. The system implements a five-layer memory hierarchy, each with distinct capacity, persistence, and visibility characteristics:

The planner's per-step context budget is carefully managed. In a typical step, the planner sees approximately 3,000–7,000 tokens: ~500–2,000 for the whiteboard, ~50–200 for the repository index (one line per item), ~1,000–3,000 for the last three outputs, ~50 for budget status, and ~1,000–2,000 for the system prompt (defined in prompts.py). This is well within any frontier model's context capacity, meaning the planner never hits context limits, even in multi-hour sessions spanning dozens of steps.

This design contrasts sharply with single-agent approaches that maintain one long conversation thread. In those systems, context window limits create a hard ceiling on reasoning depth. OpenProver's external memory architecture removes this ceiling by allowing the planner to selectively attend to relevant information — reading specific repository items, reviewing recent outputs, and consulting its own strategic notes — without carrying the full history of every prior step.

60.4.1 Within-Session Learning

The repository and whiteboard together enable a form of within-session learning that mirrors the knowledge accumulation dynamics of evolutionary systems:

Exploration and failure recording. Early workers explore different proof strategies. When an approach fails, the planner stores the failure as a repository item (e.g., approach-a-failure.md) and updates the whiteboard's "Tried" section.
Knowledge accumulation. Proven lemmas, useful observations, and literature findings accumulate in the repository. These are available to all future workers via wikilink references.
Strategic adaptation. The planner synthesizes patterns across failures and successes, updating its whiteboard strategy accordingly. By step 10 or 15, the system has built a structured understanding of the problem that no single LLM call could achieve.

However, OpenProver currently has no cross-run learning mechanism. Each run starts from scratch with only the theorem statement. There is no proof strategy transfer, no tactic preference learning, no failure pattern recognition across runs, and no repository bootstrapping from prior successful proofs. This represents a significant limitation and a natural extension point, as discussed in Section 60.7.

60.5 LLM Integration

60.5.1 Multi-Provider Architecture

OpenProver implements a factory pattern for LLM clients in llm.py, supporting three distinct provider backends. A distinctive feature is the hierarchical model assignment — different models can be assigned to planner and worker roles via --planner-model and --worker-model CLI flags:

Provider	Client Class	Interface	Tool Support	Cost Tracking
Claude (Anthropic)	`LLMClient` in `llm.py`	Claude Code CLI subprocess	MCP server for Lean tools	Yes (parsed from CLI JSON response)
Leanstral (Mistral)	via OpenAI-compatible	Mistral API	Native	Via API response
Local/vLLM	`HFClient` in `llm.py`	OpenAI-compatible HTTP	Native tool calling	No (local compute)

The rationale for model separation is cost-efficiency: the planner sees a compressed view of the proof state and must make strategic decisions — a task suited to more capable but expensive models (e.g., Claude Opus). Workers receive focused tasks and run in parallel — suited to faster, cheaper models (e.g., Claude Sonnet). A typical configuration uses --planner-model opus and --worker-model sonnet.

60.5.2 Claude CLI Integration

The Claude integration is unconventional: rather than using an SDK library, OpenProver wraps the claude CLI binary directly via subprocess.Popen. This design choice, observable in the LLMClient class in llm.py, provides control over streaming behavior and tool configuration without SDK version dependencies. Streaming uses readline() on the subprocess pipe rather than Python's line iterator, because the line iterator has read-ahead buffering that defeats real-time display.

The NDJSON stream from the Claude CLI carries tool-use events that OpenProver parses to detect MCP tool calls:

tool_use content blocks carry tool name and input (streamed incrementally)
Tool results appear as top-level {"type": "user"} messages containing tool_result entries matched by tool_use_id
MCP prefixes are stripped: mcp__lean_tools__lean_verify becomes lean_verify

When --lean-project is set, the Claude CLI is configured to spawn an MCP server subprocess using flags including --mcp-config, --strict-mcp-config, --permission-mode bypassPermissions, and --allowedTools restricting access to lean_verify and lean_search.

60.5.3 Literature Search

When the --no-isolation flag is set, the planner gains access to a literature_search action that spawns a web-enabled Claude worker with WebSearch and WebFetch tools. This enables the system to consult mathematical papers and known results mid-proof — a capability that mirrors how human mathematicians work. Results are stored as repository items and become available to all future workers via wikilinks. This integration is observable in the _handle_literature_search method in prover.py.

60.6 Demonstrated Capabilities and Evidence Model

OpenProver is primarily a tool-focused repository rather than a benchmark-reporting research paper. It does not provide systematic evaluations on standard theorem-proving benchmarks such as miniF2F, ProofNet, or FIMO. The evidence for its capabilities comes from case-study demonstrations — example runs included in the repository — rather than controlled experiments with statistical reporting. This section describes what those demonstrations show and explicitly states the limits of the evidence.

Evidence Scope Statement

All results below are from ad-hoc demonstrations documented in the repository's example runs and README. No controlled benchmark evaluation with standardized metrics, multiple seeds, or baseline comparisons has been published. The "Status" column reflects the repository author's reported outcomes, not independently verified results. Cost and timing estimates are author-provided and depend on model pricing at time of run.

60.6.1 Case-Study Results

Problem	Difficulty	Status (reported)	Evidence Type
$\sqrt{2}$ is irrational	Easy (textbook)	Solved — informal and formal proofs produced	Repository example run
Infinitely many primes	Easy (textbook)	Solved — informal proofs produced	Repository example run
$e$ is irrational	Medium (undergraduate)	Proof strategies explored	Repository example run
Cauchy-Schwarz inequality	Medium (undergraduate)	Informal and formal attempts	Repository example run
Erdős problem 205	Medium (research-adjacent)	Multi-step exploration	Repository example run
Erdős problem 838	Hard (open)	Extended autonomous sessions	Repository example run
Collatz conjecture	Open (famously unsolved)	Exploration (unsolved, as expected)	Repository example run

These demonstrations validate that the planner-worker-whiteboard architecture functions for problems ranging from textbook to research-level: the planner dispatches workers, the whiteboard accumulates state across steps, wikilinks resolve correctly, and the budget system terminates gracefully. However, the absence of systematic benchmarking means we cannot make quantitative claims about the system's proving strength relative to alternatives such as LeanCopilot, LEGO-Prover, or UlamAI. Specifically:

No success/failure rates are reported across a problem suite.
No step counts, token counts, or wall-clock times are reported in a standardized format.
No ablation studies isolate the contribution of individual components (whiteboard, wikilinks, literature search, worker parallelism).
No comparison with baseline approaches (e.g., single-agent prompting) is available.

60.6.2 Architectural Validation

Setting aside proving strength, the case studies validate that each major architectural component of the Aletheia-style planner-worker pattern can be implemented with open-source tooling:

Aletheia Feature (as publicly described)	OpenProver Equivalent	Implementation Status
Gemini Deep Think planner	Claude Opus / any LLM planner	Implemented (`prover.py`)
Parallel worker execution	`ThreadPoolExecutor` workers	Implemented (`prover.py`)
Whiteboard scratchpad	`WHITEBOARD.md` with LaTeX	Implemented (`prover.py`)
Repository of findings	`repo/` with wikilinks	Implemented (`Repo` class, `prover.py`)
Extraction + verification	`submit_proof` + Lean 4	Implemented (`lean/core.py`)
Literature search	Web-enabled Claude workers	Implemented (`prover.py`, requires `--no-isolation`)
Formal proof compilation	`lake env lean` + MCP tools	Implemented (`lean/` subpackage)

It is important to note that this is architectural replication based on Aletheia's publicly described design, not a performance equivalence claim. Aletheia's internal capabilities — including Gemini's training, fine-tuning, proprietary data, and actual proving power — are not disclosed, making direct comparison impossible.

60.6.3 Cost Estimation

OpenProver logs every LLM call's cost in archive/calls/call_NNN.json, enabling precise post-hoc cost analysis. The following are rough estimates from the repository documentation, not systematic measurements across multiple runs:

Scenario	Model Configuration	Parallelism	Time (approx.)	Estimated Cost	Source
Easy theorem ($\sqrt{2}$ irrational)	Sonnet	1	~10–30 min	~$1–5	README estimate
Medium theorem (Cauchy-Schwarz)	Sonnet	1	~1–2 hours	~$10–30	README estimate
Hard exploration (Erdős 838)	Opus planner + Sonnet workers	3	2–4 hours	~$50–200	README estimate
Local model run	vLLM (MiniMax-M2.5)	1	Variable	$0 (compute only)	README estimate

Actual costs depend on model pricing (which changes frequently), theorem difficulty, and search efficiency. The per-call cost logging enables users to compute precise totals for their own runs.

60.7 Limitations and Discussion

60.7.1 Honest Assessment of Limitations

Several significant limitations constrain OpenProver's current impact and should be considered when evaluating the system:

Limitation	Details
No systematic benchmark results	No evaluation on standard benchmarks such as miniF2F, ProofNet, or FIMO. All demonstrated results are from ad-hoc case studies. This makes it impossible to position OpenProver's proving strength quantitatively relative to alternatives.
No cross-run learning	Each run starts from scratch. No strategy transfer, tactic learning, or failure avoidance across runs.
Claude-centric default configuration	The default setup assumes the Claude Code CLI is installed and on PATH. Alternative providers require additional configuration and different flags.
No reinforcement learning or fine-tuning	Pure prompting only. No model improvement from proof traces. No learned value function to guide search.
Solo development	388 commits by a single developer as of April 2026. Bus factor of 1 creates sustainability risk.
Informal-first bias	The system is strongest at informal proofs. Formal verification is a secondary capability that depends on the quality of the informal reasoning step.
No value model for search guidance	Unlike RL-trained provers (e.g., AlphaProof), search relies entirely on LLM intuition rather than learned heuristics.
Non-determinism	LLM outputs are inherently non-deterministic. The system does not expose temperature or seed controls. Exact reproducibility of proof searches is not achievable.

60.7.2 Comparison with Alternative Approaches

System	Year	Architecture	Formal Verification	Open Source	Training Required
Aletheia (DeepMind)	2025	Planner-worker-whiteboard	Extraction + verification	No	Gemini fine-tuning
LeanCopilot	2023	Single-agent tactic suggestion	Lean 4 via LeanDojo	Yes	No (prompting)
LEGO-Prover	2024	Skill library + decomposition	Lean via LeanDojo	Partial	No (prompting)
ReProver	2023	Encoder retrieval + generation	Lean 4	Yes	Fine-tuned encoder
OpenProver	2025	Planner-worker-whiteboard	Lean 4 via MCP	Yes	No (prompting)
UlamAI	2026	Search-based with LLM policy	Lean 4 via LSP	Yes	No (prompting)

OpenProver's primary strength relative to single-agent provers (LeanCopilot, ReProver) is its richer strategic reasoning, literature search capability, and persistent memory. Against RL-trained provers (AlphaProof), it requires no training but lacks learned value functions. Against search-based provers (UlamAI), it offers richer proof state through the whiteboard and repository but lacks sophisticated tactic-level search with transposition tables. These comparisons are qualitative — no head-to-head benchmark comparisons have been published.

60.7.3 Extension Points for Evolutionary Methods

Status: Not implemented. The following extension points are identified by the chapter author as promising research directions suggested by the architecture. None of these features exist in the current OpenProver codebase. They are included here to connect the system to this survey's broader themes, not to describe implemented functionality.

OpenProver's architecture presents several natural extension points for evolutionary and population-based methods:

Repository Seeding. Pre-populating repo/ with lemmas and strategies from prior successful runs on related theorems. The wikilink system already supports referencing arbitrary markdown files — only a cross-run knowledge retrieval mechanism would be needed to select relevant prior knowledge.
Strategy Evolution. The whiteboard format could serve as a "chromosome" — evolving proof strategies across runs, with fitness measured by proof success or progress indicators. A population of whiteboard templates, subject to selection and mutation, could accelerate convergence on effective strategies for a given problem family.
Prompt Population. The planner and worker system prompts in prompts.py could be evolved using the LLM-powered evolutionary paradigm described throughout this survey, with different prompt variants competing on benchmark problems.
Model Selection as Hyperparameter. The planner/worker model assignment could be treated as an evolvable configuration, optimized across a population of runs with different model-task allocations.

These directions would require significant additional engineering — at minimum, a benchmark suite for fitness evaluation, a run orchestrator for managing populations of proving attempts, and a knowledge retrieval system for cross-run repository seeding. They represent multi-month research projects, not incremental feature additions.

60.8 Reproducibility

OpenProver achieves strong within-run reproducibility through its file-based state management. Every run creates a self-contained directory structure. The following layout is documented in the repository README and verified by inspection:

# Directory structure for a completed run
# Verified from repository documentation and run artifacts

runs/<slug>-<timestamp>/
├── THEOREM.md             # Immutable copy of input theorem
├── THEOREM.lean           # Formal Lean statement (if provided)
├── WHITEBOARD.md          # Current whiteboard state (enables resume)
├── PROOF.md               # Final informal proof (if found)
├── PROOF.lean             # Final formal proof (if lean mode)
├── DISCUSSION.md          # Post-session LLM analysis
├── run_config.toml        # Saved run configuration
├── repo/                  # Repository items with wikilinks
│   ├── lemma-foo.md
│   └── observation-bar.md
├── steps/
│   ├── step_001/
│   │   ├── planner.toml   # Planner's TOML decision
│   │   └── workers/
│   │       ├── task_0.md   # Worker task description
│   │       ├── result_0.md # Worker output
│   │       └── worker_0_call.json
│   └── step_002/...
└── archive/
    └── calls/
        └── call_001.json  # Full LLM call record (prompt, response, cost, timing)

The resume protocol detects existing runs by checking for WHITEBOARD.md + THEOREM.md, infers the step count from step_NNN directories, and restores settings from run_config.toml (with CLI flags overriding saved values). This means a proving session survives machine crashes, power loss, Ctrl-C interruptions, context window exhaustion, and network failures during API calls.

Historical runs can be browsed in read-only mode via openprover inspect runs/<run-dir>, which presents every step's planner decisions, worker tasks, and results in the TUI.

60.8.1 Resource Requirements

Resource	Minimum	Recommended
Python	3.10+	3.11+ (for `tomllib`)
RAM	4 GB	8+ GB (for Lean Explore models)
GPU	None (CPU works)	CUDA GPU (for `lean_search` reranking)
Disk	~500 MB (base)	~2 GB (with Lean Explore data)
Lean 4	Optional	Required for formal verification modes
Claude CLI	Required for default config	—

60.8.2 Reproducibility Limitations

Despite strong artifact management, exact reproducibility is limited by several factors:

LLM non-determinism. Even with identical prompts and configurations, different API calls may produce different planner decisions, leading to divergent proof searches. The system does not currently expose model temperature or random seed controls through its interface.
Provider evolution. LLM provider behavior changes over time as models are updated. A run that succeeded in February 2025 may behave differently when repeated in 2026, even with the same configuration.
No standardized benchmark protocol. Without fixed problem sets, evaluation metrics, and reporting standards, comparisons between different OpenProver runs (or between OpenProver and other systems) cannot be made rigorously.

60.8.3 Typical Usage

The following CLI invocations, drawn from the repository README, illustrate standard usage patterns:

# Quick informal proof exploration (from repository README)
openprover my_conjecture.md --autonomous --max-time 1h

# Deep dive with literature search and parallel workers
openprover hard_problem.md --no-isolation --planner-model opus \
  --worker-model sonnet -P 3 --max-time 4h

# Formalize a proof with Lean 4 verification
openprover theorem.md --lean-project ~/mathlib4 \
  --lean-theorem theorem.lean --proof proof.md

# Inspect a historical run
openprover inspect runs/sqrt2-20250315-143022/

60.9 Significance for the Field

OpenProver's contribution to the landscape of LLM-powered systems is best understood at three levels.

At the architecture level, it demonstrates that the planner-worker-whiteboard pattern — publicly described by DeepMind's Aletheia for automated mathematical reasoning — can be implemented with open-source components and commercial LLM APIs. To our knowledge, it is among the first open-source systems to implement this specific architectural pattern for theorem proving, though this claim is bounded to systems identified in this survey as of April 2026. The clean separation between strategic planning and focused execution, mediated by persistent external state, provides a reusable architectural template for any agentic system operating over extended time horizons.

At the memory level, it shows that a remarkably simple file-based knowledge management system — markdown files with wikilink cross-references — can effectively compensate for context window limitations. The planner never exceeds ~7,000 tokens per step, yet can coordinate proof searches spanning hours and dozens of steps. This is a practical counterpoint to the assumption that longer context windows are the primary path to deeper reasoning.

At the community level, it lowers the barrier to entry for automated theorem proving research. A researcher with a Claude API key (or a local vLLM instance) and a theorem statement can launch an autonomous proof search in minutes. The full auditability of every run — with archived prompts, responses, costs, and decisions — makes the system valuable for studying LLM reasoning behavior in mathematical domains.

The system's threading model (not async), subprocess-based LLM integration (not SDK), and file-based state (not database) represent pragmatic engineering choices that prioritize simplicity and inspectability over scalability. For a research tool intended to be read and modified by individual researchers, these are defensible tradeoffs.

Key caveats. OpenProver has not been evaluated on standard benchmarks, so its competitive proving strength relative to systems like LeanCopilot, LEGO-Prover, ReProver, or UlamAI remains unknown. The architectural equivalence to Aletheia is structural — based on publicly described design elements — not a claim of performance parity. The system's value at this stage is primarily as an open research platform and architectural reference implementation, not as a state-of-the-art prover.

Summary

Key takeaway: OpenProver is an open-source implementation of the planner-worker-whiteboard architecture for LLM-guided theorem proving, inspired by DeepMind's Aletheia. It demonstrates that persistent structured state — a terse LaTeX whiteboard, a file-based knowledge repository with wikilink retrieval, and a rolling output window — can enable multi-hour proof search sessions with standard-context models, without custom training or proprietary infrastructure.

Main contribution: An inspectable, resumable, multi-provider orchestration framework (~3,000 lines of Python, single hard dependency) that bridges the gap between closed frontier proof agents and the open research community, providing both informal and Lean 4 formal proving capabilities with full artifact auditability.

Evidence model: The system's capabilities are demonstrated through case studies on problems ranging from textbook (√2 irrationality) to open research problems (Erdős 838), documented as example runs in the repository. No systematic benchmark evaluation has been published. The architectural pattern generalizes beyond theorem proving, but quantitative claims about proving strength await benchmark-scale evaluation.

What a researcher should know: OpenProver is best understood as a research platform and architectural reference, not as a benchmark-dominating prover. Its value lies in inspectability, extensibility, and the demonstration that the planner-worker-whiteboard pattern is viable with commodity LLM APIs. Key limitations are the absence of cross-run learning and systematic benchmarking. The evolutionary extensions discussed in Section 60.7.3 are speculative research directions, not implemented features.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}