UlamAI: Search-Based Theorem Proving as Computational Evolution
Part: Harness & Agent Frameworks
63.1 Overview & Motivation
UlamAI Prover is an open-source CLI tool that combines best-first search over Lean 4 proof states with LLM-guided tactic generation, retrieval-augmented premise selection, automated repair loops, and a full LaTeX-to-Lean autoformalization pipeline [README]. Named after Stanisław Ulam (1909–1984)—the Polish-American mathematician who co-invented the Monte Carlo method, contributed to the hydrogen bomb design, and proposed the concept of cellular automata with John von Neumann—the system embodies a philosophy that positions formal verification as the immovable ground truth while treating language models as heuristic, fallible search policies [README].
The system was released in February 2026 (v0.1) by Ulam Labs, led by Przemek Chojecki, and iterated rapidly to v0.2.11 by March 2026 [README]. It is distributed via PyPI (ulam-prover), Homebrew, and direct installation—an unusual breadth of distribution channels for a research tool [README]. The repository is licensed under MIT and, as of the source analysis date, had accumulated 91 commits and 111+ GitHub stars [README].
Within this survey's taxonomy of harness and agent frameworks, UlamAI occupies a distinctive niche. Where planner-worker systems such as Aletheia (DeepMind, 2025) and OpenProver (2025) emphasize strategic coordination through persistent whiteboards and delegated task execution, UlamAI emphasizes systematic search—best-first exploration of a well-defined proof-state space using LLMs as the policy network [README]. This architectural divergence reflects two complementary framings of mathematical proof discovery: proof as planning versus proof as search. UlamAI's dual-track architecture—formal Lean search plus informal TeX planning—acknowledges this duality explicitly [README].
For the broader evolutionary AI narrative of this survey, UlamAI is significant for three reasons. First, its trace-logging infrastructure generates exactly the (state, action, outcome) tuples required for supervised fine-tuning, creating a potential self-reinforcing data flywheel [README]. Second, its planned progression from heuristic scoring through supervised fine-tuning to reinforcement learning (v0.3–v0.5) maps directly onto an evolutionary self-improvement trajectory [README]. Third, its benchmark parity-gate system provides a formal mechanism for evaluating whether system changes constitute genuine improvement—a fitness evaluation function for the system's own evolution [README].
Repository access note: The analysis in this chapter is based on the technical documentation, blog posts, Medium articles, README, and PyPI metadata for UlamAI Prover as described in the source analysis document (dated April 2026). Direct repository audit at a pinned commit was not performed by this survey author. Consequently, all implementation claims are tagged [README] (from documentation sources) or [INFERRED] (survey author reconstruction) rather than [REPO]. File paths and package structures referenced below are drawn from the source analysis document's description of the repository, not from independent verification.
63.2 Architecture
Repository inspected via documentation and source analysis dated April 2026; no pinned commit hash independently verified by this survey author. All [README] claims reference the source analysis document's description of the repository and its documentation.
63.2.1 System Overview
UlamAI is organized around five architectural layers: a CLI entry point, two independent proving pipelines (Lean formal and TeX informal), a Lean backend abstraction layer, an LLM adapter layer, and a benchmark/trace infrastructure layer [README]. The CLI dispatches commands to specialized subsystems: prove, formalize, bench, replay, review, checkpoint, and index [README].
Figure 63.1. UlamAI system architecture. Solid-bordered components: [README]. Dashed-bordered component: [INFERRED] planned capability not yet implemented.
63.2.2 Package Structure
The source analysis describes the following package layout [README]:
# Described package structure — from source analysis document, not independently audited
ulam/
├── cli.py # CLI entry point, TUI menu
├── prover.py # Core proof search loop
├── search.py # Best-first / beam search engine
├── retriever.py # Premise retrieval
├── llm/ # LLM adapter layer (openai, anthropic, codex_cli, claude_cli, gemini, ollama, base)
├── lean/ # Lean backend layer (dojo, cli_runner, lsp, parser)
├── formalize/ # Autoformalization pipeline (segmenter, drafter, prover, equivalence, pipeline)
├── tex/ # Informal TeX proving (planner, worker, judge, whiteboard, composer)
├── bench/ # Benchmark infrastructure (runner, report, compare, suites/)
├── index/ # Retrieval index management
├── replay.py # Deterministic replay
├── review.py # Next-step suggestion
├── checkpoint.py # Read-only health check
└── config.py # Configuration management
This structure separates concerns cleanly: the search engine (search.py) is decoupled from the proving orchestration (prover.py), the LLM adapters are independent of the Lean backends, and the two proving tracks (formal Lean, informal TeX) occupy distinct subpackages [README]. The benchmark infrastructure is co-located rather than relegated to test utilities, reflecting its status as a first-class product feature [README].
63.2.3 Sequence Diagram: Proof Search End-to-End
Figure 63.2. Sequence diagram for proof search. All interactions are [README]-sourced. Latencies are not measured; the documentation notes that Lean execution time dominates wall-clock cost.
63.2.4 Artifact Inventory
Since direct repository audit was not performed, the following table lists observable artifacts as described in the source analysis document:
| Artifact | Location / Name | Evidence Source |
|---|---|---|
| PyPI package | ulam-prover v0.2.11 | [README] |
| Homebrew tap | ulamai/ulamai | [README] |
| CLI entry point | ulam = ulam.cli:main | [README] |
| Configuration file | .ulam/config.json | [README] |
| Proof trace | run.jsonl | [README] |
| Replay metadata sidecar | *.meta.json | [README] |
| Formalization artifacts | runs/formalize_YYYYMMDD_HHMMSS/ | [README] |
| TeX proving artifacts | runs/prove_tex/tex_YYYYMMDD_HHMMSS/ | [README] |
| Premise index | .ulam/premises_*.jsonl | [README] |
| Benchmark reports | JSON + Markdown files | [README] |
| Build system | setuptools ≥68 + wheel | [README] |
| License | MIT | [README] |
63.3 Core Algorithms
63.3.0 Verification Matrix
| Algorithm / Mechanism | Claim | Evidence Source | Artifact | Confidence |
|---|---|---|---|---|
| Best-first search with beam capping | Priority queue with configurable beam width | [README] | search.py (described) | High |
| Transposition table | State deduplication via canonical hashing | [README] | search.py (described) | High |
| Automation tactic fallbacks (autop) | aesop, simp, linarith, ring, norm_num tried before LLM | [README] | prover.py (described) | High |
| Repair loop | Error feedback → LLM fix → re-execute (bounded) | [README] | prover.py (described) | High |
| Retrieval-augmented premise selection | Token-overlap + embedding-based retrieval | [README] | retriever.py (described) | High |
| TeX planner-action loop | Claim decomposition, worker drafts, judge/verifier gates | [README] | tex/ subpackage (described) | Medium-High |
| Autoformalization pipeline | Five-phase TeX→Lean with fixpoint repair | [README] | formalize/ subpackage (described) | Medium-High |
| Parity gate system | Non-regression gates for benchmark comparison | [README] | bench/compare.py (described) | High |
| Step + state caching | Two-level cache to avoid redundant Lean calls | [README] | prover.py, search.py (described) | High |
| SFT/RL training loops | Planned v0.3–v0.5, not implemented | [README] | Roadmap only | Aspirational |
63.3.1 Best-First Search with Beam Capping and Transposition Table
The core proving mechanism is a best-first search over the space of Lean 4 proof states [README]. At each step, the search engine selects the highest-priority state from a bounded frontier (beam-capped), expands it by generating candidate tactics via the LLM, executes each tactic through the Lean backend, and enqueues resulting successor states [README]. A transposition table prevents re-exploration of states already visited via different tactic paths [README].
The documentation explicitly motivates this choice over Monte Carlo Tree Search (MCTS): "Lean execution time dominates. Best-first + caching + transpositions delivers a strong baseline fast and provides clear debugging signals. MCTS/MCGS can be layered later" [README]. This is a pragmatic design decision—the bottleneck is the formal verification step, not the search branching factor.
| Symbol | Meaning | Domain |
|---|---|---|
| \(s^*\) | Next state selected for expansion | Proof state |
| \(\mathcal{F}\) | Search frontier (priority queue) | Set of proof states |
| \(h(s)\) | Heuristic score for state \(s\) | \(\mathbb{R}\) |
| \(B\) | Beam width (maximum frontier size) | \(\mathbb{Z}^+\) |
[Author-derived formalization of documented search behavior — the exact scoring function \(h(s)\) is described as heuristic and not learned; specific details of the heuristic are not publicly documented.]
Analogy to classical beam search: UlamAI's search preserves beam search's bounded-memory property and greedy expansion order, but adds a transposition table—an element drawn from game-tree search (e.g., chess engines) rather than NLP beam decoding. The analogy breaks down in that proof states lack the sequential left-to-right structure of text generation; states can be reached via many tactic orderings, making deduplication both more valuable and harder to implement correctly [README: "Better state canonicalization (stable hashing)" listed as roadmap item].
# Pseudocode — reconstructed from README documentation, not repo-verified
# Proof search main loop
def search_loop(initial_state, lean_backend, llm, retriever, config):
frontier = PriorityQueue(max_size=config.beam_width)
frontier.push(initial_state, score=heuristic(initial_state))
transposition_table = {} # state_key → best_seen_score
step_cache = {} # (state_key, tactic) → result
trace = []
while not frontier.empty() and not budget_exhausted():
state = frontier.pop_best()
# Phase 1: Try automation tactics before LLM
for auto_tactic in ["aesop", "simp", "linarith", "ring", "norm_num"]:
result = lean_backend.execute(state, auto_tactic)
if result.is_proof_complete:
return ProofFound(result, trace)
# Phase 2: Retrieve premises and query LLM
premises = retriever.retrieve(state, top_k=config.retriever_k)
candidate_tactics = llm.propose_tactics(state, premises, k=config.gen_k)
for tactic in candidate_tactics:
cache_key = (state.canonical_key(), tactic)
if cache_key in step_cache:
result = step_cache[cache_key]
else:
result = lean_backend.execute(state, tactic)
step_cache[cache_key] = result
if result.success:
new_state = result.new_state
new_score = heuristic(new_state)
state_key = new_state.canonical_key()
# Transposition check
if state_key in transposition_table:
if new_score <= transposition_table[state_key]:
continue # Prune: already seen with better score
transposition_table[state_key] = new_score
if result.is_proof_complete:
return ProofFound(result, trace)
frontier.push(new_state, score=new_score)
else:
# Repair loop (bounded)
repaired = llm.repair(tactic, result.error, state, retries=config.repair_retries)
if repaired and repaired.success:
frontier.push(repaired.new_state, score=heuristic(repaired.new_state))
trace.append(StepRecord(state, candidate_tactics, results))
return ProofNotFound(trace)
63.3.2 Automation Tactic Fallbacks (autop)
Before every LLM query, UlamAI tries a cascade of five Lean built-in automation tactics: aesop, simp, linarith, ring, and norm_num [README]. The rationale is that many subgoals that appear difficult to a language model are trivial for Lean's decision procedures. This saves LLM API tokens, search budget, and wall-clock time [README].
The automation-first design has a significant implication for cost: if a substantial fraction of subgoals in a typical proof can be closed by built-in tactics, the LLM is only queried for genuinely novel reasoning steps. This is analogous to a hybrid strategy in evolutionary search where cheap local optimization is applied before invoking expensive global operators.
63.3.3 Retrieval-Augmented Premise Selection
Two retrieval modes are documented [README]:
| Mode | Mechanism | Speed | Dependencies |
|---|---|---|---|
| Token-overlap | BM25-style keyword matching | Fast | None |
| Embedding-based | Vector similarity search | Moderate | Embedding API |
The retrieval index is built via ulam index build, which scans local project declarations and/or Mathlib (100,000+ lemmas) to produce a JSONL premises file [README]. The source documentation emphasizes that retrieval is critical because "without retrieval, the model must rely on memorized knowledge—knowledge that degrades with model size, quantization, and the ever-growing surface area of Mathlib itself" [README].
63.3.4 Repair Loop
When a tactic fails, UlamAI feeds Lean's error message back to the LLM for a targeted repair attempt [README]. Lean's type error messages are precise (missing argument, type mismatch, unknown identifier), enabling the model to make focused corrections rather than regenerating from scratch. The repair loop is bounded to prevent infinite retry cycles [README]. This error-feedback mechanism is a form of guided local search—each failed attempt constrains the search space for the next attempt.
63.3.5 TeX Proving Pipeline (Informal Proofs)
The --output-format tex pipeline implements a planner-action loop that is architecturally distinct from the formal search track [README]. The planner selects actions from a fixed vocabulary: plan (decompose into claim graph), solve (workers draft claims), write_memory (update WHITEBOARD or repo), compose (assemble final proof), and give_up (exhaust budget) [README].
Workers draft each claim and submit it to a judge plus adversarial verifier and domain checker [README]. The system maintains persistent memory via WHITEBOARD.md and named repo/ items, accessible via [[wikilink]] expansion in prompts [README]. This pattern closely mirrors the planner-worker-whiteboard architecture described in DeepMind's Aletheia and replicated in OpenProver [README].
The v0.2.11 flow begins with a single deep whole-theorem attempt using the primary model; if this fails, the system falls back to the planner/worker claim decomposition [README]. Configuration knobs include --tex-rounds, --tex-worker-drafts, --tex-concurrency, --tex-judge-repairs, --tex-replan-passes, --tex-action-steps, and separate model overrides for planner and worker roles [README].
63.3.6 Autoformalization Pipeline (TeX → Lean)
The formalization pipeline is a five-phase process [README]:
- Phase A — Ingest + Segment: Parse LaTeX into logical chunks (definitions, lemmas, theorems, proof sketches); extract named entities; build dependency graph.
- Phase B — Statement Drafting: Generate Lean skeleton with
sorryproofs; iteratively repair until the file typechecks. - Phase C — Proof Reconstruction: Open each
sorrygoal; run proof search; if failing, retrieve similar lemmas, reformulate, or generate auxiliary lemmas. - Phase D — Semantic Alignment: A fixpoint loop—LLM judges equivalence between Lean statement and informal statement; mismatches enqueue repair tasks; statement changes invalidate downstream proofs recursively.
- Phase E — Consolidation: Produce final
.leanfile; record success rates, remainingsorrys, error traces.
This pipeline is configurable via --max-rounds and --max-proof-rounds [README]. The fixpoint structure in Phase D is notable: it represents a recursive repair process where semantic drift between informal and formal representations is iteratively corrected.
63.3.7 Benchmark Parity Gate System
The bench-compare command implements a non-regression gate that can be used in CI or development workflows [README]:
# Pseudocode — reconstructed from README documentation
# Benchmark comparison with parity gate
python3 -m ulam bench-compare \
--a runs/bench/a.json --b runs/bench/b.json \
--gate \
--max-solved-drop 0 \
--max-success-rate-drop 0 \
--max-semantic-pass-rate-drop 0 \
--max-regression-rejection-rate-increase 0 \
--max-median-time-increase-pct 25
The gate checks comparability (suite SHA256 matching, inference profile matching) before evaluating regression thresholds [README]. This implements a formal fitness evaluation for system changes: a commit passes the gate if and only if it does not degrade performance beyond configured tolerances. Available threshold parameters include solved-drop, success-rate-drop, semantic-pass-rate-drop, regression-rejection-rate-increase, median-time-increase-pct, planner-replan-triggers-increase, and planner-cached-tactic-tries-drop [README].
63.4 Key Results
63.4.1 Evaluation Caveats
No public benchmark scores have been reported for UlamAI as of the source analysis date (April 2026) [README]. The system has invested heavily in benchmarking infrastructure—suite management, report generation, cross-report comparison with parity gates, campaign scripts, and miniF2F integration—but specific numerical results (solve rates, timing, token usage) are not publicly available. The following gaps are notable:
- No absolute baseline comparisons: No published solve rate on miniF2F or any standard theorem-proving benchmark.
- No seed/run counts: Neither the number of independent runs nor variance across runs is reported.
- No compute budget reporting: Wall-clock time, token usage, and API costs per benchmark run are not disclosed.
- No independent reproduction: No third-party evaluation of UlamAI's proving capabilities has been identified.
- No matched-budget comparisons: It is not possible to compare UlamAI against other systems under controlled conditions.
63.4.2 Capability Claims
In the absence of quantitative benchmarks, the following capability claims are drawn from the documentation:
| Capability | Claim | Evidence | Status |
|---|---|---|---|
| End-to-end formal proving | Produces machine-checked Lean 4 proofs for textbook theorems (e.g., irrationality of √2) | [README] | Described |
| Multi-backend operation | Works with LeanDojo, CLI, and experimental LSP backends | [README] | Described |
| Multi-provider LLM support | Operates with 7+ LLM providers including Codex, Claude, Gemini, Ollama | [README] | Described |
| Autoformalization | TeX-to-Lean pipeline with iterative repair and semantic alignment | [README] | Described |
| Informal TeX proving | Planner-action loop with claim decomposition, judge/verifier, whiteboard memory | [README] | Described |
| Benchmark infrastructure | Suite management, parity gates, campaign automation, miniF2F integration | [README] | Described |
| Deterministic replay | Re-execution of recorded proof traces with metadata validation | [README] | Described |
No independent quantitative evidence is available for any of these claims. The system's documentation is detailed and internally consistent, but all capability statements should be treated as self-reported.
63.5 Implementation & Cost
63.5.1 Resource Requirements
| Resource | Minimum | Recommended | Source |
|---|---|---|---|
| Python | 3.10+ | 3.11+ | [README] |
| RAM | 4 GB | 8+ GB | [README] |
| Disk | ~200 MB (base) | ~2 GB (with Lean + Mathlib) | [README] |
| Network | Required for API models | — | [README] |
| Lean 4 + Mathlib | Required for formal proofs | Via ulam -lean | [README] |
63.5.2 Cost Drivers
| Cost Component | Driver | Control Mechanism | Source |
|---|---|---|---|
| LLM tactic proposals | \(k\) candidates per state × states explored | gen_k, beam cap, timeout | [README] |
| LLM repair attempts | Failed tactics × repair retries | Bounded repair loop | [README] |
| Embedding calls | Premise retrieval queries | Retriever mode selection | [README] |
| Lean execution | Tactic compilation per candidate | exec_k, step cache | [README] |
| Autoformalization | Multiple rounds × LLM drafts + repairs | --max-rounds, --max-proof-rounds | [README] |
| TeX proving | Planner steps × worker drafts | --tex-rounds, --tex-worker-drafts | [README] |
63.5.3 Cost Optimization Strategies
Six documented cost optimization mechanisms [README]:
- Automation-first: Five built-in Lean tactics tried before LLM query on every state, eliminating API calls for trivially solvable subgoals.
- Step cache:
(state_key, tactic) → resultmapping avoids duplicate Lean compilation. - Transposition table:
state_key → best_seen_scoreprevents re-exploring equivalent states. - Beam capping: Bounded frontier size prevents unbounded search expansion.
- Budget tracking: Time or token budgets with automatic conclusion phase.
- Inference profiles: Pre-configured trade-offs (
fast,balanced,strict) between cost and thoroughness.
gen_k=8 and a beam width of 100 would require at most 800 LLM tactic-generation calls before frontier exhaustion, but actual costs depend heavily on the theorem's difficulty, the fraction of subgoals closed by automation fallbacks, and the cache hit rate—none of which are publicly reported.
63.5.4 Inference Profiles
| Profile | gen_k | exec_k | verify_level | Use Case | Source |
|---|---|---|---|---|---|
fast | Low | Low | Minimal | CI smoke tests | [README] |
balanced | Medium | Medium | Standard | Default benchmarking | [README] |
strict | High | High | Thorough | Publication-quality | [README] |
Exact numeric values for gen_k, exec_k, and verify_level within each profile are not publicly documented [README]. The documentation states that gen_k ranges from 8–16 in general usage [README].
63.6 Reproducibility
63.6.1 Reproducibility Assessment
| Requirement | Status | Notes | Source |
|---|---|---|---|
| Code publicly released | ✓ | MIT license, GitHub + PyPI + Homebrew | [README] |
| Config files available | ✓ | .ulam/config.json auto-generated | [README] |
| Pretrained weights / checkpoints | N/A | Uses external LLM APIs, no custom model weights | [README] |
| Documented entry point or run command | ✓ | Multiple CLI commands documented with examples | [README] |
| Compute requirements stated | Partial | Hardware minimums stated; no per-run costs | [README] |
| Seeds and run counts reported | ✗ | Random seed is configurable but no reported experiments | [README] |
| Independent reproduction attempted | ✗ | No third-party evaluation identified | [README] |
63.6.2 Trace and Replay Infrastructure
UlamAI implements comprehensive trace logging and deterministic replay [README]. Every proof search run produces a JSONL trace file logging proof state, proposed tactic, Lean response, scoring decision, and wall-clock timing per step [README]. Replay is supported via ulam replay run.jsonl for inspection and ulam replay run.jsonl --execute --strict for deterministic re-execution [README].
Strict replay mode validates against a metadata sidecar (*.meta.json) containing Lean backend type, project path, toolchain version, Mathlib commit/revision, file hash, and run configuration [README]. This level of environment pinning is among the most thorough identified in this survey's coverage of harness systems.
63.6.3 Step-by-Step Verification Path
Based on the documentation, an external researcher would need to:
- Install UlamAI:
pip install ulam-proverorbrew tap ulamai/ulamai && brew install ulamai[README]. - Set up Lean 4 environment:
ulam -lean(installs elan, Lean toolchain, LeanDojo/Pantograph) [README]. - Configure an LLM provider:
ulam auth codexor setULAM_OPENAI_API_KEY[README]. - Run a smoke test:
ulam prove examples/Smoke.lean --theorem irrational_sqrt_two_smoke --prove-mode tactic --lean dojo[README]. - Inspect artifacts: check
run.jsonlfor trace, verify the theorem typechecks in Lean. - For benchmarking:
ulam bench --suite internal_regression --inference-profile balanced[README].
What would constitute successful reproduction: A complete JSONL trace showing the search tree, with the final tactic sequence producing a Lean 4 proof that typechecks without sorry or additional axioms. The ulam checkpoint command provides a read-only health check of the resulting file [README].
Key uncertainty: The LeanDojo/Pantograph version alignment is described as fragile [README]. Toolchain mismatches between Lean 4, Mathlib, and Pantograph are a known source of reproduction failures in the broader Lean ecosystem. The documentation acknowledges this and provides --use-mathlib-toolchain for explicit control [README].
63.7 Threats to Validity
This section consolidates validity concerns for UlamAI's documented capabilities and the analysis presented in this chapter.
| Threat | Description | Severity |
|---|---|---|
| Absence of quantitative evaluation | No public benchmark scores exist despite extensive benchmarking infrastructure. All capability claims are self-reported from documentation. It is not possible to assess the system's actual theorem-proving performance or compare it to alternatives. | High |
| No independent reproduction | No third-party evaluation has been identified. The system's claims rest entirely on developer documentation. | High |
| Repository not independently audited | This chapter's analysis is based on a source analysis document rather than direct commit-pinned repository inspection. File paths, module names, and architectural claims could not be independently verified against actual code. | Medium |
| Early-stage maturity | At v0.2.11 with 91 commits over ~2 months, the system is in rapid early development. APIs, behaviors, and architectural decisions may change significantly. | Medium |
| Lean toolchain sensitivity | LeanDojo/Pantograph version alignment is described as fragile. Reproduction across different environments may fail due to toolchain mismatches rather than system issues. | Medium |
| Reviewer circularity | The TeX proving pipeline uses LLM judges and adversarial verifiers—models evaluating model output. The formal Lean track avoids this (Lean kernel is ground truth), but informal TeX proofs are not machine-checked. | Medium (TeX track only) |
| Aspirational features in roadmap | The v0.3–v0.5 learning roadmap (SFT, value model, RL) is explicitly planned but not implemented. These should not be conflated with current capabilities. | Low (clearly labeled) |
| Comparison budget mismatch | No controlled comparisons against other systems exist. Any comparison table in this chapter (or the source analysis) reflects documented feature sets, not matched-budget empirical evaluation. | Medium |
63.8 Limitations & Open Questions
63.8.1 Documented Limitations
The source analysis identifies the following limitations, most of which are acknowledged in the documentation [README]:
- No cross-run learning: Each proof attempt starts from scratch. The planned SFT pipeline (v0.3) would address this, but it is not implemented.
- No literature search: Unlike OpenProver, UlamAI cannot consult web resources during proving.
- No learned value model: Search scoring is heuristic, not trained. The documentation lists "better scoring heuristics (non-RL)" as a v0.2 goal.
- Single-threaded proof search: No parallel rollouts are documented; parallelization is planned for v0.4.
- Experimental LSP backend: The Lean language server backend is not at parity with LeanDojo.
- State canonicalization: "Better state canonicalization (stable hashing)" is listed as a roadmap item, implying the current implementation may have false negatives in the transposition table.
63.8.2 Open Questions
- What is the actual solve rate? The most critical unknown. Without benchmark numbers, the system's practical utility for mathematical research cannot be assessed.
- How effective are automation fallbacks? What fraction of subgoals in typical proofs are closed by autop versus LLM? This ratio has significant cost implications.
- Does the transposition table provide meaningful savings? If proof-state canonicalization is imperfect, the table may miss many equivalent states.
- How does the system scale with proof difficulty? The search budget (beam width × depth × gen_k) grows multiplicatively. For hard theorems, is the system budget-constrained before it is capability-constrained?
- Can the planned SFT loop close the cross-run learning gap? The trace format contains (state, tactic, outcome) tuples suitable for supervised learning, but the quality of this training signal depends on the distribution of successful proofs.
63.9 Survey Positioning
63.9.1 Comparison with Related Systems
UlamAI's positioning within the theorem-proving landscape is clarified by comparison with three related systems. Note that all comparisons below are based on documented feature sets, not matched-budget empirical evaluation.
| Capability | UlamAI | OpenProver | LeanCopilot | Evidence Basis |
|---|---|---|---|---|
| Search strategy | Best-first/beam with transposition | LLM-directed planning | One-shot tactic suggestion | [README] for all |
| Formal verification | Lean 4 (3 backends) | Lean 4 (CLI only) | Lean 4 (LeanDojo) | [README] |
| Informal proving | TeX planner-action loop | Whiteboard-based | Not supported | [README] |
| Autoformalization | TeX → Lean pipeline | Markdown → Lean | Not supported | [README] |
| Literature search | Not supported | Web-enabled | Not supported | [README] |
| Multi-model support | 7+ providers | 3 providers | 1 provider | [README] |
| Benchmark infrastructure | Extensive (parity gates, campaigns) | None documented | Basic | [README] |
| Automation fallbacks | 5 Lean tactics | Not documented | Partial | [README] |
| Open source | Yes (MIT) | Yes | Yes | [README] |
| Published benchmark results | None | Limited | Yes | [README] |
63.9.2 Proof-as-Search vs. Proof-as-Planning: A Correspondence Table
UlamAI's dual-track architecture embodies two distinct paradigms for AI-driven mathematical proof discovery:
| Conceptual Element | Proof-as-Search (Lean Track) | Proof-as-Planning (TeX Track) |
|---|---|---|
| State space | Lean proof states (goals + hypotheses) | Claim graph + whiteboard memory |
| Action space | Lean tactics | Planner actions (plan, solve, compose, ...) |
| Policy | LLM tactic generator | LLM planner + worker drafts |
| Evaluation | Lean kernel (binary: typechecks or not) | LLM judge + adversarial verifier |
| Memory | Step cache + transposition table | WHITEBOARD.md + repo/ items |
| Termination | All goals closed / budget exhausted | Proof composed / budget exhausted |
| Ground truth | Formal: Lean kernel verification | Informal: no machine-checkable guarantee |
Where the analogy breaks down: The search framing works well for the formal Lean track because proof states form a well-defined graph with a binary success criterion (typechecks vs. does not). The planning framing applies to the TeX track because informal proof construction is open-ended—there is no formal verifier, so evaluation is necessarily approximate and multi-faceted (judge, verifier, domain checker). The two tracks share LLM infrastructure but operate on fundamentally different state spaces with different evaluation guarantees. Conflating them would obscure the critical difference: the Lean track cannot produce a false proof; the TeX track can.
63.9.3 Evolutionary Self-Improvement Correspondence
Although UlamAI does not currently implement self-improvement, its architecture and roadmap map onto evolutionary computation concepts:
| Evolutionary Concept | UlamAI Component | Status |
|---|---|---|
| Individual / candidate | Tactic sequence (proof attempt) | Implemented |
| Population | Search frontier (beam-capped) | Implemented |
| Fitness function | Lean verification (binary) + heuristic score | Implemented |
| Selection | Best-first priority selection | Implemented |
| Mutation | LLM tactic proposal + repair loop | Implemented |
| Environmental pressure | Benchmark parity gates | Implemented |
| Accumulated experience | JSONL trace archive | Implemented (data only) |
| Self-improvement (SFT) | Train on successful traces | Planned (v0.3) |
| Learned evaluation (value model) | Replace heuristic scoring with learned model | Planned (v0.4) |
| On-policy RL | PPO/GRPO-style fine-tuning | Planned (v0.5) |
Where the evolutionary analogy breaks down: In a true evolutionary system, the population persists and improves across generations. UlamAI currently has no cross-run learning—each proof attempt starts from scratch with the same base LLM. The search frontier within a single run is analogous to a single-generation population, not a multi-generational evolutionary process. The planned SFT pipeline would partially close this gap by training improved models on accumulated traces, but even then, the "evolution" would occur at the model level (via fine-tuning) rather than at the program level (via code mutation). The system is better described as architecturally prepared for evolutionary dynamics than as currently exhibiting them.
Key Contribution
UlamAI's primary contribution to the harness systems landscape is its combination of systematic search-based proof discovery with extensive benchmarking infrastructure [README]. The parity gate system, inference profiles, campaign scripts, and cross-report comparison tools represent a level of engineering discipline around evaluation that is uncommon among research theorem provers. Its secondary contribution is the dual-track architecture (formal search + informal planning) that explicitly acknowledges the complementary strengths of systematic search at the tactic level and strategic planning at the proof-structure level [README].
The connection to Stanisław Ulam is more than nominal: Ulam's Monte Carlo method introduced the idea of using randomized sampling to solve deterministic problems—exactly the paradigm that UlamAI applies when using stochastic LLM outputs as the policy in a deterministic proof-state search [README]. The system's planned progression toward self-reinforcing dynamics (SFT → value model → RL) also echoes the broader theme of computational systems that improve through accumulated experience, a theme central to Ulam and von Neumann's work on self-reproducing automata.
63.10 Summary
Key takeaway: UlamAI Prover is an architecturally distinctive theorem-proving system that combines best-first search over Lean 4 proof states with LLM tactic generation, automation fallbacks, retrieval-augmented premise selection, and a dual-track (formal + informal) proving architecture. Its benchmarking infrastructure—including parity gates, inference profiles, and campaign automation—is among the most developed identified in this survey's coverage of open-source theorem provers.
Main contribution: The integration of search-based proof discovery with first-class benchmarking infrastructure in a single open-source CLI tool, demonstrating that systematic evaluation methodology can be co-developed with proving capability rather than added as an afterthought.
Most important gap for future researchers: The absence of published benchmark results is the single largest barrier to assessing UlamAI's significance. A future researcher should reproduce the system's performance on miniF2F or a comparable standard benchmark, report solve rates with seed counts and confidence intervals, and compare against at least one baseline system under matched compute budgets. Until such evidence exists, UlamAI's contribution is primarily architectural and methodological rather than empirically demonstrated.