Introduced2026-01
Score8.08/10 — Draft
Chapter 63

UlamAI: Search-Based Theorem Proving as Computational Evolution

Part: Harness & Agent Frameworks

63.1 Overview & Motivation

UlamAI Prover is an open-source CLI tool that combines best-first search over Lean 4 proof states with LLM-guided tactic generation, retrieval-augmented premise selection, automated repair loops, and a full LaTeX-to-Lean autoformalization pipeline [README]. Named after Stanisław Ulam (1909–1984)—the Polish-American mathematician who co-invented the Monte Carlo method, contributed to the hydrogen bomb design, and proposed the concept of cellular automata with John von Neumann—the system embodies a philosophy that positions formal verification as the immovable ground truth while treating language models as heuristic, fallible search policies [README].

The system was released in February 2026 (v0.1) by Ulam Labs, led by Przemek Chojecki, and iterated rapidly to v0.2.11 by March 2026 [README]. It is distributed via PyPI (ulam-prover), Homebrew, and direct installation—an unusual breadth of distribution channels for a research tool [README]. The repository is licensed under MIT and, as of the source analysis date, had accumulated 91 commits and 111+ GitHub stars [README].

Within this survey's taxonomy of harness and agent frameworks, UlamAI occupies a distinctive niche. Where planner-worker systems such as Aletheia (DeepMind, 2025) and OpenProver (2025) emphasize strategic coordination through persistent whiteboards and delegated task execution, UlamAI emphasizes systematic search—best-first exploration of a well-defined proof-state space using LLMs as the policy network [README]. This architectural divergence reflects two complementary framings of mathematical proof discovery: proof as planning versus proof as search. UlamAI's dual-track architecture—formal Lean search plus informal TeX planning—acknowledges this duality explicitly [README].

For the broader evolutionary AI narrative of this survey, UlamAI is significant for three reasons. First, its trace-logging infrastructure generates exactly the (state, action, outcome) tuples required for supervised fine-tuning, creating a potential self-reinforcing data flywheel [README]. Second, its planned progression from heuristic scoring through supervised fine-tuning to reinforcement learning (v0.3–v0.5) maps directly onto an evolutionary self-improvement trajectory [README]. Third, its benchmark parity-gate system provides a formal mechanism for evaluating whether system changes constitute genuine improvement—a fitness evaluation function for the system's own evolution [README].

Repository access note: The analysis in this chapter is based on the technical documentation, blog posts, Medium articles, README, and PyPI metadata for UlamAI Prover as described in the source analysis document (dated April 2026). Direct repository audit at a pinned commit was not performed by this survey author. Consequently, all implementation claims are tagged [README] (from documentation sources) or [INFERRED] (survey author reconstruction) rather than [REPO]. File paths and package structures referenced below are drawn from the source analysis document's description of the repository, not from independent verification.

63.2 Architecture

Repository inspected via documentation and source analysis dated April 2026; no pinned commit hash independently verified by this survey author. All [README] claims reference the source analysis document's description of the repository and its documentation.

63.2.1 System Overview

UlamAI is organized around five architectural layers: a CLI entry point, two independent proving pipelines (Lean formal and TeX informal), a Lean backend abstraction layer, an LLM adapter layer, and a benchmark/trace infrastructure layer [README]. The CLI dispatches commands to specialized subsystems: prove, formalize, bench, replay, review, checkpoint, and index [README].

CLI Entry Point (cli.py) [README] prove | formalize | bench | replay | review | checkpoint | index | auth PROVE PIPELINE [README] Search Engine Autop Fallbacks LLM Policy (k tactics) Retriever (Premises) Step Cache + State Cache (Transposition Table) Repair Loop: error → LLM fix → re-execute (bounded) FORMALIZE + TEX PIPELINES [README] A: Ingest + Segment TeX B: Statement Drafting C: Proof Reconstruction D: Semantic Alignment TeX Planner-Action Loop: plan → solve → compose → judge Whiteboard + Repo Memory | Claim Graph | Worker Drafts Fixpoint loop: Draft → Typecheck → Search → Validate → Repair LEAN BACKEND LAYER [README] LeanDojo (Pantograph) CLI (lake env lean) LSP [experimental] LLM ADAPTER LAYER [README] OpenAI | Codex CLI | Claude API | Claude CLI | Gemini | Ollama | Custom (Mistral, Z.AI, Qwen, DeepSeek, Kimi) BENCHMARK LAYER [README] Suites | Reports | Parity Gates | Campaigns TRACE + REPLAY LAYER [README] JSONL Logs | Metadata Sidecars | Deterministic Replay [INFERRED — not repo-verified] PLANNED SELF-IMPROVEMENT LOOP v0.3: SFT on traces → v0.4: Learned value model → v0.5: On-policy RL (PPO/GRPO)

Figure 63.1. UlamAI system architecture. Solid-bordered components: [README]. Dashed-bordered component: [INFERRED] planned capability not yet implemented.

63.2.2 Package Structure

The source analysis describes the following package layout [README]:

# Described package structure — from source analysis document, not independently audited
ulam/
├── cli.py              # CLI entry point, TUI menu
├── prover.py           # Core proof search loop
├── search.py           # Best-first / beam search engine
├── retriever.py        # Premise retrieval
├── llm/                # LLM adapter layer (openai, anthropic, codex_cli, claude_cli, gemini, ollama, base)
├── lean/               # Lean backend layer (dojo, cli_runner, lsp, parser)
├── formalize/          # Autoformalization pipeline (segmenter, drafter, prover, equivalence, pipeline)
├── tex/                # Informal TeX proving (planner, worker, judge, whiteboard, composer)
├── bench/              # Benchmark infrastructure (runner, report, compare, suites/)
├── index/              # Retrieval index management
├── replay.py           # Deterministic replay
├── review.py           # Next-step suggestion
├── checkpoint.py       # Read-only health check
└── config.py           # Configuration management

This structure separates concerns cleanly: the search engine (search.py) is decoupled from the proving orchestration (prover.py), the LLM adapters are independent of the Lean backends, and the two proving tracks (formal Lean, informal TeX) occupy distinct subpackages [README]. The benchmark infrastructure is co-located rather than relegated to test utilities, reflecting its status as a first-class product feature [README].

63.2.3 Sequence Diagram: Proof Search End-to-End

CLI Prover Search Autop/LLM Lean Backend Trace Log prove(file, theorem) load file, get initial proof state root state (goals) init frontier loop [until solved / budget exhausted / frontier empty] try autop (aesop, simp, ...) execute tactic result (success/fail) [if autop fails] retrieve premises (top-k) LLM: propose k tactics execute each tactic results (success/fail per tactic) [if fail → repair loop, bounded] enqueue new states (beam capped) log step to JSONL trace check: solved? budget? frontier? proof / failure + trace path

Figure 63.2. Sequence diagram for proof search. All interactions are [README]-sourced. Latencies are not measured; the documentation notes that Lean execution time dominates wall-clock cost.

63.2.4 Artifact Inventory

Since direct repository audit was not performed, the following table lists observable artifacts as described in the source analysis document:

ArtifactLocation / NameEvidence Source
PyPI packageulam-prover v0.2.11[README]
Homebrew tapulamai/ulamai[README]
CLI entry pointulam = ulam.cli:main[README]
Configuration file.ulam/config.json[README]
Proof tracerun.jsonl[README]
Replay metadata sidecar*.meta.json[README]
Formalization artifactsruns/formalize_YYYYMMDD_HHMMSS/[README]
TeX proving artifactsruns/prove_tex/tex_YYYYMMDD_HHMMSS/[README]
Premise index.ulam/premises_*.jsonl[README]
Benchmark reportsJSON + Markdown files[README]
Build systemsetuptools ≥68 + wheel[README]
LicenseMIT[README]

63.3 Core Algorithms

63.3.0 Verification Matrix

Algorithm / MechanismClaimEvidence SourceArtifactConfidence
Best-first search with beam cappingPriority queue with configurable beam width[README]search.py (described)High
Transposition tableState deduplication via canonical hashing[README]search.py (described)High
Automation tactic fallbacks (autop)aesop, simp, linarith, ring, norm_num tried before LLM[README]prover.py (described)High
Repair loopError feedback → LLM fix → re-execute (bounded)[README]prover.py (described)High
Retrieval-augmented premise selectionToken-overlap + embedding-based retrieval[README]retriever.py (described)High
TeX planner-action loopClaim decomposition, worker drafts, judge/verifier gates[README]tex/ subpackage (described)Medium-High
Autoformalization pipelineFive-phase TeX→Lean with fixpoint repair[README]formalize/ subpackage (described)Medium-High
Parity gate systemNon-regression gates for benchmark comparison[README]bench/compare.py (described)High
Step + state cachingTwo-level cache to avoid redundant Lean calls[README]prover.py, search.py (described)High
SFT/RL training loopsPlanned v0.3–v0.5, not implemented[README]Roadmap onlyAspirational

63.3.1 Best-First Search with Beam Capping and Transposition Table

The core proving mechanism is a best-first search over the space of Lean 4 proof states [README]. At each step, the search engine selects the highest-priority state from a bounded frontier (beam-capped), expands it by generating candidate tactics via the LLM, executes each tactic through the Lean backend, and enqueues resulting successor states [README]. A transposition table prevents re-exploration of states already visited via different tactic paths [README].

The documentation explicitly motivates this choice over Monte Carlo Tree Search (MCTS): "Lean execution time dominates. Best-first + caching + transpositions delivers a strong baseline fast and provides clear debugging signals. MCTS/MCGS can be layered later" [README]. This is a pragmatic design decision—the bottleneck is the formal verification step, not the search branching factor.

$$s^* = \arg\max_{s \in \mathcal{F}} \; h(s) \quad \text{subject to} \quad |\mathcal{F}| \leq B$$
SymbolMeaningDomain
\(s^*\)Next state selected for expansionProof state
\(\mathcal{F}\)Search frontier (priority queue)Set of proof states
\(h(s)\)Heuristic score for state \(s\)\(\mathbb{R}\)
\(B\)Beam width (maximum frontier size)\(\mathbb{Z}^+\)

[Author-derived formalization of documented search behavior — the exact scoring function \(h(s)\) is described as heuristic and not learned; specific details of the heuristic are not publicly documented.]

Worked Example. Suppose beam width \(B = 100\), and the frontier contains 100 states with scores ranging from 0.65 to 0.95. The search selects \(s^*\) with \(h(s^*) = 0.95\). The LLM proposes \(k = 8\) tactics. Suppose 3 succeed in Lean, producing states with scores 0.91, 0.88, and 0.72. Before insertion, the transposition table is consulted: if state with score 0.91 has key matching an existing entry with score 0.93, it is pruned (no improvement). The remaining 2 new states are inserted; the lowest-scoring state in the frontier (0.65) is evicted to maintain \(|\mathcal{F}| \leq 100\).

Analogy to classical beam search: UlamAI's search preserves beam search's bounded-memory property and greedy expansion order, but adds a transposition table—an element drawn from game-tree search (e.g., chess engines) rather than NLP beam decoding. The analogy breaks down in that proof states lack the sequential left-to-right structure of text generation; states can be reached via many tactic orderings, making deduplication both more valuable and harder to implement correctly [README: "Better state canonicalization (stable hashing)" listed as roadmap item].

# Pseudocode — reconstructed from README documentation, not repo-verified
# Proof search main loop

def search_loop(initial_state, lean_backend, llm, retriever, config):
    frontier = PriorityQueue(max_size=config.beam_width)
    frontier.push(initial_state, score=heuristic(initial_state))
    transposition_table = {}  # state_key → best_seen_score
    step_cache = {}           # (state_key, tactic) → result
    trace = []

    while not frontier.empty() and not budget_exhausted():
        state = frontier.pop_best()

        # Phase 1: Try automation tactics before LLM
        for auto_tactic in ["aesop", "simp", "linarith", "ring", "norm_num"]:
            result = lean_backend.execute(state, auto_tactic)
            if result.is_proof_complete:
                return ProofFound(result, trace)

        # Phase 2: Retrieve premises and query LLM
        premises = retriever.retrieve(state, top_k=config.retriever_k)
        candidate_tactics = llm.propose_tactics(state, premises, k=config.gen_k)

        for tactic in candidate_tactics:
            cache_key = (state.canonical_key(), tactic)
            if cache_key in step_cache:
                result = step_cache[cache_key]
            else:
                result = lean_backend.execute(state, tactic)
                step_cache[cache_key] = result

            if result.success:
                new_state = result.new_state
                new_score = heuristic(new_state)
                state_key = new_state.canonical_key()

                # Transposition check
                if state_key in transposition_table:
                    if new_score <= transposition_table[state_key]:
                        continue  # Prune: already seen with better score
                transposition_table[state_key] = new_score

                if result.is_proof_complete:
                    return ProofFound(result, trace)
                frontier.push(new_state, score=new_score)
            else:
                # Repair loop (bounded)
                repaired = llm.repair(tactic, result.error, state, retries=config.repair_retries)
                if repaired and repaired.success:
                    frontier.push(repaired.new_state, score=heuristic(repaired.new_state))

        trace.append(StepRecord(state, candidate_tactics, results))

    return ProofNotFound(trace)

63.3.2 Automation Tactic Fallbacks (autop)

Before every LLM query, UlamAI tries a cascade of five Lean built-in automation tactics: aesop, simp, linarith, ring, and norm_num [README]. The rationale is that many subgoals that appear difficult to a language model are trivial for Lean's decision procedures. This saves LLM API tokens, search budget, and wall-clock time [README].

The automation-first design has a significant implication for cost: if a substantial fraction of subgoals in a typical proof can be closed by built-in tactics, the LLM is only queried for genuinely novel reasoning steps. This is analogous to a hybrid strategy in evolutionary search where cheap local optimization is applied before invoking expensive global operators.

63.3.3 Retrieval-Augmented Premise Selection

Two retrieval modes are documented [README]:

ModeMechanismSpeedDependencies
Token-overlapBM25-style keyword matchingFastNone
Embedding-basedVector similarity searchModerateEmbedding API

The retrieval index is built via ulam index build, which scans local project declarations and/or Mathlib (100,000+ lemmas) to produce a JSONL premises file [README]. The source documentation emphasizes that retrieval is critical because "without retrieval, the model must rely on memorized knowledge—knowledge that degrades with model size, quantization, and the ever-growing surface area of Mathlib itself" [README].

63.3.4 Repair Loop

When a tactic fails, UlamAI feeds Lean's error message back to the LLM for a targeted repair attempt [README]. Lean's type error messages are precise (missing argument, type mismatch, unknown identifier), enabling the model to make focused corrections rather than regenerating from scratch. The repair loop is bounded to prevent infinite retry cycles [README]. This error-feedback mechanism is a form of guided local search—each failed attempt constrains the search space for the next attempt.

63.3.5 TeX Proving Pipeline (Informal Proofs)

The --output-format tex pipeline implements a planner-action loop that is architecturally distinct from the formal search track [README]. The planner selects actions from a fixed vocabulary: plan (decompose into claim graph), solve (workers draft claims), write_memory (update WHITEBOARD or repo), compose (assemble final proof), and give_up (exhaust budget) [README].

Workers draft each claim and submit it to a judge plus adversarial verifier and domain checker [README]. The system maintains persistent memory via WHITEBOARD.md and named repo/ items, accessible via [[wikilink]] expansion in prompts [README]. This pattern closely mirrors the planner-worker-whiteboard architecture described in DeepMind's Aletheia and replicated in OpenProver [README].

The v0.2.11 flow begins with a single deep whole-theorem attempt using the primary model; if this fails, the system falls back to the planner/worker claim decomposition [README]. Configuration knobs include --tex-rounds, --tex-worker-drafts, --tex-concurrency, --tex-judge-repairs, --tex-replan-passes, --tex-action-steps, and separate model overrides for planner and worker roles [README].

63.3.6 Autoformalization Pipeline (TeX → Lean)

The formalization pipeline is a five-phase process [README]:

  1. Phase A — Ingest + Segment: Parse LaTeX into logical chunks (definitions, lemmas, theorems, proof sketches); extract named entities; build dependency graph.
  2. Phase B — Statement Drafting: Generate Lean skeleton with sorry proofs; iteratively repair until the file typechecks.
  3. Phase C — Proof Reconstruction: Open each sorry goal; run proof search; if failing, retrieve similar lemmas, reformulate, or generate auxiliary lemmas.
  4. Phase D — Semantic Alignment: A fixpoint loop—LLM judges equivalence between Lean statement and informal statement; mismatches enqueue repair tasks; statement changes invalidate downstream proofs recursively.
  5. Phase E — Consolidation: Produce final .lean file; record success rates, remaining sorrys, error traces.

This pipeline is configurable via --max-rounds and --max-proof-rounds [README]. The fixpoint structure in Phase D is notable: it represents a recursive repair process where semantic drift between informal and formal representations is iteratively corrected.

63.3.7 Benchmark Parity Gate System

The bench-compare command implements a non-regression gate that can be used in CI or development workflows [README]:

# Pseudocode — reconstructed from README documentation
# Benchmark comparison with parity gate

python3 -m ulam bench-compare \
  --a runs/bench/a.json --b runs/bench/b.json \
  --gate \
  --max-solved-drop 0 \
  --max-success-rate-drop 0 \
  --max-semantic-pass-rate-drop 0 \
  --max-regression-rejection-rate-increase 0 \
  --max-median-time-increase-pct 25

The gate checks comparability (suite SHA256 matching, inference profile matching) before evaluating regression thresholds [README]. This implements a formal fitness evaluation for system changes: a commit passes the gate if and only if it does not degrade performance beyond configured tolerances. Available threshold parameters include solved-drop, success-rate-drop, semantic-pass-rate-drop, regression-rejection-rate-increase, median-time-increase-pct, planner-replan-triggers-increase, and planner-cached-tactic-tries-drop [README].

63.4 Key Results

63.4.1 Evaluation Caveats

No public benchmark scores have been reported for UlamAI as of the source analysis date (April 2026) [README]. The system has invested heavily in benchmarking infrastructure—suite management, report generation, cross-report comparison with parity gates, campaign scripts, and miniF2F integration—but specific numerical results (solve rates, timing, token usage) are not publicly available. The following gaps are notable:

  • No absolute baseline comparisons: No published solve rate on miniF2F or any standard theorem-proving benchmark.
  • No seed/run counts: Neither the number of independent runs nor variance across runs is reported.
  • No compute budget reporting: Wall-clock time, token usage, and API costs per benchmark run are not disclosed.
  • No independent reproduction: No third-party evaluation of UlamAI's proving capabilities has been identified.
  • No matched-budget comparisons: It is not possible to compare UlamAI against other systems under controlled conditions.

63.4.2 Capability Claims

In the absence of quantitative benchmarks, the following capability claims are drawn from the documentation:

CapabilityClaimEvidenceStatus
End-to-end formal proving Produces machine-checked Lean 4 proofs for textbook theorems (e.g., irrationality of √2) [README] Described
Multi-backend operation Works with LeanDojo, CLI, and experimental LSP backends [README] Described
Multi-provider LLM support Operates with 7+ LLM providers including Codex, Claude, Gemini, Ollama [README] Described
Autoformalization TeX-to-Lean pipeline with iterative repair and semantic alignment [README] Described
Informal TeX proving Planner-action loop with claim decomposition, judge/verifier, whiteboard memory [README] Described
Benchmark infrastructure Suite management, parity gates, campaign automation, miniF2F integration [README] Described
Deterministic replay Re-execution of recorded proof traces with metadata validation [README] Described

No independent quantitative evidence is available for any of these claims. The system's documentation is detailed and internally consistent, but all capability statements should be treated as self-reported.

63.5 Implementation & Cost

63.5.1 Resource Requirements

ResourceMinimumRecommendedSource
Python3.10+3.11+[README]
RAM4 GB8+ GB[README]
Disk~200 MB (base)~2 GB (with Lean + Mathlib)[README]
NetworkRequired for API models[README]
Lean 4 + MathlibRequired for formal proofsVia ulam -lean[README]

63.5.2 Cost Drivers

Cost ComponentDriverControl MechanismSource
LLM tactic proposals\(k\) candidates per state × states exploredgen_k, beam cap, timeout[README]
LLM repair attemptsFailed tactics × repair retriesBounded repair loop[README]
Embedding callsPremise retrieval queriesRetriever mode selection[README]
Lean executionTactic compilation per candidateexec_k, step cache[README]
AutoformalizationMultiple rounds × LLM drafts + repairs--max-rounds, --max-proof-rounds[README]
TeX provingPlanner steps × worker drafts--tex-rounds, --tex-worker-drafts[README]

63.5.3 Cost Optimization Strategies

Six documented cost optimization mechanisms [README]:

  1. Automation-first: Five built-in Lean tactics tried before LLM query on every state, eliminating API calls for trivially solvable subgoals.
  2. Step cache: (state_key, tactic) → result mapping avoids duplicate Lean compilation.
  3. Transposition table: state_key → best_seen_score prevents re-exploring equivalent states.
  4. Beam capping: Bounded frontier size prevents unbounded search expansion.
  5. Budget tracking: Time or token budgets with automatic conclusion phase.
  6. Inference profiles: Pre-configured trade-offs (fast, balanced, strict) between cost and thoroughness.
[INFERRED — cost estimate not repo-verified] No actual cost figures (tokens consumed, API dollars spent, wall-clock runtime per proof) have been published. The cost optimization mechanisms are documented but their empirical impact is not quantified. A typical proof search with gen_k=8 and a beam width of 100 would require at most 800 LLM tactic-generation calls before frontier exhaustion, but actual costs depend heavily on the theorem's difficulty, the fraction of subgoals closed by automation fallbacks, and the cache hit rate—none of which are publicly reported.

63.5.4 Inference Profiles

Profilegen_kexec_kverify_levelUse CaseSource
fastLowLowMinimalCI smoke tests[README]
balancedMediumMediumStandardDefault benchmarking[README]
strictHighHighThoroughPublication-quality[README]

Exact numeric values for gen_k, exec_k, and verify_level within each profile are not publicly documented [README]. The documentation states that gen_k ranges from 8–16 in general usage [README].

63.6 Reproducibility

63.6.1 Reproducibility Assessment

RequirementStatusNotesSource
Code publicly releasedMIT license, GitHub + PyPI + Homebrew[README]
Config files available.ulam/config.json auto-generated[README]
Pretrained weights / checkpointsN/AUses external LLM APIs, no custom model weights[README]
Documented entry point or run commandMultiple CLI commands documented with examples[README]
Compute requirements statedPartialHardware minimums stated; no per-run costs[README]
Seeds and run counts reportedRandom seed is configurable but no reported experiments[README]
Independent reproduction attemptedNo third-party evaluation identified[README]

63.6.2 Trace and Replay Infrastructure

UlamAI implements comprehensive trace logging and deterministic replay [README]. Every proof search run produces a JSONL trace file logging proof state, proposed tactic, Lean response, scoring decision, and wall-clock timing per step [README]. Replay is supported via ulam replay run.jsonl for inspection and ulam replay run.jsonl --execute --strict for deterministic re-execution [README].

Strict replay mode validates against a metadata sidecar (*.meta.json) containing Lean backend type, project path, toolchain version, Mathlib commit/revision, file hash, and run configuration [README]. This level of environment pinning is among the most thorough identified in this survey's coverage of harness systems.

63.6.3 Step-by-Step Verification Path

Based on the documentation, an external researcher would need to:

  1. Install UlamAI: pip install ulam-prover or brew tap ulamai/ulamai && brew install ulamai [README].
  2. Set up Lean 4 environment: ulam -lean (installs elan, Lean toolchain, LeanDojo/Pantograph) [README].
  3. Configure an LLM provider: ulam auth codex or set ULAM_OPENAI_API_KEY [README].
  4. Run a smoke test: ulam prove examples/Smoke.lean --theorem irrational_sqrt_two_smoke --prove-mode tactic --lean dojo [README].
  5. Inspect artifacts: check run.jsonl for trace, verify the theorem typechecks in Lean.
  6. For benchmarking: ulam bench --suite internal_regression --inference-profile balanced [README].

What would constitute successful reproduction: A complete JSONL trace showing the search tree, with the final tactic sequence producing a Lean 4 proof that typechecks without sorry or additional axioms. The ulam checkpoint command provides a read-only health check of the resulting file [README].

Key uncertainty: The LeanDojo/Pantograph version alignment is described as fragile [README]. Toolchain mismatches between Lean 4, Mathlib, and Pantograph are a known source of reproduction failures in the broader Lean ecosystem. The documentation acknowledges this and provides --use-mathlib-toolchain for explicit control [README].

63.7 Threats to Validity

This section consolidates validity concerns for UlamAI's documented capabilities and the analysis presented in this chapter.

ThreatDescriptionSeverity
Absence of quantitative evaluation No public benchmark scores exist despite extensive benchmarking infrastructure. All capability claims are self-reported from documentation. It is not possible to assess the system's actual theorem-proving performance or compare it to alternatives. High
No independent reproduction No third-party evaluation has been identified. The system's claims rest entirely on developer documentation. High
Repository not independently audited This chapter's analysis is based on a source analysis document rather than direct commit-pinned repository inspection. File paths, module names, and architectural claims could not be independently verified against actual code. Medium
Early-stage maturity At v0.2.11 with 91 commits over ~2 months, the system is in rapid early development. APIs, behaviors, and architectural decisions may change significantly. Medium
Lean toolchain sensitivity LeanDojo/Pantograph version alignment is described as fragile. Reproduction across different environments may fail due to toolchain mismatches rather than system issues. Medium
Reviewer circularity The TeX proving pipeline uses LLM judges and adversarial verifiers—models evaluating model output. The formal Lean track avoids this (Lean kernel is ground truth), but informal TeX proofs are not machine-checked. Medium (TeX track only)
Aspirational features in roadmap The v0.3–v0.5 learning roadmap (SFT, value model, RL) is explicitly planned but not implemented. These should not be conflated with current capabilities. Low (clearly labeled)
Comparison budget mismatch No controlled comparisons against other systems exist. Any comparison table in this chapter (or the source analysis) reflects documented feature sets, not matched-budget empirical evaluation. Medium

63.8 Limitations & Open Questions

63.8.1 Documented Limitations

The source analysis identifies the following limitations, most of which are acknowledged in the documentation [README]:

  • No cross-run learning: Each proof attempt starts from scratch. The planned SFT pipeline (v0.3) would address this, but it is not implemented.
  • No literature search: Unlike OpenProver, UlamAI cannot consult web resources during proving.
  • No learned value model: Search scoring is heuristic, not trained. The documentation lists "better scoring heuristics (non-RL)" as a v0.2 goal.
  • Single-threaded proof search: No parallel rollouts are documented; parallelization is planned for v0.4.
  • Experimental LSP backend: The Lean language server backend is not at parity with LeanDojo.
  • State canonicalization: "Better state canonicalization (stable hashing)" is listed as a roadmap item, implying the current implementation may have false negatives in the transposition table.

63.8.2 Open Questions

  1. What is the actual solve rate? The most critical unknown. Without benchmark numbers, the system's practical utility for mathematical research cannot be assessed.
  2. How effective are automation fallbacks? What fraction of subgoals in typical proofs are closed by autop versus LLM? This ratio has significant cost implications.
  3. Does the transposition table provide meaningful savings? If proof-state canonicalization is imperfect, the table may miss many equivalent states.
  4. How does the system scale with proof difficulty? The search budget (beam width × depth × gen_k) grows multiplicatively. For hard theorems, is the system budget-constrained before it is capability-constrained?
  5. Can the planned SFT loop close the cross-run learning gap? The trace format contains (state, tactic, outcome) tuples suitable for supervised learning, but the quality of this training signal depends on the distribution of successful proofs.
[INFERRED — survey author speculation] The gap between UlamAI's engineering maturity (extensive benchmarking infrastructure, multiple distribution channels, professional documentation) and its empirical maturity (no published results) is striking. One plausible interpretation is that the team is building evaluation infrastructure first and plans to release results once the system reaches a competitive threshold. Another is that performance is still below publishable levels. Without data, both interpretations remain speculative.

63.9 Survey Positioning

63.9.1 Comparison with Related Systems

UlamAI's positioning within the theorem-proving landscape is clarified by comparison with three related systems. Note that all comparisons below are based on documented feature sets, not matched-budget empirical evaluation.

CapabilityUlamAIOpenProverLeanCopilotEvidence Basis
Search strategyBest-first/beam with transpositionLLM-directed planningOne-shot tactic suggestion[README] for all
Formal verificationLean 4 (3 backends)Lean 4 (CLI only)Lean 4 (LeanDojo)[README]
Informal provingTeX planner-action loopWhiteboard-basedNot supported[README]
AutoformalizationTeX → Lean pipelineMarkdown → LeanNot supported[README]
Literature searchNot supportedWeb-enabledNot supported[README]
Multi-model support7+ providers3 providers1 provider[README]
Benchmark infrastructureExtensive (parity gates, campaigns)None documentedBasic[README]
Automation fallbacks5 Lean tacticsNot documentedPartial[README]
Open sourceYes (MIT)YesYes[README]
Published benchmark resultsNoneLimitedYes[README]

63.9.2 Proof-as-Search vs. Proof-as-Planning: A Correspondence Table

UlamAI's dual-track architecture embodies two distinct paradigms for AI-driven mathematical proof discovery:

Conceptual ElementProof-as-Search (Lean Track)Proof-as-Planning (TeX Track)
State spaceLean proof states (goals + hypotheses)Claim graph + whiteboard memory
Action spaceLean tacticsPlanner actions (plan, solve, compose, ...)
PolicyLLM tactic generatorLLM planner + worker drafts
EvaluationLean kernel (binary: typechecks or not)LLM judge + adversarial verifier
MemoryStep cache + transposition tableWHITEBOARD.md + repo/ items
TerminationAll goals closed / budget exhaustedProof composed / budget exhausted
Ground truthFormal: Lean kernel verificationInformal: no machine-checkable guarantee

Where the analogy breaks down: The search framing works well for the formal Lean track because proof states form a well-defined graph with a binary success criterion (typechecks vs. does not). The planning framing applies to the TeX track because informal proof construction is open-ended—there is no formal verifier, so evaluation is necessarily approximate and multi-faceted (judge, verifier, domain checker). The two tracks share LLM infrastructure but operate on fundamentally different state spaces with different evaluation guarantees. Conflating them would obscure the critical difference: the Lean track cannot produce a false proof; the TeX track can.

63.9.3 Evolutionary Self-Improvement Correspondence

Although UlamAI does not currently implement self-improvement, its architecture and roadmap map onto evolutionary computation concepts:

Evolutionary ConceptUlamAI ComponentStatus
Individual / candidateTactic sequence (proof attempt)Implemented
PopulationSearch frontier (beam-capped)Implemented
Fitness functionLean verification (binary) + heuristic scoreImplemented
SelectionBest-first priority selectionImplemented
MutationLLM tactic proposal + repair loopImplemented
Environmental pressureBenchmark parity gatesImplemented
Accumulated experienceJSONL trace archiveImplemented (data only)
Self-improvement (SFT)Train on successful tracesPlanned (v0.3)
Learned evaluation (value model)Replace heuristic scoring with learned modelPlanned (v0.4)
On-policy RLPPO/GRPO-style fine-tuningPlanned (v0.5)

Where the evolutionary analogy breaks down: In a true evolutionary system, the population persists and improves across generations. UlamAI currently has no cross-run learning—each proof attempt starts from scratch with the same base LLM. The search frontier within a single run is analogous to a single-generation population, not a multi-generational evolutionary process. The planned SFT pipeline would partially close this gap by training improved models on accumulated traces, but even then, the "evolution" would occur at the model level (via fine-tuning) rather than at the program level (via code mutation). The system is better described as architecturally prepared for evolutionary dynamics than as currently exhibiting them.

Key Contribution

UlamAI's primary contribution to the harness systems landscape is its combination of systematic search-based proof discovery with extensive benchmarking infrastructure [README]. The parity gate system, inference profiles, campaign scripts, and cross-report comparison tools represent a level of engineering discipline around evaluation that is uncommon among research theorem provers. Its secondary contribution is the dual-track architecture (formal search + informal planning) that explicitly acknowledges the complementary strengths of systematic search at the tactic level and strategic planning at the proof-structure level [README].

The connection to Stanisław Ulam is more than nominal: Ulam's Monte Carlo method introduced the idea of using randomized sampling to solve deterministic problems—exactly the paradigm that UlamAI applies when using stochastic LLM outputs as the policy in a deterministic proof-state search [README]. The system's planned progression toward self-reinforcing dynamics (SFT → value model → RL) also echoes the broader theme of computational systems that improve through accumulated experience, a theme central to Ulam and von Neumann's work on self-reproducing automata.

63.10 Summary

Key takeaway: UlamAI Prover is an architecturally distinctive theorem-proving system that combines best-first search over Lean 4 proof states with LLM tactic generation, automation fallbacks, retrieval-augmented premise selection, and a dual-track (formal + informal) proving architecture. Its benchmarking infrastructure—including parity gates, inference profiles, and campaign automation—is among the most developed identified in this survey's coverage of open-source theorem provers.

Main contribution: The integration of search-based proof discovery with first-class benchmarking infrastructure in a single open-source CLI tool, demonstrating that systematic evaluation methodology can be co-developed with proving capability rather than added as an afterthought.

Most important gap for future researchers: The absence of published benchmark results is the single largest barrier to assessing UlamAI's significance. A future researcher should reproduce the system's performance on miniF2F or a comparable standard benchmark, report solve rates with seed counts and confidence intervals, and compare against at least one baseline system under matched compute budgets. Until such evidence exists, UlamAI's contribution is primarily architectural and methodological rather than empirically demonstrated.