← Back to Index

UlamAI

Truth-first, reproducible, open-source Lean 4 theorem prover CLI combining LLM-guided reasoning with formal verification and best-first search Organization: Ulam Labs (Przemek Chojecki et al.) Published: February 2026 (v0.1); v0.2.11, March 2026 (current) Type: Open-source repository + PyPI package + Homebrew tap Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents

1 Full Title and Attribution

Full Title: UlamAI Prover — A Truth-First, Reproducible, Open-Source Lean 4 Theorem Prover CLI

Repository: github.com/ulamai/ulamai

PyPI Package: ulam-prover v0.2.11

Homebrew: brew tap ulamai/ulamai && brew install ulamai

Blog Post: UlamAI Prover: An Open Lean 4 Theorem Prover (Ulam Blog)

Medium Articles: - UlamAI — An Open-Source Theorem Prover and Formalized in Lean 4 (Feb 2026) - UlamAI Prover: Teaching AI to Prove Math Theorems (Mar 2026)

Related Project: UnsolvedMath — a benchmark of 1,146 open mathematical problems, designed as the natural complement to UlamAI Prover (problems to attempt ↔ machinery for attempting them).

Naming: Named after Stanisław Ulam (1909–1984), the Polish-American mathematician who co-invented the Monte Carlo method, contributed to the hydrogen bomb design, and proposed the concept of cellular automata with John von Neumann.

License: MIT

2 Authors and Team

Lead / Founder: Przemek Chojecki — CEO of Ulam Labs, author of the Medium articles and blog posts. PhD in mathematics with research background in number theory and algebraic geometry.

Organization: Ulam Labs — a company focused on mathematical intelligence and AI for mathematics.

Repository Activity: | Metric | Value | | --- | --- | | Commits | 91 | | Stars | 111+ | | Forks | 6 | | Primary Language | Python | | License | MIT | | Release Cadence | Rapid iteration (v0.1 → v0.2.11 in ~2 months) |

Team Context: Ulam Labs positions itself at the intersection of AI and pure mathematics. The company maintains both the UlamAI Prover (tool for proving) and UnsolvedMath (benchmark of 1,146 open problems), creating a closed loop between challenge definition and solving infrastructure.

Key Differences from Solo Projects

Unlike many open-source theorem provers that are single-developer efforts, UlamAI is backed by a company with a clear product vision. This is reflected in: - Professional documentation and tutorials - Homebrew distribution (unusual for research tools) - Colab notebooks for easy onboarding - Automated CI/CD with Homebrew tap updates - Connection to a commercial benchmark product (UnsolvedMath)

3 Core Contribution

Key Novelty: UlamAI is the first open-source system to combine best-first search over Lean 4 proof states, multi-provider LLM tactic generation, retrieval-augmented premise selection, automated repair loops, and a full LaTeX-to-Lean autoformalization pipeline in a single CLI tool — with reproducible benchmarking and trace logging built in from day one.

What Makes UlamAI Distinctive

  1. Search-Based Proof Discovery: Unlike one-shot LLM proof generation, UlamAI implements best-first search with beam capping over the space of proof states. At each state, the LLM proposes 8–16 candidate tactics. Successful applications expand the search tree; failures are pruned. A transposition table prevents re-exploring visited states. This turns theorem proving from "hope the LLM gets it right" into systematic exploration.

  2. Three Proof Modes: Tactic mode (LLM proposes next tactic, Lean checks each step), Lemma mode (LLM drafts a lemma plan, each proved sequentially), and LLM-only mode (LLM rewrites entire Lean file, CLI typechecks). Each mode represents a different trade-off between granularity and autonomy.

  3. Three Lean Backends: LeanDojo (Pantograph server for goal-state access), CLI (lake env lean for batch typecheck), and experimental LSP (Lean language server for interactive diagnostics). This separation of concern allows the system to be used with varying levels of Lean toolchain sophistication.

  4. Dual-Track Proving — Lean + TeX: UlamAI maintains two independent proving pipelines:

  5. Lean track: Formal, machine-checked proofs via tactic search
  6. TeX track: Informal natural-language proofs via a bounded planner-action loop with claim graphs, whiteboard memory, judge/verifier gates, and composable worker drafts. The TeX track is not just "ask the LLM to write a proof" — it is a structured orchestration system with its own planning, judging, and backtracking logic.

  7. Full Autoformalization Pipeline (TeX → Lean): A multi-phase pipeline that segments LaTeX documents, drafts Lean skeletons, iteratively repairs type errors, fills proofs via search, and validates semantic equivalence between informal and formal statements. This is treated as a fixpoint loop, not a one-shot translation.

  8. Benchmark Infrastructure: Built-in suite management (ulam bench), report generation, cross-report comparison with parity gates, regression suite builders, and campaign scripts. Benchmarking is not an afterthought — it's a first-class product feature.

  9. Automation Tactic Fallbacks (autop): Before querying the LLM for every subgoal, UlamAI tries aesop, simp, linarith, ring, and norm_num. Many subgoals that look hard to a language model are trivial for Lean's built-in automation, saving tokens and search budget.

Relationship to Prior Systems

System Year Architecture Proof Strategy Formal Backend Open Source
LeanCopilot 2023 Single-agent tactic suggestion One-shot Lean 4 via LeanDojo Yes
ReProver 2023 Retrieval + generation Single-step Lean 4 Yes
LEGO-Prover 2024 Skill library + decomposition Multi-step Lean via LeanDojo Partial
Aletheia (DeepMind) 2025 Planner-worker-whiteboard Multi-step, informal-first Extraction protocols No
OpenProver 2025 Planner-worker-whiteboard Informal + formal dual-track Lean 4 via lake env lean Yes
UlamAI 2026 Search-based with LLM policy Best-first/beam search Lean 4 (LeanDojo/CLI/LSP) Yes

Core Design Philosophy

"Trust Lean, not the model."

This single principle drives the entire architecture. The LLM is a heuristic guide — creative, fast, often wrong. Lean is the ground truth. UlamAI cannot produce a false proof. It can fail to find a proof, but it cannot claim to have found one that doesn't typecheck.

4 Supported Solutions

Solution Types

Solution Type Description Pipeline
Formal Lean Proof Machine-checked tactic proofs compiled against Mathlib Prove pipeline (tactic/lemma mode)
LLM-Rewritten Lean File Full Lean file rewritten by LLM, batch typechecked Prove pipeline (LLM mode)
Informal TeX Proof Structured natural-language proof with claim decomposition Prove pipeline (TeX output format)
Autoformalization LaTeX → Lean translation with iterative repair Formalize pipeline
Proof Replay Deterministic re-execution of a recorded proof trace Replay command
Proof Review Next-step suggestions for a partially completed proof Review command
Benchmark Report Solve-rate analysis across theorem suites Bench command

Prove Output Formats

Format Description Verification
lean (default) Machine-checked Lean 4 proof via tactic search Lean kernel verification at each step
tex Informal proof via bounded planner-action loop Judge + adversarial verifier + domain checks

Proof Modes (for Lean output)

Mode Mechanism Pros Cons
tactic LLM proposes next tactic; Lean checks each step Fast feedback, goal-state aware Brittle for long chains
lemma LLM drafts a lemma plan; each lemma proved sequentially Decomposes big proofs Depends on plan quality
llm LLM rewrites full Lean file; CLI typechecks Handles multi-step edits, no Dojo required Slower, less goal-state guidance

Lean Backends

Backend Mechanism Pros Cons
dojo Pantograph/LeanDojo server Goal-state access, tactic execution Extra install, toolchain sensitivity
cli lake env lean typecheck Simple, works with any toolchain No goal-state feedback
lsp Lean language server Goal/diagnostic loop without Pantograph Experimental, search parity pending

Solver Strategies

Strategy Behavior
search Best-first tactic search (default)
script Sequential script mode (deterministic tactic sequence)
portfolio Script-first, then best-first fallback

5 LLM Integration

Multi-Provider Support

UlamAI supports a broad range of LLM providers through a unified adapter layer:

Provider Auth Mechanism Default Model Notes
OpenAI-compatible ULAM_OPENAI_API_KEY gpt-4.1 Base URL configurable for custom endpoints
Codex CLI codex login / ulam auth codex gpt-5.2-codex No API key needed; subscription-based
Anthropic Claude ULAM_ANTHROPIC_API_KEY claude-3-5-sonnet API or Claude Code CLI
Claude Code CLI ulam auth claude / claude setup-token No API key; subscription-based
Gemini API ULAM_GEMINI_API_KEY gemini-3.1-pro-preview OpenAI-compatible endpoint format
Gemini CLI OAuth2 auto-discovery Subscription/login based
Ollama Local server llama3.1 Fully local, no data leaving machine
Custom API TUI configuration Mistral, Z.AI, Qwen, DeepSeek, Kimi

The documentation recommends gpt-5.2-codex or gpt-5.3-codex as the primary proving models, suggesting these Codex-series models have been found effective for tactic generation in practice.

Embedding Support (for Retrieval)

Setting Variable Default
API Key ULAM_EMBED_API_KEY Falls back to ULAM_OPENAI_API_KEY
Base URL ULAM_EMBED_BASE_URL Falls back to ULAM_OPENAI_BASE_URL
Model ULAM_EMBED_MODEL text-embedding-3-small

TeX Proving: Split Planner/Worker Models

The informal TeX proving pipeline supports independent model assignment for different roles:

┌─────────────────────────────────────────┐
│      TeX Proving Model Hierarchy        │
│                                         │
│  --tex-planner-model gpt-5.4           │
│    ↓ Planning, judging, verification,   │
│    ↓ final composition                  │
│                                         │
│  --tex-worker-model gpt-5.4-mini       │
│    ↓ Claim-draft workers (serial or    │
│    ↓ concurrent evaluation)            │
└─────────────────────────────────────────┘

This mirrors the planner/worker model split seen in OpenProver but applied to the informal proving pipeline rather than the overall system architecture.

For each proof state in frontier:
    │
    ├── 1. Retrieve premises (top-k from Mathlib/local)
    │
    ├── 2. Try automation tactics first (aesop, simp, linarith, ring, norm_num)
    │      └── If solved → emit proof, stop
    │
    ├── 3. Prompt LLM for k candidate tactics (k = 8-16)
    │
    ├── 4. Execute each tactic in Lean via selected backend
    │      ├── Success → enqueue new state (beam capped)
    │      └── Failure → request LLM repair step, bounded retries
    │
    ├── 5. Update caches (step cache + state cache)
    │
    └── 6. Log everything to JSONL trace

6 Key Results

Reported Capabilities (Blog + README)

UlamAI's published materials focus on architectural capabilities rather than benchmark scores. Key claims:

  1. Working End-to-End Pipeline: The system produces machine-checked Lean 4 proofs for standard textbook theorems (irrationality of √2, etc.) and can attempt harder problems with extended budgets.

  2. Benchmark-Ready Infrastructure: Suite management, report generation, cross-report comparison with parity gates, and campaign scripts are all implemented. The system includes:

  3. ulam bench for running suites
  4. ulam bench-compare for cross-report analysis
  5. ulam bench-validate for suite validation
  6. ulam bench-make-minif2f for building miniF2F suites
  7. ulam bench-make-regression100 for fixed-size regression suites
  8. Campaign scripts with gated comparisons

  9. Formalization Pipeline: The TeX-to-Lean pipeline handles segmentation, statement drafting, proof reconstruction, and semantic alignment with iterative repair loops.

  10. Multi-Backend Flexibility: Proven to work with OpenAI, Anthropic, Gemini, and Ollama backends.

Benchmark Infrastructure (Not Results)

UlamAI has invested heavily in benchmarking infrastructure, even though specific numerical results are not publicly reported:

Capability Implementation
Suite management bench-list-suites, bench-validate, fixed-suite builder
miniF2F integration bench-make-minif2f from local checkout, split selection
Fixed regression suite bench-make-regression100 with deterministic seed
Report generation JSON + Markdown with dataset/split breakdowns, top tags
Cross-report comparison bench-compare with automated parity gate
Parity gate thresholds --max-solved-drop, --max-success-rate-drop, --max-semantic-pass-rate-drop, --max-regression-rejection-rate-increase, --max-median-time-increase-pct, --max-planner-replan-triggers-increase, --max-planner-cached-tactic-tries-drop
Comparability checks Suite SHA256 matching, inference profile matching
Campaign automation run_bench_campaign.sh with timestamped artifacts + env snapshots
Inference profiles fast, balanced, strict with configurable gen_k, exec_k, verify_level

Parity Gate System

The benchmark comparison system deserves special attention. When comparing two runs:

python3 -m ulam bench-compare \
  --a runs/bench/a.json --b runs/bench/b.json \
  --gate \
  --max-solved-drop 0 \
  --max-success-rate-drop 0 \
  --max-semantic-pass-rate-drop 0 \
  --max-regression-rejection-rate-increase 0 \
  --max-median-time-increase-pct 25

This implements a non-regression gate — ensuring that system changes don't degrade performance. The gate also checks comparability (same suite SHA256 and same inference profile/budgets), with explicit opt-out for intentional cross-profile comparisons.

Connection to UnsolvedMath

Ulam Labs released UnsolvedMath, a benchmark of 1,146 open mathematical problems. The stated plan is to run UlamAI Prover against formalized versions of these problems, starting with restricted cases. This would represent one of the most ambitious applications of automated theorem proving — applying search-based LLM proving to genuinely open questions.

7 Reproducibility

Trace Logging

Every proof search run produces a JSONL trace file (run.jsonl by default) logging each search step:

Field Content
Proof state Current Lean goal state
Proposed tactic LLM-generated tactic attempt
Lean response Success/failure with error messages
Scoring decision Best-first priority assignment
Timing Wall-clock duration per step

Replay System

# Basic replay (show the trace)
python3 -m ulam replay run.jsonl

# Deterministic replay (re-execute every tactic)
python3 -m ulam replay run.jsonl --execute --strict

Strict mode: Checks alignment against replay metadata sidecar (*.meta.json) containing: - Lean backend type - Project path - Toolchain version - Mathlib commit/revision - File hash - Run configuration

Formalization Artifacts

Each formalization run creates timestamped artifact directories:

runs/formalize_YYYYMMDD_HHMMSS/
├── state.json        # Current pipeline state (resumable)
├── events.jsonl      # Step-by-step event log
├── summary.json      # Final summary with metrics
├── WHITEBOARD.md     # Planner scratchpad (TeX mode)
├── repo/             # Mutable memory items (TeX mode)
├── round_001/
│   ├── draft.lean    # Generated Lean code
│   ├── errors.txt    # Compilation errors
│   └── equivalence_report.json
├── round_002/...
└── final.lean        # Best output

TeX Proving Artifacts

runs/prove_tex/tex_YYYYMMDD_HHMMSS/
├── state.json        # Resumable state
├── events.jsonl      # All events
├── summary.json      # Run summary
├── WHITEBOARD.md     # Persistent planner scratchpad
└── repo/             # Named memory items

Checkpoint Command

# Read-only health check of a Lean file
python3 -m ulam checkpoint examples/Smoke.lean \
  --theorem irrational_sqrt_two_smoke --strict

Checks typecheck status, placeholder (sorry) counts, and axiom drift without modifying anything.

Review Command

# Get next-step suggestions from a trace
python3 -m ulam review --trace run.jsonl \
  --file examples/Smoke.lean --theorem irrational_sqrt_two_smoke

Configuration Pinning

UlamAI pins environment state for reproducibility:

Pinned Element Storage
Lean toolchain version .ulam/config.json
Mathlib commit Trace metadata
LLM provider + model Run configuration
Random seed Configurable
Search parameters Run configuration
Retrieval index Versioned in project
Timeout settings Run configuration

8 Compute and API Costs

Inference Profiles

UlamAI implements configurable inference profiles that trade off speed vs. verification depth:

Profile gen_k exec_k verify_level Use Case
fast Low Low Minimal Quick exploration, CI smoke tests
balanced Medium Medium Standard Default proving, benchmarking
strict High High Thorough High-confidence results, publication

Profiles can be specified per-run:

python3 -m ulam bench --suite internal_regression \
  --inference-profile balanced --gen-k 8 --exec-k 3 --verify-level strict

Cost Drivers

Cost Component Driver Control
LLM tactic proposals k candidates per state × states explored gen_k, beam cap, timeout
LLM repair attempts Failed tactics × repair retries Bounded repair loop
Embedding calls Premise retrieval queries Retriever mode selection
Lean execution Tactic compilation per candidate exec_k, caching
Autoformalization Multiple rounds × LLM drafts + repairs max-rounds, max-proof-rounds
TeX proving Planner steps × worker drafts tex-rounds, tex-worker-drafts, tex-action-steps

Resource Requirements

Resource Minimum Recommended
Python 3.10+ 3.11+
RAM 4 GB 8+ GB
Disk ~200 MB (base) ~2 GB (with Lean project + Mathlib)
Network Required for API models
Lean 4 + Mathlib Required for real proofs Via ulam -lean setup
LeanDojo + Pantograph Required for dojo backend Via ulam -lean

Cost Optimization Strategies

  1. Automation-first: aesop, simp, linarith, ring, norm_num tried before LLM on every state
  2. Step cache: (state_key, tactic) → result avoids duplicate Lean calls
  3. State cache (transposition table): state_key → best_seen_score prevents re-exploring
  4. Beam capping: Limits frontier size to prevent unbounded search
  5. Budget tracking: Time or token budgets with automatic conclusion phase
  6. Inference profiles: Pre-configured trade-offs between cost and thoroughness

9 Architecture Solution

System Diagram

┌──────────────────────────────────────────────────────────────────────────┐
│                            UlamAI System                                │
│                                                                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                     CLI Entry Point (cli.py)                      │  │
│  │                                                                   │  │
│  │  ulam prove    │ ulam formalize │ ulam bench     │ ulam replay   │  │
│  │  ulam review   │ ulam checkpoint│ ulam bench-*   │ ulam index    │  │
│  │  ulam -lean    │ ulam auth      │ ulam (TUI menu)│               │  │
│  └───────────┬───────────────┬──────────────┬───────────────────────┘  │
│              │               │              │                          │
│  ┌───────────▼───────────┐  │  ┌───────────▼─────────────────────┐   │
│  │   PROVE PIPELINE      │  │  │   FORMALIZE PIPELINE             │   │
│  │                       │  │  │                                   │   │
│  │  ┌─────────────────┐  │  │  │  Phase A: Ingest + Segment TeX   │   │
│  │  │ Search Engine    │  │  │  │  Phase B: Statement Drafting     │   │
│  │  │ (best-first/beam)│  │  │  │  Phase C: Proof Reconstruction  │   │
│  │  │ + transposition  │  │  │  │  Phase D: Semantic Alignment     │   │
│  │  │   table          │  │  │  │  Phase E: Consolidation          │   │
│  │  └────────┬────────┘  │  │  │                                   │   │
│  │           │            │  │  │  ┌─────────────────────────┐     │   │
│  │  ┌────────▼────────┐  │  │  │  │ Fixpoint Loop:          │     │   │
│  │  │ LLM Policy      │  │  │  │  │ Draft → Typecheck →     │     │   │
│  │  │ (k=8-16 tactics)│  │  │  │  │ Proof Search → Validate │     │   │
│  │  └────────┬────────┘  │  │  │  │ ← Repair / Rewrite      │     │   │
│  │           │            │  │  │  └─────────────────────────┘     │   │
│  │  ┌────────▼────────┐  │  │  └───────────────────────────────────┘   │
│  │  │ Retriever       │  │  │                                          │
│  │  │ (premises)      │  │  │                                          │
│  │  └────────┬────────┘  │  │                                          │
│  │           │            │  │                                          │
│  │  ┌────────▼────────┐  │  │                                          │
│  │  │ Autop Fallback  │  │  │                                          │
│  │  │ aesop/simp/     │  │  │                                          │
│  │  │ linarith/ring/  │  │  │                                          │
│  │  │ norm_num        │  │  │                                          │
│  │  └─────────────────┘  │  │                                          │
│  └───────────────────────┘  │                                          │
│              │               │                                          │
│  ┌───────────▼───────────────▼──────────────────────────────────────┐  │
│  │                   LEAN BACKEND LAYER                             │  │
│  │                                                                  │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │  │
│  │  │  LeanDojo     │  │  Lean CLI    │  │  Lean LSP            │  │  │
│  │  │  (Pantograph) │  │  (lake env   │  │  (Language Server)   │  │  │
│  │  │              │  │   lean)       │  │  [experimental]      │  │  │
│  │  │  Goal-state  │  │  Batch       │  │  Interactive          │  │  │
│  │  │  access      │  │  typecheck   │  │  goal/diagnostic     │  │  │
│  │  └──────────────┘  └──────────────┘  └──────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                    LLM ADAPTER LAYER                             │  │
│  │                                                                  │  │
│  │  OpenAI │ Codex CLI │ Claude API │ Claude CLI │ Gemini │ Ollama │  │
│  │  Custom API (Mistral, Z.AI, Qwen, DeepSeek, Kimi)              │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                    BENCHMARK LAYER                               │  │
│  │                                                                  │  │
│  │  Suite Management │ Report Generation │ Comparison + Gates      │  │
│  │  Campaign Scripts │ miniF2F Builder  │ Regression100 Builder   │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                    TRACE + REPLAY LAYER                          │  │
│  │                                                                  │  │
│  │  JSONL Logging │ Metadata Sidecars │ Deterministic Replay       │  │
│  │  Review + Checkpoint │ Artifact Management                      │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────┘

Proof Search Data Flow

Input: Lean file + theorem name (with sorry)
    │
    ▼
┌──────────────────────────────┐
│ 1. Initialize                │
│    Load file via Lean backend│
│    Get target proof state    │
│    Create root search node   │
└───────────┬──────────────────┘
            │
            ▼
┌──────────────────────────────┐
│ 2. Search Loop               │◄─────────────────────────┐
│    Select best state from    │                           │
│    frontier (best-first)     │                           │
└───────────┬──────────────────┘                           │
            │                                              │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 3. Try autop tactics first   │                           │
│    aesop → simp → linarith   │                           │
│    → ring → norm_num         │                           │
│    If solved → emit proof ─────── EXIT (success)         │
└───────────┬──────────────────┘                           │
            │ (none solved)                                │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 4. Retrieve premises (top-k) │                           │
│    Token-overlap or embedding│                           │
└───────────┬──────────────────┘                           │
            │                                              │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 5. LLM proposes k tactics    │                           │
│    (k = 8-16 per state)      │                           │
└───────────┬──────────────────┘                           │
            │                                              │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 6. Execute each in Lean      │                           │
│    ├── Success → new state ──│── Check cache ──► Enqueue │
│    └── Failure → repair loop │     (beam capped)         │
│         └── LLM fix attempt  │                           │
│              └── Re-execute  │                           │
└───────────┬──────────────────┘                           │
            │                                              │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 7. Update caches + log trace │                           │
│    Step cache + state cache  │                           │
│    JSONL trace entry         │                           │
└───────────┬──────────────────┘                           │
            │                                              │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 8. Check termination         │                           │
│    Solved? → EXIT (success)  │                           │
│    Budget exhausted? → EXIT  │                           │
│    Frontier empty? → EXIT    │                           │
│    Otherwise → CONTINUE ─────────────────────────────────┘
└──────────────────────────────┘

10 Component Breakdown

Package Structure

ulam/
├── cli.py              # CLI entry point, TUI menu, argument parsing
├── prover.py           # Core proof search loop
├── search.py           # Best-first / beam search engine
├── retriever.py        # Premise retrieval (token-overlap + embedding)
├── llm/                # LLM adapter layer
│   ├── openai.py       # OpenAI-compatible adapter
│   ├── anthropic.py    # Anthropic Claude adapter
│   ├── codex_cli.py    # Codex CLI wrapper
│   ├── claude_cli.py   # Claude Code CLI wrapper
│   ├── gemini.py       # Gemini adapter
│   ├── ollama.py       # Ollama adapter
│   └── base.py         # Base adapter interface
├── lean/               # Lean backend layer
│   ├── dojo.py         # LeanDojo/Pantograph runner
│   ├── cli_runner.py   # lake env lean CLI runner
│   ├── lsp.py          # Lean language server (experimental)
│   └── parser.py       # Lean file parsing
├── formalize/          # Autoformalization pipeline
│   ├── segmenter.py    # TeX → structured chunks
│   ├── drafter.py      # Statement generation
│   ├── prover.py       # Per-lemma proof loops
│   ├── equivalence.py  # Semantic alignment checks
│   └── pipeline.py     # Multi-phase orchestration
├── tex/                # Informal TeX proving pipeline
│   ├── planner.py      # Planner-action loop
│   ├── worker.py       # Claim-draft workers
│   ├── judge.py        # Judge + adversarial verifier
│   ├── whiteboard.py   # Persistent scratchpad
│   └── composer.py     # Final proof assembly
├── bench/              # Benchmark infrastructure
│   ├── runner.py       # Suite execution
│   ├── report.py       # Report generation
│   ├── compare.py      # Cross-report comparison
│   └── suites/         # Suite definitions
├── index/              # Retrieval index management
│   └── builder.py      # Index construction
├── replay.py           # Deterministic replay
├── review.py           # Next-step suggestion
├── checkpoint.py       # Read-only health check
└── config.py           # Configuration management

Core Module Descriptions

Module Responsibility Key Abstraction
cli.py Entry point, TUI menu, interactive configuration, auth flows Command dispatch
prover.py Core proving loop: state management, tactic execution, repair loops Prover class
search.py Best-first / beam search with transposition table SearchEngine
retriever.py Premise selection from Mathlib or local repo Retriever (token-overlap / embedding)
llm/ Unified interface for all LLM providers LLMAdapter base
lean/ Backend abstraction for Lean verification LeanRunner ABC
formalize/ Multi-phase TeX → Lean pipeline FormalizePipeline
tex/ Informal planner-action proving loop TexProver
bench/ Suite management, reporting, comparison BenchRunner, BenchCompare
replay.py Trace re-execution with metadata validation Replayer

Dependencies

[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "ulam-prover"
version = "0.2.11"
requires-python = ">=3.10"
license = { text = "MIT" }

The base package has minimal hard dependencies. Heavy dependencies (LeanDojo, Pantograph, torch, sentence-transformers) are installed on demand via the ulam -lean setup flow.

11 Core Mechanisms (Detailed)

11.1 Best-First Search with Beam Capping

The search engine maintains a priority queue of proof states:

Priority Queue (beam-capped)
┌────────────────────────────────────────────────┐
│ State A (score: 0.95) ← next to expand         │
│ State B (score: 0.87)                          │
│ State C (score: 0.82)                          │
│ State D (score: 0.71)                          │
│ ... (capped at beam_width)                     │
│                                                │
│ Transposition Table:                           │
│ state_key_A → best_seen_score: 0.95            │
│ state_key_B → best_seen_score: 0.87            │
│ state_key_E → best_seen_score: 0.65 (pruned)  │
└────────────────────────────────────────────────┘

Why best-first/beam before MCTS: As the project documentation explains, "Lean execution time dominates. Best-first + caching + transpositions delivers a strong baseline fast and provides clear debugging signals. MCTS/MCGS can be layered later." The search algorithm is explicitly chosen for practical reasons — debuggability and the fact that the bottleneck is Lean compilation, not search branching.

Transposition Table: The state cache maps state_key → best_seen_score. When the search encounters a state it has already seen (via a different tactic path), it checks whether the new score improves on the cached one. If not, the state is pruned. This prevents the combinatorial explosion of re-exploring equivalent states reached by different tactic sequences.

11.2 Retrieval-Augmented Premise Selection

Two retrieval modes are available:

Mode Mechanism Speed Quality Dependencies
Token-overlap BM25-style keyword matching Fast Adequate None
Embedding-based Vector similarity search Moderate Higher Embedding API

Premise Format:

lemma_name : statement

One premise per line, fetched from a --premises file or auto-indexed:

# Auto-index from local project + Mathlib
python3 -m ulam index build \
  --project /path/to/lean-project \
  --scope both \
  --out .ulam/premises_both.jsonl

Retrieval Sources: | Source | Flag | Content | | --- | --- | --- | | Local declarations | --retriever-source local | Project-specific lemmas and definitions | | Mathlib declarations | --retriever-source mathlib | Full Mathlib library (~100k+ lemmas) | | Both | --retriever-source both | Combined index |

Why Retrieval Matters: Mathlib contains over 100,000 lemmas. Without retrieval, the model must rely on memorized knowledge — knowledge that degrades with model size, quantization, and the ever-growing surface area of Mathlib itself. Retrieval turns "do you remember this lemma?" into "here is the lemma; use it."

11.3 Automation Tactic Fallbacks (autop)

Before every LLM query, UlamAI tries a cascade of built-in Lean automation:

Proof state arrived
    │
    ├── Try: aesop
    │     └── Success? → SOLVED (no LLM needed)
    ├── Try: simp
    │     └── Success? → SOLVED
    ├── Try: linarith
    │     └── Success? → SOLVED
    ├── Try: ring
    │     └── Success? → SOLVED
    └── Try: norm_num
          └── Success? → SOLVED
          │
          └── All failed → Query LLM for k tactics

Rationale: Many subgoals that look hard to a language model are trivial for Lean's built-in automation. This saves: - LLM API tokens (no call needed) - Search budget (one-step solution vs. multi-step exploration) - Wall-clock time (local computation vs. API round-trip)

11.4 Repair Loop

When a tactic fails, UlamAI doesn't just discard it — it feeds the error back to the LLM:

LLM proposes tactic T
    │
    ▼
Execute T in Lean
    │
    ├── Success → accept, enqueue new state
    │
    └── Failure (error message E)
          │
          ▼
        Ask LLM to repair T given error E
          │
          ▼
        Execute repaired T' in Lean
          │
          ├── Success → accept
          └── Failure → bounded retries exhausted → discard

This error-feedback loop is a key differentiator from pure one-shot generation. Lean's type error messages are precise and informative — they tell the model exactly what went wrong (missing argument, type mismatch, unknown identifier). This structured feedback enables targeted repairs.

11.5 TeX Proving Pipeline (Informal Proofs)

The --output-format tex pipeline is a sophisticated orchestration system:

┌────────────────────────────────────────────────────┐
│              TeX Proving Architecture               │
│                                                    │
│  v0.2.11 Flow:                                     │
│  1. Try one deep whole-theorem attempt              │
│     using single primary model                      │
│  2. If fails → fall back to planner/worker          │
│     claim decomposition                             │
│                                                    │
│  Planner Action Loop:                               │
│  ┌──────────────────────────────────────────────┐  │
│  │ Planner chooses action:                      │  │
│  │  • plan — decompose into claim graph         │  │
│  │  • solve — workers draft claims              │  │
│  │  • write_memory — update WHITEBOARD/repo     │  │
│  │  • compose — assemble final proof            │  │
│  │  • give_up — exhaust budget                  │  │
│  └──────────────────────┬───────────────────────┘  │
│                         │                          │
│  Claim Graph:           │                          │
│  ┌──────────────────────▼───────────────────────┐  │
│  │ claims + dependencies + assumptions +         │  │
│  │ required facts + worker guidance +            │  │
│  │ repo reads                                    │  │
│  └──────────────────────┬───────────────────────┘  │
│                         │                          │
│  Workers:               │                          │
│  ┌──────────────────────▼───────────────────────┐  │
│  │ Draft each claim (serial or concurrent)      │  │
│  │  → Judge + Adversarial Verifier +            │  │
│  │    Domain Checker                            │  │
│  │  → Accept / Reject                           │  │
│  └──────────────────────┬───────────────────────┘  │
│                         │                          │
│  Persistent Memory:     │                          │
│  ┌──────────────────────▼───────────────────────┐  │
│  │ WHITEBOARD.md + repo/ items                  │  │
│  │ CRUD operations by planner                   │  │
│  │ <span class="broken-link">wikilink</span> expansion in prompts            │  │
│  └──────────────────────────────────────────────┘  │
│                                                    │
│  Replan/Backtrack:                                 │
│  - Pass-based: reset decomposition if stalled      │
│  - Bounded: --tex-replan-passes limit              │
│                                                    │
│  Output: composed .tex proof draft                 │
└────────────────────────────────────────────────────┘

Configuration Knobs:

Flag Default Description
--tex-rounds Max fixed orchestration rounds
--tex-worker-drafts 1 Worker candidates per claim per round
--tex-concurrency off Toggle concurrent evaluation of drafts
--tex-judge-repairs Max rounds without accepted claim
--tex-replan-passes Max decomposition resets
--tex-action-steps Bounded planner-action steps
--tex-planner-model Override model for planning/judging
--tex-worker-model Override model for claim workers

11.6 Autoformalization Pipeline (TeX → Lean)

The formalization pipeline is a five-phase process with recursive loops:

Phase A — Ingest + Segment:

LaTeX document
    │
    ▼
Parse into logical chunks:
    ├── Definitions
    ├── Lemmas
    ├── Theorems
    └── Proof sketches
    │
    ▼
Extract named entities + symbols
    │
    ▼
Build dependency graph

Phase B — Statement Drafting (Lean Skeleton):

For each definition/lemma/theorem:
    │
    ├── Generate Lean signature with sorry proofs
    ├── Insert placeholders for unknown types
    │
    ▼
Run Lean to typecheck skeleton
    │
    ├── Missing imports → fix
    ├── Undeclared constants → fix
    ├── Type mismatches → fix
    ├── Notation issues → fix
    │
    └── Loop until file typechecks with sorry proofs

Phase C — Proof Reconstruction:

For each sorry:
    │
    ├── Open goal in Lean backend
    ├── Run proof search pipeline
    │
    ├── If proof fails:
    │   ├── Retrieve similar lemmas
    │   ├── Reformulate statement
    │   └── Generate auxiliary lemmas
    │
    └── If still failing: leave sorry, log TODO

Phase D — Semantic Alignment (Recursive):

┌──────────────────────────────────────────────┐
│  Draft → Typecheck → Proof Search →          │
│  Validate Equivalence                        │
│       ↑                      ↓               │
│       └── Repair / Rewrite ──┘               │
│                                              │
│  Statement equivalence check:                │
│  LLM judges Lean stmt vs informal stmt       │
│  Mismatch → enqueue repair task              │
│                                              │
│  Gap detection:                              │
│  Missing assumptions → add hypotheses        │
│  Hidden lemmas → generate intermediates      │
│                                              │
│  Re-run proofs:                              │
│  Statement change → invalidate downstream    │
│  → recursive repair                          │
└──────────────────────────────────────────────┘

Phase E — Consolidation: - Produce final .lean file - Record proof success rate, remaining sorrys, error traces - Save all artifacts

11.7 Axiom Control

UlamAI includes axiom safety controls:

# Default: axioms/constants allowed
python3 -m ulam prove File.lean --theorem T

# Strict: no axioms allowed
python3 -m ulam prove File.lean --theorem T --no-allow-axioms

This is important for proof integrity — a proof that introduces new axioms or constants may not be valid in the intended mathematical framework.

11.8 Scripted Solver (Sequential Tactics)

In addition to search-based proving, UlamAI supports sequential script mode:

┌─────────────────────────────────────┐
│  Script Solver                       │
│                                     │
│  1. LLM generates tactic sequence  │
│  2. Execute tactics sequentially    │
│  3. Each step validated by Lean     │
│  4. Backtrack on failure            │
│                                     │
│  Portfolio strategy:                │
│  1. Try script solver first         │
│  2. If fails → best-first search   │
└─────────────────────────────────────┘

This is enabled by default and works well for proofs that follow a natural sequential structure.

12 Programming Language

Language and Ecosystem

Aspect Choice
Primary Language Python
Python Version >=3.10
Build System setuptools (>=68) + wheel
Package Name ulam-prover
Entry Point ulam = ulam.cli:main
Formal Verification Target Lean 4 (with Mathlib)
Distribution PyPI + Homebrew

Distribution Channels

UlamAI is unusually well-distributed for a research tool:

Channel Command Notes
Homebrew brew tap ulamai/ulamai && brew install ulamai Auto-updated on release
PyPI pip install ulam-prover Standard Python install
Clone + install ./install.sh Editable install
One-liner curl -fsSL ... \| bash Quick install script
Colab Notebook in examples/ Zero-install exploration

Integration Language: Lean 4

While UlamAI is written in Python, it operates on Lean 4 code:

Lean Component Purpose
Lean 4 Proof kernel (ground truth)
Mathlib Mathematical library (100k+ lemmas)
LeanDojo Python interface to Lean proof states
Pantograph (PyPantograph) Server for tactic-level interaction
lake Build tool for Lean projects
elan Lean toolchain manager

Toolchain Alignment

UlamAI carefully manages Lean toolchain alignment:

  • Pantograph is anchored to a specific Lean toolchain (from its src/lean-toolchain)
  • UlamAI aligns the Mathlib project to that toolchain when possible
  • --use-mathlib-toolchain flag available for explicit control
  • Toolchain pinning is part of reproducibility metadata

13 Memory Management

Cache Architecture

UlamAI implements a multi-level caching system to minimize redundant Lean execution:

┌────────────────────────────────────────────────────┐
│                CACHE HIERARCHY                      │
│                                                    │
│  Level 1: Step Cache                               │
│  ├── Key: (state_key, tactic)                      │
│  ├── Value: execution result                       │
│  ├── Purpose: Avoid re-running same tactic on      │
│  │            same state                           │
│  └── Eviction: None (grows with search)            │
│                                                    │
│  Level 2: State Cache (Transposition Table)        │
│  ├── Key: state_key (canonical hash)               │
│  ├── Value: best_seen_score                        │
│  ├── Purpose: Avoid re-exploring states reached    │
│  │            via different paths                  │
│  └── Eviction: None (grows with search)            │
│                                                    │
│  Level 3: Retrieval Index                          │
│  ├── Key: project + scope                          │
│  ├── Value: premises JSONL                         │
│  ├── Purpose: Pre-computed premise index           │
│  └── Location: .ulam/premises_*.jsonl              │
│                                                    │
│  Level 4: LLM Response Cache                       │
│  ├── Not implemented (each call is logged but      │
│  │   not cached for reuse)                         │
│  └── Potential optimization point                  │
└────────────────────────────────────────────────────┘

State Canonicalization

A critical challenge in proof search is determining when two proof states are "the same." UlamAI uses state hashing for the transposition table, but acknowledges this as an area for improvement:

"Better state canonicalization (stable hashing)" is listed as a v0.2 roadmap item.

The difficulty: Lean proof states contain goal hypotheses, types, and metavariables. Two states might differ in hypothesis naming but be semantically equivalent. Imperfect canonicalization leads to: - False negatives: Missing equivalent states → redundant exploration - False positives (rare): Colliding different states → incorrect pruning

TeX Mode Memory

The TeX proving pipeline maintains its own persistent memory:

Component Persistence Mutability Scope
WHITEBOARD.md Across rounds/passes Read/write by planner Planning state
repo/ items Across rounds/passes CRUD by planner Named knowledge items
Claim graph Per pass Created by planner Proof decomposition
Worker drafts Per round Generated by workers Claim solutions
state.json Across runs Updated per event Full resumable state
events.jsonl Append-only Immutable Audit trail

The <span class="broken-link">wikilink</span> expansion mechanism works identically to OpenProver's — repo items are referenced by name and expanded into prompts. This is not coincidental; both systems likely draw from the same Aletheia-inspired architectural pattern.

Configuration Memory

UlamAI persists configuration in .ulam/config.json:

{
  "lean_project": "/path/to/lean-project",
  "llm_provider": "openai",
  "llm_model": "gpt-5.2-codex",
  "lean_backend": "dojo",
  "prove_mode": "tactic"
}

This file is auto-generated by setup flows (ulam -lean, TUI configuration) and loaded on every invocation.

14 Continued Learning

Planned Learning Pipeline

UlamAI has an explicit, public roadmap for progressive learning capabilities:

Version Learning Capability Status
v0.1 (shipped) Trace logging to JSONL Implemented
v0.2 (current) Better scoring heuristics (non-RL) In progress
v0.3 (planned) SFT training loop on traces Not started
v0.4 (planned) Learned value model for search guidance Not started
v0.5 (planned) On-policy RL (PPO/GRPO-style) Not started

v0.3 — Supervised Fine-Tuning from Traces

The most concrete near-term learning plan:

Successful proof traces (JSONL)
    │
    ▼
Extract (state, tactic) pairs
    │
    ▼
Build "proofstep dataset" for model family
    │
    ▼
SFT on tactic policy
    │
    ▼
Evaluate on miniF2F slice + internal suite

Key Insight: Every successful UlamAI proof run generates training data. The JSONL traces contain exactly the (state, action, reward) tuples needed for supervised learning. Failed proof attempts provide hard negatives.

v0.4 — Learned Value Model

Proof traces (successful + failed)
    │
    ▼
Train value model:
    state → predicted proof distance
    │
    ▼
Replace heuristic scoring in search:
    best_first_priority = value_model(state)
    │
    ▼
MCTS/MCGS option with learned value

v0.5 — Reinforcement Learning

On-policy fine-tuning (PPO/GRPO-style)
    │
    ├── Successful trajectories → positive reward
    ├── Near-miss trajectories → shaped reward
    └── Curriculum scheduling: easy → harder

v0.3 Additional Plans (Structural)

Both proving tracks (TeX and Lean formalization) are planned to receive:

  1. Dependency-first proving: Decompose into concept/claim or declaration DAG first, then solve bottom-up
  2. Live grounding retrieval: Force term/definition grounding from current Mathlib/local sources before drafting
  3. Compiler/typecheck-in-the-loop reflection: At each node, not only at the end
  4. Semantic gate beyond syntax: Check whether generated claims match intended meanings
  5. Abstain capability: Explicit no-output or partial-safe outcome when gates fail
  6. Autonomy protocol artifacts: Fixed verification/extraction prompts, full transcripts, autonomy card

Cross-Run Learning (Current Gap)

Like OpenProver, UlamAI currently has no cross-run learning mechanism. Each proof attempt starts from scratch. However, the planned SFT pipeline (v0.3) would close this gap by training improved models from accumulated traces.

UnsolvedMath Integration

The planned integration with UnsolvedMath (1,146 open problems) creates a natural curriculum:

Easy problems (textbook) → build traces
    → SFT model on traces
    → attempt harder problems
    → accumulate more traces
    → further SFT/RL
    → attempt open problems from UnsolvedMath

This is a concrete path toward evolutionary improvement through accumulated experience.

15 Applications

Primary Application: Verified Mathematical Proof Discovery

UlamAI serves multiple user profiles with distinct workflows:

User Profile Workflow Key Commands
Mathematician Prove a conjecture in Lean ulam prove File.lean --theorem T
Formalization engineer Convert a paper to Lean ulam formalize paper.tex --out paper.lean
Student Explore proof strategies ulam prove --output-format tex --statement "..."
Researcher Benchmark proving models ulam bench --suite regression
Developer Non-regression testing ulam bench-compare --gate --a base.json --b pr.json

Concrete Workflow Examples

Workflow A: Lean Formal Proving

ulam prove examples/Smoke.lean \
  --theorem irrational_sqrt_two_smoke \
  --prove-mode llm --lean lsp --llm codex_cli

Workflow B: Informal TeX Proving

ulam prove --theorem infinitely_many_primes \
  --output-format tex \
  --statement "Prove that there are infinitely many prime numbers." \
  --llm codex_cli --tex-rounds 3 --tex-worker-drafts 2

Workflow C: Autoformalization

ulam formalize examples/pol25.tex --out examples/pol25.lean \
  --proof-backend llm --lean-backend dojo \
  --max-rounds 5 --max-proof-rounds 2

Workflow D: Benchmarking

ulam bench --suite internal_regression --llm codex_cli \
  --openai-model gpt-5.3-codex --lean dojo \
  --inference-profile balanced

Research Applications

  1. Benchmark-Driven Model Evaluation: UlamAI's benchmark infrastructure enables systematic comparison of:
  2. Different LLM providers on the same theorem suite
  3. Different inference profiles (speed vs. thoroughness)
  4. Different proof modes (tactic vs. lemma vs. LLM-only)
  5. Different Lean backends (LeanDojo vs. CLI vs. LSP)

  6. Training Data Generation: Every proof run produces JSONL traces that can serve as:

  7. Supervised fine-tuning datasets for tactic prediction
  8. Hard negatives from failed proof attempts
  9. Curriculum data ordered by difficulty

  10. Autoformalization Research: The TeX → Lean pipeline enables studying:

  11. Semantic drift between informal and formal statements
  12. Multi-round repair effectiveness
  13. Gap detection in informal proofs

  14. Agentic Architecture Research: The TeX proving pipeline's planner-action loop enables studying:

  15. Claim decomposition strategies
  16. Judge/verifier effectiveness
  17. Whiteboard + repo memory impact on proof quality
  18. Replan/backtrack trigger conditions

Comparison with Alternative Systems

Capability UlamAI OpenProver LeanCopilot Aletheia
Search-based proving Best-first/beam No (LLM-directed) No Unknown
Transposition table Yes No No Unknown
Automation fallbacks Yes (5 tactics) No Partial Unknown
Formal verification Lean 4 (3 backends) Lean 4 (CLI only) Lean 4 (LeanDojo) Extraction protocols
Informal proving TeX pipeline Whiteboard-based No Whiteboard-based
Autoformalization TeX → Lean Markdown → Lean No No
Literature search No Yes (web-enabled) No Unknown
Multi-model support 7+ providers 3 providers 1 provider 1 (Gemini)
Benchmark infra Extensive None Basic Internal
Open source Yes (MIT) Yes Yes No
Planned RL/SFT Yes (v0.3-0.5) No No Unknown

Limitations and Honest Assessment

Limitation Details
No published benchmark numbers Extensive infra but no public performance reports
No cross-run learning yet SFT pipeline planned (v0.3) but not implemented
Lean toolchain sensitivity LeanDojo/Pantograph version alignment is fragile
No literature search Unlike OpenProver, cannot consult web during proving
No value model Search scoring is heuristic, not learned
Experimental LSP backend Lean LSP track not yet at parity with LeanDojo
Early-stage TeX pipeline Informal proving is functional but maturing rapidly
No parallel rollouts Single-threaded proof search (planned for v0.4)

Significance for the Harness Systems Landscape

UlamAI represents a distinct architectural philosophy from the planner-worker systems (Aletheia, OpenProver). Where those systems emphasize strategic coordination — a planner managing workers through a whiteboard — UlamAI emphasizes systematic search — best-first exploration of the proof state space with LLMs as the policy network.

The key insight is that mathematical proof discovery admits two complementary framings:

  1. Proof as Planning (OpenProver/Aletheia): A strategic agent coordinates exploration of a large, uncertain space, maintaining persistent context and delegating tasks.

  2. Proof as Search (UlamAI): A search algorithm explores a well-defined state space (proof states), using LLMs as heuristic evaluators and move generators.

Neither is strictly better. Planning excels at the macro level (choosing proof strategies, decomposing problems, consulting literature). Search excels at the micro level (finding the right tactic sequence to close a specific goal). UlamAI's dual-track architecture (Lean search + TeX planner) acknowledges this duality — the formal track uses search, the informal track uses planning.

For the broader harness systems taxonomy, UlamAI's benchmark infrastructure is particularly significant. The parity gate system, inference profiles, campaign scripts, and cross-report comparison tools represent a level of engineering discipline around evaluation that most research tools lack. This makes UlamAI not just a prover but a platform for systematic evaluation of AI-driven mathematical reasoning.

The planned SFT → value model → RL progression (v0.3 → v0.5) also positions UlamAI as a future bridge between harness systems and evolutionary/self-improving systems. Once the system can train on its own traces, it begins to exhibit the self-reinforcing dynamics that characterize the most impactful AI research systems.