← Back to Index
UlamAI
Truth-first, reproducible, open-source Lean 4 theorem prover CLI combining LLM-guided reasoning with formal verification and best-first search Organization: Ulam Labs (Przemek Chojecki et al.) Published: February 2026 (v0.1); v0.2.11, March 2026 (current) Type: Open-source repository + PyPI package + Homebrew tap Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: UlamAI Prover — A Truth-First, Reproducible, Open-Source Lean 4 Theorem Prover CLI
Repository: github.com/ulamai/ulamai
PyPI Package: ulam-prover v0.2.11
Homebrew: brew tap ulamai/ulamai && brew install ulamai
Blog Post: UlamAI Prover: An Open Lean 4 Theorem Prover (Ulam Blog)
Medium Articles: - UlamAI — An Open-Source Theorem Prover and Formalized in Lean 4 (Feb 2026) - UlamAI Prover: Teaching AI to Prove Math Theorems (Mar 2026)
Related Project: UnsolvedMath — a benchmark of 1,146 open mathematical problems, designed as the natural complement to UlamAI Prover (problems to attempt ↔ machinery for attempting them).
Naming: Named after Stanisław Ulam (1909–1984), the Polish-American mathematician who co-invented the Monte Carlo method, contributed to the hydrogen bomb design, and proposed the concept of cellular automata with John von Neumann.
License: MIT
2 Authors and Team
Lead / Founder: Przemek Chojecki — CEO of Ulam Labs, author of the Medium articles and blog posts. PhD in mathematics with research background in number theory and algebraic geometry.
Organization: Ulam Labs — a company focused on mathematical intelligence and AI for mathematics.
Repository Activity: | Metric | Value | | --- | --- | | Commits | 91 | | Stars | 111+ | | Forks | 6 | | Primary Language | Python | | License | MIT | | Release Cadence | Rapid iteration (v0.1 → v0.2.11 in ~2 months) |
Team Context: Ulam Labs positions itself at the intersection of AI and pure mathematics. The company maintains both the UlamAI Prover (tool for proving) and UnsolvedMath (benchmark of 1,146 open problems), creating a closed loop between challenge definition and solving infrastructure.
Key Differences from Solo Projects
Unlike many open-source theorem provers that are single-developer efforts, UlamAI is backed by a company with a clear product vision. This is reflected in: - Professional documentation and tutorials - Homebrew distribution (unusual for research tools) - Colab notebooks for easy onboarding - Automated CI/CD with Homebrew tap updates - Connection to a commercial benchmark product (UnsolvedMath)
3 Core Contribution
Key Novelty: UlamAI is the first open-source system to combine best-first search over Lean 4 proof states, multi-provider LLM tactic generation, retrieval-augmented premise selection, automated repair loops, and a full LaTeX-to-Lean autoformalization pipeline in a single CLI tool — with reproducible benchmarking and trace logging built in from day one.
What Makes UlamAI Distinctive
-
Search-Based Proof Discovery: Unlike one-shot LLM proof generation, UlamAI implements best-first search with beam capping over the space of proof states. At each state, the LLM proposes 8–16 candidate tactics. Successful applications expand the search tree; failures are pruned. A transposition table prevents re-exploring visited states. This turns theorem proving from "hope the LLM gets it right" into systematic exploration.
-
Three Proof Modes: Tactic mode (LLM proposes next tactic, Lean checks each step), Lemma mode (LLM drafts a lemma plan, each proved sequentially), and LLM-only mode (LLM rewrites entire Lean file, CLI typechecks). Each mode represents a different trade-off between granularity and autonomy.
-
Three Lean Backends: LeanDojo (Pantograph server for goal-state access), CLI (
lake env leanfor batch typecheck), and experimental LSP (Lean language server for interactive diagnostics). This separation of concern allows the system to be used with varying levels of Lean toolchain sophistication. -
Dual-Track Proving — Lean + TeX: UlamAI maintains two independent proving pipelines:
- Lean track: Formal, machine-checked proofs via tactic search
-
TeX track: Informal natural-language proofs via a bounded planner-action loop with claim graphs, whiteboard memory, judge/verifier gates, and composable worker drafts. The TeX track is not just "ask the LLM to write a proof" — it is a structured orchestration system with its own planning, judging, and backtracking logic.
-
Full Autoformalization Pipeline (TeX → Lean): A multi-phase pipeline that segments LaTeX documents, drafts Lean skeletons, iteratively repairs type errors, fills proofs via search, and validates semantic equivalence between informal and formal statements. This is treated as a fixpoint loop, not a one-shot translation.
-
Benchmark Infrastructure: Built-in suite management (
ulam bench), report generation, cross-report comparison with parity gates, regression suite builders, and campaign scripts. Benchmarking is not an afterthought — it's a first-class product feature. -
Automation Tactic Fallbacks (autop): Before querying the LLM for every subgoal, UlamAI tries
aesop,simp,linarith,ring, andnorm_num. Many subgoals that look hard to a language model are trivial for Lean's built-in automation, saving tokens and search budget.
Relationship to Prior Systems
| System | Year | Architecture | Proof Strategy | Formal Backend | Open Source |
|---|---|---|---|---|---|
| LeanCopilot | 2023 | Single-agent tactic suggestion | One-shot | Lean 4 via LeanDojo | Yes |
| ReProver | 2023 | Retrieval + generation | Single-step | Lean 4 | Yes |
| LEGO-Prover | 2024 | Skill library + decomposition | Multi-step | Lean via LeanDojo | Partial |
| Aletheia (DeepMind) | 2025 | Planner-worker-whiteboard | Multi-step, informal-first | Extraction protocols | No |
| OpenProver | 2025 | Planner-worker-whiteboard | Informal + formal dual-track | Lean 4 via lake env lean |
Yes |
| UlamAI | 2026 | Search-based with LLM policy | Best-first/beam search | Lean 4 (LeanDojo/CLI/LSP) | Yes |
Core Design Philosophy
"Trust Lean, not the model."
This single principle drives the entire architecture. The LLM is a heuristic guide — creative, fast, often wrong. Lean is the ground truth. UlamAI cannot produce a false proof. It can fail to find a proof, but it cannot claim to have found one that doesn't typecheck.
4 Supported Solutions
Solution Types
| Solution Type | Description | Pipeline |
|---|---|---|
Formal Lean Proof |
Machine-checked tactic proofs compiled against Mathlib | Prove pipeline (tactic/lemma mode) |
LLM-Rewritten Lean File |
Full Lean file rewritten by LLM, batch typechecked | Prove pipeline (LLM mode) |
Informal TeX Proof |
Structured natural-language proof with claim decomposition | Prove pipeline (TeX output format) |
Autoformalization |
LaTeX → Lean translation with iterative repair | Formalize pipeline |
Proof Replay |
Deterministic re-execution of a recorded proof trace | Replay command |
Proof Review |
Next-step suggestions for a partially completed proof | Review command |
Benchmark Report |
Solve-rate analysis across theorem suites | Bench command |
Prove Output Formats
| Format | Description | Verification |
|---|---|---|
lean (default) |
Machine-checked Lean 4 proof via tactic search | Lean kernel verification at each step |
tex |
Informal proof via bounded planner-action loop | Judge + adversarial verifier + domain checks |
Proof Modes (for Lean output)
| Mode | Mechanism | Pros | Cons |
|---|---|---|---|
tactic |
LLM proposes next tactic; Lean checks each step | Fast feedback, goal-state aware | Brittle for long chains |
lemma |
LLM drafts a lemma plan; each lemma proved sequentially | Decomposes big proofs | Depends on plan quality |
llm |
LLM rewrites full Lean file; CLI typechecks | Handles multi-step edits, no Dojo required | Slower, less goal-state guidance |
Lean Backends
| Backend | Mechanism | Pros | Cons |
|---|---|---|---|
dojo |
Pantograph/LeanDojo server | Goal-state access, tactic execution | Extra install, toolchain sensitivity |
cli |
lake env lean typecheck |
Simple, works with any toolchain | No goal-state feedback |
lsp |
Lean language server | Goal/diagnostic loop without Pantograph | Experimental, search parity pending |
Solver Strategies
| Strategy | Behavior |
|---|---|
search |
Best-first tactic search (default) |
script |
Sequential script mode (deterministic tactic sequence) |
portfolio |
Script-first, then best-first fallback |
5 LLM Integration
Multi-Provider Support
UlamAI supports a broad range of LLM providers through a unified adapter layer:
| Provider | Auth Mechanism | Default Model | Notes |
|---|---|---|---|
| OpenAI-compatible | ULAM_OPENAI_API_KEY |
gpt-4.1 | Base URL configurable for custom endpoints |
| Codex CLI | codex login / ulam auth codex |
gpt-5.2-codex | No API key needed; subscription-based |
| Anthropic Claude | ULAM_ANTHROPIC_API_KEY |
claude-3-5-sonnet | API or Claude Code CLI |
| Claude Code CLI | ulam auth claude / claude setup-token |
— | No API key; subscription-based |
| Gemini API | ULAM_GEMINI_API_KEY |
gemini-3.1-pro-preview | OpenAI-compatible endpoint format |
| Gemini CLI | OAuth2 auto-discovery | — | Subscription/login based |
| Ollama | Local server | llama3.1 | Fully local, no data leaving machine |
| Custom API | TUI configuration | — | Mistral, Z.AI, Qwen, DeepSeek, Kimi |
Recommended Model Configuration
The documentation recommends gpt-5.2-codex or gpt-5.3-codex as the primary proving models, suggesting these Codex-series models have been found effective for tactic generation in practice.
Embedding Support (for Retrieval)
| Setting | Variable | Default |
|---|---|---|
| API Key | ULAM_EMBED_API_KEY |
Falls back to ULAM_OPENAI_API_KEY |
| Base URL | ULAM_EMBED_BASE_URL |
Falls back to ULAM_OPENAI_BASE_URL |
| Model | ULAM_EMBED_MODEL |
text-embedding-3-small |
TeX Proving: Split Planner/Worker Models
The informal TeX proving pipeline supports independent model assignment for different roles:
┌─────────────────────────────────────────┐
│ TeX Proving Model Hierarchy │
│ │
│ --tex-planner-model gpt-5.4 │
│ ↓ Planning, judging, verification, │
│ ↓ final composition │
│ │
│ --tex-worker-model gpt-5.4-mini │
│ ↓ Claim-draft workers (serial or │
│ ↓ concurrent evaluation) │
└─────────────────────────────────────────┘
This mirrors the planner/worker model split seen in OpenProver but applied to the informal proving pipeline rather than the overall system architecture.
LLM Usage Pattern in Proof Search
For each proof state in frontier:
│
├── 1. Retrieve premises (top-k from Mathlib/local)
│
├── 2. Try automation tactics first (aesop, simp, linarith, ring, norm_num)
│ └── If solved → emit proof, stop
│
├── 3. Prompt LLM for k candidate tactics (k = 8-16)
│
├── 4. Execute each tactic in Lean via selected backend
│ ├── Success → enqueue new state (beam capped)
│ └── Failure → request LLM repair step, bounded retries
│
├── 5. Update caches (step cache + state cache)
│
└── 6. Log everything to JSONL trace
6 Key Results
Reported Capabilities (Blog + README)
UlamAI's published materials focus on architectural capabilities rather than benchmark scores. Key claims:
-
Working End-to-End Pipeline: The system produces machine-checked Lean 4 proofs for standard textbook theorems (irrationality of √2, etc.) and can attempt harder problems with extended budgets.
-
Benchmark-Ready Infrastructure: Suite management, report generation, cross-report comparison with parity gates, and campaign scripts are all implemented. The system includes:
ulam benchfor running suitesulam bench-comparefor cross-report analysisulam bench-validatefor suite validationulam bench-make-minif2ffor building miniF2F suitesulam bench-make-regression100for fixed-size regression suites-
Campaign scripts with gated comparisons
-
Formalization Pipeline: The TeX-to-Lean pipeline handles segmentation, statement drafting, proof reconstruction, and semantic alignment with iterative repair loops.
-
Multi-Backend Flexibility: Proven to work with OpenAI, Anthropic, Gemini, and Ollama backends.
Benchmark Infrastructure (Not Results)
UlamAI has invested heavily in benchmarking infrastructure, even though specific numerical results are not publicly reported:
| Capability | Implementation |
|---|---|
| Suite management | bench-list-suites, bench-validate, fixed-suite builder |
| miniF2F integration | bench-make-minif2f from local checkout, split selection |
| Fixed regression suite | bench-make-regression100 with deterministic seed |
| Report generation | JSON + Markdown with dataset/split breakdowns, top tags |
| Cross-report comparison | bench-compare with automated parity gate |
| Parity gate thresholds | --max-solved-drop, --max-success-rate-drop, --max-semantic-pass-rate-drop, --max-regression-rejection-rate-increase, --max-median-time-increase-pct, --max-planner-replan-triggers-increase, --max-planner-cached-tactic-tries-drop |
| Comparability checks | Suite SHA256 matching, inference profile matching |
| Campaign automation | run_bench_campaign.sh with timestamped artifacts + env snapshots |
| Inference profiles | fast, balanced, strict with configurable gen_k, exec_k, verify_level |
Parity Gate System
The benchmark comparison system deserves special attention. When comparing two runs:
python3 -m ulam bench-compare \
--a runs/bench/a.json --b runs/bench/b.json \
--gate \
--max-solved-drop 0 \
--max-success-rate-drop 0 \
--max-semantic-pass-rate-drop 0 \
--max-regression-rejection-rate-increase 0 \
--max-median-time-increase-pct 25
This implements a non-regression gate — ensuring that system changes don't degrade performance. The gate also checks comparability (same suite SHA256 and same inference profile/budgets), with explicit opt-out for intentional cross-profile comparisons.
Connection to UnsolvedMath
Ulam Labs released UnsolvedMath, a benchmark of 1,146 open mathematical problems. The stated plan is to run UlamAI Prover against formalized versions of these problems, starting with restricted cases. This would represent one of the most ambitious applications of automated theorem proving — applying search-based LLM proving to genuinely open questions.
7 Reproducibility
Trace Logging
Every proof search run produces a JSONL trace file (run.jsonl by default) logging each search step:
| Field | Content |
|---|---|
| Proof state | Current Lean goal state |
| Proposed tactic | LLM-generated tactic attempt |
| Lean response | Success/failure with error messages |
| Scoring decision | Best-first priority assignment |
| Timing | Wall-clock duration per step |
Replay System
# Basic replay (show the trace)
python3 -m ulam replay run.jsonl
# Deterministic replay (re-execute every tactic)
python3 -m ulam replay run.jsonl --execute --strict
Strict mode: Checks alignment against replay metadata sidecar (*.meta.json) containing:
- Lean backend type
- Project path
- Toolchain version
- Mathlib commit/revision
- File hash
- Run configuration
Formalization Artifacts
Each formalization run creates timestamped artifact directories:
runs/formalize_YYYYMMDD_HHMMSS/
├── state.json # Current pipeline state (resumable)
├── events.jsonl # Step-by-step event log
├── summary.json # Final summary with metrics
├── WHITEBOARD.md # Planner scratchpad (TeX mode)
├── repo/ # Mutable memory items (TeX mode)
├── round_001/
│ ├── draft.lean # Generated Lean code
│ ├── errors.txt # Compilation errors
│ └── equivalence_report.json
├── round_002/...
└── final.lean # Best output
TeX Proving Artifacts
runs/prove_tex/tex_YYYYMMDD_HHMMSS/
├── state.json # Resumable state
├── events.jsonl # All events
├── summary.json # Run summary
├── WHITEBOARD.md # Persistent planner scratchpad
└── repo/ # Named memory items
Checkpoint Command
# Read-only health check of a Lean file
python3 -m ulam checkpoint examples/Smoke.lean \
--theorem irrational_sqrt_two_smoke --strict
Checks typecheck status, placeholder (sorry) counts, and axiom drift without modifying anything.
Review Command
# Get next-step suggestions from a trace
python3 -m ulam review --trace run.jsonl \
--file examples/Smoke.lean --theorem irrational_sqrt_two_smoke
Configuration Pinning
UlamAI pins environment state for reproducibility:
| Pinned Element | Storage |
|---|---|
| Lean toolchain version | .ulam/config.json |
| Mathlib commit | Trace metadata |
| LLM provider + model | Run configuration |
| Random seed | Configurable |
| Search parameters | Run configuration |
| Retrieval index | Versioned in project |
| Timeout settings | Run configuration |
8 Compute and API Costs
Inference Profiles
UlamAI implements configurable inference profiles that trade off speed vs. verification depth:
| Profile | gen_k | exec_k | verify_level | Use Case |
|---|---|---|---|---|
fast |
Low | Low | Minimal | Quick exploration, CI smoke tests |
balanced |
Medium | Medium | Standard | Default proving, benchmarking |
strict |
High | High | Thorough | High-confidence results, publication |
Profiles can be specified per-run:
python3 -m ulam bench --suite internal_regression \
--inference-profile balanced --gen-k 8 --exec-k 3 --verify-level strict
Cost Drivers
| Cost Component | Driver | Control |
|---|---|---|
| LLM tactic proposals | k candidates per state × states explored | gen_k, beam cap, timeout |
| LLM repair attempts | Failed tactics × repair retries | Bounded repair loop |
| Embedding calls | Premise retrieval queries | Retriever mode selection |
| Lean execution | Tactic compilation per candidate | exec_k, caching |
| Autoformalization | Multiple rounds × LLM drafts + repairs | max-rounds, max-proof-rounds |
| TeX proving | Planner steps × worker drafts | tex-rounds, tex-worker-drafts, tex-action-steps |
Resource Requirements
| Resource | Minimum | Recommended |
|---|---|---|
| Python | 3.10+ | 3.11+ |
| RAM | 4 GB | 8+ GB |
| Disk | ~200 MB (base) | ~2 GB (with Lean project + Mathlib) |
| Network | Required for API models | — |
| Lean 4 + Mathlib | Required for real proofs | Via ulam -lean setup |
| LeanDojo + Pantograph | Required for dojo backend |
Via ulam -lean |
Cost Optimization Strategies
- Automation-first:
aesop,simp,linarith,ring,norm_numtried before LLM on every state - Step cache:
(state_key, tactic) → resultavoids duplicate Lean calls - State cache (transposition table):
state_key → best_seen_scoreprevents re-exploring - Beam capping: Limits frontier size to prevent unbounded search
- Budget tracking: Time or token budgets with automatic conclusion phase
- Inference profiles: Pre-configured trade-offs between cost and thoroughness
9 Architecture Solution
System Diagram
┌──────────────────────────────────────────────────────────────────────────┐
│ UlamAI System │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ CLI Entry Point (cli.py) │ │
│ │ │ │
│ │ ulam prove │ ulam formalize │ ulam bench │ ulam replay │ │
│ │ ulam review │ ulam checkpoint│ ulam bench-* │ ulam index │ │
│ │ ulam -lean │ ulam auth │ ulam (TUI menu)│ │ │
│ └───────────┬───────────────┬──────────────┬───────────────────────┘ │
│ │ │ │ │
│ ┌───────────▼───────────┐ │ ┌───────────▼─────────────────────┐ │
│ │ PROVE PIPELINE │ │ │ FORMALIZE PIPELINE │ │
│ │ │ │ │ │ │
│ │ ┌─────────────────┐ │ │ │ Phase A: Ingest + Segment TeX │ │
│ │ │ Search Engine │ │ │ │ Phase B: Statement Drafting │ │
│ │ │ (best-first/beam)│ │ │ │ Phase C: Proof Reconstruction │ │
│ │ │ + transposition │ │ │ │ Phase D: Semantic Alignment │ │
│ │ │ table │ │ │ │ Phase E: Consolidation │ │
│ │ └────────┬────────┘ │ │ │ │ │
│ │ │ │ │ │ ┌─────────────────────────┐ │ │
│ │ ┌────────▼────────┐ │ │ │ │ Fixpoint Loop: │ │ │
│ │ │ LLM Policy │ │ │ │ │ Draft → Typecheck → │ │ │
│ │ │ (k=8-16 tactics)│ │ │ │ │ Proof Search → Validate │ │ │
│ │ └────────┬────────┘ │ │ │ │ ← Repair / Rewrite │ │ │
│ │ │ │ │ │ └─────────────────────────┘ │ │
│ │ ┌────────▼────────┐ │ │ └───────────────────────────────────┘ │
│ │ │ Retriever │ │ │ │
│ │ │ (premises) │ │ │ │
│ │ └────────┬────────┘ │ │ │
│ │ │ │ │ │
│ │ ┌────────▼────────┐ │ │ │
│ │ │ Autop Fallback │ │ │ │
│ │ │ aesop/simp/ │ │ │ │
│ │ │ linarith/ring/ │ │ │ │
│ │ │ norm_num │ │ │ │
│ │ └─────────────────┘ │ │ │
│ └───────────────────────┘ │ │
│ │ │ │
│ ┌───────────▼───────────────▼──────────────────────────────────────┐ │
│ │ LEAN BACKEND LAYER │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │
│ │ │ LeanDojo │ │ Lean CLI │ │ Lean LSP │ │ │
│ │ │ (Pantograph) │ │ (lake env │ │ (Language Server) │ │ │
│ │ │ │ │ lean) │ │ [experimental] │ │ │
│ │ │ Goal-state │ │ Batch │ │ Interactive │ │ │
│ │ │ access │ │ typecheck │ │ goal/diagnostic │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ LLM ADAPTER LAYER │ │
│ │ │ │
│ │ OpenAI │ Codex CLI │ Claude API │ Claude CLI │ Gemini │ Ollama │ │
│ │ Custom API (Mistral, Z.AI, Qwen, DeepSeek, Kimi) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ BENCHMARK LAYER │ │
│ │ │ │
│ │ Suite Management │ Report Generation │ Comparison + Gates │ │
│ │ Campaign Scripts │ miniF2F Builder │ Regression100 Builder │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ TRACE + REPLAY LAYER │ │
│ │ │ │
│ │ JSONL Logging │ Metadata Sidecars │ Deterministic Replay │ │
│ │ Review + Checkpoint │ Artifact Management │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
Proof Search Data Flow
Input: Lean file + theorem name (with sorry)
│
▼
┌──────────────────────────────┐
│ 1. Initialize │
│ Load file via Lean backend│
│ Get target proof state │
│ Create root search node │
└───────────┬──────────────────┘
│
▼
┌──────────────────────────────┐
│ 2. Search Loop │◄─────────────────────────┐
│ Select best state from │ │
│ frontier (best-first) │ │
└───────────┬──────────────────┘ │
│ │
▼ │
┌──────────────────────────────┐ │
│ 3. Try autop tactics first │ │
│ aesop → simp → linarith │ │
│ → ring → norm_num │ │
│ If solved → emit proof ─────── EXIT (success) │
└───────────┬──────────────────┘ │
│ (none solved) │
▼ │
┌──────────────────────────────┐ │
│ 4. Retrieve premises (top-k) │ │
│ Token-overlap or embedding│ │
└───────────┬──────────────────┘ │
│ │
▼ │
┌──────────────────────────────┐ │
│ 5. LLM proposes k tactics │ │
│ (k = 8-16 per state) │ │
└───────────┬──────────────────┘ │
│ │
▼ │
┌──────────────────────────────┐ │
│ 6. Execute each in Lean │ │
│ ├── Success → new state ──│── Check cache ──► Enqueue │
│ └── Failure → repair loop │ (beam capped) │
│ └── LLM fix attempt │ │
│ └── Re-execute │ │
└───────────┬──────────────────┘ │
│ │
▼ │
┌──────────────────────────────┐ │
│ 7. Update caches + log trace │ │
│ Step cache + state cache │ │
│ JSONL trace entry │ │
└───────────┬──────────────────┘ │
│ │
▼ │
┌──────────────────────────────┐ │
│ 8. Check termination │ │
│ Solved? → EXIT (success) │ │
│ Budget exhausted? → EXIT │ │
│ Frontier empty? → EXIT │ │
│ Otherwise → CONTINUE ─────────────────────────────────┘
└──────────────────────────────┘
10 Component Breakdown
Package Structure
ulam/
├── cli.py # CLI entry point, TUI menu, argument parsing
├── prover.py # Core proof search loop
├── search.py # Best-first / beam search engine
├── retriever.py # Premise retrieval (token-overlap + embedding)
├── llm/ # LLM adapter layer
│ ├── openai.py # OpenAI-compatible adapter
│ ├── anthropic.py # Anthropic Claude adapter
│ ├── codex_cli.py # Codex CLI wrapper
│ ├── claude_cli.py # Claude Code CLI wrapper
│ ├── gemini.py # Gemini adapter
│ ├── ollama.py # Ollama adapter
│ └── base.py # Base adapter interface
├── lean/ # Lean backend layer
│ ├── dojo.py # LeanDojo/Pantograph runner
│ ├── cli_runner.py # lake env lean CLI runner
│ ├── lsp.py # Lean language server (experimental)
│ └── parser.py # Lean file parsing
├── formalize/ # Autoformalization pipeline
│ ├── segmenter.py # TeX → structured chunks
│ ├── drafter.py # Statement generation
│ ├── prover.py # Per-lemma proof loops
│ ├── equivalence.py # Semantic alignment checks
│ └── pipeline.py # Multi-phase orchestration
├── tex/ # Informal TeX proving pipeline
│ ├── planner.py # Planner-action loop
│ ├── worker.py # Claim-draft workers
│ ├── judge.py # Judge + adversarial verifier
│ ├── whiteboard.py # Persistent scratchpad
│ └── composer.py # Final proof assembly
├── bench/ # Benchmark infrastructure
│ ├── runner.py # Suite execution
│ ├── report.py # Report generation
│ ├── compare.py # Cross-report comparison
│ └── suites/ # Suite definitions
├── index/ # Retrieval index management
│ └── builder.py # Index construction
├── replay.py # Deterministic replay
├── review.py # Next-step suggestion
├── checkpoint.py # Read-only health check
└── config.py # Configuration management
Core Module Descriptions
| Module | Responsibility | Key Abstraction |
|---|---|---|
| cli.py | Entry point, TUI menu, interactive configuration, auth flows | Command dispatch |
| prover.py | Core proving loop: state management, tactic execution, repair loops | Prover class |
| search.py | Best-first / beam search with transposition table | SearchEngine |
| retriever.py | Premise selection from Mathlib or local repo | Retriever (token-overlap / embedding) |
| llm/ | Unified interface for all LLM providers | LLMAdapter base |
| lean/ | Backend abstraction for Lean verification | LeanRunner ABC |
| formalize/ | Multi-phase TeX → Lean pipeline | FormalizePipeline |
| tex/ | Informal planner-action proving loop | TexProver |
| bench/ | Suite management, reporting, comparison | BenchRunner, BenchCompare |
| replay.py | Trace re-execution with metadata validation | Replayer |
Dependencies
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "ulam-prover"
version = "0.2.11"
requires-python = ">=3.10"
license = { text = "MIT" }
The base package has minimal hard dependencies. Heavy dependencies (LeanDojo, Pantograph, torch, sentence-transformers) are installed on demand via the ulam -lean setup flow.
11 Core Mechanisms (Detailed)
11.1 Best-First Search with Beam Capping
The search engine maintains a priority queue of proof states:
Priority Queue (beam-capped)
┌────────────────────────────────────────────────┐
│ State A (score: 0.95) ← next to expand │
│ State B (score: 0.87) │
│ State C (score: 0.82) │
│ State D (score: 0.71) │
│ ... (capped at beam_width) │
│ │
│ Transposition Table: │
│ state_key_A → best_seen_score: 0.95 │
│ state_key_B → best_seen_score: 0.87 │
│ state_key_E → best_seen_score: 0.65 (pruned) │
└────────────────────────────────────────────────┘
Why best-first/beam before MCTS: As the project documentation explains, "Lean execution time dominates. Best-first + caching + transpositions delivers a strong baseline fast and provides clear debugging signals. MCTS/MCGS can be layered later." The search algorithm is explicitly chosen for practical reasons — debuggability and the fact that the bottleneck is Lean compilation, not search branching.
Transposition Table: The state cache maps state_key → best_seen_score. When the search encounters a state it has already seen (via a different tactic path), it checks whether the new score improves on the cached one. If not, the state is pruned. This prevents the combinatorial explosion of re-exploring equivalent states reached by different tactic sequences.
11.2 Retrieval-Augmented Premise Selection
Two retrieval modes are available:
| Mode | Mechanism | Speed | Quality | Dependencies |
|---|---|---|---|---|
| Token-overlap | BM25-style keyword matching | Fast | Adequate | None |
| Embedding-based | Vector similarity search | Moderate | Higher | Embedding API |
Premise Format:
lemma_name : statement
One premise per line, fetched from a --premises file or auto-indexed:
# Auto-index from local project + Mathlib
python3 -m ulam index build \
--project /path/to/lean-project \
--scope both \
--out .ulam/premises_both.jsonl
Retrieval Sources:
| Source | Flag | Content |
| --- | --- | --- |
| Local declarations | --retriever-source local | Project-specific lemmas and definitions |
| Mathlib declarations | --retriever-source mathlib | Full Mathlib library (~100k+ lemmas) |
| Both | --retriever-source both | Combined index |
Why Retrieval Matters: Mathlib contains over 100,000 lemmas. Without retrieval, the model must rely on memorized knowledge — knowledge that degrades with model size, quantization, and the ever-growing surface area of Mathlib itself. Retrieval turns "do you remember this lemma?" into "here is the lemma; use it."
11.3 Automation Tactic Fallbacks (autop)
Before every LLM query, UlamAI tries a cascade of built-in Lean automation:
Proof state arrived
│
├── Try: aesop
│ └── Success? → SOLVED (no LLM needed)
├── Try: simp
│ └── Success? → SOLVED
├── Try: linarith
│ └── Success? → SOLVED
├── Try: ring
│ └── Success? → SOLVED
└── Try: norm_num
└── Success? → SOLVED
│
└── All failed → Query LLM for k tactics
Rationale: Many subgoals that look hard to a language model are trivial for Lean's built-in automation. This saves: - LLM API tokens (no call needed) - Search budget (one-step solution vs. multi-step exploration) - Wall-clock time (local computation vs. API round-trip)
11.4 Repair Loop
When a tactic fails, UlamAI doesn't just discard it — it feeds the error back to the LLM:
LLM proposes tactic T
│
▼
Execute T in Lean
│
├── Success → accept, enqueue new state
│
└── Failure (error message E)
│
▼
Ask LLM to repair T given error E
│
▼
Execute repaired T' in Lean
│
├── Success → accept
└── Failure → bounded retries exhausted → discard
This error-feedback loop is a key differentiator from pure one-shot generation. Lean's type error messages are precise and informative — they tell the model exactly what went wrong (missing argument, type mismatch, unknown identifier). This structured feedback enables targeted repairs.
11.5 TeX Proving Pipeline (Informal Proofs)
The --output-format tex pipeline is a sophisticated orchestration system:
┌────────────────────────────────────────────────────┐
│ TeX Proving Architecture │
│ │
│ v0.2.11 Flow: │
│ 1. Try one deep whole-theorem attempt │
│ using single primary model │
│ 2. If fails → fall back to planner/worker │
│ claim decomposition │
│ │
│ Planner Action Loop: │
│ ┌──────────────────────────────────────────────┐ │
│ │ Planner chooses action: │ │
│ │ • plan — decompose into claim graph │ │
│ │ • solve — workers draft claims │ │
│ │ • write_memory — update WHITEBOARD/repo │ │
│ │ • compose — assemble final proof │ │
│ │ • give_up — exhaust budget │ │
│ └──────────────────────┬───────────────────────┘ │
│ │ │
│ Claim Graph: │ │
│ ┌──────────────────────▼───────────────────────┐ │
│ │ claims + dependencies + assumptions + │ │
│ │ required facts + worker guidance + │ │
│ │ repo reads │ │
│ └──────────────────────┬───────────────────────┘ │
│ │ │
│ Workers: │ │
│ ┌──────────────────────▼───────────────────────┐ │
│ │ Draft each claim (serial or concurrent) │ │
│ │ → Judge + Adversarial Verifier + │ │
│ │ Domain Checker │ │
│ │ → Accept / Reject │ │
│ └──────────────────────┬───────────────────────┘ │
│ │ │
│ Persistent Memory: │ │
│ ┌──────────────────────▼───────────────────────┐ │
│ │ WHITEBOARD.md + repo/ items │ │
│ │ CRUD operations by planner │ │
│ │ <span class="broken-link">wikilink</span> expansion in prompts │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ Replan/Backtrack: │
│ - Pass-based: reset decomposition if stalled │
│ - Bounded: --tex-replan-passes limit │
│ │
│ Output: composed .tex proof draft │
└────────────────────────────────────────────────────┘
Configuration Knobs:
| Flag | Default | Description |
|---|---|---|
--tex-rounds |
— | Max fixed orchestration rounds |
--tex-worker-drafts |
1 | Worker candidates per claim per round |
--tex-concurrency |
off | Toggle concurrent evaluation of drafts |
--tex-judge-repairs |
— | Max rounds without accepted claim |
--tex-replan-passes |
— | Max decomposition resets |
--tex-action-steps |
— | Bounded planner-action steps |
--tex-planner-model |
— | Override model for planning/judging |
--tex-worker-model |
— | Override model for claim workers |
11.6 Autoformalization Pipeline (TeX → Lean)
The formalization pipeline is a five-phase process with recursive loops:
Phase A — Ingest + Segment:
LaTeX document
│
▼
Parse into logical chunks:
├── Definitions
├── Lemmas
├── Theorems
└── Proof sketches
│
▼
Extract named entities + symbols
│
▼
Build dependency graph
Phase B — Statement Drafting (Lean Skeleton):
For each definition/lemma/theorem:
│
├── Generate Lean signature with sorry proofs
├── Insert placeholders for unknown types
│
▼
Run Lean to typecheck skeleton
│
├── Missing imports → fix
├── Undeclared constants → fix
├── Type mismatches → fix
├── Notation issues → fix
│
└── Loop until file typechecks with sorry proofs
Phase C — Proof Reconstruction:
For each sorry:
│
├── Open goal in Lean backend
├── Run proof search pipeline
│
├── If proof fails:
│ ├── Retrieve similar lemmas
│ ├── Reformulate statement
│ └── Generate auxiliary lemmas
│
└── If still failing: leave sorry, log TODO
Phase D — Semantic Alignment (Recursive):
┌──────────────────────────────────────────────┐
│ Draft → Typecheck → Proof Search → │
│ Validate Equivalence │
│ ↑ ↓ │
│ └── Repair / Rewrite ──┘ │
│ │
│ Statement equivalence check: │
│ LLM judges Lean stmt vs informal stmt │
│ Mismatch → enqueue repair task │
│ │
│ Gap detection: │
│ Missing assumptions → add hypotheses │
│ Hidden lemmas → generate intermediates │
│ │
│ Re-run proofs: │
│ Statement change → invalidate downstream │
│ → recursive repair │
└──────────────────────────────────────────────┘
Phase E — Consolidation:
- Produce final .lean file
- Record proof success rate, remaining sorrys, error traces
- Save all artifacts
11.7 Axiom Control
UlamAI includes axiom safety controls:
# Default: axioms/constants allowed
python3 -m ulam prove File.lean --theorem T
# Strict: no axioms allowed
python3 -m ulam prove File.lean --theorem T --no-allow-axioms
This is important for proof integrity — a proof that introduces new axioms or constants may not be valid in the intended mathematical framework.
11.8 Scripted Solver (Sequential Tactics)
In addition to search-based proving, UlamAI supports sequential script mode:
┌─────────────────────────────────────┐
│ Script Solver │
│ │
│ 1. LLM generates tactic sequence │
│ 2. Execute tactics sequentially │
│ 3. Each step validated by Lean │
│ 4. Backtrack on failure │
│ │
│ Portfolio strategy: │
│ 1. Try script solver first │
│ 2. If fails → best-first search │
└─────────────────────────────────────┘
This is enabled by default and works well for proofs that follow a natural sequential structure.
12 Programming Language
Language and Ecosystem
| Aspect | Choice |
|---|---|
| Primary Language | Python |
| Python Version | >=3.10 |
| Build System | setuptools (>=68) + wheel |
| Package Name | ulam-prover |
| Entry Point | ulam = ulam.cli:main |
| Formal Verification Target | Lean 4 (with Mathlib) |
| Distribution | PyPI + Homebrew |
Distribution Channels
UlamAI is unusually well-distributed for a research tool:
| Channel | Command | Notes |
|---|---|---|
| Homebrew | brew tap ulamai/ulamai && brew install ulamai |
Auto-updated on release |
| PyPI | pip install ulam-prover |
Standard Python install |
| Clone + install | ./install.sh |
Editable install |
| One-liner | curl -fsSL ... \| bash |
Quick install script |
| Colab | Notebook in examples/ |
Zero-install exploration |
Integration Language: Lean 4
While UlamAI is written in Python, it operates on Lean 4 code:
| Lean Component | Purpose |
|---|---|
| Lean 4 | Proof kernel (ground truth) |
| Mathlib | Mathematical library (100k+ lemmas) |
| LeanDojo | Python interface to Lean proof states |
| Pantograph (PyPantograph) | Server for tactic-level interaction |
| lake | Build tool for Lean projects |
| elan | Lean toolchain manager |
Toolchain Alignment
UlamAI carefully manages Lean toolchain alignment:
- Pantograph is anchored to a specific Lean toolchain (from its
src/lean-toolchain) - UlamAI aligns the Mathlib project to that toolchain when possible
--use-mathlib-toolchainflag available for explicit control- Toolchain pinning is part of reproducibility metadata
13 Memory Management
Cache Architecture
UlamAI implements a multi-level caching system to minimize redundant Lean execution:
┌────────────────────────────────────────────────────┐
│ CACHE HIERARCHY │
│ │
│ Level 1: Step Cache │
│ ├── Key: (state_key, tactic) │
│ ├── Value: execution result │
│ ├── Purpose: Avoid re-running same tactic on │
│ │ same state │
│ └── Eviction: None (grows with search) │
│ │
│ Level 2: State Cache (Transposition Table) │
│ ├── Key: state_key (canonical hash) │
│ ├── Value: best_seen_score │
│ ├── Purpose: Avoid re-exploring states reached │
│ │ via different paths │
│ └── Eviction: None (grows with search) │
│ │
│ Level 3: Retrieval Index │
│ ├── Key: project + scope │
│ ├── Value: premises JSONL │
│ ├── Purpose: Pre-computed premise index │
│ └── Location: .ulam/premises_*.jsonl │
│ │
│ Level 4: LLM Response Cache │
│ ├── Not implemented (each call is logged but │
│ │ not cached for reuse) │
│ └── Potential optimization point │
└────────────────────────────────────────────────────┘
State Canonicalization
A critical challenge in proof search is determining when two proof states are "the same." UlamAI uses state hashing for the transposition table, but acknowledges this as an area for improvement:
"Better state canonicalization (stable hashing)" is listed as a v0.2 roadmap item.
The difficulty: Lean proof states contain goal hypotheses, types, and metavariables. Two states might differ in hypothesis naming but be semantically equivalent. Imperfect canonicalization leads to: - False negatives: Missing equivalent states → redundant exploration - False positives (rare): Colliding different states → incorrect pruning
TeX Mode Memory
The TeX proving pipeline maintains its own persistent memory:
| Component | Persistence | Mutability | Scope |
|---|---|---|---|
| WHITEBOARD.md | Across rounds/passes | Read/write by planner | Planning state |
| repo/ items | Across rounds/passes | CRUD by planner | Named knowledge items |
| Claim graph | Per pass | Created by planner | Proof decomposition |
| Worker drafts | Per round | Generated by workers | Claim solutions |
| state.json | Across runs | Updated per event | Full resumable state |
| events.jsonl | Append-only | Immutable | Audit trail |
The <span class="broken-link">wikilink</span> expansion mechanism works identically to OpenProver's — repo items are referenced by name and expanded into prompts. This is not coincidental; both systems likely draw from the same Aletheia-inspired architectural pattern.
Configuration Memory
UlamAI persists configuration in .ulam/config.json:
{
"lean_project": "/path/to/lean-project",
"llm_provider": "openai",
"llm_model": "gpt-5.2-codex",
"lean_backend": "dojo",
"prove_mode": "tactic"
}
This file is auto-generated by setup flows (ulam -lean, TUI configuration) and loaded on every invocation.
14 Continued Learning
Planned Learning Pipeline
UlamAI has an explicit, public roadmap for progressive learning capabilities:
| Version | Learning Capability | Status |
|---|---|---|
| v0.1 (shipped) | Trace logging to JSONL | Implemented |
| v0.2 (current) | Better scoring heuristics (non-RL) | In progress |
| v0.3 (planned) | SFT training loop on traces | Not started |
| v0.4 (planned) | Learned value model for search guidance | Not started |
| v0.5 (planned) | On-policy RL (PPO/GRPO-style) | Not started |
v0.3 — Supervised Fine-Tuning from Traces
The most concrete near-term learning plan:
Successful proof traces (JSONL)
│
▼
Extract (state, tactic) pairs
│
▼
Build "proofstep dataset" for model family
│
▼
SFT on tactic policy
│
▼
Evaluate on miniF2F slice + internal suite
Key Insight: Every successful UlamAI proof run generates training data. The JSONL traces contain exactly the (state, action, reward) tuples needed for supervised learning. Failed proof attempts provide hard negatives.
v0.4 — Learned Value Model
Proof traces (successful + failed)
│
▼
Train value model:
state → predicted proof distance
│
▼
Replace heuristic scoring in search:
best_first_priority = value_model(state)
│
▼
MCTS/MCGS option with learned value
v0.5 — Reinforcement Learning
On-policy fine-tuning (PPO/GRPO-style)
│
├── Successful trajectories → positive reward
├── Near-miss trajectories → shaped reward
└── Curriculum scheduling: easy → harder
v0.3 Additional Plans (Structural)
Both proving tracks (TeX and Lean formalization) are planned to receive:
- Dependency-first proving: Decompose into concept/claim or declaration DAG first, then solve bottom-up
- Live grounding retrieval: Force term/definition grounding from current Mathlib/local sources before drafting
- Compiler/typecheck-in-the-loop reflection: At each node, not only at the end
- Semantic gate beyond syntax: Check whether generated claims match intended meanings
- Abstain capability: Explicit no-output or partial-safe outcome when gates fail
- Autonomy protocol artifacts: Fixed verification/extraction prompts, full transcripts, autonomy card
Cross-Run Learning (Current Gap)
Like OpenProver, UlamAI currently has no cross-run learning mechanism. Each proof attempt starts from scratch. However, the planned SFT pipeline (v0.3) would close this gap by training improved models from accumulated traces.
UnsolvedMath Integration
The planned integration with UnsolvedMath (1,146 open problems) creates a natural curriculum:
Easy problems (textbook) → build traces
→ SFT model on traces
→ attempt harder problems
→ accumulate more traces
→ further SFT/RL
→ attempt open problems from UnsolvedMath
This is a concrete path toward evolutionary improvement through accumulated experience.
15 Applications
Primary Application: Verified Mathematical Proof Discovery
UlamAI serves multiple user profiles with distinct workflows:
| User Profile | Workflow | Key Commands |
|---|---|---|
| Mathematician | Prove a conjecture in Lean | ulam prove File.lean --theorem T |
| Formalization engineer | Convert a paper to Lean | ulam formalize paper.tex --out paper.lean |
| Student | Explore proof strategies | ulam prove --output-format tex --statement "..." |
| Researcher | Benchmark proving models | ulam bench --suite regression |
| Developer | Non-regression testing | ulam bench-compare --gate --a base.json --b pr.json |
Concrete Workflow Examples
Workflow A: Lean Formal Proving
ulam prove examples/Smoke.lean \
--theorem irrational_sqrt_two_smoke \
--prove-mode llm --lean lsp --llm codex_cli
Workflow B: Informal TeX Proving
ulam prove --theorem infinitely_many_primes \
--output-format tex \
--statement "Prove that there are infinitely many prime numbers." \
--llm codex_cli --tex-rounds 3 --tex-worker-drafts 2
Workflow C: Autoformalization
ulam formalize examples/pol25.tex --out examples/pol25.lean \
--proof-backend llm --lean-backend dojo \
--max-rounds 5 --max-proof-rounds 2
Workflow D: Benchmarking
ulam bench --suite internal_regression --llm codex_cli \
--openai-model gpt-5.3-codex --lean dojo \
--inference-profile balanced
Research Applications
- Benchmark-Driven Model Evaluation: UlamAI's benchmark infrastructure enables systematic comparison of:
- Different LLM providers on the same theorem suite
- Different inference profiles (speed vs. thoroughness)
- Different proof modes (tactic vs. lemma vs. LLM-only)
-
Different Lean backends (LeanDojo vs. CLI vs. LSP)
-
Training Data Generation: Every proof run produces JSONL traces that can serve as:
- Supervised fine-tuning datasets for tactic prediction
- Hard negatives from failed proof attempts
-
Curriculum data ordered by difficulty
-
Autoformalization Research: The TeX → Lean pipeline enables studying:
- Semantic drift between informal and formal statements
- Multi-round repair effectiveness
-
Gap detection in informal proofs
-
Agentic Architecture Research: The TeX proving pipeline's planner-action loop enables studying:
- Claim decomposition strategies
- Judge/verifier effectiveness
- Whiteboard + repo memory impact on proof quality
- Replan/backtrack trigger conditions
Comparison with Alternative Systems
| Capability | UlamAI | OpenProver | LeanCopilot | Aletheia |
|---|---|---|---|---|
| Search-based proving | Best-first/beam | No (LLM-directed) | No | Unknown |
| Transposition table | Yes | No | No | Unknown |
| Automation fallbacks | Yes (5 tactics) | No | Partial | Unknown |
| Formal verification | Lean 4 (3 backends) | Lean 4 (CLI only) | Lean 4 (LeanDojo) | Extraction protocols |
| Informal proving | TeX pipeline | Whiteboard-based | No | Whiteboard-based |
| Autoformalization | TeX → Lean | Markdown → Lean | No | No |
| Literature search | No | Yes (web-enabled) | No | Unknown |
| Multi-model support | 7+ providers | 3 providers | 1 provider | 1 (Gemini) |
| Benchmark infra | Extensive | None | Basic | Internal |
| Open source | Yes (MIT) | Yes | Yes | No |
| Planned RL/SFT | Yes (v0.3-0.5) | No | No | Unknown |
Limitations and Honest Assessment
| Limitation | Details |
|---|---|
| No published benchmark numbers | Extensive infra but no public performance reports |
| No cross-run learning yet | SFT pipeline planned (v0.3) but not implemented |
| Lean toolchain sensitivity | LeanDojo/Pantograph version alignment is fragile |
| No literature search | Unlike OpenProver, cannot consult web during proving |
| No value model | Search scoring is heuristic, not learned |
| Experimental LSP backend | Lean LSP track not yet at parity with LeanDojo |
| Early-stage TeX pipeline | Informal proving is functional but maturing rapidly |
| No parallel rollouts | Single-threaded proof search (planned for v0.4) |
Significance for the Harness Systems Landscape
UlamAI represents a distinct architectural philosophy from the planner-worker systems (Aletheia, OpenProver). Where those systems emphasize strategic coordination — a planner managing workers through a whiteboard — UlamAI emphasizes systematic search — best-first exploration of the proof state space with LLMs as the policy network.
The key insight is that mathematical proof discovery admits two complementary framings:
-
Proof as Planning (OpenProver/Aletheia): A strategic agent coordinates exploration of a large, uncertain space, maintaining persistent context and delegating tasks.
-
Proof as Search (UlamAI): A search algorithm explores a well-defined state space (proof states), using LLMs as heuristic evaluators and move generators.
Neither is strictly better. Planning excels at the macro level (choosing proof strategies, decomposing problems, consulting literature). Search excels at the micro level (finding the right tactic sequence to close a specific goal). UlamAI's dual-track architecture (Lean search + TeX planner) acknowledges this duality — the formal track uses search, the informal track uses planning.
For the broader harness systems taxonomy, UlamAI's benchmark infrastructure is particularly significant. The parity gate system, inference profiles, campaign scripts, and cross-report comparison tools represent a level of engineering discipline around evaluation that most research tools lack. This makes UlamAI not just a prover but a platform for systematic evaluation of AI-driven mathematical reasoning.
The planned SFT → value model → RL progression (v0.3 → v0.5) also positions UlamAI as a future bridge between harness systems and evolutionary/self-improving systems. Once the system can train on its own traces, it begins to exhibit the self-reinforcing dynamics that characterize the most impactful AI research systems.