← Back to Index

UlamAI

Truth-first, reproducible, open-source Lean 4 theorem prover CLI combining LLM-guided reasoning with formal verification and best-first search Organization: Ulam Labs (Przemek Chojecki et al.) Published: February 2026 (v0.1); v0.2.11, March 2026 (current) Type: Open-source repository + PyPI package + Homebrew tap Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: UlamAI Prover — A Truth-First, Reproducible, Open-Source Lean 4 Theorem Prover CLI

Repository: github.com/ulamai/ulamai

PyPI Package: ulam-prover v0.2.11

Homebrew: brew tap ulamai/ulamai && brew install ulamai

Blog Post: UlamAI Prover: An Open Lean 4 Theorem Prover (Ulam Blog)

Medium Articles: - UlamAI — An Open-Source Theorem Prover and Formalized in Lean 4 (Feb 2026) - UlamAI Prover: Teaching AI to Prove Math Theorems (Mar 2026)

Related Project: UnsolvedMath — a benchmark of 1,146 open mathematical problems, designed as the natural complement to UlamAI Prover (problems to attempt ↔ machinery for attempting them).

Naming: Named after Stanisław Ulam (1909–1984), the Polish-American mathematician who co-invented the Monte Carlo method, contributed to the hydrogen bomb design, and proposed the concept of cellular automata with John von Neumann.

License: MIT

2 Authors and Team

Lead / Founder: Przemek Chojecki — CEO of Ulam Labs, author of the Medium articles and blog posts. PhD in mathematics with research background in number theory and algebraic geometry.

Organization: Ulam Labs — a company focused on mathematical intelligence and AI for mathematics.

Repository Activity: | Metric | Value | | --- | --- | | Commits | 91 | | Stars | 111+ | | Forks | 6 | | Primary Language | Python | | License | MIT | | Release Cadence | Rapid iteration (v0.1 → v0.2.11 in ~2 months) |

Team Context: Ulam Labs positions itself at the intersection of AI and pure mathematics. The company maintains both the UlamAI Prover (tool for proving) and UnsolvedMath (benchmark of 1,146 open problems), creating a closed loop between challenge definition and solving infrastructure.

Key Differences from Solo Projects

Unlike many open-source theorem provers that are single-developer efforts, UlamAI is backed by a company with a clear product vision. This is reflected in: - Professional documentation and tutorials - Homebrew distribution (unusual for research tools) - Colab notebooks for easy onboarding - Automated CI/CD with Homebrew tap updates - Connection to a commercial benchmark product (UnsolvedMath)

3 Core Contribution

Key Novelty: UlamAI is the first open-source system to combine best-first search over Lean 4 proof states, multi-provider LLM tactic generation, retrieval-augmented premise selection, automated repair loops, and a full LaTeX-to-Lean autoformalization pipeline in a single CLI tool — with reproducible benchmarking and trace logging built in from day one.

What Makes UlamAI Distinctive

Search-Based Proof Discovery: Unlike one-shot LLM proof generation, UlamAI implements best-first search with beam capping over the space of proof states. At each state, the LLM proposes 8–16 candidate tactics. Successful applications expand the search tree; failures are pruned. A transposition table prevents re-exploring visited states. This turns theorem proving from "hope the LLM gets it right" into systematic exploration.
Three Proof Modes: Tactic mode (LLM proposes next tactic, Lean checks each step), Lemma mode (LLM drafts a lemma plan, each proved sequentially), and LLM-only mode (LLM rewrites entire Lean file, CLI typechecks). Each mode represents a different trade-off between granularity and autonomy.
Three Lean Backends: LeanDojo (Pantograph server for goal-state access), CLI (lake env lean for batch typecheck), and experimental LSP (Lean language server for interactive diagnostics). This separation of concern allows the system to be used with varying levels of Lean toolchain sophistication.
Dual-Track Proving — Lean + TeX: UlamAI maintains two independent proving pipelines:
Lean track: Formal, machine-checked proofs via tactic search
TeX track: Informal natural-language proofs via a bounded planner-action loop with claim graphs, whiteboard memory, judge/verifier gates, and composable worker drafts. The TeX track is not just "ask the LLM to write a proof" — it is a structured orchestration system with its own planning, judging, and backtracking logic.
Full Autoformalization Pipeline (TeX → Lean): A multi-phase pipeline that segments LaTeX documents, drafts Lean skeletons, iteratively repairs type errors, fills proofs via search, and validates semantic equivalence between informal and formal statements. This is treated as a fixpoint loop, not a one-shot translation.
Benchmark Infrastructure: Built-in suite management (ulam bench), report generation, cross-report comparison with parity gates, regression suite builders, and campaign scripts. Benchmarking is not an afterthought — it's a first-class product feature.
Automation Tactic Fallbacks (autop): Before querying the LLM for every subgoal, UlamAI tries aesop, simp, linarith, ring, and norm_num. Many subgoals that look hard to a language model are trivial for Lean's built-in automation, saving tokens and search budget.

Relationship to Prior Systems

System	Year	Architecture	Proof Strategy	Formal Backend	Open Source
LeanCopilot	2023	Single-agent tactic suggestion	One-shot	Lean 4 via LeanDojo	Yes
ReProver	2023	Retrieval + generation	Single-step	Lean 4	Yes
LEGO-Prover	2024	Skill library + decomposition	Multi-step	Lean via LeanDojo	Partial
Aletheia (DeepMind)	2025	Planner-worker-whiteboard	Multi-step, informal-first	Extraction protocols	No
OpenProver	2025	Planner-worker-whiteboard	Informal + formal dual-track	Lean 4 via `lake env lean`	Yes
UlamAI	2026	Search-based with LLM policy	Best-first/beam search	Lean 4 (LeanDojo/CLI/LSP)	Yes

Core Design Philosophy

"Trust Lean, not the model."

This single principle drives the entire architecture. The LLM is a heuristic guide — creative, fast, often wrong. Lean is the ground truth. UlamAI cannot produce a false proof. It can fail to find a proof, but it cannot claim to have found one that doesn't typecheck.

4 Supported Solutions

Solution Types

Solution Type	Description	Pipeline
`Formal Lean Proof`	Machine-checked tactic proofs compiled against Mathlib	Prove pipeline (tactic/lemma mode)
`LLM-Rewritten Lean File`	Full Lean file rewritten by LLM, batch typechecked	Prove pipeline (LLM mode)
`Informal TeX Proof`	Structured natural-language proof with claim decomposition	Prove pipeline (TeX output format)
`Autoformalization`	LaTeX → Lean translation with iterative repair	Formalize pipeline
`Proof Replay`	Deterministic re-execution of a recorded proof trace	Replay command
`Proof Review`	Next-step suggestions for a partially completed proof	Review command
`Benchmark Report`	Solve-rate analysis across theorem suites	Bench command

Prove Output Formats

Format	Description	Verification
`lean` (default)	Machine-checked Lean 4 proof via tactic search	Lean kernel verification at each step
`tex`	Informal proof via bounded planner-action loop	Judge + adversarial verifier + domain checks

Proof Modes (for Lean output)

Mode	Mechanism	Pros	Cons
`tactic`	LLM proposes next tactic; Lean checks each step	Fast feedback, goal-state aware	Brittle for long chains
`lemma`	LLM drafts a lemma plan; each lemma proved sequentially	Decomposes big proofs	Depends on plan quality
`llm`	LLM rewrites full Lean file; CLI typechecks	Handles multi-step edits, no Dojo required	Slower, less goal-state guidance

Lean Backends

Backend	Mechanism	Pros	Cons
`dojo`	Pantograph/LeanDojo server	Goal-state access, tactic execution	Extra install, toolchain sensitivity
`cli`	`lake env lean` typecheck	Simple, works with any toolchain	No goal-state feedback
`lsp`	Lean language server	Goal/diagnostic loop without Pantograph	Experimental, search parity pending

Solver Strategies

Strategy	Behavior
`search`	Best-first tactic search (default)
`script`	Sequential script mode (deterministic tactic sequence)
`portfolio`	Script-first, then best-first fallback

5 LLM Integration

Multi-Provider Support

UlamAI supports a broad range of LLM providers through a unified adapter layer:

Provider	Auth Mechanism	Default Model	Notes
OpenAI-compatible	`ULAM_OPENAI_API_KEY`	gpt-4.1	Base URL configurable for custom endpoints
Codex CLI	`codex login` / `ulam auth codex`	gpt-5.2-codex	No API key needed; subscription-based
Anthropic Claude	`ULAM_ANTHROPIC_API_KEY`	claude-3-5-sonnet	API or Claude Code CLI
Claude Code CLI	`ulam auth claude` / `claude setup-token`	—	No API key; subscription-based
Gemini API	`ULAM_GEMINI_API_KEY`	gemini-3.1-pro-preview	OpenAI-compatible endpoint format
Gemini CLI	OAuth2 auto-discovery	—	Subscription/login based
Ollama	Local server	llama3.1	Fully local, no data leaving machine
Custom API	TUI configuration	—	Mistral, Z.AI, Qwen, DeepSeek, Kimi

Recommended Model Configuration

The documentation recommends gpt-5.2-codex or gpt-5.3-codex as the primary proving models, suggesting these Codex-series models have been found effective for tactic generation in practice.

Embedding Support (for Retrieval)

Setting	Variable	Default
API Key	`ULAM_EMBED_API_KEY`	Falls back to `ULAM_OPENAI_API_KEY`
Base URL	`ULAM_EMBED_BASE_URL`	Falls back to `ULAM_OPENAI_BASE_URL`
Model	`ULAM_EMBED_MODEL`	`text-embedding-3-small`

TeX Proving: Split Planner/Worker Models

The informal TeX proving pipeline supports independent model assignment for different roles:

┌─────────────────────────────────────────┐
│      TeX Proving Model Hierarchy        │
│                                         │
│  --tex-planner-model gpt-5.4           │
│    ↓ Planning, judging, verification,   │
│    ↓ final composition                  │
│                                         │
│  --tex-worker-model gpt-5.4-mini       │
│    ↓ Claim-draft workers (serial or    │
│    ↓ concurrent evaluation)            │
└─────────────────────────────────────────┘

This mirrors the planner/worker model split seen in OpenProver but applied to the informal proving pipeline rather than the overall system architecture.

LLM Usage Pattern in Proof Search

For each proof state in frontier:
    │
    ├── 1. Retrieve premises (top-k from Mathlib/local)
    │
    ├── 2. Try automation tactics first (aesop, simp, linarith, ring, norm_num)
    │      └── If solved → emit proof, stop
    │
    ├── 3. Prompt LLM for k candidate tactics (k = 8-16)
    │
    ├── 4. Execute each tactic in Lean via selected backend
    │      ├── Success → enqueue new state (beam capped)
    │      └── Failure → request LLM repair step, bounded retries
    │
    ├── 5. Update caches (step cache + state cache)
    │
    └── 6. Log everything to JSONL trace

6 Key Results

Reported Capabilities (Blog + README)

UlamAI's published materials focus on architectural capabilities rather than benchmark scores. Key claims:

Working End-to-End Pipeline: The system produces machine-checked Lean 4 proofs for standard textbook theorems (irrationality of √2, etc.) and can attempt harder problems with extended budgets.
Benchmark-Ready Infrastructure: Suite management, report generation, cross-report comparison with parity gates, and campaign scripts are all implemented. The system includes:
ulam bench for running suites
ulam bench-compare for cross-report analysis
ulam bench-validate for suite validation
ulam bench-make-minif2f for building miniF2F suites
ulam bench-make-regression100 for fixed-size regression suites
Campaign scripts with gated comparisons
Formalization Pipeline: The TeX-to-Lean pipeline handles segmentation, statement drafting, proof reconstruction, and semantic alignment with iterative repair loops.
Multi-Backend Flexibility: Proven to work with OpenAI, Anthropic, Gemini, and Ollama backends.

Benchmark Infrastructure (Not Results)

UlamAI has invested heavily in benchmarking infrastructure, even though specific numerical results are not publicly reported:

Capability	Implementation
Suite management	`bench-list-suites`, `bench-validate`, fixed-suite builder
miniF2F integration	`bench-make-minif2f` from local checkout, split selection
Fixed regression suite	`bench-make-regression100` with deterministic seed
Report generation	JSON + Markdown with dataset/split breakdowns, top tags
Cross-report comparison	`bench-compare` with automated parity gate
Parity gate thresholds	--max-solved-drop, --max-success-rate-drop, --max-semantic-pass-rate-drop, --max-regression-rejection-rate-increase, --max-median-time-increase-pct, --max-planner-replan-triggers-increase, --max-planner-cached-tactic-tries-drop
Comparability checks	Suite SHA256 matching, inference profile matching
Campaign automation	`run_bench_campaign.sh` with timestamped artifacts + env snapshots
Inference profiles	`fast`, `balanced`, `strict` with configurable gen_k, exec_k, verify_level

Parity Gate System

The benchmark comparison system deserves special attention. When comparing two runs:

python3 -m ulam bench-compare \
  --a runs/bench/a.json --b runs/bench/b.json \
  --gate \
  --max-solved-drop 0 \
  --max-success-rate-drop 0 \
  --max-semantic-pass-rate-drop 0 \
  --max-regression-rejection-rate-increase 0 \
  --max-median-time-increase-pct 25

This implements a non-regression gate — ensuring that system changes don't degrade performance. The gate also checks comparability (same suite SHA256 and same inference profile/budgets), with explicit opt-out for intentional cross-profile comparisons.

Connection to UnsolvedMath

Ulam Labs released UnsolvedMath, a benchmark of 1,146 open mathematical problems. The stated plan is to run UlamAI Prover against formalized versions of these problems, starting with restricted cases. This would represent one of the most ambitious applications of automated theorem proving — applying search-based LLM proving to genuinely open questions.

7 Reproducibility

Trace Logging

Every proof search run produces a JSONL trace file (run.jsonl by default) logging each search step:

Field	Content
Proof state	Current Lean goal state
Proposed tactic	LLM-generated tactic attempt
Lean response	Success/failure with error messages
Scoring decision	Best-first priority assignment
Timing	Wall-clock duration per step

Replay System

# Basic replay (show the trace)
python3 -m ulam replay run.jsonl

# Deterministic replay (re-execute every tactic)
python3 -m ulam replay run.jsonl --execute --strict

Strict mode: Checks alignment against replay metadata sidecar (*.meta.json) containing: - Lean backend type - Project path - Toolchain version - Mathlib commit/revision - File hash - Run configuration

Formalization Artifacts

Each formalization run creates timestamped artifact directories:

runs/formalize_YYYYMMDD_HHMMSS/
├── state.json        # Current pipeline state (resumable)
├── events.jsonl      # Step-by-step event log
├── summary.json      # Final summary with metrics
├── WHITEBOARD.md     # Planner scratchpad (TeX mode)
├── repo/             # Mutable memory items (TeX mode)
├── round_001/
│   ├── draft.lean    # Generated Lean code
│   ├── errors.txt    # Compilation errors
│   └── equivalence_report.json
├── round_002/...
└── final.lean        # Best output

TeX Proving Artifacts

runs/prove_tex/tex_YYYYMMDD_HHMMSS/
├── state.json        # Resumable state
├── events.jsonl      # All events
├── summary.json      # Run summary
├── WHITEBOARD.md     # Persistent planner scratchpad
└── repo/             # Named memory items

Checkpoint Command

# Read-only health check of a Lean file
python3 -m ulam checkpoint examples/Smoke.lean \
  --theorem irrational_sqrt_two_smoke --strict

Checks typecheck status, placeholder (sorry) counts, and axiom drift without modifying anything.

Review Command

# Get next-step suggestions from a trace
python3 -m ulam review --trace run.jsonl \
  --file examples/Smoke.lean --theorem irrational_sqrt_two_smoke

Configuration Pinning

UlamAI pins environment state for reproducibility:

Pinned Element	Storage
Lean toolchain version	`.ulam/config.json`
Mathlib commit	Trace metadata
LLM provider + model	Run configuration
Random seed	Configurable
Search parameters	Run configuration
Retrieval index	Versioned in project
Timeout settings	Run configuration

8 Compute and API Costs

Inference Profiles

UlamAI implements configurable inference profiles that trade off speed vs. verification depth:

Profile	gen_k	exec_k	verify_level	Use Case
`fast`	Low	Low	Minimal	Quick exploration, CI smoke tests
`balanced`	Medium	Medium	Standard	Default proving, benchmarking
`strict`	High	High	Thorough	High-confidence results, publication

Profiles can be specified per-run:

python3 -m ulam bench --suite internal_regression \
  --inference-profile balanced --gen-k 8 --exec-k 3 --verify-level strict

Cost Drivers

Cost Component	Driver	Control
LLM tactic proposals	k candidates per state × states explored	gen_k, beam cap, timeout
LLM repair attempts	Failed tactics × repair retries	Bounded repair loop
Embedding calls	Premise retrieval queries	Retriever mode selection
Lean execution	Tactic compilation per candidate	exec_k, caching
Autoformalization	Multiple rounds × LLM drafts + repairs	max-rounds, max-proof-rounds
TeX proving	Planner steps × worker drafts	tex-rounds, tex-worker-drafts, tex-action-steps

Resource Requirements

Resource	Minimum	Recommended
Python	3.10+	3.11+
RAM	4 GB	8+ GB
Disk	~200 MB (base)	~2 GB (with Lean project + Mathlib)
Network	Required for API models	—
Lean 4 + Mathlib	Required for real proofs	Via `ulam -lean` setup
LeanDojo + Pantograph	Required for `dojo` backend	Via `ulam -lean`

Cost Optimization Strategies

Automation-first: aesop, simp, linarith, ring, norm_num tried before LLM on every state
Step cache: (state_key, tactic) → result avoids duplicate Lean calls
State cache (transposition table): state_key → best_seen_score prevents re-exploring
Beam capping: Limits frontier size to prevent unbounded search
Budget tracking: Time or token budgets with automatic conclusion phase
Inference profiles: Pre-configured trade-offs between cost and thoroughness

9 Architecture Solution

System Diagram

┌──────────────────────────────────────────────────────────────────────────┐
│                            UlamAI System                                │
│                                                                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                     CLI Entry Point (cli.py)                      │  │
│  │                                                                   │  │
│  │  ulam prove    │ ulam formalize │ ulam bench     │ ulam replay   │  │
│  │  ulam review   │ ulam checkpoint│ ulam bench-*   │ ulam index    │  │
│  │  ulam -lean    │ ulam auth      │ ulam (TUI menu)│               │  │
│  └───────────┬───────────────┬──────────────┬───────────────────────┘  │
│              │               │              │                          │
│  ┌───────────▼───────────┐  │  ┌───────────▼─────────────────────┐   │
│  │   PROVE PIPELINE      │  │  │   FORMALIZE PIPELINE             │   │
│  │                       │  │  │                                   │   │
│  │  ┌─────────────────┐  │  │  │  Phase A: Ingest + Segment TeX   │   │
│  │  │ Search Engine    │  │  │  │  Phase B: Statement Drafting     │   │
│  │  │ (best-first/beam)│  │  │  │  Phase C: Proof Reconstruction  │   │
│  │  │ + transposition  │  │  │  │  Phase D: Semantic Alignment     │   │
│  │  │   table          │  │  │  │  Phase E: Consolidation          │   │
│  │  └────────┬────────┘  │  │  │                                   │   │
│  │           │            │  │  │  ┌─────────────────────────┐     │   │
│  │  ┌────────▼────────┐  │  │  │  │ Fixpoint Loop:          │     │   │
│  │  │ LLM Policy      │  │  │  │  │ Draft → Typecheck →     │     │   │
│  │  │ (k=8-16 tactics)│  │  │  │  │ Proof Search → Validate │     │   │
│  │  └────────┬────────┘  │  │  │  │ ← Repair / Rewrite      │     │   │
│  │           │            │  │  │  └─────────────────────────┘     │   │
│  │  ┌────────▼────────┐  │  │  └───────────────────────────────────┘   │
│  │  │ Retriever       │  │  │                                          │
│  │  │ (premises)      │  │  │                                          │
│  │  └────────┬────────┘  │  │                                          │
│  │           │            │  │                                          │
│  │  ┌────────▼────────┐  │  │                                          │
│  │  │ Autop Fallback  │  │  │                                          │
│  │  │ aesop/simp/     │  │  │                                          │
│  │  │ linarith/ring/  │  │  │                                          │
│  │  │ norm_num        │  │  │                                          │
│  │  └─────────────────┘  │  │                                          │
│  └───────────────────────┘  │                                          │
│              │               │                                          │
│  ┌───────────▼───────────────▼──────────────────────────────────────┐  │
│  │                   LEAN BACKEND LAYER                             │  │
│  │                                                                  │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │  │
│  │  │  LeanDojo     │  │  Lean CLI    │  │  Lean LSP            │  │  │
│  │  │  (Pantograph) │  │  (lake env   │  │  (Language Server)   │  │  │
│  │  │              │  │   lean)       │  │  [experimental]      │  │  │
│  │  │  Goal-state  │  │  Batch       │  │  Interactive          │  │  │
│  │  │  access      │  │  typecheck   │  │  goal/diagnostic     │  │  │
│  │  └──────────────┘  └──────────────┘  └──────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                    LLM ADAPTER LAYER                             │  │
│  │                                                                  │  │
│  │  OpenAI │ Codex CLI │ Claude API │ Claude CLI │ Gemini │ Ollama │  │
│  │  Custom API (Mistral, Z.AI, Qwen, DeepSeek, Kimi)              │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                    BENCHMARK LAYER                               │  │
│  │                                                                  │  │
│  │  Suite Management │ Report Generation │ Comparison + Gates      │  │
│  │  Campaign Scripts │ miniF2F Builder  │ Regression100 Builder   │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                    TRACE + REPLAY LAYER                          │  │
│  │                                                                  │  │
│  │  JSONL Logging │ Metadata Sidecars │ Deterministic Replay       │  │
│  │  Review + Checkpoint │ Artifact Management                      │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────┘

Proof Search Data Flow

Input: Lean file + theorem name (with sorry)
    │
    ▼
┌──────────────────────────────┐
│ 1. Initialize                │
│    Load file via Lean backend│
│    Get target proof state    │
│    Create root search node   │
└───────────┬──────────────────┘
            │
            ▼
┌──────────────────────────────┐
│ 2. Search Loop               │◄─────────────────────────┐
│    Select best state from    │                           │
│    frontier (best-first)     │                           │
└───────────┬──────────────────┘                           │
            │                                              │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 3. Try autop tactics first   │                           │
│    aesop → simp → linarith   │                           │
│    → ring → norm_num         │                           │
│    If solved → emit proof ─────── EXIT (success)         │
└───────────┬──────────────────┘                           │
            │ (none solved)                                │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 4. Retrieve premises (top-k) │                           │
│    Token-overlap or embedding│                           │
└───────────┬──────────────────┘                           │
            │                                              │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 5. LLM proposes k tactics    │                           │
│    (k = 8-16 per state)      │                           │
└───────────┬──────────────────┘                           │
            │                                              │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 6. Execute each in Lean      │                           │
│    ├── Success → new state ──│── Check cache ──► Enqueue │
│    └── Failure → repair loop │     (beam capped)         │
│         └── LLM fix attempt  │                           │
│              └── Re-execute  │                           │
└───────────┬──────────────────┘                           │
            │                                              │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 7. Update caches + log trace │                           │
│    Step cache + state cache  │                           │
│    JSONL trace entry         │                           │
└───────────┬──────────────────┘                           │
            │                                              │
            ▼                                              │
┌──────────────────────────────┐                           │
│ 8. Check termination         │                           │
│    Solved? → EXIT (success)  │                           │
│    Budget exhausted? → EXIT  │                           │
│    Frontier empty? → EXIT    │                           │
│    Otherwise → CONTINUE ─────────────────────────────────┘
└──────────────────────────────┘

10 Component Breakdown

Package Structure

ulam/
├── cli.py              # CLI entry point, TUI menu, argument parsing
├── prover.py           # Core proof search loop
├── search.py           # Best-first / beam search engine
├── retriever.py        # Premise retrieval (token-overlap + embedding)
├── llm/                # LLM adapter layer
│   ├── openai.py       # OpenAI-compatible adapter
│   ├── anthropic.py    # Anthropic Claude adapter
│   ├── codex_cli.py    # Codex CLI wrapper
│   ├── claude_cli.py   # Claude Code CLI wrapper
│   ├── gemini.py       # Gemini adapter
│   ├── ollama.py       # Ollama adapter
│   └── base.py         # Base adapter interface
├── lean/               # Lean backend layer
│   ├── dojo.py         # LeanDojo/Pantograph runner
│   ├── cli_runner.py   # lake env lean CLI runner
│   ├── lsp.py          # Lean language server (experimental)
│   └── parser.py       # Lean file parsing
├── formalize/          # Autoformalization pipeline
│   ├── segmenter.py    # TeX → structured chunks
│   ├── drafter.py      # Statement generation
│   ├── prover.py       # Per-lemma proof loops
│   ├── equivalence.py  # Semantic alignment checks
│   └── pipeline.py     # Multi-phase orchestration
├── tex/                # Informal TeX proving pipeline
│   ├── planner.py      # Planner-action loop
│   ├── worker.py       # Claim-draft workers
│   ├── judge.py        # Judge + adversarial verifier
│   ├── whiteboard.py   # Persistent scratchpad
│   └── composer.py     # Final proof assembly
├── bench/              # Benchmark infrastructure
│   ├── runner.py       # Suite execution
│   ├── report.py       # Report generation
│   ├── compare.py      # Cross-report comparison
│   └── suites/         # Suite definitions
├── index/              # Retrieval index management
│   └── builder.py      # Index construction
├── replay.py           # Deterministic replay
├── review.py           # Next-step suggestion
├── checkpoint.py       # Read-only health check
└── config.py           # Configuration management

Core Module Descriptions

Module	Responsibility	Key Abstraction
cli.py	Entry point, TUI menu, interactive configuration, auth flows	Command dispatch
prover.py	Core proving loop: state management, tactic execution, repair loops	`Prover` class
search.py	Best-first / beam search with transposition table	`SearchEngine`
retriever.py	Premise selection from Mathlib or local repo	`Retriever` (token-overlap / embedding)
llm/	Unified interface for all LLM providers	`LLMAdapter` base
lean/	Backend abstraction for Lean verification	`LeanRunner` ABC
formalize/	Multi-phase TeX → Lean pipeline	`FormalizePipeline`
tex/	Informal planner-action proving loop	`TexProver`
bench/	Suite management, reporting, comparison	`BenchRunner`, `BenchCompare`
replay.py	Trace re-execution with metadata validation	`Replayer`

Dependencies

[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "ulam-prover"
version = "0.2.11"
requires-python = ">=3.10"
license = { text = "MIT" }

The base package has minimal hard dependencies. Heavy dependencies (LeanDojo, Pantograph, torch, sentence-transformers) are installed on demand via the ulam -lean setup flow.

11 Core Mechanisms (Detailed)

11.1 Best-First Search with Beam Capping

The search engine maintains a priority queue of proof states:

Priority Queue (beam-capped)
┌────────────────────────────────────────────────┐
│ State A (score: 0.95) ← next to expand         │
│ State B (score: 0.87)                          │
│ State C (score: 0.82)                          │
│ State D (score: 0.71)                          │
│ ... (capped at beam_width)                     │
│                                                │
│ Transposition Table:                           │
│ state_key_A → best_seen_score: 0.95            │
│ state_key_B → best_seen_score: 0.87            │
│ state_key_E → best_seen_score: 0.65 (pruned)  │
└────────────────────────────────────────────────┘

Why best-first/beam before MCTS: As the project documentation explains, "Lean execution time dominates. Best-first + caching + transpositions delivers a strong baseline fast and provides clear debugging signals. MCTS/MCGS can be layered later." The search algorithm is explicitly chosen for practical reasons — debuggability and the fact that the bottleneck is Lean compilation, not search branching.

Transposition Table: The state cache maps state_key → best_seen_score. When the search encounters a state it has already seen (via a different tactic path), it checks whether the new score improves on the cached one. If not, the state is pruned. This prevents the combinatorial explosion of re-exploring equivalent states reached by different tactic sequences.

11.2 Retrieval-Augmented Premise Selection

Two retrieval modes are available:

Mode	Mechanism	Speed	Quality	Dependencies
Token-overlap	BM25-style keyword matching	Fast	Adequate	None
Embedding-based	Vector similarity search	Moderate	Higher	Embedding API

Premise Format:

lemma_name : statement

One premise per line, fetched from a --premises file or auto-indexed:

# Auto-index from local project + Mathlib
python3 -m ulam index build \
  --project /path/to/lean-project \
  --scope both \
  --out .ulam/premises_both.jsonl

Retrieval Sources: | Source | Flag | Content | | --- | --- | --- | | Local declarations | --retriever-source local | Project-specific lemmas and definitions | | Mathlib declarations | --retriever-source mathlib | Full Mathlib library (~100k+ lemmas) | | Both | --retriever-source both | Combined index |

Why Retrieval Matters: Mathlib contains over 100,000 lemmas. Without retrieval, the model must rely on memorized knowledge — knowledge that degrades with model size, quantization, and the ever-growing surface area of Mathlib itself. Retrieval turns "do you remember this lemma?" into "here is the lemma; use it."

11.3 Automation Tactic Fallbacks (autop)

Before every LLM query, UlamAI tries a cascade of built-in Lean automation:

Proof state arrived
    │
    ├── Try: aesop
    │     └── Success? → SOLVED (no LLM needed)
    ├── Try: simp
    │     └── Success? → SOLVED
    ├── Try: linarith
    │     └── Success? → SOLVED
    ├── Try: ring
    │     └── Success? → SOLVED
    └── Try: norm_num
          └── Success? → SOLVED
          │
          └── All failed → Query LLM for k tactics

Rationale: Many subgoals that look hard to a language model are trivial for Lean's built-in automation. This saves: - LLM API tokens (no call needed) - Search budget (one-step solution vs. multi-step exploration) - Wall-clock time (local computation vs. API round-trip)

11.4 Repair Loop

When a tactic fails, UlamAI doesn't just discard it — it feeds the error back to the LLM:

LLM proposes tactic T
    │
    ▼
Execute T in Lean
    │
    ├── Success → accept, enqueue new state
    │
    └── Failure (error message E)
          │
          ▼
        Ask LLM to repair T given error E
          │
          ▼
        Execute repaired T' in Lean
          │
          ├── Success → accept
          └── Failure → bounded retries exhausted → discard

This error-feedback loop is a key differentiator from pure one-shot generation. Lean's type error messages are precise and informative — they tell the model exactly what went wrong (missing argument, type mismatch, unknown identifier). This structured feedback enables targeted repairs.

11.5 TeX Proving Pipeline (Informal Proofs)

The --output-format tex pipeline is a sophisticated orchestration system:

┌────────────────────────────────────────────────────┐
│              TeX Proving Architecture               │
│                                                    │
│  v0.2.11 Flow:                                     │
│  1. Try one deep whole-theorem attempt              │
│     using single primary model                      │
│  2. If fails → fall back to planner/worker          │
│     claim decomposition                             │
│                                                    │
│  Planner Action Loop:                               │
│  ┌──────────────────────────────────────────────┐  │
│  │ Planner chooses action:                      │  │
│  │  • plan — decompose into claim graph         │  │
│  │  • solve — workers draft claims              │  │
│  │  • write_memory — update WHITEBOARD/repo     │  │
│  │  • compose — assemble final proof            │  │
│  │  • give_up — exhaust budget                  │  │
│  └──────────────────────┬───────────────────────┘  │
│                         │                          │
│  Claim Graph:           │                          │
│  ┌──────────────────────▼───────────────────────┐  │
│  │ claims + dependencies + assumptions +         │  │
│  │ required facts + worker guidance +            │  │
│  │ repo reads                                    │  │
│  └──────────────────────┬───────────────────────┘  │
│                         │                          │
│  Workers:               │                          │
│  ┌──────────────────────▼───────────────────────┐  │
│  │ Draft each claim (serial or concurrent)      │  │
│  │  → Judge + Adversarial Verifier +            │  │
│  │    Domain Checker                            │  │
│  │  → Accept / Reject                           │  │
│  └──────────────────────┬───────────────────────┘  │
│                         │                          │
│  Persistent Memory:     │                          │
│  ┌──────────────────────▼───────────────────────┐  │
│  │ WHITEBOARD.md + repo/ items                  │  │
│  │ CRUD operations by planner                   │  │
│  │ <span class="broken-link">wikilink</span> expansion in prompts            │  │
│  └──────────────────────────────────────────────┘  │
│                                                    │
│  Replan/Backtrack:                                 │
│  - Pass-based: reset decomposition if stalled      │
│  - Bounded: --tex-replan-passes limit              │
│                                                    │
│  Output: composed .tex proof draft                 │
└────────────────────────────────────────────────────┘

Configuration Knobs:

Flag	Default	Description
`--tex-rounds`	—	Max fixed orchestration rounds
`--tex-worker-drafts`	1	Worker candidates per claim per round
`--tex-concurrency`	off	Toggle concurrent evaluation of drafts
`--tex-judge-repairs`	—	Max rounds without accepted claim
`--tex-replan-passes`	—	Max decomposition resets
`--tex-action-steps`	—	Bounded planner-action steps
`--tex-planner-model`	—	Override model for planning/judging
`--tex-worker-model`	—	Override model for claim workers

11.6 Autoformalization Pipeline (TeX → Lean)

The formalization pipeline is a five-phase process with recursive loops:

Phase A — Ingest + Segment:

LaTeX document
    │
    ▼
Parse into logical chunks:
    ├── Definitions
    ├── Lemmas
    ├── Theorems
    └── Proof sketches
    │
    ▼
Extract named entities + symbols
    │
    ▼
Build dependency graph

Phase B — Statement Drafting (Lean Skeleton):

For each definition/lemma/theorem:
    │
    ├── Generate Lean signature with sorry proofs
    ├── Insert placeholders for unknown types
    │
    ▼
Run Lean to typecheck skeleton
    │
    ├── Missing imports → fix
    ├── Undeclared constants → fix
    ├── Type mismatches → fix
    ├── Notation issues → fix
    │
    └── Loop until file typechecks with sorry proofs

Phase C — Proof Reconstruction:

For each sorry:
    │
    ├── Open goal in Lean backend
    ├── Run proof search pipeline
    │
    ├── If proof fails:
    │   ├── Retrieve similar lemmas
    │   ├── Reformulate statement
    │   └── Generate auxiliary lemmas
    │
    └── If still failing: leave sorry, log TODO

Phase D — Semantic Alignment (Recursive):

┌──────────────────────────────────────────────┐
│  Draft → Typecheck → Proof Search →          │
│  Validate Equivalence                        │
│       ↑                      ↓               │
│       └── Repair / Rewrite ──┘               │
│                                              │
│  Statement equivalence check:                │
│  LLM judges Lean stmt vs informal stmt       │
│  Mismatch → enqueue repair task              │
│                                              │
│  Gap detection:                              │
│  Missing assumptions → add hypotheses        │
│  Hidden lemmas → generate intermediates      │
│                                              │
│  Re-run proofs:                              │
│  Statement change → invalidate downstream    │
│  → recursive repair                          │
└──────────────────────────────────────────────┘

Phase E — Consolidation: - Produce final .lean file - Record proof success rate, remaining sorrys, error traces - Save all artifacts

11.7 Axiom Control

UlamAI includes axiom safety controls:

# Default: axioms/constants allowed
python3 -m ulam prove File.lean --theorem T

# Strict: no axioms allowed
python3 -m ulam prove File.lean --theorem T --no-allow-axioms

This is important for proof integrity — a proof that introduces new axioms or constants may not be valid in the intended mathematical framework.

11.8 Scripted Solver (Sequential Tactics)

In addition to search-based proving, UlamAI supports sequential script mode:

┌─────────────────────────────────────┐
│  Script Solver                       │
│                                     │
│  1. LLM generates tactic sequence  │
│  2. Execute tactics sequentially    │
│  3. Each step validated by Lean     │
│  4. Backtrack on failure            │
│                                     │
│  Portfolio strategy:                │
│  1. Try script solver first         │
│  2. If fails → best-first search   │
└─────────────────────────────────────┘

This is enabled by default and works well for proofs that follow a natural sequential structure.

12 Programming Language

Language and Ecosystem

Aspect	Choice
Primary Language	Python
Python Version	>=3.10
Build System	setuptools (>=68) + wheel
Package Name	ulam-prover
Entry Point	`ulam = ulam.cli:main`
Formal Verification Target	Lean 4 (with Mathlib)
Distribution	PyPI + Homebrew

Distribution Channels

UlamAI is unusually well-distributed for a research tool:

Channel	Command	Notes
Homebrew	`brew tap ulamai/ulamai && brew install ulamai`	Auto-updated on release
PyPI	`pip install ulam-prover`	Standard Python install
Clone + install	`./install.sh`	Editable install
One-liner	`curl -fsSL ... \\| bash`	Quick install script
Colab	Notebook in `examples/`	Zero-install exploration

Integration Language: Lean 4

While UlamAI is written in Python, it operates on Lean 4 code:

Lean Component	Purpose
Lean 4	Proof kernel (ground truth)
Mathlib	Mathematical library (100k+ lemmas)
LeanDojo	Python interface to Lean proof states
Pantograph (PyPantograph)	Server for tactic-level interaction
lake	Build tool for Lean projects
elan	Lean toolchain manager

Toolchain Alignment

UlamAI carefully manages Lean toolchain alignment:

Pantograph is anchored to a specific Lean toolchain (from its src/lean-toolchain)
UlamAI aligns the Mathlib project to that toolchain when possible
--use-mathlib-toolchain flag available for explicit control
Toolchain pinning is part of reproducibility metadata

13 Memory Management

Cache Architecture

UlamAI implements a multi-level caching system to minimize redundant Lean execution:

┌────────────────────────────────────────────────────┐
│                CACHE HIERARCHY                      │
│                                                    │
│  Level 1: Step Cache                               │
│  ├── Key: (state_key, tactic)                      │
│  ├── Value: execution result                       │
│  ├── Purpose: Avoid re-running same tactic on      │
│  │            same state                           │
│  └── Eviction: None (grows with search)            │
│                                                    │
│  Level 2: State Cache (Transposition Table)        │
│  ├── Key: state_key (canonical hash)               │
│  ├── Value: best_seen_score                        │
│  ├── Purpose: Avoid re-exploring states reached    │
│  │            via different paths                  │
│  └── Eviction: None (grows with search)            │
│                                                    │
│  Level 3: Retrieval Index                          │
│  ├── Key: project + scope                          │
│  ├── Value: premises JSONL                         │
│  ├── Purpose: Pre-computed premise index           │
│  └── Location: .ulam/premises_*.jsonl              │
│                                                    │
│  Level 4: LLM Response Cache                       │
│  ├── Not implemented (each call is logged but      │
│  │   not cached for reuse)                         │
│  └── Potential optimization point                  │
└────────────────────────────────────────────────────┘

State Canonicalization

A critical challenge in proof search is determining when two proof states are "the same." UlamAI uses state hashing for the transposition table, but acknowledges this as an area for improvement:

"Better state canonicalization (stable hashing)" is listed as a v0.2 roadmap item.

The difficulty: Lean proof states contain goal hypotheses, types, and metavariables. Two states might differ in hypothesis naming but be semantically equivalent. Imperfect canonicalization leads to: - False negatives: Missing equivalent states → redundant exploration - False positives (rare): Colliding different states → incorrect pruning

TeX Mode Memory

The TeX proving pipeline maintains its own persistent memory:

Component	Persistence	Mutability	Scope
WHITEBOARD.md	Across rounds/passes	Read/write by planner	Planning state
repo/ items	Across rounds/passes	CRUD by planner	Named knowledge items
Claim graph	Per pass	Created by planner	Proof decomposition
Worker drafts	Per round	Generated by workers	Claim solutions
state.json	Across runs	Updated per event	Full resumable state
events.jsonl	Append-only	Immutable	Audit trail

The <span class="broken-link">wikilink</span> expansion mechanism works identically to OpenProver's — repo items are referenced by name and expanded into prompts. This is not coincidental; both systems likely draw from the same Aletheia-inspired architectural pattern.

Configuration Memory

UlamAI persists configuration in .ulam/config.json:

{
  "lean_project": "/path/to/lean-project",
  "llm_provider": "openai",
  "llm_model": "gpt-5.2-codex",
  "lean_backend": "dojo",
  "prove_mode": "tactic"
}

This file is auto-generated by setup flows (ulam -lean, TUI configuration) and loaded on every invocation.

14 Continued Learning

Planned Learning Pipeline

UlamAI has an explicit, public roadmap for progressive learning capabilities:

Version	Learning Capability	Status
v0.1 (shipped)	Trace logging to JSONL	Implemented
v0.2 (current)	Better scoring heuristics (non-RL)	In progress
v0.3 (planned)	SFT training loop on traces	Not started
v0.4 (planned)	Learned value model for search guidance	Not started
v0.5 (planned)	On-policy RL (PPO/GRPO-style)	Not started

v0.3 — Supervised Fine-Tuning from Traces

The most concrete near-term learning plan:

Successful proof traces (JSONL)
    │
    ▼
Extract (state, tactic) pairs
    │
    ▼
Build "proofstep dataset" for model family
    │
    ▼
SFT on tactic policy
    │
    ▼
Evaluate on miniF2F slice + internal suite

Key Insight: Every successful UlamAI proof run generates training data. The JSONL traces contain exactly the (state, action, reward) tuples needed for supervised learning. Failed proof attempts provide hard negatives.

v0.4 — Learned Value Model

Proof traces (successful + failed)
    │
    ▼
Train value model:
    state → predicted proof distance
    │
    ▼
Replace heuristic scoring in search:
    best_first_priority = value_model(state)
    │
    ▼
MCTS/MCGS option with learned value

v0.5 — Reinforcement Learning

On-policy fine-tuning (PPO/GRPO-style)
    │
    ├── Successful trajectories → positive reward
    ├── Near-miss trajectories → shaped reward
    └── Curriculum scheduling: easy → harder

v0.3 Additional Plans (Structural)

Both proving tracks (TeX and Lean formalization) are planned to receive:

Dependency-first proving: Decompose into concept/claim or declaration DAG first, then solve bottom-up
Live grounding retrieval: Force term/definition grounding from current Mathlib/local sources before drafting
Compiler/typecheck-in-the-loop reflection: At each node, not only at the end
Semantic gate beyond syntax: Check whether generated claims match intended meanings
Abstain capability: Explicit no-output or partial-safe outcome when gates fail
Autonomy protocol artifacts: Fixed verification/extraction prompts, full transcripts, autonomy card

Cross-Run Learning (Current Gap)

Like OpenProver, UlamAI currently has no cross-run learning mechanism. Each proof attempt starts from scratch. However, the planned SFT pipeline (v0.3) would close this gap by training improved models from accumulated traces.

UnsolvedMath Integration

The planned integration with UnsolvedMath (1,146 open problems) creates a natural curriculum:

Easy problems (textbook) → build traces
    → SFT model on traces
    → attempt harder problems
    → accumulate more traces
    → further SFT/RL
    → attempt open problems from UnsolvedMath

This is a concrete path toward evolutionary improvement through accumulated experience.

15 Applications

Primary Application: Verified Mathematical Proof Discovery

UlamAI serves multiple user profiles with distinct workflows:

User Profile	Workflow	Key Commands
Mathematician	Prove a conjecture in Lean	`ulam prove File.lean --theorem T`
Formalization engineer	Convert a paper to Lean	`ulam formalize paper.tex --out paper.lean`
Student	Explore proof strategies	`ulam prove --output-format tex --statement "..."`
Researcher	Benchmark proving models	`ulam bench --suite regression`
Developer	Non-regression testing	`ulam bench-compare --gate --a base.json --b pr.json`

Concrete Workflow Examples

Workflow A: Lean Formal Proving

ulam prove examples/Smoke.lean \
  --theorem irrational_sqrt_two_smoke \
  --prove-mode llm --lean lsp --llm codex_cli

Workflow B: Informal TeX Proving

ulam prove --theorem infinitely_many_primes \
  --output-format tex \
  --statement "Prove that there are infinitely many prime numbers." \
  --llm codex_cli --tex-rounds 3 --tex-worker-drafts 2

Workflow C: Autoformalization

ulam formalize examples/pol25.tex --out examples/pol25.lean \
  --proof-backend llm --lean-backend dojo \
  --max-rounds 5 --max-proof-rounds 2

Workflow D: Benchmarking

ulam bench --suite internal_regression --llm codex_cli \
  --openai-model gpt-5.3-codex --lean dojo \
  --inference-profile balanced

Research Applications

Benchmark-Driven Model Evaluation: UlamAI's benchmark infrastructure enables systematic comparison of:
Different LLM providers on the same theorem suite
Different inference profiles (speed vs. thoroughness)
Different proof modes (tactic vs. lemma vs. LLM-only)
Different Lean backends (LeanDojo vs. CLI vs. LSP)
Training Data Generation: Every proof run produces JSONL traces that can serve as:
Supervised fine-tuning datasets for tactic prediction
Hard negatives from failed proof attempts
Curriculum data ordered by difficulty
Autoformalization Research: The TeX → Lean pipeline enables studying:
Semantic drift between informal and formal statements
Multi-round repair effectiveness
Gap detection in informal proofs
Agentic Architecture Research: The TeX proving pipeline's planner-action loop enables studying:
Claim decomposition strategies
Judge/verifier effectiveness
Whiteboard + repo memory impact on proof quality
Replan/backtrack trigger conditions

Comparison with Alternative Systems

Capability	UlamAI	OpenProver	LeanCopilot	Aletheia
Search-based proving	Best-first/beam	No (LLM-directed)	No	Unknown
Transposition table	Yes	No	No	Unknown
Automation fallbacks	Yes (5 tactics)	No	Partial	Unknown
Formal verification	Lean 4 (3 backends)	Lean 4 (CLI only)	Lean 4 (LeanDojo)	Extraction protocols
Informal proving	TeX pipeline	Whiteboard-based	No	Whiteboard-based
Autoformalization	TeX → Lean	Markdown → Lean	No	No
Literature search	No	Yes (web-enabled)	No	Unknown
Multi-model support	7+ providers	3 providers	1 provider	1 (Gemini)
Benchmark infra	Extensive	None	Basic	Internal
Open source	Yes (MIT)	Yes	Yes	No
Planned RL/SFT	Yes (v0.3-0.5)	No	No	Unknown

Limitations and Honest Assessment

Limitation	Details
No published benchmark numbers	Extensive infra but no public performance reports
No cross-run learning yet	SFT pipeline planned (v0.3) but not implemented
Lean toolchain sensitivity	LeanDojo/Pantograph version alignment is fragile
No literature search	Unlike OpenProver, cannot consult web during proving
No value model	Search scoring is heuristic, not learned
Experimental LSP backend	Lean LSP track not yet at parity with LeanDojo
Early-stage TeX pipeline	Informal proving is functional but maturing rapidly
No parallel rollouts	Single-threaded proof search (planned for v0.4)

Significance for the Harness Systems Landscape

UlamAI represents a distinct architectural philosophy from the planner-worker systems (Aletheia, OpenProver). Where those systems emphasize strategic coordination — a planner managing workers through a whiteboard — UlamAI emphasizes systematic search — best-first exploration of the proof state space with LLMs as the policy network.

The key insight is that mathematical proof discovery admits two complementary framings:

Proof as Planning (OpenProver/Aletheia): A strategic agent coordinates exploration of a large, uncertain space, maintaining persistent context and delegating tasks.
Proof as Search (UlamAI): A search algorithm explores a well-defined state space (proof states), using LLMs as heuristic evaluators and move generators.

Neither is strictly better. Planning excels at the macro level (choosing proof strategies, decomposing problems, consulting literature). Search excels at the micro level (finding the right tactic sequence to close a specific goal). UlamAI's dual-track architecture (Lean search + TeX planner) acknowledges this duality — the formal track uses search, the informal track uses planning.

For the broader harness systems taxonomy, UlamAI's benchmark infrastructure is particularly significant. The parity gate system, inference profiles, campaign scripts, and cross-report comparison tools represent a level of engineering discipline around evaluation that most research tools lack. This makes UlamAI not just a prover but a platform for systematic evaluation of AI-driven mathematical reasoning.

The planned SFT → value model → RL progression (v0.3 → v0.5) also positions UlamAI as a future bridge between harness systems and evolutionary/self-improving systems. Once the system can train on its own traces, it begins to exhibit the self-reinforcing dynamics that characterize the most impactful AI research systems.