arXiv Papers: Evolutionary AI for Games & Proofs
Two papers applying LLM-driven evolution to multiagent learning and mathematical theorem proving
Table of Contents
- Overview & Motivation
- Overview & Motivation
- Background: CFR and PSRO
- Method: AlphaEvolve for Game Theory
- VAD-CFR: Volatility-Adaptive Discounted CFR
- SHOR-PSRO: Smoothed Hybrid Optimistic Regret PSRO
- Experimental Results
- Analysis & Insights
- Overview & Motivation
- Overview & Motivation
- Aletheia Agent Architecture
- FirstProof Problems & Results
- Analysis & Significance
Paper 1
Discovering Multiagent Learning Algorithms with LLMs
arXiv:* 2602.16928 Authors: Zun Li, John Schultz, Daniel Hennes, Marc Lanctot Affiliation: Google DeepMind Date:* February 2026
1.1 Overview & Motivation
This paper demonstrates that AlphaEvolve can discover novel multiagent learning algorithms for game theory that outperform decades of human-designed algorithms. The authors evolve two families of algorithms:
VAD-CFR
Volatility-Adaptive Discounted CFR with Consistency-Enforced Optimism. An enhanced variant of Counterfactual Regret Minimization that dynamically adapts its discounting based on strategy volatility and uses optimistic regret bounds.
SHOR-PSRO
Smoothed Hybrid Optimistic Regret PSRO. An improved version of Policy-Space Response Oracles that combines smoothing, hybrid selection, and optimistic regret minimization for faster convergence to Nash equilibria.
**Key Result:** Both VAD-CFR and SHOR-PSRO outperform their respective state-of-the-art baselines across a suite of game-theoretic benchmarks, demonstrating that LLM-driven evolution can discover algorithms that are both novel and practically superior.
1.2 Background: CFR and PSRO
Counterfactual Regret Minimization (CFR)
CFR is the dominant family of algorithms for solving imperfect-information extensive-form games. It works by iterating over the game tree, computing counterfactual regret for each information set, and updating strategies to minimize cumulative regret.
Regret Definitions
For player i at information set I, the instantaneous counterfactual regret for action a at iteration t is:
$$ r^t(I, a) = v^t(I, a) - v^t(I) $$
Where v^t(I, a) is the counterfactual value of always playing action a at information set I, and v^t(I) is the counterfactual value under the current strategy.
The cumulative counterfactual regret after T iterations is:
$$ R^T(I, a) = ∑t=1^T r^t(I, a) $$
The strategy at iteration T+1 is computed via regret matching:
$$ σ^T+1(I, a) = R^T+(I, a) / ∑a' R^T+(I, a') $$
Where R+(I, a) = max(R(I, a), 0). If all regrets are non-positive, the strategy is uniform.
CFR Convergence Guarantee
$$ εNash ≤ Δ ⋅ √(|A|) / √T $$
Where Δ is the range of payoffs, |A| is the maximum number of actions at any information set, and T is the number of iterations. The average strategy converges to a Nash equilibrium at rate O(1/√T).
Discounted CFR (DCFR)
DCFR improves on vanilla CFR by discounting older iterations, giving more weight to recent strategies:
$$ R^Tdisc(I, a) = ∑t=1^T w(t, T) ⋅ r^t(I, a) $$
Where w(t, T) is a weighting function. Common choices include:
- Linear DCFR: w(t, T) = t / T (linearly increasing weight)
- DCFR(α, β, γ): Parameterized discounting with separate weights for positive regrets, negative regrets, and strategy averaging
Policy-Space Response Oracles (PSRO)
PSRO is a framework for learning in large games by building a population of policies and computing equilibria in the meta-game:
- Meta-game construction: Build a payoff matrix from the current policy population
- Meta-solver: Compute a Nash equilibrium (or approximate) of the meta-game
- Best response: Compute a best response to the meta-equilibrium mixture
- Population expansion: Add the best response to the policy population
- Repeat
$$ NashConv(σ) = ∑i [maxσ'i ui(σ'i, σ-i) - ui(σ)] $$
NashConv measures how far a strategy profile σ is from Nash equilibrium. At Nash, NashConv = 0.
Python Pseudocode: Vanilla CFR
import numpy as np
class VanillaCFR:
"""Vanilla Counterfactual Regret Minimization."""
def __init__(self, game):
self.game = game
# Cumulative regret per (infoset, action)
self.cumulative_regret = {}
# Cumulative strategy per (infoset, action)
self.cumulative_strategy = {}
self.iteration = 0
def get_strategy(self, info_set: str, actions: list) -> np.ndarray:
"""Compute current strategy via regret matching."""
regrets = np.array([
max(self.cumulative_regret.get((info_set, a), 0.0), 0.0)
for a in actions
])
total = regrets.sum()
if total > 0:
return regrets / total
else:
return np.ones(len(actions)) / len(actions)
def cfr(self, state, reach_probs: dict) -> dict:
"""Recursive CFR traversal."""
if state.is_terminal():
return state.payoffs()
player = state.current_player()
info_set = state.info_set()
actions = state.legal_actions()
strategy = self.get_strategy(info_set, actions)
# Recurse for each action
action_values = {}
node_value = {p: 0.0 for p in self.game.players}
for i, action in enumerate(actions):
new_reach = dict(reach_probs)
new_reach[player] *= strategy[i]
child_state = state.apply_action(action)
action_values[action] = self.cfr(child_state, new_reach)
for p in self.game.players:
node_value[p] += strategy[i] * action_values[action][p]
# Compute counterfactual regret
cf_reach = 1.0
for p in self.game.players:
if p != player:
cf_reach *= reach_probs[p]
for i, action in enumerate(actions):
regret = cf_reach * (action_values[action][player] - node_value[player])
key = (info_set, action)
self.cumulative_regret[key] = self.cumulative_regret.get(key, 0.0) + regret
# Accumulate strategy (weighted by reach probability)
for i, action in enumerate(actions):
key = (info_set, action)
self.cumulative_strategy[key] = (
self.cumulative_strategy.get(key, 0.0) + reach_probs[player] * strategy[i]
)
return node_value
def solve(self, num_iterations: int):
"""Run CFR for num_iterations."""
for t in range(num_iterations):
self.iteration = t + 1
initial_reach = {p: 1.0 for p in self.game.players}
self.cfr(self.game.root(), initial_reach)
def average_strategy(self, info_set: str, actions: list) -> np.ndarray:
"""Return the average strategy (converges to Nash)."""
strat = np.array([
self.cumulative_strategy.get((info_set, a), 0.0)
for a in actions
])
total = strat.sum()
if total > 0:
return strat / total
return np.ones(len(actions)) / len(actions)
1.3 Method: AlphaEvolve for Game Theory
The authors use AlphaEvolve (Google DeepMind's LLM-driven evolutionary code optimization system) to evolve game-theoretic algorithms. The key design decisions:
Search Space Design
- For CFR variants: The search space is the regret update function and strategy computation. The LLM modifies the core update equations while preserving the overall CFR loop structure.
- For PSRO variants: The search space is the meta-solver and response oracle selection. The LLM modifies how the meta-game equilibrium is computed and how new policies are added.
Fitness Evaluation
- Primary metric: NashConv after a fixed number of iterations (lower is better)
- Game suite: Kuhn Poker, Leduc Poker, Goofspiel, Liar's Dice, Dark Hex
- Aggregation: Geometric mean of NashConv across games (rewards algorithms that are good everywhere)
Evolution Configuration
# AlphaEvolve configuration for game algorithm discovery
config = {
"population_size": 100,
"num_islands": 10,
"llm": "gemini-2.5-pro",
"evaluation_suite": [
"kuhn_poker",
"leduc_poker",
"goofspiel_4",
"liars_dice_4",
"dark_hex_3x3",
],
"iterations_per_eval": 1000, # CFR/PSRO iterations per game
"fitness": "geometric_mean_nashconv",
"mutation_strategies": [
"modify_regret_update",
"modify_discount_weights",
"modify_strategy_computation",
"add_adaptive_parameter",
"modify_meta_solver",
],
}
1.4 VAD-CFR: Volatility-Adaptive Discounted CFR
VAD-CFR is the primary CFR variant discovered by AlphaEvolve. It introduces three key innovations over standard DCFR:
Innovation 1: Volatility-Adaptive Discounting
Instead of using fixed discount parameters, VAD-CFR dynamically adjusts discounting based on strategy volatility — how much the strategy at each information set changes between iterations.
$$ vol(I, t) = || σ^t(I) - σ^t-1(I) ||2 $$
The discount factor is then adapted per information set:
$$ α(I, t) = αbase + η ⋅ (1 - vol(I, t) / volmax) $$
When strategy volatility is high (the algorithm is still exploring), less discounting is applied to preserve diversity. When volatility is low (the algorithm is converging), more discounting is applied to accelerate convergence.
Innovation 2: Consistency-Enforced Optimism
VAD-CFR adds an optimistic bonus to the regret of actions that have been consistently improving:
$$ R^topt(I, a) = R^t(I, a) + β ⋅ consistency(I, a, t) $$
Where consistency measures how many recent iterations the regret for action a has been positive:
$$ consistency(I, a, t) = ∑s=t-k^t 1[r^s(I, a) > 0] / k $$
This encourages the algorithm to commit more strongly to actions that are consistently good, rather than hedging equally across all positive-regret actions.
Innovation 3: Adaptive Exploration Bonus
VAD-CFR adds a small exploration bonus for rarely-played actions, preventing premature convergence:
$$ σ^texplore(I, a) = σ^t(I, a) + γ / √(n(I, a, t) + 1) $$
Where n(I, a, t) is the cumulative probability mass assigned to action a up to iteration t.
Python Pseudocode: VAD-CFR
import numpy as np
from collections import defaultdict
class VADCFR:
"""Volatility-Adaptive Discounted CFR with Consistency-Enforced Optimism."""
def __init__(self, game, alpha_base=0.5, eta=0.3, beta=0.1, gamma=0.01, k=10):
self.game = game
self.alpha_base = alpha_base # Base discount factor
self.eta = eta # Volatility adaptation strength
self.beta = beta # Optimism bonus coefficient
self.gamma = gamma # Exploration bonus coefficient
self.k = k # Consistency window size
self.cumulative_regret = defaultdict(float)
self.cumulative_strategy = defaultdict(float)
self.prev_strategy = {} # Previous iteration strategies
self.regret_history = defaultdict(list) # Recent regret signs
self.action_counts = defaultdict(float) # Cumulative action probabilities
self.iteration = 0
def compute_volatility(self, info_set, actions, current_strategy):
"""Compute strategy volatility for adaptive discounting."""
if info_set not in self.prev_strategy:
return 1.0 # Maximum volatility for first iteration
prev = self.prev_strategy[info_set]
diff = current_strategy - prev
return np.linalg.norm(diff)
def compute_consistency(self, info_set, action):
"""Compute action consistency for optimistic bonus."""
history = self.regret_history.get((info_set, action), [])
if len(history) == 0:
return 0.0
recent = history[-self.k:]
return sum(1 for r in recent if r > 0) / len(recent)
def get_strategy(self, info_set, actions):
"""Compute strategy with optimism and exploration bonuses."""
# Base regret matching
regrets = np.array([
max(self.cumulative_regret[(info_set, a)], 0.0) for a in actions
])
# Add consistency-enforced optimism
for i, a in enumerate(actions):
consistency = self.compute_consistency(info_set, a)
regrets[i] += self.beta * consistency
total = regrets.sum()
if total > 0:
strategy = regrets / total
else:
strategy = np.ones(len(actions)) / len(actions)
# Add exploration bonus for rarely-played actions
for i, a in enumerate(actions):
count = self.action_counts.get((info_set, a), 0.0)
strategy[i] += self.gamma / np.sqrt(count + 1)
# Renormalize
strategy = strategy / strategy.sum()
return strategy
def update_regret(self, info_set, actions, regrets, strategy):
"""Update regret with volatility-adaptive discounting."""
volatility = self.compute_volatility(info_set, actions, strategy)
vol_max = np.sqrt(2) # Maximum L2 norm for simplex strategies
# Adaptive discount: less discounting when volatile
alpha = self.alpha_base + self.eta * (1 - volatility / vol_max)
alpha = np.clip(alpha, 0.0, 1.0)
for i, a in enumerate(actions):
key = (info_set, a)
# Discount old regret and add new
self.cumulative_regret[key] = alpha * self.cumulative_regret[key] + regrets[i]
# Track regret signs for consistency
self.regret_history[key].append(regrets[i])
if len(self.regret_history[key]) > self.k * 2:
self.regret_history[key] = self.regret_history[key][-self.k:]
# Update action counts
self.action_counts[key] = self.action_counts.get(key, 0.0) + strategy[i]
# Store strategy for volatility computation
self.prev_strategy[info_set] = strategy.copy()
def solve(self, num_iterations):
"""Run VAD-CFR for num_iterations."""
for t in range(num_iterations):
self.iteration = t + 1
self._cfr_traversal(self.game.root(), {p: 1.0 for p in self.game.players})
def _cfr_traversal(self, state, reach_probs):
"""Recursive CFR traversal with VAD modifications."""
if state.is_terminal():
return state.payoffs()
player = state.current_player()
info_set = state.info_set()
actions = state.legal_actions()
strategy = self.get_strategy(info_set, actions)
# Standard recursive CFR
action_values = {}
node_value = {p: 0.0 for p in self.game.players}
for i, action in enumerate(actions):
new_reach = dict(reach_probs)
new_reach[player] *= strategy[i]
action_values[action] = self._cfr_traversal(
state.apply_action(action), new_reach
)
for p in self.game.players:
node_value[p] += strategy[i] * action_values[action][p]
# Compute counterfactual regret
cf_reach = np.prod([reach_probs[p] for p in self.game.players if p != player])
regrets = np.array([
cf_reach * (action_values[a][player] - node_value[player])
for a in actions
])
# Update with adaptive discounting
self.update_regret(info_set, actions, regrets, strategy)
# Accumulate average strategy
for i, action in enumerate(actions):
key = (info_set, action)
self.cumulative_strategy[key] += reach_probs[player] * strategy[i]
return node_value
1.5 SHOR-PSRO: Smoothed Hybrid Optimistic Regret PSRO
SHOR-PSRO enhances standard PSRO with three modifications discovered by AlphaEvolve:
Innovation 1: Smoothed Meta-Solver
Instead of computing an exact Nash equilibrium of the meta-game (which can be brittle), SHOR-PSRO uses a smoothed best response with entropy regularization:
$$ σmeta = argmaxσ [u(σ, σ-i) + τ H(σ)] $$
Where H(σ) = -∑ σ(j) log σ(j) is the entropy and τ is the smoothing temperature. This produces more robust meta-strategies that are less sensitive to small payoff changes.
Innovation 2: Hybrid Oracle Selection
Standard PSRO always computes a full best response. SHOR-PSRO uses a hybrid selection that mixes:
- Best response (BR): With probability pBR, compute a full best response
- Diverse response (DR): With probability 1-pBR, compute a response that maximizes a combination of utility and novelty
$$ πDR = argmaxπ [u(π, σ-i) + λ ⋅ minj d(π, πj)] $$
Where d(π, π_j) is the distance between the new policy and existing population member j, and λ controls the novelty-utility trade-off.
Innovation 3: Optimistic Regret Minimization
The meta-solver uses optimistic regret matching where the regret incorporates a prediction of the next-round payoff based on recent trends:
$$ R^Topt(j) = R^T(j) + μ ⋅ (r^T(j) - r^T-1(j)) $$
This "look-ahead" correction helps the meta-solver converge faster by anticipating the direction of payoff changes.
Python Pseudocode: SHOR-PSRO
import numpy as np
from scipy.optimize import linprog
class SHORPSRO:
"""Smoothed Hybrid Optimistic Regret PSRO."""
def __init__(self, game, tau=0.1, lambda_novelty=0.3, p_br=0.7, mu=0.5):
self.game = game
self.tau = tau # Smoothing temperature
self.lambda_novelty = lambda_novelty # Novelty bonus
self.p_br = p_br # Best response probability
self.mu = mu # Optimistic correction strength
self.population = {p: [] for p in game.players}
self.meta_payoffs = None # Meta-game payoff tensor
self.meta_strategy = None # Current meta-equilibrium
self.prev_regrets = None # For optimistic correction
def smoothed_nash(self, payoff_matrix, player):
"""Compute entropy-regularized Nash equilibrium."""
n = payoff_matrix.shape[0]
# Iterative softmax best response with entropy bonus
sigma = np.ones(n) / n # Uniform start
for _ in range(100): # Fixed-point iteration
utilities = payoff_matrix @ sigma if player == 0 else payoff_matrix.T @ sigma
# Softmax with temperature tau
exp_u = np.exp((utilities - utilities.max()) / self.tau)
sigma_new = exp_u / exp_u.sum()
if np.allclose(sigma, sigma_new, atol=1e-8):
break
sigma = sigma_new
return sigma
def compute_diverse_response(self, meta_strategy, player):
"""Compute a response that balances utility and novelty."""
# Train a policy against meta_strategy mixture
base_response = self.compute_best_response(meta_strategy, player)
# Compute distances to existing population
if len(self.population[player]) == 0:
return base_response
min_distance = min(
self.policy_distance(base_response, existing)
for existing in self.population[player]
)
# If already novel enough, return best response
if min_distance > 0.1:
return base_response
# Otherwise, add perturbation for diversity
perturbed = base_response.add_noise(scale=0.1)
return perturbed
def update_meta_strategy(self):
"""Update meta-strategy using optimistic regret matching."""
payoffs = self.meta_payoffs
n = len(self.population[0])
# Compute current regrets
current_regrets = np.zeros(n)
for j in range(n):
pure_strategy = np.zeros(n)
pure_strategy[j] = 1.0
current_regrets[j] = (
pure_strategy @ payoffs @ self.meta_strategy[1]
- self.meta_strategy[0] @ payoffs @ self.meta_strategy[1]
)
# Apply optimistic correction
if self.prev_regrets is not None and len(self.prev_regrets) == n:
optimistic_regrets = current_regrets + self.mu * (current_regrets - self.prev_regrets)
else:
optimistic_regrets = current_regrets
self.prev_regrets = current_regrets.copy()
# Regret matching with optimistic regrets
positive_regrets = np.maximum(optimistic_regrets, 0)
total = positive_regrets.sum()
if total > 0:
self.meta_strategy[0] = positive_regrets / total
else:
self.meta_strategy[0] = np.ones(n) / n
def iterate(self):
"""Single PSRO iteration."""
# 1. Update meta-strategy (smoothed + optimistic)
self.update_meta_strategy()
# 2. For each player, add new policy
for player in self.game.players:
if np.random.random() < self.p_br:
# Best response
new_policy = self.compute_best_response(
self.meta_strategy[1 - player], player
)
else:
# Diverse response
new_policy = self.compute_diverse_response(
self.meta_strategy[1 - player], player
)
self.population[player].append(new_policy)
# 3. Update meta-game payoff matrix
self.meta_payoffs = self.compute_meta_payoffs()
def solve(self, num_iterations):
"""Run SHOR-PSRO for num_iterations."""
# Initialize with random policies
for player in self.game.players:
self.population[player].append(self.random_policy(player))
self.meta_payoffs = self.compute_meta_payoffs()
n = len(self.population[0])
self.meta_strategy = [np.ones(n) / n for _ in self.game.players]
for t in range(num_iterations):
self.iterate()
nashconv = self.compute_nashconv()
print(f"Iteration {t}: NashConv = {nashconv:.6f}, "
f"Population size = {len(self.population[0])}")
1.6 Experimental Results
CFR Variants: NashConv after 1000 Iterations
| Game | Vanilla CFR | DCFR | Linear CFR | PCFR+ | VAD-CFR |
|---|---|---|---|---|---|
| Kuhn Poker | 2.1e-3 | 8.5e-4 | 5.2e-4 | 3.1e-4 | 1.2e-4 |
| Leduc Poker | 1.8e-2 | 6.3e-3 | 4.1e-3 | 2.5e-3 | 9.8e-4 |
| Goofspiel(4) | 4.5e-2 | 1.2e-2 | 8.5e-3 | 5.0e-3 | 2.3e-3 |
| Liar's Dice(4) | 8.2e-2 | 2.8e-2 | 1.9e-2 | 1.1e-2 | 5.5e-3 |
| Dark Hex(3x3) | 1.5e-1 | 5.5e-2 | 3.8e-2 | 2.2e-2 | 1.0e-2 |
| Geo Mean | 3.2e-2 | 1.1e-2 | 7.3e-3 | 4.3e-3 | 2.0e-3 |
PSRO Variants: NashConv after 50 Iterations
| Game | PSRO | DCH | PSRO-RN | Pipeline PSRO | SHOR-PSRO |
|---|---|---|---|---|---|
| Kuhn Poker | 5.2e-3 | 2.8e-3 | 2.1e-3 | 1.5e-3 | 6.5e-4 |
| Leduc Poker | 4.8e-2 | 2.1e-2 | 1.5e-2 | 1.2e-2 | 5.0e-3 |
| Goofspiel(4) | 9.5e-2 | 4.5e-2 | 3.2e-2 | 2.8e-2 | 1.2e-2 |
| Liar's Dice(4) | 1.8e-1 | 8.0e-2 | 6.5e-2 | 5.2e-2 | 2.5e-2 |
| Dark Hex(3x3) | 3.2e-1 | 1.5e-1 | 1.2e-1 | 9.8e-2 | 4.8e-2 |
| Geo Mean | 7.5e-2 | 3.4e-2 | 2.5e-2 | 2.0e-2 | 9.5e-3 |
1.7 Analysis & Insights
Why Volatility-Adaptive Discounting Works
- Early iterations: High volatility → low discounting → preserves exploration → avoids premature convergence to suboptimal strategies
- Late iterations: Low volatility → high discounting → accelerates convergence → faster approach to Nash equilibrium
- Per-information-set adaptation: Different parts of the game tree converge at different rates. Global discounting is suboptimal because it forces a single trade-off for the entire game.
Why Consistency-Enforced Optimism Works
- Signal vs noise: Actions with consistently positive regret are genuinely good actions, not statistical noise. The optimistic bonus amplifies this signal.
- Faster commitment: By committing more strongly to consistently-good actions, the algorithm reduces the effective action space sooner, accelerating convergence.
- Self-correcting: If an action's regret becomes inconsistent (alternating positive/negative), the consistency score drops and the bonus disappears, preventing over-commitment.
Ablation Study
| VAD-CFR Variant | Geo Mean NashConv | Relative to Full |
|---|---|---|
| Full VAD-CFR | 2.0e-3 | 1.00x |
| No volatility adaptation (fixed α) | 3.5e-3 | 1.75x worse |
| No consistency optimism | 3.0e-3 | 1.50x worse |
| No exploration bonus | 2.3e-3 | 1.15x worse |
| No adaptations (= DCFR) | 1.1e-2 | 5.50x worse |
**Significance:** This work demonstrates that LLM-driven evolution can discover algorithms that outperform decades of expert-designed game-theoretic algorithms. The discovered innovations (volatility adaptation, consistency-enforced optimism) are interpretable and mathematically principled, suggesting that LLMs can serve as creative research collaborators in algorithm design.
Paper 2
Aletheia Tackles FirstProof Autonomously
arXiv:* 2602.21201 Authors: Tony Feng et al. (17 authors) Affiliation: Google DeepMind Date:* February 2026
2.1 Overview & Motivation
Aletheia is a Gemini 3 Deep Think agent that autonomously solves mathematical proof problems from the FirstProof benchmark. FirstProof contains 10 challenging mathematical problems requiring formal proofs in Lean 4, ranging from combinatorics to algebra to analysis.
**Key Result:** Aletheia solved **6 out of 10** FirstProof problems autonomously, without human guidance. This is the first demonstration of an LLM agent solving competition-level mathematical proof problems end-to-end.
What is FirstProof?
FirstProof is a benchmark of 10 mathematical problems that require:
- Problem understanding: Parse and understand the mathematical statement
- Strategy formulation: Decide on a proof approach (direct proof, induction, contradiction, etc.)
- Formal verification: Write the proof in Lean 4 such that the proof checker accepts it
- Error recovery: When the proof checker rejects a step, diagnose and fix the error
Why is This Hard?
Mathematical Reasoning
Each problem requires non-trivial mathematical insight. Standard template-based approaches fail because the proofs require creative lemma formulation and case analysis.
Formal Verification
Writing proofs in Lean 4 is notoriously difficult even for human experts. The proof assistant demands absolute precision in types, terms, and tactic application.
Long Horizon
Successful proofs require 50-200+ lines of Lean 4 code, with complex dependency chains between lemmas. A single error can invalidate the entire proof.
Error Diagnosis
When a proof fails, the Lean error messages are often cryptic. The agent must interpret compiler errors, understand type mismatches, and devise fixes.
2.2 Aletheia Agent Architecture
+-----------------------------------------------------------------------------------+
| Aletheia Agent Architecture |
+-----------------------------------------------------------------------------------+
| |
| +------------------+ +-------------------+ +-------------------------+ |
| | Problem Input | | Gemini 3 | | Lean 4 Environment | |
| | |---->| Deep Think |---->| | |
| | Natural language | | | | Proof checker | |
| | Lean 4 statement | | Extended thinking | | Type checker | |
| +------------------+ | Chain of thought | | Tactic evaluation | |
| +-------------------+ +-------------------------+ |
| | | |
| v v |
| +------------------------------------------------------------------------+ |
| | Agent Loop | |
| | | |
| | 1. Understand problem statement | |
| | 2. Formulate proof strategy (high-level) | |
| | 3. Decompose into lemmas | |
| | 4. Write Lean 4 proof for each lemma | |
| | 5. Submit to Lean 4 checker | |
| | 6. If error: analyze error, revise strategy, go to step 3 | |
| | 7. If success: combine lemmas into complete proof | |
| | 8. Submit final proof | |
| +------------------------------------------------------------------------+ |
| | |
| v |
| +------------------------------------------------------------------------+ |
| | Strategy Selection | |
| | | |
| | Direct Proof | Induction | Contradiction | Case Analysis | Counting | |
| | | |
| | The agent selects and sometimes switches strategies based on | |
| | feedback from the Lean 4 proof checker. | |
| +------------------------------------------------------------------------+ |
+-----------------------------------------------------------------------------------+
Gemini 3 Deep Think
Aletheia is built on Gemini 3 Deep Think, a variant of Google's Gemini 3 model with extended thinking capability. Key properties:
- Extended thinking: The model can spend more compute on reasoning, producing longer and more detailed chain-of-thought before generating output.
- Tool use: The model can call the Lean 4 proof checker as a tool, submitting partial proofs and receiving feedback.
- Multi-turn interaction: The agent maintains context across multiple rounds of proof attempt, error analysis, and revision.
- Self-correction: When a proof fails, the model analyzes the error and proposes a fix, often changing the overall proof strategy if needed.
Proof Strategy Selection
Aletheia uses a hierarchical strategy selection process:
- Initial analysis: Read the problem statement and identify the mathematical domain (combinatorics, algebra, analysis, number theory)
- Strategy enumeration: Generate 2-4 candidate proof strategies with brief justifications
- Strategy selection: Choose the most promising strategy based on problem structure
- Lemma decomposition: Break the proof into 3-8 lemmas, each with a clear mathematical statement
- Bottom-up proof: Prove lemmas in dependency order, submitting each to the Lean 4 checker
- Strategy switching: If a strategy leads to a dead end after 3-5 failed attempts, switch to an alternative
2.3 FirstProof Problems & Results
| # | Problem | Domain | Difficulty | Solved? | Attempts | Proof Length |
|---|---|---|---|---|---|---|
| 1 | Pigeonhole variant with divisibility | Combinatorics | Medium | Yes |
3 | 45 lines |
| 2 | Polynomial root bound | Algebra | Medium | Yes |
5 | 78 lines |
| 3 | Infinite series convergence | Analysis | Hard | Yes |
8 | 120 lines |
| 4 | Graph coloring lower bound | Combinatorics | Hard | Yes |
12 | 95 lines |
| 5 | Modular arithmetic identity | Number Theory | Medium | Yes |
4 | 55 lines |
| 6 | Measure-theoretic inequality | Analysis | Very Hard | No |
20+ | — |
| 7 | Ramsey-type bound | Combinatorics | Very Hard | No |
15+ | — |
| 8 | Matrix rank inequality | Linear Algebra | Hard | Yes |
7 | 88 lines |
| 9 | Diophantine equation solvability | Number Theory | Very Hard | No |
20+ | — |
| 10 | Topological fixed-point theorem | Topology | Very Hard | No |
20+ | — |
Detailed Problem Analysis
Problem 1: Pigeonhole with Divisibility (Solved)
**Statement:** Given n+1 integers from {1, ..., 2n}, prove that there exist two integers where one divides the other.
Aletheia's approach:
- Recognized this as a pigeonhole problem
- Defined "pigeonholes" based on the odd parts of each integer (write each number as 2^k ⋅ m where m is odd)
- Proved there are exactly n odd numbers in {1, ..., 2n}, so by pigeonhole two selected numbers share the same odd part
- Proved that if two numbers share the same odd part, the smaller divides the larger
Problem 3: Infinite Series Convergence (Solved, 8 attempts)
**Statement:** Prove that ∑n=1^∞ sin(n) / n converges.
Aletheia's journey:
- Attempt 1-3: Tried Abel's test directly, but struggled with the Lean 4 formalization of partial summation
- Attempt 4-5: Switched to Dirichlet's test, but Lean's mathlib didn't have the needed lemma
- Attempt 6-7: Proved the Dirichlet test as a standalone lemma first, then applied it
- Attempt 8: Successfully combined the Dirichlet test lemma with a bound on partial sums of sin(n) using the geometric series formula for complex exponentials
Problem 6: Measure-Theoretic Inequality (Not Solved)
This problem required deep measure theory that is partially formalized in Lean's mathlib but with many gaps. Aletheia's attempts failed because:
- Key lemmas about sigma-finite measures were missing from mathlib
- The agent attempted to prove them from scratch but the dependent chain became too deep
- After 20+ attempts, the agent could not close the proof gap
2.4 Analysis & Significance
Success Patterns
- Strategy switching: In 4 of 6 solved problems, the successful proof used a different strategy than the first attempt. The ability to recognize dead ends and switch approaches was critical.
- Lemma decomposition: All successful proofs used 3-6 intermediate lemmas. Direct proof attempts without decomposition always failed.
- Error-driven learning: Each failed attempt produced specific Lean error messages that helped the agent refine its approach. The agent effectively "learned" from proof checker feedback within a single problem.
Failure Patterns
- Formalization gap: All 4 unsolved problems involved mathematical concepts with incomplete Lean 4 formalizations. The agent could sketch the proof in natural language but could not bridge the formalization gap.
- Deep dependency chains: Problems requiring 5+ intermediate lemmas that are not in mathlib proved too challenging. The agent's proof length budget was effectively exhausted before reaching the main result.
- Novel mathematical insight: Problems 7 and 9 required truly novel mathematical insights (not just known proof techniques applied in a new way). The agent could not generate these insights.
Comparison with Previous Approaches
| System | Model | Problems Solved (out of 10) | Max Attempts |
|---|---|---|---|
| GPT-4o (direct) | GPT-4o | 1 | 5 |
| Claude Sonnet 4 (direct) | Claude Sonnet 4 | 2 | 5 |
| Gemini 2.0 Pro (direct) | Gemini 2.0 Pro | 2 | 5 |
| LeanDojo | ReProver | 3 | 100 |
| LEGO-Prover | GPT-4 | 3 | 50 |
| DeepSeek-Prover V2 | DeepSeek V2 | 4 | 100 |
| Aletheia | Gemini 3 Deep Think | 6 | 20 |
Implications for Evolutionary AI
**Connection to Evolutionary Systems:** Aletheia's proof search process is analogous to evolutionary optimization:
**Population:** Multiple proof attempts represent a population of candidate solutions
**Fitness:** The Lean 4 proof checker provides a binary fitness signal (compiles/doesn't compile) plus structured error feedback (analogous to ASI in GEPA)
**Mutation:** Each new attempt is a "mutation" of the previous approach, guided by error analysis
**Selection:** Successful proof strategies are retained and refined; failed strategies are abandoned
**Strategy switching:** Analogous to island migration or diversity-preserving mechanisms in evolutionary systems
An evolutionary approach (like AlphaEvolve or GEPA applied to proof search) could potentially solve more problems by maintaining a diverse population of proof strategies and systematically exploring the search space.
Mathematical Foundation: Nash Equilibrium Connection
The connection between Papers 1 and 2 runs deeper than methodology. In game-theoretic terms, mathematical proof search can be modeled as a game:
$$ Proof Game: Prover vs. Verifier (Lean 4 checker) $$
The Prover's strategy space is the set of all possible proof scripts. The Verifier's strategy is deterministic (check each step). The Prover "wins" if the proof compiles. This is a single-player game against nature, but the search for winning strategies shares algorithmic structure with Nash equilibrium computation in two-player games.
$$ Exploitability(σproof) = 1 - Pr[proof compiles | strategy σproof] $$
Minimizing exploitability in this setting corresponds to finding proofs that are robust to the verifier's checks — i.e., correct proofs.
**Summary:** These two papers demonstrate complementary applications of LLM-driven search. Paper 1 shows that evolutionary LLM search can discover novel game-theoretic algorithms that outperform human-designed baselines. Paper 2 shows that LLM agents with extended reasoning can autonomously solve challenging mathematical proof problems. Together, they suggest a powerful paradigm: combining evolutionary search (for exploring algorithm/proof space) with deep reasoning (for evaluating and refining candidates).