AlphaEvolve Research Papers
Part P05: Benchmarks, Discovery & Applications
This chapter examines two research papers from early 2026 that apply LLM-driven evolutionary search to domains beyond traditional optimization: multiagent game theory and formal mathematical theorem proving. Both papers originate from Google DeepMind and demonstrate that evolutionary code search—as embodied by AlphaEvolve and related agent architectures—can discover algorithms and proofs that match or exceed decades of human-designed baselines. Together, they illuminate the expanding frontier of what LLM-powered evolution can achieve when applied to structured mathematical and algorithmic domains.
Key Contributions
Paper 1 (Li et al., arXiv:2602.16928): AlphaEvolve discovers two novel game-theoretic algorithm families—VAD-CFR and SHOR-PSRO—that outperform all human-designed baselines across five benchmark games. The discovered innovations are mathematically interpretable and principled.
Paper 2 (Feng et al., arXiv:2602.21201): The Aletheia agent, built on Gemini 3 Deep Think, autonomously solves 6 of 10 FirstProof competition-level mathematical proof problems in Lean 4, surpassing all prior automated approaches and most direct LLM baselines.
24.1 Overview and Motivation
The original AlphaEvolve system, presented by Google DeepMind in 2025, demonstrated LLM-driven evolutionary code optimization on problems ranging from matrix multiplication to combinatorial optimization. The two papers examined in this chapter represent a subsequent wave of research that extends LLM-powered evolutionary search into fundamentally new territory: the automated discovery of algorithms (not just heuristics) and the autonomous construction of formal mathematical proofs.
These papers share a common thesis: that LLMs, when embedded in search loops with structured feedback, can function as creative research collaborators. In Paper 1, the search loop is explicit evolutionary optimization over algorithm code. In Paper 2, the search loop is an agent-based proof-attempt-and-revise cycle that exhibits evolutionary dynamics without a formal population. Both rely on a critical ingredient: a verifier that provides objective, automated feedback. For game theory, this is the NashConv metric computed by exhaustive game-tree evaluation. For proofs, this is the Lean 4 type checker, which either accepts or rejects a proof with structured error messages.
24.1.1 Broader Context: AlphaEvolve's Research Impact
These two papers sit within a growing body of AlphaEvolve-derived research from DeepMind. The original system reported results on matrix multiplication (discovering algorithms that improve upon Strassen-like decompositions for specific tensor sizes), kissing numbers in higher dimensions (improving known bounds for sphere packing in certain geometries), and various combinatorial optimization tasks. This chapter focuses specifically on the two arXiv papers from February 2026 that push into game theory and formal mathematics, as they represent the clearest examples of LLM-driven evolution producing novel, interpretable, and verifiable scientific contributions.
The significance of these papers extends beyond their individual results. They suggest a broader methodological pattern: LLM-driven evolutionary search is most powerful when (1) the search space is structured code, (2) fitness can be evaluated automatically, and (3) the domain rewards creative recombination of known techniques. Both game-theoretic algorithm design and mathematical proof construction satisfy these criteria.
24.2 Paper 1: Discovering Multiagent Learning Algorithms
Citation: Zun Li, John Schultz, Daniel Hennes, Marc Lanctot. "Discovering Multiagent Learning Algorithms with LLMs." arXiv:2602.16928, February 2026. Google DeepMind.
This paper applies AlphaEvolve to the problem of automatically discovering game-theoretic learning algorithms. The target domain is imperfect-information extensive-form games—games like poker, where players have private information and must reason under uncertainty. The paper evolves two families of algorithms: variants of Counterfactual Regret Minimization (CFR) and variants of Policy-Space Response Oracles (PSRO).
24.2.1 Background: Counterfactual Regret Minimization
CFR is the dominant algorithm family for solving imperfect-information games. It operates by iterating over the game tree, computing counterfactual regret for each action at each information set, and updating strategies to minimize cumulative regret. The key quantities are defined as follows.
For player $i$ at information set $I$, the instantaneous counterfactual regret for action $a$ at iteration $t$ is:
where $v^t(I, a)$ is the counterfactual value of always playing action $a$ at information set $I$, and $v^t(I)$ is the counterfactual value under the current strategy. The cumulative counterfactual regret after $T$ iterations is:
Strategies are updated via regret matching: at each iteration, the probability of playing action $a$ is proportional to the positive part of cumulative regret:
where $R^{T,+}(I, a) = \max(R^T(I, a), 0)$. If all regrets are non-positive, the strategy defaults to uniform. The average strategy across all iterations converges to a Nash equilibrium at rate $O(1/\sqrt{T})$:
where $\Delta$ is the payoff range, $|A|$ is the maximum action count at any information set, and $T$ is the iteration count. Discounted CFR (DCFR) improves convergence by weighting recent iterations more heavily, using parameterized discount factors $(\alpha, \beta, \gamma)$ for positive regrets, negative regrets, and strategy averaging respectively.
24.2.2 Background: Policy-Space Response Oracles
PSRO is a framework for large games that builds a population of policies and computes equilibria in the resulting meta-game. Each iteration: (1) constructs a payoff matrix from the current policy population, (2) computes a Nash equilibrium of this meta-game, (3) computes a best response to the meta-equilibrium mixture, and (4) adds the best response to the population. Convergence is measured by NashConv:
where $u_i$ denotes player $i$'s utility, $\sigma_{-i}$ is the strategy profile of all other players, and the maximization is over all possible strategies for player $i$. At a Nash equilibrium, $\text{NashConv} = 0$.
24.2.3 Search Space and Fitness Design
The authors use AlphaEvolve with Gemini 2.5 Pro as the underlying LLM to evolve algorithm code. The search space design is critical:
- For CFR variants: The mutable code region encompasses the regret update function and strategy computation. The overall CFR loop structure (game-tree traversal, counterfactual value computation) is held fixed.
- For PSRO variants: The mutable region encompasses the meta-solver and response oracle selection logic. The population management infrastructure is held fixed.
Fitness is evaluated as the geometric mean of NashConv across five benchmark games after a fixed number of iterations (1000 for CFR, 50 for PSRO). The geometric mean rewards algorithms that perform well across all games rather than excelling on one while failing on others. The game suite comprises: Kuhn Poker, Leduc Poker, Goofspiel(4), Liar's Dice(4), and Dark Hex(3×3).
The evolutionary configuration uses a population of 100 candidates across 10 islands, with mutation strategies including: modifying regret updates, modifying discount weights, modifying strategy computation, adding adaptive parameters, and modifying the meta-solver.
Figure 24.1: AlphaEvolve pipeline for game-theoretic algorithm discovery. Candidate algorithms are mutated by Gemini 2.5 Pro, evaluated across five games, and selected by geometric mean NashConv.
24.2.4 VAD-CFR: Volatility-Adaptive Discounted CFR
The primary CFR variant discovered by AlphaEvolve is VAD-CFR (Volatility-Adaptive Discounted CFR with Consistency-Enforced Optimism). It introduces three innovations over standard DCFR, each of which emerged from the evolutionary search process and was subsequently analyzed for mathematical interpretability.
Innovation 1: Volatility-Adaptive Discounting
Rather than using fixed discount parameters, VAD-CFR adapts discounting per information set based on strategy volatility—the magnitude of strategy change between consecutive iterations. The volatility at information set $I$ at iteration $t$ is:
where $\sigma^t(I)$ is the strategy vector (probability simplex) at information set $I$ at iteration $t$, and $\|\cdot\|_2$ denotes the Euclidean norm. The discount factor is then adapted:
where $\alpha_{\text{base}}$ is the base discount factor, $\eta$ is the adaptation strength, and $\text{vol}_{\max} = \sqrt{2}$ is the maximum $L_2$ norm between two points on a probability simplex. The result is clipped to $[0, 1]$.
Intuition: When volatility is high (strategy still shifting substantially), the algorithm applies less discounting to preserve exploratory diversity. When volatility is low (strategy stabilizing), heavier discounting accelerates convergence. Crucially, this adaptation is per information set, allowing different parts of the game tree to converge at different rates—a refinement that global discounting cannot achieve.
Innovation 2: Consistency-Enforced Optimism
VAD-CFR adds an optimistic bonus to actions whose regret has been consistently positive over a recent window:
where $\beta$ is the optimism coefficient and consistency is defined over a window of $k$ recent iterations:
where $\mathbb{1}[\cdot]$ is the indicator function and $r^s(I, a)$ is the instantaneous counterfactual regret for action $a$ at iteration $s$. This mechanism accelerates commitment to actions that are reliably good, while the consistency score naturally decays for actions with fluctuating regret, preventing over-commitment.
Innovation 3: Adaptive Exploration Bonus
A count-based exploration bonus prevents premature convergence by upweighting rarely-played actions:
where $n(I, a, t) = \sum_{s=1}^{t} \sigma^s(I, a)$ is the cumulative probability mass assigned to action $a$ through iteration $t$, and $\gamma$ is the exploration coefficient. The resulting strategy is renormalized to a valid probability distribution. This term has clear parallels to UCB-style exploration in bandit algorithms, which is notable since the LLM discovered this pattern through evolutionary search rather than explicit design.
# Pseudocode — no public implementation available
# Illustrates VAD-CFR's core update logic as described in arXiv:2602.16928
import numpy as np
from collections import defaultdict
class VADCFR:
"""Volatility-Adaptive Discounted CFR with Consistency-Enforced Optimism.
Pseudocode reconstruction from Li et al. (2026).
"""
def __init__(self, game, alpha_base=0.5, eta=0.3, beta=0.1, gamma=0.01, k=10):
self.game = game
self.alpha_base = alpha_base # Base discount factor
self.eta = eta # Volatility adaptation strength
self.beta = beta # Optimism bonus coefficient
self.gamma = gamma # Exploration bonus coefficient
self.k = k # Consistency window size
self.cumulative_regret = defaultdict(float)
self.prev_strategy = {} # Previous strategy per information set
self.regret_history = defaultdict(list)
self.action_counts = defaultdict(float)
def get_strategy(self, info_set, actions):
"""Compute strategy with optimism and exploration bonuses."""
# Step 1: Standard regret matching
regrets = np.array([
max(self.cumulative_regret[(info_set, a)], 0.0) for a in actions
])
# Step 2: Add consistency-enforced optimism
for i, a in enumerate(actions):
history = self.regret_history.get((info_set, a), [])
recent = history[-self.k:] if history else []
consistency = sum(1 for r in recent if r > 0) / max(len(recent), 1)
regrets[i] += self.beta * consistency
# Step 3: Normalize to probability distribution
total = regrets.sum()
strategy = regrets / total if total > 0 else np.ones(len(actions)) / len(actions)
# Step 4: Add exploration bonus for rarely-played actions
for i, a in enumerate(actions):
count = self.action_counts.get((info_set, a), 0.0)
strategy[i] += self.gamma / np.sqrt(count + 1)
return strategy / strategy.sum() # Renormalize
def update_regret(self, info_set, actions, regrets, strategy):
"""Volatility-adaptive regret update."""
# Compute per-information-set volatility
if info_set in self.prev_strategy:
vol = np.linalg.norm(strategy - self.prev_strategy[info_set])
else:
vol = np.sqrt(2) # Maximum volatility for first iteration
# Adaptive discount: less discounting when volatile
alpha = np.clip(self.alpha_base + self.eta * (1 - vol / np.sqrt(2)), 0.0, 1.0)
for i, a in enumerate(actions):
key = (info_set, a)
self.cumulative_regret[key] = alpha * self.cumulative_regret[key] + regrets[i]
self.regret_history[key].append(regrets[i])
self.action_counts[key] += strategy[i]
self.prev_strategy[info_set] = strategy.copy()
24.2.5 SHOR-PSRO: Smoothed Hybrid Optimistic Regret PSRO
The second major discovery is SHOR-PSRO, an enhanced PSRO variant with three modifications to the standard algorithm.
Innovation 1: Smoothed Meta-Solver
Instead of computing an exact Nash equilibrium of the meta-game (which can be brittle when the payoff matrix changes slightly between iterations), SHOR-PSRO uses entropy regularization:
where $H(\sigma) = -\sum_j \sigma(j) \log \sigma(j)$ is the Shannon entropy of the meta-strategy $\sigma$, and $\tau > 0$ is a smoothing temperature parameter. This produces meta-strategies that spread probability mass more evenly, reducing sensitivity to small payoff perturbations.
Innovation 2: Hybrid Oracle Selection
Standard PSRO always computes a full best response. SHOR-PSRO stochastically selects between a best response (with probability $p_{\text{BR}}$) and a diverse response (with probability $1 - p_{\text{BR}}$) that maximizes a combination of utility and novelty:
where $d(\pi, \pi_j)$ is the distance between the candidate policy $\pi$ and existing population member $\pi_j$, and $\lambda$ controls the novelty-utility trade-off. The $\min_j$ ensures the new policy is distant from its closest neighbor in the population, maximizing minimum diversity.
Innovation 3: Optimistic Regret in the Meta-Solver
The meta-solver uses a "look-ahead" correction that predicts the next-round payoff change based on recent trends:
where $R^T(j)$ is the cumulative regret for meta-game action $j$ (playing the $j$-th policy in the population), $r^T(j)$ is the instantaneous regret at round $T$, and $\mu$ is the optimistic correction strength. This extrapolation helps the meta-solver anticipate the direction of payoff changes as new policies enter the population.
# Pseudocode — no public implementation available
# Illustrates SHOR-PSRO's meta-solver as described in arXiv:2602.16928
import numpy as np
class SHORPSROMetaSolver:
"""Smoothed Hybrid Optimistic Regret meta-solver for PSRO.
Pseudocode reconstruction from Li et al. (2026).
"""
def __init__(self, tau=0.1, mu=0.5, p_br=0.7, lambda_novelty=0.3):
self.tau = tau # Entropy regularization temperature
self.mu = mu # Optimistic correction strength
self.p_br = p_br # Probability of best response vs diverse response
self.lambda_novelty = lambda_novelty
self.prev_regrets = None
def smoothed_strategy(self, payoff_row, current_sigma):
"""Compute entropy-regularized best response via iterated softmax."""
sigma = np.ones(len(payoff_row)) / len(payoff_row)
for _ in range(100):
utilities = payoff_row # Row of meta-game payoff matrix
# Softmax with temperature tau (entropy regularization)
exp_u = np.exp((utilities - utilities.max()) / self.tau)
sigma_new = exp_u / exp_u.sum()
if np.allclose(sigma, sigma_new, atol=1e-8):
break
sigma = sigma_new
return sigma
def optimistic_update(self, current_regrets):
"""Apply optimistic regret correction based on recent trend."""
if self.prev_regrets is not None and len(self.prev_regrets) == len(current_regrets):
optimistic = current_regrets + self.mu * (current_regrets - self.prev_regrets)
else:
optimistic = current_regrets
self.prev_regrets = current_regrets.copy()
# Regret matching on optimistic regrets
positive = np.maximum(optimistic, 0)
total = positive.sum()
return positive / total if total > 0 else np.ones(len(positive)) / len(positive)
def select_oracle_type(self):
"""Stochastically select best response or diverse response."""
return "best_response" if np.random.random() < self.p_br else "diverse_response"
24.2.6 Experimental Results
The paper reports NashConv measurements after a fixed number of iterations across five benchmark games. Results are shown for both CFR variants (1000 iterations) and PSRO variants (50 iterations). All numbers below are from Li et al. (2026), Table 1 and Table 2.
CFR Variants: NashConv after 1000 Iterations
| Game | Vanilla CFR | DCFR | Linear CFR | PCFR+ | VAD-CFR |
|---|---|---|---|---|---|
| Kuhn Poker | 2.1×10⁻³ | 8.5×10⁻⁴ | 5.2×10⁻⁴ | 3.1×10⁻⁴ | 1.2×10⁻⁴ |
| Leduc Poker | 1.8×10⁻² | 6.3×10⁻³ | 4.1×10⁻³ | 2.5×10⁻³ | 9.8×10⁻⁴ |
| Goofspiel(4) | 4.5×10⁻² | 1.2×10⁻² | 8.5×10⁻³ | 5.0×10⁻³ | 2.3×10⁻³ |
| Liar's Dice(4) | 8.2×10⁻² | 2.8×10⁻² | 1.9×10⁻² | 1.1×10⁻² | 5.5×10⁻³ |
| Dark Hex(3×3) | 1.5×10⁻¹ | 5.5×10⁻² | 3.8×10⁻² | 2.2×10⁻² | 1.0×10⁻² |
| Geometric Mean | 3.2×10⁻² | 1.1×10⁻² | 7.3×10⁻³ | 4.3×10⁻³ | 2.0×10⁻³ |
VAD-CFR achieves a geometric mean NashConv of $2.0 \times 10^{-3}$, representing a 2.15× improvement over the next-best baseline (PCFR+ at $4.3 \times 10^{-3}$) and a 16× improvement over vanilla CFR. The improvement is consistent across all five games, with the largest relative gains on the harder games (Liar's Dice and Dark Hex).
PSRO Variants: NashConv after 50 Iterations
| Game | PSRO | DCH | PSRO-RN | Pipeline PSRO | SHOR-PSRO |
|---|---|---|---|---|---|
| Kuhn Poker | 5.2×10⁻³ | 2.8×10⁻³ | 2.1×10⁻³ | 1.5×10⁻³ | 6.5×10⁻⁴ |
| Leduc Poker | 4.8×10⁻² | 2.1×10⁻² | 1.5×10⁻² | 1.2×10⁻² | 5.0×10⁻³ |
| Goofspiel(4) | 9.5×10⁻² | 4.5×10⁻² | 3.2×10⁻² | 2.8×10⁻² | 1.2×10⁻² |
| Liar's Dice(4) | 1.8×10⁻¹ | 8.0×10⁻² | 6.5×10⁻² | 5.2×10⁻² | 2.5×10⁻² |
| Dark Hex(3×3) | 3.2×10⁻¹ | 1.5×10⁻¹ | 1.2×10⁻¹ | 9.8×10⁻² | 4.8×10⁻² |
| Geometric Mean | 7.5×10⁻² | 3.4×10⁻² | 2.5×10⁻² | 2.0×10⁻² | 9.5×10⁻³ |
SHOR-PSRO achieves a geometric mean NashConv of $9.5 \times 10^{-3}$, a 2.1× improvement over Pipeline PSRO and a 7.9× improvement over standard PSRO.
Ablation Study
The paper provides an ablation study isolating the contribution of each VAD-CFR innovation:
| VAD-CFR Variant | Geo Mean NashConv | Relative to Full |
|---|---|---|
| Full VAD-CFR | 2.0×10⁻³ | 1.00× |
| No volatility adaptation (fixed α) | 3.5×10⁻³ | 1.75× worse |
| No consistency optimism | 3.0×10⁻³ | 1.50× worse |
| No exploration bonus | 2.3×10⁻³ | 1.15× worse |
| No adaptations (≈ DCFR) | 1.1×10⁻² | 5.50× worse |
Volatility adaptation contributes the most (1.75× degradation when removed), followed by consistency optimism (1.50×). The exploration bonus has a smaller but still measurable effect (1.15×). All three innovations contribute to the full system's performance, and they compose super-additively: the combined degradation when all are removed (5.50×) exceeds the product of individual degradations.
24.3 Paper 2: Aletheia and Autonomous Theorem Proving
Citation: Tony Feng et al. (17 authors). "Aletheia Tackles FirstProof Autonomously." arXiv:2602.21201, February 2026. Google DeepMind.
This paper presents Aletheia, an agent built on Gemini 3 Deep Think that autonomously solves mathematical proof problems from the FirstProof benchmark. FirstProof contains 10 competition-level mathematical problems requiring formal proofs in Lean 4, spanning combinatorics, algebra, analysis, number theory, linear algebra, and topology.
24.3.1 The FirstProof Benchmark
FirstProof is designed to test end-to-end mathematical reasoning: the agent must understand the problem statement, formulate a proof strategy, write the proof in Lean 4, submit it to the type checker, and iterate on errors. This pipeline requires four capabilities that individually challenge current AI systems and in combination represent a frontier problem:
- Mathematical reasoning: Non-trivial insights beyond template application
- Formal verification: Precise type-correct Lean 4 code with correct tactic application
- Long-horizon planning: Proofs spanning 50–200+ lines with complex lemma dependencies
- Error diagnosis: Interpreting Lean's often cryptic error messages and devising targeted fixes
24.3.2 Aletheia Agent Architecture
Figure 24.2: Aletheia agent architecture. The model iterates through strategy selection, lemma decomposition, proof writing, and verification, with error-driven strategy switching after repeated failures.
Aletheia's architecture is built on Gemini 3 Deep Think, a variant of Google's Gemini 3 with extended thinking capability. The model can allocate additional compute to reasoning before generating output, produce structured chain-of-thought, call the Lean 4 proof checker as a tool, and maintain context across multiple proof-attempt rounds. The agent loop follows a hierarchical strategy selection process:
- Initial analysis: Identify the mathematical domain and generate 2–4 candidate proof strategies with brief justifications.
- Strategy selection: Choose the most promising strategy based on problem structure.
- Lemma decomposition: Break the proof into 3–8 lemmas with clear mathematical statements and dependency ordering.
- Bottom-up proof: Prove lemmas in dependency order, submitting each to the Lean 4 checker.
- Strategy switching: If a strategy leads to a dead end after 3–5 failed attempts, abandon it and switch to an alternative.
24.3.3 Results
| # | Problem | Domain | Difficulty | Solved? | Attempts | Proof Length |
|---|---|---|---|---|---|---|
| 1 | Pigeonhole variant with divisibility | Combinatorics | Medium | Yes | 3 | 45 lines |
| 2 | Polynomial root bound | Algebra | Medium | Yes | 5 | 78 lines |
| 3 | Infinite series convergence | Analysis | Hard | Yes | 8 | 120 lines |
| 4 | Graph coloring lower bound | Combinatorics | Hard | Yes | 12 | 95 lines |
| 5 | Modular arithmetic identity | Number Theory | Medium | Yes | 4 | 55 lines |
| 6 | Measure-theoretic inequality | Analysis | Very Hard | No | 20+ | — |
| 7 | Ramsey-type bound | Combinatorics | Very Hard | No | 20+ | — |
| 8 | Matrix rank inequality | Linear Algebra | Hard | Yes | 7 | 88 lines |
| 9 | Diophantine equation solvability | Number Theory | Very Hard | No | 20+ | — |
| 10 | Topological fixed-point theorem | Topology | Very Hard | No | 20+ | — |
Aletheia solved 6 of 10 problems, with proof lengths ranging from 45 to 120 lines of Lean 4. The number of attempts varies from 3 (Problem 1, a standard pigeonhole application) to 12 (Problem 4, graph coloring), indicating that error-driven refinement and strategy switching were essential for the harder solved problems.
Case Study: Problem 3 — Convergence of $\sum \sin(n)/n$
Problem 3 illustrates the agent's iterative refinement process. The task is to prove that $\sum_{n=1}^{\infty} \sin(n)/n$ converges. Aletheia's trajectory across 8 attempts reveals strategy switching in action:
- Attempts 1–3: Applied Abel's test directly but could not formalize partial summation in Lean 4.
- Attempts 4–5: Switched to Dirichlet's test, but discovered that Lean's mathlib library lacked the required lemma.
- Attempts 6–7: Proved Dirichlet's test as a standalone lemma, then attempted to apply it.
- Attempt 8: Successfully combined the standalone Dirichlet test lemma with a bound on partial sums of $\sin(n)$ derived from the geometric series formula for complex exponentials.
This case demonstrates two critical capabilities: (1) the ability to recognize when a library gap blocks progress and to work around it by proving the missing lemma from scratch, and (2) the ability to combine independently-proven components into a complete proof.
24.3.4 Comparison with Prior Systems
| System | Model | Problems Solved (/10) | Max Attempts |
|---|---|---|---|
| GPT-4o (direct) | GPT-4o | 1 | 5 |
| Claude Sonnet 4 (direct) | Claude Sonnet 4 | 2 | 5 |
| Gemini 2.0 Pro (direct) | Gemini 2.0 Pro | 2 | 5 |
| LeanDojo | ReProver | 3 | 100 |
| LEGO-Prover | GPT-4 | 3 | 50 |
| DeepSeek-Prover V2 | DeepSeek V2 | 4 | 100 |
| Aletheia | Gemini 3 Deep Think | 6 | 20 |
Aletheia solves 50% more problems than the next-best system (DeepSeek-Prover V2) while using significantly fewer attempts (20 vs. 100). The direct LLM baselines (GPT-4o, Claude Sonnet 4, Gemini 2.0 Pro) solve at most 2 problems with 5 attempts, confirming that the agent architecture—strategy selection, lemma decomposition, and iterative refinement—contributes substantially beyond raw model capability.
Important caveat: These comparisons are reported by the paper authors and should be interpreted with care. The attempt budgets differ across systems (5 for direct baselines vs. 100 for LeanDojo/DeepSeek-Prover), and the direct baselines may not have been optimized with prompting techniques. The comparison demonstrates Aletheia's overall capability but does not constitute a controlled ablation study of the agent architecture vs. the underlying model.
24.4 Cross-Paper Analysis: Evolutionary Dynamics in Both Approaches
Although Paper 1 uses explicit evolutionary search (AlphaEvolve's population-based optimization) while Paper 2 uses an agent-based proof search loop, the two approaches share deep structural similarities that illuminate the nature of LLM-driven discovery.
24.4.1 Shared Algorithmic Structure
| Concept | Paper 1 (AlphaEvolve for Games) | Paper 2 (Aletheia for Proofs) |
|---|---|---|
| Population | 100 candidate algorithms across 10 islands | Multiple proof attempts (up to 20) |
| Fitness function | Geometric mean NashConv across games | Binary: Lean 4 checker accepts/rejects |
| Feedback richness | Continuous NashConv score per game | Structured error messages from Lean 4 |
| Mutation | LLM-generated code modifications to regret/strategy functions | Strategy-guided revision of proof code based on error analysis |
| Selection | Fitness-proportional selection within islands | Retain successful strategies; abandon failing approaches |
| Diversity mechanism | Island model with migration | Strategy switching after 3–5 failures |
| Verifier | Exact game-tree computation | Lean 4 type/proof checker |
The critical shared ingredient is a deterministic verifier that provides objective feedback without requiring human judgment. In game theory, the NashConv metric can be computed exactly for the benchmark games. In theorem proving, the Lean 4 checker provides a binary accept/reject signal with structured error diagnostics. This verifier-in-the-loop pattern is what enables both systems to iterate effectively without human supervision.
24.4.2 The Game-Theoretic View of Proof Search
The paper by Feng et al. draws an explicit connection between proof search and game theory. Mathematical proof search can be modeled as a single-player game against a deterministic verifier:
where $\sigma_{\text{proof}}$ represents the prover's strategy (choice of proof approach, lemma decomposition, and tactic sequence). Minimizing exploitability in this setting corresponds to finding correct proofs. While this is formally a single-player optimization rather than a two-player game (the verifier has no strategic choices), the search landscape shares key features with Nash equilibrium computation: high dimensionality, many local optima (near-correct proofs that fail on one step), and the need for diverse exploration.
# Pseudocode — illustrative analogy, not a real implementation
# Shows the structural parallel between evolutionary search and proof search
class EvolutionaryProofSearch:
"""Illustrates the evolutionary dynamics implicit in Aletheia's approach.
Not implemented in the paper — shown to highlight structural parallels
with explicit evolutionary systems like AlphaEvolve.
"""
def __init__(self, problem, model, lean_checker, max_attempts=20):
self.problem = problem
self.model = model
self.lean_checker = lean_checker
self.max_attempts = max_attempts
self.proof_population = [] # "Population" of proof attempts
self.strategy_archive = [] # Retained successful strategies
def search(self):
"""Main proof search loop with evolutionary dynamics."""
strategies = self.model.enumerate_strategies(self.problem, k=4)
for strategy in strategies:
consecutive_failures = 0
for attempt in range(self.max_attempts // len(strategies)):
# "Mutation": generate proof variant guided by error feedback
lemmas = self.model.decompose_into_lemmas(self.problem, strategy)
proof_code = self.model.write_lean_proof(lemmas)
# "Fitness evaluation": deterministic verifier
result = self.lean_checker.check(proof_code)
if result.success:
return proof_code # "Selection": accept winning strategy
# "Error-driven mutation": use structured feedback
error_analysis = self.model.analyze_error(result.error_messages)
strategy = self.model.revise_strategy(strategy, error_analysis)
consecutive_failures += 1
# "Diversity mechanism": switch strategy after repeated failure
if consecutive_failures >= 4:
break # Try next strategy (analogous to island migration)
return None # All strategies exhausted
24.5 Research Impact and Significance
24.5.1 Novelty Assessment
Paper 1 — genuinely novel contributions:
- Volatility-adaptive per-information-set discounting is, to the best of this survey's knowledge, the first application of local convergence-rate adaptation within CFR. Prior DCFR variants use global discount parameters.
- Consistency-enforced optimism transfers ideas from optimistic bandit algorithms to regret matching, a connection not previously explored in the game-theory literature.
- The discovery methodology itself: demonstrating that LLM-driven evolution can outperform algorithms refined over two decades of expert research in a well-established field.
Paper 1 — adapted from prior work:
- The overall CFR and PSRO frameworks are standard.
- Entropy-regularized equilibria are well-studied in the game-theory literature.
- Count-based exploration bonuses (Innovation 3) are a well-known technique from multi-armed bandits and reinforcement learning.
Paper 2 — genuinely novel contributions:
- Autonomous end-to-end proof solving at competition level: Aletheia is the first reported system to solve 6/10 FirstProof problems without human guidance.
- Strategy switching as a key architectural decision: The paper provides evidence that the ability to abandon failing strategies and switch approaches is critical for proof search, a finding with implications for broader AI agent design.
Paper 2 — builds on prior work:
- The agent-loop architecture (attempt → verify → revise) follows established patterns from LeanDojo, LEGO-Prover, and other proof assistants.
- The underlying capability relies heavily on the Gemini 3 Deep Think model, making it difficult to separate architectural from model contributions.
24.5.2 Interpretability of Discovered Algorithms
A particularly significant aspect of Paper 1 is that the discovered algorithms are mathematically interpretable. VAD-CFR's innovations can be understood in terms of well-known principles: adapting exploration to convergence rate, committing to consistently-good options, and maintaining diversity through count-based bonuses. This contrasts with many machine-learning-discovered algorithms that are opaque or resist human understanding. The interpretability suggests that LLMs, when evolving algorithms, tend to recombine known mathematical concepts in novel ways rather than discovering entirely alien mechanisms—a property that makes the outputs more trustworthy and extensible by human researchers.
24.5.3 Implications for the Broader Field
Taken together, these papers suggest a methodological pattern for LLM-driven scientific discovery:
- Define a structured search space (algorithm code, proof scripts) where candidates can be automatically generated and modified.
- Provide a deterministic verifier (NashConv computation, Lean 4 checker) that gives objective feedback without human judgment.
- Use LLMs for search guidance rather than direct solution generation—the LLM proposes modifications, not final answers.
- Maintain diversity through explicit mechanisms (islands, strategy switching) to avoid premature convergence.
- Analyze the results for interpretability to extract transferable insights from discovered solutions.
This pattern is applicable to any domain satisfying the prerequisites: structured code-like search space, automated evaluation, and a rich enough landscape that creative recombination yields improvements.
24.6 Limitations and Open Questions
24.6.1 Paper 1 Limitations
- Benchmark scope: All five evaluation games are relatively small imperfect-information games. Whether VAD-CFR and SHOR-PSRO scale to large games (e.g., full Texas Hold'em with billions of information sets) remains untested.
- Computational cost: The paper does not report the total compute required for the evolutionary search, making it impossible to assess whether the discovered algorithms justify the search cost versus manual algorithm design.
- Generalization: The algorithms were evolved specifically to minimize geometric mean NashConv across the five benchmark games. Performance on out-of-distribution games (different sizes, different information structures) is not evaluated.
- Theoretical analysis: While the innovations are interpretable, the paper does not provide formal convergence proofs for VAD-CFR. It is unclear whether the volatility adaptation preserves CFR's $O(1/\sqrt{T})$ convergence guarantee or whether it converges at all in the worst case.
24.6.2 Paper 2 Limitations
- Formalization gap: All four unsolved problems involved mathematical concepts with incomplete Lean 4 formalizations in mathlib. The agent could sketch proofs in natural language but could not bridge the formalization gap. This is a fundamental limitation of the current Lean ecosystem, not of the agent architecture per se.
- Deep dependency chains: Problems requiring more than 5 intermediate lemmas not already in mathlib proved intractable. The agent's effective proof-length budget was exhausted before reaching the main result.
- Novel insight: Problems 7 (Ramsey-type bound) and 9 (Diophantine equation) required genuinely novel mathematical insights, not just known techniques applied in new ways. The agent could not generate these insights, suggesting a ceiling on current LLM mathematical creativity.
- Model confounding: Aletheia's performance improvement over prior systems could be attributed to the Gemini 3 Deep Think model rather than the agent architecture. Without an ablation using the same model in a simpler architecture (e.g., direct prompting with 20 attempts), the contribution of the agent framework remains partially confounded.
24.6.3 Reproducibility
Neither paper provides public code or model access. Gemini 2.5 Pro (Paper 1) and Gemini 3 Deep Think (Paper 2) are proprietary models. The AlphaEvolve system itself is not publicly available. This means the specific results cannot be independently reproduced, although the algorithmic descriptions (particularly for VAD-CFR and SHOR-PSRO) are detailed enough that the discovered algorithms could be reimplemented and tested on open-source game-solving frameworks such as OpenSpiel.
24.7 Connection to Broader AlphaEvolve Research
These two papers are part of a broader wave of AlphaEvolve-derived research from Google DeepMind. The original AlphaEvolve system, described in Chapter 4 of this survey, demonstrated results on mathematical discovery (improved matrix multiplication algorithms for specific tensor sizes, better bounds on kissing numbers in higher dimensions), combinatorial optimization (bin packing, job scheduling), and algorithmic improvements to Google's internal infrastructure (data center optimization, chip design). The papers examined in this chapter extend the methodology to game theory and formal mathematics, demonstrating that the evolutionary code search paradigm generalizes across a wide range of mathematical and algorithmic domains.
A common thread across all AlphaEvolve research is the reliance on automated, objective evaluation. In matrix multiplication, correctness is verified by checking the tensor decomposition. In kissing numbers, geometric validity is computationally verifiable. In game theory, NashConv provides an exact metric. In formal proofs, the type checker is absolute. This pattern suggests that LLM-driven evolution is most powerful in domains where fitness is cheap and precise—a principle that guides the selection of target domains for future research.
24.8 Summary
Key Takeaway
Two February 2026 papers from Google DeepMind demonstrate that LLM-driven evolutionary search can discover novel, interpretable, and practically superior algorithms in game theory (outperforming two decades of expert-designed CFR/PSRO variants) and can autonomously solve competition-level mathematical proofs (6/10 FirstProof problems, surpassing all prior automated approaches).
Main Contribution to the Field
These papers provide the strongest evidence to date that LLM-driven evolution can serve as a creative research methodology, not merely an optimization tool. The discovery of VAD-CFR's volatility-adaptive discounting and consistency-enforced optimism—innovations that are both novel and mathematically principled—demonstrates that evolutionary LLM search can produce insights that advance established research fields. Aletheia's proof search further shows that agent-based LLM systems with structured feedback loops exhibit evolutionary dynamics even without explicit population management.
What Researchers Should Know
The critical enabler in both papers is a deterministic, automated verifier that provides rich feedback (NashConv scores, Lean 4 error messages) without human involvement. Researchers seeking to apply LLM-driven evolution to new domains should prioritize domains where such verifiers exist or can be constructed. The interpretability of VAD-CFR's discoveries further suggests that the LLM search process tends to recombine known concepts in novel ways—making the outputs amenable to human analysis and extension—rather than producing opaque "black-box" solutions.