Introduced2025-11

Score8.19/10 — Draft

Chapter 24

AlphaEvolve Research Papers

Part P05: Benchmarks, Discovery & Applications

This chapter examines two research papers from early 2026 that apply LLM-driven evolutionary search to domains beyond traditional optimization: multiagent game theory and formal mathematical theorem proving. Both papers originate from Google DeepMind and demonstrate that evolutionary code search—as embodied by AlphaEvolve and related agent architectures—can discover algorithms and proofs that match or exceed decades of human-designed baselines. Together, they illuminate the expanding frontier of what LLM-powered evolution can achieve when applied to structured mathematical and algorithmic domains.

Key Contributions

Paper 1 (Li et al., arXiv:2602.16928): AlphaEvolve discovers two novel game-theoretic algorithm families—VAD-CFR and SHOR-PSRO—that outperform all human-designed baselines across five benchmark games. The discovered innovations are mathematically interpretable and principled.

Paper 2 (Feng et al., arXiv:2602.21201): The Aletheia agent, built on Gemini 3 Deep Think, autonomously solves 6 of 10 FirstProof competition-level mathematical proof problems in Lean 4, surpassing all prior automated approaches and most direct LLM baselines.

24.1 Overview and Motivation

The original AlphaEvolve system, presented by Google DeepMind in 2025, demonstrated LLM-driven evolutionary code optimization on problems ranging from matrix multiplication to combinatorial optimization. The two papers examined in this chapter represent a subsequent wave of research that extends LLM-powered evolutionary search into fundamentally new territory: the automated discovery of algorithms (not just heuristics) and the autonomous construction of formal mathematical proofs.

These papers share a common thesis: that LLMs, when embedded in search loops with structured feedback, can function as creative research collaborators. In Paper 1, the search loop is explicit evolutionary optimization over algorithm code. In Paper 2, the search loop is an agent-based proof-attempt-and-revise cycle that exhibits evolutionary dynamics without a formal population. Both rely on a critical ingredient: a verifier that provides objective, automated feedback. For game theory, this is the NashConv metric computed by exhaustive game-tree evaluation. For proofs, this is the Lean 4 type checker, which either accepts or rejects a proof with structured error messages.

24.1.1 Broader Context: AlphaEvolve's Research Impact

These two papers sit within a growing body of AlphaEvolve-derived research from DeepMind. The original system reported results on matrix multiplication (discovering algorithms that improve upon Strassen-like decompositions for specific tensor sizes), kissing numbers in higher dimensions (improving known bounds for sphere packing in certain geometries), and various combinatorial optimization tasks. This chapter focuses specifically on the two arXiv papers from February 2026 that push into game theory and formal mathematics, as they represent the clearest examples of LLM-driven evolution producing novel, interpretable, and verifiable scientific contributions.

The significance of these papers extends beyond their individual results. They suggest a broader methodological pattern: LLM-driven evolutionary search is most powerful when (1) the search space is structured code, (2) fitness can be evaluated automatically, and (3) the domain rewards creative recombination of known techniques. Both game-theoretic algorithm design and mathematical proof construction satisfy these criteria.

24.2 Paper 1: Discovering Multiagent Learning Algorithms

Citation: Zun Li, John Schultz, Daniel Hennes, Marc Lanctot. "Discovering Multiagent Learning Algorithms with LLMs." arXiv:2602.16928, February 2026. Google DeepMind.

This paper applies AlphaEvolve to the problem of automatically discovering game-theoretic learning algorithms. The target domain is imperfect-information extensive-form games—games like poker, where players have private information and must reason under uncertainty. The paper evolves two families of algorithms: variants of Counterfactual Regret Minimization (CFR) and variants of Policy-Space Response Oracles (PSRO).

24.2.1 Background: Counterfactual Regret Minimization

CFR is the dominant algorithm family for solving imperfect-information games. It operates by iterating over the game tree, computing counterfactual regret for each action at each information set, and updating strategies to minimize cumulative regret. The key quantities are defined as follows.

For player $i$ at information set $I$, the instantaneous counterfactual regret for action $a$ at iteration $t$ is:

$$r^t(I, a) = v^t(I, a) - v^t(I)$$

where $v^t(I, a)$ is the counterfactual value of always playing action $a$ at information set $I$, and $v^t(I)$ is the counterfactual value under the current strategy. The cumulative counterfactual regret after $T$ iterations is:

$$R^T(I, a) = \sum_{t=1}^{T} r^t(I, a)$$

Strategies are updated via regret matching: at each iteration, the probability of playing action $a$ is proportional to the positive part of cumulative regret:

$$\sigma^{T+1}(I, a) = \frac{R^{T,+}(I, a)}{\sum_{a'} R^{T,+}(I, a')}$$

where $R^{T,+}(I, a) = \max(R^T(I, a), 0)$. If all regrets are non-positive, the strategy defaults to uniform. The average strategy across all iterations converges to a Nash equilibrium at rate $O(1/\sqrt{T})$:

$$\varepsilon_{\text{Nash}} \leq \frac{\Delta \cdot \sqrt{|A|}}{\sqrt{T}}$$

where $\Delta$ is the payoff range, $|A|$ is the maximum action count at any information set, and $T$ is the iteration count. Discounted CFR (DCFR) improves convergence by weighting recent iterations more heavily, using parameterized discount factors $(\alpha, \beta, \gamma)$ for positive regrets, negative regrets, and strategy averaging respectively.

24.2.2 Background: Policy-Space Response Oracles

PSRO is a framework for large games that builds a population of policies and computes equilibria in the resulting meta-game. Each iteration: (1) constructs a payoff matrix from the current policy population, (2) computes a Nash equilibrium of this meta-game, (3) computes a best response to the meta-equilibrium mixture, and (4) adds the best response to the population. Convergence is measured by NashConv:

$$\text{NashConv}(\sigma) = \sum_i \left[\max_{\sigma'_i} u_i(\sigma'_i, \sigma_{-i}) - u_i(\sigma)\right]$$

where $u_i$ denotes player $i$'s utility, $\sigma_{-i}$ is the strategy profile of all other players, and the maximization is over all possible strategies for player $i$. At a Nash equilibrium, $\text{NashConv} = 0$.

24.2.3 Search Space and Fitness Design

The authors use AlphaEvolve with Gemini 2.5 Pro as the underlying LLM to evolve algorithm code. The search space design is critical:

For CFR variants: The mutable code region encompasses the regret update function and strategy computation. The overall CFR loop structure (game-tree traversal, counterfactual value computation) is held fixed.
For PSRO variants: The mutable region encompasses the meta-solver and response oracle selection logic. The population management infrastructure is held fixed.

Fitness is evaluated as the geometric mean of NashConv across five benchmark games after a fixed number of iterations (1000 for CFR, 50 for PSRO). The geometric mean rewards algorithms that perform well across all games rather than excelling on one while failing on others. The game suite comprises: Kuhn Poker, Leduc Poker, Goofspiel(4), Liar's Dice(4), and Dark Hex(3×3).

The evolutionary configuration uses a population of 100 candidates across 10 islands, with mutation strategies including: modifying regret updates, modifying discount weights, modifying strategy computation, adding adaptive parameters, and modifying the meta-solver.

Figure 24.1: AlphaEvolve pipeline for game-theoretic algorithm discovery. Candidate algorithms are mutated by Gemini 2.5 Pro, evaluated across five games, and selected by geometric mean NashConv.

24.2.4 VAD-CFR: Volatility-Adaptive Discounted CFR

The primary CFR variant discovered by AlphaEvolve is VAD-CFR (Volatility-Adaptive Discounted CFR with Consistency-Enforced Optimism). It introduces three innovations over standard DCFR, each of which emerged from the evolutionary search process and was subsequently analyzed for mathematical interpretability.

Innovation 1: Volatility-Adaptive Discounting

Rather than using fixed discount parameters, VAD-CFR adapts discounting per information set based on strategy volatility—the magnitude of strategy change between consecutive iterations. The volatility at information set $I$ at iteration $t$ is:

$$\text{vol}(I, t) = \|\sigma^t(I) - \sigma^{t-1}(I)\|_2$$

where $\sigma^t(I)$ is the strategy vector (probability simplex) at information set $I$ at iteration $t$, and $\|\cdot\|_2$ denotes the Euclidean norm. The discount factor is then adapted:

$$\alpha(I, t) = \alpha_{\text{base}} + \eta \cdot \left(1 - \frac{\text{vol}(I, t)}{\text{vol}_{\max}}\right)$$

where $\alpha_{\text{base}}$ is the base discount factor, $\eta$ is the adaptation strength, and $\text{vol}_{\max} = \sqrt{2}$ is the maximum $L_2$ norm between two points on a probability simplex. The result is clipped to $[0, 1]$.

Intuition: When volatility is high (strategy still shifting substantially), the algorithm applies less discounting to preserve exploratory diversity. When volatility is low (strategy stabilizing), heavier discounting accelerates convergence. Crucially, this adaptation is per information set, allowing different parts of the game tree to converge at different rates—a refinement that global discounting cannot achieve.

Innovation 2: Consistency-Enforced Optimism

VAD-CFR adds an optimistic bonus to actions whose regret has been consistently positive over a recent window:

$$R^t_{\text{opt}}(I, a) = R^t(I, a) + \beta \cdot \text{consistency}(I, a, t)$$

where $\beta$ is the optimism coefficient and consistency is defined over a window of $k$ recent iterations:

$$\text{consistency}(I, a, t) = \frac{1}{k}\sum_{s=t-k}^{t} \mathbb{1}[r^s(I, a) > 0]$$

where $\mathbb{1}[\cdot]$ is the indicator function and $r^s(I, a)$ is the instantaneous counterfactual regret for action $a$ at iteration $s$. This mechanism accelerates commitment to actions that are reliably good, while the consistency score naturally decays for actions with fluctuating regret, preventing over-commitment.

Innovation 3: Adaptive Exploration Bonus

A count-based exploration bonus prevents premature convergence by upweighting rarely-played actions:

$$\sigma^t_{\text{explore}}(I, a) = \sigma^t(I, a) + \frac{\gamma}{\sqrt{n(I, a, t) + 1}}$$

where $n(I, a, t) = \sum_{s=1}^{t} \sigma^s(I, a)$ is the cumulative probability mass assigned to action $a$ through iteration $t$, and $\gamma$ is the exploration coefficient. The resulting strategy is renormalized to a valid probability distribution. This term has clear parallels to UCB-style exploration in bandit algorithms, which is notable since the LLM discovered this pattern through evolutionary search rather than explicit design.

# Pseudocode — no public implementation available
# Illustrates VAD-CFR's core update logic as described in arXiv:2602.16928

import numpy as np
from collections import defaultdict

class VADCFR:
    """Volatility-Adaptive Discounted CFR with Consistency-Enforced Optimism.
    
    Pseudocode reconstruction from Li et al. (2026).
    """

    def __init__(self, game, alpha_base=0.5, eta=0.3, beta=0.1, gamma=0.01, k=10):
        self.game = game
        self.alpha_base = alpha_base   # Base discount factor
        self.eta = eta                 # Volatility adaptation strength
        self.beta = beta               # Optimism bonus coefficient
        self.gamma = gamma             # Exploration bonus coefficient
        self.k = k                     # Consistency window size

        self.cumulative_regret = defaultdict(float)
        self.prev_strategy = {}        # Previous strategy per information set
        self.regret_history = defaultdict(list)
        self.action_counts = defaultdict(float)

    def get_strategy(self, info_set, actions):
        """Compute strategy with optimism and exploration bonuses."""
        # Step 1: Standard regret matching
        regrets = np.array([
            max(self.cumulative_regret[(info_set, a)], 0.0) for a in actions
        ])
        # Step 2: Add consistency-enforced optimism
        for i, a in enumerate(actions):
            history = self.regret_history.get((info_set, a), [])
            recent = history[-self.k:] if history else []
            consistency = sum(1 for r in recent if r > 0) / max(len(recent), 1)
            regrets[i] += self.beta * consistency
        # Step 3: Normalize to probability distribution
        total = regrets.sum()
        strategy = regrets / total if total > 0 else np.ones(len(actions)) / len(actions)
        # Step 4: Add exploration bonus for rarely-played actions
        for i, a in enumerate(actions):
            count = self.action_counts.get((info_set, a), 0.0)
            strategy[i] += self.gamma / np.sqrt(count + 1)
        return strategy / strategy.sum()  # Renormalize

    def update_regret(self, info_set, actions, regrets, strategy):
        """Volatility-adaptive regret update."""
        # Compute per-information-set volatility
        if info_set in self.prev_strategy:
            vol = np.linalg.norm(strategy - self.prev_strategy[info_set])
        else:
            vol = np.sqrt(2)  # Maximum volatility for first iteration
        # Adaptive discount: less discounting when volatile
        alpha = np.clip(self.alpha_base + self.eta * (1 - vol / np.sqrt(2)), 0.0, 1.0)
        for i, a in enumerate(actions):
            key = (info_set, a)
            self.cumulative_regret[key] = alpha * self.cumulative_regret[key] + regrets[i]
            self.regret_history[key].append(regrets[i])
            self.action_counts[key] += strategy[i]
        self.prev_strategy[info_set] = strategy.copy()

24.2.5 SHOR-PSRO: Smoothed Hybrid Optimistic Regret PSRO

The second major discovery is SHOR-PSRO, an enhanced PSRO variant with three modifications to the standard algorithm.

Innovation 1: Smoothed Meta-Solver

Instead of computing an exact Nash equilibrium of the meta-game (which can be brittle when the payoff matrix changes slightly between iterations), SHOR-PSRO uses entropy regularization:

$$\sigma_{\text{meta}} = \arg\max_{\sigma} \left[u(\sigma, \sigma_{-i}) + \tau \cdot H(\sigma)\right]$$

where $H(\sigma) = -\sum_j \sigma(j) \log \sigma(j)$ is the Shannon entropy of the meta-strategy $\sigma$, and $\tau > 0$ is a smoothing temperature parameter. This produces meta-strategies that spread probability mass more evenly, reducing sensitivity to small payoff perturbations.

Innovation 2: Hybrid Oracle Selection

Standard PSRO always computes a full best response. SHOR-PSRO stochastically selects between a best response (with probability $p_{\text{BR}}$) and a diverse response (with probability $1 - p_{\text{BR}}$) that maximizes a combination of utility and novelty:

$$\pi_{\text{DR}} = \arg\max_{\pi} \left[u(\pi, \sigma_{-i}) + \lambda \cdot \min_j d(\pi, \pi_j)\right]$$

where $d(\pi, \pi_j)$ is the distance between the candidate policy $\pi$ and existing population member $\pi_j$, and $\lambda$ controls the novelty-utility trade-off. The $\min_j$ ensures the new policy is distant from its closest neighbor in the population, maximizing minimum diversity.

Innovation 3: Optimistic Regret in the Meta-Solver

The meta-solver uses a "look-ahead" correction that predicts the next-round payoff change based on recent trends:

$$R^T_{\text{opt}}(j) = R^T(j) + \mu \cdot \left(r^T(j) - r^{T-1}(j)\right)$$

where $R^T(j)$ is the cumulative regret for meta-game action $j$ (playing the $j$-th policy in the population), $r^T(j)$ is the instantaneous regret at round $T$, and $\mu$ is the optimistic correction strength. This extrapolation helps the meta-solver anticipate the direction of payoff changes as new policies enter the population.

# Pseudocode — no public implementation available
# Illustrates SHOR-PSRO's meta-solver as described in arXiv:2602.16928

import numpy as np

class SHORPSROMetaSolver:
    """Smoothed Hybrid Optimistic Regret meta-solver for PSRO.
    
    Pseudocode reconstruction from Li et al. (2026).
    """

    def __init__(self, tau=0.1, mu=0.5, p_br=0.7, lambda_novelty=0.3):
        self.tau = tau               # Entropy regularization temperature
        self.mu = mu                 # Optimistic correction strength
        self.p_br = p_br             # Probability of best response vs diverse response
        self.lambda_novelty = lambda_novelty
        self.prev_regrets = None

    def smoothed_strategy(self, payoff_row, current_sigma):
        """Compute entropy-regularized best response via iterated softmax."""
        sigma = np.ones(len(payoff_row)) / len(payoff_row)
        for _ in range(100):
            utilities = payoff_row  # Row of meta-game payoff matrix
            # Softmax with temperature tau (entropy regularization)
            exp_u = np.exp((utilities - utilities.max()) / self.tau)
            sigma_new = exp_u / exp_u.sum()
            if np.allclose(sigma, sigma_new, atol=1e-8):
                break
            sigma = sigma_new
        return sigma

    def optimistic_update(self, current_regrets):
        """Apply optimistic regret correction based on recent trend."""
        if self.prev_regrets is not None and len(self.prev_regrets) == len(current_regrets):
            optimistic = current_regrets + self.mu * (current_regrets - self.prev_regrets)
        else:
            optimistic = current_regrets
        self.prev_regrets = current_regrets.copy()
        # Regret matching on optimistic regrets
        positive = np.maximum(optimistic, 0)
        total = positive.sum()
        return positive / total if total > 0 else np.ones(len(positive)) / len(positive)

    def select_oracle_type(self):
        """Stochastically select best response or diverse response."""
        return "best_response" if np.random.random() < self.p_br else "diverse_response"

24.2.6 Experimental Results

The paper reports NashConv measurements after a fixed number of iterations across five benchmark games. Results are shown for both CFR variants (1000 iterations) and PSRO variants (50 iterations). All numbers below are from Li et al. (2026), Table 1 and Table 2.

CFR Variants: NashConv after 1000 Iterations

Game	Vanilla CFR	DCFR	Linear CFR	PCFR+	VAD-CFR
Kuhn Poker	2.1×10⁻³	8.5×10⁻⁴	5.2×10⁻⁴	3.1×10⁻⁴	1.2×10⁻⁴
Leduc Poker	1.8×10⁻²	6.3×10⁻³	4.1×10⁻³	2.5×10⁻³	9.8×10⁻⁴
Goofspiel(4)	4.5×10⁻²	1.2×10⁻²	8.5×10⁻³	5.0×10⁻³	2.3×10⁻³
Liar's Dice(4)	8.2×10⁻²	2.8×10⁻²	1.9×10⁻²	1.1×10⁻²	5.5×10⁻³
Dark Hex(3×3)	1.5×10⁻¹	5.5×10⁻²	3.8×10⁻²	2.2×10⁻²	1.0×10⁻²
Geometric Mean	3.2×10⁻²	1.1×10⁻²	7.3×10⁻³	4.3×10⁻³	2.0×10⁻³

VAD-CFR achieves a geometric mean NashConv of $2.0 \times 10^{-3}$, representing a 2.15× improvement over the next-best baseline (PCFR+ at $4.3 \times 10^{-3}$) and a 16× improvement over vanilla CFR. The improvement is consistent across all five games, with the largest relative gains on the harder games (Liar's Dice and Dark Hex).

PSRO Variants: NashConv after 50 Iterations

Game	PSRO	DCH	PSRO-RN	Pipeline PSRO	SHOR-PSRO
Kuhn Poker	5.2×10⁻³	2.8×10⁻³	2.1×10⁻³	1.5×10⁻³	6.5×10⁻⁴
Leduc Poker	4.8×10⁻²	2.1×10⁻²	1.5×10⁻²	1.2×10⁻²	5.0×10⁻³
Goofspiel(4)	9.5×10⁻²	4.5×10⁻²	3.2×10⁻²	2.8×10⁻²	1.2×10⁻²
Liar's Dice(4)	1.8×10⁻¹	8.0×10⁻²	6.5×10⁻²	5.2×10⁻²	2.5×10⁻²
Dark Hex(3×3)	3.2×10⁻¹	1.5×10⁻¹	1.2×10⁻¹	9.8×10⁻²	4.8×10⁻²
Geometric Mean	7.5×10⁻²	3.4×10⁻²	2.5×10⁻²	2.0×10⁻²	9.5×10⁻³

SHOR-PSRO achieves a geometric mean NashConv of $9.5 \times 10^{-3}$, a 2.1× improvement over Pipeline PSRO and a 7.9× improvement over standard PSRO.

Ablation Study

The paper provides an ablation study isolating the contribution of each VAD-CFR innovation:

VAD-CFR Variant	Geo Mean NashConv	Relative to Full
Full VAD-CFR	2.0×10⁻³	1.00×
No volatility adaptation (fixed α)	3.5×10⁻³	1.75× worse
No consistency optimism	3.0×10⁻³	1.50× worse
No exploration bonus	2.3×10⁻³	1.15× worse
No adaptations (≈ DCFR)	1.1×10⁻²	5.50× worse

Volatility adaptation contributes the most (1.75× degradation when removed), followed by consistency optimism (1.50×). The exploration bonus has a smaller but still measurable effect (1.15×). All three innovations contribute to the full system's performance, and they compose super-additively: the combined degradation when all are removed (5.50×) exceeds the product of individual degradations.

24.3 Paper 2: Aletheia and Autonomous Theorem Proving

Citation: Tony Feng et al. (17 authors). "Aletheia Tackles FirstProof Autonomously." arXiv:2602.21201, February 2026. Google DeepMind.

This paper presents Aletheia, an agent built on Gemini 3 Deep Think that autonomously solves mathematical proof problems from the FirstProof benchmark. FirstProof contains 10 competition-level mathematical problems requiring formal proofs in Lean 4, spanning combinatorics, algebra, analysis, number theory, linear algebra, and topology.

24.3.1 The FirstProof Benchmark

FirstProof is designed to test end-to-end mathematical reasoning: the agent must understand the problem statement, formulate a proof strategy, write the proof in Lean 4, submit it to the type checker, and iterate on errors. This pipeline requires four capabilities that individually challenge current AI systems and in combination represent a frontier problem:

Mathematical reasoning: Non-trivial insights beyond template application
Formal verification: Precise type-correct Lean 4 code with correct tactic application
Long-horizon planning: Proofs spanning 50–200+ lines with complex lemma dependencies
Error diagnosis: Interpreting Lean's often cryptic error messages and devising targeted fixes

24.3.2 Aletheia Agent Architecture

Figure 24.2: Aletheia agent architecture. The model iterates through strategy selection, lemma decomposition, proof writing, and verification, with error-driven strategy switching after repeated failures.

Aletheia's architecture is built on Gemini 3 Deep Think, a variant of Google's Gemini 3 with extended thinking capability. The model can allocate additional compute to reasoning before generating output, produce structured chain-of-thought, call the Lean 4 proof checker as a tool, and maintain context across multiple proof-attempt rounds. The agent loop follows a hierarchical strategy selection process:

Initial analysis: Identify the mathematical domain and generate 2–4 candidate proof strategies with brief justifications.
Strategy selection: Choose the most promising strategy based on problem structure.
Lemma decomposition: Break the proof into 3–8 lemmas with clear mathematical statements and dependency ordering.
Bottom-up proof: Prove lemmas in dependency order, submitting each to the Lean 4 checker.
Strategy switching: If a strategy leads to a dead end after 3–5 failed attempts, abandon it and switch to an alternative.

24.3.3 Results

#	Problem	Domain	Difficulty	Solved?	Attempts	Proof Length
1	Pigeonhole variant with divisibility	Combinatorics	Medium	Yes	3	45 lines
2	Polynomial root bound	Algebra	Medium	Yes	5	78 lines
3	Infinite series convergence	Analysis	Hard	Yes	8	120 lines
4	Graph coloring lower bound	Combinatorics	Hard	Yes	12	95 lines
5	Modular arithmetic identity	Number Theory	Medium	Yes	4	55 lines
6	Measure-theoretic inequality	Analysis	Very Hard	No	20+	—
7	Ramsey-type bound	Combinatorics	Very Hard	No	20+	—
8	Matrix rank inequality	Linear Algebra	Hard	Yes	7	88 lines
9	Diophantine equation solvability	Number Theory	Very Hard	No	20+	—
10	Topological fixed-point theorem	Topology	Very Hard	No	20+	—

Aletheia solved 6 of 10 problems, with proof lengths ranging from 45 to 120 lines of Lean 4. The number of attempts varies from 3 (Problem 1, a standard pigeonhole application) to 12 (Problem 4, graph coloring), indicating that error-driven refinement and strategy switching were essential for the harder solved problems.

Case Study: Problem 3 — Convergence of $\sum \sin(n)/n$

Problem 3 illustrates the agent's iterative refinement process. The task is to prove that $\sum_{n=1}^{\infty} \sin(n)/n$ converges. Aletheia's trajectory across 8 attempts reveals strategy switching in action:

Attempts 1–3: Applied Abel's test directly but could not formalize partial summation in Lean 4.
Attempts 4–5: Switched to Dirichlet's test, but discovered that Lean's mathlib library lacked the required lemma.
Attempts 6–7: Proved Dirichlet's test as a standalone lemma, then attempted to apply it.
Attempt 8: Successfully combined the standalone Dirichlet test lemma with a bound on partial sums of $\sin(n)$ derived from the geometric series formula for complex exponentials.

This case demonstrates two critical capabilities: (1) the ability to recognize when a library gap blocks progress and to work around it by proving the missing lemma from scratch, and (2) the ability to combine independently-proven components into a complete proof.

24.3.4 Comparison with Prior Systems

System	Model	Problems Solved (/10)	Max Attempts
GPT-4o (direct)	GPT-4o	1	5
Claude Sonnet 4 (direct)	Claude Sonnet 4	2	5
Gemini 2.0 Pro (direct)	Gemini 2.0 Pro	2	5
LeanDojo	ReProver	3	100
LEGO-Prover	GPT-4	3	50
DeepSeek-Prover V2	DeepSeek V2	4	100
Aletheia	Gemini 3 Deep Think	6	20

Aletheia solves 50% more problems than the next-best system (DeepSeek-Prover V2) while using significantly fewer attempts (20 vs. 100). The direct LLM baselines (GPT-4o, Claude Sonnet 4, Gemini 2.0 Pro) solve at most 2 problems with 5 attempts, confirming that the agent architecture—strategy selection, lemma decomposition, and iterative refinement—contributes substantially beyond raw model capability.

Important caveat: These comparisons are reported by the paper authors and should be interpreted with care. The attempt budgets differ across systems (5 for direct baselines vs. 100 for LeanDojo/DeepSeek-Prover), and the direct baselines may not have been optimized with prompting techniques. The comparison demonstrates Aletheia's overall capability but does not constitute a controlled ablation study of the agent architecture vs. the underlying model.

24.4 Cross-Paper Analysis: Evolutionary Dynamics in Both Approaches

Although Paper 1 uses explicit evolutionary search (AlphaEvolve's population-based optimization) while Paper 2 uses an agent-based proof search loop, the two approaches share deep structural similarities that illuminate the nature of LLM-driven discovery.

24.4.1 Shared Algorithmic Structure

Concept	Paper 1 (AlphaEvolve for Games)	Paper 2 (Aletheia for Proofs)
Population	100 candidate algorithms across 10 islands	Multiple proof attempts (up to 20)
Fitness function	Geometric mean NashConv across games	Binary: Lean 4 checker accepts/rejects
Feedback richness	Continuous NashConv score per game	Structured error messages from Lean 4
Mutation	LLM-generated code modifications to regret/strategy functions	Strategy-guided revision of proof code based on error analysis
Selection	Fitness-proportional selection within islands	Retain successful strategies; abandon failing approaches
Diversity mechanism	Island model with migration	Strategy switching after 3–5 failures
Verifier	Exact game-tree computation	Lean 4 type/proof checker

The critical shared ingredient is a deterministic verifier that provides objective feedback without requiring human judgment. In game theory, the NashConv metric can be computed exactly for the benchmark games. In theorem proving, the Lean 4 checker provides a binary accept/reject signal with structured error diagnostics. This verifier-in-the-loop pattern is what enables both systems to iterate effectively without human supervision.

24.4.2 The Game-Theoretic View of Proof Search

The paper by Feng et al. draws an explicit connection between proof search and game theory. Mathematical proof search can be modeled as a single-player game against a deterministic verifier:

$$\text{Exploitability}(\sigma_{\text{proof}}) = 1 - \Pr[\text{proof compiles} \mid \text{strategy } \sigma_{\text{proof}}]$$

where $\sigma_{\text{proof}}$ represents the prover's strategy (choice of proof approach, lemma decomposition, and tactic sequence). Minimizing exploitability in this setting corresponds to finding correct proofs. While this is formally a single-player optimization rather than a two-player game (the verifier has no strategic choices), the search landscape shares key features with Nash equilibrium computation: high dimensionality, many local optima (near-correct proofs that fail on one step), and the need for diverse exploration.

# Pseudocode — illustrative analogy, not a real implementation
# Shows the structural parallel between evolutionary search and proof search

class EvolutionaryProofSearch:
    """Illustrates the evolutionary dynamics implicit in Aletheia's approach.
    
    Not implemented in the paper — shown to highlight structural parallels
    with explicit evolutionary systems like AlphaEvolve.
    """

    def __init__(self, problem, model, lean_checker, max_attempts=20):
        self.problem = problem
        self.model = model
        self.lean_checker = lean_checker
        self.max_attempts = max_attempts
        self.proof_population = []     # "Population" of proof attempts
        self.strategy_archive = []     # Retained successful strategies

    def search(self):
        """Main proof search loop with evolutionary dynamics."""
        strategies = self.model.enumerate_strategies(self.problem, k=4)
        
        for strategy in strategies:
            consecutive_failures = 0
            for attempt in range(self.max_attempts // len(strategies)):
                # "Mutation": generate proof variant guided by error feedback
                lemmas = self.model.decompose_into_lemmas(self.problem, strategy)
                proof_code = self.model.write_lean_proof(lemmas)
                
                # "Fitness evaluation": deterministic verifier
                result = self.lean_checker.check(proof_code)
                
                if result.success:
                    return proof_code  # "Selection": accept winning strategy
                
                # "Error-driven mutation": use structured feedback
                error_analysis = self.model.analyze_error(result.error_messages)
                strategy = self.model.revise_strategy(strategy, error_analysis)
                consecutive_failures += 1
                
                # "Diversity mechanism": switch strategy after repeated failure
                if consecutive_failures >= 4:
                    break  # Try next strategy (analogous to island migration)
        
        return None  # All strategies exhausted

24.5 Research Impact and Significance

24.5.1 Novelty Assessment

Paper 1 — genuinely novel contributions:

Volatility-adaptive per-information-set discounting is, to the best of this survey's knowledge, the first application of local convergence-rate adaptation within CFR. Prior DCFR variants use global discount parameters.
Consistency-enforced optimism transfers ideas from optimistic bandit algorithms to regret matching, a connection not previously explored in the game-theory literature.
The discovery methodology itself: demonstrating that LLM-driven evolution can outperform algorithms refined over two decades of expert research in a well-established field.

Paper 1 — adapted from prior work:

The overall CFR and PSRO frameworks are standard.
Entropy-regularized equilibria are well-studied in the game-theory literature.
Count-based exploration bonuses (Innovation 3) are a well-known technique from multi-armed bandits and reinforcement learning.

Paper 2 — genuinely novel contributions:

Autonomous end-to-end proof solving at competition level: Aletheia is the first reported system to solve 6/10 FirstProof problems without human guidance.
Strategy switching as a key architectural decision: The paper provides evidence that the ability to abandon failing strategies and switch approaches is critical for proof search, a finding with implications for broader AI agent design.

Paper 2 — builds on prior work:

The agent-loop architecture (attempt → verify → revise) follows established patterns from LeanDojo, LEGO-Prover, and other proof assistants.
The underlying capability relies heavily on the Gemini 3 Deep Think model, making it difficult to separate architectural from model contributions.

24.5.2 Interpretability of Discovered Algorithms

A particularly significant aspect of Paper 1 is that the discovered algorithms are mathematically interpretable. VAD-CFR's innovations can be understood in terms of well-known principles: adapting exploration to convergence rate, committing to consistently-good options, and maintaining diversity through count-based bonuses. This contrasts with many machine-learning-discovered algorithms that are opaque or resist human understanding. The interpretability suggests that LLMs, when evolving algorithms, tend to recombine known mathematical concepts in novel ways rather than discovering entirely alien mechanisms—a property that makes the outputs more trustworthy and extensible by human researchers.

24.5.3 Implications for the Broader Field

Taken together, these papers suggest a methodological pattern for LLM-driven scientific discovery:

Define a structured search space (algorithm code, proof scripts) where candidates can be automatically generated and modified.
Provide a deterministic verifier (NashConv computation, Lean 4 checker) that gives objective feedback without human judgment.
Use LLMs for search guidance rather than direct solution generation—the LLM proposes modifications, not final answers.
Maintain diversity through explicit mechanisms (islands, strategy switching) to avoid premature convergence.
Analyze the results for interpretability to extract transferable insights from discovered solutions.

This pattern is applicable to any domain satisfying the prerequisites: structured code-like search space, automated evaluation, and a rich enough landscape that creative recombination yields improvements.

24.6 Limitations and Open Questions

24.6.1 Paper 1 Limitations

Benchmark scope: All five evaluation games are relatively small imperfect-information games. Whether VAD-CFR and SHOR-PSRO scale to large games (e.g., full Texas Hold'em with billions of information sets) remains untested.
Computational cost: The paper does not report the total compute required for the evolutionary search, making it impossible to assess whether the discovered algorithms justify the search cost versus manual algorithm design.
Generalization: The algorithms were evolved specifically to minimize geometric mean NashConv across the five benchmark games. Performance on out-of-distribution games (different sizes, different information structures) is not evaluated.
Theoretical analysis: While the innovations are interpretable, the paper does not provide formal convergence proofs for VAD-CFR. It is unclear whether the volatility adaptation preserves CFR's $O(1/\sqrt{T})$ convergence guarantee or whether it converges at all in the worst case.

24.6.2 Paper 2 Limitations

Formalization gap: All four unsolved problems involved mathematical concepts with incomplete Lean 4 formalizations in mathlib. The agent could sketch proofs in natural language but could not bridge the formalization gap. This is a fundamental limitation of the current Lean ecosystem, not of the agent architecture per se.
Deep dependency chains: Problems requiring more than 5 intermediate lemmas not already in mathlib proved intractable. The agent's effective proof-length budget was exhausted before reaching the main result.
Novel insight: Problems 7 (Ramsey-type bound) and 9 (Diophantine equation) required genuinely novel mathematical insights, not just known techniques applied in new ways. The agent could not generate these insights, suggesting a ceiling on current LLM mathematical creativity.
Model confounding: Aletheia's performance improvement over prior systems could be attributed to the Gemini 3 Deep Think model rather than the agent architecture. Without an ablation using the same model in a simpler architecture (e.g., direct prompting with 20 attempts), the contribution of the agent framework remains partially confounded.

24.6.3 Reproducibility

Neither paper provides public code or model access. Gemini 2.5 Pro (Paper 1) and Gemini 3 Deep Think (Paper 2) are proprietary models. The AlphaEvolve system itself is not publicly available. This means the specific results cannot be independently reproduced, although the algorithmic descriptions (particularly for VAD-CFR and SHOR-PSRO) are detailed enough that the discovered algorithms could be reimplemented and tested on open-source game-solving frameworks such as OpenSpiel.

24.7 Connection to Broader AlphaEvolve Research

These two papers are part of a broader wave of AlphaEvolve-derived research from Google DeepMind. The original AlphaEvolve system, described in Chapter 4 of this survey, demonstrated results on mathematical discovery (improved matrix multiplication algorithms for specific tensor sizes, better bounds on kissing numbers in higher dimensions), combinatorial optimization (bin packing, job scheduling), and algorithmic improvements to Google's internal infrastructure (data center optimization, chip design). The papers examined in this chapter extend the methodology to game theory and formal mathematics, demonstrating that the evolutionary code search paradigm generalizes across a wide range of mathematical and algorithmic domains.

A common thread across all AlphaEvolve research is the reliance on automated, objective evaluation. In matrix multiplication, correctness is verified by checking the tensor decomposition. In kissing numbers, geometric validity is computationally verifiable. In game theory, NashConv provides an exact metric. In formal proofs, the type checker is absolute. This pattern suggests that LLM-driven evolution is most powerful in domains where fitness is cheap and precise—a principle that guides the selection of target domains for future research.

24.8 Summary

Key Takeaway

Two February 2026 papers from Google DeepMind demonstrate that LLM-driven evolutionary search can discover novel, interpretable, and practically superior algorithms in game theory (outperforming two decades of expert-designed CFR/PSRO variants) and can autonomously solve competition-level mathematical proofs (6/10 FirstProof problems, surpassing all prior automated approaches).

Main Contribution to the Field

These papers provide the strongest evidence to date that LLM-driven evolution can serve as a creative research methodology, not merely an optimization tool. The discovery of VAD-CFR's volatility-adaptive discounting and consistency-enforced optimism—innovations that are both novel and mathematically principled—demonstrates that evolutionary LLM search can produce insights that advance established research fields. Aletheia's proof search further shows that agent-based LLM systems with structured feedback loops exhibit evolutionary dynamics even without explicit population management.

What Researchers Should Know

The critical enabler in both papers is a deterministic, automated verifier that provides rich feedback (NashConv scores, Lean 4 error messages) without human involvement. Researchers seeking to apply LLM-driven evolution to new domains should prioritize domains where such verifiers exist or can be constructed. The interpretability of VAD-CFR's discoveries further suggests that the LLM search process tends to recombine known concepts in novel ways—making the outputs amenable to human analysis and extension—rather than producing opaque "black-box" solutions.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}