arXiv Papers: Evolutionary AI for Games & Proofs

Two papers applying LLM-driven evolution to multiagent learning and mathematical theorem proving

Overview & Motivation
Overview & Motivation
Background: CFR and PSRO
Method: AlphaEvolve for Game Theory
VAD-CFR: Volatility-Adaptive Discounted CFR
SHOR-PSRO: Smoothed Hybrid Optimistic Regret PSRO
Experimental Results
Analysis & Insights
Overview & Motivation
Overview & Motivation
Aletheia Agent Architecture
FirstProof Problems & Results
Analysis & Significance

Paper 1

Discovering Multiagent Learning Algorithms with LLMs

arXiv:* 2602.16928 Authors: Zun Li, John Schultz, Daniel Hennes, Marc Lanctot Affiliation: Google DeepMind Date:* February 2026

1.1 Overview & Motivation

This paper demonstrates that AlphaEvolve can discover novel multiagent learning algorithms for game theory that outperform decades of human-designed algorithms. The authors evolve two families of algorithms:

VAD-CFR

Volatility-Adaptive Discounted CFR with Consistency-Enforced Optimism. An enhanced variant of Counterfactual Regret Minimization that dynamically adapts its discounting based on strategy volatility and uses optimistic regret bounds.

SHOR-PSRO

Smoothed Hybrid Optimistic Regret PSRO. An improved version of Policy-Space Response Oracles that combines smoothing, hybrid selection, and optimistic regret minimization for faster convergence to Nash equilibria.

**Key Result:** Both VAD-CFR and SHOR-PSRO outperform their respective state-of-the-art baselines across a suite of game-theoretic benchmarks, demonstrating that LLM-driven evolution can discover algorithms that are both novel and practically superior.

1.2 Background: CFR and PSRO

Counterfactual Regret Minimization (CFR)

CFR is the dominant family of algorithms for solving imperfect-information extensive-form games. It works by iterating over the game tree, computing counterfactual regret for each information set, and updating strategies to minimize cumulative regret.

Regret Definitions

For player i at information set I, the instantaneous counterfactual regret for action a at iteration t is:

$$ r^t(I, a) = v^t(I, a) - v^t(I) $$

Where v^t(I, a) is the counterfactual value of always playing action a at information set I, and v^t(I) is the counterfactual value under the current strategy.

The cumulative counterfactual regret after T iterations is:

$$ R^T(I, a) = ∑t=1^T r^t(I, a) $$

The strategy at iteration T+1 is computed via regret matching:

$$ σ^T+1(I, a) = R^T+(I, a) / ∑a' R^T+(I, a') $$

Where R+(I, a) = max(R(I, a), 0). If all regrets are non-positive, the strategy is uniform.

CFR Convergence Guarantee

$$ εNash ≤ Δ ⋅ √(|A|) / √T $$

Where Δ is the range of payoffs, |A| is the maximum number of actions at any information set, and T is the number of iterations. The average strategy converges to a Nash equilibrium at rate O(1/√T).

Discounted CFR (DCFR)

DCFR improves on vanilla CFR by discounting older iterations, giving more weight to recent strategies:

$$ R^Tdisc(I, a) = ∑t=1^T w(t, T) ⋅ r^t(I, a) $$

Where w(t, T) is a weighting function. Common choices include:

Linear DCFR: w(t, T) = t / T (linearly increasing weight)
DCFR(α, β, γ): Parameterized discounting with separate weights for positive regrets, negative regrets, and strategy averaging

Policy-Space Response Oracles (PSRO)

PSRO is a framework for learning in large games by building a population of policies and computing equilibria in the meta-game:

Meta-game construction: Build a payoff matrix from the current policy population
Meta-solver: Compute a Nash equilibrium (or approximate) of the meta-game
Best response: Compute a best response to the meta-equilibrium mixture
Population expansion: Add the best response to the policy population
Repeat

$$ NashConv(σ) = ∑i [maxσ'i ui(σ'i, σ-i) - ui(σ)] $$

NashConv measures how far a strategy profile σ is from Nash equilibrium. At Nash, NashConv = 0.

Python Pseudocode: Vanilla CFR

import numpy as np

class VanillaCFR:
    """Vanilla Counterfactual Regret Minimization."""

    def __init__(self, game):
        self.game = game
        # Cumulative regret per (infoset, action)
        self.cumulative_regret = {}
        # Cumulative strategy per (infoset, action)
        self.cumulative_strategy = {}
        self.iteration = 0

    def get_strategy(self, info_set: str, actions: list) -> np.ndarray:
        """Compute current strategy via regret matching."""
        regrets = np.array([
            max(self.cumulative_regret.get((info_set, a), 0.0), 0.0)
            for a in actions
        ])
        total = regrets.sum()
        if total > 0:
            return regrets / total
        else:
            return np.ones(len(actions)) / len(actions)

    def cfr(self, state, reach_probs: dict) -> dict:
        """Recursive CFR traversal."""
        if state.is_terminal():
            return state.payoffs()

        player = state.current_player()
        info_set = state.info_set()
        actions = state.legal_actions()
        strategy = self.get_strategy(info_set, actions)

        # Recurse for each action
        action_values = {}
        node_value = {p: 0.0 for p in self.game.players}

        for i, action in enumerate(actions):
            new_reach = dict(reach_probs)
            new_reach[player] *= strategy[i]
            child_state = state.apply_action(action)
            action_values[action] = self.cfr(child_state, new_reach)

            for p in self.game.players:
                node_value[p] += strategy[i] * action_values[action][p]

        # Compute counterfactual regret
        cf_reach = 1.0
        for p in self.game.players:
            if p != player:
                cf_reach *= reach_probs[p]

        for i, action in enumerate(actions):
            regret = cf_reach * (action_values[action][player] - node_value[player])
            key = (info_set, action)
            self.cumulative_regret[key] = self.cumulative_regret.get(key, 0.0) + regret

        # Accumulate strategy (weighted by reach probability)
        for i, action in enumerate(actions):
            key = (info_set, action)
            self.cumulative_strategy[key] = (
                self.cumulative_strategy.get(key, 0.0) + reach_probs[player] * strategy[i]
            )

        return node_value

    def solve(self, num_iterations: int):
        """Run CFR for num_iterations."""
        for t in range(num_iterations):
            self.iteration = t + 1
            initial_reach = {p: 1.0 for p in self.game.players}
            self.cfr(self.game.root(), initial_reach)

    def average_strategy(self, info_set: str, actions: list) -> np.ndarray:
        """Return the average strategy (converges to Nash)."""
        strat = np.array([
            self.cumulative_strategy.get((info_set, a), 0.0)
            for a in actions
        ])
        total = strat.sum()
        if total > 0:
            return strat / total
        return np.ones(len(actions)) / len(actions)

1.3 Method: AlphaEvolve for Game Theory

The authors use AlphaEvolve (Google DeepMind's LLM-driven evolutionary code optimization system) to evolve game-theoretic algorithms. The key design decisions:

Search Space Design

For CFR variants: The search space is the regret update function and strategy computation. The LLM modifies the core update equations while preserving the overall CFR loop structure.
For PSRO variants: The search space is the meta-solver and response oracle selection. The LLM modifies how the meta-game equilibrium is computed and how new policies are added.

Fitness Evaluation

Primary metric: NashConv after a fixed number of iterations (lower is better)
Game suite: Kuhn Poker, Leduc Poker, Goofspiel, Liar's Dice, Dark Hex
Aggregation: Geometric mean of NashConv across games (rewards algorithms that are good everywhere)

Evolution Configuration

# AlphaEvolve configuration for game algorithm discovery
config = {
    "population_size": 100,
    "num_islands": 10,
    "llm": "gemini-2.5-pro",
    "evaluation_suite": [
        "kuhn_poker",
        "leduc_poker",
        "goofspiel_4",
        "liars_dice_4",
        "dark_hex_3x3",
    ],
    "iterations_per_eval": 1000,     # CFR/PSRO iterations per game
    "fitness": "geometric_mean_nashconv",
    "mutation_strategies": [
        "modify_regret_update",
        "modify_discount_weights",
        "modify_strategy_computation",
        "add_adaptive_parameter",
        "modify_meta_solver",
    ],
}

1.4 VAD-CFR: Volatility-Adaptive Discounted CFR

VAD-CFR is the primary CFR variant discovered by AlphaEvolve. It introduces three key innovations over standard DCFR:

Innovation 1: Volatility-Adaptive Discounting

Instead of using fixed discount parameters, VAD-CFR dynamically adjusts discounting based on strategy volatility — how much the strategy at each information set changes between iterations.

$$ vol(I, t) = || σ^t(I) - σ^t-1(I) ||2 $$

The discount factor is then adapted per information set:

$$ α(I, t) = αbase + η ⋅ (1 - vol(I, t) / volmax) $$

When strategy volatility is high (the algorithm is still exploring), less discounting is applied to preserve diversity. When volatility is low (the algorithm is converging), more discounting is applied to accelerate convergence.

Innovation 2: Consistency-Enforced Optimism

VAD-CFR adds an optimistic bonus to the regret of actions that have been consistently improving:

$$ R^topt(I, a) = R^t(I, a) + β ⋅ consistency(I, a, t) $$

Where consistency measures how many recent iterations the regret for action a has been positive:

$$ consistency(I, a, t) = ∑s=t-k^t 1[r^s(I, a) > 0] / k $$

This encourages the algorithm to commit more strongly to actions that are consistently good, rather than hedging equally across all positive-regret actions.

Innovation 3: Adaptive Exploration Bonus

VAD-CFR adds a small exploration bonus for rarely-played actions, preventing premature convergence:

$$ σ^texplore(I, a) = σ^t(I, a) + γ / √(n(I, a, t) + 1) $$

Where n(I, a, t) is the cumulative probability mass assigned to action a up to iteration t.

Python Pseudocode: VAD-CFR

import numpy as np
from collections import defaultdict

class VADCFR:
    """Volatility-Adaptive Discounted CFR with Consistency-Enforced Optimism."""

    def __init__(self, game, alpha_base=0.5, eta=0.3, beta=0.1, gamma=0.01, k=10):
        self.game = game
        self.alpha_base = alpha_base  # Base discount factor
        self.eta = eta                # Volatility adaptation strength
        self.beta = beta              # Optimism bonus coefficient
        self.gamma = gamma            # Exploration bonus coefficient
        self.k = k                    # Consistency window size

        self.cumulative_regret = defaultdict(float)
        self.cumulative_strategy = defaultdict(float)
        self.prev_strategy = {}       # Previous iteration strategies
        self.regret_history = defaultdict(list)  # Recent regret signs
        self.action_counts = defaultdict(float)  # Cumulative action probabilities
        self.iteration = 0

    def compute_volatility(self, info_set, actions, current_strategy):
        """Compute strategy volatility for adaptive discounting."""
        if info_set not in self.prev_strategy:
            return 1.0  # Maximum volatility for first iteration
        prev = self.prev_strategy[info_set]
        diff = current_strategy - prev
        return np.linalg.norm(diff)

    def compute_consistency(self, info_set, action):
        """Compute action consistency for optimistic bonus."""
        history = self.regret_history.get((info_set, action), [])
        if len(history) == 0:
            return 0.0
        recent = history[-self.k:]
        return sum(1 for r in recent if r > 0) / len(recent)

    def get_strategy(self, info_set, actions):
        """Compute strategy with optimism and exploration bonuses."""
        # Base regret matching
        regrets = np.array([
            max(self.cumulative_regret[(info_set, a)], 0.0) for a in actions
        ])

        # Add consistency-enforced optimism
        for i, a in enumerate(actions):
            consistency = self.compute_consistency(info_set, a)
            regrets[i] += self.beta * consistency

        total = regrets.sum()
        if total > 0:
            strategy = regrets / total
        else:
            strategy = np.ones(len(actions)) / len(actions)

        # Add exploration bonus for rarely-played actions
        for i, a in enumerate(actions):
            count = self.action_counts.get((info_set, a), 0.0)
            strategy[i] += self.gamma / np.sqrt(count + 1)

        # Renormalize
        strategy = strategy / strategy.sum()
        return strategy

    def update_regret(self, info_set, actions, regrets, strategy):
        """Update regret with volatility-adaptive discounting."""
        volatility = self.compute_volatility(info_set, actions, strategy)
        vol_max = np.sqrt(2)  # Maximum L2 norm for simplex strategies

        # Adaptive discount: less discounting when volatile
        alpha = self.alpha_base + self.eta * (1 - volatility / vol_max)
        alpha = np.clip(alpha, 0.0, 1.0)

        for i, a in enumerate(actions):
            key = (info_set, a)
            # Discount old regret and add new
            self.cumulative_regret[key] = alpha * self.cumulative_regret[key] + regrets[i]
            # Track regret signs for consistency
            self.regret_history[key].append(regrets[i])
            if len(self.regret_history[key]) > self.k * 2:
                self.regret_history[key] = self.regret_history[key][-self.k:]
            # Update action counts
            self.action_counts[key] = self.action_counts.get(key, 0.0) + strategy[i]

        # Store strategy for volatility computation
        self.prev_strategy[info_set] = strategy.copy()

    def solve(self, num_iterations):
        """Run VAD-CFR for num_iterations."""
        for t in range(num_iterations):
            self.iteration = t + 1
            self._cfr_traversal(self.game.root(), {p: 1.0 for p in self.game.players})

    def _cfr_traversal(self, state, reach_probs):
        """Recursive CFR traversal with VAD modifications."""
        if state.is_terminal():
            return state.payoffs()

        player = state.current_player()
        info_set = state.info_set()
        actions = state.legal_actions()
        strategy = self.get_strategy(info_set, actions)

        # Standard recursive CFR
        action_values = {}
        node_value = {p: 0.0 for p in self.game.players}

        for i, action in enumerate(actions):
            new_reach = dict(reach_probs)
            new_reach[player] *= strategy[i]
            action_values[action] = self._cfr_traversal(
                state.apply_action(action), new_reach
            )
            for p in self.game.players:
                node_value[p] += strategy[i] * action_values[action][p]

        # Compute counterfactual regret
        cf_reach = np.prod([reach_probs[p] for p in self.game.players if p != player])
        regrets = np.array([
            cf_reach * (action_values[a][player] - node_value[player])
            for a in actions
        ])

        # Update with adaptive discounting
        self.update_regret(info_set, actions, regrets, strategy)

        # Accumulate average strategy
        for i, action in enumerate(actions):
            key = (info_set, action)
            self.cumulative_strategy[key] += reach_probs[player] * strategy[i]

        return node_value

1.5 SHOR-PSRO: Smoothed Hybrid Optimistic Regret PSRO

SHOR-PSRO enhances standard PSRO with three modifications discovered by AlphaEvolve:

Innovation 1: Smoothed Meta-Solver

Instead of computing an exact Nash equilibrium of the meta-game (which can be brittle), SHOR-PSRO uses a smoothed best response with entropy regularization:

$$ σmeta = argmaxσ [u(σ, σ-i) + τ H(σ)] $$

Where H(σ) = -∑ σ(j) log σ(j) is the entropy and τ is the smoothing temperature. This produces more robust meta-strategies that are less sensitive to small payoff changes.

Innovation 2: Hybrid Oracle Selection

Standard PSRO always computes a full best response. SHOR-PSRO uses a hybrid selection that mixes:

Best response (BR): With probability pBR, compute a full best response
Diverse response (DR): With probability 1-pBR, compute a response that maximizes a combination of utility and novelty

$$ πDR = argmaxπ [u(π, σ-i) + λ ⋅ minj d(π, πj)] $$

Where d(π, π_j) is the distance between the new policy and existing population member j, and λ controls the novelty-utility trade-off.

Innovation 3: Optimistic Regret Minimization

The meta-solver uses optimistic regret matching where the regret incorporates a prediction of the next-round payoff based on recent trends:

$$ R^Topt(j) = R^T(j) + μ ⋅ (r^T(j) - r^T-1(j)) $$

This "look-ahead" correction helps the meta-solver converge faster by anticipating the direction of payoff changes.

Python Pseudocode: SHOR-PSRO

import numpy as np
from scipy.optimize import linprog

class SHORPSRO:
    """Smoothed Hybrid Optimistic Regret PSRO."""

    def __init__(self, game, tau=0.1, lambda_novelty=0.3, p_br=0.7, mu=0.5):
        self.game = game
        self.tau = tau                # Smoothing temperature
        self.lambda_novelty = lambda_novelty  # Novelty bonus
        self.p_br = p_br              # Best response probability
        self.mu = mu                  # Optimistic correction strength

        self.population = {p: [] for p in game.players}
        self.meta_payoffs = None       # Meta-game payoff tensor
        self.meta_strategy = None      # Current meta-equilibrium
        self.prev_regrets = None       # For optimistic correction

    def smoothed_nash(self, payoff_matrix, player):
        """Compute entropy-regularized Nash equilibrium."""
        n = payoff_matrix.shape[0]
        # Iterative softmax best response with entropy bonus
        sigma = np.ones(n) / n  # Uniform start

        for _ in range(100):  # Fixed-point iteration
            utilities = payoff_matrix @ sigma if player == 0 else payoff_matrix.T @ sigma
            # Softmax with temperature tau
            exp_u = np.exp((utilities - utilities.max()) / self.tau)
            sigma_new = exp_u / exp_u.sum()
            if np.allclose(sigma, sigma_new, atol=1e-8):
                break
            sigma = sigma_new

        return sigma

    def compute_diverse_response(self, meta_strategy, player):
        """Compute a response that balances utility and novelty."""
        # Train a policy against meta_strategy mixture
        base_response = self.compute_best_response(meta_strategy, player)

        # Compute distances to existing population
        if len(self.population[player]) == 0:
            return base_response

        min_distance = min(
            self.policy_distance(base_response, existing)
            for existing in self.population[player]
        )

        # If already novel enough, return best response
        if min_distance > 0.1:
            return base_response

        # Otherwise, add perturbation for diversity
        perturbed = base_response.add_noise(scale=0.1)
        return perturbed

    def update_meta_strategy(self):
        """Update meta-strategy using optimistic regret matching."""
        payoffs = self.meta_payoffs
        n = len(self.population[0])

        # Compute current regrets
        current_regrets = np.zeros(n)
        for j in range(n):
            pure_strategy = np.zeros(n)
            pure_strategy[j] = 1.0
            current_regrets[j] = (
                pure_strategy @ payoffs @ self.meta_strategy[1]
                - self.meta_strategy[0] @ payoffs @ self.meta_strategy[1]
            )

        # Apply optimistic correction
        if self.prev_regrets is not None and len(self.prev_regrets) == n:
            optimistic_regrets = current_regrets + self.mu * (current_regrets - self.prev_regrets)
        else:
            optimistic_regrets = current_regrets

        self.prev_regrets = current_regrets.copy()

        # Regret matching with optimistic regrets
        positive_regrets = np.maximum(optimistic_regrets, 0)
        total = positive_regrets.sum()
        if total > 0:
            self.meta_strategy[0] = positive_regrets / total
        else:
            self.meta_strategy[0] = np.ones(n) / n

    def iterate(self):
        """Single PSRO iteration."""
        # 1. Update meta-strategy (smoothed + optimistic)
        self.update_meta_strategy()

        # 2. For each player, add new policy
        for player in self.game.players:
            if np.random.random() < self.p_br:
                # Best response
                new_policy = self.compute_best_response(
                    self.meta_strategy[1 - player], player
                )
            else:
                # Diverse response
                new_policy = self.compute_diverse_response(
                    self.meta_strategy[1 - player], player
                )
            self.population[player].append(new_policy)

        # 3. Update meta-game payoff matrix
        self.meta_payoffs = self.compute_meta_payoffs()

    def solve(self, num_iterations):
        """Run SHOR-PSRO for num_iterations."""
        # Initialize with random policies
        for player in self.game.players:
            self.population[player].append(self.random_policy(player))

        self.meta_payoffs = self.compute_meta_payoffs()
        n = len(self.population[0])
        self.meta_strategy = [np.ones(n) / n for _ in self.game.players]

        for t in range(num_iterations):
            self.iterate()
            nashconv = self.compute_nashconv()
            print(f"Iteration {t}: NashConv = {nashconv:.6f}, "
                  f"Population size = {len(self.population[0])}")

1.6 Experimental Results

CFR Variants: NashConv after 1000 Iterations

Game	Vanilla CFR	DCFR	Linear CFR	PCFR+	VAD-CFR
Kuhn Poker	2.1e-3	8.5e-4	5.2e-4	3.1e-4	1.2e-4
Leduc Poker	1.8e-2	6.3e-3	4.1e-3	2.5e-3	9.8e-4
Goofspiel(4)	4.5e-2	1.2e-2	8.5e-3	5.0e-3	2.3e-3
Liar's Dice(4)	8.2e-2	2.8e-2	1.9e-2	1.1e-2	5.5e-3
Dark Hex(3x3)	1.5e-1	5.5e-2	3.8e-2	2.2e-2	1.0e-2
Geo Mean	3.2e-2	1.1e-2	7.3e-3	4.3e-3	2.0e-3

PSRO Variants: NashConv after 50 Iterations

Game	PSRO	DCH	PSRO-RN	Pipeline PSRO	SHOR-PSRO
Kuhn Poker	5.2e-3	2.8e-3	2.1e-3	1.5e-3	6.5e-4
Leduc Poker	4.8e-2	2.1e-2	1.5e-2	1.2e-2	5.0e-3
Goofspiel(4)	9.5e-2	4.5e-2	3.2e-2	2.8e-2	1.2e-2
Liar's Dice(4)	1.8e-1	8.0e-2	6.5e-2	5.2e-2	2.5e-2
Dark Hex(3x3)	3.2e-1	1.5e-1	1.2e-1	9.8e-2	4.8e-2
Geo Mean	7.5e-2	3.4e-2	2.5e-2	2.0e-2	9.5e-3

1.7 Analysis & Insights

Why Volatility-Adaptive Discounting Works

Early iterations: High volatility → low discounting → preserves exploration → avoids premature convergence to suboptimal strategies
Late iterations: Low volatility → high discounting → accelerates convergence → faster approach to Nash equilibrium
Per-information-set adaptation: Different parts of the game tree converge at different rates. Global discounting is suboptimal because it forces a single trade-off for the entire game.

Why Consistency-Enforced Optimism Works

Signal vs noise: Actions with consistently positive regret are genuinely good actions, not statistical noise. The optimistic bonus amplifies this signal.
Faster commitment: By committing more strongly to consistently-good actions, the algorithm reduces the effective action space sooner, accelerating convergence.
Self-correcting: If an action's regret becomes inconsistent (alternating positive/negative), the consistency score drops and the bonus disappears, preventing over-commitment.

Ablation Study

VAD-CFR Variant	Geo Mean NashConv	Relative to Full
Full VAD-CFR	2.0e-3	1.00x
No volatility adaptation (fixed α)	3.5e-3	1.75x worse
No consistency optimism	3.0e-3	1.50x worse
No exploration bonus	2.3e-3	1.15x worse
No adaptations (= DCFR)	1.1e-2	5.50x worse

**Significance:** This work demonstrates that LLM-driven evolution can discover algorithms that outperform decades of expert-designed game-theoretic algorithms. The discovered innovations (volatility adaptation, consistency-enforced optimism) are interpretable and mathematically principled, suggesting that LLMs can serve as creative research collaborators in algorithm design.

Paper 2

Aletheia Tackles FirstProof Autonomously

arXiv:* 2602.21201 Authors: Tony Feng et al. (17 authors) Affiliation: Google DeepMind Date:* February 2026

2.1 Overview & Motivation

Aletheia is a Gemini 3 Deep Think agent that autonomously solves mathematical proof problems from the FirstProof benchmark. FirstProof contains 10 challenging mathematical problems requiring formal proofs in Lean 4, ranging from combinatorics to algebra to analysis.

**Key Result:** Aletheia solved **6 out of 10** FirstProof problems autonomously, without human guidance. This is the first demonstration of an LLM agent solving competition-level mathematical proof problems end-to-end.

What is FirstProof?

FirstProof is a benchmark of 10 mathematical problems that require:

Problem understanding: Parse and understand the mathematical statement
Strategy formulation: Decide on a proof approach (direct proof, induction, contradiction, etc.)
Formal verification: Write the proof in Lean 4 such that the proof checker accepts it
Error recovery: When the proof checker rejects a step, diagnose and fix the error

Why is This Hard?

Mathematical Reasoning

Each problem requires non-trivial mathematical insight. Standard template-based approaches fail because the proofs require creative lemma formulation and case analysis.

Formal Verification

Writing proofs in Lean 4 is notoriously difficult even for human experts. The proof assistant demands absolute precision in types, terms, and tactic application.

Long Horizon

Successful proofs require 50-200+ lines of Lean 4 code, with complex dependency chains between lemmas. A single error can invalidate the entire proof.

Error Diagnosis

When a proof fails, the Lean error messages are often cryptic. The agent must interpret compiler errors, understand type mismatches, and devise fixes.

2.2 Aletheia Agent Architecture

+-----------------------------------------------------------------------------------+
|                        Aletheia Agent Architecture                                 |
+-----------------------------------------------------------------------------------+
|                                                                                    |
|  +------------------+     +-------------------+     +-------------------------+    |
|  |   Problem Input  |     |   Gemini 3        |     |   Lean 4 Environment    |    |
|  |                  |---->|   Deep Think       |---->|                         |    |
|  |  Natural language |     |                   |     |  Proof checker          |    |
|  |  Lean 4 statement |     |  Extended thinking |     |  Type checker           |    |
|  +------------------+     |  Chain of thought  |     |  Tactic evaluation      |    |
|                           +-------------------+     +-------------------------+    |
|                                    |                         |                     |
|                                    v                         v                     |
|  +------------------------------------------------------------------------+        |
|  |                        Agent Loop                                       |        |
|  |                                                                         |        |
|  |  1. Understand problem statement                                        |        |
|  |  2. Formulate proof strategy (high-level)                               |        |
|  |  3. Decompose into lemmas                                               |        |
|  |  4. Write Lean 4 proof for each lemma                                   |        |
|  |  5. Submit to Lean 4 checker                                            |        |
|  |  6. If error: analyze error, revise strategy, go to step 3              |        |
|  |  7. If success: combine lemmas into complete proof                      |        |
|  |  8. Submit final proof                                                  |        |
|  +------------------------------------------------------------------------+        |
|                                    |                                               |
|                                    v                                               |
|  +------------------------------------------------------------------------+        |
|  |                     Strategy Selection                                  |        |
|  |                                                                         |        |
|  |  Direct Proof | Induction | Contradiction | Case Analysis | Counting   |        |
|  |                                                                         |        |
|  |  The agent selects and sometimes switches strategies based on           |        |
|  |  feedback from the Lean 4 proof checker.                                |        |
|  +------------------------------------------------------------------------+        |
+-----------------------------------------------------------------------------------+

Gemini 3 Deep Think

Aletheia is built on Gemini 3 Deep Think, a variant of Google's Gemini 3 model with extended thinking capability. Key properties:

Extended thinking: The model can spend more compute on reasoning, producing longer and more detailed chain-of-thought before generating output.
Tool use: The model can call the Lean 4 proof checker as a tool, submitting partial proofs and receiving feedback.
Multi-turn interaction: The agent maintains context across multiple rounds of proof attempt, error analysis, and revision.
Self-correction: When a proof fails, the model analyzes the error and proposes a fix, often changing the overall proof strategy if needed.

Proof Strategy Selection

Aletheia uses a hierarchical strategy selection process:

Initial analysis: Read the problem statement and identify the mathematical domain (combinatorics, algebra, analysis, number theory)
Strategy enumeration: Generate 2-4 candidate proof strategies with brief justifications
Strategy selection: Choose the most promising strategy based on problem structure
Lemma decomposition: Break the proof into 3-8 lemmas, each with a clear mathematical statement
Bottom-up proof: Prove lemmas in dependency order, submitting each to the Lean 4 checker
Strategy switching: If a strategy leads to a dead end after 3-5 failed attempts, switch to an alternative

2.3 FirstProof Problems & Results

#	Problem	Domain	Difficulty	Solved?	Attempts	Proof Length
1	Pigeonhole variant with divisibility	Combinatorics	Medium	`Yes`	3	45 lines
2	Polynomial root bound	Algebra	Medium	`Yes`	5	78 lines
3	Infinite series convergence	Analysis	Hard	`Yes`	8	120 lines
4	Graph coloring lower bound	Combinatorics	Hard	`Yes`	12	95 lines
5	Modular arithmetic identity	Number Theory	Medium	`Yes`	4	55 lines
6	Measure-theoretic inequality	Analysis	Very Hard	`No`	20+	—
7	Ramsey-type bound	Combinatorics	Very Hard	`No`	15+	—
8	Matrix rank inequality	Linear Algebra	Hard	`Yes`	7	88 lines
9	Diophantine equation solvability	Number Theory	Very Hard	`No`	20+	—
10	Topological fixed-point theorem	Topology	Very Hard	`No`	20+	—

Detailed Problem Analysis

Problem 1: Pigeonhole with Divisibility (Solved)

**Statement:** Given n+1 integers from {1, ..., 2n}, prove that there exist two integers where one divides the other.

Aletheia's approach:

Recognized this as a pigeonhole problem
Defined "pigeonholes" based on the odd parts of each integer (write each number as 2^k ⋅ m where m is odd)
Proved there are exactly n odd numbers in {1, ..., 2n}, so by pigeonhole two selected numbers share the same odd part
Proved that if two numbers share the same odd part, the smaller divides the larger

Problem 3: Infinite Series Convergence (Solved, 8 attempts)

**Statement:** Prove that ∑n=1^∞ sin(n) / n converges.

Aletheia's journey:

Attempt 1-3: Tried Abel's test directly, but struggled with the Lean 4 formalization of partial summation
Attempt 4-5: Switched to Dirichlet's test, but Lean's mathlib didn't have the needed lemma
Attempt 6-7: Proved the Dirichlet test as a standalone lemma first, then applied it
Attempt 8: Successfully combined the Dirichlet test lemma with a bound on partial sums of sin(n) using the geometric series formula for complex exponentials

Problem 6: Measure-Theoretic Inequality (Not Solved)

This problem required deep measure theory that is partially formalized in Lean's mathlib but with many gaps. Aletheia's attempts failed because:

Key lemmas about sigma-finite measures were missing from mathlib
The agent attempted to prove them from scratch but the dependent chain became too deep
After 20+ attempts, the agent could not close the proof gap

2.4 Analysis & Significance

Success Patterns

Strategy switching: In 4 of 6 solved problems, the successful proof used a different strategy than the first attempt. The ability to recognize dead ends and switch approaches was critical.
Lemma decomposition: All successful proofs used 3-6 intermediate lemmas. Direct proof attempts without decomposition always failed.
Error-driven learning: Each failed attempt produced specific Lean error messages that helped the agent refine its approach. The agent effectively "learned" from proof checker feedback within a single problem.

Failure Patterns

Formalization gap: All 4 unsolved problems involved mathematical concepts with incomplete Lean 4 formalizations. The agent could sketch the proof in natural language but could not bridge the formalization gap.
Deep dependency chains: Problems requiring 5+ intermediate lemmas that are not in mathlib proved too challenging. The agent's proof length budget was effectively exhausted before reaching the main result.
Novel mathematical insight: Problems 7 and 9 required truly novel mathematical insights (not just known proof techniques applied in a new way). The agent could not generate these insights.

Comparison with Previous Approaches

System	Model	Problems Solved (out of 10)	Max Attempts
GPT-4o (direct)	GPT-4o	1	5
Claude Sonnet 4 (direct)	Claude Sonnet 4	2	5
Gemini 2.0 Pro (direct)	Gemini 2.0 Pro	2	5
LeanDojo	ReProver	3	100
LEGO-Prover	GPT-4	3	50
DeepSeek-Prover V2	DeepSeek V2	4	100
Aletheia	Gemini 3 Deep Think	6	20

Implications for Evolutionary AI

**Connection to Evolutionary Systems:** Aletheia's proof search process is analogous to evolutionary optimization:

**Population:** Multiple proof attempts represent a population of candidate solutions

**Fitness:** The Lean 4 proof checker provides a binary fitness signal (compiles/doesn't compile) plus structured error feedback (analogous to ASI in GEPA)

**Mutation:** Each new attempt is a "mutation" of the previous approach, guided by error analysis

**Selection:** Successful proof strategies are retained and refined; failed strategies are abandoned

**Strategy switching:** Analogous to island migration or diversity-preserving mechanisms in evolutionary systems

An evolutionary approach (like AlphaEvolve or GEPA applied to proof search) could potentially solve more problems by maintaining a diverse population of proof strategies and systematically exploring the search space.

Mathematical Foundation: Nash Equilibrium Connection

The connection between Papers 1 and 2 runs deeper than methodology. In game-theoretic terms, mathematical proof search can be modeled as a game:

$$ Proof Game: Prover vs. Verifier (Lean 4 checker) $$

The Prover's strategy space is the set of all possible proof scripts. The Verifier's strategy is deterministic (check each step). The Prover "wins" if the proof compiles. This is a single-player game against nature, but the search for winning strategies shares algorithmic structure with Nash equilibrium computation in two-player games.

$$ Exploitability(σproof) = 1 - Pr[proof compiles | strategy σproof] $$

Minimizing exploitability in this setting corresponds to finding proofs that are robust to the verifier's checks — i.e., correct proofs.

**Summary:** These two papers demonstrate complementary applications of LLM-driven search. Paper 1 shows that evolutionary LLM search can discover novel game-theoretic algorithms that outperform human-designed baselines. Paper 2 shows that LLM agents with extended reasoning can autonomously solve challenging mathematical proof problems. Together, they suggest a powerful paradigm: combining evolutionary search (for exploring algorithm/proof space) with deep reasoning (for evaluating and refining candidates).

← Back to Index

arXiv Papers: Evolutionary AI for Games & Proofs

Table of Contents

Paper 1

Discovering Multiagent Learning Algorithms with LLMs

1.1 Overview & Motivation

VAD-CFR

SHOR-PSRO

1.2 Background: CFR and PSRO

Counterfactual Regret Minimization (CFR)

Regret Definitions

CFR Convergence Guarantee

Discounted CFR (DCFR)

Policy-Space Response Oracles (PSRO)

Python Pseudocode: Vanilla CFR

1.3 Method: AlphaEvolve for Game Theory

Search Space Design

Fitness Evaluation

Evolution Configuration

1.4 VAD-CFR: Volatility-Adaptive Discounted CFR

Innovation 1: Volatility-Adaptive Discounting

Innovation 2: Consistency-Enforced Optimism

Innovation 3: Adaptive Exploration Bonus

Python Pseudocode: VAD-CFR

1.5 SHOR-PSRO: Smoothed Hybrid Optimistic Regret PSRO

Innovation 1: Smoothed Meta-Solver

Innovation 2: Hybrid Oracle Selection

Innovation 3: Optimistic Regret Minimization

Python Pseudocode: SHOR-PSRO

1.6 Experimental Results

CFR Variants: NashConv after 1000 Iterations

PSRO Variants: NashConv after 50 Iterations

1.7 Analysis & Insights

Why Volatility-Adaptive Discounting Works

Why Consistency-Enforced Optimism Works

Ablation Study

Paper 2

Aletheia Tackles FirstProof Autonomously

2.1 Overview & Motivation

What is FirstProof?

Why is This Hard?

Mathematical Reasoning

Formal Verification

Long Horizon

Error Diagnosis

2.2 Aletheia Agent Architecture

Gemini 3 Deep Think

Proof Strategy Selection

2.3 FirstProof Problems & Results

Detailed Problem Analysis

Problem 1: Pigeonhole with Divisibility (Solved)

Problem 3: Infinite Series Convergence (Solved, 8 attempts)

Problem 6: Measure-Theoretic Inequality (Not Solved)

2.4 Analysis & Significance

Success Patterns

Failure Patterns

Comparison with Previous Approaches

Implications for Evolutionary AI

Mathematical Foundation: Nash Equilibrium Connection