AlphaEvolve
A Gemini-Powered Coding Agent for Designing Advanced Algorithms Organization: Google DeepMind Published: May 14, 2025 Type: White Paper + Blog Post Report Type: PhD-Level Technical Analysis Report Date: March 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: AlphaEvolve: A Gemini-Powered Coding Agent for Designing Advanced Algorithms
White Paper URL: AlphaEvolve.pdf (Google DeepMind)
Blog Post: DeepMind Blog (May 14, 2025)
Results Colab: Mathematical Results Notebook
Lineage: Successor to FunSearch (Nature, 2023) and AlphaTensor (Nature, 2022)
Publication Date: May 14, 2025
2 Authors and Team
AlphaEvolve was developed by the AlphaEvolve team at Google DeepMind. The white paper lists the following primary authors:
Lead Authors: Matej Balog, Alexander Novikov, Ngoc-Hieu Nguyen, Amin Ghiasi, Tran Le Anh, Emilien Dupont, Alhussein Fawzi, Bernardino Romera-Paredes, Yusuf Hamied, et al. The team draws from the mathematical and algorithmic foundations group at DeepMind, with contributions from the Gemini model team. Many of these researchers previously worked on FunSearch and AlphaTensor.
The team is situated within Google DeepMind's broader research program on AI for science and mathematical discovery, building directly on the group's prior work that demonstrated LLMs could discover new mathematical knowledge (FunSearch) and novel algorithms for matrix multiplication (AlphaTensor).
3 Core Contribution
Key Novelty: AlphaEvolve is the first system to combine an ensemble of frontier LLMs (Gemini Flash and Gemini Pro) as mutation operators within an evolutionary framework to discover and optimize entire codebases -- not just individual functions -- for general-purpose algorithm design.
What Makes AlphaEvolve Novel
- Beyond single-function evolution: Unlike its predecessor FunSearch, which evolved a single function, AlphaEvolve can modify multiple files, functions, and components simultaneously across an entire codebase. It proposes changes to optimizers, loss functions, hyperparameters, and weight initialization -- all at once.
- LLM ensemble as mutation operator: Uses both Gemini Flash (fast, breadth) and Gemini Pro (deep, insightful) in an ensemble, leveraging the complementary strengths of speed and reasoning depth.
- Diff-based and full-program modifications: The system can propose both granular diff-style patches and full program rewrites, enabling both fine-grained tweaking and radical restructuring.
- Automated evaluation pipeline: All proposed solutions are verified by automated evaluators that objectively score correctness and quality, removing human subjectivity from the loop.
- Real-world deployment at Google scale: Unlike most research systems, AlphaEvolve's discoveries are deployed in production -- optimizing Borg scheduling, TPU chip design, and Gemini model training itself.
- Self-improving capability: AlphaEvolve contributes to the training of the very LLMs (Gemini) that power it, creating a positive feedback loop.
Relationship to Prior Work
| System | Year | Scope | Domain | LLM Used |
|---|---|---|---|---|
| AlphaTensor | 2022 | RL-based tensor decomposition | Matrix multiplication only | None (RL agent) |
| FunSearch | 2023 | Single function evolution | Math (cap set, bin packing) | PaLM 2 / Codey |
| AlphaEvolve | 2025 | Full codebase evolution | General-purpose | Gemini Flash + Pro ensemble |
4 Supported Solutions
AlphaEvolve targets any problem where solutions can be expressed as algorithms (code) and automatically verified. The types of solutions it has produced include:
| Solution Type | Description | Example |
|---|---|---|
Algorithm Design |
Novel algorithms for fundamental computational problems | Matrix multiplication algorithms using 48 scalar multiplications for 4x4 complex matrices |
Optimization Heuristics |
Scheduling and resource allocation heuristics | Borg data center scheduling heuristic recovering 0.7% of Google's worldwide compute |
Kernel Optimization |
Low-level GPU/TPU kernel code optimization | 23% speedup in Gemini's key matrix multiplication kernel; 32.5% FlashAttention speedup |
Hardware Design |
Verilog circuit optimizations | Removing unnecessary bits from TPU arithmetic circuits |
Mathematical Constructions |
Solutions to open mathematical problems | Kissing number lower bound of 593 in 11 dimensions |
Gradient-based Procedures |
Complete optimization pipelines | Novel gradient-based procedure for matrix multiplication discovery |
Complete Codebases |
Multi-file, multi-component systems | Training pipelines with optimizer, loss, initialization, and hyperparameters |
5 LLM Integration
Dual-Model Ensemble Architecture
AlphaEvolve uses an ensemble of Gemini models with distinct roles:
| Model | Role | Characteristic | Usage Pattern |
|---|---|---|---|
| Gemini Flash | Breadth explorer | Fastest, most efficient model | High-volume sampling; generates many candidate mutations quickly; responsible for most of the exploration budget |
| Gemini Pro | Depth explorer | Most powerful, most capable model | Provides critical depth with insightful suggestions; used for complex, multi-step mutations; higher quality but lower throughput |
How the LLMs are Used
The LLMs serve as intelligent mutation operators in the evolutionary loop. Unlike traditional evolutionary programming where mutations are random (bit flips, crossover), AlphaEvolve uses the LLMs' understanding of code semantics, algorithms, and mathematics to propose meaningful code changes.
Prompt Construction Pipeline
- Context Assembly: The prompt sampler assembles context including the current best programs, their scores, the problem description, and any relevant code from the codebase.
- Parent Selection: One or more parent programs are selected from the program database based on the evolutionary strategy (fitness-proportional, tournament, or MAP-Elites niche-based selection).
- Diff vs. Full-Program: The system decides whether to request a diff-based patch (for incremental improvements) or a full program rewrite (for radical restructuring).
- LLM Invocation: The assembled prompt is sent to either Gemini Flash or Gemini Pro (or both), and the response is parsed to extract the proposed code modification.
- Code Application: The diff/code is applied to create a new candidate program.
Prompt Format Example (Conceptual)
PROMPT TEMPLATE (Diff-Based)
# System context provided to the LLM:
You are an expert algorithm designer. Your goal is to improve the following
program to maximize the evaluation score.
## Current Best Program (Parent):
```python
{parent_program_code}
Current Score: {parent_score}
Problem Description: {problem_description}
Previous Improvement History:
{history_of_successful_mutations}
Instructions:
Propose a modification to improve this program. Return your changes as a unified diff. Focus on algorithmic improvements, not just cosmetic changes. Consider: - Different mathematical formulations - Alternative data structures - Asymptotic complexity improvements - Numerical stability - Edge case handling
### Model Selection Strategy
The system dynamically allocates between Flash and Pro based on:
- **Exploration phase:** Early in evolution, Gemini Flash dominates (high volume, diverse ideas)
- **Exploitation phase:** As the population matures, Gemini Pro is used more frequently for refined improvements
- **Complexity of required change:** Multi-file, multi-component changes are routed to Pro
- **Budget constraints:** Flash is significantly cheaper per token, so most calls go to Flash
## 6 Key Results
### 6.1 Matrix Multiplication
> **Breakthrough Result:** AlphaEvolve found an algorithm to multiply 4x4 complex-valued matrices using **48 scalar multiplications**, improving upon Strassen's 1969 algorithm which was previously the best known in this setting. This advances beyond AlphaTensor, which only found improvements for binary arithmetic in the 4x4 case.
The process required 15 non-trivial mutations during the evolutionary process, modifying the optimizer, weight initialization, loss function, and hyperparameters simultaneously. This demonstrates AlphaEvolve's ability to orchestrate complex, multi-component changes that no single LLM call could produce.
### 6.2 Data Center Scheduling (Borg)
> AlphaEvolve discovered a scheduling heuristic for Google's Borg cluster management system that **continuously recovers 0.7% of Google's worldwide compute resources**. This has been in production for over a year. The solution is human-readable code, offering interpretability, debuggability, and ease of deployment -- a significant operational advantage over black-box optimization.
### 6.3 TPU Hardware Design
AlphaEvolve proposed a Verilog rewrite that removed unnecessary bits in a highly optimized arithmetic circuit for matrix multiplication. This modification passed rigorous formal verification and was integrated into an upcoming Tensor Processing Unit (TPU).
### 6.4 AI Training Acceleration
| Application | Speedup | Impact |
| --- | --- | --- |
| Gemini matrix multiplication kernel (Pallas) | 23% | 1% reduction in overall Gemini training time |
| FlashAttention kernel optimization | 32.5% | Direct inference/training speedup for Transformer models |
| Low-level GPU instruction optimization | Up to 32.5% | Optimization beyond what compilers typically achieve |
### 6.5 Mathematical Discoveries
| Problem Domain | Problems Tested | Rediscovered SOTA | Improved Beyond SOTA |
| --- | --- | --- | --- |
| Open mathematical problems | 50+ | ~75% | ~20% |
Notable mathematical advances include:
- **Kissing number problem (11D):** Found a configuration of 593 outer spheres, establishing a new lower bound. The kissing number problem has fascinated mathematicians for 300+ years.
- **Problems in combinatorics, number theory, geometry, and analysis** were all tackled using the same unified framework.
- Most experiments could be set up in a matter of hours due to AlphaEvolve's general-purpose nature.
### 6.6 Self-Improvement
> **Meta-result:** AlphaEvolve contributed to the training of the Gemini models that power AlphaEvolve itself. This creates a virtuous cycle where improvements to Gemini's coding ability directly enhance AlphaEvolve's mutation quality, which in turn discovers better training algorithms for Gemini.
## 7 Reproducibility
| Aspect | Status | Details |
| --- | --- | --- |
| Source Code | `Not Released` | AlphaEvolve is a proprietary Google DeepMind system; no public code release |
| White Paper | `Available` | Technical white paper with architectural details available as PDF |
| Results | `Available` | Mathematical results published via Google Colab notebook on GitHub (google-deepmind/alphaevolve_results) |
| Model Access | `Limited` | Requires Gemini API access; Gemini Flash and Pro are available commercially but the exact prompts and pipeline are not public |
| Early Access Program | `Planned` | Google announced an Early Access Program for selected academic users |
| Independent Reimplementation | `Available` | OpenEvolve (community project) reimplements the core architecture |
> **Reproducibility Challenge:** While the white paper provides substantial architectural detail, full reproduction requires: (1) access to Gemini Flash and Pro models at scale, (2) the exact prompt templates and evolutionary hyperparameters, (3) significant compute budget, and (4) domain-specific evaluation functions. The mathematical results are independently verifiable through the published Colab notebook.
## 8 Compute and API Costs
### Cost Structure
AlphaEvolve's costs are dominated by two factors: LLM API calls and evaluation compute.
| Cost Component | Description | Estimated Scale |
| --- | --- | --- |
| Gemini Flash calls | High-volume exploration; majority of mutation budget | Tens of thousands to millions of calls per experiment |
| Gemini Pro calls | Targeted deep reasoning; minority of mutation budget | Thousands of calls per experiment (higher per-call cost) |
| Evaluation compute | Sandboxed code execution, automated verification | Varies by domain; hardware verification requires formal methods |
| Population storage | Program database maintenance | Negligible relative to LLM/eval costs |
### Cost-Efficiency Strategy
> AlphaEvolve's dual-model strategy is explicitly designed for cost efficiency:
>
> **Gemini Flash** handles ~80-90% of mutations at low cost per token
> **Gemini Pro** is reserved for high-value, complex mutations (~10-20%)
> The evolutionary framework ensures only promising directions are explored further, preventing wasteful LLM calls
> Evaluation cascading: cheap, fast checks filter out obviously bad candidates before expensive evaluations
### Estimated API Costs (Current Gemini Pricing, 2025-2026)
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
| --- | --- | --- | --- |
| Gemini 2.0 Flash | ~$0.10 | ~$0.40 | Workhorse for exploration |
| Gemini 2.5 Pro | ~$1.25-$2.50 | ~$10-$15 | Used sparingly for depth |
For a typical mathematical discovery experiment running thousands of evolutionary cycles, the total LLM API cost would be estimated in the range of **$100 to $10,000+** depending on problem complexity, population size, and number of generations. Google DeepMind runs this internally, so their actual marginal costs are significantly lower than retail API pricing.
## 9 Architecture Solution
### High-Level Architecture Diagram
+=====================================================================================+ | AlphaEvolve System Architecture | +=====================================================================================+ | | | +-----------------+ +-------------------+ +---------------------+ | | | PROMPT SAMPLER |------>| LLM ENSEMBLE |------>| CODE PARSER | | | | | | | | | | | | - Context | | +---------------+ | | - Diff extraction | | | | assembly | | | Gemini Flash | | | - Full program | | | | - Parent | | | (Breadth) | | | extraction | | | | selection | | +---------------+ | | - Syntax validation | | | | - Problem desc | | +---------------+ | +----------+----------+ | | | - History | | | Gemini Pro | | | | | +---------+-------+ | | (Depth) | | v | | ^ | +---------------+ | +---------------------+ | | | +-------------------+ | CODE APPLICATOR | | | | | | | | | | - Apply diff/patch | | | | | - Create candidate | | | | +----------+----------+ | | | | | | | v | | +---------+------------+ +---------------------+ | | | PROGRAM DATABASE |<-----------------------------| EVALUATOR PIPELINE | | | | | | | | | | +------------------+ | Scored programs | - Sandboxed exec | | | | | MAP-Elites Grid | |<-----------------------------| - Correctness check | | | | | | | | - Performance score | | | | | Behavioral | | | - Multi-objective | | | | | descriptors x | | | scoring | | | | | fitness scores | | | - Cascaded filters | | | | +------------------+ | +---------------------+ | | | | | | | +------------------+ | | | | | Island Model | | | | | | (sub-populations)| | | | | +------------------+ | | | | | | | | +------------------+ | | | | | Elite Archive | | | | | | (Hall of Fame) | | | | | +------------------+ | | | +----------------------+ | | | +=====================================================================================+
### Data Flow
1. **Prompt Sampler** selects parent programs from the Program Database and assembles a prompt with context.
2. **LLM Ensemble** (Gemini Flash or Pro) generates proposed code modifications (diffs or full programs).
3. **Code Parser** extracts and validates the proposed changes.
4. **Code Applicator** applies the modifications to create a new candidate program.
5. **Evaluator Pipeline** runs the candidate in a sandbox, scores it, and checks correctness.
6. **Program Database** stores the scored program using MAP-Elites (behavioral niche mapping) and the island model.
7. The cycle repeats, with the prompt sampler drawing from the updated database for the next generation.
### Evolutionary Loop Detail
+----------------------------+ | Initialize Population | | (seed programs or empty) | +-------------+--------------+ | v +-------------------------------------------+ | Main Evolutionary Loop | | | | +-----------------------------------+ | | | 1. SELECT parents from database | | | | (MAP-Elites / tournament / | | | | fitness-proportional) | | | +----------------+------------------+ | | | | | v | | +-----------------------------------+ | | | 2. CONSTRUCT prompt | | | | - Parent code + score | | | | - Problem specification | | | | - Mutation history | | | | - Diff vs full-program mode | | | +----------------+------------------+ | | | | | v | | +-----------------------------------+ | | | 3. GENERATE mutation via LLM | | | | - Gemini Flash (80-90%) | | | | - Gemini Pro (10-20%) | | | +----------------+------------------+ | | | | | v | | +-----------------------------------+ | | | 4. PARSE and APPLY modification | | | | - Extract diff/code | | | | - Validate syntax | | | | - Create candidate program | | | +----------------+------------------+ | | | | | v | | +-----------------------------------+ | | | 5. EVALUATE candidate | | | | - Sandboxed execution | | | | - Fast filters first | | | | - Full evaluation if passes | | | | - Multi-objective scoring | | | +----------------+------------------+ | | | | | v | | +-----------------------------------+ | | | 6. UPDATE program database | | | | - MAP-Elites niche placement | | | | - Island migration | | | | - Elite archive update | | | +-----------------------------------+ | | | +-------------------------------------------+ | v +----------------------------+ | Return Best Solution(s) | +----------------------------+
## 10 Component Breakdown
### 10.1 Prompt Sampler
The prompt sampler is responsible for assembling the input context that drives LLM mutation. Its key responsibilities:
- **Parent selection:** Chooses one or more parent programs from the program database using evolutionary selection strategies
- **Context construction:** Includes the problem description, parent code, parent scores, and relevant historical information about what mutations have been tried
- **Mode selection:** Decides between diff-based prompts (incremental changes) and full-program prompts (radical rewrites)
- **Model routing:** Determines whether to send the prompt to Gemini Flash or Gemini Pro
### 10.2 LLM Ensemble
The ensemble of Gemini Flash and Gemini Pro models serves as the mutation engine. They receive assembled prompts and return proposed code modifications. The ensemble approach allows the system to balance exploration (Flash: many cheap, diverse ideas) with exploitation (Pro: fewer, deeper, higher-quality suggestions).
### 10.3 Evaluator Pipeline
The evaluator is the fitness function of the evolutionary system. It:
- Runs candidate programs in a **sandboxed execution environment** for safety
- Applies **cascaded evaluation**: cheap filters first (syntax check, basic tests), then expensive full evaluation only for promising candidates
- Produces **quantifiable, objective scores** that enable comparison across the population
- Supports **multi-objective evaluation** (correctness, performance, code size, etc.)
- For hardware (Verilog), integrates with **formal verification methods** to ensure functional correctness
### 10.4 Program Database
The program database stores the population of evolved programs and implements the evolutionary strategy. It has three key sub-components:
#### MAP-Elites Grid
Programs are organized by behavioral descriptors (features that characterize what a program does, not just how well it does it). This ensures diversity: the database maintains the best program found for each behavioral niche.
#### Island Model
The population may be divided into semi-isolated sub-populations (islands) that evolve independently with occasional migration. This prevents premature convergence and maintains genetic diversity.
#### Elite Archive
A "Hall of Fame" of the best solutions ever found, used as reference points and as potential parents for future mutations.
### 10.5 Code Parser and Applicator
Extracts code modifications from LLM outputs (parsing diffs, code blocks, or inline changes) and applies them to create new candidate programs. Handles edge cases like malformed diffs, syntax errors, and merge conflicts.
### Component Interaction Matrix
| Component | Inputs From | Outputs To | Key Dependency |
| --- | --- | --- | --- |
| Prompt Sampler | Program Database, Problem Spec | LLM Ensemble | Selection strategy |
| LLM Ensemble | Prompt Sampler | Code Parser | Gemini API |
| Code Parser | LLM Ensemble | Code Applicator | Diff/code format |
| Code Applicator | Code Parser, Parent Code | Evaluator | Valid program generation |
| Evaluator | Code Applicator | Program Database | Domain-specific eval function |
| Program Database | Evaluator | Prompt Sampler | MAP-Elites, islands |
## 11 Core Mechanisms (Detailed)
### 11.1 Code Modification and Mutation (LLM-Guided Mutations)
AlphaEvolve's central innovation is using LLMs as *semantically-aware mutation operators*. Traditional evolutionary programming uses random perturbations (bit flips, crossover, random code insertion). AlphaEvolve replaces these with LLM-generated modifications that understand code structure, algorithmic patterns, and mathematical concepts.
#### Mutation Types
| Mutation Type | Description | When Used |
| --- | --- | --- |
| Diff-based patch | Small, targeted changes to specific lines/functions | Incremental improvements; fine-tuning |
| Full program rewrite | Complete replacement of the evolved code section | Radical restructuring; escaping local optima |
| Multi-file modification | Coordinated changes across multiple files/components | Complex algorithmic changes (optimizer + loss + init) |
| Hyperparameter sweep | Changes to numerical parameters | Fine-tuning discovered algorithms |
#### Mutation Quality vs. Random Perturbation
The key advantage of LLM-guided mutations is their **semantic validity rate**. A random mutation to source code almost certainly produces non-compiling or semantically meaningless code. LLM-guided mutations produce syntactically valid, semantically meaningful changes at a much higher rate.
$$
P(valid | LLM mutation) >> P(valid | random mutation)
$$
This dramatically improves the efficiency of the evolutionary search, as fewer mutations are wasted on non-viable candidates.
Python (Conceptual Mutation Process)
def generate_mutation(parent_program, problem_spec, model_selector): """Generate an LLM-guided mutation of a parent program."""
# 1. Select which LLM to use
model = model_selector.choose_model() # Flash (~85%) or Pro (~15%)
# 2. Decide mutation mode
mode = choose_mutation_mode(parent_program)
# mode in {"diff", "full_rewrite", "multi_file"}
# 3. Construct the prompt
prompt = assemble_prompt(
parent_code=parent_program.code,
parent_score=parent_program.fitness,
problem_description=problem_spec,
mutation_history=parent_program.lineage,
mode=mode
)
# 4. Call the LLM
response = model.generate(prompt, temperature=0.8)
# 5. Parse the response to extract code changes
if mode == "diff":
changes = parse_unified_diff(response)
elif mode == "full_rewrite":
changes = parse_full_program(response)
else:
changes = parse_multi_file_changes(response)
# 6. Apply changes to create child program
child_code = apply_changes(parent_program.code, changes)
return CandidateProgram(code=child_code, parent=parent_program)
### 11.2 Parent Selection
AlphaEvolve uses multiple parent selection strategies, with MAP-Elites being the primary mechanism for maintaining diversity while exploiting high-fitness regions.
#### MAP-Elites Selection
MAP-Elites (Multi-dimensional Archive of Phenotypic Elites) maps programs to a grid defined by **behavioral descriptors** -- features that characterize what a program does beyond its raw fitness score. For each cell in the grid, only the highest-fitness program is retained.
$$
Grid: B1 x B2 x ... x Bk --> Program | null
For each cell (b1, ..., bk): store argmaxp fitness(p)
subject to: descriptor(p) = (b1, ..., bk)
$$
Python (MAP-Elites Selection)
import numpy as np
class MAPElitesGrid: def init(self, descriptor_dims, bins_per_dim): """ descriptor_dims: number of behavioral descriptor dimensions bins_per_dim: list of bin counts for each dimension """ self.grid = {} # (bin_idx_tuple) -> (program, fitness) self.bins_per_dim = bins_per_dim self.descriptor_ranges = [] # min/max for each descriptor
def compute_bin_index(self, descriptors):
"""Map continuous descriptors to grid cell indices."""
indices = []
for i, (val, (lo, hi), n_bins) in enumerate(
zip(descriptors, self.descriptor_ranges, self.bins_per_dim)
):
idx = int((val - lo) / (hi - lo) * n_bins)
idx = max(0, min(n_bins - 1, idx))
indices.append(idx)
return tuple(indices)
def insert(self, program, fitness, descriptors):
"""Insert program if it's the best in its niche."""
cell = self.compute_bin_index(descriptors)
if cell not in self.grid or fitness > self.grid[cell][1]:
self.grid[cell] = (program, fitness)
return True # Inserted
return False # Rejected
def select_parent(self, strategy="uniform"):
"""Select a parent from occupied cells."""
occupied = list(self.grid.values())
if strategy == "uniform":
# Uniform random from all occupied cells
prog, fit = occupied[np.random.randint(len(occupied))]
elif strategy == "fitness_proportional":
# Weighted by fitness
fitnesses = np.array([f for _, f in occupied])
fitnesses = fitnesses - fitnesses.min() + 1e-8
probs = fitnesses / fitnesses.sum()
idx = np.random.choice(len(occupied), p=probs)
prog, fit = occupied[idx]
elif strategy == "tournament":
# Tournament selection with k=3
k = min(3, len(occupied))
tournament = [occupied[i] for i in
np.random.choice(len(occupied), k, replace=False)]
prog, fit = max(tournament, key=lambda x: x[1])
return prog
#### Island Model Migration
Periodically, high-fitness programs migrate between islands, introducing genetic diversity while maintaining the benefits of isolated evolution.
$$
Migration: Every T generations, top-k programs from island i are copied to island (i+1) mod N
$$
### 11.3 Population Management
The program database manages the population through several mechanisms:
#### Insertion Policy
- **MAP-Elites:** A new program replaces the current occupant of its behavioral niche only if it has higher fitness
- **Size limits:** Maximum population size prevents unbounded memory growth
- **Duplicate detection:** Semantically identical programs (same behavior on test cases) are deduplicated
#### Population Diversity Metrics
$$
Diversity(P) = |{cells occupied}| / |{total cells in MAP-Elites grid}|
Coverage(P) = fraction of behavioral space explored
$$
Python (Population Management)
class ProgramDatabase: def init(self, config): self.map_elites = MAPElitesGrid( descriptor_dims=config.n_descriptors, bins_per_dim=config.bins_per_dim ) self.elite_archive = [] # Best-ever solutions self.max_archive_size = config.max_archive_size self.islands = [MAPElitesGrid(...) for _ in range(config.n_islands)] self.generation = 0 self.migration_interval = config.migration_interval
def add_program(self, program, fitness, descriptors, island_id=0):
"""Add a scored program to the database."""
# Insert into appropriate island's MAP-Elites grid
inserted = self.islands[island_id].insert(program, fitness, descriptors)
# Update global elite archive
if fitness > self._archive_threshold():
self.elite_archive.append((program, fitness))
self.elite_archive.sort(key=lambda x: x[1], reverse=True)
self.elite_archive = self.elite_archive[:self.max_archive_size]
return inserted
def maybe_migrate(self):
"""Periodic migration between islands."""
self.generation += 1
if self.generation % self.migration_interval == 0:
for i in range(len(self.islands)):
source = self.islands[i]
target = self.islands[(i + 1) % len(self.islands)]
migrants = source.get_top_k(k=3)
for prog, fit, desc in migrants:
target.insert(prog, fit, desc)
def get_diversity_stats(self):
"""Compute population diversity metrics."""
total_cells = 1
for b in self.map_elites.bins_per_dim:
total_cells *= b
occupied = len(self.map_elites.grid)
return {
"coverage": occupied / total_cells,
"occupied_cells": occupied,
"best_fitness": max(f for _, f in self.map_elites.grid.values()),
"mean_fitness": np.mean([f for _, f in self.map_elites.grid.values()])
}
### 11.4 Solution Diversity and Novelty
Maintaining diversity is critical in evolutionary search to prevent premature convergence. AlphaEvolve uses multiple mechanisms:
#### Behavioral Descriptors
Instead of measuring diversity by syntactic code differences, AlphaEvolve uses **behavioral descriptors** -- features extracted from how a program behaves on test inputs. This ensures that the population covers the space of *possible behaviors*, not just the space of possible code strings.
Python (Behavioral Descriptor Extraction)
def extract_behavioral_descriptors(program, test_inputs): """Extract behavioral features from program execution.""" descriptors = []
# Example descriptors for a sorting algorithm:
results = [program.run(inp) for inp in test_inputs]
# 1. Number of comparisons used (algorithmic complexity proxy)
descriptors.append(np.mean([r.comparison_count for r in results]))
# 2. Memory usage pattern
descriptors.append(np.mean([r.peak_memory for r in results]))
# 3. Code size (lines of code)
descriptors.append(len(program.code.split('\n')))
# 4. Branching pattern hash
descriptors.append(compute_branch_pattern_hash(results))
return np.array(descriptors)
#### Novelty Search Component
In addition to fitness-based selection, AlphaEvolve can incorporate novelty-seeking behavior where programs that exhibit *novel behaviors* (far from existing solutions in descriptor space) are rewarded, even if their fitness is not immediately superior.
$$
novelty(p) = (1/k) * sum of dist(descriptor(p), descriptor(ni)) for nearest k neighbors ni
$$
### 11.5 LLM Orchestration and Model Selection
#### Dynamic Model Allocation
Python (Model Selection Strategy)
class ModelSelector: def init(self, config): self.flash_ratio = config.flash_ratio # e.g., 0.85 self.pro_ratio = 1.0 - self.flash_ratio self.generation = 0 self.adaptive = config.adaptive_selection self.flash_success_rate = 0.0 self.pro_success_rate = 0.0
def choose_model(self, mutation_complexity="normal"):
"""Select which model to use for this mutation."""
if mutation_complexity == "complex":
# Multi-file changes always use Pro
return "gemini-pro"
if self.adaptive:
# Adjust ratios based on observed success rates
# (Multi-armed bandit approach)
total = self.flash_success_rate + self.pro_success_rate + 1e-8
adaptive_flash_ratio = self.flash_success_rate / total
# Blend with base ratio for exploration
effective_ratio = 0.7 * adaptive_flash_ratio + 0.3 * self.flash_ratio
else:
effective_ratio = self.flash_ratio
if np.random.random() < effective_ratio:
return "gemini-flash"
else:
return "gemini-pro"
def update_stats(self, model, was_improvement):
"""Track which model produces improvements."""
alpha = 0.01 # Exponential moving average decay
if model == "gemini-flash":
self.flash_success_rate = (1 - alpha) * self.flash_success_rate + alpha * was_improvement
else:
self.pro_success_rate = (1 - alpha) * self.pro_success_rate + alpha * was_improvement
### 11.6 Search Strategies
AlphaEvolve combines several search strategies to balance exploration and exploitation:
| Strategy | Description | Role |
| --- | --- | --- |
| MAP-Elites | Maintains diverse archive across behavioral dimensions | Primary diversity mechanism |
| Island Model | Parallel sub-populations with occasional migration | Prevents premature convergence |
| Tournament Selection | Select best of k random candidates | Selection pressure control |
| Fitness-Proportional | Probability proportional to fitness | Exploitation of good solutions |
| Novelty Search | Reward behavioral novelty | Escape local optima |
| Epsilon-greedy | Mostly exploit best, occasionally explore random | Balancing mechanism |
#### Multi-Objective Pareto Optimization
For problems with multiple objectives (e.g., algorithm speed vs. code size vs. correctness), AlphaEvolve uses Pareto dominance to maintain a frontier of non-dominated solutions.
$$
Program p1 dominates p2 iff:
forall i: fi(p1) >= fi(p2) AND exists j: fj(p1) > fj(p2)
$$
### 11.7 Evaluation and Fitness
#### Sandboxed Execution
All candidate programs are executed in a **sandboxed environment** that prevents:
- File system access beyond the working directory
- Network access
- Excessive resource consumption (CPU time limits, memory limits)
- Interference with other candidates or the host system
#### Cascaded Evaluation Pipeline
Python (Cascaded Evaluation)
class EvaluationPipeline: def evaluate(self, candidate_program): """Multi-stage evaluation with early termination."""
# Stage 1: Syntax check (near-instant)
if not self.check_syntax(candidate_program.code):
return EvalResult(fitness=-float('inf'), stage="syntax")
# Stage 2: Quick smoke test (milliseconds)
try:
result = self.run_sandboxed(candidate_program, test_set="quick",
timeout=5.0)
if not result.all_correct:
return EvalResult(fitness=-float('inf'), stage="smoke")
except (TimeoutError, RuntimeError):
return EvalResult(fitness=-float('inf'), stage="smoke")
# Stage 3: Full correctness test (seconds)
result = self.run_sandboxed(candidate_program, test_set="full",
timeout=60.0)
if not result.all_correct:
return EvalResult(fitness=0.0, stage="correctness",
partial_score=result.correct_fraction)
# Stage 4: Performance benchmarking (seconds to minutes)
perf = self.benchmark(candidate_program, n_runs=10)
# Stage 5: Multi-objective scoring
fitness = self.compute_fitness(
correctness=result.correct_fraction,
performance=perf.median_time,
code_size=len(candidate_program.code),
memory_usage=perf.peak_memory
)
return EvalResult(fitness=fitness, stage="complete",
details=perf)
def compute_fitness(self, correctness, performance, code_size, memory_usage):
"""Multi-objective fitness function."""
if correctness < 1.0:
return correctness * 0.01 # Heavily penalize incorrect programs
# Normalized performance score (lower time = higher fitness)
perf_score = 1.0 / (1.0 + performance)
# Optional: penalize very large programs
size_penalty = max(0, (code_size - 1000) * 0.0001)
return perf_score - size_penalty
### 11.8 Prompt Engineering
Prompt engineering in AlphaEvolve is a critical component that directly affects mutation quality. The system uses several prompt strategies:
#### Diff-Based Prompts
The LLM is asked to produce a unified diff that modifies specific parts of the code. This is preferred for incremental improvements.
Prompt Template (Diff Mode)
You are an expert algorithm designer. Below is a program that scores {parent_score} on the evaluation metric. Your goal is to modify it to achieve a higher score.
Current Program:
{parent_code}
Evaluation Metric:
{metric_description}
Previous Successful Modifications:
{history}
Please provide your modification as a unified diff (--- a/file +++ b/file). Focus on algorithmic improvements. Think step by step about what could improve the score, then provide the diff.
#### Full-Program Prompts
For radical restructuring, the LLM generates a completely new version of the program.
#### Multi-File Prompts
For complex changes spanning multiple components (like the matrix multiplication discovery that modified optimizer, loss function, initialization, and hyperparameters), the prompt includes all relevant files and asks for coordinated changes.
#### Temperature and Sampling
- Higher temperature (~0.8-1.0) for exploration phases to encourage diverse mutations
- Lower temperature (~0.3-0.5) for exploitation phases to refine promising solutions
- Multiple samples per prompt to increase the chance of finding improvements
### 11.9 Cost Control
AlphaEvolve implements several mechanisms to control computational cost:
- **Cascaded evaluation:** Cheap filters eliminate bad candidates before expensive evaluations
- **Flash-first strategy:** Use the cheaper model for most mutations
- **Early stopping:** Terminate evaluation if a candidate fails basic tests
- **Budget allocation:** Fixed compute budget per experiment, dynamically allocated between exploration and exploitation
- **Caching:** Previously evaluated programs are cached to avoid re-evaluation
- **Batch processing:** Multiple mutations and evaluations can be batched for efficiency
$$
Total Cost = Nflash * Cflash + Npro * Cpro + Neval * Ceval
where Nflash >> Npro and Cflash << Cpro
$$
### 11.10 Meta-Level Improvement
> **Self-Improving Loop:** AlphaEvolve's most remarkable meta-level property is that it contributes to training the Gemini models that power it. Specifically:
>
> AlphaEvolve discovers better kernel optimizations for Gemini's matrix multiplication operations
> These optimizations speed up Gemini's training (1% training time reduction)
> Faster training enables more efficient Gemini model development
> Better Gemini models produce higher-quality mutations in AlphaEvolve
> Higher-quality mutations lead to better algorithm discoveries
>
> This creates a **positive feedback loop** -- a form of recursive self-improvement bounded by the diminishing returns of each optimization cycle.
+-------------------+ +------------------+ | AlphaEvolve |------>| Better Kernels | | (Algorithm | | for Gemini | | Discovery) | | Training | +--------+----------+ +--------+---------+ ^ | | v +--------+----------+ +--------+---------+ | Better Gemini |<------| Faster Gemini | | Mutations | | Training | +-------------------+ +------------------+ ```
This is not unbounded recursive self-improvement (each cycle yields diminishing marginal gains), but it does represent a meaningful and deployed instance of AI systems contributing to their own improvement.
11.11 Mathematical Framework
Evolutionary Optimization Formulation
$$ maximize f(p) subject to p in P
where P = space of all valid programs
f: P --> R is the fitness function (evaluation metric)
$$
LLM as Mutation Operator
$$ MLLM: P x C --> P
p' = MLLM(p, c)
where p is the parent program, c is the context (problem spec, history),
and p' is the mutated child program
$$
MAP-Elites Update Rule
$$ For a new program p with fitness f(p) and descriptors d(p):
Let cell = bin(d(p))
If cell is empty OR f(p) > f(archive[cell]):
archive[cell] = p
$$
Expected Improvement per Generation
$$ E[delta_f] = sum over models m: wm * E[f(Mm(p, c)) - f(p) | f(Mm(p, c)) > f(p)] * P(improvement | m)
where wm is the weight for model m (Flash vs Pro)
$$
Cost-Efficiency Ratio
$$ Efficiency = delta_f / (CLLM + Ceval)
= (fitness improvement) / (LLM cost + evaluation cost)
$$
Island Migration Topology
$$ Migration graph G = (V, E) where V = {islands}, E = migration paths
Ring topology: E = {(i, (i+1) mod N) | i in 0..N-1}
Migration rate: mu = k / |island| every T generations
$$
12 Programming Language
| Aspect | Language/Technology | Notes |
|---|---|---|
| AlphaEvolve system implementation | Python (primary), likely with C++ internals | Google DeepMind standard stack |
| Evolved programs (typical) | Python | Most experiments evolve Python code |
| Evolved programs (hardware) | Verilog | TPU circuit optimization |
| Evolved programs (kernels) | Pallas (JAX kernel DSL), CUDA-like | GPU/TPU kernel optimization |
| LLM backend | Gemini API (internal Google) | Likely using internal APIs, not public Gemini API |
| Evaluation infrastructure | Google internal tooling (Borg, etc.) | Sandboxed execution on Google infrastructure |
A key strength of AlphaEvolve is its language agnosticism: since the LLMs can generate code in any programming language, AlphaEvolve can evolve solutions in Python, Verilog, C++, CUDA, or any other language as needed for the specific domain.
13 Memory Management
Program Database Memory
The program database must efficiently store and retrieve potentially thousands of programs. Key memory management strategies include:
- MAP-Elites grid sparsity: Only occupied cells consume memory; the grid structure itself is sparse (dictionary/hash-map based)
- Program deduplication: Programs with identical behavior on test cases are merged, with only one copy stored
- Incremental storage: Programs can be stored as diffs from their parent, reducing storage for closely related programs
- Archive pruning: The elite archive has a fixed maximum size; lower-fitness programs are evicted when the archive is full
- Evaluation caching: Results of previous evaluations are cached (keyed by program hash) to avoid redundant computation
LLM Context Window Management
For large codebases, fitting all relevant context into the LLM's context window requires careful management:
- Selective context: Only the most relevant code sections are included in the prompt
- Summarization: Long histories of mutations are summarized rather than included verbatim
- Hierarchical context: File-level summaries are provided, with full code only for the sections being modified
14 Continued Learning
AlphaEvolve exhibits multiple forms of continued learning:
Within-Run Learning
- Population improvement: The program database accumulates better solutions over generations
- Mutation history: Successful mutations are recorded and inform future prompt construction
- Adaptive model selection: The system learns which model (Flash vs Pro) is more effective for the current problem
Cross-Run Learning
- Transfer of insights: Solutions discovered in one domain can seed evolution in related domains
- Prompt template refinement: Prompt engineering is iteratively improved based on experiment outcomes
Meta-Level Learning (Self-Improvement Loop)
- Gemini model improvement: AlphaEvolve's kernel optimizations improve Gemini training, which improves AlphaEvolve's mutation quality
- Infrastructure improvement: Borg scheduling improvements free up compute for running more AlphaEvolve experiments
Note that the LLMs themselves (Gemini Flash/Pro) are not fine-tuned during AlphaEvolve runs. The "learning" happens at the population level (better programs in the database) and at the prompt level (better context provided to the LLM), not at the model weight level.
15 Applications
15.1 Sorting Algorithms
While AlphaEvolve's predecessor AlphaDev (2023) focused specifically on sorting algorithms, AlphaEvolve's general framework subsumes this capability. The system can evolve sorting algorithms by defining appropriate fitness functions (number of comparisons, swaps, or wall-clock time) and behavioral descriptors (algorithm strategy type, memory usage pattern).
15.2 Matrix Multiplication
Flagship Result: AlphaEvolve discovered an algorithm for multiplying 4x4 complex-valued matrices using 48 scalar multiplications, improving upon Strassen's 1969 algorithm. The evolutionary process required 15 non-trivial mutations modifying multiple components simultaneously:
Optimizer architecture Weight initialization scheme Loss function formulation Hyperparameter sweep configuration
This demonstrates AlphaEvolve's ability to co-evolve multiple interdependent components, something beyond the capability of single-function evolution systems like FunSearch.
15.3 Mathematical Conjectures and Open Problems
| Problem | Domain | Result |
|---|---|---|
| Kissing number (11D) | Geometry | New lower bound: 593 spheres (improved) |
| 50+ open problems | Analysis, geometry, combinatorics, number theory | ~75% rediscovered SOTA, ~20% improved beyond SOTA |
15.4 Kernel Optimization
| Kernel | Target Hardware | Speedup | Technique |
|---|---|---|---|
| Matrix multiplication (Pallas) | TPU | 23% | Better subproblem decomposition |
| FlashAttention | GPU | 32.5% | Low-level GPU instruction optimization |
15.5 Data Center Scheduling
The Borg scheduling heuristic discovered by AlphaEvolve has been in production at Google for over a year, recovering 0.7% of worldwide compute resources. This translates to significant cost savings and energy efficiency improvements at Google's scale.
15.6 Hardware Design (Verilog)
AlphaEvolve proposed Verilog circuit modifications for TPU arithmetic units, removing unnecessary bits while maintaining formal verification guarantees. This represents a novel application of evolutionary program synthesis to hardware design.
15.7 Future Applications (Stated by DeepMind)
- Material science
- Drug discovery
- Sustainability optimization
- General technological and business applications
- Any problem where the solution can be described as an algorithm and automatically verified
AlphaEvolve Technical Research Report | Generated March 2026
Sources: Google DeepMind Blog Post (May 14, 2025), AlphaEvolve White Paper, google-deepmind/alphaevolve_results
This report is intended for PhD-level academic research and technical analysis.