OpenEvolve
Open-Source Reimplementation of AlphaEvolve for LLM-Guided Evolutionary Coding Organization: Algorithmic Superintelligence Inc. (community) Repository: github.com/algorithmicsuperintelligence/openevolve License: Apache 2.0 Language: Python Report Date: March 2026
Table of Contents
- Full Title and Attribution
- Authors and Contributors
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results and Examples
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications and Examples
- Comparison: OpenEvolve vs AlphaEvolve
1 Full Title and Attribution
Full Title: OpenEvolve: An Open-Source Implementation of AlphaEvolve
Repository: github.com/algorithmicsuperintelligence/openevolve
License: Apache License 2.0
Status: Active open-source project
Inspiration: Directly inspired by Google DeepMind's AlphaEvolve (May 2025)
Stars: 8,000+ (as of early 2026)
2 Authors and Contributors
OpenEvolve is developed by Algorithmic Superintelligence Inc., an open-source community organization. The project is community-driven, with contributions from multiple developers.
The project was created shortly after Google DeepMind published the AlphaEvolve white paper in May 2025. It aims to democratize access to evolutionary LLM-guided program synthesis by providing a fully open-source implementation that works with any LLM provider, not just Google's Gemini.
3 Core Contribution
Key Contribution: OpenEvolve is the first comprehensive open-source reimplementation of AlphaEvolve's core architecture, making LLM-guided evolutionary program synthesis accessible to researchers and practitioners who do not have access to Google's internal infrastructure.
What OpenEvolve Provides
- Multi-provider LLM support: Works with Gemini, OpenAI (GPT-4, GPT-4o), Anthropic (Claude), and any OpenAI-compatible API (local models via Ollama, vLLM, etc.)
- Faithful architectural reproduction: Implements the key components described in the AlphaEvolve white paper -- prompt sampler, program database (MAP-Elites), LLM ensemble, evaluation pipeline
- Configurable evolutionary strategies: YAML-based configuration for all hyperparameters, selection strategies, and LLM settings
- Diff-based and full-program mutations: Both mutation modes as described in AlphaEvolve
- Sandboxed evaluation: Secure code execution with timeout and resource limits
- Ready-to-use examples: Pre-built examples for sorting algorithms, symbolic regression, mathematical optimization, and more
- Cost control: Built-in token tracking, budget limits, and model routing to manage API costs
Key Differences from AlphaEvolve
| Aspect | AlphaEvolve | OpenEvolve |
|---|---|---|
| Source availability | Closed (proprietary) | Fully open-source (Apache 2.0) |
| LLM provider | Gemini only (internal) | Any provider (Gemini, OpenAI, Anthropic, local) |
| Scale | Google-scale infrastructure | Single machine / small cluster |
| Evaluation | Google internal sandboxing | Docker/subprocess sandboxing |
| Production deployment | Deployed in Google infra | Research/experimentation tool |
| Hardware targets | Verilog, TPU kernels, GPU | Python-focused (extensible) |
4 Supported Solutions
| Solution Type | Support Level | Example |
|---|---|---|
Python Functions |
Primary target | Sorting algorithms, mathematical functions |
Complete Programs |
Supported | Multi-file Python programs |
Symbolic Regression |
Example provided | Discovering mathematical formulas from data |
Algorithm Optimization |
Example provided | Optimizing algorithm efficiency |
Mathematical Discovery |
Supported | Exploring conjectures and constructions |
Custom Domains |
Extensible | Any domain with a programmatic evaluator |
OpenEvolve is designed to be domain-agnostic: any problem expressible as "write code, evaluate code, score code" can be tackled by defining an appropriate evaluation function.
5 LLM Integration
Supported LLM Providers
| Provider | Models | Configuration Key | Notes |
|---|---|---|---|
| Google Gemini | Gemini 2.0 Flash, 2.5 Pro, etc. | gemini |
Closest to original AlphaEvolve |
| OpenAI | GPT-4o, GPT-4-turbo, GPT-4o-mini, o1, o3 | openai |
Strong coding capability |
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus, Claude Opus 4 | anthropic |
Strong reasoning |
| OpenAI-Compatible | Any model (Ollama, vLLM, LM Studio) | openai_compatible |
Enables free local models |
Multi-Model Ensemble Configuration
OpenEvolve supports an ensemble of models, mirroring AlphaEvolve's Flash+Pro strategy. The user configures multiple models with weights that determine how often each is sampled.
YAML (config.yaml -- LLM Configuration)
# Multi-model ensemble configuration
llm:
models:
- provider: "gemini"
model_name: "gemini-2.0-flash"
temperature: 0.8
max_tokens: 4096
weight: 0.8 # 80% of mutations use this model (breadth)
role: "explorer"
- provider: "gemini"
model_name: "gemini-2.5-pro"
temperature: 0.5
max_tokens: 8192
weight: 0.2 # 20% of mutations use this model (depth)
role: "refiner"
# Alternative: Use OpenAI models
# models:
# - provider: "openai"
# model_name: "gpt-4o-mini"
# weight: 0.85
# - provider: "openai"
# model_name: "gpt-4o"
# weight: 0.15
# Cost control
max_tokens_per_run: 10000000 # Total token budget
max_cost_per_run: 50.0 # USD budget limit
LLM as Mutation Operator
OpenEvolve uses LLMs identically to AlphaEvolve: as semantically-aware mutation operators. The LLM receives a parent program with its score and is asked to produce improvements, either as a diff or a full rewrite.
Prompt Construction
The PromptSampler class constructs prompts dynamically based on:
- Parent program(s) selected from the program database
- Problem specification (provided by the user)
- Mutation mode (diff vs. full program)
- Evolutionary context (generation number, best score seen, improvement history)
Python (openevolve/prompt_sampler.py -- Conceptual)
class PromptSampler:
def __init__(self, config, program_db):
self.config = config
self.program_db = program_db
self.diff_ratio = config.get("diff_ratio", 0.5)
def sample_prompt(self):
"""Assemble a prompt for the LLM mutation operator."""
# 1. Select parent(s) from program database
parents = self.program_db.select_parents(
n=self.config.get("n_parents", 1),
strategy=self.config.get("selection_strategy", "map_elites")
)
# 2. Choose mutation mode
use_diff = random.random() < self.diff_ratio
# 3. Build prompt
prompt = self._build_prompt(parents, use_diff)
# 4. Select model from ensemble
model = self._select_model()
return prompt, model, use_diff
def _build_prompt(self, parents, use_diff):
sections = []
sections.append(self.config.get("system_prompt",
"You are an expert algorithm designer..."))
sections.append(f"## Problem:\n{self.config['problem_description']}")
for i, parent in enumerate(parents):
sections.append(
f"## Parent Program {i+1} (score: {parent.fitness:.4f}):\n"
f"```python\n{parent.code}\n```"
)
if use_diff:
sections.append(
"Provide your improvement as a unified diff.")
else:
sections.append(
"Provide the complete improved program.")
return "\n\n".join(sections)
6 Key Results and Examples
OpenEvolve, being a reimplementation tool rather than a research paper, provides example benchmarks rather than novel discoveries. The included examples demonstrate that the architecture works and can evolve solutions:
Included Example Problems
| Example | Description | Evaluation Metric |
|---|---|---|
| Sorting Algorithm | Evolve a sorting function that minimizes comparisons/time | Wall-clock time on benchmark arrays |
| Symbolic Regression | Discover mathematical formulas from data points | Mean squared error on held-out data |
| Function Optimization | Find optimal solutions to mathematical optimization problems | Objective function value |
| Algorithm Design | Design algorithms for combinatorial problems | Solution quality score |
Community Results
As an open-source project, OpenEvolve enables community experimentation. Users have reported using it for:
- Evolving competitive programming solutions
- Discovering more efficient data structure implementations
- Optimizing machine learning hyperparameters through code evolution
- Exploring mathematical conjectures
Note: OpenEvolve has not replicated AlphaEvolve's headline results (Borg scheduling, TPU Verilog, kissing number improvements). Those results require Google-scale infrastructure, internal APIs, and domain-specific evaluation harnesses that are not publicly available. OpenEvolve demonstrates that the architecture is sound and reproducible, not that all results transfer.
7 Reproducibility
| Aspect | Status | Details |
|---|---|---|
| Source Code | Fully Open |
Complete source code on GitHub under Apache 2.0 |
| Documentation | Comprehensive |
README, code comments, configuration examples |
| Installation | Straightforward |
pip install or clone + install |
| Examples | Included |
Multiple ready-to-run examples with evaluation functions |
| LLM Access | User-provided |
Requires API keys for chosen provider (or local models) |
| Configuration | YAML-based |
All hyperparameters configurable via YAML files |
| Determinism | Stochastic |
LLM outputs are non-deterministic; random seeds for selection |
Installation
Bash (Installation)
# Clone the repository
git clone https://github.com/algorithmicsuperintelligence/openevolve.git
cd openevolve
# Install dependencies
pip install -e .
# Set API keys
export GEMINI_API_KEY="your-key-here"
# OR
export OPENAI_API_KEY="your-key-here"
# OR
export ANTHROPIC_API_KEY="your-key-here"
# Run an example
python -m openevolve.run --config examples/sorting/config.yaml
8 Compute and API Costs
Cost Structure
OpenEvolve's costs are primarily driven by LLM API calls. The system includes built-in cost tracking and budget limits.
| Configuration | Estimated Cost per Run | Typical Duration |
|---|---|---|
| Local models (Ollama) | $0 (electricity only) | Hours to days (depending on model size) |
| GPT-4o-mini (explorer) + GPT-4o (refiner) | $5-50 | Hours |
| Gemini Flash (explorer) + Pro (refiner) | $2-30 | Hours |
| Claude Sonnet (explorer) + Opus (refiner) | $10-100 | Hours |
Built-in Cost Control
YAML (Cost Control Configuration)
cost_control:
# Hard budget limits
max_total_cost_usd: 50.0
max_tokens_input: 5000000
max_tokens_output: 2000000
# Evaluation limits
max_eval_time_seconds: 60
max_concurrent_evals: 4
# Tracking
log_token_usage: true
log_cost_per_generation: true
Cost Optimization Strategies
- Cheap model first: Use a cheap model (Flash/mini) for 80%+ of mutations
- Local models: Use Ollama with open-weights models for zero API cost
- Cascaded evaluation: Quick syntax/smoke tests before expensive evaluations
- Caching: Avoid re-evaluating identical or equivalent programs
- Budget limits: Hard stops when cost thresholds are reached
9 Architecture Solution
Repository Structure
openevolve/
__init__.py
run.py # Main entry point / CLI runner
controller.py # Orchestrates the evolutionary loop
prompt_sampler.py # Constructs prompts, selects parents
program_database.py # MAP-Elites program database
evaluator.py # Sandboxed code evaluation
llm_provider.py # Multi-provider LLM abstraction
diff_engine.py # Diff parsing and application
config.py # Configuration management
utils.py # Utility functions
examples/
sorting/ # Sorting algorithm evolution
config.yaml
evaluate.py
seed_program.py
symbolic_regression/ # Mathematical formula discovery
config.yaml
evaluate.py
seed_program.py
optimization/ # Function optimization
config.yaml
evaluate.py
High-Level Architecture Diagram
+=====================================================================================+
| OpenEvolve System Architecture |
+=====================================================================================+
| |
| +----------------------------------------------------------------------+ |
| | CONTROLLER (controller.py) | |
| | Orchestrates the evolutionary loop: | |
| | for generation in range(max_generations): | |
| | prompt, model = prompt_sampler.sample() | |
| | response = llm_provider.generate(prompt, model) | |
| | candidate = diff_engine.apply(response, parent) | |
| | score = evaluator.evaluate(candidate) | |
| | program_database.insert(candidate, score) | |
| +----------------------------------------------------------------------+ |
| | | | | | |
| v v v v v |
| +-----------+ +----------+ +---------------+ +---------+ +----------+ |
| | PROMPT | | LLM | | DIFF ENGINE | | EVAL | | PROGRAM | |
| | SAMPLER | | PROVIDER | | | | UATOR | | DATABASE | |
| | | | | | - Parse diffs | | | | | |
| | - Select | | - Gemini | | - Parse code | | - Sand- | | - MAP- | |
| | parents | | - OpenAI | | - Apply patch | | boxed | | Elites | |
| | - Build | | - Claude | | - Validate | | exec | | - Islands| |
| | prompt | | - Local | | | | - Score | | - Elite | |
| | - Choose | | | | | | - Cache | | archive| |
| | model | | | | | | | | | |
| +-----------+ +----------+ +---------------+ +---------+ +----------+ |
| |
| +----------------------------------------------------------------------+ |
| | CONFIGURATION (config.yaml) | |
| | - LLM models, weights, temperatures | |
| | - Evolution: population size, generations, selection strategy | |
| | - Evaluation: timeout, sandbox settings | |
| | - Cost control: budget limits, token tracking | |
| +----------------------------------------------------------------------+ |
| |
+=====================================================================================+
Evolutionary Loop Data Flow
+------------------------+
| config.yaml |
| seed_program.py |
| evaluate.py |
+-----------+------------+
|
v
+------------------------+
| Controller.run() |
+-----------+------------+
|
+----------------+----------------+
| |
v |
+----------------+ |
| PromptSampler | |
| .sample() | |
+-------+--------+ |
| |
+----------+-----------+ |
| | |
v v |
+------------+ +-------------+ |
| Program | | Model | |
| Database | | Selection | |
| .select() | | (weighted) | |
+------+-----+ +------+------+ |
| | |
v v |
+-----------------------------------------+ |
| LLM Provider.generate() | |
| (Gemini / OpenAI / Claude / | |
| Local via Ollama) | |
+-------------------+---------------------+ |
| |
v |
+------------------+ |
| DiffEngine | |
| .parse_and_apply | |
+--------+---------+ |
| |
v |
+------------------+ |
| Evaluator | |
| .evaluate() | |
| (sandboxed) | |
+--------+---------+ |
| |
v |
+------------------+ |
| ProgramDatabase |<----------------------+
| .insert() | (next generation)
+------------------+
10 Component Breakdown
10.1 Controller (controller.py)
The central orchestrator that runs the evolutionary loop. Responsibilities:
- Initialize all components from configuration
- Run the main evolution loop for N generations
- Coordinate between prompt sampling, LLM generation, diff application, evaluation, and database updates
- Track progress, log metrics, and enforce budget limits
- Handle errors gracefully (LLM failures, evaluation timeouts, etc.)
Python (controller.py -- Core Loop)
class Controller:
def __init__(self, config):
self.config = config
self.prompt_sampler = PromptSampler(config)
self.llm_provider = LLMProvider(config)
self.diff_engine = DiffEngine()
self.evaluator = Evaluator(config)
self.program_db = ProgramDatabase(config)
self.cost_tracker = CostTracker(config)
def run(self, seed_program, evaluation_fn):
"""Run the full evolutionary loop."""
# Initialize with seed program
seed_score = self.evaluator.evaluate(seed_program, evaluation_fn)
self.program_db.insert(seed_program, seed_score)
for gen in range(self.config["max_generations"]):
# Check budget
if self.cost_tracker.budget_exceeded():
break
# Sample prompt and model
prompt, model_name, is_diff = self.prompt_sampler.sample_prompt()
# Generate mutation via LLM
response = self.llm_provider.generate(
prompt, model=model_name
)
self.cost_tracker.update(response.usage)
# Apply mutation
parent = self.prompt_sampler.last_selected_parent
if is_diff:
candidate = self.diff_engine.apply_diff(
parent.code, response.text)
else:
candidate = self.diff_engine.extract_code(response.text)
if candidate is None:
continue # Failed to parse
# Evaluate
score = self.evaluator.evaluate(candidate, evaluation_fn)
# Update database
self.program_db.insert(candidate, score)
# Log progress
self._log_generation(gen, score, self.program_db.best_score())
return self.program_db.get_best_program()
10.2 Program Database (program_database.py)
Implements MAP-Elites with optional island model. Key features:
- MAP-Elites grid with configurable behavioral descriptors
- Multiple selection strategies (uniform, fitness-proportional, tournament)
- Elite archive for best-ever solutions
- Optional island sub-populations with migration
- Persistence (save/load to disk for resuming experiments)
Python (program_database.py -- Core Structure)
class ProgramDatabase:
def __init__(self, config):
self.config = config
self.population_size = config.get("population_size", 100)
self.selection_strategy = config.get("selection_strategy", "map_elites")
# MAP-Elites grid
self.n_descriptor_dims = config.get("n_descriptor_dims", 2)
self.bins_per_dim = config.get("bins_per_dim", 10)
self.grid = {} # (bin_indices) -> Program
# Island model
self.n_islands = config.get("n_islands", 1)
self.migration_interval = config.get("migration_interval", 50)
self.islands = [{} for _ in range(self.n_islands)]
# Elite archive
self.elite_archive = []
self.max_elite_size = config.get("max_elite_size", 20)
# Generation counter
self.generation = 0
def insert(self, program, fitness, descriptors=None):
"""Insert a program into the database."""
if descriptors is None:
descriptors = self._auto_compute_descriptors(program)
cell = self._discretize(descriptors)
island_id = self.generation % self.n_islands
# MAP-Elites insertion
if cell not in self.islands[island_id] or \
fitness > self.islands[island_id][cell].fitness:
self.islands[island_id][cell] = Program(
code=program, fitness=fitness,
descriptors=descriptors, generation=self.generation
)
# Update elite archive
self._update_elite(program, fitness)
self.generation += 1
# Periodic migration
if self.generation % self.migration_interval == 0:
self._migrate()
def select_parents(self, n=1, strategy=None):
"""Select parent programs for mutation."""
strategy = strategy or self.selection_strategy
all_programs = self._get_all_programs()
if strategy == "map_elites":
# Uniform random from occupied cells
return random.sample(all_programs, min(n, len(all_programs)))
elif strategy == "tournament":
return [self._tournament_select(all_programs) for _ in range(n)]
elif strategy == "fitness_proportional":
return self._fitness_proportional_select(all_programs, n)
def _migrate(self):
"""Migrate top programs between islands."""
for i in range(self.n_islands):
source = self.islands[i]
target = self.islands[(i + 1) % self.n_islands]
# Get top-k from source
top_k = sorted(source.values(), key=lambda p: p.fitness,
reverse=True)[:3]
for prog in top_k:
cell = self._discretize(prog.descriptors)
if cell not in target or prog.fitness > target[cell].fitness:
target[cell] = prog
10.3 LLM Provider (llm_provider.py)
A unified interface abstracting over multiple LLM providers:
Python (llm_provider.py -- Multi-Provider Abstraction)
class LLMProvider:
def __init__(self, config):
self.models = []
self.weights = []
for model_config in config["llm"]["models"]:
provider = model_config["provider"]
if provider == "gemini":
client = GeminiClient(model_config)
elif provider == "openai":
client = OpenAIClient(model_config)
elif provider == "anthropic":
client = AnthropicClient(model_config)
elif provider == "openai_compatible":
client = OpenAICompatibleClient(model_config)
else:
raise ValueError(f"Unknown provider: {provider}")
self.models.append(client)
self.weights.append(model_config.get("weight", 1.0))
# Normalize weights
total = sum(self.weights)
self.weights = [w / total for w in self.weights]
def generate(self, prompt, model=None):
"""Generate a response using the selected or randomly chosen model."""
if model is None:
# Weighted random selection
model_idx = random.choices(
range(len(self.models)), weights=self.weights, k=1
)[0]
else:
model_idx = self._find_model_by_name(model)
return self.models[model_idx].generate(prompt)
10.4 Evaluator (evaluator.py)
Runs candidate programs in a sandboxed environment with resource limits:
Python (evaluator.py -- Sandboxed Evaluation)
import subprocess
import tempfile
import os
class Evaluator:
def __init__(self, config):
self.timeout = config.get("eval_timeout", 60)
self.cache = {} # hash(code) -> score
def evaluate(self, code, evaluation_fn_path):
"""Evaluate a candidate program in a sandbox."""
# Check cache
code_hash = hash(code)
if code_hash in self.cache:
return self.cache[code_hash]
# Stage 1: Syntax check
try:
compile(code, "<candidate>", "exec")
except SyntaxError:
return -float('inf')
# Stage 2: Sandboxed execution
with tempfile.NamedTemporaryFile(
mode='w', suffix='.py', delete=False
) as f:
f.write(code)
candidate_path = f.name
try:
result = subprocess.run(
["python", evaluation_fn_path, candidate_path],
capture_output=True,
text=True,
timeout=self.timeout,
env={**os.environ, "SANDBOX": "1"}
)
if result.returncode != 0:
score = -float('inf')
else:
score = float(result.stdout.strip())
except (subprocess.TimeoutExpired, ValueError):
score = -float('inf')
finally:
os.unlink(candidate_path)
self.cache[code_hash] = score
return score
10.5 Diff Engine (diff_engine.py)
Parses and applies code modifications from LLM responses:
- Unified diff parsing and application
- Full code block extraction from markdown-formatted responses
- Error recovery for malformed diffs
- Syntax validation of resulting code
10.6 Configuration (config.py + YAML files)
YAML (Full Configuration Example)
# OpenEvolve Full Configuration
# Problem specification
problem:
description: "Evolve a sorting algorithm that minimizes comparisons"
seed_program: "examples/sorting/seed_program.py"
evaluation_fn: "examples/sorting/evaluate.py"
# Evolution parameters
evolution:
max_generations: 1000
population_size: 100
selection_strategy: "map_elites" # or "tournament", "fitness_proportional"
n_islands: 3
migration_interval: 50
tournament_size: 3
# MAP-Elites configuration
map_elites:
n_descriptor_dims: 2
bins_per_dim: 10
descriptor_names:
- "code_length"
- "avg_comparisons"
# Mutation settings
mutation:
diff_ratio: 0.5 # 50% diff, 50% full rewrite
n_parents: 1 # Number of parents per mutation
include_history: true # Include mutation history in prompt
max_history_length: 5 # Last N successful mutations
# LLM configuration
llm:
models:
- provider: "gemini"
model_name: "gemini-2.0-flash"
temperature: 0.8
max_tokens: 4096
weight: 0.8
- provider: "gemini"
model_name: "gemini-2.5-pro"
temperature: 0.5
max_tokens: 8192
weight: 0.2
# Evaluation settings
evaluation:
timeout_seconds: 60
n_test_cases: 100
cache_results: true
sandbox_mode: "subprocess" # or "docker"
# Cost control
cost_control:
max_total_cost_usd: 50.0
log_usage: true
# Logging
logging:
level: "INFO"
save_all_programs: false
save_improvements_only: true
output_dir: "./results"
11 Core Mechanisms (Detailed)
11.1 Code Modification and Mutation (LLM-Guided Mutations)
OpenEvolve implements the same LLM-as-mutation-operator paradigm as AlphaEvolve. The diff engine supports two primary modes:
Diff-Based Mutations
The LLM generates a unified diff that the diff engine parses and applies to the parent program. This is effective for incremental improvements.
Python (diff_engine.py -- Diff Application)
import re
class DiffEngine:
def apply_diff(self, original_code, llm_response):
"""Parse and apply a unified diff from LLM output."""
# Extract diff from markdown code blocks or raw text
diff_text = self._extract_diff(llm_response)
if diff_text is None:
return None
# Parse unified diff hunks
hunks = self._parse_hunks(diff_text)
# Apply hunks to original code
lines = original_code.split('\n')
offset = 0
for hunk in hunks:
start = hunk.original_start + offset - 1
# Remove old lines
for _ in range(hunk.original_count):
if start < len(lines):
lines.pop(start)
# Insert new lines
for i, line in enumerate(hunk.new_lines):
lines.insert(start + i, line)
offset += len(hunk.new_lines) - hunk.original_count
result = '\n'.join(lines)
# Validate syntax
try:
compile(result, "<modified>", "exec")
except SyntaxError:
return None # Reject invalid code
return result
def extract_code(self, llm_response):
"""Extract full program from LLM response."""
# Look for ```python ... ``` blocks
pattern = r'```(?:python)?\s*\n(.*?)```'
matches = re.findall(pattern, llm_response, re.DOTALL)
if matches:
code = matches[-1].strip() # Take last code block
try:
compile(code, "<extracted>", "exec")
return code
except SyntaxError:
return None
return None
def _extract_diff(self, text):
"""Extract diff block from LLM response."""
# Try markdown code block first
pattern = r'```(?:diff)?\s*\n(.*?)```'
matches = re.findall(pattern, text, re.DOTALL)
if matches:
return matches[0]
# Try raw diff (starts with --- or @@)
if '---' in text and '+++' in text:
return text
return None
Mutation Quality Metrics
$$ Compilation Rate = |{mutations that produce valid syntax}| / |{total mutations attempted}|
Improvement Rate = |{mutations that improve fitness}| / |{mutations that compile and run}|
Efficiency = Improvement Rate * avg(fitness improvement) / avg(API cost per mutation)
$$
11.2 Parent Selection
OpenEvolve implements three primary selection strategies, configurable via YAML:
| Strategy | Config Value | Description | Bias |
|---|---|---|---|
| MAP-Elites uniform | map_elites |
Uniform random from occupied cells | Diversity-favoring (no fitness bias within niches) |
| Tournament | tournament |
Best of k random candidates | Adjustable pressure via k |
| Fitness-proportional | fitness_proportional |
Probability proportional to fitness | Exploitation-favoring |
Python (Selection Strategies)
def tournament_select(programs, k=3):
"""Tournament selection: pick best of k random candidates."""
tournament = random.sample(programs, min(k, len(programs)))
return max(tournament, key=lambda p: p.fitness)
def fitness_proportional_select(programs, n=1):
"""Roulette wheel selection weighted by fitness."""
fitnesses = [p.fitness for p in programs]
min_f = min(fitnesses)
weights = [f - min_f + 1e-8 for f in fitnesses]
total = sum(weights)
probs = [w / total for w in weights]
return random.choices(programs, weights=probs, k=n)
def map_elites_select(grid):
"""Uniform random from occupied MAP-Elites cells."""
occupied = list(grid.values())
return random.choice(occupied)
11.3 Population Management
MAP-Elites Grid
The program database organizes solutions in a MAP-Elites grid defined by behavioral descriptors. The grid is a sparse dictionary (only occupied cells are stored).
$$ Grid: D1 x D2 x ... x Dk --> Program
where Di = discretized behavioral descriptor dimension i
Insertion rule: grid[cell] = argmax(fitness) among all programs mapping to cell
$$
Island Model
Configurable via n_islands and migration_interval. Each island maintains its own MAP-Elites grid, with periodic migration of top-performing programs between islands in a ring topology.
Python (Island Migration)
def migrate_ring(islands, k=3):
"""Ring migration: top-k from each island go to the next."""
n = len(islands)
migrants = []
# Collect migrants from each island
for i in range(n):
top_k = sorted(
islands[i].values(),
key=lambda p: p.fitness,
reverse=True
)[:k]
migrants.append(top_k)
# Insert migrants into next island
for i in range(n):
target = (i + 1) % n
for prog in migrants[i]:
cell = discretize(prog.descriptors)
if cell not in islands[target] or \
prog.fitness > islands[target][cell].fitness:
islands[target][cell] = prog
11.4 Solution Diversity and Novelty
OpenEvolve maintains diversity through:
- MAP-Elites grid structure: By definition, MAP-Elites maintains one solution per behavioral niche, ensuring the population covers diverse regions of behavior space
- Island model: Semi-isolated populations evolve differently, preventing the entire population from converging to a single solution
- Behavioral descriptors: Users define problem-specific descriptors that characterize how a program works, not just how well
- Temperature control: Higher LLM temperatures encourage more diverse mutations
- Diff vs. full-rewrite balance: Full rewrites can produce dramatically different programs, maintaining structural diversity
Configurable Behavioral Descriptors
Python (Custom Descriptor Definition)
# In evaluate.py, users define behavioral descriptors:
def evaluate_and_describe(candidate_code):
"""Return both fitness score and behavioral descriptors."""
# Run the candidate
results = run_on_test_cases(candidate_code)
# Fitness: how good is the solution
fitness = compute_score(results)
# Behavioral descriptors: what kind of solution is it
descriptors = {
"code_length": len(candidate_code.split('\n')),
"time_complexity_proxy": measure_scaling(results),
"uses_recursion": 1.0 if "def " in candidate_code
and is_recursive(candidate_code) else 0.0,
"memory_usage": measure_peak_memory(results)
}
return fitness, descriptors
11.5 LLM Orchestration and Model Selection
OpenEvolve's LLM orchestration closely mirrors AlphaEvolve's dual-model strategy but generalizes it to any number of models from any provider.
Weighted Model Selection
$$ P(modeli) = wi / sum(wj) for all j $$
Adaptive Selection (Optional)
OpenEvolve can optionally track which models produce improvements and adjust weights dynamically using a multi-armed bandit approach.
Python (Adaptive Model Selection)
class AdaptiveModelSelector:
def __init__(self, models, initial_weights, alpha=0.05):
self.models = models
self.weights = initial_weights.copy()
self.success_counts = [0] * len(models)
self.total_counts = [0] * len(models)
self.alpha = alpha
def select(self):
# UCB1-inspired selection
total = sum(self.total_counts) + 1
scores = []
for i in range(len(self.models)):
if self.total_counts[i] == 0:
scores.append(float('inf')) # Explore untried
else:
exploit = self.success_counts[i] / self.total_counts[i]
explore = ((2 * math.log(total)) / self.total_counts[i]) ** 0.5
scores.append(exploit + self.alpha * explore)
return self.models[scores.index(max(scores))]
def update(self, model_idx, was_improvement):
self.total_counts[model_idx] += 1
if was_improvement:
self.success_counts[model_idx] += 1
11.6 Search Strategies
| Strategy | Implementation Status | Configuration |
|---|---|---|
| MAP-Elites | Implemented |
selection_strategy: "map_elites" |
| Island Model | Implemented |
n_islands: 3 |
| Tournament Selection | Implemented |
selection_strategy: "tournament" |
| Fitness-Proportional | Implemented |
selection_strategy: "fitness_proportional" |
| Adaptive Model Selection | Optional |
adaptive_model_selection: true |
| Novelty Search | Planned/Partial |
Via custom behavioral descriptors |
11.7 Evaluation and Fitness
User-Defined Evaluation Functions
The evaluation function is provided by the user as a Python script. OpenEvolve calls this script with the candidate program path and expects a numeric score on stdout.
Python (Example: evaluate.py for Sorting)
import sys
import time
import importlib.util
import random
def evaluate(candidate_path):
"""Evaluate a candidate sorting algorithm."""
# Load candidate module
spec = importlib.util.spec_from_file_location("candidate", candidate_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
# Generate test cases
test_cases = [
random.sample(range(1000), k=n)
for n in [10, 50, 100, 500, 1000]
for _ in range(10)
]
total_time = 0.0
correct = 0
for tc in test_cases:
arr = tc.copy()
expected = sorted(tc)
start = time.perf_counter()
result = module.sort(arr)
elapsed = time.perf_counter() - start
if result == expected:
correct += 1
total_time += elapsed
# Score: correctness is critical, then speed
correctness_ratio = correct / len(test_cases)
if correctness_ratio < 1.0:
score = correctness_ratio * 0.01 # Heavy penalty
else:
score = 1.0 / (1.0 + total_time) # Faster = higher score
print(score)
if __name__ == "__main__":
evaluate(sys.argv[1])
Cascaded Evaluation
OpenEvolve implements the same cascaded evaluation pattern as AlphaEvolve:
- Syntax check: Python
compile()-- near instant - Import check: Attempt to import the module -- catches import errors
- Smoke test: Quick run on small inputs with short timeout
- Full evaluation: Complete benchmark on all test cases
11.8 Prompt Engineering
OpenEvolve uses configurable prompt templates. The prompt sampler assembles prompts dynamically.
Default Prompt Template (Diff Mode)
Prompt Template
You are an expert programmer tasked with improving algorithms.
## Problem Description:
{problem_description}
## Current Program (fitness score: {parent_score}):
```python
{parent_code}
Your Task:
Analyze this program and propose an improvement that will increase the fitness score. Think about: 1. Algorithmic complexity improvements 2. Better data structures 3. Mathematical optimizations 4. Edge case handling
Provide your changes as a unified diff format:
--- a/program.py
+++ b/program.py
@@ line numbers @@
context
-removed
+added
#### Prompt Customization
Users can fully customize prompts via the YAML configuration, including:
- System prompt (defining the LLM's role)
- Problem description
- Output format instructions (diff vs. full program)
- Whether to include mutation history
- Whether to include multiple parents (crossover-like)
### 11.9 Cost Control
Python (Cost Tracking)
class CostTracker: def init(self, config): self.max_cost = config.get("max_total_cost_usd", float('inf')) self.max_input_tokens = config.get("max_tokens_input", float('inf')) self.max_output_tokens = config.get("max_tokens_output", float('inf'))
self.total_cost = 0.0
self.total_input_tokens = 0
self.total_output_tokens = 0
self.calls_per_model = {}
# Pricing per 1M tokens (approximate)
self.pricing = {
"gemini-2.0-flash": {"input": 0.10, "output": 0.40},
"gemini-2.5-pro": {"input": 1.25, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gpt-4o": {"input": 2.50, "output": 10.00},
"claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
}
def update(self, model_name, input_tokens, output_tokens):
self.total_input_tokens += input_tokens
self.total_output_tokens += output_tokens
if model_name in self.pricing:
p = self.pricing[model_name]
cost = (input_tokens * p["input"] + output_tokens * p["output"]) / 1e6
self.total_cost += cost
def budget_exceeded(self):
return (
self.total_cost >= self.max_cost or
self.total_input_tokens >= self.max_input_tokens or
self.total_output_tokens >= self.max_output_tokens
)
### 11.10 Meta-Level Improvement
Unlike AlphaEvolve (which has a self-improvement loop via Gemini training), OpenEvolve does not have a built-in meta-level improvement mechanism. However, it supports several forms of cross-run learning:
- **Seed program transfer:** The best program from one run can be used as the seed for a subsequent run
- **Database persistence:** The program database can be saved and loaded, allowing experiments to be resumed
- **Configuration tuning:** Results from one run inform hyperparameter choices for the next
- **Prompt refinement:** Observing what kinds of prompts produce improvements can guide prompt engineering
> **No model fine-tuning:** Like AlphaEvolve, OpenEvolve does not fine-tune the LLMs during evolution. The LLMs are used as frozen inference-time mutation operators. All "learning" happens at the population level.
### 11.11 Mathematical Framework
#### Evolutionary Loop Formalization
$$
Given: Problem specification S, seed program p0, evaluation function f: P --> R
For t = 1, 2, ..., T:
1. Select parent: pparent ~ Select(DBt-1)
2. Choose model: m ~ Categorical(w1, ..., wK)
3. Generate mutation: pcandidate = LLMm(pparent, S)
4. Evaluate: s = f(pcandidate)
5. Update: DBt = MapElitesInsert(DBt-1, pcandidate, s, desc(pcandidate))
Return: argmaxp in DBT f(p)
$$
#### MAP-Elites Coverage
$$
Coverage(DB) = |{non-empty cells}| / prod(bins_per_dimi)
Quality(DB) = (1/|{non-empty cells}|) * sum of f(DB[cell]) for all non-empty cells
$$
#### Expected Mutation Improvement
$$
E[improvement] = sum over models m: wm * P(f(child) > f(parent) | model=m) * E[f(child) - f(parent) | improvement, model=m]
$$
#### Cost-Per-Improvement
$$
CPI = Total API Cost / Number of Fitness Improvements Found
Marginal CPI at generation t: CPIt = cost(generation t) / max(0, n_improvements at generation t, epsilon)
$$
## 12 Programming Language
| Aspect | Language | Notes |
| --- | --- | --- |
| OpenEvolve framework | **Python 3.8+** | Pure Python implementation |
| Evolved programs | **Python** (primarily) | Framework is Python-centric but extensible to other languages |
| Configuration | **YAML** | Human-readable configuration files |
| Dependencies | Standard Python ecosystem | google-generativeai, openai, anthropic, pyyaml, etc. |
### Extensibility to Other Languages
While the default evaluation pipeline assumes Python programs, the evaluation function is user-defined and can invoke any external process. Users could evolve programs in any language by:
1. Defining the seed program in the target language
2. Writing an evaluation function that compiles and runs the target language
3. Adjusting the prompt templates to request output in the target language
## 13 Memory Management
### Program Database
- **Sparse MAP-Elites grid:** Implemented as a Python dictionary; only occupied cells consume memory
- **In-memory storage:** The entire population is kept in RAM (typical populations of 100-1000 programs are easily manageable)
- **Disk persistence:** Database can be serialized to disk (pickle/JSON) for resuming experiments
- **Elite archive:** Fixed-size list of best programs (configurable max size, default 20)
### Evaluation Cache
- **Hash-based cache:** Programs are hashed and their scores cached to avoid re-evaluation
- **LRU eviction:** For long runs, the cache can be bounded with LRU eviction policy
### LLM Context Management
- **Token counting:** Prompts are checked against the model's context window limit before sending
- **Truncation:** Long programs or histories are truncated to fit within context limits
- **Selective inclusion:** Only the most relevant parent programs and recent history are included
### Memory Estimation
$$
Memory = O(P * S + C * S + E * S)
where P = population size, S = average program size (bytes),
C = cache size, E = elite archive size
Typical: 1000 programs * 10KB each = ~10MB (negligible)
$$
## 14 Continued Learning
### Within-Run
- **Population evolution:** The program database continuously improves over generations
- **Mutation history:** Successful mutations are recorded and included in future prompts, allowing the LLM to learn from what has worked
- **Adaptive model selection:** If enabled, the system learns which models are more effective and routes more traffic to them
### Cross-Run
- **Database persistence:** Save and load the program database to resume evolution
- **Seed transfer:** Best programs from one run can seed subsequent runs
- **Configuration refinement:** Observing run metrics guides future configuration choices
### Limitations vs. AlphaEvolve
- No self-improvement loop (OpenEvolve does not train its own LLMs)
- No automatic infrastructure optimization (no Borg, no TPU, no Google-scale feedback)
- Learning is bounded to the population and prompt level, not the model weight level
## 15 Applications and Examples
### 15.1 Sorting Algorithm Evolution
> The sorting example evolves a Python sorting function starting from a basic implementation (e.g., bubble sort) toward faster algorithms. The evaluation function measures wall-clock time on benchmark arrays of varying sizes, with correctness as a hard constraint.
Python (examples/sorting/seed_program.py)
def sort(arr): """Initial seed: simple bubble sort.""" n = len(arr) for i in range(n): for j in range(n - i - 1): if arr[j] > arr[j + 1]: arr[j], arr[j + 1] = arr[j + 1], arr[j] return arr
Through evolution, OpenEvolve can discover quicksort, mergesort, timsort-like algorithms, or novel hybrid approaches that outperform the seed on the specific benchmark.
### 15.2 Symbolic Regression
Given data points (x, y), evolve a mathematical function f such that f(x) closely approximates y. The evaluation function computes mean squared error on held-out data, rewarding both accuracy and simplicity.
### 15.3 Mathematical Optimization
Evolve algorithms for combinatorial optimization problems. The user provides an evaluation function that scores the quality of the solution found by the candidate algorithm.
### 15.4 Custom Applications
OpenEvolve's modular design enables any application where:
1. Solutions can be expressed as Python code (or code callable from Python)
2. An automatic evaluation function can score solution quality
3. The search space is rich enough to benefit from evolutionary exploration
### How to Create a Custom Application
Bash
1. Create problem directory
mkdir my_problem/
2. Write seed program
cat > my_problem/seed_program.py << 'EOF' def solve(input_data): # Initial naive solution return baseline_solution(input_data) EOF
3. Write evaluation function
cat > my_problem/evaluate.py << 'EOF' import sys import importlib.util
def evaluate(candidate_path): spec = importlib.util.spec_from_file_location("c", candidate_path) mod = importlib.util.module_from_spec(spec) spec.loader.exec_module(mod)
score = benchmark(mod.solve)
print(score)
if name == "main": evaluate(sys.argv[1]) EOF
4. Write configuration
cat > my_problem/config.yaml << 'EOF' problem: description: "Optimize the solve() function for maximum score" seed_program: "my_problem/seed_program.py" evaluation_fn: "my_problem/evaluate.py" evolution: max_generations: 500 llm: models: - provider: "gemini" model_name: "gemini-2.0-flash" weight: 1.0 EOF
5. Run OpenEvolve
python -m openevolve.run --config my_problem/config.yaml ```
16 Detailed Comparison: OpenEvolve vs. AlphaEvolve
| Dimension | AlphaEvolve (Google DeepMind) | OpenEvolve (Open Source) |
|---|---|---|
| Source Code | Proprietary; not released | Fully open-source (Apache 2.0) |
| LLM Models | Gemini Flash + Gemini Pro (internal APIs) | Any: Gemini, OpenAI, Anthropic, local (Ollama/vLLM) |
| LLM Ensemble | Dual-model (Flash for breadth, Pro for depth) | Configurable N-model ensemble with weighted selection |
| Mutation Modes | Diff-based, full-program, multi-file | Diff-based, full-program (multi-file support limited) |
| Program Database | MAP-Elites with island model and elite archive | MAP-Elites with island model and elite archive (faithful reproduction) |
| Parent Selection | MAP-Elites, tournament, fitness-proportional | MAP-Elites, tournament, fitness-proportional |
| Evaluation | Google internal sandboxing; formal verification for hardware | Subprocess/Docker sandboxing; user-defined evaluation functions |
| Scale | Google-scale (millions of LLM calls, massive parallel evaluation) | Single machine / small cluster |
| Evolved Language | Python, Verilog, Pallas/CUDA, any | Python primarily (extensible to others) |
| Self-Improvement | Yes (improves Gemini training, which improves AlphaEvolve) | No (uses frozen LLMs) |
| Configuration | Internal configuration system | YAML-based, user-friendly |
| Cost Control | Internal resource management | Built-in token tracking, USD budget limits |
| Results: Matrix Mult | Improved Strassen for 4x4 complex (48 scalar mults) | Not reproduced (requires specialized evaluation) |
| Results: Data Centers | 0.7% of Google's worldwide compute recovered | Not applicable (no access to Borg) |
| Results: Hardware | TPU Verilog optimization deployed | Not applicable (no access to TPU design tools) |
| Results: Kernels | 23% matrix mult speedup, 32.5% FlashAttention | Not reproduced (requires GPU kernel benchmarking) |
| Results: Math | Kissing number (11D), 50+ open problems | Example framework provided; community experiments |
| Community | Closed team at Google DeepMind | Open community; 8000+ GitHub stars |
| Documentation | White paper + blog post | README, code comments, examples, YAML configs |
Architectural Fidelity Assessment
High Fidelity Components: OpenEvolve faithfully reproduces the core architecture described in the AlphaEvolve white paper:
Prompt Sampler with parent selection and context assembly LLM ensemble with weighted model selection MAP-Elites program database with behavioral descriptors Island model with periodic migration Diff-based and full-program mutation modes Cascaded evaluation with sandboxed execution
Missing / Reduced Components:
Multi-file evolution: AlphaEvolve can modify entire codebases across multiple files simultaneously; OpenEvolve primarily targets single-file programs Hardware targets: No Verilog, Pallas/CUDA kernel evolution built-in Formal verification: No integration with hardware verification tools Self-improvement loop: No mechanism to improve the underlying LLMs Scale: No massively parallel evaluation infrastructure Production deployment: Research tool, not production infrastructure
When to Use Each System
| Scenario | Recommended | Reason |
|---|---|---|
| Academic research on evolutionary LLM coding | OpenEvolve | Open source, reproducible, configurable |
| Google-scale infrastructure optimization | AlphaEvolve | Only available internally, requires Google infrastructure |
| Exploring novel algorithms for your domain | OpenEvolve | Customizable evaluation functions, any LLM provider |
| Hardware (Verilog) optimization | AlphaEvolve | Built-in hardware targets and verification |
| Using local/free models | OpenEvolve | Supports Ollama, vLLM, any OpenAI-compatible API |
| Minimal API cost experimentation | OpenEvolve | Built-in cost control, supports free local models |
OpenEvolve Technical Research Report | Generated March 2026
Source: github.com/algorithmicsuperintelligence/openevolve
This report is intended for PhD-level academic research and technical analysis.