← Back to Index

OpenEvolve

Open-Source Reimplementation of AlphaEvolve for LLM-Guided Evolutionary Coding Organization: Algorithmic Superintelligence Inc. (community) Repository: github.com/algorithmicsuperintelligence/openevolve License: Apache 2.0 Language: Python Report Date: March 2026

Table of Contents

1 Full Title and Attribution

Full Title: OpenEvolve: An Open-Source Implementation of AlphaEvolve

Repository: github.com/algorithmicsuperintelligence/openevolve

License: Apache License 2.0

Status: Active open-source project

Inspiration: Directly inspired by Google DeepMind's AlphaEvolve (May 2025)

Stars: 8,000+ (as of early 2026)

2 Authors and Contributors

OpenEvolve is developed by Algorithmic Superintelligence Inc., an open-source community organization. The project is community-driven, with contributions from multiple developers.

The project was created shortly after Google DeepMind published the AlphaEvolve white paper in May 2025. It aims to democratize access to evolutionary LLM-guided program synthesis by providing a fully open-source implementation that works with any LLM provider, not just Google's Gemini.

3 Core Contribution

Key Contribution: OpenEvolve is the first comprehensive open-source reimplementation of AlphaEvolve's core architecture, making LLM-guided evolutionary program synthesis accessible to researchers and practitioners who do not have access to Google's internal infrastructure.

What OpenEvolve Provides

  1. Multi-provider LLM support: Works with Gemini, OpenAI (GPT-4, GPT-4o), Anthropic (Claude), and any OpenAI-compatible API (local models via Ollama, vLLM, etc.)
  2. Faithful architectural reproduction: Implements the key components described in the AlphaEvolve white paper -- prompt sampler, program database (MAP-Elites), LLM ensemble, evaluation pipeline
  3. Configurable evolutionary strategies: YAML-based configuration for all hyperparameters, selection strategies, and LLM settings
  4. Diff-based and full-program mutations: Both mutation modes as described in AlphaEvolve
  5. Sandboxed evaluation: Secure code execution with timeout and resource limits
  6. Ready-to-use examples: Pre-built examples for sorting algorithms, symbolic regression, mathematical optimization, and more
  7. Cost control: Built-in token tracking, budget limits, and model routing to manage API costs

Key Differences from AlphaEvolve

Aspect AlphaEvolve OpenEvolve
Source availability Closed (proprietary) Fully open-source (Apache 2.0)
LLM provider Gemini only (internal) Any provider (Gemini, OpenAI, Anthropic, local)
Scale Google-scale infrastructure Single machine / small cluster
Evaluation Google internal sandboxing Docker/subprocess sandboxing
Production deployment Deployed in Google infra Research/experimentation tool
Hardware targets Verilog, TPU kernels, GPU Python-focused (extensible)

4 Supported Solutions

Solution Type Support Level Example
Python Functions Primary target Sorting algorithms, mathematical functions
Complete Programs Supported Multi-file Python programs
Symbolic Regression Example provided Discovering mathematical formulas from data
Algorithm Optimization Example provided Optimizing algorithm efficiency
Mathematical Discovery Supported Exploring conjectures and constructions
Custom Domains Extensible Any domain with a programmatic evaluator

OpenEvolve is designed to be domain-agnostic: any problem expressible as "write code, evaluate code, score code" can be tackled by defining an appropriate evaluation function.

5 LLM Integration

Supported LLM Providers

Provider Models Configuration Key Notes
Google Gemini Gemini 2.0 Flash, 2.5 Pro, etc. gemini Closest to original AlphaEvolve
OpenAI GPT-4o, GPT-4-turbo, GPT-4o-mini, o1, o3 openai Strong coding capability
Anthropic Claude 3.5 Sonnet, Claude 3 Opus, Claude Opus 4 anthropic Strong reasoning
OpenAI-Compatible Any model (Ollama, vLLM, LM Studio) openai_compatible Enables free local models

Multi-Model Ensemble Configuration

OpenEvolve supports an ensemble of models, mirroring AlphaEvolve's Flash+Pro strategy. The user configures multiple models with weights that determine how often each is sampled.

YAML (config.yaml -- LLM Configuration)

# Multi-model ensemble configuration
llm:
  models:
    - provider: "gemini"
      model_name: "gemini-2.0-flash"
      temperature: 0.8
      max_tokens: 4096
      weight: 0.8           # 80% of mutations use this model (breadth)
      role: "explorer"

    - provider: "gemini"
      model_name: "gemini-2.5-pro"
      temperature: 0.5
      max_tokens: 8192
      weight: 0.2           # 20% of mutations use this model (depth)
      role: "refiner"

  # Alternative: Use OpenAI models
  # models:
  #   - provider: "openai"
  #     model_name: "gpt-4o-mini"
  #     weight: 0.85
  #   - provider: "openai"
  #     model_name: "gpt-4o"
  #     weight: 0.15

  # Cost control
  max_tokens_per_run: 10000000     # Total token budget
  max_cost_per_run: 50.0          # USD budget limit

LLM as Mutation Operator

OpenEvolve uses LLMs identically to AlphaEvolve: as semantically-aware mutation operators. The LLM receives a parent program with its score and is asked to produce improvements, either as a diff or a full rewrite.

Prompt Construction

The PromptSampler class constructs prompts dynamically based on:

  • Parent program(s) selected from the program database
  • Problem specification (provided by the user)
  • Mutation mode (diff vs. full program)
  • Evolutionary context (generation number, best score seen, improvement history)

Python (openevolve/prompt_sampler.py -- Conceptual)

class PromptSampler:
    def __init__(self, config, program_db):
        self.config = config
        self.program_db = program_db
        self.diff_ratio = config.get("diff_ratio", 0.5)

    def sample_prompt(self):
        """Assemble a prompt for the LLM mutation operator."""

        # 1. Select parent(s) from program database
        parents = self.program_db.select_parents(
            n=self.config.get("n_parents", 1),
            strategy=self.config.get("selection_strategy", "map_elites")
        )

        # 2. Choose mutation mode
        use_diff = random.random() < self.diff_ratio

        # 3. Build prompt
        prompt = self._build_prompt(parents, use_diff)

        # 4. Select model from ensemble
        model = self._select_model()

        return prompt, model, use_diff

    def _build_prompt(self, parents, use_diff):
        sections = []
        sections.append(self.config.get("system_prompt",
            "You are an expert algorithm designer..."))
        sections.append(f"## Problem:\n{self.config['problem_description']}")

        for i, parent in enumerate(parents):
            sections.append(
                f"## Parent Program {i+1} (score: {parent.fitness:.4f}):\n"
                f"```python\n{parent.code}\n```"
            )

        if use_diff:
            sections.append(
                "Provide your improvement as a unified diff.")
        else:
            sections.append(
                "Provide the complete improved program.")

        return "\n\n".join(sections)

6 Key Results and Examples

OpenEvolve, being a reimplementation tool rather than a research paper, provides example benchmarks rather than novel discoveries. The included examples demonstrate that the architecture works and can evolve solutions:

Included Example Problems

Example Description Evaluation Metric
Sorting Algorithm Evolve a sorting function that minimizes comparisons/time Wall-clock time on benchmark arrays
Symbolic Regression Discover mathematical formulas from data points Mean squared error on held-out data
Function Optimization Find optimal solutions to mathematical optimization problems Objective function value
Algorithm Design Design algorithms for combinatorial problems Solution quality score

Community Results

As an open-source project, OpenEvolve enables community experimentation. Users have reported using it for:

  • Evolving competitive programming solutions
  • Discovering more efficient data structure implementations
  • Optimizing machine learning hyperparameters through code evolution
  • Exploring mathematical conjectures

Note: OpenEvolve has not replicated AlphaEvolve's headline results (Borg scheduling, TPU Verilog, kissing number improvements). Those results require Google-scale infrastructure, internal APIs, and domain-specific evaluation harnesses that are not publicly available. OpenEvolve demonstrates that the architecture is sound and reproducible, not that all results transfer.

7 Reproducibility

Aspect Status Details
Source Code Fully Open Complete source code on GitHub under Apache 2.0
Documentation Comprehensive README, code comments, configuration examples
Installation Straightforward pip install or clone + install
Examples Included Multiple ready-to-run examples with evaluation functions
LLM Access User-provided Requires API keys for chosen provider (or local models)
Configuration YAML-based All hyperparameters configurable via YAML files
Determinism Stochastic LLM outputs are non-deterministic; random seeds for selection

Installation

Bash (Installation)

# Clone the repository
git clone https://github.com/algorithmicsuperintelligence/openevolve.git
cd openevolve

# Install dependencies
pip install -e .

# Set API keys
export GEMINI_API_KEY="your-key-here"
# OR
export OPENAI_API_KEY="your-key-here"
# OR
export ANTHROPIC_API_KEY="your-key-here"

# Run an example
python -m openevolve.run --config examples/sorting/config.yaml

8 Compute and API Costs

Cost Structure

OpenEvolve's costs are primarily driven by LLM API calls. The system includes built-in cost tracking and budget limits.

Configuration Estimated Cost per Run Typical Duration
Local models (Ollama) $0 (electricity only) Hours to days (depending on model size)
GPT-4o-mini (explorer) + GPT-4o (refiner) $5-50 Hours
Gemini Flash (explorer) + Pro (refiner) $2-30 Hours
Claude Sonnet (explorer) + Opus (refiner) $10-100 Hours

Built-in Cost Control

YAML (Cost Control Configuration)

cost_control:
  # Hard budget limits
  max_total_cost_usd: 50.0
  max_tokens_input: 5000000
  max_tokens_output: 2000000

  # Evaluation limits
  max_eval_time_seconds: 60
  max_concurrent_evals: 4

  # Tracking
  log_token_usage: true
  log_cost_per_generation: true

Cost Optimization Strategies

  • Cheap model first: Use a cheap model (Flash/mini) for 80%+ of mutations
  • Local models: Use Ollama with open-weights models for zero API cost
  • Cascaded evaluation: Quick syntax/smoke tests before expensive evaluations
  • Caching: Avoid re-evaluating identical or equivalent programs
  • Budget limits: Hard stops when cost thresholds are reached

9 Architecture Solution

Repository Structure

openevolve/
__init__.py
run.py # Main entry point / CLI runner
controller.py # Orchestrates the evolutionary loop
prompt_sampler.py # Constructs prompts, selects parents
program_database.py # MAP-Elites program database
evaluator.py # Sandboxed code evaluation
llm_provider.py # Multi-provider LLM abstraction
diff_engine.py # Diff parsing and application
config.py # Configuration management
utils.py # Utility functions
examples/
sorting/ # Sorting algorithm evolution
config.yaml
evaluate.py
seed_program.py
symbolic_regression/ # Mathematical formula discovery
config.yaml
evaluate.py
seed_program.py
optimization/ # Function optimization
config.yaml
evaluate.py

High-Level Architecture Diagram

+=====================================================================================+
|                          OpenEvolve System Architecture                              |
+=====================================================================================+
|                                                                                     |
|   +----------------------------------------------------------------------+          |
|   |                      CONTROLLER (controller.py)                      |          |
|   |  Orchestrates the evolutionary loop:                                 |          |
|   |  for generation in range(max_generations):                           |          |
|   |      prompt, model = prompt_sampler.sample()                         |          |
|   |      response = llm_provider.generate(prompt, model)                 |          |
|   |      candidate = diff_engine.apply(response, parent)                 |          |
|   |      score = evaluator.evaluate(candidate)                           |          |
|   |      program_database.insert(candidate, score)                       |          |
|   +----------------------------------------------------------------------+          |
|         |              |                |              |              |              |
|         v              v                v              v              v              |
|   +-----------+  +----------+  +---------------+  +---------+  +----------+         |
|   | PROMPT    |  | LLM      |  | DIFF ENGINE   |  | EVAL    |  | PROGRAM  |         |
|   | SAMPLER   |  | PROVIDER |  |               |  | UATOR   |  | DATABASE |         |
|   |           |  |          |  | - Parse diffs |  |         |  |          |         |
|   | - Select  |  | - Gemini |  | - Parse code  |  | - Sand- |  | - MAP-   |         |
|   |   parents |  | - OpenAI |  | - Apply patch |  |   boxed |  |   Elites |         |
|   | - Build   |  | - Claude |  | - Validate    |  |   exec  |  | - Islands|         |
|   |   prompt  |  | - Local  |  |               |  | - Score |  | - Elite  |         |
|   | - Choose  |  |          |  |               |  | - Cache |  |   archive|         |
|   |   model   |  |          |  |               |  |         |  |          |         |
|   +-----------+  +----------+  +---------------+  +---------+  +----------+         |
|                                                                                     |
|   +----------------------------------------------------------------------+          |
|   |                    CONFIGURATION (config.yaml)                        |          |
|   |  - LLM models, weights, temperatures                                 |          |
|   |  - Evolution: population size, generations, selection strategy        |          |
|   |  - Evaluation: timeout, sandbox settings                             |          |
|   |  - Cost control: budget limits, token tracking                       |          |
|   +----------------------------------------------------------------------+          |
|                                                                                     |
+=====================================================================================+

Evolutionary Loop Data Flow

+------------------------+
                        |    config.yaml         |
                        |    seed_program.py     |
                        |    evaluate.py         |
                        +-----------+------------+
                                    |
                                    v
                        +------------------------+
                        |    Controller.run()    |
                        +-----------+------------+
                                    |
                   +----------------+----------------+
                   |                                 |
                   v                                 |
          +----------------+                         |
          | PromptSampler  |                         |
          | .sample()      |                         |
          +-------+--------+                         |
                  |                                  |
       +----------+-----------+                      |
       |                      |                      |
       v                      v                      |
+------------+        +-------------+                |
| Program    |        | Model       |                |
| Database   |        | Selection   |                |
| .select()  |        | (weighted)  |                |
+------+-----+        +------+------+                |
       |                      |                      |
       v                      v                      |
+-----------------------------------------+          |
|        LLM Provider.generate()          |          |
|        (Gemini / OpenAI / Claude /      |          |
|         Local via Ollama)               |          |
+-------------------+---------------------+          |
                    |                                |
                    v                                |
          +------------------+                       |
          | DiffEngine       |                       |
          | .parse_and_apply |                       |
          +--------+---------+                       |
                   |                                 |
                   v                                 |
          +------------------+                       |
          | Evaluator        |                       |
          | .evaluate()      |                       |
          | (sandboxed)      |                       |
          +--------+---------+                       |
                   |                                 |
                   v                                 |
          +------------------+                       |
          | ProgramDatabase  |<----------------------+
          | .insert()        |    (next generation)
          +------------------+

10 Component Breakdown

10.1 Controller (controller.py)

The central orchestrator that runs the evolutionary loop. Responsibilities:

  • Initialize all components from configuration
  • Run the main evolution loop for N generations
  • Coordinate between prompt sampling, LLM generation, diff application, evaluation, and database updates
  • Track progress, log metrics, and enforce budget limits
  • Handle errors gracefully (LLM failures, evaluation timeouts, etc.)

Python (controller.py -- Core Loop)

class Controller:
    def __init__(self, config):
        self.config = config
        self.prompt_sampler = PromptSampler(config)
        self.llm_provider = LLMProvider(config)
        self.diff_engine = DiffEngine()
        self.evaluator = Evaluator(config)
        self.program_db = ProgramDatabase(config)
        self.cost_tracker = CostTracker(config)

    def run(self, seed_program, evaluation_fn):
        """Run the full evolutionary loop."""
        # Initialize with seed program
        seed_score = self.evaluator.evaluate(seed_program, evaluation_fn)
        self.program_db.insert(seed_program, seed_score)

        for gen in range(self.config["max_generations"]):
            # Check budget
            if self.cost_tracker.budget_exceeded():
                break

            # Sample prompt and model
            prompt, model_name, is_diff = self.prompt_sampler.sample_prompt()

            # Generate mutation via LLM
            response = self.llm_provider.generate(
                prompt, model=model_name
            )
            self.cost_tracker.update(response.usage)

            # Apply mutation
            parent = self.prompt_sampler.last_selected_parent
            if is_diff:
                candidate = self.diff_engine.apply_diff(
                    parent.code, response.text)
            else:
                candidate = self.diff_engine.extract_code(response.text)

            if candidate is None:
                continue  # Failed to parse

            # Evaluate
            score = self.evaluator.evaluate(candidate, evaluation_fn)

            # Update database
            self.program_db.insert(candidate, score)

            # Log progress
            self._log_generation(gen, score, self.program_db.best_score())

        return self.program_db.get_best_program()

10.2 Program Database (program_database.py)

Implements MAP-Elites with optional island model. Key features:

  • MAP-Elites grid with configurable behavioral descriptors
  • Multiple selection strategies (uniform, fitness-proportional, tournament)
  • Elite archive for best-ever solutions
  • Optional island sub-populations with migration
  • Persistence (save/load to disk for resuming experiments)

Python (program_database.py -- Core Structure)

class ProgramDatabase:
    def __init__(self, config):
        self.config = config
        self.population_size = config.get("population_size", 100)
        self.selection_strategy = config.get("selection_strategy", "map_elites")

        # MAP-Elites grid
        self.n_descriptor_dims = config.get("n_descriptor_dims", 2)
        self.bins_per_dim = config.get("bins_per_dim", 10)
        self.grid = {}  # (bin_indices) -> Program

        # Island model
        self.n_islands = config.get("n_islands", 1)
        self.migration_interval = config.get("migration_interval", 50)
        self.islands = [{} for _ in range(self.n_islands)]

        # Elite archive
        self.elite_archive = []
        self.max_elite_size = config.get("max_elite_size", 20)

        # Generation counter
        self.generation = 0

    def insert(self, program, fitness, descriptors=None):
        """Insert a program into the database."""
        if descriptors is None:
            descriptors = self._auto_compute_descriptors(program)

        cell = self._discretize(descriptors)
        island_id = self.generation % self.n_islands

        # MAP-Elites insertion
        if cell not in self.islands[island_id] or \
           fitness > self.islands[island_id][cell].fitness:
            self.islands[island_id][cell] = Program(
                code=program, fitness=fitness,
                descriptors=descriptors, generation=self.generation
            )

        # Update elite archive
        self._update_elite(program, fitness)
        self.generation += 1

        # Periodic migration
        if self.generation % self.migration_interval == 0:
            self._migrate()

    def select_parents(self, n=1, strategy=None):
        """Select parent programs for mutation."""
        strategy = strategy or self.selection_strategy
        all_programs = self._get_all_programs()

        if strategy == "map_elites":
            # Uniform random from occupied cells
            return random.sample(all_programs, min(n, len(all_programs)))
        elif strategy == "tournament":
            return [self._tournament_select(all_programs) for _ in range(n)]
        elif strategy == "fitness_proportional":
            return self._fitness_proportional_select(all_programs, n)

    def _migrate(self):
        """Migrate top programs between islands."""
        for i in range(self.n_islands):
            source = self.islands[i]
            target = self.islands[(i + 1) % self.n_islands]
            # Get top-k from source
            top_k = sorted(source.values(), key=lambda p: p.fitness,
                          reverse=True)[:3]
            for prog in top_k:
                cell = self._discretize(prog.descriptors)
                if cell not in target or prog.fitness > target[cell].fitness:
                    target[cell] = prog

10.3 LLM Provider (llm_provider.py)

A unified interface abstracting over multiple LLM providers:

Python (llm_provider.py -- Multi-Provider Abstraction)

class LLMProvider:
    def __init__(self, config):
        self.models = []
        self.weights = []

        for model_config in config["llm"]["models"]:
            provider = model_config["provider"]
            if provider == "gemini":
                client = GeminiClient(model_config)
            elif provider == "openai":
                client = OpenAIClient(model_config)
            elif provider == "anthropic":
                client = AnthropicClient(model_config)
            elif provider == "openai_compatible":
                client = OpenAICompatibleClient(model_config)
            else:
                raise ValueError(f"Unknown provider: {provider}")

            self.models.append(client)
            self.weights.append(model_config.get("weight", 1.0))

        # Normalize weights
        total = sum(self.weights)
        self.weights = [w / total for w in self.weights]

    def generate(self, prompt, model=None):
        """Generate a response using the selected or randomly chosen model."""
        if model is None:
            # Weighted random selection
            model_idx = random.choices(
                range(len(self.models)), weights=self.weights, k=1
            )[0]
        else:
            model_idx = self._find_model_by_name(model)

        return self.models[model_idx].generate(prompt)

10.4 Evaluator (evaluator.py)

Runs candidate programs in a sandboxed environment with resource limits:

Python (evaluator.py -- Sandboxed Evaluation)

import subprocess
import tempfile
import os

class Evaluator:
    def __init__(self, config):
        self.timeout = config.get("eval_timeout", 60)
        self.cache = {}  # hash(code) -> score

    def evaluate(self, code, evaluation_fn_path):
        """Evaluate a candidate program in a sandbox."""

        # Check cache
        code_hash = hash(code)
        if code_hash in self.cache:
            return self.cache[code_hash]

        # Stage 1: Syntax check
        try:
            compile(code, "<candidate>", "exec")
        except SyntaxError:
            return -float('inf')

        # Stage 2: Sandboxed execution
        with tempfile.NamedTemporaryFile(
            mode='w', suffix='.py', delete=False
        ) as f:
            f.write(code)
            candidate_path = f.name

        try:
            result = subprocess.run(
                ["python", evaluation_fn_path, candidate_path],
                capture_output=True,
                text=True,
                timeout=self.timeout,
                env={**os.environ, "SANDBOX": "1"}
            )
            if result.returncode != 0:
                score = -float('inf')
            else:
                score = float(result.stdout.strip())
        except (subprocess.TimeoutExpired, ValueError):
            score = -float('inf')
        finally:
            os.unlink(candidate_path)

        self.cache[code_hash] = score
        return score

10.5 Diff Engine (diff_engine.py)

Parses and applies code modifications from LLM responses:

  • Unified diff parsing and application
  • Full code block extraction from markdown-formatted responses
  • Error recovery for malformed diffs
  • Syntax validation of resulting code

10.6 Configuration (config.py + YAML files)

YAML (Full Configuration Example)

# OpenEvolve Full Configuration

# Problem specification
problem:
  description: "Evolve a sorting algorithm that minimizes comparisons"
  seed_program: "examples/sorting/seed_program.py"
  evaluation_fn: "examples/sorting/evaluate.py"

# Evolution parameters
evolution:
  max_generations: 1000
  population_size: 100
  selection_strategy: "map_elites"    # or "tournament", "fitness_proportional"
  n_islands: 3
  migration_interval: 50
  tournament_size: 3

# MAP-Elites configuration
map_elites:
  n_descriptor_dims: 2
  bins_per_dim: 10
  descriptor_names:
    - "code_length"
    - "avg_comparisons"

# Mutation settings
mutation:
  diff_ratio: 0.5                  # 50% diff, 50% full rewrite
  n_parents: 1                     # Number of parents per mutation
  include_history: true            # Include mutation history in prompt
  max_history_length: 5            # Last N successful mutations

# LLM configuration
llm:
  models:
    - provider: "gemini"
      model_name: "gemini-2.0-flash"
      temperature: 0.8
      max_tokens: 4096
      weight: 0.8
    - provider: "gemini"
      model_name: "gemini-2.5-pro"
      temperature: 0.5
      max_tokens: 8192
      weight: 0.2

# Evaluation settings
evaluation:
  timeout_seconds: 60
  n_test_cases: 100
  cache_results: true
  sandbox_mode: "subprocess"      # or "docker"

# Cost control
cost_control:
  max_total_cost_usd: 50.0
  log_usage: true

# Logging
logging:
  level: "INFO"
  save_all_programs: false
  save_improvements_only: true
  output_dir: "./results"

11 Core Mechanisms (Detailed)

11.1 Code Modification and Mutation (LLM-Guided Mutations)

OpenEvolve implements the same LLM-as-mutation-operator paradigm as AlphaEvolve. The diff engine supports two primary modes:

Diff-Based Mutations

The LLM generates a unified diff that the diff engine parses and applies to the parent program. This is effective for incremental improvements.

Python (diff_engine.py -- Diff Application)

import re

class DiffEngine:
    def apply_diff(self, original_code, llm_response):
        """Parse and apply a unified diff from LLM output."""

        # Extract diff from markdown code blocks or raw text
        diff_text = self._extract_diff(llm_response)
        if diff_text is None:
            return None

        # Parse unified diff hunks
        hunks = self._parse_hunks(diff_text)

        # Apply hunks to original code
        lines = original_code.split('\n')
        offset = 0

        for hunk in hunks:
            start = hunk.original_start + offset - 1
            # Remove old lines
            for _ in range(hunk.original_count):
                if start < len(lines):
                    lines.pop(start)
            # Insert new lines
            for i, line in enumerate(hunk.new_lines):
                lines.insert(start + i, line)
            offset += len(hunk.new_lines) - hunk.original_count

        result = '\n'.join(lines)

        # Validate syntax
        try:
            compile(result, "<modified>", "exec")
        except SyntaxError:
            return None  # Reject invalid code

        return result

    def extract_code(self, llm_response):
        """Extract full program from LLM response."""
        # Look for ```python ... ``` blocks
        pattern = r'```(?:python)?\s*\n(.*?)```'
        matches = re.findall(pattern, llm_response, re.DOTALL)
        if matches:
            code = matches[-1].strip()  # Take last code block
            try:
                compile(code, "<extracted>", "exec")
                return code
            except SyntaxError:
                return None
        return None

    def _extract_diff(self, text):
        """Extract diff block from LLM response."""
        # Try markdown code block first
        pattern = r'```(?:diff)?\s*\n(.*?)```'
        matches = re.findall(pattern, text, re.DOTALL)
        if matches:
            return matches[0]
        # Try raw diff (starts with --- or @@)
        if '---' in text and '+++' in text:
            return text
        return None

Mutation Quality Metrics

$$ Compilation Rate = |{mutations that produce valid syntax}| / |{total mutations attempted}|

Improvement Rate = |{mutations that improve fitness}| / |{mutations that compile and run}|



Efficiency = Improvement Rate * avg(fitness improvement) / avg(API cost per mutation)

$$

11.2 Parent Selection

OpenEvolve implements three primary selection strategies, configurable via YAML:

Strategy Config Value Description Bias
MAP-Elites uniform map_elites Uniform random from occupied cells Diversity-favoring (no fitness bias within niches)
Tournament tournament Best of k random candidates Adjustable pressure via k
Fitness-proportional fitness_proportional Probability proportional to fitness Exploitation-favoring

Python (Selection Strategies)

def tournament_select(programs, k=3):
    """Tournament selection: pick best of k random candidates."""
    tournament = random.sample(programs, min(k, len(programs)))
    return max(tournament, key=lambda p: p.fitness)

def fitness_proportional_select(programs, n=1):
    """Roulette wheel selection weighted by fitness."""
    fitnesses = [p.fitness for p in programs]
    min_f = min(fitnesses)
    weights = [f - min_f + 1e-8 for f in fitnesses]
    total = sum(weights)
    probs = [w / total for w in weights]
    return random.choices(programs, weights=probs, k=n)

def map_elites_select(grid):
    """Uniform random from occupied MAP-Elites cells."""
    occupied = list(grid.values())
    return random.choice(occupied)

11.3 Population Management

MAP-Elites Grid

The program database organizes solutions in a MAP-Elites grid defined by behavioral descriptors. The grid is a sparse dictionary (only occupied cells are stored).

$$ Grid: D1 x D2 x ... x Dk --> Program

where Di = discretized behavioral descriptor dimension i



Insertion rule: grid[cell] = argmax(fitness) among all programs mapping to cell

$$

Island Model

Configurable via n_islands and migration_interval. Each island maintains its own MAP-Elites grid, with periodic migration of top-performing programs between islands in a ring topology.

Python (Island Migration)

def migrate_ring(islands, k=3):
    """Ring migration: top-k from each island go to the next."""
    n = len(islands)
    migrants = []

    # Collect migrants from each island
    for i in range(n):
        top_k = sorted(
            islands[i].values(),
            key=lambda p: p.fitness,
            reverse=True
        )[:k]
        migrants.append(top_k)

    # Insert migrants into next island
    for i in range(n):
        target = (i + 1) % n
        for prog in migrants[i]:
            cell = discretize(prog.descriptors)
            if cell not in islands[target] or \
               prog.fitness > islands[target][cell].fitness:
                islands[target][cell] = prog

11.4 Solution Diversity and Novelty

OpenEvolve maintains diversity through:

  1. MAP-Elites grid structure: By definition, MAP-Elites maintains one solution per behavioral niche, ensuring the population covers diverse regions of behavior space
  2. Island model: Semi-isolated populations evolve differently, preventing the entire population from converging to a single solution
  3. Behavioral descriptors: Users define problem-specific descriptors that characterize how a program works, not just how well
  4. Temperature control: Higher LLM temperatures encourage more diverse mutations
  5. Diff vs. full-rewrite balance: Full rewrites can produce dramatically different programs, maintaining structural diversity

Configurable Behavioral Descriptors

Python (Custom Descriptor Definition)

# In evaluate.py, users define behavioral descriptors:

def evaluate_and_describe(candidate_code):
    """Return both fitness score and behavioral descriptors."""

    # Run the candidate
    results = run_on_test_cases(candidate_code)

    # Fitness: how good is the solution
    fitness = compute_score(results)

    # Behavioral descriptors: what kind of solution is it
    descriptors = {
        "code_length": len(candidate_code.split('\n')),
        "time_complexity_proxy": measure_scaling(results),
        "uses_recursion": 1.0 if "def " in candidate_code
                            and is_recursive(candidate_code) else 0.0,
        "memory_usage": measure_peak_memory(results)
    }

    return fitness, descriptors

11.5 LLM Orchestration and Model Selection

OpenEvolve's LLM orchestration closely mirrors AlphaEvolve's dual-model strategy but generalizes it to any number of models from any provider.

Weighted Model Selection

$$ P(modeli) = wi / sum(wj) for all j $$

Adaptive Selection (Optional)

OpenEvolve can optionally track which models produce improvements and adjust weights dynamically using a multi-armed bandit approach.

Python (Adaptive Model Selection)

class AdaptiveModelSelector:
    def __init__(self, models, initial_weights, alpha=0.05):
        self.models = models
        self.weights = initial_weights.copy()
        self.success_counts = [0] * len(models)
        self.total_counts = [0] * len(models)
        self.alpha = alpha

    def select(self):
        # UCB1-inspired selection
        total = sum(self.total_counts) + 1
        scores = []
        for i in range(len(self.models)):
            if self.total_counts[i] == 0:
                scores.append(float('inf'))  # Explore untried
            else:
                exploit = self.success_counts[i] / self.total_counts[i]
                explore = ((2 * math.log(total)) / self.total_counts[i]) ** 0.5
                scores.append(exploit + self.alpha * explore)

        return self.models[scores.index(max(scores))]

    def update(self, model_idx, was_improvement):
        self.total_counts[model_idx] += 1
        if was_improvement:
            self.success_counts[model_idx] += 1

11.6 Search Strategies

Strategy Implementation Status Configuration
MAP-Elites Implemented selection_strategy: "map_elites"
Island Model Implemented n_islands: 3
Tournament Selection Implemented selection_strategy: "tournament"
Fitness-Proportional Implemented selection_strategy: "fitness_proportional"
Adaptive Model Selection Optional adaptive_model_selection: true
Novelty Search Planned/Partial Via custom behavioral descriptors

11.7 Evaluation and Fitness

User-Defined Evaluation Functions

The evaluation function is provided by the user as a Python script. OpenEvolve calls this script with the candidate program path and expects a numeric score on stdout.

Python (Example: evaluate.py for Sorting)

import sys
import time
import importlib.util
import random

def evaluate(candidate_path):
    """Evaluate a candidate sorting algorithm."""

    # Load candidate module
    spec = importlib.util.spec_from_file_location("candidate", candidate_path)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)

    # Generate test cases
    test_cases = [
        random.sample(range(1000), k=n)
        for n in [10, 50, 100, 500, 1000]
        for _ in range(10)
    ]

    total_time = 0.0
    correct = 0

    for tc in test_cases:
        arr = tc.copy()
        expected = sorted(tc)

        start = time.perf_counter()
        result = module.sort(arr)
        elapsed = time.perf_counter() - start

        if result == expected:
            correct += 1
            total_time += elapsed

    # Score: correctness is critical, then speed
    correctness_ratio = correct / len(test_cases)
    if correctness_ratio < 1.0:
        score = correctness_ratio * 0.01  # Heavy penalty
    else:
        score = 1.0 / (1.0 + total_time)  # Faster = higher score

    print(score)

if __name__ == "__main__":
    evaluate(sys.argv[1])

Cascaded Evaluation

OpenEvolve implements the same cascaded evaluation pattern as AlphaEvolve:

  1. Syntax check: Python compile() -- near instant
  2. Import check: Attempt to import the module -- catches import errors
  3. Smoke test: Quick run on small inputs with short timeout
  4. Full evaluation: Complete benchmark on all test cases

11.8 Prompt Engineering

OpenEvolve uses configurable prompt templates. The prompt sampler assembles prompts dynamically.

Default Prompt Template (Diff Mode)

Prompt Template

You are an expert programmer tasked with improving algorithms.

## Problem Description:
{problem_description}

## Current Program (fitness score: {parent_score}):
```python
{parent_code}

Your Task:

Analyze this program and propose an improvement that will increase the fitness score. Think about: 1. Algorithmic complexity improvements 2. Better data structures 3. Mathematical optimizations 4. Edge case handling

Provide your changes as a unified diff format:

--- a/program.py
+++ b/program.py
@@ line numbers @@
 context
-removed
+added
#### Prompt Customization

Users can fully customize prompts via the YAML configuration, including:

- System prompt (defining the LLM's role)
- Problem description
- Output format instructions (diff vs. full program)
- Whether to include mutation history
- Whether to include multiple parents (crossover-like)

### 11.9 Cost Control

Python (Cost Tracking)

class CostTracker: def init(self, config): self.max_cost = config.get("max_total_cost_usd", float('inf')) self.max_input_tokens = config.get("max_tokens_input", float('inf')) self.max_output_tokens = config.get("max_tokens_output", float('inf'))

    self.total_cost = 0.0
    self.total_input_tokens = 0
    self.total_output_tokens = 0
    self.calls_per_model = {}

    # Pricing per 1M tokens (approximate)
    self.pricing = {
        "gemini-2.0-flash": {"input": 0.10, "output": 0.40},
        "gemini-2.5-pro": {"input": 1.25, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
    }

def update(self, model_name, input_tokens, output_tokens):
    self.total_input_tokens += input_tokens
    self.total_output_tokens += output_tokens

    if model_name in self.pricing:
        p = self.pricing[model_name]
        cost = (input_tokens * p["input"] + output_tokens * p["output"]) / 1e6
        self.total_cost += cost

def budget_exceeded(self):
    return (
        self.total_cost >= self.max_cost or
        self.total_input_tokens >= self.max_input_tokens or
        self.total_output_tokens >= self.max_output_tokens
    )
### 11.10 Meta-Level Improvement

Unlike AlphaEvolve (which has a self-improvement loop via Gemini training), OpenEvolve does not have a built-in meta-level improvement mechanism. However, it supports several forms of cross-run learning:

- **Seed program transfer:** The best program from one run can be used as the seed for a subsequent run
- **Database persistence:** The program database can be saved and loaded, allowing experiments to be resumed
- **Configuration tuning:** Results from one run inform hyperparameter choices for the next
- **Prompt refinement:** Observing what kinds of prompts produce improvements can guide prompt engineering

> **No model fine-tuning:** Like AlphaEvolve, OpenEvolve does not fine-tune the LLMs during evolution. The LLMs are used as frozen inference-time mutation operators. All "learning" happens at the population level.

### 11.11 Mathematical Framework

#### Evolutionary Loop Formalization


$$
Given: Problem specification S, seed program p0, evaluation function f: P --> R



    For t = 1, 2, ..., T:


      1. Select parent: pparent ~ Select(DBt-1)


      2. Choose model: m ~ Categorical(w1, ..., wK)


      3. Generate mutation: pcandidate = LLMm(pparent, S)


      4. Evaluate: s = f(pcandidate)


      5. Update: DBt = MapElitesInsert(DBt-1, pcandidate, s, desc(pcandidate))



    Return: argmaxp in DBT f(p)
$$

#### MAP-Elites Coverage


$$
Coverage(DB) = |{non-empty cells}| / prod(bins_per_dimi)



    Quality(DB) = (1/|{non-empty cells}|) * sum of f(DB[cell]) for all non-empty cells
$$

#### Expected Mutation Improvement


$$
E[improvement] = sum over models m: wm * P(f(child) > f(parent) | model=m) * E[f(child) - f(parent) | improvement, model=m]
$$

#### Cost-Per-Improvement


$$
CPI = Total API Cost / Number of Fitness Improvements Found



    Marginal CPI at generation t: CPIt = cost(generation t) / max(0, n_improvements at generation t, epsilon)
$$

## 12 Programming Language

| Aspect | Language | Notes |
| --- | --- | --- |
| OpenEvolve framework | **Python 3.8+** | Pure Python implementation |
| Evolved programs | **Python** (primarily) | Framework is Python-centric but extensible to other languages |
| Configuration | **YAML** | Human-readable configuration files |
| Dependencies | Standard Python ecosystem | google-generativeai, openai, anthropic, pyyaml, etc. |

### Extensibility to Other Languages

While the default evaluation pipeline assumes Python programs, the evaluation function is user-defined and can invoke any external process. Users could evolve programs in any language by:

1. Defining the seed program in the target language
2. Writing an evaluation function that compiles and runs the target language
3. Adjusting the prompt templates to request output in the target language

## 13 Memory Management

### Program Database

- **Sparse MAP-Elites grid:** Implemented as a Python dictionary; only occupied cells consume memory
- **In-memory storage:** The entire population is kept in RAM (typical populations of 100-1000 programs are easily manageable)
- **Disk persistence:** Database can be serialized to disk (pickle/JSON) for resuming experiments
- **Elite archive:** Fixed-size list of best programs (configurable max size, default 20)

### Evaluation Cache

- **Hash-based cache:** Programs are hashed and their scores cached to avoid re-evaluation
- **LRU eviction:** For long runs, the cache can be bounded with LRU eviction policy

### LLM Context Management

- **Token counting:** Prompts are checked against the model's context window limit before sending
- **Truncation:** Long programs or histories are truncated to fit within context limits
- **Selective inclusion:** Only the most relevant parent programs and recent history are included

### Memory Estimation


$$
Memory = O(P * S + C * S + E * S)



    where P = population size, S = average program size (bytes),


    C = cache size, E = elite archive size



    Typical: 1000 programs * 10KB each = ~10MB (negligible)
$$

## 14 Continued Learning

### Within-Run

- **Population evolution:** The program database continuously improves over generations
- **Mutation history:** Successful mutations are recorded and included in future prompts, allowing the LLM to learn from what has worked
- **Adaptive model selection:** If enabled, the system learns which models are more effective and routes more traffic to them

### Cross-Run

- **Database persistence:** Save and load the program database to resume evolution
- **Seed transfer:** Best programs from one run can seed subsequent runs
- **Configuration refinement:** Observing run metrics guides future configuration choices

### Limitations vs. AlphaEvolve

- No self-improvement loop (OpenEvolve does not train its own LLMs)
- No automatic infrastructure optimization (no Borg, no TPU, no Google-scale feedback)
- Learning is bounded to the population and prompt level, not the model weight level

## 15 Applications and Examples

### 15.1 Sorting Algorithm Evolution

> The sorting example evolves a Python sorting function starting from a basic implementation (e.g., bubble sort) toward faster algorithms. The evaluation function measures wall-clock time on benchmark arrays of varying sizes, with correctness as a hard constraint.

Python (examples/sorting/seed_program.py)

def sort(arr): """Initial seed: simple bubble sort.""" n = len(arr) for i in range(n): for j in range(n - i - 1): if arr[j] > arr[j + 1]: arr[j], arr[j + 1] = arr[j + 1], arr[j] return arr

Through evolution, OpenEvolve can discover quicksort, mergesort, timsort-like algorithms, or novel hybrid approaches that outperform the seed on the specific benchmark.

### 15.2 Symbolic Regression

Given data points (x, y), evolve a mathematical function f such that f(x) closely approximates y. The evaluation function computes mean squared error on held-out data, rewarding both accuracy and simplicity.

### 15.3 Mathematical Optimization

Evolve algorithms for combinatorial optimization problems. The user provides an evaluation function that scores the quality of the solution found by the candidate algorithm.

### 15.4 Custom Applications

OpenEvolve's modular design enables any application where:

1. Solutions can be expressed as Python code (or code callable from Python)
2. An automatic evaluation function can score solution quality
3. The search space is rich enough to benefit from evolutionary exploration

### How to Create a Custom Application

Bash

1. Create problem directory

mkdir my_problem/

2. Write seed program

cat > my_problem/seed_program.py << 'EOF' def solve(input_data): # Initial naive solution return baseline_solution(input_data) EOF

3. Write evaluation function

cat > my_problem/evaluate.py << 'EOF' import sys import importlib.util

def evaluate(candidate_path): spec = importlib.util.spec_from_file_location("c", candidate_path) mod = importlib.util.module_from_spec(spec) spec.loader.exec_module(mod)

score = benchmark(mod.solve)
print(score)

if name == "main": evaluate(sys.argv[1]) EOF

4. Write configuration

cat > my_problem/config.yaml << 'EOF' problem: description: "Optimize the solve() function for maximum score" seed_program: "my_problem/seed_program.py" evaluation_fn: "my_problem/evaluate.py" evolution: max_generations: 500 llm: models: - provider: "gemini" model_name: "gemini-2.0-flash" weight: 1.0 EOF

5. Run OpenEvolve

python -m openevolve.run --config my_problem/config.yaml ```

16 Detailed Comparison: OpenEvolve vs. AlphaEvolve

Dimension AlphaEvolve (Google DeepMind) OpenEvolve (Open Source)
Source Code Proprietary; not released Fully open-source (Apache 2.0)
LLM Models Gemini Flash + Gemini Pro (internal APIs) Any: Gemini, OpenAI, Anthropic, local (Ollama/vLLM)
LLM Ensemble Dual-model (Flash for breadth, Pro for depth) Configurable N-model ensemble with weighted selection
Mutation Modes Diff-based, full-program, multi-file Diff-based, full-program (multi-file support limited)
Program Database MAP-Elites with island model and elite archive MAP-Elites with island model and elite archive (faithful reproduction)
Parent Selection MAP-Elites, tournament, fitness-proportional MAP-Elites, tournament, fitness-proportional
Evaluation Google internal sandboxing; formal verification for hardware Subprocess/Docker sandboxing; user-defined evaluation functions
Scale Google-scale (millions of LLM calls, massive parallel evaluation) Single machine / small cluster
Evolved Language Python, Verilog, Pallas/CUDA, any Python primarily (extensible to others)
Self-Improvement Yes (improves Gemini training, which improves AlphaEvolve) No (uses frozen LLMs)
Configuration Internal configuration system YAML-based, user-friendly
Cost Control Internal resource management Built-in token tracking, USD budget limits
Results: Matrix Mult Improved Strassen for 4x4 complex (48 scalar mults) Not reproduced (requires specialized evaluation)
Results: Data Centers 0.7% of Google's worldwide compute recovered Not applicable (no access to Borg)
Results: Hardware TPU Verilog optimization deployed Not applicable (no access to TPU design tools)
Results: Kernels 23% matrix mult speedup, 32.5% FlashAttention Not reproduced (requires GPU kernel benchmarking)
Results: Math Kissing number (11D), 50+ open problems Example framework provided; community experiments
Community Closed team at Google DeepMind Open community; 8000+ GitHub stars
Documentation White paper + blog post README, code comments, examples, YAML configs

Architectural Fidelity Assessment

High Fidelity Components: OpenEvolve faithfully reproduces the core architecture described in the AlphaEvolve white paper:

Prompt Sampler with parent selection and context assembly LLM ensemble with weighted model selection MAP-Elites program database with behavioral descriptors Island model with periodic migration Diff-based and full-program mutation modes Cascaded evaluation with sandboxed execution

Missing / Reduced Components:

Multi-file evolution: AlphaEvolve can modify entire codebases across multiple files simultaneously; OpenEvolve primarily targets single-file programs Hardware targets: No Verilog, Pallas/CUDA kernel evolution built-in Formal verification: No integration with hardware verification tools Self-improvement loop: No mechanism to improve the underlying LLMs Scale: No massively parallel evaluation infrastructure Production deployment: Research tool, not production infrastructure

When to Use Each System

Scenario Recommended Reason
Academic research on evolutionary LLM coding OpenEvolve Open source, reproducible, configurable
Google-scale infrastructure optimization AlphaEvolve Only available internally, requires Google infrastructure
Exploring novel algorithms for your domain OpenEvolve Customizable evaluation functions, any LLM provider
Hardware (Verilog) optimization AlphaEvolve Built-in hardware targets and verification
Using local/free models OpenEvolve Supports Ollama, vLLM, any OpenAI-compatible API
Minimal API cost experimentation OpenEvolve Built-in cost control, supports free local models

OpenEvolve Technical Research Report | Generated March 2026

Source: github.com/algorithmicsuperintelligence/openevolve

This report is intended for PhD-level academic research and technical analysis.