OpenEvolve

Open-Source Reimplementation of AlphaEvolve for LLM-Guided Evolutionary Coding Organization: Algorithmic Superintelligence Inc. (community) Repository: github.com/algorithmicsuperintelligence/openevolve License: Apache 2.0 Language: Python Report Date: March 2026

Full Title and Attribution
Authors and Contributors
Core Contribution
Supported Solutions
LLM Integration
Key Results and Examples
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications and Examples
Comparison: OpenEvolve vs AlphaEvolve

1 Full Title and Attribution

Full Title: OpenEvolve: An Open-Source Implementation of AlphaEvolve

Repository: github.com/algorithmicsuperintelligence/openevolve

License: Apache License 2.0

Status: Active open-source project

Inspiration: Directly inspired by Google DeepMind's AlphaEvolve (May 2025)

Stars: 8,000+ (as of early 2026)

2 Authors and Contributors

OpenEvolve is developed by Algorithmic Superintelligence Inc., an open-source community organization. The project is community-driven, with contributions from multiple developers.

The project was created shortly after Google DeepMind published the AlphaEvolve white paper in May 2025. It aims to democratize access to evolutionary LLM-guided program synthesis by providing a fully open-source implementation that works with any LLM provider, not just Google's Gemini.

3 Core Contribution

Key Contribution: OpenEvolve is the first comprehensive open-source reimplementation of AlphaEvolve's core architecture, making LLM-guided evolutionary program synthesis accessible to researchers and practitioners who do not have access to Google's internal infrastructure.

What OpenEvolve Provides

Multi-provider LLM support: Works with Gemini, OpenAI (GPT-4, GPT-4o), Anthropic (Claude), and any OpenAI-compatible API (local models via Ollama, vLLM, etc.)
Faithful architectural reproduction: Implements the key components described in the AlphaEvolve white paper -- prompt sampler, program database (MAP-Elites), LLM ensemble, evaluation pipeline
Configurable evolutionary strategies: YAML-based configuration for all hyperparameters, selection strategies, and LLM settings
Diff-based and full-program mutations: Both mutation modes as described in AlphaEvolve
Sandboxed evaluation: Secure code execution with timeout and resource limits
Ready-to-use examples: Pre-built examples for sorting algorithms, symbolic regression, mathematical optimization, and more
Cost control: Built-in token tracking, budget limits, and model routing to manage API costs

Key Differences from AlphaEvolve

Aspect	AlphaEvolve	OpenEvolve
Source availability	Closed (proprietary)	Fully open-source (Apache 2.0)
LLM provider	Gemini only (internal)	Any provider (Gemini, OpenAI, Anthropic, local)
Scale	Google-scale infrastructure	Single machine / small cluster
Evaluation	Google internal sandboxing	Docker/subprocess sandboxing
Production deployment	Deployed in Google infra	Research/experimentation tool
Hardware targets	Verilog, TPU kernels, GPU	Python-focused (extensible)

4 Supported Solutions

Solution Type	Support Level	Example
`Python Functions`	Primary target	Sorting algorithms, mathematical functions
`Complete Programs`	Supported	Multi-file Python programs
`Symbolic Regression`	Example provided	Discovering mathematical formulas from data
`Algorithm Optimization`	Example provided	Optimizing algorithm efficiency
`Mathematical Discovery`	Supported	Exploring conjectures and constructions
`Custom Domains`	Extensible	Any domain with a programmatic evaluator

OpenEvolve is designed to be domain-agnostic: any problem expressible as "write code, evaluate code, score code" can be tackled by defining an appropriate evaluation function.

5 LLM Integration

Supported LLM Providers

Provider	Models	Configuration Key	Notes
Google Gemini	Gemini 2.0 Flash, 2.5 Pro, etc.	`gemini`	Closest to original AlphaEvolve
OpenAI	GPT-4o, GPT-4-turbo, GPT-4o-mini, o1, o3	`openai`	Strong coding capability
Anthropic	Claude 3.5 Sonnet, Claude 3 Opus, Claude Opus 4	`anthropic`	Strong reasoning
OpenAI-Compatible	Any model (Ollama, vLLM, LM Studio)	`openai_compatible`	Enables free local models

Multi-Model Ensemble Configuration

OpenEvolve supports an ensemble of models, mirroring AlphaEvolve's Flash+Pro strategy. The user configures multiple models with weights that determine how often each is sampled.

YAML (config.yaml -- LLM Configuration)

# Multi-model ensemble configuration
llm:
  models:
    - provider: "gemini"
      model_name: "gemini-2.0-flash"
      temperature: 0.8
      max_tokens: 4096
      weight: 0.8           # 80% of mutations use this model (breadth)
      role: "explorer"

    - provider: "gemini"
      model_name: "gemini-2.5-pro"
      temperature: 0.5
      max_tokens: 8192
      weight: 0.2           # 20% of mutations use this model (depth)
      role: "refiner"

  # Alternative: Use OpenAI models
  # models:
  #   - provider: "openai"
  #     model_name: "gpt-4o-mini"
  #     weight: 0.85
  #   - provider: "openai"
  #     model_name: "gpt-4o"
  #     weight: 0.15

  # Cost control
  max_tokens_per_run: 10000000     # Total token budget
  max_cost_per_run: 50.0          # USD budget limit

LLM as Mutation Operator

OpenEvolve uses LLMs identically to AlphaEvolve: as semantically-aware mutation operators. The LLM receives a parent program with its score and is asked to produce improvements, either as a diff or a full rewrite.

Prompt Construction

The PromptSampler class constructs prompts dynamically based on:

Parent program(s) selected from the program database
Problem specification (provided by the user)
Mutation mode (diff vs. full program)
Evolutionary context (generation number, best score seen, improvement history)

Python (openevolve/prompt_sampler.py -- Conceptual)

class PromptSampler:
    def __init__(self, config, program_db):
        self.config = config
        self.program_db = program_db
        self.diff_ratio = config.get("diff_ratio", 0.5)

    def sample_prompt(self):
        """Assemble a prompt for the LLM mutation operator."""

        # 1. Select parent(s) from program database
        parents = self.program_db.select_parents(
            n=self.config.get("n_parents", 1),
            strategy=self.config.get("selection_strategy", "map_elites")
        )

        # 2. Choose mutation mode
        use_diff = random.random() < self.diff_ratio

        # 3. Build prompt
        prompt = self._build_prompt(parents, use_diff)

        # 4. Select model from ensemble
        model = self._select_model()

        return prompt, model, use_diff

    def _build_prompt(self, parents, use_diff):
        sections = []
        sections.append(self.config.get("system_prompt",
            "You are an expert algorithm designer..."))
        sections.append(f"## Problem:\n{self.config['problem_description']}")

        for i, parent in enumerate(parents):
            sections.append(
                f"## Parent Program {i+1} (score: {parent.fitness:.4f}):\n"
                f"```python\n{parent.code}\n```"
            )

        if use_diff:
            sections.append(
                "Provide your improvement as a unified diff.")
        else:
            sections.append(
                "Provide the complete improved program.")

        return "\n\n".join(sections)

6 Key Results and Examples

OpenEvolve, being a reimplementation tool rather than a research paper, provides example benchmarks rather than novel discoveries. The included examples demonstrate that the architecture works and can evolve solutions:

Included Example Problems

Example	Description	Evaluation Metric
Sorting Algorithm	Evolve a sorting function that minimizes comparisons/time	Wall-clock time on benchmark arrays
Symbolic Regression	Discover mathematical formulas from data points	Mean squared error on held-out data
Function Optimization	Find optimal solutions to mathematical optimization problems	Objective function value
Algorithm Design	Design algorithms for combinatorial problems	Solution quality score

Community Results

As an open-source project, OpenEvolve enables community experimentation. Users have reported using it for:

Evolving competitive programming solutions
Discovering more efficient data structure implementations
Optimizing machine learning hyperparameters through code evolution
Exploring mathematical conjectures

Note: OpenEvolve has not replicated AlphaEvolve's headline results (Borg scheduling, TPU Verilog, kissing number improvements). Those results require Google-scale infrastructure, internal APIs, and domain-specific evaluation harnesses that are not publicly available. OpenEvolve demonstrates that the architecture is sound and reproducible, not that all results transfer.

7 Reproducibility

Aspect	Status	Details
Source Code	`Fully Open`	Complete source code on GitHub under Apache 2.0
Documentation	`Comprehensive`	README, code comments, configuration examples
Installation	`Straightforward`	`pip install` or clone + install
Examples	`Included`	Multiple ready-to-run examples with evaluation functions
LLM Access	`User-provided`	Requires API keys for chosen provider (or local models)
Configuration	`YAML-based`	All hyperparameters configurable via YAML files
Determinism	`Stochastic`	LLM outputs are non-deterministic; random seeds for selection

Installation

Bash (Installation)

# Clone the repository
git clone https://github.com/algorithmicsuperintelligence/openevolve.git
cd openevolve

# Install dependencies
pip install -e .

# Set API keys
export GEMINI_API_KEY="your-key-here"
# OR
export OPENAI_API_KEY="your-key-here"
# OR
export ANTHROPIC_API_KEY="your-key-here"

# Run an example
python -m openevolve.run --config examples/sorting/config.yaml

8 Compute and API Costs

Cost Structure

OpenEvolve's costs are primarily driven by LLM API calls. The system includes built-in cost tracking and budget limits.

Configuration	Estimated Cost per Run	Typical Duration
Local models (Ollama)	$0 (electricity only)	Hours to days (depending on model size)
GPT-4o-mini (explorer) + GPT-4o (refiner)	$5-50	Hours
Gemini Flash (explorer) + Pro (refiner)	$2-30	Hours
Claude Sonnet (explorer) + Opus (refiner)	$10-100	Hours

Built-in Cost Control

YAML (Cost Control Configuration)

cost_control:
  # Hard budget limits
  max_total_cost_usd: 50.0
  max_tokens_input: 5000000
  max_tokens_output: 2000000

  # Evaluation limits
  max_eval_time_seconds: 60
  max_concurrent_evals: 4

  # Tracking
  log_token_usage: true
  log_cost_per_generation: true

Cost Optimization Strategies

Cheap model first: Use a cheap model (Flash/mini) for 80%+ of mutations
Local models: Use Ollama with open-weights models for zero API cost
Cascaded evaluation: Quick syntax/smoke tests before expensive evaluations
Caching: Avoid re-evaluating identical or equivalent programs
Budget limits: Hard stops when cost thresholds are reached

9 Architecture Solution

Repository Structure

openevolve/
__init__.py
run.py # Main entry point / CLI runner
controller.py # Orchestrates the evolutionary loop
prompt_sampler.py # Constructs prompts, selects parents
program_database.py # MAP-Elites program database
evaluator.py # Sandboxed code evaluation
llm_provider.py # Multi-provider LLM abstraction
diff_engine.py # Diff parsing and application
config.py # Configuration management
utils.py # Utility functions
examples/
sorting/ # Sorting algorithm evolution
config.yaml
evaluate.py
seed_program.py
symbolic_regression/ # Mathematical formula discovery
config.yaml
evaluate.py
seed_program.py
optimization/ # Function optimization
config.yaml
evaluate.py

High-Level Architecture Diagram

+=====================================================================================+
|                          OpenEvolve System Architecture                              |
+=====================================================================================+
|                                                                                     |
|   +----------------------------------------------------------------------+          |
|   |                      CONTROLLER (controller.py)                      |          |
|   |  Orchestrates the evolutionary loop:                                 |          |
|   |  for generation in range(max_generations):                           |          |
|   |      prompt, model = prompt_sampler.sample()                         |          |
|   |      response = llm_provider.generate(prompt, model)                 |          |
|   |      candidate = diff_engine.apply(response, parent)                 |          |
|   |      score = evaluator.evaluate(candidate)                           |          |
|   |      program_database.insert(candidate, score)                       |          |
|   +----------------------------------------------------------------------+          |
|         |              |                |              |              |              |
|         v              v                v              v              v              |
|   +-----------+  +----------+  +---------------+  +---------+  +----------+         |
|   | PROMPT    |  | LLM      |  | DIFF ENGINE   |  | EVAL    |  | PROGRAM  |         |
|   | SAMPLER   |  | PROVIDER |  |               |  | UATOR   |  | DATABASE |         |
|   |           |  |          |  | - Parse diffs |  |         |  |          |         |
|   | - Select  |  | - Gemini |  | - Parse code  |  | - Sand- |  | - MAP-   |         |
|   |   parents |  | - OpenAI |  | - Apply patch |  |   boxed |  |   Elites |         |
|   | - Build   |  | - Claude |  | - Validate    |  |   exec  |  | - Islands|         |
|   |   prompt  |  | - Local  |  |               |  | - Score |  | - Elite  |         |
|   | - Choose  |  |          |  |               |  | - Cache |  |   archive|         |
|   |   model   |  |          |  |               |  |         |  |          |         |
|   +-----------+  +----------+  +---------------+  +---------+  +----------+         |
|                                                                                     |
|   +----------------------------------------------------------------------+          |
|   |                    CONFIGURATION (config.yaml)                        |          |
|   |  - LLM models, weights, temperatures                                 |          |
|   |  - Evolution: population size, generations, selection strategy        |          |
|   |  - Evaluation: timeout, sandbox settings                             |          |
|   |  - Cost control: budget limits, token tracking                       |          |
|   +----------------------------------------------------------------------+          |
|                                                                                     |
+=====================================================================================+

Evolutionary Loop Data Flow

+------------------------+
                        |    config.yaml         |
                        |    seed_program.py     |
                        |    evaluate.py         |
                        +-----------+------------+
                                    |
                                    v
                        +------------------------+
                        |    Controller.run()    |
                        +-----------+------------+
                                    |
                   +----------------+----------------+
                   |                                 |
                   v                                 |
          +----------------+                         |
          | PromptSampler  |                         |
          | .sample()      |                         |
          +-------+--------+                         |
                  |                                  |
       +----------+-----------+                      |
       |                      |                      |
       v                      v                      |
+------------+        +-------------+                |
| Program    |        | Model       |                |
| Database   |        | Selection   |                |
| .select()  |        | (weighted)  |                |
+------+-----+        +------+------+                |
       |                      |                      |
       v                      v                      |
+-----------------------------------------+          |
|        LLM Provider.generate()          |          |
|        (Gemini / OpenAI / Claude /      |          |
|         Local via Ollama)               |          |
+-------------------+---------------------+          |
                    |                                |
                    v                                |
          +------------------+                       |
          | DiffEngine       |                       |
          | .parse_and_apply |                       |
          +--------+---------+                       |
                   |                                 |
                   v                                 |
          +------------------+                       |
          | Evaluator        |                       |
          | .evaluate()      |                       |
          | (sandboxed)      |                       |
          +--------+---------+                       |
                   |                                 |
                   v                                 |
          +------------------+                       |
          | ProgramDatabase  |<----------------------+
          | .insert()        |    (next generation)
          +------------------+

10 Component Breakdown

10.1 Controller (`controller.py`)

The central orchestrator that runs the evolutionary loop. Responsibilities:

Initialize all components from configuration
Run the main evolution loop for N generations
Coordinate between prompt sampling, LLM generation, diff application, evaluation, and database updates
Track progress, log metrics, and enforce budget limits
Handle errors gracefully (LLM failures, evaluation timeouts, etc.)

Python (controller.py -- Core Loop)

class Controller:
    def __init__(self, config):
        self.config = config
        self.prompt_sampler = PromptSampler(config)
        self.llm_provider = LLMProvider(config)
        self.diff_engine = DiffEngine()
        self.evaluator = Evaluator(config)
        self.program_db = ProgramDatabase(config)
        self.cost_tracker = CostTracker(config)

    def run(self, seed_program, evaluation_fn):
        """Run the full evolutionary loop."""
        # Initialize with seed program
        seed_score = self.evaluator.evaluate(seed_program, evaluation_fn)
        self.program_db.insert(seed_program, seed_score)

        for gen in range(self.config["max_generations"]):
            # Check budget
            if self.cost_tracker.budget_exceeded():
                break

            # Sample prompt and model
            prompt, model_name, is_diff = self.prompt_sampler.sample_prompt()

            # Generate mutation via LLM
            response = self.llm_provider.generate(
                prompt, model=model_name
            )
            self.cost_tracker.update(response.usage)

            # Apply mutation
            parent = self.prompt_sampler.last_selected_parent
            if is_diff:
                candidate = self.diff_engine.apply_diff(
                    parent.code, response.text)
            else:
                candidate = self.diff_engine.extract_code(response.text)

            if candidate is None:
                continue  # Failed to parse

            # Evaluate
            score = self.evaluator.evaluate(candidate, evaluation_fn)

            # Update database
            self.program_db.insert(candidate, score)

            # Log progress
            self._log_generation(gen, score, self.program_db.best_score())

        return self.program_db.get_best_program()

10.2 Program Database (`program_database.py`)

Implements MAP-Elites with optional island model. Key features:

MAP-Elites grid with configurable behavioral descriptors
Multiple selection strategies (uniform, fitness-proportional, tournament)
Elite archive for best-ever solutions
Optional island sub-populations with migration
Persistence (save/load to disk for resuming experiments)

Python (program_database.py -- Core Structure)

class ProgramDatabase:
    def __init__(self, config):
        self.config = config
        self.population_size = config.get("population_size", 100)
        self.selection_strategy = config.get("selection_strategy", "map_elites")

        # MAP-Elites grid
        self.n_descriptor_dims = config.get("n_descriptor_dims", 2)
        self.bins_per_dim = config.get("bins_per_dim", 10)
        self.grid = {}  # (bin_indices) -> Program

        # Island model
        self.n_islands = config.get("n_islands", 1)
        self.migration_interval = config.get("migration_interval", 50)
        self.islands = [{} for _ in range(self.n_islands)]

        # Elite archive
        self.elite_archive = []
        self.max_elite_size = config.get("max_elite_size", 20)

        # Generation counter
        self.generation = 0

    def insert(self, program, fitness, descriptors=None):
        """Insert a program into the database."""
        if descriptors is None:
            descriptors = self._auto_compute_descriptors(program)

        cell = self._discretize(descriptors)
        island_id = self.generation % self.n_islands

        # MAP-Elites insertion
        if cell not in self.islands[island_id] or \
           fitness > self.islands[island_id][cell].fitness:
            self.islands[island_id][cell] = Program(
                code=program, fitness=fitness,
                descriptors=descriptors, generation=self.generation
            )

        # Update elite archive
        self._update_elite(program, fitness)
        self.generation += 1

        # Periodic migration
        if self.generation % self.migration_interval == 0:
            self._migrate()

    def select_parents(self, n=1, strategy=None):
        """Select parent programs for mutation."""
        strategy = strategy or self.selection_strategy
        all_programs = self._get_all_programs()

        if strategy == "map_elites":
            # Uniform random from occupied cells
            return random.sample(all_programs, min(n, len(all_programs)))
        elif strategy == "tournament":
            return [self._tournament_select(all_programs) for _ in range(n)]
        elif strategy == "fitness_proportional":
            return self._fitness_proportional_select(all_programs, n)

    def _migrate(self):
        """Migrate top programs between islands."""
        for i in range(self.n_islands):
            source = self.islands[i]
            target = self.islands[(i + 1) % self.n_islands]
            # Get top-k from source
            top_k = sorted(source.values(), key=lambda p: p.fitness,
                          reverse=True)[:3]
            for prog in top_k:
                cell = self._discretize(prog.descriptors)
                if cell not in target or prog.fitness > target[cell].fitness:
                    target[cell] = prog

10.3 LLM Provider (`llm_provider.py`)

A unified interface abstracting over multiple LLM providers:

Python (llm_provider.py -- Multi-Provider Abstraction)

class LLMProvider:
    def __init__(self, config):
        self.models = []
        self.weights = []

        for model_config in config["llm"]["models"]:
            provider = model_config["provider"]
            if provider == "gemini":
                client = GeminiClient(model_config)
            elif provider == "openai":
                client = OpenAIClient(model_config)
            elif provider == "anthropic":
                client = AnthropicClient(model_config)
            elif provider == "openai_compatible":
                client = OpenAICompatibleClient(model_config)
            else:
                raise ValueError(f"Unknown provider: {provider}")

            self.models.append(client)
            self.weights.append(model_config.get("weight", 1.0))

        # Normalize weights
        total = sum(self.weights)
        self.weights = [w / total for w in self.weights]

    def generate(self, prompt, model=None):
        """Generate a response using the selected or randomly chosen model."""
        if model is None:
            # Weighted random selection
            model_idx = random.choices(
                range(len(self.models)), weights=self.weights, k=1
            )[0]
        else:
            model_idx = self._find_model_by_name(model)

        return self.models[model_idx].generate(prompt)

10.4 Evaluator (`evaluator.py`)

Runs candidate programs in a sandboxed environment with resource limits:

Python (evaluator.py -- Sandboxed Evaluation)

import subprocess
import tempfile
import os

class Evaluator:
    def __init__(self, config):
        self.timeout = config.get("eval_timeout", 60)
        self.cache = {}  # hash(code) -> score

    def evaluate(self, code, evaluation_fn_path):
        """Evaluate a candidate program in a sandbox."""

        # Check cache
        code_hash = hash(code)
        if code_hash in self.cache:
            return self.cache[code_hash]

        # Stage 1: Syntax check
        try:
            compile(code, "<candidate>", "exec")
        except SyntaxError:
            return -float('inf')

        # Stage 2: Sandboxed execution
        with tempfile.NamedTemporaryFile(
            mode='w', suffix='.py', delete=False
        ) as f:
            f.write(code)
            candidate_path = f.name

        try:
            result = subprocess.run(
                ["python", evaluation_fn_path, candidate_path],
                capture_output=True,
                text=True,
                timeout=self.timeout,
                env={**os.environ, "SANDBOX": "1"}
            )
            if result.returncode != 0:
                score = -float('inf')
            else:
                score = float(result.stdout.strip())
        except (subprocess.TimeoutExpired, ValueError):
            score = -float('inf')
        finally:
            os.unlink(candidate_path)

        self.cache[code_hash] = score
        return score

10.5 Diff Engine (`diff_engine.py`)

Parses and applies code modifications from LLM responses:

Unified diff parsing and application
Full code block extraction from markdown-formatted responses
Error recovery for malformed diffs
Syntax validation of resulting code

10.6 Configuration (`config.py` + YAML files)

YAML (Full Configuration Example)

# OpenEvolve Full Configuration

# Problem specification
problem:
  description: "Evolve a sorting algorithm that minimizes comparisons"
  seed_program: "examples/sorting/seed_program.py"
  evaluation_fn: "examples/sorting/evaluate.py"

# Evolution parameters
evolution:
  max_generations: 1000
  population_size: 100
  selection_strategy: "map_elites"    # or "tournament", "fitness_proportional"
  n_islands: 3
  migration_interval: 50
  tournament_size: 3

# MAP-Elites configuration
map_elites:
  n_descriptor_dims: 2
  bins_per_dim: 10
  descriptor_names:
    - "code_length"
    - "avg_comparisons"

# Mutation settings
mutation:
  diff_ratio: 0.5                  # 50% diff, 50% full rewrite
  n_parents: 1                     # Number of parents per mutation
  include_history: true            # Include mutation history in prompt
  max_history_length: 5            # Last N successful mutations

# LLM configuration
llm:
  models:
    - provider: "gemini"
      model_name: "gemini-2.0-flash"
      temperature: 0.8
      max_tokens: 4096
      weight: 0.8
    - provider: "gemini"
      model_name: "gemini-2.5-pro"
      temperature: 0.5
      max_tokens: 8192
      weight: 0.2

# Evaluation settings
evaluation:
  timeout_seconds: 60
  n_test_cases: 100
  cache_results: true
  sandbox_mode: "subprocess"      # or "docker"

# Cost control
cost_control:
  max_total_cost_usd: 50.0
  log_usage: true

# Logging
logging:
  level: "INFO"
  save_all_programs: false
  save_improvements_only: true
  output_dir: "./results"

11 Core Mechanisms (Detailed)

11.1 Code Modification and Mutation (LLM-Guided Mutations)

OpenEvolve implements the same LLM-as-mutation-operator paradigm as AlphaEvolve. The diff engine supports two primary modes:

Diff-Based Mutations

The LLM generates a unified diff that the diff engine parses and applies to the parent program. This is effective for incremental improvements.

Python (diff_engine.py -- Diff Application)

import re

class DiffEngine:
    def apply_diff(self, original_code, llm_response):
        """Parse and apply a unified diff from LLM output."""

        # Extract diff from markdown code blocks or raw text
        diff_text = self._extract_diff(llm_response)
        if diff_text is None:
            return None

        # Parse unified diff hunks
        hunks = self._parse_hunks(diff_text)

        # Apply hunks to original code
        lines = original_code.split('\n')
        offset = 0

        for hunk in hunks:
            start = hunk.original_start + offset - 1
            # Remove old lines
            for _ in range(hunk.original_count):
                if start < len(lines):
                    lines.pop(start)
            # Insert new lines
            for i, line in enumerate(hunk.new_lines):
                lines.insert(start + i, line)
            offset += len(hunk.new_lines) - hunk.original_count

        result = '\n'.join(lines)

        # Validate syntax
        try:
            compile(result, "<modified>", "exec")
        except SyntaxError:
            return None  # Reject invalid code

        return result

    def extract_code(self, llm_response):
        """Extract full program from LLM response."""
        # Look for ```python ... ``` blocks
        pattern = r'```(?:python)?\s*\n(.*?)```'
        matches = re.findall(pattern, llm_response, re.DOTALL)
        if matches:
            code = matches[-1].strip()  # Take last code block
            try:
                compile(code, "<extracted>", "exec")
                return code
            except SyntaxError:
                return None
        return None

    def _extract_diff(self, text):
        """Extract diff block from LLM response."""
        # Try markdown code block first
        pattern = r'```(?:diff)?\s*\n(.*?)```'
        matches = re.findall(pattern, text, re.DOTALL)
        if matches:
            return matches[0]
        # Try raw diff (starts with --- or @@)
        if '---' in text and '+++' in text:
            return text
        return None

Mutation Quality Metrics

$$ Compilation Rate = |{mutations that produce valid syntax}| / |{total mutations attempted}|

Improvement Rate = |{mutations that improve fitness}| / |{mutations that compile and run}|



Efficiency = Improvement Rate * avg(fitness improvement) / avg(API cost per mutation)

11.2 Parent Selection

OpenEvolve implements three primary selection strategies, configurable via YAML:

Strategy	Config Value	Description	Bias
MAP-Elites uniform	`map_elites`	Uniform random from occupied cells	Diversity-favoring (no fitness bias within niches)
Tournament	`tournament`	Best of k random candidates	Adjustable pressure via k
Fitness-proportional	`fitness_proportional`	Probability proportional to fitness	Exploitation-favoring

Python (Selection Strategies)

def tournament_select(programs, k=3):
    """Tournament selection: pick best of k random candidates."""
    tournament = random.sample(programs, min(k, len(programs)))
    return max(tournament, key=lambda p: p.fitness)

def fitness_proportional_select(programs, n=1):
    """Roulette wheel selection weighted by fitness."""
    fitnesses = [p.fitness for p in programs]
    min_f = min(fitnesses)
    weights = [f - min_f + 1e-8 for f in fitnesses]
    total = sum(weights)
    probs = [w / total for w in weights]
    return random.choices(programs, weights=probs, k=n)

def map_elites_select(grid):
    """Uniform random from occupied MAP-Elites cells."""
    occupied = list(grid.values())
    return random.choice(occupied)

11.3 Population Management

MAP-Elites Grid

The program database organizes solutions in a MAP-Elites grid defined by behavioral descriptors. The grid is a sparse dictionary (only occupied cells are stored).

$$ Grid: D1 x D2 x ... x Dk --> Program

where Di = discretized behavioral descriptor dimension i



Insertion rule: grid[cell] = argmax(fitness) among all programs mapping to cell

Island Model

Configurable via n_islands and migration_interval. Each island maintains its own MAP-Elites grid, with periodic migration of top-performing programs between islands in a ring topology.

Python (Island Migration)

def migrate_ring(islands, k=3):
    """Ring migration: top-k from each island go to the next."""
    n = len(islands)
    migrants = []

    # Collect migrants from each island
    for i in range(n):
        top_k = sorted(
            islands[i].values(),
            key=lambda p: p.fitness,
            reverse=True
        )[:k]
        migrants.append(top_k)

    # Insert migrants into next island
    for i in range(n):
        target = (i + 1) % n
        for prog in migrants[i]:
            cell = discretize(prog.descriptors)
            if cell not in islands[target] or \
               prog.fitness > islands[target][cell].fitness:
                islands[target][cell] = prog

11.4 Solution Diversity and Novelty

OpenEvolve maintains diversity through:

MAP-Elites grid structure: By definition, MAP-Elites maintains one solution per behavioral niche, ensuring the population covers diverse regions of behavior space
Island model: Semi-isolated populations evolve differently, preventing the entire population from converging to a single solution
Behavioral descriptors: Users define problem-specific descriptors that characterize how a program works, not just how well
Temperature control: Higher LLM temperatures encourage more diverse mutations
Diff vs. full-rewrite balance: Full rewrites can produce dramatically different programs, maintaining structural diversity

Configurable Behavioral Descriptors

Python (Custom Descriptor Definition)

# In evaluate.py, users define behavioral descriptors:

def evaluate_and_describe(candidate_code):
    """Return both fitness score and behavioral descriptors."""

    # Run the candidate
    results = run_on_test_cases(candidate_code)

    # Fitness: how good is the solution
    fitness = compute_score(results)

    # Behavioral descriptors: what kind of solution is it
    descriptors = {
        "code_length": len(candidate_code.split('\n')),
        "time_complexity_proxy": measure_scaling(results),
        "uses_recursion": 1.0 if "def " in candidate_code
                            and is_recursive(candidate_code) else 0.0,
        "memory_usage": measure_peak_memory(results)
    }

    return fitness, descriptors

11.5 LLM Orchestration and Model Selection

OpenEvolve's LLM orchestration closely mirrors AlphaEvolve's dual-model strategy but generalizes it to any number of models from any provider.

Weighted Model Selection

$$ P(modeli) = wi / sum(wj) for all j $$

Adaptive Selection (Optional)

OpenEvolve can optionally track which models produce improvements and adjust weights dynamically using a multi-armed bandit approach.

Python (Adaptive Model Selection)

class AdaptiveModelSelector:
    def __init__(self, models, initial_weights, alpha=0.05):
        self.models = models
        self.weights = initial_weights.copy()
        self.success_counts = [0] * len(models)
        self.total_counts = [0] * len(models)
        self.alpha = alpha

    def select(self):
        # UCB1-inspired selection
        total = sum(self.total_counts) + 1
        scores = []
        for i in range(len(self.models)):
            if self.total_counts[i] == 0:
                scores.append(float('inf'))  # Explore untried
            else:
                exploit = self.success_counts[i] / self.total_counts[i]
                explore = ((2 * math.log(total)) / self.total_counts[i]) ** 0.5
                scores.append(exploit + self.alpha * explore)

        return self.models[scores.index(max(scores))]

    def update(self, model_idx, was_improvement):
        self.total_counts[model_idx] += 1
        if was_improvement:
            self.success_counts[model_idx] += 1

11.6 Search Strategies

Strategy	Implementation Status	Configuration
MAP-Elites	`Implemented`	`selection_strategy: "map_elites"`
Island Model	`Implemented`	`n_islands: 3`
Tournament Selection	`Implemented`	`selection_strategy: "tournament"`
Fitness-Proportional	`Implemented`	`selection_strategy: "fitness_proportional"`
Adaptive Model Selection	`Optional`	`adaptive_model_selection: true`
Novelty Search	`Planned/Partial`	Via custom behavioral descriptors

11.7 Evaluation and Fitness

User-Defined Evaluation Functions

The evaluation function is provided by the user as a Python script. OpenEvolve calls this script with the candidate program path and expects a numeric score on stdout.

Python (Example: evaluate.py for Sorting)

import sys
import time
import importlib.util
import random

def evaluate(candidate_path):
    """Evaluate a candidate sorting algorithm."""

    # Load candidate module
    spec = importlib.util.spec_from_file_location("candidate", candidate_path)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)

    # Generate test cases
    test_cases = [
        random.sample(range(1000), k=n)
        for n in [10, 50, 100, 500, 1000]
        for _ in range(10)
    ]

    total_time = 0.0
    correct = 0

    for tc in test_cases:
        arr = tc.copy()
        expected = sorted(tc)

        start = time.perf_counter()
        result = module.sort(arr)
        elapsed = time.perf_counter() - start

        if result == expected:
            correct += 1
            total_time += elapsed

    # Score: correctness is critical, then speed
    correctness_ratio = correct / len(test_cases)
    if correctness_ratio < 1.0:
        score = correctness_ratio * 0.01  # Heavy penalty
    else:
        score = 1.0 / (1.0 + total_time)  # Faster = higher score

    print(score)

if __name__ == "__main__":
    evaluate(sys.argv[1])

Cascaded Evaluation

OpenEvolve implements the same cascaded evaluation pattern as AlphaEvolve:

Syntax check: Python compile() -- near instant
Import check: Attempt to import the module -- catches import errors
Smoke test: Quick run on small inputs with short timeout
Full evaluation: Complete benchmark on all test cases

11.8 Prompt Engineering

OpenEvolve uses configurable prompt templates. The prompt sampler assembles prompts dynamically.

Default Prompt Template (Diff Mode)

Prompt Template

You are an expert programmer tasked with improving algorithms.

## Problem Description:
{problem_description}

## Current Program (fitness score: {parent_score}):
```python
{parent_code}

Your Task:

Analyze this program and propose an improvement that will increase the fitness score. Think about: 1. Algorithmic complexity improvements 2. Better data structures 3. Mathematical optimizations 4. Edge case handling

Provide your changes as a unified diff format:

--- a/program.py
+++ b/program.py
@@ line numbers @@
 context
-removed
+added

#### Prompt Customization

Users can fully customize prompts via the YAML configuration, including:

- System prompt (defining the LLM's role)
- Problem description
- Output format instructions (diff vs. full program)
- Whether to include mutation history
- Whether to include multiple parents (crossover-like)

### 11.9 Cost Control

Python (Cost Tracking)

class CostTracker: def init(self, config): self.max_cost = config.get("max_total_cost_usd", float('inf')) self.max_input_tokens = config.get("max_tokens_input", float('inf')) self.max_output_tokens = config.get("max_tokens_output", float('inf'))

    self.total_cost = 0.0
    self.total_input_tokens = 0
    self.total_output_tokens = 0
    self.calls_per_model = {}

    # Pricing per 1M tokens (approximate)
    self.pricing = {
        "gemini-2.0-flash": {"input": 0.10, "output": 0.40},
        "gemini-2.5-pro": {"input": 1.25, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
    }

def update(self, model_name, input_tokens, output_tokens):
    self.total_input_tokens += input_tokens
    self.total_output_tokens += output_tokens

    if model_name in self.pricing:
        p = self.pricing[model_name]
        cost = (input_tokens * p["input"] + output_tokens * p["output"]) / 1e6
        self.total_cost += cost

def budget_exceeded(self):
    return (
        self.total_cost >= self.max_cost or
        self.total_input_tokens >= self.max_input_tokens or
        self.total_output_tokens >= self.max_output_tokens
    )

### 11.10 Meta-Level Improvement

Unlike AlphaEvolve (which has a self-improvement loop via Gemini training), OpenEvolve does not have a built-in meta-level improvement mechanism. However, it supports several forms of cross-run learning:

- **Seed program transfer:** The best program from one run can be used as the seed for a subsequent run
- **Database persistence:** The program database can be saved and loaded, allowing experiments to be resumed
- **Configuration tuning:** Results from one run inform hyperparameter choices for the next
- **Prompt refinement:** Observing what kinds of prompts produce improvements can guide prompt engineering

> **No model fine-tuning:** Like AlphaEvolve, OpenEvolve does not fine-tune the LLMs during evolution. The LLMs are used as frozen inference-time mutation operators. All "learning" happens at the population level.

### 11.11 Mathematical Framework

#### Evolutionary Loop Formalization


$$
Given: Problem specification S, seed program p0, evaluation function f: P --> R



    For t = 1, 2, ..., T:


      1. Select parent: pparent ~ Select(DBt-1)


      2. Choose model: m ~ Categorical(w1, ..., wK)


      3. Generate mutation: pcandidate = LLMm(pparent, S)


      4. Evaluate: s = f(pcandidate)


      5. Update: DBt = MapElitesInsert(DBt-1, pcandidate, s, desc(pcandidate))



    Return: argmaxp in DBT f(p)
$$

#### MAP-Elites Coverage


$$
Coverage(DB) = |{non-empty cells}| / prod(bins_per_dimi)



    Quality(DB) = (1/|{non-empty cells}|) * sum of f(DB[cell]) for all non-empty cells
$$

#### Expected Mutation Improvement


$$
E[improvement] = sum over models m: wm * P(f(child) > f(parent) | model=m) * E[f(child) - f(parent) | improvement, model=m]
$$

#### Cost-Per-Improvement


$$
CPI = Total API Cost / Number of Fitness Improvements Found



    Marginal CPI at generation t: CPIt = cost(generation t) / max(0, n_improvements at generation t, epsilon)
$$

## 12 Programming Language

| Aspect | Language | Notes |
| --- | --- | --- |
| OpenEvolve framework | **Python 3.8+** | Pure Python implementation |
| Evolved programs | **Python** (primarily) | Framework is Python-centric but extensible to other languages |
| Configuration | **YAML** | Human-readable configuration files |
| Dependencies | Standard Python ecosystem | google-generativeai, openai, anthropic, pyyaml, etc. |

### Extensibility to Other Languages

While the default evaluation pipeline assumes Python programs, the evaluation function is user-defined and can invoke any external process. Users could evolve programs in any language by:

1. Defining the seed program in the target language
2. Writing an evaluation function that compiles and runs the target language
3. Adjusting the prompt templates to request output in the target language

## 13 Memory Management

### Program Database

- **Sparse MAP-Elites grid:** Implemented as a Python dictionary; only occupied cells consume memory
- **In-memory storage:** The entire population is kept in RAM (typical populations of 100-1000 programs are easily manageable)
- **Disk persistence:** Database can be serialized to disk (pickle/JSON) for resuming experiments
- **Elite archive:** Fixed-size list of best programs (configurable max size, default 20)

### Evaluation Cache

- **Hash-based cache:** Programs are hashed and their scores cached to avoid re-evaluation
- **LRU eviction:** For long runs, the cache can be bounded with LRU eviction policy

### LLM Context Management

- **Token counting:** Prompts are checked against the model's context window limit before sending
- **Truncation:** Long programs or histories are truncated to fit within context limits
- **Selective inclusion:** Only the most relevant parent programs and recent history are included

### Memory Estimation


$$
Memory = O(P * S + C * S + E * S)



    where P = population size, S = average program size (bytes),


    C = cache size, E = elite archive size



    Typical: 1000 programs * 10KB each = ~10MB (negligible)
$$

## 14 Continued Learning

### Within-Run

- **Population evolution:** The program database continuously improves over generations
- **Mutation history:** Successful mutations are recorded and included in future prompts, allowing the LLM to learn from what has worked
- **Adaptive model selection:** If enabled, the system learns which models are more effective and routes more traffic to them

### Cross-Run

- **Database persistence:** Save and load the program database to resume evolution
- **Seed transfer:** Best programs from one run can seed subsequent runs
- **Configuration refinement:** Observing run metrics guides future configuration choices

### Limitations vs. AlphaEvolve

- No self-improvement loop (OpenEvolve does not train its own LLMs)
- No automatic infrastructure optimization (no Borg, no TPU, no Google-scale feedback)
- Learning is bounded to the population and prompt level, not the model weight level

## 15 Applications and Examples

### 15.1 Sorting Algorithm Evolution

> The sorting example evolves a Python sorting function starting from a basic implementation (e.g., bubble sort) toward faster algorithms. The evaluation function measures wall-clock time on benchmark arrays of varying sizes, with correctness as a hard constraint.

Python (examples/sorting/seed_program.py)

def sort(arr): """Initial seed: simple bubble sort.""" n = len(arr) for i in range(n): for j in range(n - i - 1): if arr[j] > arr[j + 1]: arr[j], arr[j + 1] = arr[j + 1], arr[j] return arr

Through evolution, OpenEvolve can discover quicksort, mergesort, timsort-like algorithms, or novel hybrid approaches that outperform the seed on the specific benchmark.

### 15.2 Symbolic Regression

Given data points (x, y), evolve a mathematical function f such that f(x) closely approximates y. The evaluation function computes mean squared error on held-out data, rewarding both accuracy and simplicity.

### 15.3 Mathematical Optimization

Evolve algorithms for combinatorial optimization problems. The user provides an evaluation function that scores the quality of the solution found by the candidate algorithm.

### 15.4 Custom Applications

OpenEvolve's modular design enables any application where:

1. Solutions can be expressed as Python code (or code callable from Python)
2. An automatic evaluation function can score solution quality
3. The search space is rich enough to benefit from evolutionary exploration

### How to Create a Custom Application

Bash

1. Create problem directory

mkdir my_problem/

2. Write seed program

cat > my_problem/seed_program.py << 'EOF' def solve(input_data): # Initial naive solution return baseline_solution(input_data) EOF

3. Write evaluation function

cat > my_problem/evaluate.py << 'EOF' import sys import importlib.util

def evaluate(candidate_path): spec = importlib.util.spec_from_file_location("c", candidate_path) mod = importlib.util.module_from_spec(spec) spec.loader.exec_module(mod)

score = benchmark(mod.solve)
print(score)

if name == "main": evaluate(sys.argv[1]) EOF

4. Write configuration

cat > my_problem/config.yaml << 'EOF' problem: description: "Optimize the solve() function for maximum score" seed_program: "my_problem/seed_program.py" evaluation_fn: "my_problem/evaluate.py" evolution: max_generations: 500 llm: models: - provider: "gemini" model_name: "gemini-2.0-flash" weight: 1.0 EOF

5. Run OpenEvolve

python -m openevolve.run --config my_problem/config.yaml ```

16 Detailed Comparison: OpenEvolve vs. AlphaEvolve

Dimension	AlphaEvolve (Google DeepMind)	OpenEvolve (Open Source)
Source Code	Proprietary; not released	Fully open-source (Apache 2.0)
LLM Models	Gemini Flash + Gemini Pro (internal APIs)	Any: Gemini, OpenAI, Anthropic, local (Ollama/vLLM)
LLM Ensemble	Dual-model (Flash for breadth, Pro for depth)	Configurable N-model ensemble with weighted selection
Mutation Modes	Diff-based, full-program, multi-file	Diff-based, full-program (multi-file support limited)
Program Database	MAP-Elites with island model and elite archive	MAP-Elites with island model and elite archive (faithful reproduction)
Parent Selection	MAP-Elites, tournament, fitness-proportional	MAP-Elites, tournament, fitness-proportional
Evaluation	Google internal sandboxing; formal verification for hardware	Subprocess/Docker sandboxing; user-defined evaluation functions
Scale	Google-scale (millions of LLM calls, massive parallel evaluation)	Single machine / small cluster
Evolved Language	Python, Verilog, Pallas/CUDA, any	Python primarily (extensible to others)
Self-Improvement	Yes (improves Gemini training, which improves AlphaEvolve)	No (uses frozen LLMs)
Configuration	Internal configuration system	YAML-based, user-friendly
Cost Control	Internal resource management	Built-in token tracking, USD budget limits
Results: Matrix Mult	Improved Strassen for 4x4 complex (48 scalar mults)	Not reproduced (requires specialized evaluation)
Results: Data Centers	0.7% of Google's worldwide compute recovered	Not applicable (no access to Borg)
Results: Hardware	TPU Verilog optimization deployed	Not applicable (no access to TPU design tools)
Results: Kernels	23% matrix mult speedup, 32.5% FlashAttention	Not reproduced (requires GPU kernel benchmarking)
Results: Math	Kissing number (11D), 50+ open problems	Example framework provided; community experiments
Community	Closed team at Google DeepMind	Open community; 8000+ GitHub stars
Documentation	White paper + blog post	README, code comments, examples, YAML configs

Architectural Fidelity Assessment

High Fidelity Components: OpenEvolve faithfully reproduces the core architecture described in the AlphaEvolve white paper:

Prompt Sampler with parent selection and context assembly LLM ensemble with weighted model selection MAP-Elites program database with behavioral descriptors Island model with periodic migration Diff-based and full-program mutation modes Cascaded evaluation with sandboxed execution

Missing / Reduced Components:

Multi-file evolution: AlphaEvolve can modify entire codebases across multiple files simultaneously; OpenEvolve primarily targets single-file programs Hardware targets: No Verilog, Pallas/CUDA kernel evolution built-in Formal verification: No integration with hardware verification tools Self-improvement loop: No mechanism to improve the underlying LLMs Scale: No massively parallel evaluation infrastructure Production deployment: Research tool, not production infrastructure

When to Use Each System

Scenario	Recommended	Reason
Academic research on evolutionary LLM coding	OpenEvolve	Open source, reproducible, configurable
Google-scale infrastructure optimization	AlphaEvolve	Only available internally, requires Google infrastructure
Exploring novel algorithms for your domain	OpenEvolve	Customizable evaluation functions, any LLM provider
Hardware (Verilog) optimization	AlphaEvolve	Built-in hardware targets and verification
Using local/free models	OpenEvolve	Supports Ollama, vLLM, any OpenAI-compatible API
Minimal API cost experimentation	OpenEvolve	Built-in cost control, supports free local models

OpenEvolve Technical Research Report | Generated March 2026

Source: github.com/algorithmicsuperintelligence/openevolve

This report is intended for PhD-level academic research and technical analysis.