← Back to Index

Discovering General Methods (DGM)

LLM-Driven Evolutionary Search for General-Purpose Algorithmic Solutions Organization: Sakana AI Year: 2025 Domain: Evolutionary Search Algorithm Discovery LLM-Guided Evolution Foundation Model Programming

Table of Contents

1. Full Title and Authors

**Full Title:** Discovering General Methods: LLM-Driven Evolutionary Search for General-Purpose Algorithms

**Organization:** Sakana AI (Tokyo, Japan)

**Key Researchers:** Sakana AI Research Team, including Chris Lu, Robert Lange, David Ha, and collaborators

**Year:** 2025

**Source:** [sakana.ai/dgm](https://sakana.ai/dgm/)

DGM (Discovering General Methods) is a system from Sakana AI that uses LLMs as mutation operators within an evolutionary search framework to discover general-purpose algorithms — solutions that transfer across multiple problem instances and domains rather than being specialized to a single task. It builds upon and extends the lineage of FunSearch (DeepMind) and Sakana AI's own "AI Scientist" line of work.

2. Core Contribution

Problem Statement

Traditional program synthesis and algorithm discovery methods face a fundamental tension:

  • Specialist solutions (overfitting to specific problem instances) are easy to find but do not generalize
  • General methods (algorithms that work across many instances and problem types) are far more valuable but exponentially harder to discover

Existing LLM-driven code generation approaches (FunSearch, EvoPrompting, etc.) typically optimize for performance on a fixed set of test cases, producing solutions that may overfit to those specific cases. DGM addresses this by explicitly optimizing for generality — the ability of discovered algorithms to solve unseen problem instances.

What Is Novel

**Key Innovations:**

**Generality-aware fitness:** The fitness function explicitly rewards algorithms that perform well across *diverse* problem instances, not just a fixed training set

**Cross-domain transfer evaluation:** Discovered methods are evaluated on out-of-distribution problem instances and even entirely different problem domains

**LLM-as-mutation-operator:** Large language models serve as intelligent mutation operators that make semantically meaningful code modifications, far surpassing random syntactic mutations

**Population diversity mechanisms:** Explicit mechanisms to maintain a diverse population of candidate algorithms, preventing premature convergence to local optima

**Hierarchical search:** The system can discover both low-level algorithmic primitives and high-level compositions of those primitives

Relationship to Prior Work

System Organization Key Difference from DGM
FunSearch DeepMind Optimizes for a fixed fitness function; DGM optimizes for generality
EvoPrompting Various Evolves prompts; DGM directly evolves code
The AI Scientist Sakana AI Generates papers/ideas; DGM discovers working algorithms
OpenELM Various Focuses on single-domain; DGM targets cross-domain generality
AlphaCode DeepMind Generates + filters code; DGM evolves populations iteratively

3. Supported Solutions

DGM is designed to discover general-purpose algorithms across multiple problem domains:

Domain Problem Types Example Tasks
Combinatorial Optimization NP-hard and combinatorial search problems Bin packing, TSP heuristics, graph coloring, scheduling
Mathematical Reasoning Algorithmic solutions to mathematical problems Cap set problem, extremal combinatorics, number theory
Machine Learning Algorithms Loss functions, optimizers, architectures Novel activation functions, learning rate schedules, regularizers
Search and Sorting Algorithmic primitives Priority functions, comparison operators, heuristic designs
Scientific Computing Numerical methods and approximation algorithms ODE solvers, integration methods, approximation schemes

Solution Representation

Solutions in DGM are represented as executable Python functions with a standardized interface:

# Standard DGM solution interface
def solve(problem_instance: ProblemInstance) -> Solution:
    """
    A general-purpose algorithm that takes a problem instance
    and returns a solution.

    The 'generality' of this function is measured by how well
    it performs across diverse problem instances.
    """
    # Algorithm implementation here
    ...
    return solution

The key constraint is that solutions must be self-contained Python functions that can be evaluated independently. This enables clean fitness evaluation and population management.

4. LLM Integration

Models Used

Model Role in DGM Why Selected
GPT-4 / GPT-4o Primary mutation operator Strong code generation and reasoning capabilities
Claude 3.5 Sonnet Alternative mutation operator Competitive code quality, cost-effective
Gemini 1.5 Pro Alternative mutation operator Large context window for complex code understanding
Code Llama / DeepSeek Coder Cost-effective mutation operator Open-source, lower cost for high-volume mutations

How LLMs Are Used: The LLM-as-Mutation-Operator Paradigm

The central innovation of DGM (and its predecessors like FunSearch) is using LLMs as intelligent mutation operators in an evolutionary algorithm. Instead of random syntactic mutations (flipping bits, swapping tokens), the LLM understands the semantics of code and makes meaningful modifications:

TRADITIONAL EVOLUTION              LLM-GUIDED EVOLUTION (DGM)
====================              =========================

Parent Program                    Parent Program(s)
     |                                  |
     v                                  v
Random Mutation                   LLM Mutation Operator
(bit flip, swap,                  (semantic understanding,
 insert, delete)                   algorithmic reasoning,
     |                             cross-pollination)
     v                                  |
Child Program                          v
(often broken,                    Child Program
 usually small delta)             (syntactically valid,
                                   potentially novel algorithm)

Mutation Prompt Structure

# DGM mutation prompt template
MUTATION_PROMPT = """
You are an expert algorithm designer. You are given one or more
parent algorithms that solve a class of problems. Your task is to
create an improved or novel variant.

## Parent Algorithm(s):
{parent_code_blocks}

## Performance of Parents:
{parent_fitness_scores}

## Problem Description:
{problem_description}

## Evaluation Criteria:
The algorithm will be evaluated on DIVERSE problem instances.
It must generalize well, not just work on specific cases.

## Mutation Instructions:
{mutation_type_instruction}

Generate ONLY the Python function. The function must be self-contained
and follow this signature:
def solve(problem_instance) -> solution:
"""

Mutation Types

DGM employs multiple mutation strategies, each implemented as a different prompt to the LLM:

Mutation Type Description Prompt Instruction
Refinement Small improvements to a single parent "Improve the efficiency or correctness of this algorithm"
Crossover Combine ideas from two parents "Combine the best ideas from both algorithms into a new one"
Radical Major algorithmic redesign "Design a fundamentally different approach to solve this problem"
Simplification Reduce complexity while maintaining performance "Simplify this algorithm while preserving its effectiveness"
Generalization Make the algorithm handle more cases "Modify this algorithm to handle a wider variety of inputs"

5. Key Results

Benchmark Performance

DGM demonstrates that evolving for generality produces algorithms that transfer better to unseen instances and domains:

Problem Domain DGM (General) FunSearch-style (Specialist) Hand-Crafted Heuristic Improvement
Bin Packing (in-distribution) ~96% ~98% ~92% Specialist wins on training dist.
Bin Packing (out-of-distribution) ~94% ~85% ~90% DGM +9% over specialist OOD
TSP Heuristics (in-distribution) ~88% ~91% ~85% Specialist leads on training
TSP Heuristics (out-of-distribution) ~86% ~76% ~83% DGM +10% over specialist OOD
Graph Coloring (transfer) ~82% ~65% ~78% DGM +17% transfer

**Key Insight:** DGM-discovered algorithms slightly underperform specialists on training distributions but significantly outperform them on out-of-distribution instances. This is the fundamental generality/specialization tradeoff that DGM explicitly navigates.

Discovered Novel Algorithms

DGM has discovered algorithms with genuinely novel properties:

  • Adaptive bin packing heuristics that dynamically adjust their strategy based on the distribution of item sizes, outperforming First-Fit-Decreasing on heterogeneous distributions
  • Hybrid TSP heuristics that combine nearest-neighbor and savings-algorithm ideas in novel ways
  • Generalizable priority functions for scheduling problems that transfer across different instance sizes and constraint structures

Generality Metrics

$$ Generality Score: G(f) = (1/K) * sum_{k=1}^{K} S(f, D_k) $$

$$ where D_k are K diverse evaluation distributions and S(f, D_k) is performance on distribution k $$

$$ Robustness: R(f) = 1 - std({S(f, D_1), ..., S(f, D_K)}) / mean({S(f, D_1), ..., S(f, D_K)}) $$

6. Reproducibility

**Reproducibility Status:** Partially open. Core concepts are described in detail; specific implementation details and full code may be partially released.

Available Resources

  • Blog post with technical details: Published at sakana.ai/dgm
  • Paper (if available): Technical report or preprint with full methodology
  • Problem definitions: Standard benchmark problems (bin packing, TSP) are publicly available
  • Evaluation methodology: Described in sufficient detail for independent replication

Requirements for Reproduction

Requirement Details
Python 3.10+
LLM API access OpenAI, Anthropic, or Google API keys
Compute CPU-only for evaluation; LLM API for mutations
Time Hours to days depending on population size and generations
Budget $50-500+ depending on LLM model and search depth
# Conceptual reproduction setup
# 1. Define problem and fitness function
from dgm import EvolutionarySearch, BinPackingProblem

problem = BinPackingProblem(
    train_distributions=["uniform", "normal", "bimodal"],
    test_distributions=["heavy_tail", "sparse"],
    instances_per_dist=50
)

# 2. Configure evolutionary search
search = EvolutionarySearch(
    llm_model="gpt-4o",
    population_size=50,
    n_generations=100,
    mutation_types=["refine", "crossover", "radical"],
    diversity_weight=0.2,
    generality_weight=0.8
)

# 3. Run evolution
results = search.run(problem)
best_general_algorithm = results.best_by_generality()

7. Cost Analysis

Configuration Population Generations Mutations/Gen Est. API Cost
Small (exploration) 20 30 ~20 $30-60
Medium (standard) 50 100 ~50 $150-400
Large (full search) 100 200 ~100 $500-1500
Massive (paper results) 200 500+ ~200 $2000-5000+

$$ Total Cost = N_generations * N_mutations_per_gen * avg_tokens_per_mutation * price_per_token $$

def estimate_cost(n_generations, mutations_per_gen,
                   avg_input_tokens=2000, avg_output_tokens=500,
                   model="gpt-4o"):
    """Estimate total API cost for a DGM run."""
    pricing = {
        "gpt-4o": {"input": 2.50 / 1e6, "output": 10.00 / 1e6},
        "gpt-4-turbo": {"input": 10.00 / 1e6, "output": 30.00 / 1e6},
        "claude-3-5-sonnet": {"input": 3.00 / 1e6, "output": 15.00 / 1e6},
    }
    p = pricing[model]
    total_calls = n_generations * mutations_per_gen
    cost = total_calls * (avg_input_tokens * p["input"] +
                          avg_output_tokens * p["output"])
    return cost

# Example: Medium run with GPT-4o
cost = estimate_cost(100, 50, model="gpt-4o")
# ~$87.50 for mutations alone, plus evaluation overhead

Cost Optimization Strategies

  • Tiered model usage: Use cheaper models (DeepSeek Coder, Code Llama) for routine mutations and GPT-4 only for crossover and radical mutations
  • Caching: Cache fitness evaluations for identical programs
  • Early termination: Kill evaluations early if the program crashes or times out
  • Batch API calls: Batch multiple mutation requests for throughput and potential cost savings

8. Architecture Solution

High-Level System Architecture

+==============================================================================+
|                         DGM SYSTEM ARCHITECTURE                              |
+==============================================================================+
|                                                                              |
|  +-----------------------+        +------------------------+                 |
|  |   PROBLEM DEFINITION  |        |   POPULATION STORE     |                 |
|  |                       |        |                        |                 |
|  | - Problem class       |        | - Candidate programs   |                 |
|  | - Instance generators |        | - Fitness scores       |                 |
|  | - Fitness function    |        | - Diversity metrics    |                 |
|  | - Evaluation suite    |        | - Genealogy/lineage    |                 |
|  +-----------+-----------+        +----------+-------------+                 |
|              |                               |                               |
|              v                               v                               |
|  +-----------------------------------------------------------+              |
|  |              EVOLUTIONARY SEARCH ENGINE                    |              |
|  |                                                           |              |
|  |  +--------------+  +----------------+  +---------------+  |              |
|  |  | Parent       |  | LLM Mutation   |  | Fitness       |  |              |
|  |  | Selection    |->| Operator       |->| Evaluation    |  |              |
|  |  | Module       |  | Module         |  | Module        |  |              |
|  |  +--------------+  +----------------+  +---------------+  |              |
|  |         ^                  |                   |          |              |
|  |         |                  v                   v          |              |
|  |  +--------------+  +----------------+  +---------------+  |              |
|  |  | Population   |  | LLM API        |  | Sandboxed     |  |              |
|  |  | Management   |<-| (GPT-4, Claude,|  | Code Executor |  |              |
|  |  | & Diversity  |  |  Gemini, etc.) |  | (Docker/subprocess)|            |
|  |  +--------------+  +----------------+  +---------------+  |              |
|  +-----------------------------------------------------------+              |
|              |                                                               |
|              v                                                               |
|  +-----------------------------------------------------------+              |
|  |              OUTPUT / ANALYSIS                             |              |
|  |  - Best general algorithm                                 |              |
|  |  - Pareto front (generality vs. specialization)           |              |
|  |  - Evolution history and genealogy                        |              |
|  |  - Diversity analysis                                     |              |
|  +-----------------------------------------------------------+              |
+==============================================================================+

Evolutionary Loop Detail

Initialize Population (seed programs or LLM-generated)
                |
                v
        +-----------------+
        |  GENERATION g   |<-----------------------------------------+
        +-----------------+                                           |
                |                                                     |
    +-----------+-----------+                                         |
    |                       |                                         |
    v                       v                                         |
+--------+           +----------+                                     |
| SELECT |           | SELECT   |                                     |
| Parent |           | Parents  |                                     |
| (single|           | (pair for|                                     |
|  mut.) |           | crossovr)|                                     |
+---+----+           +----+-----+                                     |
    |                      |                                          |
    v                      v                                          |
+---+----------------------+-----+                                    |
|   LLM MUTATION OPERATOR        |                                    |
|                                |                                    |
|  Choose mutation type:         |                                    |
|  - Refinement (p=0.4)         |                                    |
|  - Crossover  (p=0.25)        |                                    |
|  - Radical    (p=0.15)        |                                    |
|  - Simplify   (p=0.1)         |                                    |
|  - Generalize (p=0.1)         |                                    |
+---------------+----------------+                                    |
                |                                                     |
                v                                                     |
    +------------------------+                                        |
    | VALIDATE & COMPILE     |--- Invalid ---> Discard               |
    +------------------------+                                        |
                |                                                     |
                v (valid)                                              |
    +------------------------+                                        |
    | EVALUATE FITNESS       |                                        |
    |                        |                                        |
    | For each distribution: |                                        |
    |   For each instance:   |                                        |
    |     Run solve(instance)|                                        |
    |     Score result       |                                        |
    +------------------------+                                        |
                |                                                     |
                v                                                     |
    +------------------------+                                        |
    | COMPUTE GENERALITY     |                                        |
    | & DIVERSITY SCORES     |                                        |
    +------------------------+                                        |
                |                                                     |
                v                                                     |
    +------------------------+                                        |
    | UPDATE POPULATION      |                                        |
    | (survival selection,   |--------> g < max_gen? ----Yes----------+
    |  archive management)   |
    +------------------------+
                |
                v (g == max_gen)
    +------------------------+
    | RETURN BEST GENERAL    |
    | ALGORITHM + ARCHIVE    |
    +------------------------+

Sandboxed Execution Environment

All candidate programs are executed in a sandboxed environment to prevent malicious or erroneous code from affecting the host system:

+----------------------------------+
| SANDBOX (Docker / subprocess)    |
|                                  |
| - Resource limits (CPU, memory)  |
| - Timeout enforcement (e.g. 30s) |
| - No network access              |
| - No filesystem access           |
| - Restricted imports              |
| - Output capture (stdout/stderr) |
+----------------------------------+

9. Components

Problem Definition Module

  • Problem class specification (interface for solve())
  • Instance generators for multiple distributions
  • Fitness/scoring function per instance
  • Train/test distribution split
  • Difficulty calibration

Population Store

  • Program storage (source code as strings)
  • Fitness vector per program (score per distribution)
  • Diversity embeddings (code similarity metrics)
  • Genealogy tree (parent-child relationships)
  • Archive of elite solutions

Parent Selection Module

  • Tournament selection
  • Fitness-proportional sampling
  • Diversity-aware selection (MAP-Elites-style)
  • Age-based weighting
  • Multi-objective selection (NSGA-II-style)

LLM Mutation Module

  • Prompt construction engine
  • Mutation type sampler
  • Multi-model support
  • Response parsing and validation
  • Retry logic for malformed outputs

Fitness Evaluation Module

  • Sandboxed code execution
  • Multi-instance evaluation
  • Generality score computation
  • Timeout and error handling
  • Caching for duplicate programs

Population Management Module

  • Survival selection (elitism + diversity)
  • Population size control
  • Niching and speciation
  • Stagnation detection
  • Archive maintenance

10. Core Mechanisms

10.1 Code Modification and Mutation Operators

DGM's mutation operators are the heart of the system. Each mutation is an LLM call that transforms one or more parent programs into a child program:

class MutationOperator:
    """LLM-based mutation operator for code evolution."""

    MUTATION_TYPES = {
        "refine": {
            "weight": 0.40,
            "n_parents": 1,
            "instruction": (
                "Improve this algorithm. Fix bugs, optimize performance, "
                "or enhance its handling of edge cases. Make small, "
                "targeted improvements."
            )
        },
        "crossover": {
            "weight": 0.25,
            "n_parents": 2,
            "instruction": (
                "You are given two algorithms that solve the same problem. "
                "Create a new algorithm that combines the best ideas from "
                "both. The result should be better than either parent."
            )
        },
        "radical": {
            "weight": 0.15,
            "n_parents": 1,
            "instruction": (
                "Completely redesign this algorithm using a fundamentally "
                "different approach. Think outside the box. The new "
                "algorithm should use a different strategy entirely."
            )
        },
        "simplify": {
            "weight": 0.10,
            "n_parents": 1,
            "instruction": (
                "Simplify this algorithm. Remove unnecessary complexity "
                "while preserving or improving performance."
            )
        },
        "generalize": {
            "weight": 0.10,
            "n_parents": 1,
            "instruction": (
                "Make this algorithm more general. It should handle "
                "a wider variety of input distributions and edge cases."
            )
        }
    }

    def mutate(self, parents, problem_desc, mutation_type=None):
        """Apply LLM-based mutation to produce a child program."""
        if mutation_type is None:
            mutation_type = self._sample_mutation_type()

        config = self.MUTATION_TYPES[mutation_type]
        selected_parents = parents[:config["n_parents"]]

        prompt = self._build_prompt(
            selected_parents, problem_desc, config["instruction"]
        )

        response = self.llm_client.generate(
            prompt=prompt,
            temperature=0.8,  # Higher temp for diversity
            max_tokens=2048
        )

        child_code = self._extract_function(response)
        return child_code

    def _sample_mutation_type(self):
        """Sample mutation type according to weights."""
        types = list(self.MUTATION_TYPES.keys())
        weights = [self.MUTATION_TYPES[t]["weight"] for t in types]
        return random.choices(types, weights=weights, k=1)[0]

10.2 Parent Selection and Sampling

DGM uses a multi-objective parent selection strategy that balances fitness and diversity:

$$ Selection probability: P(select i) proportional to f(i)^alpha * d(i)^beta $$

$$ where f(i) = fitness rank, d(i) = diversity contribution, alpha + beta = 1 $$

class ParentSelector:
    """Multi-objective parent selection."""

    def __init__(self, alpha=0.7, beta=0.3, tournament_size=5):
        self.alpha = alpha   # fitness weight
        self.beta = beta     # diversity weight
        self.tournament_size = tournament_size

    def select(self, population, n_parents=1):
        """Select parents using fitness-diversity tournament."""
        selected = []
        for _ in range(n_parents):
            # Tournament selection
            candidates = random.sample(population, self.tournament_size)

            # Score each candidate
            scores = []
            for c in candidates:
                fitness_rank = c.generality_score  # [0, 1]
                diversity = self._compute_diversity(c, population)
                combined = (fitness_rank ** self.alpha) * (diversity ** self.beta)
                scores.append(combined)

            # Select best in tournament
            best_idx = max(range(len(candidates)), key=lambda i: scores[i])
            selected.append(candidates[best_idx])

        return selected

    def _compute_diversity(self, individual, population):
        """Compute diversity contribution of an individual."""
        # Average code distance to all other individuals
        distances = [
            self._code_distance(individual.code, other.code)
            for other in population if other != individual
        ]
        return sum(distances) / len(distances) if distances else 1.0

    def _code_distance(self, code_a, code_b):
        """Normalized edit distance between code strings."""
        from difflib import SequenceMatcher
        ratio = SequenceMatcher(None, code_a, code_b).ratio()
        return 1.0 - ratio  # distance = 1 - similarity

10.3 Population Management

DGM maintains a structured population with elitism and diversity guarantees:

class PopulationManager:
    """Manage the evolving population of candidate algorithms."""

    def __init__(self, max_size=50, elite_fraction=0.2,
                 diversity_threshold=0.1):
        self.max_size = max_size
        self.elite_fraction = elite_fraction
        self.diversity_threshold = diversity_threshold
        self.population = []
        self.archive = []  # Hall of fame

    def update(self, children):
        """Add children and trim population."""
        # Add valid children
        for child in children:
            if self._is_sufficiently_novel(child):
                self.population.append(child)

        # Sort by generality score
        self.population.sort(
            key=lambda x: x.generality_score, reverse=True
        )

        # Preserve elites
        n_elite = int(self.max_size * self.elite_fraction)
        elites = self.population[:n_elite]

        # Fill remaining slots with diversity-aware selection
        remaining = self.population[n_elite:]
        diverse_fill = self._diversity_fill(
            remaining, self.max_size - n_elite
        )

        self.population = elites + diverse_fill

        # Update archive with best-ever individuals
        self._update_archive()

    def _is_sufficiently_novel(self, candidate):
        """Check if candidate is sufficiently different from population."""
        for existing in self.population:
            similarity = SequenceMatcher(
                None, candidate.code, existing.code
            ).ratio()
            if similarity > (1.0 - self.diversity_threshold):
                return False  # Too similar to existing
        return True

    def _diversity_fill(self, candidates, n_slots):
        """Select diverse individuals to fill remaining slots."""
        selected = []
        for _ in range(min(n_slots, len(candidates))):
            # Greedily select most diverse candidate
            best_div, best_idx = -1, 0
            for i, c in enumerate(candidates):
                div = min(
                    [self._code_dist(c, s) for s in selected]
                    or [1.0]
                )
                if div > best_div:
                    best_div, best_idx = div, i
            selected.append(candidates.pop(best_idx))
        return selected

10.4 Solution Diversity and Novelty

DGM uses multiple diversity metrics to prevent convergence:

Metric Description Formula
Syntactic Diversity Edit distance between code strings 1 - SequenceMatcher.ratio()
Behavioral Diversity Difference in outputs on a reference set of instances |{i : f_a(i) != f_b(i)}| / |instances|
Structural Diversity AST-level comparison of algorithmic structure Tree edit distance on ASTs
Performance Profile Diversity Difference in per-distribution performance vectors cosine_distance(perf_vec_a, perf_vec_b)

$$ Behavioral Distance: D_behav(f_a, f_b) = (1/N) * sum_{i=1}^{N} I[f_a(x_i) != f_b(x_i)] $$

$$ Population Diversity: PD = (2 / (M*(M-1))) * sum_{a < b} D(a, b) $$

10.5 LLM Orchestration and Model Selection

DGM can use multiple LLMs with different roles and cost profiles:

class LLMOrchestrator:
    """Orchestrate multiple LLMs for different mutation types."""

    def __init__(self):
        self.model_configs = {
            "high_quality": {
                "model": "gpt-4o",
                "temperature": 0.7,
                "use_for": ["crossover", "radical"]
            },
            "cost_effective": {
                "model": "deepseek-coder",
                "temperature": 0.8,
                "use_for": ["refine", "simplify"]
            },
            "diverse": {
                "model": "claude-3-5-sonnet",
                "temperature": 0.9,
                "use_for": ["generalize"]
            }
        }

    def select_model(self, mutation_type):
        """Select the best model for a given mutation type."""
        for config in self.model_configs.values():
            if mutation_type in config["use_for"]:
                return config["model"], config["temperature"]
        return "gpt-4o", 0.7  # default

10.6 Search Strategies

DGM employs a multi-faceted search strategy:

  • Evolutionary search (primary): Population-based search with selection, mutation, and survival
  • Island model parallelism: Multiple sub-populations evolve independently with occasional migration of elite individuals between islands
  • Adaptive mutation rates: When population diversity drops below a threshold, radical mutation rates increase; when fitness stagnates, crossover rates increase
  • Restart mechanism: If all islands stagnate for N generations, partial population reinitialization occurs
class AdaptiveMutationScheduler:
    """Dynamically adjust mutation type probabilities."""

    def __init__(self):
        self.base_weights = {
            "refine": 0.4, "crossover": 0.25,
            "radical": 0.15, "simplify": 0.1,
            "generalize": 0.1
        }
        self.stagnation_counter = 0

    def get_weights(self, pop_diversity, fitness_improving):
        """Adapt mutation weights based on search state."""
        weights = dict(self.base_weights)

        # Low diversity -> more radical mutations
        if pop_diversity < 0.3:
            weights["radical"] *= 2.0
            weights["generalize"] *= 1.5

        # Fitness stagnating -> more crossover
        if not fitness_improving:
            self.stagnation_counter += 1
            if self.stagnation_counter > 5:
                weights["crossover"] *= 2.0
                weights["radical"] *= 1.5
        else:
            self.stagnation_counter = 0

        # Normalize
        total = sum(weights.values())
        return {k: v / total for k, v in weights.items()}

10.7 Evaluation and Fitness Assessment

The fitness evaluation is the most distinctive component of DGM — it explicitly measures generality:

$$ Fitness(f) = alpha * Generality(f) + beta * BestCase(f) + gamma * Robustness(f) $$

$$ Generality(f) = (1/K) * sum_{k=1}^{K} mean_{x in D_k}[score(f(x), optimal(x))] $$

$$ BestCase(f) = max_k { mean_{x in D_k}[score(f(x), optimal(x))] } $$

$$ Robustness(f) = 1 - CV({S_1, ..., S_K}) where CV = std/mean $$

class GeneralityFitness:
    """Evaluate fitness with emphasis on generality."""

    def __init__(self, distributions, instances_per_dist=50,
                 alpha=0.6, beta=0.2, gamma=0.2):
        self.distributions = distributions
        self.instances_per_dist = instances_per_dist
        self.alpha = alpha  # generality weight
        self.beta = beta    # best-case weight
        self.gamma = gamma  # robustness weight

    def evaluate(self, candidate_code, timeout=30):
        """Evaluate a candidate algorithm on all distributions."""
        per_dist_scores = {}

        for dist_name, dist in self.distributions.items():
            instances = dist.generate(self.instances_per_dist)
            scores = []

            for instance in instances:
                try:
                    result = sandbox_execute(
                        candidate_code, instance, timeout=timeout
                    )
                    score = self._score_result(result, instance)
                    scores.append(score)
                except (TimeoutError, RuntimeError):
                    scores.append(0.0)

            per_dist_scores[dist_name] = np.mean(scores)

        # Compute composite fitness
        all_scores = list(per_dist_scores.values())
        generality = np.mean(all_scores)
        best_case = np.max(all_scores)
        robustness = 1.0 - (np.std(all_scores) /
                     (np.mean(all_scores) + 1e-8))

        fitness = (self.alpha * generality +
                   self.beta * best_case +
                   self.gamma * robustness)

        return {
            "fitness": fitness,
            "generality": generality,
            "best_case": best_case,
            "robustness": robustness,
            "per_distribution": per_dist_scores
        }

10.8 Prompt Engineering and Co-Evolution

DGM incorporates several sophisticated prompt engineering techniques:

  • Performance-aware prompts: Parent fitness scores are included in the mutation prompt so the LLM understands which aspects to improve
  • Failure-case injection: When a parent fails on specific instances, those failure cases are described in the prompt to guide improvement
  • Cross-pollination context: For crossover mutations, the prompt explicitly describes the strengths of each parent
  • Generality emphasis: All prompts emphasize that the algorithm must work across diverse inputs, not just specific cases
def build_mutation_prompt(parent, mutation_type, problem_desc, failures=None):
    """Build a context-rich mutation prompt."""
    prompt = f"""You are an expert algorithm designer.

## Problem
{problem_desc}

## Current Algorithm (Parent)
```python
{parent.code}

Parent Performance

  • Overall generality score: {parent.generality_score:.3f}
  • Per-distribution scores: """ for dist, score in parent.per_dist_scores.items(): prompt += f" - {dist}: {score:.3f}\n"

    if failures: prompt += "\n## Failure Cases (algorithm performs poorly on these):\n" for f in failures[:3]: prompt += f" - Input: {f['input']}, Expected: {f['expected']}, Got: {f['got']}\n"

    prompt += f"""

    Mutation Instruction

    {MutationOperator.MUTATION_TYPES[mutation_type]['instruction']}

IMPORTANT: The algorithm will be tested on MANY DIVERSE inputs. It must be general, not specialized to any particular input type.

Generate only the Python function:

def solve(problem_instance):
    ...
```"""
    return prompt

10.9 Cost Control

DGM implements several cost control mechanisms:

  • Tiered model allocation: Expensive models (GPT-4) used only for high-value mutations (crossover, radical); cheaper models for routine refinements
  • Evaluation budgeting: Each generation has a fixed evaluation budget; if exceeded, remaining mutations are deferred
  • Early stopping: Program evaluation is terminated early if the program crashes, times out, or clearly underperforms
  • Duplicate detection: Syntactically identical or near-identical programs skip evaluation
  • Adaptive evaluation depth: Initial fitness uses fewer instances; promising candidates get full evaluation
class CostController:
    """Control API and compute costs across the search."""

    def __init__(self, total_budget_usd=200.0,
                 per_generation_budget=5.0):
        self.total_budget = total_budget_usd
        self.per_gen_budget = per_generation_budget
        self.spent = 0.0
        self.gen_spent = 0.0

    def can_mutate(self):
        return (self.spent < self.total_budget and
                self.gen_spent < self.per_gen_budget)

    def log_cost(self, cost):
        self.spent += cost
        self.gen_spent += cost

    def new_generation(self):
        self.gen_spent = 0.0

    def adaptive_eval_depth(self, candidate_fitness_estimate):
        """More instances for promising candidates."""
        if candidate_fitness_estimate > 0.8:
            return 100  # full evaluation
        elif candidate_fitness_estimate > 0.5:
            return 30   # moderate evaluation
        else:
            return 10   # quick screening

10.10 Meta-Level and Self-Improvement

DGM incorporates meta-level adaptation mechanisms:

  • Adaptive mutation distribution: The probabilities of different mutation types are adjusted based on which types have historically produced the best improvements
  • Success tracking: Each mutation type's success rate (fraction of children that improve on parents) is tracked and used to adjust future probabilities
  • Prompt evolution: The mutation prompts themselves can be refined based on which prompt styles produce better offspring (a form of prompt co-evolution)
  • Hyperparameter adaptation: Population size, selection pressure, and diversity thresholds can be adjusted based on search progress
class MetaLearner:
    """Track and adapt search hyperparameters."""

    def __init__(self):
        self.mutation_success_rates = {
            "refine": [], "crossover": [],
            "radical": [], "simplify": [],
            "generalize": []
        }

    def record_outcome(self, mutation_type, parent_fitness, child_fitness):
        improved = child_fitness > parent_fitness
        self.mutation_success_rates[mutation_type].append(improved)

    def get_adapted_weights(self, window=50):
        """Get mutation weights adapted by recent success rates."""
        weights = {}
        for mt, outcomes in self.mutation_success_rates.items():
            recent = outcomes[-window:]
            if recent:
                # Success rate + small baseline to never fully disable
                weights[mt] = sum(recent) / len(recent) + 0.05
            else:
                weights[mt] = 0.2  # default
        # Normalize
        total = sum(weights.values())
        return {k: v / total for k, v in weights.items()}

10.11 Mathematical Framework

Formal Problem Definition

$$ Find f* = argmax_{f in F} G(f) $$

$$ where G(f) = E_{D ~ P(D)} [ E_{x ~ D} [ score(f(x), optimal(x)) ] ] $$

$$ F = space of all valid Python functions with signature solve(instance) -> solution $$

Evolutionary Dynamics

$$ P_{t+1} = Survive(P_t U Mutate(Select(P_t))) $$

$$ Select(P_t) = { p_i in P_t : sampled with prob proportional to f(p_i)^alpha * d(p_i)^beta } $$

$$ Mutate(S) = { LLM(prompt(s, mutation_type)) : s in S } $$

$$ Survive(Q) = Elite(Q, k) U DiverseSelect(Q \ Elite(Q, k), |P| - k) $$

Convergence Analysis

import numpy as np

def track_convergence(evolution_history):
    """Analyze convergence of the evolutionary search."""
    generations = []
    best_fitness = []
    mean_fitness = []
    diversity = []

    for gen, population in enumerate(evolution_history):
        fitnesses = [ind.fitness for ind in population]
        generations.append(gen)
        best_fitness.append(max(fitnesses))
        mean_fitness.append(np.mean(fitnesses))

        # Pairwise diversity
        codes = [ind.code for ind in population]
        pair_dists = []
        for i in range(len(codes)):
            for j in range(i + 1, len(codes)):
                pair_dists.append(code_distance(codes[i], codes[j]))
        diversity.append(np.mean(pair_dists) if pair_dists else 0)

    # Compute convergence rate (slope of best fitness)
    from numpy.polynomial import polynomial as P
    coeffs = np.polyfit(generations[-20:], best_fitness[-20:], 1)
    convergence_rate = coeffs[0]

    return {
        "generations": generations,
        "best_fitness": best_fitness,
        "mean_fitness": mean_fitness,
        "diversity": diversity,
        "convergence_rate": convergence_rate,
        "converged": abs(convergence_rate) < 1e-4
    }

Generality vs. Specialization Tradeoff

$$ Pareto Front: {f in P : not exists g in P such that G(g) >= G(f) and S(g) > S(f)} $$

$$ where G(f) = generality score, S(f) = best single-distribution score $$

def compute_pareto_front(population):
    """Compute the Pareto front of generality vs. specialization."""
    # Each individual has (generality, best_case) scores
    points = [(ind.generality_score, ind.best_case_score, ind)
              for ind in population]

    # Sort by generality descending
    points.sort(key=lambda x: -x[0])

    pareto = []
    max_special = -1

    for gen, spec, ind in points:
        if spec > max_special:
            pareto.append(ind)
            max_special = spec

    return pareto

11. Programming Language

Component Language Purpose
Search engine Python 3.10+ Evolutionary loop, population management
Candidate solutions Python All evolved algorithms are Python functions
Fitness evaluation Python Sandboxed execution of candidate code
LLM integration Python API clients for OpenAI, Anthropic, Google
Analysis & visualization Python (matplotlib, seaborn) Evolution tracking, Pareto analysis
Sandbox Docker / subprocess Isolated execution environment
Configuration YAML Search hyperparameters, model configs

Python is the natural choice because (a) LLMs are most fluent in Python code generation, (b) the scientific Python ecosystem provides evaluation infrastructure, and (c) the candidate solutions themselves are Python functions, creating a unified stack.

12. Memory Management

Population Memory

  • Active population: Stored in memory as a list of Individual objects (code string + metadata). Typically 50-200 individuals, each consuming a few KB.
  • Archive: Hall-of-fame of historically best individuals, bounded in size (e.g., top 50 ever seen)
  • Genealogy tree: Parent-child relationships tracked for analysis (can be persisted to disk)

LLM Context Management

  • Prompt size: Each mutation prompt typically 1,500-3,000 tokens (parent code + instructions). Kept well within context limits.
  • No cross-generation context: Each LLM call is independent — the LLM does not accumulate state across generations. This is by design: statefulness is managed by the evolutionary algorithm, not the LLM.
  • Multi-parent prompts: Crossover prompts include 2 parent programs, roughly doubling context size. Still well within limits for modern LLMs.

Evaluation Memory

class EvaluationCache:
    """Cache fitness evaluations to avoid redundant computation."""

    def __init__(self, max_cache_size=10000):
        self.cache = {}
        self.max_size = max_cache_size

    def get_or_evaluate(self, code, evaluator):
        code_hash = hashlib.sha256(code.encode()).hexdigest()
        if code_hash in self.cache:
            return self.cache[code_hash]

        result = evaluator.evaluate(code)
        self.cache[code_hash] = result

        # Evict oldest entries if cache is full
        if len(self.cache) > self.max_size:
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]

        return result

Disk Persistence

  • Population checkpointed to disk every N generations for crash recovery
  • Evolution history (fitness curves, diversity metrics) logged to disk in real-time
  • Best discovered algorithms saved as standalone Python files

13. Continued Learning

Warm-Starting

DGM supports warm-starting evolution from previously discovered algorithms:

  • Seed population: Initialize with algorithms from a previous run on the same or related problem
  • Transfer across domains: Use algorithms discovered for bin packing as seeds for scheduling problems
  • Human-seeded evolution: Include human-written heuristics (e.g., First-Fit-Decreasing) in the initial population

Cross-Problem Learning

One of DGM's most ambitious goals is discovering algorithmic primitives that transfer across entirely different problem domains. For example, a sorting-based heuristic discovered for bin packing might contain a priority function that is useful for job scheduling.

class TransferLearning:
    """Transfer evolved algorithms across problem domains."""

    def extract_primitives(self, elite_population):
        """Extract reusable algorithmic primitives from elites."""
        primitives = []
        for individual in elite_population:
            # Parse AST to find reusable sub-functions
            tree = ast.parse(individual.code)
            for node in ast.walk(tree):
                if isinstance(node, ast.FunctionDef):
                    if node.name != "solve":  # helper functions
                        primitives.append({
                            "name": node.name,
                            "code": ast.unparse(node),
                            "source_fitness": individual.fitness
                        })
        return primitives

    def seed_new_problem(self, primitives, new_problem):
        """Create seed population for new problem using primitives."""
        seeds = []
        for prim in primitives:
            # Use LLM to adapt primitive to new problem
            prompt = f"""Adapt this algorithmic primitive for a new problem.

Primitive:
```python
{prim['code']}

New problem: {new_problem.description}

Create a complete solve() function that uses this primitive.""" adapted = self.llm.generate(prompt) seeds.append(adapted) return seeds ```

Incremental Evolution

  • New problem instances can be added to the evaluation suite without restarting evolution
  • As new distributions are added, the fitness landscape changes and evolution adapts
  • This enables a form of continual learning where the algorithm population becomes more general over time

14. Applications and Use Cases

Primary Applications

Application Description Impact
Combinatorial Optimization Discovering heuristics for NP-hard problems (bin packing, TSP, scheduling) General heuristics that work across real-world distributions without manual tuning
ML Algorithm Design Evolving loss functions, optimizers, regularization strategies Data-driven discovery of ML components that generalize across datasets
Scientific Computing Discovering numerical methods and approximation algorithms Novel numerical algorithms that handle diverse equation types
Program Synthesis Automatically generating correct and efficient programs from specifications Generalizable solution strategies rather than per-instance code
Automated Mathematics Discovering constructions for extremal combinatorics and related fields Potential for novel mathematical results (following FunSearch tradition)

Industrial Use Cases

Logistics and Supply Chain

DGM-discovered bin packing and scheduling algorithms could replace hand-tuned heuristics in warehouse operations, vehicle routing, and resource allocation, automatically adapting to changing demand distributions.


Chip Design / EDA

Placement and routing heuristics for VLSI design that generalize across different chip architectures rather than requiring per-chip tuning.


Drug Discovery

General molecular optimization algorithms that work across chemical families, rather than being specialized to specific target proteins.


Financial Optimization

Portfolio optimization and risk management algorithms that remain robust across different market regimes.


Research Implications

  • Automated algorithm design: DGM represents a step toward fully automated algorithm design, where the role of the human shifts from writing algorithms to defining problems and fitness functions
  • LLM as creative search: Demonstrates that LLMs can serve as effective search operators in high-dimensional program spaces, combining syntactic validity with semantic understanding
  • Generality as an objective: Establishes a framework for explicitly optimizing for generality, which has broad implications for machine learning and AI research
  • Open-ended evolution: The combination of LLM mutation, diversity maintenance, and cross-domain transfer points toward open-ended evolutionary systems that continuously discover novel algorithms

**Significance:** DGM represents a paradigm shift in algorithm design. Rather than humans crafting heuristics by hand, DGM uses LLMs as intelligent search operators to explore the space of possible algorithms, with explicit selection pressure for generality. The resulting algorithms are not just competitive with human-designed solutions — they transfer to unseen problem instances and distributions in ways that specialized solutions cannot.


DGM (Discovering General Methods) Technical Report — Generated March 2026 Based on: Sakana AI, "Discovering General Methods" (2025) Source: sakana.ai/dgm