Discovering General Methods (DGM)

LLM-Driven Evolutionary Search for General-Purpose Algorithmic Solutions Organization: Sakana AI Year: 2025 Domain: Evolutionary Search Algorithm Discovery LLM-Guided Evolution Foundation Model Programming

Full Title and Authors
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Cost Analysis
Architecture Solution
Components
Core Mechanisms
Programming Language
Memory Management
Continued Learning
Applications and Use Cases

1. Full Title and Authors

**Full Title:** Discovering General Methods: LLM-Driven Evolutionary Search for General-Purpose Algorithms

**Organization:** Sakana AI (Tokyo, Japan)

**Key Researchers:** Sakana AI Research Team, including Chris Lu, Robert Lange, David Ha, and collaborators

**Year:** 2025

**Source:** [sakana.ai/dgm](https://sakana.ai/dgm/)

DGM (Discovering General Methods) is a system from Sakana AI that uses LLMs as mutation operators within an evolutionary search framework to discover general-purpose algorithms — solutions that transfer across multiple problem instances and domains rather than being specialized to a single task. It builds upon and extends the lineage of FunSearch (DeepMind) and Sakana AI's own "AI Scientist" line of work.

2. Core Contribution

Problem Statement

Traditional program synthesis and algorithm discovery methods face a fundamental tension:

Specialist solutions (overfitting to specific problem instances) are easy to find but do not generalize
General methods (algorithms that work across many instances and problem types) are far more valuable but exponentially harder to discover

Existing LLM-driven code generation approaches (FunSearch, EvoPrompting, etc.) typically optimize for performance on a fixed set of test cases, producing solutions that may overfit to those specific cases. DGM addresses this by explicitly optimizing for generality — the ability of discovered algorithms to solve unseen problem instances.

What Is Novel

**Key Innovations:**

**Generality-aware fitness:** The fitness function explicitly rewards algorithms that perform well across *diverse* problem instances, not just a fixed training set

**Cross-domain transfer evaluation:** Discovered methods are evaluated on out-of-distribution problem instances and even entirely different problem domains

**LLM-as-mutation-operator:** Large language models serve as intelligent mutation operators that make semantically meaningful code modifications, far surpassing random syntactic mutations

**Population diversity mechanisms:** Explicit mechanisms to maintain a diverse population of candidate algorithms, preventing premature convergence to local optima

**Hierarchical search:** The system can discover both low-level algorithmic primitives and high-level compositions of those primitives

Relationship to Prior Work

System	Organization	Key Difference from DGM
FunSearch	DeepMind	Optimizes for a fixed fitness function; DGM optimizes for generality
EvoPrompting	Various	Evolves prompts; DGM directly evolves code
The AI Scientist	Sakana AI	Generates papers/ideas; DGM discovers working algorithms
OpenELM	Various	Focuses on single-domain; DGM targets cross-domain generality
AlphaCode	DeepMind	Generates + filters code; DGM evolves populations iteratively

3. Supported Solutions

DGM is designed to discover general-purpose algorithms across multiple problem domains:

Domain	Problem Types	Example Tasks
Combinatorial Optimization	NP-hard and combinatorial search problems	Bin packing, TSP heuristics, graph coloring, scheduling
Mathematical Reasoning	Algorithmic solutions to mathematical problems	Cap set problem, extremal combinatorics, number theory
Machine Learning Algorithms	Loss functions, optimizers, architectures	Novel activation functions, learning rate schedules, regularizers
Search and Sorting	Algorithmic primitives	Priority functions, comparison operators, heuristic designs
Scientific Computing	Numerical methods and approximation algorithms	ODE solvers, integration methods, approximation schemes

Solution Representation

Solutions in DGM are represented as executable Python functions with a standardized interface:

# Standard DGM solution interface
def solve(problem_instance: ProblemInstance) -> Solution:
    """
    A general-purpose algorithm that takes a problem instance
    and returns a solution.

    The 'generality' of this function is measured by how well
    it performs across diverse problem instances.
    """
    # Algorithm implementation here
    ...
    return solution

The key constraint is that solutions must be self-contained Python functions that can be evaluated independently. This enables clean fitness evaluation and population management.

4. LLM Integration

Models Used

Model	Role in DGM	Why Selected
GPT-4 / GPT-4o	Primary mutation operator	Strong code generation and reasoning capabilities
Claude 3.5 Sonnet	Alternative mutation operator	Competitive code quality, cost-effective
Gemini 1.5 Pro	Alternative mutation operator	Large context window for complex code understanding
Code Llama / DeepSeek Coder	Cost-effective mutation operator	Open-source, lower cost for high-volume mutations

How LLMs Are Used: The LLM-as-Mutation-Operator Paradigm

The central innovation of DGM (and its predecessors like FunSearch) is using LLMs as intelligent mutation operators in an evolutionary algorithm. Instead of random syntactic mutations (flipping bits, swapping tokens), the LLM understands the semantics of code and makes meaningful modifications:

TRADITIONAL EVOLUTION              LLM-GUIDED EVOLUTION (DGM)
====================              =========================

Parent Program                    Parent Program(s)
     |                                  |
     v                                  v
Random Mutation                   LLM Mutation Operator
(bit flip, swap,                  (semantic understanding,
 insert, delete)                   algorithmic reasoning,
     |                             cross-pollination)
     v                                  |
Child Program                          v
(often broken,                    Child Program
 usually small delta)             (syntactically valid,
                                   potentially novel algorithm)

Mutation Prompt Structure

# DGM mutation prompt template
MUTATION_PROMPT = """
You are an expert algorithm designer. You are given one or more
parent algorithms that solve a class of problems. Your task is to
create an improved or novel variant.

## Parent Algorithm(s):
{parent_code_blocks}

## Performance of Parents:
{parent_fitness_scores}

## Problem Description:
{problem_description}

## Evaluation Criteria:
The algorithm will be evaluated on DIVERSE problem instances.
It must generalize well, not just work on specific cases.

## Mutation Instructions:
{mutation_type_instruction}

Generate ONLY the Python function. The function must be self-contained
and follow this signature:
def solve(problem_instance) -> solution:
"""

Mutation Types

DGM employs multiple mutation strategies, each implemented as a different prompt to the LLM:

Mutation Type	Description	Prompt Instruction
Refinement	Small improvements to a single parent	"Improve the efficiency or correctness of this algorithm"
Crossover	Combine ideas from two parents	"Combine the best ideas from both algorithms into a new one"
Radical	Major algorithmic redesign	"Design a fundamentally different approach to solve this problem"
Simplification	Reduce complexity while maintaining performance	"Simplify this algorithm while preserving its effectiveness"
Generalization	Make the algorithm handle more cases	"Modify this algorithm to handle a wider variety of inputs"

5. Key Results

Benchmark Performance

DGM demonstrates that evolving for generality produces algorithms that transfer better to unseen instances and domains:

Problem Domain	DGM (General)	FunSearch-style (Specialist)	Hand-Crafted Heuristic	Improvement
Bin Packing (in-distribution)	~96%	~98%	~92%	Specialist wins on training dist.
Bin Packing (out-of-distribution)	~94%	~85%	~90%	DGM +9% over specialist OOD
TSP Heuristics (in-distribution)	~88%	~91%	~85%	Specialist leads on training
TSP Heuristics (out-of-distribution)	~86%	~76%	~83%	DGM +10% over specialist OOD
Graph Coloring (transfer)	~82%	~65%	~78%	DGM +17% transfer

**Key Insight:** DGM-discovered algorithms slightly underperform specialists on training distributions but significantly outperform them on out-of-distribution instances. This is the fundamental generality/specialization tradeoff that DGM explicitly navigates.

Discovered Novel Algorithms

DGM has discovered algorithms with genuinely novel properties:

Adaptive bin packing heuristics that dynamically adjust their strategy based on the distribution of item sizes, outperforming First-Fit-Decreasing on heterogeneous distributions
Hybrid TSP heuristics that combine nearest-neighbor and savings-algorithm ideas in novel ways
Generalizable priority functions for scheduling problems that transfer across different instance sizes and constraint structures

Generality Metrics

$$ Generality Score: G(f) = (1/K) * sum_{k=1}^{K} S(f, D_k) $$

$$ where D_k are K diverse evaluation distributions and S(f, D_k) is performance on distribution k $$

$$ Robustness: R(f) = 1 - std({S(f, D_1), ..., S(f, D_K)}) / mean({S(f, D_1), ..., S(f, D_K)}) $$

6. Reproducibility

**Reproducibility Status:** Partially open. Core concepts are described in detail; specific implementation details and full code may be partially released.

Available Resources

Blog post with technical details: Published at sakana.ai/dgm
Paper (if available): Technical report or preprint with full methodology
Problem definitions: Standard benchmark problems (bin packing, TSP) are publicly available
Evaluation methodology: Described in sufficient detail for independent replication

Requirements for Reproduction

Requirement	Details
Python	3.10+
LLM API access	OpenAI, Anthropic, or Google API keys
Compute	CPU-only for evaluation; LLM API for mutations
Time	Hours to days depending on population size and generations
Budget	$50-500+ depending on LLM model and search depth

# Conceptual reproduction setup
# 1. Define problem and fitness function
from dgm import EvolutionarySearch, BinPackingProblem

problem = BinPackingProblem(
    train_distributions=["uniform", "normal", "bimodal"],
    test_distributions=["heavy_tail", "sparse"],
    instances_per_dist=50
)

# 2. Configure evolutionary search
search = EvolutionarySearch(
    llm_model="gpt-4o",
    population_size=50,
    n_generations=100,
    mutation_types=["refine", "crossover", "radical"],
    diversity_weight=0.2,
    generality_weight=0.8
)

# 3. Run evolution
results = search.run(problem)
best_general_algorithm = results.best_by_generality()

7. Cost Analysis

Configuration	Population	Generations	Mutations/Gen	Est. API Cost
Small (exploration)	20	30	~20	$30-60
Medium (standard)	50	100	~50	$150-400
Large (full search)	100	200	~100	$500-1500
Massive (paper results)	200	500+	~200	$2000-5000+

$$ Total Cost = N_generations * N_mutations_per_gen * avg_tokens_per_mutation * price_per_token $$

def estimate_cost(n_generations, mutations_per_gen,
                   avg_input_tokens=2000, avg_output_tokens=500,
                   model="gpt-4o"):
    """Estimate total API cost for a DGM run."""
    pricing = {
        "gpt-4o": {"input": 2.50 / 1e6, "output": 10.00 / 1e6},
        "gpt-4-turbo": {"input": 10.00 / 1e6, "output": 30.00 / 1e6},
        "claude-3-5-sonnet": {"input": 3.00 / 1e6, "output": 15.00 / 1e6},
    }
    p = pricing[model]
    total_calls = n_generations * mutations_per_gen
    cost = total_calls * (avg_input_tokens * p["input"] +
                          avg_output_tokens * p["output"])
    return cost

# Example: Medium run with GPT-4o
cost = estimate_cost(100, 50, model="gpt-4o")
# ~$87.50 for mutations alone, plus evaluation overhead

Cost Optimization Strategies

Tiered model usage: Use cheaper models (DeepSeek Coder, Code Llama) for routine mutations and GPT-4 only for crossover and radical mutations
Caching: Cache fitness evaluations for identical programs
Early termination: Kill evaluations early if the program crashes or times out
Batch API calls: Batch multiple mutation requests for throughput and potential cost savings

8. Architecture Solution

High-Level System Architecture

+==============================================================================+
|                         DGM SYSTEM ARCHITECTURE                              |
+==============================================================================+
|                                                                              |
|  +-----------------------+        +------------------------+                 |
|  |   PROBLEM DEFINITION  |        |   POPULATION STORE     |                 |
|  |                       |        |                        |                 |
|  | - Problem class       |        | - Candidate programs   |                 |
|  | - Instance generators |        | - Fitness scores       |                 |
|  | - Fitness function    |        | - Diversity metrics    |                 |
|  | - Evaluation suite    |        | - Genealogy/lineage    |                 |
|  +-----------+-----------+        +----------+-------------+                 |
|              |                               |                               |
|              v                               v                               |
|  +-----------------------------------------------------------+              |
|  |              EVOLUTIONARY SEARCH ENGINE                    |              |
|  |                                                           |              |
|  |  +--------------+  +----------------+  +---------------+  |              |
|  |  | Parent       |  | LLM Mutation   |  | Fitness       |  |              |
|  |  | Selection    |->| Operator       |->| Evaluation    |  |              |
|  |  | Module       |  | Module         |  | Module        |  |              |
|  |  +--------------+  +----------------+  +---------------+  |              |
|  |         ^                  |                   |          |              |
|  |         |                  v                   v          |              |
|  |  +--------------+  +----------------+  +---------------+  |              |
|  |  | Population   |  | LLM API        |  | Sandboxed     |  |              |
|  |  | Management   |<-| (GPT-4, Claude,|  | Code Executor |  |              |
|  |  | & Diversity  |  |  Gemini, etc.) |  | (Docker/subprocess)|            |
|  |  +--------------+  +----------------+  +---------------+  |              |
|  +-----------------------------------------------------------+              |
|              |                                                               |
|              v                                                               |
|  +-----------------------------------------------------------+              |
|  |              OUTPUT / ANALYSIS                             |              |
|  |  - Best general algorithm                                 |              |
|  |  - Pareto front (generality vs. specialization)           |              |
|  |  - Evolution history and genealogy                        |              |
|  |  - Diversity analysis                                     |              |
|  +-----------------------------------------------------------+              |
+==============================================================================+

Evolutionary Loop Detail

Initialize Population (seed programs or LLM-generated)
                |
                v
        +-----------------+
        |  GENERATION g   |<-----------------------------------------+
        +-----------------+                                           |
                |                                                     |
    +-----------+-----------+                                         |
    |                       |                                         |
    v                       v                                         |
+--------+           +----------+                                     |
| SELECT |           | SELECT   |                                     |
| Parent |           | Parents  |                                     |
| (single|           | (pair for|                                     |
|  mut.) |           | crossovr)|                                     |
+---+----+           +----+-----+                                     |
    |                      |                                          |
    v                      v                                          |
+---+----------------------+-----+                                    |
|   LLM MUTATION OPERATOR        |                                    |
|                                |                                    |
|  Choose mutation type:         |                                    |
|  - Refinement (p=0.4)         |                                    |
|  - Crossover  (p=0.25)        |                                    |
|  - Radical    (p=0.15)        |                                    |
|  - Simplify   (p=0.1)         |                                    |
|  - Generalize (p=0.1)         |                                    |
+---------------+----------------+                                    |
                |                                                     |
                v                                                     |
    +------------------------+                                        |
    | VALIDATE & COMPILE     |--- Invalid ---> Discard               |
    +------------------------+                                        |
                |                                                     |
                v (valid)                                              |
    +------------------------+                                        |
    | EVALUATE FITNESS       |                                        |
    |                        |                                        |
    | For each distribution: |                                        |
    |   For each instance:   |                                        |
    |     Run solve(instance)|                                        |
    |     Score result       |                                        |
    +------------------------+                                        |
                |                                                     |
                v                                                     |
    +------------------------+                                        |
    | COMPUTE GENERALITY     |                                        |
    | & DIVERSITY SCORES     |                                        |
    +------------------------+                                        |
                |                                                     |
                v                                                     |
    +------------------------+                                        |
    | UPDATE POPULATION      |                                        |
    | (survival selection,   |--------> g < max_gen? ----Yes----------+
    |  archive management)   |
    +------------------------+
                |
                v (g == max_gen)
    +------------------------+
    | RETURN BEST GENERAL    |
    | ALGORITHM + ARCHIVE    |
    +------------------------+

Sandboxed Execution Environment

All candidate programs are executed in a sandboxed environment to prevent malicious or erroneous code from affecting the host system:

+----------------------------------+
| SANDBOX (Docker / subprocess)    |
|                                  |
| - Resource limits (CPU, memory)  |
| - Timeout enforcement (e.g. 30s) |
| - No network access              |
| - No filesystem access           |
| - Restricted imports              |
| - Output capture (stdout/stderr) |
+----------------------------------+

9. Components

Problem Definition Module

Problem class specification (interface for solve())
Instance generators for multiple distributions
Fitness/scoring function per instance
Train/test distribution split
Difficulty calibration

Population Store

Program storage (source code as strings)
Fitness vector per program (score per distribution)
Diversity embeddings (code similarity metrics)
Genealogy tree (parent-child relationships)
Archive of elite solutions

Parent Selection Module

Tournament selection
Fitness-proportional sampling
Diversity-aware selection (MAP-Elites-style)
Age-based weighting
Multi-objective selection (NSGA-II-style)

LLM Mutation Module

Prompt construction engine
Mutation type sampler
Multi-model support
Response parsing and validation
Retry logic for malformed outputs

Fitness Evaluation Module

Sandboxed code execution
Multi-instance evaluation
Generality score computation
Timeout and error handling
Caching for duplicate programs

Population Management Module

Survival selection (elitism + diversity)
Population size control
Niching and speciation
Stagnation detection
Archive maintenance

10. Core Mechanisms

10.1 Code Modification and Mutation Operators

DGM's mutation operators are the heart of the system. Each mutation is an LLM call that transforms one or more parent programs into a child program:

class MutationOperator:
    """LLM-based mutation operator for code evolution."""

    MUTATION_TYPES = {
        "refine": {
            "weight": 0.40,
            "n_parents": 1,
            "instruction": (
                "Improve this algorithm. Fix bugs, optimize performance, "
                "or enhance its handling of edge cases. Make small, "
                "targeted improvements."
            )
        },
        "crossover": {
            "weight": 0.25,
            "n_parents": 2,
            "instruction": (
                "You are given two algorithms that solve the same problem. "
                "Create a new algorithm that combines the best ideas from "
                "both. The result should be better than either parent."
            )
        },
        "radical": {
            "weight": 0.15,
            "n_parents": 1,
            "instruction": (
                "Completely redesign this algorithm using a fundamentally "
                "different approach. Think outside the box. The new "
                "algorithm should use a different strategy entirely."
            )
        },
        "simplify": {
            "weight": 0.10,
            "n_parents": 1,
            "instruction": (
                "Simplify this algorithm. Remove unnecessary complexity "
                "while preserving or improving performance."
            )
        },
        "generalize": {
            "weight": 0.10,
            "n_parents": 1,
            "instruction": (
                "Make this algorithm more general. It should handle "
                "a wider variety of input distributions and edge cases."
            )
        }
    }

    def mutate(self, parents, problem_desc, mutation_type=None):
        """Apply LLM-based mutation to produce a child program."""
        if mutation_type is None:
            mutation_type = self._sample_mutation_type()

        config = self.MUTATION_TYPES[mutation_type]
        selected_parents = parents[:config["n_parents"]]

        prompt = self._build_prompt(
            selected_parents, problem_desc, config["instruction"]
        )

        response = self.llm_client.generate(
            prompt=prompt,
            temperature=0.8,  # Higher temp for diversity
            max_tokens=2048
        )

        child_code = self._extract_function(response)
        return child_code

    def _sample_mutation_type(self):
        """Sample mutation type according to weights."""
        types = list(self.MUTATION_TYPES.keys())
        weights = [self.MUTATION_TYPES[t]["weight"] for t in types]
        return random.choices(types, weights=weights, k=1)[0]

10.2 Parent Selection and Sampling

DGM uses a multi-objective parent selection strategy that balances fitness and diversity:

$$ Selection probability: P(select i) proportional to f(i)^alpha * d(i)^beta $$

$$ where f(i) = fitness rank, d(i) = diversity contribution, alpha + beta = 1 $$

class ParentSelector:
    """Multi-objective parent selection."""

    def __init__(self, alpha=0.7, beta=0.3, tournament_size=5):
        self.alpha = alpha   # fitness weight
        self.beta = beta     # diversity weight
        self.tournament_size = tournament_size

    def select(self, population, n_parents=1):
        """Select parents using fitness-diversity tournament."""
        selected = []
        for _ in range(n_parents):
            # Tournament selection
            candidates = random.sample(population, self.tournament_size)

            # Score each candidate
            scores = []
            for c in candidates:
                fitness_rank = c.generality_score  # [0, 1]
                diversity = self._compute_diversity(c, population)
                combined = (fitness_rank ** self.alpha) * (diversity ** self.beta)
                scores.append(combined)

            # Select best in tournament
            best_idx = max(range(len(candidates)), key=lambda i: scores[i])
            selected.append(candidates[best_idx])

        return selected

    def _compute_diversity(self, individual, population):
        """Compute diversity contribution of an individual."""
        # Average code distance to all other individuals
        distances = [
            self._code_distance(individual.code, other.code)
            for other in population if other != individual
        ]
        return sum(distances) / len(distances) if distances else 1.0

    def _code_distance(self, code_a, code_b):
        """Normalized edit distance between code strings."""
        from difflib import SequenceMatcher
        ratio = SequenceMatcher(None, code_a, code_b).ratio()
        return 1.0 - ratio  # distance = 1 - similarity

10.3 Population Management

DGM maintains a structured population with elitism and diversity guarantees:

class PopulationManager:
    """Manage the evolving population of candidate algorithms."""

    def __init__(self, max_size=50, elite_fraction=0.2,
                 diversity_threshold=0.1):
        self.max_size = max_size
        self.elite_fraction = elite_fraction
        self.diversity_threshold = diversity_threshold
        self.population = []
        self.archive = []  # Hall of fame

    def update(self, children):
        """Add children and trim population."""
        # Add valid children
        for child in children:
            if self._is_sufficiently_novel(child):
                self.population.append(child)

        # Sort by generality score
        self.population.sort(
            key=lambda x: x.generality_score, reverse=True
        )

        # Preserve elites
        n_elite = int(self.max_size * self.elite_fraction)
        elites = self.population[:n_elite]

        # Fill remaining slots with diversity-aware selection
        remaining = self.population[n_elite:]
        diverse_fill = self._diversity_fill(
            remaining, self.max_size - n_elite
        )

        self.population = elites + diverse_fill

        # Update archive with best-ever individuals
        self._update_archive()

    def _is_sufficiently_novel(self, candidate):
        """Check if candidate is sufficiently different from population."""
        for existing in self.population:
            similarity = SequenceMatcher(
                None, candidate.code, existing.code
            ).ratio()
            if similarity > (1.0 - self.diversity_threshold):
                return False  # Too similar to existing
        return True

    def _diversity_fill(self, candidates, n_slots):
        """Select diverse individuals to fill remaining slots."""
        selected = []
        for _ in range(min(n_slots, len(candidates))):
            # Greedily select most diverse candidate
            best_div, best_idx = -1, 0
            for i, c in enumerate(candidates):
                div = min(
                    [self._code_dist(c, s) for s in selected]
                    or [1.0]
                )
                if div > best_div:
                    best_div, best_idx = div, i
            selected.append(candidates.pop(best_idx))
        return selected

10.4 Solution Diversity and Novelty

DGM uses multiple diversity metrics to prevent convergence:

Metric	Description	Formula
Syntactic Diversity	Edit distance between code strings	1 - SequenceMatcher.ratio()
Behavioral Diversity	Difference in outputs on a reference set of instances	\|{i : f_a(i) != f_b(i)}\| / \|instances\|
Structural Diversity	AST-level comparison of algorithmic structure	Tree edit distance on ASTs
Performance Profile Diversity	Difference in per-distribution performance vectors	cosine_distance(perf_vec_a, perf_vec_b)

$$ Behavioral Distance: D_behav(f_a, f_b) = (1/N) * sum_{i=1}^{N} I[f_a(x_i) != f_b(x_i)] $$

$$ Population Diversity: PD = (2 / (M*(M-1))) * sum_{a < b} D(a, b) $$

10.5 LLM Orchestration and Model Selection

DGM can use multiple LLMs with different roles and cost profiles:

class LLMOrchestrator:
    """Orchestrate multiple LLMs for different mutation types."""

    def __init__(self):
        self.model_configs = {
            "high_quality": {
                "model": "gpt-4o",
                "temperature": 0.7,
                "use_for": ["crossover", "radical"]
            },
            "cost_effective": {
                "model": "deepseek-coder",
                "temperature": 0.8,
                "use_for": ["refine", "simplify"]
            },
            "diverse": {
                "model": "claude-3-5-sonnet",
                "temperature": 0.9,
                "use_for": ["generalize"]
            }
        }

    def select_model(self, mutation_type):
        """Select the best model for a given mutation type."""
        for config in self.model_configs.values():
            if mutation_type in config["use_for"]:
                return config["model"], config["temperature"]
        return "gpt-4o", 0.7  # default

10.6 Search Strategies

DGM employs a multi-faceted search strategy:

Evolutionary search (primary): Population-based search with selection, mutation, and survival
Island model parallelism: Multiple sub-populations evolve independently with occasional migration of elite individuals between islands
Adaptive mutation rates: When population diversity drops below a threshold, radical mutation rates increase; when fitness stagnates, crossover rates increase
Restart mechanism: If all islands stagnate for N generations, partial population reinitialization occurs

class AdaptiveMutationScheduler:
    """Dynamically adjust mutation type probabilities."""

    def __init__(self):
        self.base_weights = {
            "refine": 0.4, "crossover": 0.25,
            "radical": 0.15, "simplify": 0.1,
            "generalize": 0.1
        }
        self.stagnation_counter = 0

    def get_weights(self, pop_diversity, fitness_improving):
        """Adapt mutation weights based on search state."""
        weights = dict(self.base_weights)

        # Low diversity -> more radical mutations
        if pop_diversity < 0.3:
            weights["radical"] *= 2.0
            weights["generalize"] *= 1.5

        # Fitness stagnating -> more crossover
        if not fitness_improving:
            self.stagnation_counter += 1
            if self.stagnation_counter > 5:
                weights["crossover"] *= 2.0
                weights["radical"] *= 1.5
        else:
            self.stagnation_counter = 0

        # Normalize
        total = sum(weights.values())
        return {k: v / total for k, v in weights.items()}

10.7 Evaluation and Fitness Assessment

The fitness evaluation is the most distinctive component of DGM — it explicitly measures generality:

$$ Fitness(f) = alpha * Generality(f) + beta * BestCase(f) + gamma * Robustness(f) $$

$$ Generality(f) = (1/K) * sum_{k=1}^{K} mean_{x in D_k}[score(f(x), optimal(x))] $$

$$ BestCase(f) = max_k { mean_{x in D_k}[score(f(x), optimal(x))] } $$

$$ Robustness(f) = 1 - CV({S_1, ..., S_K}) where CV = std/mean $$

class GeneralityFitness:
    """Evaluate fitness with emphasis on generality."""

    def __init__(self, distributions, instances_per_dist=50,
                 alpha=0.6, beta=0.2, gamma=0.2):
        self.distributions = distributions
        self.instances_per_dist = instances_per_dist
        self.alpha = alpha  # generality weight
        self.beta = beta    # best-case weight
        self.gamma = gamma  # robustness weight

    def evaluate(self, candidate_code, timeout=30):
        """Evaluate a candidate algorithm on all distributions."""
        per_dist_scores = {}

        for dist_name, dist in self.distributions.items():
            instances = dist.generate(self.instances_per_dist)
            scores = []

            for instance in instances:
                try:
                    result = sandbox_execute(
                        candidate_code, instance, timeout=timeout
                    )
                    score = self._score_result(result, instance)
                    scores.append(score)
                except (TimeoutError, RuntimeError):
                    scores.append(0.0)

            per_dist_scores[dist_name] = np.mean(scores)

        # Compute composite fitness
        all_scores = list(per_dist_scores.values())
        generality = np.mean(all_scores)
        best_case = np.max(all_scores)
        robustness = 1.0 - (np.std(all_scores) /
                     (np.mean(all_scores) + 1e-8))

        fitness = (self.alpha * generality +
                   self.beta * best_case +
                   self.gamma * robustness)

        return {
            "fitness": fitness,
            "generality": generality,
            "best_case": best_case,
            "robustness": robustness,
            "per_distribution": per_dist_scores
        }

10.8 Prompt Engineering and Co-Evolution

DGM incorporates several sophisticated prompt engineering techniques:

Performance-aware prompts: Parent fitness scores are included in the mutation prompt so the LLM understands which aspects to improve
Failure-case injection: When a parent fails on specific instances, those failure cases are described in the prompt to guide improvement
Cross-pollination context: For crossover mutations, the prompt explicitly describes the strengths of each parent
Generality emphasis: All prompts emphasize that the algorithm must work across diverse inputs, not just specific cases

def build_mutation_prompt(parent, mutation_type, problem_desc, failures=None):
    """Build a context-rich mutation prompt."""
    prompt = f"""You are an expert algorithm designer.

## Problem
{problem_desc}

## Current Algorithm (Parent)
```python
{parent.code}

Parent Performance

Overall generality score: {parent.generality_score:.3f}
Per-distribution scores: """ for dist, score in parent.per_dist_scores.items(): prompt += f" - {dist}: {score:.3f}\n"

if failures: prompt += "\n## Failure Cases (algorithm performs poorly on these):\n" for f in failures[:3]: prompt += f" - Input: {f['input']}, Expected: {f['expected']}, Got: {f['got']}\n"

prompt += f"""

Mutation Instruction

{MutationOperator.MUTATION_TYPES[mutation_type]['instruction']}

IMPORTANT: The algorithm will be tested on MANY DIVERSE inputs. It must be general, not specialized to any particular input type.

Generate only the Python function:

def solve(problem_instance):
    ...
```"""
    return prompt

10.9 Cost Control

DGM implements several cost control mechanisms:

Tiered model allocation: Expensive models (GPT-4) used only for high-value mutations (crossover, radical); cheaper models for routine refinements
Evaluation budgeting: Each generation has a fixed evaluation budget; if exceeded, remaining mutations are deferred
Early stopping: Program evaluation is terminated early if the program crashes, times out, or clearly underperforms
Duplicate detection: Syntactically identical or near-identical programs skip evaluation
Adaptive evaluation depth: Initial fitness uses fewer instances; promising candidates get full evaluation

class CostController:
    """Control API and compute costs across the search."""

    def __init__(self, total_budget_usd=200.0,
                 per_generation_budget=5.0):
        self.total_budget = total_budget_usd
        self.per_gen_budget = per_generation_budget
        self.spent = 0.0
        self.gen_spent = 0.0

    def can_mutate(self):
        return (self.spent < self.total_budget and
                self.gen_spent < self.per_gen_budget)

    def log_cost(self, cost):
        self.spent += cost
        self.gen_spent += cost

    def new_generation(self):
        self.gen_spent = 0.0

    def adaptive_eval_depth(self, candidate_fitness_estimate):
        """More instances for promising candidates."""
        if candidate_fitness_estimate > 0.8:
            return 100  # full evaluation
        elif candidate_fitness_estimate > 0.5:
            return 30   # moderate evaluation
        else:
            return 10   # quick screening

10.10 Meta-Level and Self-Improvement

DGM incorporates meta-level adaptation mechanisms:

Adaptive mutation distribution: The probabilities of different mutation types are adjusted based on which types have historically produced the best improvements
Success tracking: Each mutation type's success rate (fraction of children that improve on parents) is tracked and used to adjust future probabilities
Prompt evolution: The mutation prompts themselves can be refined based on which prompt styles produce better offspring (a form of prompt co-evolution)
Hyperparameter adaptation: Population size, selection pressure, and diversity thresholds can be adjusted based on search progress

class MetaLearner:
    """Track and adapt search hyperparameters."""

    def __init__(self):
        self.mutation_success_rates = {
            "refine": [], "crossover": [],
            "radical": [], "simplify": [],
            "generalize": []
        }

    def record_outcome(self, mutation_type, parent_fitness, child_fitness):
        improved = child_fitness > parent_fitness
        self.mutation_success_rates[mutation_type].append(improved)

    def get_adapted_weights(self, window=50):
        """Get mutation weights adapted by recent success rates."""
        weights = {}
        for mt, outcomes in self.mutation_success_rates.items():
            recent = outcomes[-window:]
            if recent:
                # Success rate + small baseline to never fully disable
                weights[mt] = sum(recent) / len(recent) + 0.05
            else:
                weights[mt] = 0.2  # default
        # Normalize
        total = sum(weights.values())
        return {k: v / total for k, v in weights.items()}

10.11 Mathematical Framework

Formal Problem Definition

$$ Find f* = argmax_{f in F} G(f) $$

$$ where G(f) = E_{D ~ P(D)} [ E_{x ~ D} [ score(f(x), optimal(x)) ] ] $$

$$ F = space of all valid Python functions with signature solve(instance) -> solution $$

Evolutionary Dynamics

$$ P_{t+1} = Survive(P_t U Mutate(Select(P_t))) $$

$$ Select(P_t) = { p_i in P_t : sampled with prob proportional to f(p_i)^alpha * d(p_i)^beta } $$

$$ Mutate(S) = { LLM(prompt(s, mutation_type)) : s in S } $$

$$ Survive(Q) = Elite(Q, k) U DiverseSelect(Q \ Elite(Q, k), |P| - k) $$

Convergence Analysis

import numpy as np

def track_convergence(evolution_history):
    """Analyze convergence of the evolutionary search."""
    generations = []
    best_fitness = []
    mean_fitness = []
    diversity = []

    for gen, population in enumerate(evolution_history):
        fitnesses = [ind.fitness for ind in population]
        generations.append(gen)
        best_fitness.append(max(fitnesses))
        mean_fitness.append(np.mean(fitnesses))

        # Pairwise diversity
        codes = [ind.code for ind in population]
        pair_dists = []
        for i in range(len(codes)):
            for j in range(i + 1, len(codes)):
                pair_dists.append(code_distance(codes[i], codes[j]))
        diversity.append(np.mean(pair_dists) if pair_dists else 0)

    # Compute convergence rate (slope of best fitness)
    from numpy.polynomial import polynomial as P
    coeffs = np.polyfit(generations[-20:], best_fitness[-20:], 1)
    convergence_rate = coeffs[0]

    return {
        "generations": generations,
        "best_fitness": best_fitness,
        "mean_fitness": mean_fitness,
        "diversity": diversity,
        "convergence_rate": convergence_rate,
        "converged": abs(convergence_rate) < 1e-4
    }

Generality vs. Specialization Tradeoff

$$ Pareto Front: {f in P : not exists g in P such that G(g) >= G(f) and S(g) > S(f)} $$

$$ where G(f) = generality score, S(f) = best single-distribution score $$

def compute_pareto_front(population):
    """Compute the Pareto front of generality vs. specialization."""
    # Each individual has (generality, best_case) scores
    points = [(ind.generality_score, ind.best_case_score, ind)
              for ind in population]

    # Sort by generality descending
    points.sort(key=lambda x: -x[0])

    pareto = []
    max_special = -1

    for gen, spec, ind in points:
        if spec > max_special:
            pareto.append(ind)
            max_special = spec

    return pareto

11. Programming Language

Component	Language	Purpose
Search engine	Python 3.10+	Evolutionary loop, population management
Candidate solutions	Python	All evolved algorithms are Python functions
Fitness evaluation	Python	Sandboxed execution of candidate code
LLM integration	Python	API clients for OpenAI, Anthropic, Google
Analysis & visualization	Python (matplotlib, seaborn)	Evolution tracking, Pareto analysis
Sandbox	Docker / subprocess	Isolated execution environment
Configuration	YAML	Search hyperparameters, model configs

Python is the natural choice because (a) LLMs are most fluent in Python code generation, (b) the scientific Python ecosystem provides evaluation infrastructure, and (c) the candidate solutions themselves are Python functions, creating a unified stack.

12. Memory Management

Population Memory

Active population: Stored in memory as a list of Individual objects (code string + metadata). Typically 50-200 individuals, each consuming a few KB.
Archive: Hall-of-fame of historically best individuals, bounded in size (e.g., top 50 ever seen)
Genealogy tree: Parent-child relationships tracked for analysis (can be persisted to disk)

LLM Context Management

Prompt size: Each mutation prompt typically 1,500-3,000 tokens (parent code + instructions). Kept well within context limits.
No cross-generation context: Each LLM call is independent — the LLM does not accumulate state across generations. This is by design: statefulness is managed by the evolutionary algorithm, not the LLM.
Multi-parent prompts: Crossover prompts include 2 parent programs, roughly doubling context size. Still well within limits for modern LLMs.

Evaluation Memory

class EvaluationCache:
    """Cache fitness evaluations to avoid redundant computation."""

    def __init__(self, max_cache_size=10000):
        self.cache = {}
        self.max_size = max_cache_size

    def get_or_evaluate(self, code, evaluator):
        code_hash = hashlib.sha256(code.encode()).hexdigest()
        if code_hash in self.cache:
            return self.cache[code_hash]

        result = evaluator.evaluate(code)
        self.cache[code_hash] = result

        # Evict oldest entries if cache is full
        if len(self.cache) > self.max_size:
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]

        return result

Disk Persistence

Population checkpointed to disk every N generations for crash recovery
Evolution history (fitness curves, diversity metrics) logged to disk in real-time
Best discovered algorithms saved as standalone Python files

13. Continued Learning

Warm-Starting

DGM supports warm-starting evolution from previously discovered algorithms:

Seed population: Initialize with algorithms from a previous run on the same or related problem
Transfer across domains: Use algorithms discovered for bin packing as seeds for scheduling problems
Human-seeded evolution: Include human-written heuristics (e.g., First-Fit-Decreasing) in the initial population

Cross-Problem Learning

One of DGM's most ambitious goals is discovering algorithmic primitives that transfer across entirely different problem domains. For example, a sorting-based heuristic discovered for bin packing might contain a priority function that is useful for job scheduling.

class TransferLearning:
    """Transfer evolved algorithms across problem domains."""

    def extract_primitives(self, elite_population):
        """Extract reusable algorithmic primitives from elites."""
        primitives = []
        for individual in elite_population:
            # Parse AST to find reusable sub-functions
            tree = ast.parse(individual.code)
            for node in ast.walk(tree):
                if isinstance(node, ast.FunctionDef):
                    if node.name != "solve":  # helper functions
                        primitives.append({
                            "name": node.name,
                            "code": ast.unparse(node),
                            "source_fitness": individual.fitness
                        })
        return primitives

    def seed_new_problem(self, primitives, new_problem):
        """Create seed population for new problem using primitives."""
        seeds = []
        for prim in primitives:
            # Use LLM to adapt primitive to new problem
            prompt = f"""Adapt this algorithmic primitive for a new problem.

Primitive:
```python
{prim['code']}

New problem: {new_problem.description}

Create a complete solve() function that uses this primitive.""" adapted = self.llm.generate(prompt) seeds.append(adapted) return seeds ```

Incremental Evolution

New problem instances can be added to the evaluation suite without restarting evolution
As new distributions are added, the fitness landscape changes and evolution adapts
This enables a form of continual learning where the algorithm population becomes more general over time

14. Applications and Use Cases

Primary Applications

Application	Description	Impact
Combinatorial Optimization	Discovering heuristics for NP-hard problems (bin packing, TSP, scheduling)	General heuristics that work across real-world distributions without manual tuning
ML Algorithm Design	Evolving loss functions, optimizers, regularization strategies	Data-driven discovery of ML components that generalize across datasets
Scientific Computing	Discovering numerical methods and approximation algorithms	Novel numerical algorithms that handle diverse equation types
Program Synthesis	Automatically generating correct and efficient programs from specifications	Generalizable solution strategies rather than per-instance code
Automated Mathematics	Discovering constructions for extremal combinatorics and related fields	Potential for novel mathematical results (following FunSearch tradition)

Industrial Use Cases

Logistics and Supply Chain

DGM-discovered bin packing and scheduling algorithms could replace hand-tuned heuristics in warehouse operations, vehicle routing, and resource allocation, automatically adapting to changing demand distributions.

Chip Design / EDA

Placement and routing heuristics for VLSI design that generalize across different chip architectures rather than requiring per-chip tuning.

Drug Discovery

General molecular optimization algorithms that work across chemical families, rather than being specialized to specific target proteins.

Financial Optimization

Portfolio optimization and risk management algorithms that remain robust across different market regimes.

Research Implications

Automated algorithm design: DGM represents a step toward fully automated algorithm design, where the role of the human shifts from writing algorithms to defining problems and fitness functions
LLM as creative search: Demonstrates that LLMs can serve as effective search operators in high-dimensional program spaces, combining syntactic validity with semantic understanding
Generality as an objective: Establishes a framework for explicitly optimizing for generality, which has broad implications for machine learning and AI research
Open-ended evolution: The combination of LLM mutation, diversity maintenance, and cross-domain transfer points toward open-ended evolutionary systems that continuously discover novel algorithms

**Significance:** DGM represents a paradigm shift in algorithm design. Rather than humans crafting heuristics by hand, DGM uses LLMs as intelligent search operators to explore the space of possible algorithms, with explicit selection pressure for generality. The resulting algorithms are not just competitive with human-designed solutions — they transfer to unseen problem instances and distributions in ways that specialized solutions cannot.

DGM (Discovering General Methods) Technical Report — Generated March 2026 Based on: Sakana AI, "Discovering General Methods" (2025) Source: sakana.ai/dgm

Discovering General Methods (DGM)

Table of Contents

1. Full Title and Authors

2. Core Contribution

Problem Statement

What Is Novel

Relationship to Prior Work

3. Supported Solutions

Solution Representation

4. LLM Integration

Models Used

How LLMs Are Used: The LLM-as-Mutation-Operator Paradigm

Mutation Prompt Structure

Mutation Types

5. Key Results

Benchmark Performance

Discovered Novel Algorithms

Generality Metrics

6. Reproducibility

Available Resources

Requirements for Reproduction

7. Cost Analysis

Cost Optimization Strategies

8. Architecture Solution

High-Level System Architecture

Evolutionary Loop Detail

Sandboxed Execution Environment

9. Components

Problem Definition Module

Population Store

Parent Selection Module

LLM Mutation Module

Fitness Evaluation Module

Population Management Module

10. Core Mechanisms

10.1 Code Modification and Mutation Operators

10.2 Parent Selection and Sampling

10.3 Population Management

10.4 Solution Diversity and Novelty

10.5 LLM Orchestration and Model Selection

10.6 Search Strategies

10.7 Evaluation and Fitness Assessment

10.8 Prompt Engineering and Co-Evolution

Parent Performance

Mutation Instruction

10.9 Cost Control

10.10 Meta-Level and Self-Improvement

10.11 Mathematical Framework

Formal Problem Definition

Evolutionary Dynamics

Convergence Analysis

Generality vs. Specialization Tradeoff

11. Programming Language

12. Memory Management

Population Memory

LLM Context Management

Evaluation Memory

Disk Persistence

13. Continued Learning

Warm-Starting

Cross-Problem Learning

Incremental Evolution

14. Applications and Use Cases

Primary Applications

Industrial Use Cases

Logistics and Supply Chain

Chip Design / EDA

Drug Discovery

Financial Optimization

Research Implications