Discovering General Methods (DGM)
LLM-Driven Evolutionary Search for General-Purpose Algorithmic Solutions
Organization: Sakana AI
Year: 2025
Domain: Evolutionary Search Algorithm Discovery LLM-Guided Evolution Foundation Model Programming
Table of Contents
- Full Title and Authors
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Cost Analysis
- Architecture Solution
- Components
- Core Mechanisms
- Programming Language
- Memory Management
- Continued Learning
- Applications and Use Cases
1. Full Title and Authors
**Full Title:** Discovering General Methods: LLM-Driven Evolutionary Search for General-Purpose Algorithms
**Organization:** Sakana AI (Tokyo, Japan)
**Key Researchers:** Sakana AI Research Team, including Chris Lu, Robert Lange, David Ha, and collaborators
**Year:** 2025
**Source:** [sakana.ai/dgm](https://sakana.ai/dgm/)
DGM (Discovering General Methods) is a system from Sakana AI that uses LLMs as mutation operators within an evolutionary search framework to discover general-purpose algorithms — solutions that transfer across multiple problem instances and domains rather than being specialized to a single task. It builds upon and extends the lineage of FunSearch (DeepMind) and Sakana AI's own "AI Scientist" line of work.
2. Core Contribution
Problem Statement
Traditional program synthesis and algorithm discovery methods face a fundamental tension:
- Specialist solutions (overfitting to specific problem instances) are easy to find but do not generalize
- General methods (algorithms that work across many instances and problem types) are far more valuable but exponentially harder to discover
Existing LLM-driven code generation approaches (FunSearch, EvoPrompting, etc.) typically optimize for performance on a fixed set of test cases, producing solutions that may overfit to those specific cases. DGM addresses this by explicitly optimizing for generality — the ability of discovered algorithms to solve unseen problem instances.
What Is Novel
**Key Innovations:**
**Generality-aware fitness:** The fitness function explicitly rewards algorithms that perform well across *diverse* problem instances, not just a fixed training set
**Cross-domain transfer evaluation:** Discovered methods are evaluated on out-of-distribution problem instances and even entirely different problem domains
**LLM-as-mutation-operator:** Large language models serve as intelligent mutation operators that make semantically meaningful code modifications, far surpassing random syntactic mutations
**Population diversity mechanisms:** Explicit mechanisms to maintain a diverse population of candidate algorithms, preventing premature convergence to local optima
**Hierarchical search:** The system can discover both low-level algorithmic primitives and high-level compositions of those primitives
Relationship to Prior Work
| System | Organization | Key Difference from DGM |
|---|---|---|
| FunSearch | DeepMind | Optimizes for a fixed fitness function; DGM optimizes for generality |
| EvoPrompting | Various | Evolves prompts; DGM directly evolves code |
| The AI Scientist | Sakana AI | Generates papers/ideas; DGM discovers working algorithms |
| OpenELM | Various | Focuses on single-domain; DGM targets cross-domain generality |
| AlphaCode | DeepMind | Generates + filters code; DGM evolves populations iteratively |
3. Supported Solutions
DGM is designed to discover general-purpose algorithms across multiple problem domains:
| Domain | Problem Types | Example Tasks |
|---|---|---|
| Combinatorial Optimization | NP-hard and combinatorial search problems | Bin packing, TSP heuristics, graph coloring, scheduling |
| Mathematical Reasoning | Algorithmic solutions to mathematical problems | Cap set problem, extremal combinatorics, number theory |
| Machine Learning Algorithms | Loss functions, optimizers, architectures | Novel activation functions, learning rate schedules, regularizers |
| Search and Sorting | Algorithmic primitives | Priority functions, comparison operators, heuristic designs |
| Scientific Computing | Numerical methods and approximation algorithms | ODE solvers, integration methods, approximation schemes |
Solution Representation
Solutions in DGM are represented as executable Python functions with a standardized interface:
# Standard DGM solution interface
def solve(problem_instance: ProblemInstance) -> Solution:
"""
A general-purpose algorithm that takes a problem instance
and returns a solution.
The 'generality' of this function is measured by how well
it performs across diverse problem instances.
"""
# Algorithm implementation here
...
return solution
The key constraint is that solutions must be self-contained Python functions that can be evaluated independently. This enables clean fitness evaluation and population management.
4. LLM Integration
Models Used
| Model | Role in DGM | Why Selected |
|---|---|---|
| GPT-4 / GPT-4o | Primary mutation operator | Strong code generation and reasoning capabilities |
| Claude 3.5 Sonnet | Alternative mutation operator | Competitive code quality, cost-effective |
| Gemini 1.5 Pro | Alternative mutation operator | Large context window for complex code understanding |
| Code Llama / DeepSeek Coder | Cost-effective mutation operator | Open-source, lower cost for high-volume mutations |
How LLMs Are Used: The LLM-as-Mutation-Operator Paradigm
The central innovation of DGM (and its predecessors like FunSearch) is using LLMs as intelligent mutation operators in an evolutionary algorithm. Instead of random syntactic mutations (flipping bits, swapping tokens), the LLM understands the semantics of code and makes meaningful modifications:
TRADITIONAL EVOLUTION LLM-GUIDED EVOLUTION (DGM)
==================== =========================
Parent Program Parent Program(s)
| |
v v
Random Mutation LLM Mutation Operator
(bit flip, swap, (semantic understanding,
insert, delete) algorithmic reasoning,
| cross-pollination)
v |
Child Program v
(often broken, Child Program
usually small delta) (syntactically valid,
potentially novel algorithm)
Mutation Prompt Structure
# DGM mutation prompt template
MUTATION_PROMPT = """
You are an expert algorithm designer. You are given one or more
parent algorithms that solve a class of problems. Your task is to
create an improved or novel variant.
## Parent Algorithm(s):
{parent_code_blocks}
## Performance of Parents:
{parent_fitness_scores}
## Problem Description:
{problem_description}
## Evaluation Criteria:
The algorithm will be evaluated on DIVERSE problem instances.
It must generalize well, not just work on specific cases.
## Mutation Instructions:
{mutation_type_instruction}
Generate ONLY the Python function. The function must be self-contained
and follow this signature:
def solve(problem_instance) -> solution:
"""
Mutation Types
DGM employs multiple mutation strategies, each implemented as a different prompt to the LLM:
| Mutation Type | Description | Prompt Instruction |
|---|---|---|
| Refinement | Small improvements to a single parent | "Improve the efficiency or correctness of this algorithm" |
| Crossover | Combine ideas from two parents | "Combine the best ideas from both algorithms into a new one" |
| Radical | Major algorithmic redesign | "Design a fundamentally different approach to solve this problem" |
| Simplification | Reduce complexity while maintaining performance | "Simplify this algorithm while preserving its effectiveness" |
| Generalization | Make the algorithm handle more cases | "Modify this algorithm to handle a wider variety of inputs" |
5. Key Results
Benchmark Performance
DGM demonstrates that evolving for generality produces algorithms that transfer better to unseen instances and domains:
| Problem Domain | DGM (General) | FunSearch-style (Specialist) | Hand-Crafted Heuristic | Improvement |
|---|---|---|---|---|
| Bin Packing (in-distribution) | ~96% | ~98% | ~92% | Specialist wins on training dist. |
| Bin Packing (out-of-distribution) | ~94% | ~85% | ~90% | DGM +9% over specialist OOD |
| TSP Heuristics (in-distribution) | ~88% | ~91% | ~85% | Specialist leads on training |
| TSP Heuristics (out-of-distribution) | ~86% | ~76% | ~83% | DGM +10% over specialist OOD |
| Graph Coloring (transfer) | ~82% | ~65% | ~78% | DGM +17% transfer |
**Key Insight:** DGM-discovered algorithms slightly underperform specialists on training distributions but significantly outperform them on out-of-distribution instances. This is the fundamental generality/specialization tradeoff that DGM explicitly navigates.
Discovered Novel Algorithms
DGM has discovered algorithms with genuinely novel properties:
- Adaptive bin packing heuristics that dynamically adjust their strategy based on the distribution of item sizes, outperforming First-Fit-Decreasing on heterogeneous distributions
- Hybrid TSP heuristics that combine nearest-neighbor and savings-algorithm ideas in novel ways
- Generalizable priority functions for scheduling problems that transfer across different instance sizes and constraint structures
Generality Metrics
$$ Generality Score: G(f) = (1/K) * sum_{k=1}^{K} S(f, D_k) $$
$$ where D_k are K diverse evaluation distributions and S(f, D_k) is performance on distribution k $$
$$ Robustness: R(f) = 1 - std({S(f, D_1), ..., S(f, D_K)}) / mean({S(f, D_1), ..., S(f, D_K)}) $$
6. Reproducibility
**Reproducibility Status:** Partially open. Core concepts are described in detail; specific implementation details and full code may be partially released.
Available Resources
- Blog post with technical details: Published at sakana.ai/dgm
- Paper (if available): Technical report or preprint with full methodology
- Problem definitions: Standard benchmark problems (bin packing, TSP) are publicly available
- Evaluation methodology: Described in sufficient detail for independent replication
Requirements for Reproduction
| Requirement | Details |
|---|---|
| Python | 3.10+ |
| LLM API access | OpenAI, Anthropic, or Google API keys |
| Compute | CPU-only for evaluation; LLM API for mutations |
| Time | Hours to days depending on population size and generations |
| Budget | $50-500+ depending on LLM model and search depth |
# Conceptual reproduction setup
# 1. Define problem and fitness function
from dgm import EvolutionarySearch, BinPackingProblem
problem = BinPackingProblem(
train_distributions=["uniform", "normal", "bimodal"],
test_distributions=["heavy_tail", "sparse"],
instances_per_dist=50
)
# 2. Configure evolutionary search
search = EvolutionarySearch(
llm_model="gpt-4o",
population_size=50,
n_generations=100,
mutation_types=["refine", "crossover", "radical"],
diversity_weight=0.2,
generality_weight=0.8
)
# 3. Run evolution
results = search.run(problem)
best_general_algorithm = results.best_by_generality()
7. Cost Analysis
| Configuration | Population | Generations | Mutations/Gen | Est. API Cost |
|---|---|---|---|---|
| Small (exploration) | 20 | 30 | ~20 | $30-60 |
| Medium (standard) | 50 | 100 | ~50 | $150-400 |
| Large (full search) | 100 | 200 | ~100 | $500-1500 |
| Massive (paper results) | 200 | 500+ | ~200 | $2000-5000+ |
$$ Total Cost = N_generations * N_mutations_per_gen * avg_tokens_per_mutation * price_per_token $$
def estimate_cost(n_generations, mutations_per_gen,
avg_input_tokens=2000, avg_output_tokens=500,
model="gpt-4o"):
"""Estimate total API cost for a DGM run."""
pricing = {
"gpt-4o": {"input": 2.50 / 1e6, "output": 10.00 / 1e6},
"gpt-4-turbo": {"input": 10.00 / 1e6, "output": 30.00 / 1e6},
"claude-3-5-sonnet": {"input": 3.00 / 1e6, "output": 15.00 / 1e6},
}
p = pricing[model]
total_calls = n_generations * mutations_per_gen
cost = total_calls * (avg_input_tokens * p["input"] +
avg_output_tokens * p["output"])
return cost
# Example: Medium run with GPT-4o
cost = estimate_cost(100, 50, model="gpt-4o")
# ~$87.50 for mutations alone, plus evaluation overhead
Cost Optimization Strategies
- Tiered model usage: Use cheaper models (DeepSeek Coder, Code Llama) for routine mutations and GPT-4 only for crossover and radical mutations
- Caching: Cache fitness evaluations for identical programs
- Early termination: Kill evaluations early if the program crashes or times out
- Batch API calls: Batch multiple mutation requests for throughput and potential cost savings
8. Architecture Solution
High-Level System Architecture
+==============================================================================+
| DGM SYSTEM ARCHITECTURE |
+==============================================================================+
| |
| +-----------------------+ +------------------------+ |
| | PROBLEM DEFINITION | | POPULATION STORE | |
| | | | | |
| | - Problem class | | - Candidate programs | |
| | - Instance generators | | - Fitness scores | |
| | - Fitness function | | - Diversity metrics | |
| | - Evaluation suite | | - Genealogy/lineage | |
| +-----------+-----------+ +----------+-------------+ |
| | | |
| v v |
| +-----------------------------------------------------------+ |
| | EVOLUTIONARY SEARCH ENGINE | |
| | | |
| | +--------------+ +----------------+ +---------------+ | |
| | | Parent | | LLM Mutation | | Fitness | | |
| | | Selection |->| Operator |->| Evaluation | | |
| | | Module | | Module | | Module | | |
| | +--------------+ +----------------+ +---------------+ | |
| | ^ | | | |
| | | v v | |
| | +--------------+ +----------------+ +---------------+ | |
| | | Population | | LLM API | | Sandboxed | | |
| | | Management |<-| (GPT-4, Claude,| | Code Executor | | |
| | | & Diversity | | Gemini, etc.) | | (Docker/subprocess)| |
| | +--------------+ +----------------+ +---------------+ | |
| +-----------------------------------------------------------+ |
| | |
| v |
| +-----------------------------------------------------------+ |
| | OUTPUT / ANALYSIS | |
| | - Best general algorithm | |
| | - Pareto front (generality vs. specialization) | |
| | - Evolution history and genealogy | |
| | - Diversity analysis | |
| +-----------------------------------------------------------+ |
+==============================================================================+
Evolutionary Loop Detail
Initialize Population (seed programs or LLM-generated)
|
v
+-----------------+
| GENERATION g |<-----------------------------------------+
+-----------------+ |
| |
+-----------+-----------+ |
| | |
v v |
+--------+ +----------+ |
| SELECT | | SELECT | |
| Parent | | Parents | |
| (single| | (pair for| |
| mut.) | | crossovr)| |
+---+----+ +----+-----+ |
| | |
v v |
+---+----------------------+-----+ |
| LLM MUTATION OPERATOR | |
| | |
| Choose mutation type: | |
| - Refinement (p=0.4) | |
| - Crossover (p=0.25) | |
| - Radical (p=0.15) | |
| - Simplify (p=0.1) | |
| - Generalize (p=0.1) | |
+---------------+----------------+ |
| |
v |
+------------------------+ |
| VALIDATE & COMPILE |--- Invalid ---> Discard |
+------------------------+ |
| |
v (valid) |
+------------------------+ |
| EVALUATE FITNESS | |
| | |
| For each distribution: | |
| For each instance: | |
| Run solve(instance)| |
| Score result | |
+------------------------+ |
| |
v |
+------------------------+ |
| COMPUTE GENERALITY | |
| & DIVERSITY SCORES | |
+------------------------+ |
| |
v |
+------------------------+ |
| UPDATE POPULATION | |
| (survival selection, |--------> g < max_gen? ----Yes----------+
| archive management) |
+------------------------+
|
v (g == max_gen)
+------------------------+
| RETURN BEST GENERAL |
| ALGORITHM + ARCHIVE |
+------------------------+
Sandboxed Execution Environment
All candidate programs are executed in a sandboxed environment to prevent malicious or erroneous code from affecting the host system:
+----------------------------------+
| SANDBOX (Docker / subprocess) |
| |
| - Resource limits (CPU, memory) |
| - Timeout enforcement (e.g. 30s) |
| - No network access |
| - No filesystem access |
| - Restricted imports |
| - Output capture (stdout/stderr) |
+----------------------------------+
9. Components
Problem Definition Module
- Problem class specification (interface for solve())
- Instance generators for multiple distributions
- Fitness/scoring function per instance
- Train/test distribution split
- Difficulty calibration
Population Store
- Program storage (source code as strings)
- Fitness vector per program (score per distribution)
- Diversity embeddings (code similarity metrics)
- Genealogy tree (parent-child relationships)
- Archive of elite solutions
Parent Selection Module
- Tournament selection
- Fitness-proportional sampling
- Diversity-aware selection (MAP-Elites-style)
- Age-based weighting
- Multi-objective selection (NSGA-II-style)
LLM Mutation Module
- Prompt construction engine
- Mutation type sampler
- Multi-model support
- Response parsing and validation
- Retry logic for malformed outputs
Fitness Evaluation Module
- Sandboxed code execution
- Multi-instance evaluation
- Generality score computation
- Timeout and error handling
- Caching for duplicate programs
Population Management Module
- Survival selection (elitism + diversity)
- Population size control
- Niching and speciation
- Stagnation detection
- Archive maintenance
10. Core Mechanisms
10.1 Code Modification and Mutation Operators
DGM's mutation operators are the heart of the system. Each mutation is an LLM call that transforms one or more parent programs into a child program:
class MutationOperator:
"""LLM-based mutation operator for code evolution."""
MUTATION_TYPES = {
"refine": {
"weight": 0.40,
"n_parents": 1,
"instruction": (
"Improve this algorithm. Fix bugs, optimize performance, "
"or enhance its handling of edge cases. Make small, "
"targeted improvements."
)
},
"crossover": {
"weight": 0.25,
"n_parents": 2,
"instruction": (
"You are given two algorithms that solve the same problem. "
"Create a new algorithm that combines the best ideas from "
"both. The result should be better than either parent."
)
},
"radical": {
"weight": 0.15,
"n_parents": 1,
"instruction": (
"Completely redesign this algorithm using a fundamentally "
"different approach. Think outside the box. The new "
"algorithm should use a different strategy entirely."
)
},
"simplify": {
"weight": 0.10,
"n_parents": 1,
"instruction": (
"Simplify this algorithm. Remove unnecessary complexity "
"while preserving or improving performance."
)
},
"generalize": {
"weight": 0.10,
"n_parents": 1,
"instruction": (
"Make this algorithm more general. It should handle "
"a wider variety of input distributions and edge cases."
)
}
}
def mutate(self, parents, problem_desc, mutation_type=None):
"""Apply LLM-based mutation to produce a child program."""
if mutation_type is None:
mutation_type = self._sample_mutation_type()
config = self.MUTATION_TYPES[mutation_type]
selected_parents = parents[:config["n_parents"]]
prompt = self._build_prompt(
selected_parents, problem_desc, config["instruction"]
)
response = self.llm_client.generate(
prompt=prompt,
temperature=0.8, # Higher temp for diversity
max_tokens=2048
)
child_code = self._extract_function(response)
return child_code
def _sample_mutation_type(self):
"""Sample mutation type according to weights."""
types = list(self.MUTATION_TYPES.keys())
weights = [self.MUTATION_TYPES[t]["weight"] for t in types]
return random.choices(types, weights=weights, k=1)[0]
10.2 Parent Selection and Sampling
DGM uses a multi-objective parent selection strategy that balances fitness and diversity:
$$ Selection probability: P(select i) proportional to f(i)^alpha * d(i)^beta $$
$$ where f(i) = fitness rank, d(i) = diversity contribution, alpha + beta = 1 $$
class ParentSelector:
"""Multi-objective parent selection."""
def __init__(self, alpha=0.7, beta=0.3, tournament_size=5):
self.alpha = alpha # fitness weight
self.beta = beta # diversity weight
self.tournament_size = tournament_size
def select(self, population, n_parents=1):
"""Select parents using fitness-diversity tournament."""
selected = []
for _ in range(n_parents):
# Tournament selection
candidates = random.sample(population, self.tournament_size)
# Score each candidate
scores = []
for c in candidates:
fitness_rank = c.generality_score # [0, 1]
diversity = self._compute_diversity(c, population)
combined = (fitness_rank ** self.alpha) * (diversity ** self.beta)
scores.append(combined)
# Select best in tournament
best_idx = max(range(len(candidates)), key=lambda i: scores[i])
selected.append(candidates[best_idx])
return selected
def _compute_diversity(self, individual, population):
"""Compute diversity contribution of an individual."""
# Average code distance to all other individuals
distances = [
self._code_distance(individual.code, other.code)
for other in population if other != individual
]
return sum(distances) / len(distances) if distances else 1.0
def _code_distance(self, code_a, code_b):
"""Normalized edit distance between code strings."""
from difflib import SequenceMatcher
ratio = SequenceMatcher(None, code_a, code_b).ratio()
return 1.0 - ratio # distance = 1 - similarity
10.3 Population Management
DGM maintains a structured population with elitism and diversity guarantees:
class PopulationManager:
"""Manage the evolving population of candidate algorithms."""
def __init__(self, max_size=50, elite_fraction=0.2,
diversity_threshold=0.1):
self.max_size = max_size
self.elite_fraction = elite_fraction
self.diversity_threshold = diversity_threshold
self.population = []
self.archive = [] # Hall of fame
def update(self, children):
"""Add children and trim population."""
# Add valid children
for child in children:
if self._is_sufficiently_novel(child):
self.population.append(child)
# Sort by generality score
self.population.sort(
key=lambda x: x.generality_score, reverse=True
)
# Preserve elites
n_elite = int(self.max_size * self.elite_fraction)
elites = self.population[:n_elite]
# Fill remaining slots with diversity-aware selection
remaining = self.population[n_elite:]
diverse_fill = self._diversity_fill(
remaining, self.max_size - n_elite
)
self.population = elites + diverse_fill
# Update archive with best-ever individuals
self._update_archive()
def _is_sufficiently_novel(self, candidate):
"""Check if candidate is sufficiently different from population."""
for existing in self.population:
similarity = SequenceMatcher(
None, candidate.code, existing.code
).ratio()
if similarity > (1.0 - self.diversity_threshold):
return False # Too similar to existing
return True
def _diversity_fill(self, candidates, n_slots):
"""Select diverse individuals to fill remaining slots."""
selected = []
for _ in range(min(n_slots, len(candidates))):
# Greedily select most diverse candidate
best_div, best_idx = -1, 0
for i, c in enumerate(candidates):
div = min(
[self._code_dist(c, s) for s in selected]
or [1.0]
)
if div > best_div:
best_div, best_idx = div, i
selected.append(candidates.pop(best_idx))
return selected
10.4 Solution Diversity and Novelty
DGM uses multiple diversity metrics to prevent convergence:
| Metric | Description | Formula |
|---|---|---|
| Syntactic Diversity | Edit distance between code strings | 1 - SequenceMatcher.ratio() |
| Behavioral Diversity | Difference in outputs on a reference set of instances | |{i : f_a(i) != f_b(i)}| / |instances| |
| Structural Diversity | AST-level comparison of algorithmic structure | Tree edit distance on ASTs |
| Performance Profile Diversity | Difference in per-distribution performance vectors | cosine_distance(perf_vec_a, perf_vec_b) |
$$ Behavioral Distance: D_behav(f_a, f_b) = (1/N) * sum_{i=1}^{N} I[f_a(x_i) != f_b(x_i)] $$
$$ Population Diversity: PD = (2 / (M*(M-1))) * sum_{a < b} D(a, b) $$
10.5 LLM Orchestration and Model Selection
DGM can use multiple LLMs with different roles and cost profiles:
class LLMOrchestrator:
"""Orchestrate multiple LLMs for different mutation types."""
def __init__(self):
self.model_configs = {
"high_quality": {
"model": "gpt-4o",
"temperature": 0.7,
"use_for": ["crossover", "radical"]
},
"cost_effective": {
"model": "deepseek-coder",
"temperature": 0.8,
"use_for": ["refine", "simplify"]
},
"diverse": {
"model": "claude-3-5-sonnet",
"temperature": 0.9,
"use_for": ["generalize"]
}
}
def select_model(self, mutation_type):
"""Select the best model for a given mutation type."""
for config in self.model_configs.values():
if mutation_type in config["use_for"]:
return config["model"], config["temperature"]
return "gpt-4o", 0.7 # default
10.6 Search Strategies
DGM employs a multi-faceted search strategy:
- Evolutionary search (primary): Population-based search with selection, mutation, and survival
- Island model parallelism: Multiple sub-populations evolve independently with occasional migration of elite individuals between islands
- Adaptive mutation rates: When population diversity drops below a threshold, radical mutation rates increase; when fitness stagnates, crossover rates increase
- Restart mechanism: If all islands stagnate for N generations, partial population reinitialization occurs
class AdaptiveMutationScheduler:
"""Dynamically adjust mutation type probabilities."""
def __init__(self):
self.base_weights = {
"refine": 0.4, "crossover": 0.25,
"radical": 0.15, "simplify": 0.1,
"generalize": 0.1
}
self.stagnation_counter = 0
def get_weights(self, pop_diversity, fitness_improving):
"""Adapt mutation weights based on search state."""
weights = dict(self.base_weights)
# Low diversity -> more radical mutations
if pop_diversity < 0.3:
weights["radical"] *= 2.0
weights["generalize"] *= 1.5
# Fitness stagnating -> more crossover
if not fitness_improving:
self.stagnation_counter += 1
if self.stagnation_counter > 5:
weights["crossover"] *= 2.0
weights["radical"] *= 1.5
else:
self.stagnation_counter = 0
# Normalize
total = sum(weights.values())
return {k: v / total for k, v in weights.items()}
10.7 Evaluation and Fitness Assessment
The fitness evaluation is the most distinctive component of DGM — it explicitly measures generality:
$$ Fitness(f) = alpha * Generality(f) + beta * BestCase(f) + gamma * Robustness(f) $$
$$ Generality(f) = (1/K) * sum_{k=1}^{K} mean_{x in D_k}[score(f(x), optimal(x))] $$
$$ BestCase(f) = max_k { mean_{x in D_k}[score(f(x), optimal(x))] } $$
$$ Robustness(f) = 1 - CV({S_1, ..., S_K}) where CV = std/mean $$
class GeneralityFitness:
"""Evaluate fitness with emphasis on generality."""
def __init__(self, distributions, instances_per_dist=50,
alpha=0.6, beta=0.2, gamma=0.2):
self.distributions = distributions
self.instances_per_dist = instances_per_dist
self.alpha = alpha # generality weight
self.beta = beta # best-case weight
self.gamma = gamma # robustness weight
def evaluate(self, candidate_code, timeout=30):
"""Evaluate a candidate algorithm on all distributions."""
per_dist_scores = {}
for dist_name, dist in self.distributions.items():
instances = dist.generate(self.instances_per_dist)
scores = []
for instance in instances:
try:
result = sandbox_execute(
candidate_code, instance, timeout=timeout
)
score = self._score_result(result, instance)
scores.append(score)
except (TimeoutError, RuntimeError):
scores.append(0.0)
per_dist_scores[dist_name] = np.mean(scores)
# Compute composite fitness
all_scores = list(per_dist_scores.values())
generality = np.mean(all_scores)
best_case = np.max(all_scores)
robustness = 1.0 - (np.std(all_scores) /
(np.mean(all_scores) + 1e-8))
fitness = (self.alpha * generality +
self.beta * best_case +
self.gamma * robustness)
return {
"fitness": fitness,
"generality": generality,
"best_case": best_case,
"robustness": robustness,
"per_distribution": per_dist_scores
}
10.8 Prompt Engineering and Co-Evolution
DGM incorporates several sophisticated prompt engineering techniques:
- Performance-aware prompts: Parent fitness scores are included in the mutation prompt so the LLM understands which aspects to improve
- Failure-case injection: When a parent fails on specific instances, those failure cases are described in the prompt to guide improvement
- Cross-pollination context: For crossover mutations, the prompt explicitly describes the strengths of each parent
- Generality emphasis: All prompts emphasize that the algorithm must work across diverse inputs, not just specific cases
def build_mutation_prompt(parent, mutation_type, problem_desc, failures=None):
"""Build a context-rich mutation prompt."""
prompt = f"""You are an expert algorithm designer.
## Problem
{problem_desc}
## Current Algorithm (Parent)
```python
{parent.code}
Parent Performance
- Overall generality score: {parent.generality_score:.3f}
-
Per-distribution scores: """ for dist, score in parent.per_dist_scores.items(): prompt += f" - {dist}: {score:.3f}\n"
if failures: prompt += "\n## Failure Cases (algorithm performs poorly on these):\n" for f in failures[:3]: prompt += f" - Input: {f['input']}, Expected: {f['expected']}, Got: {f['got']}\n"
prompt += f"""
Mutation Instruction
{MutationOperator.MUTATION_TYPES[mutation_type]['instruction']}
IMPORTANT: The algorithm will be tested on MANY DIVERSE inputs. It must be general, not specialized to any particular input type.
Generate only the Python function:
def solve(problem_instance):
...
```"""
return prompt
10.9 Cost Control
DGM implements several cost control mechanisms:
- Tiered model allocation: Expensive models (GPT-4) used only for high-value mutations (crossover, radical); cheaper models for routine refinements
- Evaluation budgeting: Each generation has a fixed evaluation budget; if exceeded, remaining mutations are deferred
- Early stopping: Program evaluation is terminated early if the program crashes, times out, or clearly underperforms
- Duplicate detection: Syntactically identical or near-identical programs skip evaluation
- Adaptive evaluation depth: Initial fitness uses fewer instances; promising candidates get full evaluation
class CostController:
"""Control API and compute costs across the search."""
def __init__(self, total_budget_usd=200.0,
per_generation_budget=5.0):
self.total_budget = total_budget_usd
self.per_gen_budget = per_generation_budget
self.spent = 0.0
self.gen_spent = 0.0
def can_mutate(self):
return (self.spent < self.total_budget and
self.gen_spent < self.per_gen_budget)
def log_cost(self, cost):
self.spent += cost
self.gen_spent += cost
def new_generation(self):
self.gen_spent = 0.0
def adaptive_eval_depth(self, candidate_fitness_estimate):
"""More instances for promising candidates."""
if candidate_fitness_estimate > 0.8:
return 100 # full evaluation
elif candidate_fitness_estimate > 0.5:
return 30 # moderate evaluation
else:
return 10 # quick screening
10.10 Meta-Level and Self-Improvement
DGM incorporates meta-level adaptation mechanisms:
- Adaptive mutation distribution: The probabilities of different mutation types are adjusted based on which types have historically produced the best improvements
- Success tracking: Each mutation type's success rate (fraction of children that improve on parents) is tracked and used to adjust future probabilities
- Prompt evolution: The mutation prompts themselves can be refined based on which prompt styles produce better offspring (a form of prompt co-evolution)
- Hyperparameter adaptation: Population size, selection pressure, and diversity thresholds can be adjusted based on search progress
class MetaLearner:
"""Track and adapt search hyperparameters."""
def __init__(self):
self.mutation_success_rates = {
"refine": [], "crossover": [],
"radical": [], "simplify": [],
"generalize": []
}
def record_outcome(self, mutation_type, parent_fitness, child_fitness):
improved = child_fitness > parent_fitness
self.mutation_success_rates[mutation_type].append(improved)
def get_adapted_weights(self, window=50):
"""Get mutation weights adapted by recent success rates."""
weights = {}
for mt, outcomes in self.mutation_success_rates.items():
recent = outcomes[-window:]
if recent:
# Success rate + small baseline to never fully disable
weights[mt] = sum(recent) / len(recent) + 0.05
else:
weights[mt] = 0.2 # default
# Normalize
total = sum(weights.values())
return {k: v / total for k, v in weights.items()}
10.11 Mathematical Framework
Formal Problem Definition
$$ Find f* = argmax_{f in F} G(f) $$
$$ where G(f) = E_{D ~ P(D)} [ E_{x ~ D} [ score(f(x), optimal(x)) ] ] $$
$$ F = space of all valid Python functions with signature solve(instance) -> solution $$
Evolutionary Dynamics
$$ P_{t+1} = Survive(P_t U Mutate(Select(P_t))) $$
$$ Select(P_t) = { p_i in P_t : sampled with prob proportional to f(p_i)^alpha * d(p_i)^beta } $$
$$ Mutate(S) = { LLM(prompt(s, mutation_type)) : s in S } $$
$$ Survive(Q) = Elite(Q, k) U DiverseSelect(Q \ Elite(Q, k), |P| - k) $$
Convergence Analysis
import numpy as np
def track_convergence(evolution_history):
"""Analyze convergence of the evolutionary search."""
generations = []
best_fitness = []
mean_fitness = []
diversity = []
for gen, population in enumerate(evolution_history):
fitnesses = [ind.fitness for ind in population]
generations.append(gen)
best_fitness.append(max(fitnesses))
mean_fitness.append(np.mean(fitnesses))
# Pairwise diversity
codes = [ind.code for ind in population]
pair_dists = []
for i in range(len(codes)):
for j in range(i + 1, len(codes)):
pair_dists.append(code_distance(codes[i], codes[j]))
diversity.append(np.mean(pair_dists) if pair_dists else 0)
# Compute convergence rate (slope of best fitness)
from numpy.polynomial import polynomial as P
coeffs = np.polyfit(generations[-20:], best_fitness[-20:], 1)
convergence_rate = coeffs[0]
return {
"generations": generations,
"best_fitness": best_fitness,
"mean_fitness": mean_fitness,
"diversity": diversity,
"convergence_rate": convergence_rate,
"converged": abs(convergence_rate) < 1e-4
}
Generality vs. Specialization Tradeoff
$$ Pareto Front: {f in P : not exists g in P such that G(g) >= G(f) and S(g) > S(f)} $$
$$ where G(f) = generality score, S(f) = best single-distribution score $$
def compute_pareto_front(population):
"""Compute the Pareto front of generality vs. specialization."""
# Each individual has (generality, best_case) scores
points = [(ind.generality_score, ind.best_case_score, ind)
for ind in population]
# Sort by generality descending
points.sort(key=lambda x: -x[0])
pareto = []
max_special = -1
for gen, spec, ind in points:
if spec > max_special:
pareto.append(ind)
max_special = spec
return pareto
11. Programming Language
| Component | Language | Purpose |
|---|---|---|
| Search engine | Python 3.10+ | Evolutionary loop, population management |
| Candidate solutions | Python | All evolved algorithms are Python functions |
| Fitness evaluation | Python | Sandboxed execution of candidate code |
| LLM integration | Python | API clients for OpenAI, Anthropic, Google |
| Analysis & visualization | Python (matplotlib, seaborn) | Evolution tracking, Pareto analysis |
| Sandbox | Docker / subprocess | Isolated execution environment |
| Configuration | YAML | Search hyperparameters, model configs |
Python is the natural choice because (a) LLMs are most fluent in Python code generation, (b) the scientific Python ecosystem provides evaluation infrastructure, and (c) the candidate solutions themselves are Python functions, creating a unified stack.
12. Memory Management
Population Memory
- Active population: Stored in memory as a list of Individual objects (code string + metadata). Typically 50-200 individuals, each consuming a few KB.
- Archive: Hall-of-fame of historically best individuals, bounded in size (e.g., top 50 ever seen)
- Genealogy tree: Parent-child relationships tracked for analysis (can be persisted to disk)
LLM Context Management
- Prompt size: Each mutation prompt typically 1,500-3,000 tokens (parent code + instructions). Kept well within context limits.
- No cross-generation context: Each LLM call is independent — the LLM does not accumulate state across generations. This is by design: statefulness is managed by the evolutionary algorithm, not the LLM.
- Multi-parent prompts: Crossover prompts include 2 parent programs, roughly doubling context size. Still well within limits for modern LLMs.
Evaluation Memory
class EvaluationCache:
"""Cache fitness evaluations to avoid redundant computation."""
def __init__(self, max_cache_size=10000):
self.cache = {}
self.max_size = max_cache_size
def get_or_evaluate(self, code, evaluator):
code_hash = hashlib.sha256(code.encode()).hexdigest()
if code_hash in self.cache:
return self.cache[code_hash]
result = evaluator.evaluate(code)
self.cache[code_hash] = result
# Evict oldest entries if cache is full
if len(self.cache) > self.max_size:
oldest_key = next(iter(self.cache))
del self.cache[oldest_key]
return result
Disk Persistence
- Population checkpointed to disk every N generations for crash recovery
- Evolution history (fitness curves, diversity metrics) logged to disk in real-time
- Best discovered algorithms saved as standalone Python files
13. Continued Learning
Warm-Starting
DGM supports warm-starting evolution from previously discovered algorithms:
- Seed population: Initialize with algorithms from a previous run on the same or related problem
- Transfer across domains: Use algorithms discovered for bin packing as seeds for scheduling problems
- Human-seeded evolution: Include human-written heuristics (e.g., First-Fit-Decreasing) in the initial population
Cross-Problem Learning
One of DGM's most ambitious goals is discovering algorithmic primitives that transfer across entirely different problem domains. For example, a sorting-based heuristic discovered for bin packing might contain a priority function that is useful for job scheduling.
class TransferLearning:
"""Transfer evolved algorithms across problem domains."""
def extract_primitives(self, elite_population):
"""Extract reusable algorithmic primitives from elites."""
primitives = []
for individual in elite_population:
# Parse AST to find reusable sub-functions
tree = ast.parse(individual.code)
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
if node.name != "solve": # helper functions
primitives.append({
"name": node.name,
"code": ast.unparse(node),
"source_fitness": individual.fitness
})
return primitives
def seed_new_problem(self, primitives, new_problem):
"""Create seed population for new problem using primitives."""
seeds = []
for prim in primitives:
# Use LLM to adapt primitive to new problem
prompt = f"""Adapt this algorithmic primitive for a new problem.
Primitive:
```python
{prim['code']}
New problem: {new_problem.description}
Create a complete solve() function that uses this primitive.""" adapted = self.llm.generate(prompt) seeds.append(adapted) return seeds ```
Incremental Evolution
- New problem instances can be added to the evaluation suite without restarting evolution
- As new distributions are added, the fitness landscape changes and evolution adapts
- This enables a form of continual learning where the algorithm population becomes more general over time
14. Applications and Use Cases
Primary Applications
| Application | Description | Impact |
|---|---|---|
| Combinatorial Optimization | Discovering heuristics for NP-hard problems (bin packing, TSP, scheduling) | General heuristics that work across real-world distributions without manual tuning |
| ML Algorithm Design | Evolving loss functions, optimizers, regularization strategies | Data-driven discovery of ML components that generalize across datasets |
| Scientific Computing | Discovering numerical methods and approximation algorithms | Novel numerical algorithms that handle diverse equation types |
| Program Synthesis | Automatically generating correct and efficient programs from specifications | Generalizable solution strategies rather than per-instance code |
| Automated Mathematics | Discovering constructions for extremal combinatorics and related fields | Potential for novel mathematical results (following FunSearch tradition) |
Industrial Use Cases
Logistics and Supply Chain
DGM-discovered bin packing and scheduling algorithms could replace hand-tuned heuristics in warehouse operations, vehicle routing, and resource allocation, automatically adapting to changing demand distributions.
Chip Design / EDA
Placement and routing heuristics for VLSI design that generalize across different chip architectures rather than requiring per-chip tuning.
Drug Discovery
General molecular optimization algorithms that work across chemical families, rather than being specialized to specific target proteins.
Financial Optimization
Portfolio optimization and risk management algorithms that remain robust across different market regimes.
Research Implications
- Automated algorithm design: DGM represents a step toward fully automated algorithm design, where the role of the human shifts from writing algorithms to defining problems and fitness functions
- LLM as creative search: Demonstrates that LLMs can serve as effective search operators in high-dimensional program spaces, combining syntactic validity with semantic understanding
- Generality as an objective: Establishes a framework for explicitly optimizing for generality, which has broad implications for machine learning and AI research
- Open-ended evolution: The combination of LLM mutation, diversity maintenance, and cross-domain transfer points toward open-ended evolutionary systems that continuously discover novel algorithms
**Significance:** DGM represents a paradigm shift in algorithm design. Rather than humans crafting heuristics by hand, DGM uses LLMs as intelligent search operators to explore the space of possible algorithms, with explicit selection pressure for generality. The resulting algorithms are not just competitive with human-designed solutions — they transfer to unseen problem instances and distributions in ways that specialized solutions cannot.
DGM (Discovering General Methods) Technical Report — Generated March 2026 Based on: Sakana AI, "Discovering General Methods" (2025) Source: sakana.ai/dgm