Imbue Darwinian Evolver for ARC-AGI-2
Evolving Programs Through LLM-Guided Darwinian Natural Selection Authors: Imbue Research (Kanjun Qiu, Josh Albrecht, et al.) Published: February 27, 2026 Repository: github.com/imbue-ai/darwinian_evolver Blog: imbue.com/research
Table of Contents
- Core Contribution
- Results
- Architecture
- Components
- Mutation Operators
- Selection
- Population Mgmt
- Diversity
- LLM Orchestration
- Search Strategy
- Fitness
- Prompts
- Cost Control
- Meta-Level
- Memory
- Reproducibility
- Applications
1. Core Contribution
Key Insight: Rather than using LLMs to directly solve ARC-AGI-2 tasks, the Darwinian Evolver treats the LLM as a mutation operator within a Darwinian evolutionary framework. Programs are the individuals, LLM-generated code modifications are the mutations, and task-specific fitness evaluation drives natural selection to converge on correct solutions.
The Imbue Darwinian Evolver represents a paradigm shift from prompt-engineering-based approaches to ARC-AGI-2. Instead of asking an LLM "what is the output?", the system asks "here is a program that partially works -- how can we modify it to work better?" This reframes abstract visual reasoning into a concrete program synthesis problem solved via evolutionary search.
Core Principles
- Programs as Individuals: Each candidate solution is a Python function (
transform) that maps input grids to output grids. - LLMs as Mutation Operators: Large language models generate code modifications (mutations) guided by the parent program, task examples, and error signals.
- Fitness = Pixel-Level Accuracy: Programs are evaluated by running them against training examples and measuring exact pixel match accuracy.
- Population-Based Search: Multiple candidate programs co-exist and compete, with selection pressure favoring higher-fitness individuals.
- Darwinian Selection: Tournament or fitness-proportionate selection determines which programs survive and reproduce (get mutated).
Novelty Over Prior Work
| Approach | Method | Limitation |
|---|---|---|
| Direct LLM prompting | Ask LLM to produce output grid | LLMs struggle with spatial reasoning |
| Code generation (single-shot) | Ask LLM to write transform function | ARC tasks require non-obvious algorithms |
| Best-of-N sampling | Generate N programs, pick best | No iterative refinement |
| Darwinian Evolver | Evolve programs via LLM-guided mutation + selection | Combines search + learning |
2. Key Results
ARC-AGI-2 Performance
| Metric | Value | Notes |
|---|---|---|
| ARC-AGI-2 Public Eval Score | ~5-8% | On the public evaluation set (semi-private) |
| ARC-AGI-2 Training Accuracy | Higher on training tasks | Evolutionary search can overfit to training examples |
| Per-task budget | ~$1-5 in API costs per task | Varies with number of generations |
| Time per task | Minutes to hours | Depends on convergence speed |
| Submission format | 2 guesses per task | Top-2 programs from final population |
Context: ARC-AGI-2 is substantially harder than ARC-AGI-1 (where top scores exceed 50%). On ARC-AGI-2, even frontier models score very low. The Darwinian Evolver demonstrates that evolutionary program synthesis is a viable approach to this challenging benchmark, achieving competitive results with other program synthesis methods.
Comparison with Baselines
| Method | Approach | ARC-AGI-2 Approx. |
|---|---|---|
| Direct GPT-4o / Claude | Direct output prediction | <2% |
| Single-shot code generation | One-shot program synthesis | ~3-5% |
| Imbue Darwinian Evolver | Evolutionary program synthesis | ~5-8% |
| Top leaderboard entries | Various (often hybrid) | ~10-15%+ |
3. Architecture
+==================================================================================+
| IMBUE DARWINIAN EVOLVER |
+==================================================================================+
| |
| +----------------------------+ +-------------------------------------+ |
| | TASK INPUT | | EVOLUTIONARY ENGINE | |
| | - Training pairs (I/O) | | | |
| | - Test input grid(s) |------>| 1. INITIALIZATION | |
| +----------------------------+ | - Seed population from LLM | |
| | - Random diverse programs | |
| +----------------------------+ | | |
| | LLM POOL | | 2. EVALUATION | |
| | - Claude 3.5 Sonnet | | - Run programs on train pairs | |
| | - GPT-4o | | - Pixel-match fitness score | |
| | - DeepSeek | | | |
| | - Gemini |<----->| 3. SELECTION | |
| +----------------------------+ | - Tournament selection | |
| | - Fitness-proportionate | |
| +----------------------------+ | | |
| | DSL / PRIMITIVES | | 4. MUTATION (via LLM) | |
| | - Grid operations | | - Code modification prompts | |
| | - Numpy/scipy helpers | | - Error-guided repair | |
| | - Color transforms | | - Crossover of parents | |
| +----------------------------+ | | |
| | 5. POPULATION MANAGEMENT | |
| | - Elitism (keep best) | |
| | - Diversity maintenance | |
| | - Population size control | |
| | | |
| | 6. TERMINATION | |
| | - Perfect fitness found | |
| | - Budget exhausted | |
| | - Max generations reached | |
| +-------------------------------------+ |
| | |
| v |
| +-------------------------------------+ |
| | OUTPUT | |
| | - Top-K programs | |
| | - Predicted test output(s) | |
| | - 2 guesses per task | |
| +-------------------------------------+ |
+==================================================================================+
Evolutionary Loop Detail
Generation t Generation t+1
+-----------+ +-----------+
| Program 1 |---fitness=0.8------+ | Program 1'| (mutated from P1)
| Program 2 |---fitness=0.4--+ | | Program 4 | (elite, kept)
| Program 3 |---fitness=0.2 | +-->| Program 5 | (crossover P1xP4)
| Program 4 |---fitness=0.9--+------>| Program 6 | (mutated from P4)
+-----------+ | | | Program 7 | (new seed from LLM)
| | +-----------+
v v
SELECTED ELIMINATED
(tournament) (low fitness)
4. Component Breakdown
Repository Structure
darwinian_evolver/
+-- README.md # Setup and usage instructions
+-- requirements.txt # Python dependencies
+-- config.yaml # Configuration (models, params)
+-- main.py # Entry point, orchestrates evolution
+-- evolver.py # Core evolutionary engine
+-- population.py # Population management
+-- individual.py # Individual program representation
+-- fitness.py # Fitness evaluation (grid comparison)
+-- mutation.py # LLM-based mutation operators
+-- selection.py # Selection strategies
+-- crossover.py # Crossover operators
+-- llm_client.py # Multi-model LLM interface
+-- prompts/ # Prompt templates
| +-- seed_prompt.txt # Initial program generation
| +-- mutate_prompt.txt # Mutation prompt template
| +-- crossover_prompt.txt # Crossover prompt template
| +-- repair_prompt.txt # Error-guided repair prompt
+-- utils/ # Utility functions
| +-- grid_utils.py # Grid manipulation helpers
| +-- arc_loader.py # ARC task loading
| +-- budget.py # Cost tracking and budget control
+-- tests/ # Test suite
Module Responsibilities
evolver.py
The main evolutionary loop. Orchestrates initialization, evaluation, selection, mutation, and population replacement across generations. Handles termination conditions and budget checking.
Core
population.py
Manages the collection of individuals. Handles adding/removing individuals, maintaining population size constraints, elitism, and diversity metrics.
Data Structure
individual.py
Represents a single candidate program. Stores the source code string, fitness score, generation number, parent lineage, and any metadata.
Data Structure
mutation.py
Implements LLM-based mutation operators. Takes a parent program and produces a mutated child by prompting an LLM with the code, task examples, and error information.
Operator
selection.py
Implements selection strategies: tournament selection, fitness-proportionate selection, and elitism. Determines which individuals get to reproduce.
Operator
fitness.py
Evaluates programs by executing them on training examples and computing pixel-match accuracy. Handles sandboxed execution, timeouts, and error catching.
Evaluation
llm_client.py
Provides a unified interface to multiple LLM providers (Anthropic, OpenAI, Google, DeepSeek). Handles API calls, retries, rate limiting, and cost tracking.
Infrastructure
prompts/
Template directory containing carefully engineered prompts for each evolutionary operation: seeding, mutation, crossover, and repair.
Configuration
5. Code Modification and Mutation Operators
Mutation is the primary genetic operator and the key innovation of the Darwinian Evolver. Instead of random token-level mutations (as in traditional genetic programming), mutations are semantically meaningful code modifications generated by LLMs.
Mutation Types
| Type | Description | When Used |
|---|---|---|
| Guided Mutation | LLM modifies parent code given error feedback | Default, most common |
| Random Mutation | LLM rewrites a random section of the code | Diversity injection |
| Error-Guided Repair | LLM fixes runtime errors or assertion failures | When parent crashes |
| Strategy Mutation | LLM changes the algorithmic approach entirely | When stuck in local optimum |
| Parameter Mutation | LLM tweaks constants, thresholds, indices | Fine-tuning near-correct programs |
Core Mutation Implementation
class Mutator:
def __init__(self, llm_client, prompts, config):
self.llm = llm_client
self.prompts = prompts
self.config = config
def mutate(self, individual, task, error_info=None):
"""Generate a mutated child from a parent individual."""
# Build the mutation prompt with context
prompt = self.build_mutation_prompt(
parent_code=individual.code,
task_examples=task.train_pairs,
fitness=individual.fitness,
error_info=error_info,
predicted_outputs=individual.predictions,
expected_outputs=task.expected_outputs
)
# Call LLM to generate mutated code
response = self.llm.generate(
prompt=prompt,
temperature=self.config.mutation_temperature, # typically 0.7-1.0
max_tokens=self.config.max_code_tokens
)
# Extract code from LLM response
mutated_code = self.extract_code(response)
# Validate syntax before returning
if self.is_valid_python(mutated_code):
return Individual(
code=mutated_code,
generation=individual.generation + 1,
parent_id=individual.id,
mutation_type="guided"
)
return None # Mutation failed, discard
Mutation Prompt Construction
def build_mutation_prompt(self, parent_code, task_examples,
fitness, error_info, predicted_outputs,
expected_outputs):
"""Build a detailed prompt for LLM-guided mutation."""
prompt_parts = []
# System context
prompt_parts.append(
"You are an expert Python programmer helping to evolve "
"a program that transforms input grids to output grids. "
"Your task is to modify the given program to improve its "
"accuracy on the training examples."
)
# Task examples (input/output pairs)
for i, (inp, out) in enumerate(task_examples):
prompt_parts.append(f"Example {i+1}:")
prompt_parts.append(f"Input:\n{self.grid_to_str(inp)}")
prompt_parts.append(f"Expected Output:\n{self.grid_to_str(out)}")
if predicted_outputs and i < len(predicted_outputs):
prompt_parts.append(
f"Current Program Output:\n"
f"{self.grid_to_str(predicted_outputs[i])}"
)
# Current program
prompt_parts.append(f"Current program (fitness={fitness:.3f}):")
prompt_parts.append(f"```python\n{parent_code}\n```")
# Error information if available
if error_info:
prompt_parts.append(f"The program produced this error:\n{error_info}")
# Mutation instruction
prompt_parts.append(
"Please modify the program to better match the expected "
"outputs. Focus on understanding the pattern in the "
"input-output pairs and adjusting the logic accordingly. "
"Return ONLY the modified Python function."
)
return "\n\n".join(prompt_parts)
Temperature-Based Mutation Intensity
The mutation intensity is controlled via LLM temperature. Higher temperatures produce more radical mutations (exploration), while lower temperatures produce conservative modifications (exploitation):
$$ T(g) = T_max - (T_max - T_min) * (fitness / fitness_max)
Where: T(g) = temperature for generation g, fitness = parent fitness score
High fitness parents get low temperature (fine-tuning), low fitness parents get high temperature (exploration) $$
def adaptive_temperature(self, fitness, gen, max_gen):
"""Compute mutation temperature based on fitness and progress."""
# Base: high fitness -> low temperature (exploit)
# Low fitness -> high temperature (explore)
fitness_factor = 1.0 - fitness # fitness in [0, 1]
progress_factor = gen / max_gen # how far along we are
temperature = (
self.config.temp_min +
(self.config.temp_max - self.config.temp_min) *
fitness_factor * (1.0 - 0.3 * progress_factor)
)
return max(self.config.temp_min, min(self.config.temp_max, temperature))
6. Parent Selection and Sampling
Tournament Selection
The primary selection mechanism is tournament selection, where k individuals are randomly sampled from the population, and the one with the highest fitness is selected as a parent:
$$ P(select individual i) = P(i is best among k random samples)
For tournament size k and population N with sorted fitnesses:
P(rank r selected) = (1/N)^k * C(N, k) ... approximated by favoring top ranks $$
class TournamentSelection:
def __init__(self, tournament_size=3):
self.tournament_size = tournament_size
def select(self, population, n_parents=1):
"""Select n_parents from population via tournament selection."""
parents = []
for _ in range(n_parents):
# Sample k random individuals
tournament = random.sample(population.individuals,
min(self.tournament_size,
len(population)))
# Select the one with highest fitness
winner = max(tournament, key=lambda ind: ind.fitness)
parents.append(winner)
return parents
class FitnessProportionateSelection:
def select(self, population, n_parents=1):
"""Select parents with probability proportional to fitness."""
fitnesses = [ind.fitness for ind in population.individuals]
total = sum(fitnesses)
if total == 0:
# All zero fitness -- select uniformly
return random.sample(population.individuals, n_parents)
probabilities = [f / total for f in fitnesses]
selected = random.choices(
population.individuals,
weights=probabilities,
k=n_parents
)
return selected
Elitism
def apply_elitism(population, n_elite=2):
"""Preserve the top n_elite individuals across generations."""
sorted_pop = sorted(
population.individuals,
key=lambda ind: ind.fitness,
reverse=True
)
return sorted_pop[:n_elite]
Selection Pressure Scheduling
The tournament size can increase over generations to shift from exploration (low selection pressure) to exploitation (high selection pressure):
def scheduled_tournament_size(generation, max_generations,
min_k=2, max_k=5):
"""Increase tournament size over evolutionary run."""
progress = generation / max_generations
k = int(min_k + (max_k - min_k) * progress)
return max(min_k, min(max_k, k))
7. Population Management and Architecture
Population Data Structure
from dataclasses import dataclass, field
from typing import List, Optional
import uuid
@dataclass
class Individual:
"""A candidate solution program."""
code: str # Python source code of transform()
fitness: float = 0.0 # Pixel-match accuracy [0, 1]
generation: int = 0 # Which generation it was created
id: str = field(default_factory= # Unique identifier
lambda: str(uuid.uuid4())[:8])
parent_id: Optional[str] = None # Parent's ID (None for seeds)
mutation_type: str = "seed" # How this individual was created
error_info: Optional[str] = None # Runtime error if any
predictions: List = field( # Predicted outputs on train set
default_factory=list)
model_used: str = "" # Which LLM generated this
cost: float = 0.0 # API cost to generate this
code_hash: str = "" # Hash for deduplication
class Population:
"""Manages the evolutionary population."""
def __init__(self, max_size=20, elite_count=2):
self.individuals: List[Individual] = []
self.max_size = max_size
self.elite_count = elite_count
self.history: List[Individual] = [] # All ever created
self.seen_hashes: set = set() # For deduplication
self.best_ever: Optional[Individual] = None
def add(self, individual: Individual) -> bool:
"""Add individual if not duplicate. Returns success."""
code_hash = hashlib.md5(
individual.code.encode()
).hexdigest()
if code_hash in self.seen_hashes:
return False # Duplicate, reject
individual.code_hash = code_hash
self.seen_hashes.add(code_hash)
self.individuals.append(individual)
self.history.append(individual)
# Track best ever
if (self.best_ever is None or
individual.fitness > self.best_ever.fitness):
self.best_ever = individual
return True
def cull(self):
"""Remove worst individuals to maintain population size."""
if len(self.individuals) <= self.max_size:
return
# Sort by fitness descending
self.individuals.sort(
key=lambda ind: ind.fitness, reverse=True
)
# Keep elites + diverse selection of remainder
elites = self.individuals[:self.elite_count]
remainder = self.individuals[self.elite_count:]
# From remainder, keep diverse subset
kept = self.diverse_subset(
remainder,
self.max_size - self.elite_count
)
self.individuals = elites + kept
def get_best(self, n=1) -> List[Individual]:
"""Return top-n individuals by fitness."""
return sorted(
self.individuals,
key=lambda ind: ind.fitness,
reverse=True
)[:n]
Population Lifecycle
Generation 0 (Seeding):
LLM generates N diverse initial programs
|
v
Evaluate all -> Assign fitness scores
|
v
Generation 1..G (Evolution Loop):
1. Select parents (tournament / proportionate)
2. Mutate parents via LLM -> children
3. Evaluate children -> fitness
4. Add children to population
5. Cull population to max_size (keep elites + diverse)
6. Check termination (perfect fitness? budget? max gen?)
|
v
Final: Return top-2 programs for submission
8. Solution Diversity and Novelty
Maintaining diversity is critical to avoid premature convergence. The Darwinian Evolver uses multiple mechanisms:
8.1 Code-Level Deduplication
import hashlib
def is_duplicate(code, seen_hashes):
"""Check if functionally identical code already exists."""
# Normalize whitespace and comments for comparison
normalized = normalize_code(code)
h = hashlib.md5(normalized.encode()).hexdigest()
return h in seen_hashes
def normalize_code(code):
"""Strip comments and normalize whitespace."""
lines = []
for line in code.split('\n'):
stripped = line.strip()
if stripped and not stripped.startswith('#'):
lines.append(stripped)
return '\n'.join(lines)
8.2 Behavioral Diversity
Programs that produce different output patterns are valued even if they have similar fitness. The "behavioral fingerprint" of a program is its set of predicted outputs on the training examples:
def behavioral_distance(ind_a, ind_b):
"""Compute behavioral distance between two individuals."""
if not ind_a.predictions or not ind_b.predictions:
return 1.0 # Max distance if no predictions
distances = []
for pred_a, pred_b in zip(ind_a.predictions, ind_b.predictions):
if pred_a is None or pred_b is None:
distances.append(1.0)
continue
# Pixel-wise comparison of predicted outputs
a = np.array(pred_a)
b = np.array(pred_b)
if a.shape != b.shape:
distances.append(1.0)
else:
distances.append(np.mean(a != b))
return np.mean(distances)
def diverse_subset(self, candidates, n):
"""Select n diverse individuals using greedy farthest-first."""
if len(candidates) <= n:
return candidates
# Start with highest fitness
selected = [candidates[0]]
remaining = candidates[1:]
while len(selected) < n and remaining:
# Pick individual most different from already selected
best_candidate = None
best_min_dist = -1
for cand in remaining:
min_dist = min(
behavioral_distance(cand, sel)
for sel in selected
)
if min_dist > best_min_dist:
best_min_dist = min_dist
best_candidate = cand
selected.append(best_candidate)
remaining.remove(best_candidate)
return selected
8.3 Immigration / Fresh Seeding
Periodically, entirely new programs are generated from scratch (not mutated from existing parents) to inject fresh genetic material:
def maybe_inject_fresh_seed(self, population, task, generation):
"""Periodically inject completely new programs."""
if generation % self.config.immigration_interval == 0:
fresh = self.mutator.generate_seed(task)
if fresh:
population.add(fresh)
8.4 Model Diversity
Using multiple LLM models for mutations naturally introduces diversity in coding style, algorithmic approach, and problem-solving strategy. Different models have different "creative biases."
9. LLM Orchestration and Model Selection
Supported Models
| Model | Provider | Role | Approx. Cost (per 1M tokens) |
|---|---|---|---|
| Claude 3.5 Sonnet | Anthropic | Primary mutation operator | $3 input / $15 output |
| GPT-4o | OpenAI | Mutation, crossover | $2.50 input / $10 output |
| DeepSeek V3/R1 | DeepSeek | Cost-efficient mutations | $0.27 input / $1.10 output |
| Gemini 1.5/2.0 | Diversity, alternative approaches | Varies |
Multi-Model Strategy
class LLMClient:
"""Unified multi-model LLM interface."""
def __init__(self, config):
self.models = config.models # List of model configs
self.total_cost = 0.0
self.model_stats = {} # Track success rate per model
def select_model(self, mutation_type, budget_remaining):
"""Select which model to use for this mutation."""
if budget_remaining < self.config.low_budget_threshold:
# Switch to cheapest model when budget is low
return self.cheapest_model()
if mutation_type == "strategy":
# Use strongest model for strategic changes
return self.strongest_model()
# Default: rotate models for diversity
return random.choice(self.models)
def generate(self, prompt, model=None, temperature=0.8,
max_tokens=4096):
"""Generate completion from selected model."""
if model is None:
model = self.select_model("default", self.budget_remaining)
# Route to appropriate API
if model.provider == "anthropic":
response = self._call_anthropic(prompt, model, temperature)
elif model.provider == "openai":
response = self._call_openai(prompt, model, temperature)
elif model.provider == "deepseek":
response = self._call_deepseek(prompt, model, temperature)
elif model.provider == "google":
response = self._call_google(prompt, model, temperature)
# Track cost
cost = self.compute_cost(response, model)
self.total_cost += cost
self.budget_remaining -= cost
return response
Model Rotation for Diversity
A key insight is that different LLMs approach code modification differently. Claude may favor clean refactors, GPT-4o may produce more creative algorithmic changes, and DeepSeek may offer unique mathematical perspectives. By rotating models across mutations, the population inherits diverse "genetic material" from different LLM lineages.
10. Search Strategies
Island Model Evolution
For harder tasks, the system can run multiple independent evolutionary populations ("islands") that periodically exchange their best individuals:
Island 1 (Claude-based) Island 2 (GPT-based) Island 3 (DeepSeek-based)
+---------+ +---------+ +---------+
| Pop 1 | | Pop 2 | | Pop 3 |
| evolve | | evolve | | evolve |
| evolve | | evolve | | evolve |
+----+----+ +----+----+ +----+----+
| | |
+------- Migration --------+-------- Migration ------+
| (every M gens) | (every M gens) |
v v v
Best from I2,I3 Best from I1,I3 Best from I1,I2
injected into I1 injected into I2 injected into I3
Adaptive Strategy Selection
class AdaptiveStrategy:
"""Adaptively choose between exploration and exploitation."""
def __init__(self):
self.stagnation_counter = 0
self.best_fitness_history = []
def choose_strategy(self, population, generation):
"""Decide mutation strategy based on evolutionary progress."""
current_best = population.get_best()[0].fitness
# Track stagnation
if (self.best_fitness_history and
current_best <= self.best_fitness_history[-1]):
self.stagnation_counter += 1
else:
self.stagnation_counter = 0
self.best_fitness_history.append(current_best)
# If stagnating, increase exploration
if self.stagnation_counter >= 3:
return "explore" # High temp, strategy mutations
elif current_best > 0.8:
return "exploit" # Low temp, parameter tweaks
else:
return "balanced" # Default mixed strategy
Restart Mechanism
If the population has stagnated for too many generations without improvement, a full restart can be triggered, keeping only the all-time best individual and regenerating the rest from scratch:
def maybe_restart(self, population, stagnation_count):
"""Restart population if hopelessly stuck."""
if stagnation_count >= self.config.restart_threshold:
best = population.best_ever
population.clear()
if best:
population.add(best)
# Re-seed rest of population
for _ in range(self.config.seed_count - 1):
seed = self.mutator.generate_seed(self.task)
if seed:
population.add(seed)
return True
return False
11. Evaluation and Fitness Assessment
Fitness Function
The fitness function measures how well a candidate program's outputs match the expected outputs across all training examples:
$$ fitness(program) = (1/N) * SUM_{i=1}^{N} pixel_accuracy(program(input_i), expected_output_i)
pixel_accuracy(predicted, expected) = |{(r,c) : predicted[r][c] == expected[r][c]}| / (rows * cols)
If predicted shape != expected shape: pixel_accuracy = 0
If program raises exception: fitness = 0 (but error info is captured) $$
import numpy as np
import signal
import traceback
class FitnessEvaluator:
def __init__(self, timeout_seconds=5):
self.timeout = timeout_seconds
def evaluate(self, individual, task):
"""Evaluate individual on all training examples."""
scores = []
predictions = []
error_info = None
for inp, expected_out in task.train_pairs:
try:
# Execute program in sandboxed environment
predicted = self.run_program(
individual.code, inp
)
predictions.append(predicted)
# Compare predicted vs expected
score = self.pixel_accuracy(predicted, expected_out)
scores.append(score)
except TimeoutError:
scores.append(0.0)
predictions.append(None)
error_info = "Timeout: program exceeded time limit"
except Exception as e:
scores.append(0.0)
predictions.append(None)
error_info = traceback.format_exc()
individual.fitness = np.mean(scores) if scores else 0.0
individual.predictions = predictions
individual.error_info = error_info
return individual
def pixel_accuracy(self, predicted, expected):
"""Compute pixel-level match accuracy."""
pred = np.array(predicted)
exp = np.array(expected)
if pred.shape != exp.shape:
return 0.0 # Shape mismatch = zero fitness
matches = np.sum(pred == exp)
total = exp.size
return matches / total if total > 0 else 0.0
def run_program(self, code, input_grid):
"""Execute transform function in a sandboxed environment."""
# Create isolated namespace
namespace = {
'np': np,
'numpy': np,
'input_grid': input_grid,
}
# Set timeout via signal
signal.alarm(self.timeout)
try:
exec(code, namespace)
transform_fn = namespace['transform']
result = transform_fn(input_grid)
return result
finally:
signal.alarm(0) # Cancel alarm
Multi-Objective Fitness (Optional)
Beyond pixel accuracy, the system can incorporate secondary objectives:
- Code simplicity: Shorter programs preferred (Occam's razor)
- Execution speed: Faster programs preferred
- Partial credit: Correct grid dimensions, correct colors present, correct patterns
$$ fitness_combined = alpha * pixel_accuracy + beta * (1 - code_length/max_length) + gamma * (1 - exec_time/timeout)
Typical: alpha=0.9, beta=0.05, gamma=0.05 $$
12. Prompt Engineering
Seed Generation Prompt
The initial population is generated by prompting LLMs to write transform functions from scratch:
# Seed prompt template (simplified)
SEED_PROMPT = """You are solving an ARC (Abstraction and Reasoning Corpus) task.
Given input grids, you must write a Python function `transform(grid)` that
produces the correct output grid.
The grid is a 2D list of integers (0-9), where each integer represents a color:
0=black, 1=blue, 2=red, 3=green, 4=yellow, 5=gray, 6=magenta, 7=orange,
8=cyan, 9=maroon
Training examples:
{examples}
Analyze the pattern in these examples, then write a Python function:
```python
import numpy as np
def transform(grid):
grid = np.array(grid)
# Your implementation here
return output.tolist()
Think step by step about what transformation maps each input to its output. Return ONLY the Python function, no explanation."""
### Mutation Prompt Template
MUTATION_PROMPT = """You are evolving a Python program to solve an ARC task.
Task Examples:
{examples_with_current_predictions}
Current Program (fitness = {fitness:.1%}):
{parent_code}
Analysis:
- The program gets {correct_count}/{total_count} pixels correct
- Main discrepancies: {discrepancy_analysis} {error_section}
Instructions:
Modify the program to improve its accuracy. Consider: 1. Does the program correctly identify the transformation pattern? 2. Are there edge cases it misses? 3. Should the algorithmic approach be changed entirely?
Return ONLY the modified Python function."""
### Error Repair Prompt
REPAIR_PROMPT = """The following Python program crashed with an error:
{code}
Error:
{error_traceback}
Fix the error while preserving the intended logic. If the logic itself is flawed, also improve the approach. Return ONLY the fixed function."""
### Crossover Prompt
CROSSOVER_PROMPT = """Combine the best ideas from these two programs into a single improved program.
Program A (fitness = {fitness_a:.1%}):
{code_a}
Program B (fitness = {fitness_b:.1%}):
{code_b}
Task Examples:
{examples}
Create a new program that combines the strongest aspects of both approaches. Return ONLY the combined Python function."""
## 13. Crossover Operators
While mutation is the primary operator, crossover combines genetic material from two parent programs:
class Crossover: def init(self, llm_client, prompts): self.llm = llm_client self.prompts = prompts
def crossover(self, parent_a, parent_b, task):
"""Produce a child by combining two parents via LLM."""
prompt = self.prompts["crossover"].format(
code_a=parent_a.code,
code_b=parent_b.code,
fitness_a=parent_a.fitness,
fitness_b=parent_b.fitness,
examples=self.format_examples(task)
)
response = self.llm.generate(
prompt=prompt,
temperature=0.7,
max_tokens=4096
)
child_code = self.extract_code(response)
if child_code:
return Individual(
code=child_code,
generation=max(parent_a.generation,
parent_b.generation) + 1,
parent_id=f"{parent_a.id}x{parent_b.id}",
mutation_type="crossover"
)
return None
### Crossover Selection
Crossover is applied less frequently than mutation (typically 20-30% of offspring are from crossover). Parents for crossover are selected from the top half of the population by fitness, ensuring that only reasonably good programs contribute to recombination.
$$
P(crossover) = crossover_rate = 0.2 to 0.3
P(mutation) = 1 - crossover_rate = 0.7 to 0.8
Parent selection for crossover: Both parents from top 50% by fitness
$$
## 14. Cost Control and Budget Management
### Per-Task Budget
class BudgetManager: """Track and enforce API cost budgets."""
def __init__(self, max_budget_per_task=5.0,
warning_threshold=0.8):
self.max_budget = max_budget_per_task # dollars
self.warning_threshold = warning_threshold
self.spent = 0.0
self.calls = 0
self.cost_per_model = {}
def can_afford(self, estimated_cost):
"""Check if we can afford another API call."""
return (self.spent + estimated_cost) <= self.max_budget
def record_cost(self, cost, model_name):
"""Record an API call cost."""
self.spent += cost
self.calls += 1
self.cost_per_model[model_name] = (
self.cost_per_model.get(model_name, 0) + cost
)
def budget_remaining(self):
return self.max_budget - self.spent
def should_switch_to_cheap_model(self):
"""Switch to cheaper model when budget is running low."""
return self.spent >= self.max_budget * self.warning_threshold
### Cost Estimates
| Component | Estimated Cost | Notes |
| --- | --- | --- |
| Seed generation (N=10 programs) | $0.10 - $0.50 | Depends on model used |
| Per mutation (1 LLM call) | $0.01 - $0.10 | Varies by model, prompt length |
| Per generation (pop=20) | $0.20 - $2.00 | ~10-20 mutations per generation |
| Full task (20-50 generations) | $1 - $5 | Typical range with budget control |
| Full ARC-AGI-2 eval (100 tasks) | $100 - $500 | Varies with difficulty |
### Cost Optimization Strategies
- **Early termination:** Stop immediately upon finding a perfect-fitness program.
- **Model cascading:** Start with cheap models (DeepSeek), escalate to expensive ones (Claude) only for harder tasks.
- **Prompt compression:** Minimize token count in prompts by using compact grid representations.
- **Cache deduplication:** Don't re-evaluate programs that have already been seen.
- **Adaptive generation count:** Easy tasks converge in 5-10 generations; hard tasks get up to 50+.
## 15. Meta-Level and Self-Improvement
### Prompt Evolution
A meta-level strategy is to evolve the prompts themselves. If certain prompt phrasings lead to higher-fitness mutations on average, those prompt templates are favored:
class PromptEvolver: """Evolve the mutation prompts themselves."""
def __init__(self, base_prompts, llm_client):
self.prompt_population = [
{"template": p, "avg_fitness_gain": 0.0, "uses": 0}
for p in base_prompts
]
self.llm = llm_client
def select_prompt(self):
"""Select prompt template weighted by past success."""
# Upper confidence bound (UCB) selection
best_score = -float('inf')
best_prompt = None
for p in self.prompt_population:
if p["uses"] == 0:
return p # Explore unused prompts first
ucb = (p["avg_fitness_gain"] +
np.sqrt(2 * np.log(self.total_uses) / p["uses"]))
if ucb > best_score:
best_score = ucb
best_prompt = p
return best_prompt
def update(self, prompt_entry, fitness_gain):
"""Update prompt statistics after use."""
n = prompt_entry["uses"]
prompt_entry["avg_fitness_gain"] = (
(prompt_entry["avg_fitness_gain"] * n + fitness_gain) / (n + 1)
)
prompt_entry["uses"] += 1
### Hyperparameter Adaptation
The system can adapt its own hyperparameters during a run:
- **Population size:** Increase if diversity is low, decrease if budget is tight.
- **Mutation temperature:** Adjust based on stagnation (see Section 5).
- **Tournament size:** Increase over generations (see Section 6).
- **Crossover rate:** Increase if crossover offspring outperform mutations.
### Transfer Learning Across Tasks
Solutions (or partial solutions) from previously solved tasks can serve as seed programs for new similar tasks. Common utility functions discovered during evolution can be extracted and reused:
class TransferLibrary: """Library of useful code fragments from solved tasks."""
def __init__(self):
self.solved_programs = {} # task_id -> best program
self.utility_functions = [] # Extracted helpers
def register_solution(self, task_id, program):
self.solved_programs[task_id] = program
self.extract_utilities(program)
def get_relevant_seeds(self, task, n=3):
"""Find previously solved programs that might help."""
# Simple heuristic: tasks with similar grid dimensions
# or similar color palettes
candidates = []
for tid, prog in self.solved_programs.items():
similarity = self.task_similarity(task, tid)
candidates.append((similarity, prog))
candidates.sort(reverse=True)
return [prog for _, prog in candidates[:n]]
## 16. Main Evolutionary Loop
class DarwinianEvolver: """Main evolutionary engine for ARC-AGI-2 program synthesis."""
def __init__(self, config):
self.config = config
self.llm = LLMClient(config)
self.mutator = Mutator(self.llm, config.prompts, config)
self.crossover = Crossover(self.llm, config.prompts)
self.selector = TournamentSelection(config.tournament_size)
self.evaluator = FitnessEvaluator(config.timeout)
self.budget = BudgetManager(config.max_budget_per_task)
self.strategy = AdaptiveStrategy()
def solve(self, task):
"""Evolve a solution for the given ARC task."""
# 1. Initialize population with seed programs
population = Population(
max_size=self.config.pop_size,
elite_count=self.config.elite_count
)
self.seed_population(population, task)
# 2. Evaluate initial population
for ind in population.individuals:
self.evaluator.evaluate(ind, task)
# 3. Evolutionary loop
for gen in range(self.config.max_generations):
# Check termination conditions
best = population.get_best()[0]
if best.fitness >= 1.0:
break # Perfect solution found!
if not self.budget.can_afford(0.01):
break # Budget exhausted
# Determine strategy
strategy = self.strategy.choose_strategy(population, gen)
# Generate offspring
offspring = []
for _ in range(self.config.offspring_per_gen):
if not self.budget.can_afford(0.01):
break
if random.random() < self.config.crossover_rate:
# Crossover
parents = self.selector.select(population, 2)
child = self.crossover.crossover(
parents[0], parents[1], task
)
else:
# Mutation
[parent] = self.selector.select(population, 1)
child = self.mutator.mutate(
parent, task, parent.error_info
)
if child:
self.evaluator.evaluate(child, task)
offspring.append(child)
# Add offspring to population
for child in offspring:
population.add(child)
# Cull population to max size
population.cull()
# Maybe inject fresh seeds
self.maybe_inject_fresh_seed(population, task, gen)
# Maybe restart if stuck
self.maybe_restart(population,
self.strategy.stagnation_counter)
# 4. Return top-2 programs for ARC submission
top2 = population.get_best(n=2)
return [self.predict_test(ind, task) for ind in top2]
def seed_population(self, population, task):
"""Generate initial seed programs."""
for i in range(self.config.seed_count):
# Use different models for diversity
model = self.llm.models[i % len(self.llm.models)]
seed = self.mutator.generate_seed(task, model=model)
if seed:
population.add(seed)
def predict_test(self, individual, task):
"""Run best program on test input to get prediction."""
try:
result = self.evaluator.run_program(
individual.code, task.test_input
)
return result
except:
return <a href="/evolve/ale-agent-ahc058.html">0</a> # Fallback empty grid
## 17. Memory Management
### Population Storage
- **Active population:** In-memory list of `Individual` objects (typically 10-30 per task).
- **History archive:** All individuals ever created are stored for analysis and potential revival.
- **Deduplication set:** Hash set of all seen program codes to prevent duplicates.
- **Best-ever tracking:** The single best individual across all generations is always preserved.
### Serialization
import json
def save_population(population, filepath): """Serialize population to JSON for persistence.""" data = { "individuals": [ { "id": ind.id, "code": ind.code, "fitness": ind.fitness, "generation": ind.generation, "parent_id": ind.parent_id, "mutation_type": ind.mutation_type, "model_used": ind.model_used, "cost": ind.cost, } for ind in population.individuals ], "best_ever_id": population.best_ever.id if population.best_ever else None, "generation": max( (ind.generation for ind in population.individuals), default=0 ), } with open(filepath, 'w') as f: json.dump(data, f, indent=2)
### Memory Footprint
| Component | Size | Notes |
| --- | --- | --- |
| Single Individual | ~1-10 KB | Code string + metadata |
| Active Population (20) | ~20-200 KB | In memory during evolution |
| History (500 individuals) | ~0.5-5 MB | All programs ever generated |
| Predictions cache | ~1-10 MB | Stored grid predictions for behavioral distance |
| Total per task | <50 MB | Lightweight |
## 18. Reproducibility
### Setup Instructions
Clone the repository
git clone https://github.com/imbue-ai/darwinian_evolver.git cd darwinian_evolver
Create virtual environment
python -m venv venv source venv/bin/activate
Install dependencies
pip install -r requirements.txt
Set API keys
export ANTHROPIC_API_KEY="sk-ant-..." export OPENAI_API_KEY="sk-..." export DEEPSEEK_API_KEY="..." export GOOGLE_API_KEY="..."
Download ARC-AGI-2 dataset
(follow instructions in README for dataset setup)
Run on a single task
python main.py --task_id
Run on full evaluation set
python main.py --eval --config config.yaml --output results/
### Dependencies
| Package | Version | Purpose |
| --- | --- | --- |
| Python | >=3.10 | Runtime |
| numpy | >=1.24 | Grid operations |
| anthropic | latest | Claude API client |
| openai | latest | GPT-4o API client |
| pyyaml | latest | Configuration parsing |
| scipy | >=1.10 | Optional: advanced grid operations |
| tqdm | latest | Progress bars |
### Configuration (config.yaml)
Evolutionary parameters
population_size: 20 max_generations: 50 elite_count: 2 offspring_per_generation: 10 crossover_rate: 0.2 seed_count: 10 tournament_size: 3
Mutation parameters
mutation_temperature_min: 0.3 mutation_temperature_max: 1.0 max_code_tokens: 4096
Budget
max_budget_per_task: 5.0 # dollars low_budget_threshold: 0.8
Models
models: - name: claude-3-5-sonnet provider: anthropic role: primary - name: gpt-4o provider: openai role: secondary - name: deepseek-chat provider: deepseek role: budget
Diversity
immigration_interval: 5 # inject fresh seed every N generations restart_threshold: 10 # restart after N stagnant generations
Execution
program_timeout: 5 # seconds per program execution
## 19. Programming Language
#### Framework Language
**Python 3.10+** -- The evolutionary engine, LLM orchestration, fitness evaluation, and all infrastructure code is written in Python.
---
#### Evolved Programs
**Python (numpy-based)** -- The candidate solutions (transform functions) are Python functions using numpy for grid manipulation. Programs are executed via `exec()` in sandboxed namespaces.
---
The choice of Python for both the framework and evolved programs is intentional: LLMs are most proficient at generating Python code, and numpy provides efficient grid operations. The `exec()`-based execution model allows programs to be stored as strings and dynamically evaluated.
## 20. Continued Learning
### Within-Task Learning
The evolutionary process itself is a form of learning: the population accumulates knowledge about the task through the fitness landscape. Each generation's survivors encode "learned" information about the input-output relationship.
### Cross-Task Transfer
- **Utility library accumulation:** Common helper functions (flood fill, connected components, rotation, reflection, color remapping) discovered during evolution are saved and injected as available imports for future tasks.
- **Solution seeding:** Programs from solved tasks can seed the initial population for similar unsolved tasks.
- **Prompt refinement:** Prompt templates that led to high-quality mutations are preserved and reused.
### Model Fine-Tuning (Future Direction)
The evolutionary history (parent program, mutation prompt, successful child program) forms a natural dataset for fine-tuning LLMs to become better mutation operators. This creates a virtuous cycle:
Evolution produces Fine-tune LLMs on Better mutations (parent, child) pairs successful mutations improve evolution | | | +---------->-----------+----------->----------+ | Virtuous Cycle ```
21. Applications and Generalization
Beyond ARC-AGI-2
| Application Domain | How It Applies | Adaptation Needed |
|---|---|---|
| General program synthesis | Evolve programs from input/output specs | Change fitness function to match domain |
| Algorithm optimization | Evolve faster/better algorithms | Fitness = runtime or correctness on test suite |
| Mathematical reasoning | Evolve proof strategies or formulas | Fitness = theorem prover verification |
| Scientific discovery | Evolve hypotheses expressed as code | Fitness = experimental data fit |
| Code refactoring | Evolve cleaner versions of messy code | Fitness = correctness + readability metrics |
| Game AI | Evolve strategy functions | Fitness = win rate in simulated games |
| Data transformation | Evolve ETL pipelines from examples | Fitness = output match on sample data |
Key Advantages of the Darwinian Approach
- No task-specific architecture needed: The same evolutionary framework works for any task where fitness can be computed.
- Model-agnostic: Any LLM can serve as a mutation operator. As models improve, evolution automatically benefits.
- Inherently parallel: Multiple mutations can be generated and evaluated in parallel.
- Interpretable solutions: Unlike neural network outputs, evolved programs are readable Python code that can be inspected and understood.
- Compositional: Solutions can build on each other -- successful subroutines from one task transfer to others.
Limitations
- Cost: Requires many LLM API calls, making it expensive at scale.
- Latency: Evolution takes minutes to hours per task, not suitable for real-time applications.
- LLM Ceiling: The quality of mutations is bounded by the LLM's coding ability.
- Fitness Function Design: Requires well-defined fitness; open-ended tasks are harder to evaluate.
- Stochastic: Results are non-deterministic; the same task may get different solutions across runs.
22. Summary
The Imbue Darwinian Evolver demonstrates that Darwinian evolution, with LLMs serving as intelligent mutation operators, is a viable and elegant approach to program synthesis for abstract reasoning tasks. By maintaining populations of candidate programs, applying selection pressure via pixel-accuracy fitness, and leveraging multiple LLMs for diverse mutations, the system discovers Python programs that solve ARC-AGI-2 tasks -- achieving competitive scores on one of the most challenging AI benchmarks.
Key Takeaways
- Evolution works for code: Treating LLMs as mutation operators in a Darwinian framework outperforms naive best-of-N sampling.
- Diversity is critical: Code deduplication, behavioral distance, multi-model mutations, and immigration prevent premature convergence.
- Error feedback is gold: Passing error messages and output discrepancies back to the LLM enables targeted repairs.
- Adaptive strategies matter: Adjusting temperature, tournament size, and mutation type based on evolutionary progress improves efficiency.
- Budget management is essential: Cost-aware model selection and early termination keep the approach practical.
- The approach is general: The framework applies far beyond ARC to any domain with computable fitness.
Research Connections
| Related Work | Relationship |
|---|---|
| FunSearch (DeepMind) | Also evolves programs via LLMs; Darwinian Evolver adds multi-model diversity and adaptive strategies |
| AlphaEvolve (DeepMind) | Gemini-powered code evolution; similar spirit, different implementation |
| OpenEvolve | Open-source LLM-guided evolution; shares evolutionary paradigm |
| ShinkaEvolve (Sakana AI) | Evolutionary ML/AI agent; parallel development in the evolution-for-AI space |
| Genetic Programming (Koza) | Classical GP; Darwinian Evolver replaces random tree mutations with LLM-guided semantic mutations |
| POET / PAIRED | Co-evolution of environments and agents; population-based search inspiration |
References
- Imbue Research. "Evolving Solutions to ARC-AGI-2 with Darwinian Evolution." imbue.com/research/2026-02-27-arc-agi-2-evolution/, February 2026.
- Imbue AI. "darwinian_evolver." GitHub Repository. github.com/imbue-ai/darwinian_evolver, 2026.
- Chollet, F. "On the Measure of Intelligence." arXiv:1911.01547, 2019.
- Romera-Paredes, B. et al. "Mathematical discoveries from program search with large language models." Nature 625, 468-475 (2024). [FunSearch]
- Lehman, J. et al. "Evolution through Large Models." arXiv:2206.08896, 2022.
- DeepMind. "AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms." 2025.
- Koza, J.R. "Genetic Programming: On the Programming of Computers by Means of Natural Selection." MIT Press, 1992.
- ARC Prize Foundation. "ARC-AGI-2." arcprize.org, 2025-2026.
Technical Report generated for PhD-level research analysis.
Report Date: March 2026 | Subject: Imbue Darwinian Evolver for ARC-AGI-2
Based on analysis of the Imbue research blog post and the darwinian_evolver GitHub repository.