Imbue Darwinian Evolver for ARC-AGI-2

Evolving Programs Through LLM-Guided Darwinian Natural Selection Authors: Imbue Research (Kanjun Qiu, Josh Albrecht, et al.) Published: February 27, 2026 Repository: github.com/imbue-ai/darwinian_evolver Blog: imbue.com/research

Core Contribution
Results
Architecture
Components
Mutation Operators
Selection
Population Mgmt
Diversity
LLM Orchestration
Search Strategy
Fitness
Prompts
Cost Control
Meta-Level
Memory
Reproducibility
Applications

1. Core Contribution

Key Insight: Rather than using LLMs to directly solve ARC-AGI-2 tasks, the Darwinian Evolver treats the LLM as a mutation operator within a Darwinian evolutionary framework. Programs are the individuals, LLM-generated code modifications are the mutations, and task-specific fitness evaluation drives natural selection to converge on correct solutions.

The Imbue Darwinian Evolver represents a paradigm shift from prompt-engineering-based approaches to ARC-AGI-2. Instead of asking an LLM "what is the output?", the system asks "here is a program that partially works -- how can we modify it to work better?" This reframes abstract visual reasoning into a concrete program synthesis problem solved via evolutionary search.

Core Principles

Programs as Individuals: Each candidate solution is a Python function (transform) that maps input grids to output grids.
LLMs as Mutation Operators: Large language models generate code modifications (mutations) guided by the parent program, task examples, and error signals.
Fitness = Pixel-Level Accuracy: Programs are evaluated by running them against training examples and measuring exact pixel match accuracy.
Population-Based Search: Multiple candidate programs co-exist and compete, with selection pressure favoring higher-fitness individuals.
Darwinian Selection: Tournament or fitness-proportionate selection determines which programs survive and reproduce (get mutated).

Novelty Over Prior Work

Approach	Method	Limitation
Direct LLM prompting	Ask LLM to produce output grid	LLMs struggle with spatial reasoning
Code generation (single-shot)	Ask LLM to write transform function	ARC tasks require non-obvious algorithms
Best-of-N sampling	Generate N programs, pick best	No iterative refinement
Darwinian Evolver	Evolve programs via LLM-guided mutation + selection	Combines search + learning

2. Key Results

ARC-AGI-2 Performance

Metric	Value	Notes
ARC-AGI-2 Public Eval Score	~5-8%	On the public evaluation set (semi-private)
ARC-AGI-2 Training Accuracy	Higher on training tasks	Evolutionary search can overfit to training examples
Per-task budget	~$1-5 in API costs per task	Varies with number of generations
Time per task	Minutes to hours	Depends on convergence speed
Submission format	2 guesses per task	Top-2 programs from final population

Context: ARC-AGI-2 is substantially harder than ARC-AGI-1 (where top scores exceed 50%). On ARC-AGI-2, even frontier models score very low. The Darwinian Evolver demonstrates that evolutionary program synthesis is a viable approach to this challenging benchmark, achieving competitive results with other program synthesis methods.

Comparison with Baselines

Method	Approach	ARC-AGI-2 Approx.
Direct GPT-4o / Claude	Direct output prediction	<2%
Single-shot code generation	One-shot program synthesis	~3-5%
Imbue Darwinian Evolver	Evolutionary program synthesis	~5-8%
Top leaderboard entries	Various (often hybrid)	~10-15%+

3. Architecture

+==================================================================================+
|                          IMBUE DARWINIAN EVOLVER                                  |
+==================================================================================+
|                                                                                   |
|  +----------------------------+       +-------------------------------------+     |
|  |     TASK INPUT             |       |     EVOLUTIONARY ENGINE             |     |
|  |  - Training pairs (I/O)   |       |                                     |     |
|  |  - Test input grid(s)     |------>|  1. INITIALIZATION                  |     |
|  +----------------------------+       |     - Seed population from LLM      |     |
|                                       |     - Random diverse programs       |     |
|  +----------------------------+       |                                     |     |
|  |     LLM POOL              |       |  2. EVALUATION                      |     |
|  |  - Claude 3.5 Sonnet      |       |     - Run programs on train pairs   |     |
|  |  - GPT-4o                 |       |     - Pixel-match fitness score     |     |
|  |  - DeepSeek               |       |                                     |     |
|  |  - Gemini                 |<----->|  3. SELECTION                       |     |
|  +----------------------------+       |     - Tournament selection          |     |
|                                       |     - Fitness-proportionate        |     |
|  +----------------------------+       |                                     |     |
|  |     DSL / PRIMITIVES      |       |  4. MUTATION (via LLM)             |     |
|  |  - Grid operations        |       |     - Code modification prompts     |     |
|  |  - Numpy/scipy helpers    |       |     - Error-guided repair           |     |
|  |  - Color transforms       |       |     - Crossover of parents         |     |
|  +----------------------------+       |                                     |     |
|                                       |  5. POPULATION MANAGEMENT          |     |
|                                       |     - Elitism (keep best)          |     |
|                                       |     - Diversity maintenance        |     |
|                                       |     - Population size control      |     |
|                                       |                                     |     |
|                                       |  6. TERMINATION                    |     |
|                                       |     - Perfect fitness found        |     |
|                                       |     - Budget exhausted             |     |
|                                       |     - Max generations reached      |     |
|                                       +-------------------------------------+     |
|                                                    |                              |
|                                                    v                              |
|                                       +-------------------------------------+     |
|                                       |     OUTPUT                          |     |
|                                       |  - Top-K programs                  |     |
|                                       |  - Predicted test output(s)       |     |
|                                       |  - 2 guesses per task             |     |
|                                       +-------------------------------------+     |
+==================================================================================+

Evolutionary Loop Detail

Generation t                         Generation t+1
   +-----------+                        +-----------+
   | Program 1 |---fitness=0.8------+   | Program 1'|  (mutated from P1)
   | Program 2 |---fitness=0.4--+   |   | Program 4 |  (elite, kept)
   | Program 3 |---fitness=0.2  |   +-->| Program 5 |  (crossover P1xP4)
   | Program 4 |---fitness=0.9--+------>| Program 6 |  (mutated from P4)
   +-----------+    |           |       | Program 7 |  (new seed from LLM)
                    |           |       +-----------+
                    v           v
               SELECTED     ELIMINATED
            (tournament)   (low fitness)

4. Component Breakdown

Repository Structure

darwinian_evolver/
+-- README.md                    # Setup and usage instructions
+-- requirements.txt             # Python dependencies
+-- config.yaml                  # Configuration (models, params)
+-- main.py                      # Entry point, orchestrates evolution
+-- evolver.py                   # Core evolutionary engine
+-- population.py                # Population management
+-- individual.py                # Individual program representation
+-- fitness.py                   # Fitness evaluation (grid comparison)
+-- mutation.py                  # LLM-based mutation operators
+-- selection.py                 # Selection strategies
+-- crossover.py                 # Crossover operators
+-- llm_client.py                # Multi-model LLM interface
+-- prompts/                     # Prompt templates
|   +-- seed_prompt.txt          # Initial program generation
|   +-- mutate_prompt.txt        # Mutation prompt template
|   +-- crossover_prompt.txt     # Crossover prompt template
|   +-- repair_prompt.txt        # Error-guided repair prompt
+-- utils/                       # Utility functions
|   +-- grid_utils.py            # Grid manipulation helpers
|   +-- arc_loader.py            # ARC task loading
|   +-- budget.py                # Cost tracking and budget control
+-- tests/                       # Test suite

Module Responsibilities

evolver.py

The main evolutionary loop. Orchestrates initialization, evaluation, selection, mutation, and population replacement across generations. Handles termination conditions and budget checking.

Core

population.py

Manages the collection of individuals. Handles adding/removing individuals, maintaining population size constraints, elitism, and diversity metrics.

Data Structure

individual.py

Represents a single candidate program. Stores the source code string, fitness score, generation number, parent lineage, and any metadata.

Data Structure

mutation.py

Implements LLM-based mutation operators. Takes a parent program and produces a mutated child by prompting an LLM with the code, task examples, and error information.

Operator

selection.py

Implements selection strategies: tournament selection, fitness-proportionate selection, and elitism. Determines which individuals get to reproduce.

Operator

fitness.py

Evaluates programs by executing them on training examples and computing pixel-match accuracy. Handles sandboxed execution, timeouts, and error catching.

Evaluation

llm_client.py

Provides a unified interface to multiple LLM providers (Anthropic, OpenAI, Google, DeepSeek). Handles API calls, retries, rate limiting, and cost tracking.

Infrastructure

prompts/

Template directory containing carefully engineered prompts for each evolutionary operation: seeding, mutation, crossover, and repair.

Configuration

5. Code Modification and Mutation Operators

Mutation is the primary genetic operator and the key innovation of the Darwinian Evolver. Instead of random token-level mutations (as in traditional genetic programming), mutations are semantically meaningful code modifications generated by LLMs.

Mutation Types

Type	Description	When Used
Guided Mutation	LLM modifies parent code given error feedback	Default, most common
Random Mutation	LLM rewrites a random section of the code	Diversity injection
Error-Guided Repair	LLM fixes runtime errors or assertion failures	When parent crashes
Strategy Mutation	LLM changes the algorithmic approach entirely	When stuck in local optimum
Parameter Mutation	LLM tweaks constants, thresholds, indices	Fine-tuning near-correct programs

Core Mutation Implementation

class Mutator:
    def __init__(self, llm_client, prompts, config):
        self.llm = llm_client
        self.prompts = prompts
        self.config = config

    def mutate(self, individual, task, error_info=None):
        """Generate a mutated child from a parent individual."""
        # Build the mutation prompt with context
        prompt = self.build_mutation_prompt(
            parent_code=individual.code,
            task_examples=task.train_pairs,
            fitness=individual.fitness,
            error_info=error_info,
            predicted_outputs=individual.predictions,
            expected_outputs=task.expected_outputs
        )

        # Call LLM to generate mutated code
        response = self.llm.generate(
            prompt=prompt,
            temperature=self.config.mutation_temperature,  # typically 0.7-1.0
            max_tokens=self.config.max_code_tokens
        )

        # Extract code from LLM response
        mutated_code = self.extract_code(response)

        # Validate syntax before returning
        if self.is_valid_python(mutated_code):
            return Individual(
                code=mutated_code,
                generation=individual.generation + 1,
                parent_id=individual.id,
                mutation_type="guided"
            )
        return None  # Mutation failed, discard

Mutation Prompt Construction

def build_mutation_prompt(self, parent_code, task_examples,
                          fitness, error_info, predicted_outputs,
                          expected_outputs):
    """Build a detailed prompt for LLM-guided mutation."""
    prompt_parts = []

    # System context
    prompt_parts.append(
        "You are an expert Python programmer helping to evolve "
        "a program that transforms input grids to output grids. "
        "Your task is to modify the given program to improve its "
        "accuracy on the training examples."
    )

    # Task examples (input/output pairs)
    for i, (inp, out) in enumerate(task_examples):
        prompt_parts.append(f"Example {i+1}:")
        prompt_parts.append(f"Input:\n{self.grid_to_str(inp)}")
        prompt_parts.append(f"Expected Output:\n{self.grid_to_str(out)}")
        if predicted_outputs and i < len(predicted_outputs):
            prompt_parts.append(
                f"Current Program Output:\n"
                f"{self.grid_to_str(predicted_outputs[i])}"
            )

    # Current program
    prompt_parts.append(f"Current program (fitness={fitness:.3f}):")
    prompt_parts.append(f"```python\n{parent_code}\n```")

    # Error information if available
    if error_info:
        prompt_parts.append(f"The program produced this error:\n{error_info}")

    # Mutation instruction
    prompt_parts.append(
        "Please modify the program to better match the expected "
        "outputs. Focus on understanding the pattern in the "
        "input-output pairs and adjusting the logic accordingly. "
        "Return ONLY the modified Python function."
    )

    return "\n\n".join(prompt_parts)

Temperature-Based Mutation Intensity

The mutation intensity is controlled via LLM temperature. Higher temperatures produce more radical mutations (exploration), while lower temperatures produce conservative modifications (exploitation):

$$ T(g) = T_max - (T_max - T_min) * (fitness / fitness_max)

Where: T(g) = temperature for generation g, fitness = parent fitness score

High fitness parents get low temperature (fine-tuning), low fitness parents get high temperature (exploration) $$

def adaptive_temperature(self, fitness, gen, max_gen):
    """Compute mutation temperature based on fitness and progress."""
    # Base: high fitness -> low temperature (exploit)
    # Low fitness -> high temperature (explore)
    fitness_factor = 1.0 - fitness  # fitness in [0, 1]
    progress_factor = gen / max_gen  # how far along we are

    temperature = (
        self.config.temp_min +
        (self.config.temp_max - self.config.temp_min) *
        fitness_factor * (1.0 - 0.3 * progress_factor)
    )
    return max(self.config.temp_min, min(self.config.temp_max, temperature))

6. Parent Selection and Sampling

Tournament Selection

The primary selection mechanism is tournament selection, where k individuals are randomly sampled from the population, and the one with the highest fitness is selected as a parent:

$$ P(select individual i) = P(i is best among k random samples)

For tournament size k and population N with sorted fitnesses:

P(rank r selected) = (1/N)^k * C(N, k) ... approximated by favoring top ranks $$

class TournamentSelection:
    def __init__(self, tournament_size=3):
        self.tournament_size = tournament_size

    def select(self, population, n_parents=1):
        """Select n_parents from population via tournament selection."""
        parents = []
        for _ in range(n_parents):
            # Sample k random individuals
            tournament = random.sample(population.individuals,
                                        min(self.tournament_size,
                                            len(population)))
            # Select the one with highest fitness
            winner = max(tournament, key=lambda ind: ind.fitness)
            parents.append(winner)
        return parents

class FitnessProportionateSelection:
    def select(self, population, n_parents=1):
        """Select parents with probability proportional to fitness."""
        fitnesses = [ind.fitness for ind in population.individuals]
        total = sum(fitnesses)
        if total == 0:
            # All zero fitness -- select uniformly
            return random.sample(population.individuals, n_parents)

        probabilities = [f / total for f in fitnesses]
        selected = random.choices(
            population.individuals,
            weights=probabilities,
            k=n_parents
        )
        return selected

Elitism

def apply_elitism(population, n_elite=2):
    """Preserve the top n_elite individuals across generations."""
    sorted_pop = sorted(
        population.individuals,
        key=lambda ind: ind.fitness,
        reverse=True
    )
    return sorted_pop[:n_elite]

Selection Pressure Scheduling

The tournament size can increase over generations to shift from exploration (low selection pressure) to exploitation (high selection pressure):

def scheduled_tournament_size(generation, max_generations,
                              min_k=2, max_k=5):
    """Increase tournament size over evolutionary run."""
    progress = generation / max_generations
    k = int(min_k + (max_k - min_k) * progress)
    return max(min_k, min(max_k, k))

7. Population Management and Architecture

Population Data Structure

from dataclasses import dataclass, field
from typing import List, Optional
import uuid

@dataclass
class Individual:
    """A candidate solution program."""
    code: str                        # Python source code of transform()
    fitness: float = 0.0             # Pixel-match accuracy [0, 1]
    generation: int = 0              # Which generation it was created
    id: str = field(default_factory=  # Unique identifier
                    lambda: str(uuid.uuid4())[:8])
    parent_id: Optional[str] = None # Parent's ID (None for seeds)
    mutation_type: str = "seed"      # How this individual was created
    error_info: Optional[str] = None # Runtime error if any
    predictions: List = field(       # Predicted outputs on train set
                  default_factory=list)
    model_used: str = ""             # Which LLM generated this
    cost: float = 0.0               # API cost to generate this
    code_hash: str = ""              # Hash for deduplication

class Population:
    """Manages the evolutionary population."""

    def __init__(self, max_size=20, elite_count=2):
        self.individuals: List[Individual] = []
        self.max_size = max_size
        self.elite_count = elite_count
        self.history: List[Individual] = []  # All ever created
        self.seen_hashes: set = set()   # For deduplication
        self.best_ever: Optional[Individual] = None

    def add(self, individual: Individual) -> bool:
        """Add individual if not duplicate. Returns success."""
        code_hash = hashlib.md5(
            individual.code.encode()
        ).hexdigest()
        if code_hash in self.seen_hashes:
            return False  # Duplicate, reject

        individual.code_hash = code_hash
        self.seen_hashes.add(code_hash)
        self.individuals.append(individual)
        self.history.append(individual)

        # Track best ever
        if (self.best_ever is None or
            individual.fitness > self.best_ever.fitness):
            self.best_ever = individual

        return True

    def cull(self):
        """Remove worst individuals to maintain population size."""
        if len(self.individuals) <= self.max_size:
            return

        # Sort by fitness descending
        self.individuals.sort(
            key=lambda ind: ind.fitness, reverse=True
        )

        # Keep elites + diverse selection of remainder
        elites = self.individuals[:self.elite_count]
        remainder = self.individuals[self.elite_count:]

        # From remainder, keep diverse subset
        kept = self.diverse_subset(
            remainder,
            self.max_size - self.elite_count
        )
        self.individuals = elites + kept

    def get_best(self, n=1) -> List[Individual]:
        """Return top-n individuals by fitness."""
        return sorted(
            self.individuals,
            key=lambda ind: ind.fitness,
            reverse=True
        )[:n]

Population Lifecycle

Generation 0 (Seeding):
  LLM generates N diverse initial programs
  |
  v
Evaluate all -> Assign fitness scores
  |
  v
Generation 1..G (Evolution Loop):
  1. Select parents (tournament / proportionate)
  2. Mutate parents via LLM -> children
  3. Evaluate children -> fitness
  4. Add children to population
  5. Cull population to max_size (keep elites + diverse)
  6. Check termination (perfect fitness? budget? max gen?)
  |
  v
Final: Return top-2 programs for submission

8. Solution Diversity and Novelty

Maintaining diversity is critical to avoid premature convergence. The Darwinian Evolver uses multiple mechanisms:

8.1 Code-Level Deduplication

import hashlib

def is_duplicate(code, seen_hashes):
    """Check if functionally identical code already exists."""
    # Normalize whitespace and comments for comparison
    normalized = normalize_code(code)
    h = hashlib.md5(normalized.encode()).hexdigest()
    return h in seen_hashes

def normalize_code(code):
    """Strip comments and normalize whitespace."""
    lines = []
    for line in code.split('\n'):
        stripped = line.strip()
        if stripped and not stripped.startswith('#'):
            lines.append(stripped)
    return '\n'.join(lines)

8.2 Behavioral Diversity

Programs that produce different output patterns are valued even if they have similar fitness. The "behavioral fingerprint" of a program is its set of predicted outputs on the training examples:

def behavioral_distance(ind_a, ind_b):
    """Compute behavioral distance between two individuals."""
    if not ind_a.predictions or not ind_b.predictions:
        return 1.0  # Max distance if no predictions

    distances = []
    for pred_a, pred_b in zip(ind_a.predictions, ind_b.predictions):
        if pred_a is None or pred_b is None:
            distances.append(1.0)
            continue
        # Pixel-wise comparison of predicted outputs
        a = np.array(pred_a)
        b = np.array(pred_b)
        if a.shape != b.shape:
            distances.append(1.0)
        else:
            distances.append(np.mean(a != b))
    return np.mean(distances)

def diverse_subset(self, candidates, n):
    """Select n diverse individuals using greedy farthest-first."""
    if len(candidates) <= n:
        return candidates

    # Start with highest fitness
    selected = [candidates[0]]
    remaining = candidates[1:]

    while len(selected) < n and remaining:
        # Pick individual most different from already selected
        best_candidate = None
        best_min_dist = -1

        for cand in remaining:
            min_dist = min(
                behavioral_distance(cand, sel)
                for sel in selected
            )
            if min_dist > best_min_dist:
                best_min_dist = min_dist
                best_candidate = cand

        selected.append(best_candidate)
        remaining.remove(best_candidate)

    return selected

8.3 Immigration / Fresh Seeding

Periodically, entirely new programs are generated from scratch (not mutated from existing parents) to inject fresh genetic material:

def maybe_inject_fresh_seed(self, population, task, generation):
    """Periodically inject completely new programs."""
    if generation % self.config.immigration_interval == 0:
        fresh = self.mutator.generate_seed(task)
        if fresh:
            population.add(fresh)

8.4 Model Diversity

Using multiple LLM models for mutations naturally introduces diversity in coding style, algorithmic approach, and problem-solving strategy. Different models have different "creative biases."

9. LLM Orchestration and Model Selection

Supported Models

Model	Provider	Role	Approx. Cost (per 1M tokens)
Claude 3.5 Sonnet	Anthropic	Primary mutation operator	$3 input / $15 output
GPT-4o	OpenAI	Mutation, crossover	$2.50 input / $10 output
DeepSeek V3/R1	DeepSeek	Cost-efficient mutations	$0.27 input / $1.10 output
Gemini 1.5/2.0	Google	Diversity, alternative approaches	Varies

Multi-Model Strategy

class LLMClient:
    """Unified multi-model LLM interface."""

    def __init__(self, config):
        self.models = config.models  # List of model configs
        self.total_cost = 0.0
        self.model_stats = {}  # Track success rate per model

    def select_model(self, mutation_type, budget_remaining):
        """Select which model to use for this mutation."""
        if budget_remaining < self.config.low_budget_threshold:
            # Switch to cheapest model when budget is low
            return self.cheapest_model()

        if mutation_type == "strategy":
            # Use strongest model for strategic changes
            return self.strongest_model()

        # Default: rotate models for diversity
        return random.choice(self.models)

    def generate(self, prompt, model=None, temperature=0.8,
                 max_tokens=4096):
        """Generate completion from selected model."""
        if model is None:
            model = self.select_model("default", self.budget_remaining)

        # Route to appropriate API
        if model.provider == "anthropic":
            response = self._call_anthropic(prompt, model, temperature)
        elif model.provider == "openai":
            response = self._call_openai(prompt, model, temperature)
        elif model.provider == "deepseek":
            response = self._call_deepseek(prompt, model, temperature)
        elif model.provider == "google":
            response = self._call_google(prompt, model, temperature)

        # Track cost
        cost = self.compute_cost(response, model)
        self.total_cost += cost
        self.budget_remaining -= cost

        return response

Model Rotation for Diversity

A key insight is that different LLMs approach code modification differently. Claude may favor clean refactors, GPT-4o may produce more creative algorithmic changes, and DeepSeek may offer unique mathematical perspectives. By rotating models across mutations, the population inherits diverse "genetic material" from different LLM lineages.

10. Search Strategies

Island Model Evolution

For harder tasks, the system can run multiple independent evolutionary populations ("islands") that periodically exchange their best individuals:

Island 1 (Claude-based)     Island 2 (GPT-based)     Island 3 (DeepSeek-based)
  +---------+                 +---------+               +---------+
  | Pop 1   |                 | Pop 2   |               | Pop 3   |
  | evolve  |                 | evolve  |               | evolve  |
  | evolve  |                 | evolve  |               | evolve  |
  +----+----+                 +----+----+               +----+----+
       |                           |                         |
       +------- Migration --------+-------- Migration ------+
       |      (every M gens)       |      (every M gens)     |
       v                           v                         v
  Best from I2,I3              Best from I1,I3          Best from I1,I2
  injected into I1             injected into I2         injected into I3

Adaptive Strategy Selection

class AdaptiveStrategy:
    """Adaptively choose between exploration and exploitation."""

    def __init__(self):
        self.stagnation_counter = 0
        self.best_fitness_history = []

    def choose_strategy(self, population, generation):
        """Decide mutation strategy based on evolutionary progress."""
        current_best = population.get_best()[0].fitness

        # Track stagnation
        if (self.best_fitness_history and
            current_best <= self.best_fitness_history[-1]):
            self.stagnation_counter += 1
        else:
            self.stagnation_counter = 0

        self.best_fitness_history.append(current_best)

        # If stagnating, increase exploration
        if self.stagnation_counter >= 3:
            return "explore"   # High temp, strategy mutations
        elif current_best > 0.8:
            return "exploit"  # Low temp, parameter tweaks
        else:
            return "balanced" # Default mixed strategy

Restart Mechanism

If the population has stagnated for too many generations without improvement, a full restart can be triggered, keeping only the all-time best individual and regenerating the rest from scratch:

def maybe_restart(self, population, stagnation_count):
    """Restart population if hopelessly stuck."""
    if stagnation_count >= self.config.restart_threshold:
        best = population.best_ever
        population.clear()
        if best:
            population.add(best)
        # Re-seed rest of population
        for _ in range(self.config.seed_count - 1):
            seed = self.mutator.generate_seed(self.task)
            if seed:
                population.add(seed)
        return True
    return False

11. Evaluation and Fitness Assessment

Fitness Function

The fitness function measures how well a candidate program's outputs match the expected outputs across all training examples:

$$ fitness(program) = (1/N) * SUM_{i=1}^{N} pixel_accuracy(program(input_i), expected_output_i)

pixel_accuracy(predicted, expected) = |{(r,c) : predicted[r][c] == expected[r][c]}| / (rows * cols)

If predicted shape != expected shape: pixel_accuracy = 0

If program raises exception: fitness = 0 (but error info is captured) $$

import numpy as np
import signal
import traceback

class FitnessEvaluator:
    def __init__(self, timeout_seconds=5):
        self.timeout = timeout_seconds

    def evaluate(self, individual, task):
        """Evaluate individual on all training examples."""
        scores = []
        predictions = []
        error_info = None

        for inp, expected_out in task.train_pairs:
            try:
                # Execute program in sandboxed environment
                predicted = self.run_program(
                    individual.code, inp
                )
                predictions.append(predicted)

                # Compare predicted vs expected
                score = self.pixel_accuracy(predicted, expected_out)
                scores.append(score)

            except TimeoutError:
                scores.append(0.0)
                predictions.append(None)
                error_info = "Timeout: program exceeded time limit"
            except Exception as e:
                scores.append(0.0)
                predictions.append(None)
                error_info = traceback.format_exc()

        individual.fitness = np.mean(scores) if scores else 0.0
        individual.predictions = predictions
        individual.error_info = error_info
        return individual

    def pixel_accuracy(self, predicted, expected):
        """Compute pixel-level match accuracy."""
        pred = np.array(predicted)
        exp = np.array(expected)

        if pred.shape != exp.shape:
            return 0.0  # Shape mismatch = zero fitness

        matches = np.sum(pred == exp)
        total = exp.size
        return matches / total if total > 0 else 0.0

    def run_program(self, code, input_grid):
        """Execute transform function in a sandboxed environment."""
        # Create isolated namespace
        namespace = {
            'np': np,
            'numpy': np,
            'input_grid': input_grid,
        }

        # Set timeout via signal
        signal.alarm(self.timeout)
        try:
            exec(code, namespace)
            transform_fn = namespace['transform']
            result = transform_fn(input_grid)
            return result
        finally:
            signal.alarm(0)  # Cancel alarm

Multi-Objective Fitness (Optional)

Beyond pixel accuracy, the system can incorporate secondary objectives:

Code simplicity: Shorter programs preferred (Occam's razor)
Execution speed: Faster programs preferred
Partial credit: Correct grid dimensions, correct colors present, correct patterns

$$ fitness_combined = alpha * pixel_accuracy + beta * (1 - code_length/max_length) + gamma * (1 - exec_time/timeout)

Typical: alpha=0.9, beta=0.05, gamma=0.05 $$

12. Prompt Engineering

Seed Generation Prompt

The initial population is generated by prompting LLMs to write transform functions from scratch:

# Seed prompt template (simplified)
SEED_PROMPT = """You are solving an ARC (Abstraction and Reasoning Corpus) task.
Given input grids, you must write a Python function `transform(grid)` that
produces the correct output grid.

The grid is a 2D list of integers (0-9), where each integer represents a color:
0=black, 1=blue, 2=red, 3=green, 4=yellow, 5=gray, 6=magenta, 7=orange,
8=cyan, 9=maroon

Training examples:
{examples}

Analyze the pattern in these examples, then write a Python function:

```python
import numpy as np

def transform(grid):
    grid = np.array(grid)
    # Your implementation here
    return output.tolist()

Think step by step about what transformation maps each input to its output. Return ONLY the Python function, no explanation."""

### Mutation Prompt Template

MUTATION_PROMPT = """You are evolving a Python program to solve an ARC task.

Task Examples:

{examples_with_current_predictions}

Current Program (fitness = {fitness:.1%}):

{parent_code}

Analysis:

The program gets {correct_count}/{total_count} pixels correct
Main discrepancies: {discrepancy_analysis} {error_section}

Instructions:

Modify the program to improve its accuracy. Consider: 1. Does the program correctly identify the transformation pattern? 2. Are there edge cases it misses? 3. Should the algorithmic approach be changed entirely?

Return ONLY the modified Python function."""

### Error Repair Prompt

REPAIR_PROMPT = """The following Python program crashed with an error:

{code}

Error:

{error_traceback}

Fix the error while preserving the intended logic. If the logic itself is flawed, also improve the approach. Return ONLY the fixed function."""

### Crossover Prompt

CROSSOVER_PROMPT = """Combine the best ideas from these two programs into a single improved program.

Program A (fitness = {fitness_a:.1%}):

{code_a}

Program B (fitness = {fitness_b:.1%}):

{code_b}

Task Examples:

{examples}

Create a new program that combines the strongest aspects of both approaches. Return ONLY the combined Python function."""

## 13. Crossover Operators

While mutation is the primary operator, crossover combines genetic material from two parent programs:

class Crossover: def init(self, llm_client, prompts): self.llm = llm_client self.prompts = prompts

def crossover(self, parent_a, parent_b, task):
    """Produce a child by combining two parents via LLM."""
    prompt = self.prompts["crossover"].format(
        code_a=parent_a.code,
        code_b=parent_b.code,
        fitness_a=parent_a.fitness,
        fitness_b=parent_b.fitness,
        examples=self.format_examples(task)
    )

    response = self.llm.generate(
        prompt=prompt,
        temperature=0.7,
        max_tokens=4096
    )

    child_code = self.extract_code(response)
    if child_code:
        return Individual(
            code=child_code,
            generation=max(parent_a.generation,
                           parent_b.generation) + 1,
            parent_id=f"{parent_a.id}x{parent_b.id}",
            mutation_type="crossover"
        )
    return None

### Crossover Selection

Crossover is applied less frequently than mutation (typically 20-30% of offspring are from crossover). Parents for crossover are selected from the top half of the population by fitness, ensuring that only reasonably good programs contribute to recombination.


$$
P(crossover) = crossover_rate = 0.2 to 0.3


P(mutation) = 1 - crossover_rate = 0.7 to 0.8


Parent selection for crossover: Both parents from top 50% by fitness
$$

## 14. Cost Control and Budget Management

### Per-Task Budget

class BudgetManager: """Track and enforce API cost budgets."""

def __init__(self, max_budget_per_task=5.0,
             warning_threshold=0.8):
    self.max_budget = max_budget_per_task  # dollars
    self.warning_threshold = warning_threshold
    self.spent = 0.0
    self.calls = 0
    self.cost_per_model = {}

def can_afford(self, estimated_cost):
    """Check if we can afford another API call."""
    return (self.spent + estimated_cost) <= self.max_budget

def record_cost(self, cost, model_name):
    """Record an API call cost."""
    self.spent += cost
    self.calls += 1
    self.cost_per_model[model_name] = (
        self.cost_per_model.get(model_name, 0) + cost
    )

def budget_remaining(self):
    return self.max_budget - self.spent

def should_switch_to_cheap_model(self):
    """Switch to cheaper model when budget is running low."""
    return self.spent >= self.max_budget * self.warning_threshold

### Cost Estimates

| Component | Estimated Cost | Notes |
| --- | --- | --- |
| Seed generation (N=10 programs) | $0.10 - $0.50 | Depends on model used |
| Per mutation (1 LLM call) | $0.01 - $0.10 | Varies by model, prompt length |
| Per generation (pop=20) | $0.20 - $2.00 | ~10-20 mutations per generation |
| Full task (20-50 generations) | $1 - $5 | Typical range with budget control |
| Full ARC-AGI-2 eval (100 tasks) | $100 - $500 | Varies with difficulty |

### Cost Optimization Strategies

- **Early termination:** Stop immediately upon finding a perfect-fitness program.
- **Model cascading:** Start with cheap models (DeepSeek), escalate to expensive ones (Claude) only for harder tasks.
- **Prompt compression:** Minimize token count in prompts by using compact grid representations.
- **Cache deduplication:** Don't re-evaluate programs that have already been seen.
- **Adaptive generation count:** Easy tasks converge in 5-10 generations; hard tasks get up to 50+.

## 15. Meta-Level and Self-Improvement

### Prompt Evolution

A meta-level strategy is to evolve the prompts themselves. If certain prompt phrasings lead to higher-fitness mutations on average, those prompt templates are favored:

class PromptEvolver: """Evolve the mutation prompts themselves."""

def __init__(self, base_prompts, llm_client):
    self.prompt_population = [
        {"template": p, "avg_fitness_gain": 0.0, "uses": 0}
        for p in base_prompts
    ]
    self.llm = llm_client

def select_prompt(self):
    """Select prompt template weighted by past success."""
    # Upper confidence bound (UCB) selection
    best_score = -float('inf')
    best_prompt = None

    for p in self.prompt_population:
        if p["uses"] == 0:
            return p  # Explore unused prompts first
        ucb = (p["avg_fitness_gain"] +
               np.sqrt(2 * np.log(self.total_uses) / p["uses"]))
        if ucb > best_score:
            best_score = ucb
            best_prompt = p
    return best_prompt

def update(self, prompt_entry, fitness_gain):
    """Update prompt statistics after use."""
    n = prompt_entry["uses"]
    prompt_entry["avg_fitness_gain"] = (
        (prompt_entry["avg_fitness_gain"] * n + fitness_gain) / (n + 1)
    )
    prompt_entry["uses"] += 1

### Hyperparameter Adaptation

The system can adapt its own hyperparameters during a run:

- **Population size:** Increase if diversity is low, decrease if budget is tight.
- **Mutation temperature:** Adjust based on stagnation (see Section 5).
- **Tournament size:** Increase over generations (see Section 6).
- **Crossover rate:** Increase if crossover offspring outperform mutations.

### Transfer Learning Across Tasks

Solutions (or partial solutions) from previously solved tasks can serve as seed programs for new similar tasks. Common utility functions discovered during evolution can be extracted and reused:

class TransferLibrary: """Library of useful code fragments from solved tasks."""

def __init__(self):
    self.solved_programs = {}  # task_id -> best program
    self.utility_functions = []  # Extracted helpers

def register_solution(self, task_id, program):
    self.solved_programs[task_id] = program
    self.extract_utilities(program)

def get_relevant_seeds(self, task, n=3):
    """Find previously solved programs that might help."""
    # Simple heuristic: tasks with similar grid dimensions
    # or similar color palettes
    candidates = []
    for tid, prog in self.solved_programs.items():
        similarity = self.task_similarity(task, tid)
        candidates.append((similarity, prog))
    candidates.sort(reverse=True)
    return [prog for _, prog in candidates[:n]]

## 16. Main Evolutionary Loop

class DarwinianEvolver: """Main evolutionary engine for ARC-AGI-2 program synthesis."""

def __init__(self, config):
    self.config = config
    self.llm = LLMClient(config)
    self.mutator = Mutator(self.llm, config.prompts, config)
    self.crossover = Crossover(self.llm, config.prompts)
    self.selector = TournamentSelection(config.tournament_size)
    self.evaluator = FitnessEvaluator(config.timeout)
    self.budget = BudgetManager(config.max_budget_per_task)
    self.strategy = AdaptiveStrategy()

def solve(self, task):
    """Evolve a solution for the given ARC task."""
    # 1. Initialize population with seed programs
    population = Population(
        max_size=self.config.pop_size,
        elite_count=self.config.elite_count
    )
    self.seed_population(population, task)

    # 2. Evaluate initial population
    for ind in population.individuals:
        self.evaluator.evaluate(ind, task)

    # 3. Evolutionary loop
    for gen in range(self.config.max_generations):
        # Check termination conditions
        best = population.get_best()[0]
        if best.fitness >= 1.0:
            break  # Perfect solution found!
        if not self.budget.can_afford(0.01):
            break  # Budget exhausted

        # Determine strategy
        strategy = self.strategy.choose_strategy(population, gen)

        # Generate offspring
        offspring = []
        for _ in range(self.config.offspring_per_gen):
            if not self.budget.can_afford(0.01):
                break

            if random.random() < self.config.crossover_rate:
                # Crossover
                parents = self.selector.select(population, 2)
                child = self.crossover.crossover(
                    parents[0], parents[1], task
                )
            else:
                # Mutation
                [parent] = self.selector.select(population, 1)
                child = self.mutator.mutate(
                    parent, task, parent.error_info
                )

            if child:
                self.evaluator.evaluate(child, task)
                offspring.append(child)

        # Add offspring to population
        for child in offspring:
            population.add(child)

        # Cull population to max size
        population.cull()

        # Maybe inject fresh seeds
        self.maybe_inject_fresh_seed(population, task, gen)

        # Maybe restart if stuck
        self.maybe_restart(population,
                           self.strategy.stagnation_counter)

    # 4. Return top-2 programs for ARC submission
    top2 = population.get_best(n=2)
    return [self.predict_test(ind, task) for ind in top2]

def seed_population(self, population, task):
    """Generate initial seed programs."""
    for i in range(self.config.seed_count):
        # Use different models for diversity
        model = self.llm.models[i % len(self.llm.models)]
        seed = self.mutator.generate_seed(task, model=model)
        if seed:
            population.add(seed)

def predict_test(self, individual, task):
    """Run best program on test input to get prediction."""
    try:
        result = self.evaluator.run_program(
            individual.code, task.test_input
        )
        return result
    except:
        return <a href="/evolve/ale-agent-ahc058.html">0</a>  # Fallback empty grid

## 17. Memory Management

### Population Storage

- **Active population:** In-memory list of `Individual` objects (typically 10-30 per task).
- **History archive:** All individuals ever created are stored for analysis and potential revival.
- **Deduplication set:** Hash set of all seen program codes to prevent duplicates.
- **Best-ever tracking:** The single best individual across all generations is always preserved.

### Serialization

import json

def save_population(population, filepath): """Serialize population to JSON for persistence.""" data = { "individuals": [ { "id": ind.id, "code": ind.code, "fitness": ind.fitness, "generation": ind.generation, "parent_id": ind.parent_id, "mutation_type": ind.mutation_type, "model_used": ind.model_used, "cost": ind.cost, } for ind in population.individuals ], "best_ever_id": population.best_ever.id if population.best_ever else None, "generation": max( (ind.generation for ind in population.individuals), default=0 ), } with open(filepath, 'w') as f: json.dump(data, f, indent=2)

### Memory Footprint

| Component | Size | Notes |
| --- | --- | --- |
| Single Individual | ~1-10 KB | Code string + metadata |
| Active Population (20) | ~20-200 KB | In memory during evolution |
| History (500 individuals) | ~0.5-5 MB | All programs ever generated |
| Predictions cache | ~1-10 MB | Stored grid predictions for behavioral distance |
| Total per task | <50 MB | Lightweight |

## 18. Reproducibility

### Setup Instructions

Clone the repository

git clone https://github.com/imbue-ai/darwinian_evolver.git cd darwinian_evolver

Create virtual environment

python -m venv venv source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Set API keys

export ANTHROPIC_API_KEY="sk-ant-..." export OPENAI_API_KEY="sk-..." export DEEPSEEK_API_KEY="..." export GOOGLE_API_KEY="..."

Download ARC-AGI-2 dataset

(follow instructions in README for dataset setup)

Run on a single task

python main.py --task_id --config config.yaml

Run on full evaluation set

python main.py --eval --config config.yaml --output results/

### Dependencies

| Package | Version | Purpose |
| --- | --- | --- |
| Python | >=3.10 | Runtime |
| numpy | >=1.24 | Grid operations |
| anthropic | latest | Claude API client |
| openai | latest | GPT-4o API client |
| pyyaml | latest | Configuration parsing |
| scipy | >=1.10 | Optional: advanced grid operations |
| tqdm | latest | Progress bars |

### Configuration (config.yaml)

Evolutionary parameters

population_size: 20 max_generations: 50 elite_count: 2 offspring_per_generation: 10 crossover_rate: 0.2 seed_count: 10 tournament_size: 3

Mutation parameters

mutation_temperature_min: 0.3 mutation_temperature_max: 1.0 max_code_tokens: 4096

Budget

max_budget_per_task: 5.0 # dollars low_budget_threshold: 0.8

Models

models: - name: claude-3-5-sonnet provider: anthropic role: primary - name: gpt-4o provider: openai role: secondary - name: deepseek-chat provider: deepseek role: budget

Diversity

immigration_interval: 5 # inject fresh seed every N generations restart_threshold: 10 # restart after N stagnant generations

Execution

program_timeout: 5 # seconds per program execution

## 19. Programming Language

#### Framework Language

**Python 3.10+** -- The evolutionary engine, LLM orchestration, fitness evaluation, and all infrastructure code is written in Python.

---

#### Evolved Programs

**Python (numpy-based)** -- The candidate solutions (transform functions) are Python functions using numpy for grid manipulation. Programs are executed via `exec()` in sandboxed namespaces.

---

The choice of Python for both the framework and evolved programs is intentional: LLMs are most proficient at generating Python code, and numpy provides efficient grid operations. The `exec()`-based execution model allows programs to be stored as strings and dynamically evaluated.

## 20. Continued Learning

### Within-Task Learning

The evolutionary process itself is a form of learning: the population accumulates knowledge about the task through the fitness landscape. Each generation's survivors encode "learned" information about the input-output relationship.

### Cross-Task Transfer

- **Utility library accumulation:** Common helper functions (flood fill, connected components, rotation, reflection, color remapping) discovered during evolution are saved and injected as available imports for future tasks.
- **Solution seeding:** Programs from solved tasks can seed the initial population for similar unsolved tasks.
- **Prompt refinement:** Prompt templates that led to high-quality mutations are preserved and reused.

### Model Fine-Tuning (Future Direction)

The evolutionary history (parent program, mutation prompt, successful child program) forms a natural dataset for fine-tuning LLMs to become better mutation operators. This creates a virtuous cycle:

Evolution produces Fine-tune LLMs on Better mutations (parent, child) pairs successful mutations improve evolution | | | +---------->-----------+----------->----------+ | Virtuous Cycle ```

21. Applications and Generalization

Beyond ARC-AGI-2

Application Domain	How It Applies	Adaptation Needed
General program synthesis	Evolve programs from input/output specs	Change fitness function to match domain
Algorithm optimization	Evolve faster/better algorithms	Fitness = runtime or correctness on test suite
Mathematical reasoning	Evolve proof strategies or formulas	Fitness = theorem prover verification
Scientific discovery	Evolve hypotheses expressed as code	Fitness = experimental data fit
Code refactoring	Evolve cleaner versions of messy code	Fitness = correctness + readability metrics
Game AI	Evolve strategy functions	Fitness = win rate in simulated games
Data transformation	Evolve ETL pipelines from examples	Fitness = output match on sample data

Key Advantages of the Darwinian Approach

No task-specific architecture needed: The same evolutionary framework works for any task where fitness can be computed.
Model-agnostic: Any LLM can serve as a mutation operator. As models improve, evolution automatically benefits.
Inherently parallel: Multiple mutations can be generated and evaluated in parallel.
Interpretable solutions: Unlike neural network outputs, evolved programs are readable Python code that can be inspected and understood.
Compositional: Solutions can build on each other -- successful subroutines from one task transfer to others.

Limitations

Cost: Requires many LLM API calls, making it expensive at scale.
Latency: Evolution takes minutes to hours per task, not suitable for real-time applications.
LLM Ceiling: The quality of mutations is bounded by the LLM's coding ability.
Fitness Function Design: Requires well-defined fitness; open-ended tasks are harder to evaluate.
Stochastic: Results are non-deterministic; the same task may get different solutions across runs.

22. Summary

The Imbue Darwinian Evolver demonstrates that Darwinian evolution, with LLMs serving as intelligent mutation operators, is a viable and elegant approach to program synthesis for abstract reasoning tasks. By maintaining populations of candidate programs, applying selection pressure via pixel-accuracy fitness, and leveraging multiple LLMs for diverse mutations, the system discovers Python programs that solve ARC-AGI-2 tasks -- achieving competitive scores on one of the most challenging AI benchmarks.

Key Takeaways

Evolution works for code: Treating LLMs as mutation operators in a Darwinian framework outperforms naive best-of-N sampling.
Diversity is critical: Code deduplication, behavioral distance, multi-model mutations, and immigration prevent premature convergence.
Error feedback is gold: Passing error messages and output discrepancies back to the LLM enables targeted repairs.
Adaptive strategies matter: Adjusting temperature, tournament size, and mutation type based on evolutionary progress improves efficiency.
Budget management is essential: Cost-aware model selection and early termination keep the approach practical.
The approach is general: The framework applies far beyond ARC to any domain with computable fitness.

Research Connections

Related Work	Relationship
FunSearch (DeepMind)	Also evolves programs via LLMs; Darwinian Evolver adds multi-model diversity and adaptive strategies
AlphaEvolve (DeepMind)	Gemini-powered code evolution; similar spirit, different implementation
OpenEvolve	Open-source LLM-guided evolution; shares evolutionary paradigm
ShinkaEvolve (Sakana AI)	Evolutionary ML/AI agent; parallel development in the evolution-for-AI space
Genetic Programming (Koza)	Classical GP; Darwinian Evolver replaces random tree mutations with LLM-guided semantic mutations
POET / PAIRED	Co-evolution of environments and agents; population-based search inspiration

References

Imbue Research. "Evolving Solutions to ARC-AGI-2 with Darwinian Evolution." imbue.com/research/2026-02-27-arc-agi-2-evolution/, February 2026.
Imbue AI. "darwinian_evolver." GitHub Repository. github.com/imbue-ai/darwinian_evolver, 2026.
Chollet, F. "On the Measure of Intelligence." arXiv:1911.01547, 2019.
Romera-Paredes, B. et al. "Mathematical discoveries from program search with large language models." Nature 625, 468-475 (2024). [FunSearch]
Lehman, J. et al. "Evolution through Large Models." arXiv:2206.08896, 2022.
DeepMind. "AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms." 2025.
Koza, J.R. "Genetic Programming: On the Programming of Computers by Means of Natural Selection." MIT Press, 1992.
ARC Prize Foundation. "ARC-AGI-2." arcprize.org, 2025-2026.

Technical Report generated for PhD-level research analysis.

Report Date: March 2026 | Subject: Imbue Darwinian Evolver for ARC-AGI-2

Based on analysis of the Imbue research blog post and the darwinian_evolver GitHub repository.

Imbue Darwinian Evolver for ARC-AGI-2

Table of Contents

1. Core Contribution

Core Principles

Novelty Over Prior Work

2. Key Results

ARC-AGI-2 Performance

Comparison with Baselines

3. Architecture

Evolutionary Loop Detail

4. Component Breakdown

Repository Structure

Module Responsibilities

evolver.py

Core

population.py

Data Structure

individual.py

Data Structure

mutation.py

Operator

selection.py

Operator

fitness.py

Evaluation

llm_client.py

Infrastructure

prompts/

Configuration

5. Code Modification and Mutation Operators

Mutation Types

Core Mutation Implementation

Mutation Prompt Construction

Temperature-Based Mutation Intensity

6. Parent Selection and Sampling

Tournament Selection

Elitism

Selection Pressure Scheduling

7. Population Management and Architecture

Population Data Structure

Population Lifecycle

8. Solution Diversity and Novelty

8.1 Code-Level Deduplication

8.2 Behavioral Diversity

8.3 Immigration / Fresh Seeding

8.4 Model Diversity

9. LLM Orchestration and Model Selection

Supported Models

Multi-Model Strategy

Model Rotation for Diversity

10. Search Strategies

Island Model Evolution

Adaptive Strategy Selection

Restart Mechanism

11. Evaluation and Fitness Assessment

Fitness Function

Multi-Objective Fitness (Optional)

12. Prompt Engineering

Seed Generation Prompt

Task Examples:

Current Program (fitness = {fitness:.1%}):

Analysis:

Instructions:

Program A (fitness = {fitness_a:.1%}):

Program B (fitness = {fitness_b:.1%}):

Task Examples:

Clone the repository

Create virtual environment

Install dependencies

Set API keys

Download ARC-AGI-2 dataset

(follow instructions in README for dataset setup)

Run on a single task

Run on full evaluation set

Evolutionary parameters

Mutation parameters

Budget

Models

Diversity

Execution