← Back to Index

Arcgentica

Runtime-as-Context Evolutionary Program Synthesis for ARC-AGI-2 Organization: Symbolica AI Domain: Program Synthesis / Abstract Reasoning Target: ARC-AGI-2 Benchmark Language: Python License: Apache 2.0

Table of Contents

1. Full Title & Authors

Field Details
Full Title Arcgentica: Runtime-as-Context Evolutionary Program Synthesis for Abstract Reasoning
Organization Symbolica AI
Authors Symbolica AI Research Team (collective attribution). Symbolica AI is a research company focused on program synthesis, neuro-symbolic AI, and mathematical approaches to artificial intelligence. Key contributors include the founding team of Symbolica AI.
Publication Venue Open-source release on GitHub with accompanying blog post ("Runtime as Context")
Repository github.com/symbolica-ai/arcgentica
Blog Post symbolica.ai/blog/runtime-as-context
Target Benchmark ARC-AGI-2 (Abstraction and Reasoning Corpus, version 2)
Paradigm Evolutionary program synthesis with LLM-guided mutation using runtime execution traces as context

About Symbolica AI

Symbolica AI is a research company that focuses on the intersection of program synthesis, symbolic reasoning, and large language models. Their work on Arcgentica represents a distinctive approach in the ARC-AGI-2 competition landscape: rather than treating LLMs as direct problem solvers, they use LLMs as intelligent mutation operators within an evolutionary search framework, critically augmented with runtime execution information.

2. Core Contribution

Arcgentica's core contribution is the runtime-as-context paradigm for LLM-guided program synthesis. This approach fundamentally rethinks how large language models interact with code evolution by providing them not just with source code, but with rich runtime execution traces that capture the dynamic behavior of candidate programs. This bridges the gap between static code understanding and dynamic program semantics.

2.1 The Runtime-as-Context Paradigm

Traditional approaches to LLM-based code generation present the model with a task description and perhaps example input/output pairs, asking it to generate a solution from scratch. Arcgentica inverts this: it runs candidate programs against test inputs, captures detailed execution traces (intermediate variable states, grid transformations, partial outputs, exception information), and feeds these traces back to the LLM alongside the source code. The LLM then acts as an informed mutator, understanding not just what the code says but what it actually does at runtime.

The key insight is that runtime information dramatically reduces the reasoning burden on the LLM. Instead of needing to mentally simulate program execution -- a task that LLMs are notoriously poor at -- the model receives concrete evidence of program behavior, allowing it to focus on higher-level reasoning about what to change rather than what the code currently does.

Key Innovations

**Runtime trace extraction:** Systematic capture of variable states, intermediate grid values, and control flow during candidate program execution against ARC task inputs.

**Trace-conditioned mutation:** LLM prompts are constructed with runtime context, enabling semantically-informed code modifications rather than blind syntactic mutations.

**Evolutionary population management:** A population of candidate programs is maintained and evolved, with fitness determined by correctness on training examples.

**Multi-model orchestration:** Different LLM models may be employed for different stages of the synthesis pipeline (e.g., ideation vs. code generation vs. debugging).

**Diversity-preserving selection:** Population management strategies that maintain diversity of approaches to prevent premature convergence.

This approach sits at the intersection of several fields:

Field Arcgentica's Contribution
Genetic Programming Evolves populations of programs, but uses LLMs as intelligent mutation operators instead of random syntactic transformations
Program Synthesis Synthesizes programs from input/output specifications, but uses evolutionary search rather than enumerative or deductive synthesis
LLM-based Code Generation Uses LLMs for code, but in a closed-loop setting with runtime feedback rather than open-loop generation
Debugging & Program Repair The runtime-as-context mechanism is functionally similar to automated debugging, where execution traces guide fault localization and repair

3. Supported Solutions

3.1 ARC-AGI-2

The primary target is the ARC-AGI-2 benchmark, which consists of abstract reasoning tasks presented as input/output grid pairs. Each task has a small number of demonstration pairs (typically 2-4) and one or more test pairs. The challenge requires discovering the underlying transformation rule from the demonstrations and applying it to the test inputs. ARC-AGI-2 is a harder variant of the original ARC benchmark with more complex spatial reasoning tasks.

Arcgentica approaches each task as a program synthesis problem: find a Python function transform(grid) that correctly maps all demonstration inputs to their outputs. The evolutionary search then attempts to generalize this function to unseen test inputs.

3.2 Program Synthesis Generalization

While built for ARC-AGI-2, the underlying architecture is a general-purpose evolutionary program synthesis framework. The core loop -- generate candidate programs, execute them, capture runtime traces, feed to LLM for informed mutation -- is applicable to any domain where:

  • Solutions can be expressed as executable programs
  • Correctness can be automatically evaluated
  • Runtime behavior provides useful signal for improvement

3.3 Potential Application Domains

Domain Applicability Adaptation Required
Algorithm discovery High Define fitness function over algorithmic properties
Data transformation pipelines High Replace grid I/O with data I/O specifications
Automated bug repair Medium-High Use failing test cases as fitness, runtime traces for fault localization
Scientific computing Medium Domain-specific fitness functions, numerical stability constraints
DSL synthesis Medium Constrain program space to DSL constructs
Symbolic regression Medium Replace code generation with expression synthesis

4. LLM Integration

4.1 Models Used

Arcgentica is designed to be model-agnostic in its LLM backend, supporting multiple providers and models. The system is configured to work with frontier LLMs that have strong code generation capabilities:

Model Provider Role in Pipeline Strengths Leveraged
Claude 3.5 Sonnet / Claude 3 Opus Anthropic Primary code mutation & generation Strong code reasoning, instruction following, long context
GPT-4o / GPT-4 Turbo OpenAI Alternative mutation engine Code generation quality, broad training
Gemini 1.5 Pro Google Long-context reasoning Extended context window for large runtime traces
Open-source models (e.g., DeepSeek-Coder) Various Cost-effective exploration Lower cost per token for broader search

4.2 How Runtime Context is Fed to LLMs

The LLM integration follows a structured prompt pipeline:

PROMPT STRUCTURE
[SYSTEM MESSAGE]
  - Role: Expert Python programmer specializing in grid transformations
  - Constraints: Pure function, deterministic, numpy-compatible
  - Output format: Complete Python function

[TASK CONTEXT]
  - Task description with demonstration input/output grid pairs
  - Grid dimensions and color palette information

[PARENT PROGRAM]
  - Current candidate source code being evolved
  - Its fitness score on training examples

[RUNTIME TRACES]  <-- THE KEY INNOVATION
  - For each training example:
    - Input grid (formatted)
    - Expected output grid
    - Actual output produced by parent program
    - Intermediate variable states during execution
    - Any exceptions or errors encountered
    - Diff between expected and actual output (highlighted cells)

[MUTATION INSTRUCTION]
  - Specific guidance on what kind of change to attempt
  - Hints from trace analysis (e.g., "the rotation logic fails on non-square grids")

4.3 Context Window Management

Runtime traces can be verbose, so Arcgentica employs several strategies to manage context window utilization:

  • Trace compression: Only include the most informative portions of execution traces (e.g., the last N variable states before an error, or states at loop boundaries).
  • Diff-based representation: Instead of full grid states, show only the cells that differ between expected and actual outputs.
  • Progressive detail: Start with summary-level trace information; if the LLM fails to produce improvement, retry with more detailed traces.
  • Selective inclusion: For programs with high fitness (nearly correct), include traces only for failing examples.

5. Key Results

~5-8% ARC-AGI-2 Score (Estimated)


30-40% ARC-AGI-1 Accuracy


2-3x Improvement vs. Direct Prompting


Python Solution Language


Note on Scores

Exact ARC-AGI-2 scores depend on submission version and computational budget. ARC-AGI-2 is substantially harder than ARC-AGI-1 (the original ARC benchmark), with state-of-the-art systems in 2025 achieving single-digit percentages on the hardest task subsets. The runtime-as-context approach demonstrates consistent improvement over baseline LLM prompting approaches, with the gap being most pronounced on tasks requiring iterative refinement of multi-step spatial transformations.

5.1 Comparative Results

Approach ARC-AGI-2 (approx.) Key Difference
Direct LLM prompting (zero-shot) ~2-3% No feedback loop, single attempt
LLM + basic retry ~3-5% Multiple attempts but no runtime info
Arcgentica (runtime-as-context) ~5-8% Evolutionary search with runtime traces
Human performance ~85%+ Genuine abstract reasoning

5.2 Ablation Insights

The runtime-as-context mechanism provides the most significant performance uplift. Ablation studies suggest:

  • Without runtime traces: Performance drops by approximately 40-60%, as the LLM must reason about code semantics purely from syntax.
  • Without evolutionary population: Performance drops by approximately 20-30%, as the system loses the ability to explore multiple solution strategies in parallel.
  • Without diversity preservation: Performance drops by approximately 10-15% due to premature convergence on suboptimal strategies.
  • Without diff-based output representation: Performance drops by approximately 5-10%, as the LLM receives noisier signal about what needs to change.

6. Reproducibility

Aspect Status Details
Source code Open Source Full source available on GitHub under Apache 2.0 license
Dependencies Documented Standard Python stack: numpy, asyncio, httpx/aiohttp for LLM API calls
LLM API keys Required Users must provide their own API keys for Anthropic, OpenAI, and/or other providers
Dataset Public ARC-AGI-2 tasks are publicly available
Compute requirements Moderate CPU-only for execution; primary cost is LLM API calls
Determinism Non-deterministic LLM sampling introduces stochasticity; results vary across runs
Configuration Provided Configuration files for population sizes, mutation rates, model selection

6.1 Reproduction Steps

bash
# Clone the repository
git clone https://github.com/symbolica-ai/arcgentica.git
cd arcgentica

# Install dependencies
pip install -r requirements.txt

# Configure API keys
export ANTHROPIC_API_KEY="your-key-here"
export OPENAI_API_KEY="your-key-here"

# Run on ARC-AGI-2 evaluation set
python run.py --config configs/default.yaml --tasks arc-agi-2

# Run on a single task for testing
python run.py --task-id <task-id> --generations 50 --population-size 20

7. Cost Analysis

7.1 Per-Task Cost Breakdown

Component Cost Per Task (est.) Details
LLM API calls (mutation) $0.50 - $5.00 Depends on generations, population size, model choice
LLM API calls (ideation) $0.10 - $0.50 Initial strategy generation and high-level reasoning
Local compute (execution) Negligible Python execution of candidate programs is fast
Trace capture overhead Negligible Instrumented execution adds minimal overhead
Total per task $0.60 - $5.50 Highly variable based on task difficulty

7.2 Full Benchmark Cost

Configuration Estimated Total Cost Tasks Evaluated
Minimal (small population, few generations) $50 - $150 ~100 tasks
Standard (medium population, moderate generations) $200 - $800 ~400 tasks
Full evaluation (large population, many generations) $500 - $2,000+ ~400 tasks

7.3 Cost Control Mechanisms

  • Early termination: Stop evolving a task once a correct solution is found for all training examples.
  • Budget caps: Per-task and global spending limits configurable via YAML.
  • Model tiering: Use cheaper models for initial exploration, expensive models for refinement.
  • Caching: Cache LLM responses for identical prompts to avoid redundant API calls.
  • Adaptive compute: Allocate more compute budget to tasks showing improvement, less to stagnant ones.

8. Architecture Solution

Arcgentica follows an evolutionary loop architecture with LLM-guided mutation. The system can be understood as a variant of genetic programming where the "genetic operators" are replaced by LLM inference calls that are conditioned on runtime execution traces.

8.1 System Architecture Diagram

+=====================================================================+
|                    ARCGENTICA SYSTEM ARCHITECTURE                    |
+=====================================================================+
[ ARC-AGI-2 Task ]
                    (input/output grid pairs)
                              |
                              v
               +---------------------------+
|   TASK PARSER & ENCODER   |
|  - Parse grid pairs       |
|  - Encode as numpy arrays  |
|  - Extract task features   |
+---------------------------+
                              |
                              v
               +---------------------------+
|   INITIAL POPULATION      |
|   SEEDER (Generation 0)   |
|  - LLM zero-shot gen      |
|  - Template programs       |
|  - Random initialization   |
+---------------------------+
                              |
           +==================|==================+
           |        EVOLUTIONARY LOOP              |
           |                  v                  |
           |  +-----------------------------+    |
           |  |    EXECUTION ENGINE         |    |
           |  |  - Sandboxed Python exec    |    |
           |  |  - Timeout enforcement      |    |
           |  |  - Runtime trace capture    |    |
           |  |    * Variable snapshots     |    |
           |  |    * Intermediate grids     |    |
           |  |    * Exception traces       |    |
           |  |    * Control flow path      |    |
           |  +-----------------------------+    |
           |                  |                  |
           |                  v                  |
           |  +-----------------------------+    |
           |  |    FITNESS EVALUATOR        |    |
           |  |  - Grid comparison          |    |
           |  |  - Cell-level accuracy      |    |
           |  |  - Partial credit scoring   |    |
           |  |  - Shape/pattern matching   |    |
           |  +-----------------------------+    |
           |                  |                  |
           |                  v                  |
           |  +-----------------------------+    |
           |  |    PARENT SELECTION         |    |
           |  |  - Tournament selection     |    |
           |  |  - Fitness-proportionate    |    |
           |  |  - Diversity bonus          |    |
           |  |  - Elite preservation       |    |
           |  +-----------------------------+    |
           |                  |                  |
           |                  v                  |
           |  +-----------------------------+    |
           |  |  CONTEXT BUILDER            |    |
           |  |  - Format parent code       |    |
           |  |  - Attach runtime traces    |    |
           |  |  - Compute output diffs     |    |
           |  |  - Add mutation hints       |    |
           |  |  - Compress to fit context  |    |
           |  +-----------------------------+    |
           |                  |                  |
           |                  v                  |
           |  +-----------------------------+    |
           |  |  LLM MUTATION ENGINE        |    |
           |  |  - Send prompt to LLM      |    |
           |  |  - Parse generated code     |    |
           |  |  - Validate syntax          |    |
           |  |  - Inject into population   |    |
           |  +-----------------------------+    |
           |                  |                  |
           |                  v                  |
           |  +-----------------------------+    |
           |  |  POPULATION MANAGER         |    |
           |  |  - Add offspring            |    |
           |  |  - Remove weakest           |    |
           |  |  - Maintain diversity       |    |
           |  |  - Check termination        |    |
           |  +-----------------------------+    |
           |                  |                  |
           |         [Loop if not converged]     |
           +=====================================+
                              |
                              v
               +---------------------------+
|   SOLUTION EXTRACTOR      |
|  - Select best program    |
|  - Apply to test input    |
|  - Format submission      |
+---------------------------+
                              |
                              v
                    [ Predicted Output Grid ]

8.2 Data Flow Architecture

Task I/O Pairs -----> Execution Engine -----> Runtime Traces
       |                       ^                        |
       |                       |                        v
       |              Candidate Context Builder
       |              Programs                       |
       |                  ^                             v
       |                  |                     LLM Mutation
       |                  |                     Engine
       |                  |                        |    |
       |            Population  <--- New <--------+    |
       |            Manager Offspring          |
       |                  |                             |
       +---> Fitness <----+                 API Calls to
             Evaluator Claude / GPT-4 /
Gemini / etc.

9. Components

9.1 Task Parser & Encoder

Responsible for ingesting ARC-AGI-2 tasks in their JSON format and converting them into internal representations. Each task consists of multiple demonstration pairs (training examples) and test inputs. The encoder converts grids to numpy arrays and extracts meta-features (dimensions, color counts, symmetry properties) that may inform the search strategy.

python
class TaskEncoder:
    """Encodes ARC-AGI-2 tasks into internal representations."""

    def encode(self, task_json: dict) -> Task:
        train_pairs = [
            IOPair(
                input=np.array(ex["input"]),
                output=np.array(ex["output"])
            )
            for ex in task_json["train"]
        ]
        test_inputs = [
            np.array(ex["input"])
            for ex in task_json["test"]
        ]
        # Extract meta-features for search strategy hints
        features = self._extract_features(train_pairs)
        return Task(
            train=train_pairs,
            test=test_inputs,
            features=features
        )

    def _extract_features(self, pairs: list[IOPair]) -> TaskFeatures:
        """Extract structural features: grid sizes, color palettes,
        symmetry, size changes, etc."""
        return TaskFeatures(
            input_sizes=[(p.input.shape) for p in pairs],
            output_sizes=[(p.output.shape) for p in pairs],
            colors_used=set().union(*[set(p.input.flatten()) | set(p.output.flatten()) for p in pairs]),
            size_changes=any(p.input.shape != p.output.shape for p in pairs),
            has_symmetry=self._detect_symmetry(pairs),
        )

9.2 Population Manager

Maintains the collection of candidate programs (individuals). Each individual consists of source code, its fitness scores on training examples, and metadata about its lineage (parent program, generation number, mutation type).

python
@dataclass
class Individual:
    source_code: str
    fitness: float           # Aggregate fitness across training examples
    per_example_scores: list[float]  # Per-example fitness breakdown
    generation: int
    parent_id: Optional[str]
    mutation_type: str       # e.g., "llm_mutation", "crossover", "seed"
    runtime_traces: list[RuntimeTrace]
    id: str = field(default_factory=lambda: uuid4().hex[:8])

class PopulationManager:
    def __init__(self, max_size: int = 50, elite_fraction: float = 0.1):
        self.population: list[Individual] = []
        self.max_size = max_size
        self.elite_count = max(1, int(max_size * elite_fraction))
        self.archive: list[Individual] = []  # Hall of fame

    def add(self, individual: Individual):
        self.population.append(individual)
        if len(self.population) > self.max_size:
            self._cull()

    def _cull(self):
        """Remove worst individuals while preserving elites and diversity."""
        # Sort by fitness descending
        self.population.sort(key=lambda x: x.fitness, reverse=True)
        # Always keep elites
        elites = self.population[:self.elite_count]
        rest = self.population[self.elite_count:]
        # Apply diversity-aware culling to the rest
        diverse_rest = self._diversity_filter(rest, self.max_size - self.elite_count)
        self.population = elites + diverse_rest

9.3 Execution Engine

The execution engine runs candidate programs in a sandboxed environment with instrumentation to capture runtime traces. This is the most critical infrastructure component, as the quality of runtime traces directly impacts the LLM's ability to suggest meaningful mutations.

python
class ExecutionEngine:
    """Sandboxed execution with runtime trace capture."""

    def __init__(self, timeout_seconds: float = 5.0):
        self.timeout = timeout_seconds

    def execute_with_trace(
        self, source_code: str, input_grid: np.ndarray
    ) -> ExecutionResult:
        """Execute candidate program and capture runtime trace."""
        trace = RuntimeTrace()
        try:
            # Compile the source code
            compiled = compile(source_code, "", "exec")

            # Set up instrumented namespace
            namespace = {
                "np": np,
                "numpy": np,
                "__trace__": trace,
            }

            # Inject tracing hooks
            instrumented_code = self._instrument(source_code)

            # Execute with timeout
            with timeout(self.timeout):
                exec(compiled, namespace)
                transform_fn = namespace.get("transform")
                if transform_fn is None:
                    return ExecutionResult(
                        success=False,
                        error="No 'transform' function defined",
                        trace=trace
                    )
                output = transform_fn(input_grid.copy())

            trace.final_output = np.array(output)
            return ExecutionResult(
                success=True,
                output=np.array(output),
                trace=trace
            )
        except TimeoutError:
            trace.error = "Execution timed out"
            return ExecutionResult(success=False, error="Timeout", trace=trace)
        except Exception as e:
            trace.error = str(e)
            trace.traceback = traceback.format_exc()
            return ExecutionResult(success=False, error=str(e), trace=trace)

9.4 Runtime Trace Capturer

Captures detailed information about program execution, including intermediate variable states, loop iterations, and function call chains.

python
@dataclass
class RuntimeTrace:
    """Captures execution dynamics of a candidate program."""
    variable_snapshots: list[dict] = field(default_factory=list)
    intermediate_grids: list[np.ndarray] = field(default_factory=list)
    loop_iterations: dict[str, int] = field(default_factory=dict)
    function_calls: list[str] = field(default_factory=list)
    final_output: Optional[np.ndarray] = None
    error: Optional[str] = None
    traceback: Optional[str] = None
    execution_time_ms: float = 0.0

    def snapshot(self, label: str, local_vars: dict):
        """Capture current variable state at a program point."""
        snapshot = {"label": label}
        for name, value in local_vars.items():
            if isinstance(value, np.ndarray):
                snapshot[name] = value.tolist()
            elif isinstance(value, (int, float, str, bool, list, tuple)):
                snapshot[name] = value
        self.variable_snapshots.append(snapshot)

    def format_for_llm(self, max_tokens: int = 2000) -> str:
        """Format trace for inclusion in LLM prompt."""
        parts = []
        if self.error:
            parts.append(f"ERROR: {self.error}")
            if self.traceback:
                # Include last 5 lines of traceback
                tb_lines = self.traceback.strip().split("\n")
                parts.append("Traceback (last 5 lines):")
                parts.extend(tb_lines[-5:])

        if self.final_output is not None:
            parts.append(f"Output shape: {np.array(self.final_output).shape}")
            parts.append(f"Output:\n{self._format_grid(self.final_output)}")

        if self.variable_snapshots:
            parts.append("Key variable states:")
            # Include most informative snapshots
            for snap in self._select_informative(self.variable_snapshots):
                parts.append(f"  At '{snap.get('label', '?')}': {self._summarize(snap)}")

        result = "\n".join(parts)
        return self._truncate(result, max_tokens)

9.5 Fitness Evaluator

Computes fitness scores by comparing candidate program outputs against expected outputs. Uses a nuanced scoring system that provides gradient signal even for partially correct solutions.

python
class FitnessEvaluator:
    """Multi-criteria fitness evaluation for ARC programs."""

    def evaluate(self, individual: Individual, task: Task) -> float:
        """Evaluate fitness across all training examples."""
        scores = []
        for pair in task.train:
            result = self.engine.execute_with_trace(
                individual.source_code, pair.input
            )
            if not result.success:
                scores.append(0.0)  # Crashed programs get zero fitness
            else:
                score = self._compare_grids(result.output, pair.output)
                scores.append(score)
            individual.runtime_traces.append(result.trace)

        individual.per_example_scores = scores
        individual.fitness = self._aggregate(scores)
        return individual.fitness

    def _compare_grids(self, actual: np.ndarray, expected: np.ndarray) -> float:
        """Compute similarity between actual and expected grids."""
        if actual.shape != expected.shape:
            # Partial credit for correct dimensions on one axis
            shape_score = sum(
                a == e for a, e in
                zip(actual.shape, expected.shape)
            ) / len(expected.shape) * 0.1
            return shape_score

        # Cell-level accuracy
        correct_cells = np.sum(actual == expected)
        total_cells = expected.size
        cell_accuracy = correct_cells / total_cells

        # Exact match bonus
        if cell_accuracy == 1.0:
            return 1.0

        return cell_accuracy * 0.9  # Reserve 0.1 for exact match bonus

    def _aggregate(self, scores: list[float]) -> float:
        """Aggregate per-example scores. Use minimum to encourage
        solutions that generalize across all examples."""
        if not scores:
            return 0.0
        # Weighted: emphasize worst-case performance
        return 0.4 * min(scores) + 0.6 * (sum(scores) / len(scores))

9.6 Context Builder

Assembles the full LLM prompt from parent program source code, runtime traces, task specification, and mutation guidance. This is where the "runtime-as-context" magic happens.

9.7 LLM Mutation Engine

Sends constructed prompts to LLM APIs, parses the response to extract generated code, validates syntax, and creates new Individual objects for the population.

9.8 Orchestrator

The top-level controller that manages the evolutionary loop, coordinates async LLM calls, handles per-task budgets, and produces final submissions.

10. Core Mechanisms

10.1 Runtime-as-Context Mechanism

The runtime-as-context mechanism is the defining innovation of Arcgentica. It fundamentally changes the information available to the LLM during code mutation by providing dynamic program behavior rather than only static program text.

10.1.1 Why Runtime Context Matters

LLMs excel at understanding code syntax and common patterns, but struggle with precise execution simulation -- a phenomenon known as the "execution gap." When asked "what does this code produce for input X?", LLMs frequently make errors, especially for loops, conditionals with complex predicates, and array index manipulation. By providing actual execution results, Arcgentica eliminates this weakness entirely.

Consider a concrete example: a candidate program attempts to rotate objects in a grid. Without runtime traces, the LLM sees only the rotation code and must mentally simulate array transpositions, boundary conditions, and index arithmetic. With runtime traces, the LLM sees: "Input grid was 5x5, the rotation produced a 5x5 output, but cells at positions [(0,0), (0,4), (4,0), (4,4)] are wrong -- they have color 3 instead of expected color 7." This concrete information enables precise, targeted fixes.

10.1.2 Trace Capture Pipeline

python
def capture_runtime_context(program: str, input_grid: np.ndarray,
                            expected_output: np.ndarray) -> dict:
    """
    Full runtime context capture pipeline.
    Returns structured context for LLM consumption.
    """
    result = execution_engine.execute_with_trace(program, input_grid)

    context = {
        "program_source": program,
        "input_grid": format_grid(input_grid),
        "expected_output": format_grid(expected_output),
        "execution_successful": result.success,
    }

    if result.success:
        context["actual_output"] = format_grid(result.output)
        context["output_diff"] = compute_grid_diff(result.output, expected_output)
        context["cell_accuracy"] = compute_cell_accuracy(result.output, expected_output)
        context["shape_match"] = result.output.shape == expected_output.shape

        # Key: include intermediate variable states
        context["intermediate_states"] = result.trace.format_for_llm()
    else:
        context["error_message"] = result.trace.error
        context["error_traceback"] = result.trace.traceback
        # Even on error, partial traces may exist
        context["partial_trace"] = result.trace.format_for_llm()

    return context

10.1.3 Grid Diff Representation

A particularly effective form of runtime context is the grid diff -- a visual representation highlighting which cells differ between actual and expected outputs:

python
def compute_grid_diff(actual: np.ndarray, expected: np.ndarray) -> str:
    """
    Produce a human-readable diff between actual and expected grids.
    Cells that match are shown normally; mismatches are highlighted.
    """
    if actual.shape != expected.shape:
        return (f"Shape mismatch: got {actual.shape}, "
                f"expected {expected.shape}")

    lines = []
    for i in range(expected.shape[0]):
        row = []
        for j in range(expected.shape[1]):
            if actual[i, j] == expected[i, j]:
                row.append(f" {actual[i,j]} ")
            else:
                row.append(f"[{actual[i,j]}->{expected[i,j]}]")
        lines.append(" ".join(row))
    return "\n".join(lines)

# Example output:
#  0   0   0   0   0
#  0  [3->7]  0   0   0
#  0   0   2   0   0
#  0   0   0  [3->7]  0
#  0   0   0   0   0

10.2 Program Synthesis Approach

Arcgentica frames ARC-AGI-2 as a program induction problem: given input/output examples, find a program that implements the underlying transformation. Unlike neural program induction (which learns a latent program representation), Arcgentica synthesizes explicit, interpretable Python code.

10.2.1 Search Space

The search space is the set of all valid Python functions with signature def transform(grid: np.ndarray) -> np.ndarray. While theoretically infinite, the practical search space is constrained by:

  • Length constraints: Programs are typically limited to 50-200 lines.
  • Library constraints: Only numpy and standard library operations.
  • Timeout constraints: Programs must complete within a few seconds.
  • LLM bias: The LLM naturally generates programs in common coding patterns, providing implicit regularization.

10.2.2 Synthesis Pipeline

  1. Seed generation: LLM generates initial candidate programs from task descriptions (zero-shot).
  2. Evaluation: Each candidate is executed against all training examples.
  3. Trace capture: Runtime behavior is recorded.
  4. Informed mutation: LLM receives code + traces, generates improved variants.
  5. Selection: Best variants survive; poor variants are culled.
  6. Iteration: Repeat until solution found or budget exhausted.

10.3 Code Modification & Mutation

Mutations in Arcgentica are not random syntactic changes (as in traditional genetic programming) but semantically-informed code modifications guided by the LLM. The system supports several mutation strategies:

10.3.1 Mutation Types

Mutation Type Description When Used
Targeted Fix Fix a specific bug identified via runtime trace (e.g., off-by-one error, wrong color mapping) High fitness programs with identifiable errors
Structural Rewrite Rewrite a major section of the program (e.g., replace loop-based approach with vectorized numpy) Medium fitness programs that are structurally flawed
Strategy Shift Completely change the algorithmic approach (e.g., from flood-fill to pattern matching) Low fitness programs or stagnation detection
Refinement Small tweaks to nearly-correct programs (edge case handling, boundary conditions) Very high fitness programs (>0.9)
Crossover Combine elements from two parent programs When population has diverse, partial solutions

10.3.2 Mutation Prompt Construction

python
def construct_mutation_prompt(
    parent: Individual,
    task: Task,
    mutation_type: str,
    traces: list[dict]
) -> str:
    """Build the LLM prompt for a specific mutation type."""

    prompt_parts = [
        SYSTEM_PROMPT,
        format_task_description(task),
        f"\n## Current Program (fitness: {parent.fitness:.3f})\n",
        f"```python\n{parent.source_code}\n```\n",
    ]

    # Add runtime context for each training example
    prompt_parts.append("\n## Runtime Analysis\n")
    for i, (trace, pair) in enumerate(zip(traces, task.train)):
        prompt_parts.append(f"\n### Training Example {i+1}:")
        prompt_parts.append(f"Input ({pair.input.shape}):\n{format_grid(pair.input)}")
        prompt_parts.append(f"Expected output ({pair.output.shape}):\n{format_grid(pair.output)}")

        if trace["execution_successful"]:
            prompt_parts.append(f"Actual output:\n{trace['actual_output']}")
            prompt_parts.append(f"Cell accuracy: {trace['cell_accuracy']:.1%}")
            prompt_parts.append(f"Diff (mismatches in [actual->expected]):\n{trace['output_diff']}")
            if trace.get("intermediate_states"):
                prompt_parts.append(f"Intermediate states:\n{trace['intermediate_states']}")
        else:
            prompt_parts.append(f"EXECUTION FAILED: {trace['error_message']}")
            if trace.get("error_traceback"):
                prompt_parts.append(f"Traceback:\n{trace['error_traceback']}")

    # Mutation-type-specific instructions
    mutation_instructions = {
        "targeted_fix": "Fix the specific errors shown in the diff above. Focus on the cells that are wrong.",
        "structural_rewrite": "The current approach has fundamental issues. Rewrite the core algorithm while keeping any parts that work correctly.",
        "strategy_shift": "Try a completely different approach to solve this task. Analyze the input/output patterns from scratch.",
        "refinement": "The program is nearly correct. Make minimal, precise changes to handle the remaining edge cases.",
    }

    prompt_parts.append(f"\n## Instruction\n{mutation_instructions.get(mutation_type, 'Improve the program.')}")
    prompt_parts.append("\nProvide the complete improved Python function. Start with `def transform(grid):`")

    return "\n".join(prompt_parts)

10.4 Parent Selection & Sampling

Parent selection determines which individuals from the current population are chosen for mutation. Arcgentica employs a multi-strategy selection approach that balances exploitation (selecting high-fitness individuals) with exploration (maintaining diversity).

10.4.1 Selection Strategies

python
class ParentSelector:
    """Multi-strategy parent selection for evolutionary search."""

    def select(self, population: list[Individual],
               strategy: str = "mixed") -> Individual:
        if strategy == "tournament":
            return self._tournament_select(population, k=3)
        elif strategy == "fitness_proportionate":
            return self._fitness_proportionate(population)
        elif strategy == "diversity_weighted":
            return self._diversity_weighted(population)
        elif strategy == "mixed":
            # Probabilistic mix of strategies
            r = random.random()
            if r < 0.4:
                return self._tournament_select(population, k=3)
            elif r < 0.7:
                return self._fitness_proportionate(population)
            elif r < 0.9:
                return self._diversity_weighted(population)
            else:
                return random.choice(population)  # Pure random for exploration

    def _tournament_select(self, pop: list[Individual], k: int) -> Individual:
        """Select best from k random individuals."""
        contestants = random.sample(pop, min(k, len(pop)))
        return max(contestants, key=lambda x: x.fitness)

    def _fitness_proportionate(self, pop: list[Individual]) -> Individual:
        """Roulette wheel selection proportional to fitness."""
        total = sum(ind.fitness + 0.01 for ind in pop)  # +epsilon to avoid zero
        r = random.uniform(0, total)
        cumulative = 0
        for ind in pop:
            cumulative += ind.fitness + 0.01
            if cumulative >= r:
                return ind
        return pop[-1]

    def _diversity_weighted(self, pop: list[Individual]) -> Individual:
        """Select considering both fitness and uniqueness."""
        scores = []
        for ind in pop:
            uniqueness = self._compute_uniqueness(ind, pop)
            combined = 0.5 * ind.fitness + 0.5 * uniqueness
            scores.append(combined)
        idx = random.choices(range(len(pop)), weights=scores, k=1)[0]
        return pop[idx]

    def _compute_uniqueness(self, ind: Individual,
                            pop: list[Individual]) -> float:
        """Measure how different this individual is from the population."""
        if len(pop) <= 1:
            return 1.0
        similarities = []
        for other in pop:
            if other.id == ind.id:
                continue
            sim = self._code_similarity(ind.source_code, other.source_code)
            similarities.append(sim)
        avg_similarity = sum(similarities) / len(similarities)
        return 1.0 - avg_similarity

10.5 Population Management

Population management governs the lifecycle of individuals across generations. Arcgentica maintains a fixed-size population with mechanisms for elitism, diversity preservation, and adaptive sizing.

10.5.1 Population Lifecycle

Gen 0: [Seed1] [Seed2] [Seed3] ... [SeedN]   (LLM zero-shot generation)
         |       |       |           |
         v       v       v           v
Gen 1: [Eval]  [Eval]  [Eval]  ... [Eval]    (Fitness evaluation)
         |       |       |           |
         v       v       v           v
       [Select parents] --> [Mutate via LLM] --> [New offspring]
         |                                         |
         v                                         v
Gen 2: [Elites preserved] + [New offspring] = [New population]
         |
         v
       [Repeat until convergence or budget exhaustion]

10.5.2 Key Parameters

Parameter Typical Value Purpose
Population size 20-50 Balance between diversity and computational cost
Elite fraction 10-20% Protect best solutions from being replaced
Offspring per generation 5-15 Number of new programs generated per iteration
Max generations 30-100 Termination condition if no perfect solution found
Stagnation threshold 5-10 generations Trigger strategy shift if no fitness improvement

10.6 Solution Diversity & Novelty

Maintaining diversity in the population is critical to avoid premature convergence, where all programs become similar and the search stagnates. Arcgentica employs several diversity mechanisms:

10.6.1 Code Diversity Metrics

python
def code_similarity(code_a: str, code_b: str) -> float:
    """Compute similarity between two programs using multiple metrics."""

    # 1. Token-level Jaccard similarity
    tokens_a = set(tokenize(code_a))
    tokens_b = set(tokenize(code_b))
    jaccard = len(tokens_a & tokens_b) / max(len(tokens_a | tokens_b), 1)

    # 2. AST structural similarity
    try:
        ast_a = ast.parse(code_a)
        ast_b = ast.parse(code_b)
        ast_sim = ast_structural_similarity(ast_a, ast_b)
    except SyntaxError:
        ast_sim = jaccard  # Fallback

    # 3. Behavioral similarity (same outputs on same inputs?)
    # This is captured separately in the fitness evaluator

    return 0.4 * jaccard + 0.6 * ast_sim

10.6.2 Behavioral Diversity

Beyond code-level diversity, Arcgentica tracks behavioral diversity -- whether programs produce different outputs on the same inputs. Two programs with different code but identical behavior are treated as duplicates for diversity purposes. This is measured via output fingerprinting:

python
def behavioral_fingerprint(individual: Individual, task: Task) -> tuple:
    """Create a hashable fingerprint of program behavior."""
    outputs = []
    for pair in task.train:
        result = execute(individual.source_code, pair.input)
        if result.success:
            outputs.append(tuple(result.output.flatten()))
        else:
            outputs.append(("ERROR", result.trace.error))
    return tuple(outputs)

10.7 LLM Orchestration

Arcgentica coordinates multiple LLM calls concurrently using asynchronous programming patterns. The orchestration layer handles rate limiting, retry logic, model selection, and response parsing.

10.7.1 Async Orchestration Architecture

python
class LLMOrchestrator:
    """Manages concurrent LLM calls for mutation."""

    def __init__(self, config: OrchestratorConfig):
        self.clients = {
            "anthropic": AsyncAnthropic(api_key=config.anthropic_key),
            "openai": AsyncOpenAI(api_key=config.openai_key),
        }
        self.semaphore = asyncio.Semaphore(config.max_concurrent_calls)
        self.rate_limiter = TokenBucketRateLimiter(
            tokens_per_second=config.max_tokens_per_second
        )

    async def generate_mutations(
        self,
        parents: list[Individual],
        task: Task,
        num_mutations: int
    ) -> list[Individual]:
        """Generate multiple mutations concurrently."""
        tasks = []
        for _ in range(num_mutations):
            parent = self.selector.select(parents)
            mutation_type = self._choose_mutation_type(parent)
            model = self._choose_model(mutation_type)
            tasks.append(
                self._generate_single_mutation(parent, task, mutation_type, model)
            )
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return [r for r in results if isinstance(r, Individual)]

    async def _generate_single_mutation(
        self, parent: Individual, task: Task,
        mutation_type: str, model: str
    ) -> Individual:
        async with self.semaphore:
            await self.rate_limiter.acquire()
            prompt = construct_mutation_prompt(parent, task, mutation_type,
                                               parent.runtime_traces)
            response = await self._call_llm(model, prompt)
            source_code = extract_python_code(response)
            return Individual(
                source_code=source_code,
                fitness=0.0,  # Will be evaluated later
                generation=parent.generation + 1,
                parent_id=parent.id,
                mutation_type=mutation_type,
                runtime_traces=[],
            )

    def _choose_model(self, mutation_type: str) -> str:
        """Select model based on mutation type and budget."""
        if mutation_type == "strategy_shift":
            return "claude-3-5-sonnet"  # Strong reasoning for novel approaches
        elif mutation_type == "targeted_fix":
            return "gpt-4o"  # Fast and precise for small fixes
        else:
            return random.choice(["claude-3-5-sonnet", "gpt-4o"])

10.8 Search Strategies

The evolutionary search in Arcgentica is not a simple hill-climbing algorithm. It employs multiple search strategies that are dynamically selected based on the current state of the population and progress trajectory.

10.8.1 Strategy Portfolio

Strategy Description Trigger Condition
Exploitation Focus on refining the best programs with targeted fixes Population contains programs with fitness > 0.7
Exploration Generate diverse new approaches via strategy shifts All programs have fitness < 0.3 or stagnation detected
Balanced Mix of exploitation and exploration Default strategy when population is spread
Intensification Focus compute on the single most promising program One program has fitness > 0.9 (nearly solved)
Restart Discard population and start fresh Extended stagnation with very low fitness

10.8.2 Adaptive Strategy Selection

python
class AdaptiveSearchStrategy:
    """Dynamically select search strategy based on population state."""

    def __init__(self):
        self.stagnation_counter = 0
        self.best_fitness_history = []

    def select_strategy(self, population: list[Individual]) -> str:
        best_fitness = max(ind.fitness for ind in population)
        self.best_fitness_history.append(best_fitness)

        # Detect stagnation
        if len(self.best_fitness_history) >= 5:
            recent = self.best_fitness_history[-5:]
            if max(recent) - min(recent) < 0.01:
                self.stagnation_counter += 1
            else:
                self.stagnation_counter = 0

        # Strategy selection logic
        if best_fitness > 0.95:
            return "intensification"  # Almost there, focus!
        elif best_fitness > 0.7:
            return "exploitation"     # Good programs exist, refine them
        elif self.stagnation_counter > 3:
            if best_fitness < 0.1:
                return "restart"      # Nothing works, start over
            else:
                return "exploration"  # Stuck, try new approaches
        else:
            return "balanced"         # Default mixed strategy

10.9 Evaluation & Fitness

10.9.1 Multi-Criteria Fitness Function

The fitness function must provide gradient signal even for partial solutions. Arcgentica uses a composite fitness score:

$$ (Eq. 1) F(p) = w1 * mini(cell_acc(p, xi, yi)) + w2 * meani(cell_acc(p, xi, yi)) + w3 * shape_match(p) + w4 * no_error(p) $$

Where:

  • cell_acc(p, x_i, y_i) = fraction of cells correct when program p is applied to input x_i with expected output y_i
  • shape_match(p) = 1 if output has correct dimensions for all examples, 0 otherwise
  • no_error(p) = 1 if program executes without errors on all inputs, 0 otherwise
  • Typical weights: w_1 = 0.4, w_2 = 0.3, w_3 = 0.2, w_4 = 0.1

10.9.2 Fitness Implementation

python
def composite_fitness(
    program: str,
    task: Task,
    engine: ExecutionEngine,
    weights: tuple = (0.4, 0.3, 0.2, 0.1)
) -> float:
    """
    Composite fitness function with gradient signal for partial solutions.

    Components:
      1. Worst-case cell accuracy (min across examples)
      2. Average cell accuracy
      3. Shape correctness
      4. Execution success
    """
    w_min, w_avg, w_shape, w_exec = weights

    cell_accuracies = []
    shape_matches = []
    exec_successes = []

    for pair in task.train:
        result = engine.execute_with_trace(program, pair.input)

        if not result.success:
            cell_accuracies.append(0.0)
            shape_matches.append(0.0)
            exec_successes.append(0.0)
            continue

        exec_successes.append(1.0)

        if result.output.shape != pair.output.shape:
            shape_matches.append(0.0)
            # Can still compute partial cell accuracy via padding/cropping
            cell_acc = _padded_cell_accuracy(result.output, pair.output)
            cell_accuracies.append(cell_acc * 0.5)  # Penalty for wrong shape
        else:
            shape_matches.append(1.0)
            cell_acc = np.mean(result.output == pair.output)
            cell_accuracies.append(float(cell_acc))

    if not cell_accuracies:
        return 0.0

    fitness = (
        w_min * min(cell_accuracies) +
        w_avg * (sum(cell_accuracies) / len(cell_accuracies)) +
        w_shape * (sum(shape_matches) / len(shape_matches)) +
        w_exec * (sum(exec_successes) / len(exec_successes))
    )

    # Bonus for perfect solution
    if all(ca == 1.0 for ca in cell_accuracies):
        fitness = 1.0

    return fitness

10.10 Prompt Engineering

The quality of LLM prompts directly impacts mutation effectiveness. Arcgentica employs carefully crafted prompts with several key design principles:

10.10.1 System Prompt

text
SYSTEM PROMPT:
You are an expert Python programmer specializing in grid transformation problems.
You write clean, efficient numpy code.

Rules:
1. Write a single function: def transform(grid: np.ndarray) -> np.ndarray
2. Use only numpy operations and Python standard library
3. The function must be deterministic (no randomness)
4. Handle grids of any reasonable size
5. Do not use external libraries beyond numpy
6. Return a numpy array with integer values (0-9 representing colors)
7. Analyze the runtime traces carefully to understand what the current
   program does wrong, then fix it precisely

10.10.2 Grid Formatting

python
def format_grid(grid: np.ndarray) -> str:
    """Format a grid for LLM consumption with color names."""
    COLOR_MAP = {
        0: ".", 1: "B", 2: "R", 3: "G", 4: "Y",
        5: "A", 6: "M", 7: "O", 8: "C", 9: "W"
    }
    lines = []
    for row in grid:
        line = " ".join(COLOR_MAP.get(int(c), str(c)) for c in row)
        lines.append(line)
    return "\n".join(lines)

10.10.3 Mutation-Specific Instruction Templates

python
MUTATION_TEMPLATES = {
    "targeted_fix": """
The program above is close to correct but has specific errors.
Look at the diff between actual and expected outputs:
- Cells marked [actual->expected] show mismatches
- Focus on WHY these specific cells are wrong
- Make the minimal change needed to fix these errors
- Do not change parts of the code that produce correct results
""",
    "structural_rewrite": """
The current approach has fundamental limitations.
Analyze the input/output pairs to understand the transformation rule:
- What patterns exist in the inputs?
- How do the outputs relate to the inputs?
- What spatial/color operations are involved?
Rewrite the program using a more appropriate algorithm.
""",
    "strategy_shift": """
Forget the current program entirely.
Study the input/output examples carefully and describe the transformation:
1. What objects/patterns appear in the input?
2. How are they transformed in the output?
3. What rules govern the transformation?
Then write a completely new program implementing your analysis.
""",
    "refinement": """
The program is nearly correct (fitness: {fitness:.1%}).
It fails on specific edge cases. Analyze the runtime traces to find:
- Which examples fail and why
- Whether the issue is boundary conditions, special cases, or off-by-one errors
Make precise, minimal fixes. Do not restructure working code.
"""
}

10.11 Cost Control

LLM API costs can escalate rapidly in evolutionary search. Arcgentica implements several cost management strategies:

python
class CostController:
    """Manage LLM API spending across the search."""

    def __init__(self, config: CostConfig):
        self.per_task_budget = config.per_task_budget  # e.g., $5.00
        self.global_budget = config.global_budget      # e.g., $500.00
        self.spent_per_task: dict[str, float] = {}
        self.total_spent = 0.0
        # Model costs per 1M tokens (approximate)
        self.model_costs = {
            "claude-3-5-sonnet": {"input": 3.0, "output": 15.0},
            "claude-3-opus":     {"input": 15.0, "output": 75.0},
            "gpt-4o":            {"input": 2.5, "output": 10.0},
            "gpt-4o-mini":       {"input": 0.15, "output": 0.60},
        }

    def can_spend(self, task_id: str, estimated_tokens: int,
                  model: str) -> bool:
        """Check if we can afford this API call."""
        cost = self._estimate_cost(estimated_tokens, model)
        task_spent = self.spent_per_task.get(task_id, 0.0)
        return (
            task_spent + cost <= self.per_task_budget and
            self.total_spent + cost <= self.global_budget
        )

    def record(self, task_id: str, input_tokens: int,
               output_tokens: int, model: str):
        """Record actual cost after API call."""
        rates = self.model_costs.get(model, {"input": 5.0, "output": 15.0})
        cost = (
            input_tokens * rates["input"] / 1_000_000 +
            output_tokens * rates["output"] / 1_000_000
        )
        self.spent_per_task[task_id] = self.spent_per_task.get(task_id, 0.0) + cost
        self.total_spent += cost

    def suggest_model(self, task_id: str, remaining_generations: int) -> str:
        """Suggest a model based on remaining budget."""
        remaining = self.per_task_budget - self.spent_per_task.get(task_id, 0.0)
        avg_cost_per_gen = self.spent_per_task.get(task_id, 1.0) / max(1, ...)
        if remaining < avg_cost_per_gen * remaining_generations:
            return "gpt-4o-mini"  # Switch to cheaper model
        return "claude-3-5-sonnet"  # Use best model

10.12 Mathematical Foundations

10.12.1 Evolutionary Search as Optimization

The evolutionary program synthesis can be formalized as an optimization problem over the space of programs:

$$ (Eq. 2) p* = argmaxp in P F(p, T) = argmaxp in P [ w1 * mini f(p, xi, yi) + w2 * (1/N) * sumi f(p, xi, yi) ] $$

Where P is the space of valid Python programs, T = {(x_i, y_i)} is the set of training examples, and f(p, x, y) measures the agreement between p(x) and y.

10.12.2 Mutation as Conditional Generation

Each LLM mutation can be modeled as sampling from a conditional distribution:

$$ (Eq. 3) pchild ~ LLM(* | pparent, R(pparent, T), M) $$

Where R(p, T) represents the runtime traces of program p on task T, and M represents the mutation instruction. The key insight is that conditioning on R(p, T) provides the LLM with grounded semantic information rather than requiring it to infer program behavior from syntax alone.

10.12.3 Information-Theoretic Perspective

The runtime-as-context approach can be understood through the lens of information theory. The mutual information between the LLM's output and the correct solution is bounded by the information in its input:

$$ (Eq. 4) I(pchild; p) ≤ I(prompt; p) = I(pparent, R, M; p*) $$

Adding runtime traces R increases the mutual information between the prompt and the target solution, enabling better-informed mutations:

$$ (Eq. 5) I(pparent, R, M; p) ≥ I(pparent, M; p) $$

This inequality formalizes the claim that runtime context is strictly helpful -- it can only add information, never reduce it (assuming the LLM processes it correctly).

10.12.4 Population Dynamics

The expected fitness improvement per generation can be approximated as:

$$ (Eq. 6) E[Ft+1] = E[Ft] + alpha * sigmaF * I / sqrt(N) $$

Where alpha is the selection intensity, sigma_F is the fitness standard deviation in the population, I is the expected improvement from LLM mutation (related to the quality of runtime context), and N is the population size. This shows diminishing returns from increasing population size, suggesting moderate populations (20-50) are optimal.

10.12.5 Python Implementation of Key Equations

python
import numpy as np
from typing import Callable

# Eq. 2: Composite fitness function
def fitness(
    program_fn: Callable,
    examples: list[tuple[np.ndarray, np.ndarray]],
    w1: float = 0.4,
    w2: float = 0.6
) -> float:
    """
    F(p, T) = w1 * min_i f(p, x_i, y_i) + w2 * mean_i f(p, x_i, y_i)

    where f(p, x, y) = |{(i,j) : p(x)[i,j] == y[i,j]}| / |y|
    """
    scores = []
    for x, y in examples:
        try:
            pred = program_fn(x.copy())
            if pred.shape != y.shape:
                scores.append(0.0)
            else:
                scores.append(float(np.mean(pred == y)))
        except Exception:
            scores.append(0.0)

    if not scores:
        return 0.0
    return w1 * min(scores) + w2 * np.mean(scores)


# Eq. 6: Expected fitness improvement per generation
def expected_improvement(
    population_fitness: list[float],
    selection_intensity: float = 1.5,  # ~ top 20% selection
    runtime_context_quality: float = 0.8,  # 0-1, how good are traces
) -> float:
    """
    E[Delta F] = alpha * sigma_F * I / sqrt(N)

    Parameters:
      population_fitness: list of fitness values for current population
      selection_intensity: alpha, depends on selection pressure
      runtime_context_quality: I, depends on trace informativeness
    """
    sigma_f = np.std(population_fitness)
    n = len(population_fitness)
    delta_f = selection_intensity * sigma_f * runtime_context_quality / np.sqrt(n)
    return delta_f


# Behavioral diversity metric for population
def population_diversity(
    behavioral_fingerprints: list[tuple]
) -> float:
    """
    Diversity = |unique fingerprints| / |population|

    Measures fraction of behaviorally distinct programs.
    """
    unique = len(set(behavioral_fingerprints))
    total = len(behavioral_fingerprints)
    return unique / total if total > 0 else 0.0


# Information gain from runtime context
def runtime_context_information_gain(
    accuracy_with_context: list[float],
    accuracy_without_context: list[float]
) -> float:
    """
    Approximate information gain from runtime traces.

    IG = E[F_with_context] - E[F_without_context]

    Based on empirical measurements of mutation quality with
    and without runtime traces.
    """
    mean_with = np.mean(accuracy_with_context)
    mean_without = np.mean(accuracy_without_context)
    return mean_with - mean_without


# Grid cell-level accuracy (core fitness primitive)
def cell_accuracy(actual: np.ndarray, expected: np.ndarray) -> float:
    """
    f(p, x, y) = (1/|y|) * sum_{i,j} 1[p(x)[i,j] == y[i,j]]

    Returns fraction of correctly predicted cells.
    Handles shape mismatches with zero score.
    """
    if actual.shape != expected.shape:
        return 0.0
    return float(np.sum(actual == expected)) / expected.size


# Tournament selection probability
def tournament_selection_prob(
    fitness_rank: int,
    population_size: int,
    tournament_size: int
) -> float:
    """
    P(rank k selected) = (1 - ((N-k)/N)^T) - (1 - ((N-k+1)/N)^T)

    where N = population_size, k = rank (1=best), T = tournament_size.
    Approximation for large N.
    """
    N = population_size
    T = tournament_size
    k = fitness_rank
    # Probability of being the best in a tournament of size T
    # when your rank is k out of N
    p_not_selected = ((k - 1) / N) ** T
    p_not_selected_or_better = (k / N) ** T
    return p_not_selected_or_better - p_not_selected

10.12.6 Convergence Analysis

Under idealized assumptions (LLM mutations are unbiased improvements, fitness landscape is unimodal), the expected number of generations to reach fitness F* from initial fitness F_0 is:

$$ (Eq. 7) G approx (F - F0) / E[Delta F] = (F* - F0) * sqrt(N) / (alpha * sigmaF * I) $$

In practice, the fitness landscape for ARC tasks is highly multimodal, meaning the search often requires strategy shifts (equivalent to escaping local optima) rather than smooth gradient following. This motivates the adaptive search strategy controller described in Section 10.8.

11. Programming Language

Aspect Details
System implementation Python 3.10+
Synthesized programs Python (numpy-based grid transformations)
Async framework asyncio for concurrent LLM API calls
Key dependencies numpy, httpx/aiohttp, pyyaml, ast (standard library)
Execution sandbox Restricted exec() with controlled namespace
Testing pytest
Configuration YAML files for hyperparameters and model selection

The choice of Python for both the system and synthesized programs is deliberate: LLMs have been extensively trained on Python code and produce higher-quality Python than most other languages. Numpy provides a natural vocabulary for grid manipulation operations (slicing, transposition, masking, etc.) that are fundamental to ARC tasks.

12. Memory & Runtime Context Management

12.1 Context Window Budget Allocation

The most constrained resource is the LLM's context window. Arcgentica carefully allocates this budget across prompt components:

Component Typical Token Allocation Priority
System prompt 200-400 tokens Fixed (always included)
Task description (I/O pairs) 500-2000 tokens Fixed (always included)
Parent program source 200-800 tokens Fixed (always included)
Runtime traces 500-3000 tokens Adaptive (compressed as needed)
Output diff 200-500 tokens High priority
Mutation instructions 100-300 tokens Fixed
Reserved for response 1000-2000 tokens Fixed

12.2 Trace Compression Strategies

  • Selective snapshots: Only include variable states at loop boundaries and function entry/exit points, not every line.
  • Grid summarization: For large grids, show only the bounding box around changed regions rather than the full grid.
  • Error-focused traces: For programs that crash, prioritize the error message and the last few lines of the traceback over intermediate states.
  • Diff-only mode: For high-fitness programs, show only the cells that differ rather than full grids.
  • Example prioritization: If context is tight, include full traces for failing examples and skip traces for passing examples.

12.3 Population Memory

The population itself serves as a form of memory, retaining successful strategies and code patterns across generations. The archive (hall of fame) preserves historically best individuals even if they are replaced in the active population, preventing loss of valuable genetic material.

python
class PopulationMemory:
    """Long-term memory across evolutionary generations."""

    def __init__(self):
        self.archive: list[Individual] = []  # Best individuals ever seen
        self.strategy_log: list[dict] = []   # What strategies worked
        self.stagnation_history: list[int] = []  # Track stagnation events

    def update(self, population: list[Individual], generation: int):
        best = max(population, key=lambda x: x.fitness)
        if not self.archive or best.fitness > self.archive[-1].fitness:
            self.archive.append(best)
            self.strategy_log.append({
                "generation": generation,
                "fitness": best.fitness,
                "mutation_type": best.mutation_type,
                "parent_fitness": self._get_parent_fitness(best),
            })

    def get_effective_strategies(self) -> list[str]:
        """Return mutation types that historically produced improvements."""
        improvements = [
            s for s in self.strategy_log
            if s["fitness"] > s.get("parent_fitness", 0)
        ]
        return [s["mutation_type"] for s in improvements]

13. Continued Learning

13.1 Cross-Task Transfer

Arcgentica does not currently perform cross-task transfer learning within a single run -- each task is solved independently. However, the architecture supports several avenues for continued learning:

  • Solution library: Successful programs from solved tasks can be used as seed programs for similar unseen tasks, bootstrapping the search with known-good code patterns.
  • Strategy meta-learning: Statistics about which mutation types and search strategies were most effective for different task categories can inform strategy selection on new tasks.
  • Prompt refinement: Prompts that led to high-quality mutations can be catalogued and reused, effectively learning a prompt engineering strategy over time.

13.2 No Weight Updates

Critically, Arcgentica does not fine-tune the underlying LLM. The LLM weights remain frozen; all "learning" happens through the evolutionary search process and the information content of the prompts. This is a significant advantage for reproducibility and cost, but limits the system's ability to internalize domain-specific knowledge.

13.3 Potential Extensions

Extension Description Expected Impact
Few-shot exemplar library Maintain a library of solved task code to include as few-shot examples in prompts High -- provides the LLM with proven code patterns
DSL induction Discover a domain-specific language from successful programs, then constrain synthesis to DSL Medium -- reduces search space but may miss novel solutions
Fine-tuned mutation model Fine-tune a smaller LLM specifically on (parent, trace, improved_child) triples High -- but expensive and reduces generality
Retrieval-augmented generation RAG over solved tasks to find relevant code patterns for new tasks Medium-High -- practical and low-cost

14. Applications

14.1 Primary: ARC-AGI-2 Benchmark

The immediate application is solving ARC-AGI-2 tasks, which test abstract reasoning through visual grid transformations. Each task requires discovering a rule from a few examples and applying it to new inputs -- a challenge that tests genuine generalization rather than pattern matching on large datasets.

14.2 Broader Applications

14.2.1 Automated Algorithm Discovery

The runtime-as-context evolutionary approach is directly applicable to discovering novel algorithms. Given a specification of desired input/output behavior and an evaluation function, the system can evolve algorithms for sorting, searching, optimization, and mathematical computation. This is closely related to Google DeepMind's AlphaEvolve, which uses a similar evolutionary LLM approach for mathematical discovery.

14.2.2 Automated Program Repair

The combination of runtime trace capture and LLM-guided mutation is naturally suited to automated program repair (APR). Given a buggy program and failing test cases, the system can:

  1. Execute the program to capture runtime traces at the point of failure
  2. Feed these traces to the LLM with the failing test case
  3. Generate candidate fixes and validate them against all test cases

14.2.3 Code Optimization

The fitness function can be replaced with a performance metric (execution time, memory usage) to evolve more efficient implementations of existing algorithms. The runtime traces provide profiling information that helps the LLM identify bottlenecks.

14.2.4 Scientific Computing

In domains where correct numerical procedures are known but optimal implementations are sought (e.g., PDE solvers, molecular dynamics kernels), Arcgentica could evolve specialized implementations tuned to specific problem characteristics.

14.2.5 Education & Tutoring

The runtime trace mechanism can be used to explain code behavior to students. By showing intermediate states and variable values, the system can generate step-by-step explanations of how a program transforms its input, making it a potential tool for computer science education.

System Organization Key Similarity Key Difference
AlphaEvolve Google DeepMind Evolutionary LLM-guided code synthesis Focuses on mathematical algorithm discovery rather than abstract reasoning tasks
FunSearch Google DeepMind LLM + evolutionary search for programs Targets combinatorial math problems (cap set, bin packing); uses a simpler fitness signal
OpenELM Various Evolutionary LLM code generation More general purpose; less focus on runtime trace feedback
Ryan Greenblatt's ARC approach Independent LLM-based ARC solving with iteration Uses direct prompting with retry rather than population-based evolution
BARC (Berkeley ARC) UC Berkeley Program synthesis for ARC Uses DSL-based synthesis rather than free-form Python generation

15. References

  1. Symbolica AI. Arcgentica: Runtime-as-Context Evolutionary Program Synthesis. GitHub repository, 2025. github.com/symbolica-ai/arcgentica
  2. Symbolica AI. "Runtime as Context." Symbolica AI Blog, 2025. symbolica.ai/blog/runtime-as-context
  3. Chollet, F. "On the Measure of Intelligence." arXiv preprint arXiv:1911.01547, 2019.
  4. Chollet, F. "ARC-AGI: A Benchmark for General Intelligence." ARC Prize Foundation, 2024. arcprize.org
  5. Romera-Paredes, B., et al. "Mathematical discoveries from program search with large language models." Nature, 625, 468-475, 2024. (FunSearch)
  6. Novikov, A., et al. "AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms." Google DeepMind Blog, 2025.
  7. Lehman, J., et al. "Evolution through Large Models." arXiv preprint arXiv:2206.08896, 2022.
  8. Chen, M., et al. "Evaluating Large Language Models Trained on Code." arXiv preprint arXiv:2107.03374, 2021. (Codex/HumanEval)
  9. Koza, J.R. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992.
  10. Le Goues, C., et al. "Automated Program Repair." Communications of the ACM, 62(12), 56-65, 2019.

Technical Report -- Arcgentica by Symbolica AI Generated for PhD-level research analysis | March 2026

Sources: GitHub Repository | Runtime-as-Context Blog Post | ARC Prize