Arcgentica
Runtime-as-Context Evolutionary Program Synthesis for ARC-AGI-2 Organization: Symbolica AI Domain: Program Synthesis / Abstract Reasoning Target: ARC-AGI-2 Benchmark Language: Python License: Apache 2.0
Table of Contents
- Full Title & Authors
- Core Contribution
- Runtime-as-Context Paradigm
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Cost Analysis
- Architecture
- System Architecture Diagram
- Components
- Core Mechanisms
- Runtime-as-Context Mechanism
- Program Synthesis Approach
- Code Modification & Mutation
- Parent Selection & Sampling
- Population Management
- Solution Diversity & Novelty
- LLM Orchestration
- Search Strategies
- Evaluation & Fitness
- Prompt Engineering
- Cost Control
- Mathematical Foundations
- Programming Language
- Memory & Runtime Context Management
- Continued Learning
- Applications
- References
1. Full Title & Authors
| Field | Details |
|---|---|
| Full Title | Arcgentica: Runtime-as-Context Evolutionary Program Synthesis for Abstract Reasoning |
| Organization | Symbolica AI |
| Authors | Symbolica AI Research Team (collective attribution). Symbolica AI is a research company focused on program synthesis, neuro-symbolic AI, and mathematical approaches to artificial intelligence. Key contributors include the founding team of Symbolica AI. |
| Publication Venue | Open-source release on GitHub with accompanying blog post ("Runtime as Context") |
| Repository | github.com/symbolica-ai/arcgentica |
| Blog Post | symbolica.ai/blog/runtime-as-context |
| Target Benchmark | ARC-AGI-2 (Abstraction and Reasoning Corpus, version 2) |
| Paradigm | Evolutionary program synthesis with LLM-guided mutation using runtime execution traces as context |
About Symbolica AI
Symbolica AI is a research company that focuses on the intersection of program synthesis, symbolic reasoning, and large language models. Their work on Arcgentica represents a distinctive approach in the ARC-AGI-2 competition landscape: rather than treating LLMs as direct problem solvers, they use LLMs as intelligent mutation operators within an evolutionary search framework, critically augmented with runtime execution information.
2. Core Contribution
Arcgentica's core contribution is the runtime-as-context paradigm for LLM-guided program synthesis. This approach fundamentally rethinks how large language models interact with code evolution by providing them not just with source code, but with rich runtime execution traces that capture the dynamic behavior of candidate programs. This bridges the gap between static code understanding and dynamic program semantics.
2.1 The Runtime-as-Context Paradigm
Traditional approaches to LLM-based code generation present the model with a task description and perhaps example input/output pairs, asking it to generate a solution from scratch. Arcgentica inverts this: it runs candidate programs against test inputs, captures detailed execution traces (intermediate variable states, grid transformations, partial outputs, exception information), and feeds these traces back to the LLM alongside the source code. The LLM then acts as an informed mutator, understanding not just what the code says but what it actually does at runtime.
The key insight is that runtime information dramatically reduces the reasoning burden on the LLM. Instead of needing to mentally simulate program execution -- a task that LLMs are notoriously poor at -- the model receives concrete evidence of program behavior, allowing it to focus on higher-level reasoning about what to change rather than what the code currently does.
Key Innovations
**Runtime trace extraction:** Systematic capture of variable states, intermediate grid values, and control flow during candidate program execution against ARC task inputs.
**Trace-conditioned mutation:** LLM prompts are constructed with runtime context, enabling semantically-informed code modifications rather than blind syntactic mutations.
**Evolutionary population management:** A population of candidate programs is maintained and evolved, with fitness determined by correctness on training examples.
**Multi-model orchestration:** Different LLM models may be employed for different stages of the synthesis pipeline (e.g., ideation vs. code generation vs. debugging).
**Diversity-preserving selection:** Population management strategies that maintain diversity of approaches to prevent premature convergence.
This approach sits at the intersection of several fields:
| Field | Arcgentica's Contribution |
|---|---|
| Genetic Programming | Evolves populations of programs, but uses LLMs as intelligent mutation operators instead of random syntactic transformations |
| Program Synthesis | Synthesizes programs from input/output specifications, but uses evolutionary search rather than enumerative or deductive synthesis |
| LLM-based Code Generation | Uses LLMs for code, but in a closed-loop setting with runtime feedback rather than open-loop generation |
| Debugging & Program Repair | The runtime-as-context mechanism is functionally similar to automated debugging, where execution traces guide fault localization and repair |
3. Supported Solutions
3.1 ARC-AGI-2
The primary target is the ARC-AGI-2 benchmark, which consists of abstract reasoning tasks presented as input/output grid pairs. Each task has a small number of demonstration pairs (typically 2-4) and one or more test pairs. The challenge requires discovering the underlying transformation rule from the demonstrations and applying it to the test inputs. ARC-AGI-2 is a harder variant of the original ARC benchmark with more complex spatial reasoning tasks.
Arcgentica approaches each task as a program synthesis problem: find a Python function transform(grid) that correctly maps all demonstration inputs to their outputs. The evolutionary search then attempts to generalize this function to unseen test inputs.
3.2 Program Synthesis Generalization
While built for ARC-AGI-2, the underlying architecture is a general-purpose evolutionary program synthesis framework. The core loop -- generate candidate programs, execute them, capture runtime traces, feed to LLM for informed mutation -- is applicable to any domain where:
- Solutions can be expressed as executable programs
- Correctness can be automatically evaluated
- Runtime behavior provides useful signal for improvement
3.3 Potential Application Domains
| Domain | Applicability | Adaptation Required |
|---|---|---|
| Algorithm discovery | High | Define fitness function over algorithmic properties |
| Data transformation pipelines | High | Replace grid I/O with data I/O specifications |
| Automated bug repair | Medium-High | Use failing test cases as fitness, runtime traces for fault localization |
| Scientific computing | Medium | Domain-specific fitness functions, numerical stability constraints |
| DSL synthesis | Medium | Constrain program space to DSL constructs |
| Symbolic regression | Medium | Replace code generation with expression synthesis |
4. LLM Integration
4.1 Models Used
Arcgentica is designed to be model-agnostic in its LLM backend, supporting multiple providers and models. The system is configured to work with frontier LLMs that have strong code generation capabilities:
| Model | Provider | Role in Pipeline | Strengths Leveraged |
|---|---|---|---|
| Claude 3.5 Sonnet / Claude 3 Opus | Anthropic | Primary code mutation & generation | Strong code reasoning, instruction following, long context |
| GPT-4o / GPT-4 Turbo | OpenAI | Alternative mutation engine | Code generation quality, broad training |
| Gemini 1.5 Pro | Long-context reasoning | Extended context window for large runtime traces | |
| Open-source models (e.g., DeepSeek-Coder) | Various | Cost-effective exploration | Lower cost per token for broader search |
4.2 How Runtime Context is Fed to LLMs
The LLM integration follows a structured prompt pipeline:
PROMPT STRUCTURE
[SYSTEM MESSAGE]
- Role: Expert Python programmer specializing in grid transformations
- Constraints: Pure function, deterministic, numpy-compatible
- Output format: Complete Python function
[TASK CONTEXT]
- Task description with demonstration input/output grid pairs
- Grid dimensions and color palette information
[PARENT PROGRAM]
- Current candidate source code being evolved
- Its fitness score on training examples
[RUNTIME TRACES] <-- THE KEY INNOVATION
- For each training example:
- Input grid (formatted)
- Expected output grid
- Actual output produced by parent program
- Intermediate variable states during execution
- Any exceptions or errors encountered
- Diff between expected and actual output (highlighted cells)
[MUTATION INSTRUCTION]
- Specific guidance on what kind of change to attempt
- Hints from trace analysis (e.g., "the rotation logic fails on non-square grids")
4.3 Context Window Management
Runtime traces can be verbose, so Arcgentica employs several strategies to manage context window utilization:
- Trace compression: Only include the most informative portions of execution traces (e.g., the last N variable states before an error, or states at loop boundaries).
- Diff-based representation: Instead of full grid states, show only the cells that differ between expected and actual outputs.
- Progressive detail: Start with summary-level trace information; if the LLM fails to produce improvement, retry with more detailed traces.
- Selective inclusion: For programs with high fitness (nearly correct), include traces only for failing examples.
5. Key Results
~5-8% ARC-AGI-2 Score (Estimated)
30-40% ARC-AGI-1 Accuracy
2-3x Improvement vs. Direct Prompting
Python Solution Language
Note on Scores
Exact ARC-AGI-2 scores depend on submission version and computational budget. ARC-AGI-2 is substantially harder than ARC-AGI-1 (the original ARC benchmark), with state-of-the-art systems in 2025 achieving single-digit percentages on the hardest task subsets. The runtime-as-context approach demonstrates consistent improvement over baseline LLM prompting approaches, with the gap being most pronounced on tasks requiring iterative refinement of multi-step spatial transformations.
5.1 Comparative Results
| Approach | ARC-AGI-2 (approx.) | Key Difference |
|---|---|---|
| Direct LLM prompting (zero-shot) | ~2-3% | No feedback loop, single attempt |
| LLM + basic retry | ~3-5% | Multiple attempts but no runtime info |
| Arcgentica (runtime-as-context) | ~5-8% | Evolutionary search with runtime traces |
| Human performance | ~85%+ | Genuine abstract reasoning |
5.2 Ablation Insights
The runtime-as-context mechanism provides the most significant performance uplift. Ablation studies suggest:
- Without runtime traces: Performance drops by approximately 40-60%, as the LLM must reason about code semantics purely from syntax.
- Without evolutionary population: Performance drops by approximately 20-30%, as the system loses the ability to explore multiple solution strategies in parallel.
- Without diversity preservation: Performance drops by approximately 10-15% due to premature convergence on suboptimal strategies.
- Without diff-based output representation: Performance drops by approximately 5-10%, as the LLM receives noisier signal about what needs to change.
6. Reproducibility
| Aspect | Status | Details |
|---|---|---|
| Source code | Open Source | Full source available on GitHub under Apache 2.0 license |
| Dependencies | Documented | Standard Python stack: numpy, asyncio, httpx/aiohttp for LLM API calls |
| LLM API keys | Required | Users must provide their own API keys for Anthropic, OpenAI, and/or other providers |
| Dataset | Public | ARC-AGI-2 tasks are publicly available |
| Compute requirements | Moderate | CPU-only for execution; primary cost is LLM API calls |
| Determinism | Non-deterministic | LLM sampling introduces stochasticity; results vary across runs |
| Configuration | Provided | Configuration files for population sizes, mutation rates, model selection |
6.1 Reproduction Steps
bash
# Clone the repository
git clone https://github.com/symbolica-ai/arcgentica.git
cd arcgentica
# Install dependencies
pip install -r requirements.txt
# Configure API keys
export ANTHROPIC_API_KEY="your-key-here"
export OPENAI_API_KEY="your-key-here"
# Run on ARC-AGI-2 evaluation set
python run.py --config configs/default.yaml --tasks arc-agi-2
# Run on a single task for testing
python run.py --task-id <task-id> --generations 50 --population-size 20
7. Cost Analysis
7.1 Per-Task Cost Breakdown
| Component | Cost Per Task (est.) | Details |
|---|---|---|
| LLM API calls (mutation) | $0.50 - $5.00 | Depends on generations, population size, model choice |
| LLM API calls (ideation) | $0.10 - $0.50 | Initial strategy generation and high-level reasoning |
| Local compute (execution) | Negligible | Python execution of candidate programs is fast |
| Trace capture overhead | Negligible | Instrumented execution adds minimal overhead |
| Total per task | $0.60 - $5.50 | Highly variable based on task difficulty |
7.2 Full Benchmark Cost
| Configuration | Estimated Total Cost | Tasks Evaluated |
|---|---|---|
| Minimal (small population, few generations) | $50 - $150 | ~100 tasks |
| Standard (medium population, moderate generations) | $200 - $800 | ~400 tasks |
| Full evaluation (large population, many generations) | $500 - $2,000+ | ~400 tasks |
7.3 Cost Control Mechanisms
- Early termination: Stop evolving a task once a correct solution is found for all training examples.
- Budget caps: Per-task and global spending limits configurable via YAML.
- Model tiering: Use cheaper models for initial exploration, expensive models for refinement.
- Caching: Cache LLM responses for identical prompts to avoid redundant API calls.
- Adaptive compute: Allocate more compute budget to tasks showing improvement, less to stagnant ones.
8. Architecture Solution
Arcgentica follows an evolutionary loop architecture with LLM-guided mutation. The system can be understood as a variant of genetic programming where the "genetic operators" are replaced by LLM inference calls that are conditioned on runtime execution traces.
8.1 System Architecture Diagram
+=====================================================================+
| ARCGENTICA SYSTEM ARCHITECTURE |
+=====================================================================+
[ ARC-AGI-2 Task ]
(input/output grid pairs)
|
v
+---------------------------+
| TASK PARSER & ENCODER |
| - Parse grid pairs |
| - Encode as numpy arrays |
| - Extract task features |
+---------------------------+
|
v
+---------------------------+
| INITIAL POPULATION |
| SEEDER (Generation 0) |
| - LLM zero-shot gen |
| - Template programs |
| - Random initialization |
+---------------------------+
|
+==================|==================+
| EVOLUTIONARY LOOP |
| v |
| +-----------------------------+ |
| | EXECUTION ENGINE | |
| | - Sandboxed Python exec | |
| | - Timeout enforcement | |
| | - Runtime trace capture | |
| | * Variable snapshots | |
| | * Intermediate grids | |
| | * Exception traces | |
| | * Control flow path | |
| +-----------------------------+ |
| | |
| v |
| +-----------------------------+ |
| | FITNESS EVALUATOR | |
| | - Grid comparison | |
| | - Cell-level accuracy | |
| | - Partial credit scoring | |
| | - Shape/pattern matching | |
| +-----------------------------+ |
| | |
| v |
| +-----------------------------+ |
| | PARENT SELECTION | |
| | - Tournament selection | |
| | - Fitness-proportionate | |
| | - Diversity bonus | |
| | - Elite preservation | |
| +-----------------------------+ |
| | |
| v |
| +-----------------------------+ |
| | CONTEXT BUILDER | |
| | - Format parent code | |
| | - Attach runtime traces | |
| | - Compute output diffs | |
| | - Add mutation hints | |
| | - Compress to fit context | |
| +-----------------------------+ |
| | |
| v |
| +-----------------------------+ |
| | LLM MUTATION ENGINE | |
| | - Send prompt to LLM | |
| | - Parse generated code | |
| | - Validate syntax | |
| | - Inject into population | |
| +-----------------------------+ |
| | |
| v |
| +-----------------------------+ |
| | POPULATION MANAGER | |
| | - Add offspring | |
| | - Remove weakest | |
| | - Maintain diversity | |
| | - Check termination | |
| +-----------------------------+ |
| | |
| [Loop if not converged] |
+=====================================+
|
v
+---------------------------+
| SOLUTION EXTRACTOR |
| - Select best program |
| - Apply to test input |
| - Format submission |
+---------------------------+
|
v
[ Predicted Output Grid ]
8.2 Data Flow Architecture
Task I/O Pairs -----> Execution Engine -----> Runtime Traces
| ^ |
| | v
| Candidate Context Builder
| Programs |
| ^ v
| | LLM Mutation
| | Engine
| | | |
| Population <--- New <--------+ |
| Manager Offspring |
| | |
+---> Fitness <----+ API Calls to
Evaluator Claude / GPT-4 /
Gemini / etc.
9. Components
9.1 Task Parser & Encoder
Responsible for ingesting ARC-AGI-2 tasks in their JSON format and converting them into internal representations. Each task consists of multiple demonstration pairs (training examples) and test inputs. The encoder converts grids to numpy arrays and extracts meta-features (dimensions, color counts, symmetry properties) that may inform the search strategy.
python
class TaskEncoder:
"""Encodes ARC-AGI-2 tasks into internal representations."""
def encode(self, task_json: dict) -> Task:
train_pairs = [
IOPair(
input=np.array(ex["input"]),
output=np.array(ex["output"])
)
for ex in task_json["train"]
]
test_inputs = [
np.array(ex["input"])
for ex in task_json["test"]
]
# Extract meta-features for search strategy hints
features = self._extract_features(train_pairs)
return Task(
train=train_pairs,
test=test_inputs,
features=features
)
def _extract_features(self, pairs: list[IOPair]) -> TaskFeatures:
"""Extract structural features: grid sizes, color palettes,
symmetry, size changes, etc."""
return TaskFeatures(
input_sizes=[(p.input.shape) for p in pairs],
output_sizes=[(p.output.shape) for p in pairs],
colors_used=set().union(*[set(p.input.flatten()) | set(p.output.flatten()) for p in pairs]),
size_changes=any(p.input.shape != p.output.shape for p in pairs),
has_symmetry=self._detect_symmetry(pairs),
)
9.2 Population Manager
Maintains the collection of candidate programs (individuals). Each individual consists of source code, its fitness scores on training examples, and metadata about its lineage (parent program, generation number, mutation type).
python
@dataclass
class Individual:
source_code: str
fitness: float # Aggregate fitness across training examples
per_example_scores: list[float] # Per-example fitness breakdown
generation: int
parent_id: Optional[str]
mutation_type: str # e.g., "llm_mutation", "crossover", "seed"
runtime_traces: list[RuntimeTrace]
id: str = field(default_factory=lambda: uuid4().hex[:8])
class PopulationManager:
def __init__(self, max_size: int = 50, elite_fraction: float = 0.1):
self.population: list[Individual] = []
self.max_size = max_size
self.elite_count = max(1, int(max_size * elite_fraction))
self.archive: list[Individual] = [] # Hall of fame
def add(self, individual: Individual):
self.population.append(individual)
if len(self.population) > self.max_size:
self._cull()
def _cull(self):
"""Remove worst individuals while preserving elites and diversity."""
# Sort by fitness descending
self.population.sort(key=lambda x: x.fitness, reverse=True)
# Always keep elites
elites = self.population[:self.elite_count]
rest = self.population[self.elite_count:]
# Apply diversity-aware culling to the rest
diverse_rest = self._diversity_filter(rest, self.max_size - self.elite_count)
self.population = elites + diverse_rest
9.3 Execution Engine
The execution engine runs candidate programs in a sandboxed environment with instrumentation to capture runtime traces. This is the most critical infrastructure component, as the quality of runtime traces directly impacts the LLM's ability to suggest meaningful mutations.
python
class ExecutionEngine:
"""Sandboxed execution with runtime trace capture."""
def __init__(self, timeout_seconds: float = 5.0):
self.timeout = timeout_seconds
def execute_with_trace(
self, source_code: str, input_grid: np.ndarray
) -> ExecutionResult:
"""Execute candidate program and capture runtime trace."""
trace = RuntimeTrace()
try:
# Compile the source code
compiled = compile(source_code, "", "exec")
# Set up instrumented namespace
namespace = {
"np": np,
"numpy": np,
"__trace__": trace,
}
# Inject tracing hooks
instrumented_code = self._instrument(source_code)
# Execute with timeout
with timeout(self.timeout):
exec(compiled, namespace)
transform_fn = namespace.get("transform")
if transform_fn is None:
return ExecutionResult(
success=False,
error="No 'transform' function defined",
trace=trace
)
output = transform_fn(input_grid.copy())
trace.final_output = np.array(output)
return ExecutionResult(
success=True,
output=np.array(output),
trace=trace
)
except TimeoutError:
trace.error = "Execution timed out"
return ExecutionResult(success=False, error="Timeout", trace=trace)
except Exception as e:
trace.error = str(e)
trace.traceback = traceback.format_exc()
return ExecutionResult(success=False, error=str(e), trace=trace)
9.4 Runtime Trace Capturer
Captures detailed information about program execution, including intermediate variable states, loop iterations, and function call chains.
python
@dataclass
class RuntimeTrace:
"""Captures execution dynamics of a candidate program."""
variable_snapshots: list[dict] = field(default_factory=list)
intermediate_grids: list[np.ndarray] = field(default_factory=list)
loop_iterations: dict[str, int] = field(default_factory=dict)
function_calls: list[str] = field(default_factory=list)
final_output: Optional[np.ndarray] = None
error: Optional[str] = None
traceback: Optional[str] = None
execution_time_ms: float = 0.0
def snapshot(self, label: str, local_vars: dict):
"""Capture current variable state at a program point."""
snapshot = {"label": label}
for name, value in local_vars.items():
if isinstance(value, np.ndarray):
snapshot[name] = value.tolist()
elif isinstance(value, (int, float, str, bool, list, tuple)):
snapshot[name] = value
self.variable_snapshots.append(snapshot)
def format_for_llm(self, max_tokens: int = 2000) -> str:
"""Format trace for inclusion in LLM prompt."""
parts = []
if self.error:
parts.append(f"ERROR: {self.error}")
if self.traceback:
# Include last 5 lines of traceback
tb_lines = self.traceback.strip().split("\n")
parts.append("Traceback (last 5 lines):")
parts.extend(tb_lines[-5:])
if self.final_output is not None:
parts.append(f"Output shape: {np.array(self.final_output).shape}")
parts.append(f"Output:\n{self._format_grid(self.final_output)}")
if self.variable_snapshots:
parts.append("Key variable states:")
# Include most informative snapshots
for snap in self._select_informative(self.variable_snapshots):
parts.append(f" At '{snap.get('label', '?')}': {self._summarize(snap)}")
result = "\n".join(parts)
return self._truncate(result, max_tokens)
9.5 Fitness Evaluator
Computes fitness scores by comparing candidate program outputs against expected outputs. Uses a nuanced scoring system that provides gradient signal even for partially correct solutions.
python
class FitnessEvaluator:
"""Multi-criteria fitness evaluation for ARC programs."""
def evaluate(self, individual: Individual, task: Task) -> float:
"""Evaluate fitness across all training examples."""
scores = []
for pair in task.train:
result = self.engine.execute_with_trace(
individual.source_code, pair.input
)
if not result.success:
scores.append(0.0) # Crashed programs get zero fitness
else:
score = self._compare_grids(result.output, pair.output)
scores.append(score)
individual.runtime_traces.append(result.trace)
individual.per_example_scores = scores
individual.fitness = self._aggregate(scores)
return individual.fitness
def _compare_grids(self, actual: np.ndarray, expected: np.ndarray) -> float:
"""Compute similarity between actual and expected grids."""
if actual.shape != expected.shape:
# Partial credit for correct dimensions on one axis
shape_score = sum(
a == e for a, e in
zip(actual.shape, expected.shape)
) / len(expected.shape) * 0.1
return shape_score
# Cell-level accuracy
correct_cells = np.sum(actual == expected)
total_cells = expected.size
cell_accuracy = correct_cells / total_cells
# Exact match bonus
if cell_accuracy == 1.0:
return 1.0
return cell_accuracy * 0.9 # Reserve 0.1 for exact match bonus
def _aggregate(self, scores: list[float]) -> float:
"""Aggregate per-example scores. Use minimum to encourage
solutions that generalize across all examples."""
if not scores:
return 0.0
# Weighted: emphasize worst-case performance
return 0.4 * min(scores) + 0.6 * (sum(scores) / len(scores))
9.6 Context Builder
Assembles the full LLM prompt from parent program source code, runtime traces, task specification, and mutation guidance. This is where the "runtime-as-context" magic happens.
9.7 LLM Mutation Engine
Sends constructed prompts to LLM APIs, parses the response to extract generated code, validates syntax, and creates new Individual objects for the population.
9.8 Orchestrator
The top-level controller that manages the evolutionary loop, coordinates async LLM calls, handles per-task budgets, and produces final submissions.
10. Core Mechanisms
10.1 Runtime-as-Context Mechanism
The runtime-as-context mechanism is the defining innovation of Arcgentica. It fundamentally changes the information available to the LLM during code mutation by providing dynamic program behavior rather than only static program text.
10.1.1 Why Runtime Context Matters
LLMs excel at understanding code syntax and common patterns, but struggle with precise execution simulation -- a phenomenon known as the "execution gap." When asked "what does this code produce for input X?", LLMs frequently make errors, especially for loops, conditionals with complex predicates, and array index manipulation. By providing actual execution results, Arcgentica eliminates this weakness entirely.
Consider a concrete example: a candidate program attempts to rotate objects in a grid. Without runtime traces, the LLM sees only the rotation code and must mentally simulate array transpositions, boundary conditions, and index arithmetic. With runtime traces, the LLM sees: "Input grid was 5x5, the rotation produced a 5x5 output, but cells at positions [(0,0), (0,4), (4,0), (4,4)] are wrong -- they have color 3 instead of expected color 7." This concrete information enables precise, targeted fixes.
10.1.2 Trace Capture Pipeline
python
def capture_runtime_context(program: str, input_grid: np.ndarray,
expected_output: np.ndarray) -> dict:
"""
Full runtime context capture pipeline.
Returns structured context for LLM consumption.
"""
result = execution_engine.execute_with_trace(program, input_grid)
context = {
"program_source": program,
"input_grid": format_grid(input_grid),
"expected_output": format_grid(expected_output),
"execution_successful": result.success,
}
if result.success:
context["actual_output"] = format_grid(result.output)
context["output_diff"] = compute_grid_diff(result.output, expected_output)
context["cell_accuracy"] = compute_cell_accuracy(result.output, expected_output)
context["shape_match"] = result.output.shape == expected_output.shape
# Key: include intermediate variable states
context["intermediate_states"] = result.trace.format_for_llm()
else:
context["error_message"] = result.trace.error
context["error_traceback"] = result.trace.traceback
# Even on error, partial traces may exist
context["partial_trace"] = result.trace.format_for_llm()
return context
10.1.3 Grid Diff Representation
A particularly effective form of runtime context is the grid diff -- a visual representation highlighting which cells differ between actual and expected outputs:
python
def compute_grid_diff(actual: np.ndarray, expected: np.ndarray) -> str:
"""
Produce a human-readable diff between actual and expected grids.
Cells that match are shown normally; mismatches are highlighted.
"""
if actual.shape != expected.shape:
return (f"Shape mismatch: got {actual.shape}, "
f"expected {expected.shape}")
lines = []
for i in range(expected.shape[0]):
row = []
for j in range(expected.shape[1]):
if actual[i, j] == expected[i, j]:
row.append(f" {actual[i,j]} ")
else:
row.append(f"[{actual[i,j]}->{expected[i,j]}]")
lines.append(" ".join(row))
return "\n".join(lines)
# Example output:
# 0 0 0 0 0
# 0 [3->7] 0 0 0
# 0 0 2 0 0
# 0 0 0 [3->7] 0
# 0 0 0 0 0
10.2 Program Synthesis Approach
Arcgentica frames ARC-AGI-2 as a program induction problem: given input/output examples, find a program that implements the underlying transformation. Unlike neural program induction (which learns a latent program representation), Arcgentica synthesizes explicit, interpretable Python code.
10.2.1 Search Space
The search space is the set of all valid Python functions with signature def transform(grid: np.ndarray) -> np.ndarray. While theoretically infinite, the practical search space is constrained by:
- Length constraints: Programs are typically limited to 50-200 lines.
- Library constraints: Only numpy and standard library operations.
- Timeout constraints: Programs must complete within a few seconds.
- LLM bias: The LLM naturally generates programs in common coding patterns, providing implicit regularization.
10.2.2 Synthesis Pipeline
- Seed generation: LLM generates initial candidate programs from task descriptions (zero-shot).
- Evaluation: Each candidate is executed against all training examples.
- Trace capture: Runtime behavior is recorded.
- Informed mutation: LLM receives code + traces, generates improved variants.
- Selection: Best variants survive; poor variants are culled.
- Iteration: Repeat until solution found or budget exhausted.
10.3 Code Modification & Mutation
Mutations in Arcgentica are not random syntactic changes (as in traditional genetic programming) but semantically-informed code modifications guided by the LLM. The system supports several mutation strategies:
10.3.1 Mutation Types
| Mutation Type | Description | When Used |
|---|---|---|
| Targeted Fix | Fix a specific bug identified via runtime trace (e.g., off-by-one error, wrong color mapping) | High fitness programs with identifiable errors |
| Structural Rewrite | Rewrite a major section of the program (e.g., replace loop-based approach with vectorized numpy) | Medium fitness programs that are structurally flawed |
| Strategy Shift | Completely change the algorithmic approach (e.g., from flood-fill to pattern matching) | Low fitness programs or stagnation detection |
| Refinement | Small tweaks to nearly-correct programs (edge case handling, boundary conditions) | Very high fitness programs (>0.9) |
| Crossover | Combine elements from two parent programs | When population has diverse, partial solutions |
10.3.2 Mutation Prompt Construction
python
def construct_mutation_prompt(
parent: Individual,
task: Task,
mutation_type: str,
traces: list[dict]
) -> str:
"""Build the LLM prompt for a specific mutation type."""
prompt_parts = [
SYSTEM_PROMPT,
format_task_description(task),
f"\n## Current Program (fitness: {parent.fitness:.3f})\n",
f"```python\n{parent.source_code}\n```\n",
]
# Add runtime context for each training example
prompt_parts.append("\n## Runtime Analysis\n")
for i, (trace, pair) in enumerate(zip(traces, task.train)):
prompt_parts.append(f"\n### Training Example {i+1}:")
prompt_parts.append(f"Input ({pair.input.shape}):\n{format_grid(pair.input)}")
prompt_parts.append(f"Expected output ({pair.output.shape}):\n{format_grid(pair.output)}")
if trace["execution_successful"]:
prompt_parts.append(f"Actual output:\n{trace['actual_output']}")
prompt_parts.append(f"Cell accuracy: {trace['cell_accuracy']:.1%}")
prompt_parts.append(f"Diff (mismatches in [actual->expected]):\n{trace['output_diff']}")
if trace.get("intermediate_states"):
prompt_parts.append(f"Intermediate states:\n{trace['intermediate_states']}")
else:
prompt_parts.append(f"EXECUTION FAILED: {trace['error_message']}")
if trace.get("error_traceback"):
prompt_parts.append(f"Traceback:\n{trace['error_traceback']}")
# Mutation-type-specific instructions
mutation_instructions = {
"targeted_fix": "Fix the specific errors shown in the diff above. Focus on the cells that are wrong.",
"structural_rewrite": "The current approach has fundamental issues. Rewrite the core algorithm while keeping any parts that work correctly.",
"strategy_shift": "Try a completely different approach to solve this task. Analyze the input/output patterns from scratch.",
"refinement": "The program is nearly correct. Make minimal, precise changes to handle the remaining edge cases.",
}
prompt_parts.append(f"\n## Instruction\n{mutation_instructions.get(mutation_type, 'Improve the program.')}")
prompt_parts.append("\nProvide the complete improved Python function. Start with `def transform(grid):`")
return "\n".join(prompt_parts)
10.4 Parent Selection & Sampling
Parent selection determines which individuals from the current population are chosen for mutation. Arcgentica employs a multi-strategy selection approach that balances exploitation (selecting high-fitness individuals) with exploration (maintaining diversity).
10.4.1 Selection Strategies
python
class ParentSelector:
"""Multi-strategy parent selection for evolutionary search."""
def select(self, population: list[Individual],
strategy: str = "mixed") -> Individual:
if strategy == "tournament":
return self._tournament_select(population, k=3)
elif strategy == "fitness_proportionate":
return self._fitness_proportionate(population)
elif strategy == "diversity_weighted":
return self._diversity_weighted(population)
elif strategy == "mixed":
# Probabilistic mix of strategies
r = random.random()
if r < 0.4:
return self._tournament_select(population, k=3)
elif r < 0.7:
return self._fitness_proportionate(population)
elif r < 0.9:
return self._diversity_weighted(population)
else:
return random.choice(population) # Pure random for exploration
def _tournament_select(self, pop: list[Individual], k: int) -> Individual:
"""Select best from k random individuals."""
contestants = random.sample(pop, min(k, len(pop)))
return max(contestants, key=lambda x: x.fitness)
def _fitness_proportionate(self, pop: list[Individual]) -> Individual:
"""Roulette wheel selection proportional to fitness."""
total = sum(ind.fitness + 0.01 for ind in pop) # +epsilon to avoid zero
r = random.uniform(0, total)
cumulative = 0
for ind in pop:
cumulative += ind.fitness + 0.01
if cumulative >= r:
return ind
return pop[-1]
def _diversity_weighted(self, pop: list[Individual]) -> Individual:
"""Select considering both fitness and uniqueness."""
scores = []
for ind in pop:
uniqueness = self._compute_uniqueness(ind, pop)
combined = 0.5 * ind.fitness + 0.5 * uniqueness
scores.append(combined)
idx = random.choices(range(len(pop)), weights=scores, k=1)[0]
return pop[idx]
def _compute_uniqueness(self, ind: Individual,
pop: list[Individual]) -> float:
"""Measure how different this individual is from the population."""
if len(pop) <= 1:
return 1.0
similarities = []
for other in pop:
if other.id == ind.id:
continue
sim = self._code_similarity(ind.source_code, other.source_code)
similarities.append(sim)
avg_similarity = sum(similarities) / len(similarities)
return 1.0 - avg_similarity
10.5 Population Management
Population management governs the lifecycle of individuals across generations. Arcgentica maintains a fixed-size population with mechanisms for elitism, diversity preservation, and adaptive sizing.
10.5.1 Population Lifecycle
Gen 0: [Seed1] [Seed2] [Seed3] ... [SeedN] (LLM zero-shot generation)
| | | |
v v v v
Gen 1: [Eval] [Eval] [Eval] ... [Eval] (Fitness evaluation)
| | | |
v v v v
[Select parents] --> [Mutate via LLM] --> [New offspring]
| |
v v
Gen 2: [Elites preserved] + [New offspring] = [New population]
|
v
[Repeat until convergence or budget exhaustion]
10.5.2 Key Parameters
| Parameter | Typical Value | Purpose |
|---|---|---|
| Population size | 20-50 | Balance between diversity and computational cost |
| Elite fraction | 10-20% | Protect best solutions from being replaced |
| Offspring per generation | 5-15 | Number of new programs generated per iteration |
| Max generations | 30-100 | Termination condition if no perfect solution found |
| Stagnation threshold | 5-10 generations | Trigger strategy shift if no fitness improvement |
10.6 Solution Diversity & Novelty
Maintaining diversity in the population is critical to avoid premature convergence, where all programs become similar and the search stagnates. Arcgentica employs several diversity mechanisms:
10.6.1 Code Diversity Metrics
python
def code_similarity(code_a: str, code_b: str) -> float:
"""Compute similarity between two programs using multiple metrics."""
# 1. Token-level Jaccard similarity
tokens_a = set(tokenize(code_a))
tokens_b = set(tokenize(code_b))
jaccard = len(tokens_a & tokens_b) / max(len(tokens_a | tokens_b), 1)
# 2. AST structural similarity
try:
ast_a = ast.parse(code_a)
ast_b = ast.parse(code_b)
ast_sim = ast_structural_similarity(ast_a, ast_b)
except SyntaxError:
ast_sim = jaccard # Fallback
# 3. Behavioral similarity (same outputs on same inputs?)
# This is captured separately in the fitness evaluator
return 0.4 * jaccard + 0.6 * ast_sim
10.6.2 Behavioral Diversity
Beyond code-level diversity, Arcgentica tracks behavioral diversity -- whether programs produce different outputs on the same inputs. Two programs with different code but identical behavior are treated as duplicates for diversity purposes. This is measured via output fingerprinting:
python
def behavioral_fingerprint(individual: Individual, task: Task) -> tuple:
"""Create a hashable fingerprint of program behavior."""
outputs = []
for pair in task.train:
result = execute(individual.source_code, pair.input)
if result.success:
outputs.append(tuple(result.output.flatten()))
else:
outputs.append(("ERROR", result.trace.error))
return tuple(outputs)
10.7 LLM Orchestration
Arcgentica coordinates multiple LLM calls concurrently using asynchronous programming patterns. The orchestration layer handles rate limiting, retry logic, model selection, and response parsing.
10.7.1 Async Orchestration Architecture
python
class LLMOrchestrator:
"""Manages concurrent LLM calls for mutation."""
def __init__(self, config: OrchestratorConfig):
self.clients = {
"anthropic": AsyncAnthropic(api_key=config.anthropic_key),
"openai": AsyncOpenAI(api_key=config.openai_key),
}
self.semaphore = asyncio.Semaphore(config.max_concurrent_calls)
self.rate_limiter = TokenBucketRateLimiter(
tokens_per_second=config.max_tokens_per_second
)
async def generate_mutations(
self,
parents: list[Individual],
task: Task,
num_mutations: int
) -> list[Individual]:
"""Generate multiple mutations concurrently."""
tasks = []
for _ in range(num_mutations):
parent = self.selector.select(parents)
mutation_type = self._choose_mutation_type(parent)
model = self._choose_model(mutation_type)
tasks.append(
self._generate_single_mutation(parent, task, mutation_type, model)
)
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if isinstance(r, Individual)]
async def _generate_single_mutation(
self, parent: Individual, task: Task,
mutation_type: str, model: str
) -> Individual:
async with self.semaphore:
await self.rate_limiter.acquire()
prompt = construct_mutation_prompt(parent, task, mutation_type,
parent.runtime_traces)
response = await self._call_llm(model, prompt)
source_code = extract_python_code(response)
return Individual(
source_code=source_code,
fitness=0.0, # Will be evaluated later
generation=parent.generation + 1,
parent_id=parent.id,
mutation_type=mutation_type,
runtime_traces=[],
)
def _choose_model(self, mutation_type: str) -> str:
"""Select model based on mutation type and budget."""
if mutation_type == "strategy_shift":
return "claude-3-5-sonnet" # Strong reasoning for novel approaches
elif mutation_type == "targeted_fix":
return "gpt-4o" # Fast and precise for small fixes
else:
return random.choice(["claude-3-5-sonnet", "gpt-4o"])
10.8 Search Strategies
The evolutionary search in Arcgentica is not a simple hill-climbing algorithm. It employs multiple search strategies that are dynamically selected based on the current state of the population and progress trajectory.
10.8.1 Strategy Portfolio
| Strategy | Description | Trigger Condition |
|---|---|---|
| Exploitation | Focus on refining the best programs with targeted fixes | Population contains programs with fitness > 0.7 |
| Exploration | Generate diverse new approaches via strategy shifts | All programs have fitness < 0.3 or stagnation detected |
| Balanced | Mix of exploitation and exploration | Default strategy when population is spread |
| Intensification | Focus compute on the single most promising program | One program has fitness > 0.9 (nearly solved) |
| Restart | Discard population and start fresh | Extended stagnation with very low fitness |
10.8.2 Adaptive Strategy Selection
python
class AdaptiveSearchStrategy:
"""Dynamically select search strategy based on population state."""
def __init__(self):
self.stagnation_counter = 0
self.best_fitness_history = []
def select_strategy(self, population: list[Individual]) -> str:
best_fitness = max(ind.fitness for ind in population)
self.best_fitness_history.append(best_fitness)
# Detect stagnation
if len(self.best_fitness_history) >= 5:
recent = self.best_fitness_history[-5:]
if max(recent) - min(recent) < 0.01:
self.stagnation_counter += 1
else:
self.stagnation_counter = 0
# Strategy selection logic
if best_fitness > 0.95:
return "intensification" # Almost there, focus!
elif best_fitness > 0.7:
return "exploitation" # Good programs exist, refine them
elif self.stagnation_counter > 3:
if best_fitness < 0.1:
return "restart" # Nothing works, start over
else:
return "exploration" # Stuck, try new approaches
else:
return "balanced" # Default mixed strategy
10.9 Evaluation & Fitness
10.9.1 Multi-Criteria Fitness Function
The fitness function must provide gradient signal even for partial solutions. Arcgentica uses a composite fitness score:
$$ (Eq. 1) F(p) = w1 * mini(cell_acc(p, xi, yi)) + w2 * meani(cell_acc(p, xi, yi)) + w3 * shape_match(p) + w4 * no_error(p) $$
Where:
cell_acc(p, x_i, y_i)= fraction of cells correct when program p is applied to input x_i with expected output y_ishape_match(p)= 1 if output has correct dimensions for all examples, 0 otherwiseno_error(p)= 1 if program executes without errors on all inputs, 0 otherwise- Typical weights: w_1 = 0.4, w_2 = 0.3, w_3 = 0.2, w_4 = 0.1
10.9.2 Fitness Implementation
python
def composite_fitness(
program: str,
task: Task,
engine: ExecutionEngine,
weights: tuple = (0.4, 0.3, 0.2, 0.1)
) -> float:
"""
Composite fitness function with gradient signal for partial solutions.
Components:
1. Worst-case cell accuracy (min across examples)
2. Average cell accuracy
3. Shape correctness
4. Execution success
"""
w_min, w_avg, w_shape, w_exec = weights
cell_accuracies = []
shape_matches = []
exec_successes = []
for pair in task.train:
result = engine.execute_with_trace(program, pair.input)
if not result.success:
cell_accuracies.append(0.0)
shape_matches.append(0.0)
exec_successes.append(0.0)
continue
exec_successes.append(1.0)
if result.output.shape != pair.output.shape:
shape_matches.append(0.0)
# Can still compute partial cell accuracy via padding/cropping
cell_acc = _padded_cell_accuracy(result.output, pair.output)
cell_accuracies.append(cell_acc * 0.5) # Penalty for wrong shape
else:
shape_matches.append(1.0)
cell_acc = np.mean(result.output == pair.output)
cell_accuracies.append(float(cell_acc))
if not cell_accuracies:
return 0.0
fitness = (
w_min * min(cell_accuracies) +
w_avg * (sum(cell_accuracies) / len(cell_accuracies)) +
w_shape * (sum(shape_matches) / len(shape_matches)) +
w_exec * (sum(exec_successes) / len(exec_successes))
)
# Bonus for perfect solution
if all(ca == 1.0 for ca in cell_accuracies):
fitness = 1.0
return fitness
10.10 Prompt Engineering
The quality of LLM prompts directly impacts mutation effectiveness. Arcgentica employs carefully crafted prompts with several key design principles:
10.10.1 System Prompt
text
SYSTEM PROMPT:
You are an expert Python programmer specializing in grid transformation problems.
You write clean, efficient numpy code.
Rules:
1. Write a single function: def transform(grid: np.ndarray) -> np.ndarray
2. Use only numpy operations and Python standard library
3. The function must be deterministic (no randomness)
4. Handle grids of any reasonable size
5. Do not use external libraries beyond numpy
6. Return a numpy array with integer values (0-9 representing colors)
7. Analyze the runtime traces carefully to understand what the current
program does wrong, then fix it precisely
10.10.2 Grid Formatting
python
def format_grid(grid: np.ndarray) -> str:
"""Format a grid for LLM consumption with color names."""
COLOR_MAP = {
0: ".", 1: "B", 2: "R", 3: "G", 4: "Y",
5: "A", 6: "M", 7: "O", 8: "C", 9: "W"
}
lines = []
for row in grid:
line = " ".join(COLOR_MAP.get(int(c), str(c)) for c in row)
lines.append(line)
return "\n".join(lines)
10.10.3 Mutation-Specific Instruction Templates
python
MUTATION_TEMPLATES = {
"targeted_fix": """
The program above is close to correct but has specific errors.
Look at the diff between actual and expected outputs:
- Cells marked [actual->expected] show mismatches
- Focus on WHY these specific cells are wrong
- Make the minimal change needed to fix these errors
- Do not change parts of the code that produce correct results
""",
"structural_rewrite": """
The current approach has fundamental limitations.
Analyze the input/output pairs to understand the transformation rule:
- What patterns exist in the inputs?
- How do the outputs relate to the inputs?
- What spatial/color operations are involved?
Rewrite the program using a more appropriate algorithm.
""",
"strategy_shift": """
Forget the current program entirely.
Study the input/output examples carefully and describe the transformation:
1. What objects/patterns appear in the input?
2. How are they transformed in the output?
3. What rules govern the transformation?
Then write a completely new program implementing your analysis.
""",
"refinement": """
The program is nearly correct (fitness: {fitness:.1%}).
It fails on specific edge cases. Analyze the runtime traces to find:
- Which examples fail and why
- Whether the issue is boundary conditions, special cases, or off-by-one errors
Make precise, minimal fixes. Do not restructure working code.
"""
}
10.11 Cost Control
LLM API costs can escalate rapidly in evolutionary search. Arcgentica implements several cost management strategies:
python
class CostController:
"""Manage LLM API spending across the search."""
def __init__(self, config: CostConfig):
self.per_task_budget = config.per_task_budget # e.g., $5.00
self.global_budget = config.global_budget # e.g., $500.00
self.spent_per_task: dict[str, float] = {}
self.total_spent = 0.0
# Model costs per 1M tokens (approximate)
self.model_costs = {
"claude-3-5-sonnet": {"input": 3.0, "output": 15.0},
"claude-3-opus": {"input": 15.0, "output": 75.0},
"gpt-4o": {"input": 2.5, "output": 10.0},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
}
def can_spend(self, task_id: str, estimated_tokens: int,
model: str) -> bool:
"""Check if we can afford this API call."""
cost = self._estimate_cost(estimated_tokens, model)
task_spent = self.spent_per_task.get(task_id, 0.0)
return (
task_spent + cost <= self.per_task_budget and
self.total_spent + cost <= self.global_budget
)
def record(self, task_id: str, input_tokens: int,
output_tokens: int, model: str):
"""Record actual cost after API call."""
rates = self.model_costs.get(model, {"input": 5.0, "output": 15.0})
cost = (
input_tokens * rates["input"] / 1_000_000 +
output_tokens * rates["output"] / 1_000_000
)
self.spent_per_task[task_id] = self.spent_per_task.get(task_id, 0.0) + cost
self.total_spent += cost
def suggest_model(self, task_id: str, remaining_generations: int) -> str:
"""Suggest a model based on remaining budget."""
remaining = self.per_task_budget - self.spent_per_task.get(task_id, 0.0)
avg_cost_per_gen = self.spent_per_task.get(task_id, 1.0) / max(1, ...)
if remaining < avg_cost_per_gen * remaining_generations:
return "gpt-4o-mini" # Switch to cheaper model
return "claude-3-5-sonnet" # Use best model
10.12 Mathematical Foundations
10.12.1 Evolutionary Search as Optimization
The evolutionary program synthesis can be formalized as an optimization problem over the space of programs:
$$ (Eq. 2) p* = argmaxp in P F(p, T) = argmaxp in P [ w1 * mini f(p, xi, yi) + w2 * (1/N) * sumi f(p, xi, yi) ] $$
Where P is the space of valid Python programs, T = {(x_i, y_i)} is the set of training examples, and f(p, x, y) measures the agreement between p(x) and y.
10.12.2 Mutation as Conditional Generation
Each LLM mutation can be modeled as sampling from a conditional distribution:
$$ (Eq. 3) pchild ~ LLM(* | pparent, R(pparent, T), M) $$
Where R(p, T) represents the runtime traces of program p on task T, and M represents the mutation instruction. The key insight is that conditioning on R(p, T) provides the LLM with grounded semantic information rather than requiring it to infer program behavior from syntax alone.
10.12.3 Information-Theoretic Perspective
The runtime-as-context approach can be understood through the lens of information theory. The mutual information between the LLM's output and the correct solution is bounded by the information in its input:
$$ (Eq. 4) I(pchild; p) ≤ I(prompt; p) = I(pparent, R, M; p*) $$
Adding runtime traces R increases the mutual information between the prompt and the target solution, enabling better-informed mutations:
$$ (Eq. 5) I(pparent, R, M; p) ≥ I(pparent, M; p) $$
This inequality formalizes the claim that runtime context is strictly helpful -- it can only add information, never reduce it (assuming the LLM processes it correctly).
10.12.4 Population Dynamics
The expected fitness improvement per generation can be approximated as:
$$ (Eq. 6) E[Ft+1] = E[Ft] + alpha * sigmaF * I / sqrt(N) $$
Where alpha is the selection intensity, sigma_F is the fitness standard deviation in the population, I is the expected improvement from LLM mutation (related to the quality of runtime context), and N is the population size. This shows diminishing returns from increasing population size, suggesting moderate populations (20-50) are optimal.
10.12.5 Python Implementation of Key Equations
python
import numpy as np
from typing import Callable
# Eq. 2: Composite fitness function
def fitness(
program_fn: Callable,
examples: list[tuple[np.ndarray, np.ndarray]],
w1: float = 0.4,
w2: float = 0.6
) -> float:
"""
F(p, T) = w1 * min_i f(p, x_i, y_i) + w2 * mean_i f(p, x_i, y_i)
where f(p, x, y) = |{(i,j) : p(x)[i,j] == y[i,j]}| / |y|
"""
scores = []
for x, y in examples:
try:
pred = program_fn(x.copy())
if pred.shape != y.shape:
scores.append(0.0)
else:
scores.append(float(np.mean(pred == y)))
except Exception:
scores.append(0.0)
if not scores:
return 0.0
return w1 * min(scores) + w2 * np.mean(scores)
# Eq. 6: Expected fitness improvement per generation
def expected_improvement(
population_fitness: list[float],
selection_intensity: float = 1.5, # ~ top 20% selection
runtime_context_quality: float = 0.8, # 0-1, how good are traces
) -> float:
"""
E[Delta F] = alpha * sigma_F * I / sqrt(N)
Parameters:
population_fitness: list of fitness values for current population
selection_intensity: alpha, depends on selection pressure
runtime_context_quality: I, depends on trace informativeness
"""
sigma_f = np.std(population_fitness)
n = len(population_fitness)
delta_f = selection_intensity * sigma_f * runtime_context_quality / np.sqrt(n)
return delta_f
# Behavioral diversity metric for population
def population_diversity(
behavioral_fingerprints: list[tuple]
) -> float:
"""
Diversity = |unique fingerprints| / |population|
Measures fraction of behaviorally distinct programs.
"""
unique = len(set(behavioral_fingerprints))
total = len(behavioral_fingerprints)
return unique / total if total > 0 else 0.0
# Information gain from runtime context
def runtime_context_information_gain(
accuracy_with_context: list[float],
accuracy_without_context: list[float]
) -> float:
"""
Approximate information gain from runtime traces.
IG = E[F_with_context] - E[F_without_context]
Based on empirical measurements of mutation quality with
and without runtime traces.
"""
mean_with = np.mean(accuracy_with_context)
mean_without = np.mean(accuracy_without_context)
return mean_with - mean_without
# Grid cell-level accuracy (core fitness primitive)
def cell_accuracy(actual: np.ndarray, expected: np.ndarray) -> float:
"""
f(p, x, y) = (1/|y|) * sum_{i,j} 1[p(x)[i,j] == y[i,j]]
Returns fraction of correctly predicted cells.
Handles shape mismatches with zero score.
"""
if actual.shape != expected.shape:
return 0.0
return float(np.sum(actual == expected)) / expected.size
# Tournament selection probability
def tournament_selection_prob(
fitness_rank: int,
population_size: int,
tournament_size: int
) -> float:
"""
P(rank k selected) = (1 - ((N-k)/N)^T) - (1 - ((N-k+1)/N)^T)
where N = population_size, k = rank (1=best), T = tournament_size.
Approximation for large N.
"""
N = population_size
T = tournament_size
k = fitness_rank
# Probability of being the best in a tournament of size T
# when your rank is k out of N
p_not_selected = ((k - 1) / N) ** T
p_not_selected_or_better = (k / N) ** T
return p_not_selected_or_better - p_not_selected
10.12.6 Convergence Analysis
Under idealized assumptions (LLM mutations are unbiased improvements, fitness landscape is unimodal), the expected number of generations to reach fitness F* from initial fitness F_0 is:
$$ (Eq. 7) G approx (F - F0) / E[Delta F] = (F* - F0) * sqrt(N) / (alpha * sigmaF * I) $$
In practice, the fitness landscape for ARC tasks is highly multimodal, meaning the search often requires strategy shifts (equivalent to escaping local optima) rather than smooth gradient following. This motivates the adaptive search strategy controller described in Section 10.8.
11. Programming Language
| Aspect | Details |
|---|---|
| System implementation | Python 3.10+ |
| Synthesized programs | Python (numpy-based grid transformations) |
| Async framework | asyncio for concurrent LLM API calls |
| Key dependencies | numpy, httpx/aiohttp, pyyaml, ast (standard library) |
| Execution sandbox | Restricted exec() with controlled namespace |
| Testing | pytest |
| Configuration | YAML files for hyperparameters and model selection |
The choice of Python for both the system and synthesized programs is deliberate: LLMs have been extensively trained on Python code and produce higher-quality Python than most other languages. Numpy provides a natural vocabulary for grid manipulation operations (slicing, transposition, masking, etc.) that are fundamental to ARC tasks.
12. Memory & Runtime Context Management
12.1 Context Window Budget Allocation
The most constrained resource is the LLM's context window. Arcgentica carefully allocates this budget across prompt components:
| Component | Typical Token Allocation | Priority |
|---|---|---|
| System prompt | 200-400 tokens | Fixed (always included) |
| Task description (I/O pairs) | 500-2000 tokens | Fixed (always included) |
| Parent program source | 200-800 tokens | Fixed (always included) |
| Runtime traces | 500-3000 tokens | Adaptive (compressed as needed) |
| Output diff | 200-500 tokens | High priority |
| Mutation instructions | 100-300 tokens | Fixed |
| Reserved for response | 1000-2000 tokens | Fixed |
12.2 Trace Compression Strategies
- Selective snapshots: Only include variable states at loop boundaries and function entry/exit points, not every line.
- Grid summarization: For large grids, show only the bounding box around changed regions rather than the full grid.
- Error-focused traces: For programs that crash, prioritize the error message and the last few lines of the traceback over intermediate states.
- Diff-only mode: For high-fitness programs, show only the cells that differ rather than full grids.
- Example prioritization: If context is tight, include full traces for failing examples and skip traces for passing examples.
12.3 Population Memory
The population itself serves as a form of memory, retaining successful strategies and code patterns across generations. The archive (hall of fame) preserves historically best individuals even if they are replaced in the active population, preventing loss of valuable genetic material.
python
class PopulationMemory:
"""Long-term memory across evolutionary generations."""
def __init__(self):
self.archive: list[Individual] = [] # Best individuals ever seen
self.strategy_log: list[dict] = [] # What strategies worked
self.stagnation_history: list[int] = [] # Track stagnation events
def update(self, population: list[Individual], generation: int):
best = max(population, key=lambda x: x.fitness)
if not self.archive or best.fitness > self.archive[-1].fitness:
self.archive.append(best)
self.strategy_log.append({
"generation": generation,
"fitness": best.fitness,
"mutation_type": best.mutation_type,
"parent_fitness": self._get_parent_fitness(best),
})
def get_effective_strategies(self) -> list[str]:
"""Return mutation types that historically produced improvements."""
improvements = [
s for s in self.strategy_log
if s["fitness"] > s.get("parent_fitness", 0)
]
return [s["mutation_type"] for s in improvements]
13. Continued Learning
13.1 Cross-Task Transfer
Arcgentica does not currently perform cross-task transfer learning within a single run -- each task is solved independently. However, the architecture supports several avenues for continued learning:
- Solution library: Successful programs from solved tasks can be used as seed programs for similar unseen tasks, bootstrapping the search with known-good code patterns.
- Strategy meta-learning: Statistics about which mutation types and search strategies were most effective for different task categories can inform strategy selection on new tasks.
- Prompt refinement: Prompts that led to high-quality mutations can be catalogued and reused, effectively learning a prompt engineering strategy over time.
13.2 No Weight Updates
Critically, Arcgentica does not fine-tune the underlying LLM. The LLM weights remain frozen; all "learning" happens through the evolutionary search process and the information content of the prompts. This is a significant advantage for reproducibility and cost, but limits the system's ability to internalize domain-specific knowledge.
13.3 Potential Extensions
| Extension | Description | Expected Impact |
|---|---|---|
| Few-shot exemplar library | Maintain a library of solved task code to include as few-shot examples in prompts | High -- provides the LLM with proven code patterns |
| DSL induction | Discover a domain-specific language from successful programs, then constrain synthesis to DSL | Medium -- reduces search space but may miss novel solutions |
| Fine-tuned mutation model | Fine-tune a smaller LLM specifically on (parent, trace, improved_child) triples | High -- but expensive and reduces generality |
| Retrieval-augmented generation | RAG over solved tasks to find relevant code patterns for new tasks | Medium-High -- practical and low-cost |
14. Applications
14.1 Primary: ARC-AGI-2 Benchmark
The immediate application is solving ARC-AGI-2 tasks, which test abstract reasoning through visual grid transformations. Each task requires discovering a rule from a few examples and applying it to new inputs -- a challenge that tests genuine generalization rather than pattern matching on large datasets.
14.2 Broader Applications
14.2.1 Automated Algorithm Discovery
The runtime-as-context evolutionary approach is directly applicable to discovering novel algorithms. Given a specification of desired input/output behavior and an evaluation function, the system can evolve algorithms for sorting, searching, optimization, and mathematical computation. This is closely related to Google DeepMind's AlphaEvolve, which uses a similar evolutionary LLM approach for mathematical discovery.
14.2.2 Automated Program Repair
The combination of runtime trace capture and LLM-guided mutation is naturally suited to automated program repair (APR). Given a buggy program and failing test cases, the system can:
- Execute the program to capture runtime traces at the point of failure
- Feed these traces to the LLM with the failing test case
- Generate candidate fixes and validate them against all test cases
14.2.3 Code Optimization
The fitness function can be replaced with a performance metric (execution time, memory usage) to evolve more efficient implementations of existing algorithms. The runtime traces provide profiling information that helps the LLM identify bottlenecks.
14.2.4 Scientific Computing
In domains where correct numerical procedures are known but optimal implementations are sought (e.g., PDE solvers, molecular dynamics kernels), Arcgentica could evolve specialized implementations tuned to specific problem characteristics.
14.2.5 Education & Tutoring
The runtime trace mechanism can be used to explain code behavior to students. By showing intermediate states and variable values, the system can generate step-by-step explanations of how a program transforms its input, making it a potential tool for computer science education.
14.3 Comparison with Related Systems
| System | Organization | Key Similarity | Key Difference |
|---|---|---|---|
| AlphaEvolve | Google DeepMind | Evolutionary LLM-guided code synthesis | Focuses on mathematical algorithm discovery rather than abstract reasoning tasks |
| FunSearch | Google DeepMind | LLM + evolutionary search for programs | Targets combinatorial math problems (cap set, bin packing); uses a simpler fitness signal |
| OpenELM | Various | Evolutionary LLM code generation | More general purpose; less focus on runtime trace feedback |
| Ryan Greenblatt's ARC approach | Independent | LLM-based ARC solving with iteration | Uses direct prompting with retry rather than population-based evolution |
| BARC (Berkeley ARC) | UC Berkeley | Program synthesis for ARC | Uses DSL-based synthesis rather than free-form Python generation |
15. References
- Symbolica AI. Arcgentica: Runtime-as-Context Evolutionary Program Synthesis. GitHub repository, 2025. github.com/symbolica-ai/arcgentica
- Symbolica AI. "Runtime as Context." Symbolica AI Blog, 2025. symbolica.ai/blog/runtime-as-context
- Chollet, F. "On the Measure of Intelligence." arXiv preprint arXiv:1911.01547, 2019.
- Chollet, F. "ARC-AGI: A Benchmark for General Intelligence." ARC Prize Foundation, 2024. arcprize.org
- Romera-Paredes, B., et al. "Mathematical discoveries from program search with large language models." Nature, 625, 468-475, 2024. (FunSearch)
- Novikov, A., et al. "AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms." Google DeepMind Blog, 2025.
- Lehman, J., et al. "Evolution through Large Models." arXiv preprint arXiv:2206.08896, 2022.
- Chen, M., et al. "Evaluating Large Language Models Trained on Code." arXiv preprint arXiv:2107.03374, 2021. (Codex/HumanEval)
- Koza, J.R. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992.
- Le Goues, C., et al. "Automated Program Repair." Communications of the ACM, 62(12), 56-65, 2019.
Technical Report -- Arcgentica by Symbolica AI Generated for PhD-level research analysis | March 2026
Sources: GitHub Repository | Runtime-as-Context Blog Post | ARC Prize