Confluence Labs
Part P04: Program Synthesis & ARC-AGI Solvers
16.1 Overview & Motivation
The Abstraction and Reasoning Corpus (ARC), introduced by François Chollet in 2019, was designed as a benchmark for fluid intelligence—the capacity to solve genuinely novel problems that resist memorization and pattern matching. ARC tasks present a small number of input-output grid pairs (typically 2–5) demonstrating an unknown transformation rule, and the solver must infer this rule and apply it to unseen test inputs. By 2025, ARC-AGI-2 had emerged as the harder second iteration of this benchmark, with curated tasks demanding more complex compositional reasoning, a reduced guess limit (two attempts instead of three), and a 12-hour wall-clock constraint for the full evaluation.
Confluence Labs, an AI research lab founded by Brent and Niranjan with backing from Y Combinator, approached ARC-AGI-2 through a distinctive lens: learning efficiency—the ability to acquire new capabilities from minimal data. Rather than training end-to-end neural networks to predict output grids directly, Confluence Labs recast each ARC task as a program induction problem: find executable Python code that maps input grids to output grids. Their system achieved a reported 97.92% accuracy on the ARC-AGI-2 public evaluation set (approximately 392 of 400 tasks) at a reported cost of $11.77 per task. These figures are self-reported in the project README (github.com/confluence-labs/arc-agi-2); no independent third-party verification is documented in the available sources.
Key Contribution
Confluence Labs demonstrates that large language models, when used as program synthesizers rather than direct predictors, can achieve near-ceiling performance on abstract reasoning benchmarks designed to resist statistical learning. The system's three design principles—structural alignment with LLM training distributions, extended reasoning horizons via iterative refinement, and measurable feedback through executable verification—provide a reusable framework for applying LLM code generation to hypothesis-driven reasoning tasks beyond grid puzzles.
The system is open-source under the MIT license (repository: github.com/confluence-labs/arc-agi-2), built in Python 3.11+ with uv as the package manager, Google Gemini as the LLM backend, and E2B for sandboxed code execution. This chapter examines the architecture, algorithms, cost structure, and limitations of the Confluence Labs solver in detail.
Source Provenance Note
This chapter draws on three categories of evidence: (A) Repository-verified—details confirmed by examining the public GitHub repository structure, README, configuration files, and any source code; (B) README/documentation-reported—claims stated in the project README or associated documentation but not independently verified; and (C) Author analysis—the survey author's own formalization, interpretation, or reconstruction of mechanisms described at a high level in the source material. Each major claim is tagged with its provenance category. All code examples in this chapter are illustrative pseudocode faithful to the documented architecture unless explicitly noted as verbatim repository excerpts.
16.1.1 The Program Synthesis Paradigm
Program synthesis—the automatic generation of programs from specifications—has deep roots in computer science, from Summers (1977) and inductive logic programming through modern neural-guided search. Traditional approaches use domain-specific languages (DSLs) and enumerative or constraint-based search. Confluence Labs instead leverages LLMs as neural program synthesizers: the model's training on millions of code examples provides an implicit prior over likely programs, dramatically narrowing the search space compared to brute-force enumeration.
The insight that distinguishes this approach from earlier ARC solvers is the recognition that modern LLMs possess extraordinary code-generation capabilities. When an LLM writes a Python function to solve an ARC task, it must understand spatial relationships in the grid, identify the transformation rule, and express that rule as executable logic. The generated code serves as an explicit, verifiable, and interpretable representation of the inferred rule—properties that direct neural prediction approaches lack.
16.1.2 Theoretical Framing
The approach can be loosely understood through an analogy to Solomonoff induction (Solomonoff, 1964). Each candidate program represents a hypothesis about the underlying generating process. The LLM serves as an approximation to a universal prior, biased toward simple, human-readable programs. The iterative refinement loop implements a form of posterior updating, where programs that fail on training examples are revised to incorporate the observed evidence.
Analogy Caveat
Provenance: Author analysis (C). The Solomonoff framing is offered as conceptual intuition, not as a formal theoretical result. Classical Solomonoff induction is incomputable and defined over all computable hypotheses; the LLM-based system operates over a heavily restricted, empirically-shaped subset of Python programs. The analogy is useful for understanding why the approach works—program length serves as an implicit complexity prior, and iterative refinement approximates Bayesian updating—but it should not be read as a claim that the system implements or approximates Solomonoff induction in any formal sense. No citation in the Confluence Labs documentation makes this connection; it is the survey author's interpretive framing.
16.2 Architecture
The Confluence Labs solver is organized as a modular pipeline centered on a core solver engine that orchestrates multiple parallel agents, each operating within isolated sandbox environments. The architecture reflects three levels of parallelism: multiple agents per task, multiple iterations per agent, and multiple concurrent sandboxes across the task set.
16.2.1 Repository Structure
The repository (arc-agi-2/) is organized around the core solver engine, with configuration managed through environment variables and execution controlled by a shell entry point. The following layout is described in the source documentation. Provenance: (B) README/documentation-reported. The exact internal file names (e.g., solver.py, agent.py, ensemble.py) appear in the source material's directory listing, but without independent verification of the repository contents at a specific commit hash, they should be treated as documentation-stated rather than code-verified.
| Path | Purpose | Verification Status |
|---|---|---|
gemini-cli-solver/src/ | Core solver engine source code | Documented in README |
gemini-cli-solver/src/solver.py | Main solver orchestrator: task distribution, agent lifecycle, output collection | Stated in source material directory listing |
gemini-cli-solver/src/agent.py | Individual agent implementation with iterative refinement loop | Stated in source material directory listing |
gemini-cli-solver/src/sandbox.py | E2B sandbox management: pool, execution, timeout enforcement | Stated in source material directory listing |
gemini-cli-solver/src/ensemble.py | Ensemble aggregation and output selection (majority voting) | Stated in source material directory listing |
gemini-cli-solver/src/prompts/ | Prompt templates for task formatting and refinement | Stated in source material directory listing |
.env | Runtime configuration (agent count, iterations, concurrency, API keys) | Documented in README |
run.sh | Entry point with --smoke and --full modes | Documented in README |
Verification note. To confirm these paths, a reader should clone the repository at github.com/confluence-labs/arc-agi-2 and inspect the directory tree. The source material's description is consistent and internally coherent, but the survey author has not independently verified every file name against a specific commit hash. Throughout this chapter, where the text says a module "handles" or "implements" a function, the attribution is based on the documented purpose of that file, not a line-by-line code audit.
16.2.2 Technology Stack
| Component | Technology | Role | Source |
|---|---|---|---|
| Language | Python 3.11+ | Primary implementation | README (B) |
| Package Manager | uv | Dependency resolution, lockfile (uv.lock) | README (B) |
| LLM Backend | Google Gemini API (gemini-2.5-pro) | Code generation and iterative refinement | README + .env template (B) |
| Sandbox | E2B | Isolated, ephemeral code execution environments | README (B) |
| Configuration | .env files | Runtime parameter management | README (B) |
| Orchestration | Shell (run.sh) + asyncio | Entry point and concurrency control | README (B) |
16.3 Core Principles and Algorithms
The system is built on three foundational design principles that collectively address the question: how should abstract reasoning problems be presented to and solved by LLMs? These principles are explicitly articulated in the source documentation (B) and directly determine architectural choices in the solver pipeline.
16.3.1 Principle 1: Structural Alignment with Training Distributions
The first principle recognizes that the format in which a problem is presented to an LLM substantially affects solution quality. LLMs are not general-purpose reasoning engines but pattern matchers trained on specific data distributions. By reformulating ARC tasks to resemble the coding challenges, code review sessions, and technical documentation that dominate LLM training corpora, the system maximizes the probability that learned representations will transfer to the task.
Concretely, this means representing grids as Python nested lists, framing the task as "write a Python function transform(input_grid)," and priming the model with standard programming idioms (numpy operations, list comprehensions, coordinate manipulations). The prompt structure mirrors coding challenge formats that the model has encountered millions of times during pre-training.
# Illustrative pseudocode — faithful to the documented architecture
# but NOT a verbatim excerpt from the repository.
# The actual prompt construction likely resides in
# gemini-cli-solver/src/prompts/task_prompt.py (per source docs).
def format_task_for_prompt(train_pairs, test_inputs):
"""Structure ARC task as a coding challenge prompt.
Key design choice: grids rendered as Python lists, not custom formats.
The prompt mimics competitive programming problem statements.
"""
lines = ["# ARC Task: Transform input grids to output grids\n"]
lines.append("## Training Examples\n")
for i, pair in enumerate(train_pairs):
lines.append(f"### Example {i + 1}")
lines.append(f"Input:\n{_format_grid(pair['input'])}")
lines.append(f"Output:\n{_format_grid(pair['output'])}")
lines.append("## Task")
lines.append("Write a Python function `transform(input_grid: list[list[int]])")
lines.append("-> list[list[int]]` that correctly transforms ALL training")
lines.append("inputs to their corresponding outputs.")
lines.append("\n## Guidelines")
lines.append("1. Analyze training examples carefully")
lines.append("2. Identify what changes between input and output")
lines.append("3. Look for spatial patterns, symmetries, color rules")
lines.append("4. Write clean, readable Python (numpy allowed)")
lines.append("5. Do NOT hardcode for specific examples — generalize")
return "\n".join(lines)
def _format_grid(grid):
return "\n".join(
"[" + ", ".join(str(c) for c in row) + "]"
for row in grid
)
16.3.2 Principle 2: Extended Reasoning Horizons
Complex ARC tasks require chains of reasoning that exceed what an LLM can accomplish in a single forward pass. The system addresses this through a multi-iteration architecture: each of 12 agents can refine its solution across up to 10 iterations, and the overall system runs for up to 12 hours. This converts a single-shot inference problem into a multi-step search problem with feedback.
The iterative approach encourages progressive complexity: agents start with simple hypotheses and add complexity only when simpler approaches fail, implementing an informal Occam's razor. The source documentation (B) reports that approximately 60% of eventually-solved tasks are resolved within the first 3 iterations, with the remaining 40% requiring deeper reasoning across iterations 4–10.
16.3.3 Principle 3: Measurable Feedback Loops
The program synthesis framing provides an unusually clean feedback signal: a generated program either produces the correct output for a given input or it does not. This binary correctness signal, augmented with structured error diagnostics (stack traces, cell-level diffs between expected and actual outputs), guides the iterative refinement process. Provenance: (B) documented principle; (C) error-category taxonomy below is the survey author's systematization of the described feedback types.
| Error Category | Information Provided to LLM | Typical Resolution |
|---|---|---|
| Syntax Error | Python traceback with line number | Fix syntax; usually resolved in one iteration |
| Runtime Error | Exception type, message, traceback | Fix logic errors (index bounds, type mismatches) |
| Wrong Output | Expected vs. actual grids, cell-level diff | Revise transformation rule hypothesis |
| Partial Match | Percentage correct, mismatched regions | Adjust edge cases while preserving core logic |
| Timeout | Execution exceeded 30s limit | Optimize algorithm or fix infinite loop |
16.3.4 The Multi-Agent Ensemble
The system deploys $n = 12$ independent agents per task, each running up to $k = 10$ refinement iterations (these parameters are documented in the .env configuration; provenance B). The ensemble is motivated by a standard probabilistic argument. Let $p$ denote the probability that a single agent solves a given task. Assuming agent independence, the probability that at least one succeeds is:
This is a standard result from probability theory (not specific to this system). For $n = 12$ and $p = 0.30$ (an illustrative value; the actual per-agent success rate is not reported in the source material):
The table below shows how ensemble probability scales with agent count for the illustrative case of $p = 0.30$:
| Agents ($n$) | $P(\geq 1 \text{ success})$ | Marginal Gain per Agent |
|---|---|---|
| 1 | 30.0% | — |
| 2 | 51.0% | +21.0% |
| 4 | 76.0% | +12.5% |
| 6 | 88.2% | +6.1% |
| 8 | 94.2% | +3.0% |
| 10 | 97.2% | +1.5% |
| 12 | 98.6% | +0.7% |
| 16 | 99.5% | +0.2% |
Independence Assumption and Correlated Agents
Provenance: Author analysis (C). The calculation above assumes statistical independence between agents, which is not satisfied in practice. All agents use the same LLM (Gemini 2.5 Pro), receive the same task description, and operate under the same prompt template. This introduces positive correlation between agent outcomes: if one agent fails because the task requires a concept outside the model's training distribution, other agents are likely to fail for the same reason.
Under a simple equicorrelated model where each pair of agents shares correlation $\rho$, the effective number of independent trials becomes approximately $n_{\text{eff}} = n / (1 + (n-1)\rho)$. Even modest correlation substantially degrades the ensemble benefit:
| Pairwise $\rho$ | $n_{\text{eff}}$ (of 12) | $P(\geq 1)$ at $p = 0.30$ |
|---|---|---|
| 0.0 (independent) | 12.0 | 98.6% |
| 0.1 | 5.5 | 86.2% |
| 0.3 | 2.8 | 65.7% |
| 0.5 | 1.8 | 51.0% |
The source documentation does not report empirical diversity statistics (e.g., inter-agent agreement rates, unique solution counts per task, or correlation measurements). LLM output stochasticity (temperature > 0) and divergent error paths during iterative refinement introduce variance across agents, but the magnitude of effective independence remains an open empirical question. The actual per-agent success rate $p$ is not published—the $p = 0.30$ value used throughout this section is illustrative only. The true ensemble dynamics depend on the joint distribution of agent outcomes, which is not characterized in the available sources. Different agents are not reported to use different temperature settings, system prompts, or model variants to increase diversity—this remains an obvious optimization avenue noted in the source material.
16.3.5 Ensemble Aggregation
ARC-AGI-2 allows two guesses per test input. Given up to 12 agent solutions, the system selects two outputs through majority voting (documented principle B; specific mechanism described below follows the documented approach). For each test input $t$, let $\{o_1^t, o_2^t, \ldots, o_m^t\}$ be the set of $m \leq 12$ candidate outputs from successful agents. The system converts each output grid to a hashable representation (tuple of tuples) and counts occurrences:
where $\mathbb{1}[\cdot]$ is the indicator function. The first guess is the output with the most agent agreement; the second guess is the next most common distinct output. If only one unique output exists among all agents, both guesses are identical.
# Illustrative pseudocode — represents the documented majority-voting
# approach described in the source material. NOT a verbatim excerpt
# from the repository. The actual implementation likely resides in
# gemini-cli-solver/src/ensemble.py (per source docs).
from collections import Counter
def aggregate_solutions(results, num_test_inputs):
"""Select final 2 guesses per test input via majority voting.
Args:
results: List of agent results from successful agents
num_test_inputs: Number of test inputs in the task
Returns:
List of [guess_1, guess_2] per test input
"""
per_test_outputs = []
for test_idx in range(num_test_inputs):
candidates = []
for result in results:
if result.success and test_idx < len(result.test_outputs):
output = result.test_outputs[test_idx]
# Convert to hashable for counting
candidates.append(grid_to_hashable(output))
if candidates:
counter = Counter(candidates)
top = counter.most_common(2)
guess_1 = hashable_to_grid(top[0][0])
guess_2 = hashable_to_grid(top[1][0]) if len(top) > 1 \
else guess_1
per_test_outputs.append([guess_1, guess_2])
return per_test_outputs
def grid_to_hashable(grid):
return tuple(tuple(row) for row in grid)
def hashable_to_grid(h):
return [list(row) for row in h]
The source documentation also mentions alternative aggregation strategies (weighted consensus, diversity selection) but indicates that majority voting is the primary method used in the reported results. Whether more sophisticated aggregation was tested and rejected is not documented.
16.4 Iterative Refinement Loop
The iterative refinement loop is where the extended reasoning horizon and measurable feedback principles converge. Each agent follows a structured lifecycle for every task:
16.4.1 Refinement Prompt Construction
When a generated program fails on one or more training pairs, the system constructs a refinement prompt that includes: (1) the previous code, (2) a detailed error summary with cell-level diffs for wrong outputs, and (3) the original task description. This rich feedback allows the LLM to diagnose specific failures and revise its hypothesis about the transformation rule. Provenance: (B) documented design; (C) the specific prompt structure below is the survey author's reconstruction.
# Illustrative pseudocode — reconstructs the documented refinement
# approach. NOT a verbatim excerpt from the repository.
# The actual refinement prompt construction likely resides in
# gemini-cli-solver/src/prompts/refinement_prompt.py (per source docs).
def build_refinement_prompt(task, previous_code, iteration, errors):
"""Construct a targeted refinement prompt from execution feedback.
The prompt provides the LLM with:
- The code that failed
- Precise error diagnostics (cell-level diffs for wrong outputs)
- The original task for re-analysis
"""
sections = [
f"## Refinement Iteration {iteration + 1}",
f"Your previous attempt failed on {len(errors)} training examples.\n",
f"## Previous Code\n```python\n{previous_code}\n```\n",
"## Error Analysis"
]
for error in errors:
pair_idx = error['pair']
if 'error' in error:
# Runtime error: include full traceback
sections.append(
f"### Pair {pair_idx}: Runtime Error\n"
f"```\n{error['error']}\n```"
)
elif not error['passed']:
# Wrong output: show cell-level diff
expected = error.get('expected', [])
actual = error.get('actual', [])
diff_lines = []
for r in range(len(expected)):
for c in range(len(expected[r])):
exp_val = expected[r][c]
act_val = actual[r][c] if r < len(actual) \
and c < len(actual[r]) else "MISSING"
if exp_val != act_val:
diff_lines.append(
f" Cell ({r},{c}): expected {exp_val}, "
f"got {act_val}"
)
sections.append(
f"### Pair {pair_idx}: Wrong Output\n" +
"\n".join(diff_lines[:20]) # cap at 20 diffs
)
sections.append(f"## Original Task\n{format_task_for_prompt(task)}")
sections.append(
"## Instructions\n"
"1. Analyze why the previous code failed\n"
"2. Consider alternative transformation rules\n"
"3. Write an improved transform() function\n"
"4. Ensure it handles ALL training examples"
)
return "\n\n".join(sections)
16.4.2 Convergence Characteristics
The source documentation (B) reports characteristic convergence patterns across iterations, with the following approximate distribution of solves:
| Iteration Range | Fraction of Solves | Characterization | Source |
|---|---|---|---|
| 1–3 | ~60% | Quick fixes: syntax, off-by-one, wrong axis | Source docs (B) |
| 4–7 | ~30% | Deeper reasoning: reconsidering the transformation hypothesis | Source docs (B) |
| 8–10 | ~10% | Diminishing returns, but on the hardest tasks | Source docs (B) |
The average iterations to solve, across all successfully solved tasks, is reported as approximately 3.2 iterations (B).
Convergence Model Caveat
Provenance: Author analysis (C). The source material describes the convergence pattern as "approximate exponential decay" with a "decay rate of ~0.5 per iteration." If taken literally, this would suggest that the conditional probability of solving at iteration $k$ given failure through $k{-}1$ follows:
where $p_0$ is the first-iteration solve probability and $\lambda \approx 0.5$. However, this model should be treated as a rough heuristic description of the observed pattern, not as a validated statistical fit. The source material does not provide:
- Per-iteration solve histograms or raw data
- A fitted parameter estimate with confidence intervals
- A goodness-of-fit test or comparison to alternative models
- Whether $p_0$ and $\lambda$ vary by task difficulty tier
The geometric decay is plausible as an intuition—easier failure modes are fixed first, leaving progressively harder conceptual mismatches for later iterations—but it should not be cited as a quantitative law of this system without stronger empirical backing.
16.5 Sandbox Infrastructure
Executing LLM-generated code safely requires strong isolation. Confluence Labs uses E2B (Environment to Binary), a cloud service providing ephemeral, isolated execution environments (B). The system maintains a pool of 132 concurrent sandboxes (GEMINI_CLI_CONCURRENCY=132, documented in .env template; B), sized to support the throughput requirements of 12 agents each potentially running multiple iterations simultaneously across the full task set. The source material describes the concurrency value of 132 as representing 12 agents × 10 maximum iterations plus a ~10% overhead buffer for retries (B).
Concurrency is controlled by an asyncio semaphore that limits simultaneous sandbox executions (documented design B; specific implementation detail C). Each execution is wrapped in a 30-second timeout (SANDBOX_TIMEOUT=30, from .env template; B)—ARC transformations should complete in milliseconds, so this threshold catches infinite loops and degenerate solutions. Sandbox instances are closed in finally blocks to prevent resource leaks, and transient E2B API failures trigger automatic retries with exponential backoff (B).
The test harness wraps the generated transform() function with assertions against all training pairs, capturing results as structured JSON:
# Illustrative pseudocode — shows the documented test-harness approach
# for executing generated code against training pairs. NOT a verbatim
# excerpt from the repository. The actual sandbox logic likely resides
# in gemini-cli-solver/src/sandbox.py (per source docs).
def build_test_harness(user_code, train_pairs):
"""Wrap generated code with training-pair assertions.
The harness executes transform() on each training input,
compares against expected output, and reports structured results.
"""
import json
harness = f"""
{user_code}
import json
results = []
train_pairs = {json.dumps(train_pairs)}
for i, pair in enumerate(train_pairs):
try:
actual = transform(pair['input'])
expected = pair['output']
passed = (actual == expected)
results.append({{
'pair': i,
'passed': passed,
'expected': expected,
'actual': actual if not passed else None
}})
except Exception as e:
results.append({{
'pair': i,
'passed': False,
'error': str(e)
}})
print(json.dumps(results))
"""
return harness
16.6 Key Results
The following results are reported by Confluence Labs. All figures in this section originate from the project README and documentation (provenance B) unless otherwise noted. No independent third-party verification of the 97.92% score or the cost figures is documented in the available source material.
| Metric | Value | Exact Source | Independently Verified? |
|---|---|---|---|
| Public eval accuracy | 97.92% (~392/400 tasks) | Repository README | Not documented |
| Cost per task | $11.77 | Repository README | Not documented |
| Total cost (400 tasks) | ~$4,708 | Derived: $11.77 × 400 | N/A (arithmetic) |
| Wall clock time | ~12 hours (full evaluation) | Configuration constraint (B) | N/A (config limit) |
| Average iterations to solve | ~3.2 | Source documentation | Not documented |
| Agents per task | 12 | .env config (GEMINI_CLI_AGENTS) | Config file (B) |
| Concurrent sandboxes | 132 | .env config (GEMINI_CLI_CONCURRENCY) | Config file (B) |
| Max iterations per agent | 10 | .env config (GEMINI_CLI_MAX_ITERATIONS) | Config file (B) |
Evaluation Protocol — What Is and Isn't Documented
Provenance: Author analysis (C) based on available source material.
The source material specifies that the 97.92% score is on the ARC-AGI-2 public evaluation set (400 tasks). The following evaluation details are not documented in the available sources:
- Model version pinning: The .env template specifies
gemini-2.5-pro, but whether a specific model snapshot/version was used (and whether the API model has since been updated) is not stated. - Temperature and sampling settings: The .env template shows
GEMINI_TEMPERATURE=0.7andGEMINI_MAX_TOKENS=8192, but whether these were the exact settings used for the reported result is not confirmed. - Number of reruns: Whether the 97.92% figure represents a single run, the best of multiple runs, or an average across runs is not stated.
- Submission procedure: Whether the score was obtained via the official ARC-AGI-2 evaluation platform or computed locally against the public task set is not specified.
- Private evaluation set: No results on the private (held-out) 100-task set are reported.
- Variance: No confidence interval, standard deviation, or range across multiple runs is provided.
These gaps are common in repository-based benchmark claims and do not imply the result is incorrect, but they limit the strength of conclusions that can be drawn from the number.
16.6.1 Performance by Task Difficulty
The source material provides an approximate breakdown of performance across difficulty tiers. Provenance: (B) source documentation, with the following important caveats.
| Difficulty Tier | Approx. Tasks | Solve Rate | Avg. Iterations | Source |
|---|---|---|---|---|
| Easy (simple spatial transforms) | ~120 | ~100% | 1.2 | Source docs (B) |
| Medium (compositional rules) | ~160 | ~99% | 3.1 | Source docs (B) |
| Hard (multi-step, abstract) | ~80 | ~96% | 5.8 | Source docs (B) |
| Very Hard (novel concepts) | ~40 | ~90% | 8.2 | Source docs (B) |
Methodology caveat. These difficulty-tier numbers are reported in the source documentation but without detailed methodology for tier assignment. The approximate task counts sum to 400, consistent with the ARC-AGI-2 public evaluation set size. Critical unknowns include: (1) whether these tiers were assigned by the Confluence Labs team through manual review, automated clustering, or correspondence to an external classification; (2) what criteria define "easy" vs. "medium" vs. "hard" vs. "very hard"; (3) whether the solve rates and average iterations are per-tier averages or medians; and (4) whether these figures come from the same run that produced the 97.92% headline number. Without this methodology, these figures should be treated as self-reported approximate observations, not rigorous empirical measurements.
16.6.2 Failure Mode Analysis
The approximately 2% of unsolved tasks (roughly 8 of 400) share common characteristics according to the source material (B):
- Highly novel spatial concepts — transformations involving spatial relationships with few analogues in typical programming tasks, outside the LLM's training distribution.
- Ambiguous rules — tasks where training examples are consistent with multiple transformation rules, and the correct rule is not the most "natural" one from a programming perspective.
- Large-scale counting or arithmetic — tasks requiring precise counting of complex features across large grids, where off-by-one errors compound.
- Recursive or self-referential patterns — transformations requiring meta-reasoning about the transformation itself.
These failure modes are consistent with known limitations of LLM code generation: the model struggles when the required program has no close analogues in its training data, or when the specification (training examples) is ambiguous.
16.7 Cost Analysis
At a reported $11.77 per task (B), the system represents a significant compute investment. The cost structure below is presented in the source documentation (B). However, the breakdown into sub-components is not accompanied by a detailed accounting methodology. Whether these figures come from API billing logs, estimates based on token counts, or back-of-the-envelope calculations is not stated.
| Cost Component | Est. per Task | Percentage | Source |
|---|---|---|---|
| Gemini API calls (12 agents × avg 3.2 iters) | $8.50 | 72.2% | Source docs (B) |
| E2B sandbox execution | $2.20 | 18.7% | Source docs (B) |
| Infrastructure overhead (networking, logging) | $0.80 | 6.8% | Source docs (B) |
| Miscellaneous (retries, failed executions) | $0.27 | 2.3% | Source docs (B) |
Provenance note: The sub-total ($8.50 + $2.20 + $0.80 + $0.27 = $11.77) is arithmetically consistent with the headline per-task cost, which adds internal coherence. However, the Gemini API component depends on exact pricing at the time of the evaluation run, which is not documented (Gemini API pricing has changed multiple times). The E2B cost depends on the pricing tier and instance type used. These figures should be understood as approximate cost allocations reported by the team, not independently auditable line items.
The total cost for a full 400-task evaluation run is approximately $4,708 (arithmetic derivation from the per-task cost). The source material suggests this price point sits near the "knee" of the cost-performance curve (B): reducing cost below $5/task would likely drop accuracy below 95%, while increasing cost above $20/task would yield diminishing accuracy gains toward 98.5–99%. This cost-performance characterization is presented as the team's judgment (B); no ablation study systematically varying cost is documented.
Several cost reduction strategies are identified in the source material (B):
| Strategy | Estimated Savings | Accuracy Impact | Evidence Level |
|---|---|---|---|
| Reduce agents from 12 to 6 | ~50% LLM cost reduction | Reduced ensemble coverage | Theoretical (C) |
| Reduce max iterations from 10 to 5 | ~30% LLM cost reduction | Loss of hard-task solves | Source estimate (B) |
| Adaptive agent allocation (fewer for easy tasks) | 50%+ overall cost reduction | Minimal accuracy loss (per source) | Source estimate (B) |
| Use cheaper model variant | Variable | Degraded code quality | Source speculation (B) |
16.8 Comparison with Related Approaches
The ARC benchmark has attracted a diverse range of approaches. Confluence Labs' program synthesis approach occupies a distinctive position in this landscape, distinguished by its scale of parallelism, its reliance on general-purpose code generation rather than DSL search, and its use of iterative feedback-driven refinement.
| Approach | Method Type | Key Strength | Key Weakness |
|---|---|---|---|
| Confluence Labs | LLM Program Synthesis | Expressive, interpretable, verifiable solutions | High cost, LLM-capability-bounded |
| DreamCoder-style | Neural-guided DSL Search | Efficient search with learned library | Limited by DSL expressiveness |
| End-to-end Neural | Direct Grid Prediction | Fast inference | Poor generalization to novel tasks |
| Brute-force DSL | Enumerative Search | Complete within DSL scope | Exponential complexity |
| Evolutionary (GP) | Genetic Programming | Flexible, no LLM dependency | Slow convergence |
| LLM-guided Evolution | Hybrid (e.g., Imbue) | Systematic, adaptive mutations | Fitness design complexity |
16.8.1 Relation to Evolutionary Systems Surveyed in This Book
Within the broader landscape of LLM-powered evolutionary systems covered in this survey, Confluence Labs occupies an interesting boundary position. Systems such as FunSearch (Chapter 6), AlphaEvolve (Chapter 4), and OpenEvolve (Chapter 5) use LLMs as mutation operators within an explicit evolutionary loop—maintaining populations, applying selection pressure, and tracking fitness across generations. Confluence Labs, by contrast, uses the LLM as a direct solver with iterative self-refinement but without population-based search.
To make this comparison precise, we evaluate the system against the five defining characteristics of evolutionary program synthesis:
| Evolutionary Characteristic | FunSearch / AlphaEvolve | Confluence Labs |
|---|---|---|
| Population | Maintained across generations; programs compete and persist | No population. Each agent maintains a single candidate at a time; no cross-generation memory |
| Selection | Fitness-proportional or tournament-based parent selection | Binary: pass all training pairs or fail. No graded fitness, no selection among partial solutions |
| Variation (mutation) | LLM-generated edits to parents; crossover between candidates | LLM-generated revision conditioned on error feedback. No crossover between agents |
| Fitness landscape | Continuous or ordinal fitness guiding incremental improvement | Binary correctness. A partially correct program receives the same "fail" signal as a completely wrong one (error diagnostics provide qualitative feedback but not a scalar fitness) |
| Archive / memory | MAP-Elites grid, island databases, program libraries | No persistent archive. Each agent's history is its own conversation context; no cross-agent knowledge sharing |
The system can be viewed as a degenerate case of evolutionary search where population size equals 1 per agent, selection is binary, and mutation is performed entirely by the LLM conditioned on error feedback. Despite the absence of explicit evolutionary mechanisms, the hypothesis-test-refine cycle mirrors the generate-evaluate-select loop, the multi-agent ensemble provides diversity analogous (though not equivalent) to population-based search, and iterative refinement with feedback implements a form of directed variation.
16.8.2 What the Comparison Reveals
The Confluence Labs result suggests that for well-structured reasoning tasks with clear verification, brute-force parallelization of LLM inference—enough agents, enough iterations, enough compute—can match or exceed more sophisticated search strategies. This raises a question for the evolutionary AI community: when is population-based search necessary, and when is parallel independent restart sufficient?
The answer likely depends on the per-attempt success probability $p$. For tasks where $p \geq 0.2$–$0.3$, ensemble amplification alone may suffice even with correlated agents. For tasks with very low per-attempt probability (open-ended optimization, algorithm discovery where the search space is vast and solutions are sparse), evolutionary mechanisms providing population-level memory, crossover, and guided exploration of the fitness landscape likely become essential. The Confluence Labs system does not address this regime—its binary feedback signal and lack of graded fitness would provide no gradient toward distant solutions.
16.9 Limitations & Discussion
16.9.1 LLM Capability Ceiling
The system's performance is fundamentally bounded by the capabilities of the underlying LLM (Google Gemini). If the model cannot conceive of a particular transformation rule—because it has no close analogue in its training distribution—no amount of iterative refinement or ensemble scaling will produce the correct solution. This is not a limitation that can be addressed architecturally; it requires improvements in the base model.
16.9.2 Cost and Scalability
At a reported $11.77 per task and ~$4,708 per full evaluation (B), the approach is viable for benchmark competition and research but would be prohibitively expensive for high-volume production applications. LLM API costs are trending downward, but the linear scaling with agents and iterations means cost reduction requires either algorithmic improvements (adaptive agent allocation) or cheaper models.
16.9.3 Generalization to Private Evaluation
The 97.92% figure is reported on the public evaluation set (400 tasks). Performance on the private evaluation set (100 held-out tasks with potentially different difficulty distribution) is not reported in the available sources. As with any benchmark result, there is a risk that the public set's difficulty distribution does not represent the full range of ARC-AGI-2 challenges.
16.9.4 Reproducibility
LLM outputs are non-deterministic even at temperature 0, due to implementation details of floating-point arithmetic, batching, and sampling. This means exact reproduction of the 97.92% score is not guaranteed across runs, although statistical consistency is expected. The MIT license and public repository enable full reproduction of the system architecture and configuration, making the approach structurally reproducible even if individual run outcomes vary.
Reproducibility and Verification Checklist
What a third party would need to reproduce the public evaluation number:
| Requirement | Status | Notes |
|---|---|---|
| Source code | Available | MIT license, github.com/confluence-labs/arc-agi-2 |
| Dependencies | Available | uv.lock provides pinned dependency versions |
| ARC-AGI-2 public tasks | Available | 400-task public evaluation set is publicly distributed |
| Gemini API access | Required | Requires Google API key with access to gemini-2.5-pro |
| E2B API access | Required | Requires E2B API key (paid service) |
| API cost | ~$4,700 | Based on reported per-task cost; actual cost depends on current API pricing |
| Model version pinning | Not specified | Gemini model may have been updated since the reported run; exact snapshot unknown |
| Temperature / sampling | Documented in .env | GEMINI_TEMPERATURE=0.7, GEMINI_MAX_TOKENS=8192 (from .env template) |
| Number of runs for reported score | Not specified | Unknown whether 97.92% is single-run, best-of-N, or average |
| Expected variance | Not reported | LLM non-determinism means exact reproduction unlikely; range unknown |
| Hardware requirements | Modest (API-based) | No GPU needed; requires stable internet for API calls |
| Wall clock time | ~12 hours | Constrained by WALL_CLOCK_LIMIT=43200 seconds |
Key stochasticity caveats: Even with identical code and configuration, exact result reproduction is unlikely because: (1) the Gemini API may return different completions across runs due to non-deterministic sampling; (2) the model checkpoint behind the gemini-2.5-pro endpoint may have been updated since the original evaluation; (3) E2B sandbox execution timing may vary, potentially affecting timeout behavior on edge cases; (4) asyncio task scheduling order is not deterministic, which could affect which agents complete first and how the wall-clock budget is allocated. A reproducer should expect results within a statistical neighborhood of 97.92% but should not treat exact match as expected.
16.9.5 Philosophical Considerations
A deeper question raised by this work: does the system exhibit genuine "abstract reasoning" in the sense intended by the ARC benchmark? Chollet designed ARC to measure fluid intelligence—the ability to solve genuinely novel problems. The LLM-based approach converts novel reasoning problems into coding problems, which the model solves using crystallized knowledge from its training data. Each "novel" transformation rule is recognized and expressed through recombination of programming patterns the model has already seen.
This is undeniably effective, but it challenges whether ARC is measuring what it was designed to measure. If LLMs can achieve 98% accuracy by casting abstract reasoning as program synthesis, then either (a) program synthesis is a valid form of abstract reasoning, or (b) the benchmark is measuring a different capability than intended. This remains an open question in the AI research community.
16.10 Strategic Vision
Confluence Labs positions their ARC solver as a proof of concept for a broader research agenda in data-efficient scientific modeling. The source material (B) describes three target domains for extending the program synthesis framework:
- Hardware engineering: Automated synthesis of digital logic circuits or HDL programs from behavioral specifications.
- Biology: Generating computational models of biological processes from limited experimental observations (rare diseases, novel organisms).
- Materials science: Discovering structure-property relationships in novel materials through automated hypothesis generation and testing.
The team is also exploring hybrid approaches combining LLM-based program synthesis with Bayesian model selection (B). In this paradigm, the LLM generates candidate programs (the hypothesis space), while Bayesian methods handle uncertainty quantification and active learning (selecting the most informative next experiment). This would extend the system beyond passive inference toward active experimental design.
| Future Direction | Approach | Expected Impact | Source |
|---|---|---|---|
| Agent diversity | Different prompts, temperatures, models per agent | Improved ensemble diversity | Source docs (B) |
| Cross-agent communication | Share partial solutions between agents | Faster convergence on hard tasks | Source docs (B) |
| Meta-learning | Learn from solved tasks to improve prompts | Transfer across task categories | Source docs (B) |
| Hybrid search | Combine LLM synthesis with symbolic search | Better coverage of unusual transformations | Source docs (B) |
| Adaptive allocation | Fewer agents for easy tasks, more for hard | 50%+ cost reduction (per source) | Source docs (B) |
16.11 Summary
Chapter Summary
Key takeaway: Confluence Labs reports 97.92% accuracy on ARC-AGI-2 (public evaluation set, self-reported) by reframing abstract reasoning tasks as program induction problems and applying massive parallel LLM code generation with iterative feedback-driven refinement. The result demonstrates that brute-force scaling of well-structured LLM inference can reach near-ceiling performance on benchmarks designed to resist statistical learning.
Main contribution to the field: A principled framework (three design principles: structural alignment, extended reasoning, measurable feedback) for maximizing LLM performance on abstract reasoning through program synthesis, supported by a complete open-source implementation (MIT license) with transparent cost analysis ($11.77/task, self-reported). The ensemble amplification argument—converting a modest per-agent success rate into high ensemble accuracy through 12 independent agents—provides a broadly applicable scaling strategy, though its effectiveness in this system depends on the unmeasured degree of inter-agent correlation.
What a researcher should know: This system operates at the boundary between evolutionary program synthesis and parallel LLM inference. It achieves its reported ARC results without population-based search, crossover, or fitness landscapes—raising the question of when explicit evolutionary mechanisms are necessary versus when parallel independent restart with feedback is sufficient. The system is architecturally simple and extensible (MIT license), making it a strong baseline for LLM-driven abstract reasoning research.
Evidence caveats: All reported performance numbers are self-reported from the repository README with no documented independent verification. The evaluation protocol is incompletely specified (model version pinning, number of reruns, temperature, and submission procedure are not fully documented). The difficulty-tier breakdown and cost sub-component allocation lack detailed methodology. The ensemble-success calculation assumes agent independence, but empirical correlation statistics are not reported. Code examples in this chapter are illustrative pseudocode faithful to the documented architecture, not verbatim repository excerpts. Readers seeking to verify specific implementation claims should consult the repository directly at github.com/confluence-labs/arc-agi-2.