Introduced2025-12

Score7.92/10 — Draft

Chapter 16

Confluence Labs

Part P04: Program Synthesis & ARC-AGI Solvers

16.1 Overview & Motivation

The Abstraction and Reasoning Corpus (ARC), introduced by François Chollet in 2019, was designed as a benchmark for fluid intelligence—the capacity to solve genuinely novel problems that resist memorization and pattern matching. ARC tasks present a small number of input-output grid pairs (typically 2–5) demonstrating an unknown transformation rule, and the solver must infer this rule and apply it to unseen test inputs. By 2025, ARC-AGI-2 had emerged as the harder second iteration of this benchmark, with curated tasks demanding more complex compositional reasoning, a reduced guess limit (two attempts instead of three), and a 12-hour wall-clock constraint for the full evaluation.

Confluence Labs, an AI research lab founded by Brent and Niranjan with backing from Y Combinator, approached ARC-AGI-2 through a distinctive lens: learning efficiency—the ability to acquire new capabilities from minimal data. Rather than training end-to-end neural networks to predict output grids directly, Confluence Labs recast each ARC task as a program induction problem: find executable Python code that maps input grids to output grids. Their system achieved a reported 97.92% accuracy on the ARC-AGI-2 public evaluation set (approximately 392 of 400 tasks) at a reported cost of $11.77 per task. These figures are self-reported in the project README (github.com/confluence-labs/arc-agi-2); no independent third-party verification is documented in the available sources.

Key Contribution

Confluence Labs demonstrates that large language models, when used as program synthesizers rather than direct predictors, can achieve near-ceiling performance on abstract reasoning benchmarks designed to resist statistical learning. The system's three design principles—structural alignment with LLM training distributions, extended reasoning horizons via iterative refinement, and measurable feedback through executable verification—provide a reusable framework for applying LLM code generation to hypothesis-driven reasoning tasks beyond grid puzzles.

The system is open-source under the MIT license (repository: github.com/confluence-labs/arc-agi-2), built in Python 3.11+ with uv as the package manager, Google Gemini as the LLM backend, and E2B for sandboxed code execution. This chapter examines the architecture, algorithms, cost structure, and limitations of the Confluence Labs solver in detail.

Source Provenance Note

This chapter draws on three categories of evidence: (A) Repository-verified—details confirmed by examining the public GitHub repository structure, README, configuration files, and any source code; (B) README/documentation-reported—claims stated in the project README or associated documentation but not independently verified; and (C) Author analysis—the survey author's own formalization, interpretation, or reconstruction of mechanisms described at a high level in the source material. Each major claim is tagged with its provenance category. All code examples in this chapter are illustrative pseudocode faithful to the documented architecture unless explicitly noted as verbatim repository excerpts.

16.1.1 The Program Synthesis Paradigm

Program synthesis—the automatic generation of programs from specifications—has deep roots in computer science, from Summers (1977) and inductive logic programming through modern neural-guided search. Traditional approaches use domain-specific languages (DSLs) and enumerative or constraint-based search. Confluence Labs instead leverages LLMs as neural program synthesizers: the model's training on millions of code examples provides an implicit prior over likely programs, dramatically narrowing the search space compared to brute-force enumeration.

The insight that distinguishes this approach from earlier ARC solvers is the recognition that modern LLMs possess extraordinary code-generation capabilities. When an LLM writes a Python function to solve an ARC task, it must understand spatial relationships in the grid, identify the transformation rule, and express that rule as executable logic. The generated code serves as an explicit, verifiable, and interpretable representation of the inferred rule—properties that direct neural prediction approaches lack.

16.1.2 Theoretical Framing

The approach can be loosely understood through an analogy to Solomonoff induction (Solomonoff, 1964). Each candidate program represents a hypothesis about the underlying generating process. The LLM serves as an approximation to a universal prior, biased toward simple, human-readable programs. The iterative refinement loop implements a form of posterior updating, where programs that fail on training examples are revised to incorporate the observed evidence.

Analogy Caveat

Provenance: Author analysis (C). The Solomonoff framing is offered as conceptual intuition, not as a formal theoretical result. Classical Solomonoff induction is incomputable and defined over all computable hypotheses; the LLM-based system operates over a heavily restricted, empirically-shaped subset of Python programs. The analogy is useful for understanding why the approach works—program length serves as an implicit complexity prior, and iterative refinement approximates Bayesian updating—but it should not be read as a claim that the system implements or approximates Solomonoff induction in any formal sense. No citation in the Confluence Labs documentation makes this connection; it is the survey author's interpretive framing.

16.2 Architecture

The Confluence Labs solver is organized as a modular pipeline centered on a core solver engine that orchestrates multiple parallel agents, each operating within isolated sandbox environments. The architecture reflects three levels of parallelism: multiple agents per task, multiple iterations per agent, and multiple concurrent sandboxes across the task set.

16.2.1 Repository Structure

The repository (arc-agi-2/) is organized around the core solver engine, with configuration managed through environment variables and execution controlled by a shell entry point. The following layout is described in the source documentation. Provenance: (B) README/documentation-reported. The exact internal file names (e.g., solver.py, agent.py, ensemble.py) appear in the source material's directory listing, but without independent verification of the repository contents at a specific commit hash, they should be treated as documentation-stated rather than code-verified.

Path	Purpose	Verification Status
`gemini-cli-solver/src/`	Core solver engine source code	Documented in README
`gemini-cli-solver/src/solver.py`	Main solver orchestrator: task distribution, agent lifecycle, output collection	Stated in source material directory listing
`gemini-cli-solver/src/agent.py`	Individual agent implementation with iterative refinement loop	Stated in source material directory listing
`gemini-cli-solver/src/sandbox.py`	E2B sandbox management: pool, execution, timeout enforcement	Stated in source material directory listing
`gemini-cli-solver/src/ensemble.py`	Ensemble aggregation and output selection (majority voting)	Stated in source material directory listing
`gemini-cli-solver/src/prompts/`	Prompt templates for task formatting and refinement	Stated in source material directory listing
`.env`	Runtime configuration (agent count, iterations, concurrency, API keys)	Documented in README
`run.sh`	Entry point with `--smoke` and `--full` modes	Documented in README

Verification note. To confirm these paths, a reader should clone the repository at github.com/confluence-labs/arc-agi-2 and inspect the directory tree. The source material's description is consistent and internally coherent, but the survey author has not independently verified every file name against a specific commit hash. Throughout this chapter, where the text says a module "handles" or "implements" a function, the attribution is based on the documented purpose of that file, not a line-by-line code audit.

16.2.2 Technology Stack

Component	Technology	Role	Source
Language	Python 3.11+	Primary implementation	README (B)
Package Manager	uv	Dependency resolution, lockfile (`uv.lock`)	README (B)
LLM Backend	Google Gemini API (gemini-2.5-pro)	Code generation and iterative refinement	README + .env template (B)
Sandbox	E2B	Isolated, ephemeral code execution environments	README (B)
Configuration	.env files	Runtime parameter management	README (B)
Orchestration	Shell (run.sh) + asyncio	Entry point and concurrency control	README (B)

16.3 Core Principles and Algorithms

The system is built on three foundational design principles that collectively address the question: how should abstract reasoning problems be presented to and solved by LLMs? These principles are explicitly articulated in the source documentation (B) and directly determine architectural choices in the solver pipeline.

16.3.1 Principle 1: Structural Alignment with Training Distributions

The first principle recognizes that the format in which a problem is presented to an LLM substantially affects solution quality. LLMs are not general-purpose reasoning engines but pattern matchers trained on specific data distributions. By reformulating ARC tasks to resemble the coding challenges, code review sessions, and technical documentation that dominate LLM training corpora, the system maximizes the probability that learned representations will transfer to the task.

Concretely, this means representing grids as Python nested lists, framing the task as "write a Python function transform(input_grid)," and priming the model with standard programming idioms (numpy operations, list comprehensions, coordinate manipulations). The prompt structure mirrors coding challenge formats that the model has encountered millions of times during pre-training.

# Illustrative pseudocode — faithful to the documented architecture
# but NOT a verbatim excerpt from the repository.
# The actual prompt construction likely resides in
# gemini-cli-solver/src/prompts/task_prompt.py (per source docs).

def format_task_for_prompt(train_pairs, test_inputs):
    """Structure ARC task as a coding challenge prompt.
    
    Key design choice: grids rendered as Python lists, not custom formats.
    The prompt mimics competitive programming problem statements.
    """
    lines = ["# ARC Task: Transform input grids to output grids\n"]
    lines.append("## Training Examples\n")
    for i, pair in enumerate(train_pairs):
        lines.append(f"### Example {i + 1}")
        lines.append(f"Input:\n{_format_grid(pair['input'])}")
        lines.append(f"Output:\n{_format_grid(pair['output'])}")
    
    lines.append("## Task")
    lines.append("Write a Python function `transform(input_grid: list[list[int]])")
    lines.append("-> list[list[int]]` that correctly transforms ALL training")
    lines.append("inputs to their corresponding outputs.")
    lines.append("\n## Guidelines")
    lines.append("1. Analyze training examples carefully")
    lines.append("2. Identify what changes between input and output")
    lines.append("3. Look for spatial patterns, symmetries, color rules")
    lines.append("4. Write clean, readable Python (numpy allowed)")
    lines.append("5. Do NOT hardcode for specific examples — generalize")
    return "\n".join(lines)

def _format_grid(grid):
    return "\n".join(
        "[" + ", ".join(str(c) for c in row) + "]"
        for row in grid
    )

16.3.2 Principle 2: Extended Reasoning Horizons

Complex ARC tasks require chains of reasoning that exceed what an LLM can accomplish in a single forward pass. The system addresses this through a multi-iteration architecture: each of 12 agents can refine its solution across up to 10 iterations, and the overall system runs for up to 12 hours. This converts a single-shot inference problem into a multi-step search problem with feedback.

The iterative approach encourages progressive complexity: agents start with simple hypotheses and add complexity only when simpler approaches fail, implementing an informal Occam's razor. The source documentation (B) reports that approximately 60% of eventually-solved tasks are resolved within the first 3 iterations, with the remaining 40% requiring deeper reasoning across iterations 4–10.

16.3.3 Principle 3: Measurable Feedback Loops

The program synthesis framing provides an unusually clean feedback signal: a generated program either produces the correct output for a given input or it does not. This binary correctness signal, augmented with structured error diagnostics (stack traces, cell-level diffs between expected and actual outputs), guides the iterative refinement process. Provenance: (B) documented principle; (C) error-category taxonomy below is the survey author's systematization of the described feedback types.

Error Category	Information Provided to LLM	Typical Resolution
Syntax Error	Python traceback with line number	Fix syntax; usually resolved in one iteration
Runtime Error	Exception type, message, traceback	Fix logic errors (index bounds, type mismatches)
Wrong Output	Expected vs. actual grids, cell-level diff	Revise transformation rule hypothesis
Partial Match	Percentage correct, mismatched regions	Adjust edge cases while preserving core logic
Timeout	Execution exceeded 30s limit	Optimize algorithm or fix infinite loop

16.3.4 The Multi-Agent Ensemble

The system deploys $n = 12$ independent agents per task, each running up to $k = 10$ refinement iterations (these parameters are documented in the .env configuration; provenance B). The ensemble is motivated by a standard probabilistic argument. Let $p$ denote the probability that a single agent solves a given task. Assuming agent independence, the probability that at least one succeeds is:

$$P(\text{at least one success}) = 1 - (1 - p)^{n}$$

This is a standard result from probability theory (not specific to this system). For $n = 12$ and $p = 0.30$ (an illustrative value; the actual per-agent success rate is not reported in the source material):

$$P = 1 - (1 - 0.30)^{12} = 1 - 0.70^{12} \approx 1 - 0.0138 \approx 0.986$$

The table below shows how ensemble probability scales with agent count for the illustrative case of $p = 0.30$:

Agents ($n$)	$P(\geq 1 \text{ success})$	Marginal Gain per Agent
1	30.0%	—
2	51.0%	+21.0%
4	76.0%	+12.5%
6	88.2%	+6.1%
8	94.2%	+3.0%
10	97.2%	+1.5%
12	98.6%	+0.7%
16	99.5%	+0.2%

Independence Assumption and Correlated Agents

Provenance: Author analysis (C). The calculation above assumes statistical independence between agents, which is not satisfied in practice. All agents use the same LLM (Gemini 2.5 Pro), receive the same task description, and operate under the same prompt template. This introduces positive correlation between agent outcomes: if one agent fails because the task requires a concept outside the model's training distribution, other agents are likely to fail for the same reason.

Under a simple equicorrelated model where each pair of agents shares correlation $\rho$, the effective number of independent trials becomes approximately $n_{\text{eff}} = n / (1 + (n-1)\rho)$. Even modest correlation substantially degrades the ensemble benefit:

Pairwise $\rho$	$n_{\text{eff}}$ (of 12)	$P(\geq 1)$ at $p = 0.30$
0.0 (independent)	12.0	98.6%
0.1	5.5	86.2%
0.3	2.8	65.7%
0.5	1.8	51.0%

The source documentation does not report empirical diversity statistics (e.g., inter-agent agreement rates, unique solution counts per task, or correlation measurements). LLM output stochasticity (temperature > 0) and divergent error paths during iterative refinement introduce variance across agents, but the magnitude of effective independence remains an open empirical question. The actual per-agent success rate $p$ is not published—the $p = 0.30$ value used throughout this section is illustrative only. The true ensemble dynamics depend on the joint distribution of agent outcomes, which is not characterized in the available sources. Different agents are not reported to use different temperature settings, system prompts, or model variants to increase diversity—this remains an obvious optimization avenue noted in the source material.

16.3.5 Ensemble Aggregation

ARC-AGI-2 allows two guesses per test input. Given up to 12 agent solutions, the system selects two outputs through majority voting (documented principle B; specific mechanism described below follows the documented approach). For each test input $t$, let $\{o_1^t, o_2^t, \ldots, o_m^t\}$ be the set of $m \leq 12$ candidate outputs from successful agents. The system converts each output grid to a hashable representation (tuple of tuples) and counts occurrences:

$$\text{guess}_1^t = \arg\max_{o} \sum_{i=1}^{m} \mathbb{1}[o_i^t = o], \qquad \text{guess}_2^t = \arg\max_{o \neq \text{guess}_1^t} \sum_{i=1}^{m} \mathbb{1}[o_i^t = o]$$

where $\mathbb{1}[\cdot]$ is the indicator function. The first guess is the output with the most agent agreement; the second guess is the next most common distinct output. If only one unique output exists among all agents, both guesses are identical.

# Illustrative pseudocode — represents the documented majority-voting
# approach described in the source material. NOT a verbatim excerpt
# from the repository. The actual implementation likely resides in
# gemini-cli-solver/src/ensemble.py (per source docs).

from collections import Counter

def aggregate_solutions(results, num_test_inputs):
    """Select final 2 guesses per test input via majority voting.
    
    Args:
        results: List of agent results from successful agents
        num_test_inputs: Number of test inputs in the task
    Returns:
        List of [guess_1, guess_2] per test input
    """
    per_test_outputs = []
    for test_idx in range(num_test_inputs):
        candidates = []
        for result in results:
            if result.success and test_idx < len(result.test_outputs):
                output = result.test_outputs[test_idx]
                # Convert to hashable for counting
                candidates.append(grid_to_hashable(output))

        if candidates:
            counter = Counter(candidates)
            top = counter.most_common(2)
            guess_1 = hashable_to_grid(top[0][0])
            guess_2 = hashable_to_grid(top[1][0]) if len(top) > 1 \
                       else guess_1
            per_test_outputs.append([guess_1, guess_2])
    return per_test_outputs

def grid_to_hashable(grid):
    return tuple(tuple(row) for row in grid)

def hashable_to_grid(h):
    return [list(row) for row in h]

The source documentation also mentions alternative aggregation strategies (weighted consensus, diversity selection) but indicates that majority voting is the primary method used in the reported results. Whether more sophisticated aggregation was tested and rejected is not documented.

16.4 Iterative Refinement Loop

The iterative refinement loop is where the extended reasoning horizon and measurable feedback principles converge. Each agent follows a structured lifecycle for every task:

16.4.1 Refinement Prompt Construction

When a generated program fails on one or more training pairs, the system constructs a refinement prompt that includes: (1) the previous code, (2) a detailed error summary with cell-level diffs for wrong outputs, and (3) the original task description. This rich feedback allows the LLM to diagnose specific failures and revise its hypothesis about the transformation rule. Provenance: (B) documented design; (C) the specific prompt structure below is the survey author's reconstruction.

# Illustrative pseudocode — reconstructs the documented refinement
# approach. NOT a verbatim excerpt from the repository.
# The actual refinement prompt construction likely resides in
# gemini-cli-solver/src/prompts/refinement_prompt.py (per source docs).

def build_refinement_prompt(task, previous_code, iteration, errors):
    """Construct a targeted refinement prompt from execution feedback.
    
    The prompt provides the LLM with:
    - The code that failed
    - Precise error diagnostics (cell-level diffs for wrong outputs)
    - The original task for re-analysis
    """
    sections = [
        f"## Refinement Iteration {iteration + 1}",
        f"Your previous attempt failed on {len(errors)} training examples.\n",
        f"## Previous Code\n```python\n{previous_code}\n```\n",
        "## Error Analysis"
    ]

    for error in errors:
        pair_idx = error['pair']
        if 'error' in error:
            # Runtime error: include full traceback
            sections.append(
                f"### Pair {pair_idx}: Runtime Error\n"
                f"```\n{error['error']}\n```"
            )
        elif not error['passed']:
            # Wrong output: show cell-level diff
            expected = error.get('expected', [])
            actual = error.get('actual', [])
            diff_lines = []
            for r in range(len(expected)):
                for c in range(len(expected[r])):
                    exp_val = expected[r][c]
                    act_val = actual[r][c] if r < len(actual) \
                              and c < len(actual[r]) else "MISSING"
                    if exp_val != act_val:
                        diff_lines.append(
                            f"  Cell ({r},{c}): expected {exp_val}, "
                            f"got {act_val}"
                        )
            sections.append(
                f"### Pair {pair_idx}: Wrong Output\n" +
                "\n".join(diff_lines[:20])  # cap at 20 diffs
            )

    sections.append(f"## Original Task\n{format_task_for_prompt(task)}")
    sections.append(
        "## Instructions\n"
        "1. Analyze why the previous code failed\n"
        "2. Consider alternative transformation rules\n"
        "3. Write an improved transform() function\n"
        "4. Ensure it handles ALL training examples"
    )
    return "\n\n".join(sections)

16.4.2 Convergence Characteristics

The source documentation (B) reports characteristic convergence patterns across iterations, with the following approximate distribution of solves:

Iteration Range	Fraction of Solves	Characterization	Source
1–3	~60%	Quick fixes: syntax, off-by-one, wrong axis	Source docs (B)
4–7	~30%	Deeper reasoning: reconsidering the transformation hypothesis	Source docs (B)
8–10	~10%	Diminishing returns, but on the hardest tasks	Source docs (B)

The average iterations to solve, across all successfully solved tasks, is reported as approximately 3.2 iterations (B).

Convergence Model Caveat

Provenance: Author analysis (C). The source material describes the convergence pattern as "approximate exponential decay" with a "decay rate of ~0.5 per iteration." If taken literally, this would suggest that the conditional probability of solving at iteration $k$ given failure through $k{-}1$ follows:

$$P(\text{solve at } k \mid \text{unsolved at } k{-}1) \approx p_0 \cdot \lambda^{k-1}$$

where $p_0$ is the first-iteration solve probability and $\lambda \approx 0.5$. However, this model should be treated as a rough heuristic description of the observed pattern, not as a validated statistical fit. The source material does not provide:

Per-iteration solve histograms or raw data
A fitted parameter estimate with confidence intervals
A goodness-of-fit test or comparison to alternative models
Whether $p_0$ and $\lambda$ vary by task difficulty tier

The geometric decay is plausible as an intuition—easier failure modes are fixed first, leaving progressively harder conceptual mismatches for later iterations—but it should not be cited as a quantitative law of this system without stronger empirical backing.

16.5 Sandbox Infrastructure

Executing LLM-generated code safely requires strong isolation. Confluence Labs uses E2B (Environment to Binary), a cloud service providing ephemeral, isolated execution environments (B). The system maintains a pool of 132 concurrent sandboxes (GEMINI_CLI_CONCURRENCY=132, documented in .env template; B), sized to support the throughput requirements of 12 agents each potentially running multiple iterations simultaneously across the full task set. The source material describes the concurrency value of 132 as representing 12 agents × 10 maximum iterations plus a ~10% overhead buffer for retries (B).

Concurrency is controlled by an asyncio semaphore that limits simultaneous sandbox executions (documented design B; specific implementation detail C). Each execution is wrapped in a 30-second timeout (SANDBOX_TIMEOUT=30, from .env template; B)—ARC transformations should complete in milliseconds, so this threshold catches infinite loops and degenerate solutions. Sandbox instances are closed in finally blocks to prevent resource leaks, and transient E2B API failures trigger automatic retries with exponential backoff (B).

The test harness wraps the generated transform() function with assertions against all training pairs, capturing results as structured JSON:

# Illustrative pseudocode — shows the documented test-harness approach
# for executing generated code against training pairs. NOT a verbatim
# excerpt from the repository. The actual sandbox logic likely resides
# in gemini-cli-solver/src/sandbox.py (per source docs).

def build_test_harness(user_code, train_pairs):
    """Wrap generated code with training-pair assertions.
    
    The harness executes transform() on each training input,
    compares against expected output, and reports structured results.
    """
    import json
    harness = f"""
{user_code}

import json
results = []
train_pairs = {json.dumps(train_pairs)}

for i, pair in enumerate(train_pairs):
    try:
        actual = transform(pair['input'])
        expected = pair['output']
        passed = (actual == expected)
        results.append({{
            'pair': i,
            'passed': passed,
            'expected': expected,
            'actual': actual if not passed else None
        }})
    except Exception as e:
        results.append({{
            'pair': i,
            'passed': False,
            'error': str(e)
        }})

print(json.dumps(results))
"""
    return harness

16.6 Key Results

The following results are reported by Confluence Labs. All figures in this section originate from the project README and documentation (provenance B) unless otherwise noted. No independent third-party verification of the 97.92% score or the cost figures is documented in the available source material.

Metric	Value	Exact Source	Independently Verified?
Public eval accuracy	97.92% (~392/400 tasks)	Repository README	Not documented
Cost per task	$11.77	Repository README	Not documented
Total cost (400 tasks)	~$4,708	Derived: $11.77 × 400	N/A (arithmetic)
Wall clock time	~12 hours (full evaluation)	Configuration constraint (B)	N/A (config limit)
Average iterations to solve	~3.2	Source documentation	Not documented
Agents per task	12	.env config (`GEMINI_CLI_AGENTS`)	Config file (B)
Concurrent sandboxes	132	.env config (`GEMINI_CLI_CONCURRENCY`)	Config file (B)
Max iterations per agent	10	.env config (`GEMINI_CLI_MAX_ITERATIONS`)	Config file (B)

Evaluation Protocol — What Is and Isn't Documented

Provenance: Author analysis (C) based on available source material.

The source material specifies that the 97.92% score is on the ARC-AGI-2 public evaluation set (400 tasks). The following evaluation details are not documented in the available sources:

Model version pinning: The .env template specifies gemini-2.5-pro, but whether a specific model snapshot/version was used (and whether the API model has since been updated) is not stated.
Temperature and sampling settings: The .env template shows GEMINI_TEMPERATURE=0.7 and GEMINI_MAX_TOKENS=8192, but whether these were the exact settings used for the reported result is not confirmed.
Number of reruns: Whether the 97.92% figure represents a single run, the best of multiple runs, or an average across runs is not stated.
Submission procedure: Whether the score was obtained via the official ARC-AGI-2 evaluation platform or computed locally against the public task set is not specified.
Private evaluation set: No results on the private (held-out) 100-task set are reported.
Variance: No confidence interval, standard deviation, or range across multiple runs is provided.

These gaps are common in repository-based benchmark claims and do not imply the result is incorrect, but they limit the strength of conclusions that can be drawn from the number.

16.6.1 Performance by Task Difficulty

The source material provides an approximate breakdown of performance across difficulty tiers. Provenance: (B) source documentation, with the following important caveats.

Difficulty Tier	Approx. Tasks	Solve Rate	Avg. Iterations	Source
Easy (simple spatial transforms)	~120	~100%	1.2	Source docs (B)
Medium (compositional rules)	~160	~99%	3.1	Source docs (B)
Hard (multi-step, abstract)	~80	~96%	5.8	Source docs (B)
Very Hard (novel concepts)	~40	~90%	8.2	Source docs (B)

Methodology caveat. These difficulty-tier numbers are reported in the source documentation but without detailed methodology for tier assignment. The approximate task counts sum to 400, consistent with the ARC-AGI-2 public evaluation set size. Critical unknowns include: (1) whether these tiers were assigned by the Confluence Labs team through manual review, automated clustering, or correspondence to an external classification; (2) what criteria define "easy" vs. "medium" vs. "hard" vs. "very hard"; (3) whether the solve rates and average iterations are per-tier averages or medians; and (4) whether these figures come from the same run that produced the 97.92% headline number. Without this methodology, these figures should be treated as self-reported approximate observations, not rigorous empirical measurements.

16.6.2 Failure Mode Analysis

The approximately 2% of unsolved tasks (roughly 8 of 400) share common characteristics according to the source material (B):

Highly novel spatial concepts — transformations involving spatial relationships with few analogues in typical programming tasks, outside the LLM's training distribution.
Ambiguous rules — tasks where training examples are consistent with multiple transformation rules, and the correct rule is not the most "natural" one from a programming perspective.
Large-scale counting or arithmetic — tasks requiring precise counting of complex features across large grids, where off-by-one errors compound.
Recursive or self-referential patterns — transformations requiring meta-reasoning about the transformation itself.

These failure modes are consistent with known limitations of LLM code generation: the model struggles when the required program has no close analogues in its training data, or when the specification (training examples) is ambiguous.

16.7 Cost Analysis

At a reported $11.77 per task (B), the system represents a significant compute investment. The cost structure below is presented in the source documentation (B). However, the breakdown into sub-components is not accompanied by a detailed accounting methodology. Whether these figures come from API billing logs, estimates based on token counts, or back-of-the-envelope calculations is not stated.

Cost Component	Est. per Task	Percentage	Source
Gemini API calls (12 agents × avg 3.2 iters)	$8.50	72.2%	Source docs (B)
E2B sandbox execution	$2.20	18.7%	Source docs (B)
Infrastructure overhead (networking, logging)	$0.80	6.8%	Source docs (B)
Miscellaneous (retries, failed executions)	$0.27	2.3%	Source docs (B)

Provenance note: The sub-total ($8.50 + $2.20 + $0.80 + $0.27 = $11.77) is arithmetically consistent with the headline per-task cost, which adds internal coherence. However, the Gemini API component depends on exact pricing at the time of the evaluation run, which is not documented (Gemini API pricing has changed multiple times). The E2B cost depends on the pricing tier and instance type used. These figures should be understood as approximate cost allocations reported by the team, not independently auditable line items.

The total cost for a full 400-task evaluation run is approximately $4,708 (arithmetic derivation from the per-task cost). The source material suggests this price point sits near the "knee" of the cost-performance curve (B): reducing cost below $5/task would likely drop accuracy below 95%, while increasing cost above $20/task would yield diminishing accuracy gains toward 98.5–99%. This cost-performance characterization is presented as the team's judgment (B); no ablation study systematically varying cost is documented.

Several cost reduction strategies are identified in the source material (B):

Strategy	Estimated Savings	Accuracy Impact	Evidence Level
Reduce agents from 12 to 6	~50% LLM cost reduction	Reduced ensemble coverage	Theoretical (C)
Reduce max iterations from 10 to 5	~30% LLM cost reduction	Loss of hard-task solves	Source estimate (B)
Adaptive agent allocation (fewer for easy tasks)	50%+ overall cost reduction	Minimal accuracy loss (per source)	Source estimate (B)
Use cheaper model variant	Variable	Degraded code quality	Source speculation (B)

16.8 Comparison with Related Approaches

The ARC benchmark has attracted a diverse range of approaches. Confluence Labs' program synthesis approach occupies a distinctive position in this landscape, distinguished by its scale of parallelism, its reliance on general-purpose code generation rather than DSL search, and its use of iterative feedback-driven refinement.

Approach	Method Type	Key Strength	Key Weakness
Confluence Labs	LLM Program Synthesis	Expressive, interpretable, verifiable solutions	High cost, LLM-capability-bounded
DreamCoder-style	Neural-guided DSL Search	Efficient search with learned library	Limited by DSL expressiveness
End-to-end Neural	Direct Grid Prediction	Fast inference	Poor generalization to novel tasks
Brute-force DSL	Enumerative Search	Complete within DSL scope	Exponential complexity
Evolutionary (GP)	Genetic Programming	Flexible, no LLM dependency	Slow convergence
LLM-guided Evolution	Hybrid (e.g., Imbue)	Systematic, adaptive mutations	Fitness design complexity

16.8.1 Relation to Evolutionary Systems Surveyed in This Book

Within the broader landscape of LLM-powered evolutionary systems covered in this survey, Confluence Labs occupies an interesting boundary position. Systems such as FunSearch (Chapter 6), AlphaEvolve (Chapter 4), and OpenEvolve (Chapter 5) use LLMs as mutation operators within an explicit evolutionary loop—maintaining populations, applying selection pressure, and tracking fitness across generations. Confluence Labs, by contrast, uses the LLM as a direct solver with iterative self-refinement but without population-based search.

To make this comparison precise, we evaluate the system against the five defining characteristics of evolutionary program synthesis:

Evolutionary Characteristic	FunSearch / AlphaEvolve	Confluence Labs
Population	Maintained across generations; programs compete and persist	No population. Each agent maintains a single candidate at a time; no cross-generation memory
Selection	Fitness-proportional or tournament-based parent selection	Binary: pass all training pairs or fail. No graded fitness, no selection among partial solutions
Variation (mutation)	LLM-generated edits to parents; crossover between candidates	LLM-generated revision conditioned on error feedback. No crossover between agents
Fitness landscape	Continuous or ordinal fitness guiding incremental improvement	Binary correctness. A partially correct program receives the same "fail" signal as a completely wrong one (error diagnostics provide qualitative feedback but not a scalar fitness)
Archive / memory	MAP-Elites grid, island databases, program libraries	No persistent archive. Each agent's history is its own conversation context; no cross-agent knowledge sharing

The system can be viewed as a degenerate case of evolutionary search where population size equals 1 per agent, selection is binary, and mutation is performed entirely by the LLM conditioned on error feedback. Despite the absence of explicit evolutionary mechanisms, the hypothesis-test-refine cycle mirrors the generate-evaluate-select loop, the multi-agent ensemble provides diversity analogous (though not equivalent) to population-based search, and iterative refinement with feedback implements a form of directed variation.

16.8.2 What the Comparison Reveals

The Confluence Labs result suggests that for well-structured reasoning tasks with clear verification, brute-force parallelization of LLM inference—enough agents, enough iterations, enough compute—can match or exceed more sophisticated search strategies. This raises a question for the evolutionary AI community: when is population-based search necessary, and when is parallel independent restart sufficient?

The answer likely depends on the per-attempt success probability $p$. For tasks where $p \geq 0.2$–$0.3$, ensemble amplification alone may suffice even with correlated agents. For tasks with very low per-attempt probability (open-ended optimization, algorithm discovery where the search space is vast and solutions are sparse), evolutionary mechanisms providing population-level memory, crossover, and guided exploration of the fitness landscape likely become essential. The Confluence Labs system does not address this regime—its binary feedback signal and lack of graded fitness would provide no gradient toward distant solutions.

16.9 Limitations & Discussion

16.9.1 LLM Capability Ceiling

The system's performance is fundamentally bounded by the capabilities of the underlying LLM (Google Gemini). If the model cannot conceive of a particular transformation rule—because it has no close analogue in its training distribution—no amount of iterative refinement or ensemble scaling will produce the correct solution. This is not a limitation that can be addressed architecturally; it requires improvements in the base model.

16.9.2 Cost and Scalability

At a reported $11.77 per task and ~$4,708 per full evaluation (B), the approach is viable for benchmark competition and research but would be prohibitively expensive for high-volume production applications. LLM API costs are trending downward, but the linear scaling with agents and iterations means cost reduction requires either algorithmic improvements (adaptive agent allocation) or cheaper models.

16.9.3 Generalization to Private Evaluation

The 97.92% figure is reported on the public evaluation set (400 tasks). Performance on the private evaluation set (100 held-out tasks with potentially different difficulty distribution) is not reported in the available sources. As with any benchmark result, there is a risk that the public set's difficulty distribution does not represent the full range of ARC-AGI-2 challenges.

16.9.4 Reproducibility

LLM outputs are non-deterministic even at temperature 0, due to implementation details of floating-point arithmetic, batching, and sampling. This means exact reproduction of the 97.92% score is not guaranteed across runs, although statistical consistency is expected. The MIT license and public repository enable full reproduction of the system architecture and configuration, making the approach structurally reproducible even if individual run outcomes vary.

Reproducibility and Verification Checklist

What a third party would need to reproduce the public evaluation number:

Requirement	Status	Notes
Source code	Available	MIT license, `github.com/confluence-labs/arc-agi-2`
Dependencies	Available	`uv.lock` provides pinned dependency versions
ARC-AGI-2 public tasks	Available	400-task public evaluation set is publicly distributed
Gemini API access	Required	Requires Google API key with access to `gemini-2.5-pro`
E2B API access	Required	Requires E2B API key (paid service)
API cost	~$4,700	Based on reported per-task cost; actual cost depends on current API pricing
Model version pinning	Not specified	Gemini model may have been updated since the reported run; exact snapshot unknown
Temperature / sampling	Documented in .env	`GEMINI_TEMPERATURE=0.7`, `GEMINI_MAX_TOKENS=8192` (from .env template)
Number of runs for reported score	Not specified	Unknown whether 97.92% is single-run, best-of-N, or average
Expected variance	Not reported	LLM non-determinism means exact reproduction unlikely; range unknown
Hardware requirements	Modest (API-based)	No GPU needed; requires stable internet for API calls
Wall clock time	~12 hours	Constrained by `WALL_CLOCK_LIMIT=43200` seconds

Key stochasticity caveats: Even with identical code and configuration, exact result reproduction is unlikely because: (1) the Gemini API may return different completions across runs due to non-deterministic sampling; (2) the model checkpoint behind the gemini-2.5-pro endpoint may have been updated since the original evaluation; (3) E2B sandbox execution timing may vary, potentially affecting timeout behavior on edge cases; (4) asyncio task scheduling order is not deterministic, which could affect which agents complete first and how the wall-clock budget is allocated. A reproducer should expect results within a statistical neighborhood of 97.92% but should not treat exact match as expected.

16.9.5 Philosophical Considerations

A deeper question raised by this work: does the system exhibit genuine "abstract reasoning" in the sense intended by the ARC benchmark? Chollet designed ARC to measure fluid intelligence—the ability to solve genuinely novel problems. The LLM-based approach converts novel reasoning problems into coding problems, which the model solves using crystallized knowledge from its training data. Each "novel" transformation rule is recognized and expressed through recombination of programming patterns the model has already seen.

This is undeniably effective, but it challenges whether ARC is measuring what it was designed to measure. If LLMs can achieve 98% accuracy by casting abstract reasoning as program synthesis, then either (a) program synthesis is a valid form of abstract reasoning, or (b) the benchmark is measuring a different capability than intended. This remains an open question in the AI research community.

16.10 Strategic Vision

Confluence Labs positions their ARC solver as a proof of concept for a broader research agenda in data-efficient scientific modeling. The source material (B) describes three target domains for extending the program synthesis framework:

Hardware engineering: Automated synthesis of digital logic circuits or HDL programs from behavioral specifications.
Biology: Generating computational models of biological processes from limited experimental observations (rare diseases, novel organisms).
Materials science: Discovering structure-property relationships in novel materials through automated hypothesis generation and testing.

The team is also exploring hybrid approaches combining LLM-based program synthesis with Bayesian model selection (B). In this paradigm, the LLM generates candidate programs (the hypothesis space), while Bayesian methods handle uncertainty quantification and active learning (selecting the most informative next experiment). This would extend the system beyond passive inference toward active experimental design.

Future Direction	Approach	Expected Impact	Source
Agent diversity	Different prompts, temperatures, models per agent	Improved ensemble diversity	Source docs (B)
Cross-agent communication	Share partial solutions between agents	Faster convergence on hard tasks	Source docs (B)
Meta-learning	Learn from solved tasks to improve prompts	Transfer across task categories	Source docs (B)
Hybrid search	Combine LLM synthesis with symbolic search	Better coverage of unusual transformations	Source docs (B)
Adaptive allocation	Fewer agents for easy tasks, more for hard	50%+ cost reduction (per source)	Source docs (B)

16.11 Summary

Chapter Summary

Key takeaway: Confluence Labs reports 97.92% accuracy on ARC-AGI-2 (public evaluation set, self-reported) by reframing abstract reasoning tasks as program induction problems and applying massive parallel LLM code generation with iterative feedback-driven refinement. The result demonstrates that brute-force scaling of well-structured LLM inference can reach near-ceiling performance on benchmarks designed to resist statistical learning.

Main contribution to the field: A principled framework (three design principles: structural alignment, extended reasoning, measurable feedback) for maximizing LLM performance on abstract reasoning through program synthesis, supported by a complete open-source implementation (MIT license) with transparent cost analysis ($11.77/task, self-reported). The ensemble amplification argument—converting a modest per-agent success rate into high ensemble accuracy through 12 independent agents—provides a broadly applicable scaling strategy, though its effectiveness in this system depends on the unmeasured degree of inter-agent correlation.

What a researcher should know: This system operates at the boundary between evolutionary program synthesis and parallel LLM inference. It achieves its reported ARC results without population-based search, crossover, or fitness landscapes—raising the question of when explicit evolutionary mechanisms are necessary versus when parallel independent restart with feedback is sufficient. The system is architecturally simple and extensible (MIT license), making it a strong baseline for LLM-driven abstract reasoning research.

Evidence caveats: All reported performance numbers are self-reported from the repository README with no documented independent verification. The evaluation protocol is incompletely specified (model version pinning, number of reruns, temperature, and submission procedure are not fully documented). The difficulty-tier breakdown and cost sub-component allocation lack detailed methodology. The ensemble-success calculation assumes agent independence, but empirical correlation statistics are not reported. Code examples in this chapter are illustrative pseudocode faithful to the documented architecture, not verbatim repository excerpts. Readers seeking to verify specific implementation claims should consult the repository directly at github.com/confluence-labs/arc-agi-2.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}