Introduced2025-03
Score7.85/10 — Draft
Chapter 19

AB-MCTS & TreeQuest

Part: Program Synthesis & ARC-AGI Solvers

19.1 Overview & Motivation

Inference-time scaling has emerged as a complementary axis to training-time scaling for improving large language model performance on hard reasoning tasks. Rather than training larger models, one can allocate additional compute at inference time by generating multiple candidate solutions and selecting among them. The simplest such strategy — repeated sampling (Best-of-N) — generates N independent solutions and returns the best, but makes no use of information gathered during the search process: attempt N+1 is no more informed than attempt 1 [PAPER — arXiv:2503.04412, §1].

AB-MCTS (Adaptive Branching Monte Carlo Tree Search), introduced by Sakana AI in arXiv:2503.04412 (March 2025), addresses this limitation by framing inference-time scaling as a tree search problem with an adaptive depth-versus-width trade-off [PAPER — §1]. The algorithm decides at each iteration whether to go deeper (refine an existing candidate solution through sequential iteration) or go wider (generate an entirely new solution from scratch), using Thompson Sampling to learn the optimal allocation online. A Multi-LLM extension adds a third decision dimension: which frontier model to use at each generation step [PAPER — §1].

The accompanying open-source library, TreeQuest (Apache 2.0, github.com/SakanaAI/treequest), implements this algorithm with a stateless, functional design supporting checkpointing, resumption, and batch-parallel execution [PAPER — §1; README]. On the ARC-AGI-2 benchmark (120 public evaluation tasks, max 250 iterations), Multi-LLM AB-MCTS is reported to achieve >30% Pass@250, compared to 23% for naive repeated sampling with a single model (o4-mini) [PAPER — §1].

Within this survey's taxonomy, AB-MCTS occupies a distinctive position at the intersection of Monte Carlo tree search, multi-armed bandits, and LLM-guided program synthesis. Unlike evolutionary approaches that maintain fixed populations with crossover and mutation (Chapters 9–11), AB-MCTS structures its search as a tree with Bayesian action selection. Unlike Tree-of-Thought (Yao et al., 2023), which uses fixed branching factors, AB-MCTS adapts the branching ratio dynamically through Thompson Sampling [PAPER — §1, Table comparing to related work]. Unlike ensemble or mixture-of-experts approaches, the Multi-LLM variant learns model routing online rather than through a pre-trained router [PAPER — §7].

19.2 Architecture

19.2.1 Pre-Writing Audit

Repository Audit Statement: The repository at https://github.com/SakanaAI/treequest was not directly inspected via file-system access for this chapter. All implementation claims are derived from the paper (arXiv:2503.04412) and the source material analysis document, which describes the repository's design. No pinned commit hash or release tag was verified by the chapter author.

Top-level modules/directories confirmed: [UNVERIFIED] — The source material describes a Python library named treequest with functions init_tree, ask_batch, tell, best, serialize, deserialize, and a treequest.models submodule containing MultiLLMConfig. These names appear consistently across the paper's code examples but have not been verified against actual source files.

Entry point: The paper describes a functional API (no CLI entry point documented). The primary interface is init_tree()ask_batch() / tell() loop → best() [PAPER — §8–9].

Config file paths and key fields: [UNVERIFIED] — No standalone config file format (YAML/TOML) is documented. Configuration is done via function arguments: method ("ab_mcts_a", "ab_mcts_m", "multi_llm"), batch_size, max_iterations, model_config [PAPER — §8, §12].

Observable output artifacts: [UNVERIFIED] — The paper describes serializable TreeState objects that can be checkpointed to JSON. No specific log file paths, output directories, or artifact naming conventions are documented.

Consequence for evidence tiers: All implementation claims in this chapter default to [PAPER] or [INFERRED] unless otherwise noted. Code examples use names from the paper's documented API but cannot be confirmed as exact repository identifiers.

19.2.2 System Architecture

TreeQuest is designed around a stateless, functional architecture where the search tree is an immutable data structure passed through a sequence of pure functions [PAPER — §8, §10]. The library separates three concerns: (1) search decisions (which action to take, which node to refine, which model to use), (2) solution generation (delegated entirely to user-provided callbacks), and (3) solution evaluation (also user-provided). This separation allows TreeQuest to remain agnostic to the LLM provider, prompt format, and evaluation protocol [PAPER — §8.1].

The core data flow follows the ask-tell pattern: the library asks the user what to evaluate next (returning Trial objects specifying action type, parent node, and model hint), and the user tells the library the results (solution and score). Between ask and tell, the user executes LLM calls however they choose — synchronously, asynchronously, with caching or rate limiting [PAPER — §9.1].

TreeQuest AB-MCTS Architecture User Code [PAPER — §8, §9] generate_fn(trial) LLM API calls evaluate_fn(solution) Score → [0, 1] MultiLLMConfig Model pool definition TreeQuest Core [PAPER — §8–10] init_tree() → TreeState ask_batch() → List[Trial] tell() → TreeState (new) best(k) → Top solutions serialize/deserialize Checkpoint I/O Thompson Sampling Engine Beta-Bernoulli (AB-MCTS-A) Hierarchical Bayesian (AB-MCTS-M) Multi-dim Bandit (Multi-LLM) Immutable TreeState [PAPER — §10] nodes: tuple[Node, ...] edges: tuple[Edge, ...] bandit_state: BanditState pending_trials: tuple[Trial, ...] rng_state [INFERRED — not repo-verified] Possible internal modules: node_selection, prompt_construction, score_normalization These components are implied by the paper's algorithm description but not confirmed in repository source files [PAPER] [INFERRED]

19.2.3 Artifact Inventory

Since direct repository inspection was not performed, the following table lists artifacts described in the paper and source material, with evidence sources.

ArtifactTypeDescriptionEvidence Source
treequest Python packageLibraryCore AB-MCTS implementation[PAPER — §8; README]
init_tree()FunctionInitialize search tree state[PAPER — §8.3]
ask_batch()FunctionRequest batch of trials[PAPER — §9.1]
tell()FunctionReport results, return new state[PAPER — §9.3]
best()FunctionRetrieve top-k solutions[PAPER — §8.3]
serialize() / deserialize()FunctionsCheckpoint/resume support[PAPER — §10.2]
Trial dataclassData structureImmutable trial descriptor[PAPER — §9.2]
TreeState dataclassData structureComplete immutable search state[PAPER — §10.3]
MultiLLMConfigConfig classMulti-model pool configuration[PAPER — §13.1]
JSON checkpoint filesOutput artifactSerialized tree state for resumption[PAPER — §10.2]
Apache 2.0 licenseLicenseOpen-source license[PAPER — §1; README]

19.2.4 Execution Trace

No CLI entry point is documented. The paper describes a purely programmatic interface. The following is the documented execution pattern [PAPER — §8.3, §10.2]:

# Pseudocode — reconstructed from paper §8.3, §10.2, not repo-verified
# No CLI command documented; library is used programmatically

from treequest import init_tree, ask_batch, tell, best, serialize, deserialize

# Initialization
tree_state = init_tree(
    method="multi_llm",       # Variant: "ab_mcts_a" | "ab_mcts_m" | "multi_llm"
    batch_size=5,              # Parallel trials per iteration
    max_iterations=250,        # Total compute budget
    model_config=config,       # MultiLLMConfig for multi-model mode
)

# Main loop: ask → generate → evaluate → tell
for _ in range(50):            # 50 batches × 5 = 250 trials
    trials = ask_batch(tree_state)
    results = [(t.trial_id, generate(t), score(t)) for t in trials]
    tree_state = tell(tree_state, results)

# Retrieve results
solution = best(tree_state, k=2)

# Checkpoint: serialize to JSON
import json
with open("checkpoint.json", "w") as f:
    json.dump(serialize(tree_state), f)

# Expected output: JSON file containing full TreeState
# (nodes, edges, bandit_state, iteration count, pending trials, RNG state)

Note: No specific output directory structure, log file format, or terminal output is documented in available materials. The primary observable artifact is the serialized TreeState JSON checkpoint [PAPER — §10.2].

19.3 Core Algorithms

19.3.1 Verification Matrix

Algorithm / MechanismClaimEvidence SourceArtifactConfidence
Adaptive Branching (depth vs. width)Thompson Sampling selects between depth and width actions at each iteration[PAPER — §3.3, §4]arXiv:2503.04412 §4High
Beta-Bernoulli Thompson SamplingEach action maintains Beta(α, β) posterior updated with binary rewards[PAPER — §5.1–5.2]arXiv:2503.04412 §5High
Node Aggregation (AB-MCTS-A)Binary success/failure based on whether child exceeds subtree best[PAPER — §6.1]arXiv:2503.04412 §6.1High
Mixed Models (AB-MCTS-M)PyMC-based hierarchical Bayesian inference for continuous rewards[PAPER — §5.3, §6.2]arXiv:2503.04412 §6.2High
Multi-LLM model selectionThompson Sampling over (direction, model) Cartesian product[PAPER — §6.3, §7]arXiv:2503.04412 §7High
Immutable TreeStateFrozen dataclasses, functional state transitions[PAPER — §10]arXiv:2503.04412 §10High
Ask-tell interfaceDecoupled search decisions from generation execution[PAPER — §9]arXiv:2503.04412 §9High
Node selection for refinementSecondary bandit problem for which node to refine[PAPER — §3.1]arXiv:2503.04412 §3.1Medium — mechanism described but exact policy not specified
Emergent model specializationDifferent models develop different roles during search[PAPER — §7.3]arXiv:2503.04412 §7.3Medium — empirical observation, not formally measured
Crossover actionCombining elements from two subtrees[PAPER — §16.2]Future direction, not implementedN/A — proposed only

19.3.2 The Depth-Width Trade-off

The central algorithmic contribution of AB-MCTS is the adaptive selection between two search actions at each iteration [PAPER — §2.2, §3.3]:

  • Depth (Go Deeper): Select an existing node in the search tree and generate a child solution by feeding the node's solution as context to the LLM, along with its score and improvement instructions. This creates chains of progressively refined solutions — analogous to gradient-descent-like steps in continuous optimization [PAPER — §3.1].
  • Width (Go Wider): Generate an entirely new solution from scratch, without conditioning on any previous attempt. This is equivalent to adding a new root-level node to the search tree, providing diversity when existing branches are stuck in local optima [PAPER — §3.2].

The paper's key insight is that the optimal depth-width ratio is problem-dependent and node-dependent within the search tree, and should be learned online rather than fixed as a hyperparameter [PAPER — §3.3]. AB-MCTS accomplishes this via Thompson Sampling over a Beta-Bernoulli model.

19.3.3 Thompson Sampling for Action Selection

Thompson Sampling (Thompson, 1933) is a Bayesian approach to the multi-armed bandit problem. Each arm maintains a prior distribution over its expected reward; after observing a reward, the prior is updated to a posterior via Bayes' rule. At each decision point, a value is sampled from each arm's posterior, and the arm with the highest sample is selected [PAPER — §5.1].

For binary rewards, the natural conjugate prior is the Beta distribution. If arm i has observed αi successes and βi failures, its posterior is:

$$P(\text{reward}_i) \sim \text{Beta}(\alpha_i + 1,\; \beta_i + 1)$$
Symbol Table:
  • αi: cumulative success count for arm i (depth or width)
  • βi: cumulative failure count for arm i
  • The +1 terms encode a uniform prior Beta(1,1)

Provenance: [Standard definition applied here] — Beta-Bernoulli Thompson Sampling is a well-established technique (Agrawal & Goyal, 2012). The application to depth/width selection is the paper's contribution [PAPER — §5.2].

Reward definition: A reward of 1 (success) is recorded when the newly generated solution's score exceeds the best score among the parent node's descendants. A reward of 0 (failure) is recorded otherwise [PAPER — §4.5]. Formally:

$$r_t = \mathbb{1}[\text{score}(s_{\text{child}}) > \text{subtree\_best}(\text{parent})]$$
Symbol Table:
  • rt: binary reward at iteration t
  • schild: the newly generated solution
  • score(·): user-defined evaluation function mapping solutions to [0, 1]
  • subtree_best(parent): maximum score among all descendants of the parent node
  • 𝟙[·]: indicator function (1 if condition holds, 0 otherwise)

Provenance: [Published formula — paper §4.5]. This binary thresholding is the specific choice made by AB-MCTS-A; AB-MCTS-M uses continuous rewards instead.

Worked Example — Thompson Sampling Action Selection:

Suppose after 20 iterations: depth has had αd = 6 successes, βd = 8 failures; width has had αw = 4 successes, βw = 2 failures.

  • Depth posterior: Beta(7, 9). Mean = 7/16 = 0.4375. The posterior is relatively concentrated.
  • Width posterior: Beta(5, 3). Mean = 5/8 = 0.625. The posterior is broader (fewer observations).
  • Sample from Beta(7, 9) → e.g., 0.38. Sample from Beta(5, 3) → e.g., 0.71.
  • Width wins (0.71 > 0.38), so this iteration generates a fresh solution.

If width then produces a solution scoring 0.65 against a current root-level best of 0.60, the reward is 1 (success), updating width to αw = 5, βw = 2. If the score had been 0.55, the reward would be 0, updating to αw = 4, βw = 3.

19.3.4 AB-MCTS Main Loop

The algorithm operates in four phases per iteration: selection, expansion, evaluation, and backpropagation — mirroring classical MCTS but adapted for LLM inference-time scaling [PAPER — §4.1].

# Pseudocode — reconstructed from paper §4.1, not repo-verified
# Uses paper-described function names; actual implementation may differ

def ab_mcts(problem, budget_T, models, scorer):
    """AB-MCTS main loop with four phases per iteration."""
    tree = init_tree()
    thompson = ThompsonSampler(actions=["depth", "width"])

    for t in range(budget_T):
        # Phase 1 — Selection: Choose depth vs. width via Thompson Sampling
        action = thompson.sample()  # Sample from Beta posteriors

        if action == "width":
            # Phase 2a — Expansion (width): New root-level solution
            prompt = make_fresh_prompt(problem)
            parent_node = tree.root
        else:
            # Phase 2b — Expansion (depth): Refine existing node
            parent_node = select_node(tree)  # Secondary bandit/UCB
            prompt = make_refinement_prompt(
                problem, parent_node.solution, parent_node.score
            )

        # Model selection (Multi-LLM extension only)
        model = select_model(models)  # Thompson Sampling over model pool

        # Phase 3 — Evaluation: Generate solution and score it
        solution = model.generate(prompt)
        score = scorer.evaluate(solution)  # Normalized to [0, 1]

        # Phase 4 — Backpropagation: Update tree and Thompson model
        child = tree.add_child(parent_node, solution, score)
        reward = 1 if score > parent_node.subtree_best else 0
        thompson.update(action, reward)

    return tree.best_solution()

19.3.5 AB-MCTS Variants

AB-MCTS-A (Node Aggregation). The simplest variant reduces all scores to binary success/failure outcomes. Each refinement attempt is treated as an independent Bernoulli trial: did the child improve upon the subtree's best score? The Beta posterior is well-calibrated under this assumption, but the magnitude of improvement is discarded — a score increase from 0.5 to 0.51 counts identically to an increase from 0.5 to 0.99 [PAPER — §6.1]. Dependencies: NumPy only; computational overhead is negligible [PAPER — Table in §6].

AB-MCTS-M (Mixed Models). This variant uses PyMC-based hierarchical Bayesian inference to maintain richer posteriors over continuous score distributions, rather than reducing to binary outcomes [PAPER — §5.3, §6.2]. The hierarchical structure allows information sharing across tree nodes: if refinements at one node tend to produce high scores, this informs expectations for other nodes. The trade-off is increased computational cost (MCMC sampling per decision) and a PyMC/ArviZ dependency [PAPER — §6.2].

[INFERRED — Implementation Detail] The exact PyMC model specification (number of MCMC samples, hyperprior parameterization, inference backend) is described conceptually in the paper but the precise implementation constants have not been verified in the repository. The paper mentions "hierarchical statistical modeling" with population-level partial pooling [PAPER — §6.2], but the specific model architecture (e.g., normal vs. beta likelihood, exact hyperprior values) may differ in the actual implementation.

Multi-LLM AB-MCTS. This variant extends the action space to the Cartesian product of {depth, width} × {model₁, model₂, …, modelk}. Each (direction, model) combination maintains its own Beta posterior, enabling the algorithm to learn that, for example, one model excels at refinement while another produces better diverse initial solutions [PAPER — §6.3, §7.1].

VariantBranching StrategyScore HandlingDependenciesOverheadProvenance
AB-MCTS-ABeta-Bernoulli TSBinary (improve / not)NumPy onlyNegligible[PAPER — §6.1]
AB-MCTS-MHierarchical BayesianContinuous [0,1]PyMC, ArviZMCMC per decision[PAPER — §6.2]
Multi-LLMMulti-dim banditBinary per (action, model)NumPy + API keysNegligible per decision[PAPER — §6.3]

19.3.6 Multi-LLM Model Selection

The Multi-LLM extension introduces a three-dimensional decision space at each iteration [PAPER — §7.1]: (1) direction (depth vs. width), (2) model selection (which LLM from the pool), (3) generation (execute with chosen model). Each (direction, model) pair maintains an independent Beta posterior, updated based on whether the generation produced a score improvement.

# Pseudocode — reconstructed from paper §7.2, not repo-verified

class MultiLLMThompsonSampler:
    """Thompson Sampling across (direction, model) combinations."""

    def __init__(self, models: list[str]):
        self.arms = {}
        for direction in ["depth", "width"]:
            for model in models:
                key = (direction, model)
                self.arms[key] = {"alpha": 1.0, "beta": 1.0}

    def select(self) -> tuple[str, str]:
        """Sample from all arm posteriors; return (direction, model)."""
        best_key, best_sample = None, -1.0
        for key, params in self.arms.items():
            sample = np.random.beta(params["alpha"], params["beta"])
            if sample > best_sample:
                best_sample = sample
                best_key = key
        return best_key  # (direction, model_name)

    def update(self, direction: str, model: str, reward: float):
        key = (direction, model)
        self.arms[key]["alpha"] += reward
        self.arms[key]["beta"] += (1.0 - reward)
Worked Example — Multi-LLM Selection (3 models × 2 directions = 6 arms):

After 30 iterations with o4-mini, Gemini-2.5-Pro, and DeepSeek R1-0528:

Arm (direction, model)αβPosterior Mean
(depth, o4-mini)580.385
(depth, Gemini)740.636
(depth, DeepSeek)350.375
(width, o4-mini)430.571
(width, Gemini)240.333
(width, DeepSeek)330.500

The algorithm has learned that Gemini tends to be strongest at refinement (depth) while o4-mini provides good diversity (width). The next sample might draw 0.72 from (depth, Gemini) and 0.65 from (width, o4-mini), selecting Gemini for a depth action.

19.3.7 Stateless Design and Immutability

A distinctive architectural choice is that all tree state is represented via immutable (frozen) data structures. Every operation — adding a node, updating scores, recording Thompson Sampling observations — returns a new TreeState rather than mutating the existing one [PAPER — §10.1]. This design enables:

  • Checkpointing: Any state can be serialized to JSON and restored later [PAPER — §10.2].
  • Reproducibility: Given the same initial state and random seed, the search trajectory is fully deterministic [PAPER — §10.1].
  • Concurrency: Immutable structures are inherently thread-safe — multiple threads can read the same state without locks [PAPER — §10.1].
  • Debugging: The full history of tree states can be retained for post-hoc analysis [PAPER — §10.1].

19.3.8 Node Selection for Refinement

When the algorithm decides to go deeper, it must select which existing node to refine. The paper describes this as a secondary multi-armed bandit problem where each node is an arm, with selection balancing exploitation (refining high-scoring nodes) with exploration (trying nodes with low visit counts) [PAPER — §3.1]. The exact selection policy is not formally specified in the paper.

[INFERRED — Node Selection Policy] The paper states that node selection "balances exploitation (refining high-scoring nodes) with exploration (trying nodes that haven't been refined much)" [PAPER — §3.1], which is consistent with UCB-style selection as used in classical MCTS. However, the specific formula (UCB1, PUCT, Thompson Sampling at the node level, or a simpler heuristic) is not documented. The comparison table in §14.1 of the paper lists "Thompson Sampling from Beta posteriors" as the selection method, but this refers to the depth/width selection, not within-depth node selection.

19.3.9 Regret Analysis

Thompson Sampling for the K-armed Bernoulli bandit has established regret bounds [PAPER — §14.2]:

$$\mathbb{E}[R(T)] \leq O\!\left(\sqrt{K \cdot T \cdot \log T}\right)$$
Symbol Table:
  • R(T): cumulative regret after T rounds
  • K: number of arms — 2 for single-model (depth/width), 2|M| for Multi-LLM with |M| models
  • T: total number of iterations (budget)

Provenance: [Standard definition applied here] — This bound is from Agrawal & Goyal (2012) for the stationary case. The paper notes that AB-MCTS reward distributions are non-stationary (the value of depth vs. width changes as the tree grows and best known score improves), which the standard bound does not address [PAPER — §14.2].

Worked Example — Regret Bound:

For Multi-LLM AB-MCTS with K = 6 arms (2 directions × 3 models) and T = 250 iterations:

E[R(250)] ≤ O(√(6 × 250 × log 250)) ≈ O(√(6 × 250 × 5.52)) ≈ O(√8300) ≈ O(91)

This suggests that the algorithm's cumulative loss from suboptimal arm selection grows sublinearly, with expected regret of roughly 91 over 250 rounds — meaning the vast majority of iterations are allocated to near-optimal actions. Note this is a theoretical upper bound under stationarity; actual performance may differ due to non-stationary rewards.

19.4 Key Results

19.4.1 Evaluation Caveats

  • Self-reported results: All benchmark numbers are reported by the system authors (Sakana AI). No independent reproduction by a third party has been documented in available materials.
  • Exact model versions: The paper specifies model names (o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528) but does not provide exact model version identifiers, API snapshot dates, or whether model behavior may have changed between paper preparation and publication [PAPER — §11.2].
  • Seeds and run counts: The number of independent runs per configuration, random seeds used, and variance across runs are not reported. All numbers appear to be from single runs or are not specified [PAPER — §11.3].
  • Compute budget matching: The repeated sampling baseline uses 250 independent generations with o4-mini. The Multi-LLM AB-MCTS also uses 250 iterations but with three different models (o4-mini, Gemini, DeepSeek), which may have different per-call costs. Whether total token or dollar budgets are matched is not stated [PAPER — §11.2–11.3].
  • Prompt engineering: The paper acknowledges that the effectiveness of depth actions depends on prompt quality [PAPER — §16.1], but does not release the exact prompts used, making reproduction dependent on prompt design choices.
  • Approximate figures: Some results (AB-MCTS-M "~28%", Multi-LLM ">30%") are reported with approximate notation rather than exact values [PAPER — §11.3].
  • Task subset: Results are on the 120 public evaluation tasks of ARC-AGI-2, not the full held-out test set. Public set performance may not generalize to the private evaluation [PAPER — §11.1].

19.4.2 ARC-AGI-2 Results

MethodModel(s)Task SetPass@250Pass@2Seeds / RunsCompute BudgetResult TypeEvidence Source
Repeated Sampling o4-mini 120 public eval 23.0% — (not reported) — (not reported) 250 generations Self-reported [PAPER — §11.3]
AB-MCTS-A o4-mini 120 public eval 27.5% — (not reported) — (not reported) 250 iterations Self-reported [PAPER — §11.3]
AB-MCTS-M o4-mini 120 public eval ~28% — (not reported) — (not reported) 250 iterations Self-reported (approximate) [PAPER — §11.3]
Multi-LLM AB-MCTS o4-mini + Gemini-2.5-Pro + DeepSeek R1-0528 120 public eval >30% 19.2% — (not reported) 250 iterations (3-model pool) Self-reported (approximate) [PAPER — §11.3]

19.4.3 Analysis of Results

The reported improvement from 23.0% (repeated sampling) to >30% (Multi-LLM AB-MCTS) represents a >30% relative improvement in Pass@250 on the same 120-task evaluation set [PAPER — §11.3]. The paper emphasizes three patterns observed in search tree analysis [PAPER — §11.4]:

  1. Complementary reasoning: One model identifies the correct transformation type while another implements it correctly at the pixel level.
  2. Cross-model error correction: A different model can sometimes fix systematic errors during refinement that the originating model reproduces.
  3. Diverse initialization: Different models produce different distributions of initial solutions, improving coverage of the solution space.

The Pass@2 metric (19.2%) for Multi-LLM AB-MCTS indicates that when selecting only the top-2 solutions from 250 candidates for final submission, the algorithm captures approximately two-thirds of the solvable tasks. This demonstrates effective ranking, not just search [PAPER — §11.4].

19.4.4 Scaling Behavior

The paper reports that repeated sampling's Pass@N curve flattens quickly (most easily solvable problems found within the first ~50 attempts), while AB-MCTS maintains a steeper slope because search directs compute toward promising regions [PAPER — §11.5].

The paper defines search efficiency at iteration t as the probability that iteration t solves a previously unsolved problem [PAPER — §11.5]:

$$\text{efficiency}(t) = P(\text{solved at } t \mid \text{not solved at } t{-}1)$$
Symbol Table:
  • efficiency(t): marginal probability of solving a new problem at iteration t
  • For repeated sampling, this decreases monotonically
  • For AB-MCTS, this can increase over time as information accumulates

Provenance: [Published formula — paper §11.5]. This is a conceptual framing for interpreting scaling curves, not a directly measured quantity.

Worked Example — Search Efficiency:

Consider a problem solvable by 2% of random samples. Under repeated sampling, efficiency(t) ≈ 0.02 at every iteration (assuming independence). After 50 iterations, P(unsolved) = 0.9850 ≈ 0.364, so Pass@50 ≈ 63.6%.

Under AB-MCTS, if the first few attempts reveal that a particular refinement direction is promising, subsequent iterations concentrate on that direction. The efficiency might start at 0.02 but increase to, say, 0.05 as the algorithm focuses. After 50 directed iterations, the cumulative solve probability could exceed the repeated-sampling baseline despite starting from the same per-sample rate.

19.4.5 Non-LLM Baseline Comparison

The paper compares AB-MCTS variants against repeated sampling (Best-of-N) as the primary baseline [PAPER — §11.3]. No comparison against non-LLM baselines (e.g., brute-force program enumeration, DSL-based solvers, or classical ARC solvers) is reported in the available materials. The ARC-AGI-2 benchmark itself has been evaluated by various approaches in the ARC Prize competition, but the paper does not position its results against non-LLM competition entries.

[INFERRED — Baseline Context] The absence of non-LLM baselines limits the ability to assess whether the observed improvements are attributable to the search algorithm (AB-MCTS) or to the capability of the underlying frontier models. A matched-budget comparison against a DSL-based ARC solver or classical search algorithm would strengthen the evaluation. The paper's contribution is specifically about orchestrating LLM calls more efficiently, so the repeated-sampling baseline is the most relevant comparator for that claim.

19.5 Implementation & Cost

19.5.1 Implementation Details

AspectDetailEvidence Source
LanguagePython[PAPER — §8; README]
LicenseApache 2.0[PAPER — §1; README]
Core dependencies (AB-MCTS-A)NumPy[PAPER — §6, variant table]
Additional dependencies (AB-MCTS-M)PyMC, ArviZ[PAPER — §6.2]
Design patternStateless ask-tell, immutable dataclasses[PAPER — §8.1, §10]
Parallelism modelBatch-parallel trials; user manages async execution[PAPER — §9.3]
Checkpoint formatJSON-serializable TreeState[PAPER — §10.2]
Models used in reported experimentso4-mini, Gemini-2.5-Pro, DeepSeek R1-0528[PAPER — §11.2]

19.5.2 Cost Analysis

Cost DimensionDetailProvenance
Iterations per task250 (max budget in reported experiments)[PAPER — §11.2]
Batch size5 (recommended)[PAPER — §12.1]
LLM calls per task250 (1 per iteration)[PAPER — §4, §11.2]
Computational overhead of AB-MCTS-ANegligible (NumPy Beta sampling)[PAPER — §6, variant table]
Computational overhead of AB-MCTS-MMCMC sampling per decision (PyMC)[PAPER — §6, variant table]
Hardware configuration— (not reported)
Wall-clock time per task— (not reported)
Dollar cost per task— (not reported)
Total cost for 120-task evaluation— (not reported)
[INFERRED — Cost Projection] The paper does not report dollar costs or wall-clock times. A rough projection: at 250 iterations per task and 120 tasks, the full evaluation requires 30,000 LLM API calls across three frontier models. At typical 2025–2026 API pricing (o4-mini: ~$0.01–0.03 per call depending on prompt length; Gemini-2.5-Pro and DeepSeek R1-0528 at comparable rates), the total cost for the 120-task evaluation likely falls in the range of $300–$3,000 depending on prompt complexity and token counts. The paper describes ~30 seconds per LLM call as a reference point [PAPER — §10.2], which would put wall-clock time at approximately 2 hours per task with sequential execution. These are author estimates, not paper-reported figures.

19.6 Reproducibility

19.6.1 Reproducibility Assessment

RequirementStatusNotes
Code publicly releasedgithub.com/SakanaAI/treequest, Apache 2.0 [PAPER — §1]
Config files availablePartialConfiguration via function arguments documented in paper; no standalone config files described
Pretrained weights / checkpointsN/ASystem uses third-party LLM APIs; no training involved
Documented entry point or run commandPartialProgrammatic API documented [PAPER — §8]; no CLI; README may contain usage examples [UNVERIFIED]
Compute requirements statedHardware, runtime, and cost not reported
Seeds and run counts reportedNumber of runs and random seeds not specified
Exact model versions specifiedPartialModel names given (o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528); exact API versions not pinned
Prompt templates releasedPrompts not published; essential for reproduction
Evaluation function released✗ (unclear)ARC-AGI-2 scoring is well-defined (pixel accuracy), but the exact implementation is not confirmed in the repo
Independent reproduction attemptedNo independent reproductions documented in available materials

19.6.2 Step-by-Step Verification Path

To attempt validation of AB-MCTS on ARC-AGI-2, an external researcher would need to:

  1. Clone the repository: git clone https://github.com/SakanaAI/treequest
  2. Install dependencies: The paper mentions NumPy as the only dependency for AB-MCTS-A, with PyMC/ArviZ for AB-MCTS-M [PAPER — §6]. The exact install command is not documented in the paper. The researcher should check the repository for pyproject.toml, setup.py, or requirements.txt.
  3. Obtain API keys: OpenAI (o4-mini), Google (Gemini-2.5-Pro), DeepSeek (R1-0528). Exact model version strings may need to be determined from the repository or API documentation.
  4. Obtain ARC-AGI-2 tasks: The 120 public evaluation tasks are available from arcprize.org [PAPER — §11.1]. The exact task loading format expected by the code is not documented in the paper.
  5. Implement prompt templates: The paper describes depth and width prompts conceptually [PAPER — §4.3] but does not release exact templates. This is a significant barrier — the researcher must design their own prompts, which may not match the paper's performance.
  6. Implement scoring function: Pixel-level accuracy is well-defined for ARC tasks [PAPER — §12.3], but the exact normalization (e.g., handling of wrong grid dimensions) must be implemented.
  7. Run search: Using the documented API pattern: init_tree(method="multi_llm", batch_size=5, max_iterations=250) → ask/tell loop → best(k=2).
  8. Expected output: A Pass@250 score of >30% and Pass@2 of ~19.2% on the 120 public tasks.

Critical gaps for reproduction: The absence of released prompt templates is the primary barrier. Since the paper acknowledges that "the effectiveness of depth (refinement) actions depends heavily on the quality of the refinement prompt" [PAPER — §16.1], prompt design is a first-order factor in reproducing results. The lack of reported seeds, run counts, and variance estimates further limits the ability to determine whether a reproduction attempt has succeeded or failed.

19.7 Threats to Validity

  1. Self-reported results without independent verification. All benchmark numbers are reported by the system's authors (Sakana AI). No independent reproduction or third-party evaluation has been documented. The improvement from 23% to >30% should be interpreted as an author-reported claim until independently validated.
  2. Unmatched compute budgets across configurations. The Multi-LLM variant uses three different frontier models in its 250-iteration budget. The single-model baselines use only o4-mini. If the three models have different costs per call (in tokens, latency, or dollars), the budgets are not dollar-matched. The paper uses iteration count as the budget metric, which treats all model calls as equivalent [PAPER — §11.2].
  3. Model version non-stationarity. The paper specifies model names but not exact version identifiers or API snapshot dates. Frontier model APIs are updated regularly; results obtained with one API version may not be reproducible with a later version. This is a systemic issue for all inference-time scaling research, not specific to AB-MCTS.
  4. Unreleased prompt templates. Prompt engineering is acknowledged as a critical factor [PAPER — §16.1], but the exact prompts are not published. This creates a gap between the described algorithm (which is fully specified) and the reported results (which depend on undisclosed implementation details).
  5. Absence of variance estimates. No error bars, confidence intervals, or multi-run statistics are reported. Without knowing the variance across runs, it is impossible to determine whether the differences between AB-MCTS variants and the repeated-sampling baseline are statistically significant.
  6. Public evaluation set only. Results are on ARC-AGI-2's 120 public evaluation tasks, not the held-out private test set. Overfitting to public task characteristics (through prompt tuning or hyperparameter selection) cannot be ruled out.
  7. Non-stationarity of reward distributions. The Thompson Sampling regret bounds cited in the paper [PAPER — §14.2] assume stationary reward distributions. The paper acknowledges that this assumption is violated in AB-MCTS (reward distributions shift as the tree grows) but does not quantify the impact of non-stationarity on the regret guarantee.
  8. Node selection policy underspecified. The secondary bandit problem of selecting which node to refine (when depth is chosen) is described only in general terms [PAPER — §3.1]. The exact selection policy — which substantially affects depth-action quality — is not formally specified.
  9. Emergent specialization claim. The reported observation that "different models develop different roles during search" [PAPER — §7.3] is a qualitative claim without formal measurement or statistical testing. The degree of specialization and whether it is robust across tasks and runs is not quantified.

19.8 Limitations & Open Questions

19.8.1 Documented Limitations

The paper explicitly identifies five limitations [PAPER — §16.1]:

  1. Scoring function requirement: AB-MCTS requires a scalar scoring function in [0, 1] that provides meaningful partial-credit signal. Tasks without clear partial credit (open-ended creative writing, subjective evaluation) are poorly suited [PAPER — §16.1].
  2. LLM call overhead: Each iteration requires at least one LLM API call. At 250 iterations per task, the cost may be prohibitive for expensive models or large task sets [PAPER — §16.1].
  3. Prompt engineering dependency: Refinement (depth) effectiveness depends heavily on prompt quality. Poor prompts may cause the LLM to ignore or verbatim-copy the parent solution [PAPER — §16.1].
  4. Non-stationarity: The true reward distributions shift during search, violating the stationarity assumption of standard Thompson Sampling. While empirically robust to mild non-stationarity, extreme distribution shift could degrade performance [PAPER — §16.1].
  5. No crossover: Unlike evolutionary algorithms, AB-MCTS does not combine elements from different solution branches. Each refinement acts on a single parent [PAPER — §16.1].

19.8.2 Open Questions

[INFERRED — Open Research Questions]
  • Crossover action: The paper proposes adding a "crossover" operation that combines elements from two subtrees via LLM-mediated recombination [PAPER — §16.2]. This would expand the action space to {depth, width, crossover} × models, potentially enabling richer search dynamics at the cost of increased bandit dimensionality.
  • Hierarchical decomposition: For multi-step problems, AB-MCTS could be applied hierarchically — a top-level search decomposes the problem, and nested searches solve subproblems [PAPER — §16.2]. The interaction between hierarchical decomposition and Thompson Sampling is unexplored.
  • Learned scoring functions: Replacing external evaluators with LLM-based reward models would create a fully self-contained system but introduces reward hacking risks [PAPER — §16.2].
  • Non-stationary Thompson Sampling: Discounted Thompson Sampling with geometric down-weighting of older observations (αt = γ·αt−1 + rewardt, γ ∈ (0,1)) could address reward distribution shift [PAPER — §16.2].
  • Integration with code evolution: The paper suggests AB-MCTS as a search backbone for systems like AlphaEvolve, replacing MAP-Elites or simple evolutionary strategies for program search [PAPER — §16.2]. The Multi-LLM variant is particularly relevant for code generation tasks where different models may excel at algorithmic design versus implementation.

19.9 Survey Positioning

19.9.1 Comparison with Related Systems

SystemSearch StrategyAdaptationMulti-ModelPrimary DomainBudget Context
AB-MCTS / TreeQuest (this chapter) Adaptive tree search (depth + width) Thompson Sampling (online) Yes — learned routing via TS ARC-AGI abstract reasoning 250 iterations / task
FunSearch (Chapter 11) Population-based evolutionary Island model with program database No — single model (PaLM 2 / Gemini) Mathematical optimization (cap sets, bin packing) Millions of evaluations
AlphaEvolve (Chapter 9) Evolutionary with MAP-Elites archive Adaptive operator selection Yes — Gemini model cascade Algorithm discovery (broad) Large-scale (hours to days)

Note on budget comparisons: The systems above operate at fundamentally different scales. FunSearch runs millions of cheap evaluations of short programs; AlphaEvolve runs large-scale evolutionary campaigns over hours to days; AB-MCTS runs 250 iterations per task with expensive frontier model calls. Direct performance comparison across these systems is not meaningful without budget normalization, which none of the source papers provide across system boundaries.

19.9.2 Evolutionary Correspondence

AB-MCTS shares conceptual structure with evolutionary algorithms, as the paper itself notes [PAPER — §15.3]. The following table maps the correspondence:

AB-MCTS ComponentEvolutionary AnalogueCorrespondence Quality
Width action (new root solution)Random initialization / immigrationStrong — both introduce independent diversity
Depth action (refine existing node)Mutation + local search / hill climbingModerate — refinement is directed by prompt, not random perturbation
Thompson Sampling adaptive branchingAdaptive operator selectionStrong — both learn action allocation online
Multi-LLM model poolMultiple mutation operatorsModerate — models are more complex than operators
Search treePopulation with genealogyWeak — tree is append-only, no population culling
No crossoverAsexual reproduction onlyStrong limitation — explicitly acknowledged [PAPER — §16.1]

Where the analogy breaks down: AB-MCTS does not maintain a fixed population size, does not perform selection pressure (all nodes are retained, none are culled), and does not use crossover or recombination [PAPER — §15.3]. The tree is append-only: nodes are never removed, and the "population" grows monotonically. This means AB-MCTS cannot exhibit the diversity-maintenance dynamics of island models or MAP-Elites archives used by FunSearch and AlphaEvolve. The tree structure also lacks the multi-objective quality-diversity characteristics that drive these systems' ability to maintain behavioral variety. In essence, AB-MCTS is a search algorithm that shares surface-level structure with evolutionary algorithms but operates under fundamentally different population dynamics.

Key Contribution

AB-MCTS introduces adaptive depth-width branching via Thompson Sampling as a principled alternative to fixed search strategies for LLM inference-time scaling. The core insight — that the optimal ratio of solution refinement to fresh generation is problem-dependent and learnable online — is formalized as a multi-armed bandit and extended to multi-model selection. The TreeQuest library provides a clean stateless implementation with checkpoint/resume support. Among inference-time scaling methods surveyed (Best-of-N, Chain-of-Thought, Tree-of-Thought, Self-Consistency), AB-MCTS is, based on available evidence, the approach that explicitly treats the depth-width trade-off as a learnable parameter rather than a fixed architectural choice [PAPER — §15.1].

19.10 Summary

Key Takeaway: AB-MCTS demonstrates that adaptive allocation between solution refinement (depth) and fresh generation (width), governed by Thompson Sampling, can meaningfully improve inference-time scaling over naive repeated sampling. On ARC-AGI-2, the Multi-LLM variant reportedly achieves >30% Pass@250 compared to 23% for repeated sampling — a relative improvement exceeding 30% — with the additional finding that multi-model collaboration solves problems no single model can solve individually.

Main Contribution: A Bayesian framework for online learning of the depth-width trade-off in LLM-based search, implemented in the TreeQuest library with a stateless, immutable, checkpoint-friendly design. The Multi-LLM extension demonstrates that model selection can be jointly optimized alongside search direction, enabling emergent model specialization.

Most Important Gap for Future Research: The absence of released prompt templates, variance estimates, and independent reproduction limits the ability to distinguish the contribution of the search algorithm from the contribution of prompt engineering and model selection. A researcher attempting reproduction should prioritize obtaining or reverse-engineering the prompt templates, as the paper explicitly acknowledges prompt quality as a critical factor in depth-action effectiveness. Adding crossover-like recombination operations — explicitly identified as a limitation — is the most natural algorithmic extension, potentially bridging AB-MCTS toward the richer population dynamics exhibited by evolutionary approaches like FunSearch and AlphaEvolve.