Introduced2025-03

Score7.85/10 — Draft

Chapter 19

AB-MCTS & TreeQuest

Part: Program Synthesis & ARC-AGI Solvers

19.1 Overview & Motivation

Inference-time scaling has emerged as a complementary axis to training-time scaling for improving large language model performance on hard reasoning tasks. Rather than training larger models, one can allocate additional compute at inference time by generating multiple candidate solutions and selecting among them. The simplest such strategy — repeated sampling (Best-of-N) — generates N independent solutions and returns the best, but makes no use of information gathered during the search process: attempt N+1 is no more informed than attempt 1 [PAPER — arXiv:2503.04412, §1].

AB-MCTS (Adaptive Branching Monte Carlo Tree Search), introduced by Sakana AI in arXiv:2503.04412 (March 2025), addresses this limitation by framing inference-time scaling as a tree search problem with an adaptive depth-versus-width trade-off [PAPER — §1]. The algorithm decides at each iteration whether to go deeper (refine an existing candidate solution through sequential iteration) or go wider (generate an entirely new solution from scratch), using Thompson Sampling to learn the optimal allocation online. A Multi-LLM extension adds a third decision dimension: which frontier model to use at each generation step [PAPER — §1].

The accompanying open-source library, TreeQuest (Apache 2.0, github.com/SakanaAI/treequest), implements this algorithm with a stateless, functional design supporting checkpointing, resumption, and batch-parallel execution [PAPER — §1; README]. On the ARC-AGI-2 benchmark (120 public evaluation tasks, max 250 iterations), Multi-LLM AB-MCTS is reported to achieve >30% Pass@250, compared to 23% for naive repeated sampling with a single model (o4-mini) [PAPER — §1].

Within this survey's taxonomy, AB-MCTS occupies a distinctive position at the intersection of Monte Carlo tree search, multi-armed bandits, and LLM-guided program synthesis. Unlike evolutionary approaches that maintain fixed populations with crossover and mutation (Chapters 9–11), AB-MCTS structures its search as a tree with Bayesian action selection. Unlike Tree-of-Thought (Yao et al., 2023), which uses fixed branching factors, AB-MCTS adapts the branching ratio dynamically through Thompson Sampling [PAPER — §1, Table comparing to related work]. Unlike ensemble or mixture-of-experts approaches, the Multi-LLM variant learns model routing online rather than through a pre-trained router [PAPER — §7].

19.2 Architecture

19.2.1 Pre-Writing Audit

Repository Audit Statement: The repository at https://github.com/SakanaAI/treequest was not directly inspected via file-system access for this chapter. All implementation claims are derived from the paper (arXiv:2503.04412) and the source material analysis document, which describes the repository's design. No pinned commit hash or release tag was verified by the chapter author.

Top-level modules/directories confirmed: [UNVERIFIED] — The source material describes a Python library named treequest with functions init_tree, ask_batch, tell, best, serialize, deserialize, and a treequest.models submodule containing MultiLLMConfig. These names appear consistently across the paper's code examples but have not been verified against actual source files.

Entry point: The paper describes a functional API (no CLI entry point documented). The primary interface is init_tree() → ask_batch() / tell() loop → best() [PAPER — §8–9].

Config file paths and key fields: [UNVERIFIED] — No standalone config file format (YAML/TOML) is documented. Configuration is done via function arguments: method ("ab_mcts_a", "ab_mcts_m", "multi_llm"), batch_size, max_iterations, model_config [PAPER — §8, §12].

Observable output artifacts: [UNVERIFIED] — The paper describes serializable TreeState objects that can be checkpointed to JSON. No specific log file paths, output directories, or artifact naming conventions are documented.

Consequence for evidence tiers: All implementation claims in this chapter default to [PAPER] or [INFERRED] unless otherwise noted. Code examples use names from the paper's documented API but cannot be confirmed as exact repository identifiers.

19.2.2 System Architecture

TreeQuest is designed around a stateless, functional architecture where the search tree is an immutable data structure passed through a sequence of pure functions [PAPER — §8, §10]. The library separates three concerns: (1) search decisions (which action to take, which node to refine, which model to use), (2) solution generation (delegated entirely to user-provided callbacks), and (3) solution evaluation (also user-provided). This separation allows TreeQuest to remain agnostic to the LLM provider, prompt format, and evaluation protocol [PAPER — §8.1].

The core data flow follows the ask-tell pattern: the library asks the user what to evaluate next (returning Trial objects specifying action type, parent node, and model hint), and the user tells the library the results (solution and score). Between ask and tell, the user executes LLM calls however they choose — synchronously, asynchronously, with caching or rate limiting [PAPER — §9.1].

19.2.3 Artifact Inventory

Since direct repository inspection was not performed, the following table lists artifacts described in the paper and source material, with evidence sources.

Artifact	Type	Description	Evidence Source
`treequest` Python package	Library	Core AB-MCTS implementation	[PAPER — §8; README]
`init_tree()`	Function	Initialize search tree state	[PAPER — §8.3]
`ask_batch()`	Function	Request batch of trials	[PAPER — §9.1]
`tell()`	Function	Report results, return new state	[PAPER — §9.3]
`best()`	Function	Retrieve top-k solutions	[PAPER — §8.3]
`serialize()` / `deserialize()`	Functions	Checkpoint/resume support	[PAPER — §10.2]
`Trial` dataclass	Data structure	Immutable trial descriptor	[PAPER — §9.2]
`TreeState` dataclass	Data structure	Complete immutable search state	[PAPER — §10.3]
`MultiLLMConfig`	Config class	Multi-model pool configuration	[PAPER — §13.1]
JSON checkpoint files	Output artifact	Serialized tree state for resumption	[PAPER — §10.2]
Apache 2.0 license	License	Open-source license	[PAPER — §1; README]

19.2.4 Execution Trace

No CLI entry point is documented. The paper describes a purely programmatic interface. The following is the documented execution pattern [PAPER — §8.3, §10.2]:

# Pseudocode — reconstructed from paper §8.3, §10.2, not repo-verified
# No CLI command documented; library is used programmatically

from treequest import init_tree, ask_batch, tell, best, serialize, deserialize

# Initialization
tree_state = init_tree(
    method="multi_llm",       # Variant: "ab_mcts_a" | "ab_mcts_m" | "multi_llm"
    batch_size=5,              # Parallel trials per iteration
    max_iterations=250,        # Total compute budget
    model_config=config,       # MultiLLMConfig for multi-model mode
)

# Main loop: ask → generate → evaluate → tell
for _ in range(50):            # 50 batches × 5 = 250 trials
    trials = ask_batch(tree_state)
    results = [(t.trial_id, generate(t), score(t)) for t in trials]
    tree_state = tell(tree_state, results)

# Retrieve results
solution = best(tree_state, k=2)

# Checkpoint: serialize to JSON
import json
with open("checkpoint.json", "w") as f:
    json.dump(serialize(tree_state), f)

# Expected output: JSON file containing full TreeState
# (nodes, edges, bandit_state, iteration count, pending trials, RNG state)

Note: No specific output directory structure, log file format, or terminal output is documented in available materials. The primary observable artifact is the serialized TreeState JSON checkpoint [PAPER — §10.2].

19.3 Core Algorithms

19.3.1 Verification Matrix

Algorithm / Mechanism	Claim	Evidence Source	Artifact	Confidence
Adaptive Branching (depth vs. width)	Thompson Sampling selects between depth and width actions at each iteration	[PAPER — §3.3, §4]	arXiv:2503.04412 §4	High
Beta-Bernoulli Thompson Sampling	Each action maintains Beta(α, β) posterior updated with binary rewards	[PAPER — §5.1–5.2]	arXiv:2503.04412 §5	High
Node Aggregation (AB-MCTS-A)	Binary success/failure based on whether child exceeds subtree best	[PAPER — §6.1]	arXiv:2503.04412 §6.1	High
Mixed Models (AB-MCTS-M)	PyMC-based hierarchical Bayesian inference for continuous rewards	[PAPER — §5.3, §6.2]	arXiv:2503.04412 §6.2	High
Multi-LLM model selection	Thompson Sampling over (direction, model) Cartesian product	[PAPER — §6.3, §7]	arXiv:2503.04412 §7	High
Immutable TreeState	Frozen dataclasses, functional state transitions	[PAPER — §10]	arXiv:2503.04412 §10	High
Ask-tell interface	Decoupled search decisions from generation execution	[PAPER — §9]	arXiv:2503.04412 §9	High
Node selection for refinement	Secondary bandit problem for which node to refine	[PAPER — §3.1]	arXiv:2503.04412 §3.1	Medium — mechanism described but exact policy not specified
Emergent model specialization	Different models develop different roles during search	[PAPER — §7.3]	arXiv:2503.04412 §7.3	Medium — empirical observation, not formally measured
Crossover action	Combining elements from two subtrees	[PAPER — §16.2]	Future direction, not implemented	N/A — proposed only

19.3.2 The Depth-Width Trade-off

The central algorithmic contribution of AB-MCTS is the adaptive selection between two search actions at each iteration [PAPER — §2.2, §3.3]:

Depth (Go Deeper): Select an existing node in the search tree and generate a child solution by feeding the node's solution as context to the LLM, along with its score and improvement instructions. This creates chains of progressively refined solutions — analogous to gradient-descent-like steps in continuous optimization [PAPER — §3.1].
Width (Go Wider): Generate an entirely new solution from scratch, without conditioning on any previous attempt. This is equivalent to adding a new root-level node to the search tree, providing diversity when existing branches are stuck in local optima [PAPER — §3.2].

The paper's key insight is that the optimal depth-width ratio is problem-dependent and node-dependent within the search tree, and should be learned online rather than fixed as a hyperparameter [PAPER — §3.3]. AB-MCTS accomplishes this via Thompson Sampling over a Beta-Bernoulli model.

19.3.3 Thompson Sampling for Action Selection

Thompson Sampling (Thompson, 1933) is a Bayesian approach to the multi-armed bandit problem. Each arm maintains a prior distribution over its expected reward; after observing a reward, the prior is updated to a posterior via Bayes' rule. At each decision point, a value is sampled from each arm's posterior, and the arm with the highest sample is selected [PAPER — §5.1].

For binary rewards, the natural conjugate prior is the Beta distribution. If arm i has observed α_i successes and β_i failures, its posterior is:

$$P(\text{reward}_i) \sim \text{Beta}(\alpha_i + 1,\; \beta_i + 1)$$

Symbol Table:

α_i: cumulative success count for arm i (depth or width)
β_i: cumulative failure count for arm i
The +1 terms encode a uniform prior Beta(1,1)

Provenance: [Standard definition applied here] — Beta-Bernoulli Thompson Sampling is a well-established technique (Agrawal & Goyal, 2012). The application to depth/width selection is the paper's contribution [PAPER — §5.2].

Reward definition: A reward of 1 (success) is recorded when the newly generated solution's score exceeds the best score among the parent node's descendants. A reward of 0 (failure) is recorded otherwise [PAPER — §4.5]. Formally:

$$r_t = \mathbb{1}[\text{score}(s_{\text{child}}) > \text{subtree\_best}(\text{parent})]$$

Symbol Table:

r_t: binary reward at iteration t
s_child: the newly generated solution
score(·): user-defined evaluation function mapping solutions to [0, 1]
subtree_best(parent): maximum score among all descendants of the parent node
𝟙[·]: indicator function (1 if condition holds, 0 otherwise)

Provenance: [Published formula — paper §4.5]. This binary thresholding is the specific choice made by AB-MCTS-A; AB-MCTS-M uses continuous rewards instead.

Worked Example — Thompson Sampling Action Selection:

Suppose after 20 iterations: depth has had α_d = 6 successes, β_d = 8 failures; width has had α_w = 4 successes, β_w = 2 failures.

Depth posterior: Beta(7, 9). Mean = 7/16 = 0.4375. The posterior is relatively concentrated.
Width posterior: Beta(5, 3). Mean = 5/8 = 0.625. The posterior is broader (fewer observations).
Sample from Beta(7, 9) → e.g., 0.38. Sample from Beta(5, 3) → e.g., 0.71.
Width wins (0.71 > 0.38), so this iteration generates a fresh solution.

If width then produces a solution scoring 0.65 against a current root-level best of 0.60, the reward is 1 (success), updating width to α_w = 5, β_w = 2. If the score had been 0.55, the reward would be 0, updating to α_w = 4, β_w = 3.

19.3.4 AB-MCTS Main Loop

The algorithm operates in four phases per iteration: selection, expansion, evaluation, and backpropagation — mirroring classical MCTS but adapted for LLM inference-time scaling [PAPER — §4.1].

# Pseudocode — reconstructed from paper §4.1, not repo-verified
# Uses paper-described function names; actual implementation may differ

def ab_mcts(problem, budget_T, models, scorer):
    """AB-MCTS main loop with four phases per iteration."""
    tree = init_tree()
    thompson = ThompsonSampler(actions=["depth", "width"])

    for t in range(budget_T):
        # Phase 1 — Selection: Choose depth vs. width via Thompson Sampling
        action = thompson.sample()  # Sample from Beta posteriors

        if action == "width":
            # Phase 2a — Expansion (width): New root-level solution
            prompt = make_fresh_prompt(problem)
            parent_node = tree.root
        else:
            # Phase 2b — Expansion (depth): Refine existing node
            parent_node = select_node(tree)  # Secondary bandit/UCB
            prompt = make_refinement_prompt(
                problem, parent_node.solution, parent_node.score
            )

        # Model selection (Multi-LLM extension only)
        model = select_model(models)  # Thompson Sampling over model pool

        # Phase 3 — Evaluation: Generate solution and score it
        solution = model.generate(prompt)
        score = scorer.evaluate(solution)  # Normalized to [0, 1]

        # Phase 4 — Backpropagation: Update tree and Thompson model
        child = tree.add_child(parent_node, solution, score)
        reward = 1 if score > parent_node.subtree_best else 0
        thompson.update(action, reward)

    return tree.best_solution()

19.3.5 AB-MCTS Variants

AB-MCTS-A (Node Aggregation). The simplest variant reduces all scores to binary success/failure outcomes. Each refinement attempt is treated as an independent Bernoulli trial: did the child improve upon the subtree's best score? The Beta posterior is well-calibrated under this assumption, but the magnitude of improvement is discarded — a score increase from 0.5 to 0.51 counts identically to an increase from 0.5 to 0.99 [PAPER — §6.1]. Dependencies: NumPy only; computational overhead is negligible [PAPER — Table in §6].

AB-MCTS-M (Mixed Models). This variant uses PyMC-based hierarchical Bayesian inference to maintain richer posteriors over continuous score distributions, rather than reducing to binary outcomes [PAPER — §5.3, §6.2]. The hierarchical structure allows information sharing across tree nodes: if refinements at one node tend to produce high scores, this informs expectations for other nodes. The trade-off is increased computational cost (MCMC sampling per decision) and a PyMC/ArviZ dependency [PAPER — §6.2].

[INFERRED — Implementation Detail] The exact PyMC model specification (number of MCMC samples, hyperprior parameterization, inference backend) is described conceptually in the paper but the precise implementation constants have not been verified in the repository. The paper mentions "hierarchical statistical modeling" with population-level partial pooling [PAPER — §6.2], but the specific model architecture (e.g., normal vs. beta likelihood, exact hyperprior values) may differ in the actual implementation.

Multi-LLM AB-MCTS. This variant extends the action space to the Cartesian product of {depth, width} × {model₁, model₂, …, model_k}. Each (direction, model) combination maintains its own Beta posterior, enabling the algorithm to learn that, for example, one model excels at refinement while another produces better diverse initial solutions [PAPER — §6.3, §7.1].

Variant	Branching Strategy	Score Handling	Dependencies	Overhead	Provenance
AB-MCTS-A	Beta-Bernoulli TS	Binary (improve / not)	NumPy only	Negligible	[PAPER — §6.1]
AB-MCTS-M	Hierarchical Bayesian	Continuous [0,1]	PyMC, ArviZ	MCMC per decision	[PAPER — §6.2]
Multi-LLM	Multi-dim bandit	Binary per (action, model)	NumPy + API keys	Negligible per decision	[PAPER — §6.3]

19.3.6 Multi-LLM Model Selection

The Multi-LLM extension introduces a three-dimensional decision space at each iteration [PAPER — §7.1]: (1) direction (depth vs. width), (2) model selection (which LLM from the pool), (3) generation (execute with chosen model). Each (direction, model) pair maintains an independent Beta posterior, updated based on whether the generation produced a score improvement.

# Pseudocode — reconstructed from paper §7.2, not repo-verified

class MultiLLMThompsonSampler:
    """Thompson Sampling across (direction, model) combinations."""

    def __init__(self, models: list[str]):
        self.arms = {}
        for direction in ["depth", "width"]:
            for model in models:
                key = (direction, model)
                self.arms[key] = {"alpha": 1.0, "beta": 1.0}

    def select(self) -> tuple[str, str]:
        """Sample from all arm posteriors; return (direction, model)."""
        best_key, best_sample = None, -1.0
        for key, params in self.arms.items():
            sample = np.random.beta(params["alpha"], params["beta"])
            if sample > best_sample:
                best_sample = sample
                best_key = key
        return best_key  # (direction, model_name)

    def update(self, direction: str, model: str, reward: float):
        key = (direction, model)
        self.arms[key]["alpha"] += reward
        self.arms[key]["beta"] += (1.0 - reward)

Worked Example — Multi-LLM Selection (3 models × 2 directions = 6 arms):

After 30 iterations with o4-mini, Gemini-2.5-Pro, and DeepSeek R1-0528:

Arm (direction, model)	α	β	Posterior Mean
(depth, o4-mini)	5	8	0.385
(depth, Gemini)	7	4	0.636
(depth, DeepSeek)	3	5	0.375
(width, o4-mini)	4	3	0.571
(width, Gemini)	2	4	0.333
(width, DeepSeek)	3	3	0.500

The algorithm has learned that Gemini tends to be strongest at refinement (depth) while o4-mini provides good diversity (width). The next sample might draw 0.72 from (depth, Gemini) and 0.65 from (width, o4-mini), selecting Gemini for a depth action.

19.3.7 Stateless Design and Immutability

A distinctive architectural choice is that all tree state is represented via immutable (frozen) data structures. Every operation — adding a node, updating scores, recording Thompson Sampling observations — returns a new TreeState rather than mutating the existing one [PAPER — §10.1]. This design enables:

Checkpointing: Any state can be serialized to JSON and restored later [PAPER — §10.2].
Reproducibility: Given the same initial state and random seed, the search trajectory is fully deterministic [PAPER — §10.1].
Concurrency: Immutable structures are inherently thread-safe — multiple threads can read the same state without locks [PAPER — §10.1].
Debugging: The full history of tree states can be retained for post-hoc analysis [PAPER — §10.1].

19.3.8 Node Selection for Refinement

When the algorithm decides to go deeper, it must select which existing node to refine. The paper describes this as a secondary multi-armed bandit problem where each node is an arm, with selection balancing exploitation (refining high-scoring nodes) with exploration (trying nodes with low visit counts) [PAPER — §3.1]. The exact selection policy is not formally specified in the paper.

[INFERRED — Node Selection Policy] The paper states that node selection "balances exploitation (refining high-scoring nodes) with exploration (trying nodes that haven't been refined much)" [PAPER — §3.1], which is consistent with UCB-style selection as used in classical MCTS. However, the specific formula (UCB1, PUCT, Thompson Sampling at the node level, or a simpler heuristic) is not documented. The comparison table in §14.1 of the paper lists "Thompson Sampling from Beta posteriors" as the selection method, but this refers to the depth/width selection, not within-depth node selection.

19.3.9 Regret Analysis

Thompson Sampling for the K-armed Bernoulli bandit has established regret bounds [PAPER — §14.2]:

$$\mathbb{E}[R(T)] \leq O\!\left(\sqrt{K \cdot T \cdot \log T}\right)$$

Symbol Table:

R(T): cumulative regret after T rounds
K: number of arms — 2 for single-model (depth/width), 2|M| for Multi-LLM with |M| models
T: total number of iterations (budget)

Provenance: [Standard definition applied here] — This bound is from Agrawal & Goyal (2012) for the stationary case. The paper notes that AB-MCTS reward distributions are non-stationary (the value of depth vs. width changes as the tree grows and best known score improves), which the standard bound does not address [PAPER — §14.2].

Worked Example — Regret Bound:

For Multi-LLM AB-MCTS with K = 6 arms (2 directions × 3 models) and T = 250 iterations:

E[R(250)] ≤ O(√(6 × 250 × log 250)) ≈ O(√(6 × 250 × 5.52)) ≈ O(√8300) ≈ O(91)

This suggests that the algorithm's cumulative loss from suboptimal arm selection grows sublinearly, with expected regret of roughly 91 over 250 rounds — meaning the vast majority of iterations are allocated to near-optimal actions. Note this is a theoretical upper bound under stationarity; actual performance may differ due to non-stationary rewards.

19.4 Key Results

19.4.1 Evaluation Caveats

Self-reported results: All benchmark numbers are reported by the system authors (Sakana AI). No independent reproduction by a third party has been documented in available materials.
Exact model versions: The paper specifies model names (o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528) but does not provide exact model version identifiers, API snapshot dates, or whether model behavior may have changed between paper preparation and publication [PAPER — §11.2].
Seeds and run counts: The number of independent runs per configuration, random seeds used, and variance across runs are not reported. All numbers appear to be from single runs or are not specified [PAPER — §11.3].
Compute budget matching: The repeated sampling baseline uses 250 independent generations with o4-mini. The Multi-LLM AB-MCTS also uses 250 iterations but with three different models (o4-mini, Gemini, DeepSeek), which may have different per-call costs. Whether total token or dollar budgets are matched is not stated [PAPER — §11.2–11.3].
Prompt engineering: The paper acknowledges that the effectiveness of depth actions depends on prompt quality [PAPER — §16.1], but does not release the exact prompts used, making reproduction dependent on prompt design choices.
Approximate figures: Some results (AB-MCTS-M "~28%", Multi-LLM ">30%") are reported with approximate notation rather than exact values [PAPER — §11.3].
Task subset: Results are on the 120 public evaluation tasks of ARC-AGI-2, not the full held-out test set. Public set performance may not generalize to the private evaluation [PAPER — §11.1].

19.4.2 ARC-AGI-2 Results

Method	Model(s)	Task Set	Pass@250	Pass@2	Seeds / Runs	Compute Budget	Result Type	Evidence Source
Repeated Sampling	o4-mini	120 public eval	23.0%	— (not reported)	— (not reported)	250 generations	Self-reported	[PAPER — §11.3]
AB-MCTS-A	o4-mini	120 public eval	27.5%	— (not reported)	— (not reported)	250 iterations	Self-reported	[PAPER — §11.3]
AB-MCTS-M	o4-mini	120 public eval	~28%	— (not reported)	— (not reported)	250 iterations	Self-reported (approximate)	[PAPER — §11.3]
Multi-LLM AB-MCTS	o4-mini + Gemini-2.5-Pro + DeepSeek R1-0528	120 public eval	>30%	19.2%	— (not reported)	250 iterations (3-model pool)	Self-reported (approximate)	[PAPER — §11.3]

19.4.3 Analysis of Results

The reported improvement from 23.0% (repeated sampling) to >30% (Multi-LLM AB-MCTS) represents a >30% relative improvement in Pass@250 on the same 120-task evaluation set [PAPER — §11.3]. The paper emphasizes three patterns observed in search tree analysis [PAPER — §11.4]:

Complementary reasoning: One model identifies the correct transformation type while another implements it correctly at the pixel level.
Cross-model error correction: A different model can sometimes fix systematic errors during refinement that the originating model reproduces.
Diverse initialization: Different models produce different distributions of initial solutions, improving coverage of the solution space.

The Pass@2 metric (19.2%) for Multi-LLM AB-MCTS indicates that when selecting only the top-2 solutions from 250 candidates for final submission, the algorithm captures approximately two-thirds of the solvable tasks. This demonstrates effective ranking, not just search [PAPER — §11.4].

19.4.4 Scaling Behavior

The paper reports that repeated sampling's Pass@N curve flattens quickly (most easily solvable problems found within the first ~50 attempts), while AB-MCTS maintains a steeper slope because search directs compute toward promising regions [PAPER — §11.5].

The paper defines search efficiency at iteration t as the probability that iteration t solves a previously unsolved problem [PAPER — §11.5]:

$$\text{efficiency}(t) = P(\text{solved at } t \mid \text{not solved at } t{-}1)$$

Symbol Table:

efficiency(t): marginal probability of solving a new problem at iteration t
For repeated sampling, this decreases monotonically
For AB-MCTS, this can increase over time as information accumulates

Provenance: [Published formula — paper §11.5]. This is a conceptual framing for interpreting scaling curves, not a directly measured quantity.

Worked Example — Search Efficiency:

Consider a problem solvable by 2% of random samples. Under repeated sampling, efficiency(t) ≈ 0.02 at every iteration (assuming independence). After 50 iterations, P(unsolved) = 0.98⁵⁰ ≈ 0.364, so Pass@50 ≈ 63.6%.

Under AB-MCTS, if the first few attempts reveal that a particular refinement direction is promising, subsequent iterations concentrate on that direction. The efficiency might start at 0.02 but increase to, say, 0.05 as the algorithm focuses. After 50 directed iterations, the cumulative solve probability could exceed the repeated-sampling baseline despite starting from the same per-sample rate.

19.4.5 Non-LLM Baseline Comparison

The paper compares AB-MCTS variants against repeated sampling (Best-of-N) as the primary baseline [PAPER — §11.3]. No comparison against non-LLM baselines (e.g., brute-force program enumeration, DSL-based solvers, or classical ARC solvers) is reported in the available materials. The ARC-AGI-2 benchmark itself has been evaluated by various approaches in the ARC Prize competition, but the paper does not position its results against non-LLM competition entries.

[INFERRED — Baseline Context] The absence of non-LLM baselines limits the ability to assess whether the observed improvements are attributable to the search algorithm (AB-MCTS) or to the capability of the underlying frontier models. A matched-budget comparison against a DSL-based ARC solver or classical search algorithm would strengthen the evaluation. The paper's contribution is specifically about orchestrating LLM calls more efficiently, so the repeated-sampling baseline is the most relevant comparator for that claim.

19.5 Implementation & Cost

19.5.1 Implementation Details

Aspect	Detail	Evidence Source
Language	Python	[PAPER — §8; README]
License	Apache 2.0	[PAPER — §1; README]
Core dependencies (AB-MCTS-A)	NumPy	[PAPER — §6, variant table]
Additional dependencies (AB-MCTS-M)	PyMC, ArviZ	[PAPER — §6.2]
Design pattern	Stateless ask-tell, immutable dataclasses	[PAPER — §8.1, §10]
Parallelism model	Batch-parallel trials; user manages async execution	[PAPER — §9.3]
Checkpoint format	JSON-serializable TreeState	[PAPER — §10.2]
Models used in reported experiments	o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528	[PAPER — §11.2]

19.5.2 Cost Analysis

Cost Dimension	Detail	Provenance
Iterations per task	250 (max budget in reported experiments)	[PAPER — §11.2]
Batch size	5 (recommended)	[PAPER — §12.1]
LLM calls per task	250 (1 per iteration)	[PAPER — §4, §11.2]
Computational overhead of AB-MCTS-A	Negligible (NumPy Beta sampling)	[PAPER — §6, variant table]
Computational overhead of AB-MCTS-M	MCMC sampling per decision (PyMC)	[PAPER — §6, variant table]
Hardware configuration	— (not reported)	—
Wall-clock time per task	— (not reported)	—
Dollar cost per task	— (not reported)	—
Total cost for 120-task evaluation	— (not reported)	—

[INFERRED — Cost Projection] The paper does not report dollar costs or wall-clock times. A rough projection: at 250 iterations per task and 120 tasks, the full evaluation requires 30,000 LLM API calls across three frontier models. At typical 2025–2026 API pricing (o4-mini: ~$0.01–0.03 per call depending on prompt length; Gemini-2.5-Pro and DeepSeek R1-0528 at comparable rates), the total cost for the 120-task evaluation likely falls in the range of $300–$3,000 depending on prompt complexity and token counts. The paper describes ~30 seconds per LLM call as a reference point [PAPER — §10.2], which would put wall-clock time at approximately 2 hours per task with sequential execution. These are author estimates, not paper-reported figures.

19.6 Reproducibility

19.6.1 Reproducibility Assessment

Requirement	Status	Notes
Code publicly released	✓	github.com/SakanaAI/treequest, Apache 2.0 [PAPER — §1]
Config files available	Partial	Configuration via function arguments documented in paper; no standalone config files described
Pretrained weights / checkpoints	N/A	System uses third-party LLM APIs; no training involved
Documented entry point or run command	Partial	Programmatic API documented [PAPER — §8]; no CLI; README may contain usage examples [UNVERIFIED]
Compute requirements stated	✗	Hardware, runtime, and cost not reported
Seeds and run counts reported	✗	Number of runs and random seeds not specified
Exact model versions specified	Partial	Model names given (o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528); exact API versions not pinned
Prompt templates released	✗	Prompts not published; essential for reproduction
Evaluation function released	✗ (unclear)	ARC-AGI-2 scoring is well-defined (pixel accuracy), but the exact implementation is not confirmed in the repo
Independent reproduction attempted	✗	No independent reproductions documented in available materials

19.6.2 Step-by-Step Verification Path

To attempt validation of AB-MCTS on ARC-AGI-2, an external researcher would need to:

Clone the repository: git clone https://github.com/SakanaAI/treequest
Install dependencies: The paper mentions NumPy as the only dependency for AB-MCTS-A, with PyMC/ArviZ for AB-MCTS-M [PAPER — §6]. The exact install command is not documented in the paper. The researcher should check the repository for pyproject.toml, setup.py, or requirements.txt.
Obtain API keys: OpenAI (o4-mini), Google (Gemini-2.5-Pro), DeepSeek (R1-0528). Exact model version strings may need to be determined from the repository or API documentation.
Obtain ARC-AGI-2 tasks: The 120 public evaluation tasks are available from arcprize.org [PAPER — §11.1]. The exact task loading format expected by the code is not documented in the paper.
Implement prompt templates: The paper describes depth and width prompts conceptually [PAPER — §4.3] but does not release exact templates. This is a significant barrier — the researcher must design their own prompts, which may not match the paper's performance.
Implement scoring function: Pixel-level accuracy is well-defined for ARC tasks [PAPER — §12.3], but the exact normalization (e.g., handling of wrong grid dimensions) must be implemented.
Run search: Using the documented API pattern: init_tree(method="multi_llm", batch_size=5, max_iterations=250) → ask/tell loop → best(k=2).
Expected output: A Pass@250 score of >30% and Pass@2 of ~19.2% on the 120 public tasks.

Critical gaps for reproduction: The absence of released prompt templates is the primary barrier. Since the paper acknowledges that "the effectiveness of depth (refinement) actions depends heavily on the quality of the refinement prompt" [PAPER — §16.1], prompt design is a first-order factor in reproducing results. The lack of reported seeds, run counts, and variance estimates further limits the ability to determine whether a reproduction attempt has succeeded or failed.

19.7 Threats to Validity

Self-reported results without independent verification. All benchmark numbers are reported by the system's authors (Sakana AI). No independent reproduction or third-party evaluation has been documented. The improvement from 23% to >30% should be interpreted as an author-reported claim until independently validated.
Unmatched compute budgets across configurations. The Multi-LLM variant uses three different frontier models in its 250-iteration budget. The single-model baselines use only o4-mini. If the three models have different costs per call (in tokens, latency, or dollars), the budgets are not dollar-matched. The paper uses iteration count as the budget metric, which treats all model calls as equivalent [PAPER — §11.2].
Model version non-stationarity. The paper specifies model names but not exact version identifiers or API snapshot dates. Frontier model APIs are updated regularly; results obtained with one API version may not be reproducible with a later version. This is a systemic issue for all inference-time scaling research, not specific to AB-MCTS.
Unreleased prompt templates. Prompt engineering is acknowledged as a critical factor [PAPER — §16.1], but the exact prompts are not published. This creates a gap between the described algorithm (which is fully specified) and the reported results (which depend on undisclosed implementation details).
Absence of variance estimates. No error bars, confidence intervals, or multi-run statistics are reported. Without knowing the variance across runs, it is impossible to determine whether the differences between AB-MCTS variants and the repeated-sampling baseline are statistically significant.
Public evaluation set only. Results are on ARC-AGI-2's 120 public evaluation tasks, not the held-out private test set. Overfitting to public task characteristics (through prompt tuning or hyperparameter selection) cannot be ruled out.
Non-stationarity of reward distributions. The Thompson Sampling regret bounds cited in the paper [PAPER — §14.2] assume stationary reward distributions. The paper acknowledges that this assumption is violated in AB-MCTS (reward distributions shift as the tree grows) but does not quantify the impact of non-stationarity on the regret guarantee.
Node selection policy underspecified. The secondary bandit problem of selecting which node to refine (when depth is chosen) is described only in general terms [PAPER — §3.1]. The exact selection policy — which substantially affects depth-action quality — is not formally specified.
Emergent specialization claim. The reported observation that "different models develop different roles during search" [PAPER — §7.3] is a qualitative claim without formal measurement or statistical testing. The degree of specialization and whether it is robust across tasks and runs is not quantified.

19.8 Limitations & Open Questions

19.8.1 Documented Limitations

The paper explicitly identifies five limitations [PAPER — §16.1]:

Scoring function requirement: AB-MCTS requires a scalar scoring function in [0, 1] that provides meaningful partial-credit signal. Tasks without clear partial credit (open-ended creative writing, subjective evaluation) are poorly suited [PAPER — §16.1].
LLM call overhead: Each iteration requires at least one LLM API call. At 250 iterations per task, the cost may be prohibitive for expensive models or large task sets [PAPER — §16.1].
Prompt engineering dependency: Refinement (depth) effectiveness depends heavily on prompt quality. Poor prompts may cause the LLM to ignore or verbatim-copy the parent solution [PAPER — §16.1].
Non-stationarity: The true reward distributions shift during search, violating the stationarity assumption of standard Thompson Sampling. While empirically robust to mild non-stationarity, extreme distribution shift could degrade performance [PAPER — §16.1].
No crossover: Unlike evolutionary algorithms, AB-MCTS does not combine elements from different solution branches. Each refinement acts on a single parent [PAPER — §16.1].

19.8.2 Open Questions

[INFERRED — Open Research Questions]

Crossover action: The paper proposes adding a "crossover" operation that combines elements from two subtrees via LLM-mediated recombination [PAPER — §16.2]. This would expand the action space to {depth, width, crossover} × models, potentially enabling richer search dynamics at the cost of increased bandit dimensionality.
Hierarchical decomposition: For multi-step problems, AB-MCTS could be applied hierarchically — a top-level search decomposes the problem, and nested searches solve subproblems [PAPER — §16.2]. The interaction between hierarchical decomposition and Thompson Sampling is unexplored.
Learned scoring functions: Replacing external evaluators with LLM-based reward models would create a fully self-contained system but introduces reward hacking risks [PAPER — §16.2].
Non-stationary Thompson Sampling: Discounted Thompson Sampling with geometric down-weighting of older observations (α_t = γ·α_t−1 + reward_t, γ ∈ (0,1)) could address reward distribution shift [PAPER — §16.2].
Integration with code evolution: The paper suggests AB-MCTS as a search backbone for systems like AlphaEvolve, replacing MAP-Elites or simple evolutionary strategies for program search [PAPER — §16.2]. The Multi-LLM variant is particularly relevant for code generation tasks where different models may excel at algorithmic design versus implementation.

19.9 Survey Positioning

19.9.1 Comparison with Related Systems

System	Search Strategy	Adaptation	Multi-Model	Primary Domain	Budget Context
AB-MCTS / TreeQuest (this chapter)	Adaptive tree search (depth + width)	Thompson Sampling (online)	Yes — learned routing via TS	ARC-AGI abstract reasoning	250 iterations / task
FunSearch (Chapter 11)	Population-based evolutionary	Island model with program database	No — single model (PaLM 2 / Gemini)	Mathematical optimization (cap sets, bin packing)	Millions of evaluations
AlphaEvolve (Chapter 9)	Evolutionary with MAP-Elites archive	Adaptive operator selection	Yes — Gemini model cascade	Algorithm discovery (broad)	Large-scale (hours to days)

Note on budget comparisons: The systems above operate at fundamentally different scales. FunSearch runs millions of cheap evaluations of short programs; AlphaEvolve runs large-scale evolutionary campaigns over hours to days; AB-MCTS runs 250 iterations per task with expensive frontier model calls. Direct performance comparison across these systems is not meaningful without budget normalization, which none of the source papers provide across system boundaries.

19.9.2 Evolutionary Correspondence

AB-MCTS shares conceptual structure with evolutionary algorithms, as the paper itself notes [PAPER — §15.3]. The following table maps the correspondence:

AB-MCTS Component	Evolutionary Analogue	Correspondence Quality
Width action (new root solution)	Random initialization / immigration	Strong — both introduce independent diversity
Depth action (refine existing node)	Mutation + local search / hill climbing	Moderate — refinement is directed by prompt, not random perturbation
Thompson Sampling adaptive branching	Adaptive operator selection	Strong — both learn action allocation online
Multi-LLM model pool	Multiple mutation operators	Moderate — models are more complex than operators
Search tree	Population with genealogy	Weak — tree is append-only, no population culling
No crossover	Asexual reproduction only	Strong limitation — explicitly acknowledged [PAPER — §16.1]

Where the analogy breaks down: AB-MCTS does not maintain a fixed population size, does not perform selection pressure (all nodes are retained, none are culled), and does not use crossover or recombination [PAPER — §15.3]. The tree is append-only: nodes are never removed, and the "population" grows monotonically. This means AB-MCTS cannot exhibit the diversity-maintenance dynamics of island models or MAP-Elites archives used by FunSearch and AlphaEvolve. The tree structure also lacks the multi-objective quality-diversity characteristics that drive these systems' ability to maintain behavioral variety. In essence, AB-MCTS is a search algorithm that shares surface-level structure with evolutionary algorithms but operates under fundamentally different population dynamics.

Key Contribution

AB-MCTS introduces adaptive depth-width branching via Thompson Sampling as a principled alternative to fixed search strategies for LLM inference-time scaling. The core insight — that the optimal ratio of solution refinement to fresh generation is problem-dependent and learnable online — is formalized as a multi-armed bandit and extended to multi-model selection. The TreeQuest library provides a clean stateless implementation with checkpoint/resume support. Among inference-time scaling methods surveyed (Best-of-N, Chain-of-Thought, Tree-of-Thought, Self-Consistency), AB-MCTS is, based on available evidence, the approach that explicitly treats the depth-width trade-off as a learnable parameter rather than a fixed architectural choice [PAPER — §15.1].

19.10 Summary

Key Takeaway: AB-MCTS demonstrates that adaptive allocation between solution refinement (depth) and fresh generation (width), governed by Thompson Sampling, can meaningfully improve inference-time scaling over naive repeated sampling. On ARC-AGI-2, the Multi-LLM variant reportedly achieves >30% Pass@250 compared to 23% for repeated sampling — a relative improvement exceeding 30% — with the additional finding that multi-model collaboration solves problems no single model can solve individually.

Main Contribution: A Bayesian framework for online learning of the depth-width trade-off in LLM-based search, implemented in the TreeQuest library with a stateless, immutable, checkpoint-friendly design. The Multi-LLM extension demonstrates that model selection can be jointly optimized alongside search direction, enabling emergent model specialization.

Most Important Gap for Future Research: The absence of released prompt templates, variance estimates, and independent reproduction limits the ability to distinguish the contribution of the search algorithm from the contribution of prompt engineering and model selection. A researcher attempting reproduction should prioritize obtaining or reverse-engineering the prompt templates, as the paper explicitly acknowledges prompt quality as a critical factor in depth-action effectiveness. Adding crossover-like recombination operations — explicitly identified as a limitation — is the most natural algorithmic extension, potentially bridging AB-MCTS toward the richer population dynamics exhibited by evolutionary approaches like FunSearch and AlphaEvolve.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}