AB-MCTS & TreeQuest
Part: Program Synthesis & ARC-AGI Solvers
19.1 Overview & Motivation
Inference-time scaling has emerged as a complementary axis to training-time scaling for improving large language model performance on hard reasoning tasks. Rather than training larger models, one can allocate additional compute at inference time by generating multiple candidate solutions and selecting among them. The simplest such strategy — repeated sampling (Best-of-N) — generates N independent solutions and returns the best, but makes no use of information gathered during the search process: attempt N+1 is no more informed than attempt 1 [PAPER — arXiv:2503.04412, §1].
AB-MCTS (Adaptive Branching Monte Carlo Tree Search), introduced by Sakana AI in arXiv:2503.04412 (March 2025), addresses this limitation by framing inference-time scaling as a tree search problem with an adaptive depth-versus-width trade-off [PAPER — §1]. The algorithm decides at each iteration whether to go deeper (refine an existing candidate solution through sequential iteration) or go wider (generate an entirely new solution from scratch), using Thompson Sampling to learn the optimal allocation online. A Multi-LLM extension adds a third decision dimension: which frontier model to use at each generation step [PAPER — §1].
The accompanying open-source library, TreeQuest (Apache 2.0, github.com/SakanaAI/treequest), implements this algorithm with a stateless, functional design supporting checkpointing, resumption, and batch-parallel execution [PAPER — §1; README]. On the ARC-AGI-2 benchmark (120 public evaluation tasks, max 250 iterations), Multi-LLM AB-MCTS is reported to achieve >30% Pass@250, compared to 23% for naive repeated sampling with a single model (o4-mini) [PAPER — §1].
Within this survey's taxonomy, AB-MCTS occupies a distinctive position at the intersection of Monte Carlo tree search, multi-armed bandits, and LLM-guided program synthesis. Unlike evolutionary approaches that maintain fixed populations with crossover and mutation (Chapters 9–11), AB-MCTS structures its search as a tree with Bayesian action selection. Unlike Tree-of-Thought (Yao et al., 2023), which uses fixed branching factors, AB-MCTS adapts the branching ratio dynamically through Thompson Sampling [PAPER — §1, Table comparing to related work]. Unlike ensemble or mixture-of-experts approaches, the Multi-LLM variant learns model routing online rather than through a pre-trained router [PAPER — §7].
19.2 Architecture
19.2.1 Pre-Writing Audit
Repository Audit Statement: The repository at https://github.com/SakanaAI/treequest was not directly inspected via file-system access for this chapter. All implementation claims are derived from the paper (arXiv:2503.04412) and the source material analysis document, which describes the repository's design. No pinned commit hash or release tag was verified by the chapter author.
Top-level modules/directories confirmed: [UNVERIFIED] — The source material describes a Python library named treequest with functions init_tree, ask_batch, tell, best, serialize, deserialize, and a treequest.models submodule containing MultiLLMConfig. These names appear consistently across the paper's code examples but have not been verified against actual source files.
Entry point: The paper describes a functional API (no CLI entry point documented). The primary interface is init_tree() → ask_batch() / tell() loop → best() [PAPER — §8–9].
Config file paths and key fields: [UNVERIFIED] — No standalone config file format (YAML/TOML) is documented. Configuration is done via function arguments: method ("ab_mcts_a", "ab_mcts_m", "multi_llm"), batch_size, max_iterations, model_config [PAPER — §8, §12].
Observable output artifacts: [UNVERIFIED] — The paper describes serializable TreeState objects that can be checkpointed to JSON. No specific log file paths, output directories, or artifact naming conventions are documented.
Consequence for evidence tiers: All implementation claims in this chapter default to [PAPER] or [INFERRED] unless otherwise noted. Code examples use names from the paper's documented API but cannot be confirmed as exact repository identifiers.
19.2.2 System Architecture
TreeQuest is designed around a stateless, functional architecture where the search tree is an immutable data structure passed through a sequence of pure functions [PAPER — §8, §10]. The library separates three concerns: (1) search decisions (which action to take, which node to refine, which model to use), (2) solution generation (delegated entirely to user-provided callbacks), and (3) solution evaluation (also user-provided). This separation allows TreeQuest to remain agnostic to the LLM provider, prompt format, and evaluation protocol [PAPER — §8.1].
The core data flow follows the ask-tell pattern: the library asks the user what to evaluate next (returning Trial objects specifying action type, parent node, and model hint), and the user tells the library the results (solution and score). Between ask and tell, the user executes LLM calls however they choose — synchronously, asynchronously, with caching or rate limiting [PAPER — §9.1].
19.2.3 Artifact Inventory
Since direct repository inspection was not performed, the following table lists artifacts described in the paper and source material, with evidence sources.
| Artifact | Type | Description | Evidence Source |
|---|---|---|---|
treequest Python package | Library | Core AB-MCTS implementation | [PAPER — §8; README] |
init_tree() | Function | Initialize search tree state | [PAPER — §8.3] |
ask_batch() | Function | Request batch of trials | [PAPER — §9.1] |
tell() | Function | Report results, return new state | [PAPER — §9.3] |
best() | Function | Retrieve top-k solutions | [PAPER — §8.3] |
serialize() / deserialize() | Functions | Checkpoint/resume support | [PAPER — §10.2] |
Trial dataclass | Data structure | Immutable trial descriptor | [PAPER — §9.2] |
TreeState dataclass | Data structure | Complete immutable search state | [PAPER — §10.3] |
MultiLLMConfig | Config class | Multi-model pool configuration | [PAPER — §13.1] |
| JSON checkpoint files | Output artifact | Serialized tree state for resumption | [PAPER — §10.2] |
| Apache 2.0 license | License | Open-source license | [PAPER — §1; README] |
19.2.4 Execution Trace
No CLI entry point is documented. The paper describes a purely programmatic interface. The following is the documented execution pattern [PAPER — §8.3, §10.2]:
# Pseudocode — reconstructed from paper §8.3, §10.2, not repo-verified
# No CLI command documented; library is used programmatically
from treequest import init_tree, ask_batch, tell, best, serialize, deserialize
# Initialization
tree_state = init_tree(
method="multi_llm", # Variant: "ab_mcts_a" | "ab_mcts_m" | "multi_llm"
batch_size=5, # Parallel trials per iteration
max_iterations=250, # Total compute budget
model_config=config, # MultiLLMConfig for multi-model mode
)
# Main loop: ask → generate → evaluate → tell
for _ in range(50): # 50 batches × 5 = 250 trials
trials = ask_batch(tree_state)
results = [(t.trial_id, generate(t), score(t)) for t in trials]
tree_state = tell(tree_state, results)
# Retrieve results
solution = best(tree_state, k=2)
# Checkpoint: serialize to JSON
import json
with open("checkpoint.json", "w") as f:
json.dump(serialize(tree_state), f)
# Expected output: JSON file containing full TreeState
# (nodes, edges, bandit_state, iteration count, pending trials, RNG state)
Note: No specific output directory structure, log file format, or terminal output is documented in available materials. The primary observable artifact is the serialized TreeState JSON checkpoint [PAPER — §10.2].
19.3 Core Algorithms
19.3.1 Verification Matrix
| Algorithm / Mechanism | Claim | Evidence Source | Artifact | Confidence |
|---|---|---|---|---|
| Adaptive Branching (depth vs. width) | Thompson Sampling selects between depth and width actions at each iteration | [PAPER — §3.3, §4] | arXiv:2503.04412 §4 | High |
| Beta-Bernoulli Thompson Sampling | Each action maintains Beta(α, β) posterior updated with binary rewards | [PAPER — §5.1–5.2] | arXiv:2503.04412 §5 | High |
| Node Aggregation (AB-MCTS-A) | Binary success/failure based on whether child exceeds subtree best | [PAPER — §6.1] | arXiv:2503.04412 §6.1 | High |
| Mixed Models (AB-MCTS-M) | PyMC-based hierarchical Bayesian inference for continuous rewards | [PAPER — §5.3, §6.2] | arXiv:2503.04412 §6.2 | High |
| Multi-LLM model selection | Thompson Sampling over (direction, model) Cartesian product | [PAPER — §6.3, §7] | arXiv:2503.04412 §7 | High |
| Immutable TreeState | Frozen dataclasses, functional state transitions | [PAPER — §10] | arXiv:2503.04412 §10 | High |
| Ask-tell interface | Decoupled search decisions from generation execution | [PAPER — §9] | arXiv:2503.04412 §9 | High |
| Node selection for refinement | Secondary bandit problem for which node to refine | [PAPER — §3.1] | arXiv:2503.04412 §3.1 | Medium — mechanism described but exact policy not specified |
| Emergent model specialization | Different models develop different roles during search | [PAPER — §7.3] | arXiv:2503.04412 §7.3 | Medium — empirical observation, not formally measured |
| Crossover action | Combining elements from two subtrees | [PAPER — §16.2] | Future direction, not implemented | N/A — proposed only |
19.3.2 The Depth-Width Trade-off
The central algorithmic contribution of AB-MCTS is the adaptive selection between two search actions at each iteration [PAPER — §2.2, §3.3]:
- Depth (Go Deeper): Select an existing node in the search tree and generate a child solution by feeding the node's solution as context to the LLM, along with its score and improvement instructions. This creates chains of progressively refined solutions — analogous to gradient-descent-like steps in continuous optimization [PAPER — §3.1].
- Width (Go Wider): Generate an entirely new solution from scratch, without conditioning on any previous attempt. This is equivalent to adding a new root-level node to the search tree, providing diversity when existing branches are stuck in local optima [PAPER — §3.2].
The paper's key insight is that the optimal depth-width ratio is problem-dependent and node-dependent within the search tree, and should be learned online rather than fixed as a hyperparameter [PAPER — §3.3]. AB-MCTS accomplishes this via Thompson Sampling over a Beta-Bernoulli model.
19.3.3 Thompson Sampling for Action Selection
Thompson Sampling (Thompson, 1933) is a Bayesian approach to the multi-armed bandit problem. Each arm maintains a prior distribution over its expected reward; after observing a reward, the prior is updated to a posterior via Bayes' rule. At each decision point, a value is sampled from each arm's posterior, and the arm with the highest sample is selected [PAPER — §5.1].
For binary rewards, the natural conjugate prior is the Beta distribution. If arm i has observed αi successes and βi failures, its posterior is:
- αi: cumulative success count for arm i (depth or width)
- βi: cumulative failure count for arm i
- The +1 terms encode a uniform prior Beta(1,1)
Provenance: [Standard definition applied here] — Beta-Bernoulli Thompson Sampling is a well-established technique (Agrawal & Goyal, 2012). The application to depth/width selection is the paper's contribution [PAPER — §5.2].
Reward definition: A reward of 1 (success) is recorded when the newly generated solution's score exceeds the best score among the parent node's descendants. A reward of 0 (failure) is recorded otherwise [PAPER — §4.5]. Formally:
- rt: binary reward at iteration t
- schild: the newly generated solution
- score(·): user-defined evaluation function mapping solutions to [0, 1]
- subtree_best(parent): maximum score among all descendants of the parent node
- 𝟙[·]: indicator function (1 if condition holds, 0 otherwise)
Provenance: [Published formula — paper §4.5]. This binary thresholding is the specific choice made by AB-MCTS-A; AB-MCTS-M uses continuous rewards instead.
Suppose after 20 iterations: depth has had αd = 6 successes, βd = 8 failures; width has had αw = 4 successes, βw = 2 failures.
- Depth posterior: Beta(7, 9). Mean = 7/16 = 0.4375. The posterior is relatively concentrated.
- Width posterior: Beta(5, 3). Mean = 5/8 = 0.625. The posterior is broader (fewer observations).
- Sample from Beta(7, 9) → e.g., 0.38. Sample from Beta(5, 3) → e.g., 0.71.
- Width wins (0.71 > 0.38), so this iteration generates a fresh solution.
If width then produces a solution scoring 0.65 against a current root-level best of 0.60, the reward is 1 (success), updating width to αw = 5, βw = 2. If the score had been 0.55, the reward would be 0, updating to αw = 4, βw = 3.
19.3.4 AB-MCTS Main Loop
The algorithm operates in four phases per iteration: selection, expansion, evaluation, and backpropagation — mirroring classical MCTS but adapted for LLM inference-time scaling [PAPER — §4.1].
# Pseudocode — reconstructed from paper §4.1, not repo-verified
# Uses paper-described function names; actual implementation may differ
def ab_mcts(problem, budget_T, models, scorer):
"""AB-MCTS main loop with four phases per iteration."""
tree = init_tree()
thompson = ThompsonSampler(actions=["depth", "width"])
for t in range(budget_T):
# Phase 1 — Selection: Choose depth vs. width via Thompson Sampling
action = thompson.sample() # Sample from Beta posteriors
if action == "width":
# Phase 2a — Expansion (width): New root-level solution
prompt = make_fresh_prompt(problem)
parent_node = tree.root
else:
# Phase 2b — Expansion (depth): Refine existing node
parent_node = select_node(tree) # Secondary bandit/UCB
prompt = make_refinement_prompt(
problem, parent_node.solution, parent_node.score
)
# Model selection (Multi-LLM extension only)
model = select_model(models) # Thompson Sampling over model pool
# Phase 3 — Evaluation: Generate solution and score it
solution = model.generate(prompt)
score = scorer.evaluate(solution) # Normalized to [0, 1]
# Phase 4 — Backpropagation: Update tree and Thompson model
child = tree.add_child(parent_node, solution, score)
reward = 1 if score > parent_node.subtree_best else 0
thompson.update(action, reward)
return tree.best_solution()
19.3.5 AB-MCTS Variants
AB-MCTS-A (Node Aggregation). The simplest variant reduces all scores to binary success/failure outcomes. Each refinement attempt is treated as an independent Bernoulli trial: did the child improve upon the subtree's best score? The Beta posterior is well-calibrated under this assumption, but the magnitude of improvement is discarded — a score increase from 0.5 to 0.51 counts identically to an increase from 0.5 to 0.99 [PAPER — §6.1]. Dependencies: NumPy only; computational overhead is negligible [PAPER — Table in §6].
AB-MCTS-M (Mixed Models). This variant uses PyMC-based hierarchical Bayesian inference to maintain richer posteriors over continuous score distributions, rather than reducing to binary outcomes [PAPER — §5.3, §6.2]. The hierarchical structure allows information sharing across tree nodes: if refinements at one node tend to produce high scores, this informs expectations for other nodes. The trade-off is increased computational cost (MCMC sampling per decision) and a PyMC/ArviZ dependency [PAPER — §6.2].
Multi-LLM AB-MCTS. This variant extends the action space to the Cartesian product of {depth, width} × {model₁, model₂, …, modelk}. Each (direction, model) combination maintains its own Beta posterior, enabling the algorithm to learn that, for example, one model excels at refinement while another produces better diverse initial solutions [PAPER — §6.3, §7.1].
| Variant | Branching Strategy | Score Handling | Dependencies | Overhead | Provenance |
|---|---|---|---|---|---|
| AB-MCTS-A | Beta-Bernoulli TS | Binary (improve / not) | NumPy only | Negligible | [PAPER — §6.1] |
| AB-MCTS-M | Hierarchical Bayesian | Continuous [0,1] | PyMC, ArviZ | MCMC per decision | [PAPER — §6.2] |
| Multi-LLM | Multi-dim bandit | Binary per (action, model) | NumPy + API keys | Negligible per decision | [PAPER — §6.3] |
19.3.6 Multi-LLM Model Selection
The Multi-LLM extension introduces a three-dimensional decision space at each iteration [PAPER — §7.1]: (1) direction (depth vs. width), (2) model selection (which LLM from the pool), (3) generation (execute with chosen model). Each (direction, model) pair maintains an independent Beta posterior, updated based on whether the generation produced a score improvement.
# Pseudocode — reconstructed from paper §7.2, not repo-verified
class MultiLLMThompsonSampler:
"""Thompson Sampling across (direction, model) combinations."""
def __init__(self, models: list[str]):
self.arms = {}
for direction in ["depth", "width"]:
for model in models:
key = (direction, model)
self.arms[key] = {"alpha": 1.0, "beta": 1.0}
def select(self) -> tuple[str, str]:
"""Sample from all arm posteriors; return (direction, model)."""
best_key, best_sample = None, -1.0
for key, params in self.arms.items():
sample = np.random.beta(params["alpha"], params["beta"])
if sample > best_sample:
best_sample = sample
best_key = key
return best_key # (direction, model_name)
def update(self, direction: str, model: str, reward: float):
key = (direction, model)
self.arms[key]["alpha"] += reward
self.arms[key]["beta"] += (1.0 - reward)
After 30 iterations with o4-mini, Gemini-2.5-Pro, and DeepSeek R1-0528:
| Arm (direction, model) | α | β | Posterior Mean |
|---|---|---|---|
| (depth, o4-mini) | 5 | 8 | 0.385 |
| (depth, Gemini) | 7 | 4 | 0.636 |
| (depth, DeepSeek) | 3 | 5 | 0.375 |
| (width, o4-mini) | 4 | 3 | 0.571 |
| (width, Gemini) | 2 | 4 | 0.333 |
| (width, DeepSeek) | 3 | 3 | 0.500 |
The algorithm has learned that Gemini tends to be strongest at refinement (depth) while o4-mini provides good diversity (width). The next sample might draw 0.72 from (depth, Gemini) and 0.65 from (width, o4-mini), selecting Gemini for a depth action.
19.3.7 Stateless Design and Immutability
A distinctive architectural choice is that all tree state is represented via immutable (frozen) data structures. Every operation — adding a node, updating scores, recording Thompson Sampling observations — returns a new TreeState rather than mutating the existing one [PAPER — §10.1]. This design enables:
- Checkpointing: Any state can be serialized to JSON and restored later [PAPER — §10.2].
- Reproducibility: Given the same initial state and random seed, the search trajectory is fully deterministic [PAPER — §10.1].
- Concurrency: Immutable structures are inherently thread-safe — multiple threads can read the same state without locks [PAPER — §10.1].
- Debugging: The full history of tree states can be retained for post-hoc analysis [PAPER — §10.1].
19.3.8 Node Selection for Refinement
When the algorithm decides to go deeper, it must select which existing node to refine. The paper describes this as a secondary multi-armed bandit problem where each node is an arm, with selection balancing exploitation (refining high-scoring nodes) with exploration (trying nodes with low visit counts) [PAPER — §3.1]. The exact selection policy is not formally specified in the paper.
19.3.9 Regret Analysis
Thompson Sampling for the K-armed Bernoulli bandit has established regret bounds [PAPER — §14.2]:
- R(T): cumulative regret after T rounds
- K: number of arms — 2 for single-model (depth/width), 2|M| for Multi-LLM with |M| models
- T: total number of iterations (budget)
Provenance: [Standard definition applied here] — This bound is from Agrawal & Goyal (2012) for the stationary case. The paper notes that AB-MCTS reward distributions are non-stationary (the value of depth vs. width changes as the tree grows and best known score improves), which the standard bound does not address [PAPER — §14.2].
For Multi-LLM AB-MCTS with K = 6 arms (2 directions × 3 models) and T = 250 iterations:
E[R(250)] ≤ O(√(6 × 250 × log 250)) ≈ O(√(6 × 250 × 5.52)) ≈ O(√8300) ≈ O(91)
This suggests that the algorithm's cumulative loss from suboptimal arm selection grows sublinearly, with expected regret of roughly 91 over 250 rounds — meaning the vast majority of iterations are allocated to near-optimal actions. Note this is a theoretical upper bound under stationarity; actual performance may differ due to non-stationary rewards.
19.4 Key Results
19.4.1 Evaluation Caveats
- Self-reported results: All benchmark numbers are reported by the system authors (Sakana AI). No independent reproduction by a third party has been documented in available materials.
- Exact model versions: The paper specifies model names (o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528) but does not provide exact model version identifiers, API snapshot dates, or whether model behavior may have changed between paper preparation and publication [PAPER — §11.2].
- Seeds and run counts: The number of independent runs per configuration, random seeds used, and variance across runs are not reported. All numbers appear to be from single runs or are not specified [PAPER — §11.3].
- Compute budget matching: The repeated sampling baseline uses 250 independent generations with o4-mini. The Multi-LLM AB-MCTS also uses 250 iterations but with three different models (o4-mini, Gemini, DeepSeek), which may have different per-call costs. Whether total token or dollar budgets are matched is not stated [PAPER — §11.2–11.3].
- Prompt engineering: The paper acknowledges that the effectiveness of depth actions depends on prompt quality [PAPER — §16.1], but does not release the exact prompts used, making reproduction dependent on prompt design choices.
- Approximate figures: Some results (AB-MCTS-M "~28%", Multi-LLM ">30%") are reported with approximate notation rather than exact values [PAPER — §11.3].
- Task subset: Results are on the 120 public evaluation tasks of ARC-AGI-2, not the full held-out test set. Public set performance may not generalize to the private evaluation [PAPER — §11.1].
19.4.2 ARC-AGI-2 Results
| Method | Model(s) | Task Set | Pass@250 | Pass@2 | Seeds / Runs | Compute Budget | Result Type | Evidence Source |
|---|---|---|---|---|---|---|---|---|
| Repeated Sampling | o4-mini | 120 public eval | 23.0% | — (not reported) | — (not reported) | 250 generations | Self-reported | [PAPER — §11.3] |
| AB-MCTS-A | o4-mini | 120 public eval | 27.5% | — (not reported) | — (not reported) | 250 iterations | Self-reported | [PAPER — §11.3] |
| AB-MCTS-M | o4-mini | 120 public eval | ~28% | — (not reported) | — (not reported) | 250 iterations | Self-reported (approximate) | [PAPER — §11.3] |
| Multi-LLM AB-MCTS | o4-mini + Gemini-2.5-Pro + DeepSeek R1-0528 | 120 public eval | >30% | 19.2% | — (not reported) | 250 iterations (3-model pool) | Self-reported (approximate) | [PAPER — §11.3] |
19.4.3 Analysis of Results
The reported improvement from 23.0% (repeated sampling) to >30% (Multi-LLM AB-MCTS) represents a >30% relative improvement in Pass@250 on the same 120-task evaluation set [PAPER — §11.3]. The paper emphasizes three patterns observed in search tree analysis [PAPER — §11.4]:
- Complementary reasoning: One model identifies the correct transformation type while another implements it correctly at the pixel level.
- Cross-model error correction: A different model can sometimes fix systematic errors during refinement that the originating model reproduces.
- Diverse initialization: Different models produce different distributions of initial solutions, improving coverage of the solution space.
The Pass@2 metric (19.2%) for Multi-LLM AB-MCTS indicates that when selecting only the top-2 solutions from 250 candidates for final submission, the algorithm captures approximately two-thirds of the solvable tasks. This demonstrates effective ranking, not just search [PAPER — §11.4].
19.4.4 Scaling Behavior
The paper reports that repeated sampling's Pass@N curve flattens quickly (most easily solvable problems found within the first ~50 attempts), while AB-MCTS maintains a steeper slope because search directs compute toward promising regions [PAPER — §11.5].
The paper defines search efficiency at iteration t as the probability that iteration t solves a previously unsolved problem [PAPER — §11.5]:
- efficiency(t): marginal probability of solving a new problem at iteration t
- For repeated sampling, this decreases monotonically
- For AB-MCTS, this can increase over time as information accumulates
Provenance: [Published formula — paper §11.5]. This is a conceptual framing for interpreting scaling curves, not a directly measured quantity.
Consider a problem solvable by 2% of random samples. Under repeated sampling, efficiency(t) ≈ 0.02 at every iteration (assuming independence). After 50 iterations, P(unsolved) = 0.9850 ≈ 0.364, so Pass@50 ≈ 63.6%.
Under AB-MCTS, if the first few attempts reveal that a particular refinement direction is promising, subsequent iterations concentrate on that direction. The efficiency might start at 0.02 but increase to, say, 0.05 as the algorithm focuses. After 50 directed iterations, the cumulative solve probability could exceed the repeated-sampling baseline despite starting from the same per-sample rate.
19.4.5 Non-LLM Baseline Comparison
The paper compares AB-MCTS variants against repeated sampling (Best-of-N) as the primary baseline [PAPER — §11.3]. No comparison against non-LLM baselines (e.g., brute-force program enumeration, DSL-based solvers, or classical ARC solvers) is reported in the available materials. The ARC-AGI-2 benchmark itself has been evaluated by various approaches in the ARC Prize competition, but the paper does not position its results against non-LLM competition entries.
19.5 Implementation & Cost
19.5.1 Implementation Details
| Aspect | Detail | Evidence Source |
|---|---|---|
| Language | Python | [PAPER — §8; README] |
| License | Apache 2.0 | [PAPER — §1; README] |
| Core dependencies (AB-MCTS-A) | NumPy | [PAPER — §6, variant table] |
| Additional dependencies (AB-MCTS-M) | PyMC, ArviZ | [PAPER — §6.2] |
| Design pattern | Stateless ask-tell, immutable dataclasses | [PAPER — §8.1, §10] |
| Parallelism model | Batch-parallel trials; user manages async execution | [PAPER — §9.3] |
| Checkpoint format | JSON-serializable TreeState | [PAPER — §10.2] |
| Models used in reported experiments | o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528 | [PAPER — §11.2] |
19.5.2 Cost Analysis
| Cost Dimension | Detail | Provenance |
|---|---|---|
| Iterations per task | 250 (max budget in reported experiments) | [PAPER — §11.2] |
| Batch size | 5 (recommended) | [PAPER — §12.1] |
| LLM calls per task | 250 (1 per iteration) | [PAPER — §4, §11.2] |
| Computational overhead of AB-MCTS-A | Negligible (NumPy Beta sampling) | [PAPER — §6, variant table] |
| Computational overhead of AB-MCTS-M | MCMC sampling per decision (PyMC) | [PAPER — §6, variant table] |
| Hardware configuration | — (not reported) | — |
| Wall-clock time per task | — (not reported) | — |
| Dollar cost per task | — (not reported) | — |
| Total cost for 120-task evaluation | — (not reported) | — |
19.6 Reproducibility
19.6.1 Reproducibility Assessment
| Requirement | Status | Notes |
|---|---|---|
| Code publicly released | ✓ | github.com/SakanaAI/treequest, Apache 2.0 [PAPER — §1] |
| Config files available | Partial | Configuration via function arguments documented in paper; no standalone config files described |
| Pretrained weights / checkpoints | N/A | System uses third-party LLM APIs; no training involved |
| Documented entry point or run command | Partial | Programmatic API documented [PAPER — §8]; no CLI; README may contain usage examples [UNVERIFIED] |
| Compute requirements stated | ✗ | Hardware, runtime, and cost not reported |
| Seeds and run counts reported | ✗ | Number of runs and random seeds not specified |
| Exact model versions specified | Partial | Model names given (o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528); exact API versions not pinned |
| Prompt templates released | ✗ | Prompts not published; essential for reproduction |
| Evaluation function released | ✗ (unclear) | ARC-AGI-2 scoring is well-defined (pixel accuracy), but the exact implementation is not confirmed in the repo |
| Independent reproduction attempted | ✗ | No independent reproductions documented in available materials |
19.6.2 Step-by-Step Verification Path
To attempt validation of AB-MCTS on ARC-AGI-2, an external researcher would need to:
- Clone the repository:
git clone https://github.com/SakanaAI/treequest - Install dependencies: The paper mentions NumPy as the only dependency for AB-MCTS-A, with PyMC/ArviZ for AB-MCTS-M [PAPER — §6]. The exact install command is not documented in the paper. The researcher should check the repository for
pyproject.toml,setup.py, orrequirements.txt. - Obtain API keys: OpenAI (o4-mini), Google (Gemini-2.5-Pro), DeepSeek (R1-0528). Exact model version strings may need to be determined from the repository or API documentation.
- Obtain ARC-AGI-2 tasks: The 120 public evaluation tasks are available from arcprize.org [PAPER — §11.1]. The exact task loading format expected by the code is not documented in the paper.
- Implement prompt templates: The paper describes depth and width prompts conceptually [PAPER — §4.3] but does not release exact templates. This is a significant barrier — the researcher must design their own prompts, which may not match the paper's performance.
- Implement scoring function: Pixel-level accuracy is well-defined for ARC tasks [PAPER — §12.3], but the exact normalization (e.g., handling of wrong grid dimensions) must be implemented.
- Run search: Using the documented API pattern:
init_tree(method="multi_llm", batch_size=5, max_iterations=250)→ ask/tell loop →best(k=2). - Expected output: A Pass@250 score of >30% and Pass@2 of ~19.2% on the 120 public tasks.
Critical gaps for reproduction: The absence of released prompt templates is the primary barrier. Since the paper acknowledges that "the effectiveness of depth (refinement) actions depends heavily on the quality of the refinement prompt" [PAPER — §16.1], prompt design is a first-order factor in reproducing results. The lack of reported seeds, run counts, and variance estimates further limits the ability to determine whether a reproduction attempt has succeeded or failed.
19.7 Threats to Validity
- Self-reported results without independent verification. All benchmark numbers are reported by the system's authors (Sakana AI). No independent reproduction or third-party evaluation has been documented. The improvement from 23% to >30% should be interpreted as an author-reported claim until independently validated.
- Unmatched compute budgets across configurations. The Multi-LLM variant uses three different frontier models in its 250-iteration budget. The single-model baselines use only o4-mini. If the three models have different costs per call (in tokens, latency, or dollars), the budgets are not dollar-matched. The paper uses iteration count as the budget metric, which treats all model calls as equivalent [PAPER — §11.2].
- Model version non-stationarity. The paper specifies model names but not exact version identifiers or API snapshot dates. Frontier model APIs are updated regularly; results obtained with one API version may not be reproducible with a later version. This is a systemic issue for all inference-time scaling research, not specific to AB-MCTS.
- Unreleased prompt templates. Prompt engineering is acknowledged as a critical factor [PAPER — §16.1], but the exact prompts are not published. This creates a gap between the described algorithm (which is fully specified) and the reported results (which depend on undisclosed implementation details).
- Absence of variance estimates. No error bars, confidence intervals, or multi-run statistics are reported. Without knowing the variance across runs, it is impossible to determine whether the differences between AB-MCTS variants and the repeated-sampling baseline are statistically significant.
- Public evaluation set only. Results are on ARC-AGI-2's 120 public evaluation tasks, not the held-out private test set. Overfitting to public task characteristics (through prompt tuning or hyperparameter selection) cannot be ruled out.
- Non-stationarity of reward distributions. The Thompson Sampling regret bounds cited in the paper [PAPER — §14.2] assume stationary reward distributions. The paper acknowledges that this assumption is violated in AB-MCTS (reward distributions shift as the tree grows) but does not quantify the impact of non-stationarity on the regret guarantee.
- Node selection policy underspecified. The secondary bandit problem of selecting which node to refine (when depth is chosen) is described only in general terms [PAPER — §3.1]. The exact selection policy — which substantially affects depth-action quality — is not formally specified.
- Emergent specialization claim. The reported observation that "different models develop different roles during search" [PAPER — §7.3] is a qualitative claim without formal measurement or statistical testing. The degree of specialization and whether it is robust across tasks and runs is not quantified.
19.8 Limitations & Open Questions
19.8.1 Documented Limitations
The paper explicitly identifies five limitations [PAPER — §16.1]:
- Scoring function requirement: AB-MCTS requires a scalar scoring function in [0, 1] that provides meaningful partial-credit signal. Tasks without clear partial credit (open-ended creative writing, subjective evaluation) are poorly suited [PAPER — §16.1].
- LLM call overhead: Each iteration requires at least one LLM API call. At 250 iterations per task, the cost may be prohibitive for expensive models or large task sets [PAPER — §16.1].
- Prompt engineering dependency: Refinement (depth) effectiveness depends heavily on prompt quality. Poor prompts may cause the LLM to ignore or verbatim-copy the parent solution [PAPER — §16.1].
- Non-stationarity: The true reward distributions shift during search, violating the stationarity assumption of standard Thompson Sampling. While empirically robust to mild non-stationarity, extreme distribution shift could degrade performance [PAPER — §16.1].
- No crossover: Unlike evolutionary algorithms, AB-MCTS does not combine elements from different solution branches. Each refinement acts on a single parent [PAPER — §16.1].
19.8.2 Open Questions
- Crossover action: The paper proposes adding a "crossover" operation that combines elements from two subtrees via LLM-mediated recombination [PAPER — §16.2]. This would expand the action space to {depth, width, crossover} × models, potentially enabling richer search dynamics at the cost of increased bandit dimensionality.
- Hierarchical decomposition: For multi-step problems, AB-MCTS could be applied hierarchically — a top-level search decomposes the problem, and nested searches solve subproblems [PAPER — §16.2]. The interaction between hierarchical decomposition and Thompson Sampling is unexplored.
- Learned scoring functions: Replacing external evaluators with LLM-based reward models would create a fully self-contained system but introduces reward hacking risks [PAPER — §16.2].
- Non-stationary Thompson Sampling: Discounted Thompson Sampling with geometric down-weighting of older observations (αt = γ·αt−1 + rewardt, γ ∈ (0,1)) could address reward distribution shift [PAPER — §16.2].
- Integration with code evolution: The paper suggests AB-MCTS as a search backbone for systems like AlphaEvolve, replacing MAP-Elites or simple evolutionary strategies for program search [PAPER — §16.2]. The Multi-LLM variant is particularly relevant for code generation tasks where different models may excel at algorithmic design versus implementation.
19.9 Survey Positioning
19.9.1 Comparison with Related Systems
| System | Search Strategy | Adaptation | Multi-Model | Primary Domain | Budget Context |
|---|---|---|---|---|---|
| AB-MCTS / TreeQuest (this chapter) | Adaptive tree search (depth + width) | Thompson Sampling (online) | Yes — learned routing via TS | ARC-AGI abstract reasoning | 250 iterations / task |
| FunSearch (Chapter 11) | Population-based evolutionary | Island model with program database | No — single model (PaLM 2 / Gemini) | Mathematical optimization (cap sets, bin packing) | Millions of evaluations |
| AlphaEvolve (Chapter 9) | Evolutionary with MAP-Elites archive | Adaptive operator selection | Yes — Gemini model cascade | Algorithm discovery (broad) | Large-scale (hours to days) |
Note on budget comparisons: The systems above operate at fundamentally different scales. FunSearch runs millions of cheap evaluations of short programs; AlphaEvolve runs large-scale evolutionary campaigns over hours to days; AB-MCTS runs 250 iterations per task with expensive frontier model calls. Direct performance comparison across these systems is not meaningful without budget normalization, which none of the source papers provide across system boundaries.
19.9.2 Evolutionary Correspondence
AB-MCTS shares conceptual structure with evolutionary algorithms, as the paper itself notes [PAPER — §15.3]. The following table maps the correspondence:
| AB-MCTS Component | Evolutionary Analogue | Correspondence Quality |
|---|---|---|
| Width action (new root solution) | Random initialization / immigration | Strong — both introduce independent diversity |
| Depth action (refine existing node) | Mutation + local search / hill climbing | Moderate — refinement is directed by prompt, not random perturbation |
| Thompson Sampling adaptive branching | Adaptive operator selection | Strong — both learn action allocation online |
| Multi-LLM model pool | Multiple mutation operators | Moderate — models are more complex than operators |
| Search tree | Population with genealogy | Weak — tree is append-only, no population culling |
| No crossover | Asexual reproduction only | Strong limitation — explicitly acknowledged [PAPER — §16.1] |
Where the analogy breaks down: AB-MCTS does not maintain a fixed population size, does not perform selection pressure (all nodes are retained, none are culled), and does not use crossover or recombination [PAPER — §15.3]. The tree is append-only: nodes are never removed, and the "population" grows monotonically. This means AB-MCTS cannot exhibit the diversity-maintenance dynamics of island models or MAP-Elites archives used by FunSearch and AlphaEvolve. The tree structure also lacks the multi-objective quality-diversity characteristics that drive these systems' ability to maintain behavioral variety. In essence, AB-MCTS is a search algorithm that shares surface-level structure with evolutionary algorithms but operates under fundamentally different population dynamics.
Key Contribution
AB-MCTS introduces adaptive depth-width branching via Thompson Sampling as a principled alternative to fixed search strategies for LLM inference-time scaling. The core insight — that the optimal ratio of solution refinement to fresh generation is problem-dependent and learnable online — is formalized as a multi-armed bandit and extended to multi-model selection. The TreeQuest library provides a clean stateless implementation with checkpoint/resume support. Among inference-time scaling methods surveyed (Best-of-N, Chain-of-Thought, Tree-of-Thought, Self-Consistency), AB-MCTS is, based on available evidence, the approach that explicitly treats the depth-width trade-off as a learnable parameter rather than a fixed architectural choice [PAPER — §15.1].
19.10 Summary
Key Takeaway: AB-MCTS demonstrates that adaptive allocation between solution refinement (depth) and fresh generation (width), governed by Thompson Sampling, can meaningfully improve inference-time scaling over naive repeated sampling. On ARC-AGI-2, the Multi-LLM variant reportedly achieves >30% Pass@250 compared to 23% for repeated sampling — a relative improvement exceeding 30% — with the additional finding that multi-model collaboration solves problems no single model can solve individually.
Main Contribution: A Bayesian framework for online learning of the depth-width trade-off in LLM-based search, implemented in the TreeQuest library with a stateless, immutable, checkpoint-friendly design. The Multi-LLM extension demonstrates that model selection can be jointly optimized alongside search direction, enabling emergent model specialization.
Most Important Gap for Future Research: The absence of released prompt templates, variance estimates, and independent reproduction limits the ability to distinguish the contribution of the search algorithm from the contribution of prompt engineering and model selection. A researcher attempting reproduction should prioritize obtaining or reverse-engineering the prompt templates, as the paper explicitly acknowledges prompt quality as a critical factor in depth-action effectiveness. Adding crossover-like recombination operations — explicitly identified as a limitation — is the most natural algorithmic extension, potentially bridging AB-MCTS toward the richer population dynamics exhibited by evolutionary approaches like FunSearch and AlphaEvolve.