AB-MCTS / TreeQuest
Adaptive Branching Monte Carlo Tree Search for Multi-LLM Inference-Time Scaling Organization: Sakana AI Paper: arXiv 2503.04412 License: Apache 2.0 Repository: github.com/SakanaAI/treequest Report Type: PhD-Level Technical Analysis Report Date: March 2026
Table of Contents
- Executive Summary
- Motivation and Problem Formulation
- Search Dimensions: Depth vs. Width
- Core Algorithm: AB-MCTS
- Thompson Sampling Integration
- AB-MCTS Variants
- Multi-LLM Extension
- TreeQuest Library Architecture
- API Design and Ask-Tell Interface
- Stateless Design and Immutability
- Benchmark: ARC-AGI-2 Results
- Configuration and Hyperparameters
- Implementation Walkthrough
- Theoretical Foundations
- Comparison with Related Work
- Limitations and Future Directions
- References
1 Executive Summary
AB-MCTS (Adaptive Branching Monte Carlo Tree Search) represents a principled approach to inference-time scaling for large language models. Rather than relying on a single model generating a single answer, AB-MCTS orchestrates a search process that adaptively decides between two fundamental strategies: going deeper (refining an existing solution through sequential iteration) and going wider (generating entirely new solutions from scratch). This adaptive branching is the core novelty, as prior methods typically commit to one strategy or the other.
The accompanying open-source library, TreeQuest, implements this algorithm with a stateless, functional design that supports checkpointing, resumption, and batch-parallel execution. The Multi-LLM extension adds a third search dimension: selecting which frontier model to use at each step, treating model selection as an additional bandit arm optimized via Thompson Sampling.
Key Result: On the ARC-AGI-2 benchmark (120 public evaluation tasks, max 250 iterations), Multi-LLM AB-MCTS achieves >30% Pass@250, compared to 23% for naive repeated sampling with a single model (o4-mini). Problems that no single LLM can solve in isolation become tractable through collaborative multi-model search.
At a Glance
Core Innovation: Adaptive depth vs. width branching via Thompson Sampling
Search Framework: Monte Carlo Tree Search with Bayesian action selection
Models Supported: o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528
Library: TreeQuest (Python, Apache 2.0)
Design Pattern: Stateless ask-tell interface with immutable data structures
Primary Benchmark: ARC-AGI-2 (abstraction and reasoning)
2 Motivation and Problem Formulation
2.1 The Inference-Time Scaling Paradigm
Traditional scaling laws for LLMs focus on training-time compute: larger models trained on more data yield better performance. Inference-time scaling offers a complementary axis -- rather than training a bigger model, one can spend more compute at inference time by generating multiple candidate solutions and selecting among them. This paradigm is particularly relevant for tasks where verification is cheaper than generation, such as mathematical reasoning, code synthesis, and abstract pattern matching.
The simplest inference-time scaling strategy is repeated sampling (also called Best-of-N): generate N independent solutions and select the best one. While effective, this approach makes no use of information gathered during search. If the first 50 attempts all fail in a similar way, attempt 51 is no more informed than attempt 1.
2.2 The Depth-Width Trade-off
Consider a problem that admits multiple solution strategies. Two natural approaches exist:
- Depth-first (Go Deeper): Take a promising partial solution and iteratively refine it. Feed the current best attempt back to the model with instructions to improve, fix bugs, or extend the approach. This leverages accumulated context but risks getting stuck in a local optimum.
- Width-first (Go Wider): Generate entirely new solutions from scratch, independent of previous attempts. This provides diversity but wastes information gained from prior attempts.
The central insight of AB-MCTS is that the optimal trade-off between depth and width is problem-dependent and node-dependent within the search tree. Some problems benefit from deep refinement chains; others benefit from broad exploration. Even within a single search tree, different subtrees may warrant different strategies. AB-MCTS uses Bayesian methods to learn this trade-off adaptively during search.
2.3 Formal Problem Statement
Let f: S → [0, 1] be an evaluation function mapping candidate solutions to quality scores. Given a computational budget of T iterations, the goal is to find:
$$ s* = arg maxs ∈ ST f(s) $$
where ST is the set of all solutions explored within the budget. The search must decide, at each iteration, whether to:
- Expand: Create a new root-level solution (width).
- Refine: Select an existing node and generate a child solution (depth).
In the multi-LLM setting, a third decision is introduced: which model m ∈ M to use for generation. The search algorithm must balance exploration and exploitation across all three dimensions simultaneously.
Search Decision Space
================================================
Iteration t:
+-------------------+ +-------------------+
| GO WIDER | | GO DEEPER |
| (New Solution) | vs | (Refine Node) |
+-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| Select Model m | | Select Model m |
| from pool M | | from pool M |
+-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| Generate & Score | | Generate & Score |
| s_new = m(prompt) | | s_child = m(s,p) |
+-------------------+ +-------------------+
| |
+----------+ +-----------+
| |
v v
+-------------------+
| Update Thompson |
| Sampling Models |
+-------------------+
3 Search Dimensions: Depth vs. Width
3.1 Going Deeper: Sequential Refinement
The "depth" action selects an existing node in the search tree and generates a child solution by feeding the node's solution as context to the LLM. The refinement prompt typically includes the previous solution, its score, and instructions for improvement. This creates chains of progressively refined solutions.
The depth action is analogous to iterative refinement in optimization: gradient-descent-like steps that improve a current solution. Its effectiveness depends on the smoothness of the solution landscape -- if small changes to a solution reliably improve its score, depth search converges quickly. However, if the solution landscape has many local optima, depth search may get trapped.
Node Selection for Refinement
When the algorithm decides to go deeper, it must choose which existing node to refine. This is itself a multi-armed bandit problem. AB-MCTS uses the accumulated evidence at each node (the scores of its descendants) to select promising nodes for further refinement. The selection balances exploitation (refining high-scoring nodes) with exploration (trying nodes that haven't been refined much).
3.2 Going Wider: Fresh Generation
The "width" action generates an entirely new solution from scratch, without conditioning on any previous attempt. This is equivalent to adding a new root-level node to the search tree. Width provides diversity: even if all existing solution branches are stuck in local optima, a fresh generation might discover an entirely different approach.
The width action is most valuable early in search (when the tree is sparse and the algorithm hasn't yet found promising regions) and whenever the algorithm detects that refinement is yielding diminishing returns.
3.3 Adaptive Branching
The key innovation is that AB-MCTS does not fix the depth-width ratio a priori. Instead, it treats the choice as a two-armed bandit and uses Thompson Sampling to adaptively allocate compute between the two strategies. Early in search, when few solutions exist, width tends to dominate. As the tree grows and promising branches emerge, depth becomes more attractive. If depth produces diminishing returns in a particular subtree, the algorithm naturally shifts back toward width.
Adaptive Branching Principle: The ratio of depth-to-width actions is not a hyperparameter but an emergent property of the search dynamics. It is learned online through the Thompson Sampling mechanism, adapting to each problem's characteristics.
4 Core Algorithm: AB-MCTS
4.1 Algorithm Overview
AB-MCTS operates in a loop of four phases: selection, expansion, evaluation, and backpropagation. This structure mirrors classical MCTS but adapts each phase for the LLM inference-time scaling setting.
Python
# AB-MCTS main loop (simplified pseudocode)
def ab_mcts(problem, budget_T, models, scorer):
tree = init_tree()
thompson_model = ThompsonSampler(actions=["depth", "width"])
for t in range(budget_T):
# 1. Selection: Choose depth vs. width via Thompson Sampling
action = thompson_model.sample()
if action == "width":
# 2a. Expansion (width): Generate new root-level solution
prompt = make_fresh_prompt(problem)
node = tree.root
else:
# 2b. Expansion (depth): Select node and refine
node = select_node(tree)
prompt = make_refinement_prompt(problem, node.solution, node.score)
# 3. Evaluation: Generate solution and score it
model = select_model(models) # Multi-LLM extension
solution = model.generate(prompt)
score = scorer.evaluate(solution) # Normalized to [0, 1]
# 4. Backpropagation: Add child and update Thompson model
child = tree.add_child(node, solution, score)
reward = 1 if score > node.best_descendant_score else 0
thompson_model.update(action, reward)
return tree.best_solution()
4.2 Selection Phase
The selection phase decides between depth and width. AB-MCTS uses Thompson Sampling: each action (depth, width) has an associated probability model that estimates the expected reward of taking that action. At each iteration, a value is sampled from each action's posterior distribution, and the action with the higher sample is chosen. This naturally balances exploration and exploitation.
4.3 Expansion Phase
For width expansion, a fresh prompt is constructed from the problem description alone. For depth expansion, the prompt includes the selected node's solution and score, along with instructions to improve upon it. The expansion phase is where the LLM does its actual generative work.
4.4 Evaluation Phase
The generated solution is evaluated by a scoring function that returns a value in [0, 1]. This scoring function is user-defined and problem-specific. For ARC-AGI tasks, the score might be the fraction of test pixels correctly predicted. For code generation, it might be the fraction of test cases passed. The normalization to [0, 1] is important for the Thompson Sampling model.
4.5 Backpropagation Phase
The reward signal is computed by comparing the new solution's score to the best score among the parent node's descendants. If the new solution improves upon the best known descendant, the reward is 1 (success); otherwise, it is 0 (failure). This binary reward is used to update the Thompson Sampling model's posterior for the chosen action.
AB-MCTS Search Tree Evolution
==============================
Iteration 1 (Width): Iteration 3 (Width): Iteration 5 (Depth):
[R] [R] [R]
| / \ / \
[A] [A] [B] [A] [B]
s=0.3 s=0.3 s=0.5 s=0.3 s=0.5
|
Iteration 2 (Depth): [C]
[R] s=0.7
|
[A] Iteration 4 (Depth):
s=0.3 [R]
| / \
[A'] [A] [B]
s=0.4 s=0.3 s=0.5
|
[B']
s=0.6
Legend: [R] = Root, s = score, Width = new branch, Depth = refine existing
5 Thompson Sampling Integration
5.1 Thompson Sampling Fundamentals
Thompson Sampling is a Bayesian approach to the multi-armed bandit problem. Each arm (action) has an associated prior distribution over its expected reward. After observing a reward from an arm, the prior is updated to a posterior using Bayes' rule. At each decision point, a value is sampled from each arm's posterior, and the arm with the highest sample is chosen.
For binary rewards (success/failure), the natural choice is a Beta distribution. If arm i has had αi successes and βi failures, its posterior is:
$$ P(rewardi) ~ Beta(αi + 1, βi + 1) $$
Thompson Sampling provides an elegant exploration-exploitation trade-off: arms with high expected reward are chosen frequently (exploitation), while arms with high uncertainty are occasionally chosen due to random sampling from broad posteriors (exploration).
5.2 Application to Depth-Width Selection
In AB-MCTS, the two arms are "depth" and "width." Each maintains its own Beta distribution. When a depth action produces a child that improves upon its subtree's best score, the depth arm's success count increments. Otherwise, the failure count increments. The same logic applies to width actions.
Python
class ThompsonSampler:
"""Thompson Sampling with Beta-Bernoulli model for action selection."""
def __init__(self, actions: list[str]):
self.alpha = {a: 1.0 for a in actions} # Prior successes
self.beta = {a: 1.0 for a in actions} # Prior failures
def sample(self) -> str:
"""Sample from each arm's posterior; return arm with highest sample."""
samples = {}
for action in self.alpha:
samples[action] = np.random.beta(
self.alpha[action], self.beta[action]
)
return max(samples, key=samples.get)
def update(self, action: str, reward: float):
"""Update posterior with observed reward (0 or 1)."""
self.alpha[action] += reward
self.beta[action] += (1.0 - reward)
5.3 Mixed Models for Continuous Rewards
The basic Beta-Bernoulli model treats rewards as binary. However, real scores are continuous in [0, 1]. The AB-MCTS-M variant uses mixed models -- specifically, PyMC-based Bayesian inference with hierarchical statistical modeling -- to maintain richer posterior distributions over expected scores. This provides more nuanced exploration-exploitation trade-offs when scores carry graded information rather than simple pass/fail signals.
Design Decision: The choice between Beta-Bernoulli (AB-MCTS-A) and mixed models (AB-MCTS-M) reflects a bias-variance trade-off. The simpler model is more robust and computationally cheaper, while the mixed model can extract more signal from continuous scores at the cost of greater complexity and PyMC dependency.
6 AB-MCTS Variants
6.1 AB-MCTS-A: Node Aggregation
The Aggregation variant (AB-MCTS-A) makes branching decisions based on accumulated evidence at each node. For every node in the tree, it tracks the number of successful and unsuccessful refinements. "Success" is defined as producing a child whose score exceeds the current subtree maximum. This evidence is aggregated across the tree to estimate the global value of depth vs. width actions.
Node aggregation provides a simple, robust estimator. Its strength is that it treats every refinement attempt as an independent Bernoulli trial, making the Beta posterior well-calibrated. Its weakness is that it discards the magnitude of improvement -- a refinement that improves the score from 0.5 to 0.51 counts the same as one improving from 0.5 to 0.99.
[!info]- AB-MCTS-A Implementation Details Python ``` class ABMCTSAggregation: """AB-MCTS with node aggregation for branching decisions."""
def __init__(self): # Global counters for depth vs. width self.depth_successes = 0 self.depth_failures = 0 self.width_successes = 0 self.width_failures = 0 def should_go_deeper(self) -> bool: """Thompson Sampling: sample from both Beta posteriors.""" depth_sample = np.random.beta( self.depth_successes + 1, self.depth_failures + 1 ) width_sample = np.random.beta( self.width_successes + 1, self.width_failures + 1 ) return depth_sample > width_sample def record_outcome(self, action: str, child_score: float, subtree_best: float): success = child_score > subtree_best if action == "depth": self.depth_successes += int(success) self.depth_failures += int(not success) else: self.width_successes += int(success) self.width_failures += int(not success)```
6.2 AB-MCTS-M: Mixed Models
The Mixed Models variant (AB-MCTS-M) uses PyMC-based Bayesian inference to maintain richer posterior distributions. Rather than reducing scores to binary outcomes, AB-MCTS-M models the full score distribution for each action type using hierarchical statistical models.
The hierarchical model structure allows AB-MCTS-M to share information across nodes in the tree. If refinements at one node tend to produce high scores, this information propagates to inform expectations about refinements at other nodes with similar characteristics. This is particularly useful early in search when per-node evidence is sparse.
[!info]- PyMC Model Specification Python ``` import pymc as pm
def build_mixed_model(depth_scores, width_scores): """Build hierarchical Bayesian model for action selection.
Uses PyMC to estimate the expected improvement for depth vs. width actions with proper uncertainty quantification. """ with pm.Model() as model: # Hyperpriors for the population-level parameters mu_global = pm.Normal("mu_global", mu=0.5, sigma=0.3) sigma_global = pm.HalfNormal("sigma_global", sigma=0.2) # Action-specific means (partial pooling) mu_depth = pm.Normal("mu_depth", mu=mu_global, sigma=sigma_global) mu_width = pm.Normal("mu_width", mu=mu_global, sigma=sigma_global) # Action-specific variances sigma_depth = pm.HalfNormal("sigma_depth", sigma=0.1) sigma_width = pm.HalfNormal("sigma_width", sigma=0.1) # Likelihood for observed scores if len(depth_scores) > 0: pm.Normal("obs_depth", mu=mu_depth, sigma=sigma_depth, observed=depth_scores) if len(width_scores) > 0: pm.Normal("obs_width", mu=mu_width, sigma=sigma_width, observed=width_scores) # Sample from posterior trace = pm.sample(500, return_inferencedata=True, cores=1) return trace```
6.3 Multi-LLM: Model Selection as a Third Dimension
The Multi-LLM variant extends AB-MCTS by treating model selection as an additional action dimension. In the basic algorithm, the search decides between depth and width. In the Multi-LLM variant, it additionally decides which LLM to use for generation. Each LLM gets its own Thompson Sampling model, updated based on the quality of solutions it produces.
This is formalized as a multi-dimensional multi-armed bandit. The action space becomes the Cartesian product of {depth, width} and {model_1, model_2, ..., model_k}. Each combination maintains its own Beta posterior, allowing the algorithm to learn, for example, that model A is better at refinement while model B is better at generating diverse new solutions.
| Variant | Branching Strategy | Score Handling | Dependencies | Computational Overhead |
|---|---|---|---|---|
AB-MCTS-A |
Beta-Bernoulli Thompson Sampling | Binary (improve/not) | NumPy only | Negligible |
AB-MCTS-M |
Hierarchical Bayesian (PyMC) | Continuous [0,1] | PyMC, ArviZ | MCMC sampling per decision |
Multi-LLM |
Multi-dimensional bandit | Binary per (action, model) | NumPy + multiple API keys | Negligible (per decision) |
7 Multi-LLM Extension
7.1 Three-Dimensional Decision Space
The Multi-LLM extension expands the single-step decision from two dimensions to three:
- Dimension 1 -- Direction: Decide whether to go deeper (refine) or go wider (fresh generation).
- Dimension 2 -- Model Selection: Choose which LLM from the available pool (e.g., o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528).
- Dimension 3 -- Generation: Execute the generation with the chosen model and evaluate the result.
Multi-LLM AB-MCTS Decision Flow
=================================
Step 1: Direction Step 2: Model Step 3: Generate
+---------+ +-------------+ +---------------+
| Depth |---+ +---->| o4-mini |---+ +-->| LLM(prompt) |
+---------+ | | +-------------+ | | +---------------+
+---->+ +-------------+ +->+ |
+---------+ | +---->| Gemini-2.5 |---+ | +------v--------+
| Width |---+ | +-------------+ | | | Evaluate(sol) |
+---------+ | +-------------+ | | +------+--------+
+---->| DeepSeek R1 |---+ | |
+-------------+ | +------v--------+
+-->| Update Models |
+---------------+
Each (direction, model) pair has its own Thompson Sampling posterior.
7.2 Independent Probability Models per LLM
Each LLM maintains its own probability model, updated independently based on its performance. This allows the algorithm to learn the strengths and weaknesses of each model. For example, if Gemini-2.5-Pro consistently produces better refinements (depth) while o4-mini is better at generating diverse initial solutions (width), the algorithm will allocate accordingly.
Python
class MultiLLMThompsonSampler:
"""Thompson Sampling across (direction, model) combinations."""
def __init__(self, models: list[str]):
self.arms = {}
for direction in ["depth", "width"]:
for model in models:
key = (direction, model)
self.arms[key] = {"alpha": 1.0, "beta": 1.0}
def select(self) -> tuple[str, str]:
"""Sample from all arm posteriors; return (direction, model)."""
best_key = None
best_sample = -1.0
for key, params in self.arms.items():
sample = np.random.beta(params["alpha"], params["beta"])
if sample > best_sample:
best_sample = sample
best_key = key
return best_key # (direction, model_name)
def update(self, direction: str, model: str, reward: float):
key = (direction, model)
self.arms[key]["alpha"] += reward
self.arms[key]["beta"] += (1.0 - reward)
7.3 Emergent Model Specialization
A key empirical finding is that different models develop different roles during search. The Thompson Sampling mechanism naturally discovers these specializations without explicit programming. This emergent division of labor is one of the most compelling aspects of the Multi-LLM approach.
The synergy between models extends beyond simple specialization. When one model produces a solution with a novel structural insight but low overall score, another model may take that insight during a refinement step and combine it with its own strengths to produce a high-scoring solution. This form of "cross-pollination" between models is unique to the Multi-LLM setting and cannot be achieved by running models independently.
Empirical Finding: On ARC-AGI-2, the Multi-LLM configuration solves problems that no single LLM can solve individually, even given the same total compute budget. This is not merely a statistical artifact of increased diversity -- the models genuinely complement each other through collaborative refinement chains.
8 TreeQuest Library Architecture
8.1 Design Philosophy
TreeQuest is the open-source Python library implementing AB-MCTS. Its architecture follows several key design principles:
- Statelessness: All tree state is returned from functions rather than stored in mutable objects. This enables checkpointing and resumption.
- Immutability: Data structures are immutable, preventing accidental state corruption and enabling safe concurrent access.
- Separation of concerns: The search algorithm is decoupled from the LLM interface and the evaluation function. Users provide generation and scoring callbacks.
- Batch parallelism: The ask-tell interface supports batched operations, enabling parallel generation across multiple LLM calls.
8.2 Core Components
TreeQuest Architecture
=======================
+-----------------------------------------------------------+
| User Code |
| +------------------+ +------------------+ |
| | generate_fn() | | evaluate_fn() | |
| | (state, score) | | score in [0, 1] | |
| +--------+---------+ +--------+---------+ |
| | | |
+-----------+---------------------+--------------------------+
| |
+-----------v---------------------v--------------------------+
| TreeQuest Core |
| |
| +--------------+ +----------------+ +---------------+ |
| | init_tree() | | ask_batch() | | tell() | |
| | Returns: | | Returns: | | Updates: | |
| | TreeState | | List[Trial] | | TreeState | |
| +--------------+ +----------------+ +---------------+ |
| |
| +--------------+ +----------------+ |
| | step() | | best() | |
| | Single iter | | Best solution | |
| +--------------+ +----------------+ |
| |
| +--------------------------------------------------+ |
| | Thompson Sampling Engine | |
| | Beta posteriors | Mixed models | Multi-LLM | |
| +--------------------------------------------------+ |
+-----------------------------------------------------------+
8.3 Data Flow
The data flow through TreeQuest follows a strict functional pattern. Each function takes the current tree state as input and returns a new (immutable) tree state as output. No global state or side effects are involved in the core search logic.
Python
from treequest import init_tree, ask_batch, tell, best
# 1. Initialize the search tree
tree_state = init_tree(
method="ab_mcts_a", # AB-MCTS with node aggregation
batch_size=5, # Number of parallel trials
max_iterations=250, # Total budget
)
# 2. Main search loop
for iteration in range(250):
# Ask: get batch of trials to evaluate
trials = ask_batch(tree_state)
# User generates solutions and computes scores
results = []
for trial in trials:
solution = my_llm.generate(trial.prompt)
score = my_evaluator.score(solution) # [0, 1]
results.append((trial.trial_id, solution, score))
# Tell: reflect outcomes back into the tree
tree_state = tell(tree_state, results)
# 3. Retrieve best solution found
best_solution = best(tree_state)
9 API Design and Ask-Tell Interface
9.1 The Ask-Tell Pattern
The ask-tell interface separates the decision of what to evaluate from the execution of that evaluation. This separation is critical for practical LLM-based search:
- ask_batch() returns a list of
Trialobjects, each specifying what the user should evaluate (the prompt, the parent node, and the action type). - tell() receives the results of those evaluations (solutions and scores) and incorporates them into the search tree.
This pattern allows users to execute LLM calls however they wish -- synchronously, asynchronously, across different API providers, with rate limiting, with caching, etc. The search algorithm is agnostic to the execution strategy.
9.2 Trial Objects
Each Trial returned by ask_batch() contains all the
information needed to generate and evaluate a solution:
Python
from dataclasses import dataclass
from typing import Optional
@dataclass(frozen=True)
class Trial:
"""An immutable trial descriptor returned by ask_batch()."""
trial_id: str # Unique identifier for this trial
action: str # "depth" or "width"
parent_node_id: Optional[str] # None for width (new root)
parent_state: Optional[Any] # Parent solution state for refinement
parent_score: Optional[float] # Parent's score for context
model_hint: Optional[str] # Suggested model (Multi-LLM mode)
depth: int # Depth in the search tree
metadata: dict # Additional context for the user
9.3 Result Reporting
Results can be reported in any order via tell(). This is important for
asynchronous execution: if one LLM call finishes before another, the result can be
incorporated immediately without waiting for the entire batch.
Python
import asyncio
async def run_search_async(tree_state, llm_pool, evaluator):
"""Asynchronous search with out-of-order result reporting."""
for _ in range(50): # 50 batches of 5 = 250 total
trials = ask_batch(tree_state)
# Launch all LLM calls concurrently
tasks = []
for trial in trials:
model = llm_pool.get_model(trial.model_hint)
task = asyncio.create_task(
generate_and_score(model, trial, evaluator)
)
tasks.append(task)
# Gather results (may complete in any order)
results = await asyncio.gather(*tasks)
# Tell: incorporate all results at once
tree_state = tell(tree_state, results)
return best(tree_state)
async def generate_and_score(model, trial, evaluator):
"""Generate a solution and score it."""
solution = await model.agenerate(trial.parent_state)
score = evaluator.score(solution)
return (trial.trial_id, solution, score)
9.4 User-Defined Generation Functions
The generation and evaluation functions are entirely user-defined. TreeQuest only requires that evaluation functions return scores normalized to [0, 1]. The generation function receives the trial context (parent state, action type) and returns a (state, score) tuple.
Score Normalization: All scores must be in [0, 1] for the Thompson Sampling model to function correctly. For tasks with discrete metrics (e.g., number of test cases passed), normalize by dividing by the total number of test cases. For tasks with unbounded metrics, apply a sigmoid or similar saturating transformation.
10 Stateless Design and Immutability
10.1 Immutable Data Structures
TreeQuest represents the search tree using immutable (frozen) data structures. Every operation that modifies the tree -- adding a node, updating a score, recording a Thompson Sampling observation -- returns a new tree state rather than mutating the existing one. This design has several advantages:
- Checkpointing: Any tree state can be serialized and stored. To resume search, deserialize the state and continue calling ask/tell.
- Debugging: The full history of tree states can be retained, enabling post-hoc analysis of search behavior.
- Concurrency: Immutable structures are inherently thread-safe. Multiple threads can read the same tree state without locks.
- Reproducibility: Given the same initial state and random seed, the search trajectory is fully deterministic.
10.2 Checkpoint and Resume
Long-running searches (e.g., 250 iterations at ~30 seconds per LLM call = ~2 hours) benefit from the ability to checkpoint and resume. If a search is interrupted (machine failure, API quota exhaustion, etc.), it can be resumed from the last checkpoint without losing any work.
Python
import json
from treequest import init_tree, ask_batch, tell, serialize, deserialize
def run_with_checkpoints(problem, checkpoint_path="checkpoint.json"):
"""Run search with periodic checkpointing."""
# Resume from checkpoint if available
try:
with open(checkpoint_path) as f:
tree_state = deserialize(json.load(f))
print(f"Resumed from iteration {tree_state.iteration}")
except FileNotFoundError:
tree_state = init_tree(method="ab_mcts_a", batch_size=5)
print("Starting fresh search")
while tree_state.iteration < 250:
trials = ask_batch(tree_state)
results = evaluate_batch(trials, problem)
tree_state = tell(tree_state, results)
# Checkpoint every 10 iterations
if tree_state.iteration % 10 == 0:
with open(checkpoint_path, "w") as f:
json.dump(serialize(tree_state), f)
print(f"Checkpoint saved at iteration {tree_state.iteration}")
return tree_state
10.3 State Representation
The tree state encapsulates all information needed to continue the search:
[!info]- TreeState Data Structure Python ``` @dataclass(frozen=True) class TreeState: """Complete, immutable state of an AB-MCTS search."""
# Tree structure nodes: tuple[Node, ...] # All nodes in the tree edges: tuple[Edge, ...] # Parent-child relationships root_id: str # Root node identifier # Search state iteration: int # Current iteration number best_score: float # Best score found so far best_node_id: str # Node with best score # Thompson Sampling state bandit_state: BanditState # Alpha/beta counts per arm # Configuration method: str # "ab_mcts_a", "ab_mcts_m", "multi_llm" batch_size: int # Parallel trial count max_iterations: int # Total budget random_state: int # RNG state for reproducibility # Pending trials (outstanding ask_batch calls) pending_trials: tuple[Trial, ...] # Trials awaiting tell()@dataclass(frozen=True) class Node: node_id: str state: Any # User-defined solution state score: float # Evaluation score in [0, 1] depth: int # Depth in tree (0 = root children) action: str # "depth" or "width" model: Optional[str] # Which LLM generated this visit_count: int # Number of times refined subtree_best: float # Best score in subtree ```
11 Benchmark: ARC-AGI-2 Results
11.1 ARC-AGI-2 Overview
ARC-AGI-2 (Abstraction and Reasoning Corpus, version 2) is a benchmark of abstract visual reasoning tasks. Each task presents a set of input-output grid examples, and the solver must infer the transformation rule and apply it to a test input. The tasks require forms of reasoning (symmetry detection, counting, geometric transformations, relational reasoning) that are challenging for current LLMs.
The benchmark's 120 public evaluation tasks provide a rigorous test of inference-time scaling, as many tasks are solvable only through extensive search and the combination of multiple reasoning strategies.
11.2 Experimental Configuration
| Parameter | Value | Notes |
|---|---|---|
| Task set | 120 public eval tasks | ARC-AGI-2 public evaluation set |
| Max iterations | 250 | Total generation budget per task |
| Batch size | 5 | Parallel trials per iteration |
| Score normalization | [0, 1] | Fraction of correct pixels |
| Pass criterion | Score = 1.0 | Perfect solution required |
| Rule selection | Top 2 solutions | Pass@2 for final submission |
11.3 Results Summary
| Method | Model(s) | Pass@250 | Pass@2 (rule selection) |
|---|---|---|---|
| Repeated Sampling | o4-mini | 23.0% | -- |
| AB-MCTS-A | o4-mini | 27.5% | -- |
| AB-MCTS-M | o4-mini | ~28% | -- |
| Multi-LLM AB-MCTS | o4-mini + Gemini + DeepSeek | >30% | 19.2% |
11.4 Analysis: Where Multi-LLM Wins
The most striking result is that Multi-LLM AB-MCTS solves problems that no single LLM can solve in isolation, even with the same total compute budget. This is not simply a matter of increased diversity. Detailed analysis of the search trees reveals several patterns:
- Complementary reasoning: One model identifies the correct transformation type (e.g., "this is a rotation"), while another model implements it correctly (e.g., producing the correct pixel-level output).
- Error correction: When one model produces a nearly-correct solution with a systematic error, a different model can sometimes identify and fix the error during a refinement step, whereas the original model tends to reproduce the same error.
- Diverse initialization: Different models, trained on different data with different architectures, produce different distributions of initial solutions. This provides better coverage of the solution space.
Key Insight: The improvement from 23% (repeated sampling) to >30% (Multi-LLM AB-MCTS) represents a 30%+ relative improvement. More importantly, the Pass@2 metric (19.2%) demonstrates that the search not only finds correct solutions but also reliably ranks them -- the top-2 solutions selected from 250 candidates capture nearly two-thirds of the solvable tasks.
11.5 Scaling Behavior
The Pass@N curves reveal diminishing but persistent returns to additional compute. Repeated sampling's Pass@N curve flattens quickly (most easily solvable problems are found within the first 50 attempts). AB-MCTS maintains a steeper slope because the search algorithm directs compute toward promising regions, making each additional iteration more productive than a random sample.
[!info]- Scaling Curve Analysis The scaling behavior can be understood through the lens of search efficiency. Define search efficiency at iteration t as the probability that iteration t solves a previously unsolved problem:
$$ efficiency(t) = P(solved at t | not solved at t-1) $$
For repeated sampling, this probability decreases monotonically: the easy problems are found first, leaving only harder problems that have low per-sample solve probability. For AB-MCTS, the efficiency can increase over time as the algorithm accumulates information and focuses on promising directions. This non-monotonic efficiency is what enables the steeper scaling curve.
In the Multi-LLM setting, the efficiency benefits from an additional effect: as the algorithm learns which models are effective for which action types, it allocates compute more efficiently, further improving the scaling slope relative to the single-model variants.
12 Configuration and Hyperparameters
12.1 Recommended Settings
| Parameter | Recommended | Range | Notes |
|---|---|---|---|
batch_size |
5 | 1-10 | ≤5 recommended for best adaptive behavior |
max_iterations |
250 | 50-1000 | Task-dependent; diminishing returns beyond 500 |
| Score range | [0, 1] | [0, 1] | Mandatory normalization |
| Prior (alpha, beta) | (1, 1) | (0.5, 0.5) - (2, 2) | Uniform prior; (0.5, 0.5) = Jeffreys prior |
| Depth temperature | 0.7-1.0 | 0.0-2.0 | LLM temperature for refinement prompts |
| Width temperature | 1.0-1.2 | 0.0-2.0 | Higher temperature for diversity in new solutions |
12.2 Batch Size Trade-offs
The batch size parameter batch_size controls how many trials are
generated in parallel at each step. Smaller batch sizes allow the Thompson Sampling
model to update more frequently (after each small batch), leading to more adaptive
behavior. Larger batch sizes amortize API call overhead and latency but delay the
feedback loop.
The recommended value of batch_size ≤ 5 reflects empirical findings
on ARC-AGI-2. With batch sizes larger than 5, the algorithm generates too many trials
between Thompson Sampling updates, reducing the adaptive advantage over repeated
sampling.
12.3 Score Normalization Strategies
Proper score normalization is critical for Thompson Sampling to function correctly. Several normalization strategies are applicable depending on the task:
Python
def normalize_discrete(correct: int, total: int) -> float:
"""Normalize discrete metrics (e.g., test cases passed)."""
return correct / total if total > 0 else 0.0
def normalize_pixel(predicted_grid, target_grid) -> float:
"""Normalize pixel-level accuracy for ARC-AGI tasks."""
if predicted_grid.shape != target_grid.shape:
return 0.0 # Wrong grid dimensions
matches = (predicted_grid == target_grid).sum()
total = target_grid.size
return matches / total
def normalize_unbounded(raw_score: float, midpoint: float = 100.0) -> float:
"""Sigmoid normalization for unbounded metrics."""
return 1.0 / (1.0 + np.exp(-(raw_score - midpoint) / midpoint))
13 Implementation Walkthrough
13.1 Complete ARC-AGI-2 Example
The following example demonstrates a complete AB-MCTS search on an ARC-AGI-2 task, using the Multi-LLM variant with three models.
Python
import numpy as np
from treequest import init_tree, ask_batch, tell, best
from treequest.models import MultiLLMConfig
# Define the model pool
model_config = MultiLLMConfig(
models={
"o4-mini": {
"provider": "openai",
"api_key": os.environ["OPENAI_API_KEY"],
"temperature": 0.8,
},
"gemini-2.5-pro": {
"provider": "google",
"api_key": os.environ["GOOGLE_API_KEY"],
"temperature": 1.0,
},
"deepseek-r1-0528": {
"provider": "deepseek",
"api_key": os.environ["DEEPSEEK_API_KEY"],
"temperature": 0.9,
},
}
)
# Define the ARC-AGI-2 scoring function
def arc_scorer(predicted_grid, target_grid):
"""Score an ARC-AGI-2 solution by pixel-level accuracy."""
if predicted_grid is None:
return 0.0
if predicted_grid.shape != target_grid.shape:
return 0.0
return float((predicted_grid == target_grid).sum()) / target_grid.size
# Initialize the search tree
tree = init_tree(
method="multi_llm",
batch_size=5,
max_iterations=250,
model_config=model_config,
)
# Load the ARC task
task = load_arc_task("task_001")
# Run the search
for iteration in range(50): # 50 batches x 5 = 250 trials
trials = ask_batch(tree)
results = []
for trial in trials:
# Generate solution using assigned model
if trial.action == "width":
prompt = make_fresh_arc_prompt(task)
else:
prompt = make_refinement_arc_prompt(
task, trial.parent_state, trial.parent_score
)
model = model_config.get_client(trial.model_hint)
response = model.generate(prompt)
predicted_grid = parse_arc_output(response)
score = arc_scorer(predicted_grid, task.test_output)
results.append((trial.trial_id, predicted_grid, score))
if score == 1.0:
print(f"Perfect solution found at iteration {iteration}!")
break
tree = tell(tree, results)
# Get the best solution(s) for submission
top_solutions = best(tree, k=2) # Top 2 for Pass@2 submission
13.2 Custom Problem Integration
TreeQuest can be applied to any problem where solutions can be generated by LLMs and evaluated with a scalar score. The following template shows how to integrate a custom problem:
[!info]- Custom Problem Template Python ``` from treequest import init_tree, step
def solve_custom_problem(problem_description: str): """Solve a custom problem using AB-MCTS."""
# Define generation function def generate(trial): if trial.action == "width": prompt = f"Solve this problem:\n{problem_description}" else: prompt = ( f"Here is a previous attempt (score: {trial.parent_score}):\n" f"{trial.parent_state}\n\n" f"Improve this solution for:\n{problem_description}" ) response = llm.generate(prompt) return response # Define scoring function def evaluate(solution): # Return a score in [0, 1] tests_passed = run_tests(solution) return tests_passed / total_tests # Initialize and run tree = init_tree(method="ab_mcts_a", batch_size=5) for _ in range(250): tree = step(tree, generate, evaluate) return tree.best_solution```
14 Theoretical Foundations
14.1 Connection to Classical MCTS
AB-MCTS inherits the basic structure of Monte Carlo Tree Search but adapts it significantly for the LLM inference-time scaling setting. In classical MCTS (as used in game-playing, e.g., AlphaGo), the tree represents game states and actions, and the search discovers promising action sequences. In AB-MCTS, the tree represents solutions and refinement operations, and the search discovers promising solution-refinement sequences.
| Aspect | Classical MCTS | AB-MCTS |
|---|---|---|
| Nodes | Game states | Candidate solutions |
| Actions | Game moves | Depth (refine) or Width (new) |
| Rollout | Random playout to terminal state | LLM generation + scoring |
| Backpropagation | Win/loss statistics up the tree | Score improvement signals to Thompson Sampling |
| Selection | UCB1 or PUCT formula | Thompson Sampling from Beta posteriors |
| Expansion | One child per legal move | One child per LLM generation |
14.2 Regret Bounds
Thompson Sampling for the Bernoulli bandit problem has well-established regret bounds. For K arms over T rounds, the expected regret of Thompson Sampling satisfies:
$$ E[R(T)] ≤ O(sqrt(K T log T)) $$
In the AB-MCTS context, K = 2 (depth and width) for the single-model variant, or K = 2|M| for the multi-model variant where |M| is the number of models. The logarithmic factor means that regret grows slowly, and the algorithm quickly converges to the optimal action distribution.
However, the standard regret analysis assumes stationary reward distributions. In AB-MCTS, the reward distributions are non-stationary: the value of depth vs. width changes as the tree grows and the best known solution improves. This non-stationarity can be partially addressed by using windowed Thompson Sampling (only counting recent observations) or by applying a discount factor to older observations.
14.3 Exploration-Exploitation in the Search Tree
The exploration-exploitation trade-off in AB-MCTS operates at two levels:
- Action level: Thompson Sampling balances depth vs. width (and model selection in the multi-LLM variant). This is the primary adaptive mechanism.
- Node level: When the algorithm decides to go deeper, it must select which node to refine. This is a secondary bandit problem where each node is an arm. Nodes with high scores and low visit counts are preferred (analogous to UCB-based selection in classical MCTS).
Theoretical Observation: The two-level bandit structure of AB-MCTS creates a hierarchical exploration-exploitation problem. The outer level (depth vs. width) determines the type of exploration, while the inner level (node selection) determines the location. This hierarchical structure is a key architectural advantage over flat search strategies like repeated sampling.
14.4 Convergence Properties
Given a sufficiently informative scoring function and enough compute budget, AB-MCTS converges to the optimal solution in the following sense: if the optimal solution is reachable via a finite refinement chain from any initial solution, and the scoring function provides monotonically improving signal along this chain, then the probability of finding the optimal solution approaches 1 as the budget T increases.
This convergence guarantee is stronger than for repeated sampling (which requires the optimal solution to be directly reachable in a single generation) but weaker than for exhaustive search (which guarantees finding the optimal solution in finite time regardless of the scoring function's properties).
15 Comparison with Related Work
15.1 Inference-Time Scaling Methods
| Method | Search Strategy | Adaptive? | Multi-Model? | Key Limitation |
|---|---|---|---|---|
| Best-of-N | Width only (independent samples) | No | No | No information reuse |
| Chain-of-Thought | Depth only (sequential reasoning) | No | No | Single reasoning path |
| Tree-of-Thought | Breadth-first tree search | No (fixed branching) | No | Fixed branching factor |
| Self-Consistency | Width + majority voting | No | No | Requires consensus |
| STaR / Self-Taught Reasoner | Iterative refinement with filtering | Partially | No | Requires correct answer for filtering |
| AB-MCTS | Adaptive depth + width | Yes (Thompson Sampling) | Yes | Requires scalar scoring function |
15.2 Multi-Agent LLM Systems
Several recent systems use multiple LLMs in collaborative or competitive arrangements. AB-MCTS differs from these approaches in that it treats model selection as an optimization problem rather than a fixed architecture:
- Debate / Negotiation: Models argue for different positions. AB-MCTS instead routes problems to models based on learned competence profiles.
- Mixture of Experts: A router selects which expert to use for each input. AB-MCTS learns the routing policy online via Thompson Sampling rather than training a separate router.
- Ensemble methods: Multiple models' outputs are aggregated (e.g., majority vote). AB-MCTS instead uses sequential refinement where one model can build on another's work.
15.3 Evolutionary and Population-Based Search
AB-MCTS shares conceptual similarities with evolutionary algorithms (EAs). The width action is analogous to mutation (generating new individuals), while the depth action is analogous to local search or hill climbing. The Thompson Sampling mechanism serves a similar role to adaptive operator selection in EAs, adjusting the mutation-to-local-search ratio based on observed fitness improvements.
However, AB-MCTS differs from traditional EAs in that it does not maintain a fixed population size, does not use crossover (combining parts of different solutions), and uses LLMs as the generation engine rather than domain-specific mutation operators. The tree structure also provides a richer representation of the search history than a flat population.
16 Limitations and Future Directions
16.1 Current Limitations
- Scoring function requirement: AB-MCTS requires a scalar scoring function that provides meaningful signal in [0, 1]. For tasks without clear partial credit (e.g., open-ended creative writing), designing such a function is non-trivial.
- LLM call overhead: Each iteration requires at least one LLM API call. For models with high latency or cost (e.g., GPT-4-level models), the 250-iteration budget may be prohibitively expensive for some applications.
- Prompt engineering dependency: The effectiveness of depth (refinement) actions depends heavily on the quality of the refinement prompt. Poor prompts may cause the LLM to ignore the parent solution or copy it verbatim.
- Non-stationarity: The theoretical regret bounds assume stationary reward distributions, but the true distributions shift as the tree grows. While Thompson Sampling is empirically robust to mild non-stationarity, extreme distribution shift could degrade performance.
- No crossover: Unlike evolutionary algorithms, AB-MCTS does not combine elements from different solution branches. Adding a crossover-like operation (e.g., asking the LLM to merge the best aspects of two different solutions) is a natural extension.
16.2 Future Directions
[!info]- Crossover and Recombination A natural extension is to add a "crossover" action that combines elements from two different subtrees. The LLM would receive two solutions and be asked to combine their best aspects. This would add a fourth action to the Thompson Sampling model (depth, width, crossover, possibly model selection), increasing the action space but potentially enabling richer search dynamics.
Python ``` def crossover_action(tree, model): """Generate a solution by combining two existing ones.""" node_a = select_node(tree, strategy="best") node_b = select_node(tree, strategy="diverse")
prompt = ( f"Here are two solution approaches:\n\n" f"Approach A (score {node_a.score:.2f}):\n{node_a.state}\n\n" f"Approach B (score {node_b.score:.2f}):\n{node_b.state}\n\n" f"Combine the best aspects of both approaches." ) return model.generate(prompt)```
[!info]- Hierarchical and Nested Search For problems with natural decomposition (e.g., multi-step reasoning), AB-MCTS could be applied hierarchically: a top-level search decomposes the problem into subproblems, and each subproblem is solved by a nested AB-MCTS search. The top-level tree's scoring function would aggregate sub-problem scores.
[!info]- Learned Scoring Functions When ground-truth evaluation is expensive or unavailable, an LLM-based scoring function (reward model) could replace the external evaluator. This creates a fully self-contained system but introduces the risk of reward hacking. Careful calibration and periodic ground-truth validation would be needed.
[!info]- Non-Stationary Thompson Sampling Discounted Thompson Sampling or sliding-window Thompson Sampling could address the non-stationarity issue. In discounted TS, older observations are geometrically down-weighted:
$$ alpha_t = gamma * alpha_{t-1} + reward_t, gamma in (0, 1) $$
This allows the algorithm to adapt more quickly when the reward distribution shifts (e.g., when the search enters a new region of the solution space).
[!info]- Integration with AlphaEvolve-Style Code Evolution AB-MCTS could serve as the search backbone for code evolution systems like AlphaEvolve. Rather than using MAP-Elites or simple evolutionary strategies for program search, AB-MCTS's adaptive branching could provide more efficient exploration of the program space. The Multi-LLM variant is particularly relevant here, as different models may excel at different aspects of code generation (e.g., algorithmic design vs. implementation details vs. optimization).
17 References
- Sakana AI. "AB-MCTS: Adaptive Branching Monte Carlo Tree Search for Multi-LLM Inference-Time Scaling." arXiv:2503.04412, March 2025.
- Sakana AI. TreeQuest: Open-source library for AB-MCTS. github.com/SakanaAI/treequest. Apache 2.0 License.
- Chollet, F. "On the Measure of Intelligence." arXiv:1911.01547, 2019. (ARC benchmark definition.)
- Chollet, F. et al. "ARC Prize." arcprize.org, 2024-2025. (ARC-AGI-2 evaluation framework.)
- Thompson, W.R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples." Biometrika, 25(3-4):285-294, 1933.
- Agrawal, S. and Goyal, N. "Analysis of Thompson Sampling for the Multi-armed Bandit Problem." COLT, 2012.
- Browne, C. et al. "A Survey of Monte Carlo Tree Search Methods." IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1-43, 2012.
- Silver, D. et al. "Mastering the game of Go with deep neural networks and tree search." Nature, 529(7587):484-489, 2016.
- Yao, S. et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS, 2023.
- Wei, J. et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS, 2022.
- Wang, X. et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR, 2023.
- Zelikman, E. et al. "STaR: Bootstrapping Reasoning With Reasoning." NeurIPS, 2022.
- Snell, C. et al. "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv:2408.03314, 2024.
- Google DeepMind. "AlphaEvolve: A Gemini-Powered Coding Agent for Designing Advanced Algorithms." Technical Report, May 2025.
AB-MCTS / TreeQuest -- PhD-Level Technical Report | Generated March 2026 | Based on arXiv:2503.04412 and the TreeQuest open-source library