← Back to Index

AB-MCTS / TreeQuest

Adaptive Branching Monte Carlo Tree Search for Multi-LLM Inference-Time Scaling Organization: Sakana AI Paper: arXiv 2503.04412 License: Apache 2.0 Repository: github.com/SakanaAI/treequest Report Type: PhD-Level Technical Analysis Report Date: March 2026

Table of Contents

1 Executive Summary

AB-MCTS (Adaptive Branching Monte Carlo Tree Search) represents a principled approach to inference-time scaling for large language models. Rather than relying on a single model generating a single answer, AB-MCTS orchestrates a search process that adaptively decides between two fundamental strategies: going deeper (refining an existing solution through sequential iteration) and going wider (generating entirely new solutions from scratch). This adaptive branching is the core novelty, as prior methods typically commit to one strategy or the other.

The accompanying open-source library, TreeQuest, implements this algorithm with a stateless, functional design that supports checkpointing, resumption, and batch-parallel execution. The Multi-LLM extension adds a third search dimension: selecting which frontier model to use at each step, treating model selection as an additional bandit arm optimized via Thompson Sampling.

Key Result: On the ARC-AGI-2 benchmark (120 public evaluation tasks, max 250 iterations), Multi-LLM AB-MCTS achieves >30% Pass@250, compared to 23% for naive repeated sampling with a single model (o4-mini). Problems that no single LLM can solve in isolation become tractable through collaborative multi-model search.

At a Glance

Core Innovation: Adaptive depth vs. width branching via Thompson Sampling

Search Framework: Monte Carlo Tree Search with Bayesian action selection

Models Supported: o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528

Library: TreeQuest (Python, Apache 2.0)

Design Pattern: Stateless ask-tell interface with immutable data structures

Primary Benchmark: ARC-AGI-2 (abstraction and reasoning)

2 Motivation and Problem Formulation

2.1 The Inference-Time Scaling Paradigm

Traditional scaling laws for LLMs focus on training-time compute: larger models trained on more data yield better performance. Inference-time scaling offers a complementary axis -- rather than training a bigger model, one can spend more compute at inference time by generating multiple candidate solutions and selecting among them. This paradigm is particularly relevant for tasks where verification is cheaper than generation, such as mathematical reasoning, code synthesis, and abstract pattern matching.

The simplest inference-time scaling strategy is repeated sampling (also called Best-of-N): generate N independent solutions and select the best one. While effective, this approach makes no use of information gathered during search. If the first 50 attempts all fail in a similar way, attempt 51 is no more informed than attempt 1.

2.2 The Depth-Width Trade-off

Consider a problem that admits multiple solution strategies. Two natural approaches exist:

  • Depth-first (Go Deeper): Take a promising partial solution and iteratively refine it. Feed the current best attempt back to the model with instructions to improve, fix bugs, or extend the approach. This leverages accumulated context but risks getting stuck in a local optimum.
  • Width-first (Go Wider): Generate entirely new solutions from scratch, independent of previous attempts. This provides diversity but wastes information gained from prior attempts.

The central insight of AB-MCTS is that the optimal trade-off between depth and width is problem-dependent and node-dependent within the search tree. Some problems benefit from deep refinement chains; others benefit from broad exploration. Even within a single search tree, different subtrees may warrant different strategies. AB-MCTS uses Bayesian methods to learn this trade-off adaptively during search.

2.3 Formal Problem Statement

Let f: S → [0, 1] be an evaluation function mapping candidate solutions to quality scores. Given a computational budget of T iterations, the goal is to find:

$$ s* = arg maxs ∈ ST f(s) $$

where ST is the set of all solutions explored within the budget. The search must decide, at each iteration, whether to:

  1. Expand: Create a new root-level solution (width).
  2. Refine: Select an existing node and generate a child solution (depth).

In the multi-LLM setting, a third decision is introduced: which model m ∈ M to use for generation. The search algorithm must balance exploration and exploitation across all three dimensions simultaneously.

Search Decision Space
    ================================================

      Iteration t:
      +-------------------+     +-------------------+
      |   GO WIDER        |     |   GO DEEPER       |
      |   (New Solution)  | vs  |   (Refine Node)   |
      +-------------------+     +-------------------+
              |                         |
              v                         v
      +-------------------+     +-------------------+
      |  Select Model m   |     |  Select Model m   |
      |  from pool M      |     |  from pool M      |
      +-------------------+     +-------------------+
              |                         |
              v                         v
      +-------------------+     +-------------------+
      |  Generate & Score  |     |  Generate & Score  |
      |  s_new = m(prompt) |     |  s_child = m(s,p) |
      +-------------------+     +-------------------+
              |                         |
              +----------+  +-----------+
                         |  |
                         v  v
               +-------------------+
               |  Update Thompson  |
               |  Sampling Models  |
               +-------------------+

3 Search Dimensions: Depth vs. Width

3.1 Going Deeper: Sequential Refinement

The "depth" action selects an existing node in the search tree and generates a child solution by feeding the node's solution as context to the LLM. The refinement prompt typically includes the previous solution, its score, and instructions for improvement. This creates chains of progressively refined solutions.

The depth action is analogous to iterative refinement in optimization: gradient-descent-like steps that improve a current solution. Its effectiveness depends on the smoothness of the solution landscape -- if small changes to a solution reliably improve its score, depth search converges quickly. However, if the solution landscape has many local optima, depth search may get trapped.

Node Selection for Refinement

When the algorithm decides to go deeper, it must choose which existing node to refine. This is itself a multi-armed bandit problem. AB-MCTS uses the accumulated evidence at each node (the scores of its descendants) to select promising nodes for further refinement. The selection balances exploitation (refining high-scoring nodes) with exploration (trying nodes that haven't been refined much).

3.2 Going Wider: Fresh Generation

The "width" action generates an entirely new solution from scratch, without conditioning on any previous attempt. This is equivalent to adding a new root-level node to the search tree. Width provides diversity: even if all existing solution branches are stuck in local optima, a fresh generation might discover an entirely different approach.

The width action is most valuable early in search (when the tree is sparse and the algorithm hasn't yet found promising regions) and whenever the algorithm detects that refinement is yielding diminishing returns.

3.3 Adaptive Branching

The key innovation is that AB-MCTS does not fix the depth-width ratio a priori. Instead, it treats the choice as a two-armed bandit and uses Thompson Sampling to adaptively allocate compute between the two strategies. Early in search, when few solutions exist, width tends to dominate. As the tree grows and promising branches emerge, depth becomes more attractive. If depth produces diminishing returns in a particular subtree, the algorithm naturally shifts back toward width.

Adaptive Branching Principle: The ratio of depth-to-width actions is not a hyperparameter but an emergent property of the search dynamics. It is learned online through the Thompson Sampling mechanism, adapting to each problem's characteristics.

4 Core Algorithm: AB-MCTS

4.1 Algorithm Overview

AB-MCTS operates in a loop of four phases: selection, expansion, evaluation, and backpropagation. This structure mirrors classical MCTS but adapts each phase for the LLM inference-time scaling setting.

Python

# AB-MCTS main loop (simplified pseudocode)
def ab_mcts(problem, budget_T, models, scorer):
    tree = init_tree()
    thompson_model = ThompsonSampler(actions=["depth", "width"])

    for t in range(budget_T):
        # 1. Selection: Choose depth vs. width via Thompson Sampling
        action = thompson_model.sample()

        if action == "width":
            # 2a. Expansion (width): Generate new root-level solution
            prompt = make_fresh_prompt(problem)
            node = tree.root
        else:
            # 2b. Expansion (depth): Select node and refine
            node = select_node(tree)
            prompt = make_refinement_prompt(problem, node.solution, node.score)

        # 3. Evaluation: Generate solution and score it
        model = select_model(models)  # Multi-LLM extension
        solution = model.generate(prompt)
        score = scorer.evaluate(solution)  # Normalized to [0, 1]

        # 4. Backpropagation: Add child and update Thompson model
        child = tree.add_child(node, solution, score)
        reward = 1 if score > node.best_descendant_score else 0
        thompson_model.update(action, reward)

    return tree.best_solution()

4.2 Selection Phase

The selection phase decides between depth and width. AB-MCTS uses Thompson Sampling: each action (depth, width) has an associated probability model that estimates the expected reward of taking that action. At each iteration, a value is sampled from each action's posterior distribution, and the action with the higher sample is chosen. This naturally balances exploration and exploitation.

4.3 Expansion Phase

For width expansion, a fresh prompt is constructed from the problem description alone. For depth expansion, the prompt includes the selected node's solution and score, along with instructions to improve upon it. The expansion phase is where the LLM does its actual generative work.

4.4 Evaluation Phase

The generated solution is evaluated by a scoring function that returns a value in [0, 1]. This scoring function is user-defined and problem-specific. For ARC-AGI tasks, the score might be the fraction of test pixels correctly predicted. For code generation, it might be the fraction of test cases passed. The normalization to [0, 1] is important for the Thompson Sampling model.

4.5 Backpropagation Phase

The reward signal is computed by comparing the new solution's score to the best score among the parent node's descendants. If the new solution improves upon the best known descendant, the reward is 1 (success); otherwise, it is 0 (failure). This binary reward is used to update the Thompson Sampling model's posterior for the chosen action.

AB-MCTS Search Tree Evolution
    ==============================

    Iteration 1 (Width):     Iteration 3 (Width):     Iteration 5 (Depth):
          [R]                      [R]                      [R]
           |                      / \                      / \
          [A]                   [A] [B]                 [A] [B]
         s=0.3                s=0.3 s=0.5             s=0.3 s=0.5
                                                             |
    Iteration 2 (Depth):                                    [C]
          [R]                                             s=0.7
           |
          [A]                  Iteration 4 (Depth):
         s=0.3                      [R]
           |                       / \
          [A']                   [A] [B]
         s=0.4                 s=0.3 s=0.5
                                      |
                                     [B']
                                    s=0.6

    Legend: [R] = Root,  s = score,  Width = new branch,  Depth = refine existing

5 Thompson Sampling Integration

5.1 Thompson Sampling Fundamentals

Thompson Sampling is a Bayesian approach to the multi-armed bandit problem. Each arm (action) has an associated prior distribution over its expected reward. After observing a reward from an arm, the prior is updated to a posterior using Bayes' rule. At each decision point, a value is sampled from each arm's posterior, and the arm with the highest sample is chosen.

For binary rewards (success/failure), the natural choice is a Beta distribution. If arm i has had αi successes and βi failures, its posterior is:

$$ P(rewardi) ~ Beta(αi + 1, βi + 1) $$

Thompson Sampling provides an elegant exploration-exploitation trade-off: arms with high expected reward are chosen frequently (exploitation), while arms with high uncertainty are occasionally chosen due to random sampling from broad posteriors (exploration).

5.2 Application to Depth-Width Selection

In AB-MCTS, the two arms are "depth" and "width." Each maintains its own Beta distribution. When a depth action produces a child that improves upon its subtree's best score, the depth arm's success count increments. Otherwise, the failure count increments. The same logic applies to width actions.

Python

class ThompsonSampler:
    """Thompson Sampling with Beta-Bernoulli model for action selection."""

    def __init__(self, actions: list[str]):
        self.alpha = {a: 1.0 for a in actions}  # Prior successes
        self.beta = {a: 1.0 for a in actions}   # Prior failures

    def sample(self) -> str:
        """Sample from each arm's posterior; return arm with highest sample."""
        samples = {}
        for action in self.alpha:
            samples[action] = np.random.beta(
                self.alpha[action], self.beta[action]
            )
        return max(samples, key=samples.get)

    def update(self, action: str, reward: float):
        """Update posterior with observed reward (0 or 1)."""
        self.alpha[action] += reward
        self.beta[action] += (1.0 - reward)

5.3 Mixed Models for Continuous Rewards

The basic Beta-Bernoulli model treats rewards as binary. However, real scores are continuous in [0, 1]. The AB-MCTS-M variant uses mixed models -- specifically, PyMC-based Bayesian inference with hierarchical statistical modeling -- to maintain richer posterior distributions over expected scores. This provides more nuanced exploration-exploitation trade-offs when scores carry graded information rather than simple pass/fail signals.

Design Decision: The choice between Beta-Bernoulli (AB-MCTS-A) and mixed models (AB-MCTS-M) reflects a bias-variance trade-off. The simpler model is more robust and computationally cheaper, while the mixed model can extract more signal from continuous scores at the cost of greater complexity and PyMC dependency.

6 AB-MCTS Variants

6.1 AB-MCTS-A: Node Aggregation

The Aggregation variant (AB-MCTS-A) makes branching decisions based on accumulated evidence at each node. For every node in the tree, it tracks the number of successful and unsuccessful refinements. "Success" is defined as producing a child whose score exceeds the current subtree maximum. This evidence is aggregated across the tree to estimate the global value of depth vs. width actions.

Node aggregation provides a simple, robust estimator. Its strength is that it treats every refinement attempt as an independent Bernoulli trial, making the Beta posterior well-calibrated. Its weakness is that it discards the magnitude of improvement -- a refinement that improves the score from 0.5 to 0.51 counts the same as one improving from 0.5 to 0.99.

[!info]- AB-MCTS-A Implementation Details Python ``` class ABMCTSAggregation: """AB-MCTS with node aggregation for branching decisions."""

def __init__(self):
    # Global counters for depth vs. width
    self.depth_successes = 0
    self.depth_failures = 0
    self.width_successes = 0
    self.width_failures = 0

def should_go_deeper(self) -> bool:
    """Thompson Sampling: sample from both Beta posteriors."""
    depth_sample = np.random.beta(
        self.depth_successes + 1,
        self.depth_failures + 1
    )
    width_sample = np.random.beta(
        self.width_successes + 1,
        self.width_failures + 1
    )
    return depth_sample > width_sample

def record_outcome(self, action: str, child_score: float,
                    subtree_best: float):
    success = child_score > subtree_best
    if action == "depth":
        self.depth_successes += int(success)
        self.depth_failures += int(not success)
    else:
        self.width_successes += int(success)
        self.width_failures += int(not success)

```

6.2 AB-MCTS-M: Mixed Models

The Mixed Models variant (AB-MCTS-M) uses PyMC-based Bayesian inference to maintain richer posterior distributions. Rather than reducing scores to binary outcomes, AB-MCTS-M models the full score distribution for each action type using hierarchical statistical models.

The hierarchical model structure allows AB-MCTS-M to share information across nodes in the tree. If refinements at one node tend to produce high scores, this information propagates to inform expectations about refinements at other nodes with similar characteristics. This is particularly useful early in search when per-node evidence is sparse.

[!info]- PyMC Model Specification Python ``` import pymc as pm

def build_mixed_model(depth_scores, width_scores): """Build hierarchical Bayesian model for action selection.

Uses PyMC to estimate the expected improvement for
depth vs. width actions with proper uncertainty quantification.
"""
with pm.Model() as model:
    # Hyperpriors for the population-level parameters
    mu_global = pm.Normal("mu_global", mu=0.5, sigma=0.3)
    sigma_global = pm.HalfNormal("sigma_global", sigma=0.2)

    # Action-specific means (partial pooling)
    mu_depth = pm.Normal("mu_depth", mu=mu_global, sigma=sigma_global)
    mu_width = pm.Normal("mu_width", mu=mu_global, sigma=sigma_global)

    # Action-specific variances
    sigma_depth = pm.HalfNormal("sigma_depth", sigma=0.1)
    sigma_width = pm.HalfNormal("sigma_width", sigma=0.1)

    # Likelihood for observed scores
    if len(depth_scores) > 0:
        pm.Normal("obs_depth", mu=mu_depth, sigma=sigma_depth,
                 observed=depth_scores)
    if len(width_scores) > 0:
        pm.Normal("obs_width", mu=mu_width, sigma=sigma_width,
                 observed=width_scores)

    # Sample from posterior
    trace = pm.sample(500, return_inferencedata=True, cores=1)

return trace

```

6.3 Multi-LLM: Model Selection as a Third Dimension

The Multi-LLM variant extends AB-MCTS by treating model selection as an additional action dimension. In the basic algorithm, the search decides between depth and width. In the Multi-LLM variant, it additionally decides which LLM to use for generation. Each LLM gets its own Thompson Sampling model, updated based on the quality of solutions it produces.

This is formalized as a multi-dimensional multi-armed bandit. The action space becomes the Cartesian product of {depth, width} and {model_1, model_2, ..., model_k}. Each combination maintains its own Beta posterior, allowing the algorithm to learn, for example, that model A is better at refinement while model B is better at generating diverse new solutions.

Variant Branching Strategy Score Handling Dependencies Computational Overhead
AB-MCTS-A Beta-Bernoulli Thompson Sampling Binary (improve/not) NumPy only Negligible
AB-MCTS-M Hierarchical Bayesian (PyMC) Continuous [0,1] PyMC, ArviZ MCMC sampling per decision
Multi-LLM Multi-dimensional bandit Binary per (action, model) NumPy + multiple API keys Negligible (per decision)

7 Multi-LLM Extension

7.1 Three-Dimensional Decision Space

The Multi-LLM extension expands the single-step decision from two dimensions to three:

  1. Dimension 1 -- Direction: Decide whether to go deeper (refine) or go wider (fresh generation).
  2. Dimension 2 -- Model Selection: Choose which LLM from the available pool (e.g., o4-mini, Gemini-2.5-Pro, DeepSeek R1-0528).
  3. Dimension 3 -- Generation: Execute the generation with the chosen model and evaluate the result.
Multi-LLM AB-MCTS Decision Flow
    =================================

    Step 1: Direction          Step 2: Model            Step 3: Generate
    +---------+               +-------------+          +---------------+
    | Depth   |---+     +---->| o4-mini     |---+  +-->| LLM(prompt)   |
    +---------+   |     |     +-------------+   |  |   +---------------+
                  +---->+     +-------------+   +->+          |
    +---------+   |     +---->| Gemini-2.5  |---+  |   +------v--------+
    | Width   |---+     |     +-------------+   |  |   | Evaluate(sol) |
    +---------+         |     +-------------+   |  |   +------+--------+
                        +---->| DeepSeek R1 |---+  |          |
                              +-------------+      |   +------v--------+
                                                   +-->| Update Models |
                                                       +---------------+

    Each (direction, model) pair has its own Thompson Sampling posterior.

7.2 Independent Probability Models per LLM

Each LLM maintains its own probability model, updated independently based on its performance. This allows the algorithm to learn the strengths and weaknesses of each model. For example, if Gemini-2.5-Pro consistently produces better refinements (depth) while o4-mini is better at generating diverse initial solutions (width), the algorithm will allocate accordingly.

Python

class MultiLLMThompsonSampler:
    """Thompson Sampling across (direction, model) combinations."""

    def __init__(self, models: list[str]):
        self.arms = {}
        for direction in ["depth", "width"]:
            for model in models:
                key = (direction, model)
                self.arms[key] = {"alpha": 1.0, "beta": 1.0}

    def select(self) -> tuple[str, str]:
        """Sample from all arm posteriors; return (direction, model)."""
        best_key = None
        best_sample = -1.0

        for key, params in self.arms.items():
            sample = np.random.beta(params["alpha"], params["beta"])
            if sample > best_sample:
                best_sample = sample
                best_key = key

        return best_key  # (direction, model_name)

    def update(self, direction: str, model: str, reward: float):
        key = (direction, model)
        self.arms[key]["alpha"] += reward
        self.arms[key]["beta"] += (1.0 - reward)

7.3 Emergent Model Specialization

A key empirical finding is that different models develop different roles during search. The Thompson Sampling mechanism naturally discovers these specializations without explicit programming. This emergent division of labor is one of the most compelling aspects of the Multi-LLM approach.

The synergy between models extends beyond simple specialization. When one model produces a solution with a novel structural insight but low overall score, another model may take that insight during a refinement step and combine it with its own strengths to produce a high-scoring solution. This form of "cross-pollination" between models is unique to the Multi-LLM setting and cannot be achieved by running models independently.

Empirical Finding: On ARC-AGI-2, the Multi-LLM configuration solves problems that no single LLM can solve individually, even given the same total compute budget. This is not merely a statistical artifact of increased diversity -- the models genuinely complement each other through collaborative refinement chains.

8 TreeQuest Library Architecture

8.1 Design Philosophy

TreeQuest is the open-source Python library implementing AB-MCTS. Its architecture follows several key design principles:

  • Statelessness: All tree state is returned from functions rather than stored in mutable objects. This enables checkpointing and resumption.
  • Immutability: Data structures are immutable, preventing accidental state corruption and enabling safe concurrent access.
  • Separation of concerns: The search algorithm is decoupled from the LLM interface and the evaluation function. Users provide generation and scoring callbacks.
  • Batch parallelism: The ask-tell interface supports batched operations, enabling parallel generation across multiple LLM calls.

8.2 Core Components

TreeQuest Architecture
    =======================

    +-----------------------------------------------------------+
    |                      User Code                            |
    |  +------------------+  +------------------+               |
    |  | generate_fn()    |  | evaluate_fn()    |               |
    |  | (state, score)   |  | score in [0, 1]  |               |
    |  +--------+---------+  +--------+---------+               |
    |           |                     |                          |
    +-----------+---------------------+--------------------------+
                |                     |
    +-----------v---------------------v--------------------------+
    |                   TreeQuest Core                           |
    |                                                           |
    |  +--------------+  +----------------+  +---------------+  |
    |  | init_tree()  |  | ask_batch()    |  | tell()        |  |
    |  | Returns:     |  | Returns:       |  | Updates:      |  |
    |  |  TreeState   |  |  List[Trial]   |  |  TreeState    |  |
    |  +--------------+  +----------------+  +---------------+  |
    |                                                           |
    |  +--------------+  +----------------+                     |
    |  | step()       |  | best()         |                     |
    |  | Single iter  |  | Best solution  |                     |
    |  +--------------+  +----------------+                     |
    |                                                           |
    |  +--------------------------------------------------+    |
    |  |          Thompson Sampling Engine                  |    |
    |  |  Beta posteriors | Mixed models | Multi-LLM       |    |
    |  +--------------------------------------------------+    |
    +-----------------------------------------------------------+

8.3 Data Flow

The data flow through TreeQuest follows a strict functional pattern. Each function takes the current tree state as input and returns a new (immutable) tree state as output. No global state or side effects are involved in the core search logic.

Python

from treequest import init_tree, ask_batch, tell, best

# 1. Initialize the search tree
tree_state = init_tree(
    method="ab_mcts_a",     # AB-MCTS with node aggregation
    batch_size=5,            # Number of parallel trials
    max_iterations=250,      # Total budget
)

# 2. Main search loop
for iteration in range(250):
    # Ask: get batch of trials to evaluate
    trials = ask_batch(tree_state)

    # User generates solutions and computes scores
    results = []
    for trial in trials:
        solution = my_llm.generate(trial.prompt)
        score = my_evaluator.score(solution)  # [0, 1]
        results.append((trial.trial_id, solution, score))

    # Tell: reflect outcomes back into the tree
    tree_state = tell(tree_state, results)

# 3. Retrieve best solution found
best_solution = best(tree_state)

9 API Design and Ask-Tell Interface

9.1 The Ask-Tell Pattern

The ask-tell interface separates the decision of what to evaluate from the execution of that evaluation. This separation is critical for practical LLM-based search:

  • ask_batch() returns a list of Trial objects, each specifying what the user should evaluate (the prompt, the parent node, and the action type).
  • tell() receives the results of those evaluations (solutions and scores) and incorporates them into the search tree.

This pattern allows users to execute LLM calls however they wish -- synchronously, asynchronously, across different API providers, with rate limiting, with caching, etc. The search algorithm is agnostic to the execution strategy.

9.2 Trial Objects

Each Trial returned by ask_batch() contains all the information needed to generate and evaluate a solution:

Python

from dataclasses import dataclass
from typing import Optional

@dataclass(frozen=True)
class Trial:
    """An immutable trial descriptor returned by ask_batch()."""

    trial_id: str                    # Unique identifier for this trial
    action: str                      # "depth" or "width"
    parent_node_id: Optional[str]   # None for width (new root)
    parent_state: Optional[Any]     # Parent solution state for refinement
    parent_score: Optional[float]   # Parent's score for context
    model_hint: Optional[str]       # Suggested model (Multi-LLM mode)
    depth: int                       # Depth in the search tree
    metadata: dict                   # Additional context for the user

9.3 Result Reporting

Results can be reported in any order via tell(). This is important for asynchronous execution: if one LLM call finishes before another, the result can be incorporated immediately without waiting for the entire batch.

Python

import asyncio

async def run_search_async(tree_state, llm_pool, evaluator):
    """Asynchronous search with out-of-order result reporting."""
    for _ in range(50):  # 50 batches of 5 = 250 total
        trials = ask_batch(tree_state)

        # Launch all LLM calls concurrently
        tasks = []
        for trial in trials:
            model = llm_pool.get_model(trial.model_hint)
            task = asyncio.create_task(
                generate_and_score(model, trial, evaluator)
            )
            tasks.append(task)

        # Gather results (may complete in any order)
        results = await asyncio.gather(*tasks)

        # Tell: incorporate all results at once
        tree_state = tell(tree_state, results)

    return best(tree_state)

async def generate_and_score(model, trial, evaluator):
    """Generate a solution and score it."""
    solution = await model.agenerate(trial.parent_state)
    score = evaluator.score(solution)
    return (trial.trial_id, solution, score)

9.4 User-Defined Generation Functions

The generation and evaluation functions are entirely user-defined. TreeQuest only requires that evaluation functions return scores normalized to [0, 1]. The generation function receives the trial context (parent state, action type) and returns a (state, score) tuple.

Score Normalization: All scores must be in [0, 1] for the Thompson Sampling model to function correctly. For tasks with discrete metrics (e.g., number of test cases passed), normalize by dividing by the total number of test cases. For tasks with unbounded metrics, apply a sigmoid or similar saturating transformation.

10 Stateless Design and Immutability

10.1 Immutable Data Structures

TreeQuest represents the search tree using immutable (frozen) data structures. Every operation that modifies the tree -- adding a node, updating a score, recording a Thompson Sampling observation -- returns a new tree state rather than mutating the existing one. This design has several advantages:

  • Checkpointing: Any tree state can be serialized and stored. To resume search, deserialize the state and continue calling ask/tell.
  • Debugging: The full history of tree states can be retained, enabling post-hoc analysis of search behavior.
  • Concurrency: Immutable structures are inherently thread-safe. Multiple threads can read the same tree state without locks.
  • Reproducibility: Given the same initial state and random seed, the search trajectory is fully deterministic.

10.2 Checkpoint and Resume

Long-running searches (e.g., 250 iterations at ~30 seconds per LLM call = ~2 hours) benefit from the ability to checkpoint and resume. If a search is interrupted (machine failure, API quota exhaustion, etc.), it can be resumed from the last checkpoint without losing any work.

Python

import json
from treequest import init_tree, ask_batch, tell, serialize, deserialize

def run_with_checkpoints(problem, checkpoint_path="checkpoint.json"):
    """Run search with periodic checkpointing."""

    # Resume from checkpoint if available
    try:
        with open(checkpoint_path) as f:
            tree_state = deserialize(json.load(f))
            print(f"Resumed from iteration {tree_state.iteration}")
    except FileNotFoundError:
        tree_state = init_tree(method="ab_mcts_a", batch_size=5)
        print("Starting fresh search")

    while tree_state.iteration < 250:
        trials = ask_batch(tree_state)
        results = evaluate_batch(trials, problem)
        tree_state = tell(tree_state, results)

        # Checkpoint every 10 iterations
        if tree_state.iteration % 10 == 0:
            with open(checkpoint_path, "w") as f:
                json.dump(serialize(tree_state), f)
            print(f"Checkpoint saved at iteration {tree_state.iteration}")

    return tree_state

10.3 State Representation

The tree state encapsulates all information needed to continue the search:

[!info]- TreeState Data Structure Python ``` @dataclass(frozen=True) class TreeState: """Complete, immutable state of an AB-MCTS search."""

# Tree structure
nodes: tuple[Node, ...]           # All nodes in the tree
edges: tuple[Edge, ...]           # Parent-child relationships
root_id: str                       # Root node identifier

# Search state
iteration: int                     # Current iteration number
best_score: float                  # Best score found so far
best_node_id: str                  # Node with best score

# Thompson Sampling state
bandit_state: BanditState          # Alpha/beta counts per arm

# Configuration
method: str                        # "ab_mcts_a", "ab_mcts_m", "multi_llm"
batch_size: int                    # Parallel trial count
max_iterations: int               # Total budget
random_state: int                  # RNG state for reproducibility

# Pending trials (outstanding ask_batch calls)
pending_trials: tuple[Trial, ...]  # Trials awaiting tell()

@dataclass(frozen=True) class Node: node_id: str state: Any # User-defined solution state score: float # Evaluation score in [0, 1] depth: int # Depth in tree (0 = root children) action: str # "depth" or "width" model: Optional[str] # Which LLM generated this visit_count: int # Number of times refined subtree_best: float # Best score in subtree ```

11 Benchmark: ARC-AGI-2 Results

11.1 ARC-AGI-2 Overview

ARC-AGI-2 (Abstraction and Reasoning Corpus, version 2) is a benchmark of abstract visual reasoning tasks. Each task presents a set of input-output grid examples, and the solver must infer the transformation rule and apply it to a test input. The tasks require forms of reasoning (symmetry detection, counting, geometric transformations, relational reasoning) that are challenging for current LLMs.

The benchmark's 120 public evaluation tasks provide a rigorous test of inference-time scaling, as many tasks are solvable only through extensive search and the combination of multiple reasoning strategies.

11.2 Experimental Configuration

Parameter Value Notes
Task set 120 public eval tasks ARC-AGI-2 public evaluation set
Max iterations 250 Total generation budget per task
Batch size 5 Parallel trials per iteration
Score normalization [0, 1] Fraction of correct pixels
Pass criterion Score = 1.0 Perfect solution required
Rule selection Top 2 solutions Pass@2 for final submission

11.3 Results Summary

Method Model(s) Pass@250 Pass@2 (rule selection)
Repeated Sampling o4-mini 23.0% --
AB-MCTS-A o4-mini 27.5% --
AB-MCTS-M o4-mini ~28% --
Multi-LLM AB-MCTS o4-mini + Gemini + DeepSeek >30% 19.2%

11.4 Analysis: Where Multi-LLM Wins

The most striking result is that Multi-LLM AB-MCTS solves problems that no single LLM can solve in isolation, even with the same total compute budget. This is not simply a matter of increased diversity. Detailed analysis of the search trees reveals several patterns:

  • Complementary reasoning: One model identifies the correct transformation type (e.g., "this is a rotation"), while another model implements it correctly (e.g., producing the correct pixel-level output).
  • Error correction: When one model produces a nearly-correct solution with a systematic error, a different model can sometimes identify and fix the error during a refinement step, whereas the original model tends to reproduce the same error.
  • Diverse initialization: Different models, trained on different data with different architectures, produce different distributions of initial solutions. This provides better coverage of the solution space.

Key Insight: The improvement from 23% (repeated sampling) to >30% (Multi-LLM AB-MCTS) represents a 30%+ relative improvement. More importantly, the Pass@2 metric (19.2%) demonstrates that the search not only finds correct solutions but also reliably ranks them -- the top-2 solutions selected from 250 candidates capture nearly two-thirds of the solvable tasks.

11.5 Scaling Behavior

The Pass@N curves reveal diminishing but persistent returns to additional compute. Repeated sampling's Pass@N curve flattens quickly (most easily solvable problems are found within the first 50 attempts). AB-MCTS maintains a steeper slope because the search algorithm directs compute toward promising regions, making each additional iteration more productive than a random sample.

[!info]- Scaling Curve Analysis The scaling behavior can be understood through the lens of search efficiency. Define search efficiency at iteration t as the probability that iteration t solves a previously unsolved problem:

$$ efficiency(t) = P(solved at t | not solved at t-1) $$

For repeated sampling, this probability decreases monotonically: the easy problems are found first, leaving only harder problems that have low per-sample solve probability. For AB-MCTS, the efficiency can increase over time as the algorithm accumulates information and focuses on promising directions. This non-monotonic efficiency is what enables the steeper scaling curve.

In the Multi-LLM setting, the efficiency benefits from an additional effect: as the algorithm learns which models are effective for which action types, it allocates compute more efficiently, further improving the scaling slope relative to the single-model variants.

12 Configuration and Hyperparameters

Parameter Recommended Range Notes
batch_size 5 1-10 ≤5 recommended for best adaptive behavior
max_iterations 250 50-1000 Task-dependent; diminishing returns beyond 500
Score range [0, 1] [0, 1] Mandatory normalization
Prior (alpha, beta) (1, 1) (0.5, 0.5) - (2, 2) Uniform prior; (0.5, 0.5) = Jeffreys prior
Depth temperature 0.7-1.0 0.0-2.0 LLM temperature for refinement prompts
Width temperature 1.0-1.2 0.0-2.0 Higher temperature for diversity in new solutions

12.2 Batch Size Trade-offs

The batch size parameter batch_size controls how many trials are generated in parallel at each step. Smaller batch sizes allow the Thompson Sampling model to update more frequently (after each small batch), leading to more adaptive behavior. Larger batch sizes amortize API call overhead and latency but delay the feedback loop.

The recommended value of batch_size ≤ 5 reflects empirical findings on ARC-AGI-2. With batch sizes larger than 5, the algorithm generates too many trials between Thompson Sampling updates, reducing the adaptive advantage over repeated sampling.

12.3 Score Normalization Strategies

Proper score normalization is critical for Thompson Sampling to function correctly. Several normalization strategies are applicable depending on the task:

Python

def normalize_discrete(correct: int, total: int) -> float:
    """Normalize discrete metrics (e.g., test cases passed)."""
    return correct / total if total > 0 else 0.0

def normalize_pixel(predicted_grid, target_grid) -> float:
    """Normalize pixel-level accuracy for ARC-AGI tasks."""
    if predicted_grid.shape != target_grid.shape:
        return 0.0  # Wrong grid dimensions
    matches = (predicted_grid == target_grid).sum()
    total = target_grid.size
    return matches / total

def normalize_unbounded(raw_score: float, midpoint: float = 100.0) -> float:
    """Sigmoid normalization for unbounded metrics."""
    return 1.0 / (1.0 + np.exp(-(raw_score - midpoint) / midpoint))

13 Implementation Walkthrough

13.1 Complete ARC-AGI-2 Example

The following example demonstrates a complete AB-MCTS search on an ARC-AGI-2 task, using the Multi-LLM variant with three models.

Python

import numpy as np
from treequest import init_tree, ask_batch, tell, best
from treequest.models import MultiLLMConfig

# Define the model pool
model_config = MultiLLMConfig(
    models={
        "o4-mini": {
            "provider": "openai",
            "api_key": os.environ["OPENAI_API_KEY"],
            "temperature": 0.8,
        },
        "gemini-2.5-pro": {
            "provider": "google",
            "api_key": os.environ["GOOGLE_API_KEY"],
            "temperature": 1.0,
        },
        "deepseek-r1-0528": {
            "provider": "deepseek",
            "api_key": os.environ["DEEPSEEK_API_KEY"],
            "temperature": 0.9,
        },
    }
)

# Define the ARC-AGI-2 scoring function
def arc_scorer(predicted_grid, target_grid):
    """Score an ARC-AGI-2 solution by pixel-level accuracy."""
    if predicted_grid is None:
        return 0.0
    if predicted_grid.shape != target_grid.shape:
        return 0.0
    return float((predicted_grid == target_grid).sum()) / target_grid.size

# Initialize the search tree
tree = init_tree(
    method="multi_llm",
    batch_size=5,
    max_iterations=250,
    model_config=model_config,
)

# Load the ARC task
task = load_arc_task("task_001")

# Run the search
for iteration in range(50):  # 50 batches x 5 = 250 trials
    trials = ask_batch(tree)

    results = []
    for trial in trials:
        # Generate solution using assigned model
        if trial.action == "width":
            prompt = make_fresh_arc_prompt(task)
        else:
            prompt = make_refinement_arc_prompt(
                task, trial.parent_state, trial.parent_score
            )

        model = model_config.get_client(trial.model_hint)
        response = model.generate(prompt)
        predicted_grid = parse_arc_output(response)

        score = arc_scorer(predicted_grid, task.test_output)
        results.append((trial.trial_id, predicted_grid, score))

        if score == 1.0:
            print(f"Perfect solution found at iteration {iteration}!")
            break

    tree = tell(tree, results)

# Get the best solution(s) for submission
top_solutions = best(tree, k=2)  # Top 2 for Pass@2 submission

13.2 Custom Problem Integration

TreeQuest can be applied to any problem where solutions can be generated by LLMs and evaluated with a scalar score. The following template shows how to integrate a custom problem:

[!info]- Custom Problem Template Python ``` from treequest import init_tree, step

def solve_custom_problem(problem_description: str): """Solve a custom problem using AB-MCTS."""

# Define generation function
def generate(trial):
    if trial.action == "width":
        prompt = f"Solve this problem:\n{problem_description}"
    else:
        prompt = (
            f"Here is a previous attempt (score: {trial.parent_score}):\n"
            f"{trial.parent_state}\n\n"
            f"Improve this solution for:\n{problem_description}"
        )
    response = llm.generate(prompt)
    return response

# Define scoring function
def evaluate(solution):
    # Return a score in [0, 1]
    tests_passed = run_tests(solution)
    return tests_passed / total_tests

# Initialize and run
tree = init_tree(method="ab_mcts_a", batch_size=5)
for _ in range(250):
    tree = step(tree, generate, evaluate)

return tree.best_solution

```

14 Theoretical Foundations

14.1 Connection to Classical MCTS

AB-MCTS inherits the basic structure of Monte Carlo Tree Search but adapts it significantly for the LLM inference-time scaling setting. In classical MCTS (as used in game-playing, e.g., AlphaGo), the tree represents game states and actions, and the search discovers promising action sequences. In AB-MCTS, the tree represents solutions and refinement operations, and the search discovers promising solution-refinement sequences.

Aspect Classical MCTS AB-MCTS
Nodes Game states Candidate solutions
Actions Game moves Depth (refine) or Width (new)
Rollout Random playout to terminal state LLM generation + scoring
Backpropagation Win/loss statistics up the tree Score improvement signals to Thompson Sampling
Selection UCB1 or PUCT formula Thompson Sampling from Beta posteriors
Expansion One child per legal move One child per LLM generation

14.2 Regret Bounds

Thompson Sampling for the Bernoulli bandit problem has well-established regret bounds. For K arms over T rounds, the expected regret of Thompson Sampling satisfies:

$$ E[R(T)] ≤ O(sqrt(K T log T)) $$

In the AB-MCTS context, K = 2 (depth and width) for the single-model variant, or K = 2|M| for the multi-model variant where |M| is the number of models. The logarithmic factor means that regret grows slowly, and the algorithm quickly converges to the optimal action distribution.

However, the standard regret analysis assumes stationary reward distributions. In AB-MCTS, the reward distributions are non-stationary: the value of depth vs. width changes as the tree grows and the best known solution improves. This non-stationarity can be partially addressed by using windowed Thompson Sampling (only counting recent observations) or by applying a discount factor to older observations.

14.3 Exploration-Exploitation in the Search Tree

The exploration-exploitation trade-off in AB-MCTS operates at two levels:

  1. Action level: Thompson Sampling balances depth vs. width (and model selection in the multi-LLM variant). This is the primary adaptive mechanism.
  2. Node level: When the algorithm decides to go deeper, it must select which node to refine. This is a secondary bandit problem where each node is an arm. Nodes with high scores and low visit counts are preferred (analogous to UCB-based selection in classical MCTS).

Theoretical Observation: The two-level bandit structure of AB-MCTS creates a hierarchical exploration-exploitation problem. The outer level (depth vs. width) determines the type of exploration, while the inner level (node selection) determines the location. This hierarchical structure is a key architectural advantage over flat search strategies like repeated sampling.

14.4 Convergence Properties

Given a sufficiently informative scoring function and enough compute budget, AB-MCTS converges to the optimal solution in the following sense: if the optimal solution is reachable via a finite refinement chain from any initial solution, and the scoring function provides monotonically improving signal along this chain, then the probability of finding the optimal solution approaches 1 as the budget T increases.

This convergence guarantee is stronger than for repeated sampling (which requires the optimal solution to be directly reachable in a single generation) but weaker than for exhaustive search (which guarantees finding the optimal solution in finite time regardless of the scoring function's properties).

15.1 Inference-Time Scaling Methods

Method Search Strategy Adaptive? Multi-Model? Key Limitation
Best-of-N Width only (independent samples) No No No information reuse
Chain-of-Thought Depth only (sequential reasoning) No No Single reasoning path
Tree-of-Thought Breadth-first tree search No (fixed branching) No Fixed branching factor
Self-Consistency Width + majority voting No No Requires consensus
STaR / Self-Taught Reasoner Iterative refinement with filtering Partially No Requires correct answer for filtering
AB-MCTS Adaptive depth + width Yes (Thompson Sampling) Yes Requires scalar scoring function

15.2 Multi-Agent LLM Systems

Several recent systems use multiple LLMs in collaborative or competitive arrangements. AB-MCTS differs from these approaches in that it treats model selection as an optimization problem rather than a fixed architecture:

  • Debate / Negotiation: Models argue for different positions. AB-MCTS instead routes problems to models based on learned competence profiles.
  • Mixture of Experts: A router selects which expert to use for each input. AB-MCTS learns the routing policy online via Thompson Sampling rather than training a separate router.
  • Ensemble methods: Multiple models' outputs are aggregated (e.g., majority vote). AB-MCTS instead uses sequential refinement where one model can build on another's work.

AB-MCTS shares conceptual similarities with evolutionary algorithms (EAs). The width action is analogous to mutation (generating new individuals), while the depth action is analogous to local search or hill climbing. The Thompson Sampling mechanism serves a similar role to adaptive operator selection in EAs, adjusting the mutation-to-local-search ratio based on observed fitness improvements.

However, AB-MCTS differs from traditional EAs in that it does not maintain a fixed population size, does not use crossover (combining parts of different solutions), and uses LLMs as the generation engine rather than domain-specific mutation operators. The tree structure also provides a richer representation of the search history than a flat population.

16 Limitations and Future Directions

16.1 Current Limitations

  • Scoring function requirement: AB-MCTS requires a scalar scoring function that provides meaningful signal in [0, 1]. For tasks without clear partial credit (e.g., open-ended creative writing), designing such a function is non-trivial.
  • LLM call overhead: Each iteration requires at least one LLM API call. For models with high latency or cost (e.g., GPT-4-level models), the 250-iteration budget may be prohibitively expensive for some applications.
  • Prompt engineering dependency: The effectiveness of depth (refinement) actions depends heavily on the quality of the refinement prompt. Poor prompts may cause the LLM to ignore the parent solution or copy it verbatim.
  • Non-stationarity: The theoretical regret bounds assume stationary reward distributions, but the true distributions shift as the tree grows. While Thompson Sampling is empirically robust to mild non-stationarity, extreme distribution shift could degrade performance.
  • No crossover: Unlike evolutionary algorithms, AB-MCTS does not combine elements from different solution branches. Adding a crossover-like operation (e.g., asking the LLM to merge the best aspects of two different solutions) is a natural extension.

16.2 Future Directions

[!info]- Crossover and Recombination A natural extension is to add a "crossover" action that combines elements from two different subtrees. The LLM would receive two solutions and be asked to combine their best aspects. This would add a fourth action to the Thompson Sampling model (depth, width, crossover, possibly model selection), increasing the action space but potentially enabling richer search dynamics.

Python ``` def crossover_action(tree, model): """Generate a solution by combining two existing ones.""" node_a = select_node(tree, strategy="best") node_b = select_node(tree, strategy="diverse")

prompt = (
    f"Here are two solution approaches:\n\n"
    f"Approach A (score {node_a.score:.2f}):\n{node_a.state}\n\n"
    f"Approach B (score {node_b.score:.2f}):\n{node_b.state}\n\n"
    f"Combine the best aspects of both approaches."
)
return model.generate(prompt)

```

[!info]- Hierarchical and Nested Search For problems with natural decomposition (e.g., multi-step reasoning), AB-MCTS could be applied hierarchically: a top-level search decomposes the problem into subproblems, and each subproblem is solved by a nested AB-MCTS search. The top-level tree's scoring function would aggregate sub-problem scores.

[!info]- Learned Scoring Functions When ground-truth evaluation is expensive or unavailable, an LLM-based scoring function (reward model) could replace the external evaluator. This creates a fully self-contained system but introduces the risk of reward hacking. Careful calibration and periodic ground-truth validation would be needed.

[!info]- Non-Stationary Thompson Sampling Discounted Thompson Sampling or sliding-window Thompson Sampling could address the non-stationarity issue. In discounted TS, older observations are geometrically down-weighted:

$$ alpha_t = gamma * alpha_{t-1} + reward_t,    gamma in (0, 1) $$

This allows the algorithm to adapt more quickly when the reward distribution shifts (e.g., when the search enters a new region of the solution space).

[!info]- Integration with AlphaEvolve-Style Code Evolution AB-MCTS could serve as the search backbone for code evolution systems like AlphaEvolve. Rather than using MAP-Elites or simple evolutionary strategies for program search, AB-MCTS's adaptive branching could provide more efficient exploration of the program space. The Multi-LLM variant is particularly relevant here, as different models may excel at different aspects of code generation (e.g., algorithmic design vs. implementation details vs. optimization).

17 References

  1. Sakana AI. "AB-MCTS: Adaptive Branching Monte Carlo Tree Search for Multi-LLM Inference-Time Scaling." arXiv:2503.04412, March 2025.
  2. Sakana AI. TreeQuest: Open-source library for AB-MCTS. github.com/SakanaAI/treequest. Apache 2.0 License.
  3. Chollet, F. "On the Measure of Intelligence." arXiv:1911.01547, 2019. (ARC benchmark definition.)
  4. Chollet, F. et al. "ARC Prize." arcprize.org, 2024-2025. (ARC-AGI-2 evaluation framework.)
  5. Thompson, W.R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples." Biometrika, 25(3-4):285-294, 1933.
  6. Agrawal, S. and Goyal, N. "Analysis of Thompson Sampling for the Multi-armed Bandit Problem." COLT, 2012.
  7. Browne, C. et al. "A Survey of Monte Carlo Tree Search Methods." IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1-43, 2012.
  8. Silver, D. et al. "Mastering the game of Go with deep neural networks and tree search." Nature, 529(7587):484-489, 2016.
  9. Yao, S. et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS, 2023.
  10. Wei, J. et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS, 2022.
  11. Wang, X. et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR, 2023.
  12. Zelikman, E. et al. "STaR: Bootstrapping Reasoning With Reasoning." NeurIPS, 2022.
  13. Snell, C. et al. "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv:2408.03314, 2024.
  14. Google DeepMind. "AlphaEvolve: A Gemini-Powered Coding Agent for Designing Advanced Algorithms." Technical Report, May 2025.

AB-MCTS / TreeQuest -- PhD-Level Technical Report | Generated March 2026 | Based on arXiv:2503.04412 and the TreeQuest open-source library

Back to Index