PaperarXiv:2603.20278

Introduced2026-03

Score8.12/10 — Draft

Chapter 43

OpenResearcher

Part P07: Autonomous Research Systems

43.1 Overview & Motivation

Training autonomous deep research agents—systems that iteratively search, browse documents, and reason across many steps—demands a fundamentally different class of training data than traditional question-answering or retrieval-augmented generation. Such agents must learn to formulate queries, navigate information hierarchies, locate evidence within documents, and synthesize findings over trajectories that may span dozens or even hundreds of tool calls. The central bottleneck is not model architecture or training compute but the availability of high-quality, long-horizon research trajectories from which models can learn these behaviors.

Prior to OpenResearcher, three interrelated problems made trajectory generation at scale impractical. First, cost: live web search APIs charge per query, and at the scale needed for comprehensive trajectory generation, API fees become substantial (see Section 43.6.1 for a detailed cost derivation). Second, instability: the live web changes constantly, making trajectory synthesis non-reproducible and undermining controlled scientific analysis. Third, analytical opacity: without a fixed corpus and stable gold-document annotations, it is impossible to determine when relevant evidence is surfaced, opened, or missed—rendering ablation studies and causal analysis infeasible.

Key Contribution

OpenResearcher introduces a fully open pipeline for deep research trajectory synthesis that decouples one-time online corpus bootstrapping from fully offline, reproducible trajectory generation. According to the authors (Li et al., arXiv:2603.20278), a 30B-parameter Mixture-of-Experts student model trained via supervised fine-tuning on ~55K offline trajectories achieves 54.8% on BrowseComp-Plus—reported as surpassing GPT-4.1 (+18.4 points), Claude-4-Opus (+18.0 points), and all other evaluated systems—while generalizing to live-web benchmarks (64.1% GAIA, 26.3% BrowseComp) despite never having been exposed to live search during training. The entire pipeline—code, corpus, trajectories, and model checkpoints—is released as open-source.

OpenResearcher was published in March 2026 by the TIGER-AI Lab, a collaboration spanning Texas A&M University, the University of Waterloo, UC San Diego, Verdent AI, NetMind AI, and Lambda (Li et al., arXiv:2603.20278). The project releases all artifacts: pipeline code on GitHub (TIGER-AI-Lab/OpenResearcher), synthesized trajectories on HuggingFace, model checkpoints, and an interactive demo.

Provenance convention. Throughout this chapter, all benchmark numbers and ablation results are reported by the authors (Li et al., arXiv:2603.20278) unless explicitly marked otherwise. Code examples are algorithmic pseudocode derived from the paper's method descriptions, not excerpts from the repository. Tables and figures are individually labeled with their source: paper-reported (directly from the paper), adapted (restructured from paper data), or survey-author analysis (our own derivation or commentary). Interpretive commentary beyond the paper's analysis is marked with "We note" or "Our interpretation."

43.2 Architecture

The system architecture is organized into four cleanly separated layers: an offline corpus layer, a search engine layer, an agent layer, and a training layer. This separation is the central design principle: the expensive online phase (corpus bootstrapping) runs once, after which all subsequent trajectory generation operates entirely offline at zero marginal search cost.

43.2.1 Design Principles

Decoupling. The most consequential design decision is the clean separation between corpus construction (one-time, requires internet access) and trajectory synthesis (repeatable, fully offline). Once the corpus and FAISS index are built, unlimited trajectory synthesis runs can be performed at zero marginal search cost. This enables experimentation with different teacher models, prompt strategies, and sampling parameters without incurring API fees.

Explicit browsing structure. Rather than treating search as a monolithic retrieval operation, the system exposes three primitives that mirror hierarchical human information-seeking behavior: broad corpus search, focused document reading, and targeted evidence localization. This is a deliberate contrast to systems that rely solely on search-and-snippet retrieval.

Teacher–student distillation. A large teacher model (GPT-OSS-120B) generates high-quality trajectories that are distilled into a much smaller student (30B total parameters, 3B active) via supervised fine-tuning. No reinforcement learning, no online interaction, no curriculum—pure SFT on answer-verified demonstrations.

43.2.2 Released Artifacts and Repository

The open-source release (github.com/TIGER-AI-Lab/OpenResearcher) provides the pipeline code, data, and evaluation resources described in the paper. The following table summarizes the released artifacts as described in arXiv:2603.20278 and linked from the repository README:

Table 43.1 — Released artifacts as described in arXiv:2603.20278 §1 and the repository README. Locations are paper-reported.
Artifact	Location	Description
Pipeline code	GitHub: `TIGER-AI-Lab/OpenResearcher`	Complete synthesis pipeline, retrieval server, evaluation scripts
Offline search engine	GitHub (within repo)	FAISS-indexed corpus and local retrieval server implementation
Synthesized trajectories	HuggingFace: `OpenResearcher/OpenResearcher-Dataset`	97K+ trajectories with full metadata
Model checkpoint	HuggingFace: `OpenResearcher/OpenResearcher-30B-A3B`	Trained student model weights
Interactive demo	HuggingFace Spaces	Live demonstration of the trained agent
QA data	Included in GitHub release	6K processed question-answer pairs from MiroVerse-v0.1
Gold documents	Included in release	~10K bootstrapped gold documents

The paper describes the pipeline as consisting of four functional stages — corpus construction, search engine setup, trajectory synthesis, and student training — each corresponding to dedicated code within the repository. The specific module paths, script names, class definitions, configuration files, and entry points have not been independently audited for this chapter. We describe the algorithms below based on the paper's method sections (§§3.1–3.4 of arXiv:2603.20278); readers seeking actual implementation details — including real function names, command-line interfaces, configuration schemas, and execution instructions — should consult the repository directly.

The technology stack reported in the paper includes: Python (pipeline language), FAISS (dense retrieval indexing), Qwen3-Embedding-8B (document embeddings), the Serper API (one-time gold-document bootstrapping), GPT-OSS-120B (teacher model for trajectory generation), and Megatron-LM (distributed training framework for the student model).

43.3 Core Algorithms

This section describes the four stages of the OpenResearcher pipeline as presented in the paper. All algorithmic descriptions are based on arXiv:2603.20278 §§3.1–3.4. Code blocks are algorithmic pseudocode that illustrate the paper's described methods; they are not sourced from the repository, and actual implementation details (function names, APIs, error handling, parallelization) will differ.

43.3.1 Stage 1: Question Collection from MiroVerse

According to the paper (§3.1), OpenResearcher draws its question set from a 10% random sample of the MiroVerse-v0.1 dataset, yielding approximately 6,000 QA instances. The authors state that the selection criterion requires questions demanding long-horizon, multi-hop reasoning over heterogeneous evidence—standard benchmarks such as 2WikiMultiHopQA or Natural Questions are explicitly rejected as too shallow. The paper reports that even a strong teacher model requires dozens of tool calls for these questions, with a substantial tail exceeding 100 calls.

Critically, partial trajectories from MiroVerse are discarded. The authors found that trajectory quality in the source dataset was inconsistent, so all trajectories are regenerated from scratch using only the cleaned QA pairs as seeds.

43.3.2 Stage 2: Offline Corpus Construction

The corpus construction stage is the only phase that requires internet access. Its purpose is to ensure the offline corpus contains sufficient evidence to answer each question. The method uses answer-guided bootstrapping (paper §3.2): for each question-answer pair, a query is formed that combines the question and reference answer to maximize recall of relevant documents.

$$q_{\text{bootstrap}} = \text{concat}(q_{\text{question}}, a_{\text{reference}})$$

Equation 43.1 — Standard definition applied here; the concatenation approach is described in arXiv:2603.20278 §3.2.

where $q_{\text{question}}$ is the original question and $a_{\text{reference}}$ is the reference answer. This concatenation biases search toward documents that contain both the question context and answer-relevant terms. The resulting documents are retrieved via the Serper API, cleaned, and deduplicated, producing approximately 10,000 gold documents for 6,000 questions (~1.67 gold documents per question on average).

These gold documents are then merged with 15 million distractor documents sampled from FineWeb, creating a corpus of ~15.01 million documents containing approximately 10 trillion tokens (paper-reported). The gold-to-distractor ratio is approximately 1:1,500, which the authors argue approximates web-scale search difficulty: the teacher model must locate relevant documents among vastly more irrelevant ones.

The entire corpus is embedded using Qwen3-Embedding-8B and indexed in FAISS for dense retrieval. A critical design constraint stated in the paper is that gold documents are used only for corpus construction, never during trajectory synthesis—the teacher model must independently find evidence through its own search queries.

Bootstrap query count and cost. There is an unresolved discrepancy between the paper's algorithm description and its reported cost. The algorithm as described in §3.2 constructs a single concatenated query per QA pair, implying 6,000 Serper API calls. At Serper's rate of $1 per 1,000 queries, this would cost $6. However, the paper reports the bootstrapping cost as approximately $60 (§4), implying approximately 60,000 API calls—roughly 10 per question. Three possible explanations, none confirmed in the paper:

Multiple query variants per question: The implementation may decompose each answer into sub-components and issue multiple queries (e.g., different answer fragments, paraphrased questions) to maximize gold-document recall.
Pagination: Each logical query may require multiple API calls if results span several pages.
Additional retrieval passes: The pipeline may include secondary retrieval rounds (e.g., using initially retrieved documents to formulate follow-up queries) not fully described in the simplified algorithm.

The 10× discrepancy between the described algorithm (1 query/question) and the reported cost (~10 queries/question) is a concrete detail that the repository implementation would resolve. See Section 43.6.1 for the cost implications.

# ALGORITHMIC PSEUDOCODE — illustrates the corpus construction method
# described in arXiv:2603.20278, §3.2. NOT sourced from the repository.
# For the actual implementation, see github.com/TIGER-AI-Lab/OpenResearcher.

def bootstrap_corpus(qa_pairs, serper_client, fineweb_sampler):
    """One-time online phase: build offline corpus with answer-guided retrieval.

    Paper §3.2 describes concatenating question + answer for retrieval.
    The exact number of queries per question is ambiguous; the reported
    ~$60 cost implies ~10 queries/question rather than the 1 described.
    """
    gold_docs = []

    for question, answer in qa_pairs:  # ~6,000 QA pairs
        bootstrap_query = f"{question} {answer}"
        # Paper: Serper API for one-time gold document retrieval
        # Actual implementation may issue multiple query variants per question
        results = serper_client.search(bootstrap_query, num_results=10)
        cleaned = clean_and_deduplicate(results)
        gold_docs.extend(cleaned)

    # Step 2: Sample distractor corpus from FineWeb (publicly available dataset)
    distractor_docs = fineweb_sampler.sample(n=15_000_000)

    # Step 3: Merge, embed, and index
    full_corpus = gold_docs + distractor_docs  # ~15.01M documents
    embeddings = qwen3_embedding_8b.encode(full_corpus)
    index = faiss.build_index(embeddings)

    return full_corpus, index
    # Paper-reported statistics:
    #   Gold documents: ~10K | Distractors: 15M | Ratio: ~1:1,500
    #   Total tokens: ~10 trillion | Bootstrapping cost: ~$60

43.3.3 Stage 3: Trajectory Synthesis via ReAct Agent

The trajectory synthesis stage is the heart of the pipeline. A large teacher model (GPT-OSS-120B; Agarwal et al., 2025) operates in a ReAct-style loop (Yao et al., 2022), interleaving chain-of-thought reasoning with tool calls against the offline search engine. Each trajectory is a sequence of reasoning-action-observation triples (paper §3.3):

$$H_T = \{(q, s, m), (r_1, a_1, o_1), (r_2, a_2, o_2), \ldots, (r_T, a_{\text{final}})\}$$

Equation 43.2 — Trajectory structure as described in arXiv:2603.20278 §3.3. Standard ReAct formulation (Yao et al., 2022) applied to the browsing agent.

where $q$ is the original question, $s$ is the system prompt, $m$ is tool metadata, and at each step $t$: $r_t$ is a reasoning chain-of-thought, $a_t$ is a tool action selected from $\{\texttt{search}, \texttt{open}, \texttt{find}\}$, and $o_t$ is the observation returned by the environment. The final step produces a conclusive answer $a_{\text{final}}$ rather than a tool call.

The policy at each step is conditioned on the full trajectory history:

$$r_t, a_t \sim \pi(\cdot \mid H_{t-1})$$

$$o_t = \mathcal{E}(a_t)$$

$$H_t = H_{t-1} \cup \{(r_t, a_t, o_t)\}$$

Equations 43.3–43.5 — Standard ReAct policy notation. $\pi$ is the teacher model (GPT-OSS-120B), $\mathcal{E}$ is the offline environment.

where $\pi$ is the teacher model's policy (GPT-OSS-120B) and $\mathcal{E}$ is the offline environment (the local retrieval server). The teacher has no access to the reference answer during synthesis—it must recover the answer through its own search and reasoning.

The Three Browser Primitives

The tool space consists of exactly three primitives, designed to model the hierarchical nature of human browsing (paper §3.3):

Table 43.2 — Browser primitives as described in arXiv:2603.20278 §3.3. The exact value of Top-K is not specified in the paper.
Tool	Input	Output	Information Scale
`search(query)`	Natural language query	Top-K results (title, URL, snippet)	Corpus → Document candidates
`open(url)`	Document URL	Full document text	Document → Full content
`find(string)`	Exact string	Matching passages + context	Content → Evidence

Each tool narrows the information scope by one level. The find tool is particularly critical for named-entity lookup and factual verification—tasks where scanning long documents entirely in-context is unreliable even for large language models. The design rationale is that explicit evidence localization reduces the model's reliance on implicit reasoning over long contexts. The paper does not specify the exact value of $K$ for search retrieval results; the pseudocode below uses a placeholder, but the actual value is an implementation detail available in the repository.

For each of the 6,000 questions, 16 independent trajectory samples are generated, producing 97,000+ raw trajectories (paper §3.3). The paper reports the following per-trajectory statistics (see Section 43.5.3 for the full breakdown):

Average total tool calls (search + open + find): 52.8 per trajectory
Average search calls specifically: 33.6 per trajectory
Maximum total tool calls: 185

The distinction between total tool calls and search calls is critical for cost estimation (Section 43.6.1): only search invocations correspond to paid search-API queries in a live-web scenario, since open (page fetch) and find (local text search) are conceptually different operations.

# ALGORITHMIC PSEUDOCODE — illustrates the trajectory synthesis loop
# described in arXiv:2603.20278, §3.3. NOT sourced from the repository.
# For the actual agent implementation, see github.com/TIGER-AI-Lab/OpenResearcher.

def generate_trajectory(question, teacher_model, search_engine, max_turns=100):
    """Generate a single deep research trajectory using ReAct loop.

    The teacher model (GPT-OSS-120B) generates reasoning + tool calls.
    The search engine is the offline FAISS-based retrieval server.
    Temperature, sampling parameters, and prompt templates are not
    specified in the paper; they are implementation details in the repo.
    """
    system_prompt = build_system_prompt(tool_metadata=BROWSER_TOOLS)
    trajectory = [(question, system_prompt)]

    for turn in range(max_turns):
        reasoning, action = teacher_model.generate(
            context=trajectory,
            tools=["search", "open", "find"]
        )

        if action.type == "final_answer":
            trajectory.append((reasoning, action.answer))
            return trajectory

        # Execute tool against offline search engine (zero API cost)
        if action.type == "search":
            observation = search_engine.search(action.query)  # top-K; K unspecified
        elif action.type == "open":
            observation = search_engine.open(action.url)
        elif action.type == "find":
            observation = search_engine.find(action.string)

        trajectory.append((reasoning, action, observation))

    return None  # Budget exhausted without conclusive answer


def synthesize_all(qa_pairs, teacher_model, search_engine, samples_per_question=16):
    """Full synthesis pass: 16 samples × 6K questions → 97K+ trajectories."""
    all_trajectories = []
    for question, answer in qa_pairs:
        for _ in range(samples_per_question):
            traj = generate_trajectory(question, teacher_model, search_engine)
            if traj is not None:
                all_trajectories.append(traj)
    return all_trajectories  # Paper reports 97K+ raw trajectories

43.3.4 Stage 4: Rejection Sampling and Student Training

The raw trajectories are filtered through multiple quality gates before use in training (paper §3.4). According to the paper, trajectories exceeding the maximum context length, containing malformed tool calls, or failing to reach a conclusive answer are removed. For the primary training set, rejection sampling retains only trajectories whose final answer is correct, reducing the set from 97,000+ to approximately 55,000 trajectories.

# ALGORITHMIC PSEUDOCODE — illustrates the filtering and training pipeline
# described in arXiv:2603.20278, §§3.4 and 4. NOT sourced from the repository.

def filter_trajectories(trajectories, qa_pairs, max_context_tokens=256_000):
    """Multi-stage trajectory filtering as described in paper §3.4."""
    filtered = []
    for traj in trajectories:
        # Gate 1: Context length check
        if token_count(traj) > max_context_tokens:
            continue
        # Gate 2: Format validity (well-formed tool calls, proper structure)
        if not has_valid_format(traj):
            continue
        # Gate 3: Must reach a conclusive final answer
        if not reaches_conclusion(traj):
            continue
        # Gate 4 (rejection sampling): Final answer must be correct
        question = extract_question(traj)
        reference_answer = qa_pairs[question]
        predicted_answer = extract_final_answer(traj)
        if is_correct(predicted_answer, reference_answer):  # matching method unspecified
            filtered.append(traj)
    return filtered  # Paper: 97K+ → ~55K after all gates


# Student training configuration (paper §4, all values paper-reported):
TRAINING_CONFIG = {
    "base_model": "NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16",
    "framework": "Megatron-LM",
    "context_length": 256_000,  # tokens, pre-packed, no truncation
    "learning_rate": 5e-5,      # no warmup, no decay
    "global_batch_size": 64,
    "training_steps": 347,
    "precision": "BF16",
    "hardware": "8x NVIDIA H100",
    "training_time": "~8 hours",
}

The student model (NVIDIA Nemotron-3-Nano-30B-A3B, a Mixture-of-Experts architecture with 30B total parameters but only 3B active at inference) is trained via supervised fine-tuning using Megatron-LM. Sequences are pre-packed to 256K tokens with no truncation—a deliberate design choice, since truncation would break reasoning chains and teach incomplete patterns.

Table 43.3 — Training hyperparameters. All values paper-reported from arXiv:2603.20278 §4.
Training Parameter	Value
Base model	NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
Architecture	30B total parameters, 3B active (MoE)
Training data	~55K trajectories (rejection-sampled)
Context length	256K tokens (pre-packed, no truncation)
Hardware	8× NVIDIA H100 GPUs
Training time	~8 hours
Learning rate	$5 \times 10^{-5}$ (no warmup/decay)
Global batch size	64
Training steps	347
Precision	BF16
RL / online interaction	None

The remarkably short training duration (~8 hours on 8× H100s, only 347 gradient steps) is cited by the authors as evidence that the bottleneck in deep research agent training is data quality, not model training compute.

43.4 Empirical Findings and Research Questions

A distinguishing feature of OpenResearcher is its systematic investigation of five targeted research questions (RQ1–RQ5), enabled by the controlled offline environment. The authors describe these as the field's first controlled study of deep research trajectory design. All results in this section are reported by the authors in arXiv:2603.20278 unless otherwise noted. Where we offer interpretive commentary beyond the paper's own analysis, this is marked explicitly.

43.4.1 RQ1: Does Trajectory Correctness Matter for Training?

The conventional assumption is that training data should consist only of successful demonstrations. The authors test this directly by training separate models on correct-only, incorrect-only, and all trajectories:

Table 43.4 — RQ1 ablation results. Paper-reported from arXiv:2603.20278, RQ1. Single training run per condition; no confidence intervals or multiple seeds reported.
Training Data	BrowseComp-Plus (%)
Correct trajectories only	54.81
Incorrect trajectories only	55.06
All trajectories	54.46

The authors report that training on incorrect trajectories alone slightly outperforms correct-only training. They interpret this as evidence that the structural patterns of search—query formulation, document selection strategy, evidence inspection, and stopping behavior—are more important for learning than whether the specific final answer happens to be correct. The authors suggest that even failed trajectories provide useful supervision about search structure and tool-use ordering.

Important caveats. The paper does not report confidence intervals, standard deviations, or results from multiple training seeds for this ablation. The differences between the three conditions are small (0.35–0.60 percentage points), and it is our observation that these may fall within natural variance of a single training run. The paper presents this as a single-run comparison, so the finding—while suggestive—should be treated as a preliminary observation rather than a statistically robust conclusion. Whether incorrect trajectories are genuinely beneficial or merely non-harmful relative to correct ones remains an open question requiring replication with multiple seeds and statistical testing.

43.4.2 RQ2: Does Gold-Document Bootstrapping Matter?

The corpus coverage ablation, as reported by the authors, demonstrates why answer-guided bootstrapping is critical:

Table 43.5 — RQ2 corpus ablation results. Paper-reported from arXiv:2603.20278, RQ2.
Corpus Setting	Gold Hit Rate	Trajectory Accuracy	BrowseComp-Plus (%)
With gold docs	29.54%	56.86%	54.81
Without gold docs	1.73%	43.81%	6.35

Without gold documents, the authors report a catastrophic drop to 6.35%—a 48-point decrease. The gold hit rate drops from 29.54% to 1.73%, which the authors interpret as confirming that answer-guided bootstrapping is essential for seeding the corpus with retrievable evidence. They also note that even with bootstrapping, 70% of trajectories never retrieve a gold document, suggesting significant room for improvement in retrieval strategies.

43.4.3 RQ3: Turn Budget Analysis

The relationship between inference-time turn budget and performance, as reported by the authors, shows monotonic improvement with diminishing returns:

Table 43.6 — RQ3 turn budget results. Paper-reported from arXiv:2603.20278, RQ3. Marginal-gain column is survey-author computation from paper-reported values. The paper does not specify whether the same model was evaluated at different turn limits or separate models were trained per budget.
Max Turns at Inference	BrowseComp-Plus (%)	Marginal Gain (survey-computed)
15	47.12	—
30	50.10	+2.98
50	52.70	+2.60
75	53.48	+0.78
100	54.81	+1.33

The largest marginal gain occurs between 15 and 30 turns. The authors conclude that most questions can be solved within 50 turns, but a long tail benefits from extended exploration. They argue this supports the need for both concise and complex reasoning patterns in the training data.

43.4.4 RQ4: Tool-Space Ablation

Each browser primitive contributes measurably to performance, as reported by the authors:

Table 43.7 — RQ4 tool-space ablation results. Paper-reported from arXiv:2603.20278, RQ4. The paper does not specify whether each configuration was trained on correspondingly restricted trajectories (training-time ablation) or evaluated with restricted tool access on the same model (inference-time ablation).
Tool Configuration	BrowseComp-Plus (%)	Δ vs. Previous (survey-computed)
`search` only	44.10	— (baseline)
`search` + `open`	52.02	+7.92
`search` + `open` + `find`	54.81	+2.79

Adding open provides a +7.92 point improvement by enabling full-document reading rather than relying on snippets alone. Adding find provides a further +2.79 points by enabling targeted evidence localization. The authors interpret this as confirming that explicit browsing structure, even with just three primitives, meaningfully improves deep research performance compared to monolithic search-and-snippet approaches.

43.4.5 RQ5: Retrieval–Accuracy Relationship

The controlled corpus enables what the authors describe as the first analysis of how retrieval success relates to answer accuracy in deep research agents. The paper reports that questions where at least one gold document is retrieved have significantly higher trajectory accuracy than questions where no gold document is found. However, the authors note that the relationship is not deterministic: some questions are solved without retrieving any gold document (through indirect evidence), and some fail despite gold-document retrieval (through reasoning errors post-retrieval).

Our interpretation: This finding is consistent with the intuition that retrieval and reasoning are jointly necessary capabilities, but the paper does not provide formal statistical tests (e.g., controlled regression, matched-pair analysis) to establish a causal relationship. The correlation between gold-document retrieval and accuracy could be confounded by question difficulty—easier questions may be more likely to both retrieve gold documents and yield correct answers. The authors' framing as "causal analysis" should be understood as correlational analysis within a controlled corpus environment, which is more rigorous than live-web analysis but does not constitute causal inference in the strict statistical sense.

43.5 Key Results

All results in this section are reported by the authors in arXiv:2603.20278. We reproduce them here for reference and add interpretive commentary where noted.

43.5.1 Evaluation Protocol and Baseline Conditions

Before presenting the results, we summarize what is known and unknown about the evaluation methodology. This context is essential for interpreting cross-system comparisons.

Table 43.8 — Evaluation protocol summary. "Specified" means described in the paper; "Unspecified" means not detailed in the available paper description. Survey-author compilation.
Protocol Detail	BrowseComp-Plus (Closed-Web)	Open-Web Benchmarks
Corpus / search source	Offline FAISS corpus (same for all models)	Live Serper API (model-specific behavior)
Answer-matching method	Unspecified (exact match, fuzzy, or LLM-judge?)	Benchmark-defined (GAIA, BrowseComp, xbench)
Scoring metric	Accuracy (% correct answers)	Accuracy (% correct answers)
Retrieval top-K	Unspecified	N/A (live search)
Baseline tool access	Unspecified — same corpus, but tool set per model unknown	Each model uses its native tools/APIs
Number of runs/seeds	Unspecified (appears to be single run)	Unspecified
Confidence intervals	Not reported	Not reported
Max turns (inference)	100 for OpenResearcher; unspecified for baselines	Unspecified for most baselines
Teacher temperature	Not reported	N/A

Key interpretive constraints. For BrowseComp-Plus, the authors state this is a "closed-web" evaluation, meaning all models query the same offline FAISS corpus. However, whether the proprietary baselines (GPT-4.1, Claude-4-Opus) were given access to the same three browser primitives (search/open/find) or used a different tool interface is not fully specified. The comparison across models with different tool-access modalities should be interpreted with this caveat. For open-web benchmarks, each model uses its own search infrastructure, making the comparison more ecologically valid but less controlled.

43.5.2 BrowseComp-Plus (Closed-Web)

Table 43.9 — BrowseComp-Plus results. Paper-reported from arXiv:2603.20278. Model categories as assigned by the paper. Answer-matching protocol unspecified; single evaluation run, no confidence intervals reported.
Method	Category	BrowseComp-Plus (%)
OpenResearcher-30B-A3B	Open-source SFT	54.8
Tongyi DeepResearch	Deep Research Agent	44.5
Claude-4-Opus	Foundation + Tools	36.8
GPT-4.1	Foundation + Tools	36.4
Kimi-K2	Foundation + Tools	35.4
CutBill-30B-A3B	Deep Research Agent	30.3
Gemini-2.5-Pro	Foundation + Tools	29.5
Nemotron-3-Nano (base)	Foundation + Tools	20.8
DeepSeek-R1	Foundation + Tools	16.4

The authors report a +34.0 absolute improvement over the base Nemotron-3-Nano model, achieved via SFT alone—no reinforcement learning or online interaction. The reported gap of +18.4 points over GPT-4.1 and +18.0 points over Claude-4-Opus is notable given that OpenResearcher uses a dramatically smaller model (3B active parameters vs. hundreds of billions for proprietary systems).

We note several limitations to this comparison: (1) The answer-matching method (exact match, substring, normalized, or LLM-as-judge) is not specified, and different methods can produce meaningfully different accuracy figures. (2) The tool-access modality for baselines—whether GPT-4.1 and Claude-4-Opus were given the same search/open/find primitives or used a more limited interface—affects the fairness of the comparison. (3) OpenResearcher was specifically trained on the offline corpus underlying BrowseComp-Plus, while the proprietary models were not; the comparison is thus between a specialist system and generalist models, which should be weighted accordingly. (4) No confidence intervals or variance estimates are reported.

43.5.3 Open-Web Benchmarks

A critical question is whether a model trained exclusively on offline trajectories can generalize to live-web environments. The following results, reported by the authors, demonstrate strong transfer:

Table 43.10 — Open-web benchmark results. Paper-reported from arXiv:2603.20278. All models use live Serper API search. Results may not be reproducible across time due to web content changes and API behavior variation.
Method	BrowseComp (%)	GAIA (%)	xbench-DeepSearch (%)
OpenResearcher	26.3	64.1	65.0
OpenAI o4-mini	28.3	55.8	67.0
Claude-4-Sonnet	12.2	57.6	64.0
Kimi-K2	14.1	57.7	50.0
DeepMiner-32B	21.2	54.4	53.0
WebSailor-72B	12.0	55.4	55.0
DeepSeek-R1	8.9	30.3	55.0

On GAIA (64.1%), the authors report that OpenResearcher outperforms all listed baselines including OpenAI o4-mini (55.8%) and Claude-4-Sonnet (57.6%). On xbench-DeepSearch (65.0%), it is competitive with o4-mini (67.0%). Only on the original BrowseComp benchmark does it trail o4-mini by 2 points (26.3% vs. 28.3%), while exceeding all other listed systems. The authors interpret these results as validating the "train offline, deploy online" paradigm.

We note that live-web evaluation is inherently non-reproducible: the Serper API's retrieval behavior varies over time as web content changes, different models may have been evaluated at different dates, and the specific search-API configurations (result count, filtering, etc.) for each baseline are not specified. These results are best treated as a snapshot of comparative performance at the time of evaluation rather than as stable, reproducible measurements.

43.5.4 Trajectory Statistics

Table 43.11 — Trajectory statistics. Paper-reported from arXiv:2603.20278. "Total tool calls" includes search + open + find; "search calls" counts only search invocations.
Metric	Successful	Failed	All
Rate	56.7%	43.3%	100%
Avg. total tool calls (search + open + find)	38.4	71.7	52.8
Avg. search calls only	22.1	48.8	33.6
Max total tool calls	172	185	185
Max search calls	109	119	119

A key observation from the paper: failed trajectories use nearly 2× as many total tool calls as successful ones (71.7 vs. 38.4), primarily driven by excessive search operations (48.8 vs. 22.1). The authors interpret this as showing that failure stems from inefficient search rather than insufficient exploration—successful trajectories converge on relevant documents earlier rather than searching more broadly.

The non-search tool calls per trajectory can be computed from these statistics: for successful trajectories, 38.4 − 22.1 = 16.3 open/find calls; for failed trajectories, 71.7 − 48.8 = 22.9 open/find calls. This indicates that failed trajectories not only search more but also read and inspect more documents, consistent with the interpretation that they struggle to identify relevant evidence rather than failing to use tools at all.

43.5.5 Pass@k Analysis

The Pass@k curve, computed over 16 samples per question, uses the standard unbiased estimator introduced by Chen et al. (2021) for evaluating code generation and adapted here for research trajectories:

$$\text{Pass@}k = \mathbb{E}_{q}\left[1 - \frac{\binom{n - c(q)}{k}}{\binom{n}{k}}\right]$$

Equation 43.6 — Standard unbiased Pass@k estimator from Chen et al. (2021, "Evaluating Large Language Models Trained on Code," §A.1). Applied here to research trajectories rather than code generation.

where $n = 16$ is the number of independent trajectory samples per question, $c(q) \in \{0, 1, \ldots, n\}$ is the number of correct samples for question $q$, $k$ is the number of attempts considered, and the expectation is taken over all questions in the evaluation set. The combinatorial expression $\binom{n-c}{k}/\binom{n}{k}$ is the probability that all $k$ samples drawn without replacement from $n$ total samples are incorrect; subtracting from 1 gives the probability that at least one of $k$ samples is correct. This is an unbiased estimator of the true pass rate, avoiding the high-variance issue of the naive estimator $\mathbb{E}[1 - (1 - c/n)^k]$ (see Chen et al., 2021, §A.1 for derivation). For edge cases: when $c(q) = 0$, the contribution is 0; when $c(q) \geq k$, the binomial coefficient $\binom{n-c}{k}$ can become 0, yielding a contribution of 1. The estimator requires $k \leq n$.

Table 43.12 — Pass@k results for the teacher model (GPT-OSS-120B) trajectories. Paper-reported from arXiv:2603.20278, computed over 6K questions × 16 samples each.
$k$	Pass@$k$
1	0.567
2	0.638
4	0.710
8	0.766
16	0.792

The 22.5-point gap between Pass@1 and Pass@16 indicates high solution diversity: many questions are solvable but only along certain reasoning paths. The authors report that the per-question distribution is bimodal—approximately 20% of questions have near-0% pass rate (extremely hard) and ~30% reach near-100% (robustly solvable). We note that Pass@16 equals the empirical fraction of questions solved at least once, which at 79.2% is considerably higher than the single-sample accuracy of 56.7%, suggesting that best-of-N selection or majority voting could significantly improve deployment performance.

43.6 Implementation Details and Cost Analysis

43.6.1 Cost Comparison for Trajectory Synthesis

The economic argument for offline synthesis rests on the cost of live-web search API calls that would be required if trajectories were generated against the live web. This analysis requires careful distinction between three different count metrics reported or derivable from the paper.

Metric definitions.

Total tool calls = search + open + find invocations. Paper-reported average: 52.8 per trajectory.
Search calls = only search invocations. Paper-reported average: 33.6 per trajectory.
Non-search tool calls = open + find. Derivable: 52.8 − 33.6 = 19.2 per trajectory.

Only search calls correspond to paid search-API queries in a live-web scenario: open fetches a webpage (typically free or negligible cost), and find is a local text operation within an already-loaded document.

Cost derivation. Using the paper-reported statistics:

$$N_{\text{search}} = N_{\text{traj}} \times \overline{s} \approx 97{,}000 \times 33.6 \approx 3.26\text{M search calls}$$

Equation 43.7 — Survey-author derivation from paper-reported values (97K+ trajectories, 33.6 avg. search calls).

$$N_{\text{total}} = N_{\text{traj}} \times \overline{t} \approx 97{,}000 \times 52.8 \approx 5.12\text{M total tool calls}$$

Equation 43.8 — Survey-author derivation from paper-reported values (97K+ trajectories, 52.8 avg. total tool calls).

Reconciliation with the paper's "5.76M" figure. The source material references approximately 5.76 million "search requests." This figure does not match either our derivations cleanly: 5.12M total tool calls or 3.26M search-only calls. The 5.76M figure is consistent with approximately 96,000 trajectories × 60 tool calls per trajectory, suggesting it may use a rounded trajectory count and a higher per-trajectory average than the 52.8 reported in the trajectory statistics table, or it may count a different set of operations (e.g., including sub-calls or retries). For cost comparison purposes, we present the methodologically cleaner search-only count alongside the paper-referenced figure:

Table 43.13 — Hypothetical live-web costs avoided by offline synthesis. Survey-author computation. Row 1 uses only search calls (methodologically cleanest); Row 2 uses the paper-referenced figure of ~5.76M. Actual costs would vary by API plan and usage patterns.
Count Basis	API Calls	Serper ($1/1K)	SerpAPI ($5/1K)
Search calls only (97K × 33.6)	~3.26M	$3,260	$16,300
Paper-referenced total (~5.76M)	~5.76M	$5,760	$28,800
Offline retriever (OpenResearcher)	Any	$0	$0

The conservative (search-only) estimate of $3,260–$16,300 is methodologically preferable since open and find do not map to paid search-API queries even in a live-web scenario. The upper bound may be relevant if one assumes a live-web pipeline incurring per-call costs for webpage fetching as well.

One-time bootstrapping cost. The paper reports bootstrapping cost as approximately $60 via the Serper API. As discussed in Section 43.3.2, this implies approximately 60,000 API calls at $1/1,000 queries — roughly 10 queries per question for 6,000 questions. The paper's algorithm description (a single concatenated query per question) accounts for only $6 (6,000 calls), leaving a 10× discrepancy. The most likely explanation is that the implementation issues multiple query variants per question to maximize gold-document recall, but this detail is not resolved in the paper text.

43.6.2 Training Cost

The authors report that student model training requires 64 H100-hours (8 GPUs × 8 hours). At cloud rates of $3–$5 per H100-hour, this translates to approximately $200–$320 (survey-author estimate based on typical 2025–2026 cloud pricing). The total pipeline cost is dominated by teacher-model inference (GPT-OSS-120B generating 97K+ trajectories), not student training. The paper does not report the total token consumption or API cost for teacher inference.

43.6.3 Reproducibility

The offline design provides three reproducibility guarantees not available to systems relying on live-web search:

No rate limits: Parallel synthesis at scale without API throttling.
Deterministic retrieval: Same corpus + same queries produce identical retrieval results across runs (assuming deterministic FAISS search, which holds for exact-search indices but may vary for approximate indices depending on configuration).
Zero external dependencies: No proprietary APIs needed after one-time bootstrapping.

Limitations on reproducibility: The one-time Serper API bootstrapping may return different results over time (though the collected gold documents are included in the release). Teacher model generation is stochastic (temperature/sampling settings are not specified in the paper), so exact trajectories will differ across runs. The released 97K+ trajectory set is the canonical reference. Live-web benchmark evaluations use live search APIs and inherently produce non-reproducible results.

43.7 Multi-Scale Browsing Hierarchy: A Closer Analysis

The three-primitive browsing model is worth examining in detail because it represents a specific hypothesis about how to structure information discovery for AI agents. The hierarchy operates at three scales:

The ablation results reported by the authors in Section 43.4.4 provide empirical support for this hierarchy: each additional primitive contributes measurable gains. The search-only configuration achieves 44.10% — limited to reasoning over short snippets. Adding open raises this to 52.02% by enabling full-context reading but requiring the model to scan potentially long documents. Adding find reaches 54.81% by enabling targeted evidence localization, which is particularly valuable for named-entity lookup and factual verification tasks.

The authors note that this three-level hierarchy—corpus-to-documents, documents-to-content, content-to-evidence—mirrors how human researchers interact with information. We observe that this hierarchical decomposition is a design choice rather than a proven optimum; alternative decompositions (e.g., including a scroll or summarize primitive) remain unexplored in this work. We also note that the protocol for the tool-space ablation — whether the same model was tested with restricted tool access at inference time, or separate models were trained on tool-restricted trajectories — is not fully specified and affects interpretation of the gains.

43.8 Comparison with Related Systems

OpenResearcher occupies a distinctive position in the landscape of deep research agents. The comparison below is based on the paper's own contextualization of related work. Where information for comparison systems is taken from the paper rather than independently verified, this is noted.

Table 43.14 — System comparison. Paper-reported from arXiv:2603.20278, Table adapted by survey author. "Not specified" indicates the paper does not provide this detail for the comparison system; these systems may support longer trajectories than implied. Search-R1 trajectory depth of 2–5 is paper-reported.
System	Trajectory Source	Search Type	Fully Open	Max Tool Calls
OpenResearcher	Offline synthesis	Offline (FAISS)	Yes (all artifacts)	185
Search-R1	Online synthesis	Live API	Partial	2–5
WebExplorer	Online synthesis	Live API	Partial	Not specified
MiroThinker	Online synthesis	Live API	Partial	Not specified
DeepMiner-32B	Online synthesis	Live API	Partial	Not specified
ASearcher-QwQ-32B	Online synthesis	Live API	Partial	Not specified
WebDancer-QwQ-32B	Online synthesis	Live API	Partial	Not specified

According to the authors, the novelty lies in the combination of four properties: (1) fully offline synthesis, (2) complete artifact release (code, corpus, trajectories, model checkpoints), (3) support for 100+ tool-call trajectories, and (4) competitive performance against proprietary systems. The authors claim no prior open system achieves all four simultaneously. The task complexity spectrum ranges from shallow retrieval (2–5 tool calls) through deep research (20–50) to ultra-deep research (50–100+), with OpenResearcher being the first open system the authors identify as targeting the ultra-deep tail with a substantial portion of trajectories exceeding 100 tool calls.

43.9 Limitations & Discussion

43.9.1 Corpus Currency and Domain Specificity

The offline corpus is a snapshot: it cannot answer questions about events occurring after corpus construction. The FineWeb-based corpus is general-purpose; domain-specific applications (legal research, biomedical literature, patent analysis) would require domain-appropriate corpora and likely domain-specific bootstrapping. The system is currently optimized for English-language research only.

43.9.2 Modality and Interaction Limitations

Documents are text-only; the current system has no capability for image, table, or chart understanding. The three browser primitives do not cover interactive web elements (forms, dynamic content, JavaScript-rendered pages), limiting deployment in environments where evidence requires interactive navigation.

43.9.3 Unexplored Questions

Several important questions remain unaddressed in the paper. There is no discussion of catastrophic forgetting during SFT—whether the student model loses general capabilities while acquiring deep research skills. There is no exploration of whether search strategies degrade on out-of-distribution questions. The impact of corpus drift (when the offline corpus becomes outdated relative to the real web) is not analyzed. Multi-task and multi-domain continued learning are not investigated.

43.9.4 The Surprising Correctness Finding: Interpretation and Caveats

The RQ1 result—that training on incorrect trajectories slightly outperforms training on correct ones—deserves careful interpretation. The authors attribute this to the importance of structural search patterns over final-answer correctness. However, several caveats apply:

Statistical significance is not established. The differences between conditions (54.81%, 55.06%, 54.46%) are small—within 0.60 percentage points—and the paper reports only single training runs without confidence intervals, error bars, or multiple seeds. In our assessment, these differences may well be within natural variance.
Confounding with trajectory length. Incorrect trajectories are substantially longer on average (71.7 vs. 38.4 total tool calls), providing more supervision signal per example. The near-equivalence of outcomes could reflect a quantity-of-supervision effect rather than a genuine advantage of incorrect demonstrations.
Confounding with training set size. With a 56.7% success rate, there are approximately 55K correct trajectories and 42K incorrect trajectories (from 97K+ total). The correct-only and incorrect-only training sets differ in size, which is a confound for single-run comparisons without matched sample sizes.
Alternative explanations. Longer trajectories may expose more diverse search strategies, tool-call patterns, and failure-recovery behaviors, independent of correctness. The richness of exploration in failed trajectories may compensate for incorrect final answers.

The result is best interpreted as showing that trajectory correctness filtering is not strictly necessary for SFT data curation in this setting, rather than as strong evidence that incorrect trajectories are superior. A definitive conclusion would require controlled experiments with matched trajectory lengths, matched set sizes, multiple seeds, and statistical testing—none of which are provided in the current paper.

43.9.5 Retrieval Coverage Gap

Even with answer-guided bootstrapping, the authors report that 70% of trajectories never retrieve a gold document. This means the teacher model solves the majority of questions either through indirect evidence or through reasoning from partial information. While impressive, this also means the training data may teach suboptimal search strategies for the subset of questions where direct evidence exists but is not found. The 29.54% gold hit rate suggests significant room for improvement in corpus construction and retrieval strategies.

43.9.6 Evaluation Protocol Gaps

Several evaluation details are not fully specified in the available paper description and would be needed for complete reproducibility or fair cross-system comparison (see also Table 43.8):

Answer matching: Whether BrowseComp-Plus uses exact-match, normalized-match, fuzzy-match, substring-match, or LLM-judge evaluation is not detailed. The choice of matching method can substantially affect reported accuracy, particularly for open-ended research questions where answers may be phrased differently.
Retrieval top-K: The exact value of $K$ for the search primitive is not specified in the paper. This parameter directly affects both trajectory behavior (more results = more browsing options) and the difficulty of the retrieval task.
Baseline tool access: Whether proprietary baselines (GPT-4.1, Claude-4-Opus) used the same three browser primitives, a subset, or different tools entirely is not always clear. On BrowseComp-Plus (closed-web), all models query the same corpus, but the interface modality may differ.
Temperature and sampling: The teacher model's temperature and sampling parameters during trajectory generation are not reported. These settings affect trajectory diversity and the composition of the rejection-sampled training set.
RQ ablation protocol: For RQ3 (turn budget) and RQ4 (tool-space), whether the same model is evaluated with restricted inference settings or separate models are trained for each configuration is not fully specified. This distinction is critical: a training-time ablation measures the value of data diversity, while an inference-time ablation measures the value of tool access.

43.9.7 Potential Extensions

The paper identifies several directions for continued learning that the offline environment naturally enables: reinforcement learning from search feedback, self-play trajectory refinement, curriculum strategies over trajectory length, and active learning for corpus expansion. The pseudocode below illustrates one such direction as described conceptually in the paper:

# ALGORITHMIC PSEUDOCODE — hypothetical self-play refinement loop.
# Described as FUTURE WORK in arXiv:2603.20278; NOT implemented in the current release.

def iterative_self_play(base_model, corpus, index, qa_pairs, rounds=3):
    """Hypothetical self-play trajectory refinement.

    The student generates its own trajectories, which are filtered
    and used for further SFT. The RQ1 finding (incorrect trajectories
    have comparable training value) suggests filtering thresholds
    could be relaxed, though this requires experimental validation.
    """
    student = base_model

    for round_num in range(rounds):
        trajectories = []
        for question, answer in qa_pairs:
            for _ in range(16):
                traj = generate_trajectory(
                    question=question,
                    teacher_model=student,  # Student acts as its own teacher
                    search_engine=offline_engine(index)
                )
                if traj and is_correct(traj, answer):
                    trajectories.append(traj)

        student = sft_train(student, trajectories)

    return student

43.10 Relationship to Evolutionary AI Systems

While OpenResearcher is not itself an evolutionary system, it intersects with the themes of this survey in several important ways. The analogies drawn in this section are our own analytical commentary, not claims made by the original paper.

The trajectory synthesis pipeline implements a form of generate-and-filter that is structurally analogous to evolutionary search: a population of candidate trajectories is generated (16 per question), evaluated against a fitness criterion (answer correctness), and the fittest are selected for training (rejection sampling). The Pass@k analysis reveals the kind of solution-diversity landscape that evolutionary methods are designed to exploit.

The RQ1 finding—that incorrect trajectories have training value—resonates with evolutionary computation's insight that failed individuals contribute to search-space exploration even when they do not survive selection. The paper's observation that structural search patterns may matter more than outcome correctness echoes the distinction between genotype and phenotype: the behavioral patterns (genotype) transfer more robustly than specific outcomes (phenotype). However, as discussed in Section 43.9.4, this analogy should be treated as suggestive rather than conclusive given the statistical limitations of the RQ1 experiment.

From a practical standpoint, OpenResearcher's pipeline could serve as a trajectory-generation substrate for evolutionary approaches to research agent optimization. The offline environment's determinism and zero marginal cost enable the kind of large-scale, repeated evaluation that evolutionary search demands. A population of search agents with different prompt strategies, tool preferences, or query formulation heuristics could be evolved against this environment, with trajectory success as the fitness function.

43.11 Summary

Chapter Summary

Key takeaway: OpenResearcher demonstrates that high-quality deep research trajectories can be synthesized entirely offline, and the authors report that a small student model trained on these trajectories via SFT alone can match or exceed proprietary systems many times its size on both closed-corpus and live-web benchmarks.

Main contribution to the field: The three-stage pipeline—answer-guided corpus bootstrapping, offline trajectory synthesis with explicit browser primitives, and rejection-sampled SFT—establishes a fully open and reproducible pathway for training deep research agents. The five targeted research questions (particularly the observation that incorrect trajectories have comparable training value and the critical role of gold-document bootstrapping) provide the field's first controlled empirical analysis of deep research data design within a fixed-corpus environment.

Most important thing for researchers: The "train offline, deploy online" paradigm appears to work. A model trained on a fixed, offline corpus generalizes effectively to live-web search environments according to the reported results, decoupling the expensive data-generation phase from deployment. This means any research group with access to a teacher model and modest compute can build competitive deep research agents without ongoing API costs—and OpenResearcher's full artifact release (code, corpus, 97K+ trajectories, model checkpoint) makes this immediately actionable.

Evidential status: All benchmark numbers and ablation results in this chapter are reported by the authors (Li et al., arXiv:2603.20278). Code examples are algorithmic pseudocode derived from the paper's method descriptions, not excerpts from the repository; readers seeking actual implementation details should consult github.com/TIGER-AI-Lab/OpenResearcher. Cost estimates in Section 43.6.1 include both paper-reported figures and our own derivations, with discrepancies noted. Several evaluation protocol details—including the answer-matching method, retrieval top-K, baseline tool access, and RQ ablation protocol—are not fully specified in the paper (see Table 43.8 and Section 43.9.6). Interpretive commentary beyond the paper's own analysis is marked throughout.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}