← Back to Index

OpenResearcher

A fully open pipeline for synthesizing long-horizon deep research trajectories via offline corpus bootstrapping and browser-primitive-based browsing Organization: TIGER-AI Lab (Texas A&M University, University of Waterloo, UC San Diego, Verdent AI, NetMind AI, Lambda) Published: March 17, 2026 Type: paper/repo/data/model Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

Full Title: OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis

ArXiv: arXiv:2603.20278 (cs.IR / cs.AI / cs.CL)

Repository: github.com/TIGER-AI-Lab/OpenResearcher

Demo: HuggingFace Space

Dataset: OpenResearcher-Dataset

Model: OpenResearcher-30B-A3B

License: Open release (code, data, model checkpoints, offline search environment)

Status: Active open-source project with full artifact release

OpenResearcher is, to the authors' knowledge, the first fully open-source pipeline for deep research trajectory synthesis that produces a model rivaling proprietary systems on long-horizon search and reasoning tasks.


2 Authors and Team

Author Affiliation Role
Zhuofeng Li Texas A&M University Project Lead, Corresponding Author
Dongfu Jiang University of Waterloo Project Lead
Xueguang Ma University of Waterloo Core Contributor
Haoxiang Zhang UC San Diego Core Contributor
Ping Nie University of Waterloo Core Contributor
Yuyu Zhang Verdent AI Contributor
Kai Zou NetMind AI Contributor
Jianwen Xie Lambda Contributor
Yu Zhang Texas A&M University Corresponding Author
Wenhu Chen University of Waterloo Corresponding Author

The team is distributed across three universities and three industry labs. The TIGER-AI Lab (Texas A&M / Waterloo collaboration) has prior work in multi-modal reasoning, tool use, and benchmark construction. Wenhu Chen's group at Waterloo has produced influential work on table-based reasoning, web-scale QA, and multi-modal benchmarks. The industrial partners (Verdent AI, NetMind AI, Lambda) contributed infrastructure and model access.


3 Core Contribution

The Problem

Training deep research agents—systems that iteratively search, aggregate evidence, and reason over many steps—requires long-horizon trajectories that interleave search, browsing, and multi-step reasoning. However, existing data collection pipelines suffer from three critical limitations:

  1. Cost: Live web search APIs charge per query. At the scale of 97K+ trajectories with an average of 52.8 tool calls each (≈5.76M search requests), API costs become prohibitive ($5,760–$28,800 for Serper/SerpAPI alone)
  2. Instability: The live web changes constantly, making trajectory synthesis non-reproducible over time
  3. Analytical opacity: Without stable gold-document annotations, it is impossible to conduct controlled analyses of when relevant evidence is surfaced, opened, or missed

The Solution

OpenResearcher introduces a three-stage pipeline that decouples one-time online bootstrapping from fully offline trajectory synthesis:

┌─────────────────────────────────────────────────────────────────┐
│                  OpenResearcher Pipeline                        │
│                                                                 │
│  Stage 1: Question Collection                                   │
│  ┌─────────────────┐                                            │
│  │  MiroVerse v0.1  │──→ 6K complex QA pairs                    │
│  │  (10% sample)    │    (requires deep, multi-hop reasoning)   │
│  └─────────────────┘                                            │
│                                                                 │
│  Stage 2: Offline Corpus Construction (one-time online cost)    │
│  ┌──────────────┐   ┌──────────────────┐   ┌────────────────┐  │
│  │ Answer-guided │   │  FineWeb 15M     │   │  Qwen3-Emb-8B  │  │
│  │ bootstrapping │──→│  docs merged     │──→│  FAISS index    │  │
│  │ → 10K gold    │   │  with gold docs  │   │  (dense retr.)  │  │
│  └──────────────┘   └──────────────────┘   └────────────────┘  │
│                                                                 │
│  Stage 3: Trajectory Synthesis (fully offline)                  │
│  ┌───────────────┐   ┌───────────────┐   ┌─────────────────┐   │
│  │ GPT-OSS-120B  │──→│ Browser tools: │──→│ 97K+ trajectories│  │
│  │ (teacher)     │   │ search/open/  │   │ (55K after       │  │
│  │               │   │ find          │   │  rejection samp.) │  │
│  └───────────────┘   └───────────────┘   └─────────────────┘   │
│                                                                 │
│  Stage 4: Student Training (SFT)                                │
│  ┌───────────────┐   ┌───────────────────────────────────────┐  │
│  │ Nemotron-3    │──→│ OpenResearcher-30B-A3B                 │  │
│  │ Nano 30B-A3B  │   │ 54.8% BrowseComp-Plus (+34.0 pts)    │  │
│  │ (base)        │   │ 64.1% GAIA, 26.3% BrowseComp         │  │
│  └───────────────┘   └───────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Three Key Contributions

  1. Offline and reproducible synthesis. The expensive search-and-browse loop runs entirely offline after a one-time corpus bootstrapping stage. The resulting model outperforms larger-backbone deep research agents in both offline and live-web settings

  2. Explicit browser structure for deep research. Three minimal browser primitives—search, open, find—model the full hierarchy of information discovery, from corpus-level retrieval to document-level inspection to evidence-level localization

  3. Empirical insights into search-data and agent design. Five targeted research questions (RQ1–RQ5) provide the field's first controlled analysis of trajectory filtering, corpus construction, turn budgets, tool-space design, and the relationship between retrieval success and answer accuracy


4 Supported Solutions

Solution Type Support Level Details
Long-horizon deep research QA Primary target Multi-step search + evidence aggregation over 10–185 tool calls
Offline trajectory synthesis Core pipeline Reproducible generation of search-browse-reason trajectories
Open-web deep research Transfer target Trained on offline data, generalizes to live-web search (Serper API)
Closed-corpus research Evaluated BrowseComp-Plus benchmark with fixed FAISS-indexed corpus
Multi-hop QA Supported Subsumes simpler multi-hop tasks (2–5 hops) as special cases
Evidence localization Supported find primitive enables exact string matching within documents
SFT data generation Supported Pipeline produces teacher trajectories for distillation to smaller models

Task Complexity Spectrum

OpenResearcher explicitly targets the long-horizon tail of research tasks that prior systems cannot address:

Horizon Tool Calls Example Prior System OpenResearcher Coverage
Shallow retrieval 2–5 Search-R1 ✅ (subset)
Medium multi-hop 5–20 Standard RAG agents ✅ (well-covered)
Deep research 20–50 WebExplorer, MiroThinker ✅ (primary target)
Ultra-deep research 50–100+ No prior open system ✅ (substantial tail)
Maximum horizon 100–185 None ✅ (captured in data)

The trajectory distribution spans the full spectrum, with successful trajectories concentrated in the 10–40 range but a non-trivial portion exceeding 100 tool calls. This ensures downstream models learn both concise and complex reasoning patterns.


5 LLM Integration

Teacher Model: GPT-OSS-120B

The teacher model used for trajectory synthesis is GPT-OSS-120B (Agarwal et al., 2025), a large-scale open-source reasoning model. Key properties:

Property Value
Model GPT-OSS-120B
Role Teacher for trajectory generation
Access to reference answer ❌ (must recover answer through search)
Tool integration ReAct-style with 3 browser primitives
Context window Sufficient for 185-turn trajectories
Temperature/sampling Not specified; trajectories sampled 16× per question

The teacher operates in a ReAct-style loop (Yao et al., 2022), interleaving reasoning (chain-of-thought) with tool calls:

Trajectory H_T = {
  (query, system_prompt, tool_metadata),      # Initial context
  (reasoning_1, action_1, observation_1),     # Step 1
  (reasoning_2, action_2, observation_2),     # Step 2
  ...
  (reasoning_T, final_answer)                 # Termination
}

At each step t:
  r_t, a_t ~ π(· | H_{t-1})                  # Policy generates thought + action
  o_t = E(a_t)                                # Environment returns observation
  H_t = H_{t-1} ∪ {(r_t, a_t, o_t)}          # Trajectory grows

Student Model: Nemotron-3-Nano-30B-A3B

Property Value
Base model NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
Architecture 30B total parameters, 3B active (Mixture-of-Experts)
Training framework Megatron-LM
Training data ~55K trajectories (rejection-sampled from 97K+)
Context length 256K tokens (pre-packed, no truncation)
Hardware 8× NVIDIA H100 GPUs
Training time ~8 hours
Learning rate 5×10⁻⁵ (no decay)
Batch size 64 (global)
Training steps 347

LLM as Browser-Augmented Research Agent

The LLM integration pattern is fundamentally different from prior agentic search systems. Rather than treating search as simple document retrieval, OpenResearcher models explicit browsing behavior through three tool primitives:

Tool Function Returns Analogy
search(query) Dense retrieval over FAISS index Top-K results with title, URL, snippet Typing a query into Google
open(url) Fetch full document content Complete document text Clicking a search result
find(string) Exact string match in current doc Matching passages with context Ctrl+F on a webpage

This three-level hierarchy enables multi-scale information discovery: - Corpus → Documents via search - Documents → Content via open - Content → Evidence via find

Comparison: Tool-Space Ablation (RQ4)

Tool Configuration BrowseComp-Plus Accuracy
search only 44.10
search + open 52.02
search + open + find 54.81

Each additional primitive provides measurable gains, confirming that explicit browsing structure improves deep research performance.


6 Key Results

Primary Benchmark: BrowseComp-Plus (Closed-Web)

Method Category BrowseComp-Plus (%)
OpenResearcher-30B-A3B Ours 54.8
Tongyi DeepResearch Deep Research Agent 44.5
GPT-4.1 Foundation + Tools 36.4
Claude-4-Opus Foundation + Tools 36.8
Kimi-K2 Foundation + Tools 35.4
CutBill-30B-A3B Deep Research Agent 30.3
Gemini-2.5-Pro Foundation + Tools 29.5
Nemotron-3-Nano (base) Foundation + Tools 20.8
DeepSeek-R1 Foundation + Tools 16.4

+34.0 absolute improvement over the base Nemotron-3-Nano model. +18.4 points over GPT-4.1. +18.0 points over Claude-4-Opus. These gains are achieved via SFT alone—no reinforcement learning or online interaction.

Open-Web Deep Research Benchmarks

Method BrowseComp (%) GAIA (%) xbench-DeepSearch (%)
OpenResearcher 26.3 64.1 65.0
OpenAI o4-mini 28.3 55.8 67.0
Claude-4-Sonnet 12.2 57.6 64.0
Kimi-K2 14.1 57.7 50.0
DeepMiner-32B 21.2 54.4 53.0
WebSailor-72B 12.0 55.4 55.0
DeepSeek-R1 8.9 30.3 55.0
ASearcher-QwQ-32B 5.2 52.8 42.0
WebDancer-QwQ-32B 3.8 51.5 39.0

Crucially, OpenResearcher is trained solely on offline trajectories yet achieves competitive performance on live-web benchmarks. On GAIA (64.1%), it outperforms all listed baselines including OpenAI o4-mini (55.8%) and Claude-4-Sonnet (57.6%). This demonstrates that high-quality offline synthesis generalizes to dynamic, real-world search environments.

Trajectory Statistics

Metric Successful Failed All
Rate 56.7% 43.3% 100%
Avg. tool calls 38.4 71.7 52.8
Avg. searches 22.1 48.8 33.6
Max tool calls 172 185 185
Max searches 109 119 119

Key insight: failure stems from inefficient search, not insufficient exploration. Failed trajectories use nearly 2× as many tool calls, primarily driven by excess search operations (48.7 vs. 22.1). Successful trajectories converge on relevant documents earlier.

Pass@k Analysis

k Pass@k
1 0.567
2 0.638
3 0.681
4 0.710
8 0.766
16 0.792

The 20%+ gap between Pass@1 and Pass@16 indicates high solution diversity—many questions are solvable but only along certain reasoning paths. The per-question solve-rate distribution is bimodal: ~20% of questions have near-0% pass rate (extremely hard) and ~30% reach near-100% (robustly solvable).


7 Reproducibility

Fully Open Artifact Release

OpenResearcher releases every component necessary for reproduction:

Artifact Location Description
Pipeline code GitHub Complete synthesis pipeline
Offline search engine GitHub FAISS-indexed corpus + retrieval server
Synthesized trajectories HuggingFace 97K+ trajectories with metadata
Model checkpoint HuggingFace OpenResearcher-30B-A3B weights
Demo HuggingFace Spaces Interactive demo
QA data Included 6K processed question-answer pairs
Embedding model Public Qwen3-Embedding-8B (publicly available)

Deterministic Offline Environment

The offline design provides three reproducibility guarantees:

  1. No rate limits: Parallel synthesis at scale without API throttling
  2. Fully deterministic behavior: Same corpus + same queries → same retrieval results across runs
  3. Zero external dependencies: No proprietary APIs needed after one-time bootstrapping

Controlled Analysis Capability

Because the corpus, search backend, and browser actions are fixed, the system enables analysis impossible with live-web pipelines:

  • Gold-document tracking: For each question, the system knows which documents contain supporting evidence, enabling measurement of when gold documents are retrieved vs. opened vs. missed
  • Retrieval event tracing: Every search, open, and find operation is logged with the exact documents accessed
  • Causal analysis: RQ5 demonstrates that gold-document retrieval rate correlates with answer accuracy (29.54% hit rate with gold docs → 56.86% trajectory accuracy vs. 1.73% hit rate without → 43.81%)

What Is NOT Reproducible

  • One-time bootstrapping: The initial gold document collection uses the Serper API, which may return different results over time. However, the collected gold documents are included in the release
  • Teacher model behavior: GPT-OSS-120B's generation is stochastic; exact trajectories will differ across runs. The released trajectories are the canonical set
  • Live-web evaluation: BrowseComp, GAIA, and xbench-DeepSearch benchmarks use live search APIs, so evaluation results may vary

8 Compute and API Costs

Trajectory Synthesis Cost Comparison

Method Price per 1K Searches Total Cost (5.76M searches)
Serper API $1 $5,760
SerpAPI $5 $28,800
Offline retriever (OpenResearcher) $0 $0

The offline retriever eliminates all per-query search costs, making large-scale synthesis economically feasible.

One-Time Bootstrapping Costs

Component Estimated Cost Notes
Gold document retrieval ~$60 6K questions × ~10 Serper queries each
FineWeb corpus download Free Public dataset
Embedding generation Compute only Qwen3-Embedding-8B over 15M documents
FAISS index construction Compute only One-time indexing

Training Costs

Component Value
Hardware 8× NVIDIA H100 GPUs
Training duration ~8 hours
Estimated GPU-hours 64 H100-hours
Estimated cost (cloud) ~$200–$320 (at $3–5/H100-hour)
Framework Megatron-LM (distributed training)
Precision BF16
Context length 256K tokens

Teacher Model Inference Costs

Component Value
Model GPT-OSS-120B
Trajectories generated 97K+ (16 samples × 6K questions)
Avg. trajectory length 52.8 tool calls
Max trajectory length 185 tool calls
Total inference ~97K × [variable context] tokens

The total cost of the OpenResearcher pipeline is dominated by teacher-model inference (GPT-OSS-120B) and corpus embedding/indexing. The actual training of the student model is remarkably cheap (~64 H100-hours), demonstrating that the bottleneck in deep research is data quality, not model training.


9 Architecture Solution

System Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                     OpenResearcher Architecture                      │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │                    OFFLINE CORPUS LAYER                         │  │
│  │                                                                │  │
│  │  ┌──────────────┐   ┌────────────────────────┐                │  │
│  │  │  Gold Docs    │   │     FineWeb 15M         │                │  │
│  │  │  (10K docs)   │──→│     Documents           │                │  │
│  │  │  from Serper  │   │     (~10T tokens)        │                │  │
│  │  └──────────────┘   └────────────────────────┘                │  │
│  │           │                      │                              │  │
│  │           └──────┬───────────────┘                              │  │
│  │                  ▼                                              │  │
│  │        ┌──────────────────┐                                    │  │
│  │        │  Qwen3-Emb-8B    │                                    │  │
│  │        │  Dense Embeddings │                                    │  │
│  │        └────────┬─────────┘                                    │  │
│  │                 ▼                                              │  │
│  │        ┌──────────────────┐                                    │  │
│  │        │  FAISS Index      │                                    │  │
│  │        │  (15M+ vectors)   │                                    │  │
│  │        └────────┬─────────┘                                    │  │
│  └─────────────────┼──────────────────────────────────────────────┘  │
│                    │                                                  │
│  ┌─────────────────┼──────────────────────────────────────────────┐  │
│  │                 │        SEARCH ENGINE LAYER                    │  │
│  │                 ▼                                              │  │
│  │  ┌─────────────────────────────────────────────────────────┐  │  │
│  │  │              Local Retrieval Server                      │  │  │
│  │  │                                                         │  │  │
│  │  │  search(query) ──→ Top-K (title, URL, snippet)          │  │  │
│  │  │  open(url)     ──→ Full document content                │  │  │
│  │  │  find(string)  ──→ Exact matches in current document    │  │  │
│  │  └─────────────────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────────────────┘  │
│                    │                                                  │
│  ┌─────────────────┼──────────────────────────────────────────────┐  │
│  │                 │        AGENT LAYER                            │  │
│  │                 ▼                                              │  │
│  │  ┌─────────────────────────────────────────────────────────┐  │  │
│  │  │            GPT-OSS-120B Teacher Agent                    │  │  │
│  │  │                                                         │  │  │
│  │  │  Loop:                                                  │  │  │
│  │  │    1. Reason (chain-of-thought)                         │  │  │
│  │  │    2. Select tool (search | open | find)                │  │  │
│  │  │    3. Execute tool → receive observation                │  │  │
│  │  │    4. Update trajectory H_t                             │  │  │
│  │  │    5. Repeat until confident → emit final answer        │  │  │
│  │  └─────────────────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────────────────┘  │
│                    │                                                  │
│  ┌─────────────────┼──────────────────────────────────────────────┐  │
│  │                 │        TRAINING LAYER                         │  │
│  │                 ▼                                              │  │
│  │  ┌─────────────────┐   ┌──────────────────────────────────┐   │  │
│  │  │  Trajectory      │   │  Nemotron-3-Nano-30B-A3B          │   │  │
│  │  │  Filtering       │──→│  Supervised Fine-Tuning           │   │  │
│  │  │  (reject wrong   │   │  (Megatron-LM, 256K ctx, 8×H100) │   │  │
│  │  │   answers)       │   │                                    │   │  │
│  │  └─────────────────┘   └──────────────────────────────────┘   │  │
│  └────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

Design Principles

  1. Decoupling. Corpus construction (one-time, online) is cleanly separated from trajectory synthesis (repeatable, offline). This means the expensive bootstrapping pays off across unlimited synthesis runs.

  2. Explicit browsing. Rather than treating search as a monolithic retrieval operation, the system exposes three primitives that mirror human browsing: broad search → focused reading → targeted evidence finding.

  3. Teacher-student distillation. A large teacher model (GPT-OSS-120B) generates high-quality trajectories that are then distilled into a much smaller student (30B active parameters) via SFT.

  4. Rejection sampling. Only trajectories yielding correct final answers are retained for training (55K out of 97K+), ensuring the student learns from success patterns.


10 Component Breakdown

Component 1: QA Question Collection

Source: MiroVerse-v0.1 dataset (10% random sample → ~6K QA instances)

Selection Criteria: - Questions must require long-horizon, multi-hop reasoning over heterogeneous evidence - Standard benchmarks (2WikiMultiHopQA, Natural Questions) are explicitly rejected as too shallow - Even a strong teacher model needs dozens of tool calls for these questions, with a substantial tail exceeding 100 calls

Post-processing: - Answers normalized into concise, verifiable form - Partial trajectories from MiroVerse are discarded (unsuitable for direct supervision due to inconsistent quality) - All trajectories regenerated from scratch using only clean QA pairs

Component 2: Gold Document Retrieval

Purpose: Ensure the offline corpus contains evidence sufficient to answer each question

Method: 1. Construct query = concatenation(question, reference_answer) for improved recall 2. Retrieve via Serper API (one-time online step) 3. Clean and deduplicate documents

Output: 10K gold documents for 6K questions (~1.67 gold docs per question average)

Critical Design Choice: Gold documents are used only for corpus construction, never during trajectory synthesis. The teacher model must independently find evidence.

Component 3: Offline Corpus

Property Value
Distractor documents 15M (from FineWeb)
Gold documents 10K (from online bootstrapping)
Total size ~15.01M documents
Token count ~10 trillion tokens
Gold-to-distractor ratio ~1:1,500
Purpose Approximate web-scale complexity

The extreme gold-to-distractor ratio ensures realistic search difficulty. The teacher model must locate relevant documents among 1,500× as many irrelevant ones.

Component 4: Dense Retrieval Engine

Property Value
Embedding model Qwen3-Embedding-8B
Index FAISS
Retrieval type Dense (vector similarity)
Query input Natural language (from agent search calls)
Output Ranked documents with title, URL, snippet

Component 5: Browser Tool Suite

Tool Input Output Scale
search(query) Natural language query Top-K results (title, URL, snippet) Corpus → Document candidates
open(url) Document URL Full document text Document → Full content
find(string) Exact string Matching passages + context Content → Evidence

Design rationale: Each tool narrows the information scope by one level. The find tool is critical for named-entity lookup and factual verification—tasks where scanning long documents in-context is unreliable.

Component 6: Trajectory Filtering Pipeline

Filtering criteria: 1. Trajectories exceeding maximum context length → removed 2. Trajectories with malformed tool calls → removed 3. Trajectories failing to reach conclusive answer within budget → removed 4. (For training) Trajectories with incorrect final answers → removed via rejection sampling

Yield: 97K+ raw trajectories → ~55K training trajectories after rejection sampling

Component 7: SFT Training Pipeline

Framework: Megatron-LM (distributed training)

Key design choices: - Sequences pre-packed to 256K tokens (no truncation) - Complete reasoning chains preserved end-to-end - Fixed configuration: LR=5×10⁻⁵, no warmup/decay, 347 steps, batch=64 - No RL, no online interaction, no curriculum—pure SFT on answer-verified demonstrations


11 Core Mechanisms (Detailed)

Mechanism 1: Offline-Online Decoupling

The central architectural innovation is the clean separation between online and offline phases:

ONLINE PHASE (one-time):
  for each (question, answer) in QA_pairs:
    query = concat(question, answer)
    gold_docs = serper_api.search(query)
    gold_docs = clean_and_dedup(gold_docs)
    corpus.add(gold_docs)

  distractor_docs = fineweb.sample(15_000_000)
  corpus.add(distractor_docs)

  embeddings = qwen3_embedding(corpus)
  index = faiss.build_index(embeddings)

OFFLINE PHASE (repeatable, scalable):
  for each question in QA_pairs:
    for seed in range(16):  # 16 samples per question
      trajectory = teacher.generate_trajectory(
        question=question,
        tools=[search, open, find],
        environment=offline_search_engine(index)
      )
      trajectories.append(trajectory)

Why this matters: Once the corpus and index are built, unlimited trajectory synthesis runs can be performed at zero marginal search cost. This enables: - Experimenting with different teacher models - Varying prompt strategies - Generating multiple samples per question (Pass@k analysis) - Ablation studies on tool configurations

Mechanism 2: Multi-Scale Browsing Hierarchy

The three-primitive browsing model captures the hierarchical nature of human research:

Level 1: CORPUS SEARCH
  Agent: "What university did [person] attend?"
  search("person X education background") → 10 results

Level 2: DOCUMENT INSPECTION  
  Agent: "Result 3 looks promising, let me read the full article"
  open("https://corpus/doc_12345") → full text (may be 5K+ words)

Level 3: EVIDENCE LOCALIZATION
  Agent: "The article mentions education. Let me find the specific section"
  find("graduated from") → "...graduated from MIT in 1983..."

Ablation evidence (RQ4): - search only: Agent relies on incomplete snippets → 44.1% - search + open: Agent reads full docs but must scan long context → 52.0% - search + open + find: Agent explicitly localizes evidence → 54.8%

Each level adds ~5–10 points of accuracy by reducing the model's reliance on implicit reasoning over long contexts.

Mechanism 3: Rejection Sampling for Quality Control

Rather than training on all generated trajectories, OpenResearcher applies rejection sampling: only trajectories yielding correct final answers are retained.

Surprising finding (RQ1): Final-answer correctness is NOT the dominant indicator of training value.

Training Data BrowseComp-Plus
Correct trajectories only 54.81
Incorrect trajectories only 55.06
All trajectories 54.46

Training on incorrect trajectories alone slightly outperforms correct-only training. This suggests that even failed trajectories provide useful supervision about search structure, tool-use ordering, evidence inspection patterns, and stopping behavior.

Interpretation: The structural patterns of how to search—query formulation, document selection, evidence verification—are more important for learning than whether the final answer happens to be correct. This is a counterintuitive but practically important finding.

Mechanism 4: Corpus Coverage Bootstrapping

RQ2 ablation demonstrates why gold-document bootstrapping is critical:

Setting Gold Hit Rate Trajectory Accuracy BrowseComp-Plus
With gold docs 29.54% 56.86% 54.81
Without gold docs 1.73% 43.81% 6.35

Without gold documents, the model achieves only 6.35% on BrowseComp-Plus—a catastrophic 48-point drop. The 29.54% gold hit rate (compared to 1.73% without) confirms that answer-guided bootstrapping successfully seeds the corpus with retrievable evidence. However, even with bootstrapping, 70% of trajectories never retrieve a gold document, suggesting room for improvement in retrieval strategies.

Mechanism 5: Turn Budget Analysis

RQ3 investigates whether long-horizon capability requires a large turn budget at inference time:

Max Turns at Inference BrowseComp-Plus
15 47.12
30 50.10
50 52.70
75 53.48
100 54.81

Performance improves monotonically with the turn budget but with diminishing returns. The jump from 15 to 30 turns provides the largest marginal gain (+3.0 points), while the improvement from 75 to 100 turns is only +1.3 points. This suggests that most questions can be solved within 50 turns but a long tail benefits from extended exploration.

Mechanism 6: Retrieval-Accuracy Relationship

RQ5 provides the first controlled analysis of how retrieval success relates to final-answer accuracy in deep research:

Questions where at least one gold document is retrieved have significantly higher answer accuracy than questions where no gold document is found. However, the relationship is not deterministic—some questions are solved without retrieving any gold document (through indirect evidence), and some fail despite gold-document retrieval (through reasoning errors post-retrieval).

This nuanced finding challenges the simplistic assumption that "better retrieval = better answers" and highlights the importance of both retrieval and reasoning capabilities.


12 Programming Language

Component Language Framework/Library
Pipeline orchestration Python Custom
Trajectory synthesis Python GPT-OSS-120B API
Dense retrieval Python FAISS, Qwen3-Embedding-8B
Corpus processing Python Custom (cleaning, dedup)
Model training Python Megatron-LM
Evaluation Python Custom eval scripts
Web search (bootstrap) Python Serper API client

The entire system is Python-native, consistent with the ML research ecosystem. No multi-language complexity.

Code Organization (Inferred from Release)

OpenResearcher/
├── data/                    # QA data, processed questions
├── corpus/                  # Offline corpus construction
│   ├── bootstrap.py         # Gold document retrieval
│   ├── fineweb_sampler.py   # FineWeb document sampling
│   └── indexer.py           # FAISS index construction
├── search_engine/           # Local retrieval server
│   ├── server.py            # Retrieval API
│   └── browser_tools.py     # search/open/find implementations
├── synthesis/               # Trajectory generation
│   ├── agent.py             # Teacher agent (ReAct loop)
│   ├── prompts.py           # System prompts, tool metadata
│   └── filtering.py         # Trajectory quality filters
├── training/                # SFT pipeline
│   ├── data_prep.py         # Trajectory → training format
│   └── megatron_config.py   # Megatron-LM configuration
├── evaluation/              # Benchmark evaluation
│   ├── browsecomp_plus.py   # Closed-web eval
│   ├── browsecomp.py        # Open-web eval
│   ├── gaia.py              # GAIA benchmark
│   └── xbench.py            # xbench-DeepSearch
└── analysis/                # Research question analyses
    ├── rq1_filtering.py     # Correctness filtering ablation
    ├── rq2_corpus.py        # Corpus coverage ablation
    ├── rq3_turns.py         # Turn budget analysis
    ├── rq4_tools.py         # Tool-space ablation
    └── rq5_retrieval.py     # Retrieval-accuracy relationship

13 Memory Management

Context Window Management

The most critical memory challenge in OpenResearcher is managing the 256K-token context window during both trajectory synthesis and student training.

During Synthesis (Teacher)

The teacher model must maintain a growing trajectory context across up to 185 tool calls. Each step adds: - A reasoning chain-of-thought (variable length) - A tool call (structured JSON) - An observation (variable, especially for open which returns full documents)

Mitigation strategies: 1. Search snippets are truncated to limit context growth 2. open returns full document content but the model learns to use find for targeted inspection rather than reading entire documents 3. Trajectories exceeding context limits are filtered out

During Training (Student)

Sequences are pre-packed to 256K tokens with no truncation. This is a deliberate design choice: truncation would break the reasoning chain and teach the model incomplete patterns.

Training Memory Property Value
Max context length 256K tokens
Packing strategy Pre-packed (no padding waste)
Truncation None (trajectories that don't fit are excluded)
Framework Megatron-LM (distributed across 8× H100)

Corpus Memory

Component Memory Requirement
Raw corpus ~15M documents × avg size → multi-TB on disk
FAISS index Dense vectors for 15M documents (significant GPU/RAM)
Embeddings 15M × embedding_dim (Qwen3-Embedding-8B: 4096-dim)

Trajectory Storage

Property Value
Total trajectories 97K+
Average tool calls per trajectory 52.8
Storage format Structured (question, reasoning, action, observation tuples)
Released on HuggingFace Datasets

No External Memory System

OpenResearcher does not use any form of external memory, knowledge graph, or cross-trajectory learning. Each trajectory is independently generated. The system relies entirely on: 1. The teacher model's parametric knowledge for search strategy 2. The offline corpus for factual evidence 3. The trajectory history (within-episode context) for coherent reasoning

This is a deliberate simplicity choice. Cross-trajectory or cross-question memory could improve search strategy diversity but would complicate the pipeline and break the independence assumption that enables parallel synthesis.


14 Continued Learning

Current Learning Paradigm

OpenResearcher uses a single-pass SFT approach: the student model is trained once on the curated trajectory set. There is no iterative self-improvement, no RL phase, and no online adaptation.

Potential Continued Learning Extensions

The paper explicitly identifies several directions for continued learning:

1. Reinforcement Learning from Search Feedback

The offline environment provides perfect reward signals: trajectory accuracy can be computed against gold answers, and retrieval success (gold-document hit rate) can serve as an intermediate reward. This enables: - RLHF-style optimization with search-specific rewards - Process reward models trained on step-level retrieval success - Outcome reward models trained on final-answer correctness

2. Self-Play Trajectory Refinement

The student model (OpenResearcher-30B-A3B) could generate its own trajectories, which are then filtered and used for further SFT rounds:

Round 1: Teacher (GPT-OSS-120B) → trajectories → SFT → Student v1
Round 2: Student v1 → trajectories → filter → SFT → Student v2
Round 3: Student v2 → trajectories → filter → SFT → Student v3
...

The RQ1 finding (incorrect trajectories have training value) suggests this could work even with moderate student-model accuracy.

3. Curriculum Over Trajectory Complexity

The trajectory distribution spans 10–185 tool calls. A curriculum strategy could: - Start training on short trajectories (10–30 tool calls) - Progressively introduce longer trajectories (50–100+) - Allocate more training weight to the long-tail trajectories

4. Active Learning for Corpus Expansion

Questions where the teacher consistently fails (0/16 Pass@k) likely indicate corpus coverage gaps. An active learning loop could: - Identify consistently-failed questions - Perform targeted online bootstrapping for those questions - Expand the corpus and re-synthesize trajectories

5. Multi-Teacher Ensemble

Different teacher models may produce complementary search strategies. Training on trajectories from multiple teachers could improve strategy diversity: - GPT-OSS-120B (primary teacher) - Claude-4-Opus (different search patterns) - DeepSeek-R1 (different reasoning style) - Open-source LRMs (cost-effective scaling)

What the Paper Does NOT Address

  • No discussion of catastrophic forgetting during SFT
  • No exploration of whether the student model's search strategies degrade on out-of-distribution questions
  • No analysis of how the model's performance changes with corpus drift (when the offline corpus becomes outdated)
  • No investigation of multi-task or multi-domain continued learning

15 Applications

Direct Applications

1. Autonomous Literature Review

The deep research agent can be deployed for automated literature surveys: - Issue complex research questions - Let the agent search, read, and synthesize findings across many documents - Agent produces evidence-grounded answers with source tracing

2. Fact-Checking and Verification

The search → open → find browsing hierarchy is naturally suited to fact-checking: - Broad search identifies relevant sources - Document opening enables full-context reading - find localizes specific claims for verification

3. Competitive Intelligence

Multi-source evidence aggregation across business documents, news, filings: - Long-horizon reasoning chains connect disparate data points - The system handles heterogeneous evidence quality and potential contradictions

4. Research Data Synthesis for SFT

The pipeline itself is an application: generating high-quality research trajectories for training smaller models. Any organization can: 1. Define their domain-specific QA pairs 2. Bootstrap a corpus with answer-guided retrieval 3. Run the offline synthesis pipeline 4. SFT a model for domain-specific deep research

Benchmark Applications

Benchmark Type OpenResearcher Performance Application
BrowseComp-Plus Closed-web deep research 54.8% (SOTA) Corpus-bound research
BrowseComp Open-web deep research 26.3% Live web research
GAIA General AI assistant 64.1% Multi-modal task solving
xbench-DeepSearch Deep search evaluation 65.0% Search-intensive QA

Broader Impact: The "Offline Training, Online Deployment" Pattern

OpenResearcher demonstrates a powerful paradigm for deep research agents:

Train offline, deploy online. A model trained exclusively on offline trajectories (no live-web exposure) transfers effectively to live-web search environments. This decouples the expensive data-generation phase from the deployment environment, enabling: - Cost-effective data generation at scale - Reproducible training pipelines - Controlled analysis of agent behavior - Domain-specific customization without API lock-in

Limitations for Applications

  1. Corpus currency: The offline corpus is a snapshot; it cannot answer questions about events after corpus construction
  2. Domain specificity: The current corpus (FineWeb) is general-purpose; domain-specific applications need domain corpora
  3. Single-language: The system is optimized for English-language research
  4. No multi-modal evidence: Documents are text-only; no image, table, or chart understanding
  5. Static browsing model: The three primitives do not cover interactive web elements (forms, dynamic content, JavaScript-rendered pages)

System Trajectory Source Search Type Open-Source Max Tool Calls Benchmark SOTA
OpenResearcher Offline synthesis Offline (FAISS) ✅ Full 185 BrowseComp-Plus: 54.8%
Search-R1 Online synthesis Live API Partial 2–5 N/A
WebExplorer Online synthesis Live API Partial Variable N/A
MiroThinker Online synthesis Live API Partial Variable N/A
DeepMiner-32B Online synthesis Live API Partial Variable BrowseComp: 21.2%
ASearcher-QwQ-32B Online synthesis Live API Partial Variable GAIA: 52.8%
WebDancer-QwQ-32B Online synthesis Live API Partial Variable GAIA: 51.5%

OpenResearcher is the only system that combines (1) fully offline synthesis, (2) complete artifact release, (3) 100+ tool call support, and (4) competitive performance against proprietary systems.