← Back to Index

OpenResearcher

A fully open pipeline for synthesizing long-horizon deep research trajectories via offline corpus bootstrapping and browser-primitive-based browsing Organization: TIGER-AI Lab (Texas A&M University, University of Waterloo, UC San Diego, Verdent AI, NetMind AI, Lambda) Published: March 17, 2026 Type: paper/repo/data/model Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis

ArXiv: arXiv:2603.20278 (cs.IR / cs.AI / cs.CL)

Repository: github.com/TIGER-AI-Lab/OpenResearcher

Demo: HuggingFace Space

Dataset: OpenResearcher-Dataset

Model: OpenResearcher-30B-A3B

License: Open release (code, data, model checkpoints, offline search environment)

Status: Active open-source project with full artifact release

OpenResearcher is, to the authors' knowledge, the first fully open-source pipeline for deep research trajectory synthesis that produces a model rivaling proprietary systems on long-horizon search and reasoning tasks.

2 Authors and Team

Author	Affiliation	Role
Zhuofeng Li	Texas A&M University	Project Lead, Corresponding Author
Dongfu Jiang	University of Waterloo	Project Lead
Xueguang Ma	University of Waterloo	Core Contributor
Haoxiang Zhang	UC San Diego	Core Contributor
Ping Nie	University of Waterloo	Core Contributor
Yuyu Zhang	Verdent AI	Contributor
Kai Zou	NetMind AI	Contributor
Jianwen Xie	Lambda	Contributor
Yu Zhang	Texas A&M University	Corresponding Author
Wenhu Chen	University of Waterloo	Corresponding Author

The team is distributed across three universities and three industry labs. The TIGER-AI Lab (Texas A&M / Waterloo collaboration) has prior work in multi-modal reasoning, tool use, and benchmark construction. Wenhu Chen's group at Waterloo has produced influential work on table-based reasoning, web-scale QA, and multi-modal benchmarks. The industrial partners (Verdent AI, NetMind AI, Lambda) contributed infrastructure and model access.

3 Core Contribution

The Problem

Training deep research agents—systems that iteratively search, aggregate evidence, and reason over many steps—requires long-horizon trajectories that interleave search, browsing, and multi-step reasoning. However, existing data collection pipelines suffer from three critical limitations:

Cost: Live web search APIs charge per query. At the scale of 97K+ trajectories with an average of 52.8 tool calls each (≈5.76M search requests), API costs become prohibitive ($5,760–$28,800 for Serper/SerpAPI alone)
Instability: The live web changes constantly, making trajectory synthesis non-reproducible over time
Analytical opacity: Without stable gold-document annotations, it is impossible to conduct controlled analyses of when relevant evidence is surfaced, opened, or missed

The Solution

OpenResearcher introduces a three-stage pipeline that decouples one-time online bootstrapping from fully offline trajectory synthesis:

┌─────────────────────────────────────────────────────────────────┐
│                  OpenResearcher Pipeline                        │
│                                                                 │
│  Stage 1: Question Collection                                   │
│  ┌─────────────────┐                                            │
│  │  MiroVerse v0.1  │──→ 6K complex QA pairs                    │
│  │  (10% sample)    │    (requires deep, multi-hop reasoning)   │
│  └─────────────────┘                                            │
│                                                                 │
│  Stage 2: Offline Corpus Construction (one-time online cost)    │
│  ┌──────────────┐   ┌──────────────────┐   ┌────────────────┐  │
│  │ Answer-guided │   │  FineWeb 15M     │   │  Qwen3-Emb-8B  │  │
│  │ bootstrapping │──→│  docs merged     │──→│  FAISS index    │  │
│  │ → 10K gold    │   │  with gold docs  │   │  (dense retr.)  │  │
│  └──────────────┘   └──────────────────┘   └────────────────┘  │
│                                                                 │
│  Stage 3: Trajectory Synthesis (fully offline)                  │
│  ┌───────────────┐   ┌───────────────┐   ┌─────────────────┐   │
│  │ GPT-OSS-120B  │──→│ Browser tools: │──→│ 97K+ trajectories│  │
│  │ (teacher)     │   │ search/open/  │   │ (55K after       │  │
│  │               │   │ find          │   │  rejection samp.) │  │
│  └───────────────┘   └───────────────┘   └─────────────────┘   │
│                                                                 │
│  Stage 4: Student Training (SFT)                                │
│  ┌───────────────┐   ┌───────────────────────────────────────┐  │
│  │ Nemotron-3    │──→│ OpenResearcher-30B-A3B                 │  │
│  │ Nano 30B-A3B  │   │ 54.8% BrowseComp-Plus (+34.0 pts)    │  │
│  │ (base)        │   │ 64.1% GAIA, 26.3% BrowseComp         │  │
│  └───────────────┘   └───────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Three Key Contributions

Offline and reproducible synthesis. The expensive search-and-browse loop runs entirely offline after a one-time corpus bootstrapping stage. The resulting model outperforms larger-backbone deep research agents in both offline and live-web settings
Explicit browser structure for deep research. Three minimal browser primitives—search, open, find—model the full hierarchy of information discovery, from corpus-level retrieval to document-level inspection to evidence-level localization
Empirical insights into search-data and agent design. Five targeted research questions (RQ1–RQ5) provide the field's first controlled analysis of trajectory filtering, corpus construction, turn budgets, tool-space design, and the relationship between retrieval success and answer accuracy

4 Supported Solutions

Solution Type	Support Level	Details
Long-horizon deep research QA	Primary target	Multi-step search + evidence aggregation over 10–185 tool calls
Offline trajectory synthesis	Core pipeline	Reproducible generation of search-browse-reason trajectories
Open-web deep research	Transfer target	Trained on offline data, generalizes to live-web search (Serper API)
Closed-corpus research	Evaluated	BrowseComp-Plus benchmark with fixed FAISS-indexed corpus
Multi-hop QA	Supported	Subsumes simpler multi-hop tasks (2–5 hops) as special cases
Evidence localization	Supported	`find` primitive enables exact string matching within documents
SFT data generation	Supported	Pipeline produces teacher trajectories for distillation to smaller models

Task Complexity Spectrum

OpenResearcher explicitly targets the long-horizon tail of research tasks that prior systems cannot address:

Horizon	Tool Calls	Example Prior System	OpenResearcher Coverage
Shallow retrieval	2–5	Search-R1	✅ (subset)
Medium multi-hop	5–20	Standard RAG agents	✅ (well-covered)
Deep research	20–50	WebExplorer, MiroThinker	✅ (primary target)
Ultra-deep research	50–100+	No prior open system	✅ (substantial tail)
Maximum horizon	100–185	None	✅ (captured in data)

The trajectory distribution spans the full spectrum, with successful trajectories concentrated in the 10–40 range but a non-trivial portion exceeding 100 tool calls. This ensures downstream models learn both concise and complex reasoning patterns.

5 LLM Integration

Teacher Model: GPT-OSS-120B

The teacher model used for trajectory synthesis is GPT-OSS-120B (Agarwal et al., 2025), a large-scale open-source reasoning model. Key properties:

Property	Value
Model	GPT-OSS-120B
Role	Teacher for trajectory generation
Access to reference answer	❌ (must recover answer through search)
Tool integration	ReAct-style with 3 browser primitives
Context window	Sufficient for 185-turn trajectories
Temperature/sampling	Not specified; trajectories sampled 16× per question

The teacher operates in a ReAct-style loop (Yao et al., 2022), interleaving reasoning (chain-of-thought) with tool calls:

Trajectory H_T = {
  (query, system_prompt, tool_metadata),      # Initial context
  (reasoning_1, action_1, observation_1),     # Step 1
  (reasoning_2, action_2, observation_2),     # Step 2
  ...
  (reasoning_T, final_answer)                 # Termination
}

At each step t:
  r_t, a_t ~ π(· | H_{t-1})                  # Policy generates thought + action
  o_t = E(a_t)                                # Environment returns observation
  H_t = H_{t-1} ∪ {(r_t, a_t, o_t)}          # Trajectory grows

Student Model: Nemotron-3-Nano-30B-A3B

Property	Value
Base model	NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
Architecture	30B total parameters, 3B active (Mixture-of-Experts)
Training framework	Megatron-LM
Training data	~55K trajectories (rejection-sampled from 97K+)
Context length	256K tokens (pre-packed, no truncation)
Hardware	8× NVIDIA H100 GPUs
Training time	~8 hours
Learning rate	5×10⁻⁵ (no decay)
Batch size	64 (global)
Training steps	347

LLM as Browser-Augmented Research Agent

The LLM integration pattern is fundamentally different from prior agentic search systems. Rather than treating search as simple document retrieval, OpenResearcher models explicit browsing behavior through three tool primitives:

Tool	Function	Returns	Analogy
`search(query)`	Dense retrieval over FAISS index	Top-K results with title, URL, snippet	Typing a query into Google
`open(url)`	Fetch full document content	Complete document text	Clicking a search result
`find(string)`	Exact string match in current doc	Matching passages with context	Ctrl+F on a webpage

This three-level hierarchy enables multi-scale information discovery: - Corpus → Documents via search - Documents → Content via open - Content → Evidence via find

Comparison: Tool-Space Ablation (RQ4)

Tool Configuration	BrowseComp-Plus Accuracy
`search` only	44.10
`search` + `open`	52.02
`search` + `open` + `find`	54.81

Each additional primitive provides measurable gains, confirming that explicit browsing structure improves deep research performance.

6 Key Results

Primary Benchmark: BrowseComp-Plus (Closed-Web)

Method	Category	BrowseComp-Plus (%)
OpenResearcher-30B-A3B	Ours	54.8
Tongyi DeepResearch	Deep Research Agent	44.5
GPT-4.1	Foundation + Tools	36.4
Claude-4-Opus	Foundation + Tools	36.8
Kimi-K2	Foundation + Tools	35.4
CutBill-30B-A3B	Deep Research Agent	30.3
Gemini-2.5-Pro	Foundation + Tools	29.5
Nemotron-3-Nano (base)	Foundation + Tools	20.8
DeepSeek-R1	Foundation + Tools	16.4

+34.0 absolute improvement over the base Nemotron-3-Nano model. +18.4 points over GPT-4.1. +18.0 points over Claude-4-Opus. These gains are achieved via SFT alone—no reinforcement learning or online interaction.

Open-Web Deep Research Benchmarks

Method	BrowseComp (%)	GAIA (%)	xbench-DeepSearch (%)
OpenResearcher	26.3	64.1	65.0
OpenAI o4-mini	28.3	55.8	67.0
Claude-4-Sonnet	12.2	57.6	64.0
Kimi-K2	14.1	57.7	50.0
DeepMiner-32B	21.2	54.4	53.0
WebSailor-72B	12.0	55.4	55.0
DeepSeek-R1	8.9	30.3	55.0
ASearcher-QwQ-32B	5.2	52.8	42.0
WebDancer-QwQ-32B	3.8	51.5	39.0

Crucially, OpenResearcher is trained solely on offline trajectories yet achieves competitive performance on live-web benchmarks. On GAIA (64.1%), it outperforms all listed baselines including OpenAI o4-mini (55.8%) and Claude-4-Sonnet (57.6%). This demonstrates that high-quality offline synthesis generalizes to dynamic, real-world search environments.

Trajectory Statistics

Metric	Successful	Failed	All
Rate	56.7%	43.3%	100%
Avg. tool calls	38.4	71.7	52.8
Avg. searches	22.1	48.8	33.6
Max tool calls	172	185	185
Max searches	109	119	119

Key insight: failure stems from inefficient search, not insufficient exploration. Failed trajectories use nearly 2× as many tool calls, primarily driven by excess search operations (48.7 vs. 22.1). Successful trajectories converge on relevant documents earlier.

Pass@k Analysis

k	Pass@k
1	0.567
2	0.638
3	0.681
4	0.710
8	0.766
16	0.792

The 20%+ gap between Pass@1 and Pass@16 indicates high solution diversity—many questions are solvable but only along certain reasoning paths. The per-question solve-rate distribution is bimodal: ~20% of questions have near-0% pass rate (extremely hard) and ~30% reach near-100% (robustly solvable).

7 Reproducibility

Fully Open Artifact Release

OpenResearcher releases every component necessary for reproduction:

Artifact	Location	Description
Pipeline code	GitHub	Complete synthesis pipeline
Offline search engine	GitHub	FAISS-indexed corpus + retrieval server
Synthesized trajectories	HuggingFace	97K+ trajectories with metadata
Model checkpoint	HuggingFace	OpenResearcher-30B-A3B weights
Demo	HuggingFace Spaces	Interactive demo
QA data	Included	6K processed question-answer pairs
Embedding model	Public	Qwen3-Embedding-8B (publicly available)

Deterministic Offline Environment

The offline design provides three reproducibility guarantees:

No rate limits: Parallel synthesis at scale without API throttling
Fully deterministic behavior: Same corpus + same queries → same retrieval results across runs
Zero external dependencies: No proprietary APIs needed after one-time bootstrapping

Controlled Analysis Capability

Because the corpus, search backend, and browser actions are fixed, the system enables analysis impossible with live-web pipelines:

Gold-document tracking: For each question, the system knows which documents contain supporting evidence, enabling measurement of when gold documents are retrieved vs. opened vs. missed
Retrieval event tracing: Every search, open, and find operation is logged with the exact documents accessed
Causal analysis: RQ5 demonstrates that gold-document retrieval rate correlates with answer accuracy (29.54% hit rate with gold docs → 56.86% trajectory accuracy vs. 1.73% hit rate without → 43.81%)

What Is NOT Reproducible

One-time bootstrapping: The initial gold document collection uses the Serper API, which may return different results over time. However, the collected gold documents are included in the release
Teacher model behavior: GPT-OSS-120B's generation is stochastic; exact trajectories will differ across runs. The released trajectories are the canonical set
Live-web evaluation: BrowseComp, GAIA, and xbench-DeepSearch benchmarks use live search APIs, so evaluation results may vary

8 Compute and API Costs

Trajectory Synthesis Cost Comparison

Method	Price per 1K Searches	Total Cost (5.76M searches)
Serper API	$1	$5,760
SerpAPI	$5	$28,800
Offline retriever (OpenResearcher)	$0	$0

The offline retriever eliminates all per-query search costs, making large-scale synthesis economically feasible.

One-Time Bootstrapping Costs

Component	Estimated Cost	Notes
Gold document retrieval	~$60	6K questions × ~10 Serper queries each
FineWeb corpus download	Free	Public dataset
Embedding generation	Compute only	Qwen3-Embedding-8B over 15M documents
FAISS index construction	Compute only	One-time indexing

Training Costs

Component	Value
Hardware	8× NVIDIA H100 GPUs
Training duration	~8 hours
Estimated GPU-hours	64 H100-hours
Estimated cost (cloud)	~$200–$320 (at $3–5/H100-hour)
Framework	Megatron-LM (distributed training)
Precision	BF16
Context length	256K tokens

Teacher Model Inference Costs

Component	Value
Model	GPT-OSS-120B
Trajectories generated	97K+ (16 samples × 6K questions)
Avg. trajectory length	52.8 tool calls
Max trajectory length	185 tool calls
Total inference	~97K × [variable context] tokens

The total cost of the OpenResearcher pipeline is dominated by teacher-model inference (GPT-OSS-120B) and corpus embedding/indexing. The actual training of the student model is remarkably cheap (~64 H100-hours), demonstrating that the bottleneck in deep research is data quality, not model training.

9 Architecture Solution

System Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                     OpenResearcher Architecture                      │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │                    OFFLINE CORPUS LAYER                         │  │
│  │                                                                │  │
│  │  ┌──────────────┐   ┌────────────────────────┐                │  │
│  │  │  Gold Docs    │   │     FineWeb 15M         │                │  │
│  │  │  (10K docs)   │──→│     Documents           │                │  │
│  │  │  from Serper  │   │     (~10T tokens)        │                │  │
│  │  └──────────────┘   └────────────────────────┘                │  │
│  │           │                      │                              │  │
│  │           └──────┬───────────────┘                              │  │
│  │                  ▼                                              │  │
│  │        ┌──────────────────┐                                    │  │
│  │        │  Qwen3-Emb-8B    │                                    │  │
│  │        │  Dense Embeddings │                                    │  │
│  │        └────────┬─────────┘                                    │  │
│  │                 ▼                                              │  │
│  │        ┌──────────────────┐                                    │  │
│  │        │  FAISS Index      │                                    │  │
│  │        │  (15M+ vectors)   │                                    │  │
│  │        └────────┬─────────┘                                    │  │
│  └─────────────────┼──────────────────────────────────────────────┘  │
│                    │                                                  │
│  ┌─────────────────┼──────────────────────────────────────────────┐  │
│  │                 │        SEARCH ENGINE LAYER                    │  │
│  │                 ▼                                              │  │
│  │  ┌─────────────────────────────────────────────────────────┐  │  │
│  │  │              Local Retrieval Server                      │  │  │
│  │  │                                                         │  │  │
│  │  │  search(query) ──→ Top-K (title, URL, snippet)          │  │  │
│  │  │  open(url)     ──→ Full document content                │  │  │
│  │  │  find(string)  ──→ Exact matches in current document    │  │  │
│  │  └─────────────────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────────────────┘  │
│                    │                                                  │
│  ┌─────────────────┼──────────────────────────────────────────────┐  │
│  │                 │        AGENT LAYER                            │  │
│  │                 ▼                                              │  │
│  │  ┌─────────────────────────────────────────────────────────┐  │  │
│  │  │            GPT-OSS-120B Teacher Agent                    │  │  │
│  │  │                                                         │  │  │
│  │  │  Loop:                                                  │  │  │
│  │  │    1. Reason (chain-of-thought)                         │  │  │
│  │  │    2. Select tool (search | open | find)                │  │  │
│  │  │    3. Execute tool → receive observation                │  │  │
│  │  │    4. Update trajectory H_t                             │  │  │
│  │  │    5. Repeat until confident → emit final answer        │  │  │
│  │  └─────────────────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────────────────┘  │
│                    │                                                  │
│  ┌─────────────────┼──────────────────────────────────────────────┐  │
│  │                 │        TRAINING LAYER                         │  │
│  │                 ▼                                              │  │
│  │  ┌─────────────────┐   ┌──────────────────────────────────┐   │  │
│  │  │  Trajectory      │   │  Nemotron-3-Nano-30B-A3B          │   │  │
│  │  │  Filtering       │──→│  Supervised Fine-Tuning           │   │  │
│  │  │  (reject wrong   │   │  (Megatron-LM, 256K ctx, 8×H100) │   │  │
│  │  │   answers)       │   │                                    │   │  │
│  │  └─────────────────┘   └──────────────────────────────────┘   │  │
│  └────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

Design Principles

Decoupling. Corpus construction (one-time, online) is cleanly separated from trajectory synthesis (repeatable, offline). This means the expensive bootstrapping pays off across unlimited synthesis runs.
Explicit browsing. Rather than treating search as a monolithic retrieval operation, the system exposes three primitives that mirror human browsing: broad search → focused reading → targeted evidence finding.
Teacher-student distillation. A large teacher model (GPT-OSS-120B) generates high-quality trajectories that are then distilled into a much smaller student (30B active parameters) via SFT.
Rejection sampling. Only trajectories yielding correct final answers are retained for training (55K out of 97K+), ensuring the student learns from success patterns.

10 Component Breakdown

Component 1: QA Question Collection

Source: MiroVerse-v0.1 dataset (10% random sample → ~6K QA instances)

Selection Criteria: - Questions must require long-horizon, multi-hop reasoning over heterogeneous evidence - Standard benchmarks (2WikiMultiHopQA, Natural Questions) are explicitly rejected as too shallow - Even a strong teacher model needs dozens of tool calls for these questions, with a substantial tail exceeding 100 calls

Post-processing: - Answers normalized into concise, verifiable form - Partial trajectories from MiroVerse are discarded (unsuitable for direct supervision due to inconsistent quality) - All trajectories regenerated from scratch using only clean QA pairs

Component 2: Gold Document Retrieval

Purpose: Ensure the offline corpus contains evidence sufficient to answer each question

Method: 1. Construct query = concatenation(question, reference_answer) for improved recall 2. Retrieve via Serper API (one-time online step) 3. Clean and deduplicate documents

Output: 10K gold documents for 6K questions (~1.67 gold docs per question average)

Critical Design Choice: Gold documents are used only for corpus construction, never during trajectory synthesis. The teacher model must independently find evidence.

Component 3: Offline Corpus

Property	Value
Distractor documents	15M (from FineWeb)
Gold documents	10K (from online bootstrapping)
Total size	~15.01M documents
Token count	~10 trillion tokens
Gold-to-distractor ratio	~1:1,500
Purpose	Approximate web-scale complexity

The extreme gold-to-distractor ratio ensures realistic search difficulty. The teacher model must locate relevant documents among 1,500× as many irrelevant ones.

Component 4: Dense Retrieval Engine

Property	Value
Embedding model	Qwen3-Embedding-8B
Index	FAISS
Retrieval type	Dense (vector similarity)
Query input	Natural language (from agent `search` calls)
Output	Ranked documents with title, URL, snippet

Component 5: Browser Tool Suite

Tool	Input	Output	Scale
`search(query)`	Natural language query	Top-K results (title, URL, snippet)	Corpus → Document candidates
`open(url)`	Document URL	Full document text	Document → Full content
`find(string)`	Exact string	Matching passages + context	Content → Evidence

Design rationale: Each tool narrows the information scope by one level. The find tool is critical for named-entity lookup and factual verification—tasks where scanning long documents in-context is unreliable.

Component 6: Trajectory Filtering Pipeline

Filtering criteria: 1. Trajectories exceeding maximum context length → removed 2. Trajectories with malformed tool calls → removed 3. Trajectories failing to reach conclusive answer within budget → removed 4. (For training) Trajectories with incorrect final answers → removed via rejection sampling

Yield: 97K+ raw trajectories → ~55K training trajectories after rejection sampling

Component 7: SFT Training Pipeline

Framework: Megatron-LM (distributed training)

Key design choices: - Sequences pre-packed to 256K tokens (no truncation) - Complete reasoning chains preserved end-to-end - Fixed configuration: LR=5×10⁻⁵, no warmup/decay, 347 steps, batch=64 - No RL, no online interaction, no curriculum—pure SFT on answer-verified demonstrations

11 Core Mechanisms (Detailed)

Mechanism 1: Offline-Online Decoupling

The central architectural innovation is the clean separation between online and offline phases:

ONLINE PHASE (one-time):
  for each (question, answer) in QA_pairs:
    query = concat(question, answer)
    gold_docs = serper_api.search(query)
    gold_docs = clean_and_dedup(gold_docs)
    corpus.add(gold_docs)

  distractor_docs = fineweb.sample(15_000_000)
  corpus.add(distractor_docs)

  embeddings = qwen3_embedding(corpus)
  index = faiss.build_index(embeddings)

OFFLINE PHASE (repeatable, scalable):
  for each question in QA_pairs:
    for seed in range(16):  # 16 samples per question
      trajectory = teacher.generate_trajectory(
        question=question,
        tools=[search, open, find],
        environment=offline_search_engine(index)
      )
      trajectories.append(trajectory)

Why this matters: Once the corpus and index are built, unlimited trajectory synthesis runs can be performed at zero marginal search cost. This enables: - Experimenting with different teacher models - Varying prompt strategies - Generating multiple samples per question (Pass@k analysis) - Ablation studies on tool configurations

Mechanism 2: Multi-Scale Browsing Hierarchy

The three-primitive browsing model captures the hierarchical nature of human research:

Level 1: CORPUS SEARCH
  Agent: "What university did [person] attend?"
  search("person X education background") → 10 results

Level 2: DOCUMENT INSPECTION  
  Agent: "Result 3 looks promising, let me read the full article"
  open("https://corpus/doc_12345") → full text (may be 5K+ words)

Level 3: EVIDENCE LOCALIZATION
  Agent: "The article mentions education. Let me find the specific section"
  find("graduated from") → "...graduated from MIT in 1983..."

Ablation evidence (RQ4): - search only: Agent relies on incomplete snippets → 44.1% - search + open: Agent reads full docs but must scan long context → 52.0% - search + open + find: Agent explicitly localizes evidence → 54.8%

Each level adds ~5–10 points of accuracy by reducing the model's reliance on implicit reasoning over long contexts.

Mechanism 3: Rejection Sampling for Quality Control

Rather than training on all generated trajectories, OpenResearcher applies rejection sampling: only trajectories yielding correct final answers are retained.

Surprising finding (RQ1): Final-answer correctness is NOT the dominant indicator of training value.

Training Data	BrowseComp-Plus
Correct trajectories only	54.81
Incorrect trajectories only	55.06
All trajectories	54.46

Training on incorrect trajectories alone slightly outperforms correct-only training. This suggests that even failed trajectories provide useful supervision about search structure, tool-use ordering, evidence inspection patterns, and stopping behavior.

Interpretation: The structural patterns of how to search—query formulation, document selection, evidence verification—are more important for learning than whether the final answer happens to be correct. This is a counterintuitive but practically important finding.

Mechanism 4: Corpus Coverage Bootstrapping

RQ2 ablation demonstrates why gold-document bootstrapping is critical:

Setting	Gold Hit Rate	Trajectory Accuracy	BrowseComp-Plus
With gold docs	29.54%	56.86%	54.81
Without gold docs	1.73%	43.81%	6.35

Without gold documents, the model achieves only 6.35% on BrowseComp-Plus—a catastrophic 48-point drop. The 29.54% gold hit rate (compared to 1.73% without) confirms that answer-guided bootstrapping successfully seeds the corpus with retrievable evidence. However, even with bootstrapping, 70% of trajectories never retrieve a gold document, suggesting room for improvement in retrieval strategies.

Mechanism 5: Turn Budget Analysis

RQ3 investigates whether long-horizon capability requires a large turn budget at inference time:

Max Turns at Inference	BrowseComp-Plus
15	47.12
30	50.10
50	52.70
75	53.48
100	54.81

Performance improves monotonically with the turn budget but with diminishing returns. The jump from 15 to 30 turns provides the largest marginal gain (+3.0 points), while the improvement from 75 to 100 turns is only +1.3 points. This suggests that most questions can be solved within 50 turns but a long tail benefits from extended exploration.

Mechanism 6: Retrieval-Accuracy Relationship

RQ5 provides the first controlled analysis of how retrieval success relates to final-answer accuracy in deep research:

Questions where at least one gold document is retrieved have significantly higher answer accuracy than questions where no gold document is found. However, the relationship is not deterministic—some questions are solved without retrieving any gold document (through indirect evidence), and some fail despite gold-document retrieval (through reasoning errors post-retrieval).

This nuanced finding challenges the simplistic assumption that "better retrieval = better answers" and highlights the importance of both retrieval and reasoning capabilities.

12 Programming Language

Component	Language	Framework/Library
Pipeline orchestration	Python	Custom
Trajectory synthesis	Python	GPT-OSS-120B API
Dense retrieval	Python	FAISS, Qwen3-Embedding-8B
Corpus processing	Python	Custom (cleaning, dedup)
Model training	Python	Megatron-LM
Evaluation	Python	Custom eval scripts
Web search (bootstrap)	Python	Serper API client

The entire system is Python-native, consistent with the ML research ecosystem. No multi-language complexity.

Code Organization (Inferred from Release)

OpenResearcher/
├── data/                    # QA data, processed questions
├── corpus/                  # Offline corpus construction
│   ├── bootstrap.py         # Gold document retrieval
│   ├── fineweb_sampler.py   # FineWeb document sampling
│   └── indexer.py           # FAISS index construction
├── search_engine/           # Local retrieval server
│   ├── server.py            # Retrieval API
│   └── browser_tools.py     # search/open/find implementations
├── synthesis/               # Trajectory generation
│   ├── agent.py             # Teacher agent (ReAct loop)
│   ├── prompts.py           # System prompts, tool metadata
│   └── filtering.py         # Trajectory quality filters
├── training/                # SFT pipeline
│   ├── data_prep.py         # Trajectory → training format
│   └── megatron_config.py   # Megatron-LM configuration
├── evaluation/              # Benchmark evaluation
│   ├── browsecomp_plus.py   # Closed-web eval
│   ├── browsecomp.py        # Open-web eval
│   ├── gaia.py              # GAIA benchmark
│   └── xbench.py            # xbench-DeepSearch
└── analysis/                # Research question analyses
    ├── rq1_filtering.py     # Correctness filtering ablation
    ├── rq2_corpus.py        # Corpus coverage ablation
    ├── rq3_turns.py         # Turn budget analysis
    ├── rq4_tools.py         # Tool-space ablation
    └── rq5_retrieval.py     # Retrieval-accuracy relationship

13 Memory Management

Context Window Management

The most critical memory challenge in OpenResearcher is managing the 256K-token context window during both trajectory synthesis and student training.

During Synthesis (Teacher)

The teacher model must maintain a growing trajectory context across up to 185 tool calls. Each step adds: - A reasoning chain-of-thought (variable length) - A tool call (structured JSON) - An observation (variable, especially for open which returns full documents)

Mitigation strategies: 1. Search snippets are truncated to limit context growth 2. open returns full document content but the model learns to use find for targeted inspection rather than reading entire documents 3. Trajectories exceeding context limits are filtered out

During Training (Student)

Sequences are pre-packed to 256K tokens with no truncation. This is a deliberate design choice: truncation would break the reasoning chain and teach the model incomplete patterns.

Training Memory Property	Value
Max context length	256K tokens
Packing strategy	Pre-packed (no padding waste)
Truncation	None (trajectories that don't fit are excluded)
Framework	Megatron-LM (distributed across 8× H100)

Corpus Memory

Component	Memory Requirement
Raw corpus	~15M documents × avg size → multi-TB on disk
FAISS index	Dense vectors for 15M documents (significant GPU/RAM)
Embeddings	15M × embedding_dim (Qwen3-Embedding-8B: 4096-dim)

Trajectory Storage

Property	Value
Total trajectories	97K+
Average tool calls per trajectory	52.8
Storage format	Structured (question, reasoning, action, observation tuples)
Released on	HuggingFace Datasets

No External Memory System

OpenResearcher does not use any form of external memory, knowledge graph, or cross-trajectory learning. Each trajectory is independently generated. The system relies entirely on: 1. The teacher model's parametric knowledge for search strategy 2. The offline corpus for factual evidence 3. The trajectory history (within-episode context) for coherent reasoning

This is a deliberate simplicity choice. Cross-trajectory or cross-question memory could improve search strategy diversity but would complicate the pipeline and break the independence assumption that enables parallel synthesis.

14 Continued Learning

Current Learning Paradigm

OpenResearcher uses a single-pass SFT approach: the student model is trained once on the curated trajectory set. There is no iterative self-improvement, no RL phase, and no online adaptation.

Potential Continued Learning Extensions

The paper explicitly identifies several directions for continued learning:

1. Reinforcement Learning from Search Feedback

The offline environment provides perfect reward signals: trajectory accuracy can be computed against gold answers, and retrieval success (gold-document hit rate) can serve as an intermediate reward. This enables: - RLHF-style optimization with search-specific rewards - Process reward models trained on step-level retrieval success - Outcome reward models trained on final-answer correctness

The student model (OpenResearcher-30B-A3B) could generate its own trajectories, which are then filtered and used for further SFT rounds:

Round 1: Teacher (GPT-OSS-120B) → trajectories → SFT → Student v1
Round 2: Student v1 → trajectories → filter → SFT → Student v2
Round 3: Student v2 → trajectories → filter → SFT → Student v3
...

The RQ1 finding (incorrect trajectories have training value) suggests this could work even with moderate student-model accuracy.

3. Curriculum Over Trajectory Complexity

The trajectory distribution spans 10–185 tool calls. A curriculum strategy could: - Start training on short trajectories (10–30 tool calls) - Progressively introduce longer trajectories (50–100+) - Allocate more training weight to the long-tail trajectories

4. Active Learning for Corpus Expansion

Questions where the teacher consistently fails (0/16 Pass@k) likely indicate corpus coverage gaps. An active learning loop could: - Identify consistently-failed questions - Perform targeted online bootstrapping for those questions - Expand the corpus and re-synthesize trajectories

5. Multi-Teacher Ensemble

Different teacher models may produce complementary search strategies. Training on trajectories from multiple teachers could improve strategy diversity: - GPT-OSS-120B (primary teacher) - Claude-4-Opus (different search patterns) - DeepSeek-R1 (different reasoning style) - Open-source LRMs (cost-effective scaling)

What the Paper Does NOT Address

No discussion of catastrophic forgetting during SFT
No exploration of whether the student model's search strategies degrade on out-of-distribution questions
No analysis of how the model's performance changes with corpus drift (when the offline corpus becomes outdated)
No investigation of multi-task or multi-domain continued learning

15 Applications

Direct Applications

1. Autonomous Literature Review

The deep research agent can be deployed for automated literature surveys: - Issue complex research questions - Let the agent search, read, and synthesize findings across many documents - Agent produces evidence-grounded answers with source tracing

2. Fact-Checking and Verification

The search → open → find browsing hierarchy is naturally suited to fact-checking: - Broad search identifies relevant sources - Document opening enables full-context reading - find localizes specific claims for verification

3. Competitive Intelligence

Multi-source evidence aggregation across business documents, news, filings: - Long-horizon reasoning chains connect disparate data points - The system handles heterogeneous evidence quality and potential contradictions

4. Research Data Synthesis for SFT

The pipeline itself is an application: generating high-quality research trajectories for training smaller models. Any organization can: 1. Define their domain-specific QA pairs 2. Bootstrap a corpus with answer-guided retrieval 3. Run the offline synthesis pipeline 4. SFT a model for domain-specific deep research

Benchmark Applications

Benchmark	Type	OpenResearcher Performance	Application
BrowseComp-Plus	Closed-web deep research	54.8% (SOTA)	Corpus-bound research
BrowseComp	Open-web deep research	26.3%	Live web research
GAIA	General AI assistant	64.1%	Multi-modal task solving
xbench-DeepSearch	Deep search evaluation	65.0%	Search-intensive QA

Broader Impact: The "Offline Training, Online Deployment" Pattern

OpenResearcher demonstrates a powerful paradigm for deep research agents:

Train offline, deploy online. A model trained exclusively on offline trajectories (no live-web exposure) transfers effectively to live-web search environments. This decouples the expensive data-generation phase from the deployment environment, enabling: - Cost-effective data generation at scale - Reproducible training pipelines - Controlled analysis of agent behavior - Domain-specific customization without API lock-in

Limitations for Applications

Corpus currency: The offline corpus is a snapshot; it cannot answer questions about events after corpus construction
Domain specificity: The current corpus (FineWeb) is general-purpose; domain-specific applications need domain corpora
Single-language: The system is optimized for English-language research
No multi-modal evidence: Documents are text-only; no image, table, or chart understanding
Static browsing model: The three primitives do not cover interactive web elements (forms, dynamic content, JavaScript-rendered pages)

System	Trajectory Source	Search Type	Open-Source	Max Tool Calls	Benchmark SOTA
OpenResearcher	Offline synthesis	Offline (FAISS)	✅ Full	185	BrowseComp-Plus: 54.8%
Search-R1	Online synthesis	Live API	Partial	2–5	N/A
WebExplorer	Online synthesis	Live API	Partial	Variable	N/A
MiroThinker	Online synthesis	Live API	Partial	Variable	N/A
DeepMiner-32B	Online synthesis	Live API	Partial	Variable	BrowseComp: 21.2%
ASearcher-QwQ-32B	Online synthesis	Live API	Partial	Variable	GAIA: 52.8%
WebDancer-QwQ-32B	Online synthesis	Live API	Partial	Variable	GAIA: 51.5%

OpenResearcher is the only system that combines (1) fully offline synthesis, (2) complete artifact release, (3) 100+ tool call support, and (4) competitive performance against proprietary systems.