← Back to Index
OpenResearcher
A fully open pipeline for synthesizing long-horizon deep research trajectories via offline corpus bootstrapping and browser-primitive-based browsing Organization: TIGER-AI Lab (Texas A&M University, University of Waterloo, UC San Diego, Verdent AI, NetMind AI, Lambda) Published: March 17, 2026 Type: paper/repo/data/model Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis
ArXiv: arXiv:2603.20278 (cs.IR / cs.AI / cs.CL)
Repository: github.com/TIGER-AI-Lab/OpenResearcher
Demo: HuggingFace Space
Dataset: OpenResearcher-Dataset
Model: OpenResearcher-30B-A3B
License: Open release (code, data, model checkpoints, offline search environment)
Status: Active open-source project with full artifact release
OpenResearcher is, to the authors' knowledge, the first fully open-source pipeline for deep research trajectory synthesis that produces a model rivaling proprietary systems on long-horizon search and reasoning tasks.
2 Authors and Team
| Author | Affiliation | Role |
|---|---|---|
| Zhuofeng Li | Texas A&M University | Project Lead, Corresponding Author |
| Dongfu Jiang | University of Waterloo | Project Lead |
| Xueguang Ma | University of Waterloo | Core Contributor |
| Haoxiang Zhang | UC San Diego | Core Contributor |
| Ping Nie | University of Waterloo | Core Contributor |
| Yuyu Zhang | Verdent AI | Contributor |
| Kai Zou | NetMind AI | Contributor |
| Jianwen Xie | Lambda | Contributor |
| Yu Zhang | Texas A&M University | Corresponding Author |
| Wenhu Chen | University of Waterloo | Corresponding Author |
The team is distributed across three universities and three industry labs. The TIGER-AI Lab (Texas A&M / Waterloo collaboration) has prior work in multi-modal reasoning, tool use, and benchmark construction. Wenhu Chen's group at Waterloo has produced influential work on table-based reasoning, web-scale QA, and multi-modal benchmarks. The industrial partners (Verdent AI, NetMind AI, Lambda) contributed infrastructure and model access.
3 Core Contribution
The Problem
Training deep research agents—systems that iteratively search, aggregate evidence, and reason over many steps—requires long-horizon trajectories that interleave search, browsing, and multi-step reasoning. However, existing data collection pipelines suffer from three critical limitations:
- Cost: Live web search APIs charge per query. At the scale of 97K+ trajectories with an average of 52.8 tool calls each (≈5.76M search requests), API costs become prohibitive ($5,760–$28,800 for Serper/SerpAPI alone)
- Instability: The live web changes constantly, making trajectory synthesis non-reproducible over time
- Analytical opacity: Without stable gold-document annotations, it is impossible to conduct controlled analyses of when relevant evidence is surfaced, opened, or missed
The Solution
OpenResearcher introduces a three-stage pipeline that decouples one-time online bootstrapping from fully offline trajectory synthesis:
┌─────────────────────────────────────────────────────────────────┐
│ OpenResearcher Pipeline │
│ │
│ Stage 1: Question Collection │
│ ┌─────────────────┐ │
│ │ MiroVerse v0.1 │──→ 6K complex QA pairs │
│ │ (10% sample) │ (requires deep, multi-hop reasoning) │
│ └─────────────────┘ │
│ │
│ Stage 2: Offline Corpus Construction (one-time online cost) │
│ ┌──────────────┐ ┌──────────────────┐ ┌────────────────┐ │
│ │ Answer-guided │ │ FineWeb 15M │ │ Qwen3-Emb-8B │ │
│ │ bootstrapping │──→│ docs merged │──→│ FAISS index │ │
│ │ → 10K gold │ │ with gold docs │ │ (dense retr.) │ │
│ └──────────────┘ └──────────────────┘ └────────────────┘ │
│ │
│ Stage 3: Trajectory Synthesis (fully offline) │
│ ┌───────────────┐ ┌───────────────┐ ┌─────────────────┐ │
│ │ GPT-OSS-120B │──→│ Browser tools: │──→│ 97K+ trajectories│ │
│ │ (teacher) │ │ search/open/ │ │ (55K after │ │
│ │ │ │ find │ │ rejection samp.) │ │
│ └───────────────┘ └───────────────┘ └─────────────────┘ │
│ │
│ Stage 4: Student Training (SFT) │
│ ┌───────────────┐ ┌───────────────────────────────────────┐ │
│ │ Nemotron-3 │──→│ OpenResearcher-30B-A3B │ │
│ │ Nano 30B-A3B │ │ 54.8% BrowseComp-Plus (+34.0 pts) │ │
│ │ (base) │ │ 64.1% GAIA, 26.3% BrowseComp │ │
│ └───────────────┘ └───────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Three Key Contributions
-
Offline and reproducible synthesis. The expensive search-and-browse loop runs entirely offline after a one-time corpus bootstrapping stage. The resulting model outperforms larger-backbone deep research agents in both offline and live-web settings
-
Explicit browser structure for deep research. Three minimal browser primitives—
search,open,find—model the full hierarchy of information discovery, from corpus-level retrieval to document-level inspection to evidence-level localization -
Empirical insights into search-data and agent design. Five targeted research questions (RQ1–RQ5) provide the field's first controlled analysis of trajectory filtering, corpus construction, turn budgets, tool-space design, and the relationship between retrieval success and answer accuracy
4 Supported Solutions
| Solution Type | Support Level | Details |
|---|---|---|
| Long-horizon deep research QA | Primary target | Multi-step search + evidence aggregation over 10–185 tool calls |
| Offline trajectory synthesis | Core pipeline | Reproducible generation of search-browse-reason trajectories |
| Open-web deep research | Transfer target | Trained on offline data, generalizes to live-web search (Serper API) |
| Closed-corpus research | Evaluated | BrowseComp-Plus benchmark with fixed FAISS-indexed corpus |
| Multi-hop QA | Supported | Subsumes simpler multi-hop tasks (2–5 hops) as special cases |
| Evidence localization | Supported | find primitive enables exact string matching within documents |
| SFT data generation | Supported | Pipeline produces teacher trajectories for distillation to smaller models |
Task Complexity Spectrum
OpenResearcher explicitly targets the long-horizon tail of research tasks that prior systems cannot address:
| Horizon | Tool Calls | Example Prior System | OpenResearcher Coverage |
|---|---|---|---|
| Shallow retrieval | 2–5 | Search-R1 | ✅ (subset) |
| Medium multi-hop | 5–20 | Standard RAG agents | ✅ (well-covered) |
| Deep research | 20–50 | WebExplorer, MiroThinker | ✅ (primary target) |
| Ultra-deep research | 50–100+ | No prior open system | ✅ (substantial tail) |
| Maximum horizon | 100–185 | None | ✅ (captured in data) |
The trajectory distribution spans the full spectrum, with successful trajectories concentrated in the 10–40 range but a non-trivial portion exceeding 100 tool calls. This ensures downstream models learn both concise and complex reasoning patterns.
5 LLM Integration
Teacher Model: GPT-OSS-120B
The teacher model used for trajectory synthesis is GPT-OSS-120B (Agarwal et al., 2025), a large-scale open-source reasoning model. Key properties:
| Property | Value |
|---|---|
| Model | GPT-OSS-120B |
| Role | Teacher for trajectory generation |
| Access to reference answer | ❌ (must recover answer through search) |
| Tool integration | ReAct-style with 3 browser primitives |
| Context window | Sufficient for 185-turn trajectories |
| Temperature/sampling | Not specified; trajectories sampled 16× per question |
The teacher operates in a ReAct-style loop (Yao et al., 2022), interleaving reasoning (chain-of-thought) with tool calls:
Trajectory H_T = {
(query, system_prompt, tool_metadata), # Initial context
(reasoning_1, action_1, observation_1), # Step 1
(reasoning_2, action_2, observation_2), # Step 2
...
(reasoning_T, final_answer) # Termination
}
At each step t:
r_t, a_t ~ π(· | H_{t-1}) # Policy generates thought + action
o_t = E(a_t) # Environment returns observation
H_t = H_{t-1} ∪ {(r_t, a_t, o_t)} # Trajectory grows
Student Model: Nemotron-3-Nano-30B-A3B
| Property | Value |
|---|---|
| Base model | NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 |
| Architecture | 30B total parameters, 3B active (Mixture-of-Experts) |
| Training framework | Megatron-LM |
| Training data | ~55K trajectories (rejection-sampled from 97K+) |
| Context length | 256K tokens (pre-packed, no truncation) |
| Hardware | 8× NVIDIA H100 GPUs |
| Training time | ~8 hours |
| Learning rate | 5×10⁻⁵ (no decay) |
| Batch size | 64 (global) |
| Training steps | 347 |
LLM as Browser-Augmented Research Agent
The LLM integration pattern is fundamentally different from prior agentic search systems. Rather than treating search as simple document retrieval, OpenResearcher models explicit browsing behavior through three tool primitives:
| Tool | Function | Returns | Analogy |
|---|---|---|---|
search(query) |
Dense retrieval over FAISS index | Top-K results with title, URL, snippet | Typing a query into Google |
open(url) |
Fetch full document content | Complete document text | Clicking a search result |
find(string) |
Exact string match in current doc | Matching passages with context | Ctrl+F on a webpage |
This three-level hierarchy enables multi-scale information discovery:
- Corpus → Documents via search
- Documents → Content via open
- Content → Evidence via find
Comparison: Tool-Space Ablation (RQ4)
| Tool Configuration | BrowseComp-Plus Accuracy |
|---|---|
search only |
44.10 |
search + open |
52.02 |
search + open + find |
54.81 |
Each additional primitive provides measurable gains, confirming that explicit browsing structure improves deep research performance.
6 Key Results
Primary Benchmark: BrowseComp-Plus (Closed-Web)
| Method | Category | BrowseComp-Plus (%) |
|---|---|---|
| OpenResearcher-30B-A3B | Ours | 54.8 |
| Tongyi DeepResearch | Deep Research Agent | 44.5 |
| GPT-4.1 | Foundation + Tools | 36.4 |
| Claude-4-Opus | Foundation + Tools | 36.8 |
| Kimi-K2 | Foundation + Tools | 35.4 |
| CutBill-30B-A3B | Deep Research Agent | 30.3 |
| Gemini-2.5-Pro | Foundation + Tools | 29.5 |
| Nemotron-3-Nano (base) | Foundation + Tools | 20.8 |
| DeepSeek-R1 | Foundation + Tools | 16.4 |
+34.0 absolute improvement over the base Nemotron-3-Nano model. +18.4 points over GPT-4.1. +18.0 points over Claude-4-Opus. These gains are achieved via SFT alone—no reinforcement learning or online interaction.
Open-Web Deep Research Benchmarks
| Method | BrowseComp (%) | GAIA (%) | xbench-DeepSearch (%) |
|---|---|---|---|
| OpenResearcher | 26.3 | 64.1 | 65.0 |
| OpenAI o4-mini | 28.3 | 55.8 | 67.0 |
| Claude-4-Sonnet | 12.2 | 57.6 | 64.0 |
| Kimi-K2 | 14.1 | 57.7 | 50.0 |
| DeepMiner-32B | 21.2 | 54.4 | 53.0 |
| WebSailor-72B | 12.0 | 55.4 | 55.0 |
| DeepSeek-R1 | 8.9 | 30.3 | 55.0 |
| ASearcher-QwQ-32B | 5.2 | 52.8 | 42.0 |
| WebDancer-QwQ-32B | 3.8 | 51.5 | 39.0 |
Crucially, OpenResearcher is trained solely on offline trajectories yet achieves competitive performance on live-web benchmarks. On GAIA (64.1%), it outperforms all listed baselines including OpenAI o4-mini (55.8%) and Claude-4-Sonnet (57.6%). This demonstrates that high-quality offline synthesis generalizes to dynamic, real-world search environments.
Trajectory Statistics
| Metric | Successful | Failed | All |
|---|---|---|---|
| Rate | 56.7% | 43.3% | 100% |
| Avg. tool calls | 38.4 | 71.7 | 52.8 |
| Avg. searches | 22.1 | 48.8 | 33.6 |
| Max tool calls | 172 | 185 | 185 |
| Max searches | 109 | 119 | 119 |
Key insight: failure stems from inefficient search, not insufficient exploration. Failed trajectories use nearly 2× as many tool calls, primarily driven by excess search operations (48.7 vs. 22.1). Successful trajectories converge on relevant documents earlier.
Pass@k Analysis
| k | Pass@k |
|---|---|
| 1 | 0.567 |
| 2 | 0.638 |
| 3 | 0.681 |
| 4 | 0.710 |
| 8 | 0.766 |
| 16 | 0.792 |
The 20%+ gap between Pass@1 and Pass@16 indicates high solution diversity—many questions are solvable but only along certain reasoning paths. The per-question solve-rate distribution is bimodal: ~20% of questions have near-0% pass rate (extremely hard) and ~30% reach near-100% (robustly solvable).
7 Reproducibility
Fully Open Artifact Release
OpenResearcher releases every component necessary for reproduction:
| Artifact | Location | Description |
|---|---|---|
| Pipeline code | GitHub | Complete synthesis pipeline |
| Offline search engine | GitHub | FAISS-indexed corpus + retrieval server |
| Synthesized trajectories | HuggingFace | 97K+ trajectories with metadata |
| Model checkpoint | HuggingFace | OpenResearcher-30B-A3B weights |
| Demo | HuggingFace Spaces | Interactive demo |
| QA data | Included | 6K processed question-answer pairs |
| Embedding model | Public | Qwen3-Embedding-8B (publicly available) |
Deterministic Offline Environment
The offline design provides three reproducibility guarantees:
- No rate limits: Parallel synthesis at scale without API throttling
- Fully deterministic behavior: Same corpus + same queries → same retrieval results across runs
- Zero external dependencies: No proprietary APIs needed after one-time bootstrapping
Controlled Analysis Capability
Because the corpus, search backend, and browser actions are fixed, the system enables analysis impossible with live-web pipelines:
- Gold-document tracking: For each question, the system knows which documents contain supporting evidence, enabling measurement of when gold documents are retrieved vs. opened vs. missed
- Retrieval event tracing: Every
search,open, andfindoperation is logged with the exact documents accessed - Causal analysis: RQ5 demonstrates that gold-document retrieval rate correlates with answer accuracy (29.54% hit rate with gold docs → 56.86% trajectory accuracy vs. 1.73% hit rate without → 43.81%)
What Is NOT Reproducible
- One-time bootstrapping: The initial gold document collection uses the Serper API, which may return different results over time. However, the collected gold documents are included in the release
- Teacher model behavior: GPT-OSS-120B's generation is stochastic; exact trajectories will differ across runs. The released trajectories are the canonical set
- Live-web evaluation: BrowseComp, GAIA, and xbench-DeepSearch benchmarks use live search APIs, so evaluation results may vary
8 Compute and API Costs
Trajectory Synthesis Cost Comparison
| Method | Price per 1K Searches | Total Cost (5.76M searches) |
|---|---|---|
| Serper API | $1 | $5,760 |
| SerpAPI | $5 | $28,800 |
| Offline retriever (OpenResearcher) | $0 | $0 |
The offline retriever eliminates all per-query search costs, making large-scale synthesis economically feasible.
One-Time Bootstrapping Costs
| Component | Estimated Cost | Notes |
|---|---|---|
| Gold document retrieval | ~$60 | 6K questions × ~10 Serper queries each |
| FineWeb corpus download | Free | Public dataset |
| Embedding generation | Compute only | Qwen3-Embedding-8B over 15M documents |
| FAISS index construction | Compute only | One-time indexing |
Training Costs
| Component | Value |
|---|---|
| Hardware | 8× NVIDIA H100 GPUs |
| Training duration | ~8 hours |
| Estimated GPU-hours | 64 H100-hours |
| Estimated cost (cloud) | ~$200–$320 (at $3–5/H100-hour) |
| Framework | Megatron-LM (distributed training) |
| Precision | BF16 |
| Context length | 256K tokens |
Teacher Model Inference Costs
| Component | Value |
|---|---|
| Model | GPT-OSS-120B |
| Trajectories generated | 97K+ (16 samples × 6K questions) |
| Avg. trajectory length | 52.8 tool calls |
| Max trajectory length | 185 tool calls |
| Total inference | ~97K × [variable context] tokens |
The total cost of the OpenResearcher pipeline is dominated by teacher-model inference (GPT-OSS-120B) and corpus embedding/indexing. The actual training of the student model is remarkably cheap (~64 H100-hours), demonstrating that the bottleneck in deep research is data quality, not model training.
9 Architecture Solution
System Architecture
┌──────────────────────────────────────────────────────────────────────┐
│ OpenResearcher Architecture │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ OFFLINE CORPUS LAYER │ │
│ │ │ │
│ │ ┌──────────────┐ ┌────────────────────────┐ │ │
│ │ │ Gold Docs │ │ FineWeb 15M │ │ │
│ │ │ (10K docs) │──→│ Documents │ │ │
│ │ │ from Serper │ │ (~10T tokens) │ │ │
│ │ └──────────────┘ └────────────────────────┘ │ │
│ │ │ │ │ │
│ │ └──────┬───────────────┘ │ │
│ │ ▼ │ │
│ │ ┌──────────────────┐ │ │
│ │ │ Qwen3-Emb-8B │ │ │
│ │ │ Dense Embeddings │ │ │
│ │ └────────┬─────────┘ │ │
│ │ ▼ │ │
│ │ ┌──────────────────┐ │ │
│ │ │ FAISS Index │ │ │
│ │ │ (15M+ vectors) │ │ │
│ │ └────────┬─────────┘ │ │
│ └─────────────────┼──────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────┼──────────────────────────────────────────────┐ │
│ │ │ SEARCH ENGINE LAYER │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Local Retrieval Server │ │ │
│ │ │ │ │ │
│ │ │ search(query) ──→ Top-K (title, URL, snippet) │ │ │
│ │ │ open(url) ──→ Full document content │ │ │
│ │ │ find(string) ──→ Exact matches in current document │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────┼──────────────────────────────────────────────┐ │
│ │ │ AGENT LAYER │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ GPT-OSS-120B Teacher Agent │ │ │
│ │ │ │ │ │
│ │ │ Loop: │ │ │
│ │ │ 1. Reason (chain-of-thought) │ │ │
│ │ │ 2. Select tool (search | open | find) │ │ │
│ │ │ 3. Execute tool → receive observation │ │ │
│ │ │ 4. Update trajectory H_t │ │ │
│ │ │ 5. Repeat until confident → emit final answer │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────┼──────────────────────────────────────────────┐ │
│ │ │ TRAINING LAYER │ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ ┌──────────────────────────────────┐ │ │
│ │ │ Trajectory │ │ Nemotron-3-Nano-30B-A3B │ │ │
│ │ │ Filtering │──→│ Supervised Fine-Tuning │ │ │
│ │ │ (reject wrong │ │ (Megatron-LM, 256K ctx, 8×H100) │ │ │
│ │ │ answers) │ │ │ │ │
│ │ └─────────────────┘ └──────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Design Principles
-
Decoupling. Corpus construction (one-time, online) is cleanly separated from trajectory synthesis (repeatable, offline). This means the expensive bootstrapping pays off across unlimited synthesis runs.
-
Explicit browsing. Rather than treating search as a monolithic retrieval operation, the system exposes three primitives that mirror human browsing: broad search → focused reading → targeted evidence finding.
-
Teacher-student distillation. A large teacher model (GPT-OSS-120B) generates high-quality trajectories that are then distilled into a much smaller student (30B active parameters) via SFT.
-
Rejection sampling. Only trajectories yielding correct final answers are retained for training (55K out of 97K+), ensuring the student learns from success patterns.
10 Component Breakdown
Component 1: QA Question Collection
Source: MiroVerse-v0.1 dataset (10% random sample → ~6K QA instances)
Selection Criteria: - Questions must require long-horizon, multi-hop reasoning over heterogeneous evidence - Standard benchmarks (2WikiMultiHopQA, Natural Questions) are explicitly rejected as too shallow - Even a strong teacher model needs dozens of tool calls for these questions, with a substantial tail exceeding 100 calls
Post-processing: - Answers normalized into concise, verifiable form - Partial trajectories from MiroVerse are discarded (unsuitable for direct supervision due to inconsistent quality) - All trajectories regenerated from scratch using only clean QA pairs
Component 2: Gold Document Retrieval
Purpose: Ensure the offline corpus contains evidence sufficient to answer each question
Method: 1. Construct query = concatenation(question, reference_answer) for improved recall 2. Retrieve via Serper API (one-time online step) 3. Clean and deduplicate documents
Output: 10K gold documents for 6K questions (~1.67 gold docs per question average)
Critical Design Choice: Gold documents are used only for corpus construction, never during trajectory synthesis. The teacher model must independently find evidence.
Component 3: Offline Corpus
| Property | Value |
|---|---|
| Distractor documents | 15M (from FineWeb) |
| Gold documents | 10K (from online bootstrapping) |
| Total size | ~15.01M documents |
| Token count | ~10 trillion tokens |
| Gold-to-distractor ratio | ~1:1,500 |
| Purpose | Approximate web-scale complexity |
The extreme gold-to-distractor ratio ensures realistic search difficulty. The teacher model must locate relevant documents among 1,500× as many irrelevant ones.
Component 4: Dense Retrieval Engine
| Property | Value |
|---|---|
| Embedding model | Qwen3-Embedding-8B |
| Index | FAISS |
| Retrieval type | Dense (vector similarity) |
| Query input | Natural language (from agent search calls) |
| Output | Ranked documents with title, URL, snippet |
Component 5: Browser Tool Suite
| Tool | Input | Output | Scale |
|---|---|---|---|
search(query) |
Natural language query | Top-K results (title, URL, snippet) | Corpus → Document candidates |
open(url) |
Document URL | Full document text | Document → Full content |
find(string) |
Exact string | Matching passages + context | Content → Evidence |
Design rationale: Each tool narrows the information scope by one level. The find tool is critical for named-entity lookup and factual verification—tasks where scanning long documents in-context is unreliable.
Component 6: Trajectory Filtering Pipeline
Filtering criteria: 1. Trajectories exceeding maximum context length → removed 2. Trajectories with malformed tool calls → removed 3. Trajectories failing to reach conclusive answer within budget → removed 4. (For training) Trajectories with incorrect final answers → removed via rejection sampling
Yield: 97K+ raw trajectories → ~55K training trajectories after rejection sampling
Component 7: SFT Training Pipeline
Framework: Megatron-LM (distributed training)
Key design choices: - Sequences pre-packed to 256K tokens (no truncation) - Complete reasoning chains preserved end-to-end - Fixed configuration: LR=5×10⁻⁵, no warmup/decay, 347 steps, batch=64 - No RL, no online interaction, no curriculum—pure SFT on answer-verified demonstrations
11 Core Mechanisms (Detailed)
Mechanism 1: Offline-Online Decoupling
The central architectural innovation is the clean separation between online and offline phases:
ONLINE PHASE (one-time):
for each (question, answer) in QA_pairs:
query = concat(question, answer)
gold_docs = serper_api.search(query)
gold_docs = clean_and_dedup(gold_docs)
corpus.add(gold_docs)
distractor_docs = fineweb.sample(15_000_000)
corpus.add(distractor_docs)
embeddings = qwen3_embedding(corpus)
index = faiss.build_index(embeddings)
OFFLINE PHASE (repeatable, scalable):
for each question in QA_pairs:
for seed in range(16): # 16 samples per question
trajectory = teacher.generate_trajectory(
question=question,
tools=[search, open, find],
environment=offline_search_engine(index)
)
trajectories.append(trajectory)
Why this matters: Once the corpus and index are built, unlimited trajectory synthesis runs can be performed at zero marginal search cost. This enables: - Experimenting with different teacher models - Varying prompt strategies - Generating multiple samples per question (Pass@k analysis) - Ablation studies on tool configurations
Mechanism 2: Multi-Scale Browsing Hierarchy
The three-primitive browsing model captures the hierarchical nature of human research:
Level 1: CORPUS SEARCH
Agent: "What university did [person] attend?"
search("person X education background") → 10 results
Level 2: DOCUMENT INSPECTION
Agent: "Result 3 looks promising, let me read the full article"
open("https://corpus/doc_12345") → full text (may be 5K+ words)
Level 3: EVIDENCE LOCALIZATION
Agent: "The article mentions education. Let me find the specific section"
find("graduated from") → "...graduated from MIT in 1983..."
Ablation evidence (RQ4):
- search only: Agent relies on incomplete snippets → 44.1%
- search + open: Agent reads full docs but must scan long context → 52.0%
- search + open + find: Agent explicitly localizes evidence → 54.8%
Each level adds ~5–10 points of accuracy by reducing the model's reliance on implicit reasoning over long contexts.
Mechanism 3: Rejection Sampling for Quality Control
Rather than training on all generated trajectories, OpenResearcher applies rejection sampling: only trajectories yielding correct final answers are retained.
Surprising finding (RQ1): Final-answer correctness is NOT the dominant indicator of training value.
| Training Data | BrowseComp-Plus |
|---|---|
| Correct trajectories only | 54.81 |
| Incorrect trajectories only | 55.06 |
| All trajectories | 54.46 |
Training on incorrect trajectories alone slightly outperforms correct-only training. This suggests that even failed trajectories provide useful supervision about search structure, tool-use ordering, evidence inspection patterns, and stopping behavior.
Interpretation: The structural patterns of how to search—query formulation, document selection, evidence verification—are more important for learning than whether the final answer happens to be correct. This is a counterintuitive but practically important finding.
Mechanism 4: Corpus Coverage Bootstrapping
RQ2 ablation demonstrates why gold-document bootstrapping is critical:
| Setting | Gold Hit Rate | Trajectory Accuracy | BrowseComp-Plus |
|---|---|---|---|
| With gold docs | 29.54% | 56.86% | 54.81 |
| Without gold docs | 1.73% | 43.81% | 6.35 |
Without gold documents, the model achieves only 6.35% on BrowseComp-Plus—a catastrophic 48-point drop. The 29.54% gold hit rate (compared to 1.73% without) confirms that answer-guided bootstrapping successfully seeds the corpus with retrievable evidence. However, even with bootstrapping, 70% of trajectories never retrieve a gold document, suggesting room for improvement in retrieval strategies.
Mechanism 5: Turn Budget Analysis
RQ3 investigates whether long-horizon capability requires a large turn budget at inference time:
| Max Turns at Inference | BrowseComp-Plus |
|---|---|
| 15 | 47.12 |
| 30 | 50.10 |
| 50 | 52.70 |
| 75 | 53.48 |
| 100 | 54.81 |
Performance improves monotonically with the turn budget but with diminishing returns. The jump from 15 to 30 turns provides the largest marginal gain (+3.0 points), while the improvement from 75 to 100 turns is only +1.3 points. This suggests that most questions can be solved within 50 turns but a long tail benefits from extended exploration.
Mechanism 6: Retrieval-Accuracy Relationship
RQ5 provides the first controlled analysis of how retrieval success relates to final-answer accuracy in deep research:
Questions where at least one gold document is retrieved have significantly higher answer accuracy than questions where no gold document is found. However, the relationship is not deterministic—some questions are solved without retrieving any gold document (through indirect evidence), and some fail despite gold-document retrieval (through reasoning errors post-retrieval).
This nuanced finding challenges the simplistic assumption that "better retrieval = better answers" and highlights the importance of both retrieval and reasoning capabilities.
12 Programming Language
| Component | Language | Framework/Library |
|---|---|---|
| Pipeline orchestration | Python | Custom |
| Trajectory synthesis | Python | GPT-OSS-120B API |
| Dense retrieval | Python | FAISS, Qwen3-Embedding-8B |
| Corpus processing | Python | Custom (cleaning, dedup) |
| Model training | Python | Megatron-LM |
| Evaluation | Python | Custom eval scripts |
| Web search (bootstrap) | Python | Serper API client |
The entire system is Python-native, consistent with the ML research ecosystem. No multi-language complexity.
Code Organization (Inferred from Release)
OpenResearcher/
├── data/ # QA data, processed questions
├── corpus/ # Offline corpus construction
│ ├── bootstrap.py # Gold document retrieval
│ ├── fineweb_sampler.py # FineWeb document sampling
│ └── indexer.py # FAISS index construction
├── search_engine/ # Local retrieval server
│ ├── server.py # Retrieval API
│ └── browser_tools.py # search/open/find implementations
├── synthesis/ # Trajectory generation
│ ├── agent.py # Teacher agent (ReAct loop)
│ ├── prompts.py # System prompts, tool metadata
│ └── filtering.py # Trajectory quality filters
├── training/ # SFT pipeline
│ ├── data_prep.py # Trajectory → training format
│ └── megatron_config.py # Megatron-LM configuration
├── evaluation/ # Benchmark evaluation
│ ├── browsecomp_plus.py # Closed-web eval
│ ├── browsecomp.py # Open-web eval
│ ├── gaia.py # GAIA benchmark
│ └── xbench.py # xbench-DeepSearch
└── analysis/ # Research question analyses
├── rq1_filtering.py # Correctness filtering ablation
├── rq2_corpus.py # Corpus coverage ablation
├── rq3_turns.py # Turn budget analysis
├── rq4_tools.py # Tool-space ablation
└── rq5_retrieval.py # Retrieval-accuracy relationship
13 Memory Management
Context Window Management
The most critical memory challenge in OpenResearcher is managing the 256K-token context window during both trajectory synthesis and student training.
During Synthesis (Teacher)
The teacher model must maintain a growing trajectory context across up to 185 tool calls. Each step adds:
- A reasoning chain-of-thought (variable length)
- A tool call (structured JSON)
- An observation (variable, especially for open which returns full documents)
Mitigation strategies:
1. Search snippets are truncated to limit context growth
2. open returns full document content but the model learns to use find for targeted inspection rather than reading entire documents
3. Trajectories exceeding context limits are filtered out
During Training (Student)
Sequences are pre-packed to 256K tokens with no truncation. This is a deliberate design choice: truncation would break the reasoning chain and teach the model incomplete patterns.
| Training Memory Property | Value |
|---|---|
| Max context length | 256K tokens |
| Packing strategy | Pre-packed (no padding waste) |
| Truncation | None (trajectories that don't fit are excluded) |
| Framework | Megatron-LM (distributed across 8× H100) |
Corpus Memory
| Component | Memory Requirement |
|---|---|
| Raw corpus | ~15M documents × avg size → multi-TB on disk |
| FAISS index | Dense vectors for 15M documents (significant GPU/RAM) |
| Embeddings | 15M × embedding_dim (Qwen3-Embedding-8B: 4096-dim) |
Trajectory Storage
| Property | Value |
|---|---|
| Total trajectories | 97K+ |
| Average tool calls per trajectory | 52.8 |
| Storage format | Structured (question, reasoning, action, observation tuples) |
| Released on | HuggingFace Datasets |
No External Memory System
OpenResearcher does not use any form of external memory, knowledge graph, or cross-trajectory learning. Each trajectory is independently generated. The system relies entirely on: 1. The teacher model's parametric knowledge for search strategy 2. The offline corpus for factual evidence 3. The trajectory history (within-episode context) for coherent reasoning
This is a deliberate simplicity choice. Cross-trajectory or cross-question memory could improve search strategy diversity but would complicate the pipeline and break the independence assumption that enables parallel synthesis.
14 Continued Learning
Current Learning Paradigm
OpenResearcher uses a single-pass SFT approach: the student model is trained once on the curated trajectory set. There is no iterative self-improvement, no RL phase, and no online adaptation.
Potential Continued Learning Extensions
The paper explicitly identifies several directions for continued learning:
1. Reinforcement Learning from Search Feedback
The offline environment provides perfect reward signals: trajectory accuracy can be computed against gold answers, and retrieval success (gold-document hit rate) can serve as an intermediate reward. This enables: - RLHF-style optimization with search-specific rewards - Process reward models trained on step-level retrieval success - Outcome reward models trained on final-answer correctness
2. Self-Play Trajectory Refinement
The student model (OpenResearcher-30B-A3B) could generate its own trajectories, which are then filtered and used for further SFT rounds:
Round 1: Teacher (GPT-OSS-120B) → trajectories → SFT → Student v1
Round 2: Student v1 → trajectories → filter → SFT → Student v2
Round 3: Student v2 → trajectories → filter → SFT → Student v3
...
The RQ1 finding (incorrect trajectories have training value) suggests this could work even with moderate student-model accuracy.
3. Curriculum Over Trajectory Complexity
The trajectory distribution spans 10–185 tool calls. A curriculum strategy could: - Start training on short trajectories (10–30 tool calls) - Progressively introduce longer trajectories (50–100+) - Allocate more training weight to the long-tail trajectories
4. Active Learning for Corpus Expansion
Questions where the teacher consistently fails (0/16 Pass@k) likely indicate corpus coverage gaps. An active learning loop could: - Identify consistently-failed questions - Perform targeted online bootstrapping for those questions - Expand the corpus and re-synthesize trajectories
5. Multi-Teacher Ensemble
Different teacher models may produce complementary search strategies. Training on trajectories from multiple teachers could improve strategy diversity: - GPT-OSS-120B (primary teacher) - Claude-4-Opus (different search patterns) - DeepSeek-R1 (different reasoning style) - Open-source LRMs (cost-effective scaling)
What the Paper Does NOT Address
- No discussion of catastrophic forgetting during SFT
- No exploration of whether the student model's search strategies degrade on out-of-distribution questions
- No analysis of how the model's performance changes with corpus drift (when the offline corpus becomes outdated)
- No investigation of multi-task or multi-domain continued learning
15 Applications
Direct Applications
1. Autonomous Literature Review
The deep research agent can be deployed for automated literature surveys: - Issue complex research questions - Let the agent search, read, and synthesize findings across many documents - Agent produces evidence-grounded answers with source tracing
2. Fact-Checking and Verification
The search → open → find browsing hierarchy is naturally suited to fact-checking:
- Broad search identifies relevant sources
- Document opening enables full-context reading
- find localizes specific claims for verification
3. Competitive Intelligence
Multi-source evidence aggregation across business documents, news, filings: - Long-horizon reasoning chains connect disparate data points - The system handles heterogeneous evidence quality and potential contradictions
4. Research Data Synthesis for SFT
The pipeline itself is an application: generating high-quality research trajectories for training smaller models. Any organization can: 1. Define their domain-specific QA pairs 2. Bootstrap a corpus with answer-guided retrieval 3. Run the offline synthesis pipeline 4. SFT a model for domain-specific deep research
Benchmark Applications
| Benchmark | Type | OpenResearcher Performance | Application |
|---|---|---|---|
| BrowseComp-Plus | Closed-web deep research | 54.8% (SOTA) | Corpus-bound research |
| BrowseComp | Open-web deep research | 26.3% | Live web research |
| GAIA | General AI assistant | 64.1% | Multi-modal task solving |
| xbench-DeepSearch | Deep search evaluation | 65.0% | Search-intensive QA |
Broader Impact: The "Offline Training, Online Deployment" Pattern
OpenResearcher demonstrates a powerful paradigm for deep research agents:
Train offline, deploy online. A model trained exclusively on offline trajectories (no live-web exposure) transfers effectively to live-web search environments. This decouples the expensive data-generation phase from the deployment environment, enabling: - Cost-effective data generation at scale - Reproducible training pipelines - Controlled analysis of agent behavior - Domain-specific customization without API lock-in
Limitations for Applications
- Corpus currency: The offline corpus is a snapshot; it cannot answer questions about events after corpus construction
- Domain specificity: The current corpus (FineWeb) is general-purpose; domain-specific applications need domain corpora
- Single-language: The system is optimized for English-language research
- No multi-modal evidence: Documents are text-only; no image, table, or chart understanding
- Static browsing model: The three primitives do not cover interactive web elements (forms, dynamic content, JavaScript-rendered pages)
Appendix: Comparison with Related Systems
| System | Trajectory Source | Search Type | Open-Source | Max Tool Calls | Benchmark SOTA |
|---|---|---|---|---|---|
| OpenResearcher | Offline synthesis | Offline (FAISS) | ✅ Full | 185 | BrowseComp-Plus: 54.8% |
| Search-R1 | Online synthesis | Live API | Partial | 2–5 | N/A |
| WebExplorer | Online synthesis | Live API | Partial | Variable | N/A |
| MiroThinker | Online synthesis | Live API | Partial | Variable | N/A |
| DeepMiner-32B | Online synthesis | Live API | Partial | Variable | BrowseComp: 21.2% |
| ASearcher-QwQ-32B | Online synthesis | Live API | Partial | Variable | GAIA: 52.8% |
| WebDancer-QwQ-32B | Online synthesis | Live API | Partial | Variable | GAIA: 51.5% |
OpenResearcher is the only system that combines (1) fully offline synthesis, (2) complete artifact release, (3) 100+ tool call support, and (4) competitive performance against proprietary systems.