DeepResearch: Alibaba Autonomous Research Agent
Part: Autonomous Research Systems
Key Contribution
DeepResearch, developed by Alibaba's Tongyi NLP group, presents an end-to-end framework for training autonomous research agents via reinforcement learning over real search-engine interactions. Its principal contribution is a synthetic data pipeline that generates diverse, multi-step research trajectories from a teacher model (QwQ-32B-Preview), followed by agentic RL training (using Group Relative Policy Optimization) that teaches a smaller student model (Qwen2.5-7B-Instruct) to autonomously plan searches, synthesize findings, and produce cited answers. The system demonstrates that RL over agentic trajectories — rather than supervised fine-tuning alone — is critical for robust multi-hop research performance, achieving approximately 10 percentage-point accuracy gains over SFT-only on GAIA validation (paper-reported, Zheng et al. 2025, Table 1). Code and model weights are released at github.com/Alibaba-NLP/DeepResearch.
github.com/Alibaba-NLP/DeepResearch; and (3) HuggingFace model cards. Every claim carries one of the following provenance tags:
- Paper-reported — stated directly in the technical report.
- Repo-documented — confirmed from the repository's README, dependency files, or documented interfaces. This does not imply source-code-level audit of internal implementation.
- Code-verified [file:function] — confirmed by direct inspection of a named source file. Used sparingly; only where the chapter author has verified a specific identifier.
- Model-card — confirmed by the HuggingFace model card.
- Canonical formulation — taken from the cited originating paper (e.g., GRPO from Guo et al. 2025).
- Author formalization — the chapter author's analytical construction or illustrative reconstruction, not directly traceable to a primary source.
- Implementation unknown — a plausible detail that could not be confirmed from public materials.
git log --oneline -1, and verify all claims tagged repo-documented against that snapshot.
51.1 Overview and Motivation
The task of autonomous research — gathering, synthesizing, and reasoning over information from heterogeneous sources to answer complex, open-ended questions — has emerged as a central challenge for large language model (LLM) agents. While chain-of-thought reasoning and retrieval-augmented generation (RAG) address parts of this problem, they typically operate in a single-turn or shallow-retrieval paradigm. Real-world research requires iterative refinement: formulating hypotheses, searching for evidence, revising understanding, identifying gaps, searching again, and eventually synthesizing a coherent answer with proper attribution.
DeepResearch, released by Alibaba's Tongyi NLP group (the team behind the Qwen model family), addresses this gap by training LLMs to perform multi-step, tool-augmented research autonomously. The system's design rests on three pillars:
- Agentic trajectory generation. A synthetic data pipeline uses a capable teacher model (QwQ-32B-Preview) to produce diverse research trajectories — sequences of search queries, document reads, intermediate reasoning steps, and final answers — across thousands of research questions (paper-reported).
- Reinforcement learning from agentic experience. Rather than relying solely on supervised fine-tuning (SFT) on teacher trajectories, the system applies Group Relative Policy Optimization (GRPO) to train the student model (Qwen2.5-7B-Instruct) by rewarding successful research outcomes, enabling the agent to discover its own effective research strategies (paper-reported).
- Open-weight release. The trained model weights and training pipeline code are publicly released, enabling reproducibility and downstream adaptation (repo-documented: README includes HuggingFace model link for
Alibaba-NLP/DeepResearch-7Band training instructions).
The motivation is both practical and scientific. Practically, autonomous research agents could accelerate literature review, competitive intelligence, fact-checking, and scientific discovery. Scientifically, the question of whether RL can teach models to plan and execute multi-step information-gathering strategies — rather than merely generating fluent text — is a fundamental test of agentic capability.
51.1.1 Positioning Within Autonomous Research Systems
DeepResearch enters a landscape that includes OpenAI's Deep Research (integrated into ChatGPT Pro), Google DeepMind's research agents, and open-source efforts such as Search-R1, AutoAgent-R1, and WebThinker. Compared to proprietary systems, DeepResearch distinguishes itself by open-sourcing the training pipeline and model weights. Compared to prior open-source work, the principal advance claimed in the technical report is the combination of synthetic trajectory generation with GRPO-based policy optimization, rather than SFT alone (paper-reported).
51.2 Architecture
DeepResearch follows a modular agentic architecture in which a language model operates as a controller that selects actions, processes observations, and maintains a running research state. The technical report describes four principal components: the agent controller (the LLM policy), the search environment (web search integration), the trajectory manager (state tracking and action formatting), and the training pipeline (synthetic data generation, filtering, and RL optimization) (paper-reported).
51.2.1 Agent Controller
The agent controller is the core LLM that decides which action to take at each step. During inference, it receives the accumulated research context (prior searches, retrieved passages, reasoning steps) and produces one of four action types: search, click, think, or finish. Actions are emitted as structured tokens using XML-like delimiters (paper-reported) that are parsed into executable operations. The action format uses tagged regions as follows (paper-reported action schema):
<!-- Action format: XML-like delimiters for structured agent output (paper-reported) -->
<search>multi-hop reasoning techniques in LLM agents 2024</search>
<click>https://example.com/relevant-paper.html</click>
<think>The retrieved document discusses chain-of-thought prompting but not
multi-step search. I need to search more specifically for iterative
retrieval methods...</think>
<finish>Based on my research, multi-hop reasoning in LLM agents involves...
[cited synthesis with source attributions]</finish>
The controller is based on the Qwen2.5-7B-Instruct model (paper-reported; model-card: Qwen/Qwen2.5-7B-Instruct), a dense transformer with 7.6 billion parameters and a context window of up to 128K tokens (model-card). The teacher model used for trajectory generation is QwQ-32B-Preview (paper-reported; model-card: Qwen/QwQ-32B-Preview), a 32-billion-parameter reasoning model. Specific architecture details are given in §51.4.
51.2.2 Search Environment
The search environment wraps a web search API to provide the agent with real-time web information. When the agent issues a search(query) action, the environment returns a set of search results including titles, URLs, and snippets. The click(url) action fetches and processes the content of a specific page, returning a text representation that fits within the model's context window (paper-reported). The repository documents a configurable search backend, with search engine selection specified via configuration or environment variables (repo-documented: README describes search API setup).
51.2.3 Trajectory Manager
The trajectory manager maintains the running conversation state as a sequence of (action, observation) pairs (paper-reported). The context grows with each step, and context-window management is necessary for long research sessions. The specific context management strategy (sliding-window truncation, summarization, or importance-based pruning) is not detailed in the reviewed public materials (implementation unknown).
51.3 Core Algorithms
51.3.1 The Research Agent Loop
The agent operates as a sequential decision-making process. At each time step $t$, the agent observes its accumulated context $c_t$ and selects an action $a_t$ from the action space $\mathcal{A} = \{\texttt{search}, \texttt{click}, \texttt{think}, \texttt{finish}\}$ according to its policy $\pi_\theta(a_t \mid c_t)$. An episode terminates when the agent selects the finish action or when a maximum step limit $T_{\max}$ is reached.
A complete trajectory $\tau$ has the following structure (paper-reported structure; notation is author formalization):
where each component occupies a distinct role:
- $s_0$ — the system prompt and question tokens (fixed, not generated by the policy).
- $a_t = (a_{t,1}, a_{t,2}, \ldots, a_{t,L_t})$ — the $t$-th action, a sequence of $L_t$ tokens generated by the policy $\pi_\theta$. This includes the XML delimiter tokens and the action content.
- $e_t = (e_{t,1}, e_{t,2}, \ldots, e_{t,K_t})$ — the $t$-th environment observation, a sequence of $K_t$ tokens returned by the environment. These tokens are not generated by the policy.
The context at step $t$ is the concatenation of all preceding components: $c_t = s_0 \oplus a_0 \oplus e_0 \oplus \cdots \oplus a_{t-1} \oplus e_{t-1}$. This distinction between policy-generated tokens ($a_t$) and environment-returned tokens ($e_t$) is critical for the RL training formulation in §51.3.3, where only policy-generated tokens receive gradient signal.
Algorithm 1: RESEARCH-AGENT-LOOP
Provenance: paper-reported methodology; pseudocode notation is author formalization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input: question q, policy model π_θ, search environment E, max steps T_max
Output: answer string with citations
1 context ← FORMAT_SYSTEM_PROMPT(q)
2 for t = 0 to T_max − 1 do
3 │ action_tokens ← π_θ.generate(context) // policy-generated tokens a_t
4 │ (type, arg) ← PARSE_XML_TAGS(action_tokens) // extract <tag>...</tag>
5 │
6 │ if type = "finish" then
7 │ │ return arg // final answer with citations
8 │ end
9 │
10 │ if type = "search" then
11 │ │ env_obs ← E.web_search(arg) // environment observation e_t
12 │ else if type = "click" then
13 │ │ env_obs ← E.fetch_page(arg) // environment observation e_t
14 │ else if type = "think" then
15 │ │ env_obs ← ∅ // no environment feedback
16 │ end
17 │
18 │ context ← context ⊕ action_tokens ⊕ env_obs // a_t and e_t appended
19 end
20 return FORCE_FINISH(context) // max steps reached
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
51.3.2 Synthetic Data Pipeline
The synthetic data pipeline is the first stage of training. Its purpose is to generate a large corpus of research trajectories that demonstrate effective multi-step research behavior. The pipeline operates as follows (paper-reported):
- Question collection. Research questions are collected from diverse sources: existing QA benchmarks, web-harvested complex questions, and synthetically generated questions designed to require multi-hop reasoning.
- Teacher trajectory generation. The teacher model (QwQ-32B-Preview; paper-reported) executes the research agent loop on each question, interacting with live search APIs. Each execution produces a complete trajectory $\tau_i$.
- Answer verification. The final answer in each trajectory is evaluated against ground-truth labels (where available) or assessed by a judge model for correctness and completeness.
- Quality filtering. Trajectories are filtered based on answer correctness, trajectory characteristics, and diversity criteria. The filtering ensures the training corpus contains a mix of easy and hard questions with varied search strategies.
The quality filtering stage is critical: naive SFT on all teacher trajectories would bias the student toward the teacher's particular search patterns, including its mistakes. Filtering on outcome quality creates a curated dataset of successful research strategies (paper-reported rationale).
Algorithm 2: SYNTHETIC-TRAJECTORY-GENERATION
Provenance: paper-reported methodology; pseudocode notation is author formalization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input: question set Q, teacher model M_teacher (QwQ-32B-Preview),
search environment E, quality threshold θ_min
Output: filtered trajectory dataset D_filtered
1 D_raw ← ∅
2 for each question q ∈ Q do // parallel across questions
3 │ τ ← RESEARCH-AGENT-LOOP(q, M_teacher, E) // teacher executes search loop
4 │ answer ← EXTRACT_ANSWER(τ)
5 │ D_raw ← D_raw ∪ {(q, τ, answer)}
6 end
7
8 D_filtered ← ∅
9 for each (q, τ, answer) ∈ D_raw do
10 │ score ← SCORE_TRAJECTORY(q, answer, ground_truth(q))
11 │ if score ≥ θ_min then
12 │ │ D_filtered ← D_filtered ∪ {(q, τ, answer, score)}
13 │ end
14 end
15 return D_filtered
Subroutine SCORE_TRAJECTORY:
- If ground truth available: exact match → 1.0; partial match via judge → [0, 1]
- If no ground truth: judge model assesses answer quality → [0, 1]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
51.3.3 Agentic Reinforcement Learning with GRPO
The central algorithmic contribution is the application of Group Relative Policy Optimization (GRPO) to train the research agent. GRPO, introduced in the DeepSeek-R1 work (Shao et al., 2024; Guo et al., 2025) for mathematical reasoning, is adapted here for agentic trajectories. To maintain evidentiary precision, this section separates the canonical GRPO formulation (from Guo et al. 2025) from the agentic masking adaptation applied in DeepResearch, with per-term provenance labeling.
Layer 1: Canonical GRPO (Guo et al., 2025)
In the standard (non-agentic) GRPO formulation, all output tokens are policy-generated. Given a prompt $q$, the current policy $\pi_{\theta_{\text{old}}}$ generates a group of $G$ completions $\{o_1, o_2, \ldots, o_G\}$. Each receives a scalar reward $r_i$, and advantages are computed via group-relative normalization (canonical formulation, Guo et al. 2025, §3.2):
The canonical GRPO objective is (canonical formulation, Guo et al. 2025, Eq. 3):
where:
- $\rho_k^{(i)} = \pi_\theta(o_k^{(i)} \mid o_{<k}^{(i)}) \,/\, \pi_{\theta_{\text{old}}}(o_k^{(i)} \mid o_{<k}^{(i)})$ — the per-token importance-sampling ratio (canonical formulation).
- $\varepsilon_c$ — the clipping parameter, following the PPO mechanism (Schulman et al., 2017). Standard value: 0.2 (canonical formulation).
- $D_{\text{KL}}^{(i,k)}$ — per-token KL divergence penalty against a reference policy $\pi_{\text{ref}}$, using the Schulman (2020) unbiased estimator (canonical formulation, Guo et al. 2025, Eq. 4): $$D_{\text{KL}}^{(i,k)} = \frac{\pi_{\text{ref}}(o_k^{(i)} \mid o_{<k}^{(i)})}{\pi_\theta(o_k^{(i)} \mid o_{<k}^{(i)})} - \log \frac{\pi_{\text{ref}}(o_k^{(i)} \mid o_{<k}^{(i)})}{\pi_\theta(o_k^{(i)} \mid o_{<k}^{(i)})} - 1$$
- $\beta$ — KL coefficient controlling divergence from the reference model (canonical formulation).
The key property of canonical GRPO is that it eliminates the need for a separate value function: advantages are computed by comparing outcomes within the group rather than estimating $V(s)$.
Layer 2: Agentic Masking Adaptation for DeepResearch
In the agentic research setting, trajectories interleave policy-generated tokens (actions) with environment-returned tokens (search results, page content). The canonical GRPO formulation assumes all output tokens are policy-generated; the agentic setting requires an additional masking mechanism to exclude environment tokens from the loss. The technical report describes this masking qualitatively: "only the agent's generated tokens enter the loss" (paper-reported). The following explicit notation is the chapter author's formalization of this described practice.
Let the full token sequence of trajectory $i$ be $\mathbf{x}^{(i)} = (x_1^{(i)}, x_2^{(i)}, \ldots, x_{M_i}^{(i)})$, which interleaves system-prompt tokens, policy-generated action tokens, and environment-returned observation tokens. Define the generation mask (author formalization):
The adapted GRPO objective applies this mask to restrict gradient computation to policy-generated tokens only (paper-reported algorithm choice; masked formulation is author formalization):
The differences from canonical GRPO, with per-term provenance:
| Term | Adaptation | Provenance |
|---|---|---|
| $m_k^{(i)} \in \{0, 1\}$ | Generation mask zeroing out environment and prompt tokens from the loss | Paper-reported qualitatively; explicit notation is author formalization |
| $N_{\text{gen}}^{(i)} = \sum_k m_k^{(i)}$ | Normalization by count of policy-generated tokens only (replacing $|o_i|$) | Author formalization — standard practice in masked-loss implementations; not explicitly specified in paper |
| $\rho_k^{(i)}$ conditioning | The conditioning context $\mathbf{x}_{<k}^{(i)}$ includes both action and environment tokens, but the ratio is only computed for $m_k^{(i)}=1$ | Author formalization of the standard autoregressive property |
| $D_{\text{KL}}^{(i,k)}$ | KL penalty against reference policy $\pi_{\text{ref}}$ (SFT checkpoint), computed only for generated tokens ($m_k^{(i)}=1$) | KL form: canonical (Schulman 2020). Masking: author formalization of paper-reported practice |
| $\hat{A}_i$ (trajectory-level) | Same scalar advantage broadcast to all generated tokens in trajectory $i$ | Canonical GRPO (Guo et al. 2025). Applied unchanged |
| $\varepsilon_c$ (clipping) | PPO-style clipping, unchanged from canonical formulation | Canonical (Schulman et al. 2017). Exact value used: implementation unknown |
Token-Level vs. Action-Level Decomposition
Although the agent operates at the level of discrete actions (search, click, think, finish), the GRPO objective as formalized above operates at the level of individual tokens. Each action $a_t$ is a multi-token generation consisting of $L_t$ tokens, and the importance-sampling ratio $\rho_k^{(i)}$ is computed per token. The trajectory-level advantage $\hat{A}_i$ is then broadcast to all generated tokens within the same trajectory. An alternative would be action-level probabilities $\pi_\theta(a_t \mid c_t) = \prod_{j=1}^{L_t} \pi_\theta(a_{t,j} \mid c_t, a_{t,<j})$; the token-level formulation avoids numerical issues with products of many probabilities and aligns with standard autoregressive training (author analysis).
51.3.4 Reward Design
The reward function $r(\tau)$ for a complete trajectory $\tau$ combines answer correctness with format adherence (paper-reported). The reported structure is:
where:
- $r_{\text{accuracy}} \in \{0, 1\}$ or $[0, 1]$ — reward for answer correctness against ground truth (paper-reported: accuracy is the primary signal).
- $r_{\text{format}} \in \{0, 1\}$ — reward for properly formatted output with correct XML action delimiters (paper-reported).
- $\lambda_{\text{acc}}, \lambda_{\text{fmt}}$ — weighting coefficients. Exact values are implementation unknown; the paper states accuracy is the dominant signal.
51.3.5 Training Stages
The complete training pipeline proceeds in two stages (paper-reported):
| Stage | Method | Purpose | Data Source |
|---|---|---|---|
| 1. Cold Start | SFT | Initialize agent with basic research behavior and action formatting | Filtered teacher trajectories from QwQ-32B-Preview |
| 2. RL Training | GRPO | Optimize research strategy via trajectory-level rewards | Online rollouts with search-environment interaction |
The SFT cold-start stage is critical (paper-reported finding). Without SFT initialization, the base model does not know how to format actions or issue coherent search queries, making the RL reward signal too sparse for effective learning. SFT provides a "warm" initial policy that GRPO can then refine. The RL training phase uses the verl framework (Volcano Engine Reinforcement Learning) for distributed GRPO training (repo-documented: verl is listed as a dependency in the repository's requirements).
Algorithm 3: GRPO-AGENTIC-TRAINING-STEP
Provenance: canonical GRPO structure from Guo et al. (2025), §3.2; agentic masking
adaptation from paper-reported description; pseudocode notation is author formalization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input: question batch Q = {q_1,...,q_B}, group size G, current policy π_θ,
old policy π_θ_old, reference policy π_ref (SFT model),
search environment E, clip ε_c, KL weight β
Output: gradient update Δθ
1 for each question q_j ∈ Q do
2 │ // Sample G trajectories from OLD policy (canonical GRPO)
3 │ for g = 1 to G do
4 │ │ τ_g ← RESEARCH-AGENT-LOOP(q_j, π_θ_old, E)
5 │ │ r_g ← COMPUTE_REWARD(τ_g) // paper-reported reward
6 │ │ m_g ← BUILD_GENERATION_MASK(τ_g) // author formalization
7 │ end
8 │
9 │ // Trajectory-level advantage (canonical GRPO, Guo et al. 2025)
10 │ for g = 1 to G do
11 │ │ Â_g ← (r_g − mean({r_1,...,r_G})) / (std({r_1,...,r_G}) + ε)
12 │ end
13 │
14 │ // Masked token-level policy gradient (author formalization of paper-reported practice)
15 │ for g = 1 to G do
16 │ │ for each token position k in τ_g do
17 │ │ │ if m_g[k] = 0 then continue // skip environment tokens
18 │ │ │ ρ_k ← exp(logπ_θ[k] − logπ_θ_old[k]) // canonical importance ratio
19 │ │ │ ρ_clip ← clip(ρ_k, 1−ε_c, 1+ε_c) // canonical PPO clipping
20 │ │ │ L_pg ← −min(ρ_k · Â_g, ρ_clip · Â_g) // canonical clipped objective
21 │ │ │ ratio_ref ← exp(logπ_ref[k] − logπ_θ[k])
22 │ │ │ L_kl ← β · (ratio_ref − log(ratio_ref) − 1) // Schulman (2020) KL estimator
23 │ │ │ accumulate (L_pg + L_kl) / N_gen_g // normalize by generated tokens
24 │ │ end
25 │ end
26 end
27 Δθ ← gradient of accumulated loss
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Per-line provenance:
Lines 2-7, 14-25: agentic masking = author formalization of paper-reported practice
Lines 9-12: canonical GRPO advantage (Guo et al. 2025)
Lines 18-20: canonical PPO clipping (Schulman et al. 2017)
Line 22: canonical KL estimator (Schulman 2020)
Line 6: generation mask construction = author formalization (not paper-specified)
51.4 Model Architecture
Both the student and teacher models are dense transformers — neither uses Mixture-of-Experts (MoE) architecture (paper-reported; model-card confirmed).
51.4.1 Student Model: Qwen2.5-7B-Instruct
| Parameter | Value | Source |
|---|---|---|
| Architecture | Dense transformer (decoder-only) | Model-card |
| Total parameters | ~7.6B | Model-card |
| Hidden dimension | 3,584 | Model-card |
| Layers | 28 | Model-card |
| Attention heads | 28 (with GQA) | Model-card |
| Context window | Up to 128K tokens | Model-card |
| Vocabulary size | ~152K | Model-card |
| Attention mechanism | Grouped Query Attention (GQA) | Model-card |
The released trained checkpoint is available as Alibaba-NLP/DeepResearch-7B on HuggingFace (repo-documented: download link in repository README).
51.4.2 Teacher Model: QwQ-32B-Preview
The teacher model used for generating synthetic research trajectories is QwQ-32B-Preview (paper-reported; model-card: Qwen/QwQ-32B-Preview), a 32-billion-parameter dense reasoning model designed for extended chain-of-thought reasoning. The teacher is used only during the data generation phase (§51.3.2) and is not required at inference time.
51.4.3 Relevance to Agentic Research
Two aspects of the model choices are particularly relevant (author analysis):
- Long context window. The 128K-token context window accommodates multi-step research trajectories. However, even 128K tokens may be insufficient for very long sessions with multiple full web pages.
- Teacher–student scale gap. The 4.2× parameter ratio (32B → 7B) is a deliberate design choice: the teacher produces high-quality trajectories, while the student is small enough for efficient RL training (which requires generating $G$ trajectories per question) and deployment.
51.5 Key Results
51.5.1 Evaluation Metadata and Reproducibility Protocol
The following table specifies all evaluation metadata needed for reproducibility, with explicit coverage of available and unavailable information:
| Metadata Field | Value | Availability |
|---|---|---|
| Student model checkpoint | Alibaba-NLP/DeepResearch-7B (HuggingFace) |
Available (repo-documented) |
| Base model | Qwen2.5-7B-Instruct, dense, ~7.6B params | Available (model-card) |
| Teacher model | QwQ-32B-Preview (data generation only) | Available (paper-reported) |
| Repository commit hash | Not pinned in this chapter | Not recorded — readers must pin at clone time |
| Search provider | Web search API (configurable) | Provider-specific: exact API not stated in paper; repo documents setup |
| Evaluation date window | Early 2025 | Approximate (paper-reported) |
| Search result caching | Not documented | Unknown |
| Max agent steps ($T_{\max}$) | Configurable | Documented as parameter; default value not stated in paper |
| Decoding settings (temperature, top-p) | Not specified in paper | Not reported |
| Number of evaluation runs / seeds | Not specified | Not reported — results are single-run point estimates |
| Confidence intervals / variance | Not reported | Not reported |
| GRPO group size $G$ | Specified in training config | Paper-reported (training detail section) |
| RL training framework | verl (Volcano Engine RL) |
Available (repo-documented) |
| Inference engine | vLLM |
Available (repo-documented) |
| Primary metric | Answer accuracy (exact match and graded) | Available (paper-reported) |
51.5.2 Benchmark Performance
The following table reports the main results across six benchmarks for three training configurations (paper-reported, Zheng et al. 2025, Tables 1–2). All three configurations use the same agent loop and search environment, ensuring budget-matched comparison (paper-reported).
| Benchmark | Split | Metric | Base + Search | SFT | SFT + GRPO | Δ(RL−SFT) |
|---|---|---|---|---|---|---|
| GAIA | Validation, Level 1 | Accuracy (%) | 27.4 | 45.2 | 56.5 | +11.3 |
| GAIA | Validation, Level 2 | Accuracy (%) | 15.3 | 27.1 | 37.3 | +10.2 |
| GAIA | Validation, Level 3 | Accuracy (%) | 5.9 | 11.8 | 17.6 | +5.8 |
| GAIA | Validation, Average | Accuracy (%) | 18.3 | 31.3 | 40.9 | +9.6 |
| Bamboogle | Full | Accuracy (%) | 24.0 | 44.0 | 56.0 | +12.0 |
| BrowseComp | Full test | Accuracy (%) | 3.6 | 14.8 | 22.5 | +7.7 |
| HotpotQA | Test subset | EM (%) | 36.2 | 55.4 | 63.1 | +7.7 |
| 2WikiMultiHopQA | Test subset | EM (%) | 31.5 | 49.8 | 58.6 | +8.8 |
| MuSiQue | Test subset | EM (%) | 14.2 | 27.3 | 35.1 | +7.8 |
51.5.3 RL vs. SFT Ablation
The most important empirical finding is the consistent superiority of SFT+GRPO over SFT-only across all six benchmarks (paper-reported), with improvements ranging from +5.8pp (GAIA Level 3) to +12.0pp (Bamboogle):
| Training Configuration | GAIA Val Avg | Bamboogle | BrowseComp | Interpretation (paper-reported) |
|---|---|---|---|---|
| Base Qwen2.5-7B-Instruct + Search | 18.3 | 24.0 | 3.6 | Single-turn QA with search but no research training |
| SFT on filtered QwQ-32B trajectories | 31.3 | 44.0 | 14.8 | Imitates teacher search patterns |
| SFT + GRPO (DeepResearch-7B) | 40.9 | 56.0 | 22.5 | Develops own strategies beyond teacher imitation |
| GRPO without SFT cold start | Substantially degraded (paper-reported: unstable training) | Sparse reward + poor action formatting | ||
Key interpretation (paper-reported): RL enables the student to discover research strategies that differ from — and improve upon — the teacher's approach. The student is not limited to imitating the teacher but can learn to search differently when the teacher's strategy is suboptimal for the student's smaller capacity. The necessity of SFT cold start is attributed to reward sparsity: without SFT, the base model produces malformed actions and incoherent queries, providing no learning signal (paper-reported rationale). Exact scores for the GRPO-from-scratch condition are described qualitatively, not tabulated (paper-reported).
51.6 Relationship to Evolutionary AI
DeepResearch is not, strictly speaking, an evolutionary system. It does not maintain a population of candidate programs, apply mutation and crossover operators, or use fitness-based selection in the traditional evolutionary computation sense. However, several connections to the evolutionary AI paradigm warrant discussion within this survey:
51.6.1 Population-Based Training via GRPO
GRPO generates a group of $G$ trajectories per question and computes advantages relative to the group. This is structurally analogous to a population-based evaluation: $G$ candidate strategies are "born" from the same policy, evaluated against an environment, and ranked relative to each other. The policy update preferentially reinforces strategies from the upper portion of the performance distribution — a form of truncation selection.
The connection can be formalized as follows (author analysis). In a $(\mu, \lambda)$ evolution strategy, $\lambda$ offspring are generated and the top $\mu$ are selected to inform the next generation. In GRPO with group size $G$, the advantage normalization creates a continuous-valued selection pressure: trajectories above the group mean receive positive advantage (reinforced), those below receive negative advantage (discouraged). The gradient update is:
This is formally similar to the Natural Evolution Strategy (NES) gradient estimator (Wierstra et al., 2014), where a population of perturbations is evaluated and the gradient is computed via fitness-weighted log-probability. The distinction is that GRPO operates in trajectory space (sampling different action sequences from the same policy) rather than parameter space (sampling different model weights).
51.6.2 Synthetic Data as Environmental Variation
The synthetic data pipeline introduces diversity pressure analogous to environmental variation in biological evolution. By exposing the agent to diverse research questions — different domains, difficulty levels, and reasoning patterns — the training process selects for generalist research capabilities rather than narrow specialization.
51.6.3 Limitations of the Evolutionary Analogy
The analogy should not be overstated. Key differences include:
- No explicit population. GRPO maintains a single policy network; the "population" exists only within each training batch and is discarded after the gradient step.
- No crossover. There is no mechanism for combining strategies from different trajectories.
- Gradient-based optimization. The parameter update is a smooth gradient step, not discrete selection and reproduction.
- No open-ended search. The fitness function (answer correctness) is fixed and externally specified.
These distinctions place GRPO-based training closer to estimation-of-distribution algorithms (EDAs): the policy network is a parameterized distribution from which solutions are sampled, and updates shift this distribution toward higher-fitness regions.
51.7 Implementation and Reproducibility
51.7.1 Repository Structure Audit
The repository at github.com/Alibaba-NLP/DeepResearch provides agent inference code, training pipeline, search environment integration, reward computation, and model download instructions. The following audit maps architecture components to their observable repository evidence.
| Architecture Component | Evidence | Verification Level |
|---|---|---|
| Agent inference entry point | README documents an inference launch command for running the research agent | Repo-documented (README-level; exact script path not cited) |
| Agent loop and action parsing | Paper describes XML-tag-based action parsing for <search>, <click>, <think>, <finish> |
Paper-reported; implementation file/function: not audited |
| Search environment | README describes configurable search backend setup with API credentials | Repo-documented (README-level; interface code not audited) |
| System prompt / agent instructions | Paper describes the agent's instruction format and action schema | Paper-reported; template location in repo: not audited |
| SFT training | README documents SFT training procedure using teacher trajectories | Repo-documented (README-level; training script not audited) |
| GRPO training | verl is listed in dependency files; README documents RL training |
Repo-documented (dependency confirmed; training entry point not audited) |
| Reward computation | Paper describes accuracy + format reward; reward logic presumably in training code | Paper-reported; exact file/function: not audited |
| Evaluation harness | README documents benchmark evaluation instructions | Repo-documented (README-level; evaluation script not audited) |
| Model weights | HuggingFace: Alibaba-NLP/DeepResearch-7B |
Repo-documented + model-card (confirmed available) |
| Dependencies | verl, vllm, transformers, torch |
Repo-documented (dependency file in repository) |
| Context management strategy | Not documented in README or paper at the level inspected | Implementation unknown |
51.7.2 Illustrative Pseudocode
The following code examples illustrate the three core implementation components. These are author reconstructions based on the paper's described methodology and the documented repository interfaces — they are not excerpts from the repository source code. Variable names, class structures, and API surfaces are illustrative. Readers building on this work should consult the actual repository source files for verified implementations.
Pseudocode 1: Agent Inference and Action Parsing
The agent's core inference loop generates actions via the LLM and parses them using XML-tag extraction, as described in the paper's action protocol (§51.2.1). The following illustrates the expected structure (author reconstruction from paper-reported methodology):
"""Agent action parsing and research loop.
PROVENANCE: Author reconstruction based on paper-reported action protocol.
NOT a repository excerpt. Exact implementation may differ in class names,
import paths, and module organization. Verify against pinned commit."""
import re
from dataclasses import dataclass
# XML-tag regex for action extraction
# Paper-reported: the agent uses , , , delimiters.
# The exact regex used in the repository is not audited here.
ACTION_PATTERN = re.compile(
r"<(search|click|think|finish)>(.*?)\1>", re.DOTALL
)
@dataclass
class AgentAction:
"""Parsed agent action. Structure follows paper-reported action schema."""
action_type: str # "search" | "click" | "think" | "finish"
argument: str # query string, URL, reasoning text, or final answer
def parse_action(response_text: str) -> AgentAction | None:
"""Extract the first XML-tagged action from the model's response.
Paper-reported format:
query text
https://url
reasoning text
final answer with citations
"""
match = ACTION_PATTERN.search(response_text)
if match:
return AgentAction(
action_type=match.group(1),
argument=match.group(2).strip(),
)
return None
def run_research_agent(
question: str,
model, # LLM inference engine (paper reports vLLM)
search_env, # search environment with .search() and .fetch_page()
system_prompt: str,
max_steps: int = 15, # T_max — illustrative default; actual default unknown
max_tokens: int = 2048,
) -> str:
"""Execute the multi-step research agent loop (Algorithm 1, §51.3.1).
This implements the paper-reported agent loop: generate action → parse
XML tags → execute environment action → append observation → repeat.
The actual repository implementation may use different parameter names,
message formatting, or error handling patterns.
"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
]
for step in range(max_steps):
# Generate action tokens from the policy
response_text = model.chat(messages, max_tokens=max_tokens)
action = parse_action(response_text)
if action is None:
# Malformed output handling — exact strategy is implementation unknown
messages.append({"role": "assistant", "content": response_text})
continue
messages.append({"role": "assistant", "content": response_text})
if action.action_type == "finish":
return action.argument
# Execute environment action and get observation
if action.action_type == "search":
observation = search_env.search(action.argument)
elif action.action_type == "click":
observation = search_env.fetch_page(action.argument)
elif action.action_type == "think":
observation = "" # Internal reasoning — no env feedback (paper-reported)
# Append observation as user-role message (environment tokens)
if observation:
messages.append({"role": "user", "content": observation})
# Max steps reached — force finish (paper-reported: agent is prompted to conclude)
messages.append({
"role": "user",
"content": "You have reached the maximum number of steps. "
"Please provide your final answer now using ... .",
})
final = model.chat(messages, max_tokens=max_tokens)
final_action = parse_action(final)
return final_action.argument if final_action else final
Pseudocode 2: Search Environment Interface
The search environment wraps external search APIs and provides formatted results to the agent. The paper reports a configurable search backend; the following shows the expected interface structure (author reconstruction):
"""Search environment interface.
PROVENANCE: Author reconstruction based on paper-reported search integration
and repository-documented configurable backend. NOT a repository excerpt.
The actual search module may use different class names and API conventions."""
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class SearchResult:
"""Single search result. Paper describes results containing title, URL, snippet."""
title: str
url: str
snippet: str
class SearchBackend(ABC):
"""Abstract search backend interface.
Repo-documented: the repository supports configurable search backends
with API credentials set via configuration. The exact abstract interface
and concrete implementations are not audited at the source-code level.
"""
@abstractmethod
def web_search(self, query: str, num_results: int = 10) -> list[SearchResult]:
"""Execute a web search query and return ranked results."""
...
@abstractmethod
def fetch_page(self, url: str, max_chars: int = 8000) -> str:
"""Fetch a web page and extract text content, truncated to budget."""
...
class SearchEnvironment:
"""Wraps a search backend to format results for agent context.
This illustrates the expected pattern: search results are formatted
as numbered text entries that the agent can reference in subsequent
actions. The actual formatting template is implementation unknown.
"""
def __init__(self, backend: SearchBackend, max_results: int = 10):
self.backend = backend
self.max_results = max_results
def search(self, query: str) -> str:
"""Execute search and format results as agent-readable text."""
results = self.backend.web_search(query, self.max_results)
formatted_parts = []
for i, r in enumerate(results, 1):
formatted_parts.append(
f"[{i}] {r.title}\n URL: {r.url}\n {r.snippet}"
)
return "\n\n".join(formatted_parts)
def fetch_page(self, url: str) -> str:
"""Fetch page content, truncated to fit context budget."""
return self.backend.fetch_page(url)
Pseudocode 3: GRPO Training Configuration
The GRPO training stage uses the verl framework (repo-documented dependency). The following illustrates the expected training setup structure. Hyperparameter values are not verified — they are illustrative placeholders showing the configuration dimensions; readers must consult the repository's actual configuration files for ground-truth values.
"""GRPO training configuration structure.
PROVENANCE: Author reconstruction based on:
- Paper-reported: GRPO algorithm choice, SFT → RL two-stage pipeline
- Repo-documented: verl dependency, vLLM for generation
- Implementation unknown: exact hyperparameters, entry point, config schema
This is NOT a repository excerpt. The actual training launch mechanism,
configuration format, and parameter names may differ substantially.
All hyperparameter values below are ILLUSTRATIVE PLACEHOLDERS."""
# ── Conceptual training configuration ────────────────────────────────
# The repository uses verl for distributed GRPO training.
# The configuration likely specifies these dimensions (author analysis):
training_dimensions = {
# Model paths
"sft_checkpoint": "...", # SFT-initialized model (Stage 1 output)
"reference_model": "...", # Same SFT checkpoint, frozen for KL penalty
"tokenizer": "Qwen/Qwen2.5-7B-Instruct", # model-card
# GRPO parameters (paper-reported algorithm; exact values unknown)
"group_size": "?", # G: trajectories per question — paper-reported
"clip_ratio": "?", # ε_c: PPO-style clipping — canonical default 0.2
"kl_coeff": "?", # β: KL penalty weight — implementation unknown
"advantage_normalization": True, # canonical GRPO property
# Rollout (paper-reported: online trajectory generation with real search)
"inference_engine": "vllm", # repo-documented dependency
"max_agent_steps": "?", # T_max — configurable, default unknown
"search_backend": "...", # configurable — repo-documented
# Training (implementation unknown: exact schedule, hardware)
"learning_rate": "?",
"num_epochs": "?",
"gradient_accumulation": "?",
# Reward (paper-reported structure; exact weights unknown)
"accuracy_weight": "? (dominant)", # paper-reported: primary signal
"format_weight": "? (secondary)", # paper-reported: format adherence
}
# ── Launch pattern ───────────────────────────────────────────────────
# Repo-documented: training is launched via verl's distributed trainer.
# The exact entry point (e.g., a training script invoking verl's API)
# is not audited at the source-code level. verl handles:
# - Distributed rollout generation across GPUs
# - Online trajectory collection with search-environment interaction
# - Advantage computation (group-relative normalization)
# - Multi-GPU gradient synchronization
#
# The search environment is integrated into the rollout worker so each
# trajectory involves real web search API calls during training.
# (paper-reported: online RL with live search)
51.7.3 Compute Requirements
RL training of research agents is computationally expensive due to the online trajectory generation requirement. Each GRPO training step requires (author analysis — cost model not from the paper):
- Trajectory rollout. Generating $G$ trajectories per question, each involving multiple LLM forward passes and search API calls. For a batch of $B$ questions with group size $G$ and average trajectory length $\bar{L}$ actions: approximately $B \times G \times \bar{L}$ forward passes plus $B \times G \times \bar{S}$ search API calls.
- Reward computation. Scoring each trajectory against ground truth or via judge models.
- Gradient computation. Standard transformer backward pass.
This cost model is an author formalization — the exact compute budget (GPU type/count, training steps, API calls, wall-clock time, total cost) is not specified in the publicly reviewed materials.
51.8 Comparative Analysis
DeepResearch exists within a rapidly evolving landscape of autonomous research agents. The following comparison contextualizes its design choices against related open-source systems where evidence is available.
The comparison table below covers only dimensions with published evidence. Direct performance comparisons across systems are not meaningful due to differences in base models, compute budgets, search APIs, and evaluation dates.
| Dimension | DeepResearch (Alibaba) | Search-R1 | WebThinker |
|---|---|---|---|
| Open-source | Yes — code + weights[repo] | Yes[paper] | Yes[paper] |
| Training method | SFT + GRPO[paper] | RL with process reward[paper] | SFT + RL[paper] |
| Base model(s) | Qwen2.5-7B-Instruct[paper] | Qwen2.5 / LLaMA family[paper] | Various[paper] |
| Teacher model | QwQ-32B-Preview[paper] | N/A (online RL)[paper] | Varies[paper] |
| Action space | search, click, think, finish[paper] | search, finish[paper] | search, read, write, finish[paper] |
| Synthetic data pipeline | Yes — teacher trajectories[paper] | Yes[paper] | Yes[paper] |
| Released model size | 7B parameters[model-card] | 7B–32B[paper] | 7B–32B[paper] |
| Key distinguishing feature | GRPO for agentic RL; open teacher–student pipeline[paper] | Process-level reward signal[paper] | Write action for structured note-taking[paper] |
Evidence tier key: [paper] = stated in the system's published paper; [repo] = confirmed from public repository; [model-card] = from HuggingFace model card. OpenAI's Deep Research is excluded because its training method, model architecture, and internal design are not publicly documented.
51.8.1 Distinguishing Design Choices
GRPO over PPO. The choice of GRPO over standard PPO eliminates the need for a trained critic/value network (canonical GRPO property; paper-reported choice). This simplifies the training pipeline (one model to train instead of two) and avoids the bootstrapping problem where the critic must accurately estimate the value of partial research trajectories. PPO requires learning $V(s)$ for every intermediate state; GRPO sidesteps this by comparing complete trajectory outcomes within a group.
Richer action space than Search-R1. DeepResearch includes explicit click and think actions beyond Search-R1's minimal search/finish (paper-reported). The click action allows drilling into specific documents beyond search snippets; the think action provides explicit intermediate reasoning without triggering environment interaction. Whether this richer action space translates to better performance is an empirical question confounded by base model, training data, and evaluation differences.
Open weights. DeepResearch releases both model weights and training code (repo-documented), enabling the research community to study, reproduce, and build upon the approach.
51.9 Limitations and Open Questions
51.9.1 Technical Limitations
Context window constraints. Even with 128K tokens, long research trajectories including multiple full web pages can approach or exceed this limit. Information from early steps may be lost by synthesis time.
Search API dependence. Performance is coupled to search API quality and coverage, which varies significantly by topic, recency, and language.
Reward sparsity for complex questions. Binary answer-correctness rewards provide a sparse training signal for multi-part questions. The agent receives no credit for gathering relevant information if it fails to synthesize a correct final answer. Process-based rewards (as used in Search-R1) could improve training efficiency but introduce reward-specification challenges.
Hallucination in synthesis. Even with search grounding, the finish action produces free-form text not mechanically constrained to cite only retrieved information.
51.9.2 Open Research Questions
- Does RL discover genuinely novel strategies? The current approach initializes with SFT on teacher trajectories. Whether GRPO discovers fundamentally different strategies (not just refined versions) would require detailed trajectory analysis comparing teacher and student behavior — an analysis not present in the current report.
- Multi-agent research. Whether specialized agents (search, reasoning, synthesis) could outperform a single unified agent is unexplored.
- Continual learning. Research agents trained at a fixed point gradually become stale. Mechanisms for periodic re-training are not addressed.
- Adaptive tool use. The fixed action space could be extended with code execution, database queries, or specialized API calls. How to train agents for a growing tool set is an important direction.
- Scaling laws for agentic RL. Whether predictable relationships exist between GRPO compute investment (group size, questions, training steps) and downstream performance remains open.
51.10 Impact and Significance
Methodological contribution. The demonstration that GRPO — originally developed for mathematical reasoning — transfers effectively to the agentic research setting is a non-trivial empirical finding (paper-reported). The adaptation involves two key differences from the mathematical reasoning setting: (1) trajectories include environment interactions that must be masked out of the loss (§51.3.3), and (2) trajectory length is determined by the agent's own decisions. That GRPO remains effective under these conditions extends its demonstrated applicability.
Open-source infrastructure. By releasing code and weights (repo-documented: Alibaba-NLP/DeepResearch-7B on HuggingFace), DeepResearch lowers the barrier to entry for research on autonomous research agents.
Training pipeline as contribution. The synthetic data pipeline itself — generating, filtering, and curating research trajectories at scale using a larger teacher model — is a reusable methodology. Other groups can adapt it for different base models, search APIs, or domain-specific research tasks.
51.11 Summary
Chapter Summary
Key takeaway. DeepResearch demonstrates that reinforcement learning (GRPO) over agentic search trajectories produces meaningfully better research agents than SFT alone — approximately +10pp on GAIA validation average (paper-reported, Zheng et al. 2025, Table 1) — and provides an open-source pipeline for training such agents.
Main contribution. An end-to-end open-source pipeline: synthetic trajectory generation from QwQ-32B-Preview → quality filtering → SFT cold start on Qwen2.5-7B-Instruct → GRPO training via verl → released as DeepResearch-7B. The adaptation of GRPO from mathematical reasoning to the agentic setting — where trajectories interleave policy-generated action tokens with environment-returned observation tokens, requiring generation masking in the loss (§51.3.3, Layer 2) — extends the method's demonstrated applicability.
What a researcher should know. (1) Both models are dense transformers (7B student, 32B teacher), not MoE. (2) Results in §51.5.2 are single-run point estimates without reported variance; GAIA Level 3 has ~17 questions (~6pp granularity). (3) Web search non-stationarity complicates reproduction (§51.5.1). (4) The system is not evolutionary in the traditional sense but shares structural similarities with NES via GRPO's population-based trajectory sampling (§51.6). (5) Code examples in §51.7.2 are author reconstructions, not repository excerpts.
- Model checkpoint: confirm
Alibaba-NLP/DeepResearch-7Bon HuggingFace; verify base is Qwen2.5-7B-Instruct (dense, ~7.6B). - Repository: clone at a pinned commit; record the hash. Verify the audit in §51.7.1 against actual source tree — particularly agent entry point, action parser, search module, and training scripts.
- Benchmark scores: verify §51.5.2 against Zheng et al. (2025), Tables 1–2. Note evaluation date and search API used.
- Pseudocode: the code in §51.7.2 is author reconstruction — verify against actual source files for class names, imports, and module paths.
- Training config: verify GRPO hyperparameters ($G$, $\varepsilon_c$, $\beta$, learning rate) from the repository's configuration files, not from this chapter's illustrative placeholders.
- GRPO formalization: the masked-loss notation in §51.3.3 Layer 2 is author formalization of paper-reported practice. Verify the actual implementation approach from the training code.