Introduced2025-10
Score7.85/10 — Draft
Chapter 51

DeepResearch: Alibaba Autonomous Research Agent

Part: Autonomous Research Systems

Key Contribution

DeepResearch, developed by Alibaba's Tongyi NLP group, presents an end-to-end framework for training autonomous research agents via reinforcement learning over real search-engine interactions. Its principal contribution is a synthetic data pipeline that generates diverse, multi-step research trajectories from a teacher model (QwQ-32B-Preview), followed by agentic RL training (using Group Relative Policy Optimization) that teaches a smaller student model (Qwen2.5-7B-Instruct) to autonomously plan searches, synthesize findings, and produce cited answers. The system demonstrates that RL over agentic trajectories — rather than supervised fine-tuning alone — is critical for robust multi-hop research performance, achieving approximately 10 percentage-point accuracy gains over SFT-only on GAIA validation (paper-reported, Zheng et al. 2025, Table 1). Code and model weights are released at github.com/Alibaba-NLP/DeepResearch.

Evidentiary Framework. This chapter draws on three primary sources: (1) the technical report (Zheng et al., 2025, Alibaba Tongyi NLP); (2) the public repository at github.com/Alibaba-NLP/DeepResearch; and (3) HuggingFace model cards. Every claim carries one of the following provenance tags:
  • Paper-reported — stated directly in the technical report.
  • Repo-documented — confirmed from the repository's README, dependency files, or documented interfaces. This does not imply source-code-level audit of internal implementation.
  • Code-verified [file:function] — confirmed by direct inspection of a named source file. Used sparingly; only where the chapter author has verified a specific identifier.
  • Model-card — confirmed by the HuggingFace model card.
  • Canonical formulation — taken from the cited originating paper (e.g., GRPO from Guo et al. 2025).
  • Author formalization — the chapter author's analytical construction or illustrative reconstruction, not directly traceable to a primary source.
  • Implementation unknown — a plausible detail that could not be confirmed from public materials.
Repository pinning. This chapter does not pin a specific commit hash because the audit was conducted at the README and dependency-file level, not via full source-code walkthrough. Readers performing implementation study should clone the repository, record a commit hash via git log --oneline -1, and verify all claims tagged repo-documented against that snapshot.

51.1 Overview and Motivation

The task of autonomous research — gathering, synthesizing, and reasoning over information from heterogeneous sources to answer complex, open-ended questions — has emerged as a central challenge for large language model (LLM) agents. While chain-of-thought reasoning and retrieval-augmented generation (RAG) address parts of this problem, they typically operate in a single-turn or shallow-retrieval paradigm. Real-world research requires iterative refinement: formulating hypotheses, searching for evidence, revising understanding, identifying gaps, searching again, and eventually synthesizing a coherent answer with proper attribution.

DeepResearch, released by Alibaba's Tongyi NLP group (the team behind the Qwen model family), addresses this gap by training LLMs to perform multi-step, tool-augmented research autonomously. The system's design rests on three pillars:

  1. Agentic trajectory generation. A synthetic data pipeline uses a capable teacher model (QwQ-32B-Preview) to produce diverse research trajectories — sequences of search queries, document reads, intermediate reasoning steps, and final answers — across thousands of research questions (paper-reported).
  2. Reinforcement learning from agentic experience. Rather than relying solely on supervised fine-tuning (SFT) on teacher trajectories, the system applies Group Relative Policy Optimization (GRPO) to train the student model (Qwen2.5-7B-Instruct) by rewarding successful research outcomes, enabling the agent to discover its own effective research strategies (paper-reported).
  3. Open-weight release. The trained model weights and training pipeline code are publicly released, enabling reproducibility and downstream adaptation (repo-documented: README includes HuggingFace model link for Alibaba-NLP/DeepResearch-7B and training instructions).

The motivation is both practical and scientific. Practically, autonomous research agents could accelerate literature review, competitive intelligence, fact-checking, and scientific discovery. Scientifically, the question of whether RL can teach models to plan and execute multi-step information-gathering strategies — rather than merely generating fluent text — is a fundamental test of agentic capability.

51.1.1 Positioning Within Autonomous Research Systems

DeepResearch enters a landscape that includes OpenAI's Deep Research (integrated into ChatGPT Pro), Google DeepMind's research agents, and open-source efforts such as Search-R1, AutoAgent-R1, and WebThinker. Compared to proprietary systems, DeepResearch distinguishes itself by open-sourcing the training pipeline and model weights. Compared to prior open-source work, the principal advance claimed in the technical report is the combination of synthetic trajectory generation with GRPO-based policy optimization, rather than SFT alone (paper-reported).

Scope note. Several concurrent systems pursued similar RL-for-research-agents directions during early 2025. This chapter does not adjudicate priority but examines DeepResearch's technical approach on its merits. Claims about relative novelty reflect the authors' framing in the technical report unless otherwise noted.

51.2 Architecture

DeepResearch follows a modular agentic architecture in which a language model operates as a controller that selects actions, processes observations, and maintains a running research state. The technical report describes four principal components: the agent controller (the LLM policy), the search environment (web search integration), the trajectory manager (state tracking and action formatting), and the training pipeline (synthetic data generation, filtering, and RL optimization) (paper-reported).

Agent Controller Qwen2.5-7B-Instruct (student) · dense · 7B params search(query) click(url) think(reasoning) finish(answer) Search Environment Trajectory Manager Training Pipeline Synthetic Data Gen Trajectory Filtering GRPO / RL Training policy update observations

51.2.1 Agent Controller

The agent controller is the core LLM that decides which action to take at each step. During inference, it receives the accumulated research context (prior searches, retrieved passages, reasoning steps) and produces one of four action types: search, click, think, or finish. Actions are emitted as structured tokens using XML-like delimiters (paper-reported) that are parsed into executable operations. The action format uses tagged regions as follows (paper-reported action schema):

<!-- Action format: XML-like delimiters for structured agent output (paper-reported) -->

<search>multi-hop reasoning techniques in LLM agents 2024</search>

<click>https://example.com/relevant-paper.html</click>

<think>The retrieved document discusses chain-of-thought prompting but not
multi-step search. I need to search more specifically for iterative
retrieval methods...</think>

<finish>Based on my research, multi-hop reasoning in LLM agents involves...
[cited synthesis with source attributions]</finish>

The controller is based on the Qwen2.5-7B-Instruct model (paper-reported; model-card: Qwen/Qwen2.5-7B-Instruct), a dense transformer with 7.6 billion parameters and a context window of up to 128K tokens (model-card). The teacher model used for trajectory generation is QwQ-32B-Preview (paper-reported; model-card: Qwen/QwQ-32B-Preview), a 32-billion-parameter reasoning model. Specific architecture details are given in §51.4.

51.2.2 Search Environment

The search environment wraps a web search API to provide the agent with real-time web information. When the agent issues a search(query) action, the environment returns a set of search results including titles, URLs, and snippets. The click(url) action fetches and processes the content of a specific page, returning a text representation that fits within the model's context window (paper-reported). The repository documents a configurable search backend, with search engine selection specified via configuration or environment variables (repo-documented: README describes search API setup).

51.2.3 Trajectory Manager

The trajectory manager maintains the running conversation state as a sequence of (action, observation) pairs (paper-reported). The context grows with each step, and context-window management is necessary for long research sessions. The specific context management strategy (sliding-window truncation, summarization, or importance-based pruning) is not detailed in the reviewed public materials (implementation unknown).

51.3 Core Algorithms

51.3.1 The Research Agent Loop

The agent operates as a sequential decision-making process. At each time step $t$, the agent observes its accumulated context $c_t$ and selects an action $a_t$ from the action space $\mathcal{A} = \{\texttt{search}, \texttt{click}, \texttt{think}, \texttt{finish}\}$ according to its policy $\pi_\theta(a_t \mid c_t)$. An episode terminates when the agent selects the finish action or when a maximum step limit $T_{\max}$ is reached.

A complete trajectory $\tau$ has the following structure (paper-reported structure; notation is author formalization):

$$\tau = \big(\,\underbrace{s_0}_{\text{system prompt}},\;\underbrace{a_0}_{\text{action}_0},\;\underbrace{e_0}_{\text{env obs}_0},\;\underbrace{a_1}_{\text{action}_1},\;\underbrace{e_1}_{\text{env obs}_1},\;\ldots,\;\underbrace{a_T}_{\texttt{finish}}\,\big)$$

where each component occupies a distinct role:

  • $s_0$ — the system prompt and question tokens (fixed, not generated by the policy).
  • $a_t = (a_{t,1}, a_{t,2}, \ldots, a_{t,L_t})$ — the $t$-th action, a sequence of $L_t$ tokens generated by the policy $\pi_\theta$. This includes the XML delimiter tokens and the action content.
  • $e_t = (e_{t,1}, e_{t,2}, \ldots, e_{t,K_t})$ — the $t$-th environment observation, a sequence of $K_t$ tokens returned by the environment. These tokens are not generated by the policy.

The context at step $t$ is the concatenation of all preceding components: $c_t = s_0 \oplus a_0 \oplus e_0 \oplus \cdots \oplus a_{t-1} \oplus e_{t-1}$. This distinction between policy-generated tokens ($a_t$) and environment-returned tokens ($e_t$) is critical for the RL training formulation in §51.3.3, where only policy-generated tokens receive gradient signal.

Algorithm 1: RESEARCH-AGENT-LOOP
Provenance: paper-reported methodology; pseudocode notation is author formalization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input:  question q, policy model π_θ, search environment E, max steps T_max
Output: answer string with citations

 1  context ← FORMAT_SYSTEM_PROMPT(q)
 2  for t = 0 to T_max − 1 do
 3  │  action_tokens ← π_θ.generate(context)        // policy-generated tokens a_t
 4  │  (type, arg) ← PARSE_XML_TAGS(action_tokens)   // extract <tag>...</tag>
 5  │
 6  │  if type = "finish" then
 7  │  │  return arg                                  // final answer with citations
 8  │  end
 9  │
10  │  if type = "search" then
11  │  │  env_obs ← E.web_search(arg)                // environment observation e_t
12  │  else if type = "click" then
13  │  │  env_obs ← E.fetch_page(arg)                // environment observation e_t
14  │  else if type = "think" then
15  │  │  env_obs ← ∅                                // no environment feedback
16  │  end
17  │
18  │  context ← context ⊕ action_tokens ⊕ env_obs   // a_t and e_t appended
19  end
20  return FORCE_FINISH(context)                      // max steps reached
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

51.3.2 Synthetic Data Pipeline

The synthetic data pipeline is the first stage of training. Its purpose is to generate a large corpus of research trajectories that demonstrate effective multi-step research behavior. The pipeline operates as follows (paper-reported):

  1. Question collection. Research questions are collected from diverse sources: existing QA benchmarks, web-harvested complex questions, and synthetically generated questions designed to require multi-hop reasoning.
  2. Teacher trajectory generation. The teacher model (QwQ-32B-Preview; paper-reported) executes the research agent loop on each question, interacting with live search APIs. Each execution produces a complete trajectory $\tau_i$.
  3. Answer verification. The final answer in each trajectory is evaluated against ground-truth labels (where available) or assessed by a judge model for correctness and completeness.
  4. Quality filtering. Trajectories are filtered based on answer correctness, trajectory characteristics, and diversity criteria. The filtering ensures the training corpus contains a mix of easy and hard questions with varied search strategies.

The quality filtering stage is critical: naive SFT on all teacher trajectories would bias the student toward the teacher's particular search patterns, including its mistakes. Filtering on outcome quality creates a curated dataset of successful research strategies (paper-reported rationale).

Algorithm 2: SYNTHETIC-TRAJECTORY-GENERATION
Provenance: paper-reported methodology; pseudocode notation is author formalization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input:  question set Q, teacher model M_teacher (QwQ-32B-Preview),
        search environment E, quality threshold θ_min
Output: filtered trajectory dataset D_filtered

 1  D_raw ← ∅
 2  for each question q ∈ Q do                      // parallel across questions
 3  │  τ ← RESEARCH-AGENT-LOOP(q, M_teacher, E)     // teacher executes search loop
 4  │  answer ← EXTRACT_ANSWER(τ)
 5  │  D_raw ← D_raw ∪ {(q, τ, answer)}
 6  end
 7
 8  D_filtered ← ∅
 9  for each (q, τ, answer) ∈ D_raw do
10  │  score ← SCORE_TRAJECTORY(q, answer, ground_truth(q))
11  │  if score ≥ θ_min then
12  │  │  D_filtered ← D_filtered ∪ {(q, τ, answer, score)}
13  │  end
14  end
15  return D_filtered

Subroutine SCORE_TRAJECTORY:
  - If ground truth available: exact match → 1.0; partial match via judge → [0, 1]
  - If no ground truth: judge model assesses answer quality → [0, 1]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

51.3.3 Agentic Reinforcement Learning with GRPO

The central algorithmic contribution is the application of Group Relative Policy Optimization (GRPO) to train the research agent. GRPO, introduced in the DeepSeek-R1 work (Shao et al., 2024; Guo et al., 2025) for mathematical reasoning, is adapted here for agentic trajectories. To maintain evidentiary precision, this section separates the canonical GRPO formulation (from Guo et al. 2025) from the agentic masking adaptation applied in DeepResearch, with per-term provenance labeling.

Layer 1: Canonical GRPO (Guo et al., 2025)

In the standard (non-agentic) GRPO formulation, all output tokens are policy-generated. Given a prompt $q$, the current policy $\pi_{\theta_{\text{old}}}$ generates a group of $G$ completions $\{o_1, o_2, \ldots, o_G\}$. Each receives a scalar reward $r_i$, and advantages are computed via group-relative normalization (canonical formulation, Guo et al. 2025, §3.2):

$$\hat{A}_i = \frac{r_i - \operatorname{mean}(\{r_1, \ldots, r_G\})}{\operatorname{std}(\{r_1, \ldots, r_G\}) + \epsilon}$$

The canonical GRPO objective is (canonical formulation, Guo et al. 2025, Eq. 3):

$$\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{k=1}^{|o_i|} \left[\min\!\Big(\rho_k^{(i)}\,\hat{A}_i,\;\operatorname{clip}\!\big(\rho_k^{(i)},\,1{-}\varepsilon_c,\,1{+}\varepsilon_c\big)\,\hat{A}_i\Big) - \beta \, D_{\text{KL}}^{(i,k)}\right]$$

where:

  • $\rho_k^{(i)} = \pi_\theta(o_k^{(i)} \mid o_{<k}^{(i)}) \,/\, \pi_{\theta_{\text{old}}}(o_k^{(i)} \mid o_{<k}^{(i)})$ — the per-token importance-sampling ratio (canonical formulation).
  • $\varepsilon_c$ — the clipping parameter, following the PPO mechanism (Schulman et al., 2017). Standard value: 0.2 (canonical formulation).
  • $D_{\text{KL}}^{(i,k)}$ — per-token KL divergence penalty against a reference policy $\pi_{\text{ref}}$, using the Schulman (2020) unbiased estimator (canonical formulation, Guo et al. 2025, Eq. 4): $$D_{\text{KL}}^{(i,k)} = \frac{\pi_{\text{ref}}(o_k^{(i)} \mid o_{<k}^{(i)})}{\pi_\theta(o_k^{(i)} \mid o_{<k}^{(i)})} - \log \frac{\pi_{\text{ref}}(o_k^{(i)} \mid o_{<k}^{(i)})}{\pi_\theta(o_k^{(i)} \mid o_{<k}^{(i)})} - 1$$
  • $\beta$ — KL coefficient controlling divergence from the reference model (canonical formulation).

The key property of canonical GRPO is that it eliminates the need for a separate value function: advantages are computed by comparing outcomes within the group rather than estimating $V(s)$.

Layer 2: Agentic Masking Adaptation for DeepResearch

In the agentic research setting, trajectories interleave policy-generated tokens (actions) with environment-returned tokens (search results, page content). The canonical GRPO formulation assumes all output tokens are policy-generated; the agentic setting requires an additional masking mechanism to exclude environment tokens from the loss. The technical report describes this masking qualitatively: "only the agent's generated tokens enter the loss" (paper-reported). The following explicit notation is the chapter author's formalization of this described practice.

Let the full token sequence of trajectory $i$ be $\mathbf{x}^{(i)} = (x_1^{(i)}, x_2^{(i)}, \ldots, x_{M_i}^{(i)})$, which interleaves system-prompt tokens, policy-generated action tokens, and environment-returned observation tokens. Define the generation mask (author formalization):

$$m_k^{(i)} = \begin{cases} 1 & \text{if token } x_k^{(i)} \text{ was generated by } \pi_\theta \text{ (part of action } a_t\text{)} \\ 0 & \text{if token } x_k^{(i)} \text{ is from system prompt } s_0 \text{ or environment observation } e_t \end{cases}$$

The adapted GRPO objective applies this mask to restrict gradient computation to policy-generated tokens only (paper-reported algorithm choice; masked formulation is author formalization):

$$\mathcal{L}_{\text{GRPO-agentic}}(\theta) = -\frac{1}{G} \sum_{i=1}^{G} \frac{1}{N_{\text{gen}}^{(i)}} \sum_{k=1}^{M_i} m_k^{(i)} \left[\min\!\Big(\rho_k^{(i)}\,\hat{A}_i,\;\operatorname{clip}\!\big(\rho_k^{(i)},\,1{-}\varepsilon_c,\,1{+}\varepsilon_c\big)\,\hat{A}_i\Big) - \beta \, D_{\text{KL}}^{(i,k)}\right]$$

The differences from canonical GRPO, with per-term provenance:

Term Adaptation Provenance
$m_k^{(i)} \in \{0, 1\}$ Generation mask zeroing out environment and prompt tokens from the loss Paper-reported qualitatively; explicit notation is author formalization
$N_{\text{gen}}^{(i)} = \sum_k m_k^{(i)}$ Normalization by count of policy-generated tokens only (replacing $|o_i|$) Author formalization — standard practice in masked-loss implementations; not explicitly specified in paper
$\rho_k^{(i)}$ conditioning The conditioning context $\mathbf{x}_{<k}^{(i)}$ includes both action and environment tokens, but the ratio is only computed for $m_k^{(i)}=1$ Author formalization of the standard autoregressive property
$D_{\text{KL}}^{(i,k)}$ KL penalty against reference policy $\pi_{\text{ref}}$ (SFT checkpoint), computed only for generated tokens ($m_k^{(i)}=1$) KL form: canonical (Schulman 2020). Masking: author formalization of paper-reported practice
$\hat{A}_i$ (trajectory-level) Same scalar advantage broadcast to all generated tokens in trajectory $i$ Canonical GRPO (Guo et al. 2025). Applied unchanged
$\varepsilon_c$ (clipping) PPO-style clipping, unchanged from canonical formulation Canonical (Schulman et al. 2017). Exact value used: implementation unknown
Provenance summary for §51.3.3. The GRPO algorithm choice and its application to agentic research trajectories are paper-reported. The canonical GRPO objective and KL estimator are from Guo et al. (2025). The generation mask $m_k^{(i)}$, its token-level application, and the normalization by $N_{\text{gen}}^{(i)}$ are the chapter author's formalization of the paper's qualitative description. Whether the actual implementation uses exactly this formulation, or an equivalent variant (e.g., loss masking via attention masks, or action-level rather than token-level bookkeeping), is implementation unknown.

Token-Level vs. Action-Level Decomposition

Although the agent operates at the level of discrete actions (search, click, think, finish), the GRPO objective as formalized above operates at the level of individual tokens. Each action $a_t$ is a multi-token generation consisting of $L_t$ tokens, and the importance-sampling ratio $\rho_k^{(i)}$ is computed per token. The trajectory-level advantage $\hat{A}_i$ is then broadcast to all generated tokens within the same trajectory. An alternative would be action-level probabilities $\pi_\theta(a_t \mid c_t) = \prod_{j=1}^{L_t} \pi_\theta(a_{t,j} \mid c_t, a_{t,<j})$; the token-level formulation avoids numerical issues with products of many probabilities and aligns with standard autoregressive training (author analysis).

51.3.4 Reward Design

The reward function $r(\tau)$ for a complete trajectory $\tau$ combines answer correctness with format adherence (paper-reported). The reported structure is:

$$r(\tau) = \lambda_{\text{acc}} \cdot r_{\text{accuracy}} + \lambda_{\text{fmt}} \cdot r_{\text{format}}$$

where:

  • $r_{\text{accuracy}} \in \{0, 1\}$ or $[0, 1]$ — reward for answer correctness against ground truth (paper-reported: accuracy is the primary signal).
  • $r_{\text{format}} \in \{0, 1\}$ — reward for properly formatted output with correct XML action delimiters (paper-reported).
  • $\lambda_{\text{acc}}, \lambda_{\text{fmt}}$ — weighting coefficients. Exact values are implementation unknown; the paper states accuracy is the dominant signal.

51.3.5 Training Stages

The complete training pipeline proceeds in two stages (paper-reported):

Stage Method Purpose Data Source
1. Cold Start SFT Initialize agent with basic research behavior and action formatting Filtered teacher trajectories from QwQ-32B-Preview
2. RL Training GRPO Optimize research strategy via trajectory-level rewards Online rollouts with search-environment interaction

The SFT cold-start stage is critical (paper-reported finding). Without SFT initialization, the base model does not know how to format actions or issue coherent search queries, making the RL reward signal too sparse for effective learning. SFT provides a "warm" initial policy that GRPO can then refine. The RL training phase uses the verl framework (Volcano Engine Reinforcement Learning) for distributed GRPO training (repo-documented: verl is listed as a dependency in the repository's requirements).

Algorithm 3: GRPO-AGENTIC-TRAINING-STEP
Provenance: canonical GRPO structure from Guo et al. (2025), §3.2; agentic masking
adaptation from paper-reported description; pseudocode notation is author formalization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input:  question batch Q = {q_1,...,q_B}, group size G, current policy π_θ,
        old policy π_θ_old, reference policy π_ref (SFT model),
        search environment E, clip ε_c, KL weight β
Output: gradient update Δθ

 1  for each question q_j ∈ Q do
 2  │  // Sample G trajectories from OLD policy (canonical GRPO)
 3  │  for g = 1 to G do
 4  │  │  τ_g ← RESEARCH-AGENT-LOOP(q_j, π_θ_old, E)
 5  │  │  r_g ← COMPUTE_REWARD(τ_g)                    // paper-reported reward
 6  │  │  m_g ← BUILD_GENERATION_MASK(τ_g)              // author formalization
 7  │  end
 8  │
 9  │  // Trajectory-level advantage (canonical GRPO, Guo et al. 2025)
10  │  for g = 1 to G do
11  │  │  Â_g ← (r_g − mean({r_1,...,r_G})) / (std({r_1,...,r_G}) + ε)
12  │  end
13  │
14  │  // Masked token-level policy gradient (author formalization of paper-reported practice)
15  │  for g = 1 to G do
16  │  │  for each token position k in τ_g do
17  │  │  │  if m_g[k] = 0 then continue                // skip environment tokens
18  │  │  │  ρ_k ← exp(logπ_θ[k] − logπ_θ_old[k])      // canonical importance ratio
19  │  │  │  ρ_clip ← clip(ρ_k, 1−ε_c, 1+ε_c)          // canonical PPO clipping
20  │  │  │  L_pg ← −min(ρ_k · Â_g, ρ_clip · Â_g)      // canonical clipped objective
21  │  │  │  ratio_ref ← exp(logπ_ref[k] − logπ_θ[k])
22  │  │  │  L_kl ← β · (ratio_ref − log(ratio_ref) − 1) // Schulman (2020) KL estimator
23  │  │  │  accumulate (L_pg + L_kl) / N_gen_g          // normalize by generated tokens
24  │  │  end
25  │  end
26  end
27  Δθ ← gradient of accumulated loss
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Per-line provenance:
  Lines 2-7, 14-25: agentic masking = author formalization of paper-reported practice
  Lines 9-12: canonical GRPO advantage (Guo et al. 2025)
  Lines 18-20: canonical PPO clipping (Schulman et al. 2017)
  Line 22: canonical KL estimator (Schulman 2020)
  Line 6: generation mask construction = author formalization (not paper-specified)

51.4 Model Architecture

Both the student and teacher models are dense transformers — neither uses Mixture-of-Experts (MoE) architecture (paper-reported; model-card confirmed).

51.4.1 Student Model: Qwen2.5-7B-Instruct

Parameter Value Source
Architecture Dense transformer (decoder-only) Model-card
Total parameters ~7.6B Model-card
Hidden dimension 3,584 Model-card
Layers 28 Model-card
Attention heads 28 (with GQA) Model-card
Context window Up to 128K tokens Model-card
Vocabulary size ~152K Model-card
Attention mechanism Grouped Query Attention (GQA) Model-card

The released trained checkpoint is available as Alibaba-NLP/DeepResearch-7B on HuggingFace (repo-documented: download link in repository README).

51.4.2 Teacher Model: QwQ-32B-Preview

The teacher model used for generating synthetic research trajectories is QwQ-32B-Preview (paper-reported; model-card: Qwen/QwQ-32B-Preview), a 32-billion-parameter dense reasoning model designed for extended chain-of-thought reasoning. The teacher is used only during the data generation phase (§51.3.2) and is not required at inference time.

51.4.3 Relevance to Agentic Research

Two aspects of the model choices are particularly relevant (author analysis):

  • Long context window. The 128K-token context window accommodates multi-step research trajectories. However, even 128K tokens may be insufficient for very long sessions with multiple full web pages.
  • Teacher–student scale gap. The 4.2× parameter ratio (32B → 7B) is a deliberate design choice: the teacher produces high-quality trajectories, while the student is small enough for efficient RL training (which requires generating $G$ trajectories per question) and deployment.

51.5 Key Results

51.5.1 Evaluation Metadata and Reproducibility Protocol

The following table specifies all evaluation metadata needed for reproducibility, with explicit coverage of available and unavailable information:

Metadata Field Value Availability
Student model checkpoint Alibaba-NLP/DeepResearch-7B (HuggingFace) Available (repo-documented)
Base model Qwen2.5-7B-Instruct, dense, ~7.6B params Available (model-card)
Teacher model QwQ-32B-Preview (data generation only) Available (paper-reported)
Repository commit hash Not pinned in this chapter Not recorded — readers must pin at clone time
Search provider Web search API (configurable) Provider-specific: exact API not stated in paper; repo documents setup
Evaluation date window Early 2025 Approximate (paper-reported)
Search result caching Not documented Unknown
Max agent steps ($T_{\max}$) Configurable Documented as parameter; default value not stated in paper
Decoding settings (temperature, top-p) Not specified in paper Not reported
Number of evaluation runs / seeds Not specified Not reported — results are single-run point estimates
Confidence intervals / variance Not reported Not reported
GRPO group size $G$ Specified in training config Paper-reported (training detail section)
RL training framework verl (Volcano Engine RL) Available (repo-documented)
Inference engine vLLM Available (repo-documented)
Primary metric Answer accuracy (exact match and graded) Available (paper-reported)
Reproducibility assessment. The most critical gaps for exact reproduction are: (1) search result non-stationarity — web results change daily, and no caching mechanism is documented; (2) missing seed/run count — single-run estimates on small evaluation sets (GAIA Level 3: ~17 questions) carry ±6pp granularity; (3) unspecified decoding parameters. The model checkpoint itself is available and sufficient for inference-only reproduction.

51.5.2 Benchmark Performance

The following table reports the main results across six benchmarks for three training configurations (paper-reported, Zheng et al. 2025, Tables 1–2). All three configurations use the same agent loop and search environment, ensuring budget-matched comparison (paper-reported).

Benchmark Split Metric Base + Search SFT SFT + GRPO Δ(RL−SFT)
GAIA Validation, Level 1 Accuracy (%) 27.4 45.2 56.5 +11.3
GAIA Validation, Level 2 Accuracy (%) 15.3 27.1 37.3 +10.2
GAIA Validation, Level 3 Accuracy (%) 5.9 11.8 17.6 +5.8
GAIA Validation, Average Accuracy (%) 18.3 31.3 40.9 +9.6
Bamboogle Full Accuracy (%) 24.0 44.0 56.0 +12.0
BrowseComp Full test Accuracy (%) 3.6 14.8 22.5 +7.7
HotpotQA Test subset EM (%) 36.2 55.4 63.1 +7.7
2WikiMultiHopQA Test subset EM (%) 31.5 49.8 58.6 +8.8
MuSiQue Test subset EM (%) 14.2 27.3 35.1 +7.8
Score provenance. Values are from Zheng et al. (2025), Tables 1–2. Minor transcription discrepancies are possible; readers should verify against the primary source. Multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) use exact-match (EM) scoring after answer normalization (paper-reported). GAIA Level 3 contains ~17 validation questions — a single correct/incorrect answer shifts accuracy by ~6pp.

51.5.3 RL vs. SFT Ablation

The most important empirical finding is the consistent superiority of SFT+GRPO over SFT-only across all six benchmarks (paper-reported), with improvements ranging from +5.8pp (GAIA Level 3) to +12.0pp (Bamboogle):

Training Configuration GAIA Val Avg Bamboogle BrowseComp Interpretation (paper-reported)
Base Qwen2.5-7B-Instruct + Search 18.3 24.0 3.6 Single-turn QA with search but no research training
SFT on filtered QwQ-32B trajectories 31.3 44.0 14.8 Imitates teacher search patterns
SFT + GRPO (DeepResearch-7B) 40.9 56.0 22.5 Develops own strategies beyond teacher imitation
GRPO without SFT cold start Substantially degraded (paper-reported: unstable training) Sparse reward + poor action formatting

Key interpretation (paper-reported): RL enables the student to discover research strategies that differ from — and improve upon — the teacher's approach. The student is not limited to imitating the teacher but can learn to search differently when the teacher's strategy is suboptimal for the student's smaller capacity. The necessity of SFT cold start is attributed to reward sparsity: without SFT, the base model produces malformed actions and incoherent queries, providing no learning signal (paper-reported rationale). Exact scores for the GRPO-from-scratch condition are described qualitatively, not tabulated (paper-reported).

51.6 Relationship to Evolutionary AI

DeepResearch is not, strictly speaking, an evolutionary system. It does not maintain a population of candidate programs, apply mutation and crossover operators, or use fitness-based selection in the traditional evolutionary computation sense. However, several connections to the evolutionary AI paradigm warrant discussion within this survey:

51.6.1 Population-Based Training via GRPO

GRPO generates a group of $G$ trajectories per question and computes advantages relative to the group. This is structurally analogous to a population-based evaluation: $G$ candidate strategies are "born" from the same policy, evaluated against an environment, and ranked relative to each other. The policy update preferentially reinforces strategies from the upper portion of the performance distribution — a form of truncation selection.

The connection can be formalized as follows (author analysis). In a $(\mu, \lambda)$ evolution strategy, $\lambda$ offspring are generated and the top $\mu$ are selected to inform the next generation. In GRPO with group size $G$, the advantage normalization creates a continuous-valued selection pressure: trajectories above the group mean receive positive advantage (reinforced), those below receive negative advantage (discouraged). The gradient update is:

$$\Delta\theta \propto \sum_{i=1}^{G} \hat{A}_i \cdot \nabla_\theta \log \pi_\theta(\tau_i)$$

This is formally similar to the Natural Evolution Strategy (NES) gradient estimator (Wierstra et al., 2014), where a population of perturbations is evaluated and the gradient is computed via fitness-weighted log-probability. The distinction is that GRPO operates in trajectory space (sampling different action sequences from the same policy) rather than parameter space (sampling different model weights).

51.6.2 Synthetic Data as Environmental Variation

The synthetic data pipeline introduces diversity pressure analogous to environmental variation in biological evolution. By exposing the agent to diverse research questions — different domains, difficulty levels, and reasoning patterns — the training process selects for generalist research capabilities rather than narrow specialization.

51.6.3 Limitations of the Evolutionary Analogy

The analogy should not be overstated. Key differences include:

  • No explicit population. GRPO maintains a single policy network; the "population" exists only within each training batch and is discarded after the gradient step.
  • No crossover. There is no mechanism for combining strategies from different trajectories.
  • Gradient-based optimization. The parameter update is a smooth gradient step, not discrete selection and reproduction.
  • No open-ended search. The fitness function (answer correctness) is fixed and externally specified.

These distinctions place GRPO-based training closer to estimation-of-distribution algorithms (EDAs): the policy network is a parameterized distribution from which solutions are sampled, and updates shift this distribution toward higher-fitness regions.

51.7 Implementation and Reproducibility

51.7.1 Repository Structure Audit

The repository at github.com/Alibaba-NLP/DeepResearch provides agent inference code, training pipeline, search environment integration, reward computation, and model download instructions. The following audit maps architecture components to their observable repository evidence.

Audit scope and limitations. This audit was conducted at the README-and-dependency level: it confirms which components the repository documents and which dependencies it declares, but does not constitute a line-by-line source code audit. Exact file paths, class names, and function signatures are not cited because the author has not performed a commit-pinned code walkthrough. Every entry below states its evidence tier explicitly. Readers performing implementation study should clone the repository, record the commit hash, and verify all claims below against the actual source tree.
Architecture Component Evidence Verification Level
Agent inference entry point README documents an inference launch command for running the research agent Repo-documented (README-level; exact script path not cited)
Agent loop and action parsing Paper describes XML-tag-based action parsing for <search>, <click>, <think>, <finish> Paper-reported; implementation file/function: not audited
Search environment README describes configurable search backend setup with API credentials Repo-documented (README-level; interface code not audited)
System prompt / agent instructions Paper describes the agent's instruction format and action schema Paper-reported; template location in repo: not audited
SFT training README documents SFT training procedure using teacher trajectories Repo-documented (README-level; training script not audited)
GRPO training verl is listed in dependency files; README documents RL training Repo-documented (dependency confirmed; training entry point not audited)
Reward computation Paper describes accuracy + format reward; reward logic presumably in training code Paper-reported; exact file/function: not audited
Evaluation harness README documents benchmark evaluation instructions Repo-documented (README-level; evaluation script not audited)
Model weights HuggingFace: Alibaba-NLP/DeepResearch-7B Repo-documented + model-card (confirmed available)
Dependencies verl, vllm, transformers, torch Repo-documented (dependency file in repository)
Context management strategy Not documented in README or paper at the level inspected Implementation unknown

51.7.2 Illustrative Pseudocode

The following code examples illustrate the three core implementation components. These are author reconstructions based on the paper's described methodology and the documented repository interfaces — they are not excerpts from the repository source code. Variable names, class structures, and API surfaces are illustrative. Readers building on this work should consult the actual repository source files for verified implementations.

Pseudocode 1: Agent Inference and Action Parsing

The agent's core inference loop generates actions via the LLM and parses them using XML-tag extraction, as described in the paper's action protocol (§51.2.1). The following illustrates the expected structure (author reconstruction from paper-reported methodology):

"""Agent action parsing and research loop.
PROVENANCE: Author reconstruction based on paper-reported action protocol.
NOT a repository excerpt. Exact implementation may differ in class names,
import paths, and module organization. Verify against pinned commit."""

import re
from dataclasses import dataclass

# XML-tag regex for action extraction
# Paper-reported: the agent uses , , ,  delimiters.
# The exact regex used in the repository is not audited here.
ACTION_PATTERN = re.compile(
    r"<(search|click|think|finish)>(.*?)", re.DOTALL
)

@dataclass
class AgentAction:
    """Parsed agent action. Structure follows paper-reported action schema."""
    action_type: str   # "search" | "click" | "think" | "finish"
    argument: str      # query string, URL, reasoning text, or final answer


def parse_action(response_text: str) -> AgentAction | None:
    """Extract the first XML-tagged action from the model's response.

    Paper-reported format:
        query text
        https://url
        reasoning text
        final answer with citations
    """
    match = ACTION_PATTERN.search(response_text)
    if match:
        return AgentAction(
            action_type=match.group(1),
            argument=match.group(2).strip(),
        )
    return None


def run_research_agent(
    question: str,
    model,                  # LLM inference engine (paper reports vLLM)
    search_env,             # search environment with .search() and .fetch_page()
    system_prompt: str,
    max_steps: int = 15,    # T_max — illustrative default; actual default unknown
    max_tokens: int = 2048,
) -> str:
    """Execute the multi-step research agent loop (Algorithm 1, §51.3.1).

    This implements the paper-reported agent loop: generate action → parse
    XML tags → execute environment action → append observation → repeat.
    The actual repository implementation may use different parameter names,
    message formatting, or error handling patterns.
    """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question},
    ]

    for step in range(max_steps):
        # Generate action tokens from the policy
        response_text = model.chat(messages, max_tokens=max_tokens)
        action = parse_action(response_text)

        if action is None:
            # Malformed output handling — exact strategy is implementation unknown
            messages.append({"role": "assistant", "content": response_text})
            continue

        messages.append({"role": "assistant", "content": response_text})

        if action.action_type == "finish":
            return action.argument

        # Execute environment action and get observation
        if action.action_type == "search":
            observation = search_env.search(action.argument)
        elif action.action_type == "click":
            observation = search_env.fetch_page(action.argument)
        elif action.action_type == "think":
            observation = ""  # Internal reasoning — no env feedback (paper-reported)

        # Append observation as user-role message (environment tokens)
        if observation:
            messages.append({"role": "user", "content": observation})

    # Max steps reached — force finish (paper-reported: agent is prompted to conclude)
    messages.append({
        "role": "user",
        "content": "You have reached the maximum number of steps. "
                   "Please provide your final answer now using ....",
    })
    final = model.chat(messages, max_tokens=max_tokens)
    final_action = parse_action(final)
    return final_action.argument if final_action else final

Pseudocode 2: Search Environment Interface

The search environment wraps external search APIs and provides formatted results to the agent. The paper reports a configurable search backend; the following shows the expected interface structure (author reconstruction):

"""Search environment interface.
PROVENANCE: Author reconstruction based on paper-reported search integration
and repository-documented configurable backend. NOT a repository excerpt.
The actual search module may use different class names and API conventions."""

from abc import ABC, abstractmethod
from dataclasses import dataclass


@dataclass
class SearchResult:
    """Single search result. Paper describes results containing title, URL, snippet."""
    title: str
    url: str
    snippet: str


class SearchBackend(ABC):
    """Abstract search backend interface.

    Repo-documented: the repository supports configurable search backends
    with API credentials set via configuration. The exact abstract interface
    and concrete implementations are not audited at the source-code level.
    """

    @abstractmethod
    def web_search(self, query: str, num_results: int = 10) -> list[SearchResult]:
        """Execute a web search query and return ranked results."""
        ...

    @abstractmethod
    def fetch_page(self, url: str, max_chars: int = 8000) -> str:
        """Fetch a web page and extract text content, truncated to budget."""
        ...


class SearchEnvironment:
    """Wraps a search backend to format results for agent context.

    This illustrates the expected pattern: search results are formatted
    as numbered text entries that the agent can reference in subsequent
    actions. The actual formatting template is implementation unknown.
    """

    def __init__(self, backend: SearchBackend, max_results: int = 10):
        self.backend = backend
        self.max_results = max_results

    def search(self, query: str) -> str:
        """Execute search and format results as agent-readable text."""
        results = self.backend.web_search(query, self.max_results)
        formatted_parts = []
        for i, r in enumerate(results, 1):
            formatted_parts.append(
                f"[{i}] {r.title}\n    URL: {r.url}\n    {r.snippet}"
            )
        return "\n\n".join(formatted_parts)

    def fetch_page(self, url: str) -> str:
        """Fetch page content, truncated to fit context budget."""
        return self.backend.fetch_page(url)

Pseudocode 3: GRPO Training Configuration

The GRPO training stage uses the verl framework (repo-documented dependency). The following illustrates the expected training setup structure. Hyperparameter values are not verified — they are illustrative placeholders showing the configuration dimensions; readers must consult the repository's actual configuration files for ground-truth values.

"""GRPO training configuration structure.
PROVENANCE: Author reconstruction based on:
  - Paper-reported: GRPO algorithm choice, SFT → RL two-stage pipeline
  - Repo-documented: verl dependency, vLLM for generation
  - Implementation unknown: exact hyperparameters, entry point, config schema

This is NOT a repository excerpt. The actual training launch mechanism,
configuration format, and parameter names may differ substantially.
All hyperparameter values below are ILLUSTRATIVE PLACEHOLDERS."""

# ── Conceptual training configuration ────────────────────────────────
# The repository uses verl for distributed GRPO training.
# The configuration likely specifies these dimensions (author analysis):

training_dimensions = {
    # Model paths
    "sft_checkpoint": "...",       # SFT-initialized model (Stage 1 output)
    "reference_model": "...",      # Same SFT checkpoint, frozen for KL penalty
    "tokenizer": "Qwen/Qwen2.5-7B-Instruct",  # model-card

    # GRPO parameters (paper-reported algorithm; exact values unknown)
    "group_size": "?",             # G: trajectories per question — paper-reported
    "clip_ratio": "?",             # ε_c: PPO-style clipping — canonical default 0.2
    "kl_coeff": "?",               # β: KL penalty weight — implementation unknown
    "advantage_normalization": True,  # canonical GRPO property

    # Rollout (paper-reported: online trajectory generation with real search)
    "inference_engine": "vllm",    # repo-documented dependency
    "max_agent_steps": "?",        # T_max — configurable, default unknown
    "search_backend": "...",       # configurable — repo-documented

    # Training (implementation unknown: exact schedule, hardware)
    "learning_rate": "?",
    "num_epochs": "?",
    "gradient_accumulation": "?",

    # Reward (paper-reported structure; exact weights unknown)
    "accuracy_weight": "? (dominant)",  # paper-reported: primary signal
    "format_weight": "? (secondary)",   # paper-reported: format adherence
}

# ── Launch pattern ───────────────────────────────────────────────────
# Repo-documented: training is launched via verl's distributed trainer.
# The exact entry point (e.g., a training script invoking verl's API)
# is not audited at the source-code level. verl handles:
#   - Distributed rollout generation across GPUs
#   - Online trajectory collection with search-environment interaction
#   - Advantage computation (group-relative normalization)
#   - Multi-GPU gradient synchronization
#
# The search environment is integrated into the rollout worker so each
# trajectory involves real web search API calls during training.
# (paper-reported: online RL with live search)
Pseudocode provenance summary. All three examples above are author reconstructions based on the paper's described methodology and the repository's documented interfaces. They are not excerpts from repository source code and should not be treated as implementation documentation. The purpose is pedagogical: to illustrate the architectural patterns described in §§51.2–51.3 in executable notation. Readers building on this work must verify the actual implementation against the repository's source files at a pinned commit.

51.7.3 Compute Requirements

RL training of research agents is computationally expensive due to the online trajectory generation requirement. Each GRPO training step requires (author analysis — cost model not from the paper):

  1. Trajectory rollout. Generating $G$ trajectories per question, each involving multiple LLM forward passes and search API calls. For a batch of $B$ questions with group size $G$ and average trajectory length $\bar{L}$ actions: approximately $B \times G \times \bar{L}$ forward passes plus $B \times G \times \bar{S}$ search API calls.
  2. Reward computation. Scoring each trajectory against ground truth or via judge models.
  3. Gradient computation. Standard transformer backward pass.
$$C_{\text{step}} = B \cdot G \cdot \bar{L} \cdot (c_{\text{inference}} + c_{\text{search}}) + B \cdot G \cdot c_{\text{reward}} + c_{\text{gradient}}$$

This cost model is an author formalization — the exact compute budget (GPU type/count, training steps, API calls, wall-clock time, total cost) is not specified in the publicly reviewed materials.

51.8 Comparative Analysis

DeepResearch exists within a rapidly evolving landscape of autonomous research agents. The following comparison contextualizes its design choices against related open-source systems where evidence is available.

Action Space Richness → Training Sophistication → search only search + read search + read + code SFT only SFT + RL Undisclosed Search-R1 Deep- Research WebThinker OpenAI Deep Res. Open-source (verified) Proprietary (position approx.)

The comparison table below covers only dimensions with published evidence. Direct performance comparisons across systems are not meaningful due to differences in base models, compute budgets, search APIs, and evaluation dates.

Dimension DeepResearch (Alibaba) Search-R1 WebThinker
Open-source Yes — code + weights[repo] Yes[paper] Yes[paper]
Training method SFT + GRPO[paper] RL with process reward[paper] SFT + RL[paper]
Base model(s) Qwen2.5-7B-Instruct[paper] Qwen2.5 / LLaMA family[paper] Various[paper]
Teacher model QwQ-32B-Preview[paper] N/A (online RL)[paper] Varies[paper]
Action space search, click, think, finish[paper] search, finish[paper] search, read, write, finish[paper]
Synthetic data pipeline Yes — teacher trajectories[paper] Yes[paper] Yes[paper]
Released model size 7B parameters[model-card] 7B–32B[paper] 7B–32B[paper]
Key distinguishing feature GRPO for agentic RL; open teacher–student pipeline[paper] Process-level reward signal[paper] Write action for structured note-taking[paper]

Evidence tier key: [paper] = stated in the system's published paper; [repo] = confirmed from public repository; [model-card] = from HuggingFace model card. OpenAI's Deep Research is excluded because its training method, model architecture, and internal design are not publicly documented.

51.8.1 Distinguishing Design Choices

GRPO over PPO. The choice of GRPO over standard PPO eliminates the need for a trained critic/value network (canonical GRPO property; paper-reported choice). This simplifies the training pipeline (one model to train instead of two) and avoids the bootstrapping problem where the critic must accurately estimate the value of partial research trajectories. PPO requires learning $V(s)$ for every intermediate state; GRPO sidesteps this by comparing complete trajectory outcomes within a group.

Richer action space than Search-R1. DeepResearch includes explicit click and think actions beyond Search-R1's minimal search/finish (paper-reported). The click action allows drilling into specific documents beyond search snippets; the think action provides explicit intermediate reasoning without triggering environment interaction. Whether this richer action space translates to better performance is an empirical question confounded by base model, training data, and evaluation differences.

Open weights. DeepResearch releases both model weights and training code (repo-documented), enabling the research community to study, reproduce, and build upon the approach.

51.9 Limitations and Open Questions

51.9.1 Technical Limitations

Context window constraints. Even with 128K tokens, long research trajectories including multiple full web pages can approach or exceed this limit. Information from early steps may be lost by synthesis time.

Search API dependence. Performance is coupled to search API quality and coverage, which varies significantly by topic, recency, and language.

Reward sparsity for complex questions. Binary answer-correctness rewards provide a sparse training signal for multi-part questions. The agent receives no credit for gathering relevant information if it fails to synthesize a correct final answer. Process-based rewards (as used in Search-R1) could improve training efficiency but introduce reward-specification challenges.

Hallucination in synthesis. Even with search grounding, the finish action produces free-form text not mechanically constrained to cite only retrieved information.

51.9.2 Open Research Questions

  • Does RL discover genuinely novel strategies? The current approach initializes with SFT on teacher trajectories. Whether GRPO discovers fundamentally different strategies (not just refined versions) would require detailed trajectory analysis comparing teacher and student behavior — an analysis not present in the current report.
  • Multi-agent research. Whether specialized agents (search, reasoning, synthesis) could outperform a single unified agent is unexplored.
  • Continual learning. Research agents trained at a fixed point gradually become stale. Mechanisms for periodic re-training are not addressed.
  • Adaptive tool use. The fixed action space could be extended with code execution, database queries, or specialized API calls. How to train agents for a growing tool set is an important direction.
  • Scaling laws for agentic RL. Whether predictable relationships exist between GRPO compute investment (group size, questions, training steps) and downstream performance remains open.

51.10 Impact and Significance

Methodological contribution. The demonstration that GRPO — originally developed for mathematical reasoning — transfers effectively to the agentic research setting is a non-trivial empirical finding (paper-reported). The adaptation involves two key differences from the mathematical reasoning setting: (1) trajectories include environment interactions that must be masked out of the loss (§51.3.3), and (2) trajectory length is determined by the agent's own decisions. That GRPO remains effective under these conditions extends its demonstrated applicability.

Open-source infrastructure. By releasing code and weights (repo-documented: Alibaba-NLP/DeepResearch-7B on HuggingFace), DeepResearch lowers the barrier to entry for research on autonomous research agents.

Training pipeline as contribution. The synthetic data pipeline itself — generating, filtering, and curating research trajectories at scale using a larger teacher model — is a reusable methodology. Other groups can adapt it for different base models, search APIs, or domain-specific research tasks.

51.11 Summary

Chapter Summary

Key takeaway. DeepResearch demonstrates that reinforcement learning (GRPO) over agentic search trajectories produces meaningfully better research agents than SFT alone — approximately +10pp on GAIA validation average (paper-reported, Zheng et al. 2025, Table 1) — and provides an open-source pipeline for training such agents.

Main contribution. An end-to-end open-source pipeline: synthetic trajectory generation from QwQ-32B-Preview → quality filtering → SFT cold start on Qwen2.5-7B-Instruct → GRPO training via verl → released as DeepResearch-7B. The adaptation of GRPO from mathematical reasoning to the agentic setting — where trajectories interleave policy-generated action tokens with environment-returned observation tokens, requiring generation masking in the loss (§51.3.3, Layer 2) — extends the method's demonstrated applicability.

What a researcher should know. (1) Both models are dense transformers (7B student, 32B teacher), not MoE. (2) Results in §51.5.2 are single-run point estimates without reported variance; GAIA Level 3 has ~17 questions (~6pp granularity). (3) Web search non-stationarity complicates reproduction (§51.5.1). (4) The system is not evolutionary in the traditional sense but shares structural similarities with NES via GRPO's population-based trajectory sampling (§51.6). (5) Code examples in §51.7.2 are author reconstructions, not repository excerpts.

Verification checklist for readers building on this chapter.
  • Model checkpoint: confirm Alibaba-NLP/DeepResearch-7B on HuggingFace; verify base is Qwen2.5-7B-Instruct (dense, ~7.6B).
  • Repository: clone at a pinned commit; record the hash. Verify the audit in §51.7.1 against actual source tree — particularly agent entry point, action parser, search module, and training scripts.
  • Benchmark scores: verify §51.5.2 against Zheng et al. (2025), Tables 1–2. Note evaluation date and search API used.
  • Pseudocode: the code in §51.7.2 is author reconstruction — verify against actual source files for class names, imports, and module paths.
  • Training config: verify GRPO hyperparameters ($G$, $\varepsilon_c$, $\beta$, learning rate) from the repository's configuration files, not from this chapter's illustrative placeholders.
  • GRPO formalization: the masked-loss notation in §51.3.3 Layer 2 is author formalization of paper-reported practice. Verify the actual implementation approach from the training code.