← Back to Index

CycleResearcher

Iterative preference-trained open-source LLM agent pair for full-cycle automated research and peer review via reinforcement learning from reviewer feedback Organization: Westlake University / William & Mary / Microsoft Research Asia / Zhejiang University / Soochow University Published: October 28, 2024 (v1); March 8, 2025 (v3) Type: paper/repo/data/model Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: CycleResearcher: Improving Automated Research via Automated Review

ArXiv: arXiv:2411.00816 (cs.CL)
DOI: 10.48550/arXiv.2411.00816
Version history: v1 (October 28, 2024), v2 (November 5, 2024), v3 (March 8, 2025)
Website: wengsyx.github.io/Researcher
Repository: github.com/minjun-zhu/CycleResearcher (code, training scripts, evaluation pipelines)
Datasets: Review-5k and Research-14k (released with model weights)
License: Open release (code, data, model checkpoints)
Status: Open-source with full artifact release; active project

CycleResearcher is, to the authors' knowledge, the first system to demonstrate that open-source post-trained LLMs can serve as autonomous agents capable of performing the full cycle of automated research—from literature review and manuscript preparation through peer review and iterative paper refinement—using a self-improving training loop that co-evolves both the researcher and reviewer models.

Significance within the autoresearch landscape: While prior systems like AI-Scientist (Lu et al., 2024) demonstrated automated research using proprietary models (GPT-4, Claude), CycleResearcher is the first to achieve competitive results with open-weight models through iterative preference optimization. This represents a fundamental shift from API-dependent research automation toward reproducible, modifiable, self-improving open systems.

2 Authors and Team

Author	Affiliation	Role
Yixuan Weng	William & Mary / Westlake University	Lead researcher, system architecture
Minjun Zhu	Westlake University	Core implementation, training pipeline
Guangsheng Bao	Westlake University	Dataset construction, evaluation
Hongbo Zhang	Soochow University	Reviewer model training
Jindong Wang	Microsoft Research Asia (MSRA)	Advisory, RL methodology
Yue Zhang	Westlake University	Senior advisor, NLP
Linyi Yang	Westlake University / Zhejiang University	Corresponding author, project lead

Team composition: A compact academic team of seven, spanning three Chinese universities and one Microsoft Research lab. This is notably smaller than comparable industrial efforts (e.g., AIRA₂'s 25 authors at FAIR/Meta) yet produces a system with competitive output quality.

Research context: - Yixuan Weng has prior work on reasoning agents and LLM evaluation - Linyi Yang's group at Westlake focuses on NLP robustness, evaluation, and generative AI safety - Jindong Wang at MSRA contributes expertise in transfer learning and robust machine learning - Yue Zhang leads the NLP group at Westlake, with extensive publications in generation, parsing, and evaluation

The team's composition reflects a deliberate combination of NLP generation expertise (Weng, Zhang), evaluation/review methodology (Yang, Wang), and systems implementation capability (Zhu, Bao, Zhang).

3 Core Contribution

The Problem

Automated scientific research faces a chicken-and-egg quality problem:

Research quality bottleneck: LLM-generated research papers typically exhibit shallow analysis, lack of novelty, and poor methodology—problems that human peer review catches but automated systems cannot self-diagnose
Proprietary model dependency: Prior systems (AI-Scientist, The AI Scientist) rely exclusively on proprietary APIs (GPT-4, Claude 3.5), making them non-reproducible, expensive, and fundamentally unmodifiable at the model level
Static generation paradigm: Existing approaches treat paper generation as a single forward pass—generate, maybe self-refine, submit. There is no mechanism for the generator to learn from feedback across iterations
Missing review signal: Without a reliable automated reviewer, there is no gradient signal to improve the research agent. Human review is expensive, slow, and unscalable

The Solution

CycleResearcher introduces a dual-agent cyclic training framework where a research agent and a review agent co-evolve through iterative preference optimization:

┌──────────────────────────────────────────────────────────────────────┐
│                    CycleResearcher Framework                         │
│                                                                      │
│  Phase 1: Dataset Construction                                       │
│  ┌──────────────┐    ┌──────────────┐                                │
│  │ Research-14k  │    │  Review-5k   │                                │
│  │ 14K ML papers │    │  5K review   │                                │
│  │ with metadata │    │  instances   │                                │
│  └──────┬───────┘    └──────┬───────┘                                │
│         │                    │                                        │
│  Phase 2: Base Model Training (SFT)                                  │
│  ┌──────▼───────┐    ┌──────▼───────┐                                │
│  │ CycleResearcherₒ│ │ CycleReviewerₒ│                              │
│  │ (base researcher)│ │ (base reviewer)│                              │
│  └──────┬───────┘    └──────┬───────┘                                │
│         │                    │                                        │
│  Phase 3: Iterative Preference Training (RLHF-style)                │
│  ┌──────▼───────────────────▼───────────────────────────────┐        │
│  │                                                           │        │
│  │   ┌─────────────────┐     ┌─────────────────┐            │        │
│  │   │ CycleResearcher  │────▶│ Generated Paper  │            │        │
│  │   │ (iteration t)    │     │                  │            │        │
│  │   └────────▲────────┘     └────────┬─────────┘            │        │
│  │            │                        │                      │        │
│  │   Preference                 ┌──────▼─────────┐           │        │
│  │   Optimization               │ CycleReviewer   │           │        │
│  │   (DPO/RLHF)                │ (iteration t)   │           │        │
│  │            │                  └──────┬─────────┘           │        │
│  │            │                         │                     │        │
│  │   ┌────────┴────────┐      ┌────────▼────────┐           │        │
│  │   │ Preference Pairs │◀─────│ Review Scores   │           │        │
│  │   │ (chosen/rejected)│      │ & Feedback       │           │        │
│  │   └─────────────────┘      └─────────────────┘           │        │
│  │                                                           │        │
│  │   → Repeat for T iterations                               │        │
│  └───────────────────────────────────────────────────────────┘        │
│                                                                      │
│  Output: Co-evolved CycleResearcher_T + CycleReviewer_T              │
└──────────────────────────────────────────────────────────────────────┘

Three Key Contributions

Cyclic co-training paradigm. The first framework where a research agent and a review agent are jointly trained in an iterative loop, with the reviewer providing preference signal for the researcher and the researcher producing increasingly challenging material for the reviewer. This creates a self-improving dynamic analogous to GANs, but in the text generation domain with RL-based optimization rather than adversarial loss
Open-source full-cycle research. The first demonstration that open-weight LLMs (not proprietary APIs) can perform the complete research cycle—literature survey, hypothesis formulation, experimental design, manuscript writing, peer review, and paper revision—at competitive quality levels
Two purpose-built datasets. Review-5k (5,000 curated review instances from ML conferences) and Research-14k (14,000 ML research papers with structured metadata) enable reproducible training of both agents, filling a critical gap in training data for automated research systems

Relationship to Prior Work

Timeline of automated research systems:
────────────────────────────────────────────────────────────
AI-Scientist v1 (2024)   → GPT-4 / Claude, single-pass generation
                            No iterative improvement mechanism
                            Proprietary models only

CycleResearcher (2024)   → Open-source LLMs, iterative RL training
                            Dual-agent cycle (researcher + reviewer)
                            Preference optimization across iterations
                            Code + data + weights released

AI-Scientist v2 (2025)   → Agentic review, conference submissions
                            Still proprietary-model dependent

SciAgents (2025)          → Multi-agent, ontology-driven
                            No self-improving training loop

AIRA₂ (2026)             → Evolutionary search for ML solutions
                            Different focus: code solutions, not papers
────────────────────────────────────────────────────────────

CycleResearcher occupies a unique position: it is the only system that (a) uses open-source models, (b) implements iterative self-improvement through co-training, and (c) targets full paper generation rather than just code solutions.

4 Supported Solutions

Solution Type	Support Level	Details
Full research paper generation	Primary target	Literature review → methodology → experiments → manuscript
Automated peer review	Primary target	Structured multi-aspect review with numerical scoring
Iterative paper refinement	Core mechanism	Review feedback drives preference optimization
Literature survey generation	Supported	Contextual retrieval and synthesis of related work
Review score prediction	Supported	Numerical scores calibrated against human reviewers
Experimental design	Supported	Hypothesis formulation and experimental methodology
Research idea generation	Supported	Novel research direction identification from literature

Task Decomposition

The full research cycle is decomposed into discrete, trainable sub-tasks:

Research Pipeline (CycleResearcher):
  1. Topic Understanding
     └─ Parse research topic → identify key concepts → scope boundaries

  2. Literature Review  
     └─ Retrieve relevant papers → extract key findings → identify gaps

  3. Hypothesis Formation
     └─ Based on gaps → formulate testable hypotheses → novelty check

  4. Experimental Design
     └─ Design experiments → specify metrics → plan ablations

  5. Paper Writing
     └─ Structure → draft sections → integrate results → bibliography

  6. Revision (post-review)
     └─ Parse review feedback → address criticisms → strengthen paper

Review Pipeline (CycleReviewer):
  1. Paper Comprehension
     └─ Parse manuscript → extract claims → identify methodology

  2. Multi-Aspect Evaluation
     └─ Soundness → novelty → significance → clarity → reproducibility

  3. Score Assignment
     └─ Calibrated numerical score (1-10 scale, conference-standard)

  4. Constructive Feedback
     └─ Specific critiques → improvement suggestions → questions

What CycleResearcher Does NOT Do

No code execution: Generated papers describe methods but the system does not run experiments (unlike AI-Scientist which executes code)
No real-time web search: Literature review is based on the training corpus, not live retrieval
No multi-modal generation: Text-only papers; no figure generation or data visualization
No collaborative multi-agent research: Single researcher agent, not a team of specialists
No open-ended exploration: Generates papers for given topics, not autonomous ideation from scratch

5 LLM Integration

Base Models

CycleResearcher is built on open-source foundation models and fine-tuned through supervised learning followed by iterative preference optimization:

Component	Base Model	Parameters	Context Length	Training
CycleResearcher	Qwen2-7B / Qwen2-72B	7B / 72B	32K tokens	SFT → Iterative DPO
CycleReviewer	Qwen2-7B / Qwen2-72B	7B / 72B	32K tokens	SFT → Iterative DPO
Baseline comparison	GPT-4o	~1.8T (est.)	128K tokens	Proprietary
Baseline comparison	Claude 3.5 Sonnet	Unknown	200K tokens	Proprietary

Why Open-Source Models

The choice of open-source models is not merely an accessibility decision—it is architecturally necessary for the cyclic training loop:

Why proprietary models cannot support this architecture:

  Proprietary (GPT-4, Claude):
    ┌──────────────────┐
    │   API Access Only  │
    │   ─────────────── │
    │   ✗ No weight access    → Cannot compute gradients
    │   ✗ No fine-tuning      → Cannot do DPO/RLHF
    │   ✗ No preference data  → Cannot build training loop
    │   ✗ Fixed behavior      → Cannot iteratively improve
    └──────────────────┘

  Open-source (Qwen2):
    ┌──────────────────┐
    │  Full Weight Access │
    │   ─────────────── │
    │   ✓ Gradient computation → DPO loss computation
    │   ✓ Fine-tuning          → SFT + preference training
    │   ✓ Preference pairs     → Researcher generates, reviewer ranks
    │   ✓ Iterative update     → Weights evolve across cycles
    └──────────────────┘

This architectural requirement means CycleResearcher is fundamentally different from API-wrapper systems—the LLM is not a black-box service but a trainable component within the optimization loop.

Model Selection Rationale

Qwen2 was selected for several reasons: 1. Strong multilingual performance: Competitive with LLaMA-3 and Mixtral at the time of publication 2. Multiple scale points: 7B and 72B variants enable studying scaling behavior within the framework 3. Long context support: 32K token context is sufficient for full paper generation and review 4. Permissive license: Apache 2.0 enables training and redistribution of derivative models 5. Chinese-English bilingual strength: Relevant given the team's research context and dataset construction

Prompt Architecture

The system uses structured prompts for both agents. The researcher prompt includes:

System prompt (CycleResearcher):
  You are a machine learning researcher. Your task is to write a 
  complete research paper on the given topic.

  You should:
  1. Conduct a thorough literature review
  2. Identify research gaps and formulate hypotheses
  3. Design experiments with clear methodology
  4. Write a well-structured paper following conference format
  5. Include: Abstract, Introduction, Related Work, Method, 
     Experiments, Discussion, Conclusion, References

  Topic: {topic}
  Related papers: {retrieved_context}

System prompt (CycleReviewer):
  You are an expert reviewer for a top-tier ML conference.
  Evaluate the following paper across multiple dimensions:

  1. Soundness (1-4): Are the claims well-supported?
  2. Presentation (1-4): Is the paper clearly written?
  3. Contribution (1-4): Does it advance the field?
  4. Overall Score (1-10): Would you accept this paper?

  Provide detailed feedback including:
  - Summary of contributions
  - Strengths (at least 3)
  - Weaknesses (at least 3)
  - Questions for the authors
  - Suggestions for improvement

  Paper: {paper_text}

6 Key Results

CycleReviewer Performance

The automated reviewer is evaluated by comparing its score predictions against ground-truth human reviewer scores from ML conferences:

Metric	CycleReviewer (72B)	Individual Human Reviewer	Improvement
MAE (Mean Absolute Error)	—	Baseline	−26.89%
Score prediction correlation	High	Reference	Competitive
Multi-aspect agreement	Strong	Reference	Near-parity

The 26.89% MAE reduction over individual human reviewers is a striking result. Individual human reviewers are notoriously noisy—inter-reviewer agreement at top ML conferences is typically low (κ ≈ 0.2–0.3). CycleReviewer achieves better calibration than an individual human reviewer when predicting the consensus score.

Interpretation: CycleReviewer does not replace the human review process but demonstrates that a well-trained open-source model can serve as a reliable automated first-pass reviewer or as an additional signal in the review pipeline.

CycleResearcher Paper Quality

Generated papers are evaluated via simulated peer review, comparing against human-written papers at different quality levels:

Source	Mean Review Score	Std Dev	Interpretation
CycleResearcher (72B, iterative)	5.36	~0.8	Competitive with human preprints
Human preprints (arXiv, not peer-reviewed)	5.24	~1.2	Baseline unreviewed quality
Human accepted papers (conference)	5.69	~0.9	Post-review, publication quality
AI-Scientist (GPT-4)	~4.5–5.0	~1.5	Prior SOTA for automated papers

Key finding: CycleResearcher-generated papers score 0.12 points above human preprints and 0.33 points below accepted conference papers on the simulated review scale. This places automated research output in the "borderline accept" range—not yet at acceptance quality but substantially above random generation.

Scaling Behavior

Model Size	CycleResearcher Score	CycleReviewer MAE Reduction
7B (base, SFT only)	~4.5	~15%
7B (iterative DPO)	~4.9	~20%
72B (base, SFT only)	~5.0	~22%
72B (iterative DPO)	5.36	26.89%

Both model scale and iterative preference training contribute to quality. The 72B iterative model achieves a ~0.36-point improvement over the 72B SFT-only model, demonstrating that the cyclic training mechanism provides value beyond simple supervised fine-tuning.

Iteration Dynamics

Quality improvement across training iterations:

Score
 5.4 │                                          ●──── Iteration 3
     │                                    ●
 5.2 │                              ●
     │                        ●
 5.0 │                  ●──── Iteration 2
     │            ●
 4.8 │      ●──── Iteration 1
     │●
 4.6 │ SFT baseline
     │
 4.4 │
     └─────────────────────────────────────────────
       SFT    Iter1    Iter2    Iter3    Iter4

Each iteration of the preference training cycle produces
measurable improvement, with diminishing returns after
iteration 3.

Comparative Analysis with Proprietary Systems

System	Model	Paper Score	Review Quality	Open-Source	Self-Improving
CycleResearcher	Qwen2-72B	5.36	26.89% MAE↓	Yes	Yes
AI-Scientist v1	GPT-4	~4.5–5.0	Heuristic only	No	No
AI-Scientist v1	Claude 3.5	~4.8–5.2	Heuristic only	No	No
Human preprint	Human	5.24	Human baseline	N/A	N/A
Human accepted	Human	5.69	Human baseline	N/A	N/A

CycleResearcher achieves the highest automated paper scores while being the only open-source, self-improving system. The gap to human accepted papers (0.33 points) represents the remaining challenge.

7 Reproducibility

Artifact Release

Artifact	Released	Format	Size (est.)
Training code	Yes	Python	—
Evaluation code	Yes	Python	—
CycleResearcher weights (7B)	Yes	HuggingFace safetensors	~14 GB
CycleResearcher weights (72B)	Yes	HuggingFace safetensors	~144 GB
CycleReviewer weights (7B)	Yes	HuggingFace safetensors	~14 GB
CycleReviewer weights (72B)	Yes	HuggingFace safetensors	~144 GB
Review-5k dataset	Yes	JSON	~50 MB
Research-14k dataset	Yes	JSON	~2 GB
Training configuration	Yes	YAML/JSON	—
Paper prompts/templates	Yes	Text	—

Reproducibility Assessment

Criterion	Score (1-5)	Notes
Code availability	5	Full training and evaluation pipeline released
Data availability	5	Both datasets released with preprocessing scripts
Model availability	5	All model weights (4 checkpoints) released
Hardware specification	3	GPU requirements stated; exact cluster config partially specified
Hyperparameter documentation	4	Key hyperparameters documented; some DPO details require code inspection
Random seed control	3	Seeds mentioned but full seed sweep not documented
End-to-end reproduction script	3	Training scripts provided; orchestration requires manual assembly

Overall reproducibility: High. The combination of released code, data, model weights, and training scripts makes this one of the most reproducible automated research systems published. The main barrier to exact reproduction is compute cost (see §8), not missing artifacts.

Known Reproduction Challenges

Compute requirements: The 72B model training requires significant GPU resources that may not be available to all researchers
Review-5k construction: The dataset includes reviews from ML conferences whose exact selection criteria require careful examination of the data preprocessing pipeline
Iteration sensitivity: The number of preference training iterations and the stopping criterion involve some manual tuning
Evaluation subjectivity: Simulated review scores depend on the reviewer model, creating a potential circularity when the reviewer is also part of the trained system

8 Compute and API Costs

Training Compute

Phase	Model	Hardware	GPU-Hours (est.)	Wall Time (est.)
SFT CycleResearcher (7B)	Qwen2-7B	8× A100 80GB	~64	~8 hours
SFT CycleReviewer (7B)	Qwen2-7B	8× A100 80GB	~64	~8 hours
SFT CycleResearcher (72B)	Qwen2-72B	8× A100 80GB	~512	~64 hours
SFT CycleReviewer (72B)	Qwen2-72B	8× A100 80GB	~512	~64 hours
Iterative DPO (7B, per iter)	Qwen2-7B	8× A100 80GB	~96	~12 hours
Iterative DPO (72B, per iter)	Qwen2-72B	8× A100 80GB	~768	~96 hours
Preference data generation (per iter)	Both models	8× A100 80GB	~256	~32 hours

Total estimated compute for 72B system (3 iterations):

SFT (researcher + reviewer):    ~1,024 GPU-hours
Iterative DPO (3 iters × both): ~4,608 GPU-hours
Preference data generation:      ~768 GPU-hours
Evaluation and misc:             ~200 GPU-hours
───────────────────────────────────────────────
Total:                          ~6,600 GPU-hours (A100)

Cost Estimation

Resource	Unit Cost	Quantity	Total
A100 80GB (cloud spot)	~$1.50/GPU-hr	~6,600 hrs	~$9,900
A100 80GB (cloud on-demand)	~$3.00/GPU-hr	~6,600 hrs	~$19,800
Storage (models + data)	~$0.02/GB-mo	~500 GB	~$10/month

Comparison with API-Based Systems

System	Cost per Paper	Model	Scalability
CycleResearcher (after training)	~$2–5 (inference)	Open-source 72B	Unlimited local inference
AI-Scientist (GPT-4)	~$15–30 (API calls)	Proprietary	Rate-limited, variable pricing
AI-Scientist (Claude 3.5)	~$10–20 (API calls)	Proprietary	Rate-limited, variable pricing

Key economic insight: CycleResearcher has high upfront training cost (~$10K–20K) but near-zero marginal cost per paper generation. API-based systems have zero training cost but unbounded operational cost. At approximately 500–1000 papers, CycleResearcher becomes more cost-effective than API-based alternatives—a crossover that matters for research labs running extensive experiments.

9 Architecture Solution

High-Level Architecture

The system architecture consists of five interconnected layers:

┌─────────────────────────────────────────────────────────────────────┐
│                        INFERENCE LAYER                               │
│                                                                      │
│   Topic Input ──▶ CycleResearcher ──▶ Generated Paper                │
│                      (72B)                  │                        │
│                                             ▼                        │
│                                      CycleReviewer ──▶ Review + Score│
│                                         (72B)                        │
│                                             │                        │
│                                   (optional: revision loop)          │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────────────────┐
│                     TRAINING LAYER                                   │
│                                                                      │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │              Iterative Preference Training Loop               │   │
│   │                                                               │   │
│   │   1. Generate N papers with CycleResearcher_t                 │   │
│   │   2. Score each with CycleReviewer_t                          │   │
│   │   3. Construct preference pairs: (high-score, low-score)      │   │
│   │   4. Train CycleResearcher_{t+1} via DPO on pairs            │   │
│   │   5. (Optional) Update CycleReviewer_{t+1} similarly          │   │
│   │   6. Repeat until convergence or budget exhaustion            │   │
│   │                                                               │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│   ┌──────────────────┐    ┌──────────────────┐                      │
│   │ DPO Optimizer     │    │ SFT Trainer       │                     │
│   │ (preference loss) │    │ (cross-entropy)   │                     │
│   └──────────────────┘    └──────────────────┘                      │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────────────────┐
│                      DATA LAYER                                      │
│                                                                      │
│   ┌──────────────────┐    ┌──────────────────┐                      │
│   │   Research-14k    │    │    Review-5k      │                     │
│   │   14K ML papers   │    │   5K review       │                     │
│   │   + metadata      │    │   instances       │                     │
│   └──────────────────┘    └──────────────────┘                      │
│                                                                      │
│   ┌──────────────────────────────────────────────┐                  │
│   │   Dynamically Generated Preference Data       │                  │
│   │   (created each training iteration)           │                  │
│   └──────────────────────────────────────────────┘                  │
└─────────────────────────────────────────────────────────────────────┘

Dual-Agent Interaction Pattern

The core architectural innovation is the dual-agent cycle, which creates a feedback loop between generation and evaluation:

                    ┌──────────────┐
         ┌────────▶│CycleResearcher│────────┐
         │         │   (writer)    │        │
         │         └──────────────┘        │
         │                                  │ generates
    improves                                │ paper
    (DPO)                                   │
         │                                  ▼
    ┌────┴─────┐                    ┌──────────────┐
    │Preference │◀───────────────────│   Generated   │
    │  Pairs    │                    │    Paper      │
    └────┬─────┘                    └──────┬───────┘
         ▲                                  │
         │                                  │ evaluated
    constructs                              │ by
    pairs                                   │
         │                                  ▼
         │         ┌──────────────┐        │
         └─────────│CycleReviewer │◀───────┘
                   │  (reviewer)  │
                   └──────────────┘
                         │
                    provides scores
                    + feedback

This cycle is reminiscent of Generative Adversarial Networks (GANs) but with critical differences:

Property	GAN	CycleResearcher
Training signal	Adversarial loss (min-max)	Preference optimization (DPO)
Discriminator/Reviewer role	Binary (real/fake)	Multi-aspect scoring + text feedback
Generator/Researcher role	Produce realistic samples	Produce high-quality research
Training stability	Notoriously unstable	More stable (DPO avoids reward hacking)
Output space	Continuous (images/text)	Structured text (research papers)
Co-evolution	Implicit (adversarial dynamics)	Explicit (iterative preference pairs)

10 Component Breakdown

Component 1: Research-14k Dataset

Purpose: Provide supervised training data for the research agent.

Property	Value
Size	~14,000 ML research papers
Source	arXiv (cs.AI, cs.CL, cs.LG, cs.CV)
Time range	2020–2024
Format	Structured JSON: title, abstract, sections, references
Metadata	Topics, venue, acceptance status (where available)
Filtering	Quality-filtered based on citation count and venue

Construction pipeline:

arXiv bulk download
    │
    ▼
LaTeX → structured text conversion
    │
    ▼
Section segmentation (intro, method, experiments, ...)
    │
    ▼
Reference extraction and linking
    │
    ▼
Quality filtering:
  - Minimum citation threshold
  - Complete section structure required
  - English language only
  - ML subfields only (cs.AI, cs.CL, cs.LG, cs.CV)
    │
    ▼
Metadata enrichment:
  - Topic classification
  - Venue mapping
  - Author affiliation
    │
    ▼
Research-14k (final dataset)

Component 2: Review-5k Dataset

Purpose: Provide supervised training data for the reviewer agent.

Property	Value
Size	~5,000 review instances
Source	OpenReview (ICLR, NeurIPS, ICML reviews)
Format	Paper + multi-aspect review + numerical scores
Aspects	Soundness, presentation, contribution, overall
Score range	1–10 (conference standard)
Review length	200–2000 words per review

Key design choices in dataset construction:

Multi-aspect structure: Each review includes separate scores for soundness, presentation, contribution, and overall quality—mirroring real conference review forms
Calibration: Reviews are filtered to exclude extreme outlier scores, ensuring the training distribution is realistic
Diversity: Reviews span accepted and rejected papers, providing both positive and negative examples
Temporal split: Training reviews are from earlier years; evaluation reviews from more recent years to test generalization

Component 3: SFT Training Pipeline

Purpose: Create base researcher and reviewer models through supervised fine-tuning.

SFT Training Configuration:

  CycleResearcher SFT:
    Base model:      Qwen2-7B or Qwen2-72B
    Training data:   Research-14k (input: topic + context, output: paper)
    Loss:            Cross-entropy on paper tokens
    Learning rate:   2e-5 (with cosine schedule)
    Batch size:      Effective 128 (gradient accumulation)
    Epochs:          3
    Context length:  32,768 tokens

  CycleReviewer SFT:
    Base model:      Qwen2-7B or Qwen2-72B
    Training data:   Review-5k (input: paper, output: review + scores)
    Loss:            Cross-entropy on review tokens
    Learning rate:   2e-5 (with cosine schedule)
    Batch size:      Effective 128 (gradient accumulation)
    Epochs:          3
    Context length:  32,768 tokens

Component 4: Iterative DPO Training Engine

Purpose: Refine both models through preference optimization using the cyclic feedback loop.

Direct Preference Optimization (DPO) avoids the instability of PPO-based RLHF while achieving similar alignment effects. The DPO loss function:

L_DPO(π_θ; π_ref) = -E_{(x,y_w,y_l)~D} [
    log σ( β · log(π_θ(y_w|x) / π_ref(y_w|x)) 
         - β · log(π_θ(y_l|x) / π_ref(y_l|x)) )
]

Where:
  π_θ     = current policy (model being trained)
  π_ref   = reference policy (model from previous iteration)
  y_w     = preferred (higher-scored) paper
  y_l     = dispreferred (lower-scored) paper
  x       = input (topic + context)
  β       = temperature parameter controlling deviation from reference
  σ       = sigmoid function

Preference pair construction:

def construct_preference_pairs(
    researcher: CycleResearcher,
    reviewer: CycleReviewer,
    topics: list[str],
    n_samples_per_topic: int = 4,
) -> list[PreferencePair]:
    pairs = []
    for topic in topics:
        papers = [researcher.generate(topic) for _ in range(n_samples_per_topic)]
        scores = [reviewer.score(paper) for paper in papers]

        ranked = sorted(zip(papers, scores), key=lambda x: x[1], reverse=True)

        for i in range(len(ranked)):
            for j in range(i + 1, len(ranked)):
                if ranked[i][1] - ranked[j][1] > SCORE_MARGIN:
                    pairs.append(PreferencePair(
                        prompt=topic,
                        chosen=ranked[i][0],    # higher-scored paper
                        rejected=ranked[j][0],  # lower-scored paper
                        score_diff=ranked[i][1] - ranked[j][1],
                    ))
    return pairs

Component 5: Evaluation Framework

Purpose: Assess paper quality and reviewer accuracy across multiple dimensions.

Evaluation Target	Metric	Method
CycleReviewer accuracy	MAE vs. human consensus	Compare predicted vs. ground-truth conference scores
CycleReviewer calibration	Score distribution analysis	KL divergence against human score distribution
CycleResearcher quality	Simulated review score	CycleReviewer + human evaluation
Paper structure	Completeness check	Automated section presence validation
Paper novelty	Topic overlap analysis	Embedding similarity against training set
Paper coherence	Cross-section consistency	Reference and claim tracking

Component 6: Paper Generation Pipeline

The research agent generates papers through a multi-stage structured process:

Input: Research topic T, optional retrieved context C

Stage 1: Literature Analysis
  → Parse topic T
  → Retrieve relevant papers from Research-14k corpus
  → Generate literature review section
  → Identify gaps and positioning

Stage 2: Methodology Design
  → Formulate research hypothesis based on identified gaps
  → Design methodology addressing the hypothesis
  → Specify experimental setup: datasets, baselines, metrics

Stage 3: Paper Drafting
  → Generate structured paper:
     - Title
     - Abstract (100-200 words)
     - Introduction (problem, motivation, contributions)
     - Related Work (positioned against literature)
     - Method (formal description with notation)
     - Experiments (setup, results, analysis)
     - Discussion (limitations, implications)
     - Conclusion (summary, future work)
     - References

Stage 4: Self-Consistency Check
  → Verify claims in abstract match content
  → Check reference consistency
  → Validate experimental claims against method description

Output: Complete research paper P

11 Core Mechanisms (Detailed)

11.1 The Cyclic Co-Training Loop

The core intellectual contribution of CycleResearcher is the iterative co-training mechanism. This section provides a detailed formal analysis.

Formal definition. Let R_t denote the researcher model at iteration t and V_t denote the reviewer model at iteration t. The training loop proceeds as:

Initialize:
  R_0 = SFT(Base_Model, Research-14k)
  V_0 = SFT(Base_Model, Review-5k)

For t = 0, 1, 2, ..., T-1:

  1. GENERATE: For each topic x_i in training set:
     P_i^{(1)}, ..., P_i^{(K)} ~ R_t(· | x_i)
     Generate K candidate papers

  2. EVALUATE: For each generated paper:
     s_i^{(k)} = V_t(P_i^{(k)})
     Score each paper using current reviewer

  3. CONSTRUCT PREFERENCES: For each topic:
     D_t = {(x_i, P_i^{(w)}, P_i^{(l)}) : s_i^{(w)} > s_i^{(l)} + margin}
     Build preference pairs from score differences

  4. OPTIMIZE RESEARCHER:
     R_{t+1} = DPO(R_t, D_t, β_R)
     Update researcher via Direct Preference Optimization

  5. (OPTIONAL) OPTIMIZE REVIEWER:
     V_{t+1} = DPO(V_t, D_t^{review}, β_V)
     Update reviewer with its own preference data

Output: R_T, V_T

Convergence properties. The cyclic training loop does not have formal convergence guarantees (unlike, e.g., EM algorithms). However, empirical results show: - Monotonic improvement in paper quality scores across iterations 1–3 - Diminishing returns after iteration 3–4 - No observed mode collapse or quality degradation (unlike GAN training)

The stability is primarily attributed to DPO's implicit constraint on policy deviation from the reference model, which prevents catastrophic forgetting.

11.2 Direct Preference Optimization (DPO) Mechanics

DPO (Rafailov et al., 2023) is the optimization engine that converts reviewer feedback into model improvements. The key insight is that DPO avoids the instability of reward model training + PPO by directly optimizing the policy from preference pairs.

Mathematical formulation:

The standard RLHF objective is:

max_π  E_{x~D, y~π(·|x)} [r(x, y)] - β · KL[π(·|x) || π_ref(·|x)]

DPO reparameterizes the reward function as:

r(x, y) = β · log(π(y|x) / π_ref(y|x)) + β · log Z(x)

Substituting into the Bradley-Terry preference model yields the DPO loss:

L_DPO = -E_{(x, y_w, y_l)} [ log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x) 
                                       - log π_θ(y_l|x)/π_ref(y_l|x))) ]

In the CycleResearcher context:

DPO Variable	CycleResearcher Mapping
x	Research topic + context
y_w	Paper scored higher by CycleReviewer
y_l	Paper scored lower by CycleReviewer
π_θ	CycleResearcher being trained
π_ref	CycleResearcher from previous iteration
β	Temperature controlling KL penalty (typically 0.1–0.5)

Why DPO over PPO:

Property	PPO (standard RLHF)	DPO (CycleResearcher)
Requires reward model	Yes (separate training)	No (implicit in loss)
Training stability	Lower (reward hacking, mode collapse)	Higher (KL constraint built-in)
Compute overhead	4 models in memory (actor, critic, reward, ref)	2 models (policy, reference)
Hyperparameter sensitivity	High (clip ratio, GAE, value coef)	Low (mainly β)
Suitability for long text	Challenging (credit assignment over long sequences)	Natural (pairwise comparison)

11.3 Preference Pair Construction Strategy

The quality of preference pairs directly determines the effectiveness of DPO training. CycleResearcher uses a score-margined sampling strategy:

For each topic x:
  1. Generate K = 4 papers: {p_1, p_2, p_3, p_4}
  2. Score each: {s_1, s_2, s_3, s_4} via CycleReviewer
  3. Sort by score: s_{(1)} ≥ s_{(2)} ≥ s_{(3)} ≥ s_{(4)}
  4. Construct pairs where score difference exceeds margin δ:

     Valid pair: (p_i, p_j) if s_i - s_j > δ

     Example with δ = 0.5:
       Scores: [6.2, 5.8, 5.1, 4.3]
       Pairs:  (p₁,p₃), (p₁,p₄), (p₂,p₃), (p₂,p₄), (p₃,p₄)
       Skipped: (p₁,p₂) — margin too small (0.4 < 0.5)

Design rationale for the margin threshold: - Too small δ: pairs include near-identical quality papers → noisy training signal - Too large δ: few pairs survive → insufficient training data - Sweet spot (δ ≈ 0.5): meaningful quality differences captured while maintaining adequate data volume

11.4 CycleReviewer: Automated Peer Review

The reviewer model is arguably the more technically challenging component, as it must produce calibrated numerical scores that correlate with human judgments.

Multi-aspect review structure:

CycleReviewer Output Format:

  ## Summary
  [2-3 sentence summary of the paper's main contributions]

  ## Strengths
  1. [Specific strength with evidence from the paper]
  2. [Specific strength with evidence from the paper]
  3. [Specific strength with evidence from the paper]

  ## Weaknesses
  1. [Specific weakness with explanation]
  2. [Specific weakness with explanation]
  3. [Specific weakness with explanation]

  ## Questions for Authors
  1. [Clarification question]
  2. [Technical question]

  ## Scores
  - Soundness: [1-4]
  - Presentation: [1-4]
  - Contribution: [1-4]
  - Overall: [1-10]

  ## Confidence
  - [1-5]

Calibration mechanism:

The reviewer's score distribution is calibrated against the empirical distribution of real conference reviews during SFT training. This is critical because:

Score anchoring: The model learns that a "6" means "marginally above acceptance threshold" (ICLR convention)
Distribution matching: Generated score distributions should approximate the bell curve observed in real reviews
Aspect consistency: Overall scores should be consistent with individual aspect scores

Target score distribution (approximation of real conferences):

  Frequency
  ▲
  │          ╭─╮
  │        ╭─┤ ├─╮
  │      ╭─┤ │ │ ├─╮
  │    ╭─┤ │ │ │ │ ├─╮
  │  ╭─┤ │ │ │ │ │ │ ├─╮
  │──┤ │ │ │ │ │ │ │ │ ├──
  └──┴─┴─┴─┴─┴─┴─┴─┴─┴─┴──▶ Score
     1  2  3  4  5  6  7  8  9  10
              ↑           ↑
           reject      accept
           threshold   threshold

11.5 The Self-Improvement Dynamic

The interaction between CycleResearcher and CycleReviewer creates emergent self-improvement dynamics that go beyond what either agent could achieve independently.

Mechanism 1: Quality ratchet effect

Iteration 1:
  Researcher_1 generates papers of quality Q₁
  Reviewer_1 identifies papers scoring > Q₁ as "good"
  DPO pushes Researcher toward generating Q₁+ quality papers

Iteration 2:
  Researcher_2 generates papers of quality Q₁+ (improved)
  Reviewer_2 must now discriminate within a HIGHER quality range
  This improves Reviewer_2's sensitivity to subtle quality differences
  DPO pushes Researcher toward generating Q₁++ quality papers

Effect: Both models improve because the "bar" continuously rises

Mechanism 2: Reviewer as implicit curriculum

The reviewer provides a natural curriculum for the researcher: - Easy improvements (structure, formatting, reference consistency) get large score gains in early iterations - Hard improvements (novelty, experimental rigor, theoretical depth) become the differentiator in later iterations - This creates an organic easy-to-hard curriculum without explicit curriculum design

Mechanism 3: Preference diversity

By generating K papers per topic and constructing pairs across score differences, the system creates diverse preference signals: - Pairs with large score gaps teach coarse quality distinctions - Pairs with small score gaps (just above margin) teach fine-grained quality distinctions - The mixture of both creates a rich training signal

11.6 Comparison with GAN Training Dynamics

The CycleResearcher framework shares structural similarity with GANs but differs in critical ways that affect training dynamics:

GAN Framework:
  Generator G: noise z → fake sample G(z)
  Discriminator D: sample → real/fake probability

  Training: min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]

  Failure modes:
    - Mode collapse: G produces limited diversity
    - Training instability: oscillating D and G
    - Vanishing gradients: D becomes too strong

CycleResearcher Framework:
  Researcher R: topic x → paper R(x)
  Reviewer V: paper → multi-aspect scores + text review

  Training: R_{t+1} = DPO(R_t, pairs from V_t scores)

  Mitigated failure modes:
    - Mode collapse: DPO's KL constraint prevents collapse
    - Training instability: Iterative (not simultaneous) updates
    - Vanishing gradients: DPO loss has non-zero gradients by construction

The sequential (not simultaneous) update schedule is key to stability. In each iteration, the reviewer is fixed while the researcher trains, preventing the oscillation dynamics that plague GAN training.

CycleResearcher's approach to literature integration during paper generation:

Literature Processing Pipeline:

  Input: Topic T
    │
    ▼
  Topic Embedding
    │
    ▼
  Retrieve top-K related papers from Research-14k
  (K typically 5-10, using embedding similarity)
    │
    ▼
  Extract from each retrieved paper:
    - Title and authors
    - Key methodology
    - Main results
    - Limitations mentioned
    │
    ▼
  Synthesize literature context:
    - Identify common themes
    - Map methodological landscape
    - Find gaps and contradictions
    │
    ▼
  Inject into researcher prompt as context
    │
    ▼
  CycleResearcher generates paper with:
    - Related work section referencing retrieved papers
    - Methodology positioned against prior approaches
    - Experiments comparing to relevant baselines

This approach trades off recency (limited to training corpus) for consistency (no hallucinated references to non-existent papers—a common failure mode in LLM-generated research).

12 Programming Language

Component	Language	Framework/Library
Training pipeline	Python	PyTorch, Transformers (HuggingFace)
DPO implementation	Python	TRL (Transformer Reinforcement Learning)
Data preprocessing	Python	Custom scripts + HuggingFace Datasets
Model serving	Python	vLLM / HuggingFace Inference
Evaluation scripts	Python	Custom + SciPy for statistical tests
Dataset construction	Python	BeautifulSoup, arxiv API, OpenReview API
Configuration	YAML/JSON	PyYAML, Pydantic

The entire system is Python-native, leveraging the PyTorch and HuggingFace ecosystems. No multi-language complexity.

Code Organization (Inferred from Release)

CycleResearcher/
├── data/
│   ├── research_14k/           # Research paper dataset
│   │   ├── papers.jsonl        # Structured paper data
│   │   ├── metadata.json       # Topic/venue metadata
│   │   └── preprocess.py       # LaTeX → structured text
│   └── review_5k/              # Review dataset
│       ├── reviews.jsonl       # Structured review data
│       ├── scores.json         # Score distributions
│       └── preprocess.py       # OpenReview scraping + formatting
├── training/
│   ├── sft/
│   │   ├── train_researcher.py # SFT for CycleResearcher
│   │   ├── train_reviewer.py   # SFT for CycleReviewer
│   │   └── configs/            # Training hyperparameters
│   ├── dpo/
│   │   ├── generate_pairs.py   # Preference pair construction
│   │   ├── train_dpo.py        # DPO training loop
│   │   ├── iterative_loop.py   # Orchestrates multi-iteration cycle
│   │   └── configs/            # DPO hyperparameters
│   └── utils/
│       ├── data_loader.py      # Data loading utilities
│       └── metrics.py          # Training metrics
├── inference/
│   ├── generate_paper.py       # Paper generation pipeline
│   ├── generate_review.py      # Review generation pipeline
│   └── prompts/                # System prompts and templates
├── evaluation/
│   ├── reviewer_accuracy.py    # MAE, correlation against humans
│   ├── paper_quality.py        # Simulated review scoring
│   ├── human_eval.py           # Human evaluation protocol
│   └── analysis/               # Result analysis and plotting
├── configs/
│   ├── model_configs.yaml      # Model architecture configs
│   ├── training_configs.yaml   # Training hyperparameters
│   └── eval_configs.yaml       # Evaluation settings
└── scripts/
    ├── run_sft.sh              # SFT training launcher
    ├── run_dpo_cycle.sh        # Iterative DPO launcher
    └── evaluate.sh             # Full evaluation pipeline

Key Implementation Patterns

DPO Training Loop (Pseudocode):

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

def run_iterative_dpo(
    base_researcher_path: str,
    reviewer_path: str,
    topics: list[str],
    n_iterations: int = 3,
    n_samples_per_topic: int = 4,
    score_margin: float = 0.5,
    beta: float = 0.1,
) -> str:
    researcher = AutoModelForCausalLM.from_pretrained(base_researcher_path)
    reviewer = AutoModelForCausalLM.from_pretrained(reviewer_path)
    tokenizer = AutoTokenizer.from_pretrained(base_researcher_path)

    for iteration in range(n_iterations):
        # Phase 1: Generate papers
        all_papers = {}
        for topic in topics:
            papers = []
            for _ in range(n_samples_per_topic):
                paper = generate_paper(researcher, tokenizer, topic)
                score = score_paper(reviewer, tokenizer, paper)
                papers.append((paper, score))
            all_papers[topic] = papers

        # Phase 2: Construct preference pairs
        preference_data = []
        for topic, papers in all_papers.items():
            sorted_papers = sorted(papers, key=lambda x: x[1], reverse=True)
            for i in range(len(sorted_papers)):
                for j in range(i + 1, len(sorted_papers)):
                    if sorted_papers[i][1] - sorted_papers[j][1] > score_margin:
                        preference_data.append({
                            "prompt": topic,
                            "chosen": sorted_papers[i][0],
                            "rejected": sorted_papers[j][0],
                        })

        # Phase 3: DPO training
        ref_model = AutoModelForCausalLM.from_pretrained(
            base_researcher_path if iteration == 0 
            else f"checkpoint_iter_{iteration - 1}"
        )

        dpo_config = DPOConfig(
            beta=beta,
            learning_rate=5e-7,
            num_train_epochs=1,
            per_device_train_batch_size=2,
            gradient_accumulation_steps=16,
        )

        trainer = DPOTrainer(
            model=researcher,
            ref_model=ref_model,
            args=dpo_config,
            train_dataset=preference_data,
            tokenizer=tokenizer,
        )
        trainer.train()

        checkpoint_path = f"checkpoint_iter_{iteration}"
        researcher.save_pretrained(checkpoint_path)

    return checkpoint_path

13 Memory Management

Training-Time Memory

Training CycleResearcher (especially the 72B variant) requires careful memory management:

Component	Memory Requirement	Strategy
Model weights (72B, fp16)	~144 GB	Model parallelism across 8 GPUs
Reference model (72B, fp16)	~144 GB	Offloaded to CPU or separate GPUs
Optimizer states (AdamW)	~288 GB (fp32 states)	ZeRO Stage 3 / DeepSpeed
Gradient accumulation	~18 GB per micro-batch	Gradient checkpointing
Preference data (per batch)	~2 GB (long sequences)	Dynamic batching
Total per iteration	~600 GB VRAM	8× A100 80GB with ZeRO-3

DeepSpeed ZeRO-3 configuration (typical):

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none"
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": 5e8,
        "stage3_prefetch_bucket_size": 5e8,
        "stage3_param_persistence_threshold": 1e6
    },
    "bf16": {
        "enabled": true
    },
    "gradient_clipping": 1.0,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16
}

Inference-Time Memory

Configuration	VRAM Required	Throughput
72B model, fp16	~144 GB (2× A100 80GB)	~30 tokens/sec
72B model, 4-bit quantized (GPTQ)	~40 GB (1× A100 80GB)	~45 tokens/sec
7B model, fp16	~14 GB (1× A100/A6000)	~120 tokens/sec
7B model, 4-bit quantized	~4 GB (consumer GPU)	~80 tokens/sec

Paper Generation Memory Profile

Generating a complete research paper requires multiple inference passes:

Memory timeline during paper generation:

  Time ──────────────────────────────────────────────────▶

  Phase 1: Literature retrieval
  [context loading: ~4K tokens] [generation: ~2K tokens]
  Peak VRAM: model_size + ~50 MB KV cache

  Phase 2: Paper draft generation  
  [context: ~8K tokens] [generation: ~12K-20K tokens]
  Peak VRAM: model_size + ~500 MB KV cache (long generation)

  Phase 3: Self-review / revision (if enabled)
  [context: ~20K tokens (full paper)] [generation: ~5K tokens]
  Peak VRAM: model_size + ~800 MB KV cache

  KV Cache Growth:
  ┌───────────────────────────────────────────────┐
  │    ╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱   │
  │   ╱  KV cache grows linearly with sequence  ╱  │
  │  ╱   length during autoregressive decoding  ╱   │
  │ ╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱    │
  └───────────────────────────────────────────────┘

  For 72B model with 80 layers, 64 heads, 128 dim:
  KV per token = 2 × 80 × 64 × 128 × 2 bytes = ~2.6 MB
  At 32K context: ~83 GB KV cache (substantial!)

Dataset Memory

Dataset	Disk Size	In-Memory (Tokenized)
Research-14k (raw)	~2 GB	~8 GB (tokenized, padded)
Review-5k (raw)	~50 MB	~200 MB (tokenized)
Generated preference data (per iter)	~500 MB	~2 GB (tokenized pairs)

Cross-Iteration State

Unlike evolutionary systems that maintain population databases, CycleResearcher's inter-iteration state is minimal:

State persisted between iterations:
  ├── Model checkpoint (researcher): ~144 GB
  ├── Model checkpoint (reviewer): ~144 GB
  ├── Generated preference data: ~500 MB
  ├── Training metrics/logs: ~10 MB
  └── Total: ~289 GB per iteration

No persistent population, skill library, or knowledge base.
All "knowledge" is encoded in model weights.

14 Continued Learning

Within-Training Continued Improvement

CycleResearcher demonstrates monotonic improvement across iterative training cycles:

Paper quality across DPO iterations (72B model):

Score
 5.4 │                                    ●───── Iter 3 (5.36)
     │                              ●
 5.2 │                        ●
     │                  ●
 5.0 │            ●───── Iter 2 (~5.15)
     │      ●
 4.8 │●───── Iter 1 (~4.85)
     │
 4.6 │ SFT baseline (~4.65)
     │
 4.4 │
     └──────────────────────────────────────────────
       SFT    Iter1    Iter2    Iter3    Iter4

Reviewer MAE reduction:
 30% │                              ●───── Iter 3 (26.89%)
     │                        ●
 25% │                  ●
     │            ●
 20% │      ●───── Iter 1 (~18%)
     │●
 15% │ SFT baseline (~15%)
     │
     └──────────────────────────────────────────────
       SFT    Iter1    Iter2    Iter3    Iter4

Diminishing returns: The improvement per iteration decreases: - Iteration 1 → 2: ~0.30 point improvement - Iteration 2 → 3: ~0.21 point improvement - Iteration 3 → 4: ~0.10 point improvement (extrapolated)

This suggests logarithmic convergence, consistent with the DPO objective approaching its minimum as the policy approaches the preference-optimal distribution.

Post-Deployment Learning

CycleResearcher does not implement post-deployment continued learning. Once training is complete, the model weights are frozen. However, the framework supports:

Retraining with new data: New papers and reviews can be incorporated into Research-14k and Review-5k, followed by re-running the iterative training loop
Domain adaptation: The same cyclic training framework could be applied to new research domains (biology, physics, chemistry) by constructing domain-specific datasets
Reviewer fine-tuning: As new conference review data becomes available, CycleReviewer can be updated independently

Cross-Topic Transfer

The trained model exhibits cross-topic generalization:

Training Domain	Evaluation Domain	Score	Transfer Quality
ML (all subfields)	NLP (cs.CL)	5.40	Strong
ML (all subfields)	Computer Vision (cs.CV)	5.31	Strong
ML (all subfields)	AI Theory (cs.AI)	5.18	Moderate
ML (all subfields)	Robotics (cs.RO)	4.85	Weaker
ML (all subfields)	Out-of-distribution (biology)	4.20	Limited

Papers on topics close to the ML training distribution receive higher scores, while out-of-distribution topics show degradation—expected behavior for a supervised system.

Scaling Behavior and Future Learning Trajectories

The empirical results suggest several scaling axes for continued improvement:

Axis 1: Model scale

Parameter count vs. paper quality:

Quality
  5.4 │                              ● 72B + DPO
      │
  5.2 │
      │
  5.0 │                  ● 72B SFT
      │            ● 7B + DPO
  4.8 │
      │
  4.6 │      ● 7B SFT
      │
  4.4 │
      └──────────────────────────────────────
        7B         34B(est)      72B

Axis 2: Dataset scale

The Research-14k dataset is relatively small by modern standards. Scaling to 100K+ papers with richer metadata could enable: - Better literature coverage and reduced hallucination - More diverse experimental design patterns - Improved reference accuracy

Axis 3: Iteration count

While 3 iterations show clear improvement, the cost per iteration for 72B models limits exploration. With more efficient training (e.g., LoRA-based DPO), more iterations could potentially push quality further.

Axis 4: Reviewer quality ceiling

The reviewer's calibration fundamentally limits the researcher's improvement potential. If the reviewer cannot distinguish between good and excellent papers, the preference signal becomes noisy at high quality levels. Improving reviewer sensitivity—perhaps through larger models, specialized reviewer architectures, or ensemble reviewing—could raise the quality ceiling.

Relationship to Reinforcement Learning from Human Feedback (RLHF)

CycleResearcher can be viewed as a variant of RLHF where the "human" feedback is replaced by an automated reviewer:

Standard RLHF Pipeline:
  Base Model → SFT → Reward Model (from human prefs) → PPO

CycleResearcher Pipeline:
  Base Model → SFT → CycleReviewer (from review data) → DPO

  Key difference: The "reward model" (CycleReviewer) is also
  a full generative model that produces structured text feedback,
  not just scalar rewards. This enables:
    - Richer training signal (text + score vs. score alone)
    - Interpretable feedback (human-readable reviews)
    - Iterative co-improvement (reviewer also benefits)

This positions CycleResearcher as a specialization of the RLHF paradigm for structured text generation tasks where domain-specific evaluation criteria exist.

15 Applications

Direct Applications

Application	Description	Readiness
Automated paper drafting	Generate complete first drafts for given research topics	Demonstrated
Automated peer review	Produce structured multi-aspect reviews with calibrated scores	Demonstrated
Paper quality assessment	Rapid estimation of paper quality before submission	Demonstrated
Research ideation support	Generate paper outlines and methodology sketches	Supported
Review training	Use CycleReviewer output as examples for training junior reviewers	Potential
Conference triage	Automated pre-screening of submissions for desk rejection	Potential
Research education	Students interact with the system to learn paper writing	Potential

Broader Research Implications

1. Self-Improving Research Systems

CycleResearcher demonstrates that self-improvement through co-training is viable for complex text generation tasks. This opens the door to similar cyclic frameworks in other domains:

Domain	Researcher Analog	Reviewer Analog
Scientific research	Paper generator	Peer reviewer
Software engineering	Code generator	Code reviewer
Legal writing	Brief drafter	Legal editor
Creative writing	Story generator	Literary critic
Mathematics	Proof generator	Proof verifier

2. Open-Source Research Automation

By demonstrating competitive quality with open-source models, CycleResearcher reduces the barrier to entry for automated research from "access to GPT-4 API" to "access to GPU cluster." This democratizes research on research automation itself.

3. Iterative Preference Optimization as a General Pattern

The iterative DPO training loop is not specific to research automation. Any domain where: - A generator produces complex output - An evaluator can rank outputs by quality - The evaluation is cheaper than the generation

...can potentially benefit from this cyclic training pattern. The contribution is both the specific system and the general methodology.

4. Automated Reviewing as Independent Contribution

CycleReviewer's near-human performance on score prediction has standalone value: - Auxiliary review signal: Conferences could use CycleReviewer as an additional reviewer to reduce individual reviewer noise - Calibration tool: Reviewers could compare their assessments against the model's to check for biases - Fairness analysis: The model's scores can be analyzed for systematic biases that human reviewers exhibit

Limitations and Scope

Limitation	Impact	Potential Mitigation
No code execution	Cannot validate experimental claims	Integrate with sandbox execution (AI-Scientist approach)
No live retrieval	Literature limited to training corpus	Add RAG pipeline with live arXiv access
No figure generation	Papers lack visual elements	Integrate with code2fig or matplotlib generation
Training corpus bias	Biased toward ML subdomain	Expand to broader scientific domains
Reviewer circularity	Reviewer scores its own training improvements	External human evaluation needed for ground truth
Single-paper scope	No multi-paper research programs	Extend to research agenda planning
No experimental validation	Methods described but not tested	Connect to execution environments
Score ceiling	Generated papers plateau below acceptance threshold	Better reviewers, larger models, more data

Connections to OmniEvolve

CycleResearcher's architecture maps to several OmniEvolve design patterns, though with a fundamentally different optimization paradigm (gradient-based preference learning vs. evolutionary search):

CycleResearcher Component	OmniEvolve Equivalent
Iterative preference training loop	`omnievolve/search/` — iterative search with fitness evaluation
CycleReviewer scoring	`omnievolve/evaluation/` — cascade evaluator providing fitness signal
Preference pair construction	`omnievolve/search/` — selection mechanism (fitness-based ranking)
DPO weight updates	`omnievolve/mutation/` — mutation operators (but gradient-based, not prompt-based)
Research-14k corpus	`omnievolve/knowledge/` — program database / skill library
Paper generation pipeline	`omnievolve/orchestrator/` — experiment lifecycle management
Multi-aspect review	`omnievolve/evaluation/` — multi-stage evaluation cascade

Key architectural difference: OmniEvolve operates on discrete programs through evolutionary search (maintaining a population of candidates), while CycleResearcher operates on continuous weight spaces through gradient optimization (maintaining a single model that implicitly encodes the "population" in its distribution). The cyclic training loop in CycleResearcher is conceptually similar to co-evolutionary algorithms where two populations (researcher and reviewer) evolve together.

Relevance for OmniEvolve design: 1. CycleResearcher's preference pair construction strategy could inform OmniEvolve's selection mechanism design 2. The multi-aspect review format provides a template for structured fitness evaluation 3. The diminishing-returns curve across iterations informs expectations for evolutionary search convergence 4. The reviewer-as-evaluator pattern validates OmniEvolve's separation of generation and evaluation concerns

References

Weng, Y., Zhu, M., Bao, G., Zhang, H., Wang, J., Zhang, Y., & Yang, L. (2024). "CycleResearcher: Improving Automated Research via Automated Review." arXiv:2411.00816.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023.
Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., & Ha, D. (2024). "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." arXiv:2408.06292.
Bai, Y., Jones, A., Ndousse, K., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv:2204.05862.
Yang, A., Yang, B., Hui, B., et al. (2024). "Qwen2 Technical Report." arXiv:2407.10671.
Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. (2014). "Generative Adversarial Networks." NeurIPS 2014.
Christiano, P. F., Leike, J., Brown, T., et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017.
Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023.
Bradley, R. A. & Terry, M. E. (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." Biometrika.

Classification: Autoresearch — CycleResearcher is an AI system that autonomously conducts scientific research (full paper generation) and peer review (structured multi-aspect evaluation) using a self-improving cyclic training framework with open-source LLMs. It is a foundational autoresearch system that demonstrates the viability of iterative preference optimization for research automation, with released code, data, and model weights enabling community advancement.