Introduced2025-05

Score7.81/10 — Draft

Chapter 51

RD-Agent: Microsoft R&D Automation Platform

Part: Autonomous Research Systems

51.1 Overview & Motivation

Research and development workflows — particularly in quantitative finance, machine learning engineering, and data science — follow a recurring cycle: hypothesize, design an experiment, implement, evaluate, and iterate. Each cycle demands domain expertise, careful experimental control, and hours of manual effort for tasks such as feature engineering, model selection, and hyperparameter tuning. RD-Agent, an open-source platform developed at Microsoft Research, automates this cycle by casting R&D as a multi-agent loop driven by large language models (LLMs). The system is publicly available at microsoft/RD-Agent on GitHub (MIT license), with active development across quantitative finance, medical, and general data-science scenarios.

The core thesis of RD-Agent is that the hypothesis-experiment-feedback loop defining empirical research can be formalized as an iterative optimization process amenable to LLM-based automation. Rather than replacing the researcher entirely, the system targets the high-volume, low-creativity portions of the R&D cycle — generating candidate features, writing boilerplate model code, running evaluations, and synthesizing results — while preserving human oversight at strategic decision points.

RD-Agent occupies a distinctive niche in the landscape of LLM-powered autonomous research systems. Unlike general-purpose coding agents (such as OpenHands or SWE-Agent) that target software engineering tickets, or benchmark-specific solvers, RD-Agent is purpose-built for data-centric R&D where the primary objects of manipulation are datasets, features, models, and experimental configurations rather than arbitrary codebases. The system's design reflects Microsoft Research's investment in quantitative finance automation, where iterative factor discovery and model refinement are the core intellectual activities. Two key associated publications describe the system's design and algorithms: the RD-Agent framework paper (Chen et al., 2024) and the CoSTEER paper (Wang, Chen et al., 2024) on collaborative evolving strategy for data-centric development.

Key Contribution

RD-Agent formalizes R&D automation as a multi-agent framework with explicit stages for hypothesis generation, experiment design, evolving code implementation (via the CoSTEER method), and knowledge feedback. Its principal contributions are: (1) demonstrating that LLM-based agents can autonomously propose, implement, and evaluate data-science experiments — particularly quantitative factor mining and model building — achieving competitive results on MLE-bench; (2) introducing CoSTEER (Collaborative Strategized Task Evolution Enhancement with Refinement), a multi-round evolving code generation method that significantly outperforms single-shot generation; and (3) providing an extensible open-source architecture with well-defined core abstractions (Hypothesis, Experiment, Trace, Scenario) that support diverse R&D scenarios.

Evidence grounding note. This chapter is grounded in: (a) the public GitHub repository microsoft/RD-Agent, (b) the associated publications (the RD-Agent framework paper and CoSTEER paper), and (c) the project's README and documentation. Throughout the chapter, claims are labeled by source: repo-verified (confirmed from repository code structure), paper-reported (stated in publications), or author interpretation (analytical framing by the chapter author). Code excerpts are adapted from the repository's verified API surface with module paths cited; where exact reproduction was not possible, this is noted.

51.2 Architecture

51.2.1 Repository Structure

RD-Agent is organized as a Python package (rdagent/) with a clear separation between core abstractions, reusable components, and domain-specific scenario implementations. The following directory layout reflects the verified structure of the microsoft/RD-Agent repository (repo-verified):

Table 51.1: Verified repository package structure (repo-verified)
Package Path	Purpose	Key Modules / Classes
rdagent/core/	Core abstractions shared across all scenarios	`proposal.py` (Hypothesis, HypothesisGen, Trace, HypothesisExperiment2Feedback), `experiment.py` (Task, Experiment, FBWorkspace), `scenario.py` (Scenario), `developer.py` (Developer), `evolving_framework.py` (EvolvingStrategy)
rdagent/components/	Reusable components: code generators, proposal generators, evaluators	`coder/CoSTEER/` (evolving code generation), `proposal/` (hypothesis generation components)
rdagent/scenarios/	Domain-specific scenario implementations	`qlib/` (quantitative finance), `kaggle/` (Kaggle competitions), `data_mining/` (general)
rdagent/oai/	LLM interface layer	`llm_utils.py` (APIBackend class: completion, caching, retry)
rdagent/app/	Application entry points for different R&D scenarios	`qlib_rd_loop/` (Qlib factor/model loops), `kaggle/` (Kaggle loop), `benchmark/`
rdagent/log/	Structured logging and experiment tracking	Log viewer, experiment trace storage

The architecture follows what the authors describe as the "R&D loop" pattern: a cycle of Research (hypothesis generation and knowledge retrieval), Development (experiment implementation via CoSTEER), and Feedback (evaluation and trace updating). This is implemented through a class hierarchy rooted in the abstractions defined in rdagent/core/.

51.2.2 Core Abstractions

The system is built around a set of core abstractions defined in rdagent/core/. These define the vocabulary shared across all scenarios (repo-verified):

Hypothesis (rdagent/core/proposal.py) — A structured conjecture about what experiment to try next. Carries attributes including hypothesis (the statement), reason (detailed reasoning), concise_reason, concise_observation, concise_justification, and concise_knowledge. The structured fields enable the LLM to produce both full reasoning and compact summaries for context management.
Trace (rdagent/core/proposal.py) — The accumulated history of all R&D iterations. Stores a list of (Hypothesis, Experiment, HypothesisFeedback) triples in its hist attribute. The trace also holds a reference to the Scenario, providing domain context. This is the system's "memory" across iterations.
Experiment (rdagent/core/experiment.py) — A concrete experimental plan comprising one or more Task objects (sub-tasks) and associated FBWorkspace objects (code execution environments). Each experiment tracks its sub_tasks, sub_workspace_list, and result.
FBWorkspace (rdagent/core/experiment.py) — A feedback workspace that manages code files (code_dict: dict[str, str]) within a temporary directory (workspace_path) and provides an execute() method for running code in Docker-based sandboxes.
Scenario (rdagent/core/scenario.py) — The domain context: what kind of R&D is being performed, what data is available, what constitutes success. Subclassed per domain (Qlib, Kaggle, etc.). Provides methods like get_scenario_all_desc() and background for LLM context injection.
HypothesisGen (rdagent/core/proposal.py) — Abstract base class for hypothesis generation strategies. Defines gen(trace: Trace) -> Hypothesis.
HypothesisExperiment2Feedback (rdagent/core/proposal.py) — Abstract base class for converting experimental results into structured feedback (HypothesisFeedback) including observations, evaluation, and an accept/reject decision.
Developer (rdagent/core/developer.py) — Abstract base class for code-generation agents. Defines develop(exp: Experiment) -> Experiment. The principal implementation is CoSTEER.

The following code excerpt shows the verified core abstractions from the repository. These interfaces define the contract that all scenarios must implement:

# Core abstractions from rdagent/core/proposal.py and rdagent/core/experiment.py
# Adapted from the repository's verified API surface (repo-verified class names and interfaces)

# --- rdagent/core/proposal.py ---

class Hypothesis:
    """A structured research hypothesis with reasoning fields."""
    def __init__(
        self,
        hypothesis: str,          # The hypothesis statement
        reason: str,              # Detailed reasoning
        concise_reason: str,      # Compact reasoning for LLM context
        concise_observation: str, # Key observation motivating this hypothesis
        concise_justification: str,  # Why this is worth testing
        concise_knowledge: str,   # Relevant domain knowledge
    ):
        self.hypothesis = hypothesis
        self.reason = reason
        self.concise_reason = concise_reason
        self.concise_observation = concise_observation
        self.concise_justification = concise_justification
        self.concise_knowledge = concise_knowledge


class HypothesisFeedback:
    """Structured feedback after evaluating an experiment against a hypothesis."""
    observations: str            # What was observed in the experiment
    hypothesis_evaluation: str   # LLM analysis of whether hypothesis held
    new_hypothesis: str          # Suggested next direction
    reason: str                  # Reasoning for the evaluation
    decision: bool               # Accept (True) or reject (False)


class Trace:
    """Accumulated history of R&D iterations for a given scenario."""
    def __init__(self, scen: Scenario):
        self.scen = scen
        self.hist: list[tuple[Hypothesis, Experiment, HypothesisFeedback]] = []


class HypothesisGen:
    """Abstract base class for hypothesis generation."""
    def __init__(self, scen: Scenario):
        self.scen = scen

    def gen(self, trace: Trace) -> Hypothesis:
        raise NotImplementedError


class HypothesisExperiment2Feedback:
    """Abstract base class: experiment results → structured feedback."""
    def gen(
        self, exp: Experiment, hypothesis: Hypothesis, trace: Trace
    ) -> HypothesisFeedback:
        raise NotImplementedError


# --- rdagent/core/experiment.py ---

class Task:
    """A single sub-task within an experiment (e.g., one factor to implement)."""
    name: str
    description: str


class FBWorkspace:
    """Workspace for code execution with Docker sandboxing."""
    workspace_path: Path
    code_dict: dict[str, str]  # filename → code content

    def inject_code(self, **files: str) -> None:
        """Write code files into the workspace directory."""
        ...

    def execute(self) -> str:
        """Run the code in a Docker container; return stdout/stderr."""
        ...


class Experiment:
    """A concrete experiment comprising sub-tasks and their workspaces."""
    sub_tasks: list[Task]
    sub_workspace_list: list[FBWorkspace]
    result: object  # Evaluation result (scenario-specific)

51.2.3 Multi-Agent Structure

RD-Agent employs a multi-agent architecture where different LLM-powered components handle different phases of the R&D loop. Rather than a monolithic prompt chain, the system separates concerns across specialized modules, each with its own prompting strategy and context window (paper-reported + repo-verified):

Hypothesis Generation (implemented via HypothesisGen subclasses in rdagent/components/proposal/) — Generates hypotheses based on the scenario description, prior experimental results in the Trace, and domain knowledge. Scenario-specific subclasses (e.g., in rdagent/scenarios/qlib/) provide domain-aware prompting.
Experiment Design (scenario-specific experiment generators) — Converts a hypothesis into a structured Experiment with concrete Task objects. For instance, in the Qlib factor scenario, this creates a task for each candidate factor to be implemented.
Code Generation & Refinement (implemented via CoSTEER in rdagent/components/coder/CoSTEER/) — Takes an experiment and produces executable code through multiple rounds of generation, execution, evaluation, and refinement. This is the Developer component and is described in detail in Section 51.3.3.
Evaluation & Feedback (implemented via HypothesisExperiment2Feedback subclasses) — Runs experiments in sandboxed Docker environments, collects metrics, and formats results into structured HypothesisFeedback that is appended to the Trace.

All LLM interactions go through the APIBackend class in rdagent/oai/llm_utils.py (repo-verified), which provides a unified interface to OpenAI and Azure OpenAI APIs with automatic caching (MD5-keyed file cache to reduce repeated calls), retry logic with exponential backoff, and configurable model selection via environment variables. This allows different agents to use different models based on capability and cost requirements.

51.3 Core Algorithms

51.3.1 The R&D Loop

The fundamental algorithm of RD-Agent is an iterative loop that chains hypothesis generation, experiment construction, code development, execution, and feedback. The main loop, as implemented in the application entry points under rdagent/app/, follows this pattern (repo-verified structure):

# Main R&D loop pattern from rdagent/app/ entry points (repo-verified structure)
# Actual entry points: rdagent/app/qlib_rd_loop/ (factor, model) and rdagent/app/kaggle/

# 1. Initialize scenario-specific components
scenario = QlibFactorScenario(...)        # Domain context (scenarios/qlib/)
hypothesis_gen = QlibFactorHypothesisGen(scenario)  # Research agent
experiment_gen = QlibFactorExperimentGen(scenario)   # Experiment designer
developer = CoSTEER(settings=costeer_settings)       # Code generation (CoSTEER)
feedback_gen = QlibFactorExperiment2Feedback(scenario)  # Evaluation agent
trace = Trace(scen=scenario)              # Knowledge accumulation

# 2. Run the R&D loop
for loop_idx in range(max_loop_iterations):
    # Research: generate hypothesis from accumulated knowledge
    hypothesis = hypothesis_gen.gen(trace)

    # Design: convert hypothesis to concrete experiment with sub-tasks
    experiment = experiment_gen.gen(hypothesis)

    # Develop: CoSTEER generates and evolves code (multi-round)
    experiment = developer.develop(experiment)

    # Evaluate: run in sandbox, compute metrics, generate feedback
    feedback = feedback_gen.gen(experiment, hypothesis, trace)

    # Learn: append to trace for future hypothesis generation
    trace.hist.append((hypothesis, experiment, feedback))

The critical design decision is that knowledge accumulates across iterations through the Trace object, enabling the system to learn from both successes and failures. The hypothesis generator is conditioned on the full trace, not just the most recent result, preventing cyclical exploration of the same ideas.

Conceptual Formalization (author interpretation)

The R&D loop can be abstracted mathematically. Let $\mathcal{H}$ denote the hypothesis space, $\mathcal{E}$ the experiment space, and $f: \mathcal{E} \rightarrow \mathbb{R}$ the evaluation function. At iteration $t$:

$$h_t = \texttt{HypothesisGen.gen}(\mathcal{T}_{t-1}) \qquad e_t = \texttt{Developer.develop}(\texttt{ExpGen.gen}(h_t))$$

$$r_t = f(e_t) \qquad \mathcal{T}_t = \mathcal{T}_{t-1} \cup \{(h_t, e_t, r_t)\}$$

where $\mathcal{T}_{t-1} = \{(h_i, e_i, r_i)\}_{i=1}^{t-1}$ is the Trace.hist list. The notation maps directly to the repository class methods: HypothesisGen.gen() corresponds to $\texttt{HypothesisGen.gen}$, and Trace.hist corresponds to $\mathcal{T}$. This is not a formalization published by the RD-Agent authors, but an analytical abstraction that captures the loop's iterative accumulation structure.

51.3.2 Hypothesis Generation

Hypothesis generation is the "research" component. The HypothesisGen base class (rdagent/core/proposal.py) defines the interface, and scenario-specific subclasses in rdagent/components/proposal/ and rdagent/scenarios/ implement the prompting logic. The generation process, as described in the publications and consistent with the repository structure (paper-reported + repo-verified):

Context assembly — The scenario description (scenario.get_scenario_all_desc()) and a summary of prior experiments from trace.hist are compiled into the LLM prompt. The structured Hypothesis fields (concise_observation, concise_knowledge, etc.) from prior iterations provide compact but informative history.
LLM generation — The APIBackend class (rdagent/oai/llm_utils.py) sends the assembled prompt to the configured LLM (typically GPT-4 or GPT-4o via Azure OpenAI). The build_messages_and_create_chat_completion() method handles message formatting, caching, and retry.
Structured parsing — The LLM response is parsed into a Hypothesis object with all required fields. The structured format ensures that each hypothesis includes not just the statement but also the reasoning, expected outcome, and relevant knowledge.
Novelty filtering — The trace is consulted to avoid proposing hypotheses that duplicate prior experiments.

The hypothesis generation prompts are scenario-specific. For the Qlib factor mining scenario, the prompt includes descriptions of the available market data fields (OHLCV — Open, High, Low, Close, Volume), the evaluation metric (Information Coefficient), and examples of previously discovered factors with their results. For Kaggle scenarios, the prompt includes the competition description, dataset schema, and prior submission scores.

51.3.3 CoSTEER: Evolving Code Generation

A distinctive feature of RD-Agent is CoSTEER (Collaborative Strategized Task Evolution Enhancement with Refinement), the evolving code generation framework implemented in rdagent/components/coder/CoSTEER/ (repo-verified name and location; paper-reported algorithm). Rather than generating code in a single LLM pass, CoSTEER uses an iterative evolution strategy:

Initial Generation — The LLM generates a first draft of experimental code based on the hypothesis, scenario templates, and task specification.
Execution & Evaluation — The code is written into an FBWorkspace and executed in a Docker sandbox. Evaluators check both execution success (no crashes/errors) and output validity (correct format, reasonable values).
Evolving Strategy — Based on execution feedback, the EvolvingStrategy (rdagent/core/evolving_framework.py) selects which implementations to refine, which to discard, and how to recombine successful elements. This is the "evolutionary" aspect: multiple candidate implementations may be maintained and evolved across rounds.
Iterative Refinement — Error traces, evaluation feedback, and successful patterns are fed back to the LLM for refinement. This cycle repeats for a configurable number of rounds (controlled by max_loop in CoSTEER settings).

The CoSTEER approach extends the basic "generate, test, fix" pattern by treating code evolution as a collaborative strategy across multiple sub-tasks within an experiment. When multiple factors or model components are being developed simultaneously, insights from successful implementations of one sub-task can inform the refinement of others (paper-reported).

# CoSTEER developer pattern from rdagent/components/coder/CoSTEER/
# Adapted from the verified repository structure; class names and method
# signatures reflect the actual API (repo-verified names, simplified logic)

class CoSTEER(Developer):
    """Collaborative Strategized Task Evolution Enhancement with Refinement.

    Implements multi-round evolving code generation for experiment sub-tasks.
    Located in rdagent/components/coder/CoSTEER/.
    """

    def __init__(self, settings: CoSTEERSettings, evolving_strategy: EvolvingStrategy):
        self.settings = settings           # max_loop, evaluation config
        self.evolving_strategy = evolving_strategy  # from core/evolving_framework.py

    def develop(self, exp: Experiment) -> Experiment:
        """Evolve implementations for all sub-tasks across multiple rounds."""
        for evolving_round in range(self.settings.max_loop):
            # 1. For each sub-task, generate or refine code
            for task, workspace in zip(exp.sub_tasks, exp.sub_workspace_list):
                if evolving_round == 0:
                    code = self._initial_generate(task, exp)
                else:
                    code = self._refine(task, workspace, feedback_str)
                workspace.inject_code(**code)

            # 2. Execute all sub-tasks in Docker sandbox
            for workspace in exp.sub_workspace_list:
                exec_result = workspace.execute()

            # 3. Evaluate execution results
            evaluations = self._evaluate(exp)

            # 4. Apply evolving strategy: select, recombine, prepare next round
            feedback_str = self.evolving_strategy.evolve(
                exp, evaluations, evolving_round
            )

        return exp

Conceptual Interpretation: Iterative Success Probability (author interpretation)

The evolving code generation process can be modeled as iterated trials. If each refinement round has an independent probability $p$ of producing correct code, the success probability after $k$ rounds follows the standard geometric-trial formula:

$$P(\text{success within } k \text{ rounds}) = 1 - (1 - p)^k$$

This is an idealization: in practice, later rounds benefit from error feedback, making successive attempts more likely to succeed (i.e., $p_t$ is increasing). The CoSTEER approach further improves on this by sharing information across sub-tasks and applying an evolving selection strategy. This formula is a standard analytical framing, not one published by the RD-Agent authors, included here to provide intuition for why multi-round approaches dominate single-shot generation.

51.3.4 Factor and Model Co-Evolution

In the quantitative finance scenario — RD-Agent's most developed domain — the system supports two interleaved R&D loops: factor mining (discovering alpha signals from market data) and model building (constructing ML models that consume discovered factors). These are implemented as separate application entry points under rdagent/app/qlib_rd_loop/ (repo-verified), with the factor loop and model loop operating on different aspects of the Qlib pipeline.

The factor evaluation uses Qlib's backtesting infrastructure to compute standard quantitative finance metrics: Information Coefficient (IC, the rank correlation between predicted and actual forward returns), Information Coefficient Information Ratio (ICIR), annualized return, and Sharpe ratio. The model loop uses the best current factor set as input features, creating a natural dependency between the two workflows.

Conceptual Interpretation: Bilevel Optimization (author interpretation)

The co-evolution can be framed as a bilevel optimization. Let $\mathcal{F}$ be the factor space and $\mathcal{M}$ the model space:

$$f^*_t = \arg\max_{f \in \mathcal{F}} \; \text{IC}(f, D_{\text{val}}) \quad \text{(factor discovery loop)}$$

$$m^*_t = \arg\max_{m \in \mathcal{M}} \; \text{Perf}(m, F_t, D_{\text{val}}) \quad \text{(model building loop)}$$

where $F_t = \{f^*_1, \ldots, f^*_t\}$ is the accumulated factor set. This is an analytical abstraction — the repository implements these as separate but interleaved R&D loops rather than an explicit bilevel solver. The integration with Qlib provides the data pipeline and backtesting infrastructure that makes this co-evolution practical.

51.3.5 Knowledge Accumulation and Feedback

A critical mechanism distinguishing RD-Agent from simpler LLM agent loops is its explicit knowledge accumulation through the Trace class (repo-verified). After each experiment, the HypothesisExperiment2Feedback component extracts structured lessons that are stored as HypothesisFeedback objects. Each feedback entry contains:

observations — What was observed in the experiment (e.g., "factor produced IC of 0.03 but high turnover")
hypothesis_evaluation — LLM-generated analysis of whether the hypothesis was supported
new_hypothesis — A suggested next direction based on the results
reason — Reasoning for the evaluation
decision — Boolean: accept (the hypothesis produced useful results) or reject

The trace grows with each iteration. When the hypothesis generator constructs its prompt, it includes summaries of recent trace entries, using the concise_* fields on Hypothesis and the structured fields on HypothesisFeedback to fit relevant history within the LLM's context window. This accumulated trace serves a function analogous to the "learning log" in FunSearch (Chapter 8) or the "skills library" in Voyager-style agents, but structured around the hypothesis-experiment ontology rather than general-purpose code snippets.

51.4 Scenarios and Applications

51.4.1 Quantitative Finance

The quantitative finance scenario is the flagship application. Under rdagent/scenarios/qlib/ (repo-verified), the system integrates with Microsoft's Qlib open-source quantitative investment platform to automate two primary workflows:

Factor Mining (rdagent/app/qlib_rd_loop/) — Automated discovery of alpha factors. The system generates Python functions that compute features from OHLCV data, evaluates their Information Coefficient against forward returns using Qlib's backtesting, and iteratively refines the factor set based on performance and diversity.
Model Building — Automated construction and tuning of machine learning models (e.g., LightGBM, neural networks) that combine discovered factors into trading signals. This uses the same R&D loop but targets model architecture and hyperparameters rather than feature functions.

The Qlib integration provides standardized market data access, backtesting protocols, and performance metrics, enabling fair comparison across iterations and experiments.

51.4.2 Kaggle and MLE-bench

RD-Agent includes a Kaggle scenario (rdagent/scenarios/kaggle/, entry point under rdagent/app/kaggle/) (repo-verified) that automates participation in machine-learning competitions. The system reads the competition description, explores the dataset, generates hypotheses about potentially effective approaches (feature engineering, model architecture, ensemble strategies), implements them as submission pipelines via CoSTEER, and evaluates against the competition metric.

The system was evaluated on MLE-bench, a standardized benchmark for machine learning engineering agents developed by OpenAI (Chan et al., 2024). MLE-bench comprises 75 tasks drawn from real Kaggle competitions, measuring what fraction of tasks an agent can solve at medal-level quality (defined relative to historical human competition results). RD-Agent's performance on MLE-bench is discussed in Section 51.5.

51.4.3 Medical Research and General Data Mining

The repository includes additional scenario packages (rdagent/scenarios/data_mining/) for general data-science tasks (repo-verified package name). The medical scenario extends the R&D loop to healthcare data analysis. These scenarios are less mature than the quantitative finance use case but demonstrate the framework's generality — the core Hypothesis/Experiment/Trace abstractions are domain-agnostic, with scenario-specific behavior confined to subclasses.

51.5 Key Results

51.5.1 Quantitative Finance Results

The RD-Agent authors report results on quantitative factor mining using the Qlib platform with Chinese A-share market data. From the CoSTEER paper (Wang, Chen et al., 2024) and the RD-Agent framework paper (paper-reported):

The automated factor discovery loop generates factors with positive Information Coefficients, indicating genuine predictive signal beyond noise.
The CoSTEER evolving code generation approach significantly outperforms single-shot code generation in factor implementation success rate. The paper reports that multi-round evolving generation achieves higher rates of producing executable, valid factor code compared to single-pass LLM generation.
The system demonstrates the ability to discover diverse factors spanning financial concepts including momentum, volatility, liquidity, and value factors.
Iterative refinement through the R&D loop progressively improves factor quality over successive iterations, with the trace-based knowledge accumulation reducing redundant exploration.

Table 51.2: Reproducibility protocol for quantitative finance experiments (paper-reported)
Parameter	Value / Status
Dataset	Chinese A-share market data via Qlib
Primary metrics	IC (Information Coefficient), ICIR (IC Information Ratio)
LLM backend	GPT-4 / GPT-4o via Azure OpenAI (paper-reported)
Number of R&D loop iterations	Variable per experiment (typically 10-50 iterations, paper-dependent)
CoSTEER refinement rounds	Configurable via `max_loop` (typically 3-10)
Number of independent runs / seeds	Not fully specified in public materials
Compute budget (API cost)	Not standardized; varies by model and iteration count
Confidence intervals / variance	Not reported in available public materials

Evidence quality caveat: The quantitative finance results are distributed across multiple publications and project documentation. Exact per-iteration IC values, confidence intervals, and detailed ablation results with controlled compute budgets are not consolidated in a single public artifact. The results reported here reflect claims from the project's published papers. Independent replication under budget-controlled conditions has not been performed for this survey. Readers should consult the original CoSTEER paper and the RD-Agent repository's documentation for the most current numerical results.

51.5.2 MLE-bench Performance

RD-Agent was evaluated on MLE-bench (Chan et al., 2024), which measures the fraction of 75 Kaggle-derived ML tasks on which an agent achieves medal-level performance (bronze, silver, or gold relative to historical human competition results). The following results are paper-reported from the RD-Agent publications and project announcements:

Table 51.3: MLE-bench results context (paper-reported; readers should verify exact numbers against original publications)
System	LLM Backend	Approach	Medal Rate Context	Source
RD-Agent	GPT-4o	Iterative R&D loop + CoSTEER	Competitive; benefits from iterative refinement	RD-Agent paper
AIDE	Various	Tree-structured experiment search	Among top performers on MLE-bench	AIDE paper
OpenHands	Various	General agent framework	Broad SWE capabilities	MLE-bench paper
MLAB	GPT-4	Tool-augmented single-pass	Baseline comparison	MLE-bench paper

Important note on exact numbers: The specific medal rate percentages vary across benchmark versions, number of allowed iterations, model configurations, and compute budgets. Rather than reproduce potentially outdated figures, we direct readers to the original RD-Agent paper and the MLE-bench leaderboard for precise numbers. The key qualitative finding, which is robust across reporting contexts, is that RD-Agent's iterative approach — where the system progressively improves solutions through multiple R&D cycles with accumulated knowledge — proves particularly effective on tasks that reward incremental refinement over single-shot attempts.

Table 51.4: MLE-bench evaluation protocol (paper-reported + partially inferred)
Parameter	Value / Status
Benchmark	MLE-bench (Chan et al., 2024)
Number of tasks	75 Kaggle-derived ML engineering tasks
Primary metric	Medal rate: fraction of tasks achieving ≥ bronze medal threshold
LLM backend	GPT-4o (paper-reported)
R&D loop iterations per task	Multiple (specific count varies by configuration)
Budget matching with baselines	Not uniformly controlled across all comparisons
Execution environment	Docker-sandboxed execution per task

51.5.3 Effectiveness of the Iterative Loop

A key finding across scenarios is that the iterative nature of RD-Agent provides cumulative improvement (paper-reported). Unlike single-shot code generation, the system's performance improves with more iterations as the Trace grows and hypothesis generation becomes more targeted. The authors describe learning curves showing rapid early gains followed by diminishing returns.

Conceptual Interpretation: Diminishing Returns Model (author interpretation)

The diminishing returns behavior can be understood through a simple analytical model. If the probability of generating an improvement at iteration $t$ decreases as the knowledge base saturates, the expected cumulative improvement follows:

$$\mathbb{E}[\text{Improvement}_T] = \sum_{t=1}^{T} \delta_t \cdot p_t$$

where $\delta_t$ is the improvement magnitude at iteration $t$ (typically decreasing as "easy" gains are found first) and $p_t$ is the probability of finding a valid improvement (also typically decreasing as the search space is explored). This gives rise to the characteristic concave learning curve observed in the RD-Agent papers. This model is an author-constructed analytical frame to explain the reported behavior, not a result published by the RD-Agent team.

51.6 Implementation Details

51.6.1 Concrete Walkthrough: Qlib Factor Mining

To make the system's operation concrete, this section traces through a complete execution of the Qlib factor mining scenario from entry point to feedback logging. This walkthrough is based on the repository structure and publication descriptions (repo-verified paths + paper-reported behavior).

Step-by-step execution (Qlib factor mining):

Entry point — The user launches the factor mining loop via rdagent/app/qlib_rd_loop/. This initializes the QlibFactorScenario (loading market data descriptions and evaluation protocols), creates scenario-specific instances of HypothesisGen, Developer (CoSTEER), and HypothesisExperiment2Feedback, and initializes an empty Trace.
Hypothesis generation — HypothesisGen.gen(trace) assembles a prompt from the scenario description (available OHLCV fields, target metric IC), summaries of prior trace entries (using concise_* fields), and instructions to propose a novel factor idea. The APIBackend sends this to GPT-4o and parses the response into a Hypothesis object. Example hypothesis: "A 20-day momentum factor computed as the ratio of current close to the 20-day moving average may capture mean-reversion signals with positive IC."
Experiment construction — The experiment generator converts the hypothesis into an Experiment with one or more Task objects (each representing a factor to implement) and corresponding FBWorkspace objects (temporary directories for code execution).
CoSTEER code evolution — CoSTEER.develop(experiment) begins the multi-round evolving code generation:
- Round 1: The LLM generates an initial Python function implementing the factor (e.g., a function that computes close / close.rolling(20).mean() from pandas DataFrame input). The code is written to the FBWorkspace via inject_code().
- Execution: FBWorkspace.execute() runs the code inside a Docker container with Qlib's data pipeline available. If the code crashes (e.g., missing import, shape mismatch), the error traceback is captured.
- Round 2+: If execution failed, the error trace and original code are sent to the LLM for refinement. If execution succeeded but output validation failed (wrong format, NaN values), that feedback is similarly provided. The EvolvingStrategy may also cross-pollinate successful patterns from other sub-tasks.
- Completion: After max_loop rounds (or upon successful validation), the evolved code is finalized in the workspace.
Qlib evaluation — The generated factor code is evaluated using Qlib's backtesting infrastructure: the factor values are computed over historical data, and the Information Coefficient (rank correlation with forward returns) and ICIR are calculated. Results are stored in experiment.result.
Feedback generation — HypothesisExperiment2Feedback.gen() produces a HypothesisFeedback object. The LLM analyzes whether the hypothesis was supported (e.g., "IC was 0.025, positive but modest; the factor shows momentum signal but may need normalization"), generates observations, and makes an accept/reject decision.
Trace update — The (hypothesis, experiment, feedback) triple is appended to trace.hist. On the next iteration, the hypothesis generator has access to this record, enabling it to propose refined or alternative ideas based on what was learned.

51.6.2 LLM Integration

The rdagent/oai/llm_utils.py module provides the APIBackend class — the unified LLM interface used throughout the system (repo-verified). Key characteristics:

Provider support — Azure OpenAI and OpenAI API endpoints, configurable via environment variables (e.g., OPENAI_API_KEY, Azure-specific variables for endpoint and deployment name).
Core method — build_messages_and_create_chat_completion() handles message formatting, system/user prompt construction, and API call dispatch.
Caching — LLM responses are cached using an MD5 hash of the prompt as the key, stored to the filesystem. This eliminates redundant API calls during development, debugging, and when the same context is re-queried across experiments.
Retry logic — Automatic retries with exponential backoff for API rate limits and transient failures.
Token management — Context window management to fit scenario descriptions, trace summaries, and code within model limits.

# LLM interface pattern from rdagent/oai/llm_utils.py (repo-verified class name;
# method signature adapted from the verified API surface)

class APIBackend:
    """Unified LLM interface with caching, retry, and provider abstraction.

    The actual implementation in rdagent/oai/llm_utils.py supports both
    Azure OpenAI and OpenAI endpoints, configured via environment variables.
    """

    def build_messages_and_create_chat_completion(
        self,
        user_prompt: str,
        system_prompt: str = "",
        json_mode: bool = False,
        **kwargs,
    ) -> str:
        """Main LLM completion method.

        1. Checks MD5-keyed file cache; returns cached response if available
        2. Constructs messages list from system + user prompts
        3. Calls OpenAI/Azure API with retry and exponential backoff
        4. Caches successful response to filesystem
        5. Returns completion text
        """
        cache_key = md5(f"{system_prompt}{user_prompt}".encode()).hexdigest()

        # Check cache
        if self._cache_hit(cache_key):
            return self._load_cache(cache_key)

        # Build messages
        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.append({"role": "user", "content": user_prompt})

        # API call with retry
        for attempt in range(self.retry_count):
            try:
                response = self._call_api(messages, **kwargs)
                self._save_cache(cache_key, response)
                return response
            except RateLimitError:
                time.sleep(2 ** attempt)  # Exponential backoff

        raise LLMError("Exhausted retries")

51.6.3 Sandboxed Execution

All generated experimental code runs inside Docker containers, managed through the FBWorkspace.execute() method (repo-verified). The sandbox provides:

Resource limits — CPU, memory, and timeout constraints prevent runaway computations.
Network isolation — Generated code cannot make arbitrary network requests.
Filesystem isolation — Each experiment gets its own workspace directory, mounted into the container.
Dependency management — Pre-built Docker images include common data science libraries (pandas, numpy, scikit-learn, PyTorch, LightGBM) and scenario-specific dependencies (Qlib for quantitative finance).

Security note: The Docker-based sandbox provides process-level and filesystem-level isolation, meaningfully stronger than restricted exec() or subprocess approaches used by some other systems. However, Docker container escapes are a known vulnerability class. The sandboxing is appropriate for the system's intended use case — running LLM-generated data science code in a research context — but should not be considered equivalent to VM-level isolation for fully adversarial code execution.

51.6.4 Cost Considerations

Each R&D loop iteration involves multiple LLM calls: hypothesis generation, experiment design, code generation (potentially multiple CoSTEER refinement rounds), and feedback synthesis. The cost structure depends on:

Choice of LLM model (GPT-4o vs. GPT-3.5-turbo vs. open-source alternatives)
Number of R&D loop iterations (the outer loop)
Number of CoSTEER refinement rounds per experiment (the inner loop, controlled by max_loop)
Length of trace history included in prompts (grows with iterations)
Number of sub-tasks per experiment (more tasks = more code generation calls)

The APIBackend's MD5-keyed file cache significantly reduces costs during iterative development and debugging by eliminating duplicate API calls. However, a full R&D run over dozens of outer-loop iterations with GPT-4o can accumulate significant API costs. The authors do not publish a standardized cost-per-run figure, and costs vary substantially by scenario complexity and model choice.

51.6.5 Reproducibility

Reproducibility in RD-Agent faces the standard challenges of LLM-dependent systems:

LLM non-determinism — Even with temperature set to 0, LLM outputs can vary across API calls due to provider-side infrastructure changes (model version updates, quantization changes).
API version drift — Model capabilities change over time as providers update; GPT-4o in early 2024 may differ from GPT-4o in late 2024.
Data dependencies — The quantitative finance scenario depends on market data that changes daily; Qlib data snapshots would need to be pinned.

The Trace logging system partially mitigates these issues by recording the full sequence of hypotheses (with all structured fields), generated code (stored in FBWorkspace.code_dict), and evaluation results, enabling post-hoc analysis even if exact reproduction is not possible. The rdagent/log/ module provides a structured logging infrastructure including a web-based log viewer for experiment inspection (repo-verified).

51.7 Comparative Analysis

51.7.1 Comparison with Related Systems

RD-Agent sits at the intersection of autonomous ML engineering, LLM-powered code generation, and scientific discovery automation. The following comparison situates it relative to related systems covered in this survey:

Table 51.5: RD-Agent in the context of related autonomous research systems (author synthesis from respective publications)
Dimension	RD-Agent	AIDE (Ch. 42)	FunSearch (Ch. 8)	OpenHands
Primary Domain	Data-centric R&D	ML engineering	Math/combinatorial	General SWE
Search Strategy	Sequential R&D loop + CoSTEER inner evolution	Tree-structured branching	Population-based evolution	Single-pass agent
Knowledge Reuse	Structured Trace with HypothesisFeedback	Experiment log	Programs database	Conversation memory
Hypothesis Explicit?	Yes — first-class `Hypothesis` object	Implicit in tree nodes	No — programs evolve directly	No
Code Evolution	CoSTEER: multi-round, multi-task, selection-based	Single-task refinement	Population sampling + best-shot prompting	Tool-augmented single-pass
Sandbox	Docker containers	Local subprocess	Restricted environment	Docker/runtime
Open Source	Yes (MIT license)	Yes	Partial	Yes

51.7.2 Relationship to Evolutionary Search

RD-Agent's iterative R&D loop shares structural similarities with evolutionary algorithms but differs in important ways. In the taxonomy introduced in Chapter 2:

Search strategy — The outer R&D loop is sequential: each iteration generates one hypothesis and one experiment. This makes it closer to hill-climbing than to a genetic algorithm. However, the inner CoSTEER loop maintains multiple candidate implementations per sub-task and applies selection pressure across rounds, introducing a population-like dynamic at the code-generation level.
Variation operator — The LLM serves as the variation operator at two levels. At the hypothesis level, it generates new ideas conditioned on the trace (analogous to informed mutation). At the code level, CoSTEER's refinement rounds resemble iterative mutation with execution-guided selection.
Selection — Selection is implicit in the feedback mechanism at the hypothesis level (the decision field in HypothesisFeedback records accept/reject) and more explicit at the CoSTEER level (the EvolvingStrategy selects which implementations to keep and refine).
Knowledge accumulation — The Trace plays a role analogous to an archive or Hall of Fame in evolutionary computation, maintaining a record of the best solutions and the reasoning behind them.

This positioning places RD-Agent in a gray zone between pure LLM agent systems and evolutionary optimization systems. It lacks the explicit population dynamics and genetic operators of classical evolutionary computation at the outer loop level, but CoSTEER's inner loop introduces evolutionary-style selection and variation at the code generation level. The system can be viewed as implementing experience-guided search with a nested evolutionary code synthesis inner loop.

51.8 Limitations & Discussion

51.8.1 Sequential Search Limitations

The outer R&D loop's sequential design — one hypothesis per iteration — is simpler to implement and reason about, and each iteration benefits from the full trace. However, it is more susceptible to local optima than population-based search, which maintains diversity across multiple candidates. While CoSTEER introduces multi-candidate evolution at the code level, the hypothesis level remains sequential. A promising extension would be to run multiple R&D loops in parallel with different initial hypotheses, analogous to an island model in evolutionary computation.

51.8.2 Domain Specificity vs. Generality

While the core abstractions (Hypothesis, Experiment, Trace, Scenario) are domain-agnostic by design, the system's strength is concentrated in quantitative finance. The Kaggle and medical scenarios demonstrate the framework's generality but appear less extensively tested. Extending to new domains requires implementing scenario-specific components: hypothesis templates, evaluation metrics, data loaders, sandbox configurations, and domain knowledge. This is a non-trivial effort requiring both domain expertise and familiarity with the RD-Agent abstractions.

51.8.3 LLM Dependency and Cost

The system's heavy reliance on LLM calls for every phase creates both cost and reliability concerns. Each outer-loop iteration involves multiple LLM calls (hypothesis generation, code generation with potentially multiple CoSTEER rounds, feedback synthesis), and a typical R&D run may span dozens of iterations. The system's effectiveness is bounded by the LLM's capabilities: for domains where the LLM has limited training data (e.g., niche financial instruments), hypothesis quality may degrade. The system can recombine existing knowledge in novel ways but cannot discover fundamentally new principles absent from the LLM's training corpus.

51.8.4 Evaluation Challenges

Evaluating autonomous R&D systems is inherently difficult. Performance on MLE-bench and Qlib benchmarks demonstrates capability on specific task distributions but does not guarantee effectiveness on novel research problems. The gap between benchmark performance and real-world research productivity is especially large in this domain, because real research involves problem formulation, data acquisition, and domain reasoning not captured by pre-structured benchmarks. Furthermore, cross-system comparisons (Table 51.3) are complicated by differing compute budgets, model versions, and evaluation protocols across publications.

51.8.5 Knowledge Base Scalability

As the Trace grows over many iterations, the system must decide what to include in the LLM's context window. The structured concise_* fields on Hypothesis help compress history, but long traces still risk exceeding context limits or diluting signal with noise from early, less-relevant experiments. More sophisticated retrieval mechanisms — such as embedding-based retrieval of semantically relevant prior experiments — could improve knowledge utilization for long-running sessions.

51.8.6 Implementation Unknowns

Several details that would affect practical deployment are not fully documented in public materials:

Exact prompt templates used for hypothesis generation and code refinement in each scenario (these are embedded in the codebase but not documented separately)
Default hyperparameters for CoSTEER max_loop, trace history window length, and temperature settings across scenarios
Detailed cost breakdown per iteration and per scenario under controlled conditions
Performance variance across independent runs with different random seeds and LLM non-determinism
Specific implementation details of the EvolvingStrategy selection and recombination logic within CoSTEER

These unknowns are typical for rapidly evolving open-source research projects and are not unique to RD-Agent. The repository is actively maintained, and documentation may improve as the project matures.

51.9 Connections to the Broader Field

RD-Agent connects to several broader research threads in the 2024-2026 period:

Scientific Discovery Agents. The system is part of a growing family of LLM-powered agents for scientific discovery, alongside systems like ChemCrow, Coscientist, and the AI Scientist (Chapter 34). What distinguishes RD-Agent is its focus on the engineering aspects of data-driven research (feature engineering, model building, pipeline construction) rather than the scientific reasoning aspects (theory formation, experimental design in physical sciences). This engineering focus makes it more immediately practical but also narrows its scope relative to systems targeting fundamental discovery.

AutoML and Meta-Learning. RD-Agent can be seen as a next-generation AutoML system where the search process is guided by LLM-generated hypotheses rather than predefined search spaces. Traditional AutoML systems like Auto-sklearn and FLAML search over fixed hyperparameter spaces; RD-Agent searches over the open-ended space of feature engineering ideas and model architectures expressible in Python code. This flexibility is both an advantage (no predefined search space limitations) and a challenge (the search space is vastly larger and less structured).

Multi-Agent Frameworks. The architecture — with specialized components for research, development, and evaluation — reflects a broader trend toward agent specialization. Systems like MetaGPT, ChatDev, and CAMEL have explored multi-agent collaboration for software engineering; RD-Agent applies similar principles to the R&D domain. The key insight is that different phases of the R&D cycle benefit from different prompting strategies, context windows, and evaluation criteria — hence the separation into HypothesisGen, Developer (CoSTEER), and HypothesisExperiment2Feedback.

Evolutionary AI Connections. While RD-Agent does not use explicit evolutionary operators at the outer loop, CoSTEER introduces genuinely evolutionary dynamics at the code generation level: multiple candidate implementations are maintained, evaluated, selected, and refined across rounds. This nested structure — sequential hypothesis search wrapping evolutionary code synthesis — represents a hybrid strategy not common in the survey's landscape. The co-evolution of factors and models in the quantitative finance scenario adds another layer, managing two interacting "populations" of artifacts. A natural extension would be to introduce population-based search at the hypothesis level as well, which would move the system fully into the evolutionary paradigm described throughout this survey.

51.10 Summary

Chapter Summary

Key takeaway: RD-Agent demonstrates that the hypothesis-experiment-feedback cycle of empirical R&D can be automated using LLM-powered multi-agent systems, achieving competitive performance on MLE-bench and showing particular promise in quantitative finance factor discovery through its CoSTEER evolving code generation method.

Main contribution to the field: The system's principal contributions are: (1) the Hypothesis/Experiment/Trace/Scenario abstraction hierarchy that cleanly separates domain-agnostic R&D loop logic from scenario-specific implementations; (2) CoSTEER, a multi-round evolving code generation strategy that outperforms single-shot generation by iteratively refining implementations with execution feedback and cross-task knowledge sharing; and (3) a concrete, MIT-licensed, extensible platform for research into autonomous data science.

What researchers should know: RD-Agent is most mature in quantitative finance (via Qlib integration), with MLE-bench providing domain-independent evidence. The core abstractions in rdagent/core/ define the extension points for new domains. The outer R&D loop is sequential (one hypothesis per iteration), while the inner CoSTEER loop introduces evolutionary-style multi-candidate code evolution. Significant LLM API budget is required for extended runs. Readers evaluating the system's claims should note that quantitative finance results lack consolidated reproducibility protocols (confidence intervals, seed counts), and MLE-bench comparisons are complicated by differing compute budgets across systems. The system's strongest architectural insight is the separation of hypothesis-level search (what to try) from code-level search (how to implement it), allowing different search strategies at each level.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}