DiscoGen
Procedural generator of algorithm discovery tasks spanning 400M+ unique ML problems across 14 domains, enabling meta-meta-learning for evolutionary optimization of Algorithm Discovery Agents. Organization: University of Oxford, UC Santa Barbara, UCL, and collaborators Published: March 18, 2026 Type: paper + open-source repository Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: Procedural Generation of Algorithm Discovery Tasks in Machine Learning
System Name: DiscoGen
Paper: arXiv:2603.17863 (cs.LG, cs.AI)
Repository: github.com/AlexGoldie/discogen — MIT License, 31 stars
Project Website: disco-gen.github.io
Documentation: alexgoldie.github.io/discogen
PyPI Package: pip install discogen (v1.0.0)
Submission Date: March 18, 2026
License: MIT
Positioning Statement: DiscoGen is not a benchmark — it is a procedural generator of algorithm discovery tasks. Where existing suites provide tens of static problems, DiscoGen spans billions of parameterized tasks across 14 ML domains, enabling the first principled meta-meta-learning loops for optimizing Algorithm Discovery Agents (ADAs).
2 Authors and Team
| Role | Authors |
|---|---|
| Lead Author | Alexander D. Goldie (University of Oxford) |
| Core Contributors | Zilin Wang, Adrian Hayler, Deepak Nathani, Edan Toledo, Ken Thampiratwong, Aleksandra Kalisz |
| Task Contributors | Michael Beukman, Alistair Letcher, Shashank Reddy, Clarisse Wibault, Theo Wolf, Charles O'Neill, Uljad Berdica, Nicholas Roberts, Saeed Rahmani, Hannah Erlebach |
| Equal Supervision | Roberta Raileanu, Shimon Whiteson, Jakob N. Foerster |
Institutional Affiliations:
- University of Oxford (primary)
- UC Santa Barbara
- University College London (UCL)
- Additional collaborating institutions
Team Size: 20 authors — one of the largest collaborative efforts in algorithm discovery research, reflecting the enormous scope of implementing and validating 14 distinct task domains with associated datasets, evaluation pipelines, and modular decompositions.
Key Intellectual Lineage: Jakob Foerster's group at Oxford has been central to multi-agent RL and meta-learning research. Shimon Whiteson brings deep RL and automated algorithm design expertise. Roberta Raileanu contributes procedural generation methodology from the RL generalization literature.
3 Core Contribution
The Problem
Automated Algorithm Discovery (AAD) — using AI systems to discover novel ML algorithms — is a rapidly growing field. Systems like FunSearch, AlphaEvolve, and OpenELM have demonstrated that LLMs can discover novel optimizers, loss functions, and training procedures. However, the field faces a critical infrastructure gap:
- Tiny evaluation suites. Existing benchmarks contain tens of tasks at most, leading to overfitting and unreliable comparisons.
- No meta-train/meta-test separation. Most suites evaluate discovered algorithms on the same datasets used during discovery — confounding genuine algorithm quality with dataset-specific tuning.
- Data contamination. Static task sets risk contamination in LLM training corpora.
- Saturated problems. Many tasks are solved or nearly solved, providing no signal for improvement.
- Narrow domain coverage. Most benchmarks target a single ML subfield.
The Solution
DiscoGen addresses all five problems through procedural generation:
┌─────────────────────────────────────────────────────────────┐
│ DiscoGen Generator │
│ │
│ Configuration Parameters │
│ ┌──────────────────────────────────────────────────┐ │
│ │ task_domain: OnPolicyRL │ │
│ │ editable_modules: [loss, networks] │ │
│ │ meta_train: [Breakout, Freeway] │ │
│ │ meta_test: [Asterix, SpaceInvaders] │ │
│ │ eval_type: performance │ │
│ │ initialisation: empty | baseline │ │
│ │ backend: recurrent | feedforward │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Complete Runnable Task Directory │ │
│ │ task_src/ │ │
│ │ ├── loss.py (editable) │ │
│ │ ├── networks.py (editable) │ │
│ │ ├── optim.py (baseline, frozen) │ │
│ │ ├── train.py (baseline, frozen) │ │
│ │ ├── run_main.py (evaluation harness) │ │
│ │ └── config.yaml (task specification) │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Task Space: ~99.3 billion unique tasks │
└─────────────────────────────────────────────────────────────┘
Three Levels of Contribution
| Level | Contribution | Impact |
|---|---|---|
| Generator | Procedural task generator spanning 14 domains | Unlimited unique tasks for ADA optimization |
| Benchmark | DiscoBench — fixed subset for principled evaluation | Reproducible comparisons with meta-train/meta-test split |
| Research Directions | Meta-meta-learning, curriculum learning, algorithm world models | New research paradigm for agent optimization |
Key Insight: By treating algorithm discovery tasks as procedurally generated environments (analogous to procedural level generation in RL), DiscoGen transforms ADA optimization from a few-shot benchmark problem into a genuine learning problem with distribution, generalization, and curriculum.
4 Supported Solutions
Task Domains
DiscoGen supports 14 distinct ML domains, each decomposed into modular algorithm components:
| Domain | Modules (m) | Datasets (d) | Backends (b) | Total Tasks |
|---|---|---|---|---|
| Bayesian Optimization | 6 | 11 | 1 | 65,413,656 |
| Brain Speech Detection | 3 | 7 | 1 | 81,144 |
| Computer Vision Classification | 4 | 9 | 1 | 1,679,400 |
| Continual Learning | 5 | 3 | 3 | 6,696 |
| Greenhouse Gas Prediction | 2 | 4 | 1 | 900 |
| Language Modelling | 3 | 4 | 2 | 4,200 |
| Model Unlearning | 1 | 3 | 1 | 85,176 |
| Neural Cellular Automata | 5 | 5 | 1 | 33,480 |
| Off-Policy RL | 7 | 4 | 1 | 38,100 |
| Offline RL | 5 | 10 | 1 | 10,602,372 |
| On-Policy MARL | 6 | 17 | 2 | 97,431,783,120 |
| On-Policy RL | 6 | 13 | 3 | 1,789,383,960 |
| Trajectory Prediction | 4 | 3 | 3 | 1,080 |
| Unsupervised Env Design | 3 | 4 | 1 | 2,100 |
| Total | ~99.3 billion |
Module Types by Domain
Each domain decomposes its ML algorithm into editable modules. Representative examples:
On-Policy RL (PPO-style):
- loss.py — Objective function (surrogate loss, entropy bonus, value loss)
- networks.py — Policy and value network architectures
- optim.py — Optimizer configuration and learning rate schedules
- train.py — Training loop (rollout collection, update steps)
- activation.py — Activation functions
- targets.py — Advantage estimation and return computation
Language Modelling:
- loss.py — Language modeling objective
- networks.py — Transformer architecture
- optim.py — Optimizer and schedule
Bayesian Optimization:
- acq_fn.py — Acquisition function
- acq_optimizer.py — Acquisition function optimizer
- sampler.py — Initial point sampling
- next_queries.py — Query selection strategy
- surrogate.py — Surrogate model
- surrogate_optimizer.py — Surrogate training
Evaluation Types
Each task supports three evaluation objectives:
| Eval Type | Objective | Use Case |
|---|---|---|
performance |
Maximize algorithm quality metric | Standard algorithm discovery |
energy |
Minimize energy while exceeding performance threshold | Green AI, efficiency research |
time |
Minimize wall-clock time while exceeding performance threshold | Practical deployment constraints |
Initialization Modes
| Mode | What the ADA Receives | Difficulty |
|---|---|---|
baseline |
Complete, working reference implementation | Easier — improve upon known solution |
empty |
Only function signatures with input/output specs | Harder — design from scratch |
5 LLM Integration
Role of LLMs in DiscoGen
LLMs serve as Algorithm Discovery Agents (ADAs) that operate on DiscoGen tasks. DiscoGen itself is model-agnostic — it generates tasks that any ADA (LLM-based or otherwise) can attempt.
Evaluated Models
| Model | DiscoBench Single | DiscoBench Single (Until Success) | DiscoBench All |
|---|---|---|---|
| GPT-OSS 120B | 68.2% success | 100.0% success | 11.4% success |
| Devstral2 | 45.9% success | 100.0% success | 34.3% success |
| Deepseek-v3.2 | 80.0% success | 100.0% success | 25.7% success |
Critical finding: No model consistently outperforms baseline implementations across domains. This is a striking result — even the most capable models, when editing a single module, frequently produce algorithms that are worse than the standard implementation.
LLM as ADA Optimizer
The paper demonstrates prompt optimization in a meta-meta-learning loop:
┌──────────────────────────────────────────────────────────────┐
│ Meta-Meta-Learning Loop │
│ │
│ ┌─────────────┐ │
│ │ LLM │ │
│ │ (Prompt │◄─── Performance feedback from │
│ │ Optimizer) │ K_tasks DiscoGen tasks │
│ └──────┬──────┘ │
│ │ Proposes new ADA prompt │
│ ▼ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ ADA │───►│ DiscoGen │───►│ Evaluate │ │
│ │ (with new │ │ Task │ │ Algorithm │ │
│ │ prompt) │ │ (sampled) │ │ (meta-test) │ │
│ └─────────────┘ └──────────────┘ └──────┬───────┘ │
│ │ │
│ Loop for 30 steps ◄────────┘ │
└──────────────────────────────────────────────────────────────┘
The optimizer LLM proposes new system prompts for the ADA based on prior performance traces. This is distinct from the ADA itself — the optimizer sits one level above, performing search over the space of ADA configurations.
Model-Agnostic Design
DiscoGen enforces strict separation between: - Task specification — domain, modules, datasets, eval type - Agent interface — editable Python files with defined inputs/outputs - Evaluation harness — deterministic scoring on meta-train and meta-test
This means DiscoGen can benchmark any algorithm discovery system, not just LLM-based ones. Future systems using neuroevolution, program synthesis, or hybrid approaches can use the same infrastructure.
6 Key Results
DiscoBench Evaluation
DiscoBench Single (edit one module, one attempt):
| Model | Success Rate | Meta-Train Score | Meta-Test Score |
|---|---|---|---|
| Baseline (All Fixed) | — | 1104 [1077, 1136] | 1177 [1144, 1211] |
| GPT-OSS 120B | 68.2% | 931 [900, 961] | 962 [933, 993] |
| Devstral2 | 45.9% | 886 [850, 922] | 808 [771, 842] |
| Deepseek-v3.2 | 80.0% | 1079 [1050, 1108] | 1053 [1020, 1082] |
DiscoBench All (edit all modules simultaneously):
| Model | Success Rate | Meta-Train Score | Meta-Test Score |
|---|---|---|---|
| Baseline (All Fixed) | — | 1409 [1297, 1682] | 1377 [1212, 1595] |
| GPT-OSS 120B | 11.4% | 533 [−183, 700] | 597 [−106, 799] |
| Devstral2 | 34.3% | 873 [751, 1138] | 1087 [971, 1322] |
| Deepseek-v3.2 | 25.7% | 1184 [1069, 1397] | 940 [831, 1176] |
Difficulty Scaling by Module Count
| Model | 1 Module | 2 Modules | 3 Modules | 4 Modules |
|---|---|---|---|---|
| Deepseek-v3.2 | 75.0% | 47.2% | 8.3% | 0.0% |
| GPT-OSS-120b | 50.0% | 11.1% | 8.3% | 0.0% |
| Devstral2 | 29.2% | 27.8% | 0.0% | 0.0% |
Key Finding 1: Success rates collapse precipitously with module count. No model succeeds with 4 editable modules. This establishes a clear difficulty gradient — a crucial property for curriculum design in ADA optimization.
Meta-Meta-Learning with Prompt Optimization
| K_tasks (unique tasks seen) | DiscoBench Success Rate | Meta-Train Score | Meta-Test Score |
|---|---|---|---|
| 1 | 70.6% | 956 [939, 978] | 957 [927, 977] |
| 5 | 75.3% | 1014 [1000, 1033] | 973 [947, 993] |
| 10 | 72.0% | 969 [949, 989] | 1000 [980, 1022] |
| 30 | 78.7% | 1061 [1040, 1079] | 1071 [1049, 1096] |
Key Finding 2: Meta-test performance improves monotonically with the number of distinct tasks experienced during optimization. Using only 1 task leads to overfitting; 30 tasks yields the best generalization. This validates DiscoGen's core hypothesis that task diversity improves ADA quality.
Generalization Gap
The paper reveals a critical generalization gap: algorithms that perform well on meta-train datasets do not necessarily generalize to meta-test. Rank correlation analysis across DiscoBench tasks shows that the correlation structure between algorithms' meta-train performance breaks down at meta-test time. This vindicates DiscoGen's insistence on separate meta-train/meta-test evaluation.
7 Reproducibility
Open-Source Infrastructure
| Component | Availability | Notes |
|---|---|---|
| Generator code | GitHub (MIT) | Full procedural generator |
| PyPI package | pip install discogen |
CLI + Python API |
| DiscoBench configs | Included in repo | Fixed benchmark configurations |
| Domain implementations | Included in repo | 14 domains with baselines |
| Reference implementations | Per-domain _reference.txt |
Baseline code for all modules |
| Documentation | Comprehensive docs site | Usage, contributing, API |
Reproducing Results
# Install DiscoGen
pip install discogen
# List available domains
discogen get-domains
# Create a specific DiscoBench task
discogen create-task --task-domain OnPolicyRL --config-path discobench_configs/task_42.yaml
# Run the task
cd task_src/OnPolicyRL
bash install.sh # Install domain-specific dependencies
python run_main.py
# Create meta-test evaluation
discogen create-task --task-domain OnPolicyRL --config-path discobench_configs/task_42.yaml --test
cd task_src
python run_main.py # Evaluate on held-out datasets
Reproducibility Concerns
| Factor | Assessment |
|---|---|
| Task generation | Fully deterministic given config parameters — excellent |
| Evaluation scores | Score ranges reported with confidence intervals — good |
| LLM-based ADA | Inherent non-determinism in LLM outputs — moderate |
| Domain dependencies | Each domain has separate requirements; potential version conflicts — manageable via install.sh |
| Compute requirements | Not explicitly quantified per domain — unclear |
| DiscoBench configs | Fixed, included in repository — excellent |
| Meta-meta-learning | Prompt optimization details may vary by LLM provider — moderate |
Potential Confounds
- Domain-specific dependency conflicts. Each task domain has its own Python requirements that may conflict with others. The
install.shper-domain approach mitigates but doesn't eliminate this. - Baseline implementation quality. The strength of results depends on the quality of reference implementations. If baselines are weak, beating them is less impressive; if strong, the negative results for LLMs are more damning.
- Score aggregation. The paper aggregates across diverse domains with different scale metrics. The normalization scheme for cross-domain comparison needs careful scrutiny.
8 Compute and API Costs
Estimated Costs Per Run
The paper does not provide explicit cost breakdowns. Based on the experimental setup:
| Component | Estimated Cost | Notes |
|---|---|---|
| Single DiscoBench task | 1-60 min compute | Varies enormously by domain (GPU needed for RL/CV) |
| DiscoBench Single evaluation | ~35 tasks × cost per task | Per model, single attempt |
| DiscoBench All evaluation | ~35 tasks × cost per task | All modules editable |
| Meta-meta-learning (30 steps) | 30 × (LLM call + task eval) | Plus prompt optimization LLM |
| Full evaluation suite | 3 models × 3 settings | Hundreds of task evaluations |
Hardware Requirements by Domain
| Domain Category | Likely Hardware | Duration |
|---|---|---|
| RL domains (On-Policy, Off-Policy, MARL) | GPU (training agents in environments) | 10-60 min |
| CV Classification | GPU (training CNNs/ViTs) | 5-30 min |
| Language Modelling | GPU (transformer pretraining) | 30-120 min |
| Bayesian Optimization | CPU sufficient | 1-10 min |
| Greenhouse Gas Prediction | CPU sufficient | 1-5 min |
| Brain Speech Detection | GPU (neural decoding) | 10-30 min |
| Trajectory Prediction | GPU | 10-60 min |
LLM API Costs (for ADA)
For an LLM-based ADA attempting a single task: - Input context: Task description + editable module templates + reference docs - Output: Modified Python files for editable modules - Iterations: Typically multiple rounds of edit-test-refine - Estimated per-task LLM cost: $0.50-$10 depending on model and iterations
For the meta-meta-learning loop (30 optimization steps): - Per step: 1 LLM prompt optimization call + 1 ADA call + 1 task evaluation - Total: ~$30-300 for the full optimization run
Note: The dominant cost is task evaluation compute (GPU time), not LLM API calls. A full DiscoBench evaluation across all models and settings likely requires hundreds of GPU-hours.
9 Architecture Solution
System Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ DiscoGen System Architecture │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Configuration Layer │ │
│ │ │ │
│ │ Task Config (YAML) Domain Registry │ │
│ │ ┌─────────────────┐ ┌──────────────────┐ │ │
│ │ │ task_domain │ │ OnPolicyRL │ │ │
│ │ │ editable_modules │───────►│ OffPolicyRL │ │ │
│ │ │ meta_train │ │ LanguageModelling│ │ │
│ │ │ meta_test │ │ CVClassification │ │ │
│ │ │ eval_type │ │ ... (14 total) │ │ │
│ │ │ initialisation │ └──────────────────┘ │ │
│ │ │ backend │ │ │
│ │ └─────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Generation Engine │ │
│ │ │ │
│ │ create_task.py │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Module │ │ Dataset │ │ Evaluation │ │ │
│ │ │ Assembly │ │ Selection │ │ Harness │ │ │
│ │ │ │ │ │ │ Generation │ │ │
│ │ │ base/*.py │ │ meta_train[] │ │ run_main.py │ │ │
│ │ │ edit/*.py │ │ meta_test[] │ │ scoring │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Output: task_src/ │ │
│ │ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Editable │ │ Frozen │ │ Evaluation │ │ │
│ │ │ Modules │ │ Baseline │ │ Pipeline │ │ │
│ │ │ │ │ Modules │ │ │ │ │
│ │ │ loss.py │ │ optim.py │ │ run_main │ │ │
│ │ │ networks.py│ │ train.py │ │ config.yaml│ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ ADA Interface │ │
│ │ │ │
│ │ Input: Editable module files + task description │ │
│ │ Output: Modified module implementations │ │
│ │ Evaluation: python run_main.py → score │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Design Principles
1. Modularity Over Monolith
Rather than asking an ADA to build an entire ML system, DiscoGen decomposes algorithms into composable modules. This serves multiple purposes: - Difficulty control: More editable modules = harder tasks - Attribution: Which module improvements drive performance gains? - Composability: A discovered loss function from one task can be combined with a discovered optimizer from another
2. Separation of Concerns
Discovery Phase (meta-train):
ADA modifies editable modules
└── Trains on meta-train datasets
└── Iterates and refines
Evaluation Phase (meta-test):
Discovered algorithm runs on held-out datasets
└── No further modification allowed
└── Tests generalization, not overfitting
3. Configuration-Driven Generation
Every aspect of a task is determined by a small YAML configuration. This enables: - Deterministic task reproduction - Systematic difficulty sweeps - Automated curriculum construction - Combinatorial explosion of unique tasks
4. Domain Independence
The generator framework is domain-agnostic. Adding a new domain requires implementing:
- Module base/edit implementations
- Dataset adapters
- Evaluation metrics
- An install.sh for domain-specific dependencies
10 Component Breakdown
Core Components
1. CLI Interface (discogen/cli.py)
# Primary commands
discogen get-domains # List all 14 supported domains
discogen create-task # Generate a complete task directory
discogen sample-task-config # Sample a random task configuration
Key parameters for create-task:
- --task-domain — Which ML domain
- --config-path — YAML configuration file
- --example — Generate with editable (incomplete) modules
- --test — Generate meta-test evaluation version
2. Configuration System (discogen/create_config.py)
The configuration system defines the combinatorial space of possible tasks:
# Example task configuration
task_domain: OnPolicyRL
meta_train: [Breakout, Freeway]
meta_test: [Asterix, SpaceInvaders]
backend: recurrent
change_loss: true
change_networks: true
change_optim: false
change_train: false
change_activation: false
change_targets: false
eval_type: performance
initialisation: empty
The number of tasks per domain is:
$$N_{\text{tasks}} = (2^m - 1) \times \binom{d}{k_{\text{train}}} \times \binom{d - k_{\text{train}}}{k_{\text{test}}} \times b \times |\text{eval_types}| \times |\text{init_modes}|$$
where $m$ is the number of modules, $d$ is the number of datasets, $b$ is the number of backends, and $k_{\text{train}}, k_{\text{test}}$ are the sizes of train/test dataset splits.
3. Domain Implementations (discogen/domains/)
Each domain directory contains:
discogen/domains/OnPolicyRL/
├── base/ # Complete baseline implementations
│ ├── loss.py
│ ├── networks.py
│ ├── optim.py
│ ├── train.py
│ ├── activation.py
│ └── targets.py
├── edit/ # Editable templates (function signatures only)
│ ├── loss.py
│ ├── networks.py
│ └── ...
├── utils/
│ ├── _reference.txt # Reference documentation for the domain
│ ├── environments.py # Environment wrappers
│ └── evaluation.py # Metric computation
├── datasets/ # Dataset configurations
├── config.yaml # Domain-level defaults
└── install.sh # Domain-specific dependency installer
4. Task Generation Engine (discogen/create_task.py)
The generation engine assembles a complete, runnable task directory:
- Module selection: For each module, choose base (frozen) or edit (editable) version
- Dataset assignment: Map meta-train and meta-test datasets
- Evaluation setup: Configure scoring metrics and evaluation scripts
- Dependency resolution: Ensure inter-module dependencies are satisfied
- Output: Self-contained
task_src/directory
5. DiscoBench Configurations (discogen/discobench_configs/)
A fixed set of task configurations for reproducible benchmarking. These configurations are: - Curated to cover diverse difficulty levels - Balanced across domains - Stable across DiscoGen versions - Designed for principled meta-train/meta-test evaluation
Supporting Components
| Component | Purpose |
|---|---|
| Reference implementations | Gold-standard baselines per domain for comparison |
| Environment wrappers | Standardized interfaces for diverse RL environments |
| Scoring functions | Domain-specific metrics (returns, accuracy, loss, etc.) |
| Dataset loaders | Unified data loading across formats and sources |
| Result aggregation | Cross-domain score normalization and reporting |
11 Core Mechanisms (Detailed)
Mechanism 1: Procedural Task Generation
The core innovation is treating algorithm discovery tasks as procedurally generated environments — directly analogous to procedural level generation in RL (Minigrid, Procgen, etc.).
The Combinatorial Space:
For a domain with $m$ modules and $d$ datasets, the number of possible module combinations is $2^m - 1$ (at least one module must be editable). Dataset allocation multiplies this further. The resulting space is vast:
| Domain | Modules | Datasets | Unique Tasks |
|---|---|---|---|
| On-Policy MARL | 6 | 17 | 97.4 billion |
| On-Policy RL | 6 | 13 | 1.8 billion |
| Bayesian Optimization | 6 | 11 | 65.4 million |
| Offline RL | 5 | 10 | 10.6 million |
Task Sampling:
# Sample a uniformly random task configuration
discogen sample-task-config --config-dest random_task.yaml
# The config specifies ALL task parameters
# Domain, modules, datasets, eval type, initialization
This enables curriculum strategies where the distribution over tasks evolves based on the ADA's current capabilities — precisely the UED (Unsupervised Environment Design) paradigm applied to algorithm discovery.
Mechanism 2: Modular Algorithm Decomposition
Each ML algorithm is decomposed into semantically meaningful, independently editable modules:
Algorithm = Module_1 ⊕ Module_2 ⊕ ... ⊕ Module_m
Example (PPO):
PPO = Loss ⊕ Networks ⊕ Optimizer ⊕ Train_Loop ⊕ Activation ⊕ Targets
Each module has:
- Defined inputs (tensor shapes, types)
- Defined outputs (tensor shapes, types)
- Base implementation (working baseline)
- Edit template (signatures only)
Why This Matters:
- Controlled complexity. Editing 1 module is fundamentally easier than editing 6. This provides a natural difficulty gradient.
- Attribution. If an ADA improves the loss function, we know exactly which component drove the improvement.
- Composability. A novel loss function discovered for CIFAR-10 can be tested on CIFAR-100 without modification.
- Research focus. Researchers can study "loss function discovery" or "optimizer discovery" in isolation.
Mechanism 3: Meta-Train/Meta-Test Separation
This is perhaps DiscoGen's most methodologically important contribution. Every task enforces a strict separation:
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│ META-TRAIN │ │ META-TEST │
│ │ │ │
│ ADA has access to these │ │ ADA has NEVER seen these │
│ datasets during discovery │ │ datasets — held out completely │
│ │ │ │
│ Example: │ │ Example: │
│ - Breakout (Atari) │ │ - Asterix (Atari) │
│ - Freeway (Atari) │ │ - SpaceInvaders (Atari) │
│ │ │ │
│ ADA iterates on these: │ │ Discovered algorithm evaluated │
│ edit code → train → evaluate │ │ here with NO modifications │
│ → edit code → train → ... │ │ │
└─────────────────────────────────┘ └─────────────────────────────────┘
The Problem This Solves:
Prior algorithm discovery benchmarks evaluate on the same datasets used during discovery. An ADA could achieve high scores by: - Overfitting hyperparameters to specific datasets - Hardcoding dataset-specific tricks - Memorizing training data statistics
DiscoGen's meta-test evaluation ensures that only genuinely novel, generalizable algorithms score well.
Empirical Validation:
The paper shows that rank correlation between algorithms' meta-train and meta-test performance is weak — algorithms that look good during discovery often fail to generalize. This vindicates the split design.
Mechanism 4: Meta-Meta-Learning Loop
The paper demonstrates using DiscoGen for optimizing the optimizer — a meta-meta-learning loop where the ADA's prompt is itself evolved:
Outer loop (meta-meta-learning):
for step in range(30):
task = DiscoGen.sample() # Fresh task each step
score = ADA(prompt, task) # ADA discovers algorithm
prompt = Optimizer(prompt, score) # LLM proposes better prompt
Inner loop (meta-learning / algorithm discovery):
for iteration in range(N):
code = ADA.edit(modules) # ADA modifies editable code
train_score = evaluate(code, meta_train)
if converged: break
Evaluation:
test_score = evaluate(code, meta_test) # Generalization test
Key Experimental Finding:
The number of distinct DiscoGen tasks experienced during optimization is the critical variable:
| Tasks Seen | Overfitting Risk | Meta-Test Performance |
|---|---|---|
| 1 | High — prompt specializes to single task | Lowest |
| 5 | Moderate | Moderate |
| 10 | Lower | Good |
| 30 | Lowest | Best (1071 score) |
Meta-test performance improves monotonically with task diversity. This is the clearest empirical evidence that DiscoGen's scale enables genuine learning — not just memorization.
Mechanism 5: Evaluation Harness
Each generated task includes a complete, self-contained evaluation pipeline:
# Generated run_main.py (simplified)
def evaluate_algorithm():
# Load meta-train or meta-test datasets
datasets = load_datasets(config.datasets)
# Import editable modules (ADA-modified or baseline)
loss_fn = import_module("loss")
networks = import_module("networks")
optimizer = import_module("optim")
# Train using the composed algorithm
model = train(networks, loss_fn, optimizer, datasets.train)
# Evaluate
score = evaluate(model, datasets.test)
return score
The evaluation harness: - Is deterministic (fixed seeds for reproducibility) - Reports normalized scores with confidence intervals - Supports three evaluation types (performance, energy, time) - Can be run headlessly for automated optimization loops
12 Programming Language
Implementation Stack
| Component | Language | Framework |
|---|---|---|
| DiscoGen core | Python | Click (CLI), YAML (configs) |
| Domain implementations | Python | PyTorch, JAX (domain-dependent) |
| RL environments | Python | Gymnax, MinAtar, Brax, Craftax |
| Bayesian optimization | Python | GPyTorch, BoTorch |
| Language modelling | Python | PyTorch, Transformers |
| Build system | Makefile + uv | Modern Python packaging |
| Documentation | MkDocs | Deployed to GitHub Pages |
Package Management
DiscoGen uses uv for dependency management:
make install # Sets up environment + pre-commit hooks
uv run discogen ... # Run CLI commands
Each domain has isolated dependencies installed via install.sh, addressing the challenge of conflicting requirements across 14 diverse ML domains (e.g., JAX for RL vs. PyTorch for CV).
Code Structure
discogen/
├── discobench_configs/ # Fixed benchmark task configurations
├── domains/ # 14 domain implementations
│ ├── BayesianOptimisation/
│ ├── BrainSpeechDetection/
│ ├── ComputerVisionClassification/
│ ├── ContinualLearning/
│ ├── GreenhouseGasPrediction/
│ ├── LanguageModelling/
│ ├── ModelUnlearning/
│ ├── NeuralCellularAutomata/
│ ├── OfflineRL/
│ ├── OffPolicyRL/
│ ├── OnPolicyMARL/
│ ├── OnPolicyRL/
│ ├── TrajectoryPrediction/
│ └── UnsupervisedEnvironmentDesign/
├── utils/ # Shared utilities
├── create_task.py # Task generation engine
├── create_config.py # Configuration utilities
└── cli.py # Click-based CLI
Language Choice Rationale
Python is the natural choice given: 1. All 14 target ML domains are predominantly Python-based 2. The primary consumers (LLM-based ADAs) generate Python code 3. The ML ecosystem (PyTorch, JAX, scikit-learn) is Python-native 4. The editable modules are Python — this is algorithm discovery, not code translation
13 Memory Management
Task-Level Isolation
DiscoGen generates self-contained task directories. Each task runs in its own process with its own memory space. There is no shared state between task evaluations, which is essential for: - Parallel evaluation of multiple tasks - Fault isolation when ADA-generated code crashes - Reproducibility of individual task scores
Domain-Specific Memory Considerations
| Domain | Memory Profile | GPU Memory | Notes |
|---|---|---|---|
| On-Policy RL | Moderate | 2-8 GB | Environment rollouts + policy network |
| On-Policy MARL | High | 4-16 GB | Multiple agents + shared environment |
| Language Modelling | High | 8-40 GB | Transformer parameters + attention |
| CV Classification | Moderate | 4-12 GB | CNN/ViT + image batches |
| Bayesian Optimization | Low | 0-2 GB | Gaussian process + acquisition |
| Offline RL | Moderate | 2-8 GB | Replay buffer + networks |
| Neural Cellular Automata | Low-Moderate | 1-4 GB | Grid state + update rules |
Scaling Properties
DiscoGen's procedural generation is itself lightweight — task generation requires negligible compute and memory. The cost is in task evaluation, which scales with: - Domain complexity (language modelling >> greenhouse gas prediction) - Dataset size (CIFAR-100 >> MNIST) - Number of meta-train datasets (more datasets = longer training) - Evaluation type (energy/time require multiple runs for measurement)
State Management in Meta-Meta-Learning
The prompt optimization loop maintains: - Prompt history: All attempted prompts and their scores (~KB scale) - Score history: Performance on each task attempted (~KB scale) - Best prompt: Current best-performing ADA configuration - DiscoGen configs: Sampled task specifications (~KB per task)
This is extremely lightweight — the dominant memory cost is always within the inner loop (ADA executing on a task).
14 Continued Learning
Built-In Curriculum Support
DiscoGen's parameterized task generation naturally enables curriculum learning for ADAs:
Difficulty Axes:
- Module count: 1 (easy) → 6 (nearly impossible with current models)
- Initialization: baseline (easier) → empty (harder)
- Dataset complexity: MNIST (simple) → TinyImageNet (complex)
- Domain familiarity: Well-studied domains → novel combinations
- Evaluation type: performance (standard) → energy/time (constrained)
Curriculum Strategies Enabled:
| Strategy | Description |
|---|---|
| Progressive module addition | Start with 1 editable module, gradually add more |
| Domain transfer | Train on simple domains, evaluate on complex ones |
| Initialization escalation | Start with baseline code, progress to empty templates |
| Dataset difficulty ramping | Begin with MNIST, advance to CIFAR-100, TinyImageNet |
| Eval type progression | Master performance → add efficiency constraints |
Research Directions Proposed
The paper outlines several ambitious research directions:
1. Algorithm World Models
Train a model to predict algorithm performance from code without executing it. DiscoGen provides the training data: (code, configuration, score) triples at scale.
2. Curriculum Learning for ADAs
Automatically design training curricula that maximize ADA generalization. This is UED (Unsupervised Environment Design) applied to the algorithm discovery setting — a recursive application where DiscoGen itself becomes the environment generator.
3. Tree Search for Discovery
Apply MCTS or similar search methods to navigate the space of module modifications, using DiscoGen tasks as the evaluation oracle.
4. Multi-Task Algorithm Discovery
Discover algorithms that work well across multiple domains simultaneously, leveraging DiscoGen's cross-domain task coverage.
5. Foundation Models for Algorithm Discovery
Train specialized models on large volumes of DiscoGen task data, analogous to how foundation models are trained on internet text.
Extensibility
DiscoGen is designed for community contribution:
Adding a new domain requires:
1. Implement base/ modules (working baseline)
2. Implement edit/ templates (function signatures)
3. Define dataset configurations
4. Implement evaluation metrics
5. Write install.sh for dependencies
6. Add domain to registry
The documentation includes detailed contributing guides for: - Adding new task domains - Integrating new datasets - Defining new evaluation types - Contributing to DiscoBench
Version Evolution
As DiscoGen grows, DiscoBench provides stability: - Generator evolves (new domains, datasets, backends) - DiscoBench configurations remain fixed for comparability - New DiscoBench versions can be released periodically - Historical results remain valid against their DiscoBench version
15 Applications
Primary Application: ADA Optimization
DiscoGen's primary use case is training and evaluating Algorithm Discovery Agents:
For ADA Developers:
1. Use DiscoGen to generate training tasks
2. Run ADA optimization loop (prompt tuning, architecture search, etc.)
3. Evaluate on DiscoBench for principled comparison
4. Publish results with standardized metrics
Concrete Application Domains
| Application | DiscoGen Domain | Module Focus | Potential Impact |
|---|---|---|---|
| Novel RL algorithms | OnPolicyRL, OffPolicyRL | loss, train | Discovery of PPO successors |
| Efficient training | LanguageModelling | optim, loss | Reduced pretraining costs |
| Better vision classifiers | CVClassification | networks, loss | Architecture discovery |
| Multi-agent coordination | OnPolicyMARL | loss, targets, train | New MARL algorithms |
| Self-driving prediction | TrajectoryPrediction | networks, loss | Safer autonomous vehicles |
| Brain-computer interfaces | BrainSpeechDetection | networks, loss | Better neural decoders |
| Climate science | GreenhouseGasPrediction | model, data_processing | Improved forecasting |
| ML safety | ModelUnlearning | loss | Better forgetting algorithms |
| Open-ended learning | NeuralCellularAutomata | update, perceive | Artificial life research |
| RL generalization | UnsupervisedEnvDesign | sample_levels, train_step | More robust RL agents |
| Offline RL | OfflineRL | actor_loss, critic_loss | Learning from logged data |
| Continual learning | ContinualLearning | regularizer, replay | Catastrophic forgetting solutions |
Cross-Cutting Applications
1. Automated ML Research
DiscoGen enables a new paradigm: automated research assistants that discover novel algorithms without human guidance. The meta-meta-learning results show this is feasible — prompt-optimized ADAs outperform naive ones.
2. Benchmark Design
DiscoGen's methodology — procedural generation with meta-train/meta-test separation — can be applied to other evaluation domains beyond algorithm discovery.
3. ML Education
The modular decomposition provides an excellent teaching tool. Students can understand PPO by editing individual components and observing the impact.
4. Algorithm Portfolio Construction
By running ADAs across thousands of DiscoGen tasks, researchers can build portfolios of algorithms suited to different settings — analogous to algorithm selection in combinatorial optimization.
Relationship to Other Systems
| System | Relationship to DiscoGen |
|---|---|
| FunSearch (Google DeepMind) | FunSearch is an ADA; DiscoGen provides tasks for it |
| AlphaEvolve (Google DeepMind) | AlphaEvolve is an ADA; DiscoGen could evaluate it |
| OpenELM | OpenELM is an ADA; DiscoGen provides benchmarking |
| EvoTorch | EvoTorch provides optimization; DiscoGen provides problems |
| AutoML frameworks | AutoML searches hyperparameters; DiscoGen generates algorithms |
| Neural Architecture Search | NAS searches architectures; DiscoGen includes this via networks.py modules |
Strategic Position: DiscoGen occupies a unique niche as infrastructure for algorithm discovery research. It doesn't discover algorithms itself — it generates the problems that algorithm discovery systems solve. This makes it complementary to, rather than competitive with, every ADA in the field.
References
@misc{goldie2026proceduralgenerationalgorithmdiscovery,
title={Procedural Generation of Algorithm Discovery Tasks in Machine Learning},
author={Alexander D. Goldie and Zilin Wang and Adrian Hayler and Deepak Nathani
and Edan Toledo and Ken Thampiratwong and Aleksandra Kalisz
and Michael Beukman and Alistair Letcher and Shashank Reddy
and Clarisse Wibault and Theo Wolf and Charles O'Neill
and Uljad Berdica and Nicholas Roberts and Saeed Rahmani
and Hannah Erlebach and Roberta Raileanu and Shimon Whiteson
and Jakob N. Foerster},
year={2026},
eprint={2603.17863},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.17863},
}