DiscoGen

Procedural generator of algorithm discovery tasks spanning 400M+ unique ML problems across 14 domains, enabling meta-meta-learning for evolutionary optimization of Algorithm Discovery Agents. Organization: University of Oxford, UC Santa Barbara, UCL, and collaborators Published: March 18, 2026 Type: paper + open-source repository Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: Procedural Generation of Algorithm Discovery Tasks in Machine Learning

System Name: DiscoGen

Paper: arXiv:2603.17863 (cs.LG, cs.AI)

Repository: github.com/AlexGoldie/discogen — MIT License, 31 stars

Project Website: disco-gen.github.io

Documentation: alexgoldie.github.io/discogen

PyPI Package: pip install discogen (v1.0.0)

Submission Date: March 18, 2026

License: MIT

Positioning Statement: DiscoGen is not a benchmark — it is a procedural generator of algorithm discovery tasks. Where existing suites provide tens of static problems, DiscoGen spans billions of parameterized tasks across 14 ML domains, enabling the first principled meta-meta-learning loops for optimizing Algorithm Discovery Agents (ADAs).

2 Authors and Team

Role	Authors
Lead Author	Alexander D. Goldie (University of Oxford)
Core Contributors	Zilin Wang, Adrian Hayler, Deepak Nathani, Edan Toledo, Ken Thampiratwong, Aleksandra Kalisz
Task Contributors	Michael Beukman, Alistair Letcher, Shashank Reddy, Clarisse Wibault, Theo Wolf, Charles O'Neill, Uljad Berdica, Nicholas Roberts, Saeed Rahmani, Hannah Erlebach
Equal Supervision	Roberta Raileanu, Shimon Whiteson, Jakob N. Foerster

Institutional Affiliations:

University of Oxford (primary)
UC Santa Barbara
University College London (UCL)
Additional collaborating institutions

Team Size: 20 authors — one of the largest collaborative efforts in algorithm discovery research, reflecting the enormous scope of implementing and validating 14 distinct task domains with associated datasets, evaluation pipelines, and modular decompositions.

Key Intellectual Lineage: Jakob Foerster's group at Oxford has been central to multi-agent RL and meta-learning research. Shimon Whiteson brings deep RL and automated algorithm design expertise. Roberta Raileanu contributes procedural generation methodology from the RL generalization literature.

3 Core Contribution

The Problem

Automated Algorithm Discovery (AAD) — using AI systems to discover novel ML algorithms — is a rapidly growing field. Systems like FunSearch, AlphaEvolve, and OpenELM have demonstrated that LLMs can discover novel optimizers, loss functions, and training procedures. However, the field faces a critical infrastructure gap:

Tiny evaluation suites. Existing benchmarks contain tens of tasks at most, leading to overfitting and unreliable comparisons.
No meta-train/meta-test separation. Most suites evaluate discovered algorithms on the same datasets used during discovery — confounding genuine algorithm quality with dataset-specific tuning.
Data contamination. Static task sets risk contamination in LLM training corpora.
Saturated problems. Many tasks are solved or nearly solved, providing no signal for improvement.
Narrow domain coverage. Most benchmarks target a single ML subfield.

The Solution

DiscoGen addresses all five problems through procedural generation:

┌─────────────────────────────────────────────────────────────┐
│                    DiscoGen Generator                        │
│                                                             │
│  Configuration Parameters                                   │
│  ┌──────────────────────────────────────────────────┐       │
│  │ task_domain: OnPolicyRL                          │       │
│  │ editable_modules: [loss, networks]               │       │
│  │ meta_train: [Breakout, Freeway]                  │       │
│  │ meta_test: [Asterix, SpaceInvaders]              │       │
│  │ eval_type: performance                           │       │
│  │ initialisation: empty | baseline                 │       │
│  │ backend: recurrent | feedforward                 │       │
│  └──────────────────────────────────────────────────┘       │
│                         │                                   │
│                         ▼                                   │
│  ┌──────────────────────────────────────────────────┐       │
│  │         Complete Runnable Task Directory          │       │
│  │  task_src/                                       │       │
│  │  ├── loss.py          (editable)                 │       │
│  │  ├── networks.py      (editable)                 │       │
│  │  ├── optim.py         (baseline, frozen)         │       │
│  │  ├── train.py         (baseline, frozen)         │       │
│  │  ├── run_main.py      (evaluation harness)       │       │
│  │  └── config.yaml      (task specification)       │       │
│  └──────────────────────────────────────────────────┘       │
│                                                             │
│  Task Space: ~99.3 billion unique tasks                     │
└─────────────────────────────────────────────────────────────┘

Three Levels of Contribution

Level	Contribution	Impact
Generator	Procedural task generator spanning 14 domains	Unlimited unique tasks for ADA optimization
Benchmark	DiscoBench — fixed subset for principled evaluation	Reproducible comparisons with meta-train/meta-test split
Research Directions	Meta-meta-learning, curriculum learning, algorithm world models	New research paradigm for agent optimization

Key Insight: By treating algorithm discovery tasks as procedurally generated environments (analogous to procedural level generation in RL), DiscoGen transforms ADA optimization from a few-shot benchmark problem into a genuine learning problem with distribution, generalization, and curriculum.

4 Supported Solutions

Task Domains

DiscoGen supports 14 distinct ML domains, each decomposed into modular algorithm components:

Domain	Modules (m)	Datasets (d)	Backends (b)	Total Tasks
Bayesian Optimization	6	11	1	65,413,656
Brain Speech Detection	3	7	1	81,144
Computer Vision Classification	4	9	1	1,679,400
Continual Learning	5	3	3	6,696
Greenhouse Gas Prediction	2	4	1	900
Language Modelling	3	4	2	4,200
Model Unlearning	1	3	1	85,176
Neural Cellular Automata	5	5	1	33,480
Off-Policy RL	7	4	1	38,100
Offline RL	5	10	1	10,602,372
On-Policy MARL	6	17	2	97,431,783,120
On-Policy RL	6	13	3	1,789,383,960
Trajectory Prediction	4	3	3	1,080
Unsupervised Env Design	3	4	1	2,100
Total				~99.3 billion

Module Types by Domain

Each domain decomposes its ML algorithm into editable modules. Representative examples:

On-Policy RL (PPO-style): - loss.py — Objective function (surrogate loss, entropy bonus, value loss) - networks.py — Policy and value network architectures - optim.py — Optimizer configuration and learning rate schedules - train.py — Training loop (rollout collection, update steps) - activation.py — Activation functions - targets.py — Advantage estimation and return computation

Language Modelling: - loss.py — Language modeling objective - networks.py — Transformer architecture - optim.py — Optimizer and schedule

Bayesian Optimization: - acq_fn.py — Acquisition function - acq_optimizer.py — Acquisition function optimizer - sampler.py — Initial point sampling - next_queries.py — Query selection strategy - surrogate.py — Surrogate model - surrogate_optimizer.py — Surrogate training

Evaluation Types

Each task supports three evaluation objectives:

Eval Type	Objective	Use Case
`performance`	Maximize algorithm quality metric	Standard algorithm discovery
`energy`	Minimize energy while exceeding performance threshold	Green AI, efficiency research
`time`	Minimize wall-clock time while exceeding performance threshold	Practical deployment constraints

Initialization Modes

Mode	What the ADA Receives	Difficulty
`baseline`	Complete, working reference implementation	Easier — improve upon known solution
`empty`	Only function signatures with input/output specs	Harder — design from scratch

5 LLM Integration

Role of LLMs in DiscoGen

LLMs serve as Algorithm Discovery Agents (ADAs) that operate on DiscoGen tasks. DiscoGen itself is model-agnostic — it generates tasks that any ADA (LLM-based or otherwise) can attempt.

Evaluated Models

Model	DiscoBench Single	DiscoBench Single (Until Success)	DiscoBench All
GPT-OSS 120B	68.2% success	100.0% success	11.4% success
Devstral2	45.9% success	100.0% success	34.3% success
Deepseek-v3.2	80.0% success	100.0% success	25.7% success

Critical finding: No model consistently outperforms baseline implementations across domains. This is a striking result — even the most capable models, when editing a single module, frequently produce algorithms that are worse than the standard implementation.

LLM as ADA Optimizer

The paper demonstrates prompt optimization in a meta-meta-learning loop:

┌──────────────────────────────────────────────────────────────┐
│                  Meta-Meta-Learning Loop                      │
│                                                              │
│  ┌─────────────┐                                             │
│  │   LLM        │                                            │
│  │  (Prompt     │◄─── Performance feedback from              │
│  │  Optimizer)  │     K_tasks DiscoGen tasks                  │
│  └──────┬──────┘                                             │
│         │ Proposes new ADA prompt                             │
│         ▼                                                    │
│  ┌─────────────┐    ┌──────────────┐    ┌──────────────┐     │
│  │  ADA        │───►│  DiscoGen    │───►│  Evaluate    │     │
│  │  (with new  │    │  Task        │    │  Algorithm   │     │
│  │   prompt)   │    │  (sampled)   │    │  (meta-test) │     │
│  └─────────────┘    └──────────────┘    └──────┬───────┘     │
│                                                │              │
│                     Loop for 30 steps ◄────────┘              │
└──────────────────────────────────────────────────────────────┘

The optimizer LLM proposes new system prompts for the ADA based on prior performance traces. This is distinct from the ADA itself — the optimizer sits one level above, performing search over the space of ADA configurations.

Model-Agnostic Design

DiscoGen enforces strict separation between: - Task specification — domain, modules, datasets, eval type - Agent interface — editable Python files with defined inputs/outputs - Evaluation harness — deterministic scoring on meta-train and meta-test

This means DiscoGen can benchmark any algorithm discovery system, not just LLM-based ones. Future systems using neuroevolution, program synthesis, or hybrid approaches can use the same infrastructure.

6 Key Results

DiscoBench Evaluation

DiscoBench Single (edit one module, one attempt):

Model	Success Rate	Meta-Train Score	Meta-Test Score
Baseline (All Fixed)	—	1104 [1077, 1136]	1177 [1144, 1211]
GPT-OSS 120B	68.2%	931 [900, 961]	962 [933, 993]
Devstral2	45.9%	886 [850, 922]	808 [771, 842]
Deepseek-v3.2	80.0%	1079 [1050, 1108]	1053 [1020, 1082]

DiscoBench All (edit all modules simultaneously):

Model	Success Rate	Meta-Train Score	Meta-Test Score
Baseline (All Fixed)	—	1409 [1297, 1682]	1377 [1212, 1595]
GPT-OSS 120B	11.4%	533 [−183, 700]	597 [−106, 799]
Devstral2	34.3%	873 [751, 1138]	1087 [971, 1322]
Deepseek-v3.2	25.7%	1184 [1069, 1397]	940 [831, 1176]

Difficulty Scaling by Module Count

Model	1 Module	2 Modules	3 Modules	4 Modules
Deepseek-v3.2	75.0%	47.2%	8.3%	0.0%
GPT-OSS-120b	50.0%	11.1%	8.3%	0.0%
Devstral2	29.2%	27.8%	0.0%	0.0%

Key Finding 1: Success rates collapse precipitously with module count. No model succeeds with 4 editable modules. This establishes a clear difficulty gradient — a crucial property for curriculum design in ADA optimization.

Meta-Meta-Learning with Prompt Optimization

K_tasks (unique tasks seen)	DiscoBench Success Rate	Meta-Train Score	Meta-Test Score
1	70.6%	956 [939, 978]	957 [927, 977]
5	75.3%	1014 [1000, 1033]	973 [947, 993]
10	72.0%	969 [949, 989]	1000 [980, 1022]
30	78.7%	1061 [1040, 1079]	1071 [1049, 1096]

Key Finding 2: Meta-test performance improves monotonically with the number of distinct tasks experienced during optimization. Using only 1 task leads to overfitting; 30 tasks yields the best generalization. This validates DiscoGen's core hypothesis that task diversity improves ADA quality.

Generalization Gap

The paper reveals a critical generalization gap: algorithms that perform well on meta-train datasets do not necessarily generalize to meta-test. Rank correlation analysis across DiscoBench tasks shows that the correlation structure between algorithms' meta-train performance breaks down at meta-test time. This vindicates DiscoGen's insistence on separate meta-train/meta-test evaluation.

7 Reproducibility

Open-Source Infrastructure

Component	Availability	Notes
Generator code	GitHub (MIT)	Full procedural generator
PyPI package	`pip install discogen`	CLI + Python API
DiscoBench configs	Included in repo	Fixed benchmark configurations
Domain implementations	Included in repo	14 domains with baselines
Reference implementations	Per-domain `_reference.txt`	Baseline code for all modules
Documentation	Comprehensive docs site	Usage, contributing, API

Reproducing Results

# Install DiscoGen
pip install discogen

# List available domains
discogen get-domains

# Create a specific DiscoBench task
discogen create-task --task-domain OnPolicyRL --config-path discobench_configs/task_42.yaml

# Run the task
cd task_src/OnPolicyRL
bash install.sh  # Install domain-specific dependencies
python run_main.py

# Create meta-test evaluation
discogen create-task --task-domain OnPolicyRL --config-path discobench_configs/task_42.yaml --test
cd task_src
python run_main.py  # Evaluate on held-out datasets

Reproducibility Concerns

Factor	Assessment
Task generation	Fully deterministic given config parameters — excellent
Evaluation scores	Score ranges reported with confidence intervals — good
LLM-based ADA	Inherent non-determinism in LLM outputs — moderate
Domain dependencies	Each domain has separate requirements; potential version conflicts — manageable via `install.sh`
Compute requirements	Not explicitly quantified per domain — unclear
DiscoBench configs	Fixed, included in repository — excellent
Meta-meta-learning	Prompt optimization details may vary by LLM provider — moderate

Potential Confounds

Domain-specific dependency conflicts. Each task domain has its own Python requirements that may conflict with others. The install.sh per-domain approach mitigates but doesn't eliminate this.
Baseline implementation quality. The strength of results depends on the quality of reference implementations. If baselines are weak, beating them is less impressive; if strong, the negative results for LLMs are more damning.
Score aggregation. The paper aggregates across diverse domains with different scale metrics. The normalization scheme for cross-domain comparison needs careful scrutiny.

8 Compute and API Costs

Estimated Costs Per Run

The paper does not provide explicit cost breakdowns. Based on the experimental setup:

Component	Estimated Cost	Notes
Single DiscoBench task	1-60 min compute	Varies enormously by domain (GPU needed for RL/CV)
DiscoBench Single evaluation	~35 tasks × cost per task	Per model, single attempt
DiscoBench All evaluation	~35 tasks × cost per task	All modules editable
Meta-meta-learning (30 steps)	30 × (LLM call + task eval)	Plus prompt optimization LLM
Full evaluation suite	3 models × 3 settings	Hundreds of task evaluations

Hardware Requirements by Domain

Domain Category	Likely Hardware	Duration
RL domains (On-Policy, Off-Policy, MARL)	GPU (training agents in environments)	10-60 min
CV Classification	GPU (training CNNs/ViTs)	5-30 min
Language Modelling	GPU (transformer pretraining)	30-120 min
Bayesian Optimization	CPU sufficient	1-10 min
Greenhouse Gas Prediction	CPU sufficient	1-5 min
Brain Speech Detection	GPU (neural decoding)	10-30 min
Trajectory Prediction	GPU	10-60 min

LLM API Costs (for ADA)

For an LLM-based ADA attempting a single task: - Input context: Task description + editable module templates + reference docs - Output: Modified Python files for editable modules - Iterations: Typically multiple rounds of edit-test-refine - Estimated per-task LLM cost: $0.50-$10 depending on model and iterations

For the meta-meta-learning loop (30 optimization steps): - Per step: 1 LLM prompt optimization call + 1 ADA call + 1 task evaluation - Total: ~$30-300 for the full optimization run

Note: The dominant cost is task evaluation compute (GPU time), not LLM API calls. A full DiscoBench evaluation across all models and settings likely requires hundreds of GPU-hours.

9 Architecture Solution

System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        DiscoGen System Architecture                  │
│                                                                     │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │                    Configuration Layer                        │   │
│  │                                                               │   │
│  │  Task Config (YAML)          Domain Registry                  │   │
│  │  ┌─────────────────┐        ┌──────────────────┐             │   │
│  │  │ task_domain      │        │ OnPolicyRL       │             │   │
│  │  │ editable_modules │───────►│ OffPolicyRL      │             │   │
│  │  │ meta_train       │        │ LanguageModelling│             │   │
│  │  │ meta_test        │        │ CVClassification │             │   │
│  │  │ eval_type        │        │ ... (14 total)   │             │   │
│  │  │ initialisation   │        └──────────────────┘             │   │
│  │  │ backend          │                                         │   │
│  │  └─────────────────┘                                         │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │                    Generation Engine                           │   │
│  │                                                               │   │
│  │  create_task.py                                               │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │   │
│  │  │ Module        │  │ Dataset      │  │ Evaluation   │       │   │
│  │  │ Assembly      │  │ Selection    │  │ Harness      │       │   │
│  │  │              │  │              │  │ Generation   │       │   │
│  │  │ base/*.py    │  │ meta_train[] │  │ run_main.py  │       │   │
│  │  │ edit/*.py    │  │ meta_test[]  │  │ scoring      │       │   │
│  │  └──────────────┘  └──────────────┘  └──────────────┘       │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │                    Output: task_src/                           │   │
│  │                                                               │   │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐             │   │
│  │  │ Editable   │  │ Frozen     │  │ Evaluation │             │   │
│  │  │ Modules    │  │ Baseline   │  │ Pipeline   │             │   │
│  │  │            │  │ Modules    │  │            │             │   │
│  │  │ loss.py    │  │ optim.py   │  │ run_main   │             │   │
│  │  │ networks.py│  │ train.py   │  │ config.yaml│             │   │
│  │  └────────────┘  └────────────┘  └────────────┘             │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │                    ADA Interface                               │   │
│  │                                                               │   │
│  │  Input: Editable module files + task description              │   │
│  │  Output: Modified module implementations                     │   │
│  │  Evaluation: python run_main.py → score                      │   │
│  └───────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Design Principles

1. Modularity Over Monolith

Rather than asking an ADA to build an entire ML system, DiscoGen decomposes algorithms into composable modules. This serves multiple purposes: - Difficulty control: More editable modules = harder tasks - Attribution: Which module improvements drive performance gains? - Composability: A discovered loss function from one task can be combined with a discovered optimizer from another

2. Separation of Concerns

Discovery Phase (meta-train):
  ADA modifies editable modules
  └── Trains on meta-train datasets
  └── Iterates and refines

Evaluation Phase (meta-test):
  Discovered algorithm runs on held-out datasets
  └── No further modification allowed
  └── Tests generalization, not overfitting

3. Configuration-Driven Generation

Every aspect of a task is determined by a small YAML configuration. This enables: - Deterministic task reproduction - Systematic difficulty sweeps - Automated curriculum construction - Combinatorial explosion of unique tasks

4. Domain Independence

The generator framework is domain-agnostic. Adding a new domain requires implementing: - Module base/edit implementations - Dataset adapters - Evaluation metrics - An install.sh for domain-specific dependencies

10 Component Breakdown

Core Components

1. CLI Interface (`discogen/cli.py`)

# Primary commands
discogen get-domains              # List all 14 supported domains
discogen create-task              # Generate a complete task directory
discogen sample-task-config       # Sample a random task configuration

Key parameters for create-task: - --task-domain — Which ML domain - --config-path — YAML configuration file - --example — Generate with editable (incomplete) modules - --test — Generate meta-test evaluation version

2. Configuration System (`discogen/create_config.py`)

The configuration system defines the combinatorial space of possible tasks:

# Example task configuration
task_domain: OnPolicyRL
meta_train: [Breakout, Freeway]
meta_test: [Asterix, SpaceInvaders]
backend: recurrent
change_loss: true
change_networks: true
change_optim: false
change_train: false
change_activation: false
change_targets: false
eval_type: performance
initialisation: empty

The number of tasks per domain is:

$$N_{\text{tasks}} = (2^m - 1) \times \binom{d}{k_{\text{train}}} \times \binom{d - k_{\text{train}}}{k_{\text{test}}} \times b \times |\text{eval_types}| \times |\text{init_modes}|$$

where $m$ is the number of modules, $d$ is the number of datasets, $b$ is the number of backends, and $k_{\text{train}}, k_{\text{test}}$ are the sizes of train/test dataset splits.

3. Domain Implementations (`discogen/domains/`)

Each domain directory contains:

discogen/domains/OnPolicyRL/
├── base/           # Complete baseline implementations
│   ├── loss.py
│   ├── networks.py
│   ├── optim.py
│   ├── train.py
│   ├── activation.py
│   └── targets.py
├── edit/           # Editable templates (function signatures only)
│   ├── loss.py
│   ├── networks.py
│   └── ...
├── utils/
│   ├── _reference.txt    # Reference documentation for the domain
│   ├── environments.py   # Environment wrappers
│   └── evaluation.py     # Metric computation
├── datasets/       # Dataset configurations
├── config.yaml     # Domain-level defaults
└── install.sh      # Domain-specific dependency installer

4. Task Generation Engine (`discogen/create_task.py`)

The generation engine assembles a complete, runnable task directory:

Module selection: For each module, choose base (frozen) or edit (editable) version
Dataset assignment: Map meta-train and meta-test datasets
Evaluation setup: Configure scoring metrics and evaluation scripts
Dependency resolution: Ensure inter-module dependencies are satisfied
Output: Self-contained task_src/ directory

5. DiscoBench Configurations (`discogen/discobench_configs/`)

A fixed set of task configurations for reproducible benchmarking. These configurations are: - Curated to cover diverse difficulty levels - Balanced across domains - Stable across DiscoGen versions - Designed for principled meta-train/meta-test evaluation

Supporting Components

Component	Purpose
Reference implementations	Gold-standard baselines per domain for comparison
Environment wrappers	Standardized interfaces for diverse RL environments
Scoring functions	Domain-specific metrics (returns, accuracy, loss, etc.)
Dataset loaders	Unified data loading across formats and sources
Result aggregation	Cross-domain score normalization and reporting

11 Core Mechanisms (Detailed)

Mechanism 1: Procedural Task Generation

The core innovation is treating algorithm discovery tasks as procedurally generated environments — directly analogous to procedural level generation in RL (Minigrid, Procgen, etc.).

The Combinatorial Space:

For a domain with $m$ modules and $d$ datasets, the number of possible module combinations is $2^m - 1$ (at least one module must be editable). Dataset allocation multiplies this further. The resulting space is vast:

Domain	Modules	Datasets	Unique Tasks
On-Policy MARL	6	17	97.4 billion
On-Policy RL	6	13	1.8 billion
Bayesian Optimization	6	11	65.4 million
Offline RL	5	10	10.6 million

Task Sampling:

# Sample a uniformly random task configuration
discogen sample-task-config --config-dest random_task.yaml

# The config specifies ALL task parameters
# Domain, modules, datasets, eval type, initialization

This enables curriculum strategies where the distribution over tasks evolves based on the ADA's current capabilities — precisely the UED (Unsupervised Environment Design) paradigm applied to algorithm discovery.

Mechanism 2: Modular Algorithm Decomposition

Each ML algorithm is decomposed into semantically meaningful, independently editable modules:

Algorithm = Module_1 ⊕ Module_2 ⊕ ... ⊕ Module_m

Example (PPO):
  PPO = Loss ⊕ Networks ⊕ Optimizer ⊕ Train_Loop ⊕ Activation ⊕ Targets

Each module has:
  - Defined inputs (tensor shapes, types)
  - Defined outputs (tensor shapes, types)
  - Base implementation (working baseline)
  - Edit template (signatures only)

Why This Matters:

Controlled complexity. Editing 1 module is fundamentally easier than editing 6. This provides a natural difficulty gradient.
Attribution. If an ADA improves the loss function, we know exactly which component drove the improvement.
Composability. A novel loss function discovered for CIFAR-10 can be tested on CIFAR-100 without modification.
Research focus. Researchers can study "loss function discovery" or "optimizer discovery" in isolation.

Mechanism 3: Meta-Train/Meta-Test Separation

This is perhaps DiscoGen's most methodologically important contribution. Every task enforces a strict separation:

┌─────────────────────────────────┐    ┌─────────────────────────────────┐
│         META-TRAIN               │    │         META-TEST                │
│                                  │    │                                  │
│  ADA has access to these         │    │  ADA has NEVER seen these        │
│  datasets during discovery       │    │  datasets — held out completely  │
│                                  │    │                                  │
│  Example:                        │    │  Example:                        │
│  - Breakout (Atari)             │    │  - Asterix (Atari)              │
│  - Freeway (Atari)              │    │  - SpaceInvaders (Atari)        │
│                                  │    │                                  │
│  ADA iterates on these:         │    │  Discovered algorithm evaluated  │
│  edit code → train → evaluate   │    │  here with NO modifications      │
│  → edit code → train → ...      │    │                                  │
└─────────────────────────────────┘    └─────────────────────────────────┘

The Problem This Solves:

Prior algorithm discovery benchmarks evaluate on the same datasets used during discovery. An ADA could achieve high scores by: - Overfitting hyperparameters to specific datasets - Hardcoding dataset-specific tricks - Memorizing training data statistics

DiscoGen's meta-test evaluation ensures that only genuinely novel, generalizable algorithms score well.

Empirical Validation:

The paper shows that rank correlation between algorithms' meta-train and meta-test performance is weak — algorithms that look good during discovery often fail to generalize. This vindicates the split design.

Mechanism 4: Meta-Meta-Learning Loop

The paper demonstrates using DiscoGen for optimizing the optimizer — a meta-meta-learning loop where the ADA's prompt is itself evolved:

Outer loop (meta-meta-learning):
  for step in range(30):
    task = DiscoGen.sample()           # Fresh task each step
    score = ADA(prompt, task)          # ADA discovers algorithm
    prompt = Optimizer(prompt, score)  # LLM proposes better prompt

Inner loop (meta-learning / algorithm discovery):
  for iteration in range(N):
    code = ADA.edit(modules)           # ADA modifies editable code
    train_score = evaluate(code, meta_train)
    if converged: break

Evaluation:
  test_score = evaluate(code, meta_test)  # Generalization test

Key Experimental Finding:

The number of distinct DiscoGen tasks experienced during optimization is the critical variable:

Tasks Seen	Overfitting Risk	Meta-Test Performance
1	High — prompt specializes to single task	Lowest
5	Moderate	Moderate
10	Lower	Good
30	Lowest	Best (1071 score)

Meta-test performance improves monotonically with task diversity. This is the clearest empirical evidence that DiscoGen's scale enables genuine learning — not just memorization.

Mechanism 5: Evaluation Harness

Each generated task includes a complete, self-contained evaluation pipeline:

# Generated run_main.py (simplified)
def evaluate_algorithm():
    # Load meta-train or meta-test datasets
    datasets = load_datasets(config.datasets)

    # Import editable modules (ADA-modified or baseline)
    loss_fn = import_module("loss")
    networks = import_module("networks")
    optimizer = import_module("optim")

    # Train using the composed algorithm
    model = train(networks, loss_fn, optimizer, datasets.train)

    # Evaluate
    score = evaluate(model, datasets.test)

    return score

The evaluation harness: - Is deterministic (fixed seeds for reproducibility) - Reports normalized scores with confidence intervals - Supports three evaluation types (performance, energy, time) - Can be run headlessly for automated optimization loops

12 Programming Language

Implementation Stack

Component	Language	Framework
DiscoGen core	Python	Click (CLI), YAML (configs)
Domain implementations	Python	PyTorch, JAX (domain-dependent)
RL environments	Python	Gymnax, MinAtar, Brax, Craftax
Bayesian optimization	Python	GPyTorch, BoTorch
Language modelling	Python	PyTorch, Transformers
Build system	Makefile + uv	Modern Python packaging
Documentation	MkDocs	Deployed to GitHub Pages

Package Management

DiscoGen uses uv for dependency management:

make install        # Sets up environment + pre-commit hooks
uv run discogen ... # Run CLI commands

Each domain has isolated dependencies installed via install.sh, addressing the challenge of conflicting requirements across 14 diverse ML domains (e.g., JAX for RL vs. PyTorch for CV).

Code Structure

discogen/
├── discobench_configs/    # Fixed benchmark task configurations
├── domains/               # 14 domain implementations
│   ├── BayesianOptimisation/
│   ├── BrainSpeechDetection/
│   ├── ComputerVisionClassification/
│   ├── ContinualLearning/
│   ├── GreenhouseGasPrediction/
│   ├── LanguageModelling/
│   ├── ModelUnlearning/
│   ├── NeuralCellularAutomata/
│   ├── OfflineRL/
│   ├── OffPolicyRL/
│   ├── OnPolicyMARL/
│   ├── OnPolicyRL/
│   ├── TrajectoryPrediction/
│   └── UnsupervisedEnvironmentDesign/
├── utils/                 # Shared utilities
├── create_task.py         # Task generation engine
├── create_config.py       # Configuration utilities
└── cli.py                 # Click-based CLI

Language Choice Rationale

Python is the natural choice given: 1. All 14 target ML domains are predominantly Python-based 2. The primary consumers (LLM-based ADAs) generate Python code 3. The ML ecosystem (PyTorch, JAX, scikit-learn) is Python-native 4. The editable modules are Python — this is algorithm discovery, not code translation

13 Memory Management

Task-Level Isolation

DiscoGen generates self-contained task directories. Each task runs in its own process with its own memory space. There is no shared state between task evaluations, which is essential for: - Parallel evaluation of multiple tasks - Fault isolation when ADA-generated code crashes - Reproducibility of individual task scores

Domain-Specific Memory Considerations

Domain	Memory Profile	GPU Memory	Notes
On-Policy RL	Moderate	2-8 GB	Environment rollouts + policy network
On-Policy MARL	High	4-16 GB	Multiple agents + shared environment
Language Modelling	High	8-40 GB	Transformer parameters + attention
CV Classification	Moderate	4-12 GB	CNN/ViT + image batches
Bayesian Optimization	Low	0-2 GB	Gaussian process + acquisition
Offline RL	Moderate	2-8 GB	Replay buffer + networks
Neural Cellular Automata	Low-Moderate	1-4 GB	Grid state + update rules

Scaling Properties

DiscoGen's procedural generation is itself lightweight — task generation requires negligible compute and memory. The cost is in task evaluation, which scales with: - Domain complexity (language modelling >> greenhouse gas prediction) - Dataset size (CIFAR-100 >> MNIST) - Number of meta-train datasets (more datasets = longer training) - Evaluation type (energy/time require multiple runs for measurement)

State Management in Meta-Meta-Learning

The prompt optimization loop maintains: - Prompt history: All attempted prompts and their scores (~KB scale) - Score history: Performance on each task attempted (~KB scale) - Best prompt: Current best-performing ADA configuration - DiscoGen configs: Sampled task specifications (~KB per task)

This is extremely lightweight — the dominant memory cost is always within the inner loop (ADA executing on a task).

14 Continued Learning

Built-In Curriculum Support

DiscoGen's parameterized task generation naturally enables curriculum learning for ADAs:

Difficulty Axes:

Module count: 1 (easy) → 6 (nearly impossible with current models)
Initialization: baseline (easier) → empty (harder)
Dataset complexity: MNIST (simple) → TinyImageNet (complex)
Domain familiarity: Well-studied domains → novel combinations
Evaluation type: performance (standard) → energy/time (constrained)

Curriculum Strategies Enabled:

Strategy	Description
Progressive module addition	Start with 1 editable module, gradually add more
Domain transfer	Train on simple domains, evaluate on complex ones
Initialization escalation	Start with baseline code, progress to empty templates
Dataset difficulty ramping	Begin with MNIST, advance to CIFAR-100, TinyImageNet
Eval type progression	Master performance → add efficiency constraints

Research Directions Proposed

The paper outlines several ambitious research directions:

1. Algorithm World Models

Train a model to predict algorithm performance from code without executing it. DiscoGen provides the training data: (code, configuration, score) triples at scale.

2. Curriculum Learning for ADAs

Automatically design training curricula that maximize ADA generalization. This is UED (Unsupervised Environment Design) applied to the algorithm discovery setting — a recursive application where DiscoGen itself becomes the environment generator.

3. Tree Search for Discovery

Apply MCTS or similar search methods to navigate the space of module modifications, using DiscoGen tasks as the evaluation oracle.

4. Multi-Task Algorithm Discovery

Discover algorithms that work well across multiple domains simultaneously, leveraging DiscoGen's cross-domain task coverage.

5. Foundation Models for Algorithm Discovery

Train specialized models on large volumes of DiscoGen task data, analogous to how foundation models are trained on internet text.

Extensibility

DiscoGen is designed for community contribution:

Adding a new domain requires:
1. Implement base/ modules (working baseline)
2. Implement edit/ templates (function signatures)
3. Define dataset configurations
4. Implement evaluation metrics
5. Write install.sh for dependencies
6. Add domain to registry

The documentation includes detailed contributing guides for: - Adding new task domains - Integrating new datasets - Defining new evaluation types - Contributing to DiscoBench

Version Evolution

As DiscoGen grows, DiscoBench provides stability: - Generator evolves (new domains, datasets, backends) - DiscoBench configurations remain fixed for comparability - New DiscoBench versions can be released periodically - Historical results remain valid against their DiscoBench version

15 Applications

Primary Application: ADA Optimization

DiscoGen's primary use case is training and evaluating Algorithm Discovery Agents:

For ADA Developers:
  1. Use DiscoGen to generate training tasks
  2. Run ADA optimization loop (prompt tuning, architecture search, etc.)
  3. Evaluate on DiscoBench for principled comparison
  4. Publish results with standardized metrics

Concrete Application Domains

Application	DiscoGen Domain	Module Focus	Potential Impact
Novel RL algorithms	OnPolicyRL, OffPolicyRL	loss, train	Discovery of PPO successors
Efficient training	LanguageModelling	optim, loss	Reduced pretraining costs
Better vision classifiers	CVClassification	networks, loss	Architecture discovery
Multi-agent coordination	OnPolicyMARL	loss, targets, train	New MARL algorithms
Self-driving prediction	TrajectoryPrediction	networks, loss	Safer autonomous vehicles
Brain-computer interfaces	BrainSpeechDetection	networks, loss	Better neural decoders
Climate science	GreenhouseGasPrediction	model, data_processing	Improved forecasting
ML safety	ModelUnlearning	loss	Better forgetting algorithms
Open-ended learning	NeuralCellularAutomata	update, perceive	Artificial life research
RL generalization	UnsupervisedEnvDesign	sample_levels, train_step	More robust RL agents
Offline RL	OfflineRL	actor_loss, critic_loss	Learning from logged data
Continual learning	ContinualLearning	regularizer, replay	Catastrophic forgetting solutions

Cross-Cutting Applications

1. Automated ML Research

DiscoGen enables a new paradigm: automated research assistants that discover novel algorithms without human guidance. The meta-meta-learning results show this is feasible — prompt-optimized ADAs outperform naive ones.

2. Benchmark Design

DiscoGen's methodology — procedural generation with meta-train/meta-test separation — can be applied to other evaluation domains beyond algorithm discovery.

3. ML Education

The modular decomposition provides an excellent teaching tool. Students can understand PPO by editing individual components and observing the impact.

4. Algorithm Portfolio Construction

By running ADAs across thousands of DiscoGen tasks, researchers can build portfolios of algorithms suited to different settings — analogous to algorithm selection in combinatorial optimization.

Relationship to Other Systems

System	Relationship to DiscoGen
FunSearch (Google DeepMind)	FunSearch is an ADA; DiscoGen provides tasks for it
AlphaEvolve (Google DeepMind)	AlphaEvolve is an ADA; DiscoGen could evaluate it
OpenELM	OpenELM is an ADA; DiscoGen provides benchmarking
EvoTorch	EvoTorch provides optimization; DiscoGen provides problems
AutoML frameworks	AutoML searches hyperparameters; DiscoGen generates algorithms
Neural Architecture Search	NAS searches architectures; DiscoGen includes this via `networks.py` modules

Strategic Position: DiscoGen occupies a unique niche as infrastructure for algorithm discovery research. It doesn't discover algorithms itself — it generates the problems that algorithm discovery systems solve. This makes it complementary to, rather than competitive with, every ADA in the field.

References

@misc{goldie2026proceduralgenerationalgorithmdiscovery,
  title={Procedural Generation of Algorithm Discovery Tasks in Machine Learning},
  author={Alexander D. Goldie and Zilin Wang and Adrian Hayler and Deepak Nathani 
          and Edan Toledo and Ken Thampiratwong and Aleksandra Kalisz 
          and Michael Beukman and Alistair Letcher and Shashank Reddy 
          and Clarisse Wibault and Theo Wolf and Charles O'Neill 
          and Uljad Berdica and Nicholas Roberts and Saeed Rahmani 
          and Hannah Erlebach and Roberta Raileanu and Shimon Whiteson 
          and Jakob N. Foerster},
  year={2026},
  eprint={2603.17863},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2603.17863},
}