← Back to Index

DiscoGen

Procedural generator of algorithm discovery tasks spanning 400M+ unique ML problems across 14 domains, enabling meta-meta-learning for evolutionary optimization of Algorithm Discovery Agents. Organization: University of Oxford, UC Santa Barbara, UCL, and collaborators Published: March 18, 2026 Type: paper + open-source repository Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

Full Title: Procedural Generation of Algorithm Discovery Tasks in Machine Learning

System Name: DiscoGen

Paper: arXiv:2603.17863 (cs.LG, cs.AI)

Repository: github.com/AlexGoldie/discogen — MIT License, 31 stars

Project Website: disco-gen.github.io

Documentation: alexgoldie.github.io/discogen

PyPI Package: pip install discogen (v1.0.0)

Submission Date: March 18, 2026

License: MIT

Positioning Statement: DiscoGen is not a benchmark — it is a procedural generator of algorithm discovery tasks. Where existing suites provide tens of static problems, DiscoGen spans billions of parameterized tasks across 14 ML domains, enabling the first principled meta-meta-learning loops for optimizing Algorithm Discovery Agents (ADAs).


2 Authors and Team

Role Authors
Lead Author Alexander D. Goldie (University of Oxford)
Core Contributors Zilin Wang, Adrian Hayler, Deepak Nathani, Edan Toledo, Ken Thampiratwong, Aleksandra Kalisz
Task Contributors Michael Beukman, Alistair Letcher, Shashank Reddy, Clarisse Wibault, Theo Wolf, Charles O'Neill, Uljad Berdica, Nicholas Roberts, Saeed Rahmani, Hannah Erlebach
Equal Supervision Roberta Raileanu, Shimon Whiteson, Jakob N. Foerster

Institutional Affiliations:

  • University of Oxford (primary)
  • UC Santa Barbara
  • University College London (UCL)
  • Additional collaborating institutions

Team Size: 20 authors — one of the largest collaborative efforts in algorithm discovery research, reflecting the enormous scope of implementing and validating 14 distinct task domains with associated datasets, evaluation pipelines, and modular decompositions.

Key Intellectual Lineage: Jakob Foerster's group at Oxford has been central to multi-agent RL and meta-learning research. Shimon Whiteson brings deep RL and automated algorithm design expertise. Roberta Raileanu contributes procedural generation methodology from the RL generalization literature.


3 Core Contribution

The Problem

Automated Algorithm Discovery (AAD) — using AI systems to discover novel ML algorithms — is a rapidly growing field. Systems like FunSearch, AlphaEvolve, and OpenELM have demonstrated that LLMs can discover novel optimizers, loss functions, and training procedures. However, the field faces a critical infrastructure gap:

  1. Tiny evaluation suites. Existing benchmarks contain tens of tasks at most, leading to overfitting and unreliable comparisons.
  2. No meta-train/meta-test separation. Most suites evaluate discovered algorithms on the same datasets used during discovery — confounding genuine algorithm quality with dataset-specific tuning.
  3. Data contamination. Static task sets risk contamination in LLM training corpora.
  4. Saturated problems. Many tasks are solved or nearly solved, providing no signal for improvement.
  5. Narrow domain coverage. Most benchmarks target a single ML subfield.

The Solution

DiscoGen addresses all five problems through procedural generation:

┌─────────────────────────────────────────────────────────────┐
│                    DiscoGen Generator                        │
│                                                             │
│  Configuration Parameters                                   │
│  ┌──────────────────────────────────────────────────┐       │
│  │ task_domain: OnPolicyRL                          │       │
│  │ editable_modules: [loss, networks]               │       │
│  │ meta_train: [Breakout, Freeway]                  │       │
│  │ meta_test: [Asterix, SpaceInvaders]              │       │
│  │ eval_type: performance                           │       │
│  │ initialisation: empty | baseline                 │       │
│  │ backend: recurrent | feedforward                 │       │
│  └──────────────────────────────────────────────────┘       │
│                         │                                   │
│                         ▼                                   │
│  ┌──────────────────────────────────────────────────┐       │
│  │         Complete Runnable Task Directory          │       │
│  │  task_src/                                       │       │
│  │  ├── loss.py          (editable)                 │       │
│  │  ├── networks.py      (editable)                 │       │
│  │  ├── optim.py         (baseline, frozen)         │       │
│  │  ├── train.py         (baseline, frozen)         │       │
│  │  ├── run_main.py      (evaluation harness)       │       │
│  │  └── config.yaml      (task specification)       │       │
│  └──────────────────────────────────────────────────┘       │
│                                                             │
│  Task Space: ~99.3 billion unique tasks                     │
└─────────────────────────────────────────────────────────────┘

Three Levels of Contribution

Level Contribution Impact
Generator Procedural task generator spanning 14 domains Unlimited unique tasks for ADA optimization
Benchmark DiscoBench — fixed subset for principled evaluation Reproducible comparisons with meta-train/meta-test split
Research Directions Meta-meta-learning, curriculum learning, algorithm world models New research paradigm for agent optimization

Key Insight: By treating algorithm discovery tasks as procedurally generated environments (analogous to procedural level generation in RL), DiscoGen transforms ADA optimization from a few-shot benchmark problem into a genuine learning problem with distribution, generalization, and curriculum.


4 Supported Solutions

Task Domains

DiscoGen supports 14 distinct ML domains, each decomposed into modular algorithm components:

Domain Modules (m) Datasets (d) Backends (b) Total Tasks
Bayesian Optimization 6 11 1 65,413,656
Brain Speech Detection 3 7 1 81,144
Computer Vision Classification 4 9 1 1,679,400
Continual Learning 5 3 3 6,696
Greenhouse Gas Prediction 2 4 1 900
Language Modelling 3 4 2 4,200
Model Unlearning 1 3 1 85,176
Neural Cellular Automata 5 5 1 33,480
Off-Policy RL 7 4 1 38,100
Offline RL 5 10 1 10,602,372
On-Policy MARL 6 17 2 97,431,783,120
On-Policy RL 6 13 3 1,789,383,960
Trajectory Prediction 4 3 3 1,080
Unsupervised Env Design 3 4 1 2,100
Total ~99.3 billion

Module Types by Domain

Each domain decomposes its ML algorithm into editable modules. Representative examples:

On-Policy RL (PPO-style): - loss.py — Objective function (surrogate loss, entropy bonus, value loss) - networks.py — Policy and value network architectures - optim.py — Optimizer configuration and learning rate schedules - train.py — Training loop (rollout collection, update steps) - activation.py — Activation functions - targets.py — Advantage estimation and return computation

Language Modelling: - loss.py — Language modeling objective - networks.py — Transformer architecture - optim.py — Optimizer and schedule

Bayesian Optimization: - acq_fn.py — Acquisition function - acq_optimizer.py — Acquisition function optimizer - sampler.py — Initial point sampling - next_queries.py — Query selection strategy - surrogate.py — Surrogate model - surrogate_optimizer.py — Surrogate training

Evaluation Types

Each task supports three evaluation objectives:

Eval Type Objective Use Case
performance Maximize algorithm quality metric Standard algorithm discovery
energy Minimize energy while exceeding performance threshold Green AI, efficiency research
time Minimize wall-clock time while exceeding performance threshold Practical deployment constraints

Initialization Modes

Mode What the ADA Receives Difficulty
baseline Complete, working reference implementation Easier — improve upon known solution
empty Only function signatures with input/output specs Harder — design from scratch

5 LLM Integration

Role of LLMs in DiscoGen

LLMs serve as Algorithm Discovery Agents (ADAs) that operate on DiscoGen tasks. DiscoGen itself is model-agnostic — it generates tasks that any ADA (LLM-based or otherwise) can attempt.

Evaluated Models

Model DiscoBench Single DiscoBench Single (Until Success) DiscoBench All
GPT-OSS 120B 68.2% success 100.0% success 11.4% success
Devstral2 45.9% success 100.0% success 34.3% success
Deepseek-v3.2 80.0% success 100.0% success 25.7% success

Critical finding: No model consistently outperforms baseline implementations across domains. This is a striking result — even the most capable models, when editing a single module, frequently produce algorithms that are worse than the standard implementation.

LLM as ADA Optimizer

The paper demonstrates prompt optimization in a meta-meta-learning loop:

┌──────────────────────────────────────────────────────────────┐
│                  Meta-Meta-Learning Loop                      │
│                                                              │
│  ┌─────────────┐                                             │
│  │   LLM        │                                            │
│  │  (Prompt     │◄─── Performance feedback from              │
│  │  Optimizer)  │     K_tasks DiscoGen tasks                  │
│  └──────┬──────┘                                             │
│         │ Proposes new ADA prompt                             │
│         ▼                                                    │
│  ┌─────────────┐    ┌──────────────┐    ┌──────────────┐     │
│  │  ADA        │───►│  DiscoGen    │───►│  Evaluate    │     │
│  │  (with new  │    │  Task        │    │  Algorithm   │     │
│  │   prompt)   │    │  (sampled)   │    │  (meta-test) │     │
│  └─────────────┘    └──────────────┘    └──────┬───────┘     │
│                                                │              │
│                     Loop for 30 steps ◄────────┘              │
└──────────────────────────────────────────────────────────────┘

The optimizer LLM proposes new system prompts for the ADA based on prior performance traces. This is distinct from the ADA itself — the optimizer sits one level above, performing search over the space of ADA configurations.

Model-Agnostic Design

DiscoGen enforces strict separation between: - Task specification — domain, modules, datasets, eval type - Agent interface — editable Python files with defined inputs/outputs - Evaluation harness — deterministic scoring on meta-train and meta-test

This means DiscoGen can benchmark any algorithm discovery system, not just LLM-based ones. Future systems using neuroevolution, program synthesis, or hybrid approaches can use the same infrastructure.


6 Key Results

DiscoBench Evaluation

DiscoBench Single (edit one module, one attempt):

Model Success Rate Meta-Train Score Meta-Test Score
Baseline (All Fixed) 1104 [1077, 1136] 1177 [1144, 1211]
GPT-OSS 120B 68.2% 931 [900, 961] 962 [933, 993]
Devstral2 45.9% 886 [850, 922] 808 [771, 842]
Deepseek-v3.2 80.0% 1079 [1050, 1108] 1053 [1020, 1082]

DiscoBench All (edit all modules simultaneously):

Model Success Rate Meta-Train Score Meta-Test Score
Baseline (All Fixed) 1409 [1297, 1682] 1377 [1212, 1595]
GPT-OSS 120B 11.4% 533 [−183, 700] 597 [−106, 799]
Devstral2 34.3% 873 [751, 1138] 1087 [971, 1322]
Deepseek-v3.2 25.7% 1184 [1069, 1397] 940 [831, 1176]

Difficulty Scaling by Module Count

Model 1 Module 2 Modules 3 Modules 4 Modules
Deepseek-v3.2 75.0% 47.2% 8.3% 0.0%
GPT-OSS-120b 50.0% 11.1% 8.3% 0.0%
Devstral2 29.2% 27.8% 0.0% 0.0%

Key Finding 1: Success rates collapse precipitously with module count. No model succeeds with 4 editable modules. This establishes a clear difficulty gradient — a crucial property for curriculum design in ADA optimization.

Meta-Meta-Learning with Prompt Optimization

K_tasks (unique tasks seen) DiscoBench Success Rate Meta-Train Score Meta-Test Score
1 70.6% 956 [939, 978] 957 [927, 977]
5 75.3% 1014 [1000, 1033] 973 [947, 993]
10 72.0% 969 [949, 989] 1000 [980, 1022]
30 78.7% 1061 [1040, 1079] 1071 [1049, 1096]

Key Finding 2: Meta-test performance improves monotonically with the number of distinct tasks experienced during optimization. Using only 1 task leads to overfitting; 30 tasks yields the best generalization. This validates DiscoGen's core hypothesis that task diversity improves ADA quality.

Generalization Gap

The paper reveals a critical generalization gap: algorithms that perform well on meta-train datasets do not necessarily generalize to meta-test. Rank correlation analysis across DiscoBench tasks shows that the correlation structure between algorithms' meta-train performance breaks down at meta-test time. This vindicates DiscoGen's insistence on separate meta-train/meta-test evaluation.


7 Reproducibility

Open-Source Infrastructure

Component Availability Notes
Generator code GitHub (MIT) Full procedural generator
PyPI package pip install discogen CLI + Python API
DiscoBench configs Included in repo Fixed benchmark configurations
Domain implementations Included in repo 14 domains with baselines
Reference implementations Per-domain _reference.txt Baseline code for all modules
Documentation Comprehensive docs site Usage, contributing, API

Reproducing Results

# Install DiscoGen
pip install discogen

# List available domains
discogen get-domains

# Create a specific DiscoBench task
discogen create-task --task-domain OnPolicyRL --config-path discobench_configs/task_42.yaml

# Run the task
cd task_src/OnPolicyRL
bash install.sh  # Install domain-specific dependencies
python run_main.py

# Create meta-test evaluation
discogen create-task --task-domain OnPolicyRL --config-path discobench_configs/task_42.yaml --test
cd task_src
python run_main.py  # Evaluate on held-out datasets

Reproducibility Concerns

Factor Assessment
Task generation Fully deterministic given config parameters — excellent
Evaluation scores Score ranges reported with confidence intervals — good
LLM-based ADA Inherent non-determinism in LLM outputs — moderate
Domain dependencies Each domain has separate requirements; potential version conflicts — manageable via install.sh
Compute requirements Not explicitly quantified per domain — unclear
DiscoBench configs Fixed, included in repository — excellent
Meta-meta-learning Prompt optimization details may vary by LLM provider — moderate

Potential Confounds

  1. Domain-specific dependency conflicts. Each task domain has its own Python requirements that may conflict with others. The install.sh per-domain approach mitigates but doesn't eliminate this.
  2. Baseline implementation quality. The strength of results depends on the quality of reference implementations. If baselines are weak, beating them is less impressive; if strong, the negative results for LLMs are more damning.
  3. Score aggregation. The paper aggregates across diverse domains with different scale metrics. The normalization scheme for cross-domain comparison needs careful scrutiny.

8 Compute and API Costs

Estimated Costs Per Run

The paper does not provide explicit cost breakdowns. Based on the experimental setup:

Component Estimated Cost Notes
Single DiscoBench task 1-60 min compute Varies enormously by domain (GPU needed for RL/CV)
DiscoBench Single evaluation ~35 tasks × cost per task Per model, single attempt
DiscoBench All evaluation ~35 tasks × cost per task All modules editable
Meta-meta-learning (30 steps) 30 × (LLM call + task eval) Plus prompt optimization LLM
Full evaluation suite 3 models × 3 settings Hundreds of task evaluations

Hardware Requirements by Domain

Domain Category Likely Hardware Duration
RL domains (On-Policy, Off-Policy, MARL) GPU (training agents in environments) 10-60 min
CV Classification GPU (training CNNs/ViTs) 5-30 min
Language Modelling GPU (transformer pretraining) 30-120 min
Bayesian Optimization CPU sufficient 1-10 min
Greenhouse Gas Prediction CPU sufficient 1-5 min
Brain Speech Detection GPU (neural decoding) 10-30 min
Trajectory Prediction GPU 10-60 min

LLM API Costs (for ADA)

For an LLM-based ADA attempting a single task: - Input context: Task description + editable module templates + reference docs - Output: Modified Python files for editable modules - Iterations: Typically multiple rounds of edit-test-refine - Estimated per-task LLM cost: $0.50-$10 depending on model and iterations

For the meta-meta-learning loop (30 optimization steps): - Per step: 1 LLM prompt optimization call + 1 ADA call + 1 task evaluation - Total: ~$30-300 for the full optimization run

Note: The dominant cost is task evaluation compute (GPU time), not LLM API calls. A full DiscoBench evaluation across all models and settings likely requires hundreds of GPU-hours.


9 Architecture Solution

System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        DiscoGen System Architecture                  │
│                                                                     │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │                    Configuration Layer                        │   │
│  │                                                               │   │
│  │  Task Config (YAML)          Domain Registry                  │   │
│  │  ┌─────────────────┐        ┌──────────────────┐             │   │
│  │  │ task_domain      │        │ OnPolicyRL       │             │   │
│  │  │ editable_modules │───────►│ OffPolicyRL      │             │   │
│  │  │ meta_train       │        │ LanguageModelling│             │   │
│  │  │ meta_test        │        │ CVClassification │             │   │
│  │  │ eval_type        │        │ ... (14 total)   │             │   │
│  │  │ initialisation   │        └──────────────────┘             │   │
│  │  │ backend          │                                         │   │
│  │  └─────────────────┘                                         │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │                    Generation Engine                           │   │
│  │                                                               │   │
│  │  create_task.py                                               │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │   │
│  │  │ Module        │  │ Dataset      │  │ Evaluation   │       │   │
│  │  │ Assembly      │  │ Selection    │  │ Harness      │       │   │
│  │  │              │  │              │  │ Generation   │       │   │
│  │  │ base/*.py    │  │ meta_train[] │  │ run_main.py  │       │   │
│  │  │ edit/*.py    │  │ meta_test[]  │  │ scoring      │       │   │
│  │  └──────────────┘  └──────────────┘  └──────────────┘       │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │                    Output: task_src/                           │   │
│  │                                                               │   │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐             │   │
│  │  │ Editable   │  │ Frozen     │  │ Evaluation │             │   │
│  │  │ Modules    │  │ Baseline   │  │ Pipeline   │             │   │
│  │  │            │  │ Modules    │  │            │             │   │
│  │  │ loss.py    │  │ optim.py   │  │ run_main   │             │   │
│  │  │ networks.py│  │ train.py   │  │ config.yaml│             │   │
│  │  └────────────┘  └────────────┘  └────────────┘             │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │                    ADA Interface                               │   │
│  │                                                               │   │
│  │  Input: Editable module files + task description              │   │
│  │  Output: Modified module implementations                     │   │
│  │  Evaluation: python run_main.py → score                      │   │
│  └───────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Design Principles

1. Modularity Over Monolith

Rather than asking an ADA to build an entire ML system, DiscoGen decomposes algorithms into composable modules. This serves multiple purposes: - Difficulty control: More editable modules = harder tasks - Attribution: Which module improvements drive performance gains? - Composability: A discovered loss function from one task can be combined with a discovered optimizer from another

2. Separation of Concerns

Discovery Phase (meta-train):
  ADA modifies editable modules
  └── Trains on meta-train datasets
  └── Iterates and refines

Evaluation Phase (meta-test):
  Discovered algorithm runs on held-out datasets
  └── No further modification allowed
  └── Tests generalization, not overfitting

3. Configuration-Driven Generation

Every aspect of a task is determined by a small YAML configuration. This enables: - Deterministic task reproduction - Systematic difficulty sweeps - Automated curriculum construction - Combinatorial explosion of unique tasks

4. Domain Independence

The generator framework is domain-agnostic. Adding a new domain requires implementing: - Module base/edit implementations - Dataset adapters - Evaluation metrics - An install.sh for domain-specific dependencies


10 Component Breakdown

Core Components

1. CLI Interface (discogen/cli.py)

# Primary commands
discogen get-domains              # List all 14 supported domains
discogen create-task              # Generate a complete task directory
discogen sample-task-config       # Sample a random task configuration

Key parameters for create-task: - --task-domain — Which ML domain - --config-path — YAML configuration file - --example — Generate with editable (incomplete) modules - --test — Generate meta-test evaluation version

2. Configuration System (discogen/create_config.py)

The configuration system defines the combinatorial space of possible tasks:

# Example task configuration
task_domain: OnPolicyRL
meta_train: [Breakout, Freeway]
meta_test: [Asterix, SpaceInvaders]
backend: recurrent
change_loss: true
change_networks: true
change_optim: false
change_train: false
change_activation: false
change_targets: false
eval_type: performance
initialisation: empty

The number of tasks per domain is:

$$N_{\text{tasks}} = (2^m - 1) \times \binom{d}{k_{\text{train}}} \times \binom{d - k_{\text{train}}}{k_{\text{test}}} \times b \times |\text{eval_types}| \times |\text{init_modes}|$$

where $m$ is the number of modules, $d$ is the number of datasets, $b$ is the number of backends, and $k_{\text{train}}, k_{\text{test}}$ are the sizes of train/test dataset splits.

3. Domain Implementations (discogen/domains/)

Each domain directory contains:

discogen/domains/OnPolicyRL/
├── base/           # Complete baseline implementations
│   ├── loss.py
│   ├── networks.py
│   ├── optim.py
│   ├── train.py
│   ├── activation.py
│   └── targets.py
├── edit/           # Editable templates (function signatures only)
│   ├── loss.py
│   ├── networks.py
│   └── ...
├── utils/
│   ├── _reference.txt    # Reference documentation for the domain
│   ├── environments.py   # Environment wrappers
│   └── evaluation.py     # Metric computation
├── datasets/       # Dataset configurations
├── config.yaml     # Domain-level defaults
└── install.sh      # Domain-specific dependency installer

4. Task Generation Engine (discogen/create_task.py)

The generation engine assembles a complete, runnable task directory:

  1. Module selection: For each module, choose base (frozen) or edit (editable) version
  2. Dataset assignment: Map meta-train and meta-test datasets
  3. Evaluation setup: Configure scoring metrics and evaluation scripts
  4. Dependency resolution: Ensure inter-module dependencies are satisfied
  5. Output: Self-contained task_src/ directory

5. DiscoBench Configurations (discogen/discobench_configs/)

A fixed set of task configurations for reproducible benchmarking. These configurations are: - Curated to cover diverse difficulty levels - Balanced across domains - Stable across DiscoGen versions - Designed for principled meta-train/meta-test evaluation

Supporting Components

Component Purpose
Reference implementations Gold-standard baselines per domain for comparison
Environment wrappers Standardized interfaces for diverse RL environments
Scoring functions Domain-specific metrics (returns, accuracy, loss, etc.)
Dataset loaders Unified data loading across formats and sources
Result aggregation Cross-domain score normalization and reporting

11 Core Mechanisms (Detailed)

Mechanism 1: Procedural Task Generation

The core innovation is treating algorithm discovery tasks as procedurally generated environments — directly analogous to procedural level generation in RL (Minigrid, Procgen, etc.).

The Combinatorial Space:

For a domain with $m$ modules and $d$ datasets, the number of possible module combinations is $2^m - 1$ (at least one module must be editable). Dataset allocation multiplies this further. The resulting space is vast:

Domain Modules Datasets Unique Tasks
On-Policy MARL 6 17 97.4 billion
On-Policy RL 6 13 1.8 billion
Bayesian Optimization 6 11 65.4 million
Offline RL 5 10 10.6 million

Task Sampling:

# Sample a uniformly random task configuration
discogen sample-task-config --config-dest random_task.yaml

# The config specifies ALL task parameters
# Domain, modules, datasets, eval type, initialization

This enables curriculum strategies where the distribution over tasks evolves based on the ADA's current capabilities — precisely the UED (Unsupervised Environment Design) paradigm applied to algorithm discovery.

Mechanism 2: Modular Algorithm Decomposition

Each ML algorithm is decomposed into semantically meaningful, independently editable modules:

Algorithm = Module_1 ⊕ Module_2 ⊕ ... ⊕ Module_m

Example (PPO):
  PPO = Loss ⊕ Networks ⊕ Optimizer ⊕ Train_Loop ⊕ Activation ⊕ Targets

Each module has:
  - Defined inputs (tensor shapes, types)
  - Defined outputs (tensor shapes, types)
  - Base implementation (working baseline)
  - Edit template (signatures only)

Why This Matters:

  1. Controlled complexity. Editing 1 module is fundamentally easier than editing 6. This provides a natural difficulty gradient.
  2. Attribution. If an ADA improves the loss function, we know exactly which component drove the improvement.
  3. Composability. A novel loss function discovered for CIFAR-10 can be tested on CIFAR-100 without modification.
  4. Research focus. Researchers can study "loss function discovery" or "optimizer discovery" in isolation.

Mechanism 3: Meta-Train/Meta-Test Separation

This is perhaps DiscoGen's most methodologically important contribution. Every task enforces a strict separation:

┌─────────────────────────────────┐    ┌─────────────────────────────────┐
│         META-TRAIN               │    │         META-TEST                │
│                                  │    │                                  │
│  ADA has access to these         │    │  ADA has NEVER seen these        │
│  datasets during discovery       │    │  datasets — held out completely  │
│                                  │    │                                  │
│  Example:                        │    │  Example:                        │
│  - Breakout (Atari)             │    │  - Asterix (Atari)              │
│  - Freeway (Atari)              │    │  - SpaceInvaders (Atari)        │
│                                  │    │                                  │
│  ADA iterates on these:         │    │  Discovered algorithm evaluated  │
│  edit code → train → evaluate   │    │  here with NO modifications      │
│  → edit code → train → ...      │    │                                  │
└─────────────────────────────────┘    └─────────────────────────────────┘

The Problem This Solves:

Prior algorithm discovery benchmarks evaluate on the same datasets used during discovery. An ADA could achieve high scores by: - Overfitting hyperparameters to specific datasets - Hardcoding dataset-specific tricks - Memorizing training data statistics

DiscoGen's meta-test evaluation ensures that only genuinely novel, generalizable algorithms score well.

Empirical Validation:

The paper shows that rank correlation between algorithms' meta-train and meta-test performance is weak — algorithms that look good during discovery often fail to generalize. This vindicates the split design.

Mechanism 4: Meta-Meta-Learning Loop

The paper demonstrates using DiscoGen for optimizing the optimizer — a meta-meta-learning loop where the ADA's prompt is itself evolved:

Outer loop (meta-meta-learning):
  for step in range(30):
    task = DiscoGen.sample()           # Fresh task each step
    score = ADA(prompt, task)          # ADA discovers algorithm
    prompt = Optimizer(prompt, score)  # LLM proposes better prompt

Inner loop (meta-learning / algorithm discovery):
  for iteration in range(N):
    code = ADA.edit(modules)           # ADA modifies editable code
    train_score = evaluate(code, meta_train)
    if converged: break

Evaluation:
  test_score = evaluate(code, meta_test)  # Generalization test

Key Experimental Finding:

The number of distinct DiscoGen tasks experienced during optimization is the critical variable:

Tasks Seen Overfitting Risk Meta-Test Performance
1 High — prompt specializes to single task Lowest
5 Moderate Moderate
10 Lower Good
30 Lowest Best (1071 score)

Meta-test performance improves monotonically with task diversity. This is the clearest empirical evidence that DiscoGen's scale enables genuine learning — not just memorization.

Mechanism 5: Evaluation Harness

Each generated task includes a complete, self-contained evaluation pipeline:

# Generated run_main.py (simplified)
def evaluate_algorithm():
    # Load meta-train or meta-test datasets
    datasets = load_datasets(config.datasets)

    # Import editable modules (ADA-modified or baseline)
    loss_fn = import_module("loss")
    networks = import_module("networks")
    optimizer = import_module("optim")

    # Train using the composed algorithm
    model = train(networks, loss_fn, optimizer, datasets.train)

    # Evaluate
    score = evaluate(model, datasets.test)

    return score

The evaluation harness: - Is deterministic (fixed seeds for reproducibility) - Reports normalized scores with confidence intervals - Supports three evaluation types (performance, energy, time) - Can be run headlessly for automated optimization loops


12 Programming Language

Implementation Stack

Component Language Framework
DiscoGen core Python Click (CLI), YAML (configs)
Domain implementations Python PyTorch, JAX (domain-dependent)
RL environments Python Gymnax, MinAtar, Brax, Craftax
Bayesian optimization Python GPyTorch, BoTorch
Language modelling Python PyTorch, Transformers
Build system Makefile + uv Modern Python packaging
Documentation MkDocs Deployed to GitHub Pages

Package Management

DiscoGen uses uv for dependency management:

make install        # Sets up environment + pre-commit hooks
uv run discogen ... # Run CLI commands

Each domain has isolated dependencies installed via install.sh, addressing the challenge of conflicting requirements across 14 diverse ML domains (e.g., JAX for RL vs. PyTorch for CV).

Code Structure

discogen/
├── discobench_configs/    # Fixed benchmark task configurations
├── domains/               # 14 domain implementations
│   ├── BayesianOptimisation/
│   ├── BrainSpeechDetection/
│   ├── ComputerVisionClassification/
│   ├── ContinualLearning/
│   ├── GreenhouseGasPrediction/
│   ├── LanguageModelling/
│   ├── ModelUnlearning/
│   ├── NeuralCellularAutomata/
│   ├── OfflineRL/
│   ├── OffPolicyRL/
│   ├── OnPolicyMARL/
│   ├── OnPolicyRL/
│   ├── TrajectoryPrediction/
│   └── UnsupervisedEnvironmentDesign/
├── utils/                 # Shared utilities
├── create_task.py         # Task generation engine
├── create_config.py       # Configuration utilities
└── cli.py                 # Click-based CLI

Language Choice Rationale

Python is the natural choice given: 1. All 14 target ML domains are predominantly Python-based 2. The primary consumers (LLM-based ADAs) generate Python code 3. The ML ecosystem (PyTorch, JAX, scikit-learn) is Python-native 4. The editable modules are Python — this is algorithm discovery, not code translation


13 Memory Management

Task-Level Isolation

DiscoGen generates self-contained task directories. Each task runs in its own process with its own memory space. There is no shared state between task evaluations, which is essential for: - Parallel evaluation of multiple tasks - Fault isolation when ADA-generated code crashes - Reproducibility of individual task scores

Domain-Specific Memory Considerations

Domain Memory Profile GPU Memory Notes
On-Policy RL Moderate 2-8 GB Environment rollouts + policy network
On-Policy MARL High 4-16 GB Multiple agents + shared environment
Language Modelling High 8-40 GB Transformer parameters + attention
CV Classification Moderate 4-12 GB CNN/ViT + image batches
Bayesian Optimization Low 0-2 GB Gaussian process + acquisition
Offline RL Moderate 2-8 GB Replay buffer + networks
Neural Cellular Automata Low-Moderate 1-4 GB Grid state + update rules

Scaling Properties

DiscoGen's procedural generation is itself lightweight — task generation requires negligible compute and memory. The cost is in task evaluation, which scales with: - Domain complexity (language modelling >> greenhouse gas prediction) - Dataset size (CIFAR-100 >> MNIST) - Number of meta-train datasets (more datasets = longer training) - Evaluation type (energy/time require multiple runs for measurement)

State Management in Meta-Meta-Learning

The prompt optimization loop maintains: - Prompt history: All attempted prompts and their scores (~KB scale) - Score history: Performance on each task attempted (~KB scale) - Best prompt: Current best-performing ADA configuration - DiscoGen configs: Sampled task specifications (~KB per task)

This is extremely lightweight — the dominant memory cost is always within the inner loop (ADA executing on a task).


14 Continued Learning

Built-In Curriculum Support

DiscoGen's parameterized task generation naturally enables curriculum learning for ADAs:

Difficulty Axes:

  1. Module count: 1 (easy) → 6 (nearly impossible with current models)
  2. Initialization: baseline (easier) → empty (harder)
  3. Dataset complexity: MNIST (simple) → TinyImageNet (complex)
  4. Domain familiarity: Well-studied domains → novel combinations
  5. Evaluation type: performance (standard) → energy/time (constrained)

Curriculum Strategies Enabled:

Strategy Description
Progressive module addition Start with 1 editable module, gradually add more
Domain transfer Train on simple domains, evaluate on complex ones
Initialization escalation Start with baseline code, progress to empty templates
Dataset difficulty ramping Begin with MNIST, advance to CIFAR-100, TinyImageNet
Eval type progression Master performance → add efficiency constraints

Research Directions Proposed

The paper outlines several ambitious research directions:

1. Algorithm World Models

Train a model to predict algorithm performance from code without executing it. DiscoGen provides the training data: (code, configuration, score) triples at scale.

2. Curriculum Learning for ADAs

Automatically design training curricula that maximize ADA generalization. This is UED (Unsupervised Environment Design) applied to the algorithm discovery setting — a recursive application where DiscoGen itself becomes the environment generator.

3. Tree Search for Discovery

Apply MCTS or similar search methods to navigate the space of module modifications, using DiscoGen tasks as the evaluation oracle.

4. Multi-Task Algorithm Discovery

Discover algorithms that work well across multiple domains simultaneously, leveraging DiscoGen's cross-domain task coverage.

5. Foundation Models for Algorithm Discovery

Train specialized models on large volumes of DiscoGen task data, analogous to how foundation models are trained on internet text.

Extensibility

DiscoGen is designed for community contribution:

Adding a new domain requires:
1. Implement base/ modules (working baseline)
2. Implement edit/ templates (function signatures)
3. Define dataset configurations
4. Implement evaluation metrics
5. Write install.sh for dependencies
6. Add domain to registry

The documentation includes detailed contributing guides for: - Adding new task domains - Integrating new datasets - Defining new evaluation types - Contributing to DiscoBench

Version Evolution

As DiscoGen grows, DiscoBench provides stability: - Generator evolves (new domains, datasets, backends) - DiscoBench configurations remain fixed for comparability - New DiscoBench versions can be released periodically - Historical results remain valid against their DiscoBench version


15 Applications

Primary Application: ADA Optimization

DiscoGen's primary use case is training and evaluating Algorithm Discovery Agents:

For ADA Developers:
  1. Use DiscoGen to generate training tasks
  2. Run ADA optimization loop (prompt tuning, architecture search, etc.)
  3. Evaluate on DiscoBench for principled comparison
  4. Publish results with standardized metrics

Concrete Application Domains

Application DiscoGen Domain Module Focus Potential Impact
Novel RL algorithms OnPolicyRL, OffPolicyRL loss, train Discovery of PPO successors
Efficient training LanguageModelling optim, loss Reduced pretraining costs
Better vision classifiers CVClassification networks, loss Architecture discovery
Multi-agent coordination OnPolicyMARL loss, targets, train New MARL algorithms
Self-driving prediction TrajectoryPrediction networks, loss Safer autonomous vehicles
Brain-computer interfaces BrainSpeechDetection networks, loss Better neural decoders
Climate science GreenhouseGasPrediction model, data_processing Improved forecasting
ML safety ModelUnlearning loss Better forgetting algorithms
Open-ended learning NeuralCellularAutomata update, perceive Artificial life research
RL generalization UnsupervisedEnvDesign sample_levels, train_step More robust RL agents
Offline RL OfflineRL actor_loss, critic_loss Learning from logged data
Continual learning ContinualLearning regularizer, replay Catastrophic forgetting solutions

Cross-Cutting Applications

1. Automated ML Research

DiscoGen enables a new paradigm: automated research assistants that discover novel algorithms without human guidance. The meta-meta-learning results show this is feasible — prompt-optimized ADAs outperform naive ones.

2. Benchmark Design

DiscoGen's methodology — procedural generation with meta-train/meta-test separation — can be applied to other evaluation domains beyond algorithm discovery.

3. ML Education

The modular decomposition provides an excellent teaching tool. Students can understand PPO by editing individual components and observing the impact.

4. Algorithm Portfolio Construction

By running ADAs across thousands of DiscoGen tasks, researchers can build portfolios of algorithms suited to different settings — analogous to algorithm selection in combinatorial optimization.

Relationship to Other Systems

System Relationship to DiscoGen
FunSearch (Google DeepMind) FunSearch is an ADA; DiscoGen provides tasks for it
AlphaEvolve (Google DeepMind) AlphaEvolve is an ADA; DiscoGen could evaluate it
OpenELM OpenELM is an ADA; DiscoGen provides benchmarking
EvoTorch EvoTorch provides optimization; DiscoGen provides problems
AutoML frameworks AutoML searches hyperparameters; DiscoGen generates algorithms
Neural Architecture Search NAS searches architectures; DiscoGen includes this via networks.py modules

Strategic Position: DiscoGen occupies a unique niche as infrastructure for algorithm discovery research. It doesn't discover algorithms itself — it generates the problems that algorithm discovery systems solve. This makes it complementary to, rather than competitive with, every ADA in the field.


References

@misc{goldie2026proceduralgenerationalgorithmdiscovery,
  title={Procedural Generation of Algorithm Discovery Tasks in Machine Learning},
  author={Alexander D. Goldie and Zilin Wang and Adrian Hayler and Deepak Nathani 
          and Edan Toledo and Ken Thampiratwong and Aleksandra Kalisz 
          and Michael Beukman and Alistair Letcher and Shashank Reddy 
          and Clarisse Wibault and Theo Wolf and Charles O'Neill 
          and Uljad Berdica and Nicholas Roberts and Saeed Rahmani 
          and Hannah Erlebach and Roberta Raileanu and Shimon Whiteson 
          and Jakob N. Foerster},
  year={2026},
  eprint={2603.17863},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2603.17863},
}