← Back to Index
Karpathy Autoresearch
Autonomous LLM-driven neural network training research on a single GPU Organization: Andrej Karpathy (Independent / OpenAI Alumni) Published: March 2026 Type: Open-Source Repository Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: autoresearch — AI agents running research on single-GPU nanochat training automatically
Repository URL: github.com/karpathy/autoresearch
Stars: 63,000+ (as of April 2026)
License: MIT
Lineage: Direct descendant of nanochat, which is itself a single-GPU GPT training repository. The autoresearch project takes nanochat's training infrastructure and wraps it with an LLM agent loop that autonomously modifies the training code.
Announcement: First described in a tweet thread by Karpathy, followed by a results tweet reporting the first overnight autonomous run.
Publication Date: March 2026
Paradigm: This is not a traditional research paper — it is a working system designed to be forked, extended, and run. The program.md file is the "paper" — a specification document that instructs an LLM agent how to conduct research autonomously. Karpathy describes the paradigm shift directly in the README:
"The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the
program.mdMarkdown files that provide context to the AI agents and set up your autonomous research org."
2 Authors and Team
Primary Author
Andrej Karpathy — Former Director of AI at Tesla (Autopilot), co-founder and researcher at OpenAI, Stanford PhD (under Fei-Fei Li). One of the most influential figures in the deep learning community, known for pedagogical contributions (cs231n, nanoGPT, minGPT, llm.c, nanochat) and for making cutting-edge ML accessible.
Karpathy's body of work follows a distinctive pattern: take a complex system (ImageNet classifiers, GPT-2, tokenizers), strip it to its essence in a single readable file, and then open-source it as an educational tool. Autoresearch extends this pattern — now the researcher itself is automated.
Community Contributors
The repository spawned an immediate ecosystem of community forks:
| Fork | Maintainer | Platform |
|---|---|---|
| miolini/autoresearch-macos | miolini | macOS (Metal) |
| trevin-creator/autoresearch-mlx | trevin-creator | macOS (MLX) |
| jsegov/autoresearch-win-rtx | jsegov | Windows (RTX) |
| andyluo7/autoresearch | andyluo7 | AMD GPUs |
The rapid forking to alternative platforms (within days of release) demonstrates both the simplicity of the core design and the community demand for accessible autonomous research tools.
Philosophical Context
Karpathy frames autoresearch with a characteristically provocative opening:
"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the 'code' is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began."
This framing positions autoresearch not merely as a tool but as a proof-of-concept for a fundamentally different mode of scientific research — one where humans design the research program (in natural language) rather than conducting experiments directly.
3 Core Contribution
Key Novelty: Autoresearch demonstrates that a coding LLM agent, given nothing more than a Markdown instruction file and a single-GPU training setup, can autonomously conduct neural network architecture and hyperparameter research overnight — discovering real improvements that stack cumulatively to produce an 11% reduction in time-to-GPT-2 quality.
What Makes Autoresearch Novel
-
Research-as-code paradigm. The human writes
program.md— a natural-language specification of the research program. The LLM reads it and executes the research autonomously. This inverts the traditional human-researcher / computer-tool relationship. -
Radical simplicity. Three files. One metric. One GPU. Five-minute experiments. No frameworks, no orchestrators, no databases, no vector stores. The entire system fits in a single screen of instructions. This stands in stark contrast to systems like AlphaEvolve (Google DeepMind) or OpenEvolve which require substantial infrastructure.
-
Fixed-budget evaluation. Every experiment runs for exactly 5 minutes of wall-clock training time. This makes all results directly comparable regardless of what the agent changes (model size, batch size, architecture). The constraint forces the agent to think about compute-efficiency, not just model quality.
-
Cumulative greedy hill-climbing. Changes that improve val_bpb are kept (the branch advances); changes that don't improve are discarded (git reset). This creates a monotonically improving trajectory — the branch represents the best-known configuration at all times.
-
Self-contained reproducibility. The entire state of every experiment is captured as a git commit. The results log (
results.tsv) provides a complete audit trail. Any experiment can be reproduced by checking out its commit. -
"Never stop" autonomy. The agent is explicitly instructed to never pause, never ask for confirmation, and run indefinitely until manually stopped. This enables overnight and multi-day autonomous research campaigns.
Relationship to Prior Work
| System | Year | Complexity | Agent Role | Human Role | Search Strategy |
|---|---|---|---|---|---|
| Grid Search | Classical | Low | None | Defines grid | Exhaustive |
| Bayesian Optimization | Classical | Medium | Statistical model | Defines space | Acquisition function |
| Neural Architecture Search | 2016+ | High | RL/evolution agent | Defines search space | RL, evolutionary |
| FunSearch (DeepMind) | 2023 | High | LLM mutation | Defines evaluator | Evolutionary |
| AlphaEvolve (DeepMind) | 2025 | Very High | LLM ensemble | Defines problem | MAP-Elites + ensemble |
| Autoresearch | 2026 | Minimal | LLM coding agent | Writes program.md | Greedy hill-climbing |
Autoresearch occupies a unique position: it is the simplest autonomous research system that produces real results. Where AlphaEvolve requires a team of engineers to operate, autoresearch requires uv sync && uv run prepare.py and a prompt.
The Simplicity Criterion
A distinctive design choice is the explicit simplicity constraint in program.md:
"All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Conversely, removing something and getting equal or better results is a great outcome — that's a simplification win."
This prevents the common failure mode of autonomous optimization systems: accumulating complexity without proportional gains. The agent is instructed to weigh improvement magnitude against code complexity cost. A 0.001 val_bpb improvement from deleting code is preferred over a 0.001 improvement from adding 20 lines of hacky code.
4 Supported Solutions
Primary Domain: Neural Network Training Optimization
Autoresearch targets a single domain — improving a GPT model's training efficiency within a fixed compute budget. The agent has full freedom to modify:
| Solution Category | Scope | Examples |
|---|---|---|
| Model Architecture | Transformer depth, width, attention patterns | Change DEPTH, ASPECT_RATIO, WINDOW_PATTERN, add/remove layers |
| Optimizer Configuration | Learning rates, betas, weight decay, momentum | Tune MATRIX_LR, ADAM_BETAS, WEIGHT_DECAY, Muon parameters |
| Learning Rate Schedules | Warmup, warmdown, final LR fraction | Modify WARMUP_RATIO, WARMDOWN_RATIO, FINAL_LR_FRAC |
| Batch Size | Total batch size, gradient accumulation | Change TOTAL_BATCH_SIZE, DEVICE_BATCH_SIZE |
| Attention Mechanisms | Sliding window patterns, head configuration | Modify WINDOW_PATTERN, HEAD_DIM, n_kv_head |
| Activation Functions | Non-linearity choices | Replace F.relu(x).square() in MLP |
| Normalization | RMSNorm variants, placement | Modify norm() function |
| Weight Initialization | Init scales, strategies | Modify init_weights() method |
| Residual Connections | Residual lambdas, x0 mixing | Tune resid_lambdas, x0_lambdas |
| Value Embeddings | Value residual gating | Modify ve_gate, gating mechanisms |
| Rotary Embeddings | Position encoding variants | Modify apply_rotary_emb(), base frequency |
| Logit Processing | Softcap, temperature | Change softcap value in forward pass |
| Optimizer Internals | Muon orthogonalization, NorMuon variance | Modify muon_step_fused, polar express coefficients |
Constraint Surface
The agent operates within hard constraints that define the problem:
IMMUTABLE (prepare.py) MUTABLE (train.py)
┌─────────────────────────┐ ┌─────────────────────────┐
│ MAX_SEQ_LEN = 2048 │ │ ASPECT_RATIO = 64 │
│ TIME_BUDGET = 300s │ │ HEAD_DIM = 128 │
│ EVAL_TOKENS = 40×524288 │ │ WINDOW_PATTERN = "SSSL" │
│ VOCAB_SIZE = 8192 │ │ TOTAL_BATCH_SIZE = 2^19 │
│ evaluate_bpb() │ │ All LR parameters │
│ make_dataloader() │ │ WEIGHT_DECAY = 0.2 │
│ Tokenizer │ │ ADAM_BETAS = (0.8, 0.95)│
│ Data pipeline │ │ DEPTH = 8 │
│ Split pattern │ │ GPT model class │
└─────────────────────────┘ │ MuonAdamW optimizer │
│ Training loop │
│ LR schedules │
│ Everything else │
└─────────────────────────┘
Metric: Bits Per Byte (BPB)
The single optimization objective is val_bpb — validation bits per byte. This is a vocabulary-size-independent metric:
Σ cross_entropy(logits, targets) [nats]
val_bpb = ────────────────────────────────────────────────────
ln(2) × Σ utf8_byte_length(target_tokens)
BPB is preferred over perplexity because it allows fair comparison across different vocabulary sizes. If the agent changes tokenization-adjacent parameters (which it cannot, but in principle), the metric remains valid.
Soft Constraints
- VRAM: Some increase acceptable for meaningful val_bpb gains, but should not "blow up dramatically"
- Simplicity: Complexity cost is weighed against improvement magnitude
- Runtime: Must complete within the 5-minute budget (10-minute hard kill)
- Dependencies: Cannot add packages beyond those in
pyproject.toml
5 LLM Integration
Agent Architecture
Autoresearch uses a single coding LLM agent — any agent capable of reading files, editing code, and executing shell commands. The system is agent-agnostic by design.
| Aspect | Detail |
|---|---|
| Agent Type | General-purpose coding agent (Claude Code, Codex, etc.) |
| Agent Count | 1 (default; SkyPilot extension enables multi-agent) |
| Instruction Format | Natural language Markdown (program.md) |
| Agent Loop | Read → Edit → Commit → Run → Evaluate → Keep/Discard → Repeat |
| Human in Loop | None during execution (fully autonomous) |
| Context Management | Redirect stdout to file, grep for metrics (avoids flooding context) |
How the LLM is Used
The LLM serves simultaneously as:
- Hypothesis generator — decides what experiment to try next based on prior results
- Code writer — implements the experiment by editing
train.py - Result interpreter — reads metrics, decides if the change was beneficial
- Research strategist — plans what direction to explore based on accumulated evidence
- Error handler — diagnoses crashes, fixes bugs, decides whether to retry or abandon
This is fundamentally different from systems like AlphaEvolve where the LLM is used only as a mutation operator within a larger evolutionary framework. In autoresearch, the LLM is the entire system — it handles strategy, implementation, evaluation, and decision-making.
Prompt Engineering via program.md
The program.md file is the entire "prompt engineering" layer. It functions as a research specification document with several key sections:
Setup Protocol:
1. Agree on a run tag (e.g. "mar5")
2. Create branch: git checkout -b autoresearch/<tag>
3. Read in-scope files: README.md, prepare.py, train.py
4. Verify data exists in ~/.cache/autoresearch/
5. Initialize results.tsv with header row
6. Confirm and begin
Experiment Loop (per cycle):
1. Look at git state (current branch/commit)
2. Edit train.py with experimental idea
3. git commit
4. Run: uv run train.py > run.log 2>&1
5. Read results: grep "^val_bpb:\|^peak_vram_mb:" run.log
6. If empty → crash → tail -n 50 run.log → diagnose
7. Log to results.tsv
8. If improved → keep commit (advance branch)
9. If not improved → git reset to previous best
Critical Behavioral Instructions:
"NEVER STOP: Once the experiment loop has begun, do NOT pause to ask the human if you should continue. Do NOT ask 'should I keep going?' or 'is this a good stopping point?'. The human might be asleep, or gone from a computer and expects you to continue working indefinitely until you are manually stopped."
"If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes."
Context Window Management
A subtle but critical design decision is how the agent manages its context window:
uv run train.py > run.log 2>&1 # redirect ALL output to file
grep "^val_bpb:" run.log # extract only the metric
By redirecting training output to a file instead of printing to stdout, the agent avoids flooding its context window with thousands of lines of training progress. It reads only what it needs (the final metric, or a crash traceback). This is essential for running 100+ experiments without context exhaustion.
Agent-Agnostic Design
Autoresearch deliberately avoids coupling to any specific LLM provider or agent framework:
"Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like: 'Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.'"
The program.md is written in natural language that any sufficiently capable coding agent can follow. The only requirements are:
- File read/write capability
- Shell command execution
- Git operations
- Basic reasoning about experimental results
This makes autoresearch a meta-framework — it defines the research protocol, not the agent implementation.
6 Key Results
First Overnight Run
Karpathy's initial overnight run (reported via tweet) achieved:
| Metric | Value |
|---|---|
| val_bpb improvement | ~11% reduction vs. baseline |
| Number of experiments | ~20 improvements found |
| Total experiments | ~100 (12/hour × ~8 hours) |
| Runtime | Overnight (~8 hours) |
| GPU | Single H100 |
| Agent | Claude Code |
Experiment Throughput
The fixed 5-minute training budget plus agent overhead yields predictable throughput:
Per experiment:
Agent thinks + edits train.py ~30 seconds
uv run train.py ~5 minutes (300s budget)
Agent reads results + logs ~30 seconds
─────────────────────────────────────────────
Total per experiment ~6 minutes
Throughput:
Per hour ~10 experiments
Per 8-hour overnight ~80 experiments
Per 24-hour run ~240 experiments
Improvement Trajectory
The results follow a characteristic diminishing-returns pattern:
val_bpb
▲
│
│ ● baseline
│ ╲
│ ╲ rapid early gains
│ ╲ (hyperparameter sweep)
│ ╲
│ ╲
│ ╲───── moderate gains
│ ╲ (architecture changes)
│ ╲
│ ╲─────── diminishing returns
│ ╲── (fine-tuning, combinations)
│ ╲──────────────
│
└──────────────────────────────────► experiments
0 20 40 60 80 100
What the Agent Typically Discovers
Based on community reports and the SkyPilot extension results, autonomous runs consistently discover:
| Discovery Category | Typical Finding | Typical Impact |
|---|---|---|
| Batch size reduction | 2^19 → 2^18 (more optimizer steps in 5 min) | High (0.01-0.02 bpb) |
| Adam betas tuning | (0.8, 0.95) → (0.9, 0.95) or (0.7, 0.95) | Medium (0.005 bpb) |
| Weight decay reduction | 0.2 → 0.08 | Medium (0.003 bpb) |
| Model width scaling | ASPECT_RATIO 64 → 96 | High (0.01+ bpb) |
| Window pattern changes | "SSSL" → "SL" | Small (0.001 bpb) |
| LR schedule tuning | Warmdown ratio, final LR fraction | Small-Medium |
| Muon optimizer params | momentum, beta2, ns_steps | Small (0.001 bpb) |
Nanochat Leaderboard Impact
The autoresearch results directly translate to improvements on the nanochat leaderboard, which tracks the best val_bpb achievable in 5 minutes of single-GPU training. The 11% improvement from the first overnight run represented a significant leap on this leaderboard.
7 Reproducibility
Reproducibility Score: Very High
Autoresearch is one of the most reproducible autonomous research systems:
| Factor | Assessment | Detail |
|---|---|---|
| Code availability | Full source, MIT license | 3 files, no hidden components |
| Data availability | Public dataset (climbmix-400b) | Auto-downloaded by prepare.py |
| Compute requirements | 1 NVIDIA GPU | Tested on H100; community forks for other hardware |
| Dependencies | Minimal (PyTorch + 8 packages) | Pinned in pyproject.toml via uv.lock |
| Determinism | Seeded (torch.manual_seed(42)) |
Platform-dependent due to GPU timing |
| Experiment logging | Complete git history | Every experiment is a commit + TSV row |
| Agent instructions | Fully specified in program.md |
No hidden system prompts |
Reproduction Steps
# 1. Clone and setup
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
# 2. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py
# 3. Establish baseline
uv run train.py
# 4. Start autonomous research
# Point your coding agent at the repo and prompt:
# "Read program.md and let's kick off a new experiment"
Platform Sensitivity
A critical caveat: results are platform-dependent. The 5-minute time budget means different GPUs will complete different numbers of training steps:
| GPU | Approximate Steps in 5 min | Relative Throughput |
|---|---|---|
| H200 | ~1,200+ | 1.09x |
| H100 | ~1,100 | 1.0x (reference) |
| A100 (80GB) | ~800 | ~0.73x |
| RTX 4090 | ~600 | ~0.55x |
| RTX 3090 | ~400 | ~0.36x |
This is by design — autoresearch finds the optimal model for your compute. But it means results from different hardware are not directly comparable. Karpathy explicitly acknowledges this:
"This means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms."
Recommendations for Smaller Hardware
For users without H100-class hardware, Karpathy provides specific guidance:
- Use TinyStories dataset (lower entropy → smaller models work)
- Reduce
vocab_size(8192 → 4096/2048/1024/256) - Lower
MAX_SEQ_LEN(2048 → 512/256) - Decrease
DEPTH(8 → 4) - Use
WINDOW_PATTERN = "L"(avoid banded attention inefficiency) - Lower
TOTAL_BATCH_SIZE(2^19 → 2^14)
8 Compute and API Costs
GPU Compute Costs
| Configuration | GPU-Hours | Estimated Cost |
|---|---|---|
| Single overnight run (8h) | 8 GPU-hours | $16 (H100 @ $2/hr) |
| Extended run (24h) | 24 GPU-hours | $48 |
| Weekend run (48h) | 48 GPU-hours | $96 |
| Community fork (RTX 4090) | 8 hours | ~$0 (consumer GPU) |
LLM API Costs
The agent makes relatively few API calls per experiment cycle:
Per experiment cycle:
Read current state + plan ~2K tokens input
Generate code edit ~1K tokens output
Read results + decide ~500 tokens input
Log results ~200 tokens output
─────────────────────────────────────────────
Total per cycle ~4K tokens
Per overnight run (100 experiments):
Total tokens ~400K tokens
Estimated cost (Claude Sonnet) ~$1.50-3.00
Estimated cost (GPT-4o) ~$1.00-2.00
Total Cost per Run
| Run Duration | GPU Cost | API Cost | Total |
|---|---|---|---|
| 8h overnight (H100) | $16 | ~$2 | ~$18 |
| 8h overnight (RTX 4090) | $0 | ~$2 | ~$2 |
| 24h extended (H100) | $48 | ~$6 | ~$54 |
The cost structure is GPU-dominated for cloud users and API-dominated for users with local GPUs. This makes autoresearch remarkably cheap compared to traditional hyperparameter search (which requires human researcher time) or systems like AlphaEvolve (which require multi-GPU clusters and engineering teams).
Cost Efficiency Analysis
Traditional ML Research:
Senior researcher salary: ~$200K/year → ~$100/hr fully loaded
8 hours of researcher time: ~$800
GPU cost for same period: ~$16
Total: ~$816
Autoresearch:
8 hours (overnight, researcher sleeping): ~$18
Number of experiments: ~80-100
Human effort: ~10 min setup
Cost reduction: ~45x
Experiments/dollar:
Traditional: ~0.1 (researcher bottleneck)
Autoresearch: ~5.5
9 Architecture Solution
System Architecture
Autoresearch's architecture is remarkable for what it doesn't include:
┌─────────────────────────────────────────────────────────┐
│ CODING LLM AGENT │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Hypothesis │ │ Code Editor │ │ Result │ │
│ │ Generator │ │ (train.py) │ │ Interpreter │ │
│ └──────┬──────┘ └──────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ └────────────────┼───────────────────┘ │
│ │ │
│ ┌───────────▼──────────┐ │
│ │ Decision Engine │ │
│ │ (keep/discard/crash)│ │
│ └───────────┬──────────┘ │
└──────────────────────────┼──────────────────────────────┘
│
┌────────────▼────────────┐
│ SHELL INTERFACE │
│ git, uv, grep, tail │
└────────────┬────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌────────▼────────┐ ┌─────▼──────┐ ┌───────▼───────┐
│ train.py │ │ run.log │ │ results.tsv │
│ (single file │ │ (output) │ │ (audit log) │
│ the agent │ │ │ │ │
│ modifies) │ │ │ │ │
└────────┬────────┘ └────────────┘ └───────────────┘
│
┌────────▼────────┐
│ prepare.py │
│ (read-only) │
│ - dataloader │
│ - tokenizer │
│ - evaluation │
│ - constants │
└────────┬────────┘
│
┌────────▼────────┐
│ SINGLE GPU │
│ (H100/etc) │
│ 5-min training │
└─────────────────┘
Key Architectural Decisions
1. No database, no queue, no orchestrator.
Most autonomous research systems (AlphaEvolve, OpenEvolve, The AI Scientist) include substantial infrastructure: databases for results, queues for experiments, orchestrators for parallelism, vector stores for memory. Autoresearch uses none of these. The filesystem (git) is the database. The agent's context window is the memory. Shell commands are the orchestrator.
2. Git-as-state-machine.
The git branch serves as a state machine for the research process:
baseline commit
│
┌─────────────┼─────────────┐
│ │ │
exp-01 exp-02 exp-03
(keep ✓) (discard ✗) (crash ✗)
│ │ │
│ git reset git reset
│ │ │
▼ ▼ ▼
commit-01 (back to (back to
│ baseline) baseline)
│
exp-04
(keep ✓)
│
▼
commit-02
│
...
Each kept experiment advances the branch. Each discarded experiment resets to the last good state. The branch tip always represents the best-known configuration.
3. Fixed-budget evaluation.
The 5-minute wall-clock training budget is a brilliant constraint: - Makes all experiments directly comparable regardless of changes - Prevents the agent from exploiting "train longer" as a strategy - Forces compute-efficiency thinking (more steps = better, so batch size and model size matter) - Enables predictable throughput (~12 experiments/hour)
4. Single-file modification scope.
The agent only modifies train.py. This constraint:
- Keeps the search space manageable
- Makes diffs reviewable by humans
- Prevents the agent from gaming the evaluation (it cannot modify prepare.py)
- Ensures fair comparison (data pipeline and metric are fixed)
Information Flow
┌──────────────────────────────────────────────────────┐
│ EXPERIMENT CYCLE │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ program │───▶│ Agent │───▶│ train.py │ │
│ │ .md │ │ (LLM) │ │ (edited) │ │
│ │ │ │ │ │ │ │
│ │ Rules │ │ Context: │ └────┬─────┘ │
│ │ Goals │ │ - rules │ │ │
│ │ Format │ │ - code │ ┌────▼─────┐ │
│ └──────────┘ │ - past │ │ GPU │ │
│ │ results│ │ Training │ │
│ ┌──────────┐ │ - errors │ │ (5 min) │ │
│ │ results │◀───│ │ └────┬─────┘ │
│ │ .tsv │ │ │ │ │
│ │ │ │ Decision:│ ┌────▼─────┐ │
│ │ Audit │ │ keep or │ │ run.log │ │
│ │ trail │ │ discard │◀───│ (output) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
└──────────────────────────────────────────────────────┘
10 Component Breakdown
Component 1: program.md — The Research Specification
Purpose: Defines the research protocol in natural language for any LLM agent.
Structure:
| Section | Purpose | Key Constraints |
|---|---|---|
| Setup | Initialize experiment branch, verify data | Must read all in-scope files first |
| Experimentation | What can/cannot be modified | Only train.py is mutable |
| Output Format | How to parse experiment results | grep-based metric extraction |
| Logging | How to record results | Tab-separated results.tsv |
| Experiment Loop | Core loop specification | Keep if improved, discard if not |
| Never Stop | Autonomy directive | No human confirmation needed |
Key behavioral instructions encoded in program.md:
- Crash handling: "If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it."
- Timeout policy: "If a run exceeds 10 minutes, kill it and treat it as a failure."
- Autonomy: "The loop runs until the human interrupts you, period."
- Idea generation: "If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes."
- Results file isolation: "Do not commit the results.tsv file, leave it untracked by git."
Component 2: train.py — The Mutable Training Script
Purpose: Contains the complete GPT model, optimizer, and training loop. This is the only file the agent modifies.
Size: ~450 lines of Python
Sub-components:
| Sub-component | Lines | Purpose |
|---|---|---|
GPTConfig |
~10 | Model configuration dataclass |
CausalSelfAttention |
~40 | Multi-head attention with RoPE, value embeddings, flash attention |
MLP |
~10 | Feed-forward network with ReGLU-squared activation |
Block |
~10 | Transformer block (attention + MLP + residual) |
GPT |
~120 | Full model with embeddings, blocks, logit head, rotary cache |
MuonAdamW |
~100 | Combined optimizer: Muon for matrices, AdamW for others |
| Hyperparameters | ~20 | Top-level constants (the primary knobs) |
| Setup | ~40 | Model construction, optimizer creation, dataloader |
| Training Loop | ~80 | The actual training loop with logging, scheduling, evaluation |
Notable design choices in train.py:
-
Muon + AdamW hybrid optimizer: Weight matrices use the Muon optimizer (gradient orthogonalization via polar decomposition) while embeddings, scalars, and biases use AdamW. This is a state-of-the-art optimizer design.
-
Value embeddings (ResFormer): Alternating layers have value embeddings with input-dependent gating — a recent technique for improved transformer training.
-
Residual scaling: Per-layer learnable residual lambdas and x0 mixing coefficients, following recent work on deep transformer training stability.
-
ReGLU-squared activation:
F.relu(x).square()— a non-standard activation that has shown empirical benefits in small model training. -
Logit soft-capping:
softcap * tanh(logits / softcap)— prevents extreme logit values, improving training stability. -
Garbage collection management: Manual GC control with
gc.freeze()andgc.disable()to avoid Python GC stalls during training.
Component 3: prepare.py — The Fixed Infrastructure
Purpose: Data download, tokenizer training, dataloader, and evaluation metric. Immutable — the agent cannot touch this.
Key constants:
MAX_SEQ_LEN = 2048 # context length
TIME_BUDGET = 300 # 5 minutes of training
EVAL_TOKENS = 40×524288 # ~20M tokens for validation
VOCAB_SIZE = 8192 # BPE vocabulary size
Sub-components:
| Sub-component | Purpose |
|---|---|
| Data download | Fetches parquet shards from HuggingFace (climbmix-400b) |
| Tokenizer training | BPE via rustbpe, saved as tiktoken pickle |
Tokenizer class |
Wrapper with encode/decode, BOS token support |
make_dataloader() |
BOS-aligned best-fit packing dataloader |
evaluate_bpb() |
The sacred evaluation function (bits per byte) |
The separation of prepare.py (immutable) and train.py (mutable) is architecturally critical:
┌─────────────────┐ ┌─────────────────┐
│ prepare.py │ │ train.py │
│ (FROZEN) │ │ (MUTABLE) │
│ │ │ │
│ Ground truth │ │ Experiment │
│ evaluation │◄───│ space │
│ │ │ │
│ Cannot be │ │ Everything │
│ gamed by │ │ is fair game │
│ the agent │ │ │
└─────────────────┘ └─────────────────┘
Component 4: results.tsv — The Audit Trail
Purpose: Human-readable experiment log. Tab-separated for easy parsing.
Schema:
commit val_bpb memory_gb status description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase LR to 0.04
c3d4e5f 1.005000 44.0 discard switch to GeLU activation
d4e5f6g 0.000000 0.0 crash double model width (OOM)
Design choices:
- Tab-separated (not CSV) — commas break in descriptions
- Not committed to git — keeps the branch clean for code-only diffs
- Short commit hash (7 chars) for traceability
- Status enum: keep, discard, crash
- Memory tracked to detect VRAM regressions
Component 5: Git Branch — The State Machine
Purpose: Version control serves as experiment state management.
The branch naming convention (autoresearch/<tag>) creates isolated experiment runs:
master ──────────────────────────────────────────►
│
├── autoresearch/mar5 ──────●────●────●────►
│ exp-01 exp-04 exp-07
│ (kept) (kept) (kept)
│
├── autoresearch/mar5-gpu0 ──────●────●────►
│
└── autoresearch/mar6 ──────●────●────●────►
11 Core Mechanisms (Detailed)
Mechanism 1: The Greedy Hill-Climbing Loop
The core algorithm is deceptively simple:
INITIALIZE:
branch = autoresearch/<tag>
baseline = run(train.py) # establish baseline val_bpb
best_bpb = baseline
LOOP FOREVER:
idea = agent.generate_hypothesis(history, code)
modified_code = agent.implement(idea, train.py)
git.commit(modified_code)
result = run(modified_code, timeout=5min)
IF result.crashed:
IF easy_fix(result.error):
fix_and_retry()
ELSE:
log(status="crash")
git.reset(best_commit)
ELIF result.val_bpb < best_bpb:
best_bpb = result.val_bpb
best_commit = git.HEAD
log(status="keep")
# branch advances naturally
ELSE:
log(status="discard")
git.reset(best_commit)
Analysis:
This is a first-order greedy search with a sophisticated "sensor" (the LLM's judgment about what to try). The greedy hill-climbing has well-known limitations:
- Local optima trapping: The agent can get stuck in a local optimum where no single change improves the metric, but a combination of changes would.
- No backtracking: Once the agent advances past a commit, it cannot easily revisit that region of the search space (though
program.mdmentions "very sparingly" rewinding). - No exploration-exploitation balance: There is no formal mechanism for trading off exploitation (fine-tuning what works) against exploration (trying radically different ideas).
However, the LLM mitigates these limitations: - The LLM can propose compound changes (modify 3 hyperparameters at once) - The LLM has semantic understanding of the code and can make informed hypotheses - The LLM has (implicit) knowledge from its training data about what works in ML - The "think harder" instruction encourages the agent to try radical changes when stuck
Mechanism 2: The Muon + AdamW Optimizer
The training script uses a sophisticated dual-optimizer design that the agent can modify:
┌──────────────────────────────────────────────────┐
│ MuonAdamW Optimizer │
│ │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ Muon Optimizer │ │ AdamW Optimizer │ │
│ │ │ │ │ │
│ │ For: 2D weight │ │ For: embeddings, │ │
│ │ matrices │ │ scalars, biases │ │
│ │ │ │ │ │
│ │ Steps: │ │ Standard Adam │ │
│ │ 1. Nesterov │ │ with bias │ │
│ │ momentum │ │ correction and │ │
│ │ 2. Polar express │ │ weight decay │ │
│ │ orthogonalize │ │ │ │
│ │ 3. NorMuon │ │ │ │
│ │ variance │ │ │ │
│ │ reduction │ │ │ │
│ │ 4. Cautious │ │ │ │
│ │ weight decay │ │ │ │
│ └────────────────────┘ └────────────────────┘ │
└──────────────────────────────────────────────────┘
Muon gradient orthogonalization (Polar Express):
The Muon optimizer applies the "polar express" — an iterative approximation of the polar decomposition (orthogonal component of a matrix) — to gradients before applying them. This orthogonalization acts as a form of natural gradient method, where updates are made in the space of orthogonal matrices rather than Euclidean space.
The coefficients are precomputed:
polar_express_coeffs = [
(8.156554524902461, -22.48329292557795, 15.878769915207462),
(4.042929935166739, -2.808917465908714, 0.5000178451051316),
(3.8916678022926607, -2.772484153217685, 0.5060648178503393),
(3.285753657755655, -2.3681294933425376, 0.46449024233003106),
(2.3465413258596377, -1.7097828382687081, 0.42323551169305323),
]
Each iteration applies: X = a*X + X @ (b*A + c*A@A) where A = X^T @ X (or X @ X^T for tall matrices). This converges to the orthogonal Procrustes solution in ~5 iterations.
NorMuon variance reduction:
After orthogonalization, the optimizer applies variance normalization with a second-momentum buffer:
v_mean = g².mean(dim=red_dim) # per-row or per-column variance
second_momentum ← lerp(v_mean, β₂) # exponential moving average
step_size = 1/√(second_momentum) # inverse std normalization
g = g * step_size * (v_norm / v_norm_new) # scale-preserving normalization
This combines the direction quality from orthogonalization with adaptive step sizing from variance tracking — conceptually similar to combining Adam's adaptation with natural gradient's geometry.
Mechanism 3: Fixed-Budget Training and Evaluation
The training loop implements the fixed-budget constraint:
TIME_BUDGET = 300 # 5 minutes
while True:
# ... training step ...
if step > 10: # exclude warmup/compilation
total_training_time += dt
if step > 10 and total_training_time >= TIME_BUDGET:
break
The first 10 steps are excluded from the time budget to account for PyTorch compilation and CUDA warmup. This ensures the budget measures actual training time, not one-time startup costs.
Evaluation (BPB):
def evaluate_bpb(model, tokenizer, batch_size):
# Evaluate on ~20M tokens from validation split
total_nats = 0.0
total_bytes = 0
for _ in range(steps):
loss_flat = model(x, y, reduction='none').view(-1)
nbytes = token_bytes[y_flat]
mask = nbytes > 0 # exclude special tokens
total_nats += (loss_flat * mask).sum().item()
total_bytes += nbytes.sum().item()
return total_nats / (math.log(2) * total_bytes)
BPB (bits per byte) converts from nats (natural log) to bits (log₂) and normalizes by UTF-8 byte count rather than token count. This makes the metric independent of tokenizer vocabulary size.
Mechanism 4: The GPT Architecture
The model follows a modern GPT architecture with several recent innovations:
Input tokens (B, T)
│
▼
┌───────────────┐
│ Token Embed │ wte: vocab_size → n_embd
│ (+ RMSNorm) │
└───────┬───────┘
│
▼ x₀ (saved for residual mixing)
┌───────────────────────────────────────────┐
│ Block i (×n_layer) │
│ │
│ x = λᵢ·x + α₀ᵢ·x₀ ← residual mixing │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ CausalSelfAttention │ │
│ │ ┌───────┐ ┌───────┐ ┌───────┐ │ │
│ │ │ Q=Wx │ │ K=Wx │ │ V=Wx │ │ │
│ │ └───┬───┘ └───┬───┘ └───┬───┘ │ │
│ │ │ │ + VE·gate │ │
│ │ ┌───▼─────────▼───┐ │ │ │
│ │ │ RoPE + QKNorm │ │ │ │
│ │ └───────┬──────────┘ │ │ │
│ │ │ │ │ │
│ │ ┌───────▼────────────────▼───────┐ │ │
│ │ │ FlashAttention3 (causal, │ │ │
│ │ │ sliding window per layer) │ │ │
│ │ └───────────────┬───────────────┘ │ │
│ │ ▼ │ │
│ │ Linear projection │ │
│ └──────────────────┬──────────────────┘ │
│ │ │
│ x = x + attn_out ← residual │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ MLP │ │
│ │ Linear → ReLU² → Linear │ │
│ └──────────────────┬──────────────────┘ │
│ │ │
│ x = x + mlp_out ← residual │
│ │
└───────────────────────┬───────────────────┘
│
▼
┌─────────────────┐
│ RMSNorm → Head │
│ Softcap(15) │
│ Cross-entropy │
└─────────────────┘
Value Embeddings (ResFormer):
Alternating layers have value embeddings — learnable per-token vectors that are mixed into the value stream via an input-dependent gate:
ve = self.value_embeds[str(i)](idx) # (B, T, kv_dim)
gate = 2 * sigmoid(self.ve_gate(x[..., :32])) # (B, T, n_kv_head)
v = v + gate * ve
The gate is initialized to zero, so sigmoid(0) = 0.5, scaled by 2 gives 1.0 — a neutral initialization that doesn't distort the value stream at the start of training.
Residual Mixing:
Each block receives both the previous hidden state and the initial embedding:
x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0
This is initialized with λ = 1.0 and α₀ = 0.1, allowing the model to learn how much to rely on the original embedding vs. the transformed representation at each layer. This technique helps with gradient flow in deeper transformers.
Mechanism 5: BOS-Aligned Best-Fit Packing
The dataloader uses a sophisticated packing strategy:
Traditional padding:
[DOC_1 | PAD PAD PAD PAD PAD | DOC_2 | PAD PAD] ← wasted compute
Autoresearch best-fit packing:
[BOS DOC_A | BOS DOC_C | BOS DOC_F(crop)] ← 100% utilization
[BOS DOC_B | BOS DOC_D | BOS DOC_E ] ← 100% utilization
The packer: 1. Maintains a buffer of tokenized documents 2. For each row, finds the largest document that fits entirely in the remaining space 3. If no document fits, crops the shortest document to fill exactly 4. Every document starts with BOS token 5. Achieves 100% utilization (no padding tokens)
This is a first-fit-decreasing variant adapted for streaming data — the agent cannot modify this packing strategy (it's in prepare.py), but it can modify DEVICE_BATCH_SIZE which affects how many rows are packed per batch.
12 Programming Language
Language: Python 3.10+
Primary dependencies (from pyproject.toml):
| Package | Version | Purpose |
|---|---|---|
torch |
2.9.1 (CUDA 12.8) | Core training framework |
kernels |
≥0.11.7 | FlashAttention3 kernel loading |
rustbpe |
≥0.1.0 | BPE tokenizer training (Rust-backed) |
tiktoken |
≥0.11.0 | Tokenizer runtime |
pyarrow |
≥21.0.0 | Parquet data reading |
numpy |
≥2.2.6 | Numerical utilities |
pandas |
≥2.3.3 | Data manipulation |
matplotlib |
≥3.10.8 | Visualization |
requests |
≥2.32.0 | Data download |
Package management: Uses uv — the fast Rust-based Python package manager. All dependencies are locked via uv.lock. The CUDA-specific PyTorch wheel is sourced from the custom index:
[tool.uv.sources]
torch = [{ index = "pytorch-cu128" }]
<span class="broken-link">tool.uv.index</span>
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
explicit = true
Code Style
The code is written in Karpathy's characteristic style:
- Single-file implementation (no module imports within the project)
- Minimal abstraction (no unnecessary classes or inheritance)
- Heavy use of torch.compile for performance-critical paths
- Inline constants rather than config files
- Dense but readable — every line serves a purpose
Performance Optimizations
torch.compile(dynamic=False, fullgraph=True)— Compiled optimizer kernels avoid Python overhead- Flash Attention 3 — Hardware-specific attention kernel (Hopper FA3 vs. community FA3)
torch.set_float32_matmul_precision("high")— Enables TF32 for faster matmul- Pinned memory —
pin_memory=Truefor CPU→GPU data transfer - Manual GC —
gc.freeze()+gc.disable()after warmup to avoid GC stalls - bfloat16 autocast — Mixed precision training
- Pre-allocated buffers — Data loader pre-allocates CPU and GPU tensors
13 Memory Management
Context Window Memory
The primary memory challenge in autoresearch is the LLM agent's context window, not GPU memory. Over 100+ experiments, the agent accumulates substantial context:
Context accumulation per experiment:
program.md (initial read) ~2,000 tokens (amortized: 0)
train.py (read/edit) ~2,000 tokens
Results from grep ~50 tokens
Agent reasoning ~500 tokens
Error handling (if crash) ~500 tokens
──────────────────────────────────────────────
Net context per experiment ~3,000-5,000 tokens
Over 100 experiments:
Cumulative context ~300K-500K tokens
Mitigation strategies in program.md:
- Output redirection:
uv run train.py > run.log 2>&1— prevents training logs from flooding context - Selective reading:
grep "^val_bpb:" run.log— reads only the metric, not the full log - Crash diagnosis:
tail -n 50 run.log— reads only the traceback, not the full output - No TSV commits:
results.tsvis untracked — the agent doesn't re-read git diffs of results
These strategies enable an agent with ~200K context window to sustain ~100+ experiments before potential context saturation.
GPU Memory
The training script is designed for single-GPU operation with careful memory management:
| Component | Memory Usage | Notes |
|---|---|---|
| Model parameters (50M) | ~100 MB (bf16) | Main model weights |
| Optimizer states (Muon) | ~200 MB | Momentum + second momentum buffers |
| Optimizer states (Adam) | ~200 MB | exp_avg + exp_avg_sq |
| Activations (per batch) | ~2-10 GB | Depends on batch size, scales with DEVICE_BATCH_SIZE × MAX_SEQ_LEN |
| KV cache (attention) | ~500 MB | Flash Attention working memory |
| Gradient accumulation | ~100 MB | Gradients for one micro-step |
| Rotary embeddings | ~50 MB | Pre-computed cos/sin for 10× sequence length |
| Value embeddings | ~100 MB | Per-token value vectors |
| Total (baseline) | ~3-11 GB | Well within H100's 80GB |
The PYTORCH_ALLOC_CONF = "expandable_segments:True" environment variable enables PyTorch's expandable memory segments, reducing fragmentation when the agent changes model sizes between experiments.
Peak VRAM tracking:
peak_vram_mb = torch.cuda.max_memory_allocated() / 1024 / 1024
Peak VRAM is reported per experiment and logged in results.tsv, allowing the agent to track memory impact of changes. The simplicity criterion in program.md treats VRAM increase as a cost to be weighed against val_bpb improvement.
Gradient Accumulation Memory Pattern
TOTAL_BATCH_SIZE = 2^19 # ~524K tokens total
DEVICE_BATCH_SIZE = 128 # per micro-step
tokens_per_fwdbwd = 128 × 2048 # = 262,144
grad_accum_steps = 2^19 / 2^18 # = 2
Memory during training:
Step 1: forward(128 × 2048) → activations → backward → gradients accumulated
Step 2: forward(128 × 2048) → activations → backward → gradients accumulated
Optimizer step: apply accumulated gradients
Zero gradients
The agent can trade memory for throughput by changing DEVICE_BATCH_SIZE (affects peak activation memory) or TOTAL_BATCH_SIZE (affects gradient accumulation steps). Reducing TOTAL_BATCH_SIZE from 2^19 to 2^18 halves the batch but allows more optimizer steps in the 5-minute budget — a trade-off the agent consistently discovers.
14 Continued Learning
Within a Single Run
The agent learns within a single run through several mechanisms:
1. Implicit learning via context accumulation.
As the agent conducts experiments, it accumulates knowledge about what works and what doesn't in its context window. After 50 experiments, the agent has seen dozens of failed and successful modifications and can make increasingly informed hypotheses.
2. Results-driven strategy evolution.
The results.tsv provides a structured record of all experiments. The agent can (and does) refer back to this record to:
- Identify which parameter ranges have been explored
- Find near-misses worth retrying in combination
- Avoid repeating failed experiments
- Identify diminishing returns in a particular direction
3. Error-driven adaptation.
Crashes and failures teach the agent about the constraint surface. After an OOM crash from doubling model width, the agent learns the VRAM boundary and stays within it in subsequent experiments.
Across Runs
Autoresearch does not have built-in cross-run learning. Each run starts from the instruction in program.md and the current state of train.py. However, several mechanisms enable implicit cross-run knowledge transfer:
1. Branch inheritance.
A new run can start from a branch created by a previous run, inheriting all accumulated improvements:
# Run 1 ends with autoresearch/mar5 at commit abc123
# Run 2 starts from that branch
git checkout autoresearch/mar5
git checkout -b autoresearch/mar6
# Agent starts from the improved train.py
2. program.md iteration.
The human can update program.md between runs to encode lessons learned:
# Added after Run 1:
Note: batch size 2^18 consistently outperforms 2^19.
Focus on architecture changes rather than optimizer tuning.
This creates a human-in-the-loop learning cycle at the meta level:
1. Agent runs autonomously for 8 hours
2. Human reviews results
3. Human updates program.md with insights
4. Agent runs again with better instructions
3. Community knowledge propagation.
The open-source nature of autoresearch enables a distributed learning process: - Forks share discoveries (e.g., "AR=96 works best on H100") - The nanochat leaderboard aggregates best-known configurations - Community discussions identify promising research directions
Meta-Learning: Programming the Research Program
Karpathy explicitly highlights the meta-learning opportunity:
"The default
program.mdin this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the 'research org code' that achieves the fastest research progress, how you'd add more agents to the mix, etc."
This positions autoresearch as a platform for meta-research — researching how to research. The program.md is itself an optimizable artifact:
Level 0: train.py ← optimized by the agent
Level 1: program.md ← optimized by the human
Level 2: research method ← optimized by the community
Future extensions could apply autoresearch to itself — using an LLM agent to optimize program.md based on the quality of research outcomes. This is a form of meta-optimization that echoes the self-referential nature of Gödel machines.
Integration with External Knowledge
The program.md instructs the agent to "read papers referenced in the code" when stuck. This allows the agent to leverage its training-time knowledge about:
- Transformer architectures (GPT, LLaMA, Mamba)
- Optimizer research (Adam, LAMB, Muon, Sophia)
- Initialization schemes (Xavier, Kaiming, µP)
- Normalization techniques (LayerNorm, RMSNorm, QKNorm)
- Attention variants (MHA, MQA, GQA, sliding window)
The agent's "knowledge base" is its own training data — a vast corpus of ML papers and code. This is a fundamentally different approach from systems like AlphaEvolve (which use structured program databases) or The AI Scientist (which uses explicit literature search).
15 Applications
Direct Applications
1. Neural network architecture search.
The primary application — finding optimal model configurations for a given compute budget. This is directly useful for: - Startups optimizing their training infrastructure - Researchers exploring new architecture ideas - Hardware vendors benchmarking their GPUs - Students learning about ML optimization
2. Hardware-specific optimization.
Because the 5-minute budget means different hardware gets different numbers of steps, autoresearch naturally finds the best model for your specific hardware:
H100: Large model (AR=96), fewer steps → width matters most
A100: Medium model (AR=64), more steps → depth/width balance
RTX 4090: Smaller model, more steps → optimizer tuning matters more
MacBook M2: Small model, many steps → data efficiency matters most
3. Optimizer research.
The Muon + AdamW optimizer is itself a research artifact. The agent can discover optimizer improvements (e.g., different momentum schedules, variance reduction techniques) that may generalize beyond the specific training setup.
4. Leaderboard competition.
The nanochat leaderboard provides a competitive benchmark. Autoresearch enables anyone with a GPU to compete — the overnight autonomous search replaces weeks of manual experimentation.
Extended Applications
5. Research methodology demonstration.
Autoresearch demonstrates a new research methodology — "programming the researcher" rather than conducting research directly. This methodology is applicable to any domain where: - Solutions can be expressed as code - Evaluation is automated - Experiments are fast enough for iterative improvement
6. Benchmark for LLM coding agents.
The simplicity and reproducibility of autoresearch makes it an excellent benchmark for evaluating coding agents. Key metrics: - How many experiments before the first improvement? - What is the final val_bpb after 100 experiments? - How diverse are the explored hypotheses? - How well does the agent handle crashes?
7. Education.
Following Karpathy's educational philosophy, autoresearch teaches:
- How neural network training works (by observing the agent's experiments)
- What hyperparameters matter most (by reading results.tsv)
- How to design reproducible research (by studying the protocol)
- How LLM agents can automate research (by running the system)
Limitations and Future Directions
Current limitations:
| Limitation | Description | Potential Solution |
|---|---|---|
| Sequential search | One experiment at a time | SkyPilot extension (see companion document) |
| Greedy strategy | Can miss multi-step improvements | Population-based search, backtracking |
| No cross-run memory | Each run starts fresh | Persistent knowledge store |
| Single-file scope | Only modifies train.py |
Multi-file modification support |
| GPU-specific results | Results not comparable across hardware | Normalize by FLOPs or steps |
| No experiment design | Agent explores ad hoc, no DOE | Structured experimental design |
| No ensemble methods | Single agent perspective | Multi-agent debate/collaboration |
| Fixed metric | Only val_bpb, no multi-objective | Pareto optimization |
Natural extensions:
-
Multi-agent autoresearch. Multiple agents explore different regions of the search space simultaneously, sharing discoveries. The SkyPilot extension (analyzed in the companion document) takes a first step in this direction.
-
Meta-optimization of program.md. An outer loop that evaluates different
program.mdvariants based on research quality, discovering optimal research protocols. -
Domain transfer. Apply the autoresearch protocol to other optimization domains: reinforcement learning training scripts, compiler optimization passes, database query optimization, scientific simulation parameters.
-
Integration with formal methods. Combine the LLM agent's intuition with Bayesian optimization or genetic algorithms to provide principled exploration-exploitation trade-offs.
-
Distributed autoresearch networks. A network of autoresearch instances running on different hardware, sharing discoveries in real-time, and collectively exploring the configuration space.
Broader Impact
Autoresearch represents a significant milestone in the trajectory toward fully autonomous scientific research. Its contribution is not primarily technical (the greedy hill-climbing algorithm is trivial) but conceptual — it demonstrates that:
- LLMs are sufficient as autonomous research agents for well-defined optimization problems
- Natural language specifications can replace complex research infrastructure
- Overnight autonomous runs can produce results that rival weeks of human effort
- Radical simplicity (3 files, 1 GPU, 1 metric) enables massive community adoption
The 63,000+ GitHub stars within weeks of release indicate that autoresearch has resonated deeply with the ML community. It has spawned an ecosystem of forks, extensions, and derivative projects that collectively push the frontier of autonomous research systems.
Karpathy's framing of program.md as "research org code" — where the human programs the research organization rather than conducting research — may prove to be the most enduring contribution. It redefines the role of the human researcher from executor to architect of the research process, with LLM agents as the execution layer.
This analysis is based on the autoresearch repository as of March 2026, including program.md (v1), train.py (baseline), prepare.py, and README.md. The repository continues to evolve through community contributions.