← Back to Index

Karpathy Autoresearch

Autonomous LLM-driven neural network training research on a single GPU Organization: Andrej Karpathy (Independent / OpenAI Alumni) Published: March 2026 Type: Open-Source Repository Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: autoresearch — AI agents running research on single-GPU nanochat training automatically

Repository URL: github.com/karpathy/autoresearch

Stars: 63,000+ (as of April 2026)

License: MIT

Lineage: Direct descendant of nanochat, which is itself a single-GPU GPT training repository. The autoresearch project takes nanochat's training infrastructure and wraps it with an LLM agent loop that autonomously modifies the training code.

Announcement: First described in a tweet thread by Karpathy, followed by a results tweet reporting the first overnight autonomous run.

Publication Date: March 2026

Paradigm: This is not a traditional research paper — it is a working system designed to be forked, extended, and run. The program.md file is the "paper" — a specification document that instructs an LLM agent how to conduct research autonomously. Karpathy describes the paradigm shift directly in the README:

"The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org."

2 Authors and Team

Primary Author

Andrej Karpathy — Former Director of AI at Tesla (Autopilot), co-founder and researcher at OpenAI, Stanford PhD (under Fei-Fei Li). One of the most influential figures in the deep learning community, known for pedagogical contributions (cs231n, nanoGPT, minGPT, llm.c, nanochat) and for making cutting-edge ML accessible.

Karpathy's body of work follows a distinctive pattern: take a complex system (ImageNet classifiers, GPT-2, tokenizers), strip it to its essence in a single readable file, and then open-source it as an educational tool. Autoresearch extends this pattern — now the researcher itself is automated.

Community Contributors

The repository spawned an immediate ecosystem of community forks:

Fork	Maintainer	Platform
miolini/autoresearch-macos	miolini	macOS (Metal)
trevin-creator/autoresearch-mlx	trevin-creator	macOS (MLX)
jsegov/autoresearch-win-rtx	jsegov	Windows (RTX)
andyluo7/autoresearch	andyluo7	AMD GPUs

The rapid forking to alternative platforms (within days of release) demonstrates both the simplicity of the core design and the community demand for accessible autonomous research tools.

Philosophical Context

Karpathy frames autoresearch with a characteristically provocative opening:

"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the 'code' is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began."

This framing positions autoresearch not merely as a tool but as a proof-of-concept for a fundamentally different mode of scientific research — one where humans design the research program (in natural language) rather than conducting experiments directly.

3 Core Contribution

Key Novelty: Autoresearch demonstrates that a coding LLM agent, given nothing more than a Markdown instruction file and a single-GPU training setup, can autonomously conduct neural network architecture and hyperparameter research overnight — discovering real improvements that stack cumulatively to produce an 11% reduction in time-to-GPT-2 quality.

What Makes Autoresearch Novel

Research-as-code paradigm. The human writes program.md — a natural-language specification of the research program. The LLM reads it and executes the research autonomously. This inverts the traditional human-researcher / computer-tool relationship.
Radical simplicity. Three files. One metric. One GPU. Five-minute experiments. No frameworks, no orchestrators, no databases, no vector stores. The entire system fits in a single screen of instructions. This stands in stark contrast to systems like AlphaEvolve (Google DeepMind) or OpenEvolve which require substantial infrastructure.
Fixed-budget evaluation. Every experiment runs for exactly 5 minutes of wall-clock training time. This makes all results directly comparable regardless of what the agent changes (model size, batch size, architecture). The constraint forces the agent to think about compute-efficiency, not just model quality.
Cumulative greedy hill-climbing. Changes that improve val_bpb are kept (the branch advances); changes that don't improve are discarded (git reset). This creates a monotonically improving trajectory — the branch represents the best-known configuration at all times.
Self-contained reproducibility. The entire state of every experiment is captured as a git commit. The results log (results.tsv) provides a complete audit trail. Any experiment can be reproduced by checking out its commit.
"Never stop" autonomy. The agent is explicitly instructed to never pause, never ask for confirmation, and run indefinitely until manually stopped. This enables overnight and multi-day autonomous research campaigns.

Relationship to Prior Work

System	Year	Complexity	Agent Role	Human Role	Search Strategy
Grid Search	Classical	Low	None	Defines grid	Exhaustive
Bayesian Optimization	Classical	Medium	Statistical model	Defines space	Acquisition function
Neural Architecture Search	2016+	High	RL/evolution agent	Defines search space	RL, evolutionary
FunSearch (DeepMind)	2023	High	LLM mutation	Defines evaluator	Evolutionary
AlphaEvolve (DeepMind)	2025	Very High	LLM ensemble	Defines problem	MAP-Elites + ensemble
Autoresearch	2026	Minimal	LLM coding agent	Writes program.md	Greedy hill-climbing

Autoresearch occupies a unique position: it is the simplest autonomous research system that produces real results. Where AlphaEvolve requires a team of engineers to operate, autoresearch requires uv sync && uv run prepare.py and a prompt.

The Simplicity Criterion

A distinctive design choice is the explicit simplicity constraint in program.md:

"All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Conversely, removing something and getting equal or better results is a great outcome — that's a simplification win."

This prevents the common failure mode of autonomous optimization systems: accumulating complexity without proportional gains. The agent is instructed to weigh improvement magnitude against code complexity cost. A 0.001 val_bpb improvement from deleting code is preferred over a 0.001 improvement from adding 20 lines of hacky code.

4 Supported Solutions

Primary Domain: Neural Network Training Optimization

Autoresearch targets a single domain — improving a GPT model's training efficiency within a fixed compute budget. The agent has full freedom to modify:

Solution Category	Scope	Examples
Model Architecture	Transformer depth, width, attention patterns	Change `DEPTH`, `ASPECT_RATIO`, `WINDOW_PATTERN`, add/remove layers
Optimizer Configuration	Learning rates, betas, weight decay, momentum	Tune `MATRIX_LR`, `ADAM_BETAS`, `WEIGHT_DECAY`, Muon parameters
Learning Rate Schedules	Warmup, warmdown, final LR fraction	Modify `WARMUP_RATIO`, `WARMDOWN_RATIO`, `FINAL_LR_FRAC`
Batch Size	Total batch size, gradient accumulation	Change `TOTAL_BATCH_SIZE`, `DEVICE_BATCH_SIZE`
Attention Mechanisms	Sliding window patterns, head configuration	Modify `WINDOW_PATTERN`, `HEAD_DIM`, `n_kv_head`
Activation Functions	Non-linearity choices	Replace `F.relu(x).square()` in MLP
Normalization	RMSNorm variants, placement	Modify `norm()` function
Weight Initialization	Init scales, strategies	Modify `init_weights()` method
Residual Connections	Residual lambdas, x0 mixing	Tune `resid_lambdas`, `x0_lambdas`
Value Embeddings	Value residual gating	Modify `ve_gate`, gating mechanisms
Rotary Embeddings	Position encoding variants	Modify `apply_rotary_emb()`, base frequency
Logit Processing	Softcap, temperature	Change softcap value in forward pass
Optimizer Internals	Muon orthogonalization, NorMuon variance	Modify `muon_step_fused`, polar express coefficients

Constraint Surface

The agent operates within hard constraints that define the problem:

IMMUTABLE (prepare.py)              MUTABLE (train.py)
┌─────────────────────────┐        ┌─────────────────────────┐
│ MAX_SEQ_LEN = 2048      │        │ ASPECT_RATIO = 64       │
│ TIME_BUDGET = 300s      │        │ HEAD_DIM = 128          │
│ EVAL_TOKENS = 40×524288 │        │ WINDOW_PATTERN = "SSSL" │
│ VOCAB_SIZE = 8192       │        │ TOTAL_BATCH_SIZE = 2^19 │
│ evaluate_bpb()          │        │ All LR parameters       │
│ make_dataloader()       │        │ WEIGHT_DECAY = 0.2      │
│ Tokenizer               │        │ ADAM_BETAS = (0.8, 0.95)│
│ Data pipeline            │        │ DEPTH = 8               │
│ Split pattern            │        │ GPT model class         │
└─────────────────────────┘        │ MuonAdamW optimizer     │
                                   │ Training loop           │
                                   │ LR schedules            │
                                   │ Everything else         │
                                   └─────────────────────────┘

Metric: Bits Per Byte (BPB)

The single optimization objective is val_bpb — validation bits per byte. This is a vocabulary-size-independent metric:

                   Σ cross_entropy(logits, targets) [nats]
val_bpb = ────────────────────────────────────────────────────
             ln(2) × Σ utf8_byte_length(target_tokens)

BPB is preferred over perplexity because it allows fair comparison across different vocabulary sizes. If the agent changes tokenization-adjacent parameters (which it cannot, but in principle), the metric remains valid.

Soft Constraints

VRAM: Some increase acceptable for meaningful val_bpb gains, but should not "blow up dramatically"
Simplicity: Complexity cost is weighed against improvement magnitude
Runtime: Must complete within the 5-minute budget (10-minute hard kill)
Dependencies: Cannot add packages beyond those in pyproject.toml

5 LLM Integration

Agent Architecture

Autoresearch uses a single coding LLM agent — any agent capable of reading files, editing code, and executing shell commands. The system is agent-agnostic by design.

Aspect	Detail
Agent Type	General-purpose coding agent (Claude Code, Codex, etc.)
Agent Count	1 (default; SkyPilot extension enables multi-agent)
Instruction Format	Natural language Markdown (`program.md`)
Agent Loop	Read → Edit → Commit → Run → Evaluate → Keep/Discard → Repeat
Human in Loop	None during execution (fully autonomous)
Context Management	Redirect stdout to file, grep for metrics (avoids flooding context)

How the LLM is Used

The LLM serves simultaneously as:

Hypothesis generator — decides what experiment to try next based on prior results
Code writer — implements the experiment by editing train.py
Result interpreter — reads metrics, decides if the change was beneficial
Research strategist — plans what direction to explore based on accumulated evidence
Error handler — diagnoses crashes, fixes bugs, decides whether to retry or abandon

This is fundamentally different from systems like AlphaEvolve where the LLM is used only as a mutation operator within a larger evolutionary framework. In autoresearch, the LLM is the entire system — it handles strategy, implementation, evaluation, and decision-making.

Prompt Engineering via program.md

The program.md file is the entire "prompt engineering" layer. It functions as a research specification document with several key sections:

Setup Protocol:

1. Agree on a run tag (e.g. "mar5")
2. Create branch: git checkout -b autoresearch/<tag>
3. Read in-scope files: README.md, prepare.py, train.py
4. Verify data exists in ~/.cache/autoresearch/
5. Initialize results.tsv with header row
6. Confirm and begin

Experiment Loop (per cycle):

1. Look at git state (current branch/commit)
2. Edit train.py with experimental idea
3. git commit
4. Run: uv run train.py > run.log 2>&1
5. Read results: grep "^val_bpb:\|^peak_vram_mb:" run.log
6. If empty → crash → tail -n 50 run.log → diagnose
7. Log to results.tsv
8. If improved → keep commit (advance branch)
9. If not improved → git reset to previous best

Critical Behavioral Instructions:

"NEVER STOP: Once the experiment loop has begun, do NOT pause to ask the human if you should continue. Do NOT ask 'should I keep going?' or 'is this a good stopping point?'. The human might be asleep, or gone from a computer and expects you to continue working indefinitely until you are manually stopped."

"If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes."

Context Window Management

A subtle but critical design decision is how the agent manages its context window:

uv run train.py > run.log 2>&1    # redirect ALL output to file
grep "^val_bpb:" run.log           # extract only the metric

By redirecting training output to a file instead of printing to stdout, the agent avoids flooding its context window with thousands of lines of training progress. It reads only what it needs (the final metric, or a crash traceback). This is essential for running 100+ experiments without context exhaustion.

Agent-Agnostic Design

Autoresearch deliberately avoids coupling to any specific LLM provider or agent framework:

"Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like: 'Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.'"

The program.md is written in natural language that any sufficiently capable coding agent can follow. The only requirements are: - File read/write capability - Shell command execution - Git operations - Basic reasoning about experimental results

This makes autoresearch a meta-framework — it defines the research protocol, not the agent implementation.

6 Key Results

First Overnight Run

Karpathy's initial overnight run (reported via tweet) achieved:

Metric	Value
val_bpb improvement	~11% reduction vs. baseline
Number of experiments	~20 improvements found
Total experiments	~100 (12/hour × ~8 hours)
Runtime	Overnight (~8 hours)
GPU	Single H100
Agent	Claude Code

Experiment Throughput

The fixed 5-minute training budget plus agent overhead yields predictable throughput:

Per experiment:
  Agent thinks + edits train.py     ~30 seconds
  uv run train.py                   ~5 minutes (300s budget)
  Agent reads results + logs        ~30 seconds
  ─────────────────────────────────────────────
  Total per experiment              ~6 minutes

Throughput:
  Per hour                          ~10 experiments
  Per 8-hour overnight              ~80 experiments
  Per 24-hour run                   ~240 experiments

Improvement Trajectory

The results follow a characteristic diminishing-returns pattern:

val_bpb
  ▲
  │
  │ ●  baseline
  │  ╲
  │   ╲  rapid early gains
  │    ╲  (hyperparameter sweep)
  │     ╲
  │      ╲
  │       ╲───── moderate gains
  │           ╲  (architecture changes)
  │            ╲
  │             ╲─────── diminishing returns
  │                  ╲── (fine-tuning, combinations)
  │                   ╲──────────────
  │
  └──────────────────────────────────► experiments
    0    20    40    60    80    100

What the Agent Typically Discovers

Based on community reports and the SkyPilot extension results, autonomous runs consistently discover:

Discovery Category	Typical Finding	Typical Impact
Batch size reduction	2^19 → 2^18 (more optimizer steps in 5 min)	High (0.01-0.02 bpb)
Adam betas tuning	(0.8, 0.95) → (0.9, 0.95) or (0.7, 0.95)	Medium (0.005 bpb)
Weight decay reduction	0.2 → 0.08	Medium (0.003 bpb)
Model width scaling	ASPECT_RATIO 64 → 96	High (0.01+ bpb)
Window pattern changes	"SSSL" → "SL"	Small (0.001 bpb)
LR schedule tuning	Warmdown ratio, final LR fraction	Small-Medium
Muon optimizer params	momentum, beta2, ns_steps	Small (0.001 bpb)

Nanochat Leaderboard Impact

The autoresearch results directly translate to improvements on the nanochat leaderboard, which tracks the best val_bpb achievable in 5 minutes of single-GPU training. The 11% improvement from the first overnight run represented a significant leap on this leaderboard.

7 Reproducibility

Reproducibility Score: Very High

Autoresearch is one of the most reproducible autonomous research systems:

Factor	Assessment	Detail
Code availability	Full source, MIT license	3 files, no hidden components
Data availability	Public dataset (climbmix-400b)	Auto-downloaded by `prepare.py`
Compute requirements	1 NVIDIA GPU	Tested on H100; community forks for other hardware
Dependencies	Minimal (PyTorch + 8 packages)	Pinned in `pyproject.toml` via `uv.lock`
Determinism	Seeded (`torch.manual_seed(42)`)	Platform-dependent due to GPU timing
Experiment logging	Complete git history	Every experiment is a commit + TSV row
Agent instructions	Fully specified in `program.md`	No hidden system prompts

Reproduction Steps

# 1. Clone and setup
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

# 2. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py

# 3. Establish baseline
uv run train.py

# 4. Start autonomous research
# Point your coding agent at the repo and prompt:
# "Read program.md and let's kick off a new experiment"

Platform Sensitivity

A critical caveat: results are platform-dependent. The 5-minute time budget means different GPUs will complete different numbers of training steps:

GPU	Approximate Steps in 5 min	Relative Throughput
H200	~1,200+	1.09x
H100	~1,100	1.0x (reference)
A100 (80GB)	~800	~0.73x
RTX 4090	~600	~0.55x
RTX 3090	~400	~0.36x

This is by design — autoresearch finds the optimal model for your compute. But it means results from different hardware are not directly comparable. Karpathy explicitly acknowledges this:

"This means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms."

Recommendations for Smaller Hardware

For users without H100-class hardware, Karpathy provides specific guidance:

Use TinyStories dataset (lower entropy → smaller models work)
Reduce vocab_size (8192 → 4096/2048/1024/256)
Lower MAX_SEQ_LEN (2048 → 512/256)
Decrease DEPTH (8 → 4)
Use WINDOW_PATTERN = "L" (avoid banded attention inefficiency)
Lower TOTAL_BATCH_SIZE (2^19 → 2^14)

8 Compute and API Costs

GPU Compute Costs

Configuration	GPU-Hours	Estimated Cost
Single overnight run (8h)	8 GPU-hours	$16 (H100 @ $2/hr)
Extended run (24h)	24 GPU-hours	$48
Weekend run (48h)	48 GPU-hours	$96
Community fork (RTX 4090)	8 hours	~$0 (consumer GPU)

LLM API Costs

The agent makes relatively few API calls per experiment cycle:

Per experiment cycle:
  Read current state + plan          ~2K tokens input
  Generate code edit                 ~1K tokens output
  Read results + decide              ~500 tokens input
  Log results                        ~200 tokens output
  ─────────────────────────────────────────────
  Total per cycle                    ~4K tokens

Per overnight run (100 experiments):
  Total tokens                       ~400K tokens
  Estimated cost (Claude Sonnet)     ~$1.50-3.00
  Estimated cost (GPT-4o)            ~$1.00-2.00

Total Cost per Run

Run Duration	GPU Cost	API Cost	Total
8h overnight (H100)	$16	~$2	~$18
8h overnight (RTX 4090)	$0	~$2	~$2
24h extended (H100)	$48	~$6	~$54

The cost structure is GPU-dominated for cloud users and API-dominated for users with local GPUs. This makes autoresearch remarkably cheap compared to traditional hyperparameter search (which requires human researcher time) or systems like AlphaEvolve (which require multi-GPU clusters and engineering teams).

Cost Efficiency Analysis

Traditional ML Research:
  Senior researcher salary:    ~$200K/year → ~$100/hr fully loaded
  8 hours of researcher time:  ~$800
  GPU cost for same period:    ~$16
  Total:                       ~$816

Autoresearch:
  8 hours (overnight, researcher sleeping): ~$18
  Number of experiments:        ~80-100
  Human effort:                 ~10 min setup

Cost reduction: ~45x
Experiments/dollar:
  Traditional: ~0.1 (researcher bottleneck)
  Autoresearch: ~5.5

9 Architecture Solution

System Architecture

Autoresearch's architecture is remarkable for what it doesn't include:

┌─────────────────────────────────────────────────────────┐
│                    CODING LLM AGENT                     │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │  Hypothesis  │  │  Code Editor │  │   Result      │  │
│  │  Generator   │  │  (train.py)  │  │   Interpreter │  │
│  └──────┬──────┘  └──────┬───────┘  └───────┬───────┘  │
│         │                │                   │          │
│         └────────────────┼───────────────────┘          │
│                          │                              │
│              ┌───────────▼──────────┐                   │
│              │  Decision Engine     │                   │
│              │  (keep/discard/crash)│                   │
│              └───────────┬──────────┘                   │
└──────────────────────────┼──────────────────────────────┘
                           │
              ┌────────────▼────────────┐
              │     SHELL INTERFACE     │
              │  git, uv, grep, tail    │
              └────────────┬────────────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                  │
┌────────▼────────┐  ┌─────▼──────┐  ┌───────▼───────┐
│    train.py     │  │  run.log   │  │  results.tsv  │
│  (single file   │  │  (output)  │  │  (audit log)  │
│   the agent     │  │            │  │               │
│   modifies)     │  │            │  │               │
└────────┬────────┘  └────────────┘  └───────────────┘
         │
┌────────▼────────┐
│   prepare.py    │
│  (read-only)    │
│  - dataloader   │
│  - tokenizer    │
│  - evaluation   │
│  - constants    │
└────────┬────────┘
         │
┌────────▼────────┐
│   SINGLE GPU    │
│   (H100/etc)    │
│  5-min training │
└─────────────────┘

Key Architectural Decisions

1. No database, no queue, no orchestrator.

Most autonomous research systems (AlphaEvolve, OpenEvolve, The AI Scientist) include substantial infrastructure: databases for results, queues for experiments, orchestrators for parallelism, vector stores for memory. Autoresearch uses none of these. The filesystem (git) is the database. The agent's context window is the memory. Shell commands are the orchestrator.

2. Git-as-state-machine.

The git branch serves as a state machine for the research process:

                    baseline commit
                         │
           ┌─────────────┼─────────────┐
           │             │             │
        exp-01        exp-02        exp-03
       (keep ✓)      (discard ✗)   (crash ✗)
           │             │             │
           │         git reset     git reset
           │             │             │
           ▼             ▼             ▼
        commit-01    (back to       (back to
           │          baseline)      baseline)
           │
        exp-04
       (keep ✓)
           │
           ▼
        commit-02
           │
          ...

Each kept experiment advances the branch. Each discarded experiment resets to the last good state. The branch tip always represents the best-known configuration.

3. Fixed-budget evaluation.

The 5-minute wall-clock training budget is a brilliant constraint: - Makes all experiments directly comparable regardless of changes - Prevents the agent from exploiting "train longer" as a strategy - Forces compute-efficiency thinking (more steps = better, so batch size and model size matter) - Enables predictable throughput (~12 experiments/hour)

4. Single-file modification scope.

The agent only modifies train.py. This constraint: - Keeps the search space manageable - Makes diffs reviewable by humans - Prevents the agent from gaming the evaluation (it cannot modify prepare.py) - Ensures fair comparison (data pipeline and metric are fixed)

Information Flow

┌──────────────────────────────────────────────────────┐
│                  EXPERIMENT CYCLE                     │
│                                                      │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐       │
│  │ program  │───▶│  Agent   │───▶│ train.py │       │
│  │   .md    │    │ (LLM)    │    │ (edited) │       │
│  │          │    │          │    │          │       │
│  │ Rules    │    │ Context: │    └────┬─────┘       │
│  │ Goals    │    │ - rules  │         │              │
│  │ Format   │    │ - code   │    ┌────▼─────┐       │
│  └──────────┘    │ - past   │    │ GPU      │       │
│                  │   results│    │ Training │       │
│  ┌──────────┐    │ - errors │    │ (5 min)  │       │
│  │ results  │◀───│          │    └────┬─────┘       │
│  │  .tsv    │    │          │         │              │
│  │          │    │ Decision:│    ┌────▼─────┐       │
│  │ Audit    │    │ keep or  │    │ run.log  │       │
│  │ trail    │    │ discard  │◀───│ (output) │       │
│  └──────────┘    └──────────┘    └──────────┘       │
│                                                      │
└──────────────────────────────────────────────────────┘

10 Component Breakdown

Component 1: `program.md` — The Research Specification

Purpose: Defines the research protocol in natural language for any LLM agent.

Structure:

Section	Purpose	Key Constraints
Setup	Initialize experiment branch, verify data	Must read all in-scope files first
Experimentation	What can/cannot be modified	Only `train.py` is mutable
Output Format	How to parse experiment results	grep-based metric extraction
Logging	How to record results	Tab-separated `results.tsv`
Experiment Loop	Core loop specification	Keep if improved, discard if not
Never Stop	Autonomy directive	No human confirmation needed

Key behavioral instructions encoded in program.md:

Crash handling: "If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it."
Timeout policy: "If a run exceeds 10 minutes, kill it and treat it as a failure."
Autonomy: "The loop runs until the human interrupts you, period."
Idea generation: "If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes."
Results file isolation: "Do not commit the results.tsv file, leave it untracked by git."

Component 2: `train.py` — The Mutable Training Script

Purpose: Contains the complete GPT model, optimizer, and training loop. This is the only file the agent modifies.

Size: ~450 lines of Python

Sub-components:

Sub-component	Lines	Purpose
`GPTConfig`	~10	Model configuration dataclass
`CausalSelfAttention`	~40	Multi-head attention with RoPE, value embeddings, flash attention
`MLP`	~10	Feed-forward network with ReGLU-squared activation
`Block`	~10	Transformer block (attention + MLP + residual)
`GPT`	~120	Full model with embeddings, blocks, logit head, rotary cache
`MuonAdamW`	~100	Combined optimizer: Muon for matrices, AdamW for others
Hyperparameters	~20	Top-level constants (the primary knobs)
Setup	~40	Model construction, optimizer creation, dataloader
Training Loop	~80	The actual training loop with logging, scheduling, evaluation

Notable design choices in train.py:

Muon + AdamW hybrid optimizer: Weight matrices use the Muon optimizer (gradient orthogonalization via polar decomposition) while embeddings, scalars, and biases use AdamW. This is a state-of-the-art optimizer design.
Value embeddings (ResFormer): Alternating layers have value embeddings with input-dependent gating — a recent technique for improved transformer training.
Residual scaling: Per-layer learnable residual lambdas and x0 mixing coefficients, following recent work on deep transformer training stability.
ReGLU-squared activation: F.relu(x).square() — a non-standard activation that has shown empirical benefits in small model training.
Logit soft-capping: softcap * tanh(logits / softcap) — prevents extreme logit values, improving training stability.
Garbage collection management: Manual GC control with gc.freeze() and gc.disable() to avoid Python GC stalls during training.

Component 3: `prepare.py` — The Fixed Infrastructure

Purpose: Data download, tokenizer training, dataloader, and evaluation metric. Immutable — the agent cannot touch this.

Key constants:

MAX_SEQ_LEN = 2048       # context length
TIME_BUDGET = 300        # 5 minutes of training
EVAL_TOKENS = 40×524288  # ~20M tokens for validation
VOCAB_SIZE = 8192        # BPE vocabulary size

Sub-components:

Sub-component	Purpose
Data download	Fetches parquet shards from HuggingFace (climbmix-400b)
Tokenizer training	BPE via rustbpe, saved as tiktoken pickle
`Tokenizer` class	Wrapper with encode/decode, BOS token support
`make_dataloader()`	BOS-aligned best-fit packing dataloader
`evaluate_bpb()`	The sacred evaluation function (bits per byte)

The separation of prepare.py (immutable) and train.py (mutable) is architecturally critical:

   ┌─────────────────┐    ┌─────────────────┐
   │   prepare.py    │    │    train.py     │
   │  (FROZEN)       │    │  (MUTABLE)      │
   │                 │    │                 │
   │  Ground truth   │    │  Experiment     │
   │  evaluation     │◄───│  space          │
   │                 │    │                 │
   │  Cannot be      │    │  Everything     │
   │  gamed by       │    │  is fair game   │
   │  the agent      │    │                 │
   └─────────────────┘    └─────────────────┘

Component 4: `results.tsv` — The Audit Trail

Purpose: Human-readable experiment log. Tab-separated for easy parsing.

Schema:

commit    val_bpb     memory_gb  status    description
a1b2c3d   0.997900    44.0       keep      baseline
b2c3d4e   0.993200    44.2       keep      increase LR to 0.04
c3d4e5f   1.005000    44.0       discard   switch to GeLU activation
d4e5f6g   0.000000    0.0        crash     double model width (OOM)

Design choices: - Tab-separated (not CSV) — commas break in descriptions - Not committed to git — keeps the branch clean for code-only diffs - Short commit hash (7 chars) for traceability - Status enum: keep, discard, crash - Memory tracked to detect VRAM regressions

Component 5: Git Branch — The State Machine

Purpose: Version control serves as experiment state management.

The branch naming convention (autoresearch/<tag>) creates isolated experiment runs:

master ──────────────────────────────────────────►
  │
  ├── autoresearch/mar5 ──────●────●────●────►
  │                       exp-01  exp-04  exp-07
  │                       (kept)  (kept)  (kept)
  │
  ├── autoresearch/mar5-gpu0 ──────●────●────►
  │
  └── autoresearch/mar6 ──────●────●────●────►

11 Core Mechanisms (Detailed)

Mechanism 1: The Greedy Hill-Climbing Loop

The core algorithm is deceptively simple:

INITIALIZE:
  branch = autoresearch/<tag>
  baseline = run(train.py)  # establish baseline val_bpb
  best_bpb = baseline

LOOP FOREVER:
  idea = agent.generate_hypothesis(history, code)
  modified_code = agent.implement(idea, train.py)
  git.commit(modified_code)

  result = run(modified_code, timeout=5min)

  IF result.crashed:
    IF easy_fix(result.error):
      fix_and_retry()
    ELSE:
      log(status="crash")
      git.reset(best_commit)

  ELIF result.val_bpb < best_bpb:
    best_bpb = result.val_bpb
    best_commit = git.HEAD
    log(status="keep")
    # branch advances naturally

  ELSE:
    log(status="discard")
    git.reset(best_commit)

Analysis:

This is a first-order greedy search with a sophisticated "sensor" (the LLM's judgment about what to try). The greedy hill-climbing has well-known limitations:

Local optima trapping: The agent can get stuck in a local optimum where no single change improves the metric, but a combination of changes would.
No backtracking: Once the agent advances past a commit, it cannot easily revisit that region of the search space (though program.md mentions "very sparingly" rewinding).
No exploration-exploitation balance: There is no formal mechanism for trading off exploitation (fine-tuning what works) against exploration (trying radically different ideas).

However, the LLM mitigates these limitations: - The LLM can propose compound changes (modify 3 hyperparameters at once) - The LLM has semantic understanding of the code and can make informed hypotheses - The LLM has (implicit) knowledge from its training data about what works in ML - The "think harder" instruction encourages the agent to try radical changes when stuck

Mechanism 2: The Muon + AdamW Optimizer

The training script uses a sophisticated dual-optimizer design that the agent can modify:

┌──────────────────────────────────────────────────┐
│              MuonAdamW Optimizer                  │
│                                                  │
│  ┌────────────────────┐  ┌────────────────────┐  │
│  │    Muon Optimizer   │  │  AdamW Optimizer   │  │
│  │                    │  │                    │  │
│  │  For: 2D weight    │  │  For: embeddings,  │  │
│  │  matrices          │  │  scalars, biases   │  │
│  │                    │  │                    │  │
│  │  Steps:            │  │  Standard Adam     │  │
│  │  1. Nesterov       │  │  with bias         │  │
│  │     momentum       │  │  correction and    │  │
│  │  2. Polar express  │  │  weight decay      │  │
│  │     orthogonalize  │  │                    │  │
│  │  3. NorMuon        │  │                    │  │
│  │     variance       │  │                    │  │
│  │     reduction      │  │                    │  │
│  │  4. Cautious       │  │                    │  │
│  │     weight decay   │  │                    │  │
│  └────────────────────┘  └────────────────────┘  │
└──────────────────────────────────────────────────┘

Muon gradient orthogonalization (Polar Express):

The Muon optimizer applies the "polar express" — an iterative approximation of the polar decomposition (orthogonal component of a matrix) — to gradients before applying them. This orthogonalization acts as a form of natural gradient method, where updates are made in the space of orthogonal matrices rather than Euclidean space.

The coefficients are precomputed:

polar_express_coeffs = [
    (8.156554524902461, -22.48329292557795, 15.878769915207462),
    (4.042929935166739, -2.808917465908714, 0.5000178451051316),
    (3.8916678022926607, -2.772484153217685, 0.5060648178503393),
    (3.285753657755655, -2.3681294933425376, 0.46449024233003106),
    (2.3465413258596377, -1.7097828382687081, 0.42323551169305323),
]

Each iteration applies: X = a*X + X @ (b*A + c*A@A) where A = X^T @ X (or X @ X^T for tall matrices). This converges to the orthogonal Procrustes solution in ~5 iterations.

NorMuon variance reduction:

After orthogonalization, the optimizer applies variance normalization with a second-momentum buffer:

v_mean = g².mean(dim=red_dim)      # per-row or per-column variance
second_momentum ← lerp(v_mean, β₂)  # exponential moving average
step_size = 1/√(second_momentum)     # inverse std normalization
g = g * step_size * (v_norm / v_norm_new)  # scale-preserving normalization

This combines the direction quality from orthogonalization with adaptive step sizing from variance tracking — conceptually similar to combining Adam's adaptation with natural gradient's geometry.

Mechanism 3: Fixed-Budget Training and Evaluation

The training loop implements the fixed-budget constraint:

TIME_BUDGET = 300  # 5 minutes

while True:
    # ... training step ...

    if step > 10:  # exclude warmup/compilation
        total_training_time += dt

    if step > 10 and total_training_time >= TIME_BUDGET:
        break

The first 10 steps are excluded from the time budget to account for PyTorch compilation and CUDA warmup. This ensures the budget measures actual training time, not one-time startup costs.

Evaluation (BPB):

def evaluate_bpb(model, tokenizer, batch_size):
    # Evaluate on ~20M tokens from validation split
    total_nats = 0.0
    total_bytes = 0
    for _ in range(steps):
        loss_flat = model(x, y, reduction='none').view(-1)
        nbytes = token_bytes[y_flat]
        mask = nbytes > 0  # exclude special tokens
        total_nats += (loss_flat * mask).sum().item()
        total_bytes += nbytes.sum().item()
    return total_nats / (math.log(2) * total_bytes)

BPB (bits per byte) converts from nats (natural log) to bits (log₂) and normalizes by UTF-8 byte count rather than token count. This makes the metric independent of tokenizer vocabulary size.

Mechanism 4: The GPT Architecture

The model follows a modern GPT architecture with several recent innovations:

Input tokens (B, T)
        │
        ▼
┌───────────────┐
│ Token Embed   │  wte: vocab_size → n_embd
│ (+ RMSNorm)   │
└───────┬───────┘
        │
        ▼ x₀ (saved for residual mixing)
┌───────────────────────────────────────────┐
│              Block i (×n_layer)            │
│                                           │
│  x = λᵢ·x + α₀ᵢ·x₀    ← residual mixing │
│                                           │
│  ┌─────────────────────────────────────┐  │
│  │ CausalSelfAttention                 │  │
│  │  ┌───────┐ ┌───────┐ ┌───────┐     │  │
│  │  │ Q=Wx  │ │ K=Wx  │ │ V=Wx  │     │  │
│  │  └───┬───┘ └───┬───┘ └───┬───┘     │  │
│  │      │         │      + VE·gate     │  │
│  │  ┌───▼─────────▼───┐     │          │  │
│  │  │  RoPE + QKNorm   │     │          │  │
│  │  └───────┬──────────┘     │          │  │
│  │          │                │          │  │
│  │  ┌───────▼────────────────▼───────┐  │  │
│  │  │  FlashAttention3 (causal,     │  │  │
│  │  │  sliding window per layer)    │  │  │
│  │  └───────────────┬───────────────┘  │  │
│  │                  ▼                  │  │
│  │          Linear projection          │  │
│  └──────────────────┬──────────────────┘  │
│                     │                     │
│  x = x + attn_out   ← residual           │
│                                           │
│  ┌─────────────────────────────────────┐  │
│  │ MLP                                 │  │
│  │  Linear → ReLU² → Linear            │  │
│  └──────────────────┬──────────────────┘  │
│                     │                     │
│  x = x + mlp_out    ← residual           │
│                                           │
└───────────────────────┬───────────────────┘
                        │
                        ▼
              ┌─────────────────┐
              │ RMSNorm → Head  │
              │ Softcap(15)     │
              │ Cross-entropy   │
              └─────────────────┘

Value Embeddings (ResFormer):

Alternating layers have value embeddings — learnable per-token vectors that are mixed into the value stream via an input-dependent gate:

ve = self.value_embeds[str(i)](idx)  # (B, T, kv_dim)
gate = 2 * sigmoid(self.ve_gate(x[..., :32]))  # (B, T, n_kv_head)
v = v + gate * ve

The gate is initialized to zero, so sigmoid(0) = 0.5, scaled by 2 gives 1.0 — a neutral initialization that doesn't distort the value stream at the start of training.

Residual Mixing:

Each block receives both the previous hidden state and the initial embedding:

x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0

This is initialized with λ = 1.0 and α₀ = 0.1, allowing the model to learn how much to rely on the original embedding vs. the transformed representation at each layer. This technique helps with gradient flow in deeper transformers.

Mechanism 5: BOS-Aligned Best-Fit Packing

The dataloader uses a sophisticated packing strategy:

Traditional padding:
  [DOC_1 | PAD PAD PAD PAD PAD | DOC_2 | PAD PAD]  ← wasted compute

Autoresearch best-fit packing:
  [BOS DOC_A | BOS DOC_C | BOS DOC_F(crop)]  ← 100% utilization
  [BOS DOC_B | BOS DOC_D | BOS DOC_E ]       ← 100% utilization

The packer: 1. Maintains a buffer of tokenized documents 2. For each row, finds the largest document that fits entirely in the remaining space 3. If no document fits, crops the shortest document to fill exactly 4. Every document starts with BOS token 5. Achieves 100% utilization (no padding tokens)

This is a first-fit-decreasing variant adapted for streaming data — the agent cannot modify this packing strategy (it's in prepare.py), but it can modify DEVICE_BATCH_SIZE which affects how many rows are packed per batch.

12 Programming Language

Language: Python 3.10+

Primary dependencies (from pyproject.toml):

Package	Version	Purpose
`torch`	2.9.1 (CUDA 12.8)	Core training framework
`kernels`	≥0.11.7	FlashAttention3 kernel loading
`rustbpe`	≥0.1.0	BPE tokenizer training (Rust-backed)
`tiktoken`	≥0.11.0	Tokenizer runtime
`pyarrow`	≥21.0.0	Parquet data reading
`numpy`	≥2.2.6	Numerical utilities
`pandas`	≥2.3.3	Data manipulation
`matplotlib`	≥3.10.8	Visualization
`requests`	≥2.32.0	Data download

Package management: Uses uv — the fast Rust-based Python package manager. All dependencies are locked via uv.lock. The CUDA-specific PyTorch wheel is sourced from the custom index:

[tool.uv.sources]
torch = [{ index = "pytorch-cu128" }]

<span class="broken-link">tool.uv.index</span>
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
explicit = true

Code Style

The code is written in Karpathy's characteristic style: - Single-file implementation (no module imports within the project) - Minimal abstraction (no unnecessary classes or inheritance) - Heavy use of torch.compile for performance-critical paths - Inline constants rather than config files - Dense but readable — every line serves a purpose

Performance Optimizations

torch.compile(dynamic=False, fullgraph=True) — Compiled optimizer kernels avoid Python overhead
Flash Attention 3 — Hardware-specific attention kernel (Hopper FA3 vs. community FA3)
torch.set_float32_matmul_precision("high") — Enables TF32 for faster matmul
Pinned memory — pin_memory=True for CPU→GPU data transfer
Manual GC — gc.freeze() + gc.disable() after warmup to avoid GC stalls
bfloat16 autocast — Mixed precision training
Pre-allocated buffers — Data loader pre-allocates CPU and GPU tensors

13 Memory Management

Context Window Memory

The primary memory challenge in autoresearch is the LLM agent's context window, not GPU memory. Over 100+ experiments, the agent accumulates substantial context:

Context accumulation per experiment:
  program.md (initial read)          ~2,000 tokens (amortized: 0)
  train.py (read/edit)               ~2,000 tokens
  Results from grep                  ~50 tokens
  Agent reasoning                    ~500 tokens
  Error handling (if crash)          ~500 tokens
  ──────────────────────────────────────────────
  Net context per experiment         ~3,000-5,000 tokens

Over 100 experiments:
  Cumulative context                 ~300K-500K tokens

Mitigation strategies in program.md:

Output redirection: uv run train.py > run.log 2>&1 — prevents training logs from flooding context
Selective reading: grep "^val_bpb:" run.log — reads only the metric, not the full log
Crash diagnosis: tail -n 50 run.log — reads only the traceback, not the full output
No TSV commits: results.tsv is untracked — the agent doesn't re-read git diffs of results

These strategies enable an agent with ~200K context window to sustain ~100+ experiments before potential context saturation.

GPU Memory

The training script is designed for single-GPU operation with careful memory management:

Component	Memory Usage	Notes
Model parameters (50M)	~100 MB (bf16)	Main model weights
Optimizer states (Muon)	~200 MB	Momentum + second momentum buffers
Optimizer states (Adam)	~200 MB	exp_avg + exp_avg_sq
Activations (per batch)	~2-10 GB	Depends on batch size, scales with `DEVICE_BATCH_SIZE × MAX_SEQ_LEN`
KV cache (attention)	~500 MB	Flash Attention working memory
Gradient accumulation	~100 MB	Gradients for one micro-step
Rotary embeddings	~50 MB	Pre-computed cos/sin for 10× sequence length
Value embeddings	~100 MB	Per-token value vectors
Total (baseline)	~3-11 GB	Well within H100's 80GB

The PYTORCH_ALLOC_CONF = "expandable_segments:True" environment variable enables PyTorch's expandable memory segments, reducing fragmentation when the agent changes model sizes between experiments.

Peak VRAM tracking:

peak_vram_mb = torch.cuda.max_memory_allocated() / 1024 / 1024

Peak VRAM is reported per experiment and logged in results.tsv, allowing the agent to track memory impact of changes. The simplicity criterion in program.md treats VRAM increase as a cost to be weighed against val_bpb improvement.

Gradient Accumulation Memory Pattern

TOTAL_BATCH_SIZE = 2^19        # ~524K tokens total
DEVICE_BATCH_SIZE = 128         # per micro-step
tokens_per_fwdbwd = 128 × 2048 # = 262,144
grad_accum_steps = 2^19 / 2^18 # = 2

Memory during training:
  Step 1: forward(128 × 2048) → activations → backward → gradients accumulated
  Step 2: forward(128 × 2048) → activations → backward → gradients accumulated
  Optimizer step: apply accumulated gradients
  Zero gradients

The agent can trade memory for throughput by changing DEVICE_BATCH_SIZE (affects peak activation memory) or TOTAL_BATCH_SIZE (affects gradient accumulation steps). Reducing TOTAL_BATCH_SIZE from 2^19 to 2^18 halves the batch but allows more optimizer steps in the 5-minute budget — a trade-off the agent consistently discovers.

14 Continued Learning

Within a Single Run

The agent learns within a single run through several mechanisms:

1. Implicit learning via context accumulation.

As the agent conducts experiments, it accumulates knowledge about what works and what doesn't in its context window. After 50 experiments, the agent has seen dozens of failed and successful modifications and can make increasingly informed hypotheses.

2. Results-driven strategy evolution.

The results.tsv provides a structured record of all experiments. The agent can (and does) refer back to this record to: - Identify which parameter ranges have been explored - Find near-misses worth retrying in combination - Avoid repeating failed experiments - Identify diminishing returns in a particular direction

3. Error-driven adaptation.

Crashes and failures teach the agent about the constraint surface. After an OOM crash from doubling model width, the agent learns the VRAM boundary and stays within it in subsequent experiments.

Across Runs

Autoresearch does not have built-in cross-run learning. Each run starts from the instruction in program.md and the current state of train.py. However, several mechanisms enable implicit cross-run knowledge transfer:

1. Branch inheritance.

A new run can start from a branch created by a previous run, inheriting all accumulated improvements:

# Run 1 ends with autoresearch/mar5 at commit abc123
# Run 2 starts from that branch
git checkout autoresearch/mar5
git checkout -b autoresearch/mar6
# Agent starts from the improved train.py

2. program.md iteration.

The human can update program.md between runs to encode lessons learned:

# Added after Run 1:
Note: batch size 2^18 consistently outperforms 2^19.
Focus on architecture changes rather than optimizer tuning.

This creates a human-in-the-loop learning cycle at the meta level: 1. Agent runs autonomously for 8 hours 2. Human reviews results 3. Human updates program.md with insights 4. Agent runs again with better instructions

3. Community knowledge propagation.

The open-source nature of autoresearch enables a distributed learning process: - Forks share discoveries (e.g., "AR=96 works best on H100") - The nanochat leaderboard aggregates best-known configurations - Community discussions identify promising research directions

Meta-Learning: Programming the Research Program

Karpathy explicitly highlights the meta-learning opportunity:

"The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the 'research org code' that achieves the fastest research progress, how you'd add more agents to the mix, etc."

This positions autoresearch as a platform for meta-research — researching how to research. The program.md is itself an optimizable artifact:

Level 0: train.py        ← optimized by the agent
Level 1: program.md      ← optimized by the human
Level 2: research method ← optimized by the community

Future extensions could apply autoresearch to itself — using an LLM agent to optimize program.md based on the quality of research outcomes. This is a form of meta-optimization that echoes the self-referential nature of Gödel machines.

Integration with External Knowledge

The program.md instructs the agent to "read papers referenced in the code" when stuck. This allows the agent to leverage its training-time knowledge about: - Transformer architectures (GPT, LLaMA, Mamba) - Optimizer research (Adam, LAMB, Muon, Sophia) - Initialization schemes (Xavier, Kaiming, µP) - Normalization techniques (LayerNorm, RMSNorm, QKNorm) - Attention variants (MHA, MQA, GQA, sliding window)

The agent's "knowledge base" is its own training data — a vast corpus of ML papers and code. This is a fundamentally different approach from systems like AlphaEvolve (which use structured program databases) or The AI Scientist (which uses explicit literature search).

15 Applications

Direct Applications

1. Neural network architecture search.

The primary application — finding optimal model configurations for a given compute budget. This is directly useful for: - Startups optimizing their training infrastructure - Researchers exploring new architecture ideas - Hardware vendors benchmarking their GPUs - Students learning about ML optimization

2. Hardware-specific optimization.

Because the 5-minute budget means different hardware gets different numbers of steps, autoresearch naturally finds the best model for your specific hardware:

H100: Large model (AR=96), fewer steps → width matters most
A100: Medium model (AR=64), more steps → depth/width balance
RTX 4090: Smaller model, more steps → optimizer tuning matters more
MacBook M2: Small model, many steps → data efficiency matters most

3. Optimizer research.

The Muon + AdamW optimizer is itself a research artifact. The agent can discover optimizer improvements (e.g., different momentum schedules, variance reduction techniques) that may generalize beyond the specific training setup.

4. Leaderboard competition.

The nanochat leaderboard provides a competitive benchmark. Autoresearch enables anyone with a GPU to compete — the overnight autonomous search replaces weeks of manual experimentation.

Extended Applications

5. Research methodology demonstration.

Autoresearch demonstrates a new research methodology — "programming the researcher" rather than conducting research directly. This methodology is applicable to any domain where: - Solutions can be expressed as code - Evaluation is automated - Experiments are fast enough for iterative improvement

6. Benchmark for LLM coding agents.

The simplicity and reproducibility of autoresearch makes it an excellent benchmark for evaluating coding agents. Key metrics: - How many experiments before the first improvement? - What is the final val_bpb after 100 experiments? - How diverse are the explored hypotheses? - How well does the agent handle crashes?

7. Education.

Following Karpathy's educational philosophy, autoresearch teaches: - How neural network training works (by observing the agent's experiments) - What hyperparameters matter most (by reading results.tsv) - How to design reproducible research (by studying the protocol) - How LLM agents can automate research (by running the system)

Limitations and Future Directions

Current limitations:

Limitation	Description	Potential Solution
Sequential search	One experiment at a time	SkyPilot extension (see companion document)
Greedy strategy	Can miss multi-step improvements	Population-based search, backtracking
No cross-run memory	Each run starts fresh	Persistent knowledge store
Single-file scope	Only modifies `train.py`	Multi-file modification support
GPU-specific results	Results not comparable across hardware	Normalize by FLOPs or steps
No experiment design	Agent explores ad hoc, no DOE	Structured experimental design
No ensemble methods	Single agent perspective	Multi-agent debate/collaboration
Fixed metric	Only val_bpb, no multi-objective	Pareto optimization

Natural extensions:

Multi-agent autoresearch. Multiple agents explore different regions of the search space simultaneously, sharing discoveries. The SkyPilot extension (analyzed in the companion document) takes a first step in this direction.
Meta-optimization of program.md. An outer loop that evaluates different program.md variants based on research quality, discovering optimal research protocols.
Domain transfer. Apply the autoresearch protocol to other optimization domains: reinforcement learning training scripts, compiler optimization passes, database query optimization, scientific simulation parameters.
Integration with formal methods. Combine the LLM agent's intuition with Bayesian optimization or genetic algorithms to provide principled exploration-exploitation trade-offs.
Distributed autoresearch networks. A network of autoresearch instances running on different hardware, sharing discoveries in real-time, and collectively exploring the configuration space.

Broader Impact

Autoresearch represents a significant milestone in the trajectory toward fully autonomous scientific research. Its contribution is not primarily technical (the greedy hill-climbing algorithm is trivial) but conceptual — it demonstrates that:

LLMs are sufficient as autonomous research agents for well-defined optimization problems
Natural language specifications can replace complex research infrastructure
Overnight autonomous runs can produce results that rival weeks of human effort
Radical simplicity (3 files, 1 GPU, 1 metric) enables massive community adoption

The 63,000+ GitHub stars within weeks of release indicate that autoresearch has resonated deeply with the ML community. It has spawned an ecosystem of forks, extensions, and derivative projects that collectively push the frontier of autonomous research systems.

Karpathy's framing of program.md as "research org code" — where the human programs the research organization rather than conducting research — may prove to be the most enduring contribution. It redefines the role of the human researcher from executor to architect of the research process, with LLM agents as the execution layer.

This analysis is based on the autoresearch repository as of March 2026, including program.md (v1), train.py (baseline), prepare.py, and README.md. The repository continues to evolve through community contributions.