Karpathy Autoresearch
Part P07: Autonomous Research Systems
42.1 Overview and Motivation
In March 2026, Andrej Karpathy released autoresearch, a repository that inverts the relationship between human researcher and computational tool. Where prior autonomous research systems—AlphaEvolve (Chapter 4), OpenEvolve (Chapter 5), GEPA (Chapter 7)—required substantial infrastructure including databases, orchestrators, vector stores, and multi-agent coordination, autoresearch demonstrates that a single coding LLM agent, a Markdown instruction file, and one GPU can autonomously conduct neural network research overnight. The repository garnered over 63,000 GitHub stars within weeks of release, signaling a deep resonance with the machine learning community.
The system's lineage traces directly to nanochat, Karpathy's single-GPU GPT training repository. Autoresearch wraps nanochat's training infrastructure with an LLM agent loop that autonomously modifies the training code, evaluates results, and accumulates improvements. The entire system consists of three files: program.md (the research specification), train.py (the mutable experiment space), and prepare.py (the immutable evaluation infrastructure).
Key Contribution
Autoresearch establishes that a general-purpose coding LLM agent, given only a natural-language research specification and a fixed-budget training setup, can autonomously discover cumulative improvements to neural network training. In the author's reported first overnight run, this yielded an approximately 11% reduction in validation bits-per-byte over roughly 100 sequential experiments on a single H100 GPU [author tweet, single run]. The system's primary contribution is not algorithmic (the search strategy is greedy hill-climbing) but paradigmatic: it demonstrates that programming the researcher via a Markdown document can replace the complex orchestration infrastructure that characterizes prior autonomous research systems.
42.1.1 The Research-as-Code Paradigm
Karpathy describes the paradigm shift directly in the repository README [repo README]:
"The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org."
This framing introduces a three-level optimization hierarchy. At Level 0, the LLM agent optimizes train.py to minimize validation loss. At Level 1, the human optimizes program.md to maximize the agent's research productivity. At Level 2, the community optimizes the research methodology itself through forks, extensions, and shared discoveries. The human researcher's role shifts from executor of experiments to architect of the research process.
42.1.2 Philosophical Context
Karpathy's body of work follows a distinctive pattern: take a complex system (ImageNet classifiers, GPT-2, tokenizers), strip it to its essence in a single readable file, and open-source it as both an educational and a practical tool. Autoresearch extends this pattern—now the researcher itself is automated. The repository opens with a characteristically provocative framing [repo README]:
"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone."
This positions autoresearch not merely as a tool but as a proof-of-concept for a fundamentally different mode of scientific inquiry—one where humans design the research program in natural language rather than conducting experiments directly. The explicit simplicity constraint embedded in program.md further reflects this philosophy [repo program.md]:
"All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Conversely, removing something and getting equal or better results is a great outcome—that's a simplification win."
42.1.3 Positioning Among Autonomous Research Systems
| System | Year | Infrastructure Complexity | Agent Role | Human Role | Search Strategy |
|---|---|---|---|---|---|
| Bayesian Optimization | Classical | Medium | Statistical model | Defines search space | Acquisition function |
| Neural Architecture Search | 2016+ | High | RL/evolution agent | Defines search space | RL, evolutionary |
| FunSearch (DeepMind) | 2023 | High | LLM mutation | Defines evaluator | Evolutionary |
| AlphaEvolve (DeepMind) | 2025 | Very High | LLM ensemble | Defines problem | MAP-Elites + ensemble |
| OpenEvolve | 2025 | High | LLM mutation | Configures pipeline | Island-based evolutionary |
| Autoresearch | 2026 | Minimal | Full coding agent | Writes program.md | Greedy hill-climbing |
Autoresearch occupies a unique position: it is the simplest autonomous research system surveyed in this book that produces real, cumulative improvements. Where AlphaEvolve requires a team of engineers and a multi-GPU cluster to operate, autoresearch requires uv sync && uv run prepare.py and a prompt.
42.1.4 Provenance and Evidence Standards
This chapter draws on multiple evidence tiers. To maintain transparency, claims are tagged throughout with their provenance. The following table summarizes the evidence basis for each category of claim.
Evidence Provenance Key
| Tag | Meaning | Confidence |
|---|---|---|
| [repo code] | Verified by reading the file in the public repository (github.com/karpathy/autoresearch). Accessed April 2026. | High |
| [repo README] | Stated in the repository README.md | High |
| [repo program.md] | Stated in the repository program.md specification file | High |
| [author tweet] | Reported by Karpathy via social media posts (x.com/karpathy), not peer-reviewed | Medium |
| [community fork] | Reported by community fork maintainers; not independently verified by this survey | Low–Medium |
| [survey estimate] | Estimated or inferred by the survey authors based on available evidence | Low–Medium |
| [survey interpretation] | Analytical framing or formalization contributed by the survey authors | N/A (analysis) |
42.2 Architecture
Autoresearch's architecture is remarkable for what it deliberately excludes. There is no database, no task queue, no orchestrator, no vector store, no plugin system, and no configuration schema. The filesystem (via git) serves as the database. The agent's context window serves as memory. Shell commands serve as the orchestrator. This radical minimalism is a conscious design choice, not an oversight.
Implementation Snapshot
Repository: github.com/karpathy/autoresearch (MIT License, accessed April 2026)
| File | Role | Mutability | Key Exports / Constants |
|---|---|---|---|
program.md |
Research specification | Read-only by agent; human-edited between runs | Setup protocol, experiment loop spec, behavioral directives, logging format |
train.py |
GPT model + optimizer + training loop (~450 lines) | Mutable — sole file the agent edits | Classes: GPTConfig, CausalSelfAttention, MLP, Block, GPT, MuonAdamW.Constants: DEPTH, ASPECT_RATIO, HEAD_DIM, WINDOW_PATTERN, TOTAL_BATCH_SIZE, DEVICE_BATCH_SIZE, MATRIX_LR, ADAM_BETAS, WEIGHT_DECAY, WARMUP_RATIO, WARMDOWN_RATIO, FINAL_LR_FRAC |
prepare.py |
Data pipeline, tokenizer, evaluation metric | Immutable — agent cannot modify | Constants: MAX_SEQ_LEN = 2048, TIME_BUDGET = 300, VOCAB_SIZE = 8192, eval tokens = 40 × 524,288.Functions: evaluate_bpb(), make_dataloader().Class: Tokenizer |
results.tsv |
Experiment audit trail | Append-only; untracked by git | Columns: commit, val_bpb, memory_gb, status, description |
README.md |
Documentation | Immutable during runs | Setup instructions, philosophy, hardware guidance |
pyproject.toml |
Dependency specification | Immutable during runs | torch 2.9.1 (CUDA 12.8), kernels, rustbpe, tiktoken, pyarrow, numpy, pandas, matplotlib, requests |
Output artifacts per experiment: run.log (training stdout/stderr, overwritten each cycle), git commit (on autoresearch/<tag> branch), row appended to results.tsv.
42.2.1 Three-File Design
The separation between prepare.py (immutable) and train.py (mutable) is architecturally critical [repo code]. The evaluation function evaluate_bpb() lives in prepare.py, which the agent cannot modify. This prevents the agent from gaming the metric—a failure mode that plagues unconstrained autonomous optimization systems. The data pipeline, tokenizer, validation split, and evaluation constants are all outside the agent's reach.
42.2.2 Git as State Machine
Rather than maintaining experiment state in a database, autoresearch uses the git branch as a state machine [repo program.md]. Each experiment begins with a commit. If the experiment improves val_bpb, the branch advances. If the experiment fails or regresses, git reset returns the branch to the last successful commit. The branch tip always represents the best-known configuration.
The following pseudocode reconstructs the experiment cycle from the natural-language specification in program.md. It is not verbatim repository code—autoresearch has no Python orchestrator; the LLM agent itself executes these steps via shell commands as instructed by program.md.
# PSEUDOCODE — reconstructed from program.md's natural-language specification.
# Autoresearch has no orchestrator script; the LLM agent executes
# these steps interactively via shell commands.
def experiment_cycle(agent, branch: str, best_bpb: float, best_commit: str):
"""One cycle of the autoresearch loop, as specified in program.md."""
# 1. Agent generates hypothesis and edits train.py
idea = agent.generate_hypothesis()
agent.edit_file("train.py", idea)
# 2. Commit the change before running [program.md: "git commit"]
commit_hash = git_commit(f"experiment: {idea.summary}")
# 3. Run with output redirected [program.md: "uv run train.py > run.log 2>&1"]
exit_code = shell("uv run train.py > run.log 2>&1", timeout=600)
# 4. Extract metric [program.md: 'grep "^val_bpb:" run.log']
val_bpb = parse_float(shell('grep "^val_bpb:" run.log'))
peak_vram = parse_float(shell('grep "^peak_vram_mb:" run.log'))
# 5. Decision: keep, discard, or handle crash
if val_bpb is None: # crash — no metric produced
error = shell("tail -n 50 run.log") # [program.md: diagnose from tail]
if is_trivial_fix(error):
fix_and_retry()
else:
log_result(commit_hash, 0.0, 0.0, "crash", idea.summary)
git_reset(best_commit)
return best_bpb, best_commit
if val_bpb < best_bpb: # improvement — keep
log_result(commit_hash, val_bpb, peak_vram, "keep", idea.summary)
return val_bpb, commit_hash
else: # no improvement — discard
log_result(commit_hash, val_bpb, peak_vram, "discard", idea.summary)
git_reset(best_commit)
return best_bpb, best_commit
The results.tsv file serves as a human-readable audit trail, with tab-separated columns for commit hash, val_bpb, memory usage, status (keep/discard/crash), and experiment description [repo program.md]. Critically, this file is not committed to git—it is left untracked to keep the branch clean for code-only diffs.
Key behavioral directives from program.md that govern this cycle [repo program.md]:
- Crash handling: "If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it."
- Timeout policy: A 10-minute hard kill is applied if training exceeds the expected duration.
- Autonomy: "NEVER STOP: Once the experiment loop has begun, do NOT pause to ask the human if you should continue."
- Idea generation when stuck: "If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes."
42.2.3 Agent-Agnostic Design
Autoresearch deliberately avoids coupling to any specific LLM provider or agent framework [repo README]. The program.md is written in natural language that any sufficiently capable coding agent can follow. The README states:
"Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like: 'Hi have a look at program.md and let's kick off a new experiment!'"
The only capabilities required of the agent are: file read/write, shell command execution, git operations, and basic reasoning about experimental results. This makes autoresearch a meta-framework—it defines the research protocol, not the agent implementation. The agent simultaneously serves as hypothesis generator, code writer, result interpreter, research strategist, and error handler.
42.3 Core Algorithms
42.3.1 Greedy Hill-Climbing with LLM-Guided Proposals
The core search algorithm is a first-order greedy hill climb [repo program.md; survey interpretation]. At each step, the agent proposes a modification to train.py, evaluates it under a fixed compute budget, and accepts the change only if it improves the validation metric.
Formal characterization. Let $\theta_t$ denote the state of train.py at step $t$ (the complete source code, treated as an element of a discrete program space $\Theta$). Let $f: \Theta \to \mathbb{R}_{>0}$ denote the validation bits-per-byte achieved by training configuration $\theta$ for exactly TIME_BUDGET = 300 seconds of wall-clock time [repo code: prepare.py]. The acceptance rule is:
where $\theta_t' = g_{\text{LLM}}(\theta_t, \mathcal{H}_t)$ is the agent's proposed modification, generated by the LLM conditioned on the current code $\theta_t$ and the experiment history $\mathcal{H}_t = \{(\theta_i, f(\theta_i), s_i)\}_{i=1}^{t}$ with $s_i \in \{\text{keep}, \text{discard}, \text{crash}\}$. By construction, the sequence $\{f(\theta_t)\}$ is monotonically non-increasing. Note that $f$ is stochastic—even with a fixed seed, GPU-level floating-point nondeterminism introduces small variations—so $f(\theta_t')$ is in practice a single noisy evaluation [survey interpretation].
This strategy has well-known theoretical limitations. It can become trapped in local optima where no single modification improves the metric, even though a combination of changes would. There is no formal exploration-exploitation balance, no backtracking mechanism, and no population diversity. However, the LLM mitigates these limitations in several ways: it can propose compound changes (modifying multiple hyperparameters simultaneously), it has implicit knowledge from training data about what works in ML, and the program.md instruction "if you run out of ideas, think harder" encourages the agent to attempt radical changes when stuck [repo program.md].
42.3.2 Fixed-Budget Evaluation and BPB Metric
Every experiment runs for exactly 300 seconds of wall-clock training time, with a hard kill at approximately 600 seconds for safety [repo code: prepare.py defines TIME_BUDGET = 300; repo program.md specifies the timeout]. This fixed-budget constraint is a defining design choice. The training loop in train.py excludes the first 10 steps from the time budget to account for PyTorch compilation and CUDA warmup, ensuring the budget measures actual training time [repo code: train.py]:
# From train.py — fixed-budget training loop (simplified excerpt)
# TIME_BUDGET is imported from prepare.py (= 300 seconds)
total_training_time = 0.0
for step in range(max_steps):
t0 = time.perf_counter()
# ... training step (forward, backward, optimizer) ...
dt = time.perf_counter() - t0
if step > 10: # exclude warmup/compilation steps
total_training_time += dt
if step > 10 and total_training_time >= TIME_BUDGET:
break
The single optimization objective is validation bits per byte (val_bpb), a vocabulary-size-independent metric. The function evaluate_bpb() in prepare.py computes it as follows [repo code: prepare.py]:
where $N$ is the total number of tokens in the validation set, $\ell_i = \text{CE}(\hat{y}_i, y_i)$ is the per-token cross-entropy loss in nats (natural logarithm base), $b_i = \text{utf8\_bytes}(y_i)$ is the UTF-8 byte length of target token $y_i$, the indicator $\mathbb{1}[b_i > 0]$ masks out special tokens (e.g., BOS), and the factor $\ln(2)$ converts from nats to bits. BPB is preferred over perplexity because it allows fair comparison across different vocabulary sizes. The evaluation runs over 40 batches of 524,288 tokens each (approximately 20 million validation tokens), as fixed in prepare.py [repo code: prepare.py].
# From prepare.py — evaluate_bpb() (simplified structure)
def evaluate_bpb(model, tokenizer, batch_size):
"""Evaluate model on ~20M validation tokens, return bits per byte."""
total_nats = 0.0
total_bytes = 0
for _ in range(40): # 40 evaluation batches
# x, y: input/target token pairs of length 524288
loss_flat = model(x, y, reduction='none').view(-1)
nbytes = token_bytes[y_flat] # UTF-8 byte count per token
mask = nbytes > 0 # exclude special tokens
total_nats += (loss_flat * mask).sum().item()
total_bytes += nbytes.sum().item()
return total_nats / (math.log(2) * total_bytes)
The fixed-budget design forces the agent to reason about compute efficiency, not just model quality. Reducing the batch size from $2^{19}$ to $2^{18}$ tokens, for instance, halves the tokens per optimizer step but doubles the number of optimizer steps within the 5-minute budget—a trade-off the agent consistently discovers and exploits [author tweet; community fork reports].
42.3.3 The Muon + AdamW Hybrid Optimizer
The baseline train.py implements a dual-optimizer design in the MuonAdamW class [repo code: train.py]. Weight matrices (2D parameters) use the Muon optimizer (gradient orthogonalization via polar decomposition), while embeddings, scalars, and biases use standard AdamW. This subsection describes the optimizer as implemented in the baseline train.py, supplemented where noted with context from the Muon optimizer literature.
Polar decomposition via Newton-Schulz iterations. Given a gradient matrix $G \in \mathbb{R}^{m \times n}$ (with $m \leq n$; if $m > n$, $G^T$ is used instead), the polar decomposition yields $G = UP$ where $U \in \mathbb{R}^{m \times n}$ has orthonormal rows and $P \in \mathbb{R}^{n \times n}$ is symmetric positive semi-definite. The Muon optimizer approximates $U$ using Newton-Schulz iterations, implemented in train.py with 5 steps and the following hardcoded polynomial coefficients [repo code: train.py]:
# From train.py — Newton-Schulz coefficients for polar decomposition
# These are precomputed polynomial coefficients that accelerate convergence
polar_express_coeffs = [
(8.156554524902461, -22.48329292557795, 15.878769915207462),
(4.042929935166739, -2.808917465908714, 0.5000178451051316),
(3.8916678022926607, -2.772484153217685, 0.5060648178503393),
(3.285753657755655, -2.3681294933425376, 0.46449024233003106),
(2.3465413258596377, -1.7097828382687081, 0.42323551169305323),
]
Each iteration $k = 0, \ldots, 4$ applies the update:
where $X_0 = G / \|G\|_F$ (the Frobenius-normalized gradient), $A_k = X_k^T X_k \in \mathbb{R}^{n \times n}$ for wide matrices (or $A_k = X_k X_k^T$ for tall matrices), and $(a_k, b_k, c_k)$ are the coefficients listed above. After 5 iterations, $X_5$ approximates the orthogonal factor $U$ of the polar decomposition. The convergence rate depends on the singular value distribution of $G$; these specific coefficients are optimized for rapid convergence in the range of singular value ratios typical of neural network gradients [Muon optimizer literature; coefficients verified in repo code].
Relationship to natural gradient methods. The orthogonalization step extracts the rotational component of the gradient, discarding magnitude information from the singular values. This shares a geometric motivation with natural gradient methods (Amari, 1998): both aim to produce updates that are less sensitive to the parameterization of the model. However, the Muon orthogonalization operates on each weight matrix independently and does not compute or approximate the Fisher information matrix. The analogy is structural—both seek parameterization-invariant update directions—but they are not formally equivalent, and the Muon paper does not claim natural gradient equivalence [survey interpretation; Muon literature].
NorMuon variance reduction. After orthogonalization, the optimizer applies an adaptive normalization step using a second-momentum buffer, implemented in train.py as part of the MuonAdamW class [repo code: train.py]. Let $g_t \in \mathbb{R}^{m \times n}$ denote the orthogonalized gradient at step $t$. Define:
where $v_t$ is the per-row (or per-column, depending on matrix shape) exponential moving average of squared gradient entries, $\beta_2$ is the second momentum coefficient, and $d$ specifies the reduction dimension. The normalization then proceeds in two steps to avoid circularity:
Here, $\hat{g}_t$ is the intermediate variance-normalized gradient, $\epsilon$ is a small constant for numerical stability, $\odot$ denotes element-wise multiplication (with broadcasting along the reduced dimension), and $\|\cdot\|_F$ denotes the Frobenius norm. Step 1 redistributes gradient magnitude across dimensions (suppressing high-variance directions, amplifying low-variance ones), analogous to Adam's per-parameter adaptation. Step 2 rescales to preserve the overall gradient norm, ensuring that the variance normalization does not inadvertently change the effective learning rate. The final update $g_t'$ thus combines direction quality from the polar orthogonalization with adaptive per-dimension scaling from the variance tracker [repo code: train.py; Muon/NorMuon literature].
42.3.4 The Mutable Search Space
The agent has complete freedom to modify train.py while prepare.py remains immutable. The following tables enumerate the constraint surface as observed in the repository [repo code].
| Category | Mutable Parameters (in train.py) | Baseline Values |
|---|---|---|
| Model Architecture | DEPTH, ASPECT_RATIO, HEAD_DIM, WINDOW_PATTERN | 8, 64, 128, "SSSL" |
| Optimizer Config | MATRIX_LR, ADAM_BETAS, WEIGHT_DECAY | 0.04, (0.8, 0.95), 0.2 |
| LR Schedule | WARMUP_RATIO, WARMDOWN_RATIO, FINAL_LR_FRAC | Defined in train.py |
| Batch Size | TOTAL_BATCH_SIZE, DEVICE_BATCH_SIZE | $2^{19}$, 128 |
| Activation Functions | MLP non-linearity | F.relu(x).square() (ReGLU-squared) |
| Residual Connections | resid_lambdas, x0_lambdas | 1.0, 0.1 |
| Value Embeddings | ve_gate, gating mechanisms | ResFormer-style (see §42.6) |
| Logit Processing | Softcap value in forward pass | 15 |
| Optimizer Internals | Muon momentum, ns_steps, beta2 | See MuonAdamW class |
Immutable Constants (in prepare.py) | Value | Purpose |
|---|---|---|
MAX_SEQ_LEN | 2048 | Context length |
TIME_BUDGET | 300 seconds | Training time per experiment |
| Eval batches × batch tokens | 40 × 524,288 | Validation tokens (~20M) |
VOCAB_SIZE | 8192 | BPE vocabulary size |
evaluate_bpb() | Function | Sacred evaluation function |
make_dataloader() | Function | BOS-aligned best-fit packing dataloader |
42.3.5 Context Window Management
A subtle but critical design decision is how the agent manages its context window over 100+ experiments [repo program.md]. By redirecting all training output to a file (uv run train.py > run.log 2>&1) and extracting only the metric via grep, the agent avoids flooding its context with thousands of lines of training progress. The following token estimates are approximate [survey estimate]:
| Context Source | Estimated Tokens per Experiment |
|---|---|
| Agent reasoning + hypothesis | ~500 |
Code edit to train.py | ~2,000 |
| Metric extraction via grep | ~50 |
| Result logging | ~200 |
| Error handling (if crash) | ~500 |
| Total per cycle | ~3,000–5,000 |
This enables an agent with a ~200K context window to sustain approximately 100+ experiments before potential context saturation. The design explicitly trades off full observability for sustained autonomy—the agent sees only what it needs to make decisions.
42.4 Key Results
42.4.1 First Overnight Run
Karpathy's initial overnight run was reported via a tweet thread in March 2026. The following table records the reported outcomes with explicit provenance for each claim.
Reported Results — Single-Run, Non-Peer-Reviewed
The results below are from a single overnight run reported by the author via social media. They have not been independently reproduced under controlled conditions and should be interpreted as a demonstration, not a benchmark.
| Metric | Value | Provenance |
|---|---|---|
| val_bpb improvement | ~11% reduction vs. baseline | [author tweet] — single run, no confidence interval |
| Successful improvements found | ~20 experiments kept | [author tweet] — approximate count |
| Total experiments conducted | ~100 | [survey estimate] — inferred from ~12/hour × ~8 hours |
| Runtime | ~8 hours (overnight) | [author tweet] |
| GPU | Single NVIDIA H100 | [author tweet] |
| Agent | Claude Code | [repo README] |
| Keep rate | ~20% | [survey estimate] — ~20 kept / ~100 total |
Reproducibility note. No standardized replication table or downloadable run logs have been published for this result. The 11% figure represents a single-run outcome on specific hardware (H100) with a specific agent (Claude Code) at a specific point in time. GPU-level floating-point nondeterminism, LLM stochasticity, and the path-dependent nature of greedy hill-climbing mean that independent runs will produce different trajectories. Community fork maintainers have reported qualitatively similar improvement magnitudes on other hardware, but without standardized comparison protocols [community fork reports].
42.4.2 Experiment Throughput
The fixed 5-minute training budget plus agent overhead yields predictable throughput [repo program.md; survey estimate]. Each experiment cycle takes approximately 6 minutes: ~30 seconds for the agent to think and edit code, ~300 seconds for training, and ~30 seconds for result interpretation. This yields approximately 10 experiments per hour, 80 per overnight (8-hour) run, and 240 per full-day (24-hour) run. These are rough estimates; actual throughput depends on the agent's reasoning speed, crash frequency, and fix-retry overhead.
42.4.3 Typical Discovery Patterns
Based on the author's reports and community fork observations [author tweet; community fork reports], autonomous runs exhibit a characteristic diminishing-returns trajectory. The following patterns have been described but should be understood as anecdotal observations rather than statistically validated findings:
| Discovery Category | Typical Finding | Reported Impact | Stage | Source |
|---|---|---|---|---|
| Batch size reduction | $2^{19} \rightarrow 2^{18}$ (more optimizer steps) | 0.01–0.02 BPB | Early | [author tweet; community] |
| Model width scaling | ASPECT_RATIO 64 → 96 | 0.01+ BPB | Early | [community fork reports] |
| Adam betas tuning | (0.8, 0.95) → (0.9, 0.95) | ~0.005 BPB | Mid | [community fork reports] |
| Weight decay reduction | 0.2 → 0.08 | ~0.003 BPB | Mid | [community fork reports] |
| LR schedule tuning | Warmdown ratio, final LR fraction | Small–medium | Mid | [community fork reports] |
| Window pattern changes | "SSSL" → "SL" | ~0.001 BPB | Late | [community fork reports] |
| Muon optimizer params | Momentum, beta2, ns_steps | ~0.001 BPB | Late | [community fork reports] |
The early experiments typically produce large gains through hyperparameter sweeps (batch size, learning rate, model width). Middle-phase gains come from architecture and optimizer changes. Late-phase experiments produce diminishing returns through fine-tuning and combination of near-misses. The cumulative trajectory is consistent with stochastic first-order optimization against a rugged but structured fitness landscape [survey interpretation].
42.5 Implementation Details
42.5.1 Cost Analysis
Autoresearch's cost structure is dominated by GPU compute for cloud users and by LLM API costs for users with local hardware. The following estimates are derived from the repository's throughput characteristics and published cloud pricing as of early 2026 [survey estimate].
Cost Estimates — Approximate, As of Early 2026
These figures are rough estimates based on published cloud GPU pricing and typical API token rates. Actual costs depend on provider, region, spot vs. on-demand pricing, and the specific agent model used. They have not been verified against invoices or billing statements.
| Run Duration | GPU Cost (H100 @ ~$2/hr est.) | LLM API Cost (est.) | Total (est.) |
|---|---|---|---|
| 8h overnight (H100 cloud) | ~$16 | ~$2 | ~$18 |
| 8h overnight (local RTX 4090) | $0 | ~$2 | ~$2 |
| 24h extended (H100 cloud) | ~$48 | ~$6 | ~$54 |
The LLM API cost estimate assumes approximately 4,000 tokens per experiment cycle (combining input and output), yielding roughly 400,000 tokens for a 100-experiment overnight run [survey estimate]. At typical early-2026 API pricing for frontier coding models, this amounts to $1.50–$3.00.
The author's README includes a cost comparison suggesting an approximately 45× reduction relative to the fully-loaded cost of a human researcher conducting the same number of experiments manually [repo README]. This comparison is illustrative: it does not account for experiment quality, insight generation, the setup cost of writing program.md, or the human review time needed to validate and interpret results. The comparison also assumes the human researcher conducts each experiment sequentially without parallelization—a conservative assumption that favors the automated approach.
42.5.2 Reproducibility
Autoresearch achieves high reproducibility through several design choices [repo code; repo README]. The entire codebase is three files under MIT license. The training data (climbmix-400b) is publicly available and auto-downloaded by prepare.py. Dependencies are minimal (PyTorch plus 8 packages) and pinned via uv.lock. The random seed is fixed (torch.manual_seed(42)), providing determinism up to GPU-specific floating-point nondeterminism. Every experiment is captured as a git commit plus a row in results.tsv.
# Complete reproduction steps from README [repo README]
# 1. Clone and setup
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
# 2. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py
# 3. Establish baseline
uv run train.py
# 4. Start autonomous research — point any coding agent at the repo:
# "Read program.md and let's kick off a new experiment"
42.5.3 Platform Sensitivity
A critical caveat for reproducibility: results are platform-dependent [repo README]. Because the 5-minute budget is wall-clock time, different GPUs complete different numbers of training steps. The author explicitly acknowledges this:
"This means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms."
| GPU | Approximate Steps in 5 min | Relative Throughput (vs H100) | Source |
|---|---|---|---|
| H200 | ~1,200+ | ~1.09× | [survey estimate] |
| H100 | ~1,100 | 1.0× (reference) | [author tweet] |
| A100 (80GB) | ~800 | ~0.73× | [survey estimate] |
| RTX 4090 | ~600 | ~0.55× | [survey estimate] |
| RTX 3090 | ~400 | ~0.36× | [survey estimate] |
This platform dependence is by design—autoresearch finds the optimal configuration for a specific compute environment. But it means that reported BPB values from different hardware platforms are not directly comparable, and the optimal model architecture may differ across GPUs (e.g., wider models may be optimal on H100 where fewer total steps are needed, while deeper or more parameter-efficient models may be optimal on slower hardware where more steps are available) [survey interpretation].
42.5.4 Community Ecosystem
The rapid forking of autoresearch to alternative platforms—within days of release—demonstrates both the simplicity of the core design and the community demand for accessible autonomous research tools [community fork repos]:
| Fork | Platform | Source |
|---|---|---|
| autoresearch-macos | macOS (Metal) | github.com/miolini/autoresearch-macos |
| autoresearch-mlx | macOS (MLX) | github.com/trevin-creator/autoresearch-mlx |
| autoresearch-win-rtx | Windows (RTX) | github.com/jsegov/autoresearch-win-rtx |
| autoresearch (AMD) | AMD GPUs | github.com/andyluo7/autoresearch |
42.6 The GPT Model Architecture
The baseline train.py implements a modern GPT architecture in approximately 450 lines of Python [repo code: train.py]. The model incorporates several recent innovations that serve as the starting point for the agent's optimization. Understanding this baseline is essential for appreciating the search space available to the agent. The following description is derived from reading the actual repository code.
Value embeddings (ResFormer technique) [repo code: train.py, CausalSelfAttention class]. Alternating layers include learnable per-token value embeddings that are mixed into the value stream via a gated mechanism:
# From train.py — CausalSelfAttention (value embedding logic)
ve = self.value_embeds[str(i)](idx) # (B, T, kv_dim)
gate = 2 * sigmoid(self.ve_gate(x[..., :32])) # (B, T, n_kv_head)
v = v + gate * ve
The gate is initialized at zero, so $\text{sigmoid}(0) = 0.5$, scaled by 2 gives 1.0—a neutral initialization that does not distort the value stream at the start of training [repo code: train.py].
Residual mixing [repo code: train.py, GPT class]. Each block receives both the previous hidden state and the initial embedding via learnable per-layer coefficients:
# From train.py — GPT forward (residual mixing)
x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0
These are initialized with resid_lambdas = 1.0 and x0_lambdas = 0.1, allowing the model to learn how much to rely on the original embedding versus the transformed representation at each layer. This technique helps with gradient flow in deeper transformers [repo code: train.py; literature context].
Additional architectural features observed in the baseline train.py [repo code]:
- ReGLU-squared activation:
F.relu(x).square()in theMLPclass—a non-standard activation that has shown empirical benefits in small model training. - Logit soft-capping:
softcap * tanh(logits / softcap)with softcap = 15, preventing extreme logit values. - QK normalization: RMSNorm applied to query and key projections before attention.
- Per-layer sliding window: The
WINDOW_PATTERNstring (default"SSSL") defines which layers use sliding-window vs. full attention in FlashAttention3. - RoPE: Rotary position embeddings pre-computed for 10× the sequence length.
- Manual GC control:
gc.freeze()andgc.disable()after warmup to avoid Python garbage collection stalls during training. - torch.compile: Applied to the model and optimizer kernels with
dynamic=False, fullgraph=Truefor maximum compilation benefit.
42.7 Limitations and Discussion
42.7.1 Algorithmic Limitations
The greedy hill-climbing strategy imposes fundamental constraints on the quality of solutions autoresearch can find:
Local optima. The algorithm can become trapped in configurations where no single-step modification improves the metric. Multi-step improvements requiring intermediate regressions (traversing fitness valleys) are unreachable. Population-based methods like MAP-Elites (used in AlphaEvolve) maintain diversity that allows escaping such traps.
No exploration-exploitation balance. There is no formal mechanism for allocating the agent's effort between refining what works (exploitation) and trying radically different approaches (exploration). The "think harder" instruction in program.md is a heuristic substitute, not a principled solution.
Sequential search. By default, autoresearch runs one experiment at a time. With a 5-minute training budget and ~1-minute agent overhead, the system can evaluate only ~10 configurations per hour. Population-based or island-model approaches can evaluate dozens of candidates in parallel on multi-GPU systems.
Single noisy evaluation. Each experiment is evaluated exactly once. Because GPU floating-point nondeterminism, compilation caching, and system load introduce noise into the training outcome, small apparent improvements may be within the noise floor. The system has no mechanism for replicate runs or statistical significance testing [survey interpretation].
42.7.2 Scope Limitations
Single-file modification. The agent can only modify train.py. It cannot introduce new Python modules, add data augmentation strategies, modify the tokenizer, or change the evaluation metric. This constrains the search space to what can be expressed as modifications to a single training script.
No cross-run memory. Each run starts from the current program.md and train.py with no persistent memory of what was tried in prior runs. The human must manually transfer knowledge between runs by updating program.md or starting from a previously improved branch.
Platform-specific results. Because the 5-minute budget is wall-clock time, results from different GPU types are not comparable. There is no normalization by FLOPs, parameter-hours, or training steps, which limits the system's utility for cross-hardware benchmarking.
42.7.3 Comparison with Infrastructure-Heavy Systems
| Dimension | Autoresearch | AlphaEvolve | OpenEvolve |
|---|---|---|---|
| Search strategy | Greedy hill-climbing | MAP-Elites + LLM ensemble | Island-based evolutionary |
| Population | 1 (current best) | Full MAP-Elites archive | Multi-island populations |
| Infrastructure | 3 files, git, shell | Multi-GPU cluster, databases, orchestrators | Database, vector store, queue |
| Setup time | ~10 minutes [survey est.] | Substantial engineering effort | Configuration + infrastructure |
| Agent architecture | Single general-purpose agent | Multi-model ensemble | Single LLM with bandit selection |
| Cross-run learning | None (manual via branch/program.md) | Structured program database | Knowledge base, prompt evolution |
| Domain | Neural network training | General program optimization | General program optimization |
| Parallelism | Sequential (1 experiment at a time) | Massively parallel | Parallel islands |
| Evaluation replication | Single noisy eval | Multiple evaluations | Configurable eval count |
This comparison highlights a fundamental design trade-off. Autoresearch sacrifices search sophistication and parallelism for radical accessibility. The claim is not that greedy hill-climbing is optimal—it demonstrably is not—but that the simplicity of the system unlocks a mode of research that was previously inaccessible. A researcher with a single GPU and 10 minutes of setup time can conduct overnight autonomous experiments, a capability previously requiring dedicated infrastructure teams.
42.7.4 The Meta-Optimization Opportunity
Karpathy explicitly highlights the meta-learning potential of the system [repo README]:
"The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the 'research org code' that achieves the fastest research progress."
This positions program.md as itself an optimizable artifact. The human-in-the-loop learning cycle operates at a higher level: (1) agent runs autonomously for 8 hours; (2) human reviews results; (3) human updates program.md with insights; (4) agent runs again with better instructions. A natural extension would apply autoresearch to itself—using an LLM agent to optimize program.md based on research outcomes—creating a form of meta-optimization that echoes the self-referential character of Gödel machines [survey interpretation].
42.7.5 Broader Significance: Research Taste as a Specification
Perhaps the most conceptually novel aspect of autoresearch is the encoding of research taste in program.md [repo program.md]. The simplicity criterion ("a small improvement that adds ugly complexity is not worth it"), the crash-handling heuristic ("if it's something dumb and easy to fix, fix it; if the idea itself is fundamentally broken, just skip it"), and the exploration directive ("think harder") are all judgments about research quality that are typically tacit knowledge held by experienced researchers. By making these judgments explicit and executable, autoresearch converts research taste into a formal specification that can be shared, debated, and iterated upon.
This has implications beyond neural network training. Any domain where solutions can be expressed as code, evaluation is automated, and experiments are fast enough for iterative improvement is amenable to the autoresearch paradigm. The methodology is a contribution independent of the specific domain application.
42.8 Summary
Chapter Summary
Key takeaway: Autoresearch demonstrates that radical simplicity—three files, one GPU, one metric, and a natural-language research specification—is sufficient for a general-purpose coding LLM agent to conduct overnight autonomous research that produces real, cumulative improvements to neural network training.
Main contribution: The system's primary contribution is paradigmatic rather than algorithmic. It establishes the "research-as-code" model where the human writes a Markdown specification of the research protocol and the LLM agent executes it autonomously. The greedy hill-climbing search strategy is trivial, but the encoding of research taste—simplicity preference, crash-handling heuristics, exploration directives—into an executable natural-language specification is a genuinely novel idea. The 63,000+ GitHub stars [repo] and immediate ecosystem of platform-specific forks confirm that this paradigm addresses latent demand for accessible autonomous research tools.
Evidence status: The core system design is fully verifiable from the public repository (3 files, MIT license). Performance claims (11% BPB improvement, overnight throughput) rest on a single author-reported run via social media and anecdotal community fork results—no peer-reviewed evaluation or standardized replication exists. Cost and throughput estimates in this chapter are survey-author calculations based on published pricing and observed experiment timing.
For researchers: Autoresearch is most valuable as a lower bound on infrastructure requirements for autonomous research and as a template for the "research specification" pattern. Its limitations—local optima trapping, no cross-run memory, sequential execution, platform-specific results, single noisy evaluations—are well-understood and clearly documented. Systems like AlphaEvolve and OpenEvolve address these limitations at the cost of substantial infrastructure complexity. The open question is whether the program.md meta-optimization loop—iterating on the research specification itself—can close the gap without adding that complexity.