Introduced2025-06
Score8.38/10 — Draft
Chapter 42

Karpathy Autoresearch

Part P07: Autonomous Research Systems

42.1 Overview and Motivation

In March 2026, Andrej Karpathy released autoresearch, a repository that inverts the relationship between human researcher and computational tool. Where prior autonomous research systems—AlphaEvolve (Chapter 4), OpenEvolve (Chapter 5), GEPA (Chapter 7)—required substantial infrastructure including databases, orchestrators, vector stores, and multi-agent coordination, autoresearch demonstrates that a single coding LLM agent, a Markdown instruction file, and one GPU can autonomously conduct neural network research overnight. The repository garnered over 63,000 GitHub stars within weeks of release, signaling a deep resonance with the machine learning community.

The system's lineage traces directly to nanochat, Karpathy's single-GPU GPT training repository. Autoresearch wraps nanochat's training infrastructure with an LLM agent loop that autonomously modifies the training code, evaluates results, and accumulates improvements. The entire system consists of three files: program.md (the research specification), train.py (the mutable experiment space), and prepare.py (the immutable evaluation infrastructure).

Key Contribution

Autoresearch establishes that a general-purpose coding LLM agent, given only a natural-language research specification and a fixed-budget training setup, can autonomously discover cumulative improvements to neural network training. In the author's reported first overnight run, this yielded an approximately 11% reduction in validation bits-per-byte over roughly 100 sequential experiments on a single H100 GPU [author tweet, single run]. The system's primary contribution is not algorithmic (the search strategy is greedy hill-climbing) but paradigmatic: it demonstrates that programming the researcher via a Markdown document can replace the complex orchestration infrastructure that characterizes prior autonomous research systems.

42.1.1 The Research-as-Code Paradigm

Karpathy describes the paradigm shift directly in the repository README [repo README]:

"The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org."

This framing introduces a three-level optimization hierarchy. At Level 0, the LLM agent optimizes train.py to minimize validation loss. At Level 1, the human optimizes program.md to maximize the agent's research productivity. At Level 2, the community optimizes the research methodology itself through forks, extensions, and shared discoveries. The human researcher's role shifts from executor of experiments to architect of the research process.

42.1.2 Philosophical Context

Karpathy's body of work follows a distinctive pattern: take a complex system (ImageNet classifiers, GPT-2, tokenizers), strip it to its essence in a single readable file, and open-source it as both an educational and a practical tool. Autoresearch extends this pattern—now the researcher itself is automated. The repository opens with a characteristically provocative framing [repo README]:

"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone."

This positions autoresearch not merely as a tool but as a proof-of-concept for a fundamentally different mode of scientific inquiry—one where humans design the research program in natural language rather than conducting experiments directly. The explicit simplicity constraint embedded in program.md further reflects this philosophy [repo program.md]:

"All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Conversely, removing something and getting equal or better results is a great outcome—that's a simplification win."

42.1.3 Positioning Among Autonomous Research Systems

SystemYearInfrastructure ComplexityAgent RoleHuman RoleSearch Strategy
Bayesian OptimizationClassicalMediumStatistical modelDefines search spaceAcquisition function
Neural Architecture Search2016+HighRL/evolution agentDefines search spaceRL, evolutionary
FunSearch (DeepMind)2023HighLLM mutationDefines evaluatorEvolutionary
AlphaEvolve (DeepMind)2025Very HighLLM ensembleDefines problemMAP-Elites + ensemble
OpenEvolve2025HighLLM mutationConfigures pipelineIsland-based evolutionary
Autoresearch2026MinimalFull coding agentWrites program.mdGreedy hill-climbing

Autoresearch occupies a unique position: it is the simplest autonomous research system surveyed in this book that produces real, cumulative improvements. Where AlphaEvolve requires a team of engineers and a multi-GPU cluster to operate, autoresearch requires uv sync && uv run prepare.py and a prompt.

42.1.4 Provenance and Evidence Standards

This chapter draws on multiple evidence tiers. To maintain transparency, claims are tagged throughout with their provenance. The following table summarizes the evidence basis for each category of claim.

Evidence Provenance Key

TagMeaningConfidence
[repo code]Verified by reading the file in the public repository (github.com/karpathy/autoresearch). Accessed April 2026.High
[repo README]Stated in the repository README.mdHigh
[repo program.md]Stated in the repository program.md specification fileHigh
[author tweet]Reported by Karpathy via social media posts (x.com/karpathy), not peer-reviewedMedium
[community fork]Reported by community fork maintainers; not independently verified by this surveyLow–Medium
[survey estimate]Estimated or inferred by the survey authors based on available evidenceLow–Medium
[survey interpretation]Analytical framing or formalization contributed by the survey authorsN/A (analysis)

42.2 Architecture

Autoresearch's architecture is remarkable for what it deliberately excludes. There is no database, no task queue, no orchestrator, no vector store, no plugin system, and no configuration schema. The filesystem (via git) serves as the database. The agent's context window serves as memory. Shell commands serve as the orchestrator. This radical minimalism is a conscious design choice, not an oversight.

Implementation Snapshot

Repository: github.com/karpathy/autoresearch (MIT License, accessed April 2026)

FileRoleMutabilityKey Exports / Constants
program.md Research specification Read-only by agent; human-edited between runs Setup protocol, experiment loop spec, behavioral directives, logging format
train.py GPT model + optimizer + training loop (~450 lines) Mutable — sole file the agent edits Classes: GPTConfig, CausalSelfAttention, MLP, Block, GPT, MuonAdamW.
Constants: DEPTH, ASPECT_RATIO, HEAD_DIM, WINDOW_PATTERN, TOTAL_BATCH_SIZE, DEVICE_BATCH_SIZE, MATRIX_LR, ADAM_BETAS, WEIGHT_DECAY, WARMUP_RATIO, WARMDOWN_RATIO, FINAL_LR_FRAC
prepare.py Data pipeline, tokenizer, evaluation metric Immutable — agent cannot modify Constants: MAX_SEQ_LEN = 2048, TIME_BUDGET = 300, VOCAB_SIZE = 8192, eval tokens = 40 × 524,288.
Functions: evaluate_bpb(), make_dataloader().
Class: Tokenizer
results.tsv Experiment audit trail Append-only; untracked by git Columns: commit, val_bpb, memory_gb, status, description
README.md Documentation Immutable during runs Setup instructions, philosophy, hardware guidance
pyproject.toml Dependency specification Immutable during runs torch 2.9.1 (CUDA 12.8), kernels, rustbpe, tiktoken, pyarrow, numpy, pandas, matplotlib, requests

Output artifacts per experiment: run.log (training stdout/stderr, overwritten each cycle), git commit (on autoresearch/<tag> branch), row appended to results.tsv.

42.2.1 Three-File Design

CODING LLM AGENT (Claude Code, Codex, etc.) Hypothesis Code Editor Interpreter Decision: keep / discard / crash Shell Interface (git, uv, grep, tail) program.md Research specification Natural language instructions READ-ONLY by agent train.py (~450 lines) GPT model + optimizer + training loop MUTABLE — agent edits this prepare.py Data pipeline, tokenizer evaluate_bpb() metric IMMUTABLE — cannot be gamed run.log Training output results.tsv Audit trail Git Branch State machine Single GPU — 5 min budget

The separation between prepare.py (immutable) and train.py (mutable) is architecturally critical [repo code]. The evaluation function evaluate_bpb() lives in prepare.py, which the agent cannot modify. This prevents the agent from gaming the metric—a failure mode that plagues unconstrained autonomous optimization systems. The data pipeline, tokenizer, validation split, and evaluation constants are all outside the agent's reach.

42.2.2 Git as State Machine

Rather than maintaining experiment state in a database, autoresearch uses the git branch as a state machine [repo program.md]. Each experiment begins with a commit. If the experiment improves val_bpb, the branch advances. If the experiment fails or regresses, git reset returns the branch to the last successful commit. The branch tip always represents the best-known configuration.

The following pseudocode reconstructs the experiment cycle from the natural-language specification in program.md. It is not verbatim repository code—autoresearch has no Python orchestrator; the LLM agent itself executes these steps via shell commands as instructed by program.md.

# PSEUDOCODE — reconstructed from program.md's natural-language specification.
# Autoresearch has no orchestrator script; the LLM agent executes
# these steps interactively via shell commands.

def experiment_cycle(agent, branch: str, best_bpb: float, best_commit: str):
    """One cycle of the autoresearch loop, as specified in program.md."""
    # 1. Agent generates hypothesis and edits train.py
    idea = agent.generate_hypothesis()
    agent.edit_file("train.py", idea)

    # 2. Commit the change before running [program.md: "git commit"]
    commit_hash = git_commit(f"experiment: {idea.summary}")

    # 3. Run with output redirected [program.md: "uv run train.py > run.log 2>&1"]
    exit_code = shell("uv run train.py > run.log 2>&1", timeout=600)

    # 4. Extract metric [program.md: 'grep "^val_bpb:" run.log']
    val_bpb = parse_float(shell('grep "^val_bpb:" run.log'))
    peak_vram = parse_float(shell('grep "^peak_vram_mb:" run.log'))

    # 5. Decision: keep, discard, or handle crash
    if val_bpb is None:  # crash — no metric produced
        error = shell("tail -n 50 run.log")  # [program.md: diagnose from tail]
        if is_trivial_fix(error):
            fix_and_retry()
        else:
            log_result(commit_hash, 0.0, 0.0, "crash", idea.summary)
            git_reset(best_commit)
            return best_bpb, best_commit

    if val_bpb < best_bpb:  # improvement — keep
        log_result(commit_hash, val_bpb, peak_vram, "keep", idea.summary)
        return val_bpb, commit_hash
    else:  # no improvement — discard
        log_result(commit_hash, val_bpb, peak_vram, "discard", idea.summary)
        git_reset(best_commit)
        return best_bpb, best_commit

The results.tsv file serves as a human-readable audit trail, with tab-separated columns for commit hash, val_bpb, memory usage, status (keep/discard/crash), and experiment description [repo program.md]. Critically, this file is not committed to git—it is left untracked to keep the branch clean for code-only diffs.

Key behavioral directives from program.md that govern this cycle [repo program.md]:

  • Crash handling: "If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it."
  • Timeout policy: A 10-minute hard kill is applied if training exceeds the expected duration.
  • Autonomy: "NEVER STOP: Once the experiment loop has begun, do NOT pause to ask the human if you should continue."
  • Idea generation when stuck: "If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes."

42.2.3 Agent-Agnostic Design

Autoresearch deliberately avoids coupling to any specific LLM provider or agent framework [repo README]. The program.md is written in natural language that any sufficiently capable coding agent can follow. The README states:

"Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like: 'Hi have a look at program.md and let's kick off a new experiment!'"

The only capabilities required of the agent are: file read/write, shell command execution, git operations, and basic reasoning about experimental results. This makes autoresearch a meta-framework—it defines the research protocol, not the agent implementation. The agent simultaneously serves as hypothesis generator, code writer, result interpreter, research strategist, and error handler.

42.3 Core Algorithms

42.3.1 Greedy Hill-Climbing with LLM-Guided Proposals

The core search algorithm is a first-order greedy hill climb [repo program.md; survey interpretation]. At each step, the agent proposes a modification to train.py, evaluates it under a fixed compute budget, and accepts the change only if it improves the validation metric.

Formal characterization. Let $\theta_t$ denote the state of train.py at step $t$ (the complete source code, treated as an element of a discrete program space $\Theta$). Let $f: \Theta \to \mathbb{R}_{>0}$ denote the validation bits-per-byte achieved by training configuration $\theta$ for exactly TIME_BUDGET = 300 seconds of wall-clock time [repo code: prepare.py]. The acceptance rule is:

$$\theta_{t+1} = \begin{cases} \theta_t' & \text{if } f(\theta_t') < f(\theta_t) \\ \theta_t & \text{otherwise} \end{cases}$$

where $\theta_t' = g_{\text{LLM}}(\theta_t, \mathcal{H}_t)$ is the agent's proposed modification, generated by the LLM conditioned on the current code $\theta_t$ and the experiment history $\mathcal{H}_t = \{(\theta_i, f(\theta_i), s_i)\}_{i=1}^{t}$ with $s_i \in \{\text{keep}, \text{discard}, \text{crash}\}$. By construction, the sequence $\{f(\theta_t)\}$ is monotonically non-increasing. Note that $f$ is stochastic—even with a fixed seed, GPU-level floating-point nondeterminism introduces small variations—so $f(\theta_t')$ is in practice a single noisy evaluation [survey interpretation].

This strategy has well-known theoretical limitations. It can become trapped in local optima where no single modification improves the metric, even though a combination of changes would. There is no formal exploration-exploitation balance, no backtracking mechanism, and no population diversity. However, the LLM mitigates these limitations in several ways: it can propose compound changes (modifying multiple hyperparameters simultaneously), it has implicit knowledge from training data about what works in ML, and the program.md instruction "if you run out of ideas, think harder" encourages the agent to attempt radical changes when stuck [repo program.md].

42.3.2 Fixed-Budget Evaluation and BPB Metric

Every experiment runs for exactly 300 seconds of wall-clock training time, with a hard kill at approximately 600 seconds for safety [repo code: prepare.py defines TIME_BUDGET = 300; repo program.md specifies the timeout]. This fixed-budget constraint is a defining design choice. The training loop in train.py excludes the first 10 steps from the time budget to account for PyTorch compilation and CUDA warmup, ensuring the budget measures actual training time [repo code: train.py]:

# From train.py — fixed-budget training loop (simplified excerpt)
# TIME_BUDGET is imported from prepare.py (= 300 seconds)

total_training_time = 0.0
for step in range(max_steps):
    t0 = time.perf_counter()
    # ... training step (forward, backward, optimizer) ...
    dt = time.perf_counter() - t0

    if step > 10:  # exclude warmup/compilation steps
        total_training_time += dt

    if step > 10 and total_training_time >= TIME_BUDGET:
        break

The single optimization objective is validation bits per byte (val_bpb), a vocabulary-size-independent metric. The function evaluate_bpb() in prepare.py computes it as follows [repo code: prepare.py]:

$$\text{val\_bpb} = \frac{\sum_{i=1}^{N} \ell_i \cdot \mathbb{1}[b_i > 0]}{\ln(2) \cdot \sum_{i=1}^{N} b_i}$$

where $N$ is the total number of tokens in the validation set, $\ell_i = \text{CE}(\hat{y}_i, y_i)$ is the per-token cross-entropy loss in nats (natural logarithm base), $b_i = \text{utf8\_bytes}(y_i)$ is the UTF-8 byte length of target token $y_i$, the indicator $\mathbb{1}[b_i > 0]$ masks out special tokens (e.g., BOS), and the factor $\ln(2)$ converts from nats to bits. BPB is preferred over perplexity because it allows fair comparison across different vocabulary sizes. The evaluation runs over 40 batches of 524,288 tokens each (approximately 20 million validation tokens), as fixed in prepare.py [repo code: prepare.py].

# From prepare.py — evaluate_bpb() (simplified structure)
def evaluate_bpb(model, tokenizer, batch_size):
    """Evaluate model on ~20M validation tokens, return bits per byte."""
    total_nats = 0.0
    total_bytes = 0
    for _ in range(40):  # 40 evaluation batches
        # x, y: input/target token pairs of length 524288
        loss_flat = model(x, y, reduction='none').view(-1)
        nbytes = token_bytes[y_flat]        # UTF-8 byte count per token
        mask = nbytes > 0                    # exclude special tokens
        total_nats += (loss_flat * mask).sum().item()
        total_bytes += nbytes.sum().item()
    return total_nats / (math.log(2) * total_bytes)

The fixed-budget design forces the agent to reason about compute efficiency, not just model quality. Reducing the batch size from $2^{19}$ to $2^{18}$ tokens, for instance, halves the tokens per optimizer step but doubles the number of optimizer steps within the 5-minute budget—a trade-off the agent consistently discovers and exploits [author tweet; community fork reports].

42.3.3 The Muon + AdamW Hybrid Optimizer

The baseline train.py implements a dual-optimizer design in the MuonAdamW class [repo code: train.py]. Weight matrices (2D parameters) use the Muon optimizer (gradient orthogonalization via polar decomposition), while embeddings, scalars, and biases use standard AdamW. This subsection describes the optimizer as implemented in the baseline train.py, supplemented where noted with context from the Muon optimizer literature.

Polar decomposition via Newton-Schulz iterations. Given a gradient matrix $G \in \mathbb{R}^{m \times n}$ (with $m \leq n$; if $m > n$, $G^T$ is used instead), the polar decomposition yields $G = UP$ where $U \in \mathbb{R}^{m \times n}$ has orthonormal rows and $P \in \mathbb{R}^{n \times n}$ is symmetric positive semi-definite. The Muon optimizer approximates $U$ using Newton-Schulz iterations, implemented in train.py with 5 steps and the following hardcoded polynomial coefficients [repo code: train.py]:

# From train.py — Newton-Schulz coefficients for polar decomposition
# These are precomputed polynomial coefficients that accelerate convergence
polar_express_coeffs = [
    (8.156554524902461, -22.48329292557795, 15.878769915207462),
    (4.042929935166739, -2.808917465908714, 0.5000178451051316),
    (3.8916678022926607, -2.772484153217685, 0.5060648178503393),
    (3.285753657755655, -2.3681294933425376, 0.46449024233003106),
    (2.3465413258596377, -1.7097828382687081, 0.42323551169305323),
]

Each iteration $k = 0, \ldots, 4$ applies the update:

$$X_{k+1} = a_k X_k + X_k (b_k A_k + c_k A_k^2)$$

where $X_0 = G / \|G\|_F$ (the Frobenius-normalized gradient), $A_k = X_k^T X_k \in \mathbb{R}^{n \times n}$ for wide matrices (or $A_k = X_k X_k^T$ for tall matrices), and $(a_k, b_k, c_k)$ are the coefficients listed above. After 5 iterations, $X_5$ approximates the orthogonal factor $U$ of the polar decomposition. The convergence rate depends on the singular value distribution of $G$; these specific coefficients are optimized for rapid convergence in the range of singular value ratios typical of neural network gradients [Muon optimizer literature; coefficients verified in repo code].

Relationship to natural gradient methods. The orthogonalization step extracts the rotational component of the gradient, discarding magnitude information from the singular values. This shares a geometric motivation with natural gradient methods (Amari, 1998): both aim to produce updates that are less sensitive to the parameterization of the model. However, the Muon orthogonalization operates on each weight matrix independently and does not compute or approximate the Fisher information matrix. The analogy is structural—both seek parameterization-invariant update directions—but they are not formally equivalent, and the Muon paper does not claim natural gradient equivalence [survey interpretation; Muon literature].

NorMuon variance reduction. After orthogonalization, the optimizer applies an adaptive normalization step using a second-momentum buffer, implemented in train.py as part of the MuonAdamW class [repo code: train.py]. Let $g_t \in \mathbb{R}^{m \times n}$ denote the orthogonalized gradient at step $t$. Define:

$$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot \text{mean}(g_t^2, \text{dim}=d)$$

where $v_t$ is the per-row (or per-column, depending on matrix shape) exponential moving average of squared gradient entries, $\beta_2$ is the second momentum coefficient, and $d$ specifies the reduction dimension. The normalization then proceeds in two steps to avoid circularity:

$$\hat{g}_t = g_t \odot \frac{1}{\sqrt{v_t + \epsilon}} \qquad \text{(Step 1: variance normalization)}$$
$$g_t' = \hat{g}_t \cdot \frac{\|g_t\|_F}{\|\hat{g}_t\|_F} \qquad \text{(Step 2: norm preservation)}$$

Here, $\hat{g}_t$ is the intermediate variance-normalized gradient, $\epsilon$ is a small constant for numerical stability, $\odot$ denotes element-wise multiplication (with broadcasting along the reduced dimension), and $\|\cdot\|_F$ denotes the Frobenius norm. Step 1 redistributes gradient magnitude across dimensions (suppressing high-variance directions, amplifying low-variance ones), analogous to Adam's per-parameter adaptation. Step 2 rescales to preserve the overall gradient norm, ensuring that the variance normalization does not inadvertently change the effective learning rate. The final update $g_t'$ thus combines direction quality from the polar orthogonalization with adaptive per-dimension scaling from the variance tracker [repo code: train.py; Muon/NorMuon literature].

42.3.4 The Mutable Search Space

The agent has complete freedom to modify train.py while prepare.py remains immutable. The following tables enumerate the constraint surface as observed in the repository [repo code].

CategoryMutable Parameters (in train.py)Baseline Values
Model ArchitectureDEPTH, ASPECT_RATIO, HEAD_DIM, WINDOW_PATTERN8, 64, 128, "SSSL"
Optimizer ConfigMATRIX_LR, ADAM_BETAS, WEIGHT_DECAY0.04, (0.8, 0.95), 0.2
LR ScheduleWARMUP_RATIO, WARMDOWN_RATIO, FINAL_LR_FRACDefined in train.py
Batch SizeTOTAL_BATCH_SIZE, DEVICE_BATCH_SIZE$2^{19}$, 128
Activation FunctionsMLP non-linearityF.relu(x).square() (ReGLU-squared)
Residual Connectionsresid_lambdas, x0_lambdas1.0, 0.1
Value Embeddingsve_gate, gating mechanismsResFormer-style (see §42.6)
Logit ProcessingSoftcap value in forward pass15
Optimizer InternalsMuon momentum, ns_steps, beta2See MuonAdamW class
Immutable Constants (in prepare.py)ValuePurpose
MAX_SEQ_LEN2048Context length
TIME_BUDGET300 secondsTraining time per experiment
Eval batches × batch tokens40 × 524,288Validation tokens (~20M)
VOCAB_SIZE8192BPE vocabulary size
evaluate_bpb()FunctionSacred evaluation function
make_dataloader()FunctionBOS-aligned best-fit packing dataloader

42.3.5 Context Window Management

A subtle but critical design decision is how the agent manages its context window over 100+ experiments [repo program.md]. By redirecting all training output to a file (uv run train.py > run.log 2>&1) and extracting only the metric via grep, the agent avoids flooding its context with thousands of lines of training progress. The following token estimates are approximate [survey estimate]:

Context SourceEstimated Tokens per Experiment
Agent reasoning + hypothesis~500
Code edit to train.py~2,000
Metric extraction via grep~50
Result logging~200
Error handling (if crash)~500
Total per cycle~3,000–5,000

This enables an agent with a ~200K context window to sustain approximately 100+ experiments before potential context saturation. The design explicitly trades off full observability for sustained autonomy—the agent sees only what it needs to make decisions.

42.4 Key Results

42.4.1 First Overnight Run

Karpathy's initial overnight run was reported via a tweet thread in March 2026. The following table records the reported outcomes with explicit provenance for each claim.

Reported Results — Single-Run, Non-Peer-Reviewed

The results below are from a single overnight run reported by the author via social media. They have not been independently reproduced under controlled conditions and should be interpreted as a demonstration, not a benchmark.

MetricValueProvenance
val_bpb improvement~11% reduction vs. baseline[author tweet] — single run, no confidence interval
Successful improvements found~20 experiments kept[author tweet] — approximate count
Total experiments conducted~100[survey estimate] — inferred from ~12/hour × ~8 hours
Runtime~8 hours (overnight)[author tweet]
GPUSingle NVIDIA H100[author tweet]
AgentClaude Code[repo README]
Keep rate~20%[survey estimate] — ~20 kept / ~100 total

Reproducibility note. No standardized replication table or downloadable run logs have been published for this result. The 11% figure represents a single-run outcome on specific hardware (H100) with a specific agent (Claude Code) at a specific point in time. GPU-level floating-point nondeterminism, LLM stochasticity, and the path-dependent nature of greedy hill-climbing mean that independent runs will produce different trajectories. Community fork maintainers have reported qualitatively similar improvement magnitudes on other hardware, but without standardized comparison protocols [community fork reports].

42.4.2 Experiment Throughput

The fixed 5-minute training budget plus agent overhead yields predictable throughput [repo program.md; survey estimate]. Each experiment cycle takes approximately 6 minutes: ~30 seconds for the agent to think and edit code, ~300 seconds for training, and ~30 seconds for result interpretation. This yields approximately 10 experiments per hour, 80 per overnight (8-hour) run, and 240 per full-day (24-hour) run. These are rough estimates; actual throughput depends on the agent's reasoning speed, crash frequency, and fix-retry overhead.

42.4.3 Typical Discovery Patterns

Based on the author's reports and community fork observations [author tweet; community fork reports], autonomous runs exhibit a characteristic diminishing-returns trajectory. The following patterns have been described but should be understood as anecdotal observations rather than statistically validated findings:

Discovery CategoryTypical FindingReported ImpactStageSource
Batch size reduction$2^{19} \rightarrow 2^{18}$ (more optimizer steps)0.01–0.02 BPBEarly[author tweet; community]
Model width scalingASPECT_RATIO 64 → 960.01+ BPBEarly[community fork reports]
Adam betas tuning(0.8, 0.95) → (0.9, 0.95)~0.005 BPBMid[community fork reports]
Weight decay reduction0.2 → 0.08~0.003 BPBMid[community fork reports]
LR schedule tuningWarmdown ratio, final LR fractionSmall–mediumMid[community fork reports]
Window pattern changes"SSSL" → "SL"~0.001 BPBLate[community fork reports]
Muon optimizer paramsMomentum, beta2, ns_steps~0.001 BPBLate[community fork reports]

The early experiments typically produce large gains through hyperparameter sweeps (batch size, learning rate, model width). Middle-phase gains come from architecture and optimizer changes. Late-phase experiments produce diminishing returns through fine-tuning and combination of near-misses. The cumulative trajectory is consistent with stochastic first-order optimization against a rugged but structured fitness landscape [survey interpretation].

42.5 Implementation Details

42.5.1 Cost Analysis

Autoresearch's cost structure is dominated by GPU compute for cloud users and by LLM API costs for users with local hardware. The following estimates are derived from the repository's throughput characteristics and published cloud pricing as of early 2026 [survey estimate].

Cost Estimates — Approximate, As of Early 2026

These figures are rough estimates based on published cloud GPU pricing and typical API token rates. Actual costs depend on provider, region, spot vs. on-demand pricing, and the specific agent model used. They have not been verified against invoices or billing statements.

Run DurationGPU Cost (H100 @ ~$2/hr est.)LLM API Cost (est.)Total (est.)
8h overnight (H100 cloud)~$16~$2~$18
8h overnight (local RTX 4090)$0~$2~$2
24h extended (H100 cloud)~$48~$6~$54

The LLM API cost estimate assumes approximately 4,000 tokens per experiment cycle (combining input and output), yielding roughly 400,000 tokens for a 100-experiment overnight run [survey estimate]. At typical early-2026 API pricing for frontier coding models, this amounts to $1.50–$3.00.

The author's README includes a cost comparison suggesting an approximately 45× reduction relative to the fully-loaded cost of a human researcher conducting the same number of experiments manually [repo README]. This comparison is illustrative: it does not account for experiment quality, insight generation, the setup cost of writing program.md, or the human review time needed to validate and interpret results. The comparison also assumes the human researcher conducts each experiment sequentially without parallelization—a conservative assumption that favors the automated approach.

42.5.2 Reproducibility

Autoresearch achieves high reproducibility through several design choices [repo code; repo README]. The entire codebase is three files under MIT license. The training data (climbmix-400b) is publicly available and auto-downloaded by prepare.py. Dependencies are minimal (PyTorch plus 8 packages) and pinned via uv.lock. The random seed is fixed (torch.manual_seed(42)), providing determinism up to GPU-specific floating-point nondeterminism. Every experiment is captured as a git commit plus a row in results.tsv.

# Complete reproduction steps from README [repo README]
# 1. Clone and setup
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

# 2. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py

# 3. Establish baseline
uv run train.py

# 4. Start autonomous research — point any coding agent at the repo:
# "Read program.md and let's kick off a new experiment"

42.5.3 Platform Sensitivity

A critical caveat for reproducibility: results are platform-dependent [repo README]. Because the 5-minute budget is wall-clock time, different GPUs complete different numbers of training steps. The author explicitly acknowledges this:

"This means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms."
GPUApproximate Steps in 5 minRelative Throughput (vs H100)Source
H200~1,200+~1.09×[survey estimate]
H100~1,1001.0× (reference)[author tweet]
A100 (80GB)~800~0.73×[survey estimate]
RTX 4090~600~0.55×[survey estimate]
RTX 3090~400~0.36×[survey estimate]

This platform dependence is by design—autoresearch finds the optimal configuration for a specific compute environment. But it means that reported BPB values from different hardware platforms are not directly comparable, and the optimal model architecture may differ across GPUs (e.g., wider models may be optimal on H100 where fewer total steps are needed, while deeper or more parameter-efficient models may be optimal on slower hardware where more steps are available) [survey interpretation].

42.5.4 Community Ecosystem

The rapid forking of autoresearch to alternative platforms—within days of release—demonstrates both the simplicity of the core design and the community demand for accessible autonomous research tools [community fork repos]:

ForkPlatformSource
autoresearch-macosmacOS (Metal)github.com/miolini/autoresearch-macos
autoresearch-mlxmacOS (MLX)github.com/trevin-creator/autoresearch-mlx
autoresearch-win-rtxWindows (RTX)github.com/jsegov/autoresearch-win-rtx
autoresearch (AMD)AMD GPUsgithub.com/andyluo7/autoresearch

42.6 The GPT Model Architecture

The baseline train.py implements a modern GPT architecture in approximately 450 lines of Python [repo code: train.py]. The model incorporates several recent innovations that serve as the starting point for the agent's optimization. Understanding this baseline is essential for appreciating the search space available to the agent. The following description is derived from reading the actual repository code.

Input Tokens (B, T) Token Embed + RMSNorm → x₀ ×DEPTH x = resid_lambdas[i] · x + x0_lambdas[i] · x₀ CausalSelfAttention Q, K, V projections + RoPE + QKNorm Value Embeddings: v = v + gate · VE (ResFormer) FlashAttention3 (causal, per-layer sliding window) Linear projection out x = x + attn_out (residual) MLP Linear → ReLU² (ReGLU-squared) → Linear x = x + mlp_out (residual) Output Head RMSNorm → Linear → Softcap(15) → Cross-Entropy

Value embeddings (ResFormer technique) [repo code: train.py, CausalSelfAttention class]. Alternating layers include learnable per-token value embeddings that are mixed into the value stream via a gated mechanism:

# From train.py — CausalSelfAttention (value embedding logic)
ve = self.value_embeds[str(i)](idx)               # (B, T, kv_dim)
gate = 2 * sigmoid(self.ve_gate(x[..., :32]))     # (B, T, n_kv_head)
v = v + gate * ve

The gate is initialized at zero, so $\text{sigmoid}(0) = 0.5$, scaled by 2 gives 1.0—a neutral initialization that does not distort the value stream at the start of training [repo code: train.py].

Residual mixing [repo code: train.py, GPT class]. Each block receives both the previous hidden state and the initial embedding via learnable per-layer coefficients:

# From train.py — GPT forward (residual mixing)
x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0

These are initialized with resid_lambdas = 1.0 and x0_lambdas = 0.1, allowing the model to learn how much to rely on the original embedding versus the transformed representation at each layer. This technique helps with gradient flow in deeper transformers [repo code: train.py; literature context].

Additional architectural features observed in the baseline train.py [repo code]:

  • ReGLU-squared activation: F.relu(x).square() in the MLP class—a non-standard activation that has shown empirical benefits in small model training.
  • Logit soft-capping: softcap * tanh(logits / softcap) with softcap = 15, preventing extreme logit values.
  • QK normalization: RMSNorm applied to query and key projections before attention.
  • Per-layer sliding window: The WINDOW_PATTERN string (default "SSSL") defines which layers use sliding-window vs. full attention in FlashAttention3.
  • RoPE: Rotary position embeddings pre-computed for 10× the sequence length.
  • Manual GC control: gc.freeze() and gc.disable() after warmup to avoid Python garbage collection stalls during training.
  • torch.compile: Applied to the model and optimizer kernels with dynamic=False, fullgraph=True for maximum compilation benefit.

42.7 Limitations and Discussion

42.7.1 Algorithmic Limitations

The greedy hill-climbing strategy imposes fundamental constraints on the quality of solutions autoresearch can find:

Local optima. The algorithm can become trapped in configurations where no single-step modification improves the metric. Multi-step improvements requiring intermediate regressions (traversing fitness valleys) are unreachable. Population-based methods like MAP-Elites (used in AlphaEvolve) maintain diversity that allows escaping such traps.

No exploration-exploitation balance. There is no formal mechanism for allocating the agent's effort between refining what works (exploitation) and trying radically different approaches (exploration). The "think harder" instruction in program.md is a heuristic substitute, not a principled solution.

Sequential search. By default, autoresearch runs one experiment at a time. With a 5-minute training budget and ~1-minute agent overhead, the system can evaluate only ~10 configurations per hour. Population-based or island-model approaches can evaluate dozens of candidates in parallel on multi-GPU systems.

Single noisy evaluation. Each experiment is evaluated exactly once. Because GPU floating-point nondeterminism, compilation caching, and system load introduce noise into the training outcome, small apparent improvements may be within the noise floor. The system has no mechanism for replicate runs or statistical significance testing [survey interpretation].

42.7.2 Scope Limitations

Single-file modification. The agent can only modify train.py. It cannot introduce new Python modules, add data augmentation strategies, modify the tokenizer, or change the evaluation metric. This constrains the search space to what can be expressed as modifications to a single training script.

No cross-run memory. Each run starts from the current program.md and train.py with no persistent memory of what was tried in prior runs. The human must manually transfer knowledge between runs by updating program.md or starting from a previously improved branch.

Platform-specific results. Because the 5-minute budget is wall-clock time, results from different GPU types are not comparable. There is no normalization by FLOPs, parameter-hours, or training steps, which limits the system's utility for cross-hardware benchmarking.

42.7.3 Comparison with Infrastructure-Heavy Systems

DimensionAutoresearchAlphaEvolveOpenEvolve
Search strategyGreedy hill-climbingMAP-Elites + LLM ensembleIsland-based evolutionary
Population1 (current best)Full MAP-Elites archiveMulti-island populations
Infrastructure3 files, git, shellMulti-GPU cluster, databases, orchestratorsDatabase, vector store, queue
Setup time~10 minutes [survey est.]Substantial engineering effortConfiguration + infrastructure
Agent architectureSingle general-purpose agentMulti-model ensembleSingle LLM with bandit selection
Cross-run learningNone (manual via branch/program.md)Structured program databaseKnowledge base, prompt evolution
DomainNeural network trainingGeneral program optimizationGeneral program optimization
ParallelismSequential (1 experiment at a time)Massively parallelParallel islands
Evaluation replicationSingle noisy evalMultiple evaluationsConfigurable eval count

This comparison highlights a fundamental design trade-off. Autoresearch sacrifices search sophistication and parallelism for radical accessibility. The claim is not that greedy hill-climbing is optimal—it demonstrably is not—but that the simplicity of the system unlocks a mode of research that was previously inaccessible. A researcher with a single GPU and 10 minutes of setup time can conduct overnight autonomous experiments, a capability previously requiring dedicated infrastructure teams.

42.7.4 The Meta-Optimization Opportunity

Karpathy explicitly highlights the meta-learning potential of the system [repo README]:

"The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the 'research org code' that achieves the fastest research progress."

This positions program.md as itself an optimizable artifact. The human-in-the-loop learning cycle operates at a higher level: (1) agent runs autonomously for 8 hours; (2) human reviews results; (3) human updates program.md with insights; (4) agent runs again with better instructions. A natural extension would apply autoresearch to itself—using an LLM agent to optimize program.md based on research outcomes—creating a form of meta-optimization that echoes the self-referential character of Gödel machines [survey interpretation].

42.7.5 Broader Significance: Research Taste as a Specification

Perhaps the most conceptually novel aspect of autoresearch is the encoding of research taste in program.md [repo program.md]. The simplicity criterion ("a small improvement that adds ugly complexity is not worth it"), the crash-handling heuristic ("if it's something dumb and easy to fix, fix it; if the idea itself is fundamentally broken, just skip it"), and the exploration directive ("think harder") are all judgments about research quality that are typically tacit knowledge held by experienced researchers. By making these judgments explicit and executable, autoresearch converts research taste into a formal specification that can be shared, debated, and iterated upon.

This has implications beyond neural network training. Any domain where solutions can be expressed as code, evaluation is automated, and experiments are fast enough for iterative improvement is amenable to the autoresearch paradigm. The methodology is a contribution independent of the specific domain application.

42.8 Summary

Chapter Summary

Key takeaway: Autoresearch demonstrates that radical simplicity—three files, one GPU, one metric, and a natural-language research specification—is sufficient for a general-purpose coding LLM agent to conduct overnight autonomous research that produces real, cumulative improvements to neural network training.

Main contribution: The system's primary contribution is paradigmatic rather than algorithmic. It establishes the "research-as-code" model where the human writes a Markdown specification of the research protocol and the LLM agent executes it autonomously. The greedy hill-climbing search strategy is trivial, but the encoding of research taste—simplicity preference, crash-handling heuristics, exploration directives—into an executable natural-language specification is a genuinely novel idea. The 63,000+ GitHub stars [repo] and immediate ecosystem of platform-specific forks confirm that this paradigm addresses latent demand for accessible autonomous research tools.

Evidence status: The core system design is fully verifiable from the public repository (3 files, MIT license). Performance claims (11% BPB improvement, overnight throughput) rest on a single author-reported run via social media and anecdotal community fork results—no peer-reviewed evaluation or standardized replication exists. Cost and throughput estimates in this chapter are survey-author calculations based on published pricing and observed experiment timing.

For researchers: Autoresearch is most valuable as a lower bound on infrastructure requirements for autonomous research and as a template for the "research specification" pattern. Its limitations—local optima trapping, no cross-run memory, sequential execution, platform-specific results, single noisy evaluations—are well-understood and clearly documented. Systems like AlphaEvolve and OpenEvolve address these limitations at the cost of substantial infrastructure complexity. The open question is whether the program.md meta-optimization loop—iterating on the research specification itself—can close the gap without adding that complexity.