← Back to Index

SkyPilot Scaling Autoresearch

Scaling Karpathy's Autoresearch to Parallel GPU Clusters with Emergent Research Strategies Organization: SkyPilot / UC Berkeley Sky Computing Lab Published: March 2026 Type: Technical Blog Post + Open-Source Extension Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents

1 Full Title and Attribution

Full Title: Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

Blog Post URL: blog.skypilot.co/scaling-autoresearch/

Code Repository: skypilot/examples/autoresearch

Parent System: Built on top of Karpathy's autoresearch (see companion document: Karpathy Autoresearch)

SkyPilot Project: github.com/skypilot-org/skypilot

SkyPilot Skill Docs: docs.skypilot.co/en/latest/getting-started/skill.html

Publication Date: March 2026

Paradigm: This work extends autoresearch from sequential single-GPU hill-climbing to parallel multi-GPU factorial experimentation. The key innovation is not just "more GPUs" but a qualitative shift in search strategy: parallel resources enable the agent to run factorial grids, discover interaction effects, and develop emergent strategies for exploiting heterogeneous hardware — behaviors that are impossible in sequential mode.

Lineage:

nanochat (Karpathy)
    │
    ▼
autoresearch (Karpathy, March 2026)
    │  Single GPU, sequential, greedy hill-climbing
    │
    ▼
SkyPilot Autoresearch (SkyPilot team, March 2026)
    │  16 GPUs, parallel, factorial grids
    │  Emergent hardware-aware strategy
    │
    ▼
[Future: distributed autoresearch networks?]

2 Authors and Team

SkyPilot Team

SkyPilot is developed by the Sky Computing Lab at UC Berkeley, with contributions from the broader open-source community. The project was initiated by Professor Ion Stoica's lab and has grown into a widely-used infrastructure tool for ML workloads.

Key contributors to the SkyPilot project include:

Role Affiliation Contribution
Core developers UC Berkeley Sky Lab SkyPilot framework, Kubernetes integration
Infrastructure team Various Cloud provider backends, resource scheduling
ML team Various Agent skill system, autoresearch integration

Context: SkyPilot as Infrastructure

SkyPilot is not an ML research project — it is a cloud infrastructure abstraction layer that enables ML researchers to launch GPU workloads across any cloud or Kubernetes cluster from a single YAML file. The autoresearch extension is a showcase of SkyPilot's agent integration capabilities, demonstrating that infrastructure tools can transform the behavior of autonomous research agents.

The SkyPilot team's contribution is primarily systems engineering rather than ML research: they identified that the sequential bottleneck in autoresearch was an infrastructure limitation, not an algorithmic one, and showed that removing it unlocks qualitatively different agent behaviors.

Relationship to Karpathy

The SkyPilot autoresearch extension builds directly on Karpathy's original autoresearch, preserving all of its design constraints (fixed 5-minute budget, single-file modification, val_bpb metric) while replacing the execution substrate (local GPU → distributed GPU cluster). The original program.md is fetched and followed by the agent; the SkyPilot instructions.md adds the parallelism layer on top.

3 Core Contribution

Key Novelty: Scaling autoresearch from 1 GPU to 16 GPUs didn't just make it 16x faster — it caused a qualitative shift in the agent's research strategy, from greedy hill-climbing to factorial experimental design, and triggered an emergent capability: the agent independently discovered how to exploit heterogeneous GPU hardware for a two-tier screening/validation workflow.

Three Core Contributions

1. Parallelism changes search strategy, not just speed.

The most important finding is that parallel resources don't merely accelerate the sequential search — they enable fundamentally different search strategies:

Sequential (1 GPU) Parallel (16 GPUs)
Greedy hill-climbing Factorial grid search
1 experiment per decision 10-13 experiments per decision
Tests one hypothesis at a time Tests interaction effects between parameters
Can miss non-monotonic improvements Captures full response surface in one wave
~10 experiments/hour ~90 experiments/hour

The aspect ratio discovery (Phase 2 in the results) perfectly illustrates this: sequentially, the agent might test AR=64, see no improvement, and abandon the direction. In parallel, it tested AR=48, 64, 72, 80, 90, 96, and 112 simultaneously, immediately saw the monotonic trend, and zeroed in on AR=96. One 5-minute wave replaced six sequential experiments.

2. Emergent hardware-aware strategy.

Without any instruction about GPU performance differences, the agent independently: - Discovered that H200 clusters produced better val_bpb than H100 clusters for identical configurations - Diagnosed the cause (H200 completes ~9% more training steps in 5 minutes) - Developed a two-tier workflow: screen hypotheses cheaply on H100s, validate winners on H200s - Recognized that parameter rankings don't always transfer across hardware types

This is an emergent behavior — it was not programmed, not prompted, and not anticipated. The agent developed a sophisticated experimental methodology from first principles by observing its own results.

3. Infrastructure abstraction enables agent autonomy.

The SkyPilot "skill" system teaches LLM agents to manage GPU infrastructure without human intervention. The agent learns to: - Provision clusters (sky launch) - Submit experiments (sky exec) - Pipeline jobs on the same cluster (zero-idle-time queuing) - Check results (sky logs) - Tear down resources (sky down) - Manage workdir isolation for parallel code variants

This transforms the agent from a "code editor that waits for a GPU" into a "research lab manager that orchestrates a compute cluster."

Relationship to Prior Work

System Parallelism Agent Control Search Strategy Hardware Awareness
Grid Search Embarrassingly parallel None (human defines grid) Exhaustive None
Optuna/Ray Tune Multi-trial parallel Bayesian/RL controller Principled optimization Manual
AlphaEvolve 100+ evaluators Evolutionary controller MAP-Elites + islands Managed internally
Autoresearch (1 GPU) Sequential LLM agent Greedy hill-climbing Single GPU
SkyPilot Autoresearch 16 GPU cluster LLM agent Factorial grids Emergent

The SkyPilot extension occupies a unique position: it gives the same agent used in sequential autoresearch access to parallel resources, and observes how the agent's strategy naturally evolves. Unlike AlphaEvolve (where parallelism is engineered into the system design), here parallelism is an infrastructure capability that the agent learns to exploit.

4 Supported Solutions

Primary Domain: Same as Autoresearch

The SkyPilot extension targets the same domain as Karpathy's autoresearch — GPT model training optimization within a 5-minute compute budget. Everything the agent can modify remains the same:

Category Mutable Parameters Parallel Advantage
Architecture ASPECT_RATIO, DEPTH, WINDOW_PATTERN Test 6+ aspect ratios in one wave
Optimizer MATRIX_LR, ADAM_BETAS, WEIGHT_DECAY, Muon params Full factorial over optimizer settings
LR Schedule WARMUP_RATIO, WARMDOWN_RATIO, FINAL_LR_FRAC Sweep entire schedule space simultaneously
Batch Size TOTAL_BATCH_SIZE, DEVICE_BATCH_SIZE Test multiple batch sizes + other params
Activations MLP activation function A/B/C test activation variants
Attention Head configuration, sliding window Compare window patterns

Extended Solution Space: Multi-Factor Experiments

The key difference from sequential autoresearch is the ability to test combinations:

Sequential (one-at-a-time):
  Experiment 1: batch_size = 2^18           → val_bpb = 0.985
  Experiment 2: weight_decay = 0.08         → val_bpb = 0.990
  Experiment 3: batch_size=2^18 + wd=0.08   → val_bpb = 0.982  ← interaction!
  Total time: 15 minutes

Parallel (factorial):
  Wave 1 (5 minutes, 4 GPUs):
    batch_size=2^18, wd=0.2   → 0.985
    batch_size=2^18, wd=0.08  → 0.982  ← best (interaction found immediately)
    batch_size=2^19, wd=0.2   → 0.997
    batch_size=2^19, wd=0.08  → 0.993
  Total time: 5 minutes

Factorial grids reveal interaction effects that sequential one-at-a-time testing misses. In the example above, the combination of lower batch size + lower weight decay produces a synergistic improvement that neither change alone would predict.

Hardware-Heterogeneous Experimentation

A novel solution type enabled by the parallel setup is hardware-aware model optimization:

                    ┌──────────────┐
                    │  Hypothesis  │
                    │  Generator   │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │                         │
      ┌───────▼───────┐       ┌────────▼────────┐
      │  SCREENING     │       │  VALIDATION      │
      │  (13 × H100)   │       │  (3 × H200)      │
      │                │       │                  │
      │  10+ hypotheses│       │  Top 2-3         │
      │  tested cheap  │       │  promoted for    │
      │  in parallel   │       │  confirmation    │
      └───────┬───────┘       └────────┬────────┘
              │                         │
              └────────────┬────────────┘
                           │
                    ┌──────▼───────┐
                    │   DECISION    │
                    │  keep/discard │
                    └──────────────┘

The agent developed this two-tier screening/validation workflow autonomously, recognizing that: - H200s are ~9% faster (more training steps in 5 minutes) - H200s produce systematically lower val_bpb for the same configuration - Parameter rankings sometimes differ across hardware types - The limited number of H200s (3 out of 16) should be reserved for validation

5 LLM Integration

Agent: Claude Code (Anthropic)

The SkyPilot experiment used Claude Code as the autonomous agent. Key characteristics:

Aspect Detail
Agent Claude Code (Anthropic)
Run Duration ~8 hours
Experiments Submitted ~910 total
Experiments with Valid Results ~700
Concurrent Clusters 16
GPU Types 13 × H100, 3 × H200

Dual-Layer Instruction System

The agent operates under two layers of instructions:

Layer 1: program.md (Karpathy's original) - What can/cannot be modified - The simplicity criterion - The keep/discard logic - The "never stop" directive

Layer 2: instructions.md (SkyPilot extension) - How to use SkyPilot for parallel execution - Cluster naming convention (gpu-01, gpu-02, ...) - Workdir isolation for parallel code variants - Results tracking in results.tsv with experiment IDs - Maximum 4 clusters limit (relaxed to 16 in practice)

The agent fetches both documents at startup:

1. Read program.md → understand research rules
2. Load SkyPilot skill → learn infrastructure commands
3. Read autoresearch codebase → understand train.py
4. Ask about infra preference → configure YAML

SkyPilot Skill System

The SkyPilot "skill" is a natural-language instruction document that teaches LLM agents to use SkyPilot. This is a form of tool-use prompting — the agent reads documentation about a tool and then uses it:

Agent reads skill document:
  "To launch a GPU cluster, use: sky launch gpu-01 experiment.yaml -d -y"
  "To check results, use: sky logs gpu-01"
  "To pipeline experiments, use: sky exec gpu-01 experiment.yaml -d"

Agent then uses these commands autonomously for 8 hours.

The skill system is notable because it requires zero code changes to the agent — the same Claude Code instance that runs sequential autoresearch can run parallel autoresearch, simply by reading different instructions.

Agent Decision-Making at Scale

With 16 GPUs, the agent's decision-making complexity increases substantially:

Sequential agent decision tree:

observe result → keep or discard → pick next experiment

Parallel agent decision tree:

observe 16 concurrent results →
  identify trends across results →
  decide which changes to keep/combine →
  design next wave of 10-16 experiments →
  allocate experiments to GPU types →
  manage cluster pipeline queues →
  handle crashes on individual clusters →
  promote winners to validation tier

The agent manages this complexity without any explicit planning framework — it reasons about experimental design in natural language, within its context window.

Agent Reasoning Examples

The blog post captures several instances of the agent's reasoning in real time:

Discovering hardware heterogeneity:

"Only 3 H200 clusters: gpu-03, gpu-04, gpu-08! The rest are H100. This explains everything — H200 is significantly faster than H100. In the same 5-minute budget, H200 can do MORE training steps. More steps = better val_bpb."

Quantifying the difference:

"H200 runs 9% more steps in the same time! That directly leads to better val_bpb. All my 'best' results should be normalized by hardware."

Developing a strategy:

"Since H200 gets ~9% more steps than H100 in the same 5-minute budget, and I have only 3 H200 clusters, I should focus experiments on H200 clusters. The real optimization contest is on H200."

These quotes demonstrate that the agent performs genuine scientific reasoning — observing data, forming hypotheses, and modifying its experimental strategy accordingly.

6 Key Results

Headline Results

Metric Value
Starting val_bpb 1.003 (baseline)
Final val_bpb 0.974
Total improvement 2.87% reduction
Total experiments ~910 submitted
Valid experiments ~700
Runtime ~8 hours
GPUs used 16 (13 × H100 + 3 × H200)
Throughput ~90 experiments/hour
Sequential baseline equivalent ~72 hours
Speedup 9× wall-clock

Five Phases of Discovery

The agent's research naturally organized into five phases. These were not pre-planned — each phase emerged from the results of the previous one:

val_bpb
1.003 ┤●  Phase 1: Hyperparameter Sweeps
      │ ╲
0.990 ┤  ╲  (~200 experiments)
      │   ╲  - batch_size 2^18 > 2^19
      │    ╲ - adam_betas (0.9, 0.95)
0.981 ┤     ●  - weight_decay 0.08
      │      ╲
      │       ╲ Phase 2: Architecture Discovery
0.977 ┤        ● (~200 experiments)
      │         ╲ - ASPECT_RATIO 96 > 64
      │          ╲ (biggest single jump)
0.975 ┤           ● Phase 3: Fine-Tuning
      │            ╲ (~140 experiments)
      │             ╲
0.974 ┤              ● Phase 4: Optimizer Tuning
      │               ╲ (~140 experiments)
      │                ╲ - muon_beta2 = 0.98
      │                 ╲──── Phase 5: Diminishing Returns
      │                      (~200 experiments)
      └───────────────────────────────────────► experiments
        0    200   400   600   800   910

Phase-by-Phase Analysis

Phase 1: Hyperparameter Sweeps (experiments 0-200)

Strategy: Broad factorial grids over obvious hyperparameters

Experiments per wave: 10-13 simultaneous

Key discoveries:

Parameter Default Optimal Impact (Δ val_bpb)
TOTAL_BATCH_SIZE 2^19 2^18 -0.010
ADAM_BETAS (0.8, 0.95) (0.9, 0.95) -0.005
WEIGHT_DECAY 0.2 0.08 -0.003
Softcap 15 10 -0.001
Deeper models (depth 10+) All crashed or hurt N/A

Phase outcome: val_bpb 1.003 → 0.981 (Δ = 0.022)

Analysis: The batch size reduction is the most impactful single hyperparameter finding. With a fixed 5-minute wall-clock budget, reducing batch size from 2^19 (~524K tokens) to 2^18 (~262K tokens) doubles the number of optimizer steps. Despite seeing half as many tokens per step, the additional optimization steps more than compensate. This finding — that optimizer step count dominates data throughput for short training budgets — is a well-known phenomenon in learning rate scaling theory, but the agent discovered it empirically.

Phase 2: Architecture Discovery (experiments 200-420)

Strategy: Factorial grid over model aspect ratios

Critical experiment: 6 aspect ratios tested in a single 5-minute wave

Wave results (single 5-minute window):
  AR=48  (model_dim=384)  → val_bpb = 0.985
  AR=64  (model_dim=512)  → val_bpb = 0.982
  AR=72  (model_dim=576)  → val_bpb = 0.980
  AR=80  (model_dim=640)  → val_bpb = 0.979
  AR=90  (model_dim=720)  → val_bpb = 0.978
  AR=96  (model_dim=768)  → val_bpb = 0.977  ← best
  AR=112 (model_dim=896)  → val_bpb = 0.984  ← too large, not enough steps

Phase outcome: val_bpb 0.981 → 0.977 (Δ = 0.004)

Analysis: This is the phase where parallelism paid off most dramatically. The response surface over aspect ratio is non-monotonic — performance improves with width up to AR=96, then degrades because the larger model completes fewer training steps in the 5-minute budget.

Sequentially, the agent would likely have tested AR=64 (no improvement), maybe AR=72 (marginal), and possibly given up. The parallel factorial revealed the complete curve, including the optimal point and the explanation for the degradation at AR=112.

The finding that model width dominates optimizer tuning is significant:

All optimizer tuning (Phase 1):         Δ = 0.022
Model width scaling (Phase 2 alone):    Δ = 0.004 (on top of Phase 1)
Width + tuning combination:             → 0.977 vs. 1.003 baseline

The optimal AR=96 translates to model_dim = 768, producing a model with: - ~1,060 training steps on H100 (vs. ~1,450 for AR=48) - 4× more parameters than the AR=48 model - Fits within 64GB VRAM

Phase 3: Fine-Tuning the Wider Model (experiments 420-560)

Strategy: Re-optimize hyperparameters for the new AR=96 architecture

Key discoveries:

Parameter Finding
Warmdown schedule Optimal warmdown differs for wider model
Matrix learning rate Re-tuned for AR=96
Weight decay Re-optimized for larger model
Newton-Schulz steps Muon orthogonalization iterations re-tuned

Phase outcome: val_bpb 0.977 → 0.975 (Δ = 0.002)

Analysis: This phase demonstrates an important principle: hyperparameters optimized for one architecture may be suboptimal for another. After the architectural change in Phase 2, the agent correctly re-swept the optimizer settings for the new model. This is analogous to how human researchers re-tune learning rates after architectural changes.

Phase 4: Optimizer Tuning (experiments 560-700)

Strategy: Deep exploration of Muon optimizer parameters

Critical discovery: muon_beta2 = 0.98 (up from 0.95)

The Muon optimizer's beta2 parameter controls the exponential moving average for the second-momentum buffer in the NorMuon variance reduction:

second_momentum_buffer.lerp_(v_mean, 1 - beta2)

Increasing beta2 from 0.95 to 0.98 smooths the normalization, allowing the model to take larger effective steps. The agent discovered this by testing beta2 ∈ {0.95, 0.96, 0.97, 0.98, 0.99} across 10 clusters in one wave.

Phase outcome: val_bpb 0.975 → 0.974 (Δ = 0.001)

Analysis: This is the largest late-stage improvement and came from deep domain knowledge — the agent understood (or discovered through testing) that the optimizer's variance normalization was too aggressive for the wider model. Higher beta2 means slower adaptation of the normalization, which helps in the later stages of training where the loss landscape is smoother.

Phase 5: Diminishing Returns (experiments 700-910)

Strategy: Combinatorial sweeps over remaining parameters

Explored: Final LR fraction, warmdown ratio, scalar LR, embedding LR

Phase outcome: val_bpb 0.974 → < 0.001 additional improvement

Analysis: The improvement curve had flattened. The agent continued running experiments but each yielded diminishing returns:

Phase   Experiments   Δ val_bpb   Return per Experiment
  1        200          0.022        1.1 × 10⁻⁴
  2        220          0.004        1.8 × 10⁻⁵
  3        140          0.002        1.4 × 10⁻⁵
  4        140          0.001        7.1 × 10⁻⁶
  5        210         <0.001        < 4.8 × 10⁻⁶

The return per experiment declined by ~20× from Phase 1 to Phase 5. In a cost-aware setting, the agent should have a stopping criterion — but the "NEVER STOP" directive from program.md keeps it exploring.

Best Configuration Found

# Architecture
ASPECT_RATIO = 96          # model_dim = 8 * 96 = 768
DEPTH = 8                  # 8 transformer layers
WINDOW_PATTERN = "SL"      # alternating Sliding + Local attention

# Training
TOTAL_BATCH_SIZE = 2**18   # ~262K tokens/step (halved from default)

# Learning rates
MATRIX_LR = 0.05           # Muon LR for weight matrices (up from 0.04)
EMBEDDING_LR = 0.6         # AdamW LR for token embeddings
SCALAR_LR = 0.5            # AdamW LR for residual mixing scalars

# Optimizer
ADAM_BETAS = (0.70, 0.95)  # lower beta1 than default
WEIGHT_DECAY = 0.08        # reduced from 0.2
WARMDOWN_RATIO = 0.6       # increased from 0.5
FINAL_LR_FRAC = 0.05       # non-zero final LR
# Muon: momentum=0.95, ns_steps=5, beta2=0.98

Comparison: Parallel vs. Sequential

Metric Sequential (1 GPU) Parallel (16 GPUs) Ratio
Experiments/hour ~10 ~90
Time to best result ~72 hours (estimated) ~8 hours 9× faster
Information per decision 1 experiment 10-13 experiments 10-13×
Interaction effects found Rarely Routinely Qualitative
Hardware-aware strategy Impossible Emergent N/A
Total experiments ~720 (equiv. runtime) ~910 1.26×
API cost ~$7 ~$9 1.3×
GPU cost ~$144 (72h × $2/h) ~$286 (16 GPUs × 8h × ~$2.2/h)
Total cost ~$151 ~$295 1.95×

The parallel approach is 2× more expensive in compute but reaches the same result 9× faster. In research settings where wall-clock time is the binding constraint, this is a highly favorable trade-off.

7 Reproducibility

Reproducibility Score: High (with Infrastructure Requirements)

Factor Assessment Detail
Code availability Full source, open-source SkyPilot repo + autoresearch repo
Instructions Complete instructions.md Self-contained agent instructions
Data availability Public dataset Same as autoresearch (climbmix-400b)
Compute requirements GPU cluster (Kubernetes/cloud) Higher barrier than single-GPU autoresearch
Agent determinism Non-deterministic LLM agent makes different choices each run
Infrastructure determinism Variable GPU availability, scheduling, hardware mix
One-line setup Available curl ... | bash install script

Reproduction Steps

Quick start (one-line):

curl -sL https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autoresearch/setup.sh | bash
cd autoresearch
claude "Read instructions.md and start running parallel experiments."

Manual setup:

# 1. Clone both repos
git clone https://github.com/karpathy/autoresearch.git
git clone https://github.com/skypilot-org/skypilot.git
cd autoresearch

# 2. Copy SkyPilot experiment files
cp ../skypilot/examples/autoresearch/experiment.yaml .
cp ../skypilot/examples/autoresearch/instructions.md .

# 3. Prepare data locally
pip install uv && uv sync && uv run prepare.py

# 4. Install SkyPilot skill for your agent
# See: https://docs.skypilot.co/en/latest/getting-started/skill.html

# 5. Launch agent
# "Read instructions.md and start running parallel experiments"

Infrastructure Requirements

Requirement Minimum Recommended Used in Blog Post
GPU count 2 8-16 16
GPU type Any NVIDIA H100 / H200 13 × H100 + 3 × H200
Infrastructure Any SkyPilot backend Kubernetes Kubernetes (CoreWeave)
Agent Any coding agent Claude Code Claude Code
SkyPilot Latest Latest Latest

Supported infrastructure backends:

# experiment.yaml — the infra field selects the backend
resources:
  accelerators: {H100:1, H200:1}
  image_id: docker:nvcr.io/nvidia/pytorch:24.07-py3
  infra: k8s        # Kubernetes
  # infra: aws      # Amazon Web Services
  # infra: gcp      # Google Cloud Platform
  # infra: azure    # Microsoft Azure
  # infra: nebius   # Nebius AI Cloud
  # infra: slurm    # SLURM HPC clusters
  # (20+ backends supported)

Variability Across Runs

Because the LLM agent makes different decisions each run, exact results will vary. However, the qualitative findings should reproduce:

  1. Batch size reduction is consistently discovered early
  2. Model width scaling provides the largest architectural gain
  3. Diminishing returns appear after ~500 experiments
  4. Hardware heterogeneity is independently discovered (if present)
  5. The agent develops wave-based experimental design

The quantitative results (exact val_bpb values, number of experiments per phase) will differ across runs, agents, and hardware configurations.

8 Compute and API Costs

Detailed Cost Breakdown

GPU compute (as reported in the blog post):

Resource Count Hours Rate Cost
H100 GPUs 13 8 ~$2.00/hr ~$208
H200 GPUs 3 8 ~$2.30/hr ~$55
GPU subtotal ~$263

LLM API costs:

Component Tokens Rate Cost
Claude Code session (~8h) ~2M input + ~500K output Anthropic API rates ~$9
API subtotal ~$9

Total cost: ~$272-300

Cost Efficiency Analysis

Cost per experiment:
  GPU: $263 / 910 experiments ≈ $0.29/experiment
  API: $9 / 910 experiments   ≈ $0.01/experiment
  Total: ~$0.30/experiment

Cost per val_bpb improvement:
  Total improvement: 0.029 (1.003 → 0.974)
  Cost per 0.001 bpb: ~$300 / 29 ≈ $10.30/0.001 bpb

Cost vs. sequential:
  Sequential (72h on 1 H100):
    GPU: 72 × $2 = $144
    API: ~$7
    Total: ~$151

  Parallel (8h on 16 GPUs):
    GPU: ~$263
    API: ~$9
    Total: ~$272

  Premium for 9× speedup: ~$121 (1.8×)

Cost Comparison with Other Systems

System Typical Run Cost Experiments Cost/Experiment Human Hours
Manual researcher $800+ (8h salary) 10-20 $40-80 8
Autoresearch (1 GPU) ~$18 ~80 $0.22 0.1 (setup)
SkyPilot Autoresearch ~$300 ~910 $0.33 0.1 (setup)
AlphaEvolve (estimated) $10,000+ 1000+ ~$10 Multiple engineers

SkyPilot autoresearch occupies a favorable cost-performance point: 3× more expensive than single-GPU autoresearch, but 11× more experiments and 9× faster wall-clock time. The marginal cost of parallel experiments ($0.33 vs. $0.22) reflects GPU idle time during agent planning phases — with 16 GPUs, some inevitably wait while the agent designs the next wave.

9 Architecture Solution

System Architecture

┌──────────────────────────────────────────────────────────┐
│                    CLAUDE CODE AGENT                      │
│                                                          │
│  ┌────────────┐  ┌──────────────┐  ┌───────────────┐    │
│  │ Hypothesis │  │ Experiment   │  │ Result         │    │
│  │ Generator  │  │ Designer     │  │ Analyzer       │    │
│  │            │  │ (factorial   │  │ (cross-GPU     │    │
│  │ (reads     │  │  grids)      │  │  comparison)   │    │
│  │  results,  │  │              │  │                │    │
│  │  plans     │  │ Allocates    │  │ Identifies     │    │
│  │  next      │  │ experiments  │  │ trends,        │    │
│  │  wave)     │  │ to clusters  │  │ interaction    │    │
│  │            │  │              │  │ effects        │    │
│  └──────┬─────┘  └──────┬───────┘  └───────┬────────┘    │
│         │               │                  │             │
│         └───────────────┼──────────────────┘             │
│                         │                                │
│              ┌──────────▼──────────┐                     │
│              │  Cluster Manager    │                     │
│              │  (SkyPilot skill)   │                     │
│              │                    │                     │
│              │  sky launch        │                     │
│              │  sky exec          │                     │
│              │  sky logs          │                     │
│              │  sky status        │                     │
│              │  sky down          │                     │
│              └──────────┬─────────┘                     │
└─────────────────────────┼───────────────────────────────┘
                          │
            ┌─────────────┼─────────────┐
            │             │             │
   ┌────────▼──────┐  ┌──▼───────┐  ┌──▼──────────┐
   │ instructions  │  │ program  │  │ results.tsv  │
   │    .md        │  │   .md    │  │              │
   │ (SkyPilot     │  │ (Karpathy│  │ experiment_id│
   │  parallelism) │  │  rules)  │  │ val_bpb      │
   └───────────────┘  └──────────┘  │ memory_gb    │
                                    │ status       │
                                    │ description  │
                                    └──────────────┘
                          │
       ┌──────────────────┼──────────────────┐
       │                  │                  │
       ▼                  ▼                  ▼
┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│ SkyPilot     │   │ SkyPilot     │   │ SkyPilot     │
│ Cluster      │   │ Cluster      │   │ Cluster      │
│ gpu-01       │   │ gpu-02       │   │ gpu-16       │
│ ┌─────────┐  │   │ ┌─────────┐  │   │ ┌─────────┐  │
│ │ H100    │  │   │ │ H100    │  │   │ │ H200    │  │
│ │         │  │   │ │         │  │   │ │         │  │
│ │train.py │  │   │ │train.py │  │   │ │train.py │  │
│ │(variant)│  │   │ │(variant)│  │   │ │(variant)│  │
│ └─────────┘  │   │ └─────────┘  │   │ └─────────┘  │
└──────────────┘   └──────────────┘   └──────────────┘
       ···              ···              ···
    13 × H100        clusters         3 × H200

Key Architectural Differences from Sequential Autoresearch

1. Workdir isolation for parallel code variants.

In sequential autoresearch, there is one train.py and one experiment at a time. In parallel, the agent needs multiple versions of train.py running simultaneously:

# Copy files to isolated per-experiment directory
mkdir -p /tmp/autoresearch/exp-03
cp train.py prepare.py pyproject.toml experiment.yaml /tmp/autoresearch/exp-03/

# Edit the copy
# (agent modifies /tmp/autoresearch/exp-03/train.py)

# Launch with isolated workdir
sky launch gpu-03 experiment.yaml \
  --workdir /tmp/autoresearch/exp-03 \
  --env EXPERIMENT_ID=exp-03 \
  --env EXPERIMENT_DESC="wider model" -d -y

SkyPilot snapshots the working directory at submission time, so each experiment gets its own frozen copy of the code. This enables true parallel modification without race conditions.

2. Detached execution with job pipelining.

# Launch returns immediately (-d flag)
sky launch gpu-01 experiment.yaml -d -y \
  --env EXPERIMENT_ID=exp-01

# Queue next experiment on same cluster
# (starts automatically when exp-01 finishes)
sky exec gpu-01 experiment.yaml -d \
  --env EXPERIMENT_ID=exp-02

The -d (detached) flag and sky exec pipelining allow the agent to submit experiments and immediately move on to planning the next wave. Zero GPU idle time between experiments on the same cluster.

3. Result collection via log streaming.

sky logs gpu-01        # latest job
sky logs gpu-01 2      # specific job ID

The agent checks results asynchronously, inspecting logs from completed experiments while others are still running.

Experiment Lifecycle

┌──────────────────────────────────────────────────────────────┐
│                    WAVE-BASED EXECUTION                       │
│                                                              │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │  DESIGN      │    │  SUBMIT      │    │  COLLECT     │      │
│  │  Wave N      │    │  10-16 exps  │    │  Results     │      │
│  │              │───▶│  to clusters │───▶│  from logs   │      │
│  │  Agent plans │    │  sky launch  │    │  sky logs    │      │
│  │  factorial   │    │  sky exec    │    │              │      │
│  │  grid        │    │  (detached)  │    │  ~5 min wait │      │
│  └─────────────┘    └─────────────┘    └──────┬──────┘      │
│                                               │              │
│                                        ┌──────▼──────┐      │
│                                        │  ANALYZE     │      │
│                                        │  Results     │      │
│                                        │              │      │
│                                        │  - Compare   │      │
│                                        │    across    │      │
│                                        │    GPUs      │      │
│                                        │  - Identify  │      │
│                                        │    trends    │      │
│                                        │  - Update    │      │
│                                        │    results   │      │
│                                        │    .tsv      │      │
│                                        │  - Commit    │      │
│                                        │    winners   │      │
│                                        └──────┬──────┘      │
│                                               │              │
│                                        ┌──────▼──────┐      │
│                                        │  DESIGN      │      │
│                                        │  Wave N+1    │      │
│                                        └─────────────┘      │
└──────────────────────────────────────────────────────────────┘

10 Component Breakdown

Component 1: experiment.yaml — SkyPilot Task Definition

Purpose: Defines a single GPU experiment as a declarative YAML specification.

resources:
  accelerators: {H100:1, H200:1}
  image_id: docker:nvcr.io/nvidia/pytorch:24.07-py3
  infra: k8s

workdir: .

envs:
  EXPERIMENT_ID: baseline
  EXPERIMENT_DESC: "baseline run"

setup: |
  pip install uv
  uv sync
  uv run prepare.py

run: |
  uv run train.py 2>&1 | tee run.log
  EXIT_CODE=${PIPESTATUS[0]}
  if [ $EXIT_CODE -ne 0 ]; then
    echo "EXPERIMENT_STATUS: crash"
  else
    VAL_BPB=$(grep "^val_bpb:" run.log | awk '{print $2}')
    PEAK_VRAM=$(grep "^peak_vram_mb:" run.log | awk '{print $2}')
    MEMORY_GB=$(echo "scale=1; ${PEAK_VRAM} / 1024" | bc)
    echo "EXPERIMENT_STATUS: done"
    echo "EXPERIMENT_RESULT: ${EXPERIMENT_ID} val_bpb=${VAL_BPB} memory_gb=${MEMORY_GB}"
  fi
  echo "EXPERIMENT_DESC: ${EXPERIMENT_DESC}"

Key design choices:

Choice Rationale
{H100:1, H200:1} Either GPU type acceptable; SkyPilot picks available
docker:nvcr.io/nvidia/pytorch:24.07-py3 NVIDIA-maintained PyTorch container
infra: k8s Kubernetes backend (CoreWeave in the blog post)
setup block Runs once per cluster; subsequent experiments skip it
EXPERIMENT_ID / EXPERIMENT_DESC Passed as environment variables per experiment
Structured output EXPERIMENT_STATUS, EXPERIMENT_RESULT for easy parsing

Component 2: instructions.md — Parallel Agent Instructions

Purpose: Extends program.md with SkyPilot-specific parallelism instructions.

Structure:

Section Purpose
Setup 4-step initialization (read program.md, load SkyPilot skill, read codebase, ask about infra)
Launching Experiments How to use sky launch, sky exec, workdir isolation
Checking Results sky logs, SSH, sky status, sky queue
Tracking Results Extended results.tsv format with experiment_id
Experiment Loop Parallel version of the autoresearch loop
Cleanup sky down -a -y

Key behavioral directives:

  1. "Don't wait": After submitting experiments, move on to the next idea immediately
  2. "Periodically check": Poll results asynchronously rather than blocking
  3. "Tear down idle clusters": Resource management directive
  4. "At most 4 clusters": Original conservative limit (the actual run used 16)
  5. "NEVER STOP": Inherited from program.md

Component 3: SkyPilot Skill — Agent Infrastructure Education

Purpose: Teaches the LLM agent how to use SkyPilot CLI commands.

The SkyPilot skill is a document that the agent fetches and reads at startup. It teaches:

Command Purpose Agent Use
sky launch <name> <yaml> Provision cluster + run job Create new GPU clusters
sky exec <name> <yaml> Run job on existing cluster Pipeline experiments
sky logs <name> [job_id] View job output Check experiment results
sky status List all clusters Monitor infrastructure
sky queue <name> List jobs on a cluster Track pending experiments
sky down <name> Tear down cluster Clean up idle resources
-d flag Detached (async) mode Non-blocking submission
-y flag Skip confirmation Autonomous operation
--workdir Set working directory Code variant isolation
--env Pass environment variables Experiment identification

The skill system is a form of procedural knowledge injection — the agent acquires new capabilities by reading documentation, without any code changes to the agent itself.

Component 4: Extended results.tsv — Parallel Results Log

Purpose: Tracks all experiments across multiple clusters.

Schema (extended from autoresearch):

experiment_id   status    val_bpb    memory_gb   description
exp-01          keep      0.997900   44.0        baseline
exp-02          discard   1.005000   44.0        switch to GeLU
exp-03          crash     0.000000   0.0         double width (OOM)
exp-04          keep      0.985000   44.2        batch_size=2^18
exp-05          keep      0.982000   44.2        batch_size=2^18 + wd=0.08

Differences from sequential results.tsv:

Field Sequential Parallel
ID git commit hash experiment_id (agent-assigned)
Order Sequential Concurrent (multiple experiments per timestamp)
Hardware Implicit (one GPU) Should be noted in description
Volume ~80/overnight ~900/overnight

Component 5: Cluster Fleet — Managed GPU Infrastructure

Purpose: The physical compute layer managed by SkyPilot.

$ sky status
NAME     INFRA   RESOURCES        STATUS
gpu-01   k8s     1x (H100:1)      UP
gpu-02   k8s     1x (H100:1)      UP
gpu-03   k8s     1x (H200:1)      UP       ← H200 (faster)
gpu-04   k8s     1x (H200:1)      UP       ← H200 (faster)
gpu-05   k8s     1x (H100:1)      UP
gpu-06   k8s     1x (H100:1)      UP
gpu-07   k8s     1x (H100:1)      UP
gpu-08   k8s     1x (H200:1)      UP       ← H200 (faster)
gpu-09   k8s     1x (H100:1)      UP
gpu-10   k8s     1x (H100:1)      UP
gpu-11   k8s     1x (H100:1)      UP
gpu-12   k8s     1x (H100:1)      UP
gpu-13   k8s     1x (H100:1)      UP
gpu-14   k8s     1x (H100:1)      UP
gpu-15   k8s     1x (H100:1)      UP
gpu-16   k8s     1x (H100:1)      UP

The fleet is heterogeneous (13 H100s + 3 H200s) because SkyPilot allocated GPUs based on availability from the Kubernetes cluster. The agent was not told about this heterogeneity — it discovered it from experimental results.

11 Core Mechanisms (Detailed)

The central algorithmic innovation is the shift from greedy hill-climbing to wave-based factorial search:

Sequential hill-climbing (autoresearch default):

for each experiment:
    hypothesis = agent.think(history)
    result = run(hypothesis)       # 5 minutes
    if result.improved:
        keep()
    else:
        discard()

Wave-based factorial search (SkyPilot extension):

for each wave:
    hypotheses = agent.design_factorial_grid(history)  # 10-16 experiments

    for h in hypotheses:
        submit_async(h, available_cluster)  # returns immediately

    wait_for_results(~5 minutes)

    results = collect_all_results()

    trends = analyze_trends(results)  # e.g., "wider is better"
    interactions = find_interactions(results)  # e.g., "lr × wd synergy"

    for r in results:
        if r.improved:
            commit_winning_config(r)

    next_wave = agent.design_followup(trends, interactions)

Why this is a qualitative, not quantitative, change:

In sequential mode, the agent has one data point per decision. It sees "batch_size=2^18 gives val_bpb=0.985" and must decide: is this an improvement? Should I explore further? Is there an interaction with other parameters?

In parallel mode, the agent has 10-16 data points per decision. It sees the complete response surface over the tested parameters and can identify:

  1. Monotonic trends: "val_bpb decreases monotonically with batch_size from 2^19 to 2^17"
  2. Optima: "AR=96 is the sweet spot; AR=112 is too large"
  3. Interaction effects: "weight_decay=0.08 is better, but only when batch_size=2^18"
  4. Diminishing returns: "all 6 learning rate variants give similar results — LR is not the bottleneck"

This transforms the agent from a sequential decision-maker into an experimental scientist running designed experiments.

Mechanism 2: Emergent Hardware-Aware Strategy

The most remarkable finding is the agent's autonomous discovery and exploitation of hardware heterogeneity:

Timeline of discovery:

Early waves:
  Agent notices: identical configs get different val_bpb on different clusters
  Agent wonders: "Why does gpu-03 consistently score better?"

Investigation:
  Agent checks: sky status → discovers gpu-03 is H200, others are H100
  Agent hypothesizes: "H200 is faster → more training steps → better val_bpb"
  Agent verifies: "H200 runs 9% more steps in 5 minutes!"

Strategy development:
  Agent concludes: "I have 3 H200s and 13 H100s"
  Agent decides: "Screen ideas on H100s, validate winners on H200s"
  Agent implements: submits 10+ experiments to H100s, top 2-3 to H200s

Refinement:
  Agent discovers: "Rankings sometimes differ across hardware"
  Agent learns: "FINAL_LR_FRAC=0.03 beats 0.05 on H100 but loses on H200"
  Agent adapts: "The real contest is on H200 — that's the target hardware"

Analysis of the emergent behavior:

This is a textbook example of emergent intelligence — a behavior that arises from the interaction between a capable agent and a complex environment, not from explicit programming:

  1. The agent was not told about H100 vs. H200 performance differences
  2. The agent was not instructed to develop a screening/validation workflow
  3. The agent was not programmed to normalize results by hardware type
  4. The agent independently:
  5. Detected the confounding variable (hardware type)
  6. Diagnosed the mechanism (step count difference)
  7. Developed a strategy to exploit it (two-tier workflow)
  8. Identified cases where it matters (ranking inversions)

This behavior would be considered sophisticated experimental methodology if performed by a human researcher. The fact that it emerged autonomously from an LLM coding agent suggests that large language models have internalized principles of experimental design from their training data.

Mechanism 3: SkyPilot Cluster Management

The infrastructure layer handles several complex operations transparently:

Cluster provisioning:

sky launch gpu-01 experiment.yaml -d -y
  │
  ├─ Parses experiment.yaml
  ├─ Finds available GPU (H100 or H200)
  ├─ Creates Kubernetes pod with NVIDIA PyTorch image
  ├─ Syncs workdir to pod
  ├─ Runs setup block (pip install uv, uv sync, uv run prepare.py)
  └─ Runs experiment (uv run train.py)

Job pipelining:

sky exec gpu-01 experiment.yaml -d --env EXPERIMENT_ID=exp-02
  │
  ├─ Reuses existing cluster (skips setup)
  ├─ Queues job to start when current job finishes
  ├─ Returns immediately (agent can plan next experiment)
  └─ Zero GPU idle time between experiments

Workdir snapshotting:

mkdir -p /tmp/autoresearch/exp-03
cp train.py prepare.py pyproject.toml experiment.yaml /tmp/autoresearch/exp-03/
# Agent modifies /tmp/autoresearch/exp-03/train.py
sky launch gpu-03 experiment.yaml --workdir /tmp/autoresearch/exp-03 -d -y
  │
  ├─ Snapshots /tmp/autoresearch/exp-03/ at submission time
  ├─ Syncs snapshot to cluster
  └─ Subsequent modifications to /tmp/autoresearch/exp-03/ don't affect running job

This workdir isolation is critical for parallel execution — without it, the agent could not run different code variants simultaneously, as modifications to train.py would race between experiments.

Mechanism 4: Throughput Scaling Analysis

Theoretical throughput model:

T_sequential = T_think + T_train + T_analyze
             ≈ 30s + 300s + 30s
             ≈ 360s per experiment
             ≈ 10 experiments/hour

T_parallel(N) = T_design_wave + T_train + T_collect + T_analyze
              ≈ 60s + 300s + 60s + 60s
              ≈ 480s per wave of N experiments
              ≈ 7.5 waves/hour × N experiments/wave

For N=12 (average wave size):
  ≈ 7.5 × 12 = 90 experiments/hour

Scaling efficiency:

Ideal speedup:    16× (16 GPUs)
Actual speedup:   9× (90/10)
Efficiency:       56%

Sources of inefficiency:
  - Wave design time (agent thinking)        ~12% overhead
  - Result collection and analysis            ~12% overhead
  - Cluster provisioning and setup           ~5% overhead
  - GPU idle time between waves              ~10% overhead
  - Crashed experiments                      ~5% waste
  Total overhead:                            ~44%

The 56% scaling efficiency is reasonable for a system where a single agent manages 16 clusters. The primary bottleneck is the agent's thinking time — it takes longer to design a 12-experiment factorial grid than a single experiment. Scaling beyond 16 GPUs would require either faster agent reasoning or multi-agent coordination.

Mechanism 5: Experimental Design Progression

The agent's experimental design sophistication increased over the course of the 8-hour run:

Phase 1 (simple sweeps):

Wave: Test 10 values of weight_decay, one variable at a time
  weight_decay ∈ {0.01, 0.02, 0.05, 0.08, 0.1, 0.15, 0.2, 0.3, 0.5, 1.0}

Phase 2 (factorial grids):

Wave: Test 6 aspect ratios simultaneously
  AR ∈ {48, 64, 72, 80, 90, 96, 112}
  (with best hyperparameters from Phase 1 held fixed)

Phase 3 (interaction testing):

Wave: 3 × 4 factorial grid
  matrix_lr ∈ {0.03, 0.05, 0.07}
  warmdown_ratio ∈ {0.4, 0.5, 0.6, 0.7}
  → 12 experiments in one wave

Phase 4 (targeted sweeps with hardware awareness):

Wave: Test 5 values of muon_beta2, all on H100
  beta2 ∈ {0.95, 0.96, 0.97, 0.98, 0.99}
  → then promote best to H200 for validation

Phase 5 (combinatorial fine-tuning):

Wave: 2³ factorial over final tuning parameters
  final_lr_frac ∈ {0.03, 0.05}
  warmdown_ratio ∈ {0.55, 0.65}
  scalar_lr ∈ {0.4, 0.6}
  → 8 experiments, looking for interaction effects

This progression mirrors the behavior of an experienced human researcher: start with broad sweeps to map the landscape, then zoom in on promising regions with factorial designs, then fine-tune with targeted interaction tests.

12 Programming Language

Language: Python + YAML + Bash

The SkyPilot extension adds two language layers to autoresearch's pure-Python design:

Language Component Purpose
Python train.py, prepare.py Core training (unchanged from autoresearch)
YAML experiment.yaml Infrastructure-as-code (SkyPilot task definition)
Bash setup and run blocks in YAML Environment setup and experiment execution
Markdown instructions.md, program.md Agent instruction documents

SkyPilot CLI as an Agent API

The SkyPilot CLI becomes, in effect, an API for the LLM agent:

# Provisioning API
sky launch <name> <yaml> [--env KEY=VALUE ...] [-d] [-y] [--workdir PATH]

# Execution API
sky exec <name> <yaml> [--env KEY=VALUE ...] [-d]

# Observability API
sky status           # list clusters
sky queue <name>     # list jobs on cluster
sky logs <name> [id] # view job output

# Lifecycle API
sky down <name> [-y] # tear down cluster
sky down -a [-y]     # tear down all clusters

The agent interacts with this API entirely through natural language understanding of the SkyPilot skill document — there is no SDK, no Python bindings, no programmatic interface. The agent reads documentation and constructs shell commands.

Infrastructure Code Quality

The experiment.yaml is notable for its error handling:

uv run train.py 2>&1 | tee run.log
EXIT_CODE=${PIPESTATUS[0]}

if [ $EXIT_CODE -ne 0 ]; then
  echo "EXPERIMENT_STATUS: crash"
else
  VAL_BPB=$(grep "^val_bpb:" run.log | awk '{print $2}')
  PEAK_VRAM=$(grep "^peak_vram_mb:" run.log | awk '{print $2}')
  MEMORY_GB=$(echo "scale=1; ${PEAK_VRAM} / 1024" | bc)
  echo "EXPERIMENT_STATUS: done"
  echo "EXPERIMENT_RESULT: ${EXPERIMENT_ID} val_bpb=${VAL_BPB} memory_gb=${MEMORY_GB}"
fi

Key design choices: - PIPESTATUS[0] captures the exit code from train.py (not from tee) - Structured output (EXPERIMENT_STATUS, EXPERIMENT_RESULT) enables reliable parsing - tee writes both to stdout and file (unlike autoresearch's pure redirect) - Memory is converted from MB to GB inline for readability

Dependency Management

The SkyPilot extension adds no Python dependencies beyond autoresearch's pyproject.toml. The only additional requirement is the SkyPilot CLI itself, which is installed on the agent's machine (not on GPU clusters). The Docker image (nvcr.io/nvidia/pytorch:24.07-py3) provides the CUDA runtime and PyTorch.

This zero-dependency-addition design is intentional — it preserves autoresearch's simplicity while adding parallelism purely through infrastructure.

13 Memory Management

Agent Context Window Memory

The parallel execution dramatically increases the agent's context management burden:

Sequential context per cycle (~3K tokens):

Read results → Edit code → Run → Read metric → Decide

Parallel context per wave (~15-25K tokens):

Review all 16 cluster states → Design 10-16 experiments →
Copy and modify 10-16 train.py variants →
Submit all experiments → Wait → Check 10-16 results →
Compare across hardware types → Identify trends →
Decide which configs to keep → Plan next wave

Context accumulation over 8 hours:

Sequential (100 experiments):
  ~300K tokens cumulative

Parallel (910 experiments in ~75 waves):
  ~15K tokens/wave × 75 waves ≈ 1.1M tokens cumulative

The parallel agent uses approximately 3-4× more context than the sequential agent. This is a potential scaling bottleneck — at some point, the agent's context window fills up, and it either loses early experimental context or must summarize/compress.

Workdir Memory (Disk)

Each experiment creates an isolated copy of the codebase:

/tmp/autoresearch/
  exp-01/
    train.py          ~20KB
    prepare.py         ~8KB
    pyproject.toml     ~1KB
    experiment.yaml    ~1KB
  exp-02/
    ...
  exp-910/
    ...

Total disk usage: ~910 × 30KB ≈ 27MB

This is negligible — disk is not a constraint. However, the agent must manage these directories (create, modify, reference by path), which adds cognitive overhead.

GPU Memory (Per Cluster)

Each cluster runs independently with its own GPU memory:

Per cluster (same as autoresearch):
  Model parameters:      ~100 MB
  Optimizer states:      ~400 MB
  Activations:          ~2-10 GB
  KV cache:              ~500 MB
  Total:                ~3-11 GB

Across all 16 clusters:
  Total GPU memory used: ~48-176 GB (across 16 independent GPUs)
  No memory sharing between clusters

Each cluster is completely independent — there is no distributed training, no gradient sharing, no communication between GPUs. Each GPU runs its own experiment with its own copy of the code and data. This is an embarrassingly parallel architecture.

Data Memory (Per Cluster)

Each cluster downloads and stores training data independently:

Per cluster:
  ~/.cache/autoresearch/data/       ~1-5 GB (parquet shards)
  ~/.cache/autoresearch/tokenizer/  ~10 MB (BPE tokenizer)

First experiment on a cluster: setup downloads data (~2 min)
Subsequent experiments: skip setup (data already present)

The setup block in experiment.yaml runs once per cluster. SkyPilot's sky exec reuses the existing cluster, so data is downloaded only once per GPU, not once per experiment. This amortizes the data preparation cost across all experiments on a given cluster.

14 Continued Learning

Within-Run Learning

The parallel agent's learning is richer than the sequential agent's because it processes more information per decision cycle:

Sequential agent learning:

Experiment 1: batch_size=2^18 → 0.985 (better)
→ Learns: smaller batch size helps

Experiment 2: weight_decay=0.08 → 0.990 (better than baseline, worse than exp-1)
→ Learns: weight_decay matters, but batch size matters more

Parallel agent learning:

Wave 1 (12 experiments):
  batch_size ∈ {2^17, 2^18, 2^19} × weight_decay ∈ {0.05, 0.08, 0.15, 0.2}

→ Learns:
  - batch_size=2^18 is optimal (2^17 too small, 2^19 too large)
  - weight_decay=0.08 is optimal across all batch sizes
  - there is NO interaction between batch_size and weight_decay
  - margin of improvement: 0.015 from batch_size, 0.003 from wd
  - together they give 0.018, confirming additivity

The parallel agent learns more per unit time because factorial designs are statistically more efficient than one-at-a-time designs (a classical result in design of experiments theory).

Cross-Wave Strategy Evolution

The agent's strategy visibly evolves across waves:

Wave Range Strategy Sophistication
1-5 One-variable sweeps Low (identical to sequential)
6-15 Multi-variable factorials Medium (exploiting parallelism)
16-25 Hardware-aware allocation High (emergent capability)
26-40 Two-tier screening/validation High (research methodology)
40-60 Focused combinatorial tuning High (targeted exploration)
60-75 Diminishing returns awareness High (implicit stopping intuition)

This progression suggests the agent is learning how to research during the run, not just what works. It develops experimental methodology from experience.

Cross-Run Knowledge Transfer

Like sequential autoresearch, the SkyPilot extension does not have built-in cross-run memory. However, the richer results from parallel runs provide better transfer material:

1. Winning configuration as starting point.

# Best config from an 8-hour parallel run
ASPECT_RATIO = 96
TOTAL_BATCH_SIZE = 2**18
WEIGHT_DECAY = 0.08
ADAM_BETAS = (0.70, 0.95)
WARMDOWN_RATIO = 0.6
FINAL_LR_FRAC = 0.05
# muon_beta2 = 0.98

A subsequent run starting from this configuration would explore the space around this optimum, potentially finding further improvements.

2. Response surface knowledge.

The complete factorial results from parallel runs encode the response surface:

AR:   48  64  72  80  90  96  112
bpb: .985 .982 .980 .979 .978 .977 .984
     ↑ monotonic improvement up to AR=96, then regression

This knowledge could be encoded in program.md:
"Note: optimal ASPECT_RATIO is around 96 for H100.
Do not test below 80 or above 112."

3. Hardware-specific findings.

H100: ~1,060 steps in 5 minutes. FINAL_LR_FRAC=0.05 is best.
H200: ~1,200 steps in 5 minutes. FINAL_LR_FRAC=0.05 is best (but 0.03 works on H100).
Rankings can differ across hardware types — always validate on target hardware.

Potential Extensions for Persistent Learning

The SkyPilot autoresearch results suggest several natural extensions for cross-run learning:

1. Results database.

Instead of results.tsv, store results in a structured database that persists across runs. The agent could query: "What ASPECT_RATIO values have been tested, and what were the results?"

2. Bayesian surrogate model.

Fit a Gaussian process or neural network surrogate to the accumulated results, then use it to guide subsequent exploration (Bayesian optimization). The parallel factorial data provides excellent training signal for surrogate models.

3. Agent memory systems.

Use systems like autoresearch's meta-learning framework — updating program.md with empirical findings — but automated:

# Auto-generated findings from Run 1:
- Optimal ASPECT_RATIO range: [80, 100]
- Batch size 2^18 consistently outperforms 2^19
- Weight decay 0.08 is robust across architectures
- muon_beta2=0.98 is best for wide models
- H200 gives ~9% more steps than H100 in 5-min budget

4. Population-based training.

Instead of greedy hill-climbing, maintain a population of configurations that evolve over time. Parallel resources make this natural — each GPU runs a different population member. This would bring autoresearch closer to systems like AlphaEvolve while preserving its simplicity.

15 Applications

Direct Applications

1. Accelerated neural network training research.

The primary use case — any researcher with access to a GPU cluster can run parallel autoresearch overnight and wake up to 900+ experiments and a well-optimized model:

Evening:  claude "Read instructions.md and start running experiments"
Morning:  val_bpb improved by 3%, 900 experiments logged
Time invested: ~10 minutes of setup

2. Hardware benchmarking and characterization.

The emergent hardware-aware behavior makes parallel autoresearch a novel hardware benchmarking tool. By running on a heterogeneous cluster, the system automatically: - Compares GPU types under real ML workloads - Identifies performance differences that synthetic benchmarks miss - Finds hardware-specific optimal configurations - Discovers when parameter rankings change across hardware

3. LLM agent capability benchmarking.

The SkyPilot autoresearch is a rigorous benchmark for evaluating LLM coding agents:

Capability How It's Tested
Code editing Modifying train.py correctly
Tool use Using SkyPilot CLI commands
Experimental design Quality of factorial grids
Scientific reasoning Hardware heterogeneity discovery
Resource management Cluster utilization efficiency
Error handling Crash diagnosis and recovery
Long-horizon planning Multi-wave strategy evolution
Autonomy 8-hour unsupervised operation

4. Infrastructure validation.

SkyPilot uses autoresearch as a showcase for its agent skill system. The 8-hour autonomous run demonstrates that SkyPilot's Kubernetes integration, job pipelining, and multi-cluster management work reliably at scale.

Broader Research Applications

5. Design of Experiments (DOE) by AI agents.

The SkyPilot results demonstrate that LLM agents can independently develop sophisticated experimental design strategies:

Classical DOE Theory (Fisher, 1935):
  - Factorial designs
  - Blocking for confounding variables
  - Response surface methodology
  - Sequential experimentation

Agent-Discovered DOE (SkyPilot Autoresearch, 2026):
  - Factorial grids (Phase 2-3)
  - Hardware blocking (emergent)
  - Response surface scanning (AR sweep)
  - Two-tier screening/validation (emergent)

The agent rediscovered principles of experimental design that took human statisticians decades to formalize. This suggests that LLM agents could be useful research assistants even in domains where they have no specialized training.

6. Heterogeneous compute optimization.

Modern compute environments are increasingly heterogeneous (different GPU types, different cloud providers, different pricing). The emergent hardware-aware strategy suggests that LLM agents can: - Automatically discover hardware characteristics - Develop resource allocation strategies - Optimize cost-performance trade-offs - Adapt to changing resource availability

This has implications beyond ML research — any workload that runs on heterogeneous compute could benefit from agent-driven optimization.

7. Meta-research on research methodology.

The SkyPilot experiment is itself a meta-research study: it investigates how compute scale affects the quality and strategy of autonomous research. Key findings that generalize:

Finding Generalization
Parallelism changes strategy More resources enable qualitatively different approaches, not just faster execution
Emergent capabilities appear at scale Complex behaviors (hardware awareness) emerge without explicit programming
Diminishing returns follow a power law Early experiments are disproportionately valuable
Agent experimental design improves with experience LLMs learn research methodology on-the-fly

Limitations

Limitation Description Impact
Single agent bottleneck One agent manages all 16 GPUs ~56% scaling efficiency
No inter-experiment communication Clusters don't share information Redundant computations possible
Context window pressure 910 experiments generate large context Risk of losing early context
Cost scaling 16 GPUs × 8 hours = significant cost $300 vs. $18 for single-GPU
Non-deterministic Different runs yield different results Hard to reproduce exact numbers
Agent-specific Tested only with Claude Code May not generalize to other agents
Kubernetes dependency Requires cluster infrastructure Higher barrier than single GPU
No stopping criterion Runs until manually stopped Wastes compute in diminishing returns phase

Future Directions

1. Multi-agent parallel autoresearch.

Instead of one agent managing 16 GPUs, use 4 agents each managing 4 GPUs. Each agent explores a different region of the search space, and they periodically share discoveries:

Agent A (architecture):    Explores model dimensions
Agent B (optimizer):       Explores optimizer parameters
Agent C (schedule):        Explores LR and warmdown schedules
Agent D (attention):       Explores attention patterns

Shared discovery board:    Best configs from each agent
Cross-pollination:         Agents adopt each other's best findings

2. Adaptive compute allocation.

Instead of fixed 16 GPUs for 8 hours, dynamically scale: - Start with 4 GPUs (broad exploration) - Scale to 16 when promising directions found (deep exploration) - Scale down when diminishing returns detected - Total cost-aware: stop when expected improvement per dollar drops below threshold

3. Continuous autoresearch.

Instead of overnight runs, run autoresearch continuously: - Always-on agent monitors a git repository - New ideas from the community trigger automated experiments - Results are automatically committed and published - A living leaderboard reflects the current best configuration

4. Cross-domain autoresearch.

Apply the parallel autoresearch methodology to other optimization domains: - Reinforcement learning training scripts - Compiler optimization passes - Database query optimizer configurations - Scientific simulation parameters - Drug discovery molecular parameters

The SkyPilot infrastructure layer is domain-agnostic — only the train.py and prepare.py would change.

Broader Impact

The SkyPilot scaling of autoresearch demonstrates a fundamental principle: the gap between sequential and parallel autonomous research is not just speed — it's intelligence. A sequential agent does greedy hill-climbing. A parallel agent does factorial experimental design, discovers confounding variables, and develops multi-tier validation strategies.

This has profound implications for the future of AI-driven research:

  1. Compute is a cognitive amplifier. More compute doesn't just make the agent faster — it makes it smarter by enabling strategies that are impossible in sequential mode.

  2. Infrastructure is an intelligence bottleneck. The sequential autoresearch agent is limited not by its reasoning ability but by its infrastructure. Remove the infrastructure bottleneck, and emergent capabilities appear.

  3. Agent autonomy scales with resources. A single-GPU agent needs human oversight (it can get stuck in local optima). A 16-GPU agent develops its own research methodology and can run for days without human intervention.

  4. The "researcher" is the program, not the agent. Both Karpathy's program.md and SkyPilot's instructions.md are natural-language research programs. The agent is the execution engine. The quality of research depends on the quality of the program, not the specific agent. This democratizes research — anyone who can write clear instructions can run a research lab.

The combination of Karpathy's radical simplicity (3 files, 1 metric) with SkyPilot's infrastructure abstraction (any cluster, any cloud) creates a system that is simultaneously: - Simple enough for anyone to understand - Powerful enough to produce real research results - Scalable enough to exploit large compute clusters - General enough to apply to any optimization domain

This is, perhaps, the strongest evidence yet that autonomous research is not a future aspiration but a present reality — limited primarily by the quality of the research program (the natural-language instructions) and the availability of compute.


This analysis is based on the SkyPilot blog post "Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster" (March 2026), the instructions.md and experiment.yaml from skypilot/examples/autoresearch, and the original autoresearch repository. The experiment used Claude Code on a Kubernetes cluster backed by CoreWeave with 13 H100 and 3 H200 GPUs.