← Back to Index
SkyPilot Scaling Autoresearch
Scaling Karpathy's Autoresearch to Parallel GPU Clusters with Emergent Research Strategies Organization: SkyPilot / UC Berkeley Sky Computing Lab Published: March 2026 Type: Technical Blog Post + Open-Source Extension Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster
Blog Post URL: blog.skypilot.co/scaling-autoresearch/
Code Repository: skypilot/examples/autoresearch
Parent System: Built on top of Karpathy's autoresearch (see companion document: Karpathy Autoresearch)
SkyPilot Project: github.com/skypilot-org/skypilot
SkyPilot Skill Docs: docs.skypilot.co/en/latest/getting-started/skill.html
Publication Date: March 2026
Paradigm: This work extends autoresearch from sequential single-GPU hill-climbing to parallel multi-GPU factorial experimentation. The key innovation is not just "more GPUs" but a qualitative shift in search strategy: parallel resources enable the agent to run factorial grids, discover interaction effects, and develop emergent strategies for exploiting heterogeneous hardware — behaviors that are impossible in sequential mode.
Lineage:
nanochat (Karpathy)
│
▼
autoresearch (Karpathy, March 2026)
│ Single GPU, sequential, greedy hill-climbing
│
▼
SkyPilot Autoresearch (SkyPilot team, March 2026)
│ 16 GPUs, parallel, factorial grids
│ Emergent hardware-aware strategy
│
▼
[Future: distributed autoresearch networks?]
2 Authors and Team
SkyPilot Team
SkyPilot is developed by the Sky Computing Lab at UC Berkeley, with contributions from the broader open-source community. The project was initiated by Professor Ion Stoica's lab and has grown into a widely-used infrastructure tool for ML workloads.
Key contributors to the SkyPilot project include:
| Role | Affiliation | Contribution |
|---|---|---|
| Core developers | UC Berkeley Sky Lab | SkyPilot framework, Kubernetes integration |
| Infrastructure team | Various | Cloud provider backends, resource scheduling |
| ML team | Various | Agent skill system, autoresearch integration |
Context: SkyPilot as Infrastructure
SkyPilot is not an ML research project — it is a cloud infrastructure abstraction layer that enables ML researchers to launch GPU workloads across any cloud or Kubernetes cluster from a single YAML file. The autoresearch extension is a showcase of SkyPilot's agent integration capabilities, demonstrating that infrastructure tools can transform the behavior of autonomous research agents.
The SkyPilot team's contribution is primarily systems engineering rather than ML research: they identified that the sequential bottleneck in autoresearch was an infrastructure limitation, not an algorithmic one, and showed that removing it unlocks qualitatively different agent behaviors.
Relationship to Karpathy
The SkyPilot autoresearch extension builds directly on Karpathy's original autoresearch, preserving all of its design constraints (fixed 5-minute budget, single-file modification, val_bpb metric) while replacing the execution substrate (local GPU → distributed GPU cluster). The original program.md is fetched and followed by the agent; the SkyPilot instructions.md adds the parallelism layer on top.
3 Core Contribution
Key Novelty: Scaling autoresearch from 1 GPU to 16 GPUs didn't just make it 16x faster — it caused a qualitative shift in the agent's research strategy, from greedy hill-climbing to factorial experimental design, and triggered an emergent capability: the agent independently discovered how to exploit heterogeneous GPU hardware for a two-tier screening/validation workflow.
Three Core Contributions
1. Parallelism changes search strategy, not just speed.
The most important finding is that parallel resources don't merely accelerate the sequential search — they enable fundamentally different search strategies:
| Sequential (1 GPU) | Parallel (16 GPUs) |
|---|---|
| Greedy hill-climbing | Factorial grid search |
| 1 experiment per decision | 10-13 experiments per decision |
| Tests one hypothesis at a time | Tests interaction effects between parameters |
| Can miss non-monotonic improvements | Captures full response surface in one wave |
| ~10 experiments/hour | ~90 experiments/hour |
The aspect ratio discovery (Phase 2 in the results) perfectly illustrates this: sequentially, the agent might test AR=64, see no improvement, and abandon the direction. In parallel, it tested AR=48, 64, 72, 80, 90, 96, and 112 simultaneously, immediately saw the monotonic trend, and zeroed in on AR=96. One 5-minute wave replaced six sequential experiments.
2. Emergent hardware-aware strategy.
Without any instruction about GPU performance differences, the agent independently: - Discovered that H200 clusters produced better val_bpb than H100 clusters for identical configurations - Diagnosed the cause (H200 completes ~9% more training steps in 5 minutes) - Developed a two-tier workflow: screen hypotheses cheaply on H100s, validate winners on H200s - Recognized that parameter rankings don't always transfer across hardware types
This is an emergent behavior — it was not programmed, not prompted, and not anticipated. The agent developed a sophisticated experimental methodology from first principles by observing its own results.
3. Infrastructure abstraction enables agent autonomy.
The SkyPilot "skill" system teaches LLM agents to manage GPU infrastructure without human intervention. The agent learns to:
- Provision clusters (sky launch)
- Submit experiments (sky exec)
- Pipeline jobs on the same cluster (zero-idle-time queuing)
- Check results (sky logs)
- Tear down resources (sky down)
- Manage workdir isolation for parallel code variants
This transforms the agent from a "code editor that waits for a GPU" into a "research lab manager that orchestrates a compute cluster."
Relationship to Prior Work
| System | Parallelism | Agent Control | Search Strategy | Hardware Awareness |
|---|---|---|---|---|
| Grid Search | Embarrassingly parallel | None (human defines grid) | Exhaustive | None |
| Optuna/Ray Tune | Multi-trial parallel | Bayesian/RL controller | Principled optimization | Manual |
| AlphaEvolve | 100+ evaluators | Evolutionary controller | MAP-Elites + islands | Managed internally |
| Autoresearch (1 GPU) | Sequential | LLM agent | Greedy hill-climbing | Single GPU |
| SkyPilot Autoresearch | 16 GPU cluster | LLM agent | Factorial grids | Emergent |
The SkyPilot extension occupies a unique position: it gives the same agent used in sequential autoresearch access to parallel resources, and observes how the agent's strategy naturally evolves. Unlike AlphaEvolve (where parallelism is engineered into the system design), here parallelism is an infrastructure capability that the agent learns to exploit.
4 Supported Solutions
Primary Domain: Same as Autoresearch
The SkyPilot extension targets the same domain as Karpathy's autoresearch — GPT model training optimization within a 5-minute compute budget. Everything the agent can modify remains the same:
| Category | Mutable Parameters | Parallel Advantage |
|---|---|---|
| Architecture | ASPECT_RATIO, DEPTH, WINDOW_PATTERN | Test 6+ aspect ratios in one wave |
| Optimizer | MATRIX_LR, ADAM_BETAS, WEIGHT_DECAY, Muon params | Full factorial over optimizer settings |
| LR Schedule | WARMUP_RATIO, WARMDOWN_RATIO, FINAL_LR_FRAC | Sweep entire schedule space simultaneously |
| Batch Size | TOTAL_BATCH_SIZE, DEVICE_BATCH_SIZE | Test multiple batch sizes + other params |
| Activations | MLP activation function | A/B/C test activation variants |
| Attention | Head configuration, sliding window | Compare window patterns |
Extended Solution Space: Multi-Factor Experiments
The key difference from sequential autoresearch is the ability to test combinations:
Sequential (one-at-a-time):
Experiment 1: batch_size = 2^18 → val_bpb = 0.985
Experiment 2: weight_decay = 0.08 → val_bpb = 0.990
Experiment 3: batch_size=2^18 + wd=0.08 → val_bpb = 0.982 ← interaction!
Total time: 15 minutes
Parallel (factorial):
Wave 1 (5 minutes, 4 GPUs):
batch_size=2^18, wd=0.2 → 0.985
batch_size=2^18, wd=0.08 → 0.982 ← best (interaction found immediately)
batch_size=2^19, wd=0.2 → 0.997
batch_size=2^19, wd=0.08 → 0.993
Total time: 5 minutes
Factorial grids reveal interaction effects that sequential one-at-a-time testing misses. In the example above, the combination of lower batch size + lower weight decay produces a synergistic improvement that neither change alone would predict.
Hardware-Heterogeneous Experimentation
A novel solution type enabled by the parallel setup is hardware-aware model optimization:
┌──────────────┐
│ Hypothesis │
│ Generator │
└──────┬───────┘
│
┌────────────┼────────────┐
│ │
┌───────▼───────┐ ┌────────▼────────┐
│ SCREENING │ │ VALIDATION │
│ (13 × H100) │ │ (3 × H200) │
│ │ │ │
│ 10+ hypotheses│ │ Top 2-3 │
│ tested cheap │ │ promoted for │
│ in parallel │ │ confirmation │
└───────┬───────┘ └────────┬────────┘
│ │
└────────────┬────────────┘
│
┌──────▼───────┐
│ DECISION │
│ keep/discard │
└──────────────┘
The agent developed this two-tier screening/validation workflow autonomously, recognizing that: - H200s are ~9% faster (more training steps in 5 minutes) - H200s produce systematically lower val_bpb for the same configuration - Parameter rankings sometimes differ across hardware types - The limited number of H200s (3 out of 16) should be reserved for validation
5 LLM Integration
Agent: Claude Code (Anthropic)
The SkyPilot experiment used Claude Code as the autonomous agent. Key characteristics:
| Aspect | Detail |
|---|---|
| Agent | Claude Code (Anthropic) |
| Run Duration | ~8 hours |
| Experiments Submitted | ~910 total |
| Experiments with Valid Results | ~700 |
| Concurrent Clusters | 16 |
| GPU Types | 13 × H100, 3 × H200 |
Dual-Layer Instruction System
The agent operates under two layers of instructions:
Layer 1: program.md (Karpathy's original)
- What can/cannot be modified
- The simplicity criterion
- The keep/discard logic
- The "never stop" directive
Layer 2: instructions.md (SkyPilot extension)
- How to use SkyPilot for parallel execution
- Cluster naming convention (gpu-01, gpu-02, ...)
- Workdir isolation for parallel code variants
- Results tracking in results.tsv with experiment IDs
- Maximum 4 clusters limit (relaxed to 16 in practice)
The agent fetches both documents at startup:
1. Read program.md → understand research rules
2. Load SkyPilot skill → learn infrastructure commands
3. Read autoresearch codebase → understand train.py
4. Ask about infra preference → configure YAML
SkyPilot Skill System
The SkyPilot "skill" is a natural-language instruction document that teaches LLM agents to use SkyPilot. This is a form of tool-use prompting — the agent reads documentation about a tool and then uses it:
Agent reads skill document:
"To launch a GPU cluster, use: sky launch gpu-01 experiment.yaml -d -y"
"To check results, use: sky logs gpu-01"
"To pipeline experiments, use: sky exec gpu-01 experiment.yaml -d"
Agent then uses these commands autonomously for 8 hours.
The skill system is notable because it requires zero code changes to the agent — the same Claude Code instance that runs sequential autoresearch can run parallel autoresearch, simply by reading different instructions.
Agent Decision-Making at Scale
With 16 GPUs, the agent's decision-making complexity increases substantially:
Sequential agent decision tree:
observe result → keep or discard → pick next experiment
Parallel agent decision tree:
observe 16 concurrent results →
identify trends across results →
decide which changes to keep/combine →
design next wave of 10-16 experiments →
allocate experiments to GPU types →
manage cluster pipeline queues →
handle crashes on individual clusters →
promote winners to validation tier
The agent manages this complexity without any explicit planning framework — it reasons about experimental design in natural language, within its context window.
Agent Reasoning Examples
The blog post captures several instances of the agent's reasoning in real time:
Discovering hardware heterogeneity:
"Only 3 H200 clusters: gpu-03, gpu-04, gpu-08! The rest are H100. This explains everything — H200 is significantly faster than H100. In the same 5-minute budget, H200 can do MORE training steps. More steps = better val_bpb."
Quantifying the difference:
"H200 runs 9% more steps in the same time! That directly leads to better val_bpb. All my 'best' results should be normalized by hardware."
Developing a strategy:
"Since H200 gets ~9% more steps than H100 in the same 5-minute budget, and I have only 3 H200 clusters, I should focus experiments on H200 clusters. The real optimization contest is on H200."
These quotes demonstrate that the agent performs genuine scientific reasoning — observing data, forming hypotheses, and modifying its experimental strategy accordingly.
6 Key Results
Headline Results
| Metric | Value |
|---|---|
| Starting val_bpb | 1.003 (baseline) |
| Final val_bpb | 0.974 |
| Total improvement | 2.87% reduction |
| Total experiments | ~910 submitted |
| Valid experiments | ~700 |
| Runtime | ~8 hours |
| GPUs used | 16 (13 × H100 + 3 × H200) |
| Throughput | ~90 experiments/hour |
| Sequential baseline equivalent | ~72 hours |
| Speedup | 9× wall-clock |
Five Phases of Discovery
The agent's research naturally organized into five phases. These were not pre-planned — each phase emerged from the results of the previous one:
val_bpb
1.003 ┤● Phase 1: Hyperparameter Sweeps
│ ╲
0.990 ┤ ╲ (~200 experiments)
│ ╲ - batch_size 2^18 > 2^19
│ ╲ - adam_betas (0.9, 0.95)
0.981 ┤ ● - weight_decay 0.08
│ ╲
│ ╲ Phase 2: Architecture Discovery
0.977 ┤ ● (~200 experiments)
│ ╲ - ASPECT_RATIO 96 > 64
│ ╲ (biggest single jump)
0.975 ┤ ● Phase 3: Fine-Tuning
│ ╲ (~140 experiments)
│ ╲
0.974 ┤ ● Phase 4: Optimizer Tuning
│ ╲ (~140 experiments)
│ ╲ - muon_beta2 = 0.98
│ ╲──── Phase 5: Diminishing Returns
│ (~200 experiments)
└───────────────────────────────────────► experiments
0 200 400 600 800 910
Phase-by-Phase Analysis
Phase 1: Hyperparameter Sweeps (experiments 0-200)
Strategy: Broad factorial grids over obvious hyperparameters
Experiments per wave: 10-13 simultaneous
Key discoveries:
| Parameter | Default | Optimal | Impact (Δ val_bpb) |
|---|---|---|---|
TOTAL_BATCH_SIZE |
2^19 | 2^18 | -0.010 |
ADAM_BETAS |
(0.8, 0.95) | (0.9, 0.95) | -0.005 |
WEIGHT_DECAY |
0.2 | 0.08 | -0.003 |
| Softcap | 15 | 10 | -0.001 |
| Deeper models (depth 10+) | — | All crashed or hurt | N/A |
Phase outcome: val_bpb 1.003 → 0.981 (Δ = 0.022)
Analysis: The batch size reduction is the most impactful single hyperparameter finding. With a fixed 5-minute wall-clock budget, reducing batch size from 2^19 (~524K tokens) to 2^18 (~262K tokens) doubles the number of optimizer steps. Despite seeing half as many tokens per step, the additional optimization steps more than compensate. This finding — that optimizer step count dominates data throughput for short training budgets — is a well-known phenomenon in learning rate scaling theory, but the agent discovered it empirically.
Phase 2: Architecture Discovery (experiments 200-420)
Strategy: Factorial grid over model aspect ratios
Critical experiment: 6 aspect ratios tested in a single 5-minute wave
Wave results (single 5-minute window):
AR=48 (model_dim=384) → val_bpb = 0.985
AR=64 (model_dim=512) → val_bpb = 0.982
AR=72 (model_dim=576) → val_bpb = 0.980
AR=80 (model_dim=640) → val_bpb = 0.979
AR=90 (model_dim=720) → val_bpb = 0.978
AR=96 (model_dim=768) → val_bpb = 0.977 ← best
AR=112 (model_dim=896) → val_bpb = 0.984 ← too large, not enough steps
Phase outcome: val_bpb 0.981 → 0.977 (Δ = 0.004)
Analysis: This is the phase where parallelism paid off most dramatically. The response surface over aspect ratio is non-monotonic — performance improves with width up to AR=96, then degrades because the larger model completes fewer training steps in the 5-minute budget.
Sequentially, the agent would likely have tested AR=64 (no improvement), maybe AR=72 (marginal), and possibly given up. The parallel factorial revealed the complete curve, including the optimal point and the explanation for the degradation at AR=112.
The finding that model width dominates optimizer tuning is significant:
All optimizer tuning (Phase 1): Δ = 0.022
Model width scaling (Phase 2 alone): Δ = 0.004 (on top of Phase 1)
Width + tuning combination: → 0.977 vs. 1.003 baseline
The optimal AR=96 translates to model_dim = 768, producing a model with:
- ~1,060 training steps on H100 (vs. ~1,450 for AR=48)
- 4× more parameters than the AR=48 model
- Fits within 64GB VRAM
Phase 3: Fine-Tuning the Wider Model (experiments 420-560)
Strategy: Re-optimize hyperparameters for the new AR=96 architecture
Key discoveries:
| Parameter | Finding |
|---|---|
| Warmdown schedule | Optimal warmdown differs for wider model |
| Matrix learning rate | Re-tuned for AR=96 |
| Weight decay | Re-optimized for larger model |
| Newton-Schulz steps | Muon orthogonalization iterations re-tuned |
Phase outcome: val_bpb 0.977 → 0.975 (Δ = 0.002)
Analysis: This phase demonstrates an important principle: hyperparameters optimized for one architecture may be suboptimal for another. After the architectural change in Phase 2, the agent correctly re-swept the optimizer settings for the new model. This is analogous to how human researchers re-tune learning rates after architectural changes.
Phase 4: Optimizer Tuning (experiments 560-700)
Strategy: Deep exploration of Muon optimizer parameters
Critical discovery: muon_beta2 = 0.98 (up from 0.95)
The Muon optimizer's beta2 parameter controls the exponential moving average for the second-momentum buffer in the NorMuon variance reduction:
second_momentum_buffer.lerp_(v_mean, 1 - beta2)
Increasing beta2 from 0.95 to 0.98 smooths the normalization, allowing the model to take larger effective steps. The agent discovered this by testing beta2 ∈ {0.95, 0.96, 0.97, 0.98, 0.99} across 10 clusters in one wave.
Phase outcome: val_bpb 0.975 → 0.974 (Δ = 0.001)
Analysis: This is the largest late-stage improvement and came from deep domain knowledge — the agent understood (or discovered through testing) that the optimizer's variance normalization was too aggressive for the wider model. Higher beta2 means slower adaptation of the normalization, which helps in the later stages of training where the loss landscape is smoother.
Phase 5: Diminishing Returns (experiments 700-910)
Strategy: Combinatorial sweeps over remaining parameters
Explored: Final LR fraction, warmdown ratio, scalar LR, embedding LR
Phase outcome: val_bpb 0.974 → < 0.001 additional improvement
Analysis: The improvement curve had flattened. The agent continued running experiments but each yielded diminishing returns:
Phase Experiments Δ val_bpb Return per Experiment
1 200 0.022 1.1 × 10⁻⁴
2 220 0.004 1.8 × 10⁻⁵
3 140 0.002 1.4 × 10⁻⁵
4 140 0.001 7.1 × 10⁻⁶
5 210 <0.001 < 4.8 × 10⁻⁶
The return per experiment declined by ~20× from Phase 1 to Phase 5. In a cost-aware setting, the agent should have a stopping criterion — but the "NEVER STOP" directive from program.md keeps it exploring.
Best Configuration Found
# Architecture
ASPECT_RATIO = 96 # model_dim = 8 * 96 = 768
DEPTH = 8 # 8 transformer layers
WINDOW_PATTERN = "SL" # alternating Sliding + Local attention
# Training
TOTAL_BATCH_SIZE = 2**18 # ~262K tokens/step (halved from default)
# Learning rates
MATRIX_LR = 0.05 # Muon LR for weight matrices (up from 0.04)
EMBEDDING_LR = 0.6 # AdamW LR for token embeddings
SCALAR_LR = 0.5 # AdamW LR for residual mixing scalars
# Optimizer
ADAM_BETAS = (0.70, 0.95) # lower beta1 than default
WEIGHT_DECAY = 0.08 # reduced from 0.2
WARMDOWN_RATIO = 0.6 # increased from 0.5
FINAL_LR_FRAC = 0.05 # non-zero final LR
# Muon: momentum=0.95, ns_steps=5, beta2=0.98
Comparison: Parallel vs. Sequential
| Metric | Sequential (1 GPU) | Parallel (16 GPUs) | Ratio |
|---|---|---|---|
| Experiments/hour | ~10 | ~90 | 9× |
| Time to best result | ~72 hours (estimated) | ~8 hours | 9× faster |
| Information per decision | 1 experiment | 10-13 experiments | 10-13× |
| Interaction effects found | Rarely | Routinely | Qualitative |
| Hardware-aware strategy | Impossible | Emergent | N/A |
| Total experiments | ~720 (equiv. runtime) | ~910 | 1.26× |
| API cost | ~$7 | ~$9 | 1.3× |
| GPU cost | ~$144 (72h × $2/h) | ~$286 (16 GPUs × 8h × ~$2.2/h) | 2× |
| Total cost | ~$151 | ~$295 | 1.95× |
The parallel approach is 2× more expensive in compute but reaches the same result 9× faster. In research settings where wall-clock time is the binding constraint, this is a highly favorable trade-off.
7 Reproducibility
Reproducibility Score: High (with Infrastructure Requirements)
| Factor | Assessment | Detail |
|---|---|---|
| Code availability | Full source, open-source | SkyPilot repo + autoresearch repo |
| Instructions | Complete instructions.md |
Self-contained agent instructions |
| Data availability | Public dataset | Same as autoresearch (climbmix-400b) |
| Compute requirements | GPU cluster (Kubernetes/cloud) | Higher barrier than single-GPU autoresearch |
| Agent determinism | Non-deterministic | LLM agent makes different choices each run |
| Infrastructure determinism | Variable | GPU availability, scheduling, hardware mix |
| One-line setup | Available | curl ... | bash install script |
Reproduction Steps
Quick start (one-line):
curl -sL https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autoresearch/setup.sh | bash
cd autoresearch
claude "Read instructions.md and start running parallel experiments."
Manual setup:
# 1. Clone both repos
git clone https://github.com/karpathy/autoresearch.git
git clone https://github.com/skypilot-org/skypilot.git
cd autoresearch
# 2. Copy SkyPilot experiment files
cp ../skypilot/examples/autoresearch/experiment.yaml .
cp ../skypilot/examples/autoresearch/instructions.md .
# 3. Prepare data locally
pip install uv && uv sync && uv run prepare.py
# 4. Install SkyPilot skill for your agent
# See: https://docs.skypilot.co/en/latest/getting-started/skill.html
# 5. Launch agent
# "Read instructions.md and start running parallel experiments"
Infrastructure Requirements
| Requirement | Minimum | Recommended | Used in Blog Post |
|---|---|---|---|
| GPU count | 2 | 8-16 | 16 |
| GPU type | Any NVIDIA | H100 / H200 | 13 × H100 + 3 × H200 |
| Infrastructure | Any SkyPilot backend | Kubernetes | Kubernetes (CoreWeave) |
| Agent | Any coding agent | Claude Code | Claude Code |
| SkyPilot | Latest | Latest | Latest |
Supported infrastructure backends:
# experiment.yaml — the infra field selects the backend
resources:
accelerators: {H100:1, H200:1}
image_id: docker:nvcr.io/nvidia/pytorch:24.07-py3
infra: k8s # Kubernetes
# infra: aws # Amazon Web Services
# infra: gcp # Google Cloud Platform
# infra: azure # Microsoft Azure
# infra: nebius # Nebius AI Cloud
# infra: slurm # SLURM HPC clusters
# (20+ backends supported)
Variability Across Runs
Because the LLM agent makes different decisions each run, exact results will vary. However, the qualitative findings should reproduce:
- Batch size reduction is consistently discovered early
- Model width scaling provides the largest architectural gain
- Diminishing returns appear after ~500 experiments
- Hardware heterogeneity is independently discovered (if present)
- The agent develops wave-based experimental design
The quantitative results (exact val_bpb values, number of experiments per phase) will differ across runs, agents, and hardware configurations.
8 Compute and API Costs
Detailed Cost Breakdown
GPU compute (as reported in the blog post):
| Resource | Count | Hours | Rate | Cost |
|---|---|---|---|---|
| H100 GPUs | 13 | 8 | ~$2.00/hr | ~$208 |
| H200 GPUs | 3 | 8 | ~$2.30/hr | ~$55 |
| GPU subtotal | ~$263 |
LLM API costs:
| Component | Tokens | Rate | Cost |
|---|---|---|---|
| Claude Code session (~8h) | ~2M input + ~500K output | Anthropic API rates | ~$9 |
| API subtotal | ~$9 |
Total cost: ~$272-300
Cost Efficiency Analysis
Cost per experiment:
GPU: $263 / 910 experiments ≈ $0.29/experiment
API: $9 / 910 experiments ≈ $0.01/experiment
Total: ~$0.30/experiment
Cost per val_bpb improvement:
Total improvement: 0.029 (1.003 → 0.974)
Cost per 0.001 bpb: ~$300 / 29 ≈ $10.30/0.001 bpb
Cost vs. sequential:
Sequential (72h on 1 H100):
GPU: 72 × $2 = $144
API: ~$7
Total: ~$151
Parallel (8h on 16 GPUs):
GPU: ~$263
API: ~$9
Total: ~$272
Premium for 9× speedup: ~$121 (1.8×)
Cost Comparison with Other Systems
| System | Typical Run Cost | Experiments | Cost/Experiment | Human Hours |
|---|---|---|---|---|
| Manual researcher | $800+ (8h salary) | 10-20 | $40-80 | 8 |
| Autoresearch (1 GPU) | ~$18 | ~80 | $0.22 | 0.1 (setup) |
| SkyPilot Autoresearch | ~$300 | ~910 | $0.33 | 0.1 (setup) |
| AlphaEvolve (estimated) | $10,000+ | 1000+ | ~$10 | Multiple engineers |
SkyPilot autoresearch occupies a favorable cost-performance point: 3× more expensive than single-GPU autoresearch, but 11× more experiments and 9× faster wall-clock time. The marginal cost of parallel experiments ($0.33 vs. $0.22) reflects GPU idle time during agent planning phases — with 16 GPUs, some inevitably wait while the agent designs the next wave.
9 Architecture Solution
System Architecture
┌──────────────────────────────────────────────────────────┐
│ CLAUDE CODE AGENT │
│ │
│ ┌────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Hypothesis │ │ Experiment │ │ Result │ │
│ │ Generator │ │ Designer │ │ Analyzer │ │
│ │ │ │ (factorial │ │ (cross-GPU │ │
│ │ (reads │ │ grids) │ │ comparison) │ │
│ │ results, │ │ │ │ │ │
│ │ plans │ │ Allocates │ │ Identifies │ │
│ │ next │ │ experiments │ │ trends, │ │
│ │ wave) │ │ to clusters │ │ interaction │ │
│ │ │ │ │ │ effects │ │
│ └──────┬─────┘ └──────┬───────┘ └───────┬────────┘ │
│ │ │ │ │
│ └───────────────┼──────────────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Cluster Manager │ │
│ │ (SkyPilot skill) │ │
│ │ │ │
│ │ sky launch │ │
│ │ sky exec │ │
│ │ sky logs │ │
│ │ sky status │ │
│ │ sky down │ │
│ └──────────┬─────────┘ │
└─────────────────────────┼───────────────────────────────┘
│
┌─────────────┼─────────────┐
│ │ │
┌────────▼──────┐ ┌──▼───────┐ ┌──▼──────────┐
│ instructions │ │ program │ │ results.tsv │
│ .md │ │ .md │ │ │
│ (SkyPilot │ │ (Karpathy│ │ experiment_id│
│ parallelism) │ │ rules) │ │ val_bpb │
└───────────────┘ └──────────┘ │ memory_gb │
│ status │
│ description │
└──────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ SkyPilot │ │ SkyPilot │ │ SkyPilot │
│ Cluster │ │ Cluster │ │ Cluster │
│ gpu-01 │ │ gpu-02 │ │ gpu-16 │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ H100 │ │ │ │ H100 │ │ │ │ H200 │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │train.py │ │ │ │train.py │ │ │ │train.py │ │
│ │(variant)│ │ │ │(variant)│ │ │ │(variant)│ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└──────────────┘ └──────────────┘ └──────────────┘
··· ··· ···
13 × H100 clusters 3 × H200
Key Architectural Differences from Sequential Autoresearch
1. Workdir isolation for parallel code variants.
In sequential autoresearch, there is one train.py and one experiment at a time. In parallel, the agent needs multiple versions of train.py running simultaneously:
# Copy files to isolated per-experiment directory
mkdir -p /tmp/autoresearch/exp-03
cp train.py prepare.py pyproject.toml experiment.yaml /tmp/autoresearch/exp-03/
# Edit the copy
# (agent modifies /tmp/autoresearch/exp-03/train.py)
# Launch with isolated workdir
sky launch gpu-03 experiment.yaml \
--workdir /tmp/autoresearch/exp-03 \
--env EXPERIMENT_ID=exp-03 \
--env EXPERIMENT_DESC="wider model" -d -y
SkyPilot snapshots the working directory at submission time, so each experiment gets its own frozen copy of the code. This enables true parallel modification without race conditions.
2. Detached execution with job pipelining.
# Launch returns immediately (-d flag)
sky launch gpu-01 experiment.yaml -d -y \
--env EXPERIMENT_ID=exp-01
# Queue next experiment on same cluster
# (starts automatically when exp-01 finishes)
sky exec gpu-01 experiment.yaml -d \
--env EXPERIMENT_ID=exp-02
The -d (detached) flag and sky exec pipelining allow the agent to submit experiments and immediately move on to planning the next wave. Zero GPU idle time between experiments on the same cluster.
3. Result collection via log streaming.
sky logs gpu-01 # latest job
sky logs gpu-01 2 # specific job ID
The agent checks results asynchronously, inspecting logs from completed experiments while others are still running.
Experiment Lifecycle
┌──────────────────────────────────────────────────────────────┐
│ WAVE-BASED EXECUTION │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ DESIGN │ │ SUBMIT │ │ COLLECT │ │
│ │ Wave N │ │ 10-16 exps │ │ Results │ │
│ │ │───▶│ to clusters │───▶│ from logs │ │
│ │ Agent plans │ │ sky launch │ │ sky logs │ │
│ │ factorial │ │ sky exec │ │ │ │
│ │ grid │ │ (detached) │ │ ~5 min wait │ │
│ └─────────────┘ └─────────────┘ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ ANALYZE │ │
│ │ Results │ │
│ │ │ │
│ │ - Compare │ │
│ │ across │ │
│ │ GPUs │ │
│ │ - Identify │ │
│ │ trends │ │
│ │ - Update │ │
│ │ results │ │
│ │ .tsv │ │
│ │ - Commit │ │
│ │ winners │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ DESIGN │ │
│ │ Wave N+1 │ │
│ └─────────────┘ │
└──────────────────────────────────────────────────────────────┘
10 Component Breakdown
Component 1: experiment.yaml — SkyPilot Task Definition
Purpose: Defines a single GPU experiment as a declarative YAML specification.
resources:
accelerators: {H100:1, H200:1}
image_id: docker:nvcr.io/nvidia/pytorch:24.07-py3
infra: k8s
workdir: .
envs:
EXPERIMENT_ID: baseline
EXPERIMENT_DESC: "baseline run"
setup: |
pip install uv
uv sync
uv run prepare.py
run: |
uv run train.py 2>&1 | tee run.log
EXIT_CODE=${PIPESTATUS[0]}
if [ $EXIT_CODE -ne 0 ]; then
echo "EXPERIMENT_STATUS: crash"
else
VAL_BPB=$(grep "^val_bpb:" run.log | awk '{print $2}')
PEAK_VRAM=$(grep "^peak_vram_mb:" run.log | awk '{print $2}')
MEMORY_GB=$(echo "scale=1; ${PEAK_VRAM} / 1024" | bc)
echo "EXPERIMENT_STATUS: done"
echo "EXPERIMENT_RESULT: ${EXPERIMENT_ID} val_bpb=${VAL_BPB} memory_gb=${MEMORY_GB}"
fi
echo "EXPERIMENT_DESC: ${EXPERIMENT_DESC}"
Key design choices:
| Choice | Rationale |
|---|---|
{H100:1, H200:1} |
Either GPU type acceptable; SkyPilot picks available |
docker:nvcr.io/nvidia/pytorch:24.07-py3 |
NVIDIA-maintained PyTorch container |
infra: k8s |
Kubernetes backend (CoreWeave in the blog post) |
setup block |
Runs once per cluster; subsequent experiments skip it |
EXPERIMENT_ID / EXPERIMENT_DESC |
Passed as environment variables per experiment |
| Structured output | EXPERIMENT_STATUS, EXPERIMENT_RESULT for easy parsing |
Component 2: instructions.md — Parallel Agent Instructions
Purpose: Extends program.md with SkyPilot-specific parallelism instructions.
Structure:
| Section | Purpose |
|---|---|
| Setup | 4-step initialization (read program.md, load SkyPilot skill, read codebase, ask about infra) |
| Launching Experiments | How to use sky launch, sky exec, workdir isolation |
| Checking Results | sky logs, SSH, sky status, sky queue |
| Tracking Results | Extended results.tsv format with experiment_id |
| Experiment Loop | Parallel version of the autoresearch loop |
| Cleanup | sky down -a -y |
Key behavioral directives:
- "Don't wait": After submitting experiments, move on to the next idea immediately
- "Periodically check": Poll results asynchronously rather than blocking
- "Tear down idle clusters": Resource management directive
- "At most 4 clusters": Original conservative limit (the actual run used 16)
- "NEVER STOP": Inherited from
program.md
Component 3: SkyPilot Skill — Agent Infrastructure Education
Purpose: Teaches the LLM agent how to use SkyPilot CLI commands.
The SkyPilot skill is a document that the agent fetches and reads at startup. It teaches:
| Command | Purpose | Agent Use |
|---|---|---|
sky launch <name> <yaml> |
Provision cluster + run job | Create new GPU clusters |
sky exec <name> <yaml> |
Run job on existing cluster | Pipeline experiments |
sky logs <name> [job_id] |
View job output | Check experiment results |
sky status |
List all clusters | Monitor infrastructure |
sky queue <name> |
List jobs on a cluster | Track pending experiments |
sky down <name> |
Tear down cluster | Clean up idle resources |
-d flag |
Detached (async) mode | Non-blocking submission |
-y flag |
Skip confirmation | Autonomous operation |
--workdir |
Set working directory | Code variant isolation |
--env |
Pass environment variables | Experiment identification |
The skill system is a form of procedural knowledge injection — the agent acquires new capabilities by reading documentation, without any code changes to the agent itself.
Component 4: Extended results.tsv — Parallel Results Log
Purpose: Tracks all experiments across multiple clusters.
Schema (extended from autoresearch):
experiment_id status val_bpb memory_gb description
exp-01 keep 0.997900 44.0 baseline
exp-02 discard 1.005000 44.0 switch to GeLU
exp-03 crash 0.000000 0.0 double width (OOM)
exp-04 keep 0.985000 44.2 batch_size=2^18
exp-05 keep 0.982000 44.2 batch_size=2^18 + wd=0.08
Differences from sequential results.tsv:
| Field | Sequential | Parallel |
|---|---|---|
| ID | git commit hash | experiment_id (agent-assigned) |
| Order | Sequential | Concurrent (multiple experiments per timestamp) |
| Hardware | Implicit (one GPU) | Should be noted in description |
| Volume | ~80/overnight | ~900/overnight |
Component 5: Cluster Fleet — Managed GPU Infrastructure
Purpose: The physical compute layer managed by SkyPilot.
$ sky status
NAME INFRA RESOURCES STATUS
gpu-01 k8s 1x (H100:1) UP
gpu-02 k8s 1x (H100:1) UP
gpu-03 k8s 1x (H200:1) UP ← H200 (faster)
gpu-04 k8s 1x (H200:1) UP ← H200 (faster)
gpu-05 k8s 1x (H100:1) UP
gpu-06 k8s 1x (H100:1) UP
gpu-07 k8s 1x (H100:1) UP
gpu-08 k8s 1x (H200:1) UP ← H200 (faster)
gpu-09 k8s 1x (H100:1) UP
gpu-10 k8s 1x (H100:1) UP
gpu-11 k8s 1x (H100:1) UP
gpu-12 k8s 1x (H100:1) UP
gpu-13 k8s 1x (H100:1) UP
gpu-14 k8s 1x (H100:1) UP
gpu-15 k8s 1x (H100:1) UP
gpu-16 k8s 1x (H100:1) UP
The fleet is heterogeneous (13 H100s + 3 H200s) because SkyPilot allocated GPUs based on availability from the Kubernetes cluster. The agent was not told about this heterogeneity — it discovered it from experimental results.
11 Core Mechanisms (Detailed)
Mechanism 1: Wave-Based Factorial Search
The central algorithmic innovation is the shift from greedy hill-climbing to wave-based factorial search:
Sequential hill-climbing (autoresearch default):
for each experiment:
hypothesis = agent.think(history)
result = run(hypothesis) # 5 minutes
if result.improved:
keep()
else:
discard()
Wave-based factorial search (SkyPilot extension):
for each wave:
hypotheses = agent.design_factorial_grid(history) # 10-16 experiments
for h in hypotheses:
submit_async(h, available_cluster) # returns immediately
wait_for_results(~5 minutes)
results = collect_all_results()
trends = analyze_trends(results) # e.g., "wider is better"
interactions = find_interactions(results) # e.g., "lr × wd synergy"
for r in results:
if r.improved:
commit_winning_config(r)
next_wave = agent.design_followup(trends, interactions)
Why this is a qualitative, not quantitative, change:
In sequential mode, the agent has one data point per decision. It sees "batch_size=2^18 gives val_bpb=0.985" and must decide: is this an improvement? Should I explore further? Is there an interaction with other parameters?
In parallel mode, the agent has 10-16 data points per decision. It sees the complete response surface over the tested parameters and can identify:
- Monotonic trends: "val_bpb decreases monotonically with batch_size from 2^19 to 2^17"
- Optima: "AR=96 is the sweet spot; AR=112 is too large"
- Interaction effects: "weight_decay=0.08 is better, but only when batch_size=2^18"
- Diminishing returns: "all 6 learning rate variants give similar results — LR is not the bottleneck"
This transforms the agent from a sequential decision-maker into an experimental scientist running designed experiments.
Mechanism 2: Emergent Hardware-Aware Strategy
The most remarkable finding is the agent's autonomous discovery and exploitation of hardware heterogeneity:
Timeline of discovery:
Early waves:
Agent notices: identical configs get different val_bpb on different clusters
Agent wonders: "Why does gpu-03 consistently score better?"
Investigation:
Agent checks: sky status → discovers gpu-03 is H200, others are H100
Agent hypothesizes: "H200 is faster → more training steps → better val_bpb"
Agent verifies: "H200 runs 9% more steps in 5 minutes!"
Strategy development:
Agent concludes: "I have 3 H200s and 13 H100s"
Agent decides: "Screen ideas on H100s, validate winners on H200s"
Agent implements: submits 10+ experiments to H100s, top 2-3 to H200s
Refinement:
Agent discovers: "Rankings sometimes differ across hardware"
Agent learns: "FINAL_LR_FRAC=0.03 beats 0.05 on H100 but loses on H200"
Agent adapts: "The real contest is on H200 — that's the target hardware"
Analysis of the emergent behavior:
This is a textbook example of emergent intelligence — a behavior that arises from the interaction between a capable agent and a complex environment, not from explicit programming:
- The agent was not told about H100 vs. H200 performance differences
- The agent was not instructed to develop a screening/validation workflow
- The agent was not programmed to normalize results by hardware type
- The agent independently:
- Detected the confounding variable (hardware type)
- Diagnosed the mechanism (step count difference)
- Developed a strategy to exploit it (two-tier workflow)
- Identified cases where it matters (ranking inversions)
This behavior would be considered sophisticated experimental methodology if performed by a human researcher. The fact that it emerged autonomously from an LLM coding agent suggests that large language models have internalized principles of experimental design from their training data.
Mechanism 3: SkyPilot Cluster Management
The infrastructure layer handles several complex operations transparently:
Cluster provisioning:
sky launch gpu-01 experiment.yaml -d -y
│
├─ Parses experiment.yaml
├─ Finds available GPU (H100 or H200)
├─ Creates Kubernetes pod with NVIDIA PyTorch image
├─ Syncs workdir to pod
├─ Runs setup block (pip install uv, uv sync, uv run prepare.py)
└─ Runs experiment (uv run train.py)
Job pipelining:
sky exec gpu-01 experiment.yaml -d --env EXPERIMENT_ID=exp-02
│
├─ Reuses existing cluster (skips setup)
├─ Queues job to start when current job finishes
├─ Returns immediately (agent can plan next experiment)
└─ Zero GPU idle time between experiments
Workdir snapshotting:
mkdir -p /tmp/autoresearch/exp-03
cp train.py prepare.py pyproject.toml experiment.yaml /tmp/autoresearch/exp-03/
# Agent modifies /tmp/autoresearch/exp-03/train.py
sky launch gpu-03 experiment.yaml --workdir /tmp/autoresearch/exp-03 -d -y
│
├─ Snapshots /tmp/autoresearch/exp-03/ at submission time
├─ Syncs snapshot to cluster
└─ Subsequent modifications to /tmp/autoresearch/exp-03/ don't affect running job
This workdir isolation is critical for parallel execution — without it, the agent could not run different code variants simultaneously, as modifications to train.py would race between experiments.
Mechanism 4: Throughput Scaling Analysis
Theoretical throughput model:
T_sequential = T_think + T_train + T_analyze
≈ 30s + 300s + 30s
≈ 360s per experiment
≈ 10 experiments/hour
T_parallel(N) = T_design_wave + T_train + T_collect + T_analyze
≈ 60s + 300s + 60s + 60s
≈ 480s per wave of N experiments
≈ 7.5 waves/hour × N experiments/wave
For N=12 (average wave size):
≈ 7.5 × 12 = 90 experiments/hour
Scaling efficiency:
Ideal speedup: 16× (16 GPUs)
Actual speedup: 9× (90/10)
Efficiency: 56%
Sources of inefficiency:
- Wave design time (agent thinking) ~12% overhead
- Result collection and analysis ~12% overhead
- Cluster provisioning and setup ~5% overhead
- GPU idle time between waves ~10% overhead
- Crashed experiments ~5% waste
Total overhead: ~44%
The 56% scaling efficiency is reasonable for a system where a single agent manages 16 clusters. The primary bottleneck is the agent's thinking time — it takes longer to design a 12-experiment factorial grid than a single experiment. Scaling beyond 16 GPUs would require either faster agent reasoning or multi-agent coordination.
Mechanism 5: Experimental Design Progression
The agent's experimental design sophistication increased over the course of the 8-hour run:
Phase 1 (simple sweeps):
Wave: Test 10 values of weight_decay, one variable at a time
weight_decay ∈ {0.01, 0.02, 0.05, 0.08, 0.1, 0.15, 0.2, 0.3, 0.5, 1.0}
Phase 2 (factorial grids):
Wave: Test 6 aspect ratios simultaneously
AR ∈ {48, 64, 72, 80, 90, 96, 112}
(with best hyperparameters from Phase 1 held fixed)
Phase 3 (interaction testing):
Wave: 3 × 4 factorial grid
matrix_lr ∈ {0.03, 0.05, 0.07}
warmdown_ratio ∈ {0.4, 0.5, 0.6, 0.7}
→ 12 experiments in one wave
Phase 4 (targeted sweeps with hardware awareness):
Wave: Test 5 values of muon_beta2, all on H100
beta2 ∈ {0.95, 0.96, 0.97, 0.98, 0.99}
→ then promote best to H200 for validation
Phase 5 (combinatorial fine-tuning):
Wave: 2³ factorial over final tuning parameters
final_lr_frac ∈ {0.03, 0.05}
warmdown_ratio ∈ {0.55, 0.65}
scalar_lr ∈ {0.4, 0.6}
→ 8 experiments, looking for interaction effects
This progression mirrors the behavior of an experienced human researcher: start with broad sweeps to map the landscape, then zoom in on promising regions with factorial designs, then fine-tune with targeted interaction tests.
12 Programming Language
Language: Python + YAML + Bash
The SkyPilot extension adds two language layers to autoresearch's pure-Python design:
| Language | Component | Purpose |
|---|---|---|
| Python | train.py, prepare.py |
Core training (unchanged from autoresearch) |
| YAML | experiment.yaml |
Infrastructure-as-code (SkyPilot task definition) |
| Bash | setup and run blocks in YAML |
Environment setup and experiment execution |
| Markdown | instructions.md, program.md |
Agent instruction documents |
SkyPilot CLI as an Agent API
The SkyPilot CLI becomes, in effect, an API for the LLM agent:
# Provisioning API
sky launch <name> <yaml> [--env KEY=VALUE ...] [-d] [-y] [--workdir PATH]
# Execution API
sky exec <name> <yaml> [--env KEY=VALUE ...] [-d]
# Observability API
sky status # list clusters
sky queue <name> # list jobs on cluster
sky logs <name> [id] # view job output
# Lifecycle API
sky down <name> [-y] # tear down cluster
sky down -a [-y] # tear down all clusters
The agent interacts with this API entirely through natural language understanding of the SkyPilot skill document — there is no SDK, no Python bindings, no programmatic interface. The agent reads documentation and constructs shell commands.
Infrastructure Code Quality
The experiment.yaml is notable for its error handling:
uv run train.py 2>&1 | tee run.log
EXIT_CODE=${PIPESTATUS[0]}
if [ $EXIT_CODE -ne 0 ]; then
echo "EXPERIMENT_STATUS: crash"
else
VAL_BPB=$(grep "^val_bpb:" run.log | awk '{print $2}')
PEAK_VRAM=$(grep "^peak_vram_mb:" run.log | awk '{print $2}')
MEMORY_GB=$(echo "scale=1; ${PEAK_VRAM} / 1024" | bc)
echo "EXPERIMENT_STATUS: done"
echo "EXPERIMENT_RESULT: ${EXPERIMENT_ID} val_bpb=${VAL_BPB} memory_gb=${MEMORY_GB}"
fi
Key design choices:
- PIPESTATUS[0] captures the exit code from train.py (not from tee)
- Structured output (EXPERIMENT_STATUS, EXPERIMENT_RESULT) enables reliable parsing
- tee writes both to stdout and file (unlike autoresearch's pure redirect)
- Memory is converted from MB to GB inline for readability
Dependency Management
The SkyPilot extension adds no Python dependencies beyond autoresearch's pyproject.toml. The only additional requirement is the SkyPilot CLI itself, which is installed on the agent's machine (not on GPU clusters). The Docker image (nvcr.io/nvidia/pytorch:24.07-py3) provides the CUDA runtime and PyTorch.
This zero-dependency-addition design is intentional — it preserves autoresearch's simplicity while adding parallelism purely through infrastructure.
13 Memory Management
Agent Context Window Memory
The parallel execution dramatically increases the agent's context management burden:
Sequential context per cycle (~3K tokens):
Read results → Edit code → Run → Read metric → Decide
Parallel context per wave (~15-25K tokens):
Review all 16 cluster states → Design 10-16 experiments →
Copy and modify 10-16 train.py variants →
Submit all experiments → Wait → Check 10-16 results →
Compare across hardware types → Identify trends →
Decide which configs to keep → Plan next wave
Context accumulation over 8 hours:
Sequential (100 experiments):
~300K tokens cumulative
Parallel (910 experiments in ~75 waves):
~15K tokens/wave × 75 waves ≈ 1.1M tokens cumulative
The parallel agent uses approximately 3-4× more context than the sequential agent. This is a potential scaling bottleneck — at some point, the agent's context window fills up, and it either loses early experimental context or must summarize/compress.
Workdir Memory (Disk)
Each experiment creates an isolated copy of the codebase:
/tmp/autoresearch/
exp-01/
train.py ~20KB
prepare.py ~8KB
pyproject.toml ~1KB
experiment.yaml ~1KB
exp-02/
...
exp-910/
...
Total disk usage: ~910 × 30KB ≈ 27MB
This is negligible — disk is not a constraint. However, the agent must manage these directories (create, modify, reference by path), which adds cognitive overhead.
GPU Memory (Per Cluster)
Each cluster runs independently with its own GPU memory:
Per cluster (same as autoresearch):
Model parameters: ~100 MB
Optimizer states: ~400 MB
Activations: ~2-10 GB
KV cache: ~500 MB
Total: ~3-11 GB
Across all 16 clusters:
Total GPU memory used: ~48-176 GB (across 16 independent GPUs)
No memory sharing between clusters
Each cluster is completely independent — there is no distributed training, no gradient sharing, no communication between GPUs. Each GPU runs its own experiment with its own copy of the code and data. This is an embarrassingly parallel architecture.
Data Memory (Per Cluster)
Each cluster downloads and stores training data independently:
Per cluster:
~/.cache/autoresearch/data/ ~1-5 GB (parquet shards)
~/.cache/autoresearch/tokenizer/ ~10 MB (BPE tokenizer)
First experiment on a cluster: setup downloads data (~2 min)
Subsequent experiments: skip setup (data already present)
The setup block in experiment.yaml runs once per cluster. SkyPilot's sky exec reuses the existing cluster, so data is downloaded only once per GPU, not once per experiment. This amortizes the data preparation cost across all experiments on a given cluster.
14 Continued Learning
Within-Run Learning
The parallel agent's learning is richer than the sequential agent's because it processes more information per decision cycle:
Sequential agent learning:
Experiment 1: batch_size=2^18 → 0.985 (better)
→ Learns: smaller batch size helps
Experiment 2: weight_decay=0.08 → 0.990 (better than baseline, worse than exp-1)
→ Learns: weight_decay matters, but batch size matters more
Parallel agent learning:
Wave 1 (12 experiments):
batch_size ∈ {2^17, 2^18, 2^19} × weight_decay ∈ {0.05, 0.08, 0.15, 0.2}
→ Learns:
- batch_size=2^18 is optimal (2^17 too small, 2^19 too large)
- weight_decay=0.08 is optimal across all batch sizes
- there is NO interaction between batch_size and weight_decay
- margin of improvement: 0.015 from batch_size, 0.003 from wd
- together they give 0.018, confirming additivity
The parallel agent learns more per unit time because factorial designs are statistically more efficient than one-at-a-time designs (a classical result in design of experiments theory).
Cross-Wave Strategy Evolution
The agent's strategy visibly evolves across waves:
| Wave Range | Strategy | Sophistication |
|---|---|---|
| 1-5 | One-variable sweeps | Low (identical to sequential) |
| 6-15 | Multi-variable factorials | Medium (exploiting parallelism) |
| 16-25 | Hardware-aware allocation | High (emergent capability) |
| 26-40 | Two-tier screening/validation | High (research methodology) |
| 40-60 | Focused combinatorial tuning | High (targeted exploration) |
| 60-75 | Diminishing returns awareness | High (implicit stopping intuition) |
This progression suggests the agent is learning how to research during the run, not just what works. It develops experimental methodology from experience.
Cross-Run Knowledge Transfer
Like sequential autoresearch, the SkyPilot extension does not have built-in cross-run memory. However, the richer results from parallel runs provide better transfer material:
1. Winning configuration as starting point.
# Best config from an 8-hour parallel run
ASPECT_RATIO = 96
TOTAL_BATCH_SIZE = 2**18
WEIGHT_DECAY = 0.08
ADAM_BETAS = (0.70, 0.95)
WARMDOWN_RATIO = 0.6
FINAL_LR_FRAC = 0.05
# muon_beta2 = 0.98
A subsequent run starting from this configuration would explore the space around this optimum, potentially finding further improvements.
2. Response surface knowledge.
The complete factorial results from parallel runs encode the response surface:
AR: 48 64 72 80 90 96 112
bpb: .985 .982 .980 .979 .978 .977 .984
↑ monotonic improvement up to AR=96, then regression
This knowledge could be encoded in program.md:
"Note: optimal ASPECT_RATIO is around 96 for H100.
Do not test below 80 or above 112."
3. Hardware-specific findings.
H100: ~1,060 steps in 5 minutes. FINAL_LR_FRAC=0.05 is best.
H200: ~1,200 steps in 5 minutes. FINAL_LR_FRAC=0.05 is best (but 0.03 works on H100).
Rankings can differ across hardware types — always validate on target hardware.
Potential Extensions for Persistent Learning
The SkyPilot autoresearch results suggest several natural extensions for cross-run learning:
1. Results database.
Instead of results.tsv, store results in a structured database that persists across runs. The agent could query: "What ASPECT_RATIO values have been tested, and what were the results?"
2. Bayesian surrogate model.
Fit a Gaussian process or neural network surrogate to the accumulated results, then use it to guide subsequent exploration (Bayesian optimization). The parallel factorial data provides excellent training signal for surrogate models.
3. Agent memory systems.
Use systems like autoresearch's meta-learning framework — updating program.md with empirical findings — but automated:
# Auto-generated findings from Run 1:
- Optimal ASPECT_RATIO range: [80, 100]
- Batch size 2^18 consistently outperforms 2^19
- Weight decay 0.08 is robust across architectures
- muon_beta2=0.98 is best for wide models
- H200 gives ~9% more steps than H100 in 5-min budget
4. Population-based training.
Instead of greedy hill-climbing, maintain a population of configurations that evolve over time. Parallel resources make this natural — each GPU runs a different population member. This would bring autoresearch closer to systems like AlphaEvolve while preserving its simplicity.
15 Applications
Direct Applications
1. Accelerated neural network training research.
The primary use case — any researcher with access to a GPU cluster can run parallel autoresearch overnight and wake up to 900+ experiments and a well-optimized model:
Evening: claude "Read instructions.md and start running experiments"
Morning: val_bpb improved by 3%, 900 experiments logged
Time invested: ~10 minutes of setup
2. Hardware benchmarking and characterization.
The emergent hardware-aware behavior makes parallel autoresearch a novel hardware benchmarking tool. By running on a heterogeneous cluster, the system automatically: - Compares GPU types under real ML workloads - Identifies performance differences that synthetic benchmarks miss - Finds hardware-specific optimal configurations - Discovers when parameter rankings change across hardware
3. LLM agent capability benchmarking.
The SkyPilot autoresearch is a rigorous benchmark for evaluating LLM coding agents:
| Capability | How It's Tested |
|---|---|
| Code editing | Modifying train.py correctly |
| Tool use | Using SkyPilot CLI commands |
| Experimental design | Quality of factorial grids |
| Scientific reasoning | Hardware heterogeneity discovery |
| Resource management | Cluster utilization efficiency |
| Error handling | Crash diagnosis and recovery |
| Long-horizon planning | Multi-wave strategy evolution |
| Autonomy | 8-hour unsupervised operation |
4. Infrastructure validation.
SkyPilot uses autoresearch as a showcase for its agent skill system. The 8-hour autonomous run demonstrates that SkyPilot's Kubernetes integration, job pipelining, and multi-cluster management work reliably at scale.
Broader Research Applications
5. Design of Experiments (DOE) by AI agents.
The SkyPilot results demonstrate that LLM agents can independently develop sophisticated experimental design strategies:
Classical DOE Theory (Fisher, 1935):
- Factorial designs
- Blocking for confounding variables
- Response surface methodology
- Sequential experimentation
Agent-Discovered DOE (SkyPilot Autoresearch, 2026):
- Factorial grids (Phase 2-3)
- Hardware blocking (emergent)
- Response surface scanning (AR sweep)
- Two-tier screening/validation (emergent)
The agent rediscovered principles of experimental design that took human statisticians decades to formalize. This suggests that LLM agents could be useful research assistants even in domains where they have no specialized training.
6. Heterogeneous compute optimization.
Modern compute environments are increasingly heterogeneous (different GPU types, different cloud providers, different pricing). The emergent hardware-aware strategy suggests that LLM agents can: - Automatically discover hardware characteristics - Develop resource allocation strategies - Optimize cost-performance trade-offs - Adapt to changing resource availability
This has implications beyond ML research — any workload that runs on heterogeneous compute could benefit from agent-driven optimization.
7. Meta-research on research methodology.
The SkyPilot experiment is itself a meta-research study: it investigates how compute scale affects the quality and strategy of autonomous research. Key findings that generalize:
| Finding | Generalization |
|---|---|
| Parallelism changes strategy | More resources enable qualitatively different approaches, not just faster execution |
| Emergent capabilities appear at scale | Complex behaviors (hardware awareness) emerge without explicit programming |
| Diminishing returns follow a power law | Early experiments are disproportionately valuable |
| Agent experimental design improves with experience | LLMs learn research methodology on-the-fly |
Limitations
| Limitation | Description | Impact |
|---|---|---|
| Single agent bottleneck | One agent manages all 16 GPUs | ~56% scaling efficiency |
| No inter-experiment communication | Clusters don't share information | Redundant computations possible |
| Context window pressure | 910 experiments generate large context | Risk of losing early context |
| Cost scaling | 16 GPUs × 8 hours = significant cost | $300 vs. $18 for single-GPU |
| Non-deterministic | Different runs yield different results | Hard to reproduce exact numbers |
| Agent-specific | Tested only with Claude Code | May not generalize to other agents |
| Kubernetes dependency | Requires cluster infrastructure | Higher barrier than single GPU |
| No stopping criterion | Runs until manually stopped | Wastes compute in diminishing returns phase |
Future Directions
1. Multi-agent parallel autoresearch.
Instead of one agent managing 16 GPUs, use 4 agents each managing 4 GPUs. Each agent explores a different region of the search space, and they periodically share discoveries:
Agent A (architecture): Explores model dimensions
Agent B (optimizer): Explores optimizer parameters
Agent C (schedule): Explores LR and warmdown schedules
Agent D (attention): Explores attention patterns
Shared discovery board: Best configs from each agent
Cross-pollination: Agents adopt each other's best findings
2. Adaptive compute allocation.
Instead of fixed 16 GPUs for 8 hours, dynamically scale: - Start with 4 GPUs (broad exploration) - Scale to 16 when promising directions found (deep exploration) - Scale down when diminishing returns detected - Total cost-aware: stop when expected improvement per dollar drops below threshold
3. Continuous autoresearch.
Instead of overnight runs, run autoresearch continuously: - Always-on agent monitors a git repository - New ideas from the community trigger automated experiments - Results are automatically committed and published - A living leaderboard reflects the current best configuration
4. Cross-domain autoresearch.
Apply the parallel autoresearch methodology to other optimization domains: - Reinforcement learning training scripts - Compiler optimization passes - Database query optimizer configurations - Scientific simulation parameters - Drug discovery molecular parameters
The SkyPilot infrastructure layer is domain-agnostic — only the train.py and prepare.py would change.
Broader Impact
The SkyPilot scaling of autoresearch demonstrates a fundamental principle: the gap between sequential and parallel autonomous research is not just speed — it's intelligence. A sequential agent does greedy hill-climbing. A parallel agent does factorial experimental design, discovers confounding variables, and develops multi-tier validation strategies.
This has profound implications for the future of AI-driven research:
-
Compute is a cognitive amplifier. More compute doesn't just make the agent faster — it makes it smarter by enabling strategies that are impossible in sequential mode.
-
Infrastructure is an intelligence bottleneck. The sequential autoresearch agent is limited not by its reasoning ability but by its infrastructure. Remove the infrastructure bottleneck, and emergent capabilities appear.
-
Agent autonomy scales with resources. A single-GPU agent needs human oversight (it can get stuck in local optima). A 16-GPU agent develops its own research methodology and can run for days without human intervention.
-
The "researcher" is the program, not the agent. Both Karpathy's
program.mdand SkyPilot'sinstructions.mdare natural-language research programs. The agent is the execution engine. The quality of research depends on the quality of the program, not the specific agent. This democratizes research — anyone who can write clear instructions can run a research lab.
The combination of Karpathy's radical simplicity (3 files, 1 metric) with SkyPilot's infrastructure abstraction (any cluster, any cloud) creates a system that is simultaneously: - Simple enough for anyone to understand - Powerful enough to produce real research results - Scalable enough to exploit large compute clusters - General enough to apply to any optimization domain
This is, perhaps, the strongest evidence yet that autonomous research is not a future aspiration but a present reality — limited primarily by the quality of the research program (the natural-language instructions) and the availability of compute.
This analysis is based on the SkyPilot blog post "Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster" (March 2026), the instructions.md and experiment.yaml from skypilot/examples/autoresearch, and the original autoresearch repository. The experiment used Claude Code on a Kubernetes cluster backed by CoreWeave with 13 H100 and 3 H200 GPUs.