← Back to Index
Pi-Autoresearch
Autonomous experiment loop extension for the pi AI coding agent Organization: David Vilalta (davebcn87, Independent) Published: March 2026 Type: Open-Source Extension (MIT License) Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: pi-autoresearch — Autonomous experiment loop extension for pi
Repository URL: github.com/davebcn87/pi-autoresearch
Stars: 3,300+ (as of April 2026)
License: MIT
Lineage: Directly inspired by Andrej Karpathy's autoresearch, reimagined as a modular extension for the pi AI coding agent. While Karpathy's system is a standalone monolith tightly coupled to nanochat and a specific LLM agent (Claude Code), pi-autoresearch decouples the experimental loop infrastructure from the domain knowledge, creating a general-purpose optimization extension that works with any command-line-measurable metric.
Publication Date: March 2026
Paradigm: Pi-autoresearch is a meta-tool — it instruments an LLM coding agent with the tools and workflow to run autonomous optimization loops, rather than being an autonomous agent itself. The distinction is architecturally significant: the extension provides infrastructure (timing, logging, version control, dashboards), while domain intelligence comes from a separately authored "skill" document. This separation of concerns enables a single extension to serve unlimited optimization domains.
"Try an idea, measure it, keep what works, discard what doesn't, repeat forever." — pi-autoresearch README
2 Authors and Team
Primary Author
David Vilalta (GitHub: davebcn87) — Software engineer based in Barcelona, Spain. Vilalta's prior work focuses on developer tooling and agent infrastructure. The "bcn87" suffix references Barcelona, and his GitHub profile indicates involvement in AI agent extension ecosystems.
Platform Dependency: pi (by Anthropic)
Pi-autoresearch is designed as a first-class extension for pi, an AI coding agent developed by Anthropic that runs in the terminal. Pi provides the underlying LLM-powered coding agent, extension system, skill framework, and terminal UI infrastructure. Pi-autoresearch extends pi's capabilities rather than building from scratch.
| Component | Provider |
|---|---|
| LLM agent runtime | pi (Anthropic) |
| Extension system | pi extension framework |
| Skill authoring | pi skill system |
| Terminal UI (widgets, dashboards) | pi UI framework |
| Experiment infrastructure | pi-autoresearch (this project) |
Relationship to Karpathy's Autoresearch
The naming and conceptual debt are explicit — the README states "Inspired by karpathy/autoresearch." However, the architectural decisions diverge significantly:
| Dimension | Karpathy autoresearch | pi-autoresearch |
|---|---|---|
| Agent coupling | Tightly bound to Claude Code | Extension for pi (any LLM backend) |
| Domain coupling | Tightly bound to nanochat | Domain-agnostic via skills |
| Measurement | 5-min fixed wall-clock val_bpb | Any command + any metric |
| State format | results.tsv + git history |
autoresearch.jsonl + autoresearch.md |
| UI | None (terminal output only) | Live widget + expandable dashboard + fullscreen overlay |
| Finalization | Manual branch review | Automated branch grouping via autoresearch-finalize |
| Configuration | program.md only |
autoresearch.config.json + autoresearch.md + autoresearch.sh |
| Statistical rigor | None (raw val_bpb comparison) | Confidence scoring via MAD |
Community Context
Pi-autoresearch represents the second wave of autoresearch tooling — the "framework wave" that followed Karpathy's proof-of-concept. Where Karpathy demonstrated that the paradigm works, pi-autoresearch asks how to make it reusable, composable, and production-grade. The 3,300+ stars indicate strong community adoption, comparable to or exceeding many of the standalone autoresearch forks but providing a fundamentally different architectural foundation.
3 Core Contribution
Key Novelty: Pi-autoresearch transforms autonomous experimentation from a single-purpose script into a reusable, domain-agnostic infrastructure layer — separating experimental loop mechanics (measurement, version control, statistical confidence, dashboards) from domain-specific knowledge (what to optimize, how to measure, which files to modify). This separation enables any developer using pi to add autonomous optimization to any project with any metric, without writing custom experimental infrastructure.
What Makes Pi-Autoresearch Novel
-
Extension/skill architecture. The fundamental innovation is the clean separation between the extension (domain-agnostic infrastructure — tools, widgets, dashboards) and the skill (domain-specific knowledge — what to optimize, the benchmark command, the metric, the scope of files). One extension serves unlimited domains through skill composition. This is an architectural pattern not present in any prior autoresearch system.
-
Confidence scoring with MAD. After 3+ experiments, pi-autoresearch computes a statistical confidence score comparing the best improvement to the session's noise floor using Median Absolute Deviation (MAD). This addresses a critical weakness in Karpathy's system — raw metric comparisons can't distinguish real improvements from benchmark jitter, especially in noisy domains like ML training or Lighthouse scores. The confidence metric is advisory (never auto-discards), but guides the agent to re-run marginal experiments.
-
Append-only structured log (
autoresearch.jsonl). Every experiment is recorded as a single JSON line, creating a machine-readable audit trail that supports post-hoc analysis, resumption after context resets, and cross-session continuity. This is a significant improvement over TSV-based logging, enabling richer structured data (commit hashes, descriptions, confidence scores, status codes, branch context). -
Backpressure checks. An optional
autoresearch.checks.shscript runs correctness checks (tests, types, lint) after every passing benchmark. This creates a safety valve that prevents optimizations from silently breaking things — a critical concern for multi-hour autonomous runs. -
Branch-aware finalization. The
autoresearch-finalizeskill groups kept experiments into logical changesets, proposes the grouping for human approval, then creates independent branches from the merge-base. Groups must not share files, ensuring each branch can be reviewed and merged independently. This solves the "messy experiment branch" problem that plagues long autonomous runs. -
Rich terminal UI. A persistent status widget, expandable results dashboard, and fullscreen overlay with keyboard navigation provide real-time visibility into the optimization process. This is a significant UX improvement over terminal-output-only systems, enabling researchers to monitor multi-hour runs at a glance.
Relationship to Prior Work
| System | Year | Architecture | Domain | Statistical Rigor | UI |
|---|---|---|---|---|---|
| Karpathy autoresearch | 2026 | Monolithic script | Neural network training | None | Terminal output |
| SkyPilot autoresearch | 2026 | Cloud orchestrator | Any cloud workload | None | Web dashboard |
| AutoResearchClaw | 2026 | Multi-agent | Research paper writing | None | None |
| pi-autoresearch | 2026 | Extension + skill | Any measurable metric | MAD confidence | Widget + dashboard |
Design Philosophy
Pi-autoresearch embodies a specific design philosophy about autonomous research tooling:
The experimental loop (measure → judge → keep/revert → repeat) is infrastructure, not domain knowledge. Domain knowledge belongs in a separate, human-authored document. Infrastructure should be built once and reused everywhere.
This is fundamentally different from Karpathy's approach, where the experimental loop and the domain knowledge are intertwined in program.md. Pi-autoresearch's separation means that a team optimizing test speeds, another team optimizing LLM training, and a third team optimizing Lighthouse scores all use the exact same extension — only their skills differ.
4 Supported Solutions
Pi-autoresearch is explicitly domain-agnostic. It supports any optimization target where a command-line command produces a measurable numeric metric. The README provides canonical example domains, but the architecture imposes no domain restrictions:
Canonical Example Domains
| Domain | Metric | Direction | Benchmark Command | Typical Scope |
|---|---|---|---|---|
| Test speed | Seconds | ↓ (lower is better) | pnpm test |
Test configuration, runner parallelism, mocking |
| Bundle size | KB | ↓ (lower is better) | pnpm build && du -sb dist |
Tree-shaking, code splitting, dependencies |
| LLM training | val_bpb | ↓ (lower is better) | uv run train.py |
Architecture, hyperparameters, data pipeline |
| Build speed | Seconds | ↓ (lower is better) | pnpm build |
Build tool config, caching, parallelism |
| Lighthouse score | Performance score | ↑ (higher is better) | lighthouse http://localhost:3000 --output=json |
SSR, code splitting, image optimization |
Solution Taxonomy
Solutions produced by pi-autoresearch fall into several categories:
| Solution Type | Description | Mechanism |
|---|---|---|
| Configuration tuning | Adjusting config files (test runners, bundlers, compilers) | Agent reads config docs, proposes changes, measures impact |
| Code refactoring | Restructuring code for performance without behavior change | Agent identifies bottlenecks, refactors, verifies via checks |
| Algorithm replacement | Swapping algorithms for more efficient alternatives | Agent proposes alternative implementations, benchmarks |
| Dependency optimization | Adding/removing/replacing dependencies | Agent evaluates dependency alternatives, measures impact |
| Architecture changes | Restructuring module boundaries, data flow patterns | Agent proposes structural changes, validates via benchmark |
| Hyperparameter tuning | Adjusting numeric parameters in training/optimization code | Agent sweeps values, keeps improvements |
What It Cannot Optimize
The system has inherent limitations:
- Multi-objective optimization. The system tracks a single primary metric per session. Multi-metric optimization requires separate sessions or custom benchmark scripts that compute composite scores.
- Non-deterministic metrics. Highly stochastic metrics (e.g., ML validation loss with high variance) challenge the confidence scoring. The MAD-based approach helps but cannot fully compensate for extreme noise.
- Long-running benchmarks. The autonomous loop assumes each experiment completes in a reasonable time. Very long benchmarks (hours) create impractically slow feedback loops.
- Metrics requiring human judgment. Subjective quality metrics (code readability, UX quality) cannot be benchmarked automatically.
5 LLM Integration
Architecture: LLM-as-Agent, Not LLM-as-Mutation-Operator
Pi-autoresearch uses the LLM fundamentally differently from evolutionary systems like AlphaEvolve or FunSearch. In those systems, the LLM is a mutation operator called by the system to propose code changes. In pi-autoresearch, the LLM is the agent — it has full autonomy to read files, understand context, form hypotheses, make changes, and decide what to try next. The extension merely provides tools that the agent invokes at its discretion.
┌─────────────────────────────────────────────────────────────┐
│ LLM Agent (pi) │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ Read files │ │ Form │ │ Make code │ │
│ │ + context │──│ hypothesis │──│ changes │ │
│ └─────────────┘ └──────────────┘ └────────┬──────────┘ │
│ │ │
│ ┌────────────────────────────────────────────┼──────────┐ │
│ │ Extension Tools (invoked by agent) │ │ │
│ │ ▼ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │ │
│ │ │init_experiment│ │run_experiment │ │log_experiment│ │ │
│ │ │(one-time │ │(executes cmd, │ │(records │ │ │
│ │ │ session setup)│ │ times it, │ │ result, │ │ │
│ │ │ │ │ captures │ │ auto-commits,│ │ │
│ │ │ │ │ output) │ │ updates UI) │ │ │
│ │ └──────────────┘ └──────────────┘ └────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Skill (domain knowledge, loaded at start) │ │
│ │ │ │
│ │ "Optimize test runtime by modifying vitest configs. │ │
│ │ Command: pnpm test. Metric: seconds (lower better). │ │
│ │ Files in scope: vitest.config.ts, test/**/*.test.ts" │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
LLM Interaction Modes
| Mode | Trigger | LLM Behavior |
|---|---|---|
| Session setup | /skill:autoresearch-create |
Agent asks about goal, command, metric, scope — or infers from context. Creates session files. |
| Autonomous loop | After setup | Agent edits code → commits → calls run_experiment → calls log_experiment → decides keep/revert → repeats indefinitely |
| Finalization | /skill:autoresearch-finalize |
Agent reads experiment log, groups changes into logical branches, proposes grouping for approval |
| Interruption | User presses Escape | Agent stops loop and provides summary of results |
| Resumption | /autoresearch <context> |
Agent reads autoresearch.md + autoresearch.jsonl, reconstructs context, resumes loop |
No Model Coupling
A critical design decision: pi-autoresearch does not specify which LLM powers the pi agent. The extension works with whatever LLM backend the user has configured in pi. This means the same experimental infrastructure works with: - Claude (Anthropic) — pi's primary backend - GPT-4o/o3 (OpenAI) — if configured via API key - Gemini (Google) — if configured via API key - Open-weight models — if configured via local inference
The LLM choice affects the quality of hypotheses the agent generates, but not the experimental infrastructure.
Prompt Architecture
The system uses a layered prompt architecture:
Layer 1: pi system prompt (agent capabilities, tool definitions)
↓
Layer 2: Extension tool descriptions (init_experiment, run_experiment, log_experiment)
↓
Layer 3: Skill document (domain-specific instructions, loaded at session start)
↓
Layer 4: Session context (autoresearch.md — what's been tried, what worked)
↓
Layer 5: Real-time state (widget data, recent experiment results)
This layered architecture means the LLM always has access to: - What it can do (tools) - What it should optimize (skill) - What has already been tried (session history) - How well it's doing (confidence scores, metric trajectory)
6 Key Results
Claimed Performance
The README does not report specific experimental results — it is infrastructure, not a benchmark paper. However, the system is designed to produce results in any domain. The README's example domains suggest typical performance improvements:
| Domain | Typical Improvement | Confidence Threshold |
|---|---|---|
| Test speed | 10–50% reduction in runtime | ≥ 2.0× MAD (green) |
| Bundle size | 5–30% reduction in KB | ≥ 2.0× MAD (green) |
| LLM training | 5–15% reduction in val_bpb | ≥ 2.0× MAD (green) |
| Build speed | 10–40% reduction in build time | ≥ 2.0× MAD (green) |
Statistical Confidence Framework
Pi-autoresearch's most scientifically rigorous feature is its confidence scoring:
| Metric | Formula | Interpretation |
|---|---|---|
| MAD (Median Absolute Deviation) | median(|x_i - median(x)|) |
Robust noise floor estimate, resistant to outliers |
| Confidence score | |best_improvement| / MAD |
How many times the best improvement exceeds noise |
| ≥ 2.0× (green) | — | Improvement is likely real |
| 1.0–2.0× (yellow) | — | Above noise but marginal — re-run recommended |
| < 1.0× (red) | — | Within noise — likely jitter, not real improvement |
The choice of MAD over standard deviation is deliberate — MAD is robust to outliers, which are common in benchmark measurements (GC pauses, CPU throttling, I/O contention). This makes the confidence score reliable even with heterogeneous experiment results.
Comparison with Karpathy's Results
Karpathy's autoresearch reported 11% cumulative reduction in time-to-GPT-2-quality over an overnight run. Pi-autoresearch provides the infrastructure to achieve similar results but with added statistical rigor:
| Feature | Karpathy | pi-autoresearch |
|---|---|---|
| Raw improvement detection | Yes (raw val_bpb comparison) | Yes (raw metric comparison) |
| Noise floor estimation | No | Yes (MAD) |
| Confidence scoring | No | Yes (improvement/MAD ratio) |
| False positive filtering | No | Advisory (agent guided to re-run marginal results) |
| Post-hoc analysis | TSV file | Structured JSONL with full metadata |
Community Adoption Signal
The 3,300+ GitHub stars provide a strong adoption signal. The extension's domain-agnostic nature means its user base is fundamentally broader than domain-specific autoresearch tools — it serves anyone who uses pi for development, regardless of their optimization domain.
7 Reproducibility
Reproducibility Design
Pi-autoresearch is designed with reproducibility as a first-class concern at multiple levels:
Experiment-level reproducibility:
- Every experiment that is "kept" produces a git commit with a descriptive message including the metric improvement
- The autoresearch.jsonl log records every experiment (kept and discarded) with timestamps, commit hashes, metric values, confidence scores, and descriptions
- Any individual experiment can be reproduced by checking out its commit and re-running the benchmark command
Session-level reproducibility:
- The autoresearch.md file captures the complete session context: objective, metrics, files in scope, what has been tried, dead ends, and key wins
- A fresh agent with no memory can read autoresearch.md + autoresearch.jsonl and continue exactly where the previous session left off
- Sessions are branch-aware — each branch has its own session state
Infrastructure-level reproducibility:
- The extension is installed via pi install https://github.com/davebcn87/pi-autoresearch
- The benchmark script (autoresearch.sh) is an explicit, versioned shell script — not implicit agent behavior
- Optional checks script (autoresearch.checks.sh) is similarly explicit and versioned
Session Files
| File | Format | Purpose | Persistence |
|---|---|---|---|
autoresearch.md |
Markdown | Living session document — objective, history, dead ends | Survives context resets, agent restarts |
autoresearch.jsonl |
JSON Lines | Append-only experiment log with full metadata | Survives restarts, branch-aware |
autoresearch.sh |
Shell script | Benchmark command with pre-checks and metric output | Versioned in git |
autoresearch.checks.sh |
Shell script (optional) | Correctness checks (tests, types, lint) | Versioned in git |
autoresearch.config.json |
JSON (optional) | Session configuration (working dir, max iterations) | Versioned in git |
Resumption Protocol
The system supports three resumption scenarios:
- Agent restart (same context window): The agent reads
autoresearch.jsonlto reconstruct state and continues the loop. - Context reset (new agent instance): A fresh agent reads
autoresearch.mdfor high-level context andautoresearch.jsonlfor detailed history. The README explicitly states: "A fresh agent with no memory can read these two files and continue exactly where the previous session left off." - Branch switch: Each branch maintains its own session state. Switching branches automatically switches the session context.
Limitations on Reproducibility
- LLM non-determinism: The agent's hypotheses and code changes depend on the LLM's stochastic generation. Two runs with identical starting conditions will generally produce different experiment sequences.
- Environment sensitivity: Benchmark measurements depend on system load, hardware, and environment. The confidence scoring mitigates this but doesn't eliminate it.
- Skill authoring variance: The quality and specificity of the skill document significantly affect results. Vague skills produce scattered experiments; precise skills produce focused optimization.
8 Compute and API Costs
Cost Model
Pi-autoresearch's costs have three components:
| Component | Source | Scaling Factor |
|---|---|---|
| LLM tokens | pi agent inference (per-experiment reasoning + code changes) | Experiments × tokens/experiment |
| Compute time | Benchmark execution | Experiments × benchmark duration |
| Human attention | Monitoring, review, finalization | Low (designed for unattended operation) |
Token Cost Estimation
Each experiment cycle involves:
| Phase | Estimated Tokens | Notes |
|---|---|---|
| Read context + session files | 2,000–10,000 input | Depends on autoresearch.md size |
| Generate hypothesis + code change | 1,000–5,000 output | Depends on change complexity |
| Interpret results + decide keep/revert | 500–2,000 output | Includes reasoning about metrics |
| Per-experiment total | 3,500–17,000 | ~10,000 tokens typical |
For a 50-experiment session with Claude Sonnet-level pricing (~$3/M input, ~$15/M output tokens): - Input: 50 × 6,000 = 300K tokens → ~$0.90 - Output: 50 × 4,000 = 200K tokens → ~$3.00 - Total: ~$4–$10 per 50-experiment session
Cost Controls
Pi-autoresearch provides two mechanisms for cost control:
maxIterationsinautoresearch.config.json: Hard limit on experiments per session. The agent is instructed to stop and won't run more experiments beyond this limit.
{
"maxIterations": 30
}
- API key spending limits: The README recommends using provider-side spending caps: "most providers let you set per-key or monthly budgets."
Cost Comparison
| System | Typical Session Cost | Autonomous Duration | Cost Control |
|---|---|---|---|
| Karpathy autoresearch | $5–$50 (overnight) | 8–12 hours | Manual interruption only |
| pi-autoresearch | $4–$10 (50 experiments) | Varies by benchmark | maxIterations + API limits |
| AlphaEvolve | $1,000–$100,000+ | Days | Internal Google quotas |
9 Architecture Solution
Three-Layer Architecture
Pi-autoresearch's architecture follows a clean three-layer separation:
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 3: SKILL (Domain Knowledge) │
│ │
│ autoresearch-create autoresearch-finalize │
│ ┌─────────────────────┐ ┌──────────────────────────────┐ │
│ │ • Asks about goal │ │ • Reads experiment log │ │
│ │ • Infers from context│ │ • Groups kept experiments │ │
│ │ • Writes session │ │ • Proposes branch grouping │ │
│ │ files │ │ • Creates independent branches│ │
│ │ • Starts loop │ │ • Ensures no shared files │ │
│ └─────────────────────┘ └──────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ LAYER 2: EXTENSION (Infrastructure) │
│ │
│ Tools UI Commands │
│ ┌───────────────────┐ ┌───────────────┐ ┌────────────┐ │
│ │ init_experiment │ │ Status widget │ │/autoresearch│ │
│ │ run_experiment │ │ Dashboard │ │ <context> │ │
│ │ log_experiment │ │ Fullscreen │ │ off │ │
│ └───────────────────┘ │ overlay │ │ clear │ │
│ └───────────────┘ └────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ LAYER 1: pi RUNTIME │
│ │
│ LLM Agent │ Terminal UI │ Git Integration │ File I/O │
│ Extension │ Widget │ Skill System │ Process │
│ Framework │ Framework │ │ Execution │
└─────────────────────────────────────────────────────────────────┘
Data Flow
The data flow through a single experiment cycle:
Agent generates hypothesis
│
▼
Agent edits source code
│
▼
Agent calls git commit (via pi)
│
▼
Agent calls run_experiment(command="pnpm test")
│
├──► Extension executes command
├──► Extension times wall-clock duration
├──► Extension captures stdout/stderr
├──► Extension parses METRIC lines from output
│
▼
Agent calls log_experiment(metric_value, description)
│
├──► Extension records to autoresearch.jsonl
├──► Extension computes confidence score (if 3+ runs)
├──► Extension updates status widget
├──► Extension updates dashboard
│
▼
Agent evaluates: keep or revert?
│
├──► Keep: branch advances, agent generates next hypothesis
│
└──► Revert: git reset, agent tries different approach
Metric Protocol
The benchmark script (autoresearch.sh) communicates metrics to the extension via a simple protocol — standard output lines matching the pattern:
METRIC name=number
For example:
#!/bin/bash
set -euo pipefail
pnpm test --run 2>&1
echo "METRIC total_test_seconds=$(node -e 'process.stdout.write(String(...))')"
This protocol is deliberately minimal — any language, any build tool, any measurement approach can produce METRIC lines.
State Machine
The extension manages a session state machine:
┌──────────────┐
│ INACTIVE │
│ (no session)│
└──────┬───────┘
│ /autoresearch <context>
│ or /skill:autoresearch-create
▼
┌──────────────┐
│ SETUP │◄─── init_experiment()
│ (configuring│ defines name, metric,
│ session) │ unit, direction
└──────┬───────┘
│ Session files written
│ Baseline measured
▼
┌──────────────┐
│ RUNNING │◄─── Autonomous loop
│ (experiment │ edit → commit →
│ loop) │ run → log → decide
└──────┬───────┘
│ /autoresearch off
│ or Escape
│ or maxIterations reached
▼
┌──────────────┐
│ PAUSED │
│ (loop stopped│
│ state kept) │
└──────┬───────┘
│ /autoresearch <context>
│ (resumes loop)
│
│ /autoresearch clear
▼
┌──────────────┐
│ CLEARED │
│ (state │
│ deleted) │──── Returns to INACTIVE
└──────────────┘
10 Component Breakdown
Extension Components
| Component | Type | Lifecycle | Description |
|---|---|---|---|
init_experiment |
Tool | Called once per session | Configures session: experiment name, primary metric name, unit, optimization direction (minimize/maximize) |
run_experiment |
Tool | Called per experiment | Executes benchmark command, measures wall-clock duration, captures output, parses METRIC lines |
log_experiment |
Tool | Called per experiment | Records result to autoresearch.jsonl, auto-commits, computes confidence score, updates UI components |
| Status widget | UI | Persistent during session | Always-visible bar above editor: run count, keep count, best metric value, improvement %, confidence score |
| Expanded dashboard | UI | Toggle via Ctrl+X | Full results table with columns for commit, metric, status, description |
| Fullscreen overlay | UI | Toggle via Ctrl+Shift+X | Scrollable full-terminal dashboard with live spinner for running experiments |
Skill Components
| Component | Type | Purpose |
|---|---|---|
autoresearch-create |
Skill | Session initialization — gathers goal, command, metric, scope; writes session files; runs baseline; starts loop |
autoresearch-finalize |
Skill | Branch finalization — groups kept experiments into independent branches for review |
Session File Components
| File | Format | Schema | Update Frequency |
|---|---|---|---|
autoresearch.md |
Markdown | Free-form document with sections for objective, metrics, scope, history, dead ends, wins | Updated by agent after significant events |
autoresearch.jsonl |
JSON Lines | {commit, metric_name, metric_value, status, description, confidence, timestamp} |
Appended after every experiment |
autoresearch.sh |
Bash | Pre-checks + workload + METRIC output lines |
Written once during setup, rarely modified |
autoresearch.checks.sh |
Bash (optional) | Correctness checks (tests, types, lint) | Written once during setup |
autoresearch.config.json |
JSON | {workingDir?, maxIterations?} |
Written once, manually updated |
Experiment Status Taxonomy
Each experiment in autoresearch.jsonl has one of these statuses:
| Status | Meaning | UI Color | Agent Action |
|---|---|---|---|
kept |
Metric improved, changes committed | Green | Branch advances |
discarded |
Metric did not improve | Gray | Git reset, try again |
crashed |
Benchmark command failed (non-zero exit) | Red | Git reset, try different approach |
checks_failed |
Benchmark passed but correctness checks failed | Orange | Git reset, fix correctness issue |
baseline |
Initial measurement, no changes | Blue | Reference point for improvements |
11 Core Mechanisms (Detailed)
Mechanism 1: Confidence Scoring via MAD
The most technically sophisticated mechanism in pi-autoresearch is the confidence scoring system, which uses Median Absolute Deviation (MAD) as a robust noise floor estimator.
Why MAD instead of Standard Deviation?
Standard deviation is sensitive to outliers. In benchmark measurements, outliers are common: - GC pauses can spike individual measurements by 2–10× - CPU thermal throttling can degrade measurements unpredictably - I/O contention from other processes creates sporadic slowdowns - ML training loss has inherent stochasticity
MAD is robust to these outliers because it uses the median rather than the mean:
MAD = median(|x_i - median(x)|)
Confidence computation:
confidence = |best_improvement| / MAD
Where best_improvement is the largest positive change in the primary metric (accounting for optimization direction).
Interpretation thresholds:
| Confidence | Signal | Color | Agent Guidance |
|---|---|---|---|
| ≥ 2.0× | Strong signal | 🟢 Green | Improvement is likely real. Keep and continue. |
| 1.0–2.0× | Marginal signal | 🟡 Yellow | Above noise but uncertain. Re-run to confirm. |
| < 1.0× | Noise | 🔴 Red | Within noise floor. Consider reverting. |
Implementation details:
- Confidence is only computed after 3+ experiments (minimum sample size for meaningful MAD)
- All metric values in the current segment contribute to the MAD computation
- Confidence is persisted to autoresearch.jsonl for post-hoc analysis
- Confidence is displayed in the status widget, expanded dashboard, and log_experiment output
- The confidence score is advisory only — it never auto-discards experiments
Mechanism 2: Greedy Hill-Climbing with Version Control
The core optimization strategy is greedy hill-climbing backed by git version control:
┌────────────────────────────────────────────────────┐
│ EXPERIMENT CYCLE │
│ │
│ 1. Agent edits code (informed by session context) │
│ 2. Agent creates git commit │
│ 3. run_experiment executes benchmark │
│ 4. log_experiment records result │
│ 5. Decision: │
│ ├── metric improved → KEEP (branch advances) │
│ └── metric worsened → REVERT (git reset) │
│ 6. Go to 1 │
│ │
│ Invariant: HEAD always points to the best-known │
│ configuration. The branch is monotonically │
│ improving. │
└────────────────────────────────────────────────────┘
This creates a monotonically improving trajectory — every commit on the branch represents an improvement over the previous state.
Trade-off analysis:
| Property | Greedy Hill-Climbing | Population-Based Search |
|---|---|---|
| Convergence speed | Fast for easy gains | Slower but avoids local optima |
| Implementation complexity | Very low (git commit/reset) | High (population management) |
| Memory requirements | O(1) — current state only | O(N) — population of candidates |
| Risk of local optima | High | Low |
| Human interpretability | Very high (linear history) | Low (complex population dynamics) |
The greedy approach is a deliberate design choice — pi-autoresearch optimizes for simplicity and interpretability over optimality. The assumption is that LLM agents generate sufficiently diverse hypotheses to partially mitigate the local-optima problem.
Mechanism 3: Backpressure Checks
The optional autoresearch.checks.sh mechanism provides correctness guarantees during autonomous optimization:
Benchmark exits 0
│
├── autoresearch.checks.sh exists?
│ ├── No → proceed normally (keep/discard based on metric)
│ └── Yes → run checks
│ ├── Checks pass → proceed normally
│ └── Checks fail → log as "checks_failed", revert
│
▼
Normal keep/discard decision
Design principles:
- Checks execution time does not affect the primary metric (measured separately)
- Checks have a separate timeout (default 300s)
- The checks_failed status is distinct from crashed, allowing post-hoc analysis of correctness vs. performance failures
- If no checks file exists, the system behaves identically to systems without this feature
Mechanism 4: Branch-Aware Finalization
The autoresearch-finalize skill converts a messy experiment branch into clean, reviewable branches:
BEFORE (messy experiment branch):
─ baseline
├── exp1: parallel vitest workers (kept, +12%)
├── exp2: mock database calls (discarded)
├── exp3: remove slow regex (kept, +5%)
├── exp4: cache test fixtures (kept, +8%)
├── exp5: swap assertion library (discarded)
└── exp6: remove unnecessary imports (kept, +2%)
AFTER (clean independent branches):
─ merge-base
├── branch: optimize-test-parallelism
│ └── exp1: parallel vitest workers (+12%)
│
├── branch: optimize-test-regex
│ └── exp3: remove slow regex (+5%)
│
├── branch: optimize-test-caching
│ └── exp4: cache test fixtures (+8%)
│
└── branch: optimize-test-imports
└── exp6: remove unnecessary imports (+2%)
Key constraint: Groups must not share files. This ensures branches can be reviewed and merged independently, without conflict resolution.
Mechanism 5: Context Survival Across Resets
The dual-file persistence strategy (autoresearch.md + autoresearch.jsonl) is designed to survive LLM context window resets:
| File | Audience | Content | Survival Property |
|---|---|---|---|
autoresearch.md |
Agent (narrative understanding) | High-level objective, strategies tried, dead ends, key wins | Enables a fresh agent to understand why and what without reading every experiment |
autoresearch.jsonl |
Agent (precise recall) + Tools (computation) | Every experiment with exact metrics, commits, confidence scores | Enables precise reconstruction of session state and confidence calculations |
The design insight is that LLMs benefit from both narrative context (what's the goal, what approaches have been tried, what failed) and structured data (exact metric values, commit hashes, timestamps). These two modalities serve different purposes and are stored in different formats optimized for each use case.
12 Programming Language
Implementation Stack
Pi-autoresearch is implemented as a pi extension, which uses pi's extension and skill frameworks:
| Layer | Language/Format | Notes |
|---|---|---|
| Extension definition | JSON (extension manifest) | Declares tools, UI widgets, commands |
| Tool implementations | TypeScript (pi extension API) | init_experiment, run_experiment, log_experiment |
| UI components | pi widget framework (declarative) | Status bar, dashboard, fullscreen overlay |
| Skill documents | Markdown | autoresearch-create, autoresearch-finalize |
| Session state | JSON Lines + Markdown | autoresearch.jsonl + autoresearch.md |
| Benchmark scripts | Bash | autoresearch.sh, autoresearch.checks.sh |
| Configuration | JSON | autoresearch.config.json |
Language Choice Rationale
The choice to implement as a pi extension rather than a standalone tool has several implications:
-
TypeScript for tools: Pi's extension framework uses TypeScript, providing type safety and IDE support for the tool implementations. This is appropriate for the system-level concerns (timing, file I/O, process execution) that the tools handle.
-
Markdown for skills: Skills are authored in Markdown because they are consumed by the LLM agent as context. Markdown provides structure (headers, lists, emphasis) that helps the LLM parse and follow instructions, while remaining human-readable and editable.
-
Bash for benchmarks: Benchmark scripts are Bash because they need to invoke arbitrary command-line tools. The
METRIC name=numberoutput protocol is language-agnostic — any process that writes to stdout can produce metric lines. -
JSON Lines for logs: JSONL is chosen for the experiment log because it supports efficient append-only writes, is line-oriented (enabling
tail -fstyle monitoring), and is machine-parseable while remaining human-readable.
Code Complexity
The overall codebase is notably small — consistent with the "radical simplicity" design philosophy inherited from Karpathy's autoresearch:
| Component | Estimated LOC | Complexity |
|---|---|---|
| Extension (tools + UI) | ~500–800 | Medium (timing, git integration, JSONL parsing, confidence math) |
| Skills | ~200–400 lines of Markdown | Low (structured natural language instructions) |
| Session files | Generated | N/A (generated by agent/tools) |
| Total | ~700–1,200 | Low-to-medium |
13 Memory Management
Memory Architecture
Pi-autoresearch's memory management operates at three distinct levels:
Level 1: LLM Context Window (Volatile)
The LLM agent's context window is the primary working memory. It contains:
- System prompt + tool descriptions
- Skill document (loaded at session start)
- Recent conversation history (experiments, reasoning, results)
- Fragments of autoresearch.md and autoresearch.jsonl (loaded as needed)
Context window is limited (typically 128K–200K tokens for frontier models) and is lost on context reset.
Level 2: Session Files (Persistent)
The dual-file persistence layer survives context resets:
┌──────────────────────────────────────────────────────────┐
│ Session File Memory Architecture │
│ │
│ autoresearch.md (narrative memory) │
│ ┌────────────────────────────────────────────────────┐ │
│ │ • Objective: "optimize test runtime" │ │
│ │ • Approach: "started with parallelism, then..." │ │
│ │ • Dead ends: "swapping assertion lib had no effect"│ │
│ │ • Key wins: "parallel workers gave biggest gain" │ │
│ │ • Next ideas: "try test sharding, mock heavy I/O" │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ autoresearch.jsonl (structured memory) │
│ ┌────────────────────────────────────────────────────┐ │
│ │ {"run":1, "metric":45.2, "status":"baseline",...} │ │
│ │ {"run":2, "metric":39.8, "status":"kept",...} │ │
│ │ {"run":3, "metric":41.1, "status":"discarded",...} │ │
│ │ ... │ │
│ └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Level 3: Git History (Permanent)
Every kept experiment is a git commit. This creates a permanent, immutable record of every code change that produced an improvement. The git history is never consumed directly by the agent but serves as a human-auditable archive and enables the finalization workflow.
Memory Scaling
| Session Size | autoresearch.md | autoresearch.jsonl | Git commits | Context Pressure |
|---|---|---|---|---|
| 10 experiments | ~1 KB | ~3 KB | ~5 | Low |
| 50 experiments | ~3–5 KB | ~15 KB | ~20–25 | Medium |
| 200 experiments | ~10–15 KB | ~60 KB | ~80–100 | High (JSONL may exceed context) |
For very long sessions, the agent must selectively read the JSONL file (recent entries, best results) rather than loading the entire history. The narrative autoresearch.md file serves as a compressed summary that remains context-friendly regardless of session length.
Comparison with Other Memory Systems
| System | Memory Mechanism | Persistence | Scalability |
|---|---|---|---|
| Karpathy autoresearch | results.tsv + git |
File-based | Moderate (TSV grows linearly) |
| AlphaEvolve | Database (internal) | Database-backed | High (indexed queries) |
| OpenEvolve | SQLite + program database | Database-backed | High |
| pi-autoresearch | JSONL + Markdown + git | File-based | Moderate (JSONL grows linearly) |
14 Continued Learning
Within-Session Learning
The agent learns within a session through the accumulation of context:
-
Narrative learning: As the agent updates
autoresearch.mdwith dead ends and key wins, it builds a compressed model of the optimization landscape. Future hypotheses are informed by this accumulated knowledge. -
Statistical learning: The confidence scoring provides quantitative feedback on result reliability. After enough experiments, the agent can distinguish between high-confidence improvements and noise-level changes.
-
Pattern recognition: The JSONL log provides a structured record of what has been tried. The agent can read this to avoid repeating failed approaches and to identify promising directions.
Cross-Session Learning
Pi-autoresearch supports cross-session learning through the persistence layer:
-
Context reset resumption: A fresh agent instance reads
autoresearch.md+autoresearch.jsonland picks up where the previous session left off. The narrative document provides strategic context, while the JSONL provides tactical detail. -
Branch-level transfer: Since sessions are branch-aware, insights from one optimization branch can inform work on another branch — though this requires the human researcher to manually connect the sessions.
What Pi-Autoresearch Does NOT Learn
-
No cross-project learning. Insights from optimizing test speed in Project A are not automatically transferred to Project B. Each session starts fresh.
-
No meta-optimization. The system does not learn to improve its own optimization strategy over time. The hill-climbing approach, confidence thresholds, and experiment structure are fixed.
-
No skill improvement. The domain-specific skill documents are authored by humans and not modified by the system. The system cannot discover that a different benchmark command or metric would be more effective.
-
No population dynamics. Unlike evolutionary systems (AlphaEvolve, OpenEvolve), pi-autoresearch maintains a single candidate at a time. It cannot explore multiple optimization paths in parallel or combine insights from diverse candidates.
Comparison with Knowledge Management in Other Systems
| System | Within-Session | Cross-Session | Cross-Project | Meta-Learning |
|---|---|---|---|---|
| Karpathy autoresearch | Via context window | Via results.tsv + git |
No | No |
| AlphaEvolve | Program database | Persistent DB | Yes (shared infra) | Partially (self-improvement) |
| OpenEvolve | SQLite DB | Persistent DB | No | No |
| pi-autoresearch | JSONL + Markdown | JSONL + Markdown | No | No |
15 Applications
Primary Application: Developer Workflow Optimization
Pi-autoresearch's primary application is optimizing developer workflows — any measurable aspect of a software project that can be improved through code changes:
| Application | Metric | Typical Gains | Value Proposition |
|---|---|---|---|
| Test suite optimization | Wall-clock runtime | 10–50% faster | Faster CI, faster development loops |
| Build optimization | Build time | 10–40% faster | Faster deployments, faster iteration |
| Bundle optimization | Bundle size (KB) | 5–30% smaller | Faster page loads, better UX |
| Performance optimization | Lighthouse score | 5–20 point improvement | Better SEO, better user experience |
Secondary Application: ML Research
Following Karpathy's autoresearch paradigm, pi-autoresearch can be applied to ML training optimization:
| Application | Metric | Approach |
|---|---|---|
| Architecture search | val_bpb or val_loss | Agent modifies model architecture (layers, widths, attention patterns), measures training performance |
| Hyperparameter optimization | val_bpb or val_loss | Agent adjusts learning rate, batch size, warmup schedule, regularization |
| Data pipeline optimization | Training throughput (samples/sec) | Agent optimizes data loading, preprocessing, caching |
| Training stability | Loss variance | Agent modifies optimization procedure to reduce training instability |
Emerging Application: Multi-Agent Optimization
Pi-autoresearch opens the door to multi-agent optimization workflows:
Developer machine CI server
┌────────────────────┐ ┌────────────────────┐
│ pi + autoresearch │ │ Benchmark runner │
│ ┌────────────────┐ │ git push │ ┌────────────────┐ │
│ │ Agent generates│ │──────────────│ │ Run benchmark │ │
│ │ code changes │ │ │ │ on isolated │ │
│ │ & commits │ │ webhook │ │ hardware │ │
│ │ │ │◄─────────────│ │ │ │
│ │ Reads results │ │ │ │ Report metrics │ │
│ └────────────────┘ │ │ └────────────────┘ │
└────────────────────┘ └────────────────────┘
This pattern enables: - Benchmarks on dedicated hardware (not affected by developer machine load) - Longer-running benchmarks (CI can run 30-minute training jobs) - Multi-metric evaluation (CI reports multiple metrics per run)
Limitations
- Single-metric optimization: The system is designed for single-objective optimization. Multi-objective problems require manual composite scoring or separate sessions.
- Requires automatable benchmarks: Applications where quality is subjective or requires human evaluation are not supported.
- Platform lock-in: The extension is specific to the pi agent. Users of other coding agents (Claude Code, Cursor, Copilot) cannot use pi-autoresearch without switching to pi.
- No guarantee of global optimality: Greedy hill-climbing may miss solutions that require temporary metric regressions. Population-based approaches (AlphaEvolve, OpenEvolve) are better suited to problems with rugged fitness landscapes.
Impact Assessment
Pi-autoresearch represents a significant step in the democratization of autonomous optimization:
| Dimension | Assessment |
|---|---|
| Accessibility | High — single install command, works with any pi project |
| Generality | High — domain-agnostic, any measurable metric |
| Scientific rigor | Medium — confidence scoring is a good start, but lacks formal statistical testing |
| Production readiness | Medium — suitable for development optimization, not safety-critical applications |
| Community adoption | Strong — 3,300+ stars, active development |
| Innovation | Medium — novel extension/skill separation, confidence scoring; core loop is established (Karpathy) |
Position in the Autoresearch Ecosystem
Complexity ──────────────────────────────────────────────► Scale
│
│ Karpathy pi-autoresearch
│ autoresearch (domain-agnostic,
│ (minimal, statistical rigor,
│ single-domain) extension/skill split)
│ │ │
│ ▼ ▼
│ ─────────────────────────────────────────────────
│ Proof of Concept Reusable Framework Platform
│
│ SkyPilot
│ autoresearch
│ (cloud-scale,
│ multi-cluster)
│ │
│ ▼
│ ─────────────────────────────────────────────────
│ Enterprise
Pi-autoresearch occupies the "reusable framework" niche — more sophisticated than Karpathy's proof-of-concept, more accessible than enterprise-grade cloud orchestrators. This positioning makes it the natural choice for individual developers and small teams who want autonomous optimization without significant infrastructure investment.