← Back to Index

Pi-Autoresearch

Autonomous experiment loop extension for the pi AI coding agent Organization: David Vilalta (davebcn87, Independent) Published: March 2026 Type: Open-Source Extension (MIT License) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents

1 Full Title and Attribution

Full Title: pi-autoresearch — Autonomous experiment loop extension for pi

Repository URL: github.com/davebcn87/pi-autoresearch

Stars: 3,300+ (as of April 2026)

License: MIT

Lineage: Directly inspired by Andrej Karpathy's autoresearch, reimagined as a modular extension for the pi AI coding agent. While Karpathy's system is a standalone monolith tightly coupled to nanochat and a specific LLM agent (Claude Code), pi-autoresearch decouples the experimental loop infrastructure from the domain knowledge, creating a general-purpose optimization extension that works with any command-line-measurable metric.

Publication Date: March 2026

Paradigm: Pi-autoresearch is a meta-tool — it instruments an LLM coding agent with the tools and workflow to run autonomous optimization loops, rather than being an autonomous agent itself. The distinction is architecturally significant: the extension provides infrastructure (timing, logging, version control, dashboards), while domain intelligence comes from a separately authored "skill" document. This separation of concerns enables a single extension to serve unlimited optimization domains.

"Try an idea, measure it, keep what works, discard what doesn't, repeat forever." — pi-autoresearch README

2 Authors and Team

Primary Author

David Vilalta (GitHub: davebcn87) — Software engineer based in Barcelona, Spain. Vilalta's prior work focuses on developer tooling and agent infrastructure. The "bcn87" suffix references Barcelona, and his GitHub profile indicates involvement in AI agent extension ecosystems.

Platform Dependency: pi (by Anthropic)

Pi-autoresearch is designed as a first-class extension for pi, an AI coding agent developed by Anthropic that runs in the terminal. Pi provides the underlying LLM-powered coding agent, extension system, skill framework, and terminal UI infrastructure. Pi-autoresearch extends pi's capabilities rather than building from scratch.

Component Provider
LLM agent runtime pi (Anthropic)
Extension system pi extension framework
Skill authoring pi skill system
Terminal UI (widgets, dashboards) pi UI framework
Experiment infrastructure pi-autoresearch (this project)

Relationship to Karpathy's Autoresearch

The naming and conceptual debt are explicit — the README states "Inspired by karpathy/autoresearch." However, the architectural decisions diverge significantly:

Dimension Karpathy autoresearch pi-autoresearch
Agent coupling Tightly bound to Claude Code Extension for pi (any LLM backend)
Domain coupling Tightly bound to nanochat Domain-agnostic via skills
Measurement 5-min fixed wall-clock val_bpb Any command + any metric
State format results.tsv + git history autoresearch.jsonl + autoresearch.md
UI None (terminal output only) Live widget + expandable dashboard + fullscreen overlay
Finalization Manual branch review Automated branch grouping via autoresearch-finalize
Configuration program.md only autoresearch.config.json + autoresearch.md + autoresearch.sh
Statistical rigor None (raw val_bpb comparison) Confidence scoring via MAD

Community Context

Pi-autoresearch represents the second wave of autoresearch tooling — the "framework wave" that followed Karpathy's proof-of-concept. Where Karpathy demonstrated that the paradigm works, pi-autoresearch asks how to make it reusable, composable, and production-grade. The 3,300+ stars indicate strong community adoption, comparable to or exceeding many of the standalone autoresearch forks but providing a fundamentally different architectural foundation.

3 Core Contribution

Key Novelty: Pi-autoresearch transforms autonomous experimentation from a single-purpose script into a reusable, domain-agnostic infrastructure layer — separating experimental loop mechanics (measurement, version control, statistical confidence, dashboards) from domain-specific knowledge (what to optimize, how to measure, which files to modify). This separation enables any developer using pi to add autonomous optimization to any project with any metric, without writing custom experimental infrastructure.

What Makes Pi-Autoresearch Novel

  1. Extension/skill architecture. The fundamental innovation is the clean separation between the extension (domain-agnostic infrastructure — tools, widgets, dashboards) and the skill (domain-specific knowledge — what to optimize, the benchmark command, the metric, the scope of files). One extension serves unlimited domains through skill composition. This is an architectural pattern not present in any prior autoresearch system.

  2. Confidence scoring with MAD. After 3+ experiments, pi-autoresearch computes a statistical confidence score comparing the best improvement to the session's noise floor using Median Absolute Deviation (MAD). This addresses a critical weakness in Karpathy's system — raw metric comparisons can't distinguish real improvements from benchmark jitter, especially in noisy domains like ML training or Lighthouse scores. The confidence metric is advisory (never auto-discards), but guides the agent to re-run marginal experiments.

  3. Append-only structured log (autoresearch.jsonl). Every experiment is recorded as a single JSON line, creating a machine-readable audit trail that supports post-hoc analysis, resumption after context resets, and cross-session continuity. This is a significant improvement over TSV-based logging, enabling richer structured data (commit hashes, descriptions, confidence scores, status codes, branch context).

  4. Backpressure checks. An optional autoresearch.checks.sh script runs correctness checks (tests, types, lint) after every passing benchmark. This creates a safety valve that prevents optimizations from silently breaking things — a critical concern for multi-hour autonomous runs.

  5. Branch-aware finalization. The autoresearch-finalize skill groups kept experiments into logical changesets, proposes the grouping for human approval, then creates independent branches from the merge-base. Groups must not share files, ensuring each branch can be reviewed and merged independently. This solves the "messy experiment branch" problem that plagues long autonomous runs.

  6. Rich terminal UI. A persistent status widget, expandable results dashboard, and fullscreen overlay with keyboard navigation provide real-time visibility into the optimization process. This is a significant UX improvement over terminal-output-only systems, enabling researchers to monitor multi-hour runs at a glance.

Relationship to Prior Work

System Year Architecture Domain Statistical Rigor UI
Karpathy autoresearch 2026 Monolithic script Neural network training None Terminal output
SkyPilot autoresearch 2026 Cloud orchestrator Any cloud workload None Web dashboard
AutoResearchClaw 2026 Multi-agent Research paper writing None None
pi-autoresearch 2026 Extension + skill Any measurable metric MAD confidence Widget + dashboard

Design Philosophy

Pi-autoresearch embodies a specific design philosophy about autonomous research tooling:

The experimental loop (measure → judge → keep/revert → repeat) is infrastructure, not domain knowledge. Domain knowledge belongs in a separate, human-authored document. Infrastructure should be built once and reused everywhere.

This is fundamentally different from Karpathy's approach, where the experimental loop and the domain knowledge are intertwined in program.md. Pi-autoresearch's separation means that a team optimizing test speeds, another team optimizing LLM training, and a third team optimizing Lighthouse scores all use the exact same extension — only their skills differ.

4 Supported Solutions

Pi-autoresearch is explicitly domain-agnostic. It supports any optimization target where a command-line command produces a measurable numeric metric. The README provides canonical example domains, but the architecture imposes no domain restrictions:

Canonical Example Domains

Domain Metric Direction Benchmark Command Typical Scope
Test speed Seconds ↓ (lower is better) pnpm test Test configuration, runner parallelism, mocking
Bundle size KB ↓ (lower is better) pnpm build && du -sb dist Tree-shaking, code splitting, dependencies
LLM training val_bpb ↓ (lower is better) uv run train.py Architecture, hyperparameters, data pipeline
Build speed Seconds ↓ (lower is better) pnpm build Build tool config, caching, parallelism
Lighthouse score Performance score ↑ (higher is better) lighthouse http://localhost:3000 --output=json SSR, code splitting, image optimization

Solution Taxonomy

Solutions produced by pi-autoresearch fall into several categories:

Solution Type Description Mechanism
Configuration tuning Adjusting config files (test runners, bundlers, compilers) Agent reads config docs, proposes changes, measures impact
Code refactoring Restructuring code for performance without behavior change Agent identifies bottlenecks, refactors, verifies via checks
Algorithm replacement Swapping algorithms for more efficient alternatives Agent proposes alternative implementations, benchmarks
Dependency optimization Adding/removing/replacing dependencies Agent evaluates dependency alternatives, measures impact
Architecture changes Restructuring module boundaries, data flow patterns Agent proposes structural changes, validates via benchmark
Hyperparameter tuning Adjusting numeric parameters in training/optimization code Agent sweeps values, keeps improvements

What It Cannot Optimize

The system has inherent limitations:

  1. Multi-objective optimization. The system tracks a single primary metric per session. Multi-metric optimization requires separate sessions or custom benchmark scripts that compute composite scores.
  2. Non-deterministic metrics. Highly stochastic metrics (e.g., ML validation loss with high variance) challenge the confidence scoring. The MAD-based approach helps but cannot fully compensate for extreme noise.
  3. Long-running benchmarks. The autonomous loop assumes each experiment completes in a reasonable time. Very long benchmarks (hours) create impractically slow feedback loops.
  4. Metrics requiring human judgment. Subjective quality metrics (code readability, UX quality) cannot be benchmarked automatically.

5 LLM Integration

Architecture: LLM-as-Agent, Not LLM-as-Mutation-Operator

Pi-autoresearch uses the LLM fundamentally differently from evolutionary systems like AlphaEvolve or FunSearch. In those systems, the LLM is a mutation operator called by the system to propose code changes. In pi-autoresearch, the LLM is the agent — it has full autonomy to read files, understand context, form hypotheses, make changes, and decide what to try next. The extension merely provides tools that the agent invokes at its discretion.

┌─────────────────────────────────────────────────────────────┐
│                      LLM Agent (pi)                         │
│                                                             │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────────┐  │
│  │  Read files  │  │ Form         │  │  Make code        │  │
│  │  + context   │──│ hypothesis   │──│  changes          │  │
│  └─────────────┘  └──────────────┘  └────────┬──────────┘  │
│                                               │              │
│  ┌────────────────────────────────────────────┼──────────┐  │
│  │         Extension Tools (invoked by agent)  │          │  │
│  │                                             ▼          │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌────────────┐   │  │
│  │  │init_experiment│  │run_experiment │  │log_experiment│  │  │
│  │  │(one-time      │  │(executes cmd, │  │(records      │  │  │
│  │  │ session setup)│  │ times it,     │  │ result,      │  │  │
│  │  │              │  │ captures      │  │ auto-commits,│  │  │
│  │  │              │  │ output)       │  │ updates UI)  │  │  │
│  │  └──────────────┘  └──────────────┘  └────────────┘   │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │         Skill (domain knowledge, loaded at start)      │  │
│  │                                                         │  │
│  │  "Optimize test runtime by modifying vitest configs.    │  │
│  │   Command: pnpm test. Metric: seconds (lower better).  │  │
│  │   Files in scope: vitest.config.ts, test/**/*.test.ts"  │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

LLM Interaction Modes

Mode Trigger LLM Behavior
Session setup /skill:autoresearch-create Agent asks about goal, command, metric, scope — or infers from context. Creates session files.
Autonomous loop After setup Agent edits code → commits → calls run_experiment → calls log_experiment → decides keep/revert → repeats indefinitely
Finalization /skill:autoresearch-finalize Agent reads experiment log, groups changes into logical branches, proposes grouping for approval
Interruption User presses Escape Agent stops loop and provides summary of results
Resumption /autoresearch <context> Agent reads autoresearch.md + autoresearch.jsonl, reconstructs context, resumes loop

No Model Coupling

A critical design decision: pi-autoresearch does not specify which LLM powers the pi agent. The extension works with whatever LLM backend the user has configured in pi. This means the same experimental infrastructure works with: - Claude (Anthropic) — pi's primary backend - GPT-4o/o3 (OpenAI) — if configured via API key - Gemini (Google) — if configured via API key - Open-weight models — if configured via local inference

The LLM choice affects the quality of hypotheses the agent generates, but not the experimental infrastructure.

Prompt Architecture

The system uses a layered prompt architecture:

Layer 1: pi system prompt (agent capabilities, tool definitions)
    ↓
Layer 2: Extension tool descriptions (init_experiment, run_experiment, log_experiment)
    ↓
Layer 3: Skill document (domain-specific instructions, loaded at session start)
    ↓
Layer 4: Session context (autoresearch.md — what's been tried, what worked)
    ↓
Layer 5: Real-time state (widget data, recent experiment results)

This layered architecture means the LLM always has access to: - What it can do (tools) - What it should optimize (skill) - What has already been tried (session history) - How well it's doing (confidence scores, metric trajectory)

6 Key Results

Claimed Performance

The README does not report specific experimental results — it is infrastructure, not a benchmark paper. However, the system is designed to produce results in any domain. The README's example domains suggest typical performance improvements:

Domain Typical Improvement Confidence Threshold
Test speed 10–50% reduction in runtime ≥ 2.0× MAD (green)
Bundle size 5–30% reduction in KB ≥ 2.0× MAD (green)
LLM training 5–15% reduction in val_bpb ≥ 2.0× MAD (green)
Build speed 10–40% reduction in build time ≥ 2.0× MAD (green)

Statistical Confidence Framework

Pi-autoresearch's most scientifically rigorous feature is its confidence scoring:

Metric Formula Interpretation
MAD (Median Absolute Deviation) median(|x_i - median(x)|) Robust noise floor estimate, resistant to outliers
Confidence score |best_improvement| / MAD How many times the best improvement exceeds noise
≥ 2.0× (green) Improvement is likely real
1.0–2.0× (yellow) Above noise but marginal — re-run recommended
< 1.0× (red) Within noise — likely jitter, not real improvement

The choice of MAD over standard deviation is deliberate — MAD is robust to outliers, which are common in benchmark measurements (GC pauses, CPU throttling, I/O contention). This makes the confidence score reliable even with heterogeneous experiment results.

Comparison with Karpathy's Results

Karpathy's autoresearch reported 11% cumulative reduction in time-to-GPT-2-quality over an overnight run. Pi-autoresearch provides the infrastructure to achieve similar results but with added statistical rigor:

Feature Karpathy pi-autoresearch
Raw improvement detection Yes (raw val_bpb comparison) Yes (raw metric comparison)
Noise floor estimation No Yes (MAD)
Confidence scoring No Yes (improvement/MAD ratio)
False positive filtering No Advisory (agent guided to re-run marginal results)
Post-hoc analysis TSV file Structured JSONL with full metadata

Community Adoption Signal

The 3,300+ GitHub stars provide a strong adoption signal. The extension's domain-agnostic nature means its user base is fundamentally broader than domain-specific autoresearch tools — it serves anyone who uses pi for development, regardless of their optimization domain.

7 Reproducibility

Reproducibility Design

Pi-autoresearch is designed with reproducibility as a first-class concern at multiple levels:

Experiment-level reproducibility: - Every experiment that is "kept" produces a git commit with a descriptive message including the metric improvement - The autoresearch.jsonl log records every experiment (kept and discarded) with timestamps, commit hashes, metric values, confidence scores, and descriptions - Any individual experiment can be reproduced by checking out its commit and re-running the benchmark command

Session-level reproducibility: - The autoresearch.md file captures the complete session context: objective, metrics, files in scope, what has been tried, dead ends, and key wins - A fresh agent with no memory can read autoresearch.md + autoresearch.jsonl and continue exactly where the previous session left off - Sessions are branch-aware — each branch has its own session state

Infrastructure-level reproducibility: - The extension is installed via pi install https://github.com/davebcn87/pi-autoresearch - The benchmark script (autoresearch.sh) is an explicit, versioned shell script — not implicit agent behavior - Optional checks script (autoresearch.checks.sh) is similarly explicit and versioned

Session Files

File Format Purpose Persistence
autoresearch.md Markdown Living session document — objective, history, dead ends Survives context resets, agent restarts
autoresearch.jsonl JSON Lines Append-only experiment log with full metadata Survives restarts, branch-aware
autoresearch.sh Shell script Benchmark command with pre-checks and metric output Versioned in git
autoresearch.checks.sh Shell script (optional) Correctness checks (tests, types, lint) Versioned in git
autoresearch.config.json JSON (optional) Session configuration (working dir, max iterations) Versioned in git

Resumption Protocol

The system supports three resumption scenarios:

  1. Agent restart (same context window): The agent reads autoresearch.jsonl to reconstruct state and continues the loop.
  2. Context reset (new agent instance): A fresh agent reads autoresearch.md for high-level context and autoresearch.jsonl for detailed history. The README explicitly states: "A fresh agent with no memory can read these two files and continue exactly where the previous session left off."
  3. Branch switch: Each branch maintains its own session state. Switching branches automatically switches the session context.

Limitations on Reproducibility

  1. LLM non-determinism: The agent's hypotheses and code changes depend on the LLM's stochastic generation. Two runs with identical starting conditions will generally produce different experiment sequences.
  2. Environment sensitivity: Benchmark measurements depend on system load, hardware, and environment. The confidence scoring mitigates this but doesn't eliminate it.
  3. Skill authoring variance: The quality and specificity of the skill document significantly affect results. Vague skills produce scattered experiments; precise skills produce focused optimization.

8 Compute and API Costs

Cost Model

Pi-autoresearch's costs have three components:

Component Source Scaling Factor
LLM tokens pi agent inference (per-experiment reasoning + code changes) Experiments × tokens/experiment
Compute time Benchmark execution Experiments × benchmark duration
Human attention Monitoring, review, finalization Low (designed for unattended operation)

Token Cost Estimation

Each experiment cycle involves:

Phase Estimated Tokens Notes
Read context + session files 2,000–10,000 input Depends on autoresearch.md size
Generate hypothesis + code change 1,000–5,000 output Depends on change complexity
Interpret results + decide keep/revert 500–2,000 output Includes reasoning about metrics
Per-experiment total 3,500–17,000 ~10,000 tokens typical

For a 50-experiment session with Claude Sonnet-level pricing (~$3/M input, ~$15/M output tokens): - Input: 50 × 6,000 = 300K tokens → ~$0.90 - Output: 50 × 4,000 = 200K tokens → ~$3.00 - Total: ~$4–$10 per 50-experiment session

Cost Controls

Pi-autoresearch provides two mechanisms for cost control:

  1. maxIterations in autoresearch.config.json: Hard limit on experiments per session. The agent is instructed to stop and won't run more experiments beyond this limit.
{
  "maxIterations": 30
}
  1. API key spending limits: The README recommends using provider-side spending caps: "most providers let you set per-key or monthly budgets."

Cost Comparison

System Typical Session Cost Autonomous Duration Cost Control
Karpathy autoresearch $5–$50 (overnight) 8–12 hours Manual interruption only
pi-autoresearch $4–$10 (50 experiments) Varies by benchmark maxIterations + API limits
AlphaEvolve $1,000–$100,000+ Days Internal Google quotas

9 Architecture Solution

Three-Layer Architecture

Pi-autoresearch's architecture follows a clean three-layer separation:

┌─────────────────────────────────────────────────────────────────┐
│                    LAYER 3: SKILL (Domain Knowledge)            │
│                                                                 │
│  autoresearch-create          autoresearch-finalize              │
│  ┌─────────────────────┐     ┌──────────────────────────────┐   │
│  │ • Asks about goal    │     │ • Reads experiment log        │   │
│  │ • Infers from context│     │ • Groups kept experiments     │   │
│  │ • Writes session     │     │ • Proposes branch grouping    │   │
│  │   files              │     │ • Creates independent branches│   │
│  │ • Starts loop        │     │ • Ensures no shared files     │   │
│  └─────────────────────┘     └──────────────────────────────┘   │
├─────────────────────────────────────────────────────────────────┤
│                   LAYER 2: EXTENSION (Infrastructure)           │
│                                                                 │
│  Tools                    UI                    Commands         │
│  ┌───────────────────┐   ┌───────────────┐     ┌────────────┐  │
│  │ init_experiment    │   │ Status widget │     │/autoresearch│  │
│  │ run_experiment     │   │ Dashboard     │     │  <context>  │  │
│  │ log_experiment     │   │ Fullscreen    │     │  off        │  │
│  └───────────────────┘   │ overlay       │     │  clear      │  │
│                          └───────────────┘     └────────────┘  │
├─────────────────────────────────────────────────────────────────┤
│                     LAYER 1: pi RUNTIME                         │
│                                                                 │
│  LLM Agent  │  Terminal UI  │  Git Integration  │  File I/O    │
│  Extension  │  Widget       │  Skill System     │  Process     │
│  Framework  │  Framework    │                   │  Execution   │
└─────────────────────────────────────────────────────────────────┘

Data Flow

The data flow through a single experiment cycle:

Agent generates hypothesis
    │
    ▼
Agent edits source code
    │
    ▼
Agent calls git commit (via pi)
    │
    ▼
Agent calls run_experiment(command="pnpm test")
    │
    ├──► Extension executes command
    ├──► Extension times wall-clock duration
    ├──► Extension captures stdout/stderr
    ├──► Extension parses METRIC lines from output
    │
    ▼
Agent calls log_experiment(metric_value, description)
    │
    ├──► Extension records to autoresearch.jsonl
    ├──► Extension computes confidence score (if 3+ runs)
    ├──► Extension updates status widget
    ├──► Extension updates dashboard
    │
    ▼
Agent evaluates: keep or revert?
    │
    ├──► Keep: branch advances, agent generates next hypothesis
    │
    └──► Revert: git reset, agent tries different approach

Metric Protocol

The benchmark script (autoresearch.sh) communicates metrics to the extension via a simple protocol — standard output lines matching the pattern:

METRIC name=number

For example:

#!/bin/bash
set -euo pipefail
pnpm test --run 2>&1
echo "METRIC total_test_seconds=$(node -e 'process.stdout.write(String(...))')"

This protocol is deliberately minimal — any language, any build tool, any measurement approach can produce METRIC lines.

State Machine

The extension manages a session state machine:

                ┌──────────────┐
                │   INACTIVE   │
                │  (no session)│
                └──────┬───────┘
                       │ /autoresearch <context>
                       │ or /skill:autoresearch-create
                       ▼
                ┌──────────────┐
                │   SETUP      │◄─── init_experiment()
                │  (configuring│     defines name, metric,
                │   session)   │     unit, direction
                └──────┬───────┘
                       │ Session files written
                       │ Baseline measured
                       ▼
                ┌──────────────┐
                │   RUNNING    │◄─── Autonomous loop
                │  (experiment │     edit → commit →
                │   loop)      │     run → log → decide
                └──────┬───────┘
                       │ /autoresearch off
                       │ or Escape
                       │ or maxIterations reached
                       ▼
                ┌──────────────┐
                │   PAUSED     │
                │  (loop stopped│
                │   state kept) │
                └──────┬───────┘
                       │ /autoresearch <context>
                       │ (resumes loop)
                       │
                       │ /autoresearch clear
                       ▼
                ┌──────────────┐
                │   CLEARED    │
                │  (state      │
                │   deleted)   │──── Returns to INACTIVE
                └──────────────┘

10 Component Breakdown

Extension Components

Component Type Lifecycle Description
init_experiment Tool Called once per session Configures session: experiment name, primary metric name, unit, optimization direction (minimize/maximize)
run_experiment Tool Called per experiment Executes benchmark command, measures wall-clock duration, captures output, parses METRIC lines
log_experiment Tool Called per experiment Records result to autoresearch.jsonl, auto-commits, computes confidence score, updates UI components
Status widget UI Persistent during session Always-visible bar above editor: run count, keep count, best metric value, improvement %, confidence score
Expanded dashboard UI Toggle via Ctrl+X Full results table with columns for commit, metric, status, description
Fullscreen overlay UI Toggle via Ctrl+Shift+X Scrollable full-terminal dashboard with live spinner for running experiments

Skill Components

Component Type Purpose
autoresearch-create Skill Session initialization — gathers goal, command, metric, scope; writes session files; runs baseline; starts loop
autoresearch-finalize Skill Branch finalization — groups kept experiments into independent branches for review

Session File Components

File Format Schema Update Frequency
autoresearch.md Markdown Free-form document with sections for objective, metrics, scope, history, dead ends, wins Updated by agent after significant events
autoresearch.jsonl JSON Lines {commit, metric_name, metric_value, status, description, confidence, timestamp} Appended after every experiment
autoresearch.sh Bash Pre-checks + workload + METRIC output lines Written once during setup, rarely modified
autoresearch.checks.sh Bash (optional) Correctness checks (tests, types, lint) Written once during setup
autoresearch.config.json JSON {workingDir?, maxIterations?} Written once, manually updated

Experiment Status Taxonomy

Each experiment in autoresearch.jsonl has one of these statuses:

Status Meaning UI Color Agent Action
kept Metric improved, changes committed Green Branch advances
discarded Metric did not improve Gray Git reset, try again
crashed Benchmark command failed (non-zero exit) Red Git reset, try different approach
checks_failed Benchmark passed but correctness checks failed Orange Git reset, fix correctness issue
baseline Initial measurement, no changes Blue Reference point for improvements

11 Core Mechanisms (Detailed)

Mechanism 1: Confidence Scoring via MAD

The most technically sophisticated mechanism in pi-autoresearch is the confidence scoring system, which uses Median Absolute Deviation (MAD) as a robust noise floor estimator.

Why MAD instead of Standard Deviation?

Standard deviation is sensitive to outliers. In benchmark measurements, outliers are common: - GC pauses can spike individual measurements by 2–10× - CPU thermal throttling can degrade measurements unpredictably - I/O contention from other processes creates sporadic slowdowns - ML training loss has inherent stochasticity

MAD is robust to these outliers because it uses the median rather than the mean:

MAD = median(|x_i - median(x)|)

Confidence computation:

confidence = |best_improvement| / MAD

Where best_improvement is the largest positive change in the primary metric (accounting for optimization direction).

Interpretation thresholds:

Confidence Signal Color Agent Guidance
≥ 2.0× Strong signal 🟢 Green Improvement is likely real. Keep and continue.
1.0–2.0× Marginal signal 🟡 Yellow Above noise but uncertain. Re-run to confirm.
< 1.0× Noise 🔴 Red Within noise floor. Consider reverting.

Implementation details: - Confidence is only computed after 3+ experiments (minimum sample size for meaningful MAD) - All metric values in the current segment contribute to the MAD computation - Confidence is persisted to autoresearch.jsonl for post-hoc analysis - Confidence is displayed in the status widget, expanded dashboard, and log_experiment output - The confidence score is advisory only — it never auto-discards experiments

Mechanism 2: Greedy Hill-Climbing with Version Control

The core optimization strategy is greedy hill-climbing backed by git version control:

┌────────────────────────────────────────────────────┐
│                 EXPERIMENT CYCLE                    │
│                                                    │
│  1. Agent edits code (informed by session context) │
│  2. Agent creates git commit                       │
│  3. run_experiment executes benchmark              │
│  4. log_experiment records result                  │
│  5. Decision:                                      │
│     ├── metric improved → KEEP (branch advances)   │
│     └── metric worsened → REVERT (git reset)       │
│  6. Go to 1                                        │
│                                                    │
│  Invariant: HEAD always points to the best-known   │
│  configuration. The branch is monotonically        │
│  improving.                                        │
└────────────────────────────────────────────────────┘

This creates a monotonically improving trajectory — every commit on the branch represents an improvement over the previous state.

Trade-off analysis:

Property Greedy Hill-Climbing Population-Based Search
Convergence speed Fast for easy gains Slower but avoids local optima
Implementation complexity Very low (git commit/reset) High (population management)
Memory requirements O(1) — current state only O(N) — population of candidates
Risk of local optima High Low
Human interpretability Very high (linear history) Low (complex population dynamics)

The greedy approach is a deliberate design choice — pi-autoresearch optimizes for simplicity and interpretability over optimality. The assumption is that LLM agents generate sufficiently diverse hypotheses to partially mitigate the local-optima problem.

Mechanism 3: Backpressure Checks

The optional autoresearch.checks.sh mechanism provides correctness guarantees during autonomous optimization:

Benchmark exits 0
    │
    ├── autoresearch.checks.sh exists?
    │   ├── No → proceed normally (keep/discard based on metric)
    │   └── Yes → run checks
    │       ├── Checks pass → proceed normally
    │       └── Checks fail → log as "checks_failed", revert
    │
    ▼
Normal keep/discard decision

Design principles: - Checks execution time does not affect the primary metric (measured separately) - Checks have a separate timeout (default 300s) - The checks_failed status is distinct from crashed, allowing post-hoc analysis of correctness vs. performance failures - If no checks file exists, the system behaves identically to systems without this feature

Mechanism 4: Branch-Aware Finalization

The autoresearch-finalize skill converts a messy experiment branch into clean, reviewable branches:

BEFORE (messy experiment branch):
─ baseline
  ├── exp1: parallel vitest workers (kept, +12%)
  ├── exp2: mock database calls (discarded)
  ├── exp3: remove slow regex (kept, +5%)
  ├── exp4: cache test fixtures (kept, +8%)
  ├── exp5: swap assertion library (discarded)
  └── exp6: remove unnecessary imports (kept, +2%)

AFTER (clean independent branches):
─ merge-base
  ├── branch: optimize-test-parallelism
  │   └── exp1: parallel vitest workers (+12%)
  │
  ├── branch: optimize-test-regex
  │   └── exp3: remove slow regex (+5%)
  │
  ├── branch: optimize-test-caching
  │   └── exp4: cache test fixtures (+8%)
  │
  └── branch: optimize-test-imports
      └── exp6: remove unnecessary imports (+2%)

Key constraint: Groups must not share files. This ensures branches can be reviewed and merged independently, without conflict resolution.

Mechanism 5: Context Survival Across Resets

The dual-file persistence strategy (autoresearch.md + autoresearch.jsonl) is designed to survive LLM context window resets:

File Audience Content Survival Property
autoresearch.md Agent (narrative understanding) High-level objective, strategies tried, dead ends, key wins Enables a fresh agent to understand why and what without reading every experiment
autoresearch.jsonl Agent (precise recall) + Tools (computation) Every experiment with exact metrics, commits, confidence scores Enables precise reconstruction of session state and confidence calculations

The design insight is that LLMs benefit from both narrative context (what's the goal, what approaches have been tried, what failed) and structured data (exact metric values, commit hashes, timestamps). These two modalities serve different purposes and are stored in different formats optimized for each use case.

12 Programming Language

Implementation Stack

Pi-autoresearch is implemented as a pi extension, which uses pi's extension and skill frameworks:

Layer Language/Format Notes
Extension definition JSON (extension manifest) Declares tools, UI widgets, commands
Tool implementations TypeScript (pi extension API) init_experiment, run_experiment, log_experiment
UI components pi widget framework (declarative) Status bar, dashboard, fullscreen overlay
Skill documents Markdown autoresearch-create, autoresearch-finalize
Session state JSON Lines + Markdown autoresearch.jsonl + autoresearch.md
Benchmark scripts Bash autoresearch.sh, autoresearch.checks.sh
Configuration JSON autoresearch.config.json

Language Choice Rationale

The choice to implement as a pi extension rather than a standalone tool has several implications:

  1. TypeScript for tools: Pi's extension framework uses TypeScript, providing type safety and IDE support for the tool implementations. This is appropriate for the system-level concerns (timing, file I/O, process execution) that the tools handle.

  2. Markdown for skills: Skills are authored in Markdown because they are consumed by the LLM agent as context. Markdown provides structure (headers, lists, emphasis) that helps the LLM parse and follow instructions, while remaining human-readable and editable.

  3. Bash for benchmarks: Benchmark scripts are Bash because they need to invoke arbitrary command-line tools. The METRIC name=number output protocol is language-agnostic — any process that writes to stdout can produce metric lines.

  4. JSON Lines for logs: JSONL is chosen for the experiment log because it supports efficient append-only writes, is line-oriented (enabling tail -f style monitoring), and is machine-parseable while remaining human-readable.

Code Complexity

The overall codebase is notably small — consistent with the "radical simplicity" design philosophy inherited from Karpathy's autoresearch:

Component Estimated LOC Complexity
Extension (tools + UI) ~500–800 Medium (timing, git integration, JSONL parsing, confidence math)
Skills ~200–400 lines of Markdown Low (structured natural language instructions)
Session files Generated N/A (generated by agent/tools)
Total ~700–1,200 Low-to-medium

13 Memory Management

Memory Architecture

Pi-autoresearch's memory management operates at three distinct levels:

Level 1: LLM Context Window (Volatile)

The LLM agent's context window is the primary working memory. It contains: - System prompt + tool descriptions - Skill document (loaded at session start) - Recent conversation history (experiments, reasoning, results) - Fragments of autoresearch.md and autoresearch.jsonl (loaded as needed)

Context window is limited (typically 128K–200K tokens for frontier models) and is lost on context reset.

Level 2: Session Files (Persistent)

The dual-file persistence layer survives context resets:

┌──────────────────────────────────────────────────────────┐
│              Session File Memory Architecture             │
│                                                          │
│  autoresearch.md (narrative memory)                      │
│  ┌────────────────────────────────────────────────────┐  │
│  │ • Objective: "optimize test runtime"               │  │
│  │ • Approach: "started with parallelism, then..."    │  │
│  │ • Dead ends: "swapping assertion lib had no effect"│  │
│  │ • Key wins: "parallel workers gave biggest gain"   │  │
│  │ • Next ideas: "try test sharding, mock heavy I/O"  │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  autoresearch.jsonl (structured memory)                  │
│  ┌────────────────────────────────────────────────────┐  │
│  │ {"run":1, "metric":45.2, "status":"baseline",...}  │  │
│  │ {"run":2, "metric":39.8, "status":"kept",...}      │  │
│  │ {"run":3, "metric":41.1, "status":"discarded",...} │  │
│  │ ...                                                │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Level 3: Git History (Permanent)

Every kept experiment is a git commit. This creates a permanent, immutable record of every code change that produced an improvement. The git history is never consumed directly by the agent but serves as a human-auditable archive and enables the finalization workflow.

Memory Scaling

Session Size autoresearch.md autoresearch.jsonl Git commits Context Pressure
10 experiments ~1 KB ~3 KB ~5 Low
50 experiments ~3–5 KB ~15 KB ~20–25 Medium
200 experiments ~10–15 KB ~60 KB ~80–100 High (JSONL may exceed context)

For very long sessions, the agent must selectively read the JSONL file (recent entries, best results) rather than loading the entire history. The narrative autoresearch.md file serves as a compressed summary that remains context-friendly regardless of session length.

Comparison with Other Memory Systems

System Memory Mechanism Persistence Scalability
Karpathy autoresearch results.tsv + git File-based Moderate (TSV grows linearly)
AlphaEvolve Database (internal) Database-backed High (indexed queries)
OpenEvolve SQLite + program database Database-backed High
pi-autoresearch JSONL + Markdown + git File-based Moderate (JSONL grows linearly)

14 Continued Learning

Within-Session Learning

The agent learns within a session through the accumulation of context:

  1. Narrative learning: As the agent updates autoresearch.md with dead ends and key wins, it builds a compressed model of the optimization landscape. Future hypotheses are informed by this accumulated knowledge.

  2. Statistical learning: The confidence scoring provides quantitative feedback on result reliability. After enough experiments, the agent can distinguish between high-confidence improvements and noise-level changes.

  3. Pattern recognition: The JSONL log provides a structured record of what has been tried. The agent can read this to avoid repeating failed approaches and to identify promising directions.

Cross-Session Learning

Pi-autoresearch supports cross-session learning through the persistence layer:

  1. Context reset resumption: A fresh agent instance reads autoresearch.md + autoresearch.jsonl and picks up where the previous session left off. The narrative document provides strategic context, while the JSONL provides tactical detail.

  2. Branch-level transfer: Since sessions are branch-aware, insights from one optimization branch can inform work on another branch — though this requires the human researcher to manually connect the sessions.

What Pi-Autoresearch Does NOT Learn

  1. No cross-project learning. Insights from optimizing test speed in Project A are not automatically transferred to Project B. Each session starts fresh.

  2. No meta-optimization. The system does not learn to improve its own optimization strategy over time. The hill-climbing approach, confidence thresholds, and experiment structure are fixed.

  3. No skill improvement. The domain-specific skill documents are authored by humans and not modified by the system. The system cannot discover that a different benchmark command or metric would be more effective.

  4. No population dynamics. Unlike evolutionary systems (AlphaEvolve, OpenEvolve), pi-autoresearch maintains a single candidate at a time. It cannot explore multiple optimization paths in parallel or combine insights from diverse candidates.

Comparison with Knowledge Management in Other Systems

System Within-Session Cross-Session Cross-Project Meta-Learning
Karpathy autoresearch Via context window Via results.tsv + git No No
AlphaEvolve Program database Persistent DB Yes (shared infra) Partially (self-improvement)
OpenEvolve SQLite DB Persistent DB No No
pi-autoresearch JSONL + Markdown JSONL + Markdown No No

15 Applications

Primary Application: Developer Workflow Optimization

Pi-autoresearch's primary application is optimizing developer workflows — any measurable aspect of a software project that can be improved through code changes:

Application Metric Typical Gains Value Proposition
Test suite optimization Wall-clock runtime 10–50% faster Faster CI, faster development loops
Build optimization Build time 10–40% faster Faster deployments, faster iteration
Bundle optimization Bundle size (KB) 5–30% smaller Faster page loads, better UX
Performance optimization Lighthouse score 5–20 point improvement Better SEO, better user experience

Secondary Application: ML Research

Following Karpathy's autoresearch paradigm, pi-autoresearch can be applied to ML training optimization:

Application Metric Approach
Architecture search val_bpb or val_loss Agent modifies model architecture (layers, widths, attention patterns), measures training performance
Hyperparameter optimization val_bpb or val_loss Agent adjusts learning rate, batch size, warmup schedule, regularization
Data pipeline optimization Training throughput (samples/sec) Agent optimizes data loading, preprocessing, caching
Training stability Loss variance Agent modifies optimization procedure to reduce training instability

Emerging Application: Multi-Agent Optimization

Pi-autoresearch opens the door to multi-agent optimization workflows:

Developer machine                    CI server
┌────────────────────┐              ┌────────────────────┐
│ pi + autoresearch  │              │ Benchmark runner   │
│ ┌────────────────┐ │   git push   │ ┌────────────────┐ │
│ │ Agent generates│ │──────────────│ │ Run benchmark  │ │
│ │ code changes   │ │              │ │ on isolated    │ │
│ │ & commits      │ │   webhook    │ │ hardware       │ │
│ │                │ │◄─────────────│ │                │ │
│ │ Reads results  │ │              │ │ Report metrics │ │
│ └────────────────┘ │              │ └────────────────┘ │
└────────────────────┘              └────────────────────┘

This pattern enables: - Benchmarks on dedicated hardware (not affected by developer machine load) - Longer-running benchmarks (CI can run 30-minute training jobs) - Multi-metric evaluation (CI reports multiple metrics per run)

Limitations

  1. Single-metric optimization: The system is designed for single-objective optimization. Multi-objective problems require manual composite scoring or separate sessions.
  2. Requires automatable benchmarks: Applications where quality is subjective or requires human evaluation are not supported.
  3. Platform lock-in: The extension is specific to the pi agent. Users of other coding agents (Claude Code, Cursor, Copilot) cannot use pi-autoresearch without switching to pi.
  4. No guarantee of global optimality: Greedy hill-climbing may miss solutions that require temporary metric regressions. Population-based approaches (AlphaEvolve, OpenEvolve) are better suited to problems with rugged fitness landscapes.

Impact Assessment

Pi-autoresearch represents a significant step in the democratization of autonomous optimization:

Dimension Assessment
Accessibility High — single install command, works with any pi project
Generality High — domain-agnostic, any measurable metric
Scientific rigor Medium — confidence scoring is a good start, but lacks formal statistical testing
Production readiness Medium — suitable for development optimization, not safety-critical applications
Community adoption Strong — 3,300+ stars, active development
Innovation Medium — novel extension/skill separation, confidence scoring; core loop is established (Karpathy)

Position in the Autoresearch Ecosystem

Complexity ──────────────────────────────────────────────► Scale
    │
    │  Karpathy                    pi-autoresearch
    │  autoresearch                (domain-agnostic,
    │  (minimal,                    statistical rigor,
    │   single-domain)              extension/skill split)
    │       │                              │
    │       ▼                              ▼
    │  ─────────────────────────────────────────────────
    │  Proof of Concept    Reusable Framework    Platform
    │
    │                                          SkyPilot
    │                                          autoresearch
    │                                          (cloud-scale,
    │                                           multi-cluster)
    │                                              │
    │                                              ▼
    │  ─────────────────────────────────────────────────
    │                                        Enterprise

Pi-autoresearch occupies the "reusable framework" niche — more sophisticated than Karpathy's proof-of-concept, more accessible than enterprise-grade cloud orchestrators. This positioning makes it the natural choice for individual developers and small teams who want autonomous optimization without significant infrastructure investment.