← Back to Index

Pi-Autoresearch

Autonomous experiment loop extension for the pi AI coding agent Organization: David Vilalta (davebcn87, Independent) Published: March 2026 Type: Open-Source Extension (MIT License) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: pi-autoresearch — Autonomous experiment loop extension for pi

Repository URL: github.com/davebcn87/pi-autoresearch

Stars: 3,300+ (as of April 2026)

License: MIT

Lineage: Directly inspired by Andrej Karpathy's autoresearch, reimagined as a modular extension for the pi AI coding agent. While Karpathy's system is a standalone monolith tightly coupled to nanochat and a specific LLM agent (Claude Code), pi-autoresearch decouples the experimental loop infrastructure from the domain knowledge, creating a general-purpose optimization extension that works with any command-line-measurable metric.

Publication Date: March 2026

Paradigm: Pi-autoresearch is a meta-tool — it instruments an LLM coding agent with the tools and workflow to run autonomous optimization loops, rather than being an autonomous agent itself. The distinction is architecturally significant: the extension provides infrastructure (timing, logging, version control, dashboards), while domain intelligence comes from a separately authored "skill" document. This separation of concerns enables a single extension to serve unlimited optimization domains.

"Try an idea, measure it, keep what works, discard what doesn't, repeat forever." — pi-autoresearch README

2 Authors and Team

Primary Author

David Vilalta (GitHub: davebcn87) — Software engineer based in Barcelona, Spain. Vilalta's prior work focuses on developer tooling and agent infrastructure. The "bcn87" suffix references Barcelona, and his GitHub profile indicates involvement in AI agent extension ecosystems.

Platform Dependency: pi (by Anthropic)

Pi-autoresearch is designed as a first-class extension for pi, an AI coding agent developed by Anthropic that runs in the terminal. Pi provides the underlying LLM-powered coding agent, extension system, skill framework, and terminal UI infrastructure. Pi-autoresearch extends pi's capabilities rather than building from scratch.

Component	Provider
LLM agent runtime	pi (Anthropic)
Extension system	pi extension framework
Skill authoring	pi skill system
Terminal UI (widgets, dashboards)	pi UI framework
Experiment infrastructure	pi-autoresearch (this project)

Relationship to Karpathy's Autoresearch

The naming and conceptual debt are explicit — the README states "Inspired by karpathy/autoresearch." However, the architectural decisions diverge significantly:

Dimension	Karpathy autoresearch	pi-autoresearch
Agent coupling	Tightly bound to Claude Code	Extension for pi (any LLM backend)
Domain coupling	Tightly bound to nanochat	Domain-agnostic via skills
Measurement	5-min fixed wall-clock val_bpb	Any command + any metric
State format	`results.tsv` + git history	`autoresearch.jsonl` + `autoresearch.md`
UI	None (terminal output only)	Live widget + expandable dashboard + fullscreen overlay
Finalization	Manual branch review	Automated branch grouping via `autoresearch-finalize`
Configuration	`program.md` only	`autoresearch.config.json` + `autoresearch.md` + `autoresearch.sh`
Statistical rigor	None (raw val_bpb comparison)	Confidence scoring via MAD

Community Context

Pi-autoresearch represents the second wave of autoresearch tooling — the "framework wave" that followed Karpathy's proof-of-concept. Where Karpathy demonstrated that the paradigm works, pi-autoresearch asks how to make it reusable, composable, and production-grade. The 3,300+ stars indicate strong community adoption, comparable to or exceeding many of the standalone autoresearch forks but providing a fundamentally different architectural foundation.

3 Core Contribution

Key Novelty: Pi-autoresearch transforms autonomous experimentation from a single-purpose script into a reusable, domain-agnostic infrastructure layer — separating experimental loop mechanics (measurement, version control, statistical confidence, dashboards) from domain-specific knowledge (what to optimize, how to measure, which files to modify). This separation enables any developer using pi to add autonomous optimization to any project with any metric, without writing custom experimental infrastructure.

What Makes Pi-Autoresearch Novel

Extension/skill architecture. The fundamental innovation is the clean separation between the extension (domain-agnostic infrastructure — tools, widgets, dashboards) and the skill (domain-specific knowledge — what to optimize, the benchmark command, the metric, the scope of files). One extension serves unlimited domains through skill composition. This is an architectural pattern not present in any prior autoresearch system.
Confidence scoring with MAD. After 3+ experiments, pi-autoresearch computes a statistical confidence score comparing the best improvement to the session's noise floor using Median Absolute Deviation (MAD). This addresses a critical weakness in Karpathy's system — raw metric comparisons can't distinguish real improvements from benchmark jitter, especially in noisy domains like ML training or Lighthouse scores. The confidence metric is advisory (never auto-discards), but guides the agent to re-run marginal experiments.
Append-only structured log (autoresearch.jsonl). Every experiment is recorded as a single JSON line, creating a machine-readable audit trail that supports post-hoc analysis, resumption after context resets, and cross-session continuity. This is a significant improvement over TSV-based logging, enabling richer structured data (commit hashes, descriptions, confidence scores, status codes, branch context).
Backpressure checks. An optional autoresearch.checks.sh script runs correctness checks (tests, types, lint) after every passing benchmark. This creates a safety valve that prevents optimizations from silently breaking things — a critical concern for multi-hour autonomous runs.
Branch-aware finalization. The autoresearch-finalize skill groups kept experiments into logical changesets, proposes the grouping for human approval, then creates independent branches from the merge-base. Groups must not share files, ensuring each branch can be reviewed and merged independently. This solves the "messy experiment branch" problem that plagues long autonomous runs.
Rich terminal UI. A persistent status widget, expandable results dashboard, and fullscreen overlay with keyboard navigation provide real-time visibility into the optimization process. This is a significant UX improvement over terminal-output-only systems, enabling researchers to monitor multi-hour runs at a glance.

Relationship to Prior Work

System	Year	Architecture	Domain	Statistical Rigor	UI
Karpathy autoresearch	2026	Monolithic script	Neural network training	None	Terminal output
SkyPilot autoresearch	2026	Cloud orchestrator	Any cloud workload	None	Web dashboard
AutoResearchClaw	2026	Multi-agent	Research paper writing	None	None
pi-autoresearch	2026	Extension + skill	Any measurable metric	MAD confidence	Widget + dashboard

Design Philosophy

Pi-autoresearch embodies a specific design philosophy about autonomous research tooling:

The experimental loop (measure → judge → keep/revert → repeat) is infrastructure, not domain knowledge. Domain knowledge belongs in a separate, human-authored document. Infrastructure should be built once and reused everywhere.

This is fundamentally different from Karpathy's approach, where the experimental loop and the domain knowledge are intertwined in program.md. Pi-autoresearch's separation means that a team optimizing test speeds, another team optimizing LLM training, and a third team optimizing Lighthouse scores all use the exact same extension — only their skills differ.

4 Supported Solutions

Pi-autoresearch is explicitly domain-agnostic. It supports any optimization target where a command-line command produces a measurable numeric metric. The README provides canonical example domains, but the architecture imposes no domain restrictions:

Canonical Example Domains

Domain	Metric	Direction	Benchmark Command	Typical Scope
Test speed	Seconds	↓ (lower is better)	`pnpm test`	Test configuration, runner parallelism, mocking
Bundle size	KB	↓ (lower is better)	`pnpm build && du -sb dist`	Tree-shaking, code splitting, dependencies
LLM training	val_bpb	↓ (lower is better)	`uv run train.py`	Architecture, hyperparameters, data pipeline
Build speed	Seconds	↓ (lower is better)	`pnpm build`	Build tool config, caching, parallelism
Lighthouse score	Performance score	↑ (higher is better)	`lighthouse http://localhost:3000 --output=json`	SSR, code splitting, image optimization

Solution Taxonomy

Solutions produced by pi-autoresearch fall into several categories:

Solution Type	Description	Mechanism
Configuration tuning	Adjusting config files (test runners, bundlers, compilers)	Agent reads config docs, proposes changes, measures impact
Code refactoring	Restructuring code for performance without behavior change	Agent identifies bottlenecks, refactors, verifies via checks
Algorithm replacement	Swapping algorithms for more efficient alternatives	Agent proposes alternative implementations, benchmarks
Dependency optimization	Adding/removing/replacing dependencies	Agent evaluates dependency alternatives, measures impact
Architecture changes	Restructuring module boundaries, data flow patterns	Agent proposes structural changes, validates via benchmark
Hyperparameter tuning	Adjusting numeric parameters in training/optimization code	Agent sweeps values, keeps improvements

What It Cannot Optimize

The system has inherent limitations:

Multi-objective optimization. The system tracks a single primary metric per session. Multi-metric optimization requires separate sessions or custom benchmark scripts that compute composite scores.
Non-deterministic metrics. Highly stochastic metrics (e.g., ML validation loss with high variance) challenge the confidence scoring. The MAD-based approach helps but cannot fully compensate for extreme noise.
Long-running benchmarks. The autonomous loop assumes each experiment completes in a reasonable time. Very long benchmarks (hours) create impractically slow feedback loops.
Metrics requiring human judgment. Subjective quality metrics (code readability, UX quality) cannot be benchmarked automatically.

5 LLM Integration

Architecture: LLM-as-Agent, Not LLM-as-Mutation-Operator

Pi-autoresearch uses the LLM fundamentally differently from evolutionary systems like AlphaEvolve or FunSearch. In those systems, the LLM is a mutation operator called by the system to propose code changes. In pi-autoresearch, the LLM is the agent — it has full autonomy to read files, understand context, form hypotheses, make changes, and decide what to try next. The extension merely provides tools that the agent invokes at its discretion.

┌─────────────────────────────────────────────────────────────┐
│                      LLM Agent (pi)                         │
│                                                             │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────────┐  │
│  │  Read files  │  │ Form         │  │  Make code        │  │
│  │  + context   │──│ hypothesis   │──│  changes          │  │
│  └─────────────┘  └──────────────┘  └────────┬──────────┘  │
│                                               │              │
│  ┌────────────────────────────────────────────┼──────────┐  │
│  │         Extension Tools (invoked by agent)  │          │  │
│  │                                             ▼          │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌────────────┐   │  │
│  │  │init_experiment│  │run_experiment │  │log_experiment│  │  │
│  │  │(one-time      │  │(executes cmd, │  │(records      │  │  │
│  │  │ session setup)│  │ times it,     │  │ result,      │  │  │
│  │  │              │  │ captures      │  │ auto-commits,│  │  │
│  │  │              │  │ output)       │  │ updates UI)  │  │  │
│  │  └──────────────┘  └──────────────┘  └────────────┘   │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │         Skill (domain knowledge, loaded at start)      │  │
│  │                                                         │  │
│  │  "Optimize test runtime by modifying vitest configs.    │  │
│  │   Command: pnpm test. Metric: seconds (lower better).  │  │
│  │   Files in scope: vitest.config.ts, test/**/*.test.ts"  │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

LLM Interaction Modes

Mode	Trigger	LLM Behavior
Session setup	`/skill:autoresearch-create`	Agent asks about goal, command, metric, scope — or infers from context. Creates session files.
Autonomous loop	After setup	Agent edits code → commits → calls `run_experiment` → calls `log_experiment` → decides keep/revert → repeats indefinitely
Finalization	`/skill:autoresearch-finalize`	Agent reads experiment log, groups changes into logical branches, proposes grouping for approval
Interruption	User presses Escape	Agent stops loop and provides summary of results
Resumption	`/autoresearch <context>`	Agent reads `autoresearch.md` + `autoresearch.jsonl`, reconstructs context, resumes loop

No Model Coupling

A critical design decision: pi-autoresearch does not specify which LLM powers the pi agent. The extension works with whatever LLM backend the user has configured in pi. This means the same experimental infrastructure works with: - Claude (Anthropic) — pi's primary backend - GPT-4o/o3 (OpenAI) — if configured via API key - Gemini (Google) — if configured via API key - Open-weight models — if configured via local inference

The LLM choice affects the quality of hypotheses the agent generates, but not the experimental infrastructure.

Prompt Architecture

The system uses a layered prompt architecture:

Layer 1: pi system prompt (agent capabilities, tool definitions)
    ↓
Layer 2: Extension tool descriptions (init_experiment, run_experiment, log_experiment)
    ↓
Layer 3: Skill document (domain-specific instructions, loaded at session start)
    ↓
Layer 4: Session context (autoresearch.md — what's been tried, what worked)
    ↓
Layer 5: Real-time state (widget data, recent experiment results)

This layered architecture means the LLM always has access to: - What it can do (tools) - What it should optimize (skill) - What has already been tried (session history) - How well it's doing (confidence scores, metric trajectory)

6 Key Results

Claimed Performance

The README does not report specific experimental results — it is infrastructure, not a benchmark paper. However, the system is designed to produce results in any domain. The README's example domains suggest typical performance improvements:

Domain	Typical Improvement	Confidence Threshold
Test speed	10–50% reduction in runtime	≥ 2.0× MAD (green)
Bundle size	5–30% reduction in KB	≥ 2.0× MAD (green)
LLM training	5–15% reduction in val_bpb	≥ 2.0× MAD (green)
Build speed	10–40% reduction in build time	≥ 2.0× MAD (green)

Statistical Confidence Framework

Pi-autoresearch's most scientifically rigorous feature is its confidence scoring:

Metric	Formula	Interpretation
MAD (Median Absolute Deviation)	`median(\|x_i - median(x)\|)`	Robust noise floor estimate, resistant to outliers
Confidence score	`\|best_improvement\| / MAD`	How many times the best improvement exceeds noise
≥ 2.0× (green)	—	Improvement is likely real
1.0–2.0× (yellow)	—	Above noise but marginal — re-run recommended
< 1.0× (red)	—	Within noise — likely jitter, not real improvement

The choice of MAD over standard deviation is deliberate — MAD is robust to outliers, which are common in benchmark measurements (GC pauses, CPU throttling, I/O contention). This makes the confidence score reliable even with heterogeneous experiment results.

Comparison with Karpathy's Results

Karpathy's autoresearch reported 11% cumulative reduction in time-to-GPT-2-quality over an overnight run. Pi-autoresearch provides the infrastructure to achieve similar results but with added statistical rigor:

Feature	Karpathy	pi-autoresearch
Raw improvement detection	Yes (raw val_bpb comparison)	Yes (raw metric comparison)
Noise floor estimation	No	Yes (MAD)
Confidence scoring	No	Yes (improvement/MAD ratio)
False positive filtering	No	Advisory (agent guided to re-run marginal results)
Post-hoc analysis	TSV file	Structured JSONL with full metadata

Community Adoption Signal

The 3,300+ GitHub stars provide a strong adoption signal. The extension's domain-agnostic nature means its user base is fundamentally broader than domain-specific autoresearch tools — it serves anyone who uses pi for development, regardless of their optimization domain.

7 Reproducibility

Reproducibility Design

Pi-autoresearch is designed with reproducibility as a first-class concern at multiple levels:

Experiment-level reproducibility: - Every experiment that is "kept" produces a git commit with a descriptive message including the metric improvement - The autoresearch.jsonl log records every experiment (kept and discarded) with timestamps, commit hashes, metric values, confidence scores, and descriptions - Any individual experiment can be reproduced by checking out its commit and re-running the benchmark command

Session-level reproducibility: - The autoresearch.md file captures the complete session context: objective, metrics, files in scope, what has been tried, dead ends, and key wins - A fresh agent with no memory can read autoresearch.md + autoresearch.jsonl and continue exactly where the previous session left off - Sessions are branch-aware — each branch has its own session state

Infrastructure-level reproducibility: - The extension is installed via pi install https://github.com/davebcn87/pi-autoresearch - The benchmark script (autoresearch.sh) is an explicit, versioned shell script — not implicit agent behavior - Optional checks script (autoresearch.checks.sh) is similarly explicit and versioned

Session Files

File	Format	Purpose	Persistence
`autoresearch.md`	Markdown	Living session document — objective, history, dead ends	Survives context resets, agent restarts
`autoresearch.jsonl`	JSON Lines	Append-only experiment log with full metadata	Survives restarts, branch-aware
`autoresearch.sh`	Shell script	Benchmark command with pre-checks and metric output	Versioned in git
`autoresearch.checks.sh`	Shell script (optional)	Correctness checks (tests, types, lint)	Versioned in git
`autoresearch.config.json`	JSON (optional)	Session configuration (working dir, max iterations)	Versioned in git

Resumption Protocol

The system supports three resumption scenarios:

Agent restart (same context window): The agent reads autoresearch.jsonl to reconstruct state and continues the loop.
Context reset (new agent instance): A fresh agent reads autoresearch.md for high-level context and autoresearch.jsonl for detailed history. The README explicitly states: "A fresh agent with no memory can read these two files and continue exactly where the previous session left off."
Branch switch: Each branch maintains its own session state. Switching branches automatically switches the session context.

Limitations on Reproducibility

LLM non-determinism: The agent's hypotheses and code changes depend on the LLM's stochastic generation. Two runs with identical starting conditions will generally produce different experiment sequences.
Environment sensitivity: Benchmark measurements depend on system load, hardware, and environment. The confidence scoring mitigates this but doesn't eliminate it.
Skill authoring variance: The quality and specificity of the skill document significantly affect results. Vague skills produce scattered experiments; precise skills produce focused optimization.

8 Compute and API Costs

Cost Model

Pi-autoresearch's costs have three components:

Component	Source	Scaling Factor
LLM tokens	pi agent inference (per-experiment reasoning + code changes)	Experiments × tokens/experiment
Compute time	Benchmark execution	Experiments × benchmark duration
Human attention	Monitoring, review, finalization	Low (designed for unattended operation)

Token Cost Estimation

Each experiment cycle involves:

Phase	Estimated Tokens	Notes
Read context + session files	2,000–10,000 input	Depends on `autoresearch.md` size
Generate hypothesis + code change	1,000–5,000 output	Depends on change complexity
Interpret results + decide keep/revert	500–2,000 output	Includes reasoning about metrics
Per-experiment total	3,500–17,000	~10,000 tokens typical

For a 50-experiment session with Claude Sonnet-level pricing (~$3/M input, ~$15/M output tokens): - Input: 50 × 6,000 = 300K tokens → ~$0.90 - Output: 50 × 4,000 = 200K tokens → ~$3.00 - Total: ~$4–$10 per 50-experiment session

Cost Controls

Pi-autoresearch provides two mechanisms for cost control:

maxIterations in autoresearch.config.json: Hard limit on experiments per session. The agent is instructed to stop and won't run more experiments beyond this limit.

{
  "maxIterations": 30
}

API key spending limits: The README recommends using provider-side spending caps: "most providers let you set per-key or monthly budgets."

Cost Comparison

System	Typical Session Cost	Autonomous Duration	Cost Control
Karpathy autoresearch	$5–$50 (overnight)	8–12 hours	Manual interruption only
pi-autoresearch	$4–$10 (50 experiments)	Varies by benchmark	maxIterations + API limits
AlphaEvolve	$1,000–$100,000+	Days	Internal Google quotas

9 Architecture Solution

Three-Layer Architecture

Pi-autoresearch's architecture follows a clean three-layer separation:

┌─────────────────────────────────────────────────────────────────┐
│                    LAYER 3: SKILL (Domain Knowledge)            │
│                                                                 │
│  autoresearch-create          autoresearch-finalize              │
│  ┌─────────────────────┐     ┌──────────────────────────────┐   │
│  │ • Asks about goal    │     │ • Reads experiment log        │   │
│  │ • Infers from context│     │ • Groups kept experiments     │   │
│  │ • Writes session     │     │ • Proposes branch grouping    │   │
│  │   files              │     │ • Creates independent branches│   │
│  │ • Starts loop        │     │ • Ensures no shared files     │   │
│  └─────────────────────┘     └──────────────────────────────┘   │
├─────────────────────────────────────────────────────────────────┤
│                   LAYER 2: EXTENSION (Infrastructure)           │
│                                                                 │
│  Tools                    UI                    Commands         │
│  ┌───────────────────┐   ┌───────────────┐     ┌────────────┐  │
│  │ init_experiment    │   │ Status widget │     │/autoresearch│  │
│  │ run_experiment     │   │ Dashboard     │     │  <context>  │  │
│  │ log_experiment     │   │ Fullscreen    │     │  off        │  │
│  └───────────────────┘   │ overlay       │     │  clear      │  │
│                          └───────────────┘     └────────────┘  │
├─────────────────────────────────────────────────────────────────┤
│                     LAYER 1: pi RUNTIME                         │
│                                                                 │
│  LLM Agent  │  Terminal UI  │  Git Integration  │  File I/O    │
│  Extension  │  Widget       │  Skill System     │  Process     │
│  Framework  │  Framework    │                   │  Execution   │
└─────────────────────────────────────────────────────────────────┘

Data Flow

The data flow through a single experiment cycle:

Agent generates hypothesis
    │
    ▼
Agent edits source code
    │
    ▼
Agent calls git commit (via pi)
    │
    ▼
Agent calls run_experiment(command="pnpm test")
    │
    ├──► Extension executes command
    ├──► Extension times wall-clock duration
    ├──► Extension captures stdout/stderr
    ├──► Extension parses METRIC lines from output
    │
    ▼
Agent calls log_experiment(metric_value, description)
    │
    ├──► Extension records to autoresearch.jsonl
    ├──► Extension computes confidence score (if 3+ runs)
    ├──► Extension updates status widget
    ├──► Extension updates dashboard
    │
    ▼
Agent evaluates: keep or revert?
    │
    ├──► Keep: branch advances, agent generates next hypothesis
    │
    └──► Revert: git reset, agent tries different approach

Metric Protocol

The benchmark script (autoresearch.sh) communicates metrics to the extension via a simple protocol — standard output lines matching the pattern:

METRIC name=number

For example:

#!/bin/bash
set -euo pipefail
pnpm test --run 2>&1
echo "METRIC total_test_seconds=$(node -e 'process.stdout.write(String(...))')"

This protocol is deliberately minimal — any language, any build tool, any measurement approach can produce METRIC lines.

State Machine

The extension manages a session state machine:

                ┌──────────────┐
                │   INACTIVE   │
                │  (no session)│
                └──────┬───────┘
                       │ /autoresearch <context>
                       │ or /skill:autoresearch-create
                       ▼
                ┌──────────────┐
                │   SETUP      │◄─── init_experiment()
                │  (configuring│     defines name, metric,
                │   session)   │     unit, direction
                └──────┬───────┘
                       │ Session files written
                       │ Baseline measured
                       ▼
                ┌──────────────┐
                │   RUNNING    │◄─── Autonomous loop
                │  (experiment │     edit → commit →
                │   loop)      │     run → log → decide
                └──────┬───────┘
                       │ /autoresearch off
                       │ or Escape
                       │ or maxIterations reached
                       ▼
                ┌──────────────┐
                │   PAUSED     │
                │  (loop stopped│
                │   state kept) │
                └──────┬───────┘
                       │ /autoresearch <context>
                       │ (resumes loop)
                       │
                       │ /autoresearch clear
                       ▼
                ┌──────────────┐
                │   CLEARED    │
                │  (state      │
                │   deleted)   │──── Returns to INACTIVE
                └──────────────┘

10 Component Breakdown

Extension Components

Component	Type	Lifecycle	Description
`init_experiment`	Tool	Called once per session	Configures session: experiment name, primary metric name, unit, optimization direction (minimize/maximize)
`run_experiment`	Tool	Called per experiment	Executes benchmark command, measures wall-clock duration, captures output, parses `METRIC` lines
`log_experiment`	Tool	Called per experiment	Records result to `autoresearch.jsonl`, auto-commits, computes confidence score, updates UI components
Status widget	UI	Persistent during session	Always-visible bar above editor: run count, keep count, best metric value, improvement %, confidence score
Expanded dashboard	UI	Toggle via Ctrl+X	Full results table with columns for commit, metric, status, description
Fullscreen overlay	UI	Toggle via Ctrl+Shift+X	Scrollable full-terminal dashboard with live spinner for running experiments

Skill Components

Component	Type	Purpose
`autoresearch-create`	Skill	Session initialization — gathers goal, command, metric, scope; writes session files; runs baseline; starts loop
`autoresearch-finalize`	Skill	Branch finalization — groups kept experiments into independent branches for review

Session File Components

File	Format	Schema	Update Frequency
`autoresearch.md`	Markdown	Free-form document with sections for objective, metrics, scope, history, dead ends, wins	Updated by agent after significant events
`autoresearch.jsonl`	JSON Lines	`{commit, metric_name, metric_value, status, description, confidence, timestamp}`	Appended after every experiment
`autoresearch.sh`	Bash	Pre-checks + workload + `METRIC` output lines	Written once during setup, rarely modified
`autoresearch.checks.sh`	Bash (optional)	Correctness checks (tests, types, lint)	Written once during setup
`autoresearch.config.json`	JSON	`{workingDir?, maxIterations?}`	Written once, manually updated

Experiment Status Taxonomy

Each experiment in autoresearch.jsonl has one of these statuses:

Status	Meaning	UI Color	Agent Action
`kept`	Metric improved, changes committed	Green	Branch advances
`discarded`	Metric did not improve	Gray	Git reset, try again
`crashed`	Benchmark command failed (non-zero exit)	Red	Git reset, try different approach
`checks_failed`	Benchmark passed but correctness checks failed	Orange	Git reset, fix correctness issue
`baseline`	Initial measurement, no changes	Blue	Reference point for improvements

11 Core Mechanisms (Detailed)

Mechanism 1: Confidence Scoring via MAD

The most technically sophisticated mechanism in pi-autoresearch is the confidence scoring system, which uses Median Absolute Deviation (MAD) as a robust noise floor estimator.

Why MAD instead of Standard Deviation?

Standard deviation is sensitive to outliers. In benchmark measurements, outliers are common: - GC pauses can spike individual measurements by 2–10× - CPU thermal throttling can degrade measurements unpredictably - I/O contention from other processes creates sporadic slowdowns - ML training loss has inherent stochasticity

MAD is robust to these outliers because it uses the median rather than the mean:

MAD = median(|x_i - median(x)|)

Confidence computation:

confidence = |best_improvement| / MAD

Where best_improvement is the largest positive change in the primary metric (accounting for optimization direction).

Interpretation thresholds:

Confidence	Signal	Color	Agent Guidance
≥ 2.0×	Strong signal	🟢 Green	Improvement is likely real. Keep and continue.
1.0–2.0×	Marginal signal	🟡 Yellow	Above noise but uncertain. Re-run to confirm.
< 1.0×	Noise	🔴 Red	Within noise floor. Consider reverting.

Implementation details: - Confidence is only computed after 3+ experiments (minimum sample size for meaningful MAD) - All metric values in the current segment contribute to the MAD computation - Confidence is persisted to autoresearch.jsonl for post-hoc analysis - Confidence is displayed in the status widget, expanded dashboard, and log_experiment output - The confidence score is advisory only — it never auto-discards experiments

Mechanism 2: Greedy Hill-Climbing with Version Control

The core optimization strategy is greedy hill-climbing backed by git version control:

┌────────────────────────────────────────────────────┐
│                 EXPERIMENT CYCLE                    │
│                                                    │
│  1. Agent edits code (informed by session context) │
│  2. Agent creates git commit                       │
│  3. run_experiment executes benchmark              │
│  4. log_experiment records result                  │
│  5. Decision:                                      │
│     ├── metric improved → KEEP (branch advances)   │
│     └── metric worsened → REVERT (git reset)       │
│  6. Go to 1                                        │
│                                                    │
│  Invariant: HEAD always points to the best-known   │
│  configuration. The branch is monotonically        │
│  improving.                                        │
└────────────────────────────────────────────────────┘

This creates a monotonically improving trajectory — every commit on the branch represents an improvement over the previous state.

Trade-off analysis:

Property	Greedy Hill-Climbing	Population-Based Search
Convergence speed	Fast for easy gains	Slower but avoids local optima
Implementation complexity	Very low (git commit/reset)	High (population management)
Memory requirements	O(1) — current state only	O(N) — population of candidates
Risk of local optima	High	Low
Human interpretability	Very high (linear history)	Low (complex population dynamics)

The greedy approach is a deliberate design choice — pi-autoresearch optimizes for simplicity and interpretability over optimality. The assumption is that LLM agents generate sufficiently diverse hypotheses to partially mitigate the local-optima problem.

Mechanism 3: Backpressure Checks

The optional autoresearch.checks.sh mechanism provides correctness guarantees during autonomous optimization:

Benchmark exits 0
    │
    ├── autoresearch.checks.sh exists?
    │   ├── No → proceed normally (keep/discard based on metric)
    │   └── Yes → run checks
    │       ├── Checks pass → proceed normally
    │       └── Checks fail → log as "checks_failed", revert
    │
    ▼
Normal keep/discard decision

Design principles: - Checks execution time does not affect the primary metric (measured separately) - Checks have a separate timeout (default 300s) - The checks_failed status is distinct from crashed, allowing post-hoc analysis of correctness vs. performance failures - If no checks file exists, the system behaves identically to systems without this feature

Mechanism 4: Branch-Aware Finalization

The autoresearch-finalize skill converts a messy experiment branch into clean, reviewable branches:

BEFORE (messy experiment branch):
─ baseline
  ├── exp1: parallel vitest workers (kept, +12%)
  ├── exp2: mock database calls (discarded)
  ├── exp3: remove slow regex (kept, +5%)
  ├── exp4: cache test fixtures (kept, +8%)
  ├── exp5: swap assertion library (discarded)
  └── exp6: remove unnecessary imports (kept, +2%)

AFTER (clean independent branches):
─ merge-base
  ├── branch: optimize-test-parallelism
  │   └── exp1: parallel vitest workers (+12%)
  │
  ├── branch: optimize-test-regex
  │   └── exp3: remove slow regex (+5%)
  │
  ├── branch: optimize-test-caching
  │   └── exp4: cache test fixtures (+8%)
  │
  └── branch: optimize-test-imports
      └── exp6: remove unnecessary imports (+2%)

Key constraint: Groups must not share files. This ensures branches can be reviewed and merged independently, without conflict resolution.

Mechanism 5: Context Survival Across Resets

The dual-file persistence strategy (autoresearch.md + autoresearch.jsonl) is designed to survive LLM context window resets:

File	Audience	Content	Survival Property
`autoresearch.md`	Agent (narrative understanding)	High-level objective, strategies tried, dead ends, key wins	Enables a fresh agent to understand why and what without reading every experiment
`autoresearch.jsonl`	Agent (precise recall) + Tools (computation)	Every experiment with exact metrics, commits, confidence scores	Enables precise reconstruction of session state and confidence calculations

The design insight is that LLMs benefit from both narrative context (what's the goal, what approaches have been tried, what failed) and structured data (exact metric values, commit hashes, timestamps). These two modalities serve different purposes and are stored in different formats optimized for each use case.

12 Programming Language

Implementation Stack

Pi-autoresearch is implemented as a pi extension, which uses pi's extension and skill frameworks:

Layer	Language/Format	Notes
Extension definition	JSON (extension manifest)	Declares tools, UI widgets, commands
Tool implementations	TypeScript (pi extension API)	init_experiment, run_experiment, log_experiment
UI components	pi widget framework (declarative)	Status bar, dashboard, fullscreen overlay
Skill documents	Markdown	autoresearch-create, autoresearch-finalize
Session state	JSON Lines + Markdown	autoresearch.jsonl + autoresearch.md
Benchmark scripts	Bash	autoresearch.sh, autoresearch.checks.sh
Configuration	JSON	autoresearch.config.json

Language Choice Rationale

The choice to implement as a pi extension rather than a standalone tool has several implications:

TypeScript for tools: Pi's extension framework uses TypeScript, providing type safety and IDE support for the tool implementations. This is appropriate for the system-level concerns (timing, file I/O, process execution) that the tools handle.
Markdown for skills: Skills are authored in Markdown because they are consumed by the LLM agent as context. Markdown provides structure (headers, lists, emphasis) that helps the LLM parse and follow instructions, while remaining human-readable and editable.
Bash for benchmarks: Benchmark scripts are Bash because they need to invoke arbitrary command-line tools. The METRIC name=number output protocol is language-agnostic — any process that writes to stdout can produce metric lines.
JSON Lines for logs: JSONL is chosen for the experiment log because it supports efficient append-only writes, is line-oriented (enabling tail -f style monitoring), and is machine-parseable while remaining human-readable.

Code Complexity

The overall codebase is notably small — consistent with the "radical simplicity" design philosophy inherited from Karpathy's autoresearch:

Component	Estimated LOC	Complexity
Extension (tools + UI)	~500–800	Medium (timing, git integration, JSONL parsing, confidence math)
Skills	~200–400 lines of Markdown	Low (structured natural language instructions)
Session files	Generated	N/A (generated by agent/tools)
Total	~700–1,200	Low-to-medium

13 Memory Management

Memory Architecture

Pi-autoresearch's memory management operates at three distinct levels:

Level 1: LLM Context Window (Volatile)

The LLM agent's context window is the primary working memory. It contains: - System prompt + tool descriptions - Skill document (loaded at session start) - Recent conversation history (experiments, reasoning, results) - Fragments of autoresearch.md and autoresearch.jsonl (loaded as needed)

Context window is limited (typically 128K–200K tokens for frontier models) and is lost on context reset.

Level 2: Session Files (Persistent)

The dual-file persistence layer survives context resets:

┌──────────────────────────────────────────────────────────┐
│              Session File Memory Architecture             │
│                                                          │
│  autoresearch.md (narrative memory)                      │
│  ┌────────────────────────────────────────────────────┐  │
│  │ • Objective: "optimize test runtime"               │  │
│  │ • Approach: "started with parallelism, then..."    │  │
│  │ • Dead ends: "swapping assertion lib had no effect"│  │
│  │ • Key wins: "parallel workers gave biggest gain"   │  │
│  │ • Next ideas: "try test sharding, mock heavy I/O"  │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  autoresearch.jsonl (structured memory)                  │
│  ┌────────────────────────────────────────────────────┐  │
│  │ {"run":1, "metric":45.2, "status":"baseline",...}  │  │
│  │ {"run":2, "metric":39.8, "status":"kept",...}      │  │
│  │ {"run":3, "metric":41.1, "status":"discarded",...} │  │
│  │ ...                                                │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Level 3: Git History (Permanent)

Every kept experiment is a git commit. This creates a permanent, immutable record of every code change that produced an improvement. The git history is never consumed directly by the agent but serves as a human-auditable archive and enables the finalization workflow.

Memory Scaling

Session Size	autoresearch.md	autoresearch.jsonl	Git commits	Context Pressure
10 experiments	~1 KB	~3 KB	~5	Low
50 experiments	~3–5 KB	~15 KB	~20–25	Medium
200 experiments	~10–15 KB	~60 KB	~80–100	High (JSONL may exceed context)

For very long sessions, the agent must selectively read the JSONL file (recent entries, best results) rather than loading the entire history. The narrative autoresearch.md file serves as a compressed summary that remains context-friendly regardless of session length.

Comparison with Other Memory Systems

System	Memory Mechanism	Persistence	Scalability
Karpathy autoresearch	`results.tsv` + git	File-based	Moderate (TSV grows linearly)
AlphaEvolve	Database (internal)	Database-backed	High (indexed queries)
OpenEvolve	SQLite + program database	Database-backed	High
pi-autoresearch	JSONL + Markdown + git	File-based	Moderate (JSONL grows linearly)

14 Continued Learning

Within-Session Learning

The agent learns within a session through the accumulation of context:

Narrative learning: As the agent updates autoresearch.md with dead ends and key wins, it builds a compressed model of the optimization landscape. Future hypotheses are informed by this accumulated knowledge.
Statistical learning: The confidence scoring provides quantitative feedback on result reliability. After enough experiments, the agent can distinguish between high-confidence improvements and noise-level changes.
Pattern recognition: The JSONL log provides a structured record of what has been tried. The agent can read this to avoid repeating failed approaches and to identify promising directions.

Cross-Session Learning

Pi-autoresearch supports cross-session learning through the persistence layer:

Context reset resumption: A fresh agent instance reads autoresearch.md + autoresearch.jsonl and picks up where the previous session left off. The narrative document provides strategic context, while the JSONL provides tactical detail.
Branch-level transfer: Since sessions are branch-aware, insights from one optimization branch can inform work on another branch — though this requires the human researcher to manually connect the sessions.

What Pi-Autoresearch Does NOT Learn

No cross-project learning. Insights from optimizing test speed in Project A are not automatically transferred to Project B. Each session starts fresh.
No meta-optimization. The system does not learn to improve its own optimization strategy over time. The hill-climbing approach, confidence thresholds, and experiment structure are fixed.
No skill improvement. The domain-specific skill documents are authored by humans and not modified by the system. The system cannot discover that a different benchmark command or metric would be more effective.
No population dynamics. Unlike evolutionary systems (AlphaEvolve, OpenEvolve), pi-autoresearch maintains a single candidate at a time. It cannot explore multiple optimization paths in parallel or combine insights from diverse candidates.

Comparison with Knowledge Management in Other Systems

System	Within-Session	Cross-Session	Cross-Project	Meta-Learning
Karpathy autoresearch	Via context window	Via `results.tsv` + git	No	No
AlphaEvolve	Program database	Persistent DB	Yes (shared infra)	Partially (self-improvement)
OpenEvolve	SQLite DB	Persistent DB	No	No
pi-autoresearch	JSONL + Markdown	JSONL + Markdown	No	No

15 Applications

Primary Application: Developer Workflow Optimization

Pi-autoresearch's primary application is optimizing developer workflows — any measurable aspect of a software project that can be improved through code changes:

Application	Metric	Typical Gains	Value Proposition
Test suite optimization	Wall-clock runtime	10–50% faster	Faster CI, faster development loops
Build optimization	Build time	10–40% faster	Faster deployments, faster iteration
Bundle optimization	Bundle size (KB)	5–30% smaller	Faster page loads, better UX
Performance optimization	Lighthouse score	5–20 point improvement	Better SEO, better user experience

Secondary Application: ML Research

Following Karpathy's autoresearch paradigm, pi-autoresearch can be applied to ML training optimization:

Application	Metric	Approach
Architecture search	val_bpb or val_loss	Agent modifies model architecture (layers, widths, attention patterns), measures training performance
Hyperparameter optimization	val_bpb or val_loss	Agent adjusts learning rate, batch size, warmup schedule, regularization
Data pipeline optimization	Training throughput (samples/sec)	Agent optimizes data loading, preprocessing, caching
Training stability	Loss variance	Agent modifies optimization procedure to reduce training instability

Emerging Application: Multi-Agent Optimization

Pi-autoresearch opens the door to multi-agent optimization workflows:

Developer machine                    CI server
┌────────────────────┐              ┌────────────────────┐
│ pi + autoresearch  │              │ Benchmark runner   │
│ ┌────────────────┐ │   git push   │ ┌────────────────┐ │
│ │ Agent generates│ │──────────────│ │ Run benchmark  │ │
│ │ code changes   │ │              │ │ on isolated    │ │
│ │ & commits      │ │   webhook    │ │ hardware       │ │
│ │                │ │◄─────────────│ │                │ │
│ │ Reads results  │ │              │ │ Report metrics │ │
│ └────────────────┘ │              │ └────────────────┘ │
└────────────────────┘              └────────────────────┘

This pattern enables: - Benchmarks on dedicated hardware (not affected by developer machine load) - Longer-running benchmarks (CI can run 30-minute training jobs) - Multi-metric evaluation (CI reports multiple metrics per run)

Limitations

Single-metric optimization: The system is designed for single-objective optimization. Multi-objective problems require manual composite scoring or separate sessions.
Requires automatable benchmarks: Applications where quality is subjective or requires human evaluation are not supported.
Platform lock-in: The extension is specific to the pi agent. Users of other coding agents (Claude Code, Cursor, Copilot) cannot use pi-autoresearch without switching to pi.
No guarantee of global optimality: Greedy hill-climbing may miss solutions that require temporary metric regressions. Population-based approaches (AlphaEvolve, OpenEvolve) are better suited to problems with rugged fitness landscapes.

Impact Assessment

Pi-autoresearch represents a significant step in the democratization of autonomous optimization:

Dimension	Assessment
Accessibility	High — single install command, works with any pi project
Generality	High — domain-agnostic, any measurable metric
Scientific rigor	Medium — confidence scoring is a good start, but lacks formal statistical testing
Production readiness	Medium — suitable for development optimization, not safety-critical applications
Community adoption	Strong — 3,300+ stars, active development
Innovation	Medium — novel extension/skill separation, confidence scoring; core loop is established (Karpathy)

Position in the Autoresearch Ecosystem

Complexity ──────────────────────────────────────────────► Scale
    │
    │  Karpathy                    pi-autoresearch
    │  autoresearch                (domain-agnostic,
    │  (minimal,                    statistical rigor,
    │   single-domain)              extension/skill split)
    │       │                              │
    │       ▼                              ▼
    │  ─────────────────────────────────────────────────
    │  Proof of Concept    Reusable Framework    Platform
    │
    │                                          SkyPilot
    │                                          autoresearch
    │                                          (cloud-scale,
    │                                           multi-cluster)
    │                                              │
    │                                              ▼
    │  ─────────────────────────────────────────────────
    │                                        Enterprise

Pi-autoresearch occupies the "reusable framework" niche — more sophisticated than Karpathy's proof-of-concept, more accessible than enterprise-grade cloud orchestrators. This positioning makes it the natural choice for individual developers and small teams who want autonomous optimization without significant infrastructure investment.