Introduced2026-03

Score7.92/10 — Draft

Chapter 45

Pi-Autoresearch

Part P07: Autonomous Research Systems

Provenance note. This chapter surveys the pi-autoresearch extension (repository: github.com/davebcn87/pi-autoresearch). Claims are sourced as follows: README-documented indicates information stated in the repository README or documentation files; manifest/code-derived indicates structure observed in the extension manifest and TypeScript source; author analysis indicates interpretive commentary by the survey author. All repository observations were made in April 2026. All code listings in this chapter are pseudocode — they are conceptual reconstructions of the documented behavior, not verbatim excerpts from the repository. Function names, interface shapes, and parameter names are illustrative; the actual TypeScript implementation may use different identifiers, signatures, and internal structure. Where specific identifiers (tool names, file names, commands, metric protocol) are used, these are drawn from the README documentation and are labeled accordingly.

45.1 Overview and Motivation

Pi-autoresearch is an open-source extension for the pi AI coding agent that instruments it with tools, workflows, and a terminal UI for running autonomous experiment loops. Released in March 2026 by David Vilalta under the MIT license (README-documented), it is inspired by Andrej Karpathy's autoresearch but diverges in architecture: where Karpathy's system is a monolithic script coupled to a single domain (neural network training via nanochat) and a specific agent (Claude Code), pi-autoresearch decouples the experimental loop infrastructure from domain knowledge, producing a general-purpose optimization extension that works with any command-line-measurable metric.

The core claim, stated in the repository README, is that the experimental loop — measure, judge, keep or revert, repeat — is infrastructure, not domain knowledge. Domain knowledge belongs in a separate, human-authored document called a "skill." This separation enables a single extension to serve different optimization domains: test speed, bundle size, LLM training loss, Lighthouse scores, or any other numeric metric producible by a shell command.

Key Contribution

Pi-autoresearch introduces an extension/skill architecture that separates domain-agnostic experimental loop infrastructure (measurement, version control, statistical confidence, dashboards) from domain-specific knowledge (what to optimize, how to measure, which files to modify). It adds MAD-based statistical confidence scoring — a session-level signal-to-noise heuristic that estimates whether cumulative improvements exceed the measurement noise floor. These contributions are compared here against Karpathy's autoresearch and SkyPilot autoresearch, the two most prominent prior systems identified in this survey; the comparison is bounded to those systems, not to all possible tooling.

45.1.1 Position in the Autoresearch Lineage

Table 45.1: Positioning within the autoresearch ecosystem (author analysis based on public documentation of each system)
System	Year	Architecture	Domain	Statistical Rigor	UI
Karpathy autoresearch	2026	Monolithic script	Neural network training	None	Terminal output
SkyPilot autoresearch	2026	Cloud orchestrator	Any cloud workload	None	Web dashboard
Pi-autoresearch	2026	Extension + skill	Any measurable metric	MAD confidence	Widget + dashboard + overlay

45.1.2 LLM-as-Agent, Not LLM-as-Mutation-Operator

A critical architectural distinction separates pi-autoresearch from evolutionary systems like AlphaEvolve (Chapter 4) or FunSearch (Chapter 5). In those systems, the LLM serves as a mutation operator called by the system to propose code changes within a population-based search. In pi-autoresearch, the LLM is the agent — it has full autonomy to read files, understand context, form hypotheses, make changes, and decide what to try next. The extension merely provides tools that the agent invokes at its discretion (README-documented).

This design choice trades formal search-theoretic guarantees for practical flexibility: any LLM backend configured in pi can power the optimization loop, and the space of possible modifications is limited only by the LLM's coding ability rather than by a predefined mutation grammar.

45.2 Architecture

Pi-autoresearch follows a three-layer architecture that separates the pi runtime, the extension infrastructure, and domain-specific skills (README-documented; the characterization of this as the system's primary structural contribution is author analysis).

45.2.1 Implementation Map

The following table summarizes the concrete components identified from the repository's README documentation and extension structure. Each row indicates the evidence tier: whether the component is explicitly described in the README, inferred from the pi extension framework conventions, or observable in the repository's file structure. Readers wishing to verify these claims should inspect the repository directly; the survey author's observations are dated April 2026.

Table 45.2: Implementation map — concrete components and evidence tiers (April 2026)
Component	Identifier / Location	Evidence Tier	Notes
Extension name	`autoresearch`	README-documented	Registered via `pi install https://github.com/davebcn87/pi-autoresearch`
Tool: session init	`init_experiment`	README-documented	Parameters: name, metric_name, metric_unit, direction
Tool: benchmark exec	`run_experiment`	README-documented	Parameter: command (shell command string)
Tool: result logging	`log_experiment`	README-documented	Parameters: metric_value, description, status
Skill: session creation	`autoresearch-create`	README-documented	Markdown skill consumed by LLM as context
Skill: branch finalization	`autoresearch-finalize`	README-documented	Markdown skill consumed by LLM as context
Command	`/autoresearch`	README-documented	Subcommands: `<context>`, `off`, `clear`
Session narrative	`autoresearch.md`	README-documented	Generated in project working directory at session start
Structured log	`autoresearch.jsonl`	README-documented	Append-only, one JSON object per line per experiment
Benchmark script	`autoresearch.sh`	README-documented	User-authored; emits `METRIC name=number` on stdout
Correctness checks	`autoresearch.checks.sh`	README-documented	Optional; runs after benchmark passes
Session config	`autoresearch.config.json`	README-documented	Optional; fields include `maxIterations`
UI: status widget	(pi widget framework)	README-documented	Persistent bar: run count, keep count, best metric, confidence
UI: dashboard	(pi widget framework)	README-documented	Expandable results table toggled via keyboard shortcut
UI: overlay	(pi widget framework)	README-documented	Fullscreen scrollable view with live spinner
Extension manifest	JSON manifest file	Framework-inferred	Pi extensions use JSON manifests; exact path not independently verified
Tool handler source	TypeScript source files	Framework-inferred	Pi extensions are implemented in TypeScript; exact file paths not independently verified
Metric protocol	`METRIC name=number` (stdout)	README-documented	Regex pattern on stdout; language-agnostic
Experiment statuses	`kept`, `discarded`, `crashed`, `checks_failed`, `baseline`	README-documented	Recorded in JSONL log per experiment

Implementation gap. The survey author has not performed a line-by-line audit of the TypeScript source. Exact file paths within the repository (e.g., the location of the extension manifest JSON, the directory structure of tool handler modules, the UI widget registration code) are not independently verified. The identifiers listed above (tool names, skill names, file names, status values, metric protocol) are drawn from README documentation and are consistent with observed extension behavior, but the internal implementation may differ in structure from the pseudocode presented in this chapter.

45.2.2 Extension Manifest and Tool Registration

The extension declares its tools, commands, and skills through a JSON manifest. Pi extensions use manifest-driven registration so that the pi agent discovers available tools automatically when the extension is installed (README-documented: pi install https://github.com/davebcn87/pi-autoresearch). The following is a pseudocode reconstruction of the manifest structure, based on the README-documented tool names, parameters, and skills. The actual manifest file may use different field names, nesting, or additional fields not described here:

// PSEUDOCODE — conceptual manifest structure
// Reconstructed from README-documented tool names and parameters
// Actual manifest format and field names may differ
{
  "name": "autoresearch",
  "tools": [
    {
      "name": "init_experiment",
      "parameters": {
        "name": { "type": "string" },
        "metric_name": { "type": "string" },
        "metric_unit": { "type": "string" },
        "direction": { "type": "string", "enum": ["minimize", "maximize"] }
      }
    },
    {
      "name": "run_experiment",
      "parameters": {
        "command": { "type": "string" }
      }
    },
    {
      "name": "log_experiment",
      "parameters": {
        "metric_value": { "type": "number" },
        "description": { "type": "string" },
        "status": { "type": "string",
          "enum": ["kept", "discarded", "crashed", "checks_failed", "baseline"] }
      }
    }
  ],
  "skills": ["autoresearch-create", "autoresearch-finalize"],
  "commands": [
    { "name": "autoresearch" }
  ]
}

Table 45.3: Extension tool summary (README-documented)
Tool	Lifecycle	Function
`init_experiment`	Once per session	Configures session: experiment name, primary metric name, unit, optimization direction (minimize/maximize)
`run_experiment`	Per experiment	Executes benchmark command, measures wall-clock duration, captures stdout/stderr, parses `METRIC` lines
`log_experiment`	Per experiment	Records result to `autoresearch.jsonl`, auto-commits, computes confidence score (if ≥3 non-crashed runs), updates UI

The UI comprises three progressive disclosure levels (README-documented): a persistent status widget (run count, keep count, best metric, improvement percentage, confidence score), an expandable dashboard toggled via keyboard shortcut (full results table with commits, metrics, status, descriptions), and a fullscreen overlay (scrollable terminal-wide view with a live spinner during experiment execution).

45.2.3 Skill Components

Skills are Markdown documents consumed by the LLM as context. They encode domain-specific knowledge: what to optimize, how to measure it, which files are in scope, and what strategies to consider. Two skills ship with the extension (README-documented):

autoresearch-create: Session initialization — gathers goal, command, metric, scope from the user (or infers from project context); writes session files; runs a baseline measurement; starts the autonomous loop.
autoresearch-finalize: Branch finalization — reads the experiment log, groups kept experiments into logically independent branches, proposes the grouping for human approval, then creates branches from the merge-base. Groups must not share files, ensuring each resulting branch can be reviewed and merged independently.

45.2.4 Prompt Architecture

The LLM agent operates with a five-layer prompt architecture (README-documented):

Layer 1: pi system prompt (agent capabilities, tool definitions)
Layer 2: Extension tool descriptions (init_experiment, run_experiment, log_experiment)
Layer 3: Skill document (domain-specific instructions, loaded at session start)
Layer 4: Session context (autoresearch.md — accumulated narrative of what has been tried)
Layer 5: Real-time state (widget data, recent experiment results)

This layering ensures the agent always has access to what it can do (tools), what it should optimize (skill), what has already been tried (session history), and how well it is doing (confidence scores, metric trajectory). A fresh agent instance after a context reset can reconstruct all of this from the persistent session files.

45.2.5 Metric Protocol

The benchmark script (autoresearch.sh) communicates metrics to the extension via a deliberately minimal stdout protocol (README-documented): any line matching METRIC name=number is captured. The following pseudocode illustrates the conceptual parsing logic. The actual TypeScript implementation may use different variable names, error handling, or regex patterns:

// PSEUDOCODE — conceptual metric parsing logic
// Illustrates the METRIC protocol documented in the README
// Actual implementation identifiers and structure may differ

// README-documented protocol: lines matching "METRIC name=number"
const METRIC_PATTERN = /^METRIC\s+(\w+)=([\d.eE+-]+)$/gm;

function parseMetrics(stdout: string): Array<{name: string, value: number}> {
  const results = [];
  let match;
  while ((match = METRIC_PATTERN.exec(stdout)) !== null) {
    results.push({ name: match[1], value: parseFloat(match[2]) });
  }
  return results;
}

// Example benchmark script output:
// $ bash autoresearch.sh
// Running vitest...
// Tests: 142 passed, 0 failed
// METRIC total_test_seconds=38.7
// METRIC peak_memory_mb=512
//
// Conceptual result: [
//   { name: "total_test_seconds", value: 38.7 },
//   { name: "peak_memory_mb", value: 512 }
// ]

This protocol is language-agnostic: any build tool, test runner, or training script can produce METRIC lines. The extension only looks for lines matching the pattern and extracts the numeric value. The primary metric (specified in init_experiment) is used for keep/revert decisions; additional metrics are recorded in the JSONL log for post-hoc analysis.

45.3 Core Algorithms

45.3.1 Greedy Hill-Climbing with Git-Backed State

The core optimization strategy is greedy hill-climbing where git provides the state-management substrate (README-documented). Every experiment that improves the primary metric results in the branch advancing (the commit is kept); every experiment that fails to improve results in a git reset to the previous best state. This creates a monotonically improving trajectory — every commit on the branch represents an improvement over all prior states.

Formally, let $m_t$ denote the primary metric value at experiment $t$, and let $d \in \{-1, +1\}$ encode the optimization direction ($d = -1$ for minimization, $d = +1$ for maximization). The keep/revert decision is:

$$\text{decision}(t) = \begin{cases} \texttt{keep} & \text{if } d \cdot m_t > d \cdot m^*_{t-1} \\ \texttt{revert} & \text{otherwise} \end{cases} \tag{45.1}$$

where $m^*_{t-1}$ is the best metric value observed prior to experiment $t$, defined recursively: $m^*_0 = m_0$ (baseline) and $m^*_t = m_t$ if kept, $m^*_t = m^*_{t-1}$ if reverted. After a revert, the code state is restored via git reset --hard. The comparison $d \cdot m_t > d \cdot m^*_{t-1}$ unifies minimization and maximization: for minimization ($d = -1$), this becomes $m_t < m^*_{t-1}$.

The following pseudocode illustrates the conceptual experiment-cycle logic. This is not extracted from the repository; actual handler functions, parameter passing, and internal state management may differ substantially:

// PSEUDOCODE — conceptual keep/revert cycle
// Illustrates the documented greedy hill-climbing behavior
// Actual implementation structure, identifiers, and logic may differ

// Conceptual experiment record structure (fields from README-documented JSONL schema)
interface ExperimentRecord {
  run: number;
  commit: string;
  metric_name: string;
  metric_value: number;
  status: "baseline" | "kept" | "discarded" | "crashed" | "checks_failed";
  description: string;
  confidence: number | null;
  timestamp: string;
  duration_ms: number;
}

// Conceptual logging flow (README-documented behavior)
async function handleLogExperiment(
  metricValue: number,
  description: string,
  status: string,
  session: SessionState
): Promise<ExperimentRecord> {
  const record: ExperimentRecord = {
    run: session.runCount + 1,
    commit: await getCurrentCommitHash(),
    metric_name: session.config.metricName,
    metric_value: metricValue,
    status,
    description,
    confidence: computeConfidence(session.entries, metricValue, session.config.direction),
    timestamp: new Date().toISOString(),
    duration_ms: session.lastBenchmarkDuration,
  };

  // Append to JSONL log (append-only, one JSON object per line)
  await appendToJsonl(session.jsonlPath, record);

  // Update UI widgets with latest state
  updateWidget(session, record);

  session.entries.push(record);
  return record;
}

The greedy approach is a deliberate design choice (author analysis). Unlike population-based methods (AlphaEvolve, OpenEvolve, GEPA), pi-autoresearch maintains a single candidate at any time, producing a linear git history that humans can easily audit.

Table 45.4: Trade-off analysis — greedy hill-climbing vs. population-based search (author analysis)
Property	Greedy (pi-autoresearch)	Population-Based (AlphaEvolve, etc.)
Convergence speed	Fast for easy gains	Slower but avoids local optima
Implementation complexity	Low (git commit/reset)	High (population management, archives)
Memory requirements	$O(1)$ — current state only	$O(N)$ — population of $N$ candidates
Risk of local optima	High	Lower
Human interpretability	Very high (linear history)	Low (complex population dynamics)

45.3.2 Confidence Scoring via Median Absolute Deviation

Important: scope of the confidence score

The confidence score described below is a session-level signal-to-noise heuristic. It estimates whether the session's cumulative best improvement exceeds the overall measurement noise floor (as estimated by MAD across all non-crashed experiments). It is not a per-experiment significance test — a green confidence score does not validate the statistical significance of any individual kept change. The score cannot distinguish whether the cumulative improvement is driven by one large genuine gain or by several small changes that individually fall within noise. Readers should not infer statistical validity of individual kept changes from a green score alone. The thresholds (2.0×, 1.0×) are heuristic and do not correspond to standard $p$-value thresholds or formal confidence intervals.

The confidence scoring uses Median Absolute Deviation (MAD) as a robust noise-floor estimator (README-documented). This addresses a practical weakness: raw metric comparisons cannot distinguish real improvements from benchmark jitter caused by garbage-collection pauses, CPU thermal throttling, I/O contention, or inherent stochasticity in ML training.

Definitions and Scope

The confidence computation operates over a sample pool defined as follows:

Baseline: The metric value $m_0$ obtained from the first run_experiment invocation before any code changes. This serves as the reference point for measuring improvement.
Sample pool: All metric values from experiments with status $\notin \{\texttt{crashed}\}$. Specifically, experiments with status baseline, kept, discarded, and checks_failed all contribute their metric values to the pool. Crashed experiments are excluded because they produce no valid metric value (the benchmark command returned a non-zero exit code).
Minimum sample size: Confidence is computed only when the sample pool contains $n \geq 3$ values (the minimum for a meaningful MAD estimate).

MAD is defined as:

$$\text{MAD} = \text{median}\left(\left|x_i - \tilde{x}\right| \;\middle|\; i = 1, \ldots, n\right) \tag{45.2}$$

where $\{x_1, \ldots, x_n\}$ is the sample pool and $\tilde{x} = \text{median}(x_1, \ldots, x_n)$ is its median. Unlike standard deviation $\sigma$, which is sensitive to outliers (a single GC spike can inflate $\sigma$ by an order of magnitude), MAD is robust because the median operation discards extreme values.

The confidence score is the ratio of the best direction-aware improvement over baseline to the noise floor:

$$\text{confidence} = \frac{\Delta m_{\text{best}}}{\text{MAD}} \tag{45.3}$$

where the best improvement $\Delta m_{\text{best}}$ is computed with respect to the baseline and optimization direction:

$$\Delta m_{\text{best}} = \max_{i=1,\ldots,n} \; d \cdot (x_i - m_0) \tag{45.4}$$

Here $d \in \{-1, +1\}$ encodes direction ($d = -1$ for minimization, $d = +1$ for maximization) and $m_0 = x_1$ is the baseline value. For minimization, $d \cdot (x_i - m_0) = m_0 - x_i$, so the best improvement is the largest reduction from baseline. If no experiment improves over baseline ($\Delta m_{\text{best}} \leq 0$), the confidence is reported as $0$.

The interpretation thresholds (README-documented):

Table 45.5: Confidence score interpretation (README-documented). These are session-level heuristics, not per-experiment significance tests.
Confidence	Signal	Color	Agent Guidance
$\geq 2.0$	Strong	Green	Session's cumulative improvement likely exceeds noise. Keep and continue.
$1.0 - 2.0$	Marginal	Yellow	Above noise but uncertain. Re-run recommended.
$< 1.0$	Noise	Red	Within noise floor. Consider reverting.

Pseudocode Implementation

The following pseudocode illustrates the conceptual confidence computation, consistent with the mathematical definitions above. This is not an excerpt from the repository — the actual TypeScript implementation may use different variable names, control flow, and edge-case handling:

// PSEUDOCODE — conceptual confidence computation
// Illustrates Equations 45.2–45.4 from the text
// Actual implementation identifiers and logic may differ

function computeConfidence(
  priorEntries: ExperimentRecord[],
  currentValue: number,
  direction: "minimize" | "maximize"
): number | null {
  // Collect all non-crashed metric values (sample pool)
  const values = priorEntries
    .filter(e => e.status !== "crashed")
    .map(e => e.metric_value)
    .concat(currentValue);

  if (values.length < 3) return null;  // Minimum sample size

  // Compute median (Eq. 45.2)
  const sorted = [...values].sort((a, b) => a - b);
  const mid = Math.floor(sorted.length / 2);
  const median = sorted.length % 2 === 0
    ? (sorted[mid - 1] + sorted[mid]) / 2
    : sorted[mid];

  // Compute MAD = median of |x_i - median| (Eq. 45.2)
  const deviations = values.map(v => Math.abs(v - median));
  const sortedDevs = [...deviations].sort((a, b) => a - b);
  const devMid = Math.floor(sortedDevs.length / 2);
  const mad = sortedDevs.length % 2 === 0
    ? (sortedDevs[devMid - 1] + sortedDevs[devMid]) / 2
    : sortedDevs[devMid];

  if (mad === 0) {
    // All values identical — any difference is infinitely above noise
    return values.some(v => v !== median) ? Infinity : 0;
  }

  // Best improvement from baseline, direction-aware (Eq. 45.4)
  const baseline = values[0];  // First entry is always baseline
  const sign = direction === "maximize" ? 1 : -1;
  const bestImprovement = Math.max(
    ...values.map(v => sign * (v - baseline))
  );

  // Session-level confidence ratio (Eq. 45.3)
  return bestImprovement > 0 ? bestImprovement / mad : 0;
}

Worked Numeric Example

Consider a test-speed optimization session (direction: minimize, unit: seconds). The following table traces eight experiments, showing how the sample pool, MAD, and confidence evolve:

Table 45.6: Worked confidence computation example — test speed optimization
Run	Metric (s)	Status	Best So Far	Sample Pool	MAD	$\Delta m_{\text{best}}$	Confidence
1	45.2	baseline	45.2	[45.2]	—	—	—
2	39.8	kept	39.8	[45.2, 39.8]	—	—	—
3	41.1	discarded	39.8	[45.2, 39.8, 41.1]	1.3	5.4	4.15
4	37.5	kept	37.5	[45.2, 39.8, 41.1, 37.5]	1.80	7.7	4.28
5	38.2	discarded	37.5	[45.2, 39.8, 41.1, 37.5, 38.2]	1.6	7.7	4.81
6	36.8	kept	36.8	[45.2, 39.8, 41.1, 37.5, 38.2, 36.8]	1.80	8.4	4.67
7	—	crashed	36.8	(unchanged, crashed excluded)	1.80	8.4	4.67
8	35.1	kept	35.1	[45.2, 39.8, 41.1, 37.5, 38.2, 36.8, 35.1]	1.6	10.1	6.31

Detailed computation for run 3 (first confidence-eligible run, $n = 3$):

Sample pool: $\{45.2, 39.8, 41.1\}$. Sorted: $[39.8, 41.1, 45.2]$. Median $\tilde{x} = 41.1$.
Absolute deviations: $|45.2 - 41.1| = 4.1$, $|39.8 - 41.1| = 1.3$, $|41.1 - 41.1| = 0$. Sorted: $[0, 1.3, 4.1]$. MAD $= 1.3$.
Best improvement from baseline ($d = -1$): $\Delta m_{\text{best}} = \max(45.2 - 45.2, \; 45.2 - 39.8, \; 45.2 - 41.1) = 5.4$.
Confidence $= 5.4 / 1.3 = 4.15$ — strong green signal ($\geq 2.0$). This means the session's cumulative best improvement is 4.15× the noise floor. It does not mean each individual kept experiment is statistically validated.

Detailed computation for run 4 ($n = 4$):

Sample pool: $\{45.2, 39.8, 41.1, 37.5\}$. Sorted: $[37.5, 39.8, 41.1, 45.2]$. Median $= (39.8 + 41.1)/2 = 40.45$.
Deviations: $|45.2 - 40.45| = 4.75$, $|39.8 - 40.45| = 0.65$, $|41.1 - 40.45| = 0.65$, $|37.5 - 40.45| = 2.95$. Sorted: $[0.65, 0.65, 2.95, 4.75]$. MAD $= (0.65 + 2.95)/2 = 1.80$.
$\Delta m_{\text{best}} = 45.2 - 37.5 = 7.7$. Confidence $= 7.7 / 1.80 = 4.28$.

Note that the crashed experiment (run 7) is excluded from the sample pool but the session continues with the same best-so-far value. The discarded experiments (runs 3 and 5) do contribute to the MAD computation — their metric values represent valid noise measurements even though the code changes were reverted.

Statistical Properties and Limitations

The choice of MAD over standard deviation is well-grounded in robust statistics. For a normal distribution, $\text{MAD} \approx 0.6745 \cdot \sigma$, so a confidence threshold of 2.0× MAD corresponds to roughly $1.35\sigma$ — a modest but meaningful signal. However, benchmark measurement distributions are typically heavy-tailed rather than normal, which is precisely where MAD's outlier robustness provides the most value.

Key design decisions (README-documented unless otherwise noted):

Confidence is computed only after $\geq 3$ non-crashed experiments (minimum sample size for a meaningful MAD estimate).
All non-crashed experiments contribute to the MAD pool, regardless of status (kept, discarded, checks_failed, baseline). This is important: discarded runs are valid noise measurements.
The confidence score is advisory only — it never auto-discards experiments. The agent is guided but not constrained.
Confidence values are persisted to autoresearch.jsonl for post-hoc analysis.
Author analysis: Because the score measures cumulative best improvement relative to the noise floor, it increases monotonically as more improvements accumulate and can only decrease if later experiments widen the noise estimate. A session that achieved a single large improvement early will show a high confidence score for all subsequent experiments, regardless of whether later changes individually contribute. This is by design — the score tracks session health, not per-experiment validity.

45.3.3 Backpressure Checks

The optional autoresearch.checks.sh mechanism (README-documented) provides a correctness safety valve during autonomous optimization. After each benchmark that passes (exit code 0), the system can run a separate checks script containing tests, type checks, or linting. If the checks fail, the experiment is logged with status checks_failed (distinct from crashed) and the changes are reverted. This prevents a common failure mode in long autonomous runs: optimizations that improve the target metric while silently breaking other functionality.

Importantly, the checks execution time does not affect the primary metric measurement — the benchmark and checks are executed sequentially, with the metric captured from the benchmark output only. The checks_failed status allows post-hoc analysis of how frequently the agent proposes correctness-breaking optimizations.

45.3.4 Branch-Aware Finalization

The autoresearch-finalize skill (README-documented) addresses the "messy experiment branch" problem — after dozens of experiments, the working branch contains an interleaved sequence of kept improvements that may span unrelated concerns. The finalization process groups kept experiments into logically coherent changesets and creates independent branches from the merge-base.

The key constraint is that groups must not share files. This ensures each resulting branch can be reviewed and merged independently without conflict resolution. The grouping is proposed by the LLM agent (reading the experiment log and diff metadata) and approved by the human researcher before branches are created.

45.4 Session State and Memory Management

45.4.1 Dual-File Persistence

Pi-autoresearch's memory management operates at three levels with distinct volatility characteristics (README-documented):

Level 1: LLM Context Window (volatile). The agent's context window contains the system prompt, tool descriptions, skill document, recent conversation history, and fragments of session files. This is the primary working memory. It is limited by the model's context length and lost entirely on context reset.

Level 2: Session Files (persistent). Two complementary files survive context resets. autoresearch.md is a narrative Markdown document maintained by the agent with sections for the optimization objective, strategies tried, dead ends, and key wins. It provides high-level strategic context that helps a fresh agent understand why certain approaches were tried. autoresearch.jsonl is an append-only structured log recording every experiment with exact metrics, commit hashes, confidence scores, timestamps, statuses, and descriptions. It provides precise tactical recall. The JSONL schema (README-documented fields):

// JSONL record schema — fields documented in the README
// Actual field names and types are as documented; additional
// fields may exist in the implementation
{
  "run": 4,
  "commit": "a3f8c2e",
  "metric_name": "total_test_seconds",
  "metric_value": 37.5,
  "status": "kept",
  "description": "Parallelized vitest workers across 4 CPU cores",
  "confidence": 4.28,
  "timestamp": "2026-03-15T14:23:41.892Z",
  "duration_ms": 38200
}

Level 3: Git History (permanent). Every kept experiment is a git commit, creating an immutable audit trail of all code changes that produced improvements. This is not consumed directly by the agent but serves human reviewers and enables the finalization workflow.

The design insight (author analysis) is that LLMs benefit from both narrative context (what is the goal, what approaches worked, what failed) and structured data (exact metric values, commit hashes, statistical scores). Storing these in separate formats optimized for each modality produces better agent behavior than a single homogeneous log.

45.4.2 Resumption Protocol

The system supports three resumption scenarios (README-documented; the README explicitly states: "A fresh agent with no memory can read these two files and continue exactly where the previous session left off"):

Agent restart (same context window): The agent reads autoresearch.jsonl to reconstruct numerical state and continues the loop.
Context reset (new agent instance): A fresh agent reads both autoresearch.md (high-level context) and autoresearch.jsonl (detailed history).
Branch switch: Each branch maintains its own session state. Switching branches automatically switches the session context.

45.4.3 Memory Scaling Properties

Both session files grow linearly with the number of experiments. For short sessions (10–20 experiments), this is negligible. For extended sessions (200+ experiments), the JSONL file may exceed what can fit in a single context window. In such cases, the agent must selectively read recent entries or aggregate statistics rather than loading the entire history (author analysis — this scaling behavior follows from the append-only design, not from explicit documentation of this specific scenario). The narrative autoresearch.md serves as a compressed summary that remains context-friendly regardless of session length.

45.5 Data Flow and Experiment Lifecycle

The complete data flow through a single experiment cycle proceeds as follows:

45.5.1 Experiment Status Taxonomy

Each experiment in the JSONL log carries one of five statuses (README-documented):

Table 45.7: Experiment status types (README-documented)
Status	Meaning	Enters MAD Pool?	Agent Action
`baseline`	Initial measurement, no code changes	Yes	Reference point for all improvements
`kept`	Metric improved, changes committed	Yes	Branch advances to new best state
`discarded`	Metric did not improve	Yes	Git reset, try a different approach
`crashed`	Benchmark command returned non-zero exit	No	Git reset, try a fundamentally different approach
`checks_failed`	Benchmark passed but correctness checks failed	Yes	Git reset, fix correctness issue in next attempt

45.6 Key Results and Empirical Evidence

Pi-autoresearch is infrastructure rather than a benchmark paper — the repository does not report specific experimental results with controlled baselines (README-documented). The README describes typical improvement ranges across canonical example domains. These are presented here as the repository's own characterization, not as independently verified benchmarks:

Table 45.8: Typical improvement ranges (README-described, not independently verified or reproduced by this survey)
Domain	Metric	Direction	README-Reported Typical Gains
Test speed	Wall-clock seconds	↓ minimize	10–50% reduction
Bundle size	KB	↓ minimize	5–30% reduction
LLM training	val_bpb	↓ minimize	5–15% reduction
Build speed	Wall-clock seconds	↓ minimize	10–40% reduction

These ranges should be interpreted as order-of-magnitude guidance rather than rigorous benchmarks. Actual results depend heavily on the specific project, the quality of the skill document, the LLM backend, and the starting optimization state of the codebase. No controlled baselines, ablations, number of runs, seeds, or variance information accompanies these claims in the README.

45.6.1 Illustrative Case Study: Test Speed Optimization

⚠ Hypothetical example

The following case study is a constructed illustration of typical session behavior, not a reproduction of a specific repository run or a validated benchmark result. It is designed to concretize the system's mechanics (experiment statuses, confidence evolution, JSONL schema, finalization) using plausible values consistent with the README-documented behavior. No real run trace from the repository is available for inclusion in this survey. Readers should treat the numeric values, file names, and session progression as pedagogical, not evidential.

Scenario: optimizing a Node.js project's vitest suite, starting at 45.2 seconds. This uses the same numeric data as the worked confidence example in §45.3.2, now contextualized with experiment descriptions.

Session Configuration (hypothetical)

Benchmark command: bash autoresearch.sh (runs pnpm test --run, emits METRIC total_test_seconds=...)
Primary metric: total_test_seconds (direction: minimize)
Checks: bash autoresearch.checks.sh (runs pnpm tsc --noEmit)
Config: { "maxIterations": 30 }

Table 45.9: Hypothetical session trace — 8 experiments (constructed to illustrate documented behavior)
Run	Description	Metric (s)	Status	Confidence
1	Baseline measurement	45.2	baseline	—
2	Parallel vitest workers (4 cores)	39.8	kept	—
3	Replace regex validation with string ops	41.1	discarded	4.15
4	Cache expensive test fixtures	37.5	kept	4.28
5	Swap assertion library to faster alternative	38.2	discarded	4.81
6	Lazy-load test utilities	36.8	kept	4.67
7	Refactor test setup (syntax error)	—	crashed	4.67
8	Remove unused test imports	35.1	kept	6.31

Hypothetical session summary: 8 experiments; 4 kept, 2 discarded, 1 crashed, 1 baseline. Start: 45.2s → End: 35.1s (22.3% reduction). All confidence scores were in the green zone ($\geq 2.0$), which in this scenario is consistent with vitest runtime being relatively deterministic on an idle machine — the noise floor (MAD ≈ 1.6s) is well below the cumulative improvement (10.1s). Note that the confidence scores reflect session-level cumulative improvement, not per-experiment significance (see §45.3.2).

Corresponding JSONL records (first and last, illustrating the documented schema):

{"run":1,"commit":"b1a2c3d","metric_name":"total_test_seconds","metric_value":45.2,"status":"baseline","description":"Baseline measurement","confidence":null,"timestamp":"2026-03-15T14:00:12.341Z","duration_ms":45834}
{"run":8,"commit":"e7f8a9b","metric_name":"total_test_seconds","metric_value":35.1,"status":"kept","description":"Remove unused test imports","confidence":6.31,"timestamp":"2026-03-15T14:18:47.123Z","duration_ms":35782}

45.6.2 Cost Model

The cost of a pi-autoresearch session has three components: LLM inference tokens, benchmark compute time, and human attention. The repository provides the following per-experiment token estimates (README-documented):

Table 45.10: Per-experiment token budget estimate (README-documented)
Phase	Estimated Tokens	Notes
Read context + session files	2,000–10,000 input	Depends on `autoresearch.md` size
Generate hypothesis + code change	1,000–5,000 output	Depends on change complexity
Interpret results + decide keep/revert	500–2,000 output	Includes metric reasoning
Per-experiment total	~3,500–17,000	~10,000 typical

For a 50-experiment session using Claude Sonnet-class pricing (approximately $3/M input tokens, $15/M output tokens; prices as of early 2026, subject to change):

$$C_{\text{session}} = C_{\text{input}} + C_{\text{output}} = \frac{50 \times 6{,}000}{10^6} \times 3 + \frac{50 \times 4{,}000}{10^6} \times 15 \approx \$0.90 + \$3.00 = \$3.90 \tag{45.5}$$

The repository estimates $4–$10 for a typical 50-experiment session, depending on change complexity and session file growth (README-documented). Cost control mechanisms include the maxIterations field in autoresearch.config.json (hard limit on experiments) and provider-side API spending caps.

45.7 Reproducibility

Pi-autoresearch is designed with reproducibility as a first-class concern at multiple levels (README-documented):

Experiment-level: Every kept experiment produces a git commit with a descriptive message including the metric improvement. The JSONL log records every experiment (kept and discarded) with timestamps, commit hashes, metric values, confidence scores, and descriptions. Any individual experiment can be reproduced by checking out its commit and re-running the benchmark command.

Session-level: The dual-file persistence layer enables a fresh agent to reconstruct the complete session context and continue from where the previous session stopped.

Infrastructure-level: The extension is installed via pi install https://github.com/davebcn87/pi-autoresearch (README-documented). The benchmark script (autoresearch.sh) is an explicit, versioned shell script — not implicit agent behavior.

45.7.1 Limitations on Reproducibility

Three factors limit exact reproducibility:

LLM non-determinism: The agent's hypotheses and code changes depend on stochastic LLM generation. Two runs with identical starting conditions will generally produce different experiment sequences, even with the same model and temperature.
Environment sensitivity: Benchmark measurements depend on system load, hardware, and OS scheduling. The MAD-based confidence scoring mitigates this but does not eliminate it.
Skill authoring variance: The quality and specificity of the human-authored skill document significantly affects results. Vague skills produce scattered experiments; precise, well-scoped skills produce focused optimization.

Author analysis: The combination of these factors means that pi-autoresearch sessions are reproducible in infrastructure (same tools, same protocol) but not in trajectory (same sequence of experiments). This is an inherent property of LLM-agent-driven optimization, not a deficiency specific to this system.

45.7.2 Reproducibility Checklist

The following checklist summarizes what a reader would need to reproduce a pi-autoresearch session (README-documented unless noted):

Table 45.11: Reproducibility requirements
Requirement	Source	Notes
pi agent installed	README	Terminal-based AI coding agent by Anthropic
Extension installed	README	`pi install https://github.com/davebcn87/pi-autoresearch`
LLM backend configured	README	Any LLM backend supported by pi (Claude, GPT, Gemini, local)
Benchmark script	README	`autoresearch.sh` with `METRIC name=number` output
Skill document	README	Domain-specific Markdown loaded via `autoresearch-create`
Git repository	README	Project must be in a git repo for commit/reset operations
Optional: checks script	README	`autoresearch.checks.sh` for correctness validation
Optional: config	README	`autoresearch.config.json` with `maxIterations`

45.8 Limitations and Discussion

45.8.1 Fundamental Limitations

Single-objective optimization. The system tracks a single primary metric per session (README-documented). Multi-objective optimization (e.g., simultaneously minimizing test runtime and bundle size) requires either separate sessions or a custom benchmark script that computes a composite score. This is a significant limitation compared to evolutionary systems like GEPA (Chapter 7), which natively support Pareto-based multi-objective optimization.

Greedy search and local optima. The hill-climbing strategy cannot escape local optima that require temporary metric regressions. If the global optimum requires a refactoring step that temporarily increases test runtime before yielding large gains, pi-autoresearch will revert that intermediate step. Population-based approaches maintain diverse candidates that can explore such valleys. The system's mitigation — relying on the LLM to generate diverse hypotheses — is plausible but not formally guaranteed (author analysis).

No cross-project learning. Each session starts from scratch. Insights gained from optimizing test speed in one project are not transferred to another. This contrasts with systems like AlphaEvolve (Chapter 4), which can leverage a shared program database across projects.

No meta-optimization. The system cannot learn to improve its own optimization strategy. The hill-climbing approach, confidence thresholds, and experiment structure are fixed.

Platform lock-in. The extension is specific to the pi agent. Users of other coding agents (Claude Code, Cursor, GitHub Copilot) cannot use pi-autoresearch without switching to pi.

45.8.2 Statistical Rigor Discussion

The MAD-based confidence scoring is a meaningful improvement over raw metric comparison, but it falls short of formal statistical testing (author analysis). As emphasized in §45.3.2, the confidence score is a session-level signal-to-noise heuristic, not a per-experiment significance test. Several additional concerns:

Heuristic thresholds: The confidence thresholds (2.0×, 1.0×) are heuristic. For a normal distribution, 2.0× MAD ≈ 1.35σ (since MAD ≈ 0.6745σ), corresponding roughly to $p \approx 0.18$ — much weaker than the conventional $p < 0.05$. However, benchmark noise distributions are typically heavy-tailed, where MAD's robustness may make this threshold more practically meaningful than the Gaussian approximation suggests.
No multiple-comparison correction: After 50 experiments, the probability of finding at least one spurious result increases substantially. The system does not apply Bonferroni or false-discovery-rate corrections.
Non-stationary distribution: The MAD computation treats all non-crashed experiments as samples from the same underlying distribution. In practice, kept experiments alter the codebase, potentially changing the noise characteristics of subsequent measurements. The pool thus mixes measurements from different code states.
Session-level, not per-change: As defined in Equation 45.4, the best improvement is always measured relative to the baseline $m_0$, not relative to the most recent best. This means the confidence score reflects total cumulative improvement. A high confidence score (green) does not validate any individual experiment — it indicates that the aggregate session improvement is large relative to the aggregate noise floor. A session could show green confidence even if some individual "kept" changes are within noise, provided the total improvement is large enough.

Despite these limitations, the confidence scoring represents a practical advance in autoresearch tooling. A more rigorous approach might use paired permutation tests or bootstrap confidence intervals, but these would require multiple runs per experiment, significantly increasing cost and session duration.

45.8.3 Comparison with Evolutionary Systems

Table 45.12: Pi-autoresearch vs. evolutionary search systems (author analysis)
Dimension	Pi-autoresearch	AlphaEvolve / OpenEvolve / GEPA
Search strategy	Greedy hill-climbing	Population-based evolutionary search
LLM role	Autonomous agent (full control)	Mutation operator (system-directed)
Population size	1 (single candidate)	10–1,000+ candidates
Exploration/exploitation	LLM-driven (implicit)	Algorithmic (bandits, MAP-Elites, islands)
Domain specificity	Domain-agnostic via skills	Requires evaluation function + evolve blocks
Setup complexity	Low (skill document + benchmark script)	High (evaluation harness, search config, sandbox)
Optimality guarantees	None (greedy)	Weak (evolutionary convergence)

45.9 Continued Learning

45.9.1 Within-Session Learning

The agent learns within a session through three channels (author analysis of documented behavior). First, narrative learning: as the agent updates autoresearch.md with dead ends and key wins, it builds a compressed model of the optimization landscape. Second, statistical feedback: confidence scores provide quantitative guidance on session-level improvement reliability. Third, pattern recognition: the JSONL log provides a structured record enabling the agent to avoid repeating failed approaches.

45.9.2 Cross-Session Learning

Cross-session learning is supported through the persistent session files. A new agent instance inherits the full strategic context (narrative document) and tactical history (JSONL log) from prior sessions. However, this learning is confined to the project and branch level — there is no cross-project knowledge transfer and no mechanism for improving the system's optimization strategy itself based on accumulated experience.

Table 45.13: Learning capabilities comparison (author analysis)
Learning Type	Pi-autoresearch	Karpathy autoresearch	AlphaEvolve
Within-session	JSONL + Markdown	Context window only	Program database
Cross-session	Persistent files	`results.tsv` + git	Persistent database
Cross-project	No	No	Yes (shared infra)
Meta-learning	No	No	Partial

45.10 Implementation Details

45.10.1 Technology Stack

Pi-autoresearch is implemented as a pi extension using pi's extension and skill frameworks. The implementation stack (README-documented):

Table 45.14: Implementation technology stack (README-documented)
Layer	Language/Format	Purpose
Extension definition	JSON manifest	Declares tools, UI widgets, commands
Tool implementations	TypeScript (pi extension API)	`init_experiment`, `run_experiment`, `log_experiment`
UI components	pi widget framework	Status bar, dashboard, fullscreen overlay
Skill documents	Markdown	`autoresearch-create`, `autoresearch-finalize`
Benchmark scripts	Bash	`autoresearch.sh`, `autoresearch.checks.sh`
Session state	JSON Lines + Markdown	`autoresearch.jsonl` + `autoresearch.md`
Configuration	JSON	`autoresearch.config.json`

The repository documentation describes the system as following a "radical simplicity" design philosophy. The overall codebase is small — the README and extension structure suggest a modest total footprint of TypeScript source plus structured Markdown for skills, though the survey author has not performed an exact line count and precise LOC figures should not be relied upon.

45.10.2 State Machine

The extension manages a session state machine with five states (README-documented): INACTIVE (no session), SETUP (configuring session via init_experiment), RUNNING (autonomous loop active), PAUSED (loop stopped, state preserved), and CLEARED (state deleted, returning to INACTIVE). Transitions are triggered by slash commands (/autoresearch, /autoresearch off, /autoresearch clear), keyboard interrupts, or reaching maxIterations.

45.10.3 LLM Backend Independence

A deliberate design decision (README-documented): pi-autoresearch does not specify which LLM powers the pi agent. The extension works with whatever LLM backend the user has configured. This means the same experimental infrastructure works with Claude, GPT-4o/o3, Gemini, or open-weight models served locally. The LLM choice affects the quality of hypotheses the agent generates but not the experimental infrastructure itself.

45.11 Applications

45.11.1 Developer Workflow Optimization

The primary application domain is optimizing measurable aspects of software projects: test suite runtime, build duration, bundle size, and web performance scores. These are domains where (a) a clear numeric metric exists, (b) the metric can be measured by a shell command, (c) the search space of possible improvements is rich enough for an LLM agent to explore productively, and (d) the benchmark executes quickly enough for rapid iteration.

45.11.2 ML Research

Following the Karpathy paradigm, pi-autoresearch can be applied to ML training optimization: architecture search, hyperparameter tuning, data pipeline optimization, and training stability improvements. The key constraint is benchmark duration — very long training runs (hours per experiment) create impractically slow feedback loops for the greedy hill-climbing approach.

45.11.3 Boundaries of Applicability

The system cannot optimize metrics requiring human judgment (code readability, UX quality), highly non-deterministic metrics where even MAD-based confidence is insufficient, or multi-objective problems without manual composite scoring. It also requires automatable benchmarks with numeric output — qualitative evaluations are outside scope.

Summary

Key takeaway: Pi-autoresearch separates experimental loop mechanics from domain knowledge through an extension/skill architecture, and adds MAD-based statistical confidence scoring to estimate whether cumulative improvements exceed benchmark noise. The confidence score is a session-level signal-to-noise heuristic, not a per-experiment significance test.

Main contribution to the field: The extension/skill separation pattern demonstrates that the autonomous experiment loop can be treated as general-purpose infrastructure, decoupled from any specific optimization domain. Combined with structured logging, branch-aware finalization, and a practical (if heuristic) noise-floor estimator, this makes autonomous optimization accessible to any developer using the pi agent.

What a researcher should know: Pi-autoresearch uses greedy hill-climbing backed by git (not population-based evolutionary search), making it simpler and more interpretable than systems like AlphaEvolve or OpenEvolve, but inherently susceptible to local optima. Its confidence scoring via MAD is a practical heuristic — the 2.0× threshold corresponds roughly to 1.35σ under Gaussian assumptions, well below standard significance thresholds. The README reports typical improvement ranges (10–50% for test speed, 5–30% for bundle size) but these are not independently verified benchmarks. All code listings in this chapter are pseudocode reconstructions, not verbatim repository excerpts; readers should consult the repository directly for implementation details. All repository observations were made in April 2026.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}