Pi-Autoresearch
Part P07: Autonomous Research Systems
Provenance note. This chapter surveys the pi-autoresearch extension (repository: github.com/davebcn87/pi-autoresearch). Claims are sourced as follows: README-documented indicates information stated in the repository README or documentation files; manifest/code-derived indicates structure observed in the extension manifest and TypeScript source; author analysis indicates interpretive commentary by the survey author. All repository observations were made in April 2026. All code listings in this chapter are pseudocode — they are conceptual reconstructions of the documented behavior, not verbatim excerpts from the repository. Function names, interface shapes, and parameter names are illustrative; the actual TypeScript implementation may use different identifiers, signatures, and internal structure. Where specific identifiers (tool names, file names, commands, metric protocol) are used, these are drawn from the README documentation and are labeled accordingly.
45.1 Overview and Motivation
Pi-autoresearch is an open-source extension for the pi AI coding agent that instruments it with tools, workflows, and a terminal UI for running autonomous experiment loops. Released in March 2026 by David Vilalta under the MIT license (README-documented), it is inspired by Andrej Karpathy's autoresearch but diverges in architecture: where Karpathy's system is a monolithic script coupled to a single domain (neural network training via nanochat) and a specific agent (Claude Code), pi-autoresearch decouples the experimental loop infrastructure from domain knowledge, producing a general-purpose optimization extension that works with any command-line-measurable metric.
The core claim, stated in the repository README, is that the experimental loop — measure, judge, keep or revert, repeat — is infrastructure, not domain knowledge. Domain knowledge belongs in a separate, human-authored document called a "skill." This separation enables a single extension to serve different optimization domains: test speed, bundle size, LLM training loss, Lighthouse scores, or any other numeric metric producible by a shell command.
Key Contribution
Pi-autoresearch introduces an extension/skill architecture that separates domain-agnostic experimental loop infrastructure (measurement, version control, statistical confidence, dashboards) from domain-specific knowledge (what to optimize, how to measure, which files to modify). It adds MAD-based statistical confidence scoring — a session-level signal-to-noise heuristic that estimates whether cumulative improvements exceed the measurement noise floor. These contributions are compared here against Karpathy's autoresearch and SkyPilot autoresearch, the two most prominent prior systems identified in this survey; the comparison is bounded to those systems, not to all possible tooling.
45.1.1 Position in the Autoresearch Lineage
| System | Year | Architecture | Domain | Statistical Rigor | UI |
|---|---|---|---|---|---|
| Karpathy autoresearch | 2026 | Monolithic script | Neural network training | None | Terminal output |
| SkyPilot autoresearch | 2026 | Cloud orchestrator | Any cloud workload | None | Web dashboard |
| Pi-autoresearch | 2026 | Extension + skill | Any measurable metric | MAD confidence | Widget + dashboard + overlay |
45.1.2 LLM-as-Agent, Not LLM-as-Mutation-Operator
A critical architectural distinction separates pi-autoresearch from evolutionary systems like AlphaEvolve (Chapter 4) or FunSearch (Chapter 5). In those systems, the LLM serves as a mutation operator called by the system to propose code changes within a population-based search. In pi-autoresearch, the LLM is the agent — it has full autonomy to read files, understand context, form hypotheses, make changes, and decide what to try next. The extension merely provides tools that the agent invokes at its discretion (README-documented).
This design choice trades formal search-theoretic guarantees for practical flexibility: any LLM backend configured in pi can power the optimization loop, and the space of possible modifications is limited only by the LLM's coding ability rather than by a predefined mutation grammar.
45.2 Architecture
Pi-autoresearch follows a three-layer architecture that separates the pi runtime, the extension infrastructure, and domain-specific skills (README-documented; the characterization of this as the system's primary structural contribution is author analysis).
45.2.1 Implementation Map
The following table summarizes the concrete components identified from the repository's README documentation and extension structure. Each row indicates the evidence tier: whether the component is explicitly described in the README, inferred from the pi extension framework conventions, or observable in the repository's file structure. Readers wishing to verify these claims should inspect the repository directly; the survey author's observations are dated April 2026.
| Component | Identifier / Location | Evidence Tier | Notes |
|---|---|---|---|
| Extension name | autoresearch | README-documented | Registered via pi install https://github.com/davebcn87/pi-autoresearch |
| Tool: session init | init_experiment | README-documented | Parameters: name, metric_name, metric_unit, direction |
| Tool: benchmark exec | run_experiment | README-documented | Parameter: command (shell command string) |
| Tool: result logging | log_experiment | README-documented | Parameters: metric_value, description, status |
| Skill: session creation | autoresearch-create | README-documented | Markdown skill consumed by LLM as context |
| Skill: branch finalization | autoresearch-finalize | README-documented | Markdown skill consumed by LLM as context |
| Command | /autoresearch | README-documented | Subcommands: <context>, off, clear |
| Session narrative | autoresearch.md | README-documented | Generated in project working directory at session start |
| Structured log | autoresearch.jsonl | README-documented | Append-only, one JSON object per line per experiment |
| Benchmark script | autoresearch.sh | README-documented | User-authored; emits METRIC name=number on stdout |
| Correctness checks | autoresearch.checks.sh | README-documented | Optional; runs after benchmark passes |
| Session config | autoresearch.config.json | README-documented | Optional; fields include maxIterations |
| UI: status widget | (pi widget framework) | README-documented | Persistent bar: run count, keep count, best metric, confidence |
| UI: dashboard | (pi widget framework) | README-documented | Expandable results table toggled via keyboard shortcut |
| UI: overlay | (pi widget framework) | README-documented | Fullscreen scrollable view with live spinner |
| Extension manifest | JSON manifest file | Framework-inferred | Pi extensions use JSON manifests; exact path not independently verified |
| Tool handler source | TypeScript source files | Framework-inferred | Pi extensions are implemented in TypeScript; exact file paths not independently verified |
| Metric protocol | METRIC name=number (stdout) | README-documented | Regex pattern on stdout; language-agnostic |
| Experiment statuses | kept, discarded, crashed, checks_failed, baseline | README-documented | Recorded in JSONL log per experiment |
Implementation gap. The survey author has not performed a line-by-line audit of the TypeScript source. Exact file paths within the repository (e.g., the location of the extension manifest JSON, the directory structure of tool handler modules, the UI widget registration code) are not independently verified. The identifiers listed above (tool names, skill names, file names, status values, metric protocol) are drawn from README documentation and are consistent with observed extension behavior, but the internal implementation may differ in structure from the pseudocode presented in this chapter.
45.2.2 Extension Manifest and Tool Registration
The extension declares its tools, commands, and skills through a JSON manifest. Pi extensions use manifest-driven registration so that the pi agent discovers available tools automatically when the extension is installed (README-documented: pi install https://github.com/davebcn87/pi-autoresearch). The following is a pseudocode reconstruction of the manifest structure, based on the README-documented tool names, parameters, and skills. The actual manifest file may use different field names, nesting, or additional fields not described here:
// PSEUDOCODE — conceptual manifest structure
// Reconstructed from README-documented tool names and parameters
// Actual manifest format and field names may differ
{
"name": "autoresearch",
"tools": [
{
"name": "init_experiment",
"parameters": {
"name": { "type": "string" },
"metric_name": { "type": "string" },
"metric_unit": { "type": "string" },
"direction": { "type": "string", "enum": ["minimize", "maximize"] }
}
},
{
"name": "run_experiment",
"parameters": {
"command": { "type": "string" }
}
},
{
"name": "log_experiment",
"parameters": {
"metric_value": { "type": "number" },
"description": { "type": "string" },
"status": { "type": "string",
"enum": ["kept", "discarded", "crashed", "checks_failed", "baseline"] }
}
}
],
"skills": ["autoresearch-create", "autoresearch-finalize"],
"commands": [
{ "name": "autoresearch" }
]
}
| Tool | Lifecycle | Function |
|---|---|---|
init_experiment | Once per session | Configures session: experiment name, primary metric name, unit, optimization direction (minimize/maximize) |
run_experiment | Per experiment | Executes benchmark command, measures wall-clock duration, captures stdout/stderr, parses METRIC lines |
log_experiment | Per experiment | Records result to autoresearch.jsonl, auto-commits, computes confidence score (if ≥3 non-crashed runs), updates UI |
The UI comprises three progressive disclosure levels (README-documented): a persistent status widget (run count, keep count, best metric, improvement percentage, confidence score), an expandable dashboard toggled via keyboard shortcut (full results table with commits, metrics, status, descriptions), and a fullscreen overlay (scrollable terminal-wide view with a live spinner during experiment execution).
45.2.3 Skill Components
Skills are Markdown documents consumed by the LLM as context. They encode domain-specific knowledge: what to optimize, how to measure it, which files are in scope, and what strategies to consider. Two skills ship with the extension (README-documented):
- autoresearch-create: Session initialization — gathers goal, command, metric, scope from the user (or infers from project context); writes session files; runs a baseline measurement; starts the autonomous loop.
- autoresearch-finalize: Branch finalization — reads the experiment log, groups kept experiments into logically independent branches, proposes the grouping for human approval, then creates branches from the merge-base. Groups must not share files, ensuring each resulting branch can be reviewed and merged independently.
45.2.4 Prompt Architecture
The LLM agent operates with a five-layer prompt architecture (README-documented):
- Layer 1: pi system prompt (agent capabilities, tool definitions)
- Layer 2: Extension tool descriptions (
init_experiment,run_experiment,log_experiment) - Layer 3: Skill document (domain-specific instructions, loaded at session start)
- Layer 4: Session context (
autoresearch.md— accumulated narrative of what has been tried) - Layer 5: Real-time state (widget data, recent experiment results)
This layering ensures the agent always has access to what it can do (tools), what it should optimize (skill), what has already been tried (session history), and how well it is doing (confidence scores, metric trajectory). A fresh agent instance after a context reset can reconstruct all of this from the persistent session files.
45.2.5 Metric Protocol
The benchmark script (autoresearch.sh) communicates metrics to the extension via a deliberately minimal stdout protocol (README-documented): any line matching METRIC name=number is captured. The following pseudocode illustrates the conceptual parsing logic. The actual TypeScript implementation may use different variable names, error handling, or regex patterns:
// PSEUDOCODE — conceptual metric parsing logic
// Illustrates the METRIC protocol documented in the README
// Actual implementation identifiers and structure may differ
// README-documented protocol: lines matching "METRIC name=number"
const METRIC_PATTERN = /^METRIC\s+(\w+)=([\d.eE+-]+)$/gm;
function parseMetrics(stdout: string): Array<{name: string, value: number}> {
const results = [];
let match;
while ((match = METRIC_PATTERN.exec(stdout)) !== null) {
results.push({ name: match[1], value: parseFloat(match[2]) });
}
return results;
}
// Example benchmark script output:
// $ bash autoresearch.sh
// Running vitest...
// Tests: 142 passed, 0 failed
// METRIC total_test_seconds=38.7
// METRIC peak_memory_mb=512
//
// Conceptual result: [
// { name: "total_test_seconds", value: 38.7 },
// { name: "peak_memory_mb", value: 512 }
// ]
This protocol is language-agnostic: any build tool, test runner, or training script can produce METRIC lines. The extension only looks for lines matching the pattern and extracts the numeric value. The primary metric (specified in init_experiment) is used for keep/revert decisions; additional metrics are recorded in the JSONL log for post-hoc analysis.
45.3 Core Algorithms
45.3.1 Greedy Hill-Climbing with Git-Backed State
The core optimization strategy is greedy hill-climbing where git provides the state-management substrate (README-documented). Every experiment that improves the primary metric results in the branch advancing (the commit is kept); every experiment that fails to improve results in a git reset to the previous best state. This creates a monotonically improving trajectory — every commit on the branch represents an improvement over all prior states.
Formally, let $m_t$ denote the primary metric value at experiment $t$, and let $d \in \{-1, +1\}$ encode the optimization direction ($d = -1$ for minimization, $d = +1$ for maximization). The keep/revert decision is:
where $m^*_{t-1}$ is the best metric value observed prior to experiment $t$, defined recursively: $m^*_0 = m_0$ (baseline) and $m^*_t = m_t$ if kept, $m^*_t = m^*_{t-1}$ if reverted. After a revert, the code state is restored via git reset --hard. The comparison $d \cdot m_t > d \cdot m^*_{t-1}$ unifies minimization and maximization: for minimization ($d = -1$), this becomes $m_t < m^*_{t-1}$.
The following pseudocode illustrates the conceptual experiment-cycle logic. This is not extracted from the repository; actual handler functions, parameter passing, and internal state management may differ substantially:
// PSEUDOCODE — conceptual keep/revert cycle
// Illustrates the documented greedy hill-climbing behavior
// Actual implementation structure, identifiers, and logic may differ
// Conceptual experiment record structure (fields from README-documented JSONL schema)
interface ExperimentRecord {
run: number;
commit: string;
metric_name: string;
metric_value: number;
status: "baseline" | "kept" | "discarded" | "crashed" | "checks_failed";
description: string;
confidence: number | null;
timestamp: string;
duration_ms: number;
}
// Conceptual logging flow (README-documented behavior)
async function handleLogExperiment(
metricValue: number,
description: string,
status: string,
session: SessionState
): Promise<ExperimentRecord> {
const record: ExperimentRecord = {
run: session.runCount + 1,
commit: await getCurrentCommitHash(),
metric_name: session.config.metricName,
metric_value: metricValue,
status,
description,
confidence: computeConfidence(session.entries, metricValue, session.config.direction),
timestamp: new Date().toISOString(),
duration_ms: session.lastBenchmarkDuration,
};
// Append to JSONL log (append-only, one JSON object per line)
await appendToJsonl(session.jsonlPath, record);
// Update UI widgets with latest state
updateWidget(session, record);
session.entries.push(record);
return record;
}
The greedy approach is a deliberate design choice (author analysis). Unlike population-based methods (AlphaEvolve, OpenEvolve, GEPA), pi-autoresearch maintains a single candidate at any time, producing a linear git history that humans can easily audit.
| Property | Greedy (pi-autoresearch) | Population-Based (AlphaEvolve, etc.) |
|---|---|---|
| Convergence speed | Fast for easy gains | Slower but avoids local optima |
| Implementation complexity | Low (git commit/reset) | High (population management, archives) |
| Memory requirements | $O(1)$ — current state only | $O(N)$ — population of $N$ candidates |
| Risk of local optima | High | Lower |
| Human interpretability | Very high (linear history) | Low (complex population dynamics) |
45.3.2 Confidence Scoring via Median Absolute Deviation
Important: scope of the confidence score
The confidence score described below is a session-level signal-to-noise heuristic. It estimates whether the session's cumulative best improvement exceeds the overall measurement noise floor (as estimated by MAD across all non-crashed experiments). It is not a per-experiment significance test — a green confidence score does not validate the statistical significance of any individual kept change. The score cannot distinguish whether the cumulative improvement is driven by one large genuine gain or by several small changes that individually fall within noise. Readers should not infer statistical validity of individual kept changes from a green score alone. The thresholds (2.0×, 1.0×) are heuristic and do not correspond to standard $p$-value thresholds or formal confidence intervals.
The confidence scoring uses Median Absolute Deviation (MAD) as a robust noise-floor estimator (README-documented). This addresses a practical weakness: raw metric comparisons cannot distinguish real improvements from benchmark jitter caused by garbage-collection pauses, CPU thermal throttling, I/O contention, or inherent stochasticity in ML training.
Definitions and Scope
The confidence computation operates over a sample pool defined as follows:
- Baseline: The metric value $m_0$ obtained from the first
run_experimentinvocation before any code changes. This serves as the reference point for measuring improvement. - Sample pool: All metric values from experiments with status $\notin \{\texttt{crashed}\}$. Specifically, experiments with status
baseline,kept,discarded, andchecks_failedall contribute their metric values to the pool. Crashed experiments are excluded because they produce no valid metric value (the benchmark command returned a non-zero exit code). - Minimum sample size: Confidence is computed only when the sample pool contains $n \geq 3$ values (the minimum for a meaningful MAD estimate).
MAD is defined as:
where $\{x_1, \ldots, x_n\}$ is the sample pool and $\tilde{x} = \text{median}(x_1, \ldots, x_n)$ is its median. Unlike standard deviation $\sigma$, which is sensitive to outliers (a single GC spike can inflate $\sigma$ by an order of magnitude), MAD is robust because the median operation discards extreme values.
The confidence score is the ratio of the best direction-aware improvement over baseline to the noise floor:
where the best improvement $\Delta m_{\text{best}}$ is computed with respect to the baseline and optimization direction:
Here $d \in \{-1, +1\}$ encodes direction ($d = -1$ for minimization, $d = +1$ for maximization) and $m_0 = x_1$ is the baseline value. For minimization, $d \cdot (x_i - m_0) = m_0 - x_i$, so the best improvement is the largest reduction from baseline. If no experiment improves over baseline ($\Delta m_{\text{best}} \leq 0$), the confidence is reported as $0$.
The interpretation thresholds (README-documented):
| Confidence | Signal | Color | Agent Guidance |
|---|---|---|---|
| $\geq 2.0$ | Strong | Green | Session's cumulative improvement likely exceeds noise. Keep and continue. |
| $1.0 - 2.0$ | Marginal | Yellow | Above noise but uncertain. Re-run recommended. |
| $< 1.0$ | Noise | Red | Within noise floor. Consider reverting. |
Pseudocode Implementation
The following pseudocode illustrates the conceptual confidence computation, consistent with the mathematical definitions above. This is not an excerpt from the repository — the actual TypeScript implementation may use different variable names, control flow, and edge-case handling:
// PSEUDOCODE — conceptual confidence computation
// Illustrates Equations 45.2–45.4 from the text
// Actual implementation identifiers and logic may differ
function computeConfidence(
priorEntries: ExperimentRecord[],
currentValue: number,
direction: "minimize" | "maximize"
): number | null {
// Collect all non-crashed metric values (sample pool)
const values = priorEntries
.filter(e => e.status !== "crashed")
.map(e => e.metric_value)
.concat(currentValue);
if (values.length < 3) return null; // Minimum sample size
// Compute median (Eq. 45.2)
const sorted = [...values].sort((a, b) => a - b);
const mid = Math.floor(sorted.length / 2);
const median = sorted.length % 2 === 0
? (sorted[mid - 1] + sorted[mid]) / 2
: sorted[mid];
// Compute MAD = median of |x_i - median| (Eq. 45.2)
const deviations = values.map(v => Math.abs(v - median));
const sortedDevs = [...deviations].sort((a, b) => a - b);
const devMid = Math.floor(sortedDevs.length / 2);
const mad = sortedDevs.length % 2 === 0
? (sortedDevs[devMid - 1] + sortedDevs[devMid]) / 2
: sortedDevs[devMid];
if (mad === 0) {
// All values identical — any difference is infinitely above noise
return values.some(v => v !== median) ? Infinity : 0;
}
// Best improvement from baseline, direction-aware (Eq. 45.4)
const baseline = values[0]; // First entry is always baseline
const sign = direction === "maximize" ? 1 : -1;
const bestImprovement = Math.max(
...values.map(v => sign * (v - baseline))
);
// Session-level confidence ratio (Eq. 45.3)
return bestImprovement > 0 ? bestImprovement / mad : 0;
}
Worked Numeric Example
Consider a test-speed optimization session (direction: minimize, unit: seconds). The following table traces eight experiments, showing how the sample pool, MAD, and confidence evolve:
| Run | Metric (s) | Status | Best So Far | Sample Pool | MAD | $\Delta m_{\text{best}}$ | Confidence |
|---|---|---|---|---|---|---|---|
| 1 | 45.2 | baseline | 45.2 | [45.2] | — | — | — |
| 2 | 39.8 | kept | 39.8 | [45.2, 39.8] | — | — | — |
| 3 | 41.1 | discarded | 39.8 | [45.2, 39.8, 41.1] | 1.3 | 5.4 | 4.15 |
| 4 | 37.5 | kept | 37.5 | [45.2, 39.8, 41.1, 37.5] | 1.80 | 7.7 | 4.28 |
| 5 | 38.2 | discarded | 37.5 | [45.2, 39.8, 41.1, 37.5, 38.2] | 1.6 | 7.7 | 4.81 |
| 6 | 36.8 | kept | 36.8 | [45.2, 39.8, 41.1, 37.5, 38.2, 36.8] | 1.80 | 8.4 | 4.67 |
| 7 | — | crashed | 36.8 | (unchanged, crashed excluded) | 1.80 | 8.4 | 4.67 |
| 8 | 35.1 | kept | 35.1 | [45.2, 39.8, 41.1, 37.5, 38.2, 36.8, 35.1] | 1.6 | 10.1 | 6.31 |
Detailed computation for run 3 (first confidence-eligible run, $n = 3$):
- Sample pool: $\{45.2, 39.8, 41.1\}$. Sorted: $[39.8, 41.1, 45.2]$. Median $\tilde{x} = 41.1$.
- Absolute deviations: $|45.2 - 41.1| = 4.1$, $|39.8 - 41.1| = 1.3$, $|41.1 - 41.1| = 0$. Sorted: $[0, 1.3, 4.1]$. MAD $= 1.3$.
- Best improvement from baseline ($d = -1$): $\Delta m_{\text{best}} = \max(45.2 - 45.2, \; 45.2 - 39.8, \; 45.2 - 41.1) = 5.4$.
- Confidence $= 5.4 / 1.3 = 4.15$ — strong green signal ($\geq 2.0$). This means the session's cumulative best improvement is 4.15× the noise floor. It does not mean each individual kept experiment is statistically validated.
Detailed computation for run 4 ($n = 4$):
- Sample pool: $\{45.2, 39.8, 41.1, 37.5\}$. Sorted: $[37.5, 39.8, 41.1, 45.2]$. Median $= (39.8 + 41.1)/2 = 40.45$.
- Deviations: $|45.2 - 40.45| = 4.75$, $|39.8 - 40.45| = 0.65$, $|41.1 - 40.45| = 0.65$, $|37.5 - 40.45| = 2.95$. Sorted: $[0.65, 0.65, 2.95, 4.75]$. MAD $= (0.65 + 2.95)/2 = 1.80$.
- $\Delta m_{\text{best}} = 45.2 - 37.5 = 7.7$. Confidence $= 7.7 / 1.80 = 4.28$.
Note that the crashed experiment (run 7) is excluded from the sample pool but the session continues with the same best-so-far value. The discarded experiments (runs 3 and 5) do contribute to the MAD computation — their metric values represent valid noise measurements even though the code changes were reverted.
Statistical Properties and Limitations
The choice of MAD over standard deviation is well-grounded in robust statistics. For a normal distribution, $\text{MAD} \approx 0.6745 \cdot \sigma$, so a confidence threshold of 2.0× MAD corresponds to roughly $1.35\sigma$ — a modest but meaningful signal. However, benchmark measurement distributions are typically heavy-tailed rather than normal, which is precisely where MAD's outlier robustness provides the most value.
Key design decisions (README-documented unless otherwise noted):
- Confidence is computed only after $\geq 3$ non-crashed experiments (minimum sample size for a meaningful MAD estimate).
- All non-crashed experiments contribute to the MAD pool, regardless of status (kept, discarded, checks_failed, baseline). This is important: discarded runs are valid noise measurements.
- The confidence score is advisory only — it never auto-discards experiments. The agent is guided but not constrained.
- Confidence values are persisted to
autoresearch.jsonlfor post-hoc analysis. - Author analysis: Because the score measures cumulative best improvement relative to the noise floor, it increases monotonically as more improvements accumulate and can only decrease if later experiments widen the noise estimate. A session that achieved a single large improvement early will show a high confidence score for all subsequent experiments, regardless of whether later changes individually contribute. This is by design — the score tracks session health, not per-experiment validity.
45.3.3 Backpressure Checks
The optional autoresearch.checks.sh mechanism (README-documented) provides a correctness safety valve during autonomous optimization. After each benchmark that passes (exit code 0), the system can run a separate checks script containing tests, type checks, or linting. If the checks fail, the experiment is logged with status checks_failed (distinct from crashed) and the changes are reverted. This prevents a common failure mode in long autonomous runs: optimizations that improve the target metric while silently breaking other functionality.
Importantly, the checks execution time does not affect the primary metric measurement — the benchmark and checks are executed sequentially, with the metric captured from the benchmark output only. The checks_failed status allows post-hoc analysis of how frequently the agent proposes correctness-breaking optimizations.
45.3.4 Branch-Aware Finalization
The autoresearch-finalize skill (README-documented) addresses the "messy experiment branch" problem — after dozens of experiments, the working branch contains an interleaved sequence of kept improvements that may span unrelated concerns. The finalization process groups kept experiments into logically coherent changesets and creates independent branches from the merge-base.
The key constraint is that groups must not share files. This ensures each resulting branch can be reviewed and merged independently without conflict resolution. The grouping is proposed by the LLM agent (reading the experiment log and diff metadata) and approved by the human researcher before branches are created.
45.4 Session State and Memory Management
45.4.1 Dual-File Persistence
Pi-autoresearch's memory management operates at three levels with distinct volatility characteristics (README-documented):
Level 1: LLM Context Window (volatile). The agent's context window contains the system prompt, tool descriptions, skill document, recent conversation history, and fragments of session files. This is the primary working memory. It is limited by the model's context length and lost entirely on context reset.
Level 2: Session Files (persistent). Two complementary files survive context resets. autoresearch.md is a narrative Markdown document maintained by the agent with sections for the optimization objective, strategies tried, dead ends, and key wins. It provides high-level strategic context that helps a fresh agent understand why certain approaches were tried. autoresearch.jsonl is an append-only structured log recording every experiment with exact metrics, commit hashes, confidence scores, timestamps, statuses, and descriptions. It provides precise tactical recall. The JSONL schema (README-documented fields):
// JSONL record schema — fields documented in the README
// Actual field names and types are as documented; additional
// fields may exist in the implementation
{
"run": 4,
"commit": "a3f8c2e",
"metric_name": "total_test_seconds",
"metric_value": 37.5,
"status": "kept",
"description": "Parallelized vitest workers across 4 CPU cores",
"confidence": 4.28,
"timestamp": "2026-03-15T14:23:41.892Z",
"duration_ms": 38200
}
Level 3: Git History (permanent). Every kept experiment is a git commit, creating an immutable audit trail of all code changes that produced improvements. This is not consumed directly by the agent but serves human reviewers and enables the finalization workflow.
The design insight (author analysis) is that LLMs benefit from both narrative context (what is the goal, what approaches worked, what failed) and structured data (exact metric values, commit hashes, statistical scores). Storing these in separate formats optimized for each modality produces better agent behavior than a single homogeneous log.
45.4.2 Resumption Protocol
The system supports three resumption scenarios (README-documented; the README explicitly states: "A fresh agent with no memory can read these two files and continue exactly where the previous session left off"):
- Agent restart (same context window): The agent reads
autoresearch.jsonlto reconstruct numerical state and continues the loop. - Context reset (new agent instance): A fresh agent reads both
autoresearch.md(high-level context) andautoresearch.jsonl(detailed history). - Branch switch: Each branch maintains its own session state. Switching branches automatically switches the session context.
45.4.3 Memory Scaling Properties
Both session files grow linearly with the number of experiments. For short sessions (10–20 experiments), this is negligible. For extended sessions (200+ experiments), the JSONL file may exceed what can fit in a single context window. In such cases, the agent must selectively read recent entries or aggregate statistics rather than loading the entire history (author analysis — this scaling behavior follows from the append-only design, not from explicit documentation of this specific scenario). The narrative autoresearch.md serves as a compressed summary that remains context-friendly regardless of session length.
45.5 Data Flow and Experiment Lifecycle
The complete data flow through a single experiment cycle proceeds as follows:
45.5.1 Experiment Status Taxonomy
Each experiment in the JSONL log carries one of five statuses (README-documented):
| Status | Meaning | Enters MAD Pool? | Agent Action |
|---|---|---|---|
baseline | Initial measurement, no code changes | Yes | Reference point for all improvements |
kept | Metric improved, changes committed | Yes | Branch advances to new best state |
discarded | Metric did not improve | Yes | Git reset, try a different approach |
crashed | Benchmark command returned non-zero exit | No | Git reset, try a fundamentally different approach |
checks_failed | Benchmark passed but correctness checks failed | Yes | Git reset, fix correctness issue in next attempt |
45.6 Key Results and Empirical Evidence
Pi-autoresearch is infrastructure rather than a benchmark paper — the repository does not report specific experimental results with controlled baselines (README-documented). The README describes typical improvement ranges across canonical example domains. These are presented here as the repository's own characterization, not as independently verified benchmarks:
| Domain | Metric | Direction | README-Reported Typical Gains |
|---|---|---|---|
| Test speed | Wall-clock seconds | ↓ minimize | 10–50% reduction |
| Bundle size | KB | ↓ minimize | 5–30% reduction |
| LLM training | val_bpb | ↓ minimize | 5–15% reduction |
| Build speed | Wall-clock seconds | ↓ minimize | 10–40% reduction |
These ranges should be interpreted as order-of-magnitude guidance rather than rigorous benchmarks. Actual results depend heavily on the specific project, the quality of the skill document, the LLM backend, and the starting optimization state of the codebase. No controlled baselines, ablations, number of runs, seeds, or variance information accompanies these claims in the README.
45.6.1 Illustrative Case Study: Test Speed Optimization
⚠ Hypothetical example
The following case study is a constructed illustration of typical session behavior, not a reproduction of a specific repository run or a validated benchmark result. It is designed to concretize the system's mechanics (experiment statuses, confidence evolution, JSONL schema, finalization) using plausible values consistent with the README-documented behavior. No real run trace from the repository is available for inclusion in this survey. Readers should treat the numeric values, file names, and session progression as pedagogical, not evidential.
Scenario: optimizing a Node.js project's vitest suite, starting at 45.2 seconds. This uses the same numeric data as the worked confidence example in §45.3.2, now contextualized with experiment descriptions.
Session Configuration (hypothetical)
Benchmark command: bash autoresearch.sh (runs pnpm test --run, emits METRIC total_test_seconds=...)
Primary metric: total_test_seconds (direction: minimize)
Checks: bash autoresearch.checks.sh (runs pnpm tsc --noEmit)
Config: { "maxIterations": 30 }
| Run | Description | Metric (s) | Status | Confidence |
|---|---|---|---|---|
| 1 | Baseline measurement | 45.2 | baseline | — |
| 2 | Parallel vitest workers (4 cores) | 39.8 | kept | — |
| 3 | Replace regex validation with string ops | 41.1 | discarded | 4.15 |
| 4 | Cache expensive test fixtures | 37.5 | kept | 4.28 |
| 5 | Swap assertion library to faster alternative | 38.2 | discarded | 4.81 |
| 6 | Lazy-load test utilities | 36.8 | kept | 4.67 |
| 7 | Refactor test setup (syntax error) | — | crashed | 4.67 |
| 8 | Remove unused test imports | 35.1 | kept | 6.31 |
Hypothetical session summary: 8 experiments; 4 kept, 2 discarded, 1 crashed, 1 baseline. Start: 45.2s → End: 35.1s (22.3% reduction). All confidence scores were in the green zone ($\geq 2.0$), which in this scenario is consistent with vitest runtime being relatively deterministic on an idle machine — the noise floor (MAD ≈ 1.6s) is well below the cumulative improvement (10.1s). Note that the confidence scores reflect session-level cumulative improvement, not per-experiment significance (see §45.3.2).
Corresponding JSONL records (first and last, illustrating the documented schema):
{"run":1,"commit":"b1a2c3d","metric_name":"total_test_seconds","metric_value":45.2,"status":"baseline","description":"Baseline measurement","confidence":null,"timestamp":"2026-03-15T14:00:12.341Z","duration_ms":45834}
{"run":8,"commit":"e7f8a9b","metric_name":"total_test_seconds","metric_value":35.1,"status":"kept","description":"Remove unused test imports","confidence":6.31,"timestamp":"2026-03-15T14:18:47.123Z","duration_ms":35782}
45.6.2 Cost Model
The cost of a pi-autoresearch session has three components: LLM inference tokens, benchmark compute time, and human attention. The repository provides the following per-experiment token estimates (README-documented):
| Phase | Estimated Tokens | Notes |
|---|---|---|
| Read context + session files | 2,000–10,000 input | Depends on autoresearch.md size |
| Generate hypothesis + code change | 1,000–5,000 output | Depends on change complexity |
| Interpret results + decide keep/revert | 500–2,000 output | Includes metric reasoning |
| Per-experiment total | ~3,500–17,000 | ~10,000 typical |
For a 50-experiment session using Claude Sonnet-class pricing (approximately $3/M input tokens, $15/M output tokens; prices as of early 2026, subject to change):
The repository estimates $4–$10 for a typical 50-experiment session, depending on change complexity and session file growth (README-documented). Cost control mechanisms include the maxIterations field in autoresearch.config.json (hard limit on experiments) and provider-side API spending caps.
45.7 Reproducibility
Pi-autoresearch is designed with reproducibility as a first-class concern at multiple levels (README-documented):
Experiment-level: Every kept experiment produces a git commit with a descriptive message including the metric improvement. The JSONL log records every experiment (kept and discarded) with timestamps, commit hashes, metric values, confidence scores, and descriptions. Any individual experiment can be reproduced by checking out its commit and re-running the benchmark command.
Session-level: The dual-file persistence layer enables a fresh agent to reconstruct the complete session context and continue from where the previous session stopped.
Infrastructure-level: The extension is installed via pi install https://github.com/davebcn87/pi-autoresearch (README-documented). The benchmark script (autoresearch.sh) is an explicit, versioned shell script — not implicit agent behavior.
45.7.1 Limitations on Reproducibility
Three factors limit exact reproducibility:
- LLM non-determinism: The agent's hypotheses and code changes depend on stochastic LLM generation. Two runs with identical starting conditions will generally produce different experiment sequences, even with the same model and temperature.
- Environment sensitivity: Benchmark measurements depend on system load, hardware, and OS scheduling. The MAD-based confidence scoring mitigates this but does not eliminate it.
- Skill authoring variance: The quality and specificity of the human-authored skill document significantly affects results. Vague skills produce scattered experiments; precise, well-scoped skills produce focused optimization.
Author analysis: The combination of these factors means that pi-autoresearch sessions are reproducible in infrastructure (same tools, same protocol) but not in trajectory (same sequence of experiments). This is an inherent property of LLM-agent-driven optimization, not a deficiency specific to this system.
45.7.2 Reproducibility Checklist
The following checklist summarizes what a reader would need to reproduce a pi-autoresearch session (README-documented unless noted):
| Requirement | Source | Notes |
|---|---|---|
| pi agent installed | README | Terminal-based AI coding agent by Anthropic |
| Extension installed | README | pi install https://github.com/davebcn87/pi-autoresearch |
| LLM backend configured | README | Any LLM backend supported by pi (Claude, GPT, Gemini, local) |
| Benchmark script | README | autoresearch.sh with METRIC name=number output |
| Skill document | README | Domain-specific Markdown loaded via autoresearch-create |
| Git repository | README | Project must be in a git repo for commit/reset operations |
| Optional: checks script | README | autoresearch.checks.sh for correctness validation |
| Optional: config | README | autoresearch.config.json with maxIterations |
45.8 Limitations and Discussion
45.8.1 Fundamental Limitations
Single-objective optimization. The system tracks a single primary metric per session (README-documented). Multi-objective optimization (e.g., simultaneously minimizing test runtime and bundle size) requires either separate sessions or a custom benchmark script that computes a composite score. This is a significant limitation compared to evolutionary systems like GEPA (Chapter 7), which natively support Pareto-based multi-objective optimization.
Greedy search and local optima. The hill-climbing strategy cannot escape local optima that require temporary metric regressions. If the global optimum requires a refactoring step that temporarily increases test runtime before yielding large gains, pi-autoresearch will revert that intermediate step. Population-based approaches maintain diverse candidates that can explore such valleys. The system's mitigation — relying on the LLM to generate diverse hypotheses — is plausible but not formally guaranteed (author analysis).
No cross-project learning. Each session starts from scratch. Insights gained from optimizing test speed in one project are not transferred to another. This contrasts with systems like AlphaEvolve (Chapter 4), which can leverage a shared program database across projects.
No meta-optimization. The system cannot learn to improve its own optimization strategy. The hill-climbing approach, confidence thresholds, and experiment structure are fixed.
Platform lock-in. The extension is specific to the pi agent. Users of other coding agents (Claude Code, Cursor, GitHub Copilot) cannot use pi-autoresearch without switching to pi.
45.8.2 Statistical Rigor Discussion
The MAD-based confidence scoring is a meaningful improvement over raw metric comparison, but it falls short of formal statistical testing (author analysis). As emphasized in §45.3.2, the confidence score is a session-level signal-to-noise heuristic, not a per-experiment significance test. Several additional concerns:
- Heuristic thresholds: The confidence thresholds (2.0×, 1.0×) are heuristic. For a normal distribution, 2.0× MAD ≈ 1.35σ (since MAD ≈ 0.6745σ), corresponding roughly to $p \approx 0.18$ — much weaker than the conventional $p < 0.05$. However, benchmark noise distributions are typically heavy-tailed, where MAD's robustness may make this threshold more practically meaningful than the Gaussian approximation suggests.
- No multiple-comparison correction: After 50 experiments, the probability of finding at least one spurious result increases substantially. The system does not apply Bonferroni or false-discovery-rate corrections.
- Non-stationary distribution: The MAD computation treats all non-crashed experiments as samples from the same underlying distribution. In practice, kept experiments alter the codebase, potentially changing the noise characteristics of subsequent measurements. The pool thus mixes measurements from different code states.
- Session-level, not per-change: As defined in Equation 45.4, the best improvement is always measured relative to the baseline $m_0$, not relative to the most recent best. This means the confidence score reflects total cumulative improvement. A high confidence score (green) does not validate any individual experiment — it indicates that the aggregate session improvement is large relative to the aggregate noise floor. A session could show green confidence even if some individual "kept" changes are within noise, provided the total improvement is large enough.
Despite these limitations, the confidence scoring represents a practical advance in autoresearch tooling. A more rigorous approach might use paired permutation tests or bootstrap confidence intervals, but these would require multiple runs per experiment, significantly increasing cost and session duration.
45.8.3 Comparison with Evolutionary Systems
| Dimension | Pi-autoresearch | AlphaEvolve / OpenEvolve / GEPA |
|---|---|---|
| Search strategy | Greedy hill-climbing | Population-based evolutionary search |
| LLM role | Autonomous agent (full control) | Mutation operator (system-directed) |
| Population size | 1 (single candidate) | 10–1,000+ candidates |
| Exploration/exploitation | LLM-driven (implicit) | Algorithmic (bandits, MAP-Elites, islands) |
| Domain specificity | Domain-agnostic via skills | Requires evaluation function + evolve blocks |
| Setup complexity | Low (skill document + benchmark script) | High (evaluation harness, search config, sandbox) |
| Optimality guarantees | None (greedy) | Weak (evolutionary convergence) |
45.9 Continued Learning
45.9.1 Within-Session Learning
The agent learns within a session through three channels (author analysis of documented behavior). First, narrative learning: as the agent updates autoresearch.md with dead ends and key wins, it builds a compressed model of the optimization landscape. Second, statistical feedback: confidence scores provide quantitative guidance on session-level improvement reliability. Third, pattern recognition: the JSONL log provides a structured record enabling the agent to avoid repeating failed approaches.
45.9.2 Cross-Session Learning
Cross-session learning is supported through the persistent session files. A new agent instance inherits the full strategic context (narrative document) and tactical history (JSONL log) from prior sessions. However, this learning is confined to the project and branch level — there is no cross-project knowledge transfer and no mechanism for improving the system's optimization strategy itself based on accumulated experience.
| Learning Type | Pi-autoresearch | Karpathy autoresearch | AlphaEvolve |
|---|---|---|---|
| Within-session | JSONL + Markdown | Context window only | Program database |
| Cross-session | Persistent files | results.tsv + git | Persistent database |
| Cross-project | No | No | Yes (shared infra) |
| Meta-learning | No | No | Partial |
45.10 Implementation Details
45.10.1 Technology Stack
Pi-autoresearch is implemented as a pi extension using pi's extension and skill frameworks. The implementation stack (README-documented):
| Layer | Language/Format | Purpose |
|---|---|---|
| Extension definition | JSON manifest | Declares tools, UI widgets, commands |
| Tool implementations | TypeScript (pi extension API) | init_experiment, run_experiment, log_experiment |
| UI components | pi widget framework | Status bar, dashboard, fullscreen overlay |
| Skill documents | Markdown | autoresearch-create, autoresearch-finalize |
| Benchmark scripts | Bash | autoresearch.sh, autoresearch.checks.sh |
| Session state | JSON Lines + Markdown | autoresearch.jsonl + autoresearch.md |
| Configuration | JSON | autoresearch.config.json |
The repository documentation describes the system as following a "radical simplicity" design philosophy. The overall codebase is small — the README and extension structure suggest a modest total footprint of TypeScript source plus structured Markdown for skills, though the survey author has not performed an exact line count and precise LOC figures should not be relied upon.
45.10.2 State Machine
The extension manages a session state machine with five states (README-documented): INACTIVE (no session), SETUP (configuring session via init_experiment), RUNNING (autonomous loop active), PAUSED (loop stopped, state preserved), and CLEARED (state deleted, returning to INACTIVE). Transitions are triggered by slash commands (/autoresearch, /autoresearch off, /autoresearch clear), keyboard interrupts, or reaching maxIterations.
45.10.3 LLM Backend Independence
A deliberate design decision (README-documented): pi-autoresearch does not specify which LLM powers the pi agent. The extension works with whatever LLM backend the user has configured. This means the same experimental infrastructure works with Claude, GPT-4o/o3, Gemini, or open-weight models served locally. The LLM choice affects the quality of hypotheses the agent generates but not the experimental infrastructure itself.
45.11 Applications
45.11.1 Developer Workflow Optimization
The primary application domain is optimizing measurable aspects of software projects: test suite runtime, build duration, bundle size, and web performance scores. These are domains where (a) a clear numeric metric exists, (b) the metric can be measured by a shell command, (c) the search space of possible improvements is rich enough for an LLM agent to explore productively, and (d) the benchmark executes quickly enough for rapid iteration.
45.11.2 ML Research
Following the Karpathy paradigm, pi-autoresearch can be applied to ML training optimization: architecture search, hyperparameter tuning, data pipeline optimization, and training stability improvements. The key constraint is benchmark duration — very long training runs (hours per experiment) create impractically slow feedback loops for the greedy hill-climbing approach.
45.11.3 Boundaries of Applicability
The system cannot optimize metrics requiring human judgment (code readability, UX quality), highly non-deterministic metrics where even MAD-based confidence is insufficient, or multi-objective problems without manual composite scoring. It also requires automatable benchmarks with numeric output — qualitative evaluations are outside scope.
Summary
Key takeaway: Pi-autoresearch separates experimental loop mechanics from domain knowledge through an extension/skill architecture, and adds MAD-based statistical confidence scoring to estimate whether cumulative improvements exceed benchmark noise. The confidence score is a session-level signal-to-noise heuristic, not a per-experiment significance test.
Main contribution to the field: The extension/skill separation pattern demonstrates that the autonomous experiment loop can be treated as general-purpose infrastructure, decoupled from any specific optimization domain. Combined with structured logging, branch-aware finalization, and a practical (if heuristic) noise-floor estimator, this makes autonomous optimization accessible to any developer using the pi agent.
What a researcher should know: Pi-autoresearch uses greedy hill-climbing backed by git (not population-based evolutionary search), making it simpler and more interpretable than systems like AlphaEvolve or OpenEvolve, but inherently susceptible to local optima. Its confidence scoring via MAD is a practical heuristic — the 2.0× threshold corresponds roughly to 1.35σ under Gaussian assumptions, well below standard significance thresholds. The README reports typical improvement ranges (10–50% for test speed, 5–30% for bundle size) but these are not independently verified benchmarks. All code listings in this chapter are pseudocode reconstructions, not verbatim repository excerpts; readers should consult the repository directly for implementation details. All repository observations were made in April 2026.