Introduced2026-02
Score8.12/10 — Draft
Chapter 40

FARS: Fully Automated Research System

Part P07: Autonomous Research Systems

FARS represents a paradigm shift in automated research: from proof-of-concept agent demonstrations to industrial-scale, continuously operating research infrastructure. Deployed live on a 160-GPU cluster in February 2026, FARS produced 100 short research papers in 228 hours of fully unattended operation, consuming 11.4 billion tokens at a total cost of approximately $104,000. Unlike prior autoresearch systems that optimized for human-like paper formatting and peer-review acceptance, FARS rejects academic publishing conventions entirely, instead optimizing for throughput, reliability, and the systematic production of minimal, composable knowledge units — including negative results.

Key Contribution

FARS is the first publicly demonstrated, continuously operating automated research system at industrial scale. Its core contribution is not algorithmic but architectural and philosophical: it proves that automated research can function as reliable production infrastructure rather than a one-off demonstration, and it challenges the assumption that academic paper formatting is a necessary property of knowledge production. The system's 228-hour unattended deployment, producing 244 hypotheses and 100 papers with a mean review score of 5.05 on the ICLR scale (above the 4.21 average for human submissions), establishes a concrete baseline for automated research throughput, cost, and quality.

40.1 Overview and Motivation

40.1.1 The Industrial Research Thesis

FARS was developed by Analemma (日行迹智能科技), a Shanghai-based startup founded in March 2025 by Dr. Sun Tianxiang (孙天祥), a Fudan University PhD and principal developer of MOSS, one of the earliest Chinese open-source conversational language models. The core team draws from Fudan's MOSS group and Shanghai AI Laboratory's InternLM team — researchers who built the very models that autoresearch systems consume. This insider perspective informs FARS's pragmatic design: the system is built by people who understand both the capabilities and the limitations of LLMs as research tools.

The speed of execution is notable: approximately 11 months from company founding to a 160-GPU live deployment producing 100 papers. The company secured an angel round of several hundred million RMB from investors including Sequoia Capital China, reflecting significant confidence in the automated research infrastructure thesis.

FARS's central philosophical claim distinguishes it from every prior system in this survey: the output of a research system should be research contributions, not papers conforming to academic formatting conventions. Where AI Scientist (Chapter 30) sought to pass the Turing test of academic publishing, and DeepScientist (Chapter 37) sought frontier-pushing scientific depth, FARS asks what an optimally efficient research system looks like when freed from the constraints of human publishing conventions. The blog post articulates this through a first-principles critique of human-centered research:

Structural ProblemRoot CauseFARS Response
High entry barrierYears of training to become a researcherAutomated agents require no training period
Publication biasOnly "successful" experiments get publishedEvery completed experiment — positive or negative — produces output
Format overheadConforming to venue-specific formatting rulesNo structural constraints beyond clarity
Length inflationPressure to fill pages to meet minimumsPapers are as long as they need to be, no more
Supply constraintLimited number of human researchersSystem runs 24/7, parallelizes across projects

40.1.2 Five Design Principles

From the blog post and observable outputs, FARS operates on five explicit principles:

  1. Contributions, not papers. The unit of output is a research contribution — a piece of new knowledge — not a formatted document. The paper is merely a container.
  2. Single-scoped contributions. Each paper addresses exactly one research question. This is the minimal composable unit of knowledge, analogous to a function in programming: do one thing, do it well, make it reusable.
  3. Negative results are knowledge. A well-conducted experiment showing something does not work is as valuable as one showing something does. FARS explicitly reports negative results — for example, "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity."
  4. No unnecessary constraints. Papers are not padded to meet length minimums and do not conform to venue-specific templates.
  5. Scale reveals truth. FARS was designed to produce 100 papers precisely because quality variance at scale is a known limitation — the signal emerges from the aggregate, not from cherry-picked examples.

40.1.3 Position in the Autoresearch Landscape

FARS explicitly positions itself as a successor to six prior systems. Where each addressed a subset of the autoresearch problem — AI Scientist demonstrating feasibility, Zochi achieving acceptance-level quality, DeepScientist pushing frontier depth — FARS addresses industrial-scale throughput with continuous autonomous operation. It is the difference between building a faster horse and building an assembly line.

SystemYearPrimary ContributionFraming
AI Scientist2024First end-to-end: idea → code → paper → reviewDialogue-based
CycleResearcher2024Iterative review-revision quality improvementFeedback loop
Zochi2025First AI papers accepted at workshops (avg 7.67)Acceptance-optimized
AI Scientist v22025Tree search methodology, double-blind review passSearch-based
AI-Researcher2025Four-module architecture, NeurIPS SpotlightModular pipeline
DeepScientist2026Exceeded human SOTA on 3 frontier tasksDepth-first
FARS2026228h continuous operation, 100 papers, live public deploymentPipeline production

The framing distinction is important. FARS treats automated research as a pipeline production problem — not a search problem (contrast with evolutionary systems like AlphaEvolve), not a dialogue problem (contrast with AI Scientist's multi-turn LLM conversation), and not a depth-first discovery problem (contrast with DeepScientist). The pipeline metaphor has specific implications for how the system is designed, optimized, and evaluated.

40.2 Architecture

40.2.1 Four-Agent Sequential Pipeline

FARS employs a four-agent sequential pipeline: Ideation Agent → Planning Agent → Experiment Agent → Writing Agent. Each agent reads from and writes to a shared filesystem, which serves as the sole coordination mechanism. There is no direct agent-to-agent communication — the filesystem boundary is the API contract.

Research Direction Documents (Input) Ideation Agent Literature Review Hypothesis Generation Automated Review Gate ~15% tokens · Open-access papers + public GitLab repos validated hypotheses Planning Agent Hypothesis → Experimental Plan (baselines, metrics, resources) ~5% tokens experimental plan Experiment Agent (Bottleneck) Code Gen & Debug 160-GPU Cluster (tools) Model Inference Endpoints (tools) ~70% tokens · Code → GPU jobs → Results → Iterate results + code Writing Agent Results → Short paper (single contribution, 4-8 pages) ~10% tokens Paper (PDF) GitLab Repo Live Dashboard Manual Review (3+) arXiv (if passed) Shared File System — Sole Coordination Mechanism Structured project dirs · No agent-to-agent communication · Persistent state · Human-inspectable Each agent reads predecessor's output files · writes output for successor · crash-safe by design

40.2.2 The Shared Filesystem as Coordination Protocol

The most architecturally distinctive feature of FARS is the shared filesystem as the sole coordination mechanism between agents. This is not merely a storage layer — it is the entire inter-agent protocol. Most multi-agent systems use message queues, event buses, or direct API calls. FARS's choice has deep advantages that the blog post and observable behavior make clear:

  • Persistence by default. Every intermediate artifact is automatically persisted. If the system crashes, the filesystem state represents a perfect checkpoint.
  • Natural handoffs. Agent A writes files; Agent B reads them. The filesystem boundary is the API contract — no schema definitions, no serialization overhead.
  • Human inspectability. Any researcher can inspect exactly what each agent produced, enabling debugging and trust-building.
  • Trivial parallelism. Multiple projects run in parallel via separate directories. No lock contention, no resource arbitration beyond GPU scheduling.
  • Decoupled evolution. Each agent can be upgraded independently, provided file formats remain compatible.

This pattern is well-established in systems engineering (Unix pipes, Plan 9, microservices via shared storage) but is novel in the autoresearch space. AI Scientist uses a single LLM conversation thread, DeepScientist uses knowledge graphs and databases, and evolutionary systems like those in Parts P03–P05 use in-memory population databases. The following table contrasts these coordination patterns:

PatternUsed ByAdvantagesDisadvantages
Single LLM contextAI ScientistSimple, coherentContext window limits, no parallelism
Knowledge graphDeepScientistRich relationshipsComplex queries, schema overhead
In-memory populationEvolutionary systemsFast, structuredVolatile, single-node bottleneck
Message queueAI-ResearcherDecoupled, orderedNeeds broker, no natural persistence
Shared filesystemFARSUniversal, durable, inspectableUnstructured unless conventions enforced

40.2.3 Key Architectural Decisions

Several design decisions reported in the blog post and observable from the system's behavior merit analysis:

Sequential pipeline over graph search. Research has a natural linear flow — idea, plan, experiment, write — and FARS exploits this structure. The pipeline is simpler to coordinate, debug, and reason about than a graph-based approach. The trade-off is rigidity: FARS cannot loop back from failed experiments to revised hypotheses within the same project, unlike iterative systems such as CycleResearcher.

GPU cluster as tools, not raw access. The 160-GPU cluster is encapsulated as high-level tool interfaces. The Experiment Agent schedules jobs without managing CUDA, drivers, or multi-GPU parallelism. This is analogous to how cloud computing abstracts hardware — the agent reasons about experiments, not infrastructure.

Short papers by design. By targeting 4–8 page single-contribution papers, the Writing Agent's task is substantially simpler than producing full conference papers. This reduces hallucination risk, generation time, and the need for comprehensive related work surveys.

40.3 Core Mechanisms

40.3.1 Pipeline Parallelism and Throughput

FARS achieves its throughput through pipeline parallelism — the same technique used in CPU instruction pipelines and industrial assembly lines. While paper $N$ is being written, paper $N+1$ is being experimented on, paper $N+2$ is being planned, and paper $N+3$ is being ideated. All four stages operate concurrently on different projects.

Let $T_{\text{idea}}$, $T_{\text{plan}}$, $T_{\text{exp}}$, and $T_{\text{write}}$ denote the average time for each pipeline stage. The steady-state throughput of a fully occupied pipeline is:

$$\text{Throughput} = \frac{1}{\max(T_{\text{idea}},\; T_{\text{plan}},\; T_{\text{exp}},\; T_{\text{write}})}$$

where throughput is measured in papers per unit time. The bottleneck stage — the slowest — determines throughput regardless of how fast the other stages operate. Based on the reported token consumption breakdown (~70% for the Experiment Agent) and the observed average of ~2.3 hours per paper, we can estimate:

StageEstimated DurationToken ShareBottleneck?
Ideation~20–30 min~15%No
Planning~10–15 min~5%No
Experiment~90–120 min~70%Yes
Writing~20–30 min~10%No

Note: Stage durations and token-share breakdowns are author estimates based on reported aggregate figures. FARS does not publicly disclose per-stage timing.

The observed throughput of ~137 minutes per paper (228 hours / 100 papers) exceeds the estimated bottleneck duration of 90–120 minutes. This gap likely reflects pipeline filling and draining overhead, failed hypotheses that consume resources without producing papers (59% of hypotheses did not become papers), and GPU contention during peak parallel execution.

The following pseudocode illustrates the pipeline scheduling logic:

# Pseudocode — no public implementation available
# Illustrates FARS pipeline parallelism as described in the blog post

import asyncio
from dataclasses import dataclass
from enum import Enum

class Stage(Enum):
    IDEATION = "ideation"
    PLANNING = "planning"
    EXPERIMENT = "experiment"
    WRITING = "writing"
    COMPLETE = "complete"

@dataclass
class Project:
    project_id: str
    stage: Stage
    workspace_path: str  # path in shared filesystem

async def run_pipeline(
    research_directions: list[str],
    workspace_root: str,
    max_concurrent: int = 20,
) -> list[Project]:
    """
    Pipeline scheduler: launches projects and advances them
    through stages as agents complete their work.
    Each agent reads/writes from the project's workspace directory.
    """
    semaphore = asyncio.Semaphore(max_concurrent)
    completed = []

    async def process_project(direction: str, idx: int) -> Project:
        async with semaphore:
            project = Project(
                project_id=f"FA{idx:04d}",
                stage=Stage.IDEATION,
                workspace_path=f"{workspace_root}/projects/FA{idx:04d}",
            )
            # Each stage reads predecessor's output from filesystem,
            # writes its own output, then advances the stage marker.
            for stage_fn in [ideation, planning, experiment, writing]:
                success = await stage_fn(project)
                if not success:
                    break  # Project abandoned (failed review, etc.)
            else:
                project.stage = Stage.COMPLETE
                completed.append(project)
            return project

    # Launch all projects concurrently — pipeline fills naturally
    tasks = [
        process_project(d, i)
        for i, d in enumerate(research_directions)
    ]
    await asyncio.gather(*tasks)
    return completed

async def ideation(project: Project) -> bool:
    """Reads research direction, writes hypothesis to filesystem."""
    # 1. Conduct literature review (open-access papers, public repos)
    # 2. Generate hypothesis
    # 3. Automated review gate — reject if infeasible/redundant
    # 4. Write hypothesis.md to project workspace
    project.stage = Stage.PLANNING
    return True  # False if hypothesis fails review

async def experiment(project: Project) -> bool:
    """Reads plan, executes via GPU tools, writes results."""
    # 1. Read experimental plan from filesystem
    # 2. Generate code for the experiment
    # 3. Schedule GPU jobs via tool interface (not raw hardware)
    # 4. Monitor, debug, iterate as needed
    # 5. Write results + code to filesystem
    project.stage = Stage.WRITING
    return True  # False if experiment fails irrecoverably

40.3.2 The Four Agents in Detail

Ideation Agent. Converts broad research directions into validated, actionable hypotheses. It has access to open-access papers and public GitLab repositories — not just paper abstracts but actual code implementations, a significant advantage over literature-review-only approaches. The agent includes an automated review gate that filters hypotheses for quality and feasibility before forwarding. In the FARS-100 run, it generated 244 hypotheses from which 100 became papers, a 41% conversion rate. The overproduction is by design: the Ideation Agent generates more hypotheses than the downstream pipeline can absorb, ensuring the pipeline is never starved.

Planning Agent. The thinnest component, it bridges abstract hypotheses and concrete experimental protocols. It determines baselines, specifies evaluation metrics and success criteria, and estimates computational resource requirements. Its output is an experimental plan written to the shared filesystem.

Experiment Agent. The dominant component, consuming approximately 70% of total tokens. It generates code, schedules GPU training and inference jobs via tool interfaces, collects and analyzes results, and iterates through debug cycles. The GPU cluster and model inference endpoints are encapsulated as tools — the agent cannot accidentally consume all resources or interfere with other projects. When experiments fail, the agent must determine whether the failure is a system error (retry/debug) or a genuine negative result (proceed to Writing Agent with negative finding).

Writing Agent. Produces the final short paper from experimental results. Unlike other autoresearch writing modules that attempt full conference papers, FARS's writer produces focused 4–8 page papers. Crucially, the Writing Agent has two distinct modes inferred from the published outputs: a positive result mode ("we propose X, which achieves Y improvement") and a negative result mode ("we investigate whether X improves Y; we find it does not, and we analyze why"). The negative result mode is philosophically significant — it requires framing failure as contribution.

40.3.3 GPU Cluster Tool Encapsulation

The decision to expose the 160-GPU cluster as abstract tools rather than raw hardware is architecturally important. The Experiment Agent reasons about experiments, not CUDA versions or job schedulers. This creates a clean separation of concerns:

# Pseudocode — no public implementation available
# Illustrates the tool-encapsulation pattern described in reporting

# The Experiment Agent interacts with GPU resources only through
# high-level tool interfaces. Internal scheduling, multi-GPU
# parallelism, and fault tolerance are handled by the tool layer.

async def run_experiment(plan: dict, tools: "ToolInterface") -> dict:
    """
    Experiment Agent's interaction with GPU cluster is mediated
    entirely through tool calls. The agent never manages hardware.
    """
    # Step 1: Generate training code based on the plan
    code = generate_code_from_plan(plan)

    # Step 2: Schedule a training job via tool API
    job_id = await tools.schedule_training_job(
        code=code,
        config=plan["training_config"],
        # Resource requirements are declared, not managed
        gpu_count=plan.get("gpu_count", 1),
        max_hours=plan.get("max_hours", 4),
    )

    # Step 3: Monitor execution (tool handles retries, timeouts)
    status = await tools.wait_for_completion(job_id)

    if status.failed:
        # Agent decides: debug the code or accept as negative result
        if status.is_code_error:
            # Iterate: fix code, re-submit
            fixed_code = debug_and_fix(code, status.error_log)
            return await run_experiment(
                {**plan, "code_override": fixed_code}, tools
            )
        else:
            # Genuine experimental failure — proceed as negative result
            return {"outcome": "negative", "analysis": status.error_log}

    # Step 4: Collect results via tool API
    results = await tools.get_job_results(job_id)

    # Step 5: Run evaluation (may use LLM-as-a-Judge via inference tools)
    eval_scores = await tools.run_evaluation(
        model=plan["eval_model"],
        predictions=results["outputs"],
        references=plan["ground_truth"],
    )

    return {"outcome": "positive" if eval_scores["improvement"] > 0 else "negative",
            "results": results, "evaluation": eval_scores}

40.3.4 Dual Role of LLMs: Infrastructure and Subject

A distinctive aspect of FARS is the dual role of large language models. At the infrastructure layer, LLMs serve as the reasoning backbone for all four agents — conducting literature review, generating code, composing papers. At the subject layer, LLMs are the experimental subjects being researched — their training procedures, behaviors, and evaluation methods are the focus of many FARS papers.

This creates a recursive structure: LLMs researching LLMs. For this to work, the infrastructure models must be more capable than the subject models — one cannot reliably study a model's behavior using a less capable model as the reasoning engine. FARS has access to both open-source models (run on the 160-GPU cluster) and closed-source models (via API), though the specific models serving as agent backbones are not publicly disclosed.

40.3.5 Negative Result Detection and Publication

FARS's systematic production of negative results requires distinguishing between four experiment outcome types:

  1. Positive result: Proposed method improves on baselines → report improvement.
  2. Negative result: Method does not work → report why it fails (a legitimate research contribution).
  3. Null result: Inconclusive, may need more experiments → may be abandoned or extended.
  4. System failure: Bug or crash → not a research result, requires debugging.

Traditional autoresearch systems treat outcomes 2–4 as failures and discard them. FARS treats outcome 2 as a publishable contribution. Observed examples include:

  • "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — all selection strategies within a 0.3-point band, no improvement. The contribution is identifying candidate homogeneity at low temperature as the failure mechanism.
  • "Interface-Aware Smoke Tests and Deterministic Import Autofix for Feature-Level Coding Agents: A Negative Result" — automated import autofix provided no benefit over baseline (both 10.0% resolved rate).

In human academia, negative results are systematically suppressed due to publication bias. FARS's willingness to report them represents a structural fix. If automated systems can produce and publish negative results at near-zero marginal cost, the scientific record becomes more complete. The value lies not in any individual negative paper but in the aggregate: a comprehensive map of what works and what doesn't.

40.4 Key Results

40.4.1 FARS-100 Headline Metrics

The FARS-100 run began at 10:00 PM EST on February 12, 2026 and completed after 228 hours, 28 minutes, and 33 seconds of continuous, unattended operation. The following metrics are reported by Analemma through the blog post and live dashboard (source: blog post, February 11, 2026; live dashboard, analemma.ai/fars):

MetricValueSource
Duration228 hours 28 min 33 secBlog post / dashboard
Hypotheses generated244Blog post
Papers completed100Blog post / dashboard
Hypothesis → paper conversion41.0% (100/244)Computed from blog figures
Average time per paper~2 hours 17 minComputed (228h / 100 papers)
Total tokens consumed11.4 billionBlog post / media reporting
Total cost~$104,000 (~¥750,000)Blog post / media reporting
Cost per paper~$1,040Computed ($104K / 100)
Hardware160 NVIDIA GPUsBlog post
Human intervention during runZeroBlog post

40.4.2 Quality Assessment

Paper quality was assessed using Stanford's Agentic Reviewer system (paperreview.ai), calibrated against ICLR review standards. The Agentic Reviewer achieves Spearman correlation of 0.42 with human reviewers — on par with human inter-reviewer agreement of 0.41 (source: paperreview.ai calibration study):

PopulationMean Score (ICLR scale)Source
FARS-100 papers5.05 (range: 3.0–6.3)Agentic Reviewer evaluation
ICLR 2026 — all human submissions4.21Stanford calibration data
ICLR 2026 — accepted papers5.39Stanford calibration data

FARS papers score 0.84 points above the average human submission and 0.34 points below the average accepted paper. The score distribution is concentrated around 5.0, indicating a stable quality band rather than random fluctuation. A small number of papers exceeded 6.0.

Caveats on quality comparison. Several important caveats apply. First, FARS papers are short, single-contribution works (4–8 pages), while ICLR submissions are typically full papers (8–10+ pages) with broader scope. Comparing mean scores between these populations conflates paper scope with paper quality. Second, the Agentic Reviewer correlation with human reviewers (0.42) means approximately 17% of variance is reviewer-specific noise. Third, the FARS papers are evaluated by the same type of system (an LLM) that produced them, which could introduce systematic biases that would not apply to human-written papers. These scores should be interpreted as indicating that FARS output is in the plausible range for publishable research, not that FARS papers are directly comparable to human ICLR submissions.

40.4.3 Conversion Funnel Analysis

The 41% hypothesis-to-paper conversion rate (100/244) implies that ~59% of hypotheses either failed automated review, failed experimentally, or were abandoned during execution. This is actually a high conversion rate compared to human research, where the hypothesis-to-publication rate is typically 5–20%. The higher FARS rate likely reflects three factors: conservative hypothesis generation (testable, incremental hypotheses), inclusion of negative results (failures that humans would discard become papers), and the lower bar of the single-contribution format.

The conversion funnel includes a post-production quality gate: at least 3 senior researchers (5+ years experience each) manually review papers before arXiv submission, and all submissions are explicitly labeled as AI-generated. The number of papers that passed this manual review is not publicly reported.

40.4.4 Token Consumption Analysis

The FARS-100 run consumed 11.4 billion tokens. This yields an average of approximately 114 million tokens per paper — orders of magnitude higher than typical LLM generation tasks and substantially higher than other autoresearch systems. The per-paper consumption can be modeled as the sum of agent contributions:

$$T_{\text{paper}} = T_{\text{ideation}} + T_{\text{planning}} + T_{\text{experiment}} + T_{\text{writing}}$$

where $T_{\text{paper}} \approx 114 \times 10^6$ tokens. Using the estimated token shares from the blog post's reporting:

$$T_{\text{experiment}} \approx 0.70 \times T_{\text{paper}} \approx 80 \times 10^6 \text{ tokens/paper}$$

This enormous per-paper consumption reflects the Experiment Agent's iterative loop of code generation, debugging, GPU job execution, result analysis, and refinement. Each debugging cycle may consume millions of tokens, and a single paper may require 5–15 such iterations. Additionally, some experiments use LLMs as subjects (running inference on the studied model) and as judges (LLM-as-a-Judge evaluation), multiplying token consumption.

System / TaskTokens per Paper (est.)Scale Factor
Typical chatbot response~500
Complex agentic task~500,0001,000×
AI Scientist paper (estimated)~5,000,00010,000×
FARS paper~114,000,000228,000×

Note: The AI Scientist token estimate is inferred from its $15 cost and typical API pricing, not officially reported.

40.5 Cost and Compute Analysis

40.5.1 FARS-100 Cost Decomposition

The total reported cost of ~$104,000 for 100 papers comprises GPU compute and LLM API tokens. FARS does not publish a detailed breakdown, but bounds can be estimated:

$$C_{\text{total}} = C_{\text{GPU}} + C_{\text{tokens}} \approx \$104{,}000$$

where $C_{\text{GPU}}$ is the cost of 160 GPUs for 228 hours, and $C_{\text{tokens}}$ is the inference cost for 11.4 billion tokens across open- and closed-source models. Using typical cloud pricing of $2–3 per GPU-hour for A100/H100-class hardware:

$$C_{\text{GPU}} \approx 160 \times 228 \times \$2.5 \approx \$91{,}200$$

This would leave approximately $12,800 for token costs, which at 11.4B tokens implies an average token price of about $1.12 per million tokens — consistent with a mix of local open-source inference (nearly free on owned hardware) and some closed-source API calls. However, if the GPUs are owned rather than rented, the marginal compute cost is substantially lower (depreciation + electricity), and a larger share of the $104K may be attributable to API tokens. The exact breakdown is not publicly available (source: total cost figure from blog post and 机器之心 reporting; GPU count from blog post; decomposition is author estimation).

40.5.2 Cost Comparison Across Autoresearch Systems

SystemCost/PaperHardwareTime/PaperExperimental Depth
AI Scientist (2024)~$15API-onlyHoursMinimal (no GPU experiments)
AI Scientist v2 (2025)~$15–20API-onlyHoursLow–moderate
Karpathy autoresearch (2026)~$181 GPU~8 hoursModerate (single GPU)
FARS (2026)~$1,040160 GPUs~2.3 hoursHigh (GPU training + inference)
DeepScientist (2026)Not reported20,000+ GPU-hours totalDays–weeksVery high (frontier-pushing)

FARS's $1,040 per paper is ~70× more expensive than AI Scientist's $15 per paper, but this comparison is misleading. AI Scientist produces papers with minimal experimental depth — no GPU training, no real model fine-tuning, no large-scale inference. FARS executes actual GPU-intensive experiments. A more appropriate comparison is cost per unit of experimental work: a single FARS paper at $1,040 may contain the experimental equivalent of what a human researcher would spend $10,000–$30,000 on, yielding a 10–29× cost advantage.

40.5.3 Scaling Economics

If FARS operated continuously for one year, projections based on observed throughput (acknowledging these assume linear scaling and constant quality, which may not hold):

$$N_{\text{annual}} = \frac{365 \times 24}{2.28} \approx 3{,}842 \text{ papers/year}$$

The human-equivalent cost of 3,842 papers at approximately 4 papers per researcher per year would require ~960 researchers. At a fully loaded cost of $200K per researcher:

$$\frac{C_{\text{human}}}{C_{\text{FARS}}} = \frac{960 \times \$200K}{\$5M} \approx 38\times$$

where $C_{\text{FARS}} \approx \$5M$ is the estimated annual operating cost (GPU + API). This 38× cost advantage drives the economic case for automated research infrastructure. However, the comparison again carries the caveat that FARS short papers and typical human papers are not directly commensurable in scope.

40.6 Reproducibility and Transparency

40.6.1 Transparency Protocol

FARS's transparency is unprecedented in the autoresearch space — and so is its opacity, in different dimensions. The system provides:

  • Live dashboard (analemma.ai/fars): Real-time observation of the running system
  • Public experiment repositories (gitlab.com/fars-a): Code, data, and results for each paper
  • Published papers (analemma.ai/papers/): All completed papers available online
  • Independent quality evaluation: Scores from Stanford's Agentic Reviewer, which anyone can re-run

However, FARS itself — the system code, agent prompts, coordination mechanisms, and infrastructure — is proprietary. This creates an inverted reproducibility profile: the process is unusually transparent (live public operation), but the system is unusually opaque (closed source). Researchers can see that FARS works and inspect its outputs, but cannot build their own version.

CriterionRatingNotes
System reproducibilityLowProprietary code, closed architecture
Experiment reproducibilityMedium–HighIndividual experiment repos are public on GitLab
Result verificationMediumPapers and scores independently evaluable
Process transparencyHighLive dashboard, public GitLab activity
Hardware accessibilityLow160 GPUs required for full replication

40.6.2 Safety and Review Pipeline

FARS employs a deliberately conservative publication pipeline. All 100 papers produced by the automated system undergo evaluation by the Agentic Reviewer, followed by manual review by at least 3 senior researchers with 5+ years experience each. Only papers passing manual review are submitted to arXiv, and all submissions are explicitly labeled as AI-generated. This conservative approach addresses the primary concern about automated research: that it could flood the literature with low-quality or misleading work.

40.7 Memory, Learning, and Limitations

40.7.1 Memory Architecture

FARS's memory is the shared filesystem itself — a pragmatic choice that merges workspace and persistent memory. Each project maintains its own directory with structured subdirectories for ideation outputs, plans, experiment code and results, and the final paper. Agents access the filesystem as an external, persistent memory that supplements their finite context windows.

However, FARS does not appear to maintain cross-project memory — no explicit knowledge base, skill library, or failed-hypothesis registry. Each project is treated independently. If the system generates hypothesis $H_1$ for project $P_1$ and discovers it fails, there is no mechanism to prevent generating a similar hypothesis $H_1'$ for project $P_2$. Over 244 hypotheses, some redundancy is likely.

40.7.2 No Cross-Project Learning

This is a fundamental architectural distinction from evolutionary systems covered in Parts P03–P06 of this survey. Evolutionary systems explicitly learn: populations improve over generations because selection pressure retains good solutions. FARS's pipeline does not learn — each project starts fresh from a research direction.

# Pseudocode — no public implementation available
# Contrasts the FARS pipeline pattern with evolutionary learning loops

# FARS: Linear pipeline — each project independent
def fars_pipeline(directions: list[str]) -> list[Paper]:
    papers = []
    for direction in directions:
        hypothesis = ideation_agent(direction)    # No memory of prior projects
        plan = planning_agent(hypothesis)
        results = experiment_agent(plan)           # Cannot reuse prior code
        paper = writing_agent(results)
        papers.append(paper)
    return papers  # No feedback to ideation from outcomes

# Evolutionary system: Learning loop — population improves
def evolutionary_loop(seed: Program, task: Task) -> Program:
    population = initialize(seed)
    for generation in range(max_generations):
        parents = select(population)               # Selection pressure
        children = mutate(parents)                  # Variation
        scores = evaluate(children, task)
        population = update(population, children, scores)  # Learning!
    return best(population)

This is both a strength and a limitation. The absence of cross-project learning means no risk of premature convergence and trivial parallelization, but it also means FARS cannot build on its own discoveries or avoid repeating mistakes.

40.7.3 Limitations

Several significant limitations emerge from the FARS design and deployment:

AI-only research domain. FARS currently operates exclusively in AI/ML research — the "AI-for-AI" paradigm. This is a pragmatic choice (computational experiments provide readily available evaluation signals), but it means FARS cannot address biology, physics, chemistry, or any domain requiring physical experiments, human evaluation, or user studies.

No iterative refinement within projects. The linear pipeline does not support looping back from failed experiments to revised hypotheses within the same project. Systems like CycleResearcher and AI Scientist v2 employ iterative review-revision cycles that can improve a paper through multiple rounds. FARS trades this flexibility for throughput.

Proprietary system. The closed-source nature limits the research community's ability to verify architectural claims, reproduce the system, or build upon it. Individual experiment repos are public, but the orchestration layer is not.

Scale requirements. The 160-GPU cluster and $104K per 100 papers places FARS beyond the reach of most academic labs. This is research infrastructure for well-funded organizations, not a tool for individual researchers.

Quality ceiling. The mean score of 5.05 is below the ICLR acceptance threshold of 5.39. While FARS produces work that is above average for human submissions, it has not yet demonstrated the ability to consistently produce acceptance-quality work by conference standards. Whether this matters depends on whether one accepts FARS's philosophical premise that conference acceptance is not the right success metric.

Scope and depth. Short, single-contribution papers cannot achieve the depth, synthesis, or multi-faceted analysis of a full human research paper. FARS papers are best compared to individual experiments within a human paper, not to complete papers.

40.8 Comparative Analysis

40.8.1 System Framing Comparison

FARS's pipeline framing positions it uniquely among autoresearch paradigms. The following diagram illustrates how different systems frame the automated research problem:

Autoresearch System Framings Search-Based Goal: Find optimal solution Metaphor: Landscape exploration Bottleneck: Search efficiency AlphaEvolve, FunSearch AI Scientist v2 (tree search) Learns across iterations Low throughput, high depth Population improves Dialogue-Based Goal: Simulate a researcher Metaphor: Conversation Bottleneck: LLM capability AI Scientist, Zochi CycleResearcher Iterative refinement Human-like output Quality-focused Pipeline-Based Goal: Throughput + reliability Metaphor: Assembly line Bottleneck: Coordination FARS AI-Researcher (partial) No cross-project learning High throughput, breadth Projects independent

40.8.2 Quantitative Cross-System Comparison

The following comparison must be interpreted with care: different systems target different objectives, operate in different domains, and define "paper" differently. Direct numerical comparison across these dimensions risks false precision.

MetricAI ScientistAI Scientist v2DeepScientistFARS
Papers produced~10 (demo)~15 (demo)~1,100 validated100
Cost per paper~$15~$15–20Not reported~$1,040
GPU experimentsNoLimitedYes (20K+ GPU-hrs)Yes (160 GPUs)
Unattended operationNoNoPartiallyYes (228 hrs)
Negative resultsDiscardedDiscardedDiscardedPublished
Open sourceYesYesPartiallyNo (outputs public)
Quality (ICLR scale)~3.5–4.54.5–6.3Not rated5.05 mean
Review methodologySelf-reviewAutomated + peerExpert judgeAgentic Reviewer + manual

Note: AI Scientist quality scores are approximate estimates from published examples. DeepScientist measured success by exceeding human SOTA on frontier tasks rather than by review scores. Cross-system quality comparisons should be treated as indicative, not definitive, due to differences in evaluation protocol, paper scope, and domain.

40.8.3 The Depth–Breadth Trade-off

FARS occupies the breadth end of a depth–breadth spectrum in autoresearch. DeepScientist consumed 20,000+ GPU-hours to produce approximately 1,100 experimentally validated ideas, some of which exceeded human state-of-the-art on frontier tasks. FARS consumed roughly 36,000 GPU-hours (160 × 228) to produce 100 complete papers at above-average quality. The strategies are complementary:

  • DeepScientist: Few directions explored to maximum depth. Goal: genuine scientific breakthroughs.
  • FARS: Many directions explored to moderate depth. Goal: comprehensive coverage of a research space.

Neither approach dominates. A research organization might use a FARS-like system to rapidly map a problem space, then deploy a DeepScientist-like system to push the most promising directions to their limits.

40.9 Broader Implications

40.9.1 Research as Infrastructure

FARS represents a paradigm shift from research-as-craft to research-as-infrastructure. In the craft model, each paper is a unique artifact produced by skilled researchers. In the infrastructure model, papers are outputs of a production system that can be scaled, optimized, and operated continuously. This shift parallels transitions in other domains: manual QA to automated CI/CD in software, artisan production to assembly lines in manufacturing, manual analysis to automated pipelines in data science.

40.9.2 The Minimal Composable Knowledge Unit

FARS's short, single-contribution papers introduce a new unit of scientific knowledge — smaller than a traditional paper but larger than a blog post. This is analogous to the microservices revolution in software: smaller, focused, composable units replace monolithic artifacts. Each FARS paper is easy to review (one thing to evaluate), easy to cite precisely (one clear finding), and incentivizes decomposition over bundling.

Whether the broader scientific community would accept this format is an open question. The academic incentive structure rewards comprehensive, multi-contribution papers at top venues — not an incentive structure that FARS is designed to operate within.

40.9.3 Publication Bias and the Scientific Record

If automated systems can produce and publish negative results at near-zero marginal cost, the scientific record becomes more complete. In human research, an estimated 80% of experimental knowledge is lost to publication bias. FARS's 41% hypothesis-to-paper conversion rate — including papers whose entire contribution is documenting failure — represents a structural improvement in knowledge preservation.

40.9.4 The AI-for-AI Feedback Loop

FARS operates in the AI-for-AI domain: AI systems researching AI systems. Some of its papers improve LLM training, agent design, or evaluation methods. If these improvements feed back into FARS's own components (directly or through the broader research ecosystem), the system creates a positive feedback loop in AI capability. This recursive potential is both the most exciting and most concerning implication of industrial-scale autoresearch.

40.10 Connections to Evolutionary Systems

Although FARS is a pipeline system rather than an evolutionary one, its design has important connections to the evolutionary algorithm discovery systems covered in earlier parts of this survey. The shared filesystem pattern demonstrates that simple coordination mechanisms can support industrial-scale multi-agent systems — a lesson applicable to evolutionary systems that often use more complex event buses and databases. FARS's systematic production of negative results has no direct analogue in evolutionary systems, which discard failed candidates; an evolutionary system that preserved and analyzed failures could improve search efficiency. Conversely, FARS's lack of cross-project learning is precisely the gap that evolutionary approaches fill: selection pressure, population management, and progressive improvement are absent from FARS and could substantially improve its operation in future versions.

A hybrid architecture — FARS-like pipeline stages with evolutionary learning across projects — could combine throughput with progressive improvement. The Ideation Agent could maintain a population of hypothesis templates refined by selection pressure from experimental outcomes. The Experiment Agent could accumulate a skill library of reusable experimental techniques. Such extensions would transform FARS from a pipeline into a learning pipeline, maintaining throughput advantages while adding the improvement dynamics that make evolutionary systems powerful.

Summary

FARS is the first continuously operating automated research system demonstrated at industrial scale: 228 hours of fully unattended operation on 160 GPUs, producing 100 short research papers from 244 generated hypotheses at a cost of ~$1,040 per paper and a mean quality score of 5.05 on the ICLR review scale — above the 4.21 average for human submissions.

Main contribution to the field: FARS proves that automated research can function as reliable production infrastructure rather than a proof-of-concept demonstration, and introduces two significant innovations: (1) the shared filesystem as sole inter-agent coordination mechanism, enabling crash-safe, human-inspectable, trivially parallelizable multi-agent operation; and (2) the systematic publication of negative results as first-class research contributions, addressing one of science's most persistent structural problems.

What a researcher should know: FARS trades depth for breadth and learning for throughput. It does not improve over time (no cross-project learning), operates only in AI/ML research domains, and is proprietary. Its philosophical challenge to academic publishing conventions — that the paper format is a historical artifact, not a necessary property of knowledge production — may prove more influential than its technical architecture. The system's outputs (papers, code repositories, quality scores) are publicly available for independent evaluation at analemma.ai and gitlab.com/fars-a, even though the system itself cannot be reproduced.