Introduced2026-02

Score8.12/10 — Draft

Chapter 40

FARS: Fully Automated Research System

Part P07: Autonomous Research Systems

FARS represents a paradigm shift in automated research: from proof-of-concept agent demonstrations to industrial-scale, continuously operating research infrastructure. Deployed live on a 160-GPU cluster in February 2026, FARS produced 100 short research papers in 228 hours of fully unattended operation, consuming 11.4 billion tokens at a total cost of approximately $104,000. Unlike prior autoresearch systems that optimized for human-like paper formatting and peer-review acceptance, FARS rejects academic publishing conventions entirely, instead optimizing for throughput, reliability, and the systematic production of minimal, composable knowledge units — including negative results.

Key Contribution

FARS is the first publicly demonstrated, continuously operating automated research system at industrial scale. Its core contribution is not algorithmic but architectural and philosophical: it proves that automated research can function as reliable production infrastructure rather than a one-off demonstration, and it challenges the assumption that academic paper formatting is a necessary property of knowledge production. The system's 228-hour unattended deployment, producing 244 hypotheses and 100 papers with a mean review score of 5.05 on the ICLR scale (above the 4.21 average for human submissions), establishes a concrete baseline for automated research throughput, cost, and quality.

40.1 Overview and Motivation

40.1.1 The Industrial Research Thesis

FARS was developed by Analemma (日行迹智能科技), a Shanghai-based startup founded in March 2025 by Dr. Sun Tianxiang (孙天祥), a Fudan University PhD and principal developer of MOSS, one of the earliest Chinese open-source conversational language models. The core team draws from Fudan's MOSS group and Shanghai AI Laboratory's InternLM team — researchers who built the very models that autoresearch systems consume. This insider perspective informs FARS's pragmatic design: the system is built by people who understand both the capabilities and the limitations of LLMs as research tools.

The speed of execution is notable: approximately 11 months from company founding to a 160-GPU live deployment producing 100 papers. The company secured an angel round of several hundred million RMB from investors including Sequoia Capital China, reflecting significant confidence in the automated research infrastructure thesis.

FARS's central philosophical claim distinguishes it from every prior system in this survey: the output of a research system should be research contributions, not papers conforming to academic formatting conventions. Where AI Scientist (Chapter 30) sought to pass the Turing test of academic publishing, and DeepScientist (Chapter 37) sought frontier-pushing scientific depth, FARS asks what an optimally efficient research system looks like when freed from the constraints of human publishing conventions. The blog post articulates this through a first-principles critique of human-centered research:

Structural Problem	Root Cause	FARS Response
High entry barrier	Years of training to become a researcher	Automated agents require no training period
Publication bias	Only "successful" experiments get published	Every completed experiment — positive or negative — produces output
Format overhead	Conforming to venue-specific formatting rules	No structural constraints beyond clarity
Length inflation	Pressure to fill pages to meet minimums	Papers are as long as they need to be, no more
Supply constraint	Limited number of human researchers	System runs 24/7, parallelizes across projects

40.1.2 Five Design Principles

From the blog post and observable outputs, FARS operates on five explicit principles:

Contributions, not papers. The unit of output is a research contribution — a piece of new knowledge — not a formatted document. The paper is merely a container.
Single-scoped contributions. Each paper addresses exactly one research question. This is the minimal composable unit of knowledge, analogous to a function in programming: do one thing, do it well, make it reusable.
Negative results are knowledge. A well-conducted experiment showing something does not work is as valuable as one showing something does. FARS explicitly reports negative results — for example, "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity."
No unnecessary constraints. Papers are not padded to meet length minimums and do not conform to venue-specific templates.
Scale reveals truth. FARS was designed to produce 100 papers precisely because quality variance at scale is a known limitation — the signal emerges from the aggregate, not from cherry-picked examples.

40.1.3 Position in the Autoresearch Landscape

FARS explicitly positions itself as a successor to six prior systems. Where each addressed a subset of the autoresearch problem — AI Scientist demonstrating feasibility, Zochi achieving acceptance-level quality, DeepScientist pushing frontier depth — FARS addresses industrial-scale throughput with continuous autonomous operation. It is the difference between building a faster horse and building an assembly line.

System	Year	Primary Contribution	Framing
AI Scientist	2024	First end-to-end: idea → code → paper → review	Dialogue-based
CycleResearcher	2024	Iterative review-revision quality improvement	Feedback loop
Zochi	2025	First AI papers accepted at workshops (avg 7.67)	Acceptance-optimized
AI Scientist v2	2025	Tree search methodology, double-blind review pass	Search-based
AI-Researcher	2025	Four-module architecture, NeurIPS Spotlight	Modular pipeline
DeepScientist	2026	Exceeded human SOTA on 3 frontier tasks	Depth-first
FARS	2026	228h continuous operation, 100 papers, live public deployment	Pipeline production

The framing distinction is important. FARS treats automated research as a pipeline production problem — not a search problem (contrast with evolutionary systems like AlphaEvolve), not a dialogue problem (contrast with AI Scientist's multi-turn LLM conversation), and not a depth-first discovery problem (contrast with DeepScientist). The pipeline metaphor has specific implications for how the system is designed, optimized, and evaluated.

40.2 Architecture

40.2.1 Four-Agent Sequential Pipeline

FARS employs a four-agent sequential pipeline: Ideation Agent → Planning Agent → Experiment Agent → Writing Agent. Each agent reads from and writes to a shared filesystem, which serves as the sole coordination mechanism. There is no direct agent-to-agent communication — the filesystem boundary is the API contract.

40.2.2 The Shared Filesystem as Coordination Protocol

The most architecturally distinctive feature of FARS is the shared filesystem as the sole coordination mechanism between agents. This is not merely a storage layer — it is the entire inter-agent protocol. Most multi-agent systems use message queues, event buses, or direct API calls. FARS's choice has deep advantages that the blog post and observable behavior make clear:

Persistence by default. Every intermediate artifact is automatically persisted. If the system crashes, the filesystem state represents a perfect checkpoint.
Natural handoffs. Agent A writes files; Agent B reads them. The filesystem boundary is the API contract — no schema definitions, no serialization overhead.
Human inspectability. Any researcher can inspect exactly what each agent produced, enabling debugging and trust-building.
Trivial parallelism. Multiple projects run in parallel via separate directories. No lock contention, no resource arbitration beyond GPU scheduling.
Decoupled evolution. Each agent can be upgraded independently, provided file formats remain compatible.

This pattern is well-established in systems engineering (Unix pipes, Plan 9, microservices via shared storage) but is novel in the autoresearch space. AI Scientist uses a single LLM conversation thread, DeepScientist uses knowledge graphs and databases, and evolutionary systems like those in Parts P03–P05 use in-memory population databases. The following table contrasts these coordination patterns:

Pattern	Used By	Advantages	Disadvantages
Single LLM context	AI Scientist	Simple, coherent	Context window limits, no parallelism
Knowledge graph	DeepScientist	Rich relationships	Complex queries, schema overhead
In-memory population	Evolutionary systems	Fast, structured	Volatile, single-node bottleneck
Message queue	AI-Researcher	Decoupled, ordered	Needs broker, no natural persistence
Shared filesystem	FARS	Universal, durable, inspectable	Unstructured unless conventions enforced

40.2.3 Key Architectural Decisions

Several design decisions reported in the blog post and observable from the system's behavior merit analysis:

Sequential pipeline over graph search. Research has a natural linear flow — idea, plan, experiment, write — and FARS exploits this structure. The pipeline is simpler to coordinate, debug, and reason about than a graph-based approach. The trade-off is rigidity: FARS cannot loop back from failed experiments to revised hypotheses within the same project, unlike iterative systems such as CycleResearcher.

GPU cluster as tools, not raw access. The 160-GPU cluster is encapsulated as high-level tool interfaces. The Experiment Agent schedules jobs without managing CUDA, drivers, or multi-GPU parallelism. This is analogous to how cloud computing abstracts hardware — the agent reasons about experiments, not infrastructure.

Short papers by design. By targeting 4–8 page single-contribution papers, the Writing Agent's task is substantially simpler than producing full conference papers. This reduces hallucination risk, generation time, and the need for comprehensive related work surveys.

40.3 Core Mechanisms

40.3.1 Pipeline Parallelism and Throughput

FARS achieves its throughput through pipeline parallelism — the same technique used in CPU instruction pipelines and industrial assembly lines. While paper $N$ is being written, paper $N+1$ is being experimented on, paper $N+2$ is being planned, and paper $N+3$ is being ideated. All four stages operate concurrently on different projects.

Let $T_{\text{idea}}$, $T_{\text{plan}}$, $T_{\text{exp}}$, and $T_{\text{write}}$ denote the average time for each pipeline stage. The steady-state throughput of a fully occupied pipeline is:

$$\text{Throughput} = \frac{1}{\max(T_{\text{idea}},\; T_{\text{plan}},\; T_{\text{exp}},\; T_{\text{write}})}$$

where throughput is measured in papers per unit time. The bottleneck stage — the slowest — determines throughput regardless of how fast the other stages operate. Based on the reported token consumption breakdown (~70% for the Experiment Agent) and the observed average of ~2.3 hours per paper, we can estimate:

Stage	Estimated Duration	Token Share	Bottleneck?
Ideation	~20–30 min	~15%	No
Planning	~10–15 min	~5%	No
Experiment	~90–120 min	~70%	Yes
Writing	~20–30 min	~10%	No

Note: Stage durations and token-share breakdowns are author estimates based on reported aggregate figures. FARS does not publicly disclose per-stage timing.

The observed throughput of ~137 minutes per paper (228 hours / 100 papers) exceeds the estimated bottleneck duration of 90–120 minutes. This gap likely reflects pipeline filling and draining overhead, failed hypotheses that consume resources without producing papers (59% of hypotheses did not become papers), and GPU contention during peak parallel execution.

The following pseudocode illustrates the pipeline scheduling logic:

# Pseudocode — no public implementation available
# Illustrates FARS pipeline parallelism as described in the blog post

import asyncio
from dataclasses import dataclass
from enum import Enum

class Stage(Enum):
    IDEATION = "ideation"
    PLANNING = "planning"
    EXPERIMENT = "experiment"
    WRITING = "writing"
    COMPLETE = "complete"

@dataclass
class Project:
    project_id: str
    stage: Stage
    workspace_path: str  # path in shared filesystem

async def run_pipeline(
    research_directions: list[str],
    workspace_root: str,
    max_concurrent: int = 20,
) -> list[Project]:
    """
    Pipeline scheduler: launches projects and advances them
    through stages as agents complete their work.
    Each agent reads/writes from the project's workspace directory.
    """
    semaphore = asyncio.Semaphore(max_concurrent)
    completed = []

    async def process_project(direction: str, idx: int) -> Project:
        async with semaphore:
            project = Project(
                project_id=f"FA{idx:04d}",
                stage=Stage.IDEATION,
                workspace_path=f"{workspace_root}/projects/FA{idx:04d}",
            )
            # Each stage reads predecessor's output from filesystem,
            # writes its own output, then advances the stage marker.
            for stage_fn in [ideation, planning, experiment, writing]:
                success = await stage_fn(project)
                if not success:
                    break  # Project abandoned (failed review, etc.)
            else:
                project.stage = Stage.COMPLETE
                completed.append(project)
            return project

    # Launch all projects concurrently — pipeline fills naturally
    tasks = [
        process_project(d, i)
        for i, d in enumerate(research_directions)
    ]
    await asyncio.gather(*tasks)
    return completed

async def ideation(project: Project) -> bool:
    """Reads research direction, writes hypothesis to filesystem."""
    # 1. Conduct literature review (open-access papers, public repos)
    # 2. Generate hypothesis
    # 3. Automated review gate — reject if infeasible/redundant
    # 4. Write hypothesis.md to project workspace
    project.stage = Stage.PLANNING
    return True  # False if hypothesis fails review

async def experiment(project: Project) -> bool:
    """Reads plan, executes via GPU tools, writes results."""
    # 1. Read experimental plan from filesystem
    # 2. Generate code for the experiment
    # 3. Schedule GPU jobs via tool interface (not raw hardware)
    # 4. Monitor, debug, iterate as needed
    # 5. Write results + code to filesystem
    project.stage = Stage.WRITING
    return True  # False if experiment fails irrecoverably

40.3.2 The Four Agents in Detail

Ideation Agent. Converts broad research directions into validated, actionable hypotheses. It has access to open-access papers and public GitLab repositories — not just paper abstracts but actual code implementations, a significant advantage over literature-review-only approaches. The agent includes an automated review gate that filters hypotheses for quality and feasibility before forwarding. In the FARS-100 run, it generated 244 hypotheses from which 100 became papers, a 41% conversion rate. The overproduction is by design: the Ideation Agent generates more hypotheses than the downstream pipeline can absorb, ensuring the pipeline is never starved.

Planning Agent. The thinnest component, it bridges abstract hypotheses and concrete experimental protocols. It determines baselines, specifies evaluation metrics and success criteria, and estimates computational resource requirements. Its output is an experimental plan written to the shared filesystem.

Experiment Agent. The dominant component, consuming approximately 70% of total tokens. It generates code, schedules GPU training and inference jobs via tool interfaces, collects and analyzes results, and iterates through debug cycles. The GPU cluster and model inference endpoints are encapsulated as tools — the agent cannot accidentally consume all resources or interfere with other projects. When experiments fail, the agent must determine whether the failure is a system error (retry/debug) or a genuine negative result (proceed to Writing Agent with negative finding).

Writing Agent. Produces the final short paper from experimental results. Unlike other autoresearch writing modules that attempt full conference papers, FARS's writer produces focused 4–8 page papers. Crucially, the Writing Agent has two distinct modes inferred from the published outputs: a positive result mode ("we propose X, which achieves Y improvement") and a negative result mode ("we investigate whether X improves Y; we find it does not, and we analyze why"). The negative result mode is philosophically significant — it requires framing failure as contribution.

40.3.3 GPU Cluster Tool Encapsulation

The decision to expose the 160-GPU cluster as abstract tools rather than raw hardware is architecturally important. The Experiment Agent reasons about experiments, not CUDA versions or job schedulers. This creates a clean separation of concerns:

# Pseudocode — no public implementation available
# Illustrates the tool-encapsulation pattern described in reporting

# The Experiment Agent interacts with GPU resources only through
# high-level tool interfaces. Internal scheduling, multi-GPU
# parallelism, and fault tolerance are handled by the tool layer.

async def run_experiment(plan: dict, tools: "ToolInterface") -> dict:
    """
    Experiment Agent's interaction with GPU cluster is mediated
    entirely through tool calls. The agent never manages hardware.
    """
    # Step 1: Generate training code based on the plan
    code = generate_code_from_plan(plan)

    # Step 2: Schedule a training job via tool API
    job_id = await tools.schedule_training_job(
        code=code,
        config=plan["training_config"],
        # Resource requirements are declared, not managed
        gpu_count=plan.get("gpu_count", 1),
        max_hours=plan.get("max_hours", 4),
    )

    # Step 3: Monitor execution (tool handles retries, timeouts)
    status = await tools.wait_for_completion(job_id)

    if status.failed:
        # Agent decides: debug the code or accept as negative result
        if status.is_code_error:
            # Iterate: fix code, re-submit
            fixed_code = debug_and_fix(code, status.error_log)
            return await run_experiment(
                {**plan, "code_override": fixed_code}, tools
            )
        else:
            # Genuine experimental failure — proceed as negative result
            return {"outcome": "negative", "analysis": status.error_log}

    # Step 4: Collect results via tool API
    results = await tools.get_job_results(job_id)

    # Step 5: Run evaluation (may use LLM-as-a-Judge via inference tools)
    eval_scores = await tools.run_evaluation(
        model=plan["eval_model"],
        predictions=results["outputs"],
        references=plan["ground_truth"],
    )

    return {"outcome": "positive" if eval_scores["improvement"] > 0 else "negative",
            "results": results, "evaluation": eval_scores}

40.3.4 Dual Role of LLMs: Infrastructure and Subject

A distinctive aspect of FARS is the dual role of large language models. At the infrastructure layer, LLMs serve as the reasoning backbone for all four agents — conducting literature review, generating code, composing papers. At the subject layer, LLMs are the experimental subjects being researched — their training procedures, behaviors, and evaluation methods are the focus of many FARS papers.

This creates a recursive structure: LLMs researching LLMs. For this to work, the infrastructure models must be more capable than the subject models — one cannot reliably study a model's behavior using a less capable model as the reasoning engine. FARS has access to both open-source models (run on the 160-GPU cluster) and closed-source models (via API), though the specific models serving as agent backbones are not publicly disclosed.

40.3.5 Negative Result Detection and Publication

FARS's systematic production of negative results requires distinguishing between four experiment outcome types:

Positive result: Proposed method improves on baselines → report improvement.
Negative result: Method does not work → report why it fails (a legitimate research contribution).
Null result: Inconclusive, may need more experiments → may be abandoned or extended.
System failure: Bug or crash → not a research result, requires debugging.

Traditional autoresearch systems treat outcomes 2–4 as failures and discard them. FARS treats outcome 2 as a publishable contribution. Observed examples include:

"OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — all selection strategies within a 0.3-point band, no improvement. The contribution is identifying candidate homogeneity at low temperature as the failure mechanism.
"Interface-Aware Smoke Tests and Deterministic Import Autofix for Feature-Level Coding Agents: A Negative Result" — automated import autofix provided no benefit over baseline (both 10.0% resolved rate).

In human academia, negative results are systematically suppressed due to publication bias. FARS's willingness to report them represents a structural fix. If automated systems can produce and publish negative results at near-zero marginal cost, the scientific record becomes more complete. The value lies not in any individual negative paper but in the aggregate: a comprehensive map of what works and what doesn't.

40.4 Key Results

40.4.1 FARS-100 Headline Metrics

The FARS-100 run began at 10:00 PM EST on February 12, 2026 and completed after 228 hours, 28 minutes, and 33 seconds of continuous, unattended operation. The following metrics are reported by Analemma through the blog post and live dashboard (source: blog post, February 11, 2026; live dashboard, analemma.ai/fars):

Metric	Value	Source
Duration	228 hours 28 min 33 sec	Blog post / dashboard
Hypotheses generated	244	Blog post
Papers completed	100	Blog post / dashboard
Hypothesis → paper conversion	41.0% (100/244)	Computed from blog figures
Average time per paper	~2 hours 17 min	Computed (228h / 100 papers)
Total tokens consumed	11.4 billion	Blog post / media reporting
Total cost	~$104,000 (~¥750,000)	Blog post / media reporting
Cost per paper	~$1,040	Computed ($104K / 100)
Hardware	160 NVIDIA GPUs	Blog post
Human intervention during run	Zero	Blog post

40.4.2 Quality Assessment

Paper quality was assessed using Stanford's Agentic Reviewer system (paperreview.ai), calibrated against ICLR review standards. The Agentic Reviewer achieves Spearman correlation of 0.42 with human reviewers — on par with human inter-reviewer agreement of 0.41 (source: paperreview.ai calibration study):

Population	Mean Score (ICLR scale)	Source
FARS-100 papers	5.05 (range: 3.0–6.3)	Agentic Reviewer evaluation
ICLR 2026 — all human submissions	4.21	Stanford calibration data
ICLR 2026 — accepted papers	5.39	Stanford calibration data

FARS papers score 0.84 points above the average human submission and 0.34 points below the average accepted paper. The score distribution is concentrated around 5.0, indicating a stable quality band rather than random fluctuation. A small number of papers exceeded 6.0.

Caveats on quality comparison. Several important caveats apply. First, FARS papers are short, single-contribution works (4–8 pages), while ICLR submissions are typically full papers (8–10+ pages) with broader scope. Comparing mean scores between these populations conflates paper scope with paper quality. Second, the Agentic Reviewer correlation with human reviewers (0.42) means approximately 17% of variance is reviewer-specific noise. Third, the FARS papers are evaluated by the same type of system (an LLM) that produced them, which could introduce systematic biases that would not apply to human-written papers. These scores should be interpreted as indicating that FARS output is in the plausible range for publishable research, not that FARS papers are directly comparable to human ICLR submissions.

40.4.3 Conversion Funnel Analysis

The 41% hypothesis-to-paper conversion rate (100/244) implies that ~59% of hypotheses either failed automated review, failed experimentally, or were abandoned during execution. This is actually a high conversion rate compared to human research, where the hypothesis-to-publication rate is typically 5–20%. The higher FARS rate likely reflects three factors: conservative hypothesis generation (testable, incremental hypotheses), inclusion of negative results (failures that humans would discard become papers), and the lower bar of the single-contribution format.

The conversion funnel includes a post-production quality gate: at least 3 senior researchers (5+ years experience each) manually review papers before arXiv submission, and all submissions are explicitly labeled as AI-generated. The number of papers that passed this manual review is not publicly reported.

40.4.4 Token Consumption Analysis

The FARS-100 run consumed 11.4 billion tokens. This yields an average of approximately 114 million tokens per paper — orders of magnitude higher than typical LLM generation tasks and substantially higher than other autoresearch systems. The per-paper consumption can be modeled as the sum of agent contributions:

$$T_{\text{paper}} = T_{\text{ideation}} + T_{\text{planning}} + T_{\text{experiment}} + T_{\text{writing}}$$

where $T_{\text{paper}} \approx 114 \times 10^6$ tokens. Using the estimated token shares from the blog post's reporting:

$$T_{\text{experiment}} \approx 0.70 \times T_{\text{paper}} \approx 80 \times 10^6 \text{ tokens/paper}$$

This enormous per-paper consumption reflects the Experiment Agent's iterative loop of code generation, debugging, GPU job execution, result analysis, and refinement. Each debugging cycle may consume millions of tokens, and a single paper may require 5–15 such iterations. Additionally, some experiments use LLMs as subjects (running inference on the studied model) and as judges (LLM-as-a-Judge evaluation), multiplying token consumption.

System / Task	Tokens per Paper (est.)	Scale Factor
Typical chatbot response	~500	1×
Complex agentic task	~500,000	1,000×
AI Scientist paper (estimated)	~5,000,000	10,000×
FARS paper	~114,000,000	228,000×

Note: The AI Scientist token estimate is inferred from its $15 cost and typical API pricing, not officially reported.

40.5 Cost and Compute Analysis

40.5.1 FARS-100 Cost Decomposition

The total reported cost of ~$104,000 for 100 papers comprises GPU compute and LLM API tokens. FARS does not publish a detailed breakdown, but bounds can be estimated:

$$C_{\text{total}} = C_{\text{GPU}} + C_{\text{tokens}} \approx \$104{,}000$$

where $C_{\text{GPU}}$ is the cost of 160 GPUs for 228 hours, and $C_{\text{tokens}}$ is the inference cost for 11.4 billion tokens across open- and closed-source models. Using typical cloud pricing of $2–3 per GPU-hour for A100/H100-class hardware:

$$C_{\text{GPU}} \approx 160 \times 228 \times \$2.5 \approx \$91{,}200$$

This would leave approximately $12,800 for token costs, which at 11.4B tokens implies an average token price of about $1.12 per million tokens — consistent with a mix of local open-source inference (nearly free on owned hardware) and some closed-source API calls. However, if the GPUs are owned rather than rented, the marginal compute cost is substantially lower (depreciation + electricity), and a larger share of the $104K may be attributable to API tokens. The exact breakdown is not publicly available (source: total cost figure from blog post and 机器之心 reporting; GPU count from blog post; decomposition is author estimation).

40.5.2 Cost Comparison Across Autoresearch Systems

System	Cost/Paper	Hardware	Time/Paper	Experimental Depth
AI Scientist (2024)	~$15	API-only	Hours	Minimal (no GPU experiments)
AI Scientist v2 (2025)	~$15–20	API-only	Hours	Low–moderate
Karpathy autoresearch (2026)	~$18	1 GPU	~8 hours	Moderate (single GPU)
FARS (2026)	~$1,040	160 GPUs	~2.3 hours	High (GPU training + inference)
DeepScientist (2026)	Not reported	20,000+ GPU-hours total	Days–weeks	Very high (frontier-pushing)

FARS's $1,040 per paper is ~70× more expensive than AI Scientist's $15 per paper, but this comparison is misleading. AI Scientist produces papers with minimal experimental depth — no GPU training, no real model fine-tuning, no large-scale inference. FARS executes actual GPU-intensive experiments. A more appropriate comparison is cost per unit of experimental work: a single FARS paper at $1,040 may contain the experimental equivalent of what a human researcher would spend $10,000–$30,000 on, yielding a 10–29× cost advantage.

40.5.3 Scaling Economics

If FARS operated continuously for one year, projections based on observed throughput (acknowledging these assume linear scaling and constant quality, which may not hold):

$$N_{\text{annual}} = \frac{365 \times 24}{2.28} \approx 3{,}842 \text{ papers/year}$$

The human-equivalent cost of 3,842 papers at approximately 4 papers per researcher per year would require ~960 researchers. At a fully loaded cost of $200K per researcher:

$$\frac{C_{\text{human}}}{C_{\text{FARS}}} = \frac{960 \times \$200K}{\$5M} \approx 38\times$$

where $C_{\text{FARS}} \approx \$5M$ is the estimated annual operating cost (GPU + API). This 38× cost advantage drives the economic case for automated research infrastructure. However, the comparison again carries the caveat that FARS short papers and typical human papers are not directly commensurable in scope.

40.6 Reproducibility and Transparency

40.6.1 Transparency Protocol

FARS's transparency is unprecedented in the autoresearch space — and so is its opacity, in different dimensions. The system provides:

Live dashboard (analemma.ai/fars): Real-time observation of the running system
Public experiment repositories (gitlab.com/fars-a): Code, data, and results for each paper
Published papers (analemma.ai/papers/): All completed papers available online
Independent quality evaluation: Scores from Stanford's Agentic Reviewer, which anyone can re-run

However, FARS itself — the system code, agent prompts, coordination mechanisms, and infrastructure — is proprietary. This creates an inverted reproducibility profile: the process is unusually transparent (live public operation), but the system is unusually opaque (closed source). Researchers can see that FARS works and inspect its outputs, but cannot build their own version.

Criterion	Rating	Notes
System reproducibility	Low	Proprietary code, closed architecture
Experiment reproducibility	Medium–High	Individual experiment repos are public on GitLab
Result verification	Medium	Papers and scores independently evaluable
Process transparency	High	Live dashboard, public GitLab activity
Hardware accessibility	Low	160 GPUs required for full replication

40.6.2 Safety and Review Pipeline

FARS employs a deliberately conservative publication pipeline. All 100 papers produced by the automated system undergo evaluation by the Agentic Reviewer, followed by manual review by at least 3 senior researchers with 5+ years experience each. Only papers passing manual review are submitted to arXiv, and all submissions are explicitly labeled as AI-generated. This conservative approach addresses the primary concern about automated research: that it could flood the literature with low-quality or misleading work.

40.7 Memory, Learning, and Limitations

40.7.1 Memory Architecture

FARS's memory is the shared filesystem itself — a pragmatic choice that merges workspace and persistent memory. Each project maintains its own directory with structured subdirectories for ideation outputs, plans, experiment code and results, and the final paper. Agents access the filesystem as an external, persistent memory that supplements their finite context windows.

However, FARS does not appear to maintain cross-project memory — no explicit knowledge base, skill library, or failed-hypothesis registry. Each project is treated independently. If the system generates hypothesis $H_1$ for project $P_1$ and discovers it fails, there is no mechanism to prevent generating a similar hypothesis $H_1'$ for project $P_2$. Over 244 hypotheses, some redundancy is likely.

40.7.2 No Cross-Project Learning

This is a fundamental architectural distinction from evolutionary systems covered in Parts P03–P06 of this survey. Evolutionary systems explicitly learn: populations improve over generations because selection pressure retains good solutions. FARS's pipeline does not learn — each project starts fresh from a research direction.

# Pseudocode — no public implementation available
# Contrasts the FARS pipeline pattern with evolutionary learning loops

# FARS: Linear pipeline — each project independent
def fars_pipeline(directions: list[str]) -> list[Paper]:
    papers = []
    for direction in directions:
        hypothesis = ideation_agent(direction)    # No memory of prior projects
        plan = planning_agent(hypothesis)
        results = experiment_agent(plan)           # Cannot reuse prior code
        paper = writing_agent(results)
        papers.append(paper)
    return papers  # No feedback to ideation from outcomes

# Evolutionary system: Learning loop — population improves
def evolutionary_loop(seed: Program, task: Task) -> Program:
    population = initialize(seed)
    for generation in range(max_generations):
        parents = select(population)               # Selection pressure
        children = mutate(parents)                  # Variation
        scores = evaluate(children, task)
        population = update(population, children, scores)  # Learning!
    return best(population)

This is both a strength and a limitation. The absence of cross-project learning means no risk of premature convergence and trivial parallelization, but it also means FARS cannot build on its own discoveries or avoid repeating mistakes.

40.7.3 Limitations

Several significant limitations emerge from the FARS design and deployment:

AI-only research domain. FARS currently operates exclusively in AI/ML research — the "AI-for-AI" paradigm. This is a pragmatic choice (computational experiments provide readily available evaluation signals), but it means FARS cannot address biology, physics, chemistry, or any domain requiring physical experiments, human evaluation, or user studies.

No iterative refinement within projects. The linear pipeline does not support looping back from failed experiments to revised hypotheses within the same project. Systems like CycleResearcher and AI Scientist v2 employ iterative review-revision cycles that can improve a paper through multiple rounds. FARS trades this flexibility for throughput.

Proprietary system. The closed-source nature limits the research community's ability to verify architectural claims, reproduce the system, or build upon it. Individual experiment repos are public, but the orchestration layer is not.

Scale requirements. The 160-GPU cluster and $104K per 100 papers places FARS beyond the reach of most academic labs. This is research infrastructure for well-funded organizations, not a tool for individual researchers.

Quality ceiling. The mean score of 5.05 is below the ICLR acceptance threshold of 5.39. While FARS produces work that is above average for human submissions, it has not yet demonstrated the ability to consistently produce acceptance-quality work by conference standards. Whether this matters depends on whether one accepts FARS's philosophical premise that conference acceptance is not the right success metric.

Scope and depth. Short, single-contribution papers cannot achieve the depth, synthesis, or multi-faceted analysis of a full human research paper. FARS papers are best compared to individual experiments within a human paper, not to complete papers.

40.8 Comparative Analysis

40.8.1 System Framing Comparison

FARS's pipeline framing positions it uniquely among autoresearch paradigms. The following diagram illustrates how different systems frame the automated research problem:

40.8.2 Quantitative Cross-System Comparison

The following comparison must be interpreted with care: different systems target different objectives, operate in different domains, and define "paper" differently. Direct numerical comparison across these dimensions risks false precision.

Metric	AI Scientist	AI Scientist v2	DeepScientist	FARS
Papers produced	~10 (demo)	~15 (demo)	~1,100 validated	100
Cost per paper	~$15	~$15–20	Not reported	~$1,040
GPU experiments	No	Limited	Yes (20K+ GPU-hrs)	Yes (160 GPUs)
Unattended operation	No	No	Partially	Yes (228 hrs)
Negative results	Discarded	Discarded	Discarded	Published
Open source	Yes	Yes	Partially	No (outputs public)
Quality (ICLR scale)	~3.5–4.5	4.5–6.3	Not rated	5.05 mean
Review methodology	Self-review	Automated + peer	Expert judge	Agentic Reviewer + manual

Note: AI Scientist quality scores are approximate estimates from published examples. DeepScientist measured success by exceeding human SOTA on frontier tasks rather than by review scores. Cross-system quality comparisons should be treated as indicative, not definitive, due to differences in evaluation protocol, paper scope, and domain.

40.8.3 The Depth–Breadth Trade-off

FARS occupies the breadth end of a depth–breadth spectrum in autoresearch. DeepScientist consumed 20,000+ GPU-hours to produce approximately 1,100 experimentally validated ideas, some of which exceeded human state-of-the-art on frontier tasks. FARS consumed roughly 36,000 GPU-hours (160 × 228) to produce 100 complete papers at above-average quality. The strategies are complementary:

DeepScientist: Few directions explored to maximum depth. Goal: genuine scientific breakthroughs.
FARS: Many directions explored to moderate depth. Goal: comprehensive coverage of a research space.

Neither approach dominates. A research organization might use a FARS-like system to rapidly map a problem space, then deploy a DeepScientist-like system to push the most promising directions to their limits.

40.9 Broader Implications

40.9.1 Research as Infrastructure

FARS represents a paradigm shift from research-as-craft to research-as-infrastructure. In the craft model, each paper is a unique artifact produced by skilled researchers. In the infrastructure model, papers are outputs of a production system that can be scaled, optimized, and operated continuously. This shift parallels transitions in other domains: manual QA to automated CI/CD in software, artisan production to assembly lines in manufacturing, manual analysis to automated pipelines in data science.

40.9.2 The Minimal Composable Knowledge Unit

FARS's short, single-contribution papers introduce a new unit of scientific knowledge — smaller than a traditional paper but larger than a blog post. This is analogous to the microservices revolution in software: smaller, focused, composable units replace monolithic artifacts. Each FARS paper is easy to review (one thing to evaluate), easy to cite precisely (one clear finding), and incentivizes decomposition over bundling.

Whether the broader scientific community would accept this format is an open question. The academic incentive structure rewards comprehensive, multi-contribution papers at top venues — not an incentive structure that FARS is designed to operate within.

40.9.3 Publication Bias and the Scientific Record

If automated systems can produce and publish negative results at near-zero marginal cost, the scientific record becomes more complete. In human research, an estimated 80% of experimental knowledge is lost to publication bias. FARS's 41% hypothesis-to-paper conversion rate — including papers whose entire contribution is documenting failure — represents a structural improvement in knowledge preservation.

40.9.4 The AI-for-AI Feedback Loop

FARS operates in the AI-for-AI domain: AI systems researching AI systems. Some of its papers improve LLM training, agent design, or evaluation methods. If these improvements feed back into FARS's own components (directly or through the broader research ecosystem), the system creates a positive feedback loop in AI capability. This recursive potential is both the most exciting and most concerning implication of industrial-scale autoresearch.

40.10 Connections to Evolutionary Systems

Although FARS is a pipeline system rather than an evolutionary one, its design has important connections to the evolutionary algorithm discovery systems covered in earlier parts of this survey. The shared filesystem pattern demonstrates that simple coordination mechanisms can support industrial-scale multi-agent systems — a lesson applicable to evolutionary systems that often use more complex event buses and databases. FARS's systematic production of negative results has no direct analogue in evolutionary systems, which discard failed candidates; an evolutionary system that preserved and analyzed failures could improve search efficiency. Conversely, FARS's lack of cross-project learning is precisely the gap that evolutionary approaches fill: selection pressure, population management, and progressive improvement are absent from FARS and could substantially improve its operation in future versions.

A hybrid architecture — FARS-like pipeline stages with evolutionary learning across projects — could combine throughput with progressive improvement. The Ideation Agent could maintain a population of hypothesis templates refined by selection pressure from experimental outcomes. The Experiment Agent could accumulate a skill library of reusable experimental techniques. Such extensions would transform FARS from a pipeline into a learning pipeline, maintaining throughput advantages while adding the improvement dynamics that make evolutionary systems powerful.

Summary

FARS is the first continuously operating automated research system demonstrated at industrial scale: 228 hours of fully unattended operation on 160 GPUs, producing 100 short research papers from 244 generated hypotheses at a cost of ~$1,040 per paper and a mean quality score of 5.05 on the ICLR review scale — above the 4.21 average for human submissions.

Main contribution to the field: FARS proves that automated research can function as reliable production infrastructure rather than a proof-of-concept demonstration, and introduces two significant innovations: (1) the shared filesystem as sole inter-agent coordination mechanism, enabling crash-safe, human-inspectable, trivially parallelizable multi-agent operation; and (2) the systematic publication of negative results as first-class research contributions, addressing one of science's most persistent structural problems.

What a researcher should know: FARS trades depth for breadth and learning for throughput. It does not improve over time (no cross-project learning), operates only in AI/ML research domains, and is proprietary. Its philosophical challenge to academic publishing conventions — that the paper format is a historical artifact, not a necessary property of knowledge production — may prove more influential than its technical architecture. The system's outputs (papers, code repositories, quality scores) are publicly available for independent evaluation at analemma.ai and gitlab.com/fars-a, even though the system itself cannot be reproduced.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}