FARS: Fully Automated Research System
Part P07: Autonomous Research Systems
FARS represents a paradigm shift in automated research: from proof-of-concept agent demonstrations to industrial-scale, continuously operating research infrastructure. Deployed live on a 160-GPU cluster in February 2026, FARS produced 100 short research papers in 228 hours of fully unattended operation, consuming 11.4 billion tokens at a total cost of approximately $104,000. Unlike prior autoresearch systems that optimized for human-like paper formatting and peer-review acceptance, FARS rejects academic publishing conventions entirely, instead optimizing for throughput, reliability, and the systematic production of minimal, composable knowledge units — including negative results.
Key Contribution
FARS is the first publicly demonstrated, continuously operating automated research system at industrial scale. Its core contribution is not algorithmic but architectural and philosophical: it proves that automated research can function as reliable production infrastructure rather than a one-off demonstration, and it challenges the assumption that academic paper formatting is a necessary property of knowledge production. The system's 228-hour unattended deployment, producing 244 hypotheses and 100 papers with a mean review score of 5.05 on the ICLR scale (above the 4.21 average for human submissions), establishes a concrete baseline for automated research throughput, cost, and quality.
40.1 Overview and Motivation
40.1.1 The Industrial Research Thesis
FARS was developed by Analemma (日行迹智能科技), a Shanghai-based startup founded in March 2025 by Dr. Sun Tianxiang (孙天祥), a Fudan University PhD and principal developer of MOSS, one of the earliest Chinese open-source conversational language models. The core team draws from Fudan's MOSS group and Shanghai AI Laboratory's InternLM team — researchers who built the very models that autoresearch systems consume. This insider perspective informs FARS's pragmatic design: the system is built by people who understand both the capabilities and the limitations of LLMs as research tools.
The speed of execution is notable: approximately 11 months from company founding to a 160-GPU live deployment producing 100 papers. The company secured an angel round of several hundred million RMB from investors including Sequoia Capital China, reflecting significant confidence in the automated research infrastructure thesis.
FARS's central philosophical claim distinguishes it from every prior system in this survey: the output of a research system should be research contributions, not papers conforming to academic formatting conventions. Where AI Scientist (Chapter 30) sought to pass the Turing test of academic publishing, and DeepScientist (Chapter 37) sought frontier-pushing scientific depth, FARS asks what an optimally efficient research system looks like when freed from the constraints of human publishing conventions. The blog post articulates this through a first-principles critique of human-centered research:
| Structural Problem | Root Cause | FARS Response |
|---|---|---|
| High entry barrier | Years of training to become a researcher | Automated agents require no training period |
| Publication bias | Only "successful" experiments get published | Every completed experiment — positive or negative — produces output |
| Format overhead | Conforming to venue-specific formatting rules | No structural constraints beyond clarity |
| Length inflation | Pressure to fill pages to meet minimums | Papers are as long as they need to be, no more |
| Supply constraint | Limited number of human researchers | System runs 24/7, parallelizes across projects |
40.1.2 Five Design Principles
From the blog post and observable outputs, FARS operates on five explicit principles:
- Contributions, not papers. The unit of output is a research contribution — a piece of new knowledge — not a formatted document. The paper is merely a container.
- Single-scoped contributions. Each paper addresses exactly one research question. This is the minimal composable unit of knowledge, analogous to a function in programming: do one thing, do it well, make it reusable.
- Negative results are knowledge. A well-conducted experiment showing something does not work is as valuable as one showing something does. FARS explicitly reports negative results — for example, "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity."
- No unnecessary constraints. Papers are not padded to meet length minimums and do not conform to venue-specific templates.
- Scale reveals truth. FARS was designed to produce 100 papers precisely because quality variance at scale is a known limitation — the signal emerges from the aggregate, not from cherry-picked examples.
40.1.3 Position in the Autoresearch Landscape
FARS explicitly positions itself as a successor to six prior systems. Where each addressed a subset of the autoresearch problem — AI Scientist demonstrating feasibility, Zochi achieving acceptance-level quality, DeepScientist pushing frontier depth — FARS addresses industrial-scale throughput with continuous autonomous operation. It is the difference between building a faster horse and building an assembly line.
| System | Year | Primary Contribution | Framing |
|---|---|---|---|
| AI Scientist | 2024 | First end-to-end: idea → code → paper → review | Dialogue-based |
| CycleResearcher | 2024 | Iterative review-revision quality improvement | Feedback loop |
| Zochi | 2025 | First AI papers accepted at workshops (avg 7.67) | Acceptance-optimized |
| AI Scientist v2 | 2025 | Tree search methodology, double-blind review pass | Search-based |
| AI-Researcher | 2025 | Four-module architecture, NeurIPS Spotlight | Modular pipeline |
| DeepScientist | 2026 | Exceeded human SOTA on 3 frontier tasks | Depth-first |
| FARS | 2026 | 228h continuous operation, 100 papers, live public deployment | Pipeline production |
The framing distinction is important. FARS treats automated research as a pipeline production problem — not a search problem (contrast with evolutionary systems like AlphaEvolve), not a dialogue problem (contrast with AI Scientist's multi-turn LLM conversation), and not a depth-first discovery problem (contrast with DeepScientist). The pipeline metaphor has specific implications for how the system is designed, optimized, and evaluated.
40.2 Architecture
40.2.1 Four-Agent Sequential Pipeline
FARS employs a four-agent sequential pipeline: Ideation Agent → Planning Agent → Experiment Agent → Writing Agent. Each agent reads from and writes to a shared filesystem, which serves as the sole coordination mechanism. There is no direct agent-to-agent communication — the filesystem boundary is the API contract.
40.2.2 The Shared Filesystem as Coordination Protocol
The most architecturally distinctive feature of FARS is the shared filesystem as the sole coordination mechanism between agents. This is not merely a storage layer — it is the entire inter-agent protocol. Most multi-agent systems use message queues, event buses, or direct API calls. FARS's choice has deep advantages that the blog post and observable behavior make clear:
- Persistence by default. Every intermediate artifact is automatically persisted. If the system crashes, the filesystem state represents a perfect checkpoint.
- Natural handoffs. Agent A writes files; Agent B reads them. The filesystem boundary is the API contract — no schema definitions, no serialization overhead.
- Human inspectability. Any researcher can inspect exactly what each agent produced, enabling debugging and trust-building.
- Trivial parallelism. Multiple projects run in parallel via separate directories. No lock contention, no resource arbitration beyond GPU scheduling.
- Decoupled evolution. Each agent can be upgraded independently, provided file formats remain compatible.
This pattern is well-established in systems engineering (Unix pipes, Plan 9, microservices via shared storage) but is novel in the autoresearch space. AI Scientist uses a single LLM conversation thread, DeepScientist uses knowledge graphs and databases, and evolutionary systems like those in Parts P03–P05 use in-memory population databases. The following table contrasts these coordination patterns:
| Pattern | Used By | Advantages | Disadvantages |
|---|---|---|---|
| Single LLM context | AI Scientist | Simple, coherent | Context window limits, no parallelism |
| Knowledge graph | DeepScientist | Rich relationships | Complex queries, schema overhead |
| In-memory population | Evolutionary systems | Fast, structured | Volatile, single-node bottleneck |
| Message queue | AI-Researcher | Decoupled, ordered | Needs broker, no natural persistence |
| Shared filesystem | FARS | Universal, durable, inspectable | Unstructured unless conventions enforced |
40.2.3 Key Architectural Decisions
Several design decisions reported in the blog post and observable from the system's behavior merit analysis:
Sequential pipeline over graph search. Research has a natural linear flow — idea, plan, experiment, write — and FARS exploits this structure. The pipeline is simpler to coordinate, debug, and reason about than a graph-based approach. The trade-off is rigidity: FARS cannot loop back from failed experiments to revised hypotheses within the same project, unlike iterative systems such as CycleResearcher.
GPU cluster as tools, not raw access. The 160-GPU cluster is encapsulated as high-level tool interfaces. The Experiment Agent schedules jobs without managing CUDA, drivers, or multi-GPU parallelism. This is analogous to how cloud computing abstracts hardware — the agent reasons about experiments, not infrastructure.
Short papers by design. By targeting 4–8 page single-contribution papers, the Writing Agent's task is substantially simpler than producing full conference papers. This reduces hallucination risk, generation time, and the need for comprehensive related work surveys.
40.3 Core Mechanisms
40.3.1 Pipeline Parallelism and Throughput
FARS achieves its throughput through pipeline parallelism — the same technique used in CPU instruction pipelines and industrial assembly lines. While paper $N$ is being written, paper $N+1$ is being experimented on, paper $N+2$ is being planned, and paper $N+3$ is being ideated. All four stages operate concurrently on different projects.
Let $T_{\text{idea}}$, $T_{\text{plan}}$, $T_{\text{exp}}$, and $T_{\text{write}}$ denote the average time for each pipeline stage. The steady-state throughput of a fully occupied pipeline is:
where throughput is measured in papers per unit time. The bottleneck stage — the slowest — determines throughput regardless of how fast the other stages operate. Based on the reported token consumption breakdown (~70% for the Experiment Agent) and the observed average of ~2.3 hours per paper, we can estimate:
| Stage | Estimated Duration | Token Share | Bottleneck? |
|---|---|---|---|
| Ideation | ~20–30 min | ~15% | No |
| Planning | ~10–15 min | ~5% | No |
| Experiment | ~90–120 min | ~70% | Yes |
| Writing | ~20–30 min | ~10% | No |
Note: Stage durations and token-share breakdowns are author estimates based on reported aggregate figures. FARS does not publicly disclose per-stage timing.
The observed throughput of ~137 minutes per paper (228 hours / 100 papers) exceeds the estimated bottleneck duration of 90–120 minutes. This gap likely reflects pipeline filling and draining overhead, failed hypotheses that consume resources without producing papers (59% of hypotheses did not become papers), and GPU contention during peak parallel execution.
The following pseudocode illustrates the pipeline scheduling logic:
# Pseudocode — no public implementation available
# Illustrates FARS pipeline parallelism as described in the blog post
import asyncio
from dataclasses import dataclass
from enum import Enum
class Stage(Enum):
IDEATION = "ideation"
PLANNING = "planning"
EXPERIMENT = "experiment"
WRITING = "writing"
COMPLETE = "complete"
@dataclass
class Project:
project_id: str
stage: Stage
workspace_path: str # path in shared filesystem
async def run_pipeline(
research_directions: list[str],
workspace_root: str,
max_concurrent: int = 20,
) -> list[Project]:
"""
Pipeline scheduler: launches projects and advances them
through stages as agents complete their work.
Each agent reads/writes from the project's workspace directory.
"""
semaphore = asyncio.Semaphore(max_concurrent)
completed = []
async def process_project(direction: str, idx: int) -> Project:
async with semaphore:
project = Project(
project_id=f"FA{idx:04d}",
stage=Stage.IDEATION,
workspace_path=f"{workspace_root}/projects/FA{idx:04d}",
)
# Each stage reads predecessor's output from filesystem,
# writes its own output, then advances the stage marker.
for stage_fn in [ideation, planning, experiment, writing]:
success = await stage_fn(project)
if not success:
break # Project abandoned (failed review, etc.)
else:
project.stage = Stage.COMPLETE
completed.append(project)
return project
# Launch all projects concurrently — pipeline fills naturally
tasks = [
process_project(d, i)
for i, d in enumerate(research_directions)
]
await asyncio.gather(*tasks)
return completed
async def ideation(project: Project) -> bool:
"""Reads research direction, writes hypothesis to filesystem."""
# 1. Conduct literature review (open-access papers, public repos)
# 2. Generate hypothesis
# 3. Automated review gate — reject if infeasible/redundant
# 4. Write hypothesis.md to project workspace
project.stage = Stage.PLANNING
return True # False if hypothesis fails review
async def experiment(project: Project) -> bool:
"""Reads plan, executes via GPU tools, writes results."""
# 1. Read experimental plan from filesystem
# 2. Generate code for the experiment
# 3. Schedule GPU jobs via tool interface (not raw hardware)
# 4. Monitor, debug, iterate as needed
# 5. Write results + code to filesystem
project.stage = Stage.WRITING
return True # False if experiment fails irrecoverably
40.3.2 The Four Agents in Detail
Ideation Agent. Converts broad research directions into validated, actionable hypotheses. It has access to open-access papers and public GitLab repositories — not just paper abstracts but actual code implementations, a significant advantage over literature-review-only approaches. The agent includes an automated review gate that filters hypotheses for quality and feasibility before forwarding. In the FARS-100 run, it generated 244 hypotheses from which 100 became papers, a 41% conversion rate. The overproduction is by design: the Ideation Agent generates more hypotheses than the downstream pipeline can absorb, ensuring the pipeline is never starved.
Planning Agent. The thinnest component, it bridges abstract hypotheses and concrete experimental protocols. It determines baselines, specifies evaluation metrics and success criteria, and estimates computational resource requirements. Its output is an experimental plan written to the shared filesystem.
Experiment Agent. The dominant component, consuming approximately 70% of total tokens. It generates code, schedules GPU training and inference jobs via tool interfaces, collects and analyzes results, and iterates through debug cycles. The GPU cluster and model inference endpoints are encapsulated as tools — the agent cannot accidentally consume all resources or interfere with other projects. When experiments fail, the agent must determine whether the failure is a system error (retry/debug) or a genuine negative result (proceed to Writing Agent with negative finding).
Writing Agent. Produces the final short paper from experimental results. Unlike other autoresearch writing modules that attempt full conference papers, FARS's writer produces focused 4–8 page papers. Crucially, the Writing Agent has two distinct modes inferred from the published outputs: a positive result mode ("we propose X, which achieves Y improvement") and a negative result mode ("we investigate whether X improves Y; we find it does not, and we analyze why"). The negative result mode is philosophically significant — it requires framing failure as contribution.
40.3.3 GPU Cluster Tool Encapsulation
The decision to expose the 160-GPU cluster as abstract tools rather than raw hardware is architecturally important. The Experiment Agent reasons about experiments, not CUDA versions or job schedulers. This creates a clean separation of concerns:
# Pseudocode — no public implementation available
# Illustrates the tool-encapsulation pattern described in reporting
# The Experiment Agent interacts with GPU resources only through
# high-level tool interfaces. Internal scheduling, multi-GPU
# parallelism, and fault tolerance are handled by the tool layer.
async def run_experiment(plan: dict, tools: "ToolInterface") -> dict:
"""
Experiment Agent's interaction with GPU cluster is mediated
entirely through tool calls. The agent never manages hardware.
"""
# Step 1: Generate training code based on the plan
code = generate_code_from_plan(plan)
# Step 2: Schedule a training job via tool API
job_id = await tools.schedule_training_job(
code=code,
config=plan["training_config"],
# Resource requirements are declared, not managed
gpu_count=plan.get("gpu_count", 1),
max_hours=plan.get("max_hours", 4),
)
# Step 3: Monitor execution (tool handles retries, timeouts)
status = await tools.wait_for_completion(job_id)
if status.failed:
# Agent decides: debug the code or accept as negative result
if status.is_code_error:
# Iterate: fix code, re-submit
fixed_code = debug_and_fix(code, status.error_log)
return await run_experiment(
{**plan, "code_override": fixed_code}, tools
)
else:
# Genuine experimental failure — proceed as negative result
return {"outcome": "negative", "analysis": status.error_log}
# Step 4: Collect results via tool API
results = await tools.get_job_results(job_id)
# Step 5: Run evaluation (may use LLM-as-a-Judge via inference tools)
eval_scores = await tools.run_evaluation(
model=plan["eval_model"],
predictions=results["outputs"],
references=plan["ground_truth"],
)
return {"outcome": "positive" if eval_scores["improvement"] > 0 else "negative",
"results": results, "evaluation": eval_scores}
40.3.4 Dual Role of LLMs: Infrastructure and Subject
A distinctive aspect of FARS is the dual role of large language models. At the infrastructure layer, LLMs serve as the reasoning backbone for all four agents — conducting literature review, generating code, composing papers. At the subject layer, LLMs are the experimental subjects being researched — their training procedures, behaviors, and evaluation methods are the focus of many FARS papers.
This creates a recursive structure: LLMs researching LLMs. For this to work, the infrastructure models must be more capable than the subject models — one cannot reliably study a model's behavior using a less capable model as the reasoning engine. FARS has access to both open-source models (run on the 160-GPU cluster) and closed-source models (via API), though the specific models serving as agent backbones are not publicly disclosed.
40.3.5 Negative Result Detection and Publication
FARS's systematic production of negative results requires distinguishing between four experiment outcome types:
- Positive result: Proposed method improves on baselines → report improvement.
- Negative result: Method does not work → report why it fails (a legitimate research contribution).
- Null result: Inconclusive, may need more experiments → may be abandoned or extended.
- System failure: Bug or crash → not a research result, requires debugging.
Traditional autoresearch systems treat outcomes 2–4 as failures and discard them. FARS treats outcome 2 as a publishable contribution. Observed examples include:
- "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — all selection strategies within a 0.3-point band, no improvement. The contribution is identifying candidate homogeneity at low temperature as the failure mechanism.
- "Interface-Aware Smoke Tests and Deterministic Import Autofix for Feature-Level Coding Agents: A Negative Result" — automated import autofix provided no benefit over baseline (both 10.0% resolved rate).
In human academia, negative results are systematically suppressed due to publication bias. FARS's willingness to report them represents a structural fix. If automated systems can produce and publish negative results at near-zero marginal cost, the scientific record becomes more complete. The value lies not in any individual negative paper but in the aggregate: a comprehensive map of what works and what doesn't.
40.4 Key Results
40.4.1 FARS-100 Headline Metrics
The FARS-100 run began at 10:00 PM EST on February 12, 2026 and completed after 228 hours, 28 minutes, and 33 seconds of continuous, unattended operation. The following metrics are reported by Analemma through the blog post and live dashboard (source: blog post, February 11, 2026; live dashboard, analemma.ai/fars):
| Metric | Value | Source |
|---|---|---|
| Duration | 228 hours 28 min 33 sec | Blog post / dashboard |
| Hypotheses generated | 244 | Blog post |
| Papers completed | 100 | Blog post / dashboard |
| Hypothesis → paper conversion | 41.0% (100/244) | Computed from blog figures |
| Average time per paper | ~2 hours 17 min | Computed (228h / 100 papers) |
| Total tokens consumed | 11.4 billion | Blog post / media reporting |
| Total cost | ~$104,000 (~¥750,000) | Blog post / media reporting |
| Cost per paper | ~$1,040 | Computed ($104K / 100) |
| Hardware | 160 NVIDIA GPUs | Blog post |
| Human intervention during run | Zero | Blog post |
40.4.2 Quality Assessment
Paper quality was assessed using Stanford's Agentic Reviewer system (paperreview.ai), calibrated against ICLR review standards. The Agentic Reviewer achieves Spearman correlation of 0.42 with human reviewers — on par with human inter-reviewer agreement of 0.41 (source: paperreview.ai calibration study):
| Population | Mean Score (ICLR scale) | Source |
|---|---|---|
| FARS-100 papers | 5.05 (range: 3.0–6.3) | Agentic Reviewer evaluation |
| ICLR 2026 — all human submissions | 4.21 | Stanford calibration data |
| ICLR 2026 — accepted papers | 5.39 | Stanford calibration data |
FARS papers score 0.84 points above the average human submission and 0.34 points below the average accepted paper. The score distribution is concentrated around 5.0, indicating a stable quality band rather than random fluctuation. A small number of papers exceeded 6.0.
Caveats on quality comparison. Several important caveats apply. First, FARS papers are short, single-contribution works (4–8 pages), while ICLR submissions are typically full papers (8–10+ pages) with broader scope. Comparing mean scores between these populations conflates paper scope with paper quality. Second, the Agentic Reviewer correlation with human reviewers (0.42) means approximately 17% of variance is reviewer-specific noise. Third, the FARS papers are evaluated by the same type of system (an LLM) that produced them, which could introduce systematic biases that would not apply to human-written papers. These scores should be interpreted as indicating that FARS output is in the plausible range for publishable research, not that FARS papers are directly comparable to human ICLR submissions.
40.4.3 Conversion Funnel Analysis
The 41% hypothesis-to-paper conversion rate (100/244) implies that ~59% of hypotheses either failed automated review, failed experimentally, or were abandoned during execution. This is actually a high conversion rate compared to human research, where the hypothesis-to-publication rate is typically 5–20%. The higher FARS rate likely reflects three factors: conservative hypothesis generation (testable, incremental hypotheses), inclusion of negative results (failures that humans would discard become papers), and the lower bar of the single-contribution format.
The conversion funnel includes a post-production quality gate: at least 3 senior researchers (5+ years experience each) manually review papers before arXiv submission, and all submissions are explicitly labeled as AI-generated. The number of papers that passed this manual review is not publicly reported.
40.4.4 Token Consumption Analysis
The FARS-100 run consumed 11.4 billion tokens. This yields an average of approximately 114 million tokens per paper — orders of magnitude higher than typical LLM generation tasks and substantially higher than other autoresearch systems. The per-paper consumption can be modeled as the sum of agent contributions:
where $T_{\text{paper}} \approx 114 \times 10^6$ tokens. Using the estimated token shares from the blog post's reporting:
This enormous per-paper consumption reflects the Experiment Agent's iterative loop of code generation, debugging, GPU job execution, result analysis, and refinement. Each debugging cycle may consume millions of tokens, and a single paper may require 5–15 such iterations. Additionally, some experiments use LLMs as subjects (running inference on the studied model) and as judges (LLM-as-a-Judge evaluation), multiplying token consumption.
| System / Task | Tokens per Paper (est.) | Scale Factor |
|---|---|---|
| Typical chatbot response | ~500 | 1× |
| Complex agentic task | ~500,000 | 1,000× |
| AI Scientist paper (estimated) | ~5,000,000 | 10,000× |
| FARS paper | ~114,000,000 | 228,000× |
Note: The AI Scientist token estimate is inferred from its $15 cost and typical API pricing, not officially reported.
40.5 Cost and Compute Analysis
40.5.1 FARS-100 Cost Decomposition
The total reported cost of ~$104,000 for 100 papers comprises GPU compute and LLM API tokens. FARS does not publish a detailed breakdown, but bounds can be estimated:
where $C_{\text{GPU}}$ is the cost of 160 GPUs for 228 hours, and $C_{\text{tokens}}$ is the inference cost for 11.4 billion tokens across open- and closed-source models. Using typical cloud pricing of $2–3 per GPU-hour for A100/H100-class hardware:
This would leave approximately $12,800 for token costs, which at 11.4B tokens implies an average token price of about $1.12 per million tokens — consistent with a mix of local open-source inference (nearly free on owned hardware) and some closed-source API calls. However, if the GPUs are owned rather than rented, the marginal compute cost is substantially lower (depreciation + electricity), and a larger share of the $104K may be attributable to API tokens. The exact breakdown is not publicly available (source: total cost figure from blog post and 机器之心 reporting; GPU count from blog post; decomposition is author estimation).
40.5.2 Cost Comparison Across Autoresearch Systems
| System | Cost/Paper | Hardware | Time/Paper | Experimental Depth |
|---|---|---|---|---|
| AI Scientist (2024) | ~$15 | API-only | Hours | Minimal (no GPU experiments) |
| AI Scientist v2 (2025) | ~$15–20 | API-only | Hours | Low–moderate |
| Karpathy autoresearch (2026) | ~$18 | 1 GPU | ~8 hours | Moderate (single GPU) |
| FARS (2026) | ~$1,040 | 160 GPUs | ~2.3 hours | High (GPU training + inference) |
| DeepScientist (2026) | Not reported | 20,000+ GPU-hours total | Days–weeks | Very high (frontier-pushing) |
FARS's $1,040 per paper is ~70× more expensive than AI Scientist's $15 per paper, but this comparison is misleading. AI Scientist produces papers with minimal experimental depth — no GPU training, no real model fine-tuning, no large-scale inference. FARS executes actual GPU-intensive experiments. A more appropriate comparison is cost per unit of experimental work: a single FARS paper at $1,040 may contain the experimental equivalent of what a human researcher would spend $10,000–$30,000 on, yielding a 10–29× cost advantage.
40.5.3 Scaling Economics
If FARS operated continuously for one year, projections based on observed throughput (acknowledging these assume linear scaling and constant quality, which may not hold):
The human-equivalent cost of 3,842 papers at approximately 4 papers per researcher per year would require ~960 researchers. At a fully loaded cost of $200K per researcher:
where $C_{\text{FARS}} \approx \$5M$ is the estimated annual operating cost (GPU + API). This 38× cost advantage drives the economic case for automated research infrastructure. However, the comparison again carries the caveat that FARS short papers and typical human papers are not directly commensurable in scope.
40.6 Reproducibility and Transparency
40.6.1 Transparency Protocol
FARS's transparency is unprecedented in the autoresearch space — and so is its opacity, in different dimensions. The system provides:
- Live dashboard (analemma.ai/fars): Real-time observation of the running system
- Public experiment repositories (gitlab.com/fars-a): Code, data, and results for each paper
- Published papers (analemma.ai/papers/): All completed papers available online
- Independent quality evaluation: Scores from Stanford's Agentic Reviewer, which anyone can re-run
However, FARS itself — the system code, agent prompts, coordination mechanisms, and infrastructure — is proprietary. This creates an inverted reproducibility profile: the process is unusually transparent (live public operation), but the system is unusually opaque (closed source). Researchers can see that FARS works and inspect its outputs, but cannot build their own version.
| Criterion | Rating | Notes |
|---|---|---|
| System reproducibility | Low | Proprietary code, closed architecture |
| Experiment reproducibility | Medium–High | Individual experiment repos are public on GitLab |
| Result verification | Medium | Papers and scores independently evaluable |
| Process transparency | High | Live dashboard, public GitLab activity |
| Hardware accessibility | Low | 160 GPUs required for full replication |
40.6.2 Safety and Review Pipeline
FARS employs a deliberately conservative publication pipeline. All 100 papers produced by the automated system undergo evaluation by the Agentic Reviewer, followed by manual review by at least 3 senior researchers with 5+ years experience each. Only papers passing manual review are submitted to arXiv, and all submissions are explicitly labeled as AI-generated. This conservative approach addresses the primary concern about automated research: that it could flood the literature with low-quality or misleading work.
40.7 Memory, Learning, and Limitations
40.7.1 Memory Architecture
FARS's memory is the shared filesystem itself — a pragmatic choice that merges workspace and persistent memory. Each project maintains its own directory with structured subdirectories for ideation outputs, plans, experiment code and results, and the final paper. Agents access the filesystem as an external, persistent memory that supplements their finite context windows.
However, FARS does not appear to maintain cross-project memory — no explicit knowledge base, skill library, or failed-hypothesis registry. Each project is treated independently. If the system generates hypothesis $H_1$ for project $P_1$ and discovers it fails, there is no mechanism to prevent generating a similar hypothesis $H_1'$ for project $P_2$. Over 244 hypotheses, some redundancy is likely.
40.7.2 No Cross-Project Learning
This is a fundamental architectural distinction from evolutionary systems covered in Parts P03–P06 of this survey. Evolutionary systems explicitly learn: populations improve over generations because selection pressure retains good solutions. FARS's pipeline does not learn — each project starts fresh from a research direction.
# Pseudocode — no public implementation available
# Contrasts the FARS pipeline pattern with evolutionary learning loops
# FARS: Linear pipeline — each project independent
def fars_pipeline(directions: list[str]) -> list[Paper]:
papers = []
for direction in directions:
hypothesis = ideation_agent(direction) # No memory of prior projects
plan = planning_agent(hypothesis)
results = experiment_agent(plan) # Cannot reuse prior code
paper = writing_agent(results)
papers.append(paper)
return papers # No feedback to ideation from outcomes
# Evolutionary system: Learning loop — population improves
def evolutionary_loop(seed: Program, task: Task) -> Program:
population = initialize(seed)
for generation in range(max_generations):
parents = select(population) # Selection pressure
children = mutate(parents) # Variation
scores = evaluate(children, task)
population = update(population, children, scores) # Learning!
return best(population)
This is both a strength and a limitation. The absence of cross-project learning means no risk of premature convergence and trivial parallelization, but it also means FARS cannot build on its own discoveries or avoid repeating mistakes.
40.7.3 Limitations
Several significant limitations emerge from the FARS design and deployment:
AI-only research domain. FARS currently operates exclusively in AI/ML research — the "AI-for-AI" paradigm. This is a pragmatic choice (computational experiments provide readily available evaluation signals), but it means FARS cannot address biology, physics, chemistry, or any domain requiring physical experiments, human evaluation, or user studies.
No iterative refinement within projects. The linear pipeline does not support looping back from failed experiments to revised hypotheses within the same project. Systems like CycleResearcher and AI Scientist v2 employ iterative review-revision cycles that can improve a paper through multiple rounds. FARS trades this flexibility for throughput.
Proprietary system. The closed-source nature limits the research community's ability to verify architectural claims, reproduce the system, or build upon it. Individual experiment repos are public, but the orchestration layer is not.
Scale requirements. The 160-GPU cluster and $104K per 100 papers places FARS beyond the reach of most academic labs. This is research infrastructure for well-funded organizations, not a tool for individual researchers.
Quality ceiling. The mean score of 5.05 is below the ICLR acceptance threshold of 5.39. While FARS produces work that is above average for human submissions, it has not yet demonstrated the ability to consistently produce acceptance-quality work by conference standards. Whether this matters depends on whether one accepts FARS's philosophical premise that conference acceptance is not the right success metric.
Scope and depth. Short, single-contribution papers cannot achieve the depth, synthesis, or multi-faceted analysis of a full human research paper. FARS papers are best compared to individual experiments within a human paper, not to complete papers.
40.8 Comparative Analysis
40.8.1 System Framing Comparison
FARS's pipeline framing positions it uniquely among autoresearch paradigms. The following diagram illustrates how different systems frame the automated research problem:
40.8.2 Quantitative Cross-System Comparison
The following comparison must be interpreted with care: different systems target different objectives, operate in different domains, and define "paper" differently. Direct numerical comparison across these dimensions risks false precision.
| Metric | AI Scientist | AI Scientist v2 | DeepScientist | FARS |
|---|---|---|---|---|
| Papers produced | ~10 (demo) | ~15 (demo) | ~1,100 validated | 100 |
| Cost per paper | ~$15 | ~$15–20 | Not reported | ~$1,040 |
| GPU experiments | No | Limited | Yes (20K+ GPU-hrs) | Yes (160 GPUs) |
| Unattended operation | No | No | Partially | Yes (228 hrs) |
| Negative results | Discarded | Discarded | Discarded | Published |
| Open source | Yes | Yes | Partially | No (outputs public) |
| Quality (ICLR scale) | ~3.5–4.5 | 4.5–6.3 | Not rated | 5.05 mean |
| Review methodology | Self-review | Automated + peer | Expert judge | Agentic Reviewer + manual |
Note: AI Scientist quality scores are approximate estimates from published examples. DeepScientist measured success by exceeding human SOTA on frontier tasks rather than by review scores. Cross-system quality comparisons should be treated as indicative, not definitive, due to differences in evaluation protocol, paper scope, and domain.
40.8.3 The Depth–Breadth Trade-off
FARS occupies the breadth end of a depth–breadth spectrum in autoresearch. DeepScientist consumed 20,000+ GPU-hours to produce approximately 1,100 experimentally validated ideas, some of which exceeded human state-of-the-art on frontier tasks. FARS consumed roughly 36,000 GPU-hours (160 × 228) to produce 100 complete papers at above-average quality. The strategies are complementary:
- DeepScientist: Few directions explored to maximum depth. Goal: genuine scientific breakthroughs.
- FARS: Many directions explored to moderate depth. Goal: comprehensive coverage of a research space.
Neither approach dominates. A research organization might use a FARS-like system to rapidly map a problem space, then deploy a DeepScientist-like system to push the most promising directions to their limits.
40.9 Broader Implications
40.9.1 Research as Infrastructure
FARS represents a paradigm shift from research-as-craft to research-as-infrastructure. In the craft model, each paper is a unique artifact produced by skilled researchers. In the infrastructure model, papers are outputs of a production system that can be scaled, optimized, and operated continuously. This shift parallels transitions in other domains: manual QA to automated CI/CD in software, artisan production to assembly lines in manufacturing, manual analysis to automated pipelines in data science.
40.9.2 The Minimal Composable Knowledge Unit
FARS's short, single-contribution papers introduce a new unit of scientific knowledge — smaller than a traditional paper but larger than a blog post. This is analogous to the microservices revolution in software: smaller, focused, composable units replace monolithic artifacts. Each FARS paper is easy to review (one thing to evaluate), easy to cite precisely (one clear finding), and incentivizes decomposition over bundling.
Whether the broader scientific community would accept this format is an open question. The academic incentive structure rewards comprehensive, multi-contribution papers at top venues — not an incentive structure that FARS is designed to operate within.
40.9.3 Publication Bias and the Scientific Record
If automated systems can produce and publish negative results at near-zero marginal cost, the scientific record becomes more complete. In human research, an estimated 80% of experimental knowledge is lost to publication bias. FARS's 41% hypothesis-to-paper conversion rate — including papers whose entire contribution is documenting failure — represents a structural improvement in knowledge preservation.
40.9.4 The AI-for-AI Feedback Loop
FARS operates in the AI-for-AI domain: AI systems researching AI systems. Some of its papers improve LLM training, agent design, or evaluation methods. If these improvements feed back into FARS's own components (directly or through the broader research ecosystem), the system creates a positive feedback loop in AI capability. This recursive potential is both the most exciting and most concerning implication of industrial-scale autoresearch.
40.10 Connections to Evolutionary Systems
Although FARS is a pipeline system rather than an evolutionary one, its design has important connections to the evolutionary algorithm discovery systems covered in earlier parts of this survey. The shared filesystem pattern demonstrates that simple coordination mechanisms can support industrial-scale multi-agent systems — a lesson applicable to evolutionary systems that often use more complex event buses and databases. FARS's systematic production of negative results has no direct analogue in evolutionary systems, which discard failed candidates; an evolutionary system that preserved and analyzed failures could improve search efficiency. Conversely, FARS's lack of cross-project learning is precisely the gap that evolutionary approaches fill: selection pressure, population management, and progressive improvement are absent from FARS and could substantially improve its operation in future versions.
A hybrid architecture — FARS-like pipeline stages with evolutionary learning across projects — could combine throughput with progressive improvement. The Ideation Agent could maintain a population of hypothesis templates refined by selection pressure from experimental outcomes. The Experiment Agent could accumulate a skill library of reusable experimental techniques. Such extensions would transform FARS from a pipeline into a learning pipeline, maintaining throughput advantages while adding the improvement dynamics that make evolutionary systems powerful.
Summary
FARS is the first continuously operating automated research system demonstrated at industrial scale: 228 hours of fully unattended operation on 160 GPUs, producing 100 short research papers from 244 generated hypotheses at a cost of ~$1,040 per paper and a mean quality score of 5.05 on the ICLR review scale — above the 4.21 average for human submissions.
Main contribution to the field: FARS proves that automated research can function as reliable production infrastructure rather than a proof-of-concept demonstration, and introduces two significant innovations: (1) the shared filesystem as sole inter-agent coordination mechanism, enabling crash-safe, human-inspectable, trivially parallelizable multi-agent operation; and (2) the systematic publication of negative results as first-class research contributions, addressing one of science's most persistent structural problems.
What a researcher should know: FARS trades depth for breadth and learning for throughput. It does not improve over time (no cross-project learning), operates only in AI/ML research domains, and is proprietary. Its philosophical challenge to academic publishing conventions — that the paper format is a historical artifact, not a necessary property of knowledge production — may prove more influential than its technical architecture. The system's outputs (papers, code repositories, quality scores) are publicly available for independent evaluation at analemma.ai and gitlab.com/fars-a, even though the system itself cannot be reproduced.