← Back to Index

FARS

First-principles fully automated research system that rejects academic publishing conventions in favor of minimal, composable knowledge units — deployed live with 160 GPUs to produce 100 papers in 228 hours. Organization: Analemma (日行迹智能科技) Published: February 11, 2026 (blog post); live deployment February 12, 2026 Type: System + Blog Post + Live Deployment Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

Full Title: FARS: Fully Automated Research System

  • Blog post: Introducing FARS — published February 11, 2026
  • Live dashboard: https://analemma.ai/fars (real-time observation of running experiments)
  • GitLab: https://gitlab.com/fars-a — public repositories for each research project
  • Live deployment start: 10:00 PM EST (UTC−5), February 12, 2026
  • Completion: 228 hours, 28 minutes, 33 seconds of continuous unattended operation
  • Output: 244 hypotheses generated → 100 short research papers produced
  • Predecessor systems cited: AI Scientist (Sakana AI), CycleResearcher, Zochi (IntologyAI), AI Scientist v2, AI-Researcher (HKU), DeepScientist (Westlake University)
  • Research paper URL: Individual papers published at analemma.ai/papers/<uuid>/

FARS is not a traditional research paper. It is a deployed system that was announced via blog post, demonstrated through a live 228-hour public experiment, and validated through its actual outputs. This positions it uniquely in the autoresearch landscape: where AI Scientist sought peer review acceptance and DeepScientist sought frontier-pushing scientific contributions, FARS sought to prove that an unattended research assembly line can operate continuously, stably, and at industrial throughput.

The naming itself is telling: "Fully Automated Research System" emphasizes the system rather than the agent, the researcher, or the scientist. FARS is infrastructure for knowledge production, not an artificial persona.

Relationship to the Autoresearch Landscape

FARS explicitly positions itself as a successor to and synthesis of six prior systems:

Genealogy of End-to-End Autoresearch Systems (2024–2026):
───────────────────────────────────────────────────────────
AI Scientist (Sakana AI, 2024)
  └─ First end-to-end: idea → code → paper → review
  └─ Single-agent, $15/paper, weak experimental scope

CycleResearcher (2024)
  └─ Iterative review-revision cycles
  └─ Improved paper quality through feedback loops

Zochi (IntologyAI, 2025)
  └─ First AI-authored papers accepted at ACL 2025 / ICLR 2025 workshops
  └─ Average reviewer score 7.67 (above human acceptance threshold)

AI Scientist v2 (Sakana AI, 2025)
  └─ Tree search methodology, $15-20/run
  └─ First AI paper to pass double-blind peer review (ICLR workshop, 6.33)

AI-Researcher + Novix (HKU, 2025)
  └─ NeurIPS 2025 Spotlight
  └─ Four-module architecture for resource collection → filtering → ideas → writing

DeepScientist (Westlake University, 2026)
  └─ ~5,000 ideas, ~1,100 experimentally validated
  └─ Exceeded human SOTA on 3 frontier tasks (183.7%, 1.9%, 7.9%)
  └─ 20,000+ GPU-hours consumed

FARS (Analemma, 2026)          ← this system
  └─ 160-GPU cluster, 228h continuous operation
  └─ 244 hypotheses → 100 papers
  └─ First live, public, unattended research deployment at scale
  └─ Rejects academic formatting conventions entirely
───────────────────────────────────────────────────────────

Where each prior system addressed a subset of the autoresearch problem (AI Scientist: feasibility; Zochi: acceptance quality; DeepScientist: frontier-pushing depth), FARS addresses industrial-scale throughput with continuous autonomous operation — the question of whether automated research can function as a reliable, always-on production system rather than a one-off demonstration.


2 Authors and Team

Founder and CEO

Dr. Sun Tianxiang (孙天祥) — Founder and CEO of Analemma. PhD in Computer Science from Fudan University (2024), advised by Xipeng Qiu and Xuanjing Huang. Sun was a principal developer of MOSS, one of the earliest Chinese open-source conversational language models, and has extensive research experience in reinforcement learning and large language model post-training.

Sun's background is significant: MOSS (2023) was one of the first open-source LLMs to demonstrate multi-turn dialogue capability in Chinese, and the experience of building and training large models directly informs FARS's architecture — the system is built by people who understand the full stack from pretraining through RLHF to deployment.

Organizational Context

Analemma (日行迹智能科技): - Founded: March 2025 - Location: Shanghai, China - Core team: Members from Fudan University's MOSS team and Shanghai AI Laboratory's InternLM team - Funding: Angel round of several hundred million RMB from investors including Sequoia Capital China - Mission: Building infrastructure for automated scientific research

Aspect Detail
Team origin Fudan University NLP Lab (MOSS) + Shanghai AI Lab (InternLM)
Founded March 2025
FARS launch February 2026 (~11 months from founding to deployment)
Funding Angel round, Sequoia Capital China lead
Hardware 160 NVIDIA GPUs (proprietary cluster)

The speed of execution is notable: from company founding to a 160-GPU live deployment producing 100 research papers in under one year. This suggests either (a) substantial pre-founding research, (b) aggressive parallel engineering, or (c) both. The team's prior experience building MOSS and InternLM models would have provided deep familiarity with the infrastructure needed.

Team Composition

The team is not publicly enumerated in the blog post. Based on organizational context:

  • NLP/LLM researchers: Core competency from MOSS and InternLM lineage
  • Infrastructure engineers: Required for 160-GPU cluster management
  • Systems architects: Multi-agent coordination and shared filesystem design
  • Manual reviewers: At least 3 senior researchers (5+ years experience each) for pre-arXiv quality gates

Philosophical Alignment

The team's background in building large language models (MOSS, InternLM) positions them uniquely: they understand both the capabilities and limitations of LLMs as research tools, having been on the producing side of the models that autoresearch systems consume. This insider perspective likely informs FARS's pragmatic design philosophy — the system is built by researchers who know what research actually requires, not by engineers imagining what research might look like.


3 Core Contribution

The Radical Thesis

FARS makes a philosophical claim that distinguishes it from every prior autoresearch system:

The output of a research system should be research contributions, not papers conforming to academic formatting conventions.

This is not a minor stylistic preference. It is a fundamental rejection of the form factor that every other autoresearch system has optimized for. AI Scientist, Zochi, AI Scientist v2, and DeepScientist all measure success by how closely their outputs resemble human-written conference papers, using peer review scores as the gold standard. FARS inverts this: the paper format is a historical artifact of human-centered research, not a necessary property of knowledge production.

First-Principles Critique of Human Research

FARS's blog post articulates a first-principles analysis of why human-centered research is structurally inefficient:

Structural Problem Root Cause FARS Response
High entry barrier Years of training to become a researcher Automated agents require no training period
High failure rate Most research ideas don't work out Negative results are explicitly valued and published
Publication bias Only "successful" experiments get published Every completed experiment produces output
Maximal contribution units Papers are large, comprehensive artifacts Each paper is a single, well-scoped contribution
Format overhead Conforming to venue-specific formatting rules No structural constraints beyond clarity
Length inflation Pressure to fill pages to meet minimum requirements Short papers — as long as they need to be, no more
Supply constraint Limited number of human researchers System runs 24/7, parallelizes across projects

This critique identifies a fundamental tension: the academic paper format evolved to serve human readers and human review committees, not to maximize the rate of knowledge frontier expansion. FARS argues that if the goal is to "efficiently and reliably expand the frontier of knowledge," then the format should be optimized for that goal rather than for backward compatibility with human conventions.

Five Design Principles

From the blog post and observable outputs, FARS operates on five principles:

  1. Contributions, not papers. The unit of output is a research contribution — a piece of new knowledge — not a formatted document. The paper is merely a container for the contribution.

  2. Single-scoped contributions. Each paper addresses exactly one research question or presents exactly one finding. This is the minimal composable unit of knowledge, analogous to a function in programming: do one thing, do it well, make it reusable.

  3. Negative results are knowledge. A well-conducted experiment that shows something doesn't work is as valuable as one that shows something does. FARS explicitly reports negative results. Example: "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — a paper whose entire contribution is demonstrating that a promising technique fails.

  4. No unnecessary constraints. Papers are not padded to meet length minimums, do not conform to venue-specific formatting templates, and do not include sections (e.g., lengthy related work surveys) that don't serve the contribution.

  5. Scale reveals truth. A handful of examples is insufficient to evaluate a research system. FARS was designed to produce 100 papers precisely because quality variance at scale is a known limitation — the signal emerges from the aggregate, not from cherry-picked examples.

What Makes This Novel

Prior autoresearch systems tried to pass the Turing test of academic publishing — can the AI produce a paper indistinguishable from a human-written one? FARS asks a different question: if we freed research from the constraints of human publishing conventions, what would an optimally efficient research system look like?

This is the difference between building a faster horse and building a car.

The Live Deployment as Contribution

FARS's public deployment is itself a scientific contribution. By running continuously for 228 hours with a live dashboard, public GitLab, and real-time observability, Analemma subjected their system to the most rigorous possible evaluation: public scrutiny at scale. Any researcher could watch the system operate, inspect intermediate artifacts, read every generated paper, and form their own assessment.

This transparency protocol is unprecedented in the autoresearch space. AI Scientist released code but not live runs. DeepScientist reported results but not the process. FARS showed everything, live, as it happened.


4 Supported Solutions

Problem Framing

FARS frames automated research as a pipeline production problem rather than a search problem (contrast with AIRA₂ which frames it as a graph search) or a dialogue problem (contrast with AI Scientist which frames it as a multi-turn LLM conversation).

The pipeline framing has specific implications:

Search-based framing (AIRA₂, AlphaEvolve):
  Goal: Find optimal solution in a space
  Metaphor: Exploration of a landscape
  Bottleneck: Search efficiency, evaluation signal

Dialogue-based framing (AI Scientist, Zochi):
  Goal: Produce human-like research discourse
  Metaphor: Simulating a researcher
  Bottleneck: LLM capability, prompt engineering

Pipeline-based framing (FARS):
  Goal: Convert research directions into completed papers
  Metaphor: Assembly line / factory
  Bottleneck: Throughput, reliability, coordination

Research Domains

Primary domain: AI/LLM research — the "AI-for-AI" paradigm where automated systems research the technology that powers them. This is an explicitly acknowledged limitation and a pragmatic choice: AI research provides the most readily available evaluation signals (code runs, benchmarks, automated metrics).

Initial research directions specified: - Reinforcement Learning from Verifiable Rewards (RLVR) - Other AI-related topics the system discovers autonomously

Observed output topics (from published papers): - Self-reflection in small language models - World-model verification for agent planning - Vision-language model selection strategies (OCR) - Fine-tuning data selection (hard vs. easy examples) - Metamorphic testing for LLM world models - Coding agent testing and import autofix - Budget allocation for verification systems

The diversity of topics is notable: starting from a few specified directions, the Ideation Agent discovered and explored adjacent research questions autonomously. This suggests the literature review component is effective at identifying related open problems.

Paper Format and Structure

FARS papers are short papers — typically 4–8 pages — focused on a single contribution. Each paper includes:

Component Present Notes
Abstract Yes Concise, focused on the single contribution
Introduction / motivation Yes Brief — sufficient to frame the question
Method Yes Technical description of the approach
Experiments Yes With framework diagrams, result tables, analysis
Conclusion Yes What was learned, including if the result is negative
Related work Minimal Only directly relevant prior work, not comprehensive surveys
Code repository Yes Public GitLab repo for each paper
Extensive appendices No No padding

Solution Types

Based on observed outputs, FARS produces three types of research contributions:

Type 1: Positive methods. Standard research contributions where a proposed method improves on baselines. - Example: "Equation-Consistency Gated Reflection for Small Language Models: A Training-Free Approach to Preventing Self-Correction Regressions"

Type 2: Negative results. Experiments demonstrating that a plausible approach fails, with analysis of why. - Example: "Interface-Aware Smoke Tests and Deterministic Import Autofix for Feature-Level Coding Agents: A Negative Result" — automated import autofix provided no benefit over baseline (both 10.0% resolved rate) - Example: "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — all selection strategies within 0.3-point band, no improvement

Type 3: Empirical insights. Data-driven findings about phenomena in AI systems. - Example: "Hard Examples Beat Easy Examples in Repetition-Heavy Long-CoT Fine-Tuning" - Example: "Stutter-Invariance Metamorphic Audits for Text World-Model Rollouts" — domain-specific probes statistically tied with simpler baseline

The explicit production of Type 2 (negative results) papers is philosophically significant. In human academia, negative results are systematically suppressed due to publication bias — journals and conferences preferentially accept positive results. FARS's willingness to report negative results as complete papers represents a structural fix to this bias.


5 LLM Integration

Model Access

FARS has access to both open-source and closed-source large language models, with the GPU cluster enabling local inference for open-source models and API access for proprietary ones.

Access Type Purpose Examples
Open-source models (local inference) Experimental subjects, data synthesis, cheap inference Models run on the 160-GPU cluster
Closed-source models (API) Agent backbone, complex reasoning, LLM-as-a-Judge Likely GPT-4-class or Claude-class models
Model inference endpoints Data synthesis, agent design, evaluation Encapsulated as tools for the Experiment Agent

Role of LLMs in Each Agent

Ideation Agent: - Uses LLMs for literature comprehension and synthesis - Generates research hypotheses from literature analysis - Conducts automated literature review across open-access papers - May use embedding models for semantic search over literature

Planning Agent: - Uses LLMs for experimental design reasoning - Translates hypotheses into concrete experimental protocols - Determines required resources, baselines, and evaluation metrics

Experiment Agent: - Uses LLMs for code generation and debugging - Uses LLMs as experimental subjects (e.g., testing LLM behaviors) - Uses LLMs as evaluation judges (LLM-as-a-Judge paradigm) - Uses LLMs for data synthesis (generating training/test data) - Uses LLMs for agent design (creating sub-agents for experiments)

Writing Agent: - Uses LLMs for paper composition - Synthesizes experimental results into structured narratives - Generates figures, tables, and analyses

LLM-as-Infrastructure vs. LLM-as-Subject

A distinctive aspect of FARS is the dual role of LLMs:

LLM Usage in FARS:
─────────────────────────────────────────────
Infrastructure layer (running the system):
  ├── Ideation Agent backbone
  ├── Planning Agent backbone
  ├── Experiment Agent backbone
  └── Writing Agent backbone

Subject layer (being researched):
  ├── LLMs as experimental subjects
  ├── LLM behaviors being studied
  ├── LLM training/fine-tuning being tested
  └── LLM evaluation methods being developed
─────────────────────────────────────────────

This creates a recursive structure: LLMs researching LLMs. The system uses GPT/Claude-class models to reason about experiments on smaller or different LLMs. The infrastructure models must be more capable than the subject models for this to work — you cannot study the behavior of a model using a less capable model as your reasoning engine.

Token Consumption

The FARS-100 run consumed 11.4 billion tokens across 100 papers:

Metric Value
Total tokens 11.4 billion
Per paper (average) ~114 million tokens
Per hypothesis (average) ~46.7 million tokens
Token cost component Major fraction of $104,000 total

The per-paper token count of ~114 million is orders of magnitude higher than typical LLM generation tasks:

Token consumption comparison:
─────────────────────────────────────
Typical chatbot response:        ~500 tokens
Typical long-form generation:    ~5,000 tokens
Typical agentic task:            ~50,000 tokens
Complex multi-step agent:        ~500,000 tokens
AI Scientist paper:              ~5,000,000 tokens (estimated)
FARS paper:                      ~114,000,000 tokens  ← 20x more
─────────────────────────────────────

This enormous token consumption reflects the "trading computing power for intelligence" characteristic described in reporting. The Experiment Agent likely dominates: running code, debugging failures, iterating on approaches, calling LLMs for data synthesis and evaluation — all within a single paper's lifecycle.


6 Key Results

FARS-100 Headline Numbers

Metric Value
Duration 228 hours 28 minutes 33 seconds
Hypotheses generated 244
Papers completed 100
Hypothesis → paper conversion rate 41.0% (100/244)
Average time per paper ~2 hours 17 minutes
Total tokens consumed 11.4 billion
Total cost ~$104,000 (~¥750,000 RMB)
Cost per paper ~$1,040
Hardware 160 NVIDIA GPUs
Human intervention Zero during the 228-hour run

Quality Assessment

Using Stanford's Agentic Reviewer system (paperreview.ai), calibrated against ICLR review standards:

Population Mean Score Range
FARS-100 papers 5.05 3.0 – 6.3
ICLR 2026 all human submissions 4.21
ICLR 2026 accepted papers 5.39

Interpretation: - FARS papers score 0.84 points above the average human submission - FARS papers score 0.34 points below the average accepted paper - The score distribution is concentrated around 5.0, indicating stable quality band rather than random fluctuation - A small number of papers exceeded 6.0, indicating occasional "breakthrough" quality

Quality positioning:

          ← Worse                            Better →
   3.0         4.0         5.0         6.0         7.0
    │           │           │           │           │
    ├───FARS────┤           │           │           │
    │   range   │           │           │           │
    │           │     ┌─────┼─────┐     │           │
    │           │     │ FARS mean │     │           │
    │           │     │   5.05    │     │           │
    │           │     └───────────┘     │           │
    │           │                       │           │
    │     ┌─────┤                       │           │
    │     │Human│                       │           │
    │     │avg  │                       │           │
    │     │4.21 │                       │           │
    │     └─────┘                       │           │
    │                             ┌─────┤           │
    │                             │Accpt│           │
    │                             │avg  │           │
    │                             │5.39 │           │
    │                             └─────┘           │

Agentic Reviewer Calibration

The Agentic Reviewer system was validated against ICLR 2025 review data:

Comparison Spearman Correlation
Human reviewer vs. human reviewer 0.41
AI reviewer vs. human reviewer 0.42

The AI reviewer achieves parity with human inter-reviewer agreement, suggesting the scores are as reliable as human review — though both human and AI review have substantial noise (correlation of 0.41–0.42 means roughly 83% shared variance, leaving ~17% as reviewer-specific noise).

Throughput Analysis

FARS-100 throughput timeline:
────────────────────────────────────────────────
Day 1 (0-24h):     ~10 papers
Day 2 (24-48h):    ~10 papers  
Day 3 (48-72h):    ~10 papers
...
Day 9.5 (228h):    100th paper completed
────────────────────────────────────────────────
Average: ~10.5 papers/day
Peak: Not reported (likely higher due to parallel execution)
Minimum: Not reported

Comparison to Human Research

Metric Human Researcher FARS
Time per paper 3–6 months ~2.3 hours
Papers per year 2–5 (typical) ~3,800 (projected continuous)
Cost per paper $50,000–200,000 (fully loaded) ~$1,040
24/7 operation No Yes
Negative result reporting Rare (publication bias) Systematic
Quality (ICLR scale) 4.21 avg submission 5.05 avg

The throughput advantage is approximately 1,000x in papers-per-unit-time. The cost advantage is approximately 50–200x per paper (depending on human researcher cost assumptions). However, these comparisons have important caveats — FARS papers are short, single-contribution papers while human papers are typically fuller works. The appropriate comparison may be FARS papers vs. individual experiments within a human paper rather than vs. complete human papers.

Conversion Funnel

Research direction documents (input)
         │
         ▼
   244 hypotheses generated (Ideation Agent)
         │
         ▼
   ??? hypotheses passed automated review
         │
         ▼
   ??? experimental plans created (Planning Agent)
         │
         ▼
   ??? experiments executed (Experiment Agent)
         │
         ▼
   100 papers completed (Writing Agent)
         │
         ▼
   ??? papers pass manual review (3+ senior researchers)
         │
         ▼
   ??? papers submitted to arXiv (explicit AI-generated label)

The 41% hypothesis-to-paper conversion rate (100/244) suggests that ~59% of hypotheses either failed automated review, failed experimentally, or were abandoned during execution. This is actually a healthy ratio — in human research, the hypothesis-to-publication rate is typically much lower (perhaps 5–20%). FARS's higher conversion rate may reflect either (a) better hypothesis quality, (b) lower publication threshold (short papers, negative results accepted), or (c) both.


7 Reproducibility

Source Code

Component Public URL
System code (FARS itself) No Proprietary to Analemma
Research outputs (papers) Yes analemma.ai/papers/<uuid>/
Experiment code Yes gitlab.com/fars-a/<project>/
Live dashboard Yes analemma.ai/fars

FARS itself is not open-source. This is a critical distinction from systems like AI Scientist (open-source), Karpathy's autoresearch (open-source), and Zochi (partially open). The system's architecture, agent prompts, coordination mechanisms, and infrastructure code are proprietary.

What Is Reproducible

  1. Individual experiment results: Each paper has an associated GitLab repository with code, data, and instructions. Other researchers can re-run specific experiments.

  2. Paper quality evaluation: The Agentic Reviewer system is independently available (paperreview.ai). Anyone can re-score the FARS papers.

  3. Observable operation: The live dashboard provides real-time visibility into the system's operation, making the process (though not the implementation) transparent.

What Is Not Reproducible

  1. The system itself: Without access to FARS's code, prompts, and infrastructure, the full system cannot be reproduced.

  2. The 160-GPU cluster: The hardware requirements are a substantial barrier. Few academic labs have access to 160 GPUs for a continuous 10-day experiment.

  3. The specific LLM configurations: Which models serve as agent backbones, their specific prompts, and their interaction patterns are not disclosed.

Reproducibility Assessment

Criterion Rating Notes
System reproducibility Low Proprietary code, closed architecture
Experiment reproducibility Medium-High Individual experiment repos are public
Result verification Medium Papers and scores can be independently evaluated
Process transparency High Live dashboard, public GitLab activity
Hardware accessibility Low 160 GPUs required

The reproducibility profile is inverted compared to most research: the process is unusually transparent (live public operation), but the system is unusually opaque (proprietary code). This creates a trust-but-can't-verify dynamic: observers can see that FARS works, but cannot build their own version.


8 Compute and API Costs

Hardware Configuration

Resource Specification
GPU cluster 160 NVIDIA GPUs
GPU model Not publicly specified (likely A100 or H100 class)
Purpose Training, inference, experiment execution
Access model Encapsulated as tools for the Experiment Agent
Additional endpoints Model inference endpoints for data synthesis, LLM-as-a-Judge, agent design

FARS-100 Cost Breakdown

Category Estimated Cost Notes
GPU compute (160 GPUs × 228h) ~$50,000–70,000 At $2–3/GPU-hour cloud pricing
LLM API tokens (11.4B tokens) ~$30,000–50,000 Mix of open/closed model inference
Total ~$104,000 Official reported figure
Per paper ~$1,040 104,000 / 100
Per hypothesis ~$426 104,000 / 244

Token Economics

Token consumption by category (estimated breakdown):

Experiment Agent:     ~70% of total tokens
  ├── Code generation and debugging
  ├── Running LLM experiments (subject models)
  ├── Data synthesis
  ├── LLM-as-a-Judge evaluations
  └── Iterative refinement loops

Ideation Agent:       ~15% of total tokens
  ├── Literature review (reading papers)
  ├── Hypothesis generation
  └── Cross-referencing and validation

Writing Agent:        ~10% of total tokens
  ├── Paper composition
  ├── Figure and table generation
  └── Revision and formatting

Planning Agent:       ~5% of total tokens
  ├── Experimental design
  └── Protocol specification

The Experiment Agent likely dominates token consumption because it performs the most computationally and cognitively intensive work: writing code, debugging failures, running experiments, interpreting results, and iterating. This is analogous to human research where the experiment phase consumes the majority of time and resources.

Cost Comparison Across Autoresearch Systems

System Cost per Paper Hardware Duration
AI Scientist (2024) ~$15 Minimal (API-only) Hours
AI Scientist v2 (2025) ~$15–20 Minimal (API-only) Hours
Zochi (2025) Not reported Not reported Not reported
DeepScientist (2026) Not reported 20,000+ GPU-hours total Months
FARS (2026) ~$1,040 160 GPUs × 228h ~2.3 hours
Karpathy autoresearch (2026) ~$18/run 1 GPU 8 hours

FARS's per-paper cost of ~$1,040 is substantially higher than AI Scientist (~$15) but reflects a fundamentally different scope: AI Scientist produces minimally-viable papers with limited experimental depth, while FARS executes real GPU-intensive experiments. The appropriate comparison is cost-per-unit-of-experimental-work, not cost-per-paper.

Cost Efficiency Analysis

Cost per paper:              $1,040
Human equivalent:            $50,000 - $200,000 (3-6 months of researcher time)
Cost reduction vs. human:    48x - 192x

But: FARS papers are short papers (~single contribution)
Adjusted comparison (FARS paper ≈ 1 experiment in human paper):
  Human cost per experiment:    ~$10,000 - $30,000
  FARS cost per experiment:     ~$1,040
  Adjusted cost reduction:      10x - 29x

Scaling Economics

If FARS operated continuously for one year:

Annual projection (continuous operation):
  Papers per year:        ~3,800 (at 2.3h/paper)
  Annual GPU cost:        ~$2.8M - $4.0M (160 GPUs full-time)
  Annual token cost:      ~$1.5M - $2.5M
  Annual total:           ~$4.3M - $6.5M
  Cost per paper:         ~$1,130 - $1,710

  Equivalent human team:
    3,800 papers/year ÷ 4 papers/researcher/year = 950 researchers
    950 researchers × $200K/year = $190M/year

  FARS vs. human team:   ~$5M vs. ~$190M → 38x cost advantage

These projections assume linear scaling and constant quality, which may not hold. But they illustrate the economic logic driving industrial-scale autoresearch.


9 Architecture Solution

High-Level Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        FARS SYSTEM ARCHITECTURE                       │
│                                                                       │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     RESEARCH DIRECTION INPUT                     │  │
│  │  (Document specifying multiple research directions)              │  │
│  └─────────────────────────┬───────────────────────────────────────┘  │
│                            │                                          │
│                            ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     IDEATION AGENT                                │  │
│  │                                                                   │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │  │
│  │  │  Literature   │  │  Hypothesis  │  │  Automated Review    │   │  │
│  │  │  Review       │  │  Generation  │  │  (pass/fail gate)    │   │  │
│  │  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘   │  │
│  │         │                 │                      │               │  │
│  │  [Open-access papers]  [Public GitLab repos]   [Quality filter] │  │
│  └─────────────────────────┬───────────────────────────────────────┘  │
│                            │  (validated hypotheses)                   │
│                            ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     PLANNING AGENT                                │  │
│  │                                                                   │  │
│  │  Hypothesis → Experimental Plan                                   │  │
│  │  (baselines, metrics, resources, evaluation criteria)             │  │
│  └─────────────────────────┬───────────────────────────────────────┘  │
│                            │  (experimental plan)                     │
│                            ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     EXPERIMENT AGENT                               │  │
│  │                                                                   │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │  │
│  │  │  Code         │  │  GPU Cluster │  │  Model Inference     │   │  │
│  │  │  Generation   │  │  (160 GPUs)  │  │  Endpoints           │   │  │
│  │  │  & Debugging  │  │  as tools    │  │  as tools            │   │  │
│  │  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘   │  │
│  │         │                 │                      │               │  │
│  │         └─────────────────┼──────────────────────┘               │  │
│  │                           │                                       │  │
│  │         Code → Schedule GPU jobs → Collect results → Iterate     │  │
│  └─────────────────────────┬───────────────────────────────────────┘  │
│                            │  (experimental results + code)           │
│                            ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     WRITING AGENT                                 │  │
│  │                                                                   │  │
│  │  Results → Paper (short paper, single contribution)               │  │
│  └─────────────────────────┬───────────────────────────────────────┘  │
│                            │  (completed paper)                       │
│                            ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     OUTPUT                                        │  │
│  │                                                                   │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐│  │
│  │  │  Paper   │  │  GitLab  │  │  Live     │  │  Manual Review   ││  │
│  │  │  (PDF)   │  │  Repo    │  │  Dashboard│  │  (3+ reviewers)  ││  │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────────────┘│  │
│  └─────────────────────────────────────────────────────────────────┘  │
│                                                                       │
│  ═══════════════════════════════════════════════════════════════════  │
│  ║                  SHARED FILE SYSTEM                              ║  │
│  ║  (Workspace + persistent memory for all agents)                  ║  │
│  ║  • Structured project directories                                ║  │
│  ║  • Agents read from / write to shared workspace                  ║  │
│  ║  • No direct agent-to-agent communication                        ║  │
│  ║  • Scales to many concurrent research projects                   ║  │
│  ═══════════════════════════════════════════════════════════════════  │
│                                                                       │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     GPU CLUSTER (160 GPUs)                        │  │
│  │  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐        ┌─────┐               │  │
│  │  │GPU 1│ │GPU 2│ │GPU 3│ │GPU 4│  ···   │GPU  │               │  │
│  │  │     │ │     │ │     │ │     │        │ 160 │               │  │
│  │  └─────┘ └─────┘ └─────┘ └─────┘        └─────┘               │  │
│  │  Encapsulated as TOOLS for the Experiment Agent                  │  │
│  └─────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

Key Architectural Decisions

Decision Rationale Implications
Sequential pipeline (not graph search) Research has a natural linear flow: idea → plan → experiment → write Simpler coordination, clearer handoffs, easier debugging
Shared file system (not message passing) Filesystem is universally understood, durable, inspectable No message broker needed, natural audit trail, human-inspectable
No direct agent-to-agent communication Eliminates protocol complexity, race conditions, deadlocks Agents are independently testable, replaceable, scalable
GPU cluster as tools (not raw access) Experiment Agent schedules jobs without managing infrastructure Clean separation of concerns; agent thinks about experiments, not CUDA
Parallel project queue Multiple research projects advance simultaneously Higher throughput, better GPU utilization, tolerant of blocked projects
Short papers (not full conference papers) Minimal unit of contribution reduces writing burden Writing Agent can focus on clarity, not page-filling

The Shared Filesystem Pattern

The most architecturally distinctive feature of FARS is the shared filesystem as the sole coordination mechanism between agents. This deserves detailed analysis:

Shared Filesystem Structure (inferred):
─────────────────────────────────────────
/fars-workspace/
├── projects/
│   ├── FA0001/                          # Project directory
│   │   ├── ideation/
│   │   │   ├── literature_review.md     # Ideation Agent writes
│   │   │   ├── hypothesis.md            # Ideation Agent writes
│   │   │   └── review_result.json       # Automated review output
│   │   ├── planning/
│   │   │   ├── experimental_plan.md     # Planning Agent writes
│   │   │   └── resources.json           # Required resources
│   │   ├── experiment/
│   │   │   ├── code/                    # Experiment Agent writes
│   │   │   ├── results/                 # Experiment outputs
│   │   │   └── logs/                    # Execution logs
│   │   ├── writing/
│   │   │   ├── paper.tex                # Writing Agent writes
│   │   │   ├── paper.pdf                # Compiled paper
│   │   │   └── figures/                 # Generated figures
│   │   └── status.json                  # Project stage tracking
│   ├── FA0002/
│   │   └── ...
│   └── ...
├── gitlab/                              # Public repository staging
└── shared/
    ├── literature_cache/                # Shared literature database
    └── model_endpoints.json             # Available inference endpoints

Why filesystem over message passing?

Most multi-agent systems use message queues, event buses, or direct API calls for inter-agent communication. FARS's choice of a shared filesystem has several deep advantages:

  1. Persistence by default. Every intermediate artifact is automatically persisted. If the system crashes, the filesystem state represents a perfect checkpoint. No message replay, no state reconstruction needed.

  2. Natural handoffs. Agent A completes its work by writing files. Agent B begins its work by reading files. The filesystem boundary is the API contract — no schema definitions, no versioning headaches, no serialization/deserialization overhead.

  3. Human inspectability. Any researcher can SSH into the filesystem and inspect exactly what the Ideation Agent produced, what the Planning Agent interpreted, what the Experiment Agent executed. This is invaluable for debugging and trust-building.

  4. Trivial parallelism. Multiple projects run in parallel simply by having separate directories. No lock contention, no resource arbitration (beyond GPU scheduling), no coordination overhead.

  5. Scalability. Adding more concurrent projects means adding more directories. Adding more agents (e.g., a Review Agent) means adding a reader of existing files. The filesystem pattern scales linearly without architectural changes.

  6. Decoupled evolution. Each agent can be upgraded independently. As long as the file formats are compatible, agents don't need to know about each other's implementations.

This pattern is not novel in systems engineering (Unix pipes, plan9, microservices via shared storage) but is novel in the autoresearch space. Most prior systems use tighter coupling: AI Scientist uses a single LLM conversation, AIRA₂ uses an in-memory population database, DeepScientist uses knowledge graphs and databases.

The Assembly Line Metaphor

FARS's pipeline architecture maps directly to an industrial assembly line:

Industrial Assembly Line:
  Raw material → Stamping → Welding → Painting → Assembly → QC → Ship

FARS Research Assembly Line:
  Directions → Ideation → Planning → Experiment → Writing → Review → Publish

Key properties of assembly lines that FARS inherits:

  1. Pipelining. While paper N is being written, paper N+1 is being experimented on, paper N+2 is being planned, and paper N+3 is being ideated. All four stages operate concurrently on different projects.

  2. Modular replacement. Each station (agent) can be upgraded independently. A better Writing Agent doesn't require changes to the Experiment Agent.

  3. Bottleneck identification. The slowest stage determines throughput. If experiments take 10x longer than writing, optimizing the Writing Agent yields zero throughput improvement.

  4. Quality control. Papers that fail at any stage can be rejected without wasting downstream effort (except for the explicit negative-results philosophy, which means experimental failures may still produce papers).


10 Component Breakdown

Component 1: Ideation Agent

Purpose: Convert research directions into validated, actionable research hypotheses.

Aspect Detail
Input Document specifying multiple research directions
Output Validated research hypotheses (forwarded to Planning Agent)
Tools Open-access paper search, public GitLab repository access
Gate Automated review — hypothesis must pass before forwarding
Token consumption ~15% of total (literature comprehension dominates)

Process: 1. Receives research direction document (human-authored, specifying broad areas like "RLVR") 2. Conducts literature review across open-access papers 3. Identifies gaps, open questions, and unexplored combinations 4. Generates specific, testable hypotheses 5. Each hypothesis undergoes automated review (quality/feasibility filter) 6. Passing hypotheses written to shared filesystem → picked up by Planning Agent

Key design choice: The Ideation Agent has access to public GitLab repositories in addition to papers. This means it can read actual code implementations, not just paper descriptions. This is a significant advantage over literature-review-only approaches, as the gap between what papers describe and what code implements is often substantial.

Hypothesis generation rate: 244 hypotheses in 228 hours ≈ 1.07 hypotheses/hour. Given that 100 became papers (41% conversion), the Ideation Agent overproduces by design — it generates more hypotheses than the downstream pipeline can absorb, ensuring the pipeline is never starved.

Component 2: Planning Agent

Purpose: Transform validated hypotheses into concrete experimental plans.

Aspect Detail
Input Validated hypothesis (from shared filesystem)
Output Experimental plan (written to shared filesystem)
Scope Baselines, metrics, evaluation criteria, resource requirements
Token consumption ~5% of total (relatively lean — planning is reasoning-intensive but not token-heavy)

Process: 1. Reads hypothesis from project directory 2. Determines what baselines are needed 3. Specifies evaluation metrics and success criteria 4. Estimates required computational resources 5. Writes experimental plan to project directory

The Planning Agent is the thinnest component — its role is to bridge the gap between an abstract hypothesis and a concrete experimental protocol. In human research, this is the "methods section" thought process.

Component 3: Experiment Agent

Purpose: Execute experimental plans using GPU cluster and model inference endpoints.

Aspect Detail
Input Experimental plan (from shared filesystem)
Output Code, results, logs (written to shared filesystem)
Tools 160-GPU cluster (as tools), model inference endpoints (as tools)
Capabilities Code generation, debugging, GPU scheduling, data synthesis, LLM-as-a-Judge
Token consumption ~70% of total (dominant consumer)

Tool encapsulation:

The GPU cluster is not exposed as raw hardware. Instead, it is encapsulated as tools — high-level interfaces that the Experiment Agent can call:

Available tools for Experiment Agent:
─────────────────────────────────────
GPU Tools:
  schedule_training_job(config) → job_id
  check_job_status(job_id) → status
  get_job_results(job_id) → results
  cancel_job(job_id) → ok

Inference Tools:
  run_inference(model, inputs) → outputs
  synthesize_data(spec) → dataset
  evaluate_with_judge(model, prompt, response) → score

Utility Tools:
  read_dataset(path) → data
  write_results(path, data) → ok
  create_figure(data, spec) → figure

This encapsulation serves multiple purposes: - Isolation: The agent doesn't need to know about CUDA, distributed training, or job schedulers - Safety: The agent cannot accidentally consume all GPU resources or interfere with other projects - Abstraction: The same experimental code could theoretically run on different hardware backends

Experiment execution flow: 1. Read experimental plan 2. Write code for the experiment 3. Schedule training/inference jobs on GPU cluster 4. Monitor job execution 5. Collect and analyze results 6. If results are unexpected → debug, modify, re-run 7. Write final results and code to project directory

Component 4: Writing Agent

Purpose: Produce the final research paper from experimental results.

Aspect Detail
Input Experimental results, code, hypothesis (from shared filesystem)
Output Short research paper (PDF + source)
Format Single-contribution short paper
Includes Abstract, method, experiments, results, conclusion, figures, tables
Token consumption ~10% of total

Key distinction from other autoresearch writing: Most systems (AI Scientist, DeepScientist) attempt to produce full conference papers with comprehensive related work sections, lengthy introductions, and detailed appendices. FARS's Writing Agent produces short papers focused exclusively on the single contribution. This is both a philosophical choice (minimal composable knowledge) and a practical one (shorter papers are faster to write and review, and less prone to hallucination).

Component 5: Safety and Review Pipeline

Purpose: Quality gate between automated production and public dissemination.

Stage Type Details
Automated review Algorithmic Hypothesis-level gate in Ideation Agent
AI review Automated Stanford Agentic Reviewer (ICLR-calibrated)
Manual review Human At least 3 researchers with 5+ years experience each
Labeling Manual All submissions explicitly labeled as AI-generated
arXiv submission Manual Only papers passing manual review are submitted

The safety pipeline is deliberately conservative:

Papers produced by FARS (100)
    │
    ▼
AI Review (Agentic Reviewer) ──── score distribution published
    │
    ▼
Manual Review (3+ senior researchers, 5+ years each)
    │                │
    ▼                ▼
  PASS             FAIL
    │                │
    ▼                │
Label as             │
AI-generated         │
    │                │
    ▼                │
Submit to         Archived
arXiv             (not published)

This conservative approach addresses the primary concern about automated research: that it could flood the literature with low-quality or misleading work. By requiring manual review by multiple senior researchers before external publication, FARS ensures that its public outputs meet human standards even though the production process is fully automated.


11 Core Mechanisms (Detailed)

11.1 The Shared Filesystem as Coordination Protocol

The shared filesystem is not merely a storage layer — it is the entire coordination protocol of the system. Understanding its design is essential to understanding FARS.

Properties of filesystem-based coordination:

Property Advantage Trade-off
Durability Every write persists; crash-safe Higher I/O latency than in-memory
Visibility Any agent (or human) can inspect any state Potential information leakage if not structured
Atomicity File writes are atomic at OS level No multi-file transactions
Ordering Filesystem timestamps provide natural ordering Clock skew in distributed systems (mitigated if single-node)
Scalability Directories partition naturally Filesystem metadata overhead at extreme scale

Contrast with alternative coordination patterns:

Pattern               | Used by              | Advantages           | Disadvantages
────────────────────────────────────────────────────────────────────────────
Message queue         | AI-Researcher        | Decoupled, ordered   | Needs broker, no natural persistence
In-memory database    | AIRA₂                | Fast, structured     | Volatile, single-node bottleneck
Knowledge graph       | DeepScientist        | Rich relationships   | Complex queries, schema overhead
Single LLM context   | AI Scientist         | Simple, coherent     | Context window limits, no parallelism
Shared filesystem     | FARS                 | Universal, durable   | Unstructured unless conventions enforced

FARS's choice of shared filesystem is the most Unix-philosophy-aligned design in the autoresearch space. It follows the principle: "Write programs to handle text streams, because that is a universal interface." In FARS's case: "Write agents to handle files in directories, because that is a universal interface."

11.2 Pipeline Parallelism and Throughput Optimization

FARS achieves its throughput through pipeline parallelism — the same technique used in CPU instruction pipelines, GPU rendering pipelines, and industrial assembly lines.

Time →
─────────────────────────────────────────────────────────
     t=0    t=1    t=2    t=3    t=4    t=5    t=6
     │      │      │      │      │      │      │
P1:  [IDEA] [PLAN] [EXPT] [EXPT] [WRIT] [DONE]
P2:         [IDEA] [PLAN] [EXPT] [EXPT] [WRIT] [DONE]
P3:                [IDEA] [PLAN] [EXPT] [EXPT] [WRIT]→
P4:                       [IDEA] [PLAN] [EXPT] [EXPT]→
P5:                              [IDEA] [PLAN] [EXPT]→
─────────────────────────────────────────────────────────
          Pipeline stages execute concurrently

Throughput analysis:

If the bottleneck stage (Experiment) takes time T_exp, the throughput is:

Throughput = 1 / max(T_idea, T_plan, T_exp, T_write)

Given FARS's average of ~2.3 hours/paper and the Experiment Agent consuming ~70% of resources, the Experiment stage is almost certainly the bottleneck:

Estimated stage durations (per paper):
  Ideation:    ~20-30 minutes
  Planning:    ~10-15 minutes
  Experiment:  ~90-120 minutes    ← bottleneck
  Writing:     ~20-30 minutes

Pipeline throughput ≈ 1 paper / 90-120 minutes
Observed:          ≈ 1 paper / 137 minutes (2h17m)

The discrepancy between estimated bottleneck (90-120 min) and observed (137 min) likely reflects: - Pipeline filling/draining overhead - Failed experiments that consume time but don't produce papers - Resource contention on the GPU cluster during peak parallel execution - Overhead from the 59% of hypotheses that don't become papers

11.3 GPU Cluster Encapsulation

The design decision to encapsulate the 160-GPU cluster as tools rather than exposing raw hardware is architecturally significant:

Traditional research agent:
  Agent → SSH → GPU machine → CUDA → Training → Results
  (Agent manages infrastructure directly)

FARS design:
  Agent → Tool API → Cluster Scheduler → GPU → Training → Results
  (Agent is isolated from infrastructure)

Benefits of tool encapsulation:

  1. Cognitive offloading. The Experiment Agent reasons about experiments, not infrastructure. It doesn't need to know about CUDA versions, driver compatibility, multi-GPU parallelism, or job scheduling.

  2. Resource management. The tool layer can implement fair scheduling across concurrent projects, prevent resource starvation, and enforce compute budgets.

  3. Error isolation. A crashed training job doesn't crash the agent. The tool layer handles retries, timeouts, and failure reporting.

  4. Portability. The same agent prompts could theoretically work with different hardware backends (different GPU types, cloud providers, or even TPUs) by swapping the tool implementation.

11.4 Negative Result Detection and Reporting

FARS's systematic production of negative result papers requires a mechanism for detecting and properly framing negative results:

Experiment outcome classification:
─────────────────────────────────────
1. Positive result    → Method works, report improvement
2. Negative result    → Method doesn't work, report why
3. Null result        → Inconclusive, may need more experiments
4. System failure     → Bug/crash, not a research result

Traditional systems (AI Scientist, AIRA₂) treat outcomes 2-4 as failures and discard them. FARS treats outcome 2 as a legitimate research contribution and produces a paper explaining why the hypothesis failed.

This requires the Writing Agent to have a distinct mode:

Positive result writing: - "We propose X. X achieves Y improvement over baseline Z." - Emphasis on the method, the improvement, the contribution.

Negative result writing: - "We investigate whether X can improve Y. We find that X provides no significant improvement, and analyze why." - Emphasis on the hypothesis, the evidence against it, and the mechanistic explanation for failure. - Example: "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — the contribution is identifying candidate homogeneity at low temperature as the failure mechanism.

11.5 Automated Hypothesis Review

The Ideation Agent includes an automated review step that filters hypotheses before they enter the pipeline. While the specific review criteria are not publicly documented, the 41% conversion rate (100 papers from 244 hypotheses) provides bounds:

Hypothesis filtering funnel (estimated):
  244 hypotheses generated
   │
   ├── ~X% filtered by automated review (too vague, infeasible, redundant)
   │
   ├── ~Y% fail during experiment execution (bugs, inconclusive results)
   │
   └── 100 produce completed papers (41%)

If automated review filters 30% → 171 pass review
  Then 100/171 = 58% execution success rate

If automated review filters 10% → 220 pass review
  Then 100/220 = 45% execution success rate

Either way, the conversion rates are remarkably high compared to human research, where the hypothesis-to-publication rate in AI/ML is typically 10-30%. This may reflect: - Conservative hypothesis generation (Ideation Agent proposes incremental, testable hypotheses) - The inclusion of negative results (failures that would be discarded in human research become papers) - Well-scoped single-contribution format (lower bar for a "complete" paper)


12 Programming Language

System Implementation

The FARS system implementation language is not publicly disclosed, but strong inferences can be made:

Component Likely Language Evidence
Agent orchestration Python Industry standard for ML infrastructure; team's MOSS/InternLM background is Python
Shared filesystem management Python/Shell Standard infrastructure tooling
GPU cluster tools Python (PyTorch/JAX) Standard ML training frameworks
Model inference endpoints Python (FastAPI/gRPC) Standard serving patterns
Paper compilation LaTeX Academic paper production standard
Dashboard JavaScript/TypeScript Web frontend (analemma.ai/fars)

Agent-Generated Code

The Experiment Agent generates Python code for experiments. Based on the published GitLab repositories, experiments use standard ML tooling:

  • Training: PyTorch, Hugging Face Transformers, TRL
  • Evaluation: Standard metrics libraries, custom evaluation scripts
  • Data processing: pandas, numpy, datasets (Hugging Face)
  • Visualization: matplotlib, plotly

Paper Generation

Papers are generated in LaTeX format, compiled to PDF. The Writing Agent must produce: - LaTeX source with proper formatting - Figure generation code (Python → matplotlib/plotly → image) - Table generation (experimental results → LaTeX tables) - Bibliography management (BibTeX references)

GitLab Repositories

Each completed project is published to GitLab (gitlab.com/fars-a/<project>/). Repository structure typically includes:

<project>/
├── paper.pdf           # Compiled paper
├── paper.tex           # LaTeX source (inferred)
├── code/
│   ├── experiment.py   # Main experimental code
│   ├── evaluate.py     # Evaluation script
│   └── requirements.txt
├── data/               # Processed data (if small)
├── results/            # Raw experimental results
└── README.md           # Project description

13 Memory Management

System-Level Memory: The Shared Filesystem

FARS's primary memory system is the shared filesystem itself. This is a deliberate architectural choice that merges workspace and persistent memory into a single mechanism:

Memory Architecture:
─────────────────────────────────────────────────────────
Layer 1: Shared Filesystem (persistent, cross-agent)
  ├── Project directories (per-project state)
  ├── Literature cache (shared across projects)
  ├── Model endpoint registry
  └── Status tracking (project stages, progress)

Layer 2: Agent Context Windows (transient, per-agent)
  ├── Current project context
  ├── Instructions / prompts
  └── Recent interaction history

Layer 3: GPU Cluster State (transient, per-job)
  ├── Training checkpoints
  ├── Intermediate results
  └── Job queue state
─────────────────────────────────────────────────────────

Inter-Agent Memory Transfer

Because agents communicate exclusively through files, memory transfer follows a write-read pattern:

Agent A memory → write to filesystem → Agent B reads from filesystem → Agent B memory

Example (Ideation → Planning):
  Ideation Agent context:
    "After reviewing 47 papers on RLVR, I identified a gap in
     how verifiable rewards handle ambiguous correctness criteria.
     Hypothesis: A graduated verification scoring system improves
     RLVR training stability on math reasoning tasks."
       │
       ▼  (writes hypothesis.md)
  Filesystem:
    /projects/FA0042/ideation/hypothesis.md
       │
       ▼  (Planning Agent reads)
  Planning Agent context:
    "Hypothesis: graduated verification scoring for RLVR.
     I need to design an experiment comparing binary vs.
     graduated reward signals on GSM8K and MATH benchmarks..."

The filesystem acts as an externalized, persistent memory that survives agent restarts, context window limits, and even system crashes. This is architecturally similar to how human researchers use lab notebooks — the notebook persists even when the researcher is sleeping.

No Explicit Cross-Project Memory

FARS does not appear to maintain an explicit knowledge base, skill library, or cross-project memory. Each project is treated independently:

Memory Type Present in FARS Present in Alternatives
Per-project state Yes (filesystem) Yes (all systems)
Cross-project knowledge base No (inferred) Yes (DeepScientist, some others)
Skill/technique library No (inferred) Yes (FunSearch-style systems)
Literature embedding database Possibly (shared cache) Yes (some systems)
Failed hypothesis memory No explicit mechanism Varies

The absence of cross-project memory is a notable limitation. If FARS generates hypothesis H1 for project P1 and discovers it fails, there is no mechanism to prevent generating a similar hypothesis H1' for a different project P2. Over 244 hypotheses, some redundancy is likely.

However, the shared literature cache may provide implicit cross-project memory: if the Ideation Agent caches literature reviews and uses them across projects, insights from one project's literature review could inform another's hypothesis generation.

Context Window Management

Each agent operates within an LLM context window. For long-running experiments, the Experiment Agent may face context window pressure:

Experiment Agent context growth:
  Initial plan:              ~2K tokens
  First code attempt:        ~3K tokens
  GPU job results:           ~1K tokens
  Error analysis + fix:      ~2K tokens
  Second code attempt:       ~3K tokens
  Results analysis:          ~2K tokens
  ...
  After 10 iterations:       ~30K tokens (approaching limits)

The shared filesystem provides a natural overflow mechanism: the agent can write intermediate results to files and read them back selectively, effectively using the filesystem as an unbounded external memory that supplements the finite context window.


14 Continued Learning

Within-Run Learning

During the FARS-100 run, the system demonstrates within-run improvement at the population level:

FARS-100 production trajectory (approximate):
─────────────────────────────────────────────────────────
Phase 1 (hours 0-50):    Pipeline filling + initial projects
  └── Slower output as pipeline stages warm up
  └── Early papers may have lower quality (system "learning")

Phase 2 (hours 50-150):  Steady-state production
  └── ~10 papers/day consistent output
  └── Quality stabilizes around mean 5.05

Phase 3 (hours 150-228): Pipeline draining + final projects
  └── Ideation Agent may slow (research directions exhausting)
  └── Later papers may explore more peripheral topics
─────────────────────────────────────────────────────────

No Explicit Learning Mechanism

FARS does not implement explicit learning across papers:

Learning Type Implemented Notes
Within-paper iteration Yes Experiment Agent debugs and refines
Across-paper knowledge transfer No Each project is independent
Technique library accumulation No No skill extraction mechanism
Hypothesis refinement from failures No Failed hypotheses not fed back to Ideation
Prompt/strategy adaptation Not disclosed May exist internally but not documented

The Pipeline vs. Loop Distinction

This is a fundamental architectural difference from evolutionary systems:

Evolutionary systems (AIRA₂, AlphaEvolve, OpenEvolve):
  Population → Select → Mutate → Evaluate → Update Population → Repeat
  ───── LOOP: each generation learns from previous ─────

FARS pipeline:
  Direction → Ideate → Plan → Experiment → Write → Output
  ───── PIPELINE: each project is independent ─────

Evolutionary systems explicitly learn: the population improves over generations because selection pressure retains good solutions and discards bad ones. FARS's pipeline does not learn: each project starts fresh from a research direction, without incorporating lessons from previous projects.

This is both a strength and a limitation: - Strength: No risk of "premature convergence" — each project explores independently - Strength: Easily parallelizable — no cross-project dependencies - Limitation: Cannot build on its own discoveries - Limitation: May repeat mistakes across projects

Potential for Cross-Run Learning

If FARS were extended to operate over multiple runs (FARS-200, FARS-500, etc.), several learning mechanisms could be added:

  1. Hypothesis deduplication: Embedding previous hypotheses and filtering new ones that are too similar
  2. Technique library: Extracting reusable experimental techniques from successful papers
  3. Failure memory: Cataloguing why hypotheses failed to prevent re-exploration of dead ends
  4. Research direction refinement: Using paper quality scores to adjust which directions the Ideation Agent explores
  5. Writing template learning: Improving paper structure based on review scores

These extensions would transform FARS from a pipeline into a learning pipeline — maintaining the throughput advantages of pipeline architecture while adding the improvement dynamics of evolutionary systems.

FARS as a Benchmark for Research Productivity

The FARS-100 run itself serves as a baseline measurement for automated research productivity:

FARS-100 Baseline Metrics:
  Throughput:          ~10.5 papers/day
  Quality:             5.05 ± ~0.8 (ICLR scale)
  Cost:                $1,040/paper
  Conversion rate:     41% (hypothesis → paper)
  Token efficiency:    ~114M tokens/paper

Future FARS versions can be measured against these baselines:
  FARS v2:  Higher quality? Lower cost? Better conversion?
  FARS v3:  Cross-project learning? Adaptive hypothesis generation?

15 Applications

Direct Applications

Application Description Status
Automated AI research Continuous, unattended production of research papers Demonstrated (FARS-100)
Negative result publishing Systematic documentation of what doesn't work Demonstrated (multiple papers)
Research hypothesis exploration Rapid exploration of a research space Demonstrated (244 hypotheses)
Public research transparency Live, observable research process Demonstrated (dashboard + GitLab)
Research cost reduction 50-200x cheaper than human researchers per paper Demonstrated (at $1,040/paper)

Broader Implications

1. The End of Publication Bias

FARS's systematic reporting of negative results addresses one of the most persistent structural problems in science: publication bias. If automated systems can produce and publish negative results at near-zero marginal cost, the scientific record becomes more complete. The value is not in any individual negative result paper, but in the aggregate: a comprehensive map of what works and what doesn't in a research area.

Human research publication funnel:
  100 experiments → 20 positive results → 15 submitted → 5 published
  (80% of experimental knowledge is LOST)

FARS publication funnel:
  244 hypotheses → 100 papers (positive + negative) → manual review → arXiv
  (Knowledge preservation rate: ~41% vs. ~5% for humans)

2. Research as Infrastructure

FARS represents a paradigm shift from research-as-craft to research-as-infrastructure. In the craft model, each paper is a unique artifact produced by skilled artisans (researchers). In the infrastructure model, papers are outputs of a production system that can be scaled, optimized, and operated continuously.

This shift has parallels in other domains: - Software testing: manual QA → automated CI/CD - Manufacturing: artisan production → assembly line - Content creation: human writing → automated generation + human curation

3. The Minimal Composable Knowledge Unit

FARS's short, single-contribution papers introduce a new unit of scientific knowledge — smaller than a traditional paper but larger than a tweet or blog post. This format could influence human research conventions:

Traditional paper:    ~10-20 pages, multiple contributions
                      ├── Hard to review (many things to evaluate)
                      ├── Hard to cite precisely (which contribution?)
                      └── Incentivizes bundling (minimum publishable unit)

FARS paper:           ~4-8 pages, single contribution
                      ├── Easy to review (one thing to evaluate)
                      ├── Easy to cite precisely (one clear finding)
                      └── Incentivizes decomposition (one paper = one insight)

This is analogous to the microservices revolution in software: smaller, focused, composable units replace monolithic artifacts. The idea is not new (workshop papers, extended abstracts, and short papers exist in human venues), but FARS operationalizes it at scale.

4. The AI-for-AI Research Loop

FARS currently operates in the AI-for-AI domain: AI systems researching AI systems. This creates a recursive improvement dynamic:

FARS produces papers on AI topics
  ├── Some papers improve LLM training/fine-tuning
  │     └── Better LLMs → Better FARS agents → Better papers
  ├── Some papers improve agent design
  │     └── Better agents → Better FARS architecture → Better papers
  └── Some papers improve evaluation methods
        └── Better evaluation → Better quality signal → Better papers

If FARS's outputs eventually inform its own improvement (directly or indirectly through the broader research ecosystem), the system creates a positive feedback loop in AI capability advancement.

Limitations and Scope

Limitation Impact Potential Mitigation
AI-only domain Cannot research biology, physics, chemistry, etc. Integrate with lab automation, simulation tools
No physical experiments Limited to computational research Robotic lab integration (future)
Compute-bounded Cannot run large-scale pretraining experiments Larger clusters, cloud bursting
No human involvement Cannot do human evaluation, annotation, user studies Mechanical Turk integration, controlled human access
Quality variance Some papers are incremental or flawed Better quality gates, adaptive filtering
No cross-project learning Each project starts fresh Implement knowledge base, skill library
Proprietary system Cannot be reproduced or independently verified Open-source release (unlikely for competitive reasons)
Single production format Only produces short papers Extend to surveys, tutorials, replication studies

Comparison to Other Autoresearch Paradigms

Paradigm Representative Strength Weakness
Single-agent minimal Karpathy autoresearch Simplicity, accessibility Limited scope, no writing
Single-agent maximal AI Scientist v2 End-to-end paper production Quality ceiling, no real experiments
Evolutionary search AIRA₂, AlphaEvolve Progressive improvement, scaling No paper writing, competition-focused
Multi-agent pipeline FARS Throughput, continuous operation, negative results No learning, proprietary, AI-only domain
Frontier-pushing DeepScientist Genuine scientific advances Very expensive (20K+ GPU-hours), slow
Knowledge-focused CycleResearcher Iterative quality improvement Limited scale

FARS occupies a unique niche: industrial-scale continuous production of research contributions. It trades depth for breadth, learning for throughput, and maximal paper quality for comprehensive coverage of a research space.

Connections to OmniEvolve

FARS's architecture maps to several OmniEvolve design patterns, with important differences:

FARS Component OmniEvolve Equivalent Key Difference
Shared filesystem omnievolve/storage/ artifact storage FARS uses filesystem as sole coordination; OmniEvolve uses DB + filesystem
Pipeline stages omnievolve/orchestrator/ experiment lifecycle FARS is linear pipeline; OmniEvolve supports evolutionary loops
GPU cluster tools omnievolve/evaluation/ sandbox execution FARS encapsulates 160 GPUs; OmniEvolve abstracts arbitrary compute
Ideation Agent omnievolve/knowledge/ + omnievolve/search/ FARS separates ideation from search; OmniEvolve integrates them
Quality review omnievolve/evaluation/ cascade evaluator FARS uses human review; OmniEvolve uses automated cascade
Negative results No direct equivalent OmniEvolve discards failures; FARS publishes them

Architectural lesson for OmniEvolve: FARS's shared filesystem pattern demonstrates that simple coordination mechanisms can support industrial-scale multi-agent systems. OmniEvolve's more complex event bus and database coordination may be over-engineered for some use cases. The FARS pattern could be offered as a lightweight alternative configuration.

Philosophical lesson for OmniEvolve: FARS's first-principles critique of academic conventions applies equally to how evolutionary algorithm research is evaluated. If the goal is to expand the frontier of optimization knowledge, conforming to academic paper formats may be a needless constraint. OmniEvolve's reporting module could support FARS-style minimal contribution reports alongside traditional paper formats.


References

  • Analemma. (2026). "Introducing FARS: Fully Automated Research System." Blog post, February 11, 2026. https://analemma.ai/blog/introducing-fars/
  • FARS Live Dashboard. https://analemma.ai/fars
  • FARS GitLab. https://gitlab.com/fars-a
  • 机器之心 (Machine Intelligence). (2026). "228 hours of non-stop work to produce 100 papers, burning through 11.4 billion Tokens: FARS has gone crazy." February 25, 2026.
  • Lu, C., et al. (2024). "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." Sakana AI.
  • Weng, Y., et al. (2024). "CycleResearcher: Improving Automated Research via Automated Review."
  • IntologyAI. (2025). "Zochi: Artificial Scientist." https://github.com/IntologyAI/Zochi
  • Lu, C., et al. (2025). "The AI Scientist v2." Sakana AI.
  • Li, Y., et al. (2025). "AI-Researcher." NeurIPS 2025 Spotlight.
  • Chen, Y., et al. (2026). "DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively." Westlake University.
  • Stanford University. (2026). "Agentic Reviewer." https://paperreview.ai
  • Sun, T. Homepage. http://txsun1997.github.io/
  • Karpathy, A. (2026). "autoresearch." https://github.com/karpathy/autoresearch

This analysis was compiled from the FARS blog post, live dashboard observations, published paper examples, independent media reporting (Machine Intelligence / 机器之心, 36Kr), and cross-referencing with prior autoresearch systems. FARS's proprietary nature limits architectural analysis to publicly observable behavior and reported metrics. The system represents a significant milestone in the industrialization of scientific research, demonstrating that continuous, unattended, large-scale research production is technically feasible — though the question of whether this constitutes genuine knowledge expansion or sophisticated pattern-matching remains open.