← Back to Index

FARS

First-principles fully automated research system that rejects academic publishing conventions in favor of minimal, composable knowledge units — deployed live with 160 GPUs to produce 100 papers in 228 hours. Organization: Analemma (日行迹智能科技) Published: February 11, 2026 (blog post); live deployment February 12, 2026 Type: System + Blog Post + Live Deployment Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: FARS: Fully Automated Research System

Blog post: Introducing FARS — published February 11, 2026
Live dashboard: https://analemma.ai/fars (real-time observation of running experiments)
GitLab: https://gitlab.com/fars-a — public repositories for each research project
Live deployment start: 10:00 PM EST (UTC−5), February 12, 2026
Completion: 228 hours, 28 minutes, 33 seconds of continuous unattended operation
Output: 244 hypotheses generated → 100 short research papers produced
Predecessor systems cited: AI Scientist (Sakana AI), CycleResearcher, Zochi (IntologyAI), AI Scientist v2, AI-Researcher (HKU), DeepScientist (Westlake University)
Research paper URL: Individual papers published at analemma.ai/papers/<uuid>/

FARS is not a traditional research paper. It is a deployed system that was announced via blog post, demonstrated through a live 228-hour public experiment, and validated through its actual outputs. This positions it uniquely in the autoresearch landscape: where AI Scientist sought peer review acceptance and DeepScientist sought frontier-pushing scientific contributions, FARS sought to prove that an unattended research assembly line can operate continuously, stably, and at industrial throughput.

The naming itself is telling: "Fully Automated Research System" emphasizes the system rather than the agent, the researcher, or the scientist. FARS is infrastructure for knowledge production, not an artificial persona.

Relationship to the Autoresearch Landscape

FARS explicitly positions itself as a successor to and synthesis of six prior systems:

Genealogy of End-to-End Autoresearch Systems (2024–2026):
───────────────────────────────────────────────────────────
AI Scientist (Sakana AI, 2024)
  └─ First end-to-end: idea → code → paper → review
  └─ Single-agent, $15/paper, weak experimental scope

CycleResearcher (2024)
  └─ Iterative review-revision cycles
  └─ Improved paper quality through feedback loops

Zochi (IntologyAI, 2025)
  └─ First AI-authored papers accepted at ACL 2025 / ICLR 2025 workshops
  └─ Average reviewer score 7.67 (above human acceptance threshold)

AI Scientist v2 (Sakana AI, 2025)
  └─ Tree search methodology, $15-20/run
  └─ First AI paper to pass double-blind peer review (ICLR workshop, 6.33)

AI-Researcher + Novix (HKU, 2025)
  └─ NeurIPS 2025 Spotlight
  └─ Four-module architecture for resource collection → filtering → ideas → writing

DeepScientist (Westlake University, 2026)
  └─ ~5,000 ideas, ~1,100 experimentally validated
  └─ Exceeded human SOTA on 3 frontier tasks (183.7%, 1.9%, 7.9%)
  └─ 20,000+ GPU-hours consumed

FARS (Analemma, 2026)          ← this system
  └─ 160-GPU cluster, 228h continuous operation
  └─ 244 hypotheses → 100 papers
  └─ First live, public, unattended research deployment at scale
  └─ Rejects academic formatting conventions entirely
───────────────────────────────────────────────────────────

Where each prior system addressed a subset of the autoresearch problem (AI Scientist: feasibility; Zochi: acceptance quality; DeepScientist: frontier-pushing depth), FARS addresses industrial-scale throughput with continuous autonomous operation — the question of whether automated research can function as a reliable, always-on production system rather than a one-off demonstration.

2 Authors and Team

Founder and CEO

Dr. Sun Tianxiang (孙天祥) — Founder and CEO of Analemma. PhD in Computer Science from Fudan University (2024), advised by Xipeng Qiu and Xuanjing Huang. Sun was a principal developer of MOSS, one of the earliest Chinese open-source conversational language models, and has extensive research experience in reinforcement learning and large language model post-training.

Sun's background is significant: MOSS (2023) was one of the first open-source LLMs to demonstrate multi-turn dialogue capability in Chinese, and the experience of building and training large models directly informs FARS's architecture — the system is built by people who understand the full stack from pretraining through RLHF to deployment.

Organizational Context

Analemma (日行迹智能科技): - Founded: March 2025 - Location: Shanghai, China - Core team: Members from Fudan University's MOSS team and Shanghai AI Laboratory's InternLM team - Funding: Angel round of several hundred million RMB from investors including Sequoia Capital China - Mission: Building infrastructure for automated scientific research

Aspect	Detail
Team origin	Fudan University NLP Lab (MOSS) + Shanghai AI Lab (InternLM)
Founded	March 2025
FARS launch	February 2026 (~11 months from founding to deployment)
Funding	Angel round, Sequoia Capital China lead
Hardware	160 NVIDIA GPUs (proprietary cluster)

The speed of execution is notable: from company founding to a 160-GPU live deployment producing 100 research papers in under one year. This suggests either (a) substantial pre-founding research, (b) aggressive parallel engineering, or (c) both. The team's prior experience building MOSS and InternLM models would have provided deep familiarity with the infrastructure needed.

Team Composition

The team is not publicly enumerated in the blog post. Based on organizational context:

NLP/LLM researchers: Core competency from MOSS and InternLM lineage
Infrastructure engineers: Required for 160-GPU cluster management
Systems architects: Multi-agent coordination and shared filesystem design
Manual reviewers: At least 3 senior researchers (5+ years experience each) for pre-arXiv quality gates

Philosophical Alignment

The team's background in building large language models (MOSS, InternLM) positions them uniquely: they understand both the capabilities and limitations of LLMs as research tools, having been on the producing side of the models that autoresearch systems consume. This insider perspective likely informs FARS's pragmatic design philosophy — the system is built by researchers who know what research actually requires, not by engineers imagining what research might look like.

3 Core Contribution

The Radical Thesis

FARS makes a philosophical claim that distinguishes it from every prior autoresearch system:

The output of a research system should be research contributions, not papers conforming to academic formatting conventions.

This is not a minor stylistic preference. It is a fundamental rejection of the form factor that every other autoresearch system has optimized for. AI Scientist, Zochi, AI Scientist v2, and DeepScientist all measure success by how closely their outputs resemble human-written conference papers, using peer review scores as the gold standard. FARS inverts this: the paper format is a historical artifact of human-centered research, not a necessary property of knowledge production.

First-Principles Critique of Human Research

FARS's blog post articulates a first-principles analysis of why human-centered research is structurally inefficient:

Structural Problem	Root Cause	FARS Response
High entry barrier	Years of training to become a researcher	Automated agents require no training period
High failure rate	Most research ideas don't work out	Negative results are explicitly valued and published
Publication bias	Only "successful" experiments get published	Every completed experiment produces output
Maximal contribution units	Papers are large, comprehensive artifacts	Each paper is a single, well-scoped contribution
Format overhead	Conforming to venue-specific formatting rules	No structural constraints beyond clarity
Length inflation	Pressure to fill pages to meet minimum requirements	Short papers — as long as they need to be, no more
Supply constraint	Limited number of human researchers	System runs 24/7, parallelizes across projects

This critique identifies a fundamental tension: the academic paper format evolved to serve human readers and human review committees, not to maximize the rate of knowledge frontier expansion. FARS argues that if the goal is to "efficiently and reliably expand the frontier of knowledge," then the format should be optimized for that goal rather than for backward compatibility with human conventions.

Five Design Principles

From the blog post and observable outputs, FARS operates on five principles:

Contributions, not papers. The unit of output is a research contribution — a piece of new knowledge — not a formatted document. The paper is merely a container for the contribution.
Single-scoped contributions. Each paper addresses exactly one research question or presents exactly one finding. This is the minimal composable unit of knowledge, analogous to a function in programming: do one thing, do it well, make it reusable.
Negative results are knowledge. A well-conducted experiment that shows something doesn't work is as valuable as one that shows something does. FARS explicitly reports negative results. Example: "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — a paper whose entire contribution is demonstrating that a promising technique fails.
No unnecessary constraints. Papers are not padded to meet length minimums, do not conform to venue-specific formatting templates, and do not include sections (e.g., lengthy related work surveys) that don't serve the contribution.
Scale reveals truth. A handful of examples is insufficient to evaluate a research system. FARS was designed to produce 100 papers precisely because quality variance at scale is a known limitation — the signal emerges from the aggregate, not from cherry-picked examples.

What Makes This Novel

Prior autoresearch systems tried to pass the Turing test of academic publishing — can the AI produce a paper indistinguishable from a human-written one? FARS asks a different question: if we freed research from the constraints of human publishing conventions, what would an optimally efficient research system look like?

This is the difference between building a faster horse and building a car.

The Live Deployment as Contribution

FARS's public deployment is itself a scientific contribution. By running continuously for 228 hours with a live dashboard, public GitLab, and real-time observability, Analemma subjected their system to the most rigorous possible evaluation: public scrutiny at scale. Any researcher could watch the system operate, inspect intermediate artifacts, read every generated paper, and form their own assessment.

This transparency protocol is unprecedented in the autoresearch space. AI Scientist released code but not live runs. DeepScientist reported results but not the process. FARS showed everything, live, as it happened.

4 Supported Solutions

Problem Framing

FARS frames automated research as a pipeline production problem rather than a search problem (contrast with AIRA₂ which frames it as a graph search) or a dialogue problem (contrast with AI Scientist which frames it as a multi-turn LLM conversation).

The pipeline framing has specific implications:

Search-based framing (AIRA₂, AlphaEvolve):
  Goal: Find optimal solution in a space
  Metaphor: Exploration of a landscape
  Bottleneck: Search efficiency, evaluation signal

Dialogue-based framing (AI Scientist, Zochi):
  Goal: Produce human-like research discourse
  Metaphor: Simulating a researcher
  Bottleneck: LLM capability, prompt engineering

Pipeline-based framing (FARS):
  Goal: Convert research directions into completed papers
  Metaphor: Assembly line / factory
  Bottleneck: Throughput, reliability, coordination

Research Domains

Primary domain: AI/LLM research — the "AI-for-AI" paradigm where automated systems research the technology that powers them. This is an explicitly acknowledged limitation and a pragmatic choice: AI research provides the most readily available evaluation signals (code runs, benchmarks, automated metrics).

Initial research directions specified: - Reinforcement Learning from Verifiable Rewards (RLVR) - Other AI-related topics the system discovers autonomously

Observed output topics (from published papers): - Self-reflection in small language models - World-model verification for agent planning - Vision-language model selection strategies (OCR) - Fine-tuning data selection (hard vs. easy examples) - Metamorphic testing for LLM world models - Coding agent testing and import autofix - Budget allocation for verification systems

The diversity of topics is notable: starting from a few specified directions, the Ideation Agent discovered and explored adjacent research questions autonomously. This suggests the literature review component is effective at identifying related open problems.

Paper Format and Structure

FARS papers are short papers — typically 4–8 pages — focused on a single contribution. Each paper includes:

Component	Present	Notes
Abstract	Yes	Concise, focused on the single contribution
Introduction / motivation	Yes	Brief — sufficient to frame the question
Method	Yes	Technical description of the approach
Experiments	Yes	With framework diagrams, result tables, analysis
Conclusion	Yes	What was learned, including if the result is negative
Related work	Minimal	Only directly relevant prior work, not comprehensive surveys
Code repository	Yes	Public GitLab repo for each paper
Extensive appendices	No	No padding

Solution Types

Based on observed outputs, FARS produces three types of research contributions:

Type 1: Positive methods. Standard research contributions where a proposed method improves on baselines. - Example: "Equation-Consistency Gated Reflection for Small Language Models: A Training-Free Approach to Preventing Self-Correction Regressions"

Type 2: Negative results. Experiments demonstrating that a plausible approach fails, with analysis of why. - Example: "Interface-Aware Smoke Tests and Deterministic Import Autofix for Feature-Level Coding Agents: A Negative Result" — automated import autofix provided no benefit over baseline (both 10.0% resolved rate) - Example: "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — all selection strategies within 0.3-point band, no improvement

Type 3: Empirical insights. Data-driven findings about phenomena in AI systems. - Example: "Hard Examples Beat Easy Examples in Repetition-Heavy Long-CoT Fine-Tuning" - Example: "Stutter-Invariance Metamorphic Audits for Text World-Model Rollouts" — domain-specific probes statistically tied with simpler baseline

The explicit production of Type 2 (negative results) papers is philosophically significant. In human academia, negative results are systematically suppressed due to publication bias — journals and conferences preferentially accept positive results. FARS's willingness to report negative results as complete papers represents a structural fix to this bias.

5 LLM Integration

Model Access

FARS has access to both open-source and closed-source large language models, with the GPU cluster enabling local inference for open-source models and API access for proprietary ones.

Access Type	Purpose	Examples
Open-source models (local inference)	Experimental subjects, data synthesis, cheap inference	Models run on the 160-GPU cluster
Closed-source models (API)	Agent backbone, complex reasoning, LLM-as-a-Judge	Likely GPT-4-class or Claude-class models
Model inference endpoints	Data synthesis, agent design, evaluation	Encapsulated as tools for the Experiment Agent

Role of LLMs in Each Agent

Ideation Agent: - Uses LLMs for literature comprehension and synthesis - Generates research hypotheses from literature analysis - Conducts automated literature review across open-access papers - May use embedding models for semantic search over literature

Planning Agent: - Uses LLMs for experimental design reasoning - Translates hypotheses into concrete experimental protocols - Determines required resources, baselines, and evaluation metrics

Experiment Agent: - Uses LLMs for code generation and debugging - Uses LLMs as experimental subjects (e.g., testing LLM behaviors) - Uses LLMs as evaluation judges (LLM-as-a-Judge paradigm) - Uses LLMs for data synthesis (generating training/test data) - Uses LLMs for agent design (creating sub-agents for experiments)

Writing Agent: - Uses LLMs for paper composition - Synthesizes experimental results into structured narratives - Generates figures, tables, and analyses

LLM-as-Infrastructure vs. LLM-as-Subject

A distinctive aspect of FARS is the dual role of LLMs:

LLM Usage in FARS:
─────────────────────────────────────────────
Infrastructure layer (running the system):
  ├── Ideation Agent backbone
  ├── Planning Agent backbone
  ├── Experiment Agent backbone
  └── Writing Agent backbone

Subject layer (being researched):
  ├── LLMs as experimental subjects
  ├── LLM behaviors being studied
  ├── LLM training/fine-tuning being tested
  └── LLM evaluation methods being developed
─────────────────────────────────────────────

This creates a recursive structure: LLMs researching LLMs. The system uses GPT/Claude-class models to reason about experiments on smaller or different LLMs. The infrastructure models must be more capable than the subject models for this to work — you cannot study the behavior of a model using a less capable model as your reasoning engine.

Token Consumption

The FARS-100 run consumed 11.4 billion tokens across 100 papers:

Metric	Value
Total tokens	11.4 billion
Per paper (average)	~114 million tokens
Per hypothesis (average)	~46.7 million tokens
Token cost component	Major fraction of $104,000 total

The per-paper token count of ~114 million is orders of magnitude higher than typical LLM generation tasks:

Token consumption comparison:
─────────────────────────────────────
Typical chatbot response:        ~500 tokens
Typical long-form generation:    ~5,000 tokens
Typical agentic task:            ~50,000 tokens
Complex multi-step agent:        ~500,000 tokens
AI Scientist paper:              ~5,000,000 tokens (estimated)
FARS paper:                      ~114,000,000 tokens  ← 20x more
─────────────────────────────────────

This enormous token consumption reflects the "trading computing power for intelligence" characteristic described in reporting. The Experiment Agent likely dominates: running code, debugging failures, iterating on approaches, calling LLMs for data synthesis and evaluation — all within a single paper's lifecycle.

6 Key Results

FARS-100 Headline Numbers

Metric	Value
Duration	228 hours 28 minutes 33 seconds
Hypotheses generated	244
Papers completed	100
Hypothesis → paper conversion rate	41.0% (100/244)
Average time per paper	~2 hours 17 minutes
Total tokens consumed	11.4 billion
Total cost	~$104,000 (~¥750,000 RMB)
Cost per paper	~$1,040
Hardware	160 NVIDIA GPUs
Human intervention	Zero during the 228-hour run

Quality Assessment

Using Stanford's Agentic Reviewer system (paperreview.ai), calibrated against ICLR review standards:

Population	Mean Score	Range
FARS-100 papers	5.05	3.0 – 6.3
ICLR 2026 all human submissions	4.21	—
ICLR 2026 accepted papers	5.39	—

Interpretation: - FARS papers score 0.84 points above the average human submission - FARS papers score 0.34 points below the average accepted paper - The score distribution is concentrated around 5.0, indicating stable quality band rather than random fluctuation - A small number of papers exceeded 6.0, indicating occasional "breakthrough" quality

Quality positioning:

          ← Worse                            Better →
   3.0         4.0         5.0         6.0         7.0
    │           │           │           │           │
    ├───FARS────┤           │           │           │
    │   range   │           │           │           │
    │           │     ┌─────┼─────┐     │           │
    │           │     │ FARS mean │     │           │
    │           │     │   5.05    │     │           │
    │           │     └───────────┘     │           │
    │           │                       │           │
    │     ┌─────┤                       │           │
    │     │Human│                       │           │
    │     │avg  │                       │           │
    │     │4.21 │                       │           │
    │     └─────┘                       │           │
    │                             ┌─────┤           │
    │                             │Accpt│           │
    │                             │avg  │           │
    │                             │5.39 │           │
    │                             └─────┘           │

Agentic Reviewer Calibration

The Agentic Reviewer system was validated against ICLR 2025 review data:

Comparison	Spearman Correlation
Human reviewer vs. human reviewer	0.41
AI reviewer vs. human reviewer	0.42

The AI reviewer achieves parity with human inter-reviewer agreement, suggesting the scores are as reliable as human review — though both human and AI review have substantial noise (correlation of 0.41–0.42 means roughly 83% shared variance, leaving ~17% as reviewer-specific noise).

Throughput Analysis

FARS-100 throughput timeline:
────────────────────────────────────────────────
Day 1 (0-24h):     ~10 papers
Day 2 (24-48h):    ~10 papers  
Day 3 (48-72h):    ~10 papers
...
Day 9.5 (228h):    100th paper completed
────────────────────────────────────────────────
Average: ~10.5 papers/day
Peak: Not reported (likely higher due to parallel execution)
Minimum: Not reported

Comparison to Human Research

Metric	Human Researcher	FARS
Time per paper	3–6 months	~2.3 hours
Papers per year	2–5 (typical)	~3,800 (projected continuous)
Cost per paper	$50,000–200,000 (fully loaded)	~$1,040
24/7 operation	No	Yes
Negative result reporting	Rare (publication bias)	Systematic
Quality (ICLR scale)	4.21 avg submission	5.05 avg

The throughput advantage is approximately 1,000x in papers-per-unit-time. The cost advantage is approximately 50–200x per paper (depending on human researcher cost assumptions). However, these comparisons have important caveats — FARS papers are short, single-contribution papers while human papers are typically fuller works. The appropriate comparison may be FARS papers vs. individual experiments within a human paper rather than vs. complete human papers.

Conversion Funnel

Research direction documents (input)
         │
         ▼
   244 hypotheses generated (Ideation Agent)
         │
         ▼
   ??? hypotheses passed automated review
         │
         ▼
   ??? experimental plans created (Planning Agent)
         │
         ▼
   ??? experiments executed (Experiment Agent)
         │
         ▼
   100 papers completed (Writing Agent)
         │
         ▼
   ??? papers pass manual review (3+ senior researchers)
         │
         ▼
   ??? papers submitted to arXiv (explicit AI-generated label)

The 41% hypothesis-to-paper conversion rate (100/244) suggests that ~59% of hypotheses either failed automated review, failed experimentally, or were abandoned during execution. This is actually a healthy ratio — in human research, the hypothesis-to-publication rate is typically much lower (perhaps 5–20%). FARS's higher conversion rate may reflect either (a) better hypothesis quality, (b) lower publication threshold (short papers, negative results accepted), or (c) both.

7 Reproducibility

Source Code

Component	Public	URL
System code (FARS itself)	No	Proprietary to Analemma
Research outputs (papers)	Yes	`analemma.ai/papers/<uuid>/`
Experiment code	Yes	`gitlab.com/fars-a/<project>/`
Live dashboard	Yes	`analemma.ai/fars`

FARS itself is not open-source. This is a critical distinction from systems like AI Scientist (open-source), Karpathy's autoresearch (open-source), and Zochi (partially open). The system's architecture, agent prompts, coordination mechanisms, and infrastructure code are proprietary.

What Is Reproducible

Individual experiment results: Each paper has an associated GitLab repository with code, data, and instructions. Other researchers can re-run specific experiments.
Paper quality evaluation: The Agentic Reviewer system is independently available (paperreview.ai). Anyone can re-score the FARS papers.
Observable operation: The live dashboard provides real-time visibility into the system's operation, making the process (though not the implementation) transparent.

What Is Not Reproducible

The system itself: Without access to FARS's code, prompts, and infrastructure, the full system cannot be reproduced.
The 160-GPU cluster: The hardware requirements are a substantial barrier. Few academic labs have access to 160 GPUs for a continuous 10-day experiment.
The specific LLM configurations: Which models serve as agent backbones, their specific prompts, and their interaction patterns are not disclosed.

Reproducibility Assessment

Criterion	Rating	Notes
System reproducibility	Low	Proprietary code, closed architecture
Experiment reproducibility	Medium-High	Individual experiment repos are public
Result verification	Medium	Papers and scores can be independently evaluated
Process transparency	High	Live dashboard, public GitLab activity
Hardware accessibility	Low	160 GPUs required

The reproducibility profile is inverted compared to most research: the process is unusually transparent (live public operation), but the system is unusually opaque (proprietary code). This creates a trust-but-can't-verify dynamic: observers can see that FARS works, but cannot build their own version.

8 Compute and API Costs

Hardware Configuration

Resource	Specification
GPU cluster	160 NVIDIA GPUs
GPU model	Not publicly specified (likely A100 or H100 class)
Purpose	Training, inference, experiment execution
Access model	Encapsulated as tools for the Experiment Agent
Additional endpoints	Model inference endpoints for data synthesis, LLM-as-a-Judge, agent design

FARS-100 Cost Breakdown

Category	Estimated Cost	Notes
GPU compute (160 GPUs × 228h)	~$50,000–70,000	At $2–3/GPU-hour cloud pricing
LLM API tokens (11.4B tokens)	~$30,000–50,000	Mix of open/closed model inference
Total	~$104,000	Official reported figure
Per paper	~$1,040	104,000 / 100
Per hypothesis	~$426	104,000 / 244

Token Economics

Token consumption by category (estimated breakdown):

Experiment Agent:     ~70% of total tokens
  ├── Code generation and debugging
  ├── Running LLM experiments (subject models)
  ├── Data synthesis
  ├── LLM-as-a-Judge evaluations
  └── Iterative refinement loops

Ideation Agent:       ~15% of total tokens
  ├── Literature review (reading papers)
  ├── Hypothesis generation
  └── Cross-referencing and validation

Writing Agent:        ~10% of total tokens
  ├── Paper composition
  ├── Figure and table generation
  └── Revision and formatting

Planning Agent:       ~5% of total tokens
  ├── Experimental design
  └── Protocol specification

The Experiment Agent likely dominates token consumption because it performs the most computationally and cognitively intensive work: writing code, debugging failures, running experiments, interpreting results, and iterating. This is analogous to human research where the experiment phase consumes the majority of time and resources.

Cost Comparison Across Autoresearch Systems

System	Cost per Paper	Hardware	Duration
AI Scientist (2024)	~$15	Minimal (API-only)	Hours
AI Scientist v2 (2025)	~$15–20	Minimal (API-only)	Hours
Zochi (2025)	Not reported	Not reported	Not reported
DeepScientist (2026)	Not reported	20,000+ GPU-hours total	Months
FARS (2026)	~$1,040	160 GPUs × 228h	~2.3 hours
Karpathy autoresearch (2026)	~$18/run	1 GPU	8 hours

FARS's per-paper cost of ~$1,040 is substantially higher than AI Scientist (~$15) but reflects a fundamentally different scope: AI Scientist produces minimally-viable papers with limited experimental depth, while FARS executes real GPU-intensive experiments. The appropriate comparison is cost-per-unit-of-experimental-work, not cost-per-paper.

Cost Efficiency Analysis

Cost per paper:              $1,040
Human equivalent:            $50,000 - $200,000 (3-6 months of researcher time)
Cost reduction vs. human:    48x - 192x

But: FARS papers are short papers (~single contribution)
Adjusted comparison (FARS paper ≈ 1 experiment in human paper):
  Human cost per experiment:    ~$10,000 - $30,000
  FARS cost per experiment:     ~$1,040
  Adjusted cost reduction:      10x - 29x

Scaling Economics

If FARS operated continuously for one year:

Annual projection (continuous operation):
  Papers per year:        ~3,800 (at 2.3h/paper)
  Annual GPU cost:        ~$2.8M - $4.0M (160 GPUs full-time)
  Annual token cost:      ~$1.5M - $2.5M
  Annual total:           ~$4.3M - $6.5M
  Cost per paper:         ~$1,130 - $1,710

  Equivalent human team:
    3,800 papers/year ÷ 4 papers/researcher/year = 950 researchers
    950 researchers × $200K/year = $190M/year

  FARS vs. human team:   ~$5M vs. ~$190M → 38x cost advantage

These projections assume linear scaling and constant quality, which may not hold. But they illustrate the economic logic driving industrial-scale autoresearch.

9 Architecture Solution

High-Level Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        FARS SYSTEM ARCHITECTURE                       │
│                                                                       │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     RESEARCH DIRECTION INPUT                     │  │
│  │  (Document specifying multiple research directions)              │  │
│  └─────────────────────────┬───────────────────────────────────────┘  │
│                            │                                          │
│                            ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     IDEATION AGENT                                │  │
│  │                                                                   │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │  │
│  │  │  Literature   │  │  Hypothesis  │  │  Automated Review    │   │  │
│  │  │  Review       │  │  Generation  │  │  (pass/fail gate)    │   │  │
│  │  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘   │  │
│  │         │                 │                      │               │  │
│  │  [Open-access papers]  [Public GitLab repos]   [Quality filter] │  │
│  └─────────────────────────┬───────────────────────────────────────┘  │
│                            │  (validated hypotheses)                   │
│                            ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     PLANNING AGENT                                │  │
│  │                                                                   │  │
│  │  Hypothesis → Experimental Plan                                   │  │
│  │  (baselines, metrics, resources, evaluation criteria)             │  │
│  └─────────────────────────┬───────────────────────────────────────┘  │
│                            │  (experimental plan)                     │
│                            ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     EXPERIMENT AGENT                               │  │
│  │                                                                   │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │  │
│  │  │  Code         │  │  GPU Cluster │  │  Model Inference     │   │  │
│  │  │  Generation   │  │  (160 GPUs)  │  │  Endpoints           │   │  │
│  │  │  & Debugging  │  │  as tools    │  │  as tools            │   │  │
│  │  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘   │  │
│  │         │                 │                      │               │  │
│  │         └─────────────────┼──────────────────────┘               │  │
│  │                           │                                       │  │
│  │         Code → Schedule GPU jobs → Collect results → Iterate     │  │
│  └─────────────────────────┬───────────────────────────────────────┘  │
│                            │  (experimental results + code)           │
│                            ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     WRITING AGENT                                 │  │
│  │                                                                   │  │
│  │  Results → Paper (short paper, single contribution)               │  │
│  └─────────────────────────┬───────────────────────────────────────┘  │
│                            │  (completed paper)                       │
│                            ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     OUTPUT                                        │  │
│  │                                                                   │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐│  │
│  │  │  Paper   │  │  GitLab  │  │  Live     │  │  Manual Review   ││  │
│  │  │  (PDF)   │  │  Repo    │  │  Dashboard│  │  (3+ reviewers)  ││  │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────────────┘│  │
│  └─────────────────────────────────────────────────────────────────┘  │
│                                                                       │
│  ═══════════════════════════════════════════════════════════════════  │
│  ║                  SHARED FILE SYSTEM                              ║  │
│  ║  (Workspace + persistent memory for all agents)                  ║  │
│  ║  • Structured project directories                                ║  │
│  ║  • Agents read from / write to shared workspace                  ║  │
│  ║  • No direct agent-to-agent communication                        ║  │
│  ║  • Scales to many concurrent research projects                   ║  │
│  ═══════════════════════════════════════════════════════════════════  │
│                                                                       │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                     GPU CLUSTER (160 GPUs)                        │  │
│  │  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐        ┌─────┐               │  │
│  │  │GPU 1│ │GPU 2│ │GPU 3│ │GPU 4│  ···   │GPU  │               │  │
│  │  │     │ │     │ │     │ │     │        │ 160 │               │  │
│  │  └─────┘ └─────┘ └─────┘ └─────┘        └─────┘               │  │
│  │  Encapsulated as TOOLS for the Experiment Agent                  │  │
│  └─────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

Key Architectural Decisions

Decision	Rationale	Implications
Sequential pipeline (not graph search)	Research has a natural linear flow: idea → plan → experiment → write	Simpler coordination, clearer handoffs, easier debugging
Shared file system (not message passing)	Filesystem is universally understood, durable, inspectable	No message broker needed, natural audit trail, human-inspectable
No direct agent-to-agent communication	Eliminates protocol complexity, race conditions, deadlocks	Agents are independently testable, replaceable, scalable
GPU cluster as tools (not raw access)	Experiment Agent schedules jobs without managing infrastructure	Clean separation of concerns; agent thinks about experiments, not CUDA
Parallel project queue	Multiple research projects advance simultaneously	Higher throughput, better GPU utilization, tolerant of blocked projects
Short papers (not full conference papers)	Minimal unit of contribution reduces writing burden	Writing Agent can focus on clarity, not page-filling

The Shared Filesystem Pattern

The most architecturally distinctive feature of FARS is the shared filesystem as the sole coordination mechanism between agents. This deserves detailed analysis:

Shared Filesystem Structure (inferred):
─────────────────────────────────────────
/fars-workspace/
├── projects/
│   ├── FA0001/                          # Project directory
│   │   ├── ideation/
│   │   │   ├── literature_review.md     # Ideation Agent writes
│   │   │   ├── hypothesis.md            # Ideation Agent writes
│   │   │   └── review_result.json       # Automated review output
│   │   ├── planning/
│   │   │   ├── experimental_plan.md     # Planning Agent writes
│   │   │   └── resources.json           # Required resources
│   │   ├── experiment/
│   │   │   ├── code/                    # Experiment Agent writes
│   │   │   ├── results/                 # Experiment outputs
│   │   │   └── logs/                    # Execution logs
│   │   ├── writing/
│   │   │   ├── paper.tex                # Writing Agent writes
│   │   │   ├── paper.pdf                # Compiled paper
│   │   │   └── figures/                 # Generated figures
│   │   └── status.json                  # Project stage tracking
│   ├── FA0002/
│   │   └── ...
│   └── ...
├── gitlab/                              # Public repository staging
└── shared/
    ├── literature_cache/                # Shared literature database
    └── model_endpoints.json             # Available inference endpoints

Why filesystem over message passing?

Most multi-agent systems use message queues, event buses, or direct API calls for inter-agent communication. FARS's choice of a shared filesystem has several deep advantages:

Persistence by default. Every intermediate artifact is automatically persisted. If the system crashes, the filesystem state represents a perfect checkpoint. No message replay, no state reconstruction needed.
Natural handoffs. Agent A completes its work by writing files. Agent B begins its work by reading files. The filesystem boundary is the API contract — no schema definitions, no versioning headaches, no serialization/deserialization overhead.
Human inspectability. Any researcher can SSH into the filesystem and inspect exactly what the Ideation Agent produced, what the Planning Agent interpreted, what the Experiment Agent executed. This is invaluable for debugging and trust-building.
Trivial parallelism. Multiple projects run in parallel simply by having separate directories. No lock contention, no resource arbitration (beyond GPU scheduling), no coordination overhead.
Scalability. Adding more concurrent projects means adding more directories. Adding more agents (e.g., a Review Agent) means adding a reader of existing files. The filesystem pattern scales linearly without architectural changes.
Decoupled evolution. Each agent can be upgraded independently. As long as the file formats are compatible, agents don't need to know about each other's implementations.

This pattern is not novel in systems engineering (Unix pipes, plan9, microservices via shared storage) but is novel in the autoresearch space. Most prior systems use tighter coupling: AI Scientist uses a single LLM conversation, AIRA₂ uses an in-memory population database, DeepScientist uses knowledge graphs and databases.

The Assembly Line Metaphor

FARS's pipeline architecture maps directly to an industrial assembly line:

Industrial Assembly Line:
  Raw material → Stamping → Welding → Painting → Assembly → QC → Ship

FARS Research Assembly Line:
  Directions → Ideation → Planning → Experiment → Writing → Review → Publish

Key properties of assembly lines that FARS inherits:

Pipelining. While paper N is being written, paper N+1 is being experimented on, paper N+2 is being planned, and paper N+3 is being ideated. All four stages operate concurrently on different projects.
Modular replacement. Each station (agent) can be upgraded independently. A better Writing Agent doesn't require changes to the Experiment Agent.
Bottleneck identification. The slowest stage determines throughput. If experiments take 10x longer than writing, optimizing the Writing Agent yields zero throughput improvement.
Quality control. Papers that fail at any stage can be rejected without wasting downstream effort (except for the explicit negative-results philosophy, which means experimental failures may still produce papers).

10 Component Breakdown

Component 1: Ideation Agent

Purpose: Convert research directions into validated, actionable research hypotheses.

Aspect	Detail
Input	Document specifying multiple research directions
Output	Validated research hypotheses (forwarded to Planning Agent)
Tools	Open-access paper search, public GitLab repository access
Gate	Automated review — hypothesis must pass before forwarding
Token consumption	~15% of total (literature comprehension dominates)

Process: 1. Receives research direction document (human-authored, specifying broad areas like "RLVR") 2. Conducts literature review across open-access papers 3. Identifies gaps, open questions, and unexplored combinations 4. Generates specific, testable hypotheses 5. Each hypothesis undergoes automated review (quality/feasibility filter) 6. Passing hypotheses written to shared filesystem → picked up by Planning Agent

Key design choice: The Ideation Agent has access to public GitLab repositories in addition to papers. This means it can read actual code implementations, not just paper descriptions. This is a significant advantage over literature-review-only approaches, as the gap between what papers describe and what code implements is often substantial.

Hypothesis generation rate: 244 hypotheses in 228 hours ≈ 1.07 hypotheses/hour. Given that 100 became papers (41% conversion), the Ideation Agent overproduces by design — it generates more hypotheses than the downstream pipeline can absorb, ensuring the pipeline is never starved.

Component 2: Planning Agent

Purpose: Transform validated hypotheses into concrete experimental plans.

Aspect	Detail
Input	Validated hypothesis (from shared filesystem)
Output	Experimental plan (written to shared filesystem)
Scope	Baselines, metrics, evaluation criteria, resource requirements
Token consumption	~5% of total (relatively lean — planning is reasoning-intensive but not token-heavy)

Process: 1. Reads hypothesis from project directory 2. Determines what baselines are needed 3. Specifies evaluation metrics and success criteria 4. Estimates required computational resources 5. Writes experimental plan to project directory

The Planning Agent is the thinnest component — its role is to bridge the gap between an abstract hypothesis and a concrete experimental protocol. In human research, this is the "methods section" thought process.

Component 3: Experiment Agent

Purpose: Execute experimental plans using GPU cluster and model inference endpoints.

Aspect	Detail
Input	Experimental plan (from shared filesystem)
Output	Code, results, logs (written to shared filesystem)
Tools	160-GPU cluster (as tools), model inference endpoints (as tools)
Capabilities	Code generation, debugging, GPU scheduling, data synthesis, LLM-as-a-Judge
Token consumption	~70% of total (dominant consumer)

Tool encapsulation:

The GPU cluster is not exposed as raw hardware. Instead, it is encapsulated as tools — high-level interfaces that the Experiment Agent can call:

Available tools for Experiment Agent:
─────────────────────────────────────
GPU Tools:
  schedule_training_job(config) → job_id
  check_job_status(job_id) → status
  get_job_results(job_id) → results
  cancel_job(job_id) → ok

Inference Tools:
  run_inference(model, inputs) → outputs
  synthesize_data(spec) → dataset
  evaluate_with_judge(model, prompt, response) → score

Utility Tools:
  read_dataset(path) → data
  write_results(path, data) → ok
  create_figure(data, spec) → figure

This encapsulation serves multiple purposes: - Isolation: The agent doesn't need to know about CUDA, distributed training, or job schedulers - Safety: The agent cannot accidentally consume all GPU resources or interfere with other projects - Abstraction: The same experimental code could theoretically run on different hardware backends

Experiment execution flow: 1. Read experimental plan 2. Write code for the experiment 3. Schedule training/inference jobs on GPU cluster 4. Monitor job execution 5. Collect and analyze results 6. If results are unexpected → debug, modify, re-run 7. Write final results and code to project directory

Component 4: Writing Agent

Purpose: Produce the final research paper from experimental results.

Aspect	Detail
Input	Experimental results, code, hypothesis (from shared filesystem)
Output	Short research paper (PDF + source)
Format	Single-contribution short paper
Includes	Abstract, method, experiments, results, conclusion, figures, tables
Token consumption	~10% of total

Key distinction from other autoresearch writing: Most systems (AI Scientist, DeepScientist) attempt to produce full conference papers with comprehensive related work sections, lengthy introductions, and detailed appendices. FARS's Writing Agent produces short papers focused exclusively on the single contribution. This is both a philosophical choice (minimal composable knowledge) and a practical one (shorter papers are faster to write and review, and less prone to hallucination).

Component 5: Safety and Review Pipeline

Purpose: Quality gate between automated production and public dissemination.

Stage	Type	Details
Automated review	Algorithmic	Hypothesis-level gate in Ideation Agent
AI review	Automated	Stanford Agentic Reviewer (ICLR-calibrated)
Manual review	Human	At least 3 researchers with 5+ years experience each
Labeling	Manual	All submissions explicitly labeled as AI-generated
arXiv submission	Manual	Only papers passing manual review are submitted

The safety pipeline is deliberately conservative:

Papers produced by FARS (100)
    │
    ▼
AI Review (Agentic Reviewer) ──── score distribution published
    │
    ▼
Manual Review (3+ senior researchers, 5+ years each)
    │                │
    ▼                ▼
  PASS             FAIL
    │                │
    ▼                │
Label as             │
AI-generated         │
    │                │
    ▼                │
Submit to         Archived
arXiv             (not published)

This conservative approach addresses the primary concern about automated research: that it could flood the literature with low-quality or misleading work. By requiring manual review by multiple senior researchers before external publication, FARS ensures that its public outputs meet human standards even though the production process is fully automated.

11 Core Mechanisms (Detailed)

11.1 The Shared Filesystem as Coordination Protocol

The shared filesystem is not merely a storage layer — it is the entire coordination protocol of the system. Understanding its design is essential to understanding FARS.

Properties of filesystem-based coordination:

Property	Advantage	Trade-off
Durability	Every write persists; crash-safe	Higher I/O latency than in-memory
Visibility	Any agent (or human) can inspect any state	Potential information leakage if not structured
Atomicity	File writes are atomic at OS level	No multi-file transactions
Ordering	Filesystem timestamps provide natural ordering	Clock skew in distributed systems (mitigated if single-node)
Scalability	Directories partition naturally	Filesystem metadata overhead at extreme scale

Contrast with alternative coordination patterns:

Pattern               | Used by              | Advantages           | Disadvantages
────────────────────────────────────────────────────────────────────────────
Message queue         | AI-Researcher        | Decoupled, ordered   | Needs broker, no natural persistence
In-memory database    | AIRA₂                | Fast, structured     | Volatile, single-node bottleneck
Knowledge graph       | DeepScientist        | Rich relationships   | Complex queries, schema overhead
Single LLM context   | AI Scientist         | Simple, coherent     | Context window limits, no parallelism
Shared filesystem     | FARS                 | Universal, durable   | Unstructured unless conventions enforced

FARS's choice of shared filesystem is the most Unix-philosophy-aligned design in the autoresearch space. It follows the principle: "Write programs to handle text streams, because that is a universal interface." In FARS's case: "Write agents to handle files in directories, because that is a universal interface."

11.2 Pipeline Parallelism and Throughput Optimization

FARS achieves its throughput through pipeline parallelism — the same technique used in CPU instruction pipelines, GPU rendering pipelines, and industrial assembly lines.

Time →
─────────────────────────────────────────────────────────
     t=0    t=1    t=2    t=3    t=4    t=5    t=6
     │      │      │      │      │      │      │
P1:  [IDEA] [PLAN] [EXPT] [EXPT] [WRIT] [DONE]
P2:         [IDEA] [PLAN] [EXPT] [EXPT] [WRIT] [DONE]
P3:                [IDEA] [PLAN] [EXPT] [EXPT] [WRIT]→
P4:                       [IDEA] [PLAN] [EXPT] [EXPT]→
P5:                              [IDEA] [PLAN] [EXPT]→
─────────────────────────────────────────────────────────
          Pipeline stages execute concurrently

Throughput analysis:

If the bottleneck stage (Experiment) takes time T_exp, the throughput is:

Throughput = 1 / max(T_idea, T_plan, T_exp, T_write)

Given FARS's average of ~2.3 hours/paper and the Experiment Agent consuming ~70% of resources, the Experiment stage is almost certainly the bottleneck:

Estimated stage durations (per paper):
  Ideation:    ~20-30 minutes
  Planning:    ~10-15 minutes
  Experiment:  ~90-120 minutes    ← bottleneck
  Writing:     ~20-30 minutes

Pipeline throughput ≈ 1 paper / 90-120 minutes
Observed:          ≈ 1 paper / 137 minutes (2h17m)

The discrepancy between estimated bottleneck (90-120 min) and observed (137 min) likely reflects: - Pipeline filling/draining overhead - Failed experiments that consume time but don't produce papers - Resource contention on the GPU cluster during peak parallel execution - Overhead from the 59% of hypotheses that don't become papers

11.3 GPU Cluster Encapsulation

The design decision to encapsulate the 160-GPU cluster as tools rather than exposing raw hardware is architecturally significant:

Traditional research agent:
  Agent → SSH → GPU machine → CUDA → Training → Results
  (Agent manages infrastructure directly)

FARS design:
  Agent → Tool API → Cluster Scheduler → GPU → Training → Results
  (Agent is isolated from infrastructure)

Benefits of tool encapsulation:

Cognitive offloading. The Experiment Agent reasons about experiments, not infrastructure. It doesn't need to know about CUDA versions, driver compatibility, multi-GPU parallelism, or job scheduling.
Resource management. The tool layer can implement fair scheduling across concurrent projects, prevent resource starvation, and enforce compute budgets.
Error isolation. A crashed training job doesn't crash the agent. The tool layer handles retries, timeouts, and failure reporting.
Portability. The same agent prompts could theoretically work with different hardware backends (different GPU types, cloud providers, or even TPUs) by swapping the tool implementation.

11.4 Negative Result Detection and Reporting

FARS's systematic production of negative result papers requires a mechanism for detecting and properly framing negative results:

Experiment outcome classification:
─────────────────────────────────────
1. Positive result    → Method works, report improvement
2. Negative result    → Method doesn't work, report why
3. Null result        → Inconclusive, may need more experiments
4. System failure     → Bug/crash, not a research result

Traditional systems (AI Scientist, AIRA₂) treat outcomes 2-4 as failures and discard them. FARS treats outcome 2 as a legitimate research contribution and produces a paper explaining why the hypothesis failed.

This requires the Writing Agent to have a distinct mode:

Positive result writing: - "We propose X. X achieves Y improvement over baseline Z." - Emphasis on the method, the improvement, the contribution.

Negative result writing: - "We investigate whether X can improve Y. We find that X provides no significant improvement, and analyze why." - Emphasis on the hypothesis, the evidence against it, and the mechanistic explanation for failure. - Example: "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — the contribution is identifying candidate homogeneity at low temperature as the failure mechanism.

11.5 Automated Hypothesis Review

The Ideation Agent includes an automated review step that filters hypotheses before they enter the pipeline. While the specific review criteria are not publicly documented, the 41% conversion rate (100 papers from 244 hypotheses) provides bounds:

Hypothesis filtering funnel (estimated):
  244 hypotheses generated
   │
   ├── ~X% filtered by automated review (too vague, infeasible, redundant)
   │
   ├── ~Y% fail during experiment execution (bugs, inconclusive results)
   │
   └── 100 produce completed papers (41%)

If automated review filters 30% → 171 pass review
  Then 100/171 = 58% execution success rate

If automated review filters 10% → 220 pass review
  Then 100/220 = 45% execution success rate

Either way, the conversion rates are remarkably high compared to human research, where the hypothesis-to-publication rate in AI/ML is typically 10-30%. This may reflect: - Conservative hypothesis generation (Ideation Agent proposes incremental, testable hypotheses) - The inclusion of negative results (failures that would be discarded in human research become papers) - Well-scoped single-contribution format (lower bar for a "complete" paper)

12 Programming Language

System Implementation

The FARS system implementation language is not publicly disclosed, but strong inferences can be made:

Component	Likely Language	Evidence
Agent orchestration	Python	Industry standard for ML infrastructure; team's MOSS/InternLM background is Python
Shared filesystem management	Python/Shell	Standard infrastructure tooling
GPU cluster tools	Python (PyTorch/JAX)	Standard ML training frameworks
Model inference endpoints	Python (FastAPI/gRPC)	Standard serving patterns
Paper compilation	LaTeX	Academic paper production standard
Dashboard	JavaScript/TypeScript	Web frontend (analemma.ai/fars)

Agent-Generated Code

The Experiment Agent generates Python code for experiments. Based on the published GitLab repositories, experiments use standard ML tooling:

Training: PyTorch, Hugging Face Transformers, TRL
Evaluation: Standard metrics libraries, custom evaluation scripts
Data processing: pandas, numpy, datasets (Hugging Face)
Visualization: matplotlib, plotly

Paper Generation

Papers are generated in LaTeX format, compiled to PDF. The Writing Agent must produce: - LaTeX source with proper formatting - Figure generation code (Python → matplotlib/plotly → image) - Table generation (experimental results → LaTeX tables) - Bibliography management (BibTeX references)

GitLab Repositories

Each completed project is published to GitLab (gitlab.com/fars-a/<project>/). Repository structure typically includes:

<project>/
├── paper.pdf           # Compiled paper
├── paper.tex           # LaTeX source (inferred)
├── code/
│   ├── experiment.py   # Main experimental code
│   ├── evaluate.py     # Evaluation script
│   └── requirements.txt
├── data/               # Processed data (if small)
├── results/            # Raw experimental results
└── README.md           # Project description

13 Memory Management

System-Level Memory: The Shared Filesystem

FARS's primary memory system is the shared filesystem itself. This is a deliberate architectural choice that merges workspace and persistent memory into a single mechanism:

Memory Architecture:
─────────────────────────────────────────────────────────
Layer 1: Shared Filesystem (persistent, cross-agent)
  ├── Project directories (per-project state)
  ├── Literature cache (shared across projects)
  ├── Model endpoint registry
  └── Status tracking (project stages, progress)

Layer 2: Agent Context Windows (transient, per-agent)
  ├── Current project context
  ├── Instructions / prompts
  └── Recent interaction history

Layer 3: GPU Cluster State (transient, per-job)
  ├── Training checkpoints
  ├── Intermediate results
  └── Job queue state
─────────────────────────────────────────────────────────

Inter-Agent Memory Transfer

Because agents communicate exclusively through files, memory transfer follows a write-read pattern:

Agent A memory → write to filesystem → Agent B reads from filesystem → Agent B memory

Example (Ideation → Planning):
  Ideation Agent context:
    "After reviewing 47 papers on RLVR, I identified a gap in
     how verifiable rewards handle ambiguous correctness criteria.
     Hypothesis: A graduated verification scoring system improves
     RLVR training stability on math reasoning tasks."
       │
       ▼  (writes hypothesis.md)
  Filesystem:
    /projects/FA0042/ideation/hypothesis.md
       │
       ▼  (Planning Agent reads)
  Planning Agent context:
    "Hypothesis: graduated verification scoring for RLVR.
     I need to design an experiment comparing binary vs.
     graduated reward signals on GSM8K and MATH benchmarks..."

The filesystem acts as an externalized, persistent memory that survives agent restarts, context window limits, and even system crashes. This is architecturally similar to how human researchers use lab notebooks — the notebook persists even when the researcher is sleeping.

No Explicit Cross-Project Memory

FARS does not appear to maintain an explicit knowledge base, skill library, or cross-project memory. Each project is treated independently:

Memory Type	Present in FARS	Present in Alternatives
Per-project state	Yes (filesystem)	Yes (all systems)
Cross-project knowledge base	No (inferred)	Yes (DeepScientist, some others)
Skill/technique library	No (inferred)	Yes (FunSearch-style systems)
Literature embedding database	Possibly (shared cache)	Yes (some systems)
Failed hypothesis memory	No explicit mechanism	Varies

The absence of cross-project memory is a notable limitation. If FARS generates hypothesis H1 for project P1 and discovers it fails, there is no mechanism to prevent generating a similar hypothesis H1' for a different project P2. Over 244 hypotheses, some redundancy is likely.

However, the shared literature cache may provide implicit cross-project memory: if the Ideation Agent caches literature reviews and uses them across projects, insights from one project's literature review could inform another's hypothesis generation.

Context Window Management

Each agent operates within an LLM context window. For long-running experiments, the Experiment Agent may face context window pressure:

Experiment Agent context growth:
  Initial plan:              ~2K tokens
  First code attempt:        ~3K tokens
  GPU job results:           ~1K tokens
  Error analysis + fix:      ~2K tokens
  Second code attempt:       ~3K tokens
  Results analysis:          ~2K tokens
  ...
  After 10 iterations:       ~30K tokens (approaching limits)

The shared filesystem provides a natural overflow mechanism: the agent can write intermediate results to files and read them back selectively, effectively using the filesystem as an unbounded external memory that supplements the finite context window.

14 Continued Learning

Within-Run Learning

During the FARS-100 run, the system demonstrates within-run improvement at the population level:

FARS-100 production trajectory (approximate):
─────────────────────────────────────────────────────────
Phase 1 (hours 0-50):    Pipeline filling + initial projects
  └── Slower output as pipeline stages warm up
  └── Early papers may have lower quality (system "learning")

Phase 2 (hours 50-150):  Steady-state production
  └── ~10 papers/day consistent output
  └── Quality stabilizes around mean 5.05

Phase 3 (hours 150-228): Pipeline draining + final projects
  └── Ideation Agent may slow (research directions exhausting)
  └── Later papers may explore more peripheral topics
─────────────────────────────────────────────────────────

No Explicit Learning Mechanism

FARS does not implement explicit learning across papers:

Learning Type	Implemented	Notes
Within-paper iteration	Yes	Experiment Agent debugs and refines
Across-paper knowledge transfer	No	Each project is independent
Technique library accumulation	No	No skill extraction mechanism
Hypothesis refinement from failures	No	Failed hypotheses not fed back to Ideation
Prompt/strategy adaptation	Not disclosed	May exist internally but not documented

The Pipeline vs. Loop Distinction

This is a fundamental architectural difference from evolutionary systems:

Evolutionary systems (AIRA₂, AlphaEvolve, OpenEvolve):
  Population → Select → Mutate → Evaluate → Update Population → Repeat
  ───── LOOP: each generation learns from previous ─────

FARS pipeline:
  Direction → Ideate → Plan → Experiment → Write → Output
  ───── PIPELINE: each project is independent ─────

Evolutionary systems explicitly learn: the population improves over generations because selection pressure retains good solutions and discards bad ones. FARS's pipeline does not learn: each project starts fresh from a research direction, without incorporating lessons from previous projects.

This is both a strength and a limitation: - Strength: No risk of "premature convergence" — each project explores independently - Strength: Easily parallelizable — no cross-project dependencies - Limitation: Cannot build on its own discoveries - Limitation: May repeat mistakes across projects

Potential for Cross-Run Learning

If FARS were extended to operate over multiple runs (FARS-200, FARS-500, etc.), several learning mechanisms could be added:

Hypothesis deduplication: Embedding previous hypotheses and filtering new ones that are too similar
Technique library: Extracting reusable experimental techniques from successful papers
Failure memory: Cataloguing why hypotheses failed to prevent re-exploration of dead ends
Research direction refinement: Using paper quality scores to adjust which directions the Ideation Agent explores
Writing template learning: Improving paper structure based on review scores

These extensions would transform FARS from a pipeline into a learning pipeline — maintaining the throughput advantages of pipeline architecture while adding the improvement dynamics of evolutionary systems.

FARS as a Benchmark for Research Productivity

The FARS-100 run itself serves as a baseline measurement for automated research productivity:

FARS-100 Baseline Metrics:
  Throughput:          ~10.5 papers/day
  Quality:             5.05 ± ~0.8 (ICLR scale)
  Cost:                $1,040/paper
  Conversion rate:     41% (hypothesis → paper)
  Token efficiency:    ~114M tokens/paper

Future FARS versions can be measured against these baselines:
  FARS v2:  Higher quality? Lower cost? Better conversion?
  FARS v3:  Cross-project learning? Adaptive hypothesis generation?

15 Applications

Direct Applications

Application	Description	Status
Automated AI research	Continuous, unattended production of research papers	Demonstrated (FARS-100)
Negative result publishing	Systematic documentation of what doesn't work	Demonstrated (multiple papers)
Research hypothesis exploration	Rapid exploration of a research space	Demonstrated (244 hypotheses)
Public research transparency	Live, observable research process	Demonstrated (dashboard + GitLab)
Research cost reduction	50-200x cheaper than human researchers per paper	Demonstrated (at $1,040/paper)

Broader Implications

1. The End of Publication Bias

FARS's systematic reporting of negative results addresses one of the most persistent structural problems in science: publication bias. If automated systems can produce and publish negative results at near-zero marginal cost, the scientific record becomes more complete. The value is not in any individual negative result paper, but in the aggregate: a comprehensive map of what works and what doesn't in a research area.

Human research publication funnel:
  100 experiments → 20 positive results → 15 submitted → 5 published
  (80% of experimental knowledge is LOST)

FARS publication funnel:
  244 hypotheses → 100 papers (positive + negative) → manual review → arXiv
  (Knowledge preservation rate: ~41% vs. ~5% for humans)

2. Research as Infrastructure

FARS represents a paradigm shift from research-as-craft to research-as-infrastructure. In the craft model, each paper is a unique artifact produced by skilled artisans (researchers). In the infrastructure model, papers are outputs of a production system that can be scaled, optimized, and operated continuously.

This shift has parallels in other domains: - Software testing: manual QA → automated CI/CD - Manufacturing: artisan production → assembly line - Content creation: human writing → automated generation + human curation

3. The Minimal Composable Knowledge Unit

FARS's short, single-contribution papers introduce a new unit of scientific knowledge — smaller than a traditional paper but larger than a tweet or blog post. This format could influence human research conventions:

Traditional paper:    ~10-20 pages, multiple contributions
                      ├── Hard to review (many things to evaluate)
                      ├── Hard to cite precisely (which contribution?)
                      └── Incentivizes bundling (minimum publishable unit)

FARS paper:           ~4-8 pages, single contribution
                      ├── Easy to review (one thing to evaluate)
                      ├── Easy to cite precisely (one clear finding)
                      └── Incentivizes decomposition (one paper = one insight)

This is analogous to the microservices revolution in software: smaller, focused, composable units replace monolithic artifacts. The idea is not new (workshop papers, extended abstracts, and short papers exist in human venues), but FARS operationalizes it at scale.

4. The AI-for-AI Research Loop

FARS currently operates in the AI-for-AI domain: AI systems researching AI systems. This creates a recursive improvement dynamic:

FARS produces papers on AI topics
  ├── Some papers improve LLM training/fine-tuning
  │     └── Better LLMs → Better FARS agents → Better papers
  ├── Some papers improve agent design
  │     └── Better agents → Better FARS architecture → Better papers
  └── Some papers improve evaluation methods
        └── Better evaluation → Better quality signal → Better papers

If FARS's outputs eventually inform its own improvement (directly or indirectly through the broader research ecosystem), the system creates a positive feedback loop in AI capability advancement.

Limitations and Scope

Limitation	Impact	Potential Mitigation
AI-only domain	Cannot research biology, physics, chemistry, etc.	Integrate with lab automation, simulation tools
No physical experiments	Limited to computational research	Robotic lab integration (future)
Compute-bounded	Cannot run large-scale pretraining experiments	Larger clusters, cloud bursting
No human involvement	Cannot do human evaluation, annotation, user studies	Mechanical Turk integration, controlled human access
Quality variance	Some papers are incremental or flawed	Better quality gates, adaptive filtering
No cross-project learning	Each project starts fresh	Implement knowledge base, skill library
Proprietary system	Cannot be reproduced or independently verified	Open-source release (unlikely for competitive reasons)
Single production format	Only produces short papers	Extend to surveys, tutorials, replication studies

Comparison to Other Autoresearch Paradigms

Paradigm	Representative	Strength	Weakness
Single-agent minimal	Karpathy autoresearch	Simplicity, accessibility	Limited scope, no writing
Single-agent maximal	AI Scientist v2	End-to-end paper production	Quality ceiling, no real experiments
Evolutionary search	AIRA₂, AlphaEvolve	Progressive improvement, scaling	No paper writing, competition-focused
Multi-agent pipeline	FARS	Throughput, continuous operation, negative results	No learning, proprietary, AI-only domain
Frontier-pushing	DeepScientist	Genuine scientific advances	Very expensive (20K+ GPU-hours), slow
Knowledge-focused	CycleResearcher	Iterative quality improvement	Limited scale

FARS occupies a unique niche: industrial-scale continuous production of research contributions. It trades depth for breadth, learning for throughput, and maximal paper quality for comprehensive coverage of a research space.

Connections to OmniEvolve

FARS's architecture maps to several OmniEvolve design patterns, with important differences:

FARS Component	OmniEvolve Equivalent	Key Difference
Shared filesystem	`omnievolve/storage/` artifact storage	FARS uses filesystem as sole coordination; OmniEvolve uses DB + filesystem
Pipeline stages	`omnievolve/orchestrator/` experiment lifecycle	FARS is linear pipeline; OmniEvolve supports evolutionary loops
GPU cluster tools	`omnievolve/evaluation/` sandbox execution	FARS encapsulates 160 GPUs; OmniEvolve abstracts arbitrary compute
Ideation Agent	`omnievolve/knowledge/` + `omnievolve/search/`	FARS separates ideation from search; OmniEvolve integrates them
Quality review	`omnievolve/evaluation/` cascade evaluator	FARS uses human review; OmniEvolve uses automated cascade
Negative results	No direct equivalent	OmniEvolve discards failures; FARS publishes them

Architectural lesson for OmniEvolve: FARS's shared filesystem pattern demonstrates that simple coordination mechanisms can support industrial-scale multi-agent systems. OmniEvolve's more complex event bus and database coordination may be over-engineered for some use cases. The FARS pattern could be offered as a lightweight alternative configuration.

Philosophical lesson for OmniEvolve: FARS's first-principles critique of academic conventions applies equally to how evolutionary algorithm research is evaluated. If the goal is to expand the frontier of optimization knowledge, conforming to academic paper formats may be a needless constraint. OmniEvolve's reporting module could support FARS-style minimal contribution reports alongside traditional paper formats.

References

Analemma. (2026). "Introducing FARS: Fully Automated Research System." Blog post, February 11, 2026. https://analemma.ai/blog/introducing-fars/
FARS Live Dashboard. https://analemma.ai/fars
FARS GitLab. https://gitlab.com/fars-a
机器之心 (Machine Intelligence). (2026). "228 hours of non-stop work to produce 100 papers, burning through 11.4 billion Tokens: FARS has gone crazy." February 25, 2026.
Lu, C., et al. (2024). "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." Sakana AI.
Weng, Y., et al. (2024). "CycleResearcher: Improving Automated Research via Automated Review."
IntologyAI. (2025). "Zochi: Artificial Scientist." https://github.com/IntologyAI/Zochi
Lu, C., et al. (2025). "The AI Scientist v2." Sakana AI.
Li, Y., et al. (2025). "AI-Researcher." NeurIPS 2025 Spotlight.
Chen, Y., et al. (2026). "DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively." Westlake University.
Stanford University. (2026). "Agentic Reviewer." https://paperreview.ai
Sun, T. Homepage. http://txsun1997.github.io/
Karpathy, A. (2026). "autoresearch." https://github.com/karpathy/autoresearch

This analysis was compiled from the FARS blog post, live dashboard observations, published paper examples, independent media reporting (Machine Intelligence / 机器之心, 36Kr), and cross-referencing with prior autoresearch systems. FARS's proprietary nature limits architectural analysis to publicly observable behavior and reported metrics. The system represents a significant milestone in the industrialization of scientific research, demonstrating that continuous, unattended, large-scale research production is technically feasible — though the question of whether this constitutes genuine knowledge expansion or sophisticated pattern-matching remains open.