← Back to Index
FARS
First-principles fully automated research system that rejects academic publishing conventions in favor of minimal, composable knowledge units — deployed live with 160 GPUs to produce 100 papers in 228 hours. Organization: Analemma (日行迹智能科技) Published: February 11, 2026 (blog post); live deployment February 12, 2026 Type: System + Blog Post + Live Deployment Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: FARS: Fully Automated Research System
- Blog post: Introducing FARS — published February 11, 2026
- Live dashboard: https://analemma.ai/fars (real-time observation of running experiments)
- GitLab: https://gitlab.com/fars-a — public repositories for each research project
- Live deployment start: 10:00 PM EST (UTC−5), February 12, 2026
- Completion: 228 hours, 28 minutes, 33 seconds of continuous unattended operation
- Output: 244 hypotheses generated → 100 short research papers produced
- Predecessor systems cited: AI Scientist (Sakana AI), CycleResearcher, Zochi (IntologyAI), AI Scientist v2, AI-Researcher (HKU), DeepScientist (Westlake University)
- Research paper URL: Individual papers published at
analemma.ai/papers/<uuid>/
FARS is not a traditional research paper. It is a deployed system that was announced via blog post, demonstrated through a live 228-hour public experiment, and validated through its actual outputs. This positions it uniquely in the autoresearch landscape: where AI Scientist sought peer review acceptance and DeepScientist sought frontier-pushing scientific contributions, FARS sought to prove that an unattended research assembly line can operate continuously, stably, and at industrial throughput.
The naming itself is telling: "Fully Automated Research System" emphasizes the system rather than the agent, the researcher, or the scientist. FARS is infrastructure for knowledge production, not an artificial persona.
Relationship to the Autoresearch Landscape
FARS explicitly positions itself as a successor to and synthesis of six prior systems:
Genealogy of End-to-End Autoresearch Systems (2024–2026):
───────────────────────────────────────────────────────────
AI Scientist (Sakana AI, 2024)
└─ First end-to-end: idea → code → paper → review
└─ Single-agent, $15/paper, weak experimental scope
CycleResearcher (2024)
└─ Iterative review-revision cycles
└─ Improved paper quality through feedback loops
Zochi (IntologyAI, 2025)
└─ First AI-authored papers accepted at ACL 2025 / ICLR 2025 workshops
└─ Average reviewer score 7.67 (above human acceptance threshold)
AI Scientist v2 (Sakana AI, 2025)
└─ Tree search methodology, $15-20/run
└─ First AI paper to pass double-blind peer review (ICLR workshop, 6.33)
AI-Researcher + Novix (HKU, 2025)
└─ NeurIPS 2025 Spotlight
└─ Four-module architecture for resource collection → filtering → ideas → writing
DeepScientist (Westlake University, 2026)
└─ ~5,000 ideas, ~1,100 experimentally validated
└─ Exceeded human SOTA on 3 frontier tasks (183.7%, 1.9%, 7.9%)
└─ 20,000+ GPU-hours consumed
FARS (Analemma, 2026) ← this system
└─ 160-GPU cluster, 228h continuous operation
└─ 244 hypotheses → 100 papers
└─ First live, public, unattended research deployment at scale
└─ Rejects academic formatting conventions entirely
───────────────────────────────────────────────────────────
Where each prior system addressed a subset of the autoresearch problem (AI Scientist: feasibility; Zochi: acceptance quality; DeepScientist: frontier-pushing depth), FARS addresses industrial-scale throughput with continuous autonomous operation — the question of whether automated research can function as a reliable, always-on production system rather than a one-off demonstration.
2 Authors and Team
Founder and CEO
Dr. Sun Tianxiang (孙天祥) — Founder and CEO of Analemma. PhD in Computer Science from Fudan University (2024), advised by Xipeng Qiu and Xuanjing Huang. Sun was a principal developer of MOSS, one of the earliest Chinese open-source conversational language models, and has extensive research experience in reinforcement learning and large language model post-training.
Sun's background is significant: MOSS (2023) was one of the first open-source LLMs to demonstrate multi-turn dialogue capability in Chinese, and the experience of building and training large models directly informs FARS's architecture — the system is built by people who understand the full stack from pretraining through RLHF to deployment.
Organizational Context
Analemma (日行迹智能科技): - Founded: March 2025 - Location: Shanghai, China - Core team: Members from Fudan University's MOSS team and Shanghai AI Laboratory's InternLM team - Funding: Angel round of several hundred million RMB from investors including Sequoia Capital China - Mission: Building infrastructure for automated scientific research
| Aspect | Detail |
|---|---|
| Team origin | Fudan University NLP Lab (MOSS) + Shanghai AI Lab (InternLM) |
| Founded | March 2025 |
| FARS launch | February 2026 (~11 months from founding to deployment) |
| Funding | Angel round, Sequoia Capital China lead |
| Hardware | 160 NVIDIA GPUs (proprietary cluster) |
The speed of execution is notable: from company founding to a 160-GPU live deployment producing 100 research papers in under one year. This suggests either (a) substantial pre-founding research, (b) aggressive parallel engineering, or (c) both. The team's prior experience building MOSS and InternLM models would have provided deep familiarity with the infrastructure needed.
Team Composition
The team is not publicly enumerated in the blog post. Based on organizational context:
- NLP/LLM researchers: Core competency from MOSS and InternLM lineage
- Infrastructure engineers: Required for 160-GPU cluster management
- Systems architects: Multi-agent coordination and shared filesystem design
- Manual reviewers: At least 3 senior researchers (5+ years experience each) for pre-arXiv quality gates
Philosophical Alignment
The team's background in building large language models (MOSS, InternLM) positions them uniquely: they understand both the capabilities and limitations of LLMs as research tools, having been on the producing side of the models that autoresearch systems consume. This insider perspective likely informs FARS's pragmatic design philosophy — the system is built by researchers who know what research actually requires, not by engineers imagining what research might look like.
3 Core Contribution
The Radical Thesis
FARS makes a philosophical claim that distinguishes it from every prior autoresearch system:
The output of a research system should be research contributions, not papers conforming to academic formatting conventions.
This is not a minor stylistic preference. It is a fundamental rejection of the form factor that every other autoresearch system has optimized for. AI Scientist, Zochi, AI Scientist v2, and DeepScientist all measure success by how closely their outputs resemble human-written conference papers, using peer review scores as the gold standard. FARS inverts this: the paper format is a historical artifact of human-centered research, not a necessary property of knowledge production.
First-Principles Critique of Human Research
FARS's blog post articulates a first-principles analysis of why human-centered research is structurally inefficient:
| Structural Problem | Root Cause | FARS Response |
|---|---|---|
| High entry barrier | Years of training to become a researcher | Automated agents require no training period |
| High failure rate | Most research ideas don't work out | Negative results are explicitly valued and published |
| Publication bias | Only "successful" experiments get published | Every completed experiment produces output |
| Maximal contribution units | Papers are large, comprehensive artifacts | Each paper is a single, well-scoped contribution |
| Format overhead | Conforming to venue-specific formatting rules | No structural constraints beyond clarity |
| Length inflation | Pressure to fill pages to meet minimum requirements | Short papers — as long as they need to be, no more |
| Supply constraint | Limited number of human researchers | System runs 24/7, parallelizes across projects |
This critique identifies a fundamental tension: the academic paper format evolved to serve human readers and human review committees, not to maximize the rate of knowledge frontier expansion. FARS argues that if the goal is to "efficiently and reliably expand the frontier of knowledge," then the format should be optimized for that goal rather than for backward compatibility with human conventions.
Five Design Principles
From the blog post and observable outputs, FARS operates on five principles:
-
Contributions, not papers. The unit of output is a research contribution — a piece of new knowledge — not a formatted document. The paper is merely a container for the contribution.
-
Single-scoped contributions. Each paper addresses exactly one research question or presents exactly one finding. This is the minimal composable unit of knowledge, analogous to a function in programming: do one thing, do it well, make it reusable.
-
Negative results are knowledge. A well-conducted experiment that shows something doesn't work is as valuable as one that shows something does. FARS explicitly reports negative results. Example: "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — a paper whose entire contribution is demonstrating that a promising technique fails.
-
No unnecessary constraints. Papers are not padded to meet length minimums, do not conform to venue-specific formatting templates, and do not include sections (e.g., lengthy related work surveys) that don't serve the contribution.
-
Scale reveals truth. A handful of examples is insufficient to evaluate a research system. FARS was designed to produce 100 papers precisely because quality variance at scale is a known limitation — the signal emerges from the aggregate, not from cherry-picked examples.
What Makes This Novel
Prior autoresearch systems tried to pass the Turing test of academic publishing — can the AI produce a paper indistinguishable from a human-written one? FARS asks a different question: if we freed research from the constraints of human publishing conventions, what would an optimally efficient research system look like?
This is the difference between building a faster horse and building a car.
The Live Deployment as Contribution
FARS's public deployment is itself a scientific contribution. By running continuously for 228 hours with a live dashboard, public GitLab, and real-time observability, Analemma subjected their system to the most rigorous possible evaluation: public scrutiny at scale. Any researcher could watch the system operate, inspect intermediate artifacts, read every generated paper, and form their own assessment.
This transparency protocol is unprecedented in the autoresearch space. AI Scientist released code but not live runs. DeepScientist reported results but not the process. FARS showed everything, live, as it happened.
4 Supported Solutions
Problem Framing
FARS frames automated research as a pipeline production problem rather than a search problem (contrast with AIRA₂ which frames it as a graph search) or a dialogue problem (contrast with AI Scientist which frames it as a multi-turn LLM conversation).
The pipeline framing has specific implications:
Search-based framing (AIRA₂, AlphaEvolve):
Goal: Find optimal solution in a space
Metaphor: Exploration of a landscape
Bottleneck: Search efficiency, evaluation signal
Dialogue-based framing (AI Scientist, Zochi):
Goal: Produce human-like research discourse
Metaphor: Simulating a researcher
Bottleneck: LLM capability, prompt engineering
Pipeline-based framing (FARS):
Goal: Convert research directions into completed papers
Metaphor: Assembly line / factory
Bottleneck: Throughput, reliability, coordination
Research Domains
Primary domain: AI/LLM research — the "AI-for-AI" paradigm where automated systems research the technology that powers them. This is an explicitly acknowledged limitation and a pragmatic choice: AI research provides the most readily available evaluation signals (code runs, benchmarks, automated metrics).
Initial research directions specified: - Reinforcement Learning from Verifiable Rewards (RLVR) - Other AI-related topics the system discovers autonomously
Observed output topics (from published papers): - Self-reflection in small language models - World-model verification for agent planning - Vision-language model selection strategies (OCR) - Fine-tuning data selection (hard vs. easy examples) - Metamorphic testing for LLM world models - Coding agent testing and import autofix - Budget allocation for verification systems
The diversity of topics is notable: starting from a few specified directions, the Ideation Agent discovered and explored adjacent research questions autonomously. This suggests the literature review component is effective at identifying related open problems.
Paper Format and Structure
FARS papers are short papers — typically 4–8 pages — focused on a single contribution. Each paper includes:
| Component | Present | Notes |
|---|---|---|
| Abstract | Yes | Concise, focused on the single contribution |
| Introduction / motivation | Yes | Brief — sufficient to frame the question |
| Method | Yes | Technical description of the approach |
| Experiments | Yes | With framework diagrams, result tables, analysis |
| Conclusion | Yes | What was learned, including if the result is negative |
| Related work | Minimal | Only directly relevant prior work, not comprehensive surveys |
| Code repository | Yes | Public GitLab repo for each paper |
| Extensive appendices | No | No padding |
Solution Types
Based on observed outputs, FARS produces three types of research contributions:
Type 1: Positive methods. Standard research contributions where a proposed method improves on baselines. - Example: "Equation-Consistency Gated Reflection for Small Language Models: A Training-Free Approach to Preventing Self-Correction Regressions"
Type 2: Negative results. Experiments demonstrating that a plausible approach fails, with analysis of why. - Example: "Interface-Aware Smoke Tests and Deterministic Import Autofix for Feature-Level Coding Agents: A Negative Result" — automated import autofix provided no benefit over baseline (both 10.0% resolved rate) - Example: "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — all selection strategies within 0.3-point band, no improvement
Type 3: Empirical insights. Data-driven findings about phenomena in AI systems. - Example: "Hard Examples Beat Easy Examples in Repetition-Heavy Long-CoT Fine-Tuning" - Example: "Stutter-Invariance Metamorphic Audits for Text World-Model Rollouts" — domain-specific probes statistically tied with simpler baseline
The explicit production of Type 2 (negative results) papers is philosophically significant. In human academia, negative results are systematically suppressed due to publication bias — journals and conferences preferentially accept positive results. FARS's willingness to report negative results as complete papers represents a structural fix to this bias.
5 LLM Integration
Model Access
FARS has access to both open-source and closed-source large language models, with the GPU cluster enabling local inference for open-source models and API access for proprietary ones.
| Access Type | Purpose | Examples |
|---|---|---|
| Open-source models (local inference) | Experimental subjects, data synthesis, cheap inference | Models run on the 160-GPU cluster |
| Closed-source models (API) | Agent backbone, complex reasoning, LLM-as-a-Judge | Likely GPT-4-class or Claude-class models |
| Model inference endpoints | Data synthesis, agent design, evaluation | Encapsulated as tools for the Experiment Agent |
Role of LLMs in Each Agent
Ideation Agent: - Uses LLMs for literature comprehension and synthesis - Generates research hypotheses from literature analysis - Conducts automated literature review across open-access papers - May use embedding models for semantic search over literature
Planning Agent: - Uses LLMs for experimental design reasoning - Translates hypotheses into concrete experimental protocols - Determines required resources, baselines, and evaluation metrics
Experiment Agent: - Uses LLMs for code generation and debugging - Uses LLMs as experimental subjects (e.g., testing LLM behaviors) - Uses LLMs as evaluation judges (LLM-as-a-Judge paradigm) - Uses LLMs for data synthesis (generating training/test data) - Uses LLMs for agent design (creating sub-agents for experiments)
Writing Agent: - Uses LLMs for paper composition - Synthesizes experimental results into structured narratives - Generates figures, tables, and analyses
LLM-as-Infrastructure vs. LLM-as-Subject
A distinctive aspect of FARS is the dual role of LLMs:
LLM Usage in FARS:
─────────────────────────────────────────────
Infrastructure layer (running the system):
├── Ideation Agent backbone
├── Planning Agent backbone
├── Experiment Agent backbone
└── Writing Agent backbone
Subject layer (being researched):
├── LLMs as experimental subjects
├── LLM behaviors being studied
├── LLM training/fine-tuning being tested
└── LLM evaluation methods being developed
─────────────────────────────────────────────
This creates a recursive structure: LLMs researching LLMs. The system uses GPT/Claude-class models to reason about experiments on smaller or different LLMs. The infrastructure models must be more capable than the subject models for this to work — you cannot study the behavior of a model using a less capable model as your reasoning engine.
Token Consumption
The FARS-100 run consumed 11.4 billion tokens across 100 papers:
| Metric | Value |
|---|---|
| Total tokens | 11.4 billion |
| Per paper (average) | ~114 million tokens |
| Per hypothesis (average) | ~46.7 million tokens |
| Token cost component | Major fraction of $104,000 total |
The per-paper token count of ~114 million is orders of magnitude higher than typical LLM generation tasks:
Token consumption comparison:
─────────────────────────────────────
Typical chatbot response: ~500 tokens
Typical long-form generation: ~5,000 tokens
Typical agentic task: ~50,000 tokens
Complex multi-step agent: ~500,000 tokens
AI Scientist paper: ~5,000,000 tokens (estimated)
FARS paper: ~114,000,000 tokens ← 20x more
─────────────────────────────────────
This enormous token consumption reflects the "trading computing power for intelligence" characteristic described in reporting. The Experiment Agent likely dominates: running code, debugging failures, iterating on approaches, calling LLMs for data synthesis and evaluation — all within a single paper's lifecycle.
6 Key Results
FARS-100 Headline Numbers
| Metric | Value |
|---|---|
| Duration | 228 hours 28 minutes 33 seconds |
| Hypotheses generated | 244 |
| Papers completed | 100 |
| Hypothesis → paper conversion rate | 41.0% (100/244) |
| Average time per paper | ~2 hours 17 minutes |
| Total tokens consumed | 11.4 billion |
| Total cost | ~$104,000 (~¥750,000 RMB) |
| Cost per paper | ~$1,040 |
| Hardware | 160 NVIDIA GPUs |
| Human intervention | Zero during the 228-hour run |
Quality Assessment
Using Stanford's Agentic Reviewer system (paperreview.ai), calibrated against ICLR review standards:
| Population | Mean Score | Range |
|---|---|---|
| FARS-100 papers | 5.05 | 3.0 – 6.3 |
| ICLR 2026 all human submissions | 4.21 | — |
| ICLR 2026 accepted papers | 5.39 | — |
Interpretation: - FARS papers score 0.84 points above the average human submission - FARS papers score 0.34 points below the average accepted paper - The score distribution is concentrated around 5.0, indicating stable quality band rather than random fluctuation - A small number of papers exceeded 6.0, indicating occasional "breakthrough" quality
Quality positioning:
← Worse Better →
3.0 4.0 5.0 6.0 7.0
│ │ │ │ │
├───FARS────┤ │ │ │
│ range │ │ │ │
│ │ ┌─────┼─────┐ │ │
│ │ │ FARS mean │ │ │
│ │ │ 5.05 │ │ │
│ │ └───────────┘ │ │
│ │ │ │
│ ┌─────┤ │ │
│ │Human│ │ │
│ │avg │ │ │
│ │4.21 │ │ │
│ └─────┘ │ │
│ ┌─────┤ │
│ │Accpt│ │
│ │avg │ │
│ │5.39 │ │
│ └─────┘ │
Agentic Reviewer Calibration
The Agentic Reviewer system was validated against ICLR 2025 review data:
| Comparison | Spearman Correlation |
|---|---|
| Human reviewer vs. human reviewer | 0.41 |
| AI reviewer vs. human reviewer | 0.42 |
The AI reviewer achieves parity with human inter-reviewer agreement, suggesting the scores are as reliable as human review — though both human and AI review have substantial noise (correlation of 0.41–0.42 means roughly 83% shared variance, leaving ~17% as reviewer-specific noise).
Throughput Analysis
FARS-100 throughput timeline:
────────────────────────────────────────────────
Day 1 (0-24h): ~10 papers
Day 2 (24-48h): ~10 papers
Day 3 (48-72h): ~10 papers
...
Day 9.5 (228h): 100th paper completed
────────────────────────────────────────────────
Average: ~10.5 papers/day
Peak: Not reported (likely higher due to parallel execution)
Minimum: Not reported
Comparison to Human Research
| Metric | Human Researcher | FARS |
|---|---|---|
| Time per paper | 3–6 months | ~2.3 hours |
| Papers per year | 2–5 (typical) | ~3,800 (projected continuous) |
| Cost per paper | $50,000–200,000 (fully loaded) | ~$1,040 |
| 24/7 operation | No | Yes |
| Negative result reporting | Rare (publication bias) | Systematic |
| Quality (ICLR scale) | 4.21 avg submission | 5.05 avg |
The throughput advantage is approximately 1,000x in papers-per-unit-time. The cost advantage is approximately 50–200x per paper (depending on human researcher cost assumptions). However, these comparisons have important caveats — FARS papers are short, single-contribution papers while human papers are typically fuller works. The appropriate comparison may be FARS papers vs. individual experiments within a human paper rather than vs. complete human papers.
Conversion Funnel
Research direction documents (input)
│
▼
244 hypotheses generated (Ideation Agent)
│
▼
??? hypotheses passed automated review
│
▼
??? experimental plans created (Planning Agent)
│
▼
??? experiments executed (Experiment Agent)
│
▼
100 papers completed (Writing Agent)
│
▼
??? papers pass manual review (3+ senior researchers)
│
▼
??? papers submitted to arXiv (explicit AI-generated label)
The 41% hypothesis-to-paper conversion rate (100/244) suggests that ~59% of hypotheses either failed automated review, failed experimentally, or were abandoned during execution. This is actually a healthy ratio — in human research, the hypothesis-to-publication rate is typically much lower (perhaps 5–20%). FARS's higher conversion rate may reflect either (a) better hypothesis quality, (b) lower publication threshold (short papers, negative results accepted), or (c) both.
7 Reproducibility
Source Code
| Component | Public | URL |
|---|---|---|
| System code (FARS itself) | No | Proprietary to Analemma |
| Research outputs (papers) | Yes | analemma.ai/papers/<uuid>/ |
| Experiment code | Yes | gitlab.com/fars-a/<project>/ |
| Live dashboard | Yes | analemma.ai/fars |
FARS itself is not open-source. This is a critical distinction from systems like AI Scientist (open-source), Karpathy's autoresearch (open-source), and Zochi (partially open). The system's architecture, agent prompts, coordination mechanisms, and infrastructure code are proprietary.
What Is Reproducible
-
Individual experiment results: Each paper has an associated GitLab repository with code, data, and instructions. Other researchers can re-run specific experiments.
-
Paper quality evaluation: The Agentic Reviewer system is independently available (paperreview.ai). Anyone can re-score the FARS papers.
-
Observable operation: The live dashboard provides real-time visibility into the system's operation, making the process (though not the implementation) transparent.
What Is Not Reproducible
-
The system itself: Without access to FARS's code, prompts, and infrastructure, the full system cannot be reproduced.
-
The 160-GPU cluster: The hardware requirements are a substantial barrier. Few academic labs have access to 160 GPUs for a continuous 10-day experiment.
-
The specific LLM configurations: Which models serve as agent backbones, their specific prompts, and their interaction patterns are not disclosed.
Reproducibility Assessment
| Criterion | Rating | Notes |
|---|---|---|
| System reproducibility | Low | Proprietary code, closed architecture |
| Experiment reproducibility | Medium-High | Individual experiment repos are public |
| Result verification | Medium | Papers and scores can be independently evaluated |
| Process transparency | High | Live dashboard, public GitLab activity |
| Hardware accessibility | Low | 160 GPUs required |
The reproducibility profile is inverted compared to most research: the process is unusually transparent (live public operation), but the system is unusually opaque (proprietary code). This creates a trust-but-can't-verify dynamic: observers can see that FARS works, but cannot build their own version.
8 Compute and API Costs
Hardware Configuration
| Resource | Specification |
|---|---|
| GPU cluster | 160 NVIDIA GPUs |
| GPU model | Not publicly specified (likely A100 or H100 class) |
| Purpose | Training, inference, experiment execution |
| Access model | Encapsulated as tools for the Experiment Agent |
| Additional endpoints | Model inference endpoints for data synthesis, LLM-as-a-Judge, agent design |
FARS-100 Cost Breakdown
| Category | Estimated Cost | Notes |
|---|---|---|
| GPU compute (160 GPUs × 228h) | ~$50,000–70,000 | At $2–3/GPU-hour cloud pricing |
| LLM API tokens (11.4B tokens) | ~$30,000–50,000 | Mix of open/closed model inference |
| Total | ~$104,000 | Official reported figure |
| Per paper | ~$1,040 | 104,000 / 100 |
| Per hypothesis | ~$426 | 104,000 / 244 |
Token Economics
Token consumption by category (estimated breakdown):
Experiment Agent: ~70% of total tokens
├── Code generation and debugging
├── Running LLM experiments (subject models)
├── Data synthesis
├── LLM-as-a-Judge evaluations
└── Iterative refinement loops
Ideation Agent: ~15% of total tokens
├── Literature review (reading papers)
├── Hypothesis generation
└── Cross-referencing and validation
Writing Agent: ~10% of total tokens
├── Paper composition
├── Figure and table generation
└── Revision and formatting
Planning Agent: ~5% of total tokens
├── Experimental design
└── Protocol specification
The Experiment Agent likely dominates token consumption because it performs the most computationally and cognitively intensive work: writing code, debugging failures, running experiments, interpreting results, and iterating. This is analogous to human research where the experiment phase consumes the majority of time and resources.
Cost Comparison Across Autoresearch Systems
| System | Cost per Paper | Hardware | Duration |
|---|---|---|---|
| AI Scientist (2024) | ~$15 | Minimal (API-only) | Hours |
| AI Scientist v2 (2025) | ~$15–20 | Minimal (API-only) | Hours |
| Zochi (2025) | Not reported | Not reported | Not reported |
| DeepScientist (2026) | Not reported | 20,000+ GPU-hours total | Months |
| FARS (2026) | ~$1,040 | 160 GPUs × 228h | ~2.3 hours |
| Karpathy autoresearch (2026) | ~$18/run | 1 GPU | 8 hours |
FARS's per-paper cost of ~$1,040 is substantially higher than AI Scientist (~$15) but reflects a fundamentally different scope: AI Scientist produces minimally-viable papers with limited experimental depth, while FARS executes real GPU-intensive experiments. The appropriate comparison is cost-per-unit-of-experimental-work, not cost-per-paper.
Cost Efficiency Analysis
Cost per paper: $1,040
Human equivalent: $50,000 - $200,000 (3-6 months of researcher time)
Cost reduction vs. human: 48x - 192x
But: FARS papers are short papers (~single contribution)
Adjusted comparison (FARS paper ≈ 1 experiment in human paper):
Human cost per experiment: ~$10,000 - $30,000
FARS cost per experiment: ~$1,040
Adjusted cost reduction: 10x - 29x
Scaling Economics
If FARS operated continuously for one year:
Annual projection (continuous operation):
Papers per year: ~3,800 (at 2.3h/paper)
Annual GPU cost: ~$2.8M - $4.0M (160 GPUs full-time)
Annual token cost: ~$1.5M - $2.5M
Annual total: ~$4.3M - $6.5M
Cost per paper: ~$1,130 - $1,710
Equivalent human team:
3,800 papers/year ÷ 4 papers/researcher/year = 950 researchers
950 researchers × $200K/year = $190M/year
FARS vs. human team: ~$5M vs. ~$190M → 38x cost advantage
These projections assume linear scaling and constant quality, which may not hold. But they illustrate the economic logic driving industrial-scale autoresearch.
9 Architecture Solution
High-Level Architecture
┌──────────────────────────────────────────────────────────────────────┐
│ FARS SYSTEM ARCHITECTURE │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ RESEARCH DIRECTION INPUT │ │
│ │ (Document specifying multiple research directions) │ │
│ └─────────────────────────┬───────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ IDEATION AGENT │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │
│ │ │ Literature │ │ Hypothesis │ │ Automated Review │ │ │
│ │ │ Review │ │ Generation │ │ (pass/fail gate) │ │ │
│ │ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │ │
│ │ │ │ │ │ │
│ │ [Open-access papers] [Public GitLab repos] [Quality filter] │ │
│ └─────────────────────────┬───────────────────────────────────────┘ │
│ │ (validated hypotheses) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PLANNING AGENT │ │
│ │ │ │
│ │ Hypothesis → Experimental Plan │ │
│ │ (baselines, metrics, resources, evaluation criteria) │ │
│ └─────────────────────────┬───────────────────────────────────────┘ │
│ │ (experimental plan) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ EXPERIMENT AGENT │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │
│ │ │ Code │ │ GPU Cluster │ │ Model Inference │ │ │
│ │ │ Generation │ │ (160 GPUs) │ │ Endpoints │ │ │
│ │ │ & Debugging │ │ as tools │ │ as tools │ │ │
│ │ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │ │
│ │ │ │ │ │ │
│ │ └─────────────────┼──────────────────────┘ │ │
│ │ │ │ │
│ │ Code → Schedule GPU jobs → Collect results → Iterate │ │
│ └─────────────────────────┬───────────────────────────────────────┘ │
│ │ (experimental results + code) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ WRITING AGENT │ │
│ │ │ │
│ │ Results → Paper (short paper, single contribution) │ │
│ └─────────────────────────┬───────────────────────────────────────┘ │
│ │ (completed paper) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ OUTPUT │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐│ │
│ │ │ Paper │ │ GitLab │ │ Live │ │ Manual Review ││ │
│ │ │ (PDF) │ │ Repo │ │ Dashboard│ │ (3+ reviewers) ││ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────────────┘│ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ═══════════════════════════════════════════════════════════════════ │
│ ║ SHARED FILE SYSTEM ║ │
│ ║ (Workspace + persistent memory for all agents) ║ │
│ ║ • Structured project directories ║ │
│ ║ • Agents read from / write to shared workspace ║ │
│ ║ • No direct agent-to-agent communication ║ │
│ ║ • Scales to many concurrent research projects ║ │
│ ═══════════════════════════════════════════════════════════════════ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ GPU CLUSTER (160 GPUs) │ │
│ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │
│ │ │GPU 1│ │GPU 2│ │GPU 3│ │GPU 4│ ··· │GPU │ │ │
│ │ │ │ │ │ │ │ │ │ │ 160 │ │ │
│ │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │ │
│ │ Encapsulated as TOOLS for the Experiment Agent │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Key Architectural Decisions
| Decision | Rationale | Implications |
|---|---|---|
| Sequential pipeline (not graph search) | Research has a natural linear flow: idea → plan → experiment → write | Simpler coordination, clearer handoffs, easier debugging |
| Shared file system (not message passing) | Filesystem is universally understood, durable, inspectable | No message broker needed, natural audit trail, human-inspectable |
| No direct agent-to-agent communication | Eliminates protocol complexity, race conditions, deadlocks | Agents are independently testable, replaceable, scalable |
| GPU cluster as tools (not raw access) | Experiment Agent schedules jobs without managing infrastructure | Clean separation of concerns; agent thinks about experiments, not CUDA |
| Parallel project queue | Multiple research projects advance simultaneously | Higher throughput, better GPU utilization, tolerant of blocked projects |
| Short papers (not full conference papers) | Minimal unit of contribution reduces writing burden | Writing Agent can focus on clarity, not page-filling |
The Shared Filesystem Pattern
The most architecturally distinctive feature of FARS is the shared filesystem as the sole coordination mechanism between agents. This deserves detailed analysis:
Shared Filesystem Structure (inferred):
─────────────────────────────────────────
/fars-workspace/
├── projects/
│ ├── FA0001/ # Project directory
│ │ ├── ideation/
│ │ │ ├── literature_review.md # Ideation Agent writes
│ │ │ ├── hypothesis.md # Ideation Agent writes
│ │ │ └── review_result.json # Automated review output
│ │ ├── planning/
│ │ │ ├── experimental_plan.md # Planning Agent writes
│ │ │ └── resources.json # Required resources
│ │ ├── experiment/
│ │ │ ├── code/ # Experiment Agent writes
│ │ │ ├── results/ # Experiment outputs
│ │ │ └── logs/ # Execution logs
│ │ ├── writing/
│ │ │ ├── paper.tex # Writing Agent writes
│ │ │ ├── paper.pdf # Compiled paper
│ │ │ └── figures/ # Generated figures
│ │ └── status.json # Project stage tracking
│ ├── FA0002/
│ │ └── ...
│ └── ...
├── gitlab/ # Public repository staging
└── shared/
├── literature_cache/ # Shared literature database
└── model_endpoints.json # Available inference endpoints
Why filesystem over message passing?
Most multi-agent systems use message queues, event buses, or direct API calls for inter-agent communication. FARS's choice of a shared filesystem has several deep advantages:
-
Persistence by default. Every intermediate artifact is automatically persisted. If the system crashes, the filesystem state represents a perfect checkpoint. No message replay, no state reconstruction needed.
-
Natural handoffs. Agent A completes its work by writing files. Agent B begins its work by reading files. The filesystem boundary is the API contract — no schema definitions, no versioning headaches, no serialization/deserialization overhead.
-
Human inspectability. Any researcher can SSH into the filesystem and inspect exactly what the Ideation Agent produced, what the Planning Agent interpreted, what the Experiment Agent executed. This is invaluable for debugging and trust-building.
-
Trivial parallelism. Multiple projects run in parallel simply by having separate directories. No lock contention, no resource arbitration (beyond GPU scheduling), no coordination overhead.
-
Scalability. Adding more concurrent projects means adding more directories. Adding more agents (e.g., a Review Agent) means adding a reader of existing files. The filesystem pattern scales linearly without architectural changes.
-
Decoupled evolution. Each agent can be upgraded independently. As long as the file formats are compatible, agents don't need to know about each other's implementations.
This pattern is not novel in systems engineering (Unix pipes, plan9, microservices via shared storage) but is novel in the autoresearch space. Most prior systems use tighter coupling: AI Scientist uses a single LLM conversation, AIRA₂ uses an in-memory population database, DeepScientist uses knowledge graphs and databases.
The Assembly Line Metaphor
FARS's pipeline architecture maps directly to an industrial assembly line:
Industrial Assembly Line:
Raw material → Stamping → Welding → Painting → Assembly → QC → Ship
FARS Research Assembly Line:
Directions → Ideation → Planning → Experiment → Writing → Review → Publish
Key properties of assembly lines that FARS inherits:
-
Pipelining. While paper N is being written, paper N+1 is being experimented on, paper N+2 is being planned, and paper N+3 is being ideated. All four stages operate concurrently on different projects.
-
Modular replacement. Each station (agent) can be upgraded independently. A better Writing Agent doesn't require changes to the Experiment Agent.
-
Bottleneck identification. The slowest stage determines throughput. If experiments take 10x longer than writing, optimizing the Writing Agent yields zero throughput improvement.
-
Quality control. Papers that fail at any stage can be rejected without wasting downstream effort (except for the explicit negative-results philosophy, which means experimental failures may still produce papers).
10 Component Breakdown
Component 1: Ideation Agent
Purpose: Convert research directions into validated, actionable research hypotheses.
| Aspect | Detail |
|---|---|
| Input | Document specifying multiple research directions |
| Output | Validated research hypotheses (forwarded to Planning Agent) |
| Tools | Open-access paper search, public GitLab repository access |
| Gate | Automated review — hypothesis must pass before forwarding |
| Token consumption | ~15% of total (literature comprehension dominates) |
Process: 1. Receives research direction document (human-authored, specifying broad areas like "RLVR") 2. Conducts literature review across open-access papers 3. Identifies gaps, open questions, and unexplored combinations 4. Generates specific, testable hypotheses 5. Each hypothesis undergoes automated review (quality/feasibility filter) 6. Passing hypotheses written to shared filesystem → picked up by Planning Agent
Key design choice: The Ideation Agent has access to public GitLab repositories in addition to papers. This means it can read actual code implementations, not just paper descriptions. This is a significant advantage over literature-review-only approaches, as the gap between what papers describe and what code implements is often substantial.
Hypothesis generation rate: 244 hypotheses in 228 hours ≈ 1.07 hypotheses/hour. Given that 100 became papers (41% conversion), the Ideation Agent overproduces by design — it generates more hypotheses than the downstream pipeline can absorb, ensuring the pipeline is never starved.
Component 2: Planning Agent
Purpose: Transform validated hypotheses into concrete experimental plans.
| Aspect | Detail |
|---|---|
| Input | Validated hypothesis (from shared filesystem) |
| Output | Experimental plan (written to shared filesystem) |
| Scope | Baselines, metrics, evaluation criteria, resource requirements |
| Token consumption | ~5% of total (relatively lean — planning is reasoning-intensive but not token-heavy) |
Process: 1. Reads hypothesis from project directory 2. Determines what baselines are needed 3. Specifies evaluation metrics and success criteria 4. Estimates required computational resources 5. Writes experimental plan to project directory
The Planning Agent is the thinnest component — its role is to bridge the gap between an abstract hypothesis and a concrete experimental protocol. In human research, this is the "methods section" thought process.
Component 3: Experiment Agent
Purpose: Execute experimental plans using GPU cluster and model inference endpoints.
| Aspect | Detail |
|---|---|
| Input | Experimental plan (from shared filesystem) |
| Output | Code, results, logs (written to shared filesystem) |
| Tools | 160-GPU cluster (as tools), model inference endpoints (as tools) |
| Capabilities | Code generation, debugging, GPU scheduling, data synthesis, LLM-as-a-Judge |
| Token consumption | ~70% of total (dominant consumer) |
Tool encapsulation:
The GPU cluster is not exposed as raw hardware. Instead, it is encapsulated as tools — high-level interfaces that the Experiment Agent can call:
Available tools for Experiment Agent:
─────────────────────────────────────
GPU Tools:
schedule_training_job(config) → job_id
check_job_status(job_id) → status
get_job_results(job_id) → results
cancel_job(job_id) → ok
Inference Tools:
run_inference(model, inputs) → outputs
synthesize_data(spec) → dataset
evaluate_with_judge(model, prompt, response) → score
Utility Tools:
read_dataset(path) → data
write_results(path, data) → ok
create_figure(data, spec) → figure
This encapsulation serves multiple purposes: - Isolation: The agent doesn't need to know about CUDA, distributed training, or job schedulers - Safety: The agent cannot accidentally consume all GPU resources or interfere with other projects - Abstraction: The same experimental code could theoretically run on different hardware backends
Experiment execution flow: 1. Read experimental plan 2. Write code for the experiment 3. Schedule training/inference jobs on GPU cluster 4. Monitor job execution 5. Collect and analyze results 6. If results are unexpected → debug, modify, re-run 7. Write final results and code to project directory
Component 4: Writing Agent
Purpose: Produce the final research paper from experimental results.
| Aspect | Detail |
|---|---|
| Input | Experimental results, code, hypothesis (from shared filesystem) |
| Output | Short research paper (PDF + source) |
| Format | Single-contribution short paper |
| Includes | Abstract, method, experiments, results, conclusion, figures, tables |
| Token consumption | ~10% of total |
Key distinction from other autoresearch writing: Most systems (AI Scientist, DeepScientist) attempt to produce full conference papers with comprehensive related work sections, lengthy introductions, and detailed appendices. FARS's Writing Agent produces short papers focused exclusively on the single contribution. This is both a philosophical choice (minimal composable knowledge) and a practical one (shorter papers are faster to write and review, and less prone to hallucination).
Component 5: Safety and Review Pipeline
Purpose: Quality gate between automated production and public dissemination.
| Stage | Type | Details |
|---|---|---|
| Automated review | Algorithmic | Hypothesis-level gate in Ideation Agent |
| AI review | Automated | Stanford Agentic Reviewer (ICLR-calibrated) |
| Manual review | Human | At least 3 researchers with 5+ years experience each |
| Labeling | Manual | All submissions explicitly labeled as AI-generated |
| arXiv submission | Manual | Only papers passing manual review are submitted |
The safety pipeline is deliberately conservative:
Papers produced by FARS (100)
│
▼
AI Review (Agentic Reviewer) ──── score distribution published
│
▼
Manual Review (3+ senior researchers, 5+ years each)
│ │
▼ ▼
PASS FAIL
│ │
▼ │
Label as │
AI-generated │
│ │
▼ │
Submit to Archived
arXiv (not published)
This conservative approach addresses the primary concern about automated research: that it could flood the literature with low-quality or misleading work. By requiring manual review by multiple senior researchers before external publication, FARS ensures that its public outputs meet human standards even though the production process is fully automated.
11 Core Mechanisms (Detailed)
11.1 The Shared Filesystem as Coordination Protocol
The shared filesystem is not merely a storage layer — it is the entire coordination protocol of the system. Understanding its design is essential to understanding FARS.
Properties of filesystem-based coordination:
| Property | Advantage | Trade-off |
|---|---|---|
| Durability | Every write persists; crash-safe | Higher I/O latency than in-memory |
| Visibility | Any agent (or human) can inspect any state | Potential information leakage if not structured |
| Atomicity | File writes are atomic at OS level | No multi-file transactions |
| Ordering | Filesystem timestamps provide natural ordering | Clock skew in distributed systems (mitigated if single-node) |
| Scalability | Directories partition naturally | Filesystem metadata overhead at extreme scale |
Contrast with alternative coordination patterns:
Pattern | Used by | Advantages | Disadvantages
────────────────────────────────────────────────────────────────────────────
Message queue | AI-Researcher | Decoupled, ordered | Needs broker, no natural persistence
In-memory database | AIRA₂ | Fast, structured | Volatile, single-node bottleneck
Knowledge graph | DeepScientist | Rich relationships | Complex queries, schema overhead
Single LLM context | AI Scientist | Simple, coherent | Context window limits, no parallelism
Shared filesystem | FARS | Universal, durable | Unstructured unless conventions enforced
FARS's choice of shared filesystem is the most Unix-philosophy-aligned design in the autoresearch space. It follows the principle: "Write programs to handle text streams, because that is a universal interface." In FARS's case: "Write agents to handle files in directories, because that is a universal interface."
11.2 Pipeline Parallelism and Throughput Optimization
FARS achieves its throughput through pipeline parallelism — the same technique used in CPU instruction pipelines, GPU rendering pipelines, and industrial assembly lines.
Time →
─────────────────────────────────────────────────────────
t=0 t=1 t=2 t=3 t=4 t=5 t=6
│ │ │ │ │ │ │
P1: [IDEA] [PLAN] [EXPT] [EXPT] [WRIT] [DONE]
P2: [IDEA] [PLAN] [EXPT] [EXPT] [WRIT] [DONE]
P3: [IDEA] [PLAN] [EXPT] [EXPT] [WRIT]→
P4: [IDEA] [PLAN] [EXPT] [EXPT]→
P5: [IDEA] [PLAN] [EXPT]→
─────────────────────────────────────────────────────────
Pipeline stages execute concurrently
Throughput analysis:
If the bottleneck stage (Experiment) takes time T_exp, the throughput is:
Throughput = 1 / max(T_idea, T_plan, T_exp, T_write)
Given FARS's average of ~2.3 hours/paper and the Experiment Agent consuming ~70% of resources, the Experiment stage is almost certainly the bottleneck:
Estimated stage durations (per paper):
Ideation: ~20-30 minutes
Planning: ~10-15 minutes
Experiment: ~90-120 minutes ← bottleneck
Writing: ~20-30 minutes
Pipeline throughput ≈ 1 paper / 90-120 minutes
Observed: ≈ 1 paper / 137 minutes (2h17m)
The discrepancy between estimated bottleneck (90-120 min) and observed (137 min) likely reflects: - Pipeline filling/draining overhead - Failed experiments that consume time but don't produce papers - Resource contention on the GPU cluster during peak parallel execution - Overhead from the 59% of hypotheses that don't become papers
11.3 GPU Cluster Encapsulation
The design decision to encapsulate the 160-GPU cluster as tools rather than exposing raw hardware is architecturally significant:
Traditional research agent:
Agent → SSH → GPU machine → CUDA → Training → Results
(Agent manages infrastructure directly)
FARS design:
Agent → Tool API → Cluster Scheduler → GPU → Training → Results
(Agent is isolated from infrastructure)
Benefits of tool encapsulation:
-
Cognitive offloading. The Experiment Agent reasons about experiments, not infrastructure. It doesn't need to know about CUDA versions, driver compatibility, multi-GPU parallelism, or job scheduling.
-
Resource management. The tool layer can implement fair scheduling across concurrent projects, prevent resource starvation, and enforce compute budgets.
-
Error isolation. A crashed training job doesn't crash the agent. The tool layer handles retries, timeouts, and failure reporting.
-
Portability. The same agent prompts could theoretically work with different hardware backends (different GPU types, cloud providers, or even TPUs) by swapping the tool implementation.
11.4 Negative Result Detection and Reporting
FARS's systematic production of negative result papers requires a mechanism for detecting and properly framing negative results:
Experiment outcome classification:
─────────────────────────────────────
1. Positive result → Method works, report improvement
2. Negative result → Method doesn't work, report why
3. Null result → Inconclusive, may need more experiments
4. System failure → Bug/crash, not a research result
Traditional systems (AI Scientist, AIRA₂) treat outcomes 2-4 as failures and discard them. FARS treats outcome 2 as a legitimate research contribution and produces a paper explaining why the hypothesis failed.
This requires the Writing Agent to have a distinct mode:
Positive result writing: - "We propose X. X achieves Y improvement over baseline Z." - Emphasis on the method, the improvement, the contribution.
Negative result writing: - "We investigate whether X can improve Y. We find that X provides no significant improvement, and analyze why." - Emphasis on the hypothesis, the evidence against it, and the mechanistic explanation for failure. - Example: "OCR-Anchor Reranking: When Best-of-N Selection Fails Due to Candidate Homogeneity" — the contribution is identifying candidate homogeneity at low temperature as the failure mechanism.
11.5 Automated Hypothesis Review
The Ideation Agent includes an automated review step that filters hypotheses before they enter the pipeline. While the specific review criteria are not publicly documented, the 41% conversion rate (100 papers from 244 hypotheses) provides bounds:
Hypothesis filtering funnel (estimated):
244 hypotheses generated
│
├── ~X% filtered by automated review (too vague, infeasible, redundant)
│
├── ~Y% fail during experiment execution (bugs, inconclusive results)
│
└── 100 produce completed papers (41%)
If automated review filters 30% → 171 pass review
Then 100/171 = 58% execution success rate
If automated review filters 10% → 220 pass review
Then 100/220 = 45% execution success rate
Either way, the conversion rates are remarkably high compared to human research, where the hypothesis-to-publication rate in AI/ML is typically 10-30%. This may reflect: - Conservative hypothesis generation (Ideation Agent proposes incremental, testable hypotheses) - The inclusion of negative results (failures that would be discarded in human research become papers) - Well-scoped single-contribution format (lower bar for a "complete" paper)
12 Programming Language
System Implementation
The FARS system implementation language is not publicly disclosed, but strong inferences can be made:
| Component | Likely Language | Evidence |
|---|---|---|
| Agent orchestration | Python | Industry standard for ML infrastructure; team's MOSS/InternLM background is Python |
| Shared filesystem management | Python/Shell | Standard infrastructure tooling |
| GPU cluster tools | Python (PyTorch/JAX) | Standard ML training frameworks |
| Model inference endpoints | Python (FastAPI/gRPC) | Standard serving patterns |
| Paper compilation | LaTeX | Academic paper production standard |
| Dashboard | JavaScript/TypeScript | Web frontend (analemma.ai/fars) |
Agent-Generated Code
The Experiment Agent generates Python code for experiments. Based on the published GitLab repositories, experiments use standard ML tooling:
- Training: PyTorch, Hugging Face Transformers, TRL
- Evaluation: Standard metrics libraries, custom evaluation scripts
- Data processing: pandas, numpy, datasets (Hugging Face)
- Visualization: matplotlib, plotly
Paper Generation
Papers are generated in LaTeX format, compiled to PDF. The Writing Agent must produce: - LaTeX source with proper formatting - Figure generation code (Python → matplotlib/plotly → image) - Table generation (experimental results → LaTeX tables) - Bibliography management (BibTeX references)
GitLab Repositories
Each completed project is published to GitLab (gitlab.com/fars-a/<project>/). Repository structure typically includes:
<project>/
├── paper.pdf # Compiled paper
├── paper.tex # LaTeX source (inferred)
├── code/
│ ├── experiment.py # Main experimental code
│ ├── evaluate.py # Evaluation script
│ └── requirements.txt
├── data/ # Processed data (if small)
├── results/ # Raw experimental results
└── README.md # Project description
13 Memory Management
System-Level Memory: The Shared Filesystem
FARS's primary memory system is the shared filesystem itself. This is a deliberate architectural choice that merges workspace and persistent memory into a single mechanism:
Memory Architecture:
─────────────────────────────────────────────────────────
Layer 1: Shared Filesystem (persistent, cross-agent)
├── Project directories (per-project state)
├── Literature cache (shared across projects)
├── Model endpoint registry
└── Status tracking (project stages, progress)
Layer 2: Agent Context Windows (transient, per-agent)
├── Current project context
├── Instructions / prompts
└── Recent interaction history
Layer 3: GPU Cluster State (transient, per-job)
├── Training checkpoints
├── Intermediate results
└── Job queue state
─────────────────────────────────────────────────────────
Inter-Agent Memory Transfer
Because agents communicate exclusively through files, memory transfer follows a write-read pattern:
Agent A memory → write to filesystem → Agent B reads from filesystem → Agent B memory
Example (Ideation → Planning):
Ideation Agent context:
"After reviewing 47 papers on RLVR, I identified a gap in
how verifiable rewards handle ambiguous correctness criteria.
Hypothesis: A graduated verification scoring system improves
RLVR training stability on math reasoning tasks."
│
▼ (writes hypothesis.md)
Filesystem:
/projects/FA0042/ideation/hypothesis.md
│
▼ (Planning Agent reads)
Planning Agent context:
"Hypothesis: graduated verification scoring for RLVR.
I need to design an experiment comparing binary vs.
graduated reward signals on GSM8K and MATH benchmarks..."
The filesystem acts as an externalized, persistent memory that survives agent restarts, context window limits, and even system crashes. This is architecturally similar to how human researchers use lab notebooks — the notebook persists even when the researcher is sleeping.
No Explicit Cross-Project Memory
FARS does not appear to maintain an explicit knowledge base, skill library, or cross-project memory. Each project is treated independently:
| Memory Type | Present in FARS | Present in Alternatives |
|---|---|---|
| Per-project state | Yes (filesystem) | Yes (all systems) |
| Cross-project knowledge base | No (inferred) | Yes (DeepScientist, some others) |
| Skill/technique library | No (inferred) | Yes (FunSearch-style systems) |
| Literature embedding database | Possibly (shared cache) | Yes (some systems) |
| Failed hypothesis memory | No explicit mechanism | Varies |
The absence of cross-project memory is a notable limitation. If FARS generates hypothesis H1 for project P1 and discovers it fails, there is no mechanism to prevent generating a similar hypothesis H1' for a different project P2. Over 244 hypotheses, some redundancy is likely.
However, the shared literature cache may provide implicit cross-project memory: if the Ideation Agent caches literature reviews and uses them across projects, insights from one project's literature review could inform another's hypothesis generation.
Context Window Management
Each agent operates within an LLM context window. For long-running experiments, the Experiment Agent may face context window pressure:
Experiment Agent context growth:
Initial plan: ~2K tokens
First code attempt: ~3K tokens
GPU job results: ~1K tokens
Error analysis + fix: ~2K tokens
Second code attempt: ~3K tokens
Results analysis: ~2K tokens
...
After 10 iterations: ~30K tokens (approaching limits)
The shared filesystem provides a natural overflow mechanism: the agent can write intermediate results to files and read them back selectively, effectively using the filesystem as an unbounded external memory that supplements the finite context window.
14 Continued Learning
Within-Run Learning
During the FARS-100 run, the system demonstrates within-run improvement at the population level:
FARS-100 production trajectory (approximate):
─────────────────────────────────────────────────────────
Phase 1 (hours 0-50): Pipeline filling + initial projects
└── Slower output as pipeline stages warm up
└── Early papers may have lower quality (system "learning")
Phase 2 (hours 50-150): Steady-state production
└── ~10 papers/day consistent output
└── Quality stabilizes around mean 5.05
Phase 3 (hours 150-228): Pipeline draining + final projects
└── Ideation Agent may slow (research directions exhausting)
└── Later papers may explore more peripheral topics
─────────────────────────────────────────────────────────
No Explicit Learning Mechanism
FARS does not implement explicit learning across papers:
| Learning Type | Implemented | Notes |
|---|---|---|
| Within-paper iteration | Yes | Experiment Agent debugs and refines |
| Across-paper knowledge transfer | No | Each project is independent |
| Technique library accumulation | No | No skill extraction mechanism |
| Hypothesis refinement from failures | No | Failed hypotheses not fed back to Ideation |
| Prompt/strategy adaptation | Not disclosed | May exist internally but not documented |
The Pipeline vs. Loop Distinction
This is a fundamental architectural difference from evolutionary systems:
Evolutionary systems (AIRA₂, AlphaEvolve, OpenEvolve):
Population → Select → Mutate → Evaluate → Update Population → Repeat
───── LOOP: each generation learns from previous ─────
FARS pipeline:
Direction → Ideate → Plan → Experiment → Write → Output
───── PIPELINE: each project is independent ─────
Evolutionary systems explicitly learn: the population improves over generations because selection pressure retains good solutions and discards bad ones. FARS's pipeline does not learn: each project starts fresh from a research direction, without incorporating lessons from previous projects.
This is both a strength and a limitation: - Strength: No risk of "premature convergence" — each project explores independently - Strength: Easily parallelizable — no cross-project dependencies - Limitation: Cannot build on its own discoveries - Limitation: May repeat mistakes across projects
Potential for Cross-Run Learning
If FARS were extended to operate over multiple runs (FARS-200, FARS-500, etc.), several learning mechanisms could be added:
- Hypothesis deduplication: Embedding previous hypotheses and filtering new ones that are too similar
- Technique library: Extracting reusable experimental techniques from successful papers
- Failure memory: Cataloguing why hypotheses failed to prevent re-exploration of dead ends
- Research direction refinement: Using paper quality scores to adjust which directions the Ideation Agent explores
- Writing template learning: Improving paper structure based on review scores
These extensions would transform FARS from a pipeline into a learning pipeline — maintaining the throughput advantages of pipeline architecture while adding the improvement dynamics of evolutionary systems.
FARS as a Benchmark for Research Productivity
The FARS-100 run itself serves as a baseline measurement for automated research productivity:
FARS-100 Baseline Metrics:
Throughput: ~10.5 papers/day
Quality: 5.05 ± ~0.8 (ICLR scale)
Cost: $1,040/paper
Conversion rate: 41% (hypothesis → paper)
Token efficiency: ~114M tokens/paper
Future FARS versions can be measured against these baselines:
FARS v2: Higher quality? Lower cost? Better conversion?
FARS v3: Cross-project learning? Adaptive hypothesis generation?
15 Applications
Direct Applications
| Application | Description | Status |
|---|---|---|
| Automated AI research | Continuous, unattended production of research papers | Demonstrated (FARS-100) |
| Negative result publishing | Systematic documentation of what doesn't work | Demonstrated (multiple papers) |
| Research hypothesis exploration | Rapid exploration of a research space | Demonstrated (244 hypotheses) |
| Public research transparency | Live, observable research process | Demonstrated (dashboard + GitLab) |
| Research cost reduction | 50-200x cheaper than human researchers per paper | Demonstrated (at $1,040/paper) |
Broader Implications
1. The End of Publication Bias
FARS's systematic reporting of negative results addresses one of the most persistent structural problems in science: publication bias. If automated systems can produce and publish negative results at near-zero marginal cost, the scientific record becomes more complete. The value is not in any individual negative result paper, but in the aggregate: a comprehensive map of what works and what doesn't in a research area.
Human research publication funnel:
100 experiments → 20 positive results → 15 submitted → 5 published
(80% of experimental knowledge is LOST)
FARS publication funnel:
244 hypotheses → 100 papers (positive + negative) → manual review → arXiv
(Knowledge preservation rate: ~41% vs. ~5% for humans)
2. Research as Infrastructure
FARS represents a paradigm shift from research-as-craft to research-as-infrastructure. In the craft model, each paper is a unique artifact produced by skilled artisans (researchers). In the infrastructure model, papers are outputs of a production system that can be scaled, optimized, and operated continuously.
This shift has parallels in other domains: - Software testing: manual QA → automated CI/CD - Manufacturing: artisan production → assembly line - Content creation: human writing → automated generation + human curation
3. The Minimal Composable Knowledge Unit
FARS's short, single-contribution papers introduce a new unit of scientific knowledge — smaller than a traditional paper but larger than a tweet or blog post. This format could influence human research conventions:
Traditional paper: ~10-20 pages, multiple contributions
├── Hard to review (many things to evaluate)
├── Hard to cite precisely (which contribution?)
└── Incentivizes bundling (minimum publishable unit)
FARS paper: ~4-8 pages, single contribution
├── Easy to review (one thing to evaluate)
├── Easy to cite precisely (one clear finding)
└── Incentivizes decomposition (one paper = one insight)
This is analogous to the microservices revolution in software: smaller, focused, composable units replace monolithic artifacts. The idea is not new (workshop papers, extended abstracts, and short papers exist in human venues), but FARS operationalizes it at scale.
4. The AI-for-AI Research Loop
FARS currently operates in the AI-for-AI domain: AI systems researching AI systems. This creates a recursive improvement dynamic:
FARS produces papers on AI topics
├── Some papers improve LLM training/fine-tuning
│ └── Better LLMs → Better FARS agents → Better papers
├── Some papers improve agent design
│ └── Better agents → Better FARS architecture → Better papers
└── Some papers improve evaluation methods
└── Better evaluation → Better quality signal → Better papers
If FARS's outputs eventually inform its own improvement (directly or indirectly through the broader research ecosystem), the system creates a positive feedback loop in AI capability advancement.
Limitations and Scope
| Limitation | Impact | Potential Mitigation |
|---|---|---|
| AI-only domain | Cannot research biology, physics, chemistry, etc. | Integrate with lab automation, simulation tools |
| No physical experiments | Limited to computational research | Robotic lab integration (future) |
| Compute-bounded | Cannot run large-scale pretraining experiments | Larger clusters, cloud bursting |
| No human involvement | Cannot do human evaluation, annotation, user studies | Mechanical Turk integration, controlled human access |
| Quality variance | Some papers are incremental or flawed | Better quality gates, adaptive filtering |
| No cross-project learning | Each project starts fresh | Implement knowledge base, skill library |
| Proprietary system | Cannot be reproduced or independently verified | Open-source release (unlikely for competitive reasons) |
| Single production format | Only produces short papers | Extend to surveys, tutorials, replication studies |
Comparison to Other Autoresearch Paradigms
| Paradigm | Representative | Strength | Weakness |
|---|---|---|---|
| Single-agent minimal | Karpathy autoresearch | Simplicity, accessibility | Limited scope, no writing |
| Single-agent maximal | AI Scientist v2 | End-to-end paper production | Quality ceiling, no real experiments |
| Evolutionary search | AIRA₂, AlphaEvolve | Progressive improvement, scaling | No paper writing, competition-focused |
| Multi-agent pipeline | FARS | Throughput, continuous operation, negative results | No learning, proprietary, AI-only domain |
| Frontier-pushing | DeepScientist | Genuine scientific advances | Very expensive (20K+ GPU-hours), slow |
| Knowledge-focused | CycleResearcher | Iterative quality improvement | Limited scale |
FARS occupies a unique niche: industrial-scale continuous production of research contributions. It trades depth for breadth, learning for throughput, and maximal paper quality for comprehensive coverage of a research space.
Connections to OmniEvolve
FARS's architecture maps to several OmniEvolve design patterns, with important differences:
| FARS Component | OmniEvolve Equivalent | Key Difference |
|---|---|---|
| Shared filesystem | omnievolve/storage/ artifact storage |
FARS uses filesystem as sole coordination; OmniEvolve uses DB + filesystem |
| Pipeline stages | omnievolve/orchestrator/ experiment lifecycle |
FARS is linear pipeline; OmniEvolve supports evolutionary loops |
| GPU cluster tools | omnievolve/evaluation/ sandbox execution |
FARS encapsulates 160 GPUs; OmniEvolve abstracts arbitrary compute |
| Ideation Agent | omnievolve/knowledge/ + omnievolve/search/ |
FARS separates ideation from search; OmniEvolve integrates them |
| Quality review | omnievolve/evaluation/ cascade evaluator |
FARS uses human review; OmniEvolve uses automated cascade |
| Negative results | No direct equivalent | OmniEvolve discards failures; FARS publishes them |
Architectural lesson for OmniEvolve: FARS's shared filesystem pattern demonstrates that simple coordination mechanisms can support industrial-scale multi-agent systems. OmniEvolve's more complex event bus and database coordination may be over-engineered for some use cases. The FARS pattern could be offered as a lightweight alternative configuration.
Philosophical lesson for OmniEvolve: FARS's first-principles critique of academic conventions applies equally to how evolutionary algorithm research is evaluated. If the goal is to expand the frontier of optimization knowledge, conforming to academic paper formats may be a needless constraint. OmniEvolve's reporting module could support FARS-style minimal contribution reports alongside traditional paper formats.
References
- Analemma. (2026). "Introducing FARS: Fully Automated Research System." Blog post, February 11, 2026. https://analemma.ai/blog/introducing-fars/
- FARS Live Dashboard. https://analemma.ai/fars
- FARS GitLab. https://gitlab.com/fars-a
- 机器之心 (Machine Intelligence). (2026). "228 hours of non-stop work to produce 100 papers, burning through 11.4 billion Tokens: FARS has gone crazy." February 25, 2026.
- Lu, C., et al. (2024). "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." Sakana AI.
- Weng, Y., et al. (2024). "CycleResearcher: Improving Automated Research via Automated Review."
- IntologyAI. (2025). "Zochi: Artificial Scientist." https://github.com/IntologyAI/Zochi
- Lu, C., et al. (2025). "The AI Scientist v2." Sakana AI.
- Li, Y., et al. (2025). "AI-Researcher." NeurIPS 2025 Spotlight.
- Chen, Y., et al. (2026). "DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively." Westlake University.
- Stanford University. (2026). "Agentic Reviewer." https://paperreview.ai
- Sun, T. Homepage. http://txsun1997.github.io/
- Karpathy, A. (2026). "autoresearch." https://github.com/karpathy/autoresearch
This analysis was compiled from the FARS blog post, live dashboard observations, published paper examples, independent media reporting (Machine Intelligence / 机器之心, 36Kr), and cross-referencing with prior autoresearch systems. FARS's proprietary nature limits architectural analysis to publicly observable behavior and reported metrics. The system represents a significant milestone in the industrialization of scientific research, demonstrating that continuous, unattended, large-scale research production is technically feasible — though the question of whether this constitutes genuine knowledge expansion or sophisticated pattern-matching remains open.