← Back to Index

Zochi

The first AI system to achieve acceptance at an A scientific conference (ACL 2025), autonomously conducting end-to-end research from literature analysis to peer-reviewed publication across multiple domains. Organization: IntologyAI (San Francisco-based AI research lab) Published: March 17, 2025 (tech report); May 27, 2025 (ACL 2025 acceptance announced) Type: Technical Report + closed-source system Report Type: PhD-Level Technical Analysis Report Date: April 2026*

Table of Contents


1 Full Title and Attribution

Zochi Technical Report: The First Artificial Scientist

  • Repository: github.com/IntologyAI/Zochi (papers + artifacts only; system code is closed-source)
  • Technical Report PDF: Zochi_Technical_Report.pdf
  • Blog posts: Tech Report, ACL Acceptance
  • Stars: ~305 (as of April 2026)
  • License: Not specified (closed-source system; published paper artifacts under CC BY 4.0)
  • Tagline: "The First Artificial Scientist"
  • Successor system: Locus (previewed 2025; surpasses human experts on RE-Bench)

Naming and Branding

The name "Zochi" does not appear to reference an existing acronym or abbreviation. IntologyAI uses the term "Artificial Scientist" as a category label — distinguishing systems that autonomously conduct the full scientific method from "AI research assistants" that support individual phases. Zochi is positioned as the inaugural member of this category, with Locus as its successor.

Historical Significance

Zochi holds several firsts in the AI-for-science landscape:

Milestone Date Significance
First AI system with workshop publications Mar 2025 ICLR 2025 workshops (CS-ReFT, Siege)
First AI system with A* conference acceptance May 2025 ACL 2025 main proceedings (Tempest)
First multi-domain AI research system Mar 2025 Produced papers in AI, safety, and computational biology
First AI system with meta-review in top 10% May 2025 ACL meta-review score 4 = top 8.2% of submissions

Lineage and Positioning

AI Research Systems Timeline
│
├── 2024
│   ├── AI Scientist (Sakana AI)  — first end-to-end pipeline, workshop-only
│   ├── Agent Laboratory           — framework for AI-assisted research
│   └── AIDE                       — ML engineering agent (Kaggle)
│
├── 2025 (early)
│   ├── Zochi (IntologyAI)         — first A* venue acceptance  ← this system
│   ├── Google Co-Scientist        — Gemini-based, biomedical focus
│   └── AIRA₂ (Meta FAIR)          — agentic iterative research assistant
│
├── 2025 (mid)
│   ├── Locus (IntologyAI)         — Zochi successor, surpasses humans on RE-Bench
│   └── Tempest → ACL 2025         — Zochi's A* publication milestone
│
└── 2026
    ├── AutoResearchClaw            — 23-stage open-source pipeline
    ├── EurekaClaw                  — mathematical theorem proving
    └── K-Dense Co-Scientist        — bring-your-own-key research agent

Unique Position in the Ecosystem

Zochi is distinguished from every other system in the autoresearch landscape by a single fact: real peer review at the highest tier. While AI Scientist (Sakana AI) demonstrated that LLMs could produce paper-shaped artifacts, and Google Co-Scientist showed Gemini-powered hypothesis generation, Zochi is the only system whose fully autonomous output survived the ~20% acceptance rate filter of a CORE A*-ranked conference. This is a qualitative threshold that separates "demonstration" from "contribution."

Quality Validation Hierarchy
│
├── Self-evaluation only ──────────── AI Scientist, Agent Laboratory
│   └── LLM-as-judge or automated metrics
│
├── Workshop acceptance (~60-70%) ─── Zochi (ICLR 2025), AI Scientist
│   └── Lower bar, shorter papers, less rigorous review
│
├── Main conference acceptance (~20%) ─ Zochi (ACL 2025)  ← only system here
│   └── Full peer review, rebuttals, meta-review
│
└── Journal publication ──────────── EGNN-Fusion (under review)
    └── Extended review cycles, revision rounds

2 Authors and Team

Author Role (Inferred)
Andy Zhou Lead developer / system architect / first author on all Zochi papers
Ron Arel Co-founder / co-author on all Zochi papers
Soren Dunn Core contributor
Nikhil Khandekar Core contributor

BibTeX Citations

@article{zhou2025zochi,
  title   = {Zochi Technical Report: The First Artificial Scientist},
  author  = {Zhou, Andy and Arel, Ron and Dunn, Soren and Khandekar, Nikhil},
  year    = {2025},
  url     = {https://github.com/IntologyAI/Zochi/blob/main/Zochi_Technical_Report.pdf}
}

@inproceedings{zhou2025tempest,
  title     = {Tempest: Autonomous Multi-Turn Jailbreaking of Large Language
               Models with Tree Search},
  author    = {Zhou, Andy and Arel, Ron},
  booktitle = {Proceedings of the 63rd Annual Meeting of the Association for
               Computational Linguistics (ACL 2025)},
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.10619}
}

@inproceedings{zhou2025csreft,
  title     = {Compositional Subspace Representation Fine-tuning for
               Adaptive Large Language Models},
  author    = {Zhou, Andy and Arel, Ron},
  booktitle = {SCOPE Workshop, ICLR 2025},
  year      = {2025},
  url       = {https://openreview.net/forum?id=YqYcm0mpFp}
}

Team composition: IntologyAI is a small, focused research lab — 4 named contributors compared to 16 at AutoResearchClaw (AIMING Lab) or 25 at Meta FAIR's AIRA₂. The compact team size is notable: Zochi's quality-per-headcount ratio is the highest in the autoresearch space, achieving more validated research impact with fewer people than any comparable system.

Institutional context: IntologyAI is a San Francisco-based startup (previously described as London-based in some early sources; the company appears to have relocated or always been SF-based). Unlike academic lab projects (AutoResearchClaw, EurekaClaw) or big-tech teams (Google Co-Scientist, AIRA₂), IntologyAI is a venture-backed startup whose entire product identity is "Artificial Scientists." This commercial focus explains both the closed-source nature and the aggressive milestone pursuit.

Team Size Comparison

System Team Size Organization Type Code Availability
Zochi 4 Startup Closed-source
AI Scientist ~6 Research lab (Sakana AI) Open-source
Google Co-Scientist ~20+ Big tech (Google DeepMind) Closed-source
AIRA₂ 25 Big tech (Meta FAIR) Partially open
AutoResearchClaw 16 Academic (multi-university) Open-source (MIT)
EurekaClaw 8 Academic (single lab) Open-source (Apache 2.0)

3 Core Contribution

Zochi's core contribution is twofold: (1) a complete autonomous research pipeline that emulates the scientific method from literature analysis through peer-reviewed publication, and (2) empirical proof that AI systems can produce research accepted at the highest tier of peer review.

The Research Quality Gap

The most striking aspect of Zochi is not its architecture (which remains largely undisclosed) but the measurable quality gap between its outputs and those of all other AI research systems:

System Automated NeurIPS Reviewer Score Real Peer Review Venue Tier
Zochi 8, 8, 7 (avg 7.67) Yes — accepted A* (ACL 2025)
AI Scientist (Sakana) ~4 (average) Workshop only C (workshops)
Agent Laboratory ~3-4 No N/A
AIDE N/A (engineering, not research) No N/A
OpenHands N/A No N/A

This quality gap (~3.67 points on a 10-point scale) is significant. The NeurIPS guidelines scale treats 6 as the acceptance threshold for a top ML venue; Zochi averages 7.67 while competing systems cluster around 3-4.

Three Research Contributions

Unlike systems that demonstrate capability on toy domains (2D diffusion, toy language models), Zochi produced three substantive research contributions across distinct fields:

Zochi's Research Portfolio
│
├── 1. CS-ReFT (AI / Parameter-Efficient Fine-Tuning)
│   ├── Domain: Representation learning for LLMs
│   ├── Contribution: Orthonormal subspace edits in hidden states
│   ├── Result: 93.94% AlpacaEval win rate (> GPT-3.5-Turbo)
│   ├── Efficiency: 0.0098% of model parameters
│   ├── Venue: SCOPE Workshop, ICLR 2025
│   └── Reviewer scores: (6, 7, 6) — avg 6.33
│
├── 2. Siege → Tempest (AI Safety / Red Teaming)
│   ├── Domain: LLM safety, adversarial attacks
│   ├── Contribution: Tree-search multi-turn jailbreaking
│   ├── Result: 100% on GPT-3.5-T, 97% on GPT-4
│   ├── Discovery: "Partial compliance" vulnerability pattern
│   ├── Venue: ICLR 2025 Workshop → ACL 2025 Main
│   ├── Workshop scores: (7, 7) — avg 7.0
│   └── ACL meta-review: score 4 = top 8.2% of submissions
│
└── 3. EGNN-Fusion (Computational Biology)
    ├── Domain: Protein-nucleic acid binding site prediction
    ├── Contribution: Efficient EGNN architecture for binding sites
    ├── Result: 95% parameter reduction, competitive performance
    ├── Venue: Under journal review
    └── Significance: Cross-domain capability demonstration

Differentiating Capabilities

Capability Zochi AI Scientist Co-Scientist AutoResearchClaw
End-to-end autonomous research Yes Yes Partial Yes
Multi-domain research 3 domains 1 domain per run Biomedical only Configurable
A* venue acceptance Yes (ACL) No No No
Workshop acceptance Yes (ICLR) Yes (workshops) No Not yet
Real peer review survived Yes No (reviews fabricated/self) No No
Cross-domain transfer Yes (AI → bio) No No No
Automated quality (NeurIPS score) 7.67 ~4 N/A N/A
MLE-Bench engineering 80% > median N/A N/A N/A
Human involvement Figures, citations, minor edits Similar Significant Full automation

Problem Complexity Spectrum

An underappreciated aspect of Zochi's contribution is the complexity of problems tackled relative to other systems:

Problem Complexity Spectrum
│
│  Simple ◄──────────────────────────────────► Complex
│
│  ├─ AI Scientist: 2D diffusion, toy LMs, specific cognitive biases
│  │  └── Constrained problem spaces with clear metrics
│  │
│  ├─ Agent Laboratory: Predefined research templates
│  │  └── Structured task decomposition
│  │
│  ├─ AIDE / OpenHands: Kaggle competitions (engineering)
│  │  └── Well-defined objectives with leaderboard scores
│  │
│  └─ Zochi: Open-ended research challenges
│     ├── CS-ReFT: Novel method design + theoretical motivation
│     ├── Tempest: Framework design + vulnerability discovery
│     └── EGNN-Fusion: Cross-domain architecture design
│        └── Each requires novel methodology, not just optimization

4 Supported Solutions

Research Pipeline Phases

Based on the technical report and blog descriptions, Zochi supports the following research phases:

Phase Description Automation Level
Literature Analysis Ingests and analyzes thousands of research papers Fully autonomous
Gap Identification Identifies non-obvious connections and limitations Fully autonomous
Hypothesis Generation Proposes innovative solutions to identified gaps Fully autonomous
Method Design Designs novel methods and architectures Fully autonomous
Implementation Autonomously implements proposed methods Fully autonomous
Experiment Design Designs controlled experiments with ablation studies Fully autonomous
Experiment Execution Runs experiments, parallelized across multiple trials Fully autonomous
Validation Generates evaluation scripts on standardized datasets Fully autonomous
Result Analysis Interprets results and draws conclusions Fully autonomous
Manuscript Preparation Generates full research paper Mostly autonomous
Figure Creation N/A Human
Citation Formatting N/A Human
Minor Edits Formatting fixes, minor writing corrections Human

Solution Categories

Zochi Solution Architecture
│
├── Research Discovery Solutions
│   ├── Large-scale literature retrieval and analysis
│   ├── Cross-paper pattern identification
│   ├── Research gap detection
│   └── Direction scoring and selection
│
├── Method Innovation Solutions
│   ├── Novel architecture design (EGNN-Fusion)
│   ├── Novel training methodology design (CS-ReFT)
│   ├── Novel adversarial framework design (Tempest)
│   └── Cross-domain knowledge transfer
│
├── Experimental Validation Solutions
│   ├── Controlled experiment design
│   ├── Ablation study generation
│   ├── Multi-trial parallelized execution
│   ├── Automated validation script generation
│   └── Standardized dataset evaluation
│
└── Publication Solutions
    ├── Full paper generation (LaTeX)
    ├── Technical writing at conference quality
    └── Reviewer response preparation (manual for ACL)

Domain Flexibility

Unlike domain-locked systems (EurekaClaw → mathematics, Google Co-Scientist → biomedicine), Zochi demonstrates domain generality across three distinct fields:

Domain Paper Technical Approach Result Quality
AI / Representation Learning CS-ReFT Orthonormal subspace edits + router ICLR workshop accepted
AI Safety Siege → Tempest Tree search + partial compliance tracking ACL 2025 main (A*)
Computational Biology EGNN-Fusion Equivariant GNN architecture design Journal under review

This domain breadth is achieved without domain-specific plugins or handcrafted tool suites — Zochi's pipeline is general enough to produce contributions across fundamentally different fields, from model fine-tuning to protein structure prediction.


5 LLM Integration

Model Information

The Zochi technical report does not explicitly disclose which LLM backbone powers the system. However, several inferences can be made:

Aspect Assessment Evidence
Primary model Likely Claude or GPT-4 class Quality of generated text, reasoning depth
Code generation High-quality autonomous implementation CS-ReFT and Tempest are fully implemented
Multi-model Possibly — literature analysis may use different model from code generation Cost optimization for high-volume lit review
Fine-tuned Unknown Closed-source; no indication of custom training

LLM Usage Patterns

Based on the system's capabilities, Zochi likely uses LLM calls across multiple pipeline stages:

LLM Call Distribution (Inferred)
│
├── Literature Analysis
│   ├── Paper summarization (high volume, lower complexity)
│   ├── Gap identification (cross-paper reasoning)
│   └── Direction scoring (comparative judgment)
│   Estimated: 40-60% of total tokens
│
├── Method Design
│   ├── Hypothesis generation (creative reasoning)
│   ├── Architecture design (technical depth)
│   └── Novel approach formulation
│   Estimated: 10-15% of total tokens
│
├── Implementation
│   ├── Code generation (implementation from design)
│   ├── Debugging and iteration
│   └── Test case generation
│   Estimated: 15-25% of total tokens
│
├── Experimentation
│   ├── Experiment script generation
│   ├── Result interpretation
│   └── Ablation study design
│   Estimated: 5-10% of total tokens
│
└── Writing
    ├── Paper drafting (structured writing)
    ├── Technical exposition
    └── Related work synthesis
    Estimated: 10-15% of total tokens

Validation Engine

A distinctive feature of Zochi's LLM integration is the automatic validation engine:

"Our automatic validation engine generates evaluation scripts based on standardized datasets that remain unmodified throughout testing, ensuring results reflect genuine improvements."

This implies a separation of concerns where: 1. The generation LLM produces methods and code 2. The validation engine independently generates evaluation scripts 3. The standardized datasets are not modified by the generation process

This architectural choice prevents the common failure mode where AI systems inadvertently optimize for their own evaluation metrics rather than genuine performance improvements.

Comparison to Other Systems' LLM Integration

System LLM Backend Multi-Model Disclosed Custom Prompts
Zochi Undisclosed (likely frontier model) Unknown No Unknown
AI Scientist Claude 3.5 / GPT-4 Yes (configurable) Yes Yes (open)
Google Co-Scientist Gemini Yes (multi-agent) Partially No
AIRA₂ Llama-based Yes Yes Yes
AutoResearchClaw Configurable Yes Yes (open) Yes (open)
EurekaClaw Claude Sonnet (default) Configurable Yes (open) Yes (open)

Zochi's closed-source nature means its LLM integration details remain proprietary. This is both a limitation for scientific reproducibility and a competitive advantage — the prompts, model selection strategies, and chain-of-thought patterns that produce A*-quality research are Intology's core intellectual property.


6 Key Results

Headline Results Summary

Metric Value Context
ACL 2025 acceptance Main proceedings First AI system at A* venue; ~21.3% acceptance rate
ACL meta-review score 4 Top 8.2% of all ACL submissions
ICLR 2025 workshops 2 papers accepted CS-ReFT (SCOPE) + Siege (Building Trust)
Automated NeurIPS scores 8, 8, 7 (avg 7.67) Acceptance threshold = 6; other AI systems average ~4
MLE-Bench (exploratory) 80% > median human; 50% medal Without task-specific optimization
CS-ReFT AlpacaEval 93.94% win rate Surpasses GPT-3.5-Turbo (86.30%)
CS-ReFT efficiency 0.0098% parameters 12.7x fewer than LoRA
Tempest vs GPT-3.5-T 100% attack success JailbreakBench dataset
Tempest vs GPT-4 97% attack success Fewer queries than Crescendo/GOAT
EGNN-Fusion efficiency 95% parameter reduction Competitive binding site prediction

Detailed Results by Paper

Paper 1: CS-ReFT (ICLR 2025 SCOPE Workshop)

Problem: Cross-skill interference in parameter-efficient fine-tuning — improvements on one task degrade performance on others.

Method: Learns multiple orthonormal subspace transformations in hidden-state representations, each specializing in a distinct skill, composed via a lightweight router.

Metric CS-ReFT (Llama-2-7B) GPT-3.5-Turbo LoRA ReFT (base)
AlpacaEval win rate 93.94% 86.30% ~85% ~88%
Parameters used 0.0098% N/A ~0.12% ~0.06%
Cross-task interference Minimal N/A Moderate Moderate

Technical innovation: Unlike LoRA and similar methods that impose orthogonality at the weight level, CS-ReFT applies orthonormality constraints at the hidden-state level. This more directly addresses interference where it manifests — in the model's internal representations rather than in parameter space.

Reviewer assessment (SCOPE Workshop, ICLR 2025):

Reviewer Score Key Comments
Reviewer 1 6 Effective approach, addresses critical limitation of ReFT
Reviewer 2 7 "Clever idea"; strong empirical results
Reviewer 3 6 Solid contribution to parameter-efficient methods

Paper 2: Siege → Tempest (ICLR 2025 Workshop → ACL 2025)

Problem: Existing jailbreaking methods rely on single carefully crafted prompts; multi-turn attacks are understudied.

Method: Tree search over conversation branches, tracking partial compliance across turns and re-injecting policy leaks into subsequent queries.

Tempest Tree Search Mechanism
│
│  Turn 1: "Tell me about chemistry"
│  ├── Branch A: [Safe response] → partial compliance detected
│  ├── Branch B: [Deflection] → pruned
│  └── Branch C: [Partial info] → promising
│
│  Turn 2 (from Branch C): "Expand on the synthesis process"
│  ├── Branch C1: [More detail] → partial compliance ↑
│  ├── Branch C2: [Refusal] → pruned
│  └── Branch C3: [Fragment reveals] → EXPLOIT
│
│  Turn 3 (from Branch C3): Re-inject fragments + escalate
│  └── Branch C3a: [Full compliance] → JAILBREAK COMPLETE
│
│  Key insight: "Partial compliance" — models reveal fragments
│  of restricted information while appearing to maintain safety
│  guardrails. These fragments accumulate across turns.
Target Model Attack Success Rate Queries Used Baseline (Crescendo) Baseline (GOAT)
GPT-3.5-Turbo 100% Fewer Lower Lower
GPT-4 97% Fewer Lower Lower

Evolution from Siege to Tempest: The ACL version significantly expanded the ICLR workshop paper:

Aspect Siege (ICLR Workshop) Tempest (ACL 2025)
Paper length 2-4 pages (Tiny Paper) Full conference paper
Experiments JailbreakBench Expanded evaluations
Methodology Core tree search Enhanced with cross-branch learning
Contribution depth Proof of concept Comprehensive framework
Review scores (7, 7) Meta-review: 4 (top 8.2%)

ACL Acceptance Context:

Metric Value Significance
ACL 2025 acceptance rate ~21.3% Highly selective
Meta-review score 4 Top 8.2% of all submissions
CORE ranking A* Highest tier of scientific venue
Google Scholar ranking Top 40 globally Among most impactful venues in all CS

Paper 3: EGNN-Fusion (Under Journal Review)

Problem: State-of-the-art protein-nucleic acid binding site prediction requires enormous model parameters.

Method: Efficient equivariant graph neural network architecture that achieves competitive performance with 95% fewer parameters.

Metric EGNN-Fusion State-of-the-Art Baselines
Parameter count 5% of baseline 100% (reference)
Binding site prediction Competitive Reference level
Equivariance E(3)-equivariant Varies by method

Significance: This paper's primary role is as a cross-domain capability proof. The fact that the same AI system that designed a parameter-efficient fine-tuning method for LLMs also designed an efficient protein structure prediction architecture demonstrates genuine domain generality — not just re-skinning the same approach across similar problems.

MLE-Bench Performance (Exploratory)

Metric Zochi AIDE OpenHands Agent Lab
Surpass median human 80% of tasks
Medal rate 50% of tasks 8.7% (any medal) 4.4%
Task-specific optimization None None None None

The MLE-Bench results are particularly notable because Zochi was evaluated without any task-specific optimization — the same general-purpose research pipeline was applied to Kaggle-style engineering challenges, demonstrating transfer from research to engineering tasks.

Automated Quality Assessment

Zochi uses an automated reviewer based on NeurIPS conference guidelines to benchmark paper quality:

Automated Reviewer Score Distribution
│
│  10 ─┤
│   9 ─┤
│   8 ─┤  ██  ██            Zochi papers (8, 8, 7)
│   7 ─┤  ██  ██  ██
│   6 ─┤──██──██──██────── acceptance threshold ──────────
│   5 ─┤
│   4 ─┤           ░░  ░░  Other AI systems (~4 avg)
│   3 ─┤           ░░  ░░
│   2 ─┤
│   1 ─┤
│      └──────────────────
│        Zochi    Others
│
│  Legend: ██ = Zochi    ░░ = AI Scientist, Agent Lab, etc.
│  The ~3.67-point gap represents a qualitative leap
│  from "rejected" to "strong accept" territory.

Quality Gap Analysis

The quality gap between Zochi and other AI research systems deserves careful examination:

Quality Dimension Zochi Typical AI-Generated Papers
Problem selection Open-ended, frontier challenges Constrained, predefined tasks
Technical novelty Novel methods (orthonormal subspaces, tree-search jailbreaking) Incremental variations
Experimental rigor Controlled experiments, ablations, multiple trials Basic comparisons
Writing quality Near-publication quality (minor edits needed) Significant editing required
Domain awareness Deep understanding of related work Surface-level citations
Result significance State-of-the-art on standard benchmarks Toy-scale demonstrations

7 Reproducibility

Open Artifacts

Artifact Available Location
Technical report Yes PDF on GitHub
CS-ReFT paper Yes OpenReview
Tempest paper Yes arXiv:2503.10619
Siege workshop paper Yes OpenReview
System code No Closed-source
Prompt templates No Closed-source
Pipeline configuration No Closed-source
Model weights / fine-tunes No Closed-source
Experiment code (papers) Partial GitHub repository
Datasets used Standard AlpacaEval, JailbreakBench, protein datasets

Reproducibility Assessment

Factor Rating Details
System reproducibility Very Low Closed-source; no installation, no configuration, no pipeline
Paper result reproducibility Medium Standard datasets, published methods, partial code
Method reproducibility Medium-High CS-ReFT and Tempest are clearly described; independent implementation possible
Evaluation reproducibility Medium NeurIPS automated reviewer is a known methodology; MLE-Bench is open
Peer review validation High ACL and ICLR reviews are public records

Reproducibility Comparison

System Code Available Can Reproduce Pipeline Can Reproduce Results
Zochi No No Partially (published papers only)
AI Scientist Yes Yes Yes (with API keys)
AutoResearchClaw Yes Yes Yes (with API keys)
EurekaClaw Yes Yes Yes (with API keys)
Google Co-Scientist No No No
K-Dense BYOK Yes Yes Yes (with API keys)

What Can Be Reproduced

  1. CS-ReFT: The method is described with enough detail to reimplement. The AlpacaEval benchmark is public. The orthonormal subspace transformation approach is straightforward to implement given the paper.

  2. Tempest: The tree search over conversation branches with partial compliance tracking is well-specified. JailbreakBench is public. The core algorithm (BFS over adversarial prompt branches) could be reimplemented.

  3. EGNN-Fusion: The equivariant GNN architecture is described. Protein-nucleic acid binding datasets are standard.

What Cannot Be Reproduced

  1. The research pipeline itself — How Zochi selects research directions, generates hypotheses, designs methods, and writes papers
  2. The literature analysis system — How thousands of papers are ingested, analyzed, and patterns extracted
  3. The validation engine — How evaluation scripts are automatically generated
  4. The meta-cognitive layer — How Zochi decides which ideas are promising enough to pursue
  5. The quality calibration — What makes Zochi produce 7.67-quality papers when others produce ~4

This reproducibility gap is the most significant criticism of Zochi from a scientific perspective. While the individual papers are reproducible, the system that produces papers is not — making it impossible for the research community to verify, extend, or improve upon the core methodology.


8 Compute and API Costs

Cost Model (Inferred)

Since Zochi is closed-source, cost estimates must be inferred from the described capabilities:

Estimated Cost Model
│
│  Cost per paper ≈ Literature_Analysis + Method_Design + Implementation
│                   + Experimentation + Writing + Validation
│
│  Literature_Analysis:
│    ├── "Thousands of papers" analyzed
│    ├── At ~500 tokens/paper summary × 2,000 papers = 1M tokens input
│    ├── Plus gap analysis and direction scoring: ~200K tokens
│    └── Subtotal: ~1.2M tokens (mostly input)
│
│  Method_Design + Implementation:
│    ├── Hypothesis generation: ~50K tokens
│    ├── Architecture design iteration: ~100K tokens
│    ├── Code generation + debugging: ~200K tokens
│    └── Subtotal: ~350K tokens (mix of input/output)
│
│  Experimentation:
│    ├── Experiment script generation: ~50K tokens
│    ├── Result interpretation: ~100K tokens
│    ├── Ablation study design: ~50K tokens
│    └── Subtotal: ~200K tokens
│    Plus compute: GPU hours for training/evaluation
│
│  Writing + Validation:
│    ├── Paper generation: ~100K tokens
│    ├── Validation script generation: ~50K tokens
│    └── Subtotal: ~150K tokens (mostly output)
│
│  TOTAL ESTIMATED: ~1.9M tokens per paper
│  At ~$15/M tokens (frontier model): ~$30 in API costs
│  Plus GPU compute for experiments: varies ($10-$500+)

Timeline Estimates

Phase Estimated Duration Bottleneck
Literature analysis Hours API rate limits, paper retrieval
Hypothesis + method design Hours LLM reasoning depth
Implementation Hours to day Code complexity, debugging cycles
Experimentation Hours to days GPU availability, training time
Validation Hours Evaluation script execution
Writing Hours Paper generation + formatting
Total Hours to days Experiment compute

IntologyAI states: "Methods typically only require hours to validate, and a full paper takes only days to complete."

Cost Comparison

System Estimated Cost per Paper Time per Paper Model Tier
Zochi ~$30-500+ (API + compute) Days Frontier (undisclosed)
AI Scientist $10-50+ Hours to days Claude/GPT-4
AutoResearchClaw $5-30 Hours Configurable
EurekaClaw $1-50+ Hours Claude Sonnet
K-Dense BYOK $0.05-5 Minutes to hours User-selected
Human PhD student $50K-100K/year salary Months to years N/A

Hardware Requirements (Inferred)

Requirement Minimum Likely Production
CPU Multi-core Cloud instances
RAM 16+ GB 32-64 GB
GPU Required for experiments Multi-GPU for training (CS-ReFT used Llama-2-7B)
Storage 10+ GB per project Cloud storage for paper corpus
Network Required High-bandwidth for paper retrieval + API calls
API Access Frontier LLM API Rate-limited; likely parallel calls

9 Architecture Solution

Pipeline Architecture (Inferred from Descriptions)

Zochi operates as a multi-stage autonomous pipeline that mirrors the scientific method. While the internal implementation is closed-source, the described stages can be mapped to an architectural diagram:

Zochi Architecture Overview (Inferred)
│
│  INPUT: Research domain / high-level direction
│  (e.g., "novel jailbreaking methods")
│
│  ╔════════════════════════════════════════════════════════╗
│  ║                 STAGE 1: LITERATURE ANALYSIS          ║
│  ║                                                        ║
│  ║  Paper Retrieval ──► Summarization ──► Pattern Mining  ║
│  ║       │                    │                 │         ║
│  ║  (arXiv, S2,       (Per-paper       (Cross-paper      ║
│  ║   venue DBs)        key findings)    connections)      ║
│  ║                                                        ║
│  ║  Output: Research landscape map + identified gaps      ║
│  ╚═══════════════════════╤════════════════════════════════╝
│                          │
│  ╔═══════════════════════╧════════════════════════════════╗
│  ║              STAGE 2: HYPOTHESIS GENERATION            ║
│  ║                                                        ║
│  ║  Gap Analysis ──► Direction Proposal ──► Selection     ║
│  ║       │                   │                  │         ║
│  ║  (Identify          (Generate           (Score and     ║
│  ║   limitations)       novel ideas)        rank)         ║
│  ║                                                        ║
│  ║  Output: Research hypothesis + proposed method         ║
│  ╚═══════════════════════╤════════════════════════════════╝
│                          │
│  ╔═══════════════════════╧════════════════════════════════╗
│  ║              STAGE 3: METHOD DESIGN                    ║
│  ║                                                        ║
│  ║  Architecture Design ──► Technical Specification       ║
│  ║       │                         │                      ║
│  ║  (Novel method          (Formal description,           ║
│  ║   formulation)           math formulation)             ║
│  ║                                                        ║
│  ║  Output: Complete method specification                 ║
│  ╚═══════════════════════╤════════════════════════════════╝
│                          │
│  ╔═══════════════════════╧════════════════════════════════╗
│  ║             STAGE 4: IMPLEMENTATION                    ║
│  ║                                                        ║
│  ║  Code Generation ──► Testing ──► Debugging ──► ↻      ║
│  ║       │                 │            │                 ║
│  ║  (Method            (Unit +      (Iterative           ║
│  ║   implementation)    integration)  repair)             ║
│  ║                                                        ║
│  ║  Output: Working implementation of proposed method     ║
│  ╚═══════════════════════╤════════════════════════════════╝
│                          │
│  ╔═══════════════════════╧════════════════════════════════╗
│  ║             STAGE 5: EXPERIMENTATION                   ║
│  ║                                                        ║
│  ║  Experiment Design ──► Parallel Execution ──► Results  ║
│  ║       │                      │                  │      ║
│  ║  (Controlled         (Multi-trial          (Statistical║
│  ║   experiments,        parallelized)         analysis)  ║
│  ║   ablation studies)                                    ║
│  ║                                                        ║
│  ║  ┌──────────────────────────────┐                      ║
│  ║  │   VALIDATION ENGINE          │                      ║
│  ║  │   Auto-generates eval scripts│                      ║
│  ║  │   Standardized datasets      │                      ║
│  ║  │   Datasets NOT modified      │                      ║
│  ║  └──────────────────────────────┘                      ║
│  ║                                                        ║
│  ║  Output: Experimental results + analysis               ║
│  ╚═══════════════════════╤════════════════════════════════╝
│                          │
│  ╔═══════════════════════╧════════════════════════════════╗
│  ║           STAGE 6: MANUSCRIPT PREPARATION              ║
│  ║                                                        ║
│  ║  Paper Drafting ──► Related Work ──► Full Paper        ║
│  ║       │                  │               │             ║
│  ║  (Structure +      (Literature      (Complete          ║
│  ║   technical         integration)     manuscript)       ║
│  ║   writing)                                             ║
│  ║                                                        ║
│  ║  Human steps: figures, citation format, minor edits    ║
│  ║                                                        ║
│  ║  Output: Conference-ready paper                        ║
│  ╚════════════════════════════════════════════════════════╝
│
│  OUTPUT: Peer-reviewed publication (ACL 2025, ICLR 2025)

Architectural Differentiators

Feature Zochi AI Scientist AutoResearchClaw
Pipeline stages ~6 (inferred) ~8 23
Parallel experiments Yes Limited Yes
Validation engine Dedicated + separated Self-evaluation Multi-agent review
Cross-domain Yes (3 domains demonstrated) Single domain per run Configurable
Human involvement Minimal (figures, citations) Similar Fully automated
Quality bar A* venue acceptance Workshop demos Automated scores only
Stage granularity Coarse (strategic) Medium Fine (23 stages)

Key Architectural Decisions (Inferred)

  1. Separation of generation and validation: The validation engine generates evaluation scripts independently from the method generation process. This prevents the system from inadvertently gaming its own metrics.

  2. Parallelized experimentation: "Experiments are parallelized across multiple trials, significantly accelerating the research timeline." This suggests an experiment orchestration layer that manages GPU resources and collects results.

  3. Minimal human handoff: The architecture is designed to minimize human touchpoints. The only human steps are cosmetic (figures, formatting) rather than substantive (method design, experiment interpretation).

  4. Input minimalism: For the ACL paper, the input was merely "novel jailbreaking methods" — 3 words that triggered the entire research pipeline from literature analysis to a 97% attack success rate on GPT-4.

Architecture Evolution: Siege vs. Tempest

The progression from ICLR workshop (Siege) to ACL main (Tempest) reveals architectural improvements:

Architecture Evolution
│
│  Siege (ICLR 2025 Workshop, v1)
│  ├── Input: "multi-turn attacks on LLMs" (autonomously identified)
│  ├── Pipeline: Standard autonomous flow
│  ├── Output: 2-4 page tiny paper
│  ├── Human contribution: Same as standard (figures, formatting)
│  └── Result: (7, 7) reviewer scores
│
│  Tempest (ACL 2025 Main, v2)
│  ├── Input: Same high-level idea (tree search + multi-turn jailbreak)
│  ├── Pipeline: Enhanced — "significantly improved design"
│  ├── Additional: Cross-branch learning mechanism
│  ├── Additional: Robust partial compliance tracking
│  ├── Additional: "More comprehensive experiments"
│  ├── Output: Full conference paper
│  ├── Human contribution: Same minimal scope
│  └── Result: Meta-review 4 (top 8.2%)
│
│  Key observation: The system could ITERATE on its own work,
│  producing substantially improved research on the same topic.

10 Component Breakdown

Inferred Component Architecture

Since Zochi is closed-source, the component breakdown must be inferred from the described capabilities and outputs. The following represents a plausible decomposition:

Zochi Component Map (Inferred)
│
├── CORE ENGINE
│   ├── Pipeline Orchestrator
│   │   ├── Stage sequencing and state management
│   │   ├── Error handling and recovery
│   │   └── Resource allocation across stages
│   │
│   ├── LLM Interface Layer
│   │   ├── API client(s) for frontier model(s)
│   │   ├── Prompt management and templating
│   │   ├── Token budget management
│   │   └── Response parsing and validation
│   │
│   └── Domain Abstraction
│       ├── Domain-agnostic pipeline flow
│       └── Domain-specific adapter patterns
│
├── LITERATURE ENGINE
│   ├── Paper Retrieval
│   │   ├── arXiv API integration
│   │   ├── Semantic Scholar API integration
│   │   ├── Venue-specific databases
│   │   └── Citation graph traversal
│   │
│   ├── Paper Analysis
│   │   ├── Abstract and full-text summarization
│   │   ├── Methodology extraction
│   │   ├── Result extraction
│   │   └── Limitation identification
│   │
│   └── Knowledge Synthesis
│       ├── Cross-paper pattern mining
│       ├── Gap identification
│       ├── Direction scoring
│       └── Research landscape mapping
│
├── HYPOTHESIS ENGINE
│   ├── Idea Generator
│   │   ├── Cross-paper connection identification
│   │   ├── Novel combination proposal
│   │   └── Feasibility assessment
│   │
│   ├── Method Designer
│   │   ├── Architecture specification
│   │   ├── Mathematical formulation
│   │   └── Technical approach planning
│   │
│   └── Selection Filter
│       ├── Novelty scoring
│       ├── Impact prediction
│       └── Feasibility ranking
│
├── IMPLEMENTATION ENGINE
│   ├── Code Generator
│   │   ├── Method implementation from specification
│   │   ├── Data loading and preprocessing
│   │   └── Training loop generation
│   │
│   ├── Testing Layer
│   │   ├── Unit test generation
│   │   ├── Integration testing
│   │   └── Debugging and repair loop
│   │
│   └── Environment Manager
│       ├── Dependency management
│       ├── GPU resource allocation
│       └── Experiment workspace isolation
│
├── EXPERIMENTATION ENGINE
│   ├── Experiment Designer
│   │   ├── Controlled experiment specification
│   │   ├── Ablation study generation
│   │   └── Baseline selection
│   │
│   ├── Execution Orchestrator
│   │   ├── Multi-trial parallelization
│   │   ├── Result collection
│   │   └── Resource management
│   │
│   └── VALIDATION ENGINE (SEPARATED)
│       ├── Auto-generates evaluation scripts
│       ├── Uses standardized, unmodified datasets
│       ├── Independent from generation process
│       └── Ensures genuine performance measurement
│
├── WRITING ENGINE
│   ├── Paper Generator
│   │   ├── Structure planning
│   │   ├── Section-by-section drafting
│   │   ├── Related work integration
│   │   └── Technical writing quality assurance
│   │
│   └── LaTeX Formatter
│       ├── Conference template compliance
│       ├── Table and equation formatting
│       └── Reference management
│
└── QUALITY ASSURANCE
    ├── Automated Reviewer
    │   ├── NeurIPS guidelines-based scoring
    │   ├── Multi-dimensional evaluation
    │   └── Quality threshold enforcement
    │
    └── Result Verification
        ├── Statistical significance checking
        ├── Claim-to-evidence alignment
        └── Reproducibility verification

Component Interaction Patterns

Information Flow Between Components
│
│  Domain Input ──────────────────────────────────────────┐
│       │                                                  │
│       ▼                                                  │
│  ┌─────────────┐    papers     ┌──────────────┐         │
│  │  Literature  │─────────────►│  Hypothesis   │         │
│  │   Engine     │   + gaps     │   Engine      │         │
│  └─────────────┘              └──────┬───────┘         │
│                                       │                  │
│                              method spec                 │
│                                       │                  │
│                                       ▼                  │
│                              ┌──────────────┐           │
│                              │Implementation │           │
│                              │   Engine      │           │
│                              └──────┬───────┘           │
│                                      │                   │
│                                working code              │
│                                      │                   │
│                                      ▼                   │
│  ┌────────────┐  eval scripts ┌──────────────┐          │
│  │ Validation │◄─────────────│Experimentation│          │
│  │   Engine   │───results───►│   Engine      │          │
│  └────────────┘              └──────┬───────┘          │
│                                      │                   │
│                              results + analysis          │
│                                      │                   │
│                                      ▼                   │
│                              ┌──────────────┐           │
│                              │   Writing     │◄──────────┘
│                              │   Engine      │  domain context
│                              └──────┬───────┘
│                                      │
│                              ┌───────▼───────┐
│                              │    Quality     │
│                              │  Assurance     │
│                              └───────┬───────┘
│                                      │
│                                      ▼
│                              Conference Paper

Comparison: Component Density

System Named Components Pipeline Stages Agents Tools
Zochi ~12 (inferred) ~6 Unknown Unknown
AI Scientist ~8 ~8 3 (researcher, reviewer, editor) ~5
AutoResearchClaw ~15+ 23 8+ specialized agents 10+
EurekaClaw ~20+ 7 7+ specialized agents 8+
Google Co-Scientist Unknown Multi-step Multi-agent Unknown

Zochi appears to use a more consolidated architecture with fewer but more capable components, contrasting with the fine-grained stage decomposition of systems like AutoResearchClaw (23 stages) or EurekaClaw (7 stages with sub-agents per stage).


11 Core Mechanisms (Detailed)

11.1 Literature-Grounded Research Direction Selection

The literature analysis phase is described as ingesting "thousands of research papers" and identifying "non-obvious connections across papers." This implies a multi-layer analysis pipeline:

Literature Analysis Pipeline (Inferred)
│
├── Layer 1: RETRIEVAL
│   ├── Input: Domain string (e.g., "novel jailbreaking methods")
│   ├── Query expansion: LLM generates multiple search queries
│   ├── Sources: arXiv API, Semantic Scholar, venue proceedings
│   ├── Scale: "Thousands of papers" retrieved
│   └── Output: Raw paper corpus (titles, abstracts, full texts)
│
├── Layer 2: SUMMARIZATION
│   ├── Per-paper analysis:
│   │   ├── Key contribution extraction
│   │   ├── Methodology characterization
│   │   ├── Limitation identification
│   │   └── Result summary
│   ├── Efficiency: Likely uses cheaper model or shorter prompts
│   └── Output: Structured paper summaries
│
├── Layer 3: PATTERN MINING
│   ├── Cross-paper connection identification
│   ├── Methodology trend analysis
│   ├── "Non-obvious connections" — the key differentiator
│   │   ├── For CS-ReFT: Connected representation editing to
│   │   │   cross-skill interference (not obvious from either
│   │   │   literature alone)
│   │   ├── For Tempest: Connected multi-turn dialogue patterns
│   │   │   to systematic safety erosion (novel framing)
│   │   └── For EGNN-Fusion: Connected equivariant architectures
│   │       to binding site efficiency (cross-domain transfer)
│   └── Output: Pattern graph over research landscape
│
└── Layer 4: DIRECTION SELECTION
    ├── Gap scoring: novelty × feasibility × impact
    ├── Direction ranking
    ├── Selection of most promising direction
    └── Output: Chosen research direction with justification

11.2 The "Partial Compliance" Discovery Mechanism

Zochi's most impactful scientific discovery — the "partial compliance" vulnerability pattern in LLMs — illustrates the system's ability to identify non-obvious phenomena:

Partial Compliance Discovery (Tempest)
│
│  Observation: When attacked across multiple turns, LLMs don't
│  simply "comply" or "refuse" — they exhibit a gradient of responses.
│
│  ┌────────────────────────────────────────────────────────┐
│  │  Response Spectrum                                      │
│  │                                                         │
│  │  Full Refusal ◄──────────────────────► Full Compliance  │
│  │       │                                       │         │
│  │       │    ┌─────────────────────┐            │         │
│  │       │    │  PARTIAL COMPLIANCE │            │         │
│  │       │    │  ─────────────────  │            │         │
│  │       │    │  Model reveals      │            │         │
│  │       │    │  FRAGMENTS of       │            │         │
│  │       │    │  restricted info    │            │         │
│  │       │    │  while appearing    │            │         │
│  │       │    │  to maintain safety │            │         │
│  │       │    └─────────────────────┘            │         │
│  │       │              │                        │         │
│  │       │     These fragments                   │         │
│  │       │     ACCUMULATE across turns            │         │
│  │       │              │                        │         │
│  │       │     Until full compliance             │         │
│  │       │     is achieved                       │         │
│  └───────┴──────────────┴────────────────────────┘         │
│                                                             │
│  Key insight: Safety is not a binary gate but a             │
│  continuously erodable surface. Minor concessions           │
│  create anchor points for subsequent exploitation.          │
└─────────────────────────────────────────────────────────────┘

This discovery is significant because it: 1. Reframes the safety problem — from binary (safe/unsafe) to continuous (compliance gradient) 2. Was autonomously identified — Zochi discovered this pattern from literature analysis, not from human guidance 3. Has practical implications — requires rethinking multi-turn safety mechanisms beyond single-turn guardrails 4. Survived A* peer review — validating the discovery's novelty and significance

11.3 Orthonormal Subspace Representation Editing (CS-ReFT)

The CS-ReFT method demonstrates Zochi's ability to formulate technically novel approaches:

CS-ReFT Architecture (from paper)
│
│  Standard Fine-Tuning:
│  ┌────────────────────────────────────────┐
│  │  Weights modified → all tasks affected │
│  │  Task A improvement → Task B degrades  │  = cross-skill
│  │  LoRA orthogonality: weight-level only │    interference
│  └────────────────────────────────────────┘
│
│  CS-ReFT Approach:
│  ┌────────────────────────────────────────────────────────┐
│  │                                                         │
│  │  Hidden State Space (h)                                 │
│  │  ┌─────────────────────────────────────┐               │
│  │  │                                     │               │
│  │  │    Subspace S₁ ──► Skill 1 edit     │               │
│  │  │    (orthonormal)                     │               │
│  │  │              ⊥                       │               │
│  │  │    Subspace S₂ ──► Skill 2 edit     │               │
│  │  │    (orthonormal)                     │               │
│  │  │              ⊥                       │               │
│  │  │    Subspace Sₖ ──► Skill k edit     │               │
│  │  │    (orthonormal)                     │               │
│  │  │                                     │               │
│  │  └─────────────────────────────────────┘               │
│  │                    │                                    │
│  │              ┌─────┴─────┐                              │
│  │              │   Router   │                              │
│  │              │ (lightweight│                              │
│  │              │  selector) │                              │
│  │              └─────┬─────┘                              │
│  │                    │                                    │
│  │              Composed output                            │
│  │                                                         │
│  │  Key innovation: orthonormality at hidden-state level   │
│  │  not weight level → directly prevents interference      │
│  │  where it manifests                                     │
│  └────────────────────────────────────────────────────────┘
│
│  Results:
│  ├── 93.94% win rate on AlpacaEval (vs. 86.30% GPT-3.5-T)
│  ├── Only 0.0098% of model parameters
│  ├── 12.7x fewer parameters than LoRA
│  └── Minimal cross-task interference

11.4 Tree Search Over Conversation Branches (Tempest)

The Tempest framework implements a systematic search algorithm over multi-turn conversations:

Tempest Tree Search Algorithm
│
│  INITIALIZE:
│    root = initial adversarial prompt
│    target = restricted behavior
│    tree = {root}
│    compliance_tracker = {}
│
│  LOOP (for each turn t):
│    │
│    ├── EXPAND: For each active node n in tree:
│    │   ├── Generate k adversarial follow-ups
│    │   │   (breadth-first branching)
│    │   ├── Each follow-up exploits partial compliance
│    │   │   from n's response
│    │   └── Add branches to tree
│    │
│    ├── EVALUATE: For each new branch:
│    │   ├── Send to target model
│    │   ├── Receive response
│    │   ├── Measure compliance level:
│    │   │   ├── Full refusal → prune branch
│    │   │   ├── Partial compliance → track fragments
│    │   │   └── Full compliance → SUCCESS
│    │   └── Update compliance_tracker
│    │
│    ├── CROSS-BRANCH LEARNING (ACL version):
│    │   ├── Analyze successful partial compliance patterns
│    │   ├── Transfer effective strategies across branches
│    │   └── Reinject learned patterns into new prompts
│    │
│    ├── PRUNE: Remove branches with:
│    │   ├── Full refusals
│    │   ├── Stalled compliance
│    │   └── Redundant paths
│    │
│    └── RE-INJECT: For promising branches:
│        ├── Extract compliance fragments from responses
│        ├── Incorporate fragments into next turn's prompts
│        └── "Minor concessions accumulate into fully
│            disallowed outputs"
│
│  TERMINATE when:
│    ├── Full compliance achieved (success)
│    ├── All branches pruned (failure)
│    └── Max turns reached (timeout)

11.5 Validation Engine — Separation of Concerns

The validation engine is one of Zochi's most architecturally important mechanisms:

Validation Engine Design
│
│  PROBLEM: AI systems can inadvertently optimize for
│           their own evaluation metrics rather than
│           genuine performance improvements.
│
│  SOLUTION: Separate generation from evaluation
│
│  ┌──────────────────┐     ┌──────────────────┐
│  │  GENERATION PATH  │     │  VALIDATION PATH  │
│  │                    │     │                    │
│  │  Method design     │     │  Eval script gen   │
│  │  Implementation    │     │  (independent)     │
│  │  Training          │     │                    │
│  │                    │     │  Standardized      │
│  │  Produces:         │     │  datasets (NOT     │
│  │  - trained model   │     │  modified by       │
│  │  - method code     │     │  generation path)  │
│  │                    │     │                    │
│  └────────┬─────────┘     └────────┬─────────┘
│           │                        │
│           │      ┌────────────┐    │
│           └─────►│ EVALUATION  │◄──┘
│                  │             │
│                  │ Model tested│
│                  │ on unmodified│
│                  │ datasets via │
│                  │ independent  │
│                  │ eval scripts │
│                  └──────┬─────┘
│                         │
│                   Genuine results
│
│  This prevents:
│  ├── Data leakage from training to evaluation
│  ├── Metric gaming (optimizing for eval proxy)
│  ├── Self-confirming evaluation loops
│  └── Overfitting to evaluation procedure

11.6 Research Iteration Capability

The Siege → Tempest progression reveals a capability that most AI research systems lack: iterative improvement on the same research direction:

Research Iteration Loop
│
│  Iteration 1 (Siege — ICLR Workshop):
│  ├── Input: "multi-turn attacks on LLMs"
│  │   (autonomously identified from literature)
│  ├── Output: Core tree search framework
│  ├── Result: 100%/97% attack success
│  ├── Format: 2-4 page tiny paper
│  └── Feedback: (7, 7) reviewer scores
│
│  GAP: Workshop paper → Full conference paper requires:
│  ├── Deeper methodology
│  ├── More comprehensive experiments
│  ├── Stronger theoretical motivation
│  └── Better presentation
│
│  Iteration 2 (Tempest — ACL Main):
│  ├── Input: Same high-level idea + Siege as starting point
│  ├── Enhancements:
│  │   ├── Cross-branch learning mechanism (NEW)
│  │   ├── Robust partial compliance tracking (IMPROVED)
│  │   ├── Comprehensive evaluations (EXPANDED)
│  │   └── Full conference paper format (EXTENDED)
│  ├── Result: Same attack rates + deeper analysis
│  └── Outcome: Top 8.2% at A* venue
│
│  This demonstrates Zochi can:
│  1. Evaluate its own work's limitations
│  2. Identify what needs improvement
│  3. Design and implement those improvements
│  4. Produce substantially stronger results
│  5. Meet a much higher quality bar (workshop → A*)

12 Programming Language

System Implementation

Aspect Assessment Evidence
System language Unknown (likely Python) Closed-source; Python is standard for ML systems
Generated code Python (confirmed) CS-ReFT uses PyTorch; Tempest uses standard ML libraries
Experiment code Python (confirmed) Standard ML stack (PyTorch, transformers, etc.)
Paper output LaTeX Conference paper format

Generated Code Quality Indicators

The code Zochi generates must be of sufficient quality to:

  1. Train models successfully: CS-ReFT trained on Llama-2-7B with orthonormal subspace edits
  2. Run complex experiments: Tempest executed multi-turn attacks against GPT-3.5 and GPT-4 APIs
  3. Implement novel architectures: EGNN-Fusion designed a new equivariant GNN architecture
  4. Produce reproducible results: Results were verified by peer reviewers at A* venues

Comparison to Other Systems

System System Language Generated Code Open Source
Zochi Unknown (Python likely) Python No
AI Scientist Python Python Yes
AutoResearchClaw Python Python Yes (MIT)
EurekaClaw Python/TypeScript Python + LaTeX Yes (Apache 2.0)
Google Co-Scientist Unknown Unknown No
AIRA₂ Python Python Partially

Code Quality Assessment (Inferred)

Indicator Evidence Assessment
Correctness Results accepted at A* venue High — peer-verified
Reproducibility Standardized datasets, published results Medium-High
Complexity Orthonormal subspace edits, tree search, EGNN architectures High — non-trivial implementations
Test coverage Ablation studies, multiple baselines High — comprehensive evaluation
Documentation Published papers describe methods High — paper-quality documentation

13 Memory Management

Memory Architecture (Inferred)

Zochi's memory system is not publicly documented, but the system's capabilities imply several memory types:

┌─────────────────────────────────────────────────────────────┐
│              Zochi Memory Architecture (Inferred)            │
│                                                              │
│  Tier 1: LITERATURE MEMORY                                   │
│  ├── Scope: Per-project                                      │
│  ├── Content: Paper summaries, extracted patterns, gaps      │
│  ├── Scale: "Thousands of papers"                            │
│  ├── Access: Literature engine reads; hypothesis engine reads│
│  └── Purpose: Ground research in existing knowledge          │
│                                                              │
│  Tier 2: PROJECT MEMORY                                      │
│  ├── Scope: Per-project (spans all pipeline stages)          │
│  ├── Content: Chosen direction, method spec, implementation  │
│  │   state, experiment results, partial drafts               │
│  ├── Access: All stages read; each stage writes its outputs  │
│  └── Purpose: Maintain coherence across pipeline stages      │
│                                                              │
│  Tier 3: ITERATION MEMORY                                    │
│  ├── Scope: Across project iterations (Siege → Tempest)      │
│  ├── Content: Prior work artifacts, identified improvements  │
│  ├── Access: New iteration reads prior artifacts             │
│  └── Purpose: Enable research iteration and improvement      │
│                                                              │
│  Tier 4: VALIDATION MEMORY                                   │
│  ├── Scope: Per-experiment                                   │
│  ├── Content: Evaluation scripts, dataset references, results│
│  ├── Access: Validation engine (isolated from generation)    │
│  └── Purpose: Ensure genuine, unbiased evaluation            │
│                                                              │
│  Tier 5: CROSS-DOMAIN MEMORY (speculative)                   │
│  ├── Scope: Across projects/domains                          │
│  ├── Content: Transferable strategies, method patterns       │
│  ├── Access: Hypothesis engine for new projects              │
│  └── Purpose: Enable cross-domain innovation                 │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Evidence for Memory Tiers

Tier Evidence Confidence
Literature Memory "Ingests and analyzes thousands of research papers" High
Project Memory Multi-stage pipeline requires state passing High
Iteration Memory Siege → Tempest improvement demonstrates prior work awareness High
Validation Memory Separated validation engine with unmodified datasets Medium
Cross-Domain Memory 3 domains with transferable patterns Low-Medium

Context Window vs. Persistent Memory

A key architectural question for Zochi is how it handles the tension between LLM context window limitations and the need for extensive research context:

Context Window Challenge
│
│  Challenge: "Thousands of papers" analyzed → millions of tokens
│  Typical context window: 100K-200K tokens (frontier models, 2025)
│
│  Possible Solutions (Inferred):
│
│  1. HIERARCHICAL SUMMARIZATION
│     ├── Full papers → abstracts → key findings → pattern summary
│     ├── Progressive compression: 1000x reduction
│     └── Only summaries enter context window
│
│  2. RETRIEVAL-AUGMENTED GENERATION
│     ├── Vector store indexes all paper summaries
│     ├── Relevant papers retrieved per query
│     └── Only retrieved context enters window
│
│  3. STRUCTURED STATE PASSING
│     ├── Each pipeline stage produces structured output
│     ├── Next stage receives structured input (not raw text)
│     └── Information compressed between stages
│
│  4. HYBRID APPROACH (most likely)
│     ├── RAG for literature (Tier 1)
│     ├── Structured state for pipeline (Tier 2)
│     ├── Artifact persistence for iteration (Tier 3)
│     └── Isolated context for validation (Tier 4)

Memory Comparison

System Memory Tiers Literature Store Cross-Run Cross-Domain
Zochi ~5 (inferred) Yes (thousands of papers) Yes (Siege→Tempest) Yes (3 domains)
EurekaClaw 4 Yes (arXiv + S2) Yes (persistent) Per-domain plugin
AutoResearchClaw 3 Yes (real APIs) Yes (MetaClaw) No
AI Scientist 1-2 Basic No No
K-Dense BYOK 1 Conversation only No No

14 Continued Learning

Evidence of Learning Capability

The Siege → Tempest progression is the strongest evidence that Zochi implements some form of continued learning or iterative improvement:

Learning Evidence: Siege → Tempest Progression
│
│  ICLR Workshop (March 2025)
│  ├── System autonomously identified multi-turn attack direction
│  ├── Designed core tree search framework
│  ├── Produced workshop-quality paper
│  └── Received (7, 7) reviewer scores
│
│  LEARNING PHASE (March - May 2025)
│  ├── "Zochi was able to significantly improve its design"
│  ├── "Conduct more comprehensive experiments"
│  ├── Added: Cross-branch learning mechanism
│  ├── Added: Robust partial compliance tracking
│  └── Expanded methodology and evaluation
│
│  ACL Main (May 2025)
│  ├── Same high-level direction, vastly improved execution
│  ├── Full conference paper (vs. 2-4 page tiny paper)
│  ├── Meta-review score 4 = top 8.2%
│  └── Accepted at A* venue (~21% acceptance rate)
│
│  This implies the system can:
│  1. Assess the quality gap between its work and a higher bar
│  2. Identify specific improvement opportunities
│  3. Execute those improvements autonomously
│  4. Produce substantially stronger output

Types of Learning (Inferred)

Learning Type Evidence Mechanism (Inferred)
Intra-project iteration Code debugging, experiment refinement Self-correction loops within pipeline
Inter-project learning Siege → Tempest improvement Prior work analysis + targeted enhancement
Cross-domain transfer AI methods → computational biology Transferable research strategies
Quality calibration 7.67 avg NeurIPS score Understanding of what constitutes good research

Continuous Improvement Evidence

The technical report describes the ACL version (May 2025) as "a substantial advancement over our earlier systems that published workshop papers at ICLR 2025." This suggests ongoing system development between March and May 2025:

System Evolution Timeline
│
│  March 14, 2025 ── IntologyAI announcement
│  March 17, 2025 ── Technical report published
│                     ├── CS-ReFT at ICLR SCOPE Workshop
│                     └── Siege at ICLR Building Trust Workshop
│
│  March - May 2025 ── System improvement period
│                       ├── Architecture enhancements
│                       ├── "Substantially advanced" system
│                       └── Human involvement reduced further
│
│  May 27, 2025 ── ACL acceptance announced
│                   ├── Tempest (expanded Siege) accepted
│                   ├── Main proceedings (not workshop)
│                   └── "First AI to pass A* peer review"
│
│  2025 (later) ── Locus previewed
│                   ├── Successor to Zochi
│                   ├── Surpasses human experts on RE-Bench
│                   └── Multi-day research campaigns

Learning Comparison

System Learning Type What is Learned Cross-Run Cross-Domain
Zochi System evolution + inter-project Research strategies, quality standards Yes Yes
EurekaClaw Post-session distillation Proof strategies, tool patterns Yes (4-tier) Per-domain
AutoResearchClaw MetaClaw cross-run Research strategies from failures Yes (skills) No
AI Scientist None within system N/A No No
K-Dense BYOK None N/A No No
Google Co-Scientist Unknown Unknown Unknown Unknown

The Zochi → Locus Learning Lineage

Zochi's continued learning extends beyond the system itself to inform its successor:

System Key Capability Improvement Over Predecessor
Zochi v1 (Mar 2025) Workshop-level research First autonomous AI publications
Zochi v2 (May 2025) A* conference-level research Quality leap from workshop to main
Locus (2025) Surpasses human experts on RE-Bench Multi-day campaigns, engineering tasks

Locus's capabilities suggest that lessons from Zochi's research pipeline were transferred: - RE-Bench: Locus scores 1.30 vs. human expert 1.27 over 64 hours - KernelBench: State-of-the-art with 1.5x to 100x+ speedups - MLE-Bench Lite: State-of-the-art on engineering tasks - Key innovation: "Maintains consistent improvement over multiple days" — unlike systems that plateau after hours


15 Applications

Primary Application: Autonomous Scientific Research

Zochi's demonstrated applications span three distinct scientific domains:

Application Domain Paper Contribution Type Impact
AI / Representation Learning CS-ReFT Novel training methodology AlpacaEval SOTA for parameter-efficient methods
AI Safety Tempest Vulnerability framework + discovery Exposes fundamental safety weakness
Computational Biology EGNN-Fusion Efficient architecture 95% parameter reduction

Demonstrated Research Capabilities

Research Capability Matrix
│
│                    Literature │ Hypothesis │ Method  │ Implement │ Experiment │ Write
│                    Analysis   │ Generation │ Design  │           │            │
│  ──────────────────┼──────────┼───────────┼─────────┼───────────┼────────────┼──────
│  CS-ReFT           │    ✓     │     ✓     │    ✓    │     ✓     │     ✓      │   ✓
│  Tempest           │    ✓     │     ✓     │    ✓    │     ✓     │     ✓      │   ✓
│  EGNN-Fusion       │    ✓     │     ✓     │    ✓    │     ✓     │     ✓      │   ✓
│  MLE-Bench tasks   │    —     │     —     │    —    │     ✓     │     ✓      │   —
│  ──────────────────┼──────────┼───────────┼─────────┼───────────┼────────────┼──────
│  Coverage          │   3/3    │    3/3    │   3/3   │    4/4    │    4/4     │  3/3
│
│  ✓ = Demonstrated    — = Not applicable for this task type

Target Users

User Segment Use Case Value Proposition
Research labs Accelerate paper production Hours/days instead of months
PhD students Explore research directions Autonomous literature survey + direction scoring
AI safety teams Red-teaming and vulnerability discovery Systematic, comprehensive testing
Biotech/pharma Cross-domain computational methods Efficient architectures for biology
Industry R&D Novel method development Competitive research output at frontier quality
Conference organizers Quality assessment Automated reviewer scoring

Application Scenarios

Scenario 1: Novel Research Direction Exploration

Input: "novel jailbreaking methods"  (3 words)
│
└── Zochi pipeline:
    ├── Analyzes thousands of safety papers
    ├── Identifies multi-turn attack as underexplored
    ├── Designs tree search framework
    ├── Discovers partial compliance vulnerability
    ├── Implements Tempest framework
    ├── Achieves 100%/97% attack success
    ├── Writes conference paper
    └── Accepted at ACL 2025 (A*, top 8.2%)

Time: Days
Human involvement: Figures, citation formatting, minor edits
Cost: Estimated $30-500+ (API + compute)

Scenario 2: Parameter-Efficient Method Design

Input: Research direction on cross-skill interference in PEFT
│
└── Zochi pipeline:
    ├── Identifies gap between weight-level and representation-level orthogonality
    ├── Designs CS-ReFT with orthonormal subspace transformations
    ├── Implements on Llama-2-7B
    ├── Evaluates on AlpacaEval: 93.94% win rate
    ├── Demonstrates 0.0098% parameter usage
    └── Published at ICLR 2025 SCOPE Workshop

Time: Days
Human involvement: Minimal

Scenario 3: Cross-Domain Architecture Transfer

Input: Protein-nucleic acid binding site prediction
│
└── Zochi pipeline:
    ├── Analyzes computational biology literature
    ├── Identifies parameter inefficiency in existing methods
    ├── Transfers efficient architecture principles from AI domain
    ├── Designs EGNN-Fusion with 95% parameter reduction
    ├── Achieves competitive binding site prediction
    └── Under journal review

Significance: Demonstrates domain generality

Limitations and Risks

Limitation Impact Mitigation
Closed-source Cannot verify, extend, or reproduce pipeline Published papers are independently verifiable
Undisclosed architecture Scientific community cannot build on methods Individual research outputs are documented
Human verification required Not fully autonomous — figures, citations, edits Human oversight as safety mechanism
Ethical concerns AI-generated papers at top venues Transparent attribution, no AI authorship claims
Scalability unknown No evidence of concurrent multi-project runs Locus suggests improvement here
Cost unknown Closed-source prevents cost assessment Likely competitive with human research
Generalization unknown 3 domains demonstrated; broader generality unproven Cross-domain capability is promising evidence
Reviewer gaming risk System could learn to optimize for reviewer preferences Separated validation engine mitigates

Ethical Framework

IntologyAI's stated ethical principles represent the most developed framework among autoresearch systems:

Principle Implementation
No AI authorship "We do not believe AI systems should be authors on papers, as they cannot take responsibility for their work"
Human verification "Rigorous human verification of all research outputs"
Transparent attribution Acknowledge AI contributions without claiming authorship
Responsible disclosure Safety research (Tempest) follows responsible disclosure protocols
Venue engagement "In discussion with workshop organizers of Zochi's accepted papers"
Human rebuttal ACL rebuttal written manually without Zochi involvement

Comparison to Other Systems' Ethics Frameworks

System AI Authorship Policy Human Verification Venue Transparency
Zochi No AI authorship Required Disclosed to organizers
AI Scientist AI listed as author Minimal No disclosure policy
AutoResearchClaw Not addressed Configurable Not addressed
EurekaClaw Not addressed Gate modes available Not addressed
Google Co-Scientist Not addressed Built-in Not addressed

Strengths vs. Weaknesses Summary

Strength Weakness
Only AI system with A* venue acceptance Closed-source — no reproducibility of pipeline
Multi-domain research capability (3 domains) Undisclosed architecture limits scientific contribution
Highest automated quality scores (7.67 avg) Small team → sustainability risk
Minimal human involvement Cannot verify claims about autonomy level
Strong ethical framework Ethical framework untested at scale
Discovered novel vulnerability (partial compliance) Individual papers, while good, are not groundbreaking
MLE-Bench performance without optimization MLE-Bench evaluation was "exploratory"
Research iteration capability (Siege → Tempest) Unclear if iteration was human-guided or autonomous
Successor system (Locus) shows continued development Locus makes Zochi potentially obsolete

Future Trajectory

Based on IntologyAI's trajectory:

Development Status Significance
Locus (successor) Previewed 2025 Surpasses human experts on RE-Bench
Multi-day campaigns Locus capability Week/month-long research runs planned
Beta access Sign-up available Moving toward product launch
Additional domains Expected Current 3-domain capability as starting point
Journal publications EGNN-Fusion under review Expanding beyond conferences

Impact Assessment

Zochi's significance extends beyond its individual research contributions. It represents a category creation moment for AI research systems:

Before Zochi (pre-2025):
├── AI could generate paper-shaped artifacts
├── Quality insufficient for top-tier venues
├── "AI research" = demonstrations, not contributions
└── Gap between AI output and human research was large

After Zochi (2025):
├── AI output accepted at highest-tier venues
├── Quality comparable to strong human submissions
├── "Artificial Scientist" as a legitimate category
├── Gap between AI and human research is closing rapidly
└── Locus suggests trajectory toward surpassing humans

Implications:
├── Publication norms need updating (attribution, review)
├── Research acceleration possible (hours/days vs months)
├── Multi-domain research becomes feasible for small teams
├── AI safety research can be systematically automated
└── The research community must adapt to AI-generated science

Appendix A: Publication Record

Paper Venue Type Scores Status
CS-ReFT SCOPE @ ICLR 2025 Workshop poster (6, 7, 6) Accepted
Siege Building Trust @ ICLR 2025 Tiny paper (7, 7) Accepted
Tempest ACL 2025 Main proceedings Meta: 4 (top 8.2%) Accepted
EGNN-Fusion Journal (undisclosed) Full paper N/A Under review

Appendix B: Benchmark Summary

Benchmark Metric Zochi Result Best Baseline
AlpacaEval (CS-ReFT) Win rate 93.94% GPT-3.5-T: 86.30%
JailbreakBench (Tempest) Success vs GPT-3.5-T 100% Crescendo: lower
JailbreakBench (Tempest) Success vs GPT-4 97% GOAT: lower
NeurIPS Auto-Review Paper quality 8, 8, 7 (avg 7.67) Other AI: ~4
MLE-Bench > Median human 80% of tasks Agent Lab: lower
MLE-Bench Medal rate 50% of tasks AIDE: 8.7%
EGNN-Fusion Parameter reduction 95%

Appendix C: Comparison with All Major AI Research Systems (as of April 2026)

Dimension Zochi AI Scientist Co-Scientist AIRA₂ AutoResearchClaw EurekaClaw
Organization IntologyAI Sakana AI Google DeepMind Meta FAIR AIMING Lab Single lab
Team size 4 ~6 ~20+ 25 16 8
Open source No Yes No Partial Yes (MIT) Yes (Apache 2.0)
A* venue Yes (ACL) No No No No No
Workshop venue Yes (ICLR) Yes No No Not yet Not yet
Domains 3+ 1/run Biomedical NLP/ML Configurable Mathematics
Auto NeurIPS score 7.67 ~4 N/A N/A N/A N/A
Human involvement Minimal Similar Significant Moderate Full automation Configurable
Learning System evolution None Unknown Tournament MetaClaw skills 4-tier + skills
MLE-Bench 80%>median, 50% medal N/A N/A N/A N/A N/A
Ethical framework Comprehensive Basic Minimal Basic None stated None stated
Successor Locus None None None MetaClaw None

This analysis was compiled from publicly available sources including the Zochi Technical Report, IntologyAI blog posts, OpenReview submissions, arXiv papers, and third-party coverage. All claims about system internals are marked as inferred where architectural details are not publicly documented.