← Back to Index

Zochi

The first AI system to achieve acceptance at an A scientific conference (ACL 2025), autonomously conducting end-to-end research from literature analysis to peer-reviewed publication across multiple domains. Organization: IntologyAI (San Francisco-based AI research lab) Published: March 17, 2025 (tech report); May 27, 2025 (ACL 2025 acceptance announced) Type: Technical Report + closed-source system Report Type: PhD-Level Technical Analysis Report Date: April 2026*

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Zochi Technical Report: The First Artificial Scientist

Repository: github.com/IntologyAI/Zochi (papers + artifacts only; system code is closed-source)
Technical Report PDF: Zochi_Technical_Report.pdf
Blog posts: Tech Report, ACL Acceptance
Stars: ~305 (as of April 2026)
License: Not specified (closed-source system; published paper artifacts under CC BY 4.0)
Tagline: "The First Artificial Scientist"
Successor system: Locus (previewed 2025; surpasses human experts on RE-Bench)

Naming and Branding

The name "Zochi" does not appear to reference an existing acronym or abbreviation. IntologyAI uses the term "Artificial Scientist" as a category label — distinguishing systems that autonomously conduct the full scientific method from "AI research assistants" that support individual phases. Zochi is positioned as the inaugural member of this category, with Locus as its successor.

Historical Significance

Zochi holds several firsts in the AI-for-science landscape:

Milestone	Date	Significance
First AI system with workshop publications	Mar 2025	ICLR 2025 workshops (CS-ReFT, Siege)
First AI system with A* conference acceptance	May 2025	ACL 2025 main proceedings (Tempest)
First multi-domain AI research system	Mar 2025	Produced papers in AI, safety, and computational biology
First AI system with meta-review in top 10%	May 2025	ACL meta-review score 4 = top 8.2% of submissions

Lineage and Positioning

AI Research Systems Timeline
│
├── 2024
│   ├── AI Scientist (Sakana AI)  — first end-to-end pipeline, workshop-only
│   ├── Agent Laboratory           — framework for AI-assisted research
│   └── AIDE                       — ML engineering agent (Kaggle)
│
├── 2025 (early)
│   ├── Zochi (IntologyAI)         — first A* venue acceptance  ← this system
│   ├── Google Co-Scientist        — Gemini-based, biomedical focus
│   └── AIRA₂ (Meta FAIR)          — agentic iterative research assistant
│
├── 2025 (mid)
│   ├── Locus (IntologyAI)         — Zochi successor, surpasses humans on RE-Bench
│   └── Tempest → ACL 2025         — Zochi's A* publication milestone
│
└── 2026
    ├── AutoResearchClaw            — 23-stage open-source pipeline
    ├── EurekaClaw                  — mathematical theorem proving
    └── K-Dense Co-Scientist        — bring-your-own-key research agent

Unique Position in the Ecosystem

Zochi is distinguished from every other system in the autoresearch landscape by a single fact: real peer review at the highest tier. While AI Scientist (Sakana AI) demonstrated that LLMs could produce paper-shaped artifacts, and Google Co-Scientist showed Gemini-powered hypothesis generation, Zochi is the only system whose fully autonomous output survived the ~20% acceptance rate filter of a CORE A*-ranked conference. This is a qualitative threshold that separates "demonstration" from "contribution."

Quality Validation Hierarchy
│
├── Self-evaluation only ──────────── AI Scientist, Agent Laboratory
│   └── LLM-as-judge or automated metrics
│
├── Workshop acceptance (~60-70%) ─── Zochi (ICLR 2025), AI Scientist
│   └── Lower bar, shorter papers, less rigorous review
│
├── Main conference acceptance (~20%) ─ Zochi (ACL 2025)  ← only system here
│   └── Full peer review, rebuttals, meta-review
│
└── Journal publication ──────────── EGNN-Fusion (under review)
    └── Extended review cycles, revision rounds

2 Authors and Team

Author	Role (Inferred)
Andy Zhou	Lead developer / system architect / first author on all Zochi papers
Ron Arel	Co-founder / co-author on all Zochi papers
Soren Dunn	Core contributor
Nikhil Khandekar	Core contributor

BibTeX Citations

@article{zhou2025zochi,
  title   = {Zochi Technical Report: The First Artificial Scientist},
  author  = {Zhou, Andy and Arel, Ron and Dunn, Soren and Khandekar, Nikhil},
  year    = {2025},
  url     = {https://github.com/IntologyAI/Zochi/blob/main/Zochi_Technical_Report.pdf}
}

@inproceedings{zhou2025tempest,
  title     = {Tempest: Autonomous Multi-Turn Jailbreaking of Large Language
               Models with Tree Search},
  author    = {Zhou, Andy and Arel, Ron},
  booktitle = {Proceedings of the 63rd Annual Meeting of the Association for
               Computational Linguistics (ACL 2025)},
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.10619}
}

@inproceedings{zhou2025csreft,
  title     = {Compositional Subspace Representation Fine-tuning for
               Adaptive Large Language Models},
  author    = {Zhou, Andy and Arel, Ron},
  booktitle = {SCOPE Workshop, ICLR 2025},
  year      = {2025},
  url       = {https://openreview.net/forum?id=YqYcm0mpFp}
}

Team composition: IntologyAI is a small, focused research lab — 4 named contributors compared to 16 at AutoResearchClaw (AIMING Lab) or 25 at Meta FAIR's AIRA₂. The compact team size is notable: Zochi's quality-per-headcount ratio is the highest in the autoresearch space, achieving more validated research impact with fewer people than any comparable system.

Institutional context: IntologyAI is a San Francisco-based startup (previously described as London-based in some early sources; the company appears to have relocated or always been SF-based). Unlike academic lab projects (AutoResearchClaw, EurekaClaw) or big-tech teams (Google Co-Scientist, AIRA₂), IntologyAI is a venture-backed startup whose entire product identity is "Artificial Scientists." This commercial focus explains both the closed-source nature and the aggressive milestone pursuit.

Team Size Comparison

System	Team Size	Organization Type	Code Availability
Zochi	4	Startup	Closed-source
AI Scientist	~6	Research lab (Sakana AI)	Open-source
Google Co-Scientist	~20+	Big tech (Google DeepMind)	Closed-source
AIRA₂	25	Big tech (Meta FAIR)	Partially open
AutoResearchClaw	16	Academic (multi-university)	Open-source (MIT)
EurekaClaw	8	Academic (single lab)	Open-source (Apache 2.0)

3 Core Contribution

Zochi's core contribution is twofold: (1) a complete autonomous research pipeline that emulates the scientific method from literature analysis through peer-reviewed publication, and (2) empirical proof that AI systems can produce research accepted at the highest tier of peer review.

The Research Quality Gap

The most striking aspect of Zochi is not its architecture (which remains largely undisclosed) but the measurable quality gap between its outputs and those of all other AI research systems:

System	Automated NeurIPS Reviewer Score	Real Peer Review	Venue Tier
Zochi	8, 8, 7 (avg 7.67)	Yes — accepted	A* (ACL 2025)
AI Scientist (Sakana)	~4 (average)	Workshop only	C (workshops)
Agent Laboratory	~3-4	No	N/A
AIDE	N/A (engineering, not research)	No	N/A
OpenHands	N/A	No	N/A

This quality gap (~3.67 points on a 10-point scale) is significant. The NeurIPS guidelines scale treats 6 as the acceptance threshold for a top ML venue; Zochi averages 7.67 while competing systems cluster around 3-4.

Three Research Contributions

Unlike systems that demonstrate capability on toy domains (2D diffusion, toy language models), Zochi produced three substantive research contributions across distinct fields:

Zochi's Research Portfolio
│
├── 1. CS-ReFT (AI / Parameter-Efficient Fine-Tuning)
│   ├── Domain: Representation learning for LLMs
│   ├── Contribution: Orthonormal subspace edits in hidden states
│   ├── Result: 93.94% AlpacaEval win rate (> GPT-3.5-Turbo)
│   ├── Efficiency: 0.0098% of model parameters
│   ├── Venue: SCOPE Workshop, ICLR 2025
│   └── Reviewer scores: (6, 7, 6) — avg 6.33
│
├── 2. Siege → Tempest (AI Safety / Red Teaming)
│   ├── Domain: LLM safety, adversarial attacks
│   ├── Contribution: Tree-search multi-turn jailbreaking
│   ├── Result: 100% on GPT-3.5-T, 97% on GPT-4
│   ├── Discovery: "Partial compliance" vulnerability pattern
│   ├── Venue: ICLR 2025 Workshop → ACL 2025 Main
│   ├── Workshop scores: (7, 7) — avg 7.0
│   └── ACL meta-review: score 4 = top 8.2% of submissions
│
└── 3. EGNN-Fusion (Computational Biology)
    ├── Domain: Protein-nucleic acid binding site prediction
    ├── Contribution: Efficient EGNN architecture for binding sites
    ├── Result: 95% parameter reduction, competitive performance
    ├── Venue: Under journal review
    └── Significance: Cross-domain capability demonstration

Differentiating Capabilities

Capability	Zochi	AI Scientist	Co-Scientist	AutoResearchClaw
End-to-end autonomous research	Yes	Yes	Partial	Yes
Multi-domain research	3 domains	1 domain per run	Biomedical only	Configurable
A* venue acceptance	Yes (ACL)	No	No	No
Workshop acceptance	Yes (ICLR)	Yes (workshops)	No	Not yet
Real peer review survived	Yes	No (reviews fabricated/self)	No	No
Cross-domain transfer	Yes (AI → bio)	No	No	No
Automated quality (NeurIPS score)	7.67	~4	N/A	N/A
MLE-Bench engineering	80% > median	N/A	N/A	N/A
Human involvement	Figures, citations, minor edits	Similar	Significant	Full automation

Problem Complexity Spectrum

An underappreciated aspect of Zochi's contribution is the complexity of problems tackled relative to other systems:

Problem Complexity Spectrum
│
│  Simple ◄──────────────────────────────────► Complex
│
│  ├─ AI Scientist: 2D diffusion, toy LMs, specific cognitive biases
│  │  └── Constrained problem spaces with clear metrics
│  │
│  ├─ Agent Laboratory: Predefined research templates
│  │  └── Structured task decomposition
│  │
│  ├─ AIDE / OpenHands: Kaggle competitions (engineering)
│  │  └── Well-defined objectives with leaderboard scores
│  │
│  └─ Zochi: Open-ended research challenges
│     ├── CS-ReFT: Novel method design + theoretical motivation
│     ├── Tempest: Framework design + vulnerability discovery
│     └── EGNN-Fusion: Cross-domain architecture design
│        └── Each requires novel methodology, not just optimization

4 Supported Solutions

Research Pipeline Phases

Based on the technical report and blog descriptions, Zochi supports the following research phases:

Phase	Description	Automation Level
Literature Analysis	Ingests and analyzes thousands of research papers	Fully autonomous
Gap Identification	Identifies non-obvious connections and limitations	Fully autonomous
Hypothesis Generation	Proposes innovative solutions to identified gaps	Fully autonomous
Method Design	Designs novel methods and architectures	Fully autonomous
Implementation	Autonomously implements proposed methods	Fully autonomous
Experiment Design	Designs controlled experiments with ablation studies	Fully autonomous
Experiment Execution	Runs experiments, parallelized across multiple trials	Fully autonomous
Validation	Generates evaluation scripts on standardized datasets	Fully autonomous
Result Analysis	Interprets results and draws conclusions	Fully autonomous
Manuscript Preparation	Generates full research paper	Mostly autonomous
Figure Creation	N/A	Human
Citation Formatting	N/A	Human
Minor Edits	Formatting fixes, minor writing corrections	Human

Solution Categories

Zochi Solution Architecture
│
├── Research Discovery Solutions
│   ├── Large-scale literature retrieval and analysis
│   ├── Cross-paper pattern identification
│   ├── Research gap detection
│   └── Direction scoring and selection
│
├── Method Innovation Solutions
│   ├── Novel architecture design (EGNN-Fusion)
│   ├── Novel training methodology design (CS-ReFT)
│   ├── Novel adversarial framework design (Tempest)
│   └── Cross-domain knowledge transfer
│
├── Experimental Validation Solutions
│   ├── Controlled experiment design
│   ├── Ablation study generation
│   ├── Multi-trial parallelized execution
│   ├── Automated validation script generation
│   └── Standardized dataset evaluation
│
└── Publication Solutions
    ├── Full paper generation (LaTeX)
    ├── Technical writing at conference quality
    └── Reviewer response preparation (manual for ACL)

Domain Flexibility

Unlike domain-locked systems (EurekaClaw → mathematics, Google Co-Scientist → biomedicine), Zochi demonstrates domain generality across three distinct fields:

Domain	Paper	Technical Approach	Result Quality
AI / Representation Learning	CS-ReFT	Orthonormal subspace edits + router	ICLR workshop accepted
AI Safety	Siege → Tempest	Tree search + partial compliance tracking	ACL 2025 main (A*)
Computational Biology	EGNN-Fusion	Equivariant GNN architecture design	Journal under review

This domain breadth is achieved without domain-specific plugins or handcrafted tool suites — Zochi's pipeline is general enough to produce contributions across fundamentally different fields, from model fine-tuning to protein structure prediction.

5 LLM Integration

Model Information

The Zochi technical report does not explicitly disclose which LLM backbone powers the system. However, several inferences can be made:

Aspect	Assessment	Evidence
Primary model	Likely Claude or GPT-4 class	Quality of generated text, reasoning depth
Code generation	High-quality autonomous implementation	CS-ReFT and Tempest are fully implemented
Multi-model	Possibly — literature analysis may use different model from code generation	Cost optimization for high-volume lit review
Fine-tuned	Unknown	Closed-source; no indication of custom training

LLM Usage Patterns

Based on the system's capabilities, Zochi likely uses LLM calls across multiple pipeline stages:

LLM Call Distribution (Inferred)
│
├── Literature Analysis
│   ├── Paper summarization (high volume, lower complexity)
│   ├── Gap identification (cross-paper reasoning)
│   └── Direction scoring (comparative judgment)
│   Estimated: 40-60% of total tokens
│
├── Method Design
│   ├── Hypothesis generation (creative reasoning)
│   ├── Architecture design (technical depth)
│   └── Novel approach formulation
│   Estimated: 10-15% of total tokens
│
├── Implementation
│   ├── Code generation (implementation from design)
│   ├── Debugging and iteration
│   └── Test case generation
│   Estimated: 15-25% of total tokens
│
├── Experimentation
│   ├── Experiment script generation
│   ├── Result interpretation
│   └── Ablation study design
│   Estimated: 5-10% of total tokens
│
└── Writing
    ├── Paper drafting (structured writing)
    ├── Technical exposition
    └── Related work synthesis
    Estimated: 10-15% of total tokens

Validation Engine

A distinctive feature of Zochi's LLM integration is the automatic validation engine:

"Our automatic validation engine generates evaluation scripts based on standardized datasets that remain unmodified throughout testing, ensuring results reflect genuine improvements."

This implies a separation of concerns where: 1. The generation LLM produces methods and code 2. The validation engine independently generates evaluation scripts 3. The standardized datasets are not modified by the generation process

This architectural choice prevents the common failure mode where AI systems inadvertently optimize for their own evaluation metrics rather than genuine performance improvements.

Comparison to Other Systems' LLM Integration

System	LLM Backend	Multi-Model	Disclosed	Custom Prompts
Zochi	Undisclosed (likely frontier model)	Unknown	No	Unknown
AI Scientist	Claude 3.5 / GPT-4	Yes (configurable)	Yes	Yes (open)
Google Co-Scientist	Gemini	Yes (multi-agent)	Partially	No
AIRA₂	Llama-based	Yes	Yes	Yes
AutoResearchClaw	Configurable	Yes	Yes (open)	Yes (open)
EurekaClaw	Claude Sonnet (default)	Configurable	Yes (open)	Yes (open)

Zochi's closed-source nature means its LLM integration details remain proprietary. This is both a limitation for scientific reproducibility and a competitive advantage — the prompts, model selection strategies, and chain-of-thought patterns that produce A*-quality research are Intology's core intellectual property.

6 Key Results

Headline Results Summary

Metric	Value	Context
ACL 2025 acceptance	Main proceedings	First AI system at A* venue; ~21.3% acceptance rate
ACL meta-review score	4	Top 8.2% of all ACL submissions
ICLR 2025 workshops	2 papers accepted	CS-ReFT (SCOPE) + Siege (Building Trust)
Automated NeurIPS scores	8, 8, 7 (avg 7.67)	Acceptance threshold = 6; other AI systems average ~4
MLE-Bench (exploratory)	80% > median human; 50% medal	Without task-specific optimization
CS-ReFT AlpacaEval	93.94% win rate	Surpasses GPT-3.5-Turbo (86.30%)
CS-ReFT efficiency	0.0098% parameters	12.7x fewer than LoRA
Tempest vs GPT-3.5-T	100% attack success	JailbreakBench dataset
Tempest vs GPT-4	97% attack success	Fewer queries than Crescendo/GOAT
EGNN-Fusion efficiency	95% parameter reduction	Competitive binding site prediction

Detailed Results by Paper

Paper 1: CS-ReFT (ICLR 2025 SCOPE Workshop)

Problem: Cross-skill interference in parameter-efficient fine-tuning — improvements on one task degrade performance on others.

Method: Learns multiple orthonormal subspace transformations in hidden-state representations, each specializing in a distinct skill, composed via a lightweight router.

Metric	CS-ReFT (Llama-2-7B)	GPT-3.5-Turbo	LoRA	ReFT (base)
AlpacaEval win rate	93.94%	86.30%	~85%	~88%
Parameters used	0.0098%	N/A	~0.12%	~0.06%
Cross-task interference	Minimal	N/A	Moderate	Moderate

Technical innovation: Unlike LoRA and similar methods that impose orthogonality at the weight level, CS-ReFT applies orthonormality constraints at the hidden-state level. This more directly addresses interference where it manifests — in the model's internal representations rather than in parameter space.

Reviewer assessment (SCOPE Workshop, ICLR 2025):

Reviewer	Score	Key Comments
Reviewer 1	6	Effective approach, addresses critical limitation of ReFT
Reviewer 2	7	"Clever idea"; strong empirical results
Reviewer 3	6	Solid contribution to parameter-efficient methods

Paper 2: Siege → Tempest (ICLR 2025 Workshop → ACL 2025)

Problem: Existing jailbreaking methods rely on single carefully crafted prompts; multi-turn attacks are understudied.

Method: Tree search over conversation branches, tracking partial compliance across turns and re-injecting policy leaks into subsequent queries.

Tempest Tree Search Mechanism
│
│  Turn 1: "Tell me about chemistry"
│  ├── Branch A: [Safe response] → partial compliance detected
│  ├── Branch B: [Deflection] → pruned
│  └── Branch C: [Partial info] → promising
│
│  Turn 2 (from Branch C): "Expand on the synthesis process"
│  ├── Branch C1: [More detail] → partial compliance ↑
│  ├── Branch C2: [Refusal] → pruned
│  └── Branch C3: [Fragment reveals] → EXPLOIT
│
│  Turn 3 (from Branch C3): Re-inject fragments + escalate
│  └── Branch C3a: [Full compliance] → JAILBREAK COMPLETE
│
│  Key insight: "Partial compliance" — models reveal fragments
│  of restricted information while appearing to maintain safety
│  guardrails. These fragments accumulate across turns.

Target Model	Attack Success Rate	Queries Used	Baseline (Crescendo)	Baseline (GOAT)
GPT-3.5-Turbo	100%	Fewer	Lower	Lower
GPT-4	97%	Fewer	Lower	Lower

Evolution from Siege to Tempest: The ACL version significantly expanded the ICLR workshop paper:

Aspect	Siege (ICLR Workshop)	Tempest (ACL 2025)
Paper length	2-4 pages (Tiny Paper)	Full conference paper
Experiments	JailbreakBench	Expanded evaluations
Methodology	Core tree search	Enhanced with cross-branch learning
Contribution depth	Proof of concept	Comprehensive framework
Review scores	(7, 7)	Meta-review: 4 (top 8.2%)

ACL Acceptance Context:

Metric	Value	Significance
ACL 2025 acceptance rate	~21.3%	Highly selective
Meta-review score	4	Top 8.2% of all submissions
CORE ranking	A*	Highest tier of scientific venue
Google Scholar ranking	Top 40 globally	Among most impactful venues in all CS

Paper 3: EGNN-Fusion (Under Journal Review)

Problem: State-of-the-art protein-nucleic acid binding site prediction requires enormous model parameters.

Method: Efficient equivariant graph neural network architecture that achieves competitive performance with 95% fewer parameters.

Metric	EGNN-Fusion	State-of-the-Art Baselines
Parameter count	5% of baseline	100% (reference)
Binding site prediction	Competitive	Reference level
Equivariance	E(3)-equivariant	Varies by method

Significance: This paper's primary role is as a cross-domain capability proof. The fact that the same AI system that designed a parameter-efficient fine-tuning method for LLMs also designed an efficient protein structure prediction architecture demonstrates genuine domain generality — not just re-skinning the same approach across similar problems.

MLE-Bench Performance (Exploratory)

Metric	Zochi	AIDE	OpenHands	Agent Lab
Surpass median human	80% of tasks	—	—	—
Medal rate	50% of tasks	8.7% (any medal)	4.4%	—
Task-specific optimization	None	None	None	None

The MLE-Bench results are particularly notable because Zochi was evaluated without any task-specific optimization — the same general-purpose research pipeline was applied to Kaggle-style engineering challenges, demonstrating transfer from research to engineering tasks.

Automated Quality Assessment

Zochi uses an automated reviewer based on NeurIPS conference guidelines to benchmark paper quality:

Automated Reviewer Score Distribution
│
│  10 ─┤
│   9 ─┤
│   8 ─┤  ██  ██            Zochi papers (8, 8, 7)
│   7 ─┤  ██  ██  ██
│   6 ─┤──██──██──██────── acceptance threshold ──────────
│   5 ─┤
│   4 ─┤           ░░  ░░  Other AI systems (~4 avg)
│   3 ─┤           ░░  ░░
│   2 ─┤
│   1 ─┤
│      └──────────────────
│        Zochi    Others
│
│  Legend: ██ = Zochi    ░░ = AI Scientist, Agent Lab, etc.
│  The ~3.67-point gap represents a qualitative leap
│  from "rejected" to "strong accept" territory.

Quality Gap Analysis

The quality gap between Zochi and other AI research systems deserves careful examination:

Quality Dimension	Zochi	Typical AI-Generated Papers
Problem selection	Open-ended, frontier challenges	Constrained, predefined tasks
Technical novelty	Novel methods (orthonormal subspaces, tree-search jailbreaking)	Incremental variations
Experimental rigor	Controlled experiments, ablations, multiple trials	Basic comparisons
Writing quality	Near-publication quality (minor edits needed)	Significant editing required
Domain awareness	Deep understanding of related work	Surface-level citations
Result significance	State-of-the-art on standard benchmarks	Toy-scale demonstrations

7 Reproducibility

Open Artifacts

Artifact	Available	Location
Technical report	Yes	PDF on GitHub
CS-ReFT paper	Yes	OpenReview
Tempest paper	Yes	arXiv:2503.10619
Siege workshop paper	Yes	OpenReview
System code	No	Closed-source
Prompt templates	No	Closed-source
Pipeline configuration	No	Closed-source
Model weights / fine-tunes	No	Closed-source
Experiment code (papers)	Partial	GitHub repository
Datasets used	Standard	AlpacaEval, JailbreakBench, protein datasets

Reproducibility Assessment

Factor	Rating	Details
System reproducibility	Very Low	Closed-source; no installation, no configuration, no pipeline
Paper result reproducibility	Medium	Standard datasets, published methods, partial code
Method reproducibility	Medium-High	CS-ReFT and Tempest are clearly described; independent implementation possible
Evaluation reproducibility	Medium	NeurIPS automated reviewer is a known methodology; MLE-Bench is open
Peer review validation	High	ACL and ICLR reviews are public records

Reproducibility Comparison

System	Code Available	Can Reproduce Pipeline	Can Reproduce Results
Zochi	No	No	Partially (published papers only)
AI Scientist	Yes	Yes	Yes (with API keys)
AutoResearchClaw	Yes	Yes	Yes (with API keys)
EurekaClaw	Yes	Yes	Yes (with API keys)
Google Co-Scientist	No	No	No
K-Dense BYOK	Yes	Yes	Yes (with API keys)

What Can Be Reproduced

CS-ReFT: The method is described with enough detail to reimplement. The AlpacaEval benchmark is public. The orthonormal subspace transformation approach is straightforward to implement given the paper.
Tempest: The tree search over conversation branches with partial compliance tracking is well-specified. JailbreakBench is public. The core algorithm (BFS over adversarial prompt branches) could be reimplemented.
EGNN-Fusion: The equivariant GNN architecture is described. Protein-nucleic acid binding datasets are standard.

What Cannot Be Reproduced

The research pipeline itself — How Zochi selects research directions, generates hypotheses, designs methods, and writes papers
The literature analysis system — How thousands of papers are ingested, analyzed, and patterns extracted
The validation engine — How evaluation scripts are automatically generated
The meta-cognitive layer — How Zochi decides which ideas are promising enough to pursue
The quality calibration — What makes Zochi produce 7.67-quality papers when others produce ~4

This reproducibility gap is the most significant criticism of Zochi from a scientific perspective. While the individual papers are reproducible, the system that produces papers is not — making it impossible for the research community to verify, extend, or improve upon the core methodology.

8 Compute and API Costs

Cost Model (Inferred)

Since Zochi is closed-source, cost estimates must be inferred from the described capabilities:

Estimated Cost Model
│
│  Cost per paper ≈ Literature_Analysis + Method_Design + Implementation
│                   + Experimentation + Writing + Validation
│
│  Literature_Analysis:
│    ├── "Thousands of papers" analyzed
│    ├── At ~500 tokens/paper summary × 2,000 papers = 1M tokens input
│    ├── Plus gap analysis and direction scoring: ~200K tokens
│    └── Subtotal: ~1.2M tokens (mostly input)
│
│  Method_Design + Implementation:
│    ├── Hypothesis generation: ~50K tokens
│    ├── Architecture design iteration: ~100K tokens
│    ├── Code generation + debugging: ~200K tokens
│    └── Subtotal: ~350K tokens (mix of input/output)
│
│  Experimentation:
│    ├── Experiment script generation: ~50K tokens
│    ├── Result interpretation: ~100K tokens
│    ├── Ablation study design: ~50K tokens
│    └── Subtotal: ~200K tokens
│    Plus compute: GPU hours for training/evaluation
│
│  Writing + Validation:
│    ├── Paper generation: ~100K tokens
│    ├── Validation script generation: ~50K tokens
│    └── Subtotal: ~150K tokens (mostly output)
│
│  TOTAL ESTIMATED: ~1.9M tokens per paper
│  At ~$15/M tokens (frontier model): ~$30 in API costs
│  Plus GPU compute for experiments: varies ($10-$500+)

Timeline Estimates

Phase	Estimated Duration	Bottleneck
Literature analysis	Hours	API rate limits, paper retrieval
Hypothesis + method design	Hours	LLM reasoning depth
Implementation	Hours to day	Code complexity, debugging cycles
Experimentation	Hours to days	GPU availability, training time
Validation	Hours	Evaluation script execution
Writing	Hours	Paper generation + formatting
Total	Hours to days	Experiment compute

IntologyAI states: "Methods typically only require hours to validate, and a full paper takes only days to complete."

Cost Comparison

System	Estimated Cost per Paper	Time per Paper	Model Tier
Zochi	~$30-500+ (API + compute)	Days	Frontier (undisclosed)
AI Scientist	$10-50+	Hours to days	Claude/GPT-4
AutoResearchClaw	$5-30	Hours	Configurable
EurekaClaw	$1-50+	Hours	Claude Sonnet
K-Dense BYOK	$0.05-5	Minutes to hours	User-selected
Human PhD student	$50K-100K/year salary	Months to years	N/A

Hardware Requirements (Inferred)

Requirement	Minimum	Likely Production
CPU	Multi-core	Cloud instances
RAM	16+ GB	32-64 GB
GPU	Required for experiments	Multi-GPU for training (CS-ReFT used Llama-2-7B)
Storage	10+ GB per project	Cloud storage for paper corpus
Network	Required	High-bandwidth for paper retrieval + API calls
API Access	Frontier LLM API	Rate-limited; likely parallel calls

9 Architecture Solution

Pipeline Architecture (Inferred from Descriptions)

Zochi operates as a multi-stage autonomous pipeline that mirrors the scientific method. While the internal implementation is closed-source, the described stages can be mapped to an architectural diagram:

Zochi Architecture Overview (Inferred)
│
│  INPUT: Research domain / high-level direction
│  (e.g., "novel jailbreaking methods")
│
│  ╔════════════════════════════════════════════════════════╗
│  ║                 STAGE 1: LITERATURE ANALYSIS          ║
│  ║                                                        ║
│  ║  Paper Retrieval ──► Summarization ──► Pattern Mining  ║
│  ║       │                    │                 │         ║
│  ║  (arXiv, S2,       (Per-paper       (Cross-paper      ║
│  ║   venue DBs)        key findings)    connections)      ║
│  ║                                                        ║
│  ║  Output: Research landscape map + identified gaps      ║
│  ╚═══════════════════════╤════════════════════════════════╝
│                          │
│  ╔═══════════════════════╧════════════════════════════════╗
│  ║              STAGE 2: HYPOTHESIS GENERATION            ║
│  ║                                                        ║
│  ║  Gap Analysis ──► Direction Proposal ──► Selection     ║
│  ║       │                   │                  │         ║
│  ║  (Identify          (Generate           (Score and     ║
│  ║   limitations)       novel ideas)        rank)         ║
│  ║                                                        ║
│  ║  Output: Research hypothesis + proposed method         ║
│  ╚═══════════════════════╤════════════════════════════════╝
│                          │
│  ╔═══════════════════════╧════════════════════════════════╗
│  ║              STAGE 3: METHOD DESIGN                    ║
│  ║                                                        ║
│  ║  Architecture Design ──► Technical Specification       ║
│  ║       │                         │                      ║
│  ║  (Novel method          (Formal description,           ║
│  ║   formulation)           math formulation)             ║
│  ║                                                        ║
│  ║  Output: Complete method specification                 ║
│  ╚═══════════════════════╤════════════════════════════════╝
│                          │
│  ╔═══════════════════════╧════════════════════════════════╗
│  ║             STAGE 4: IMPLEMENTATION                    ║
│  ║                                                        ║
│  ║  Code Generation ──► Testing ──► Debugging ──► ↻      ║
│  ║       │                 │            │                 ║
│  ║  (Method            (Unit +      (Iterative           ║
│  ║   implementation)    integration)  repair)             ║
│  ║                                                        ║
│  ║  Output: Working implementation of proposed method     ║
│  ╚═══════════════════════╤════════════════════════════════╝
│                          │
│  ╔═══════════════════════╧════════════════════════════════╗
│  ║             STAGE 5: EXPERIMENTATION                   ║
│  ║                                                        ║
│  ║  Experiment Design ──► Parallel Execution ──► Results  ║
│  ║       │                      │                  │      ║
│  ║  (Controlled         (Multi-trial          (Statistical║
│  ║   experiments,        parallelized)         analysis)  ║
│  ║   ablation studies)                                    ║
│  ║                                                        ║
│  ║  ┌──────────────────────────────┐                      ║
│  ║  │   VALIDATION ENGINE          │                      ║
│  ║  │   Auto-generates eval scripts│                      ║
│  ║  │   Standardized datasets      │                      ║
│  ║  │   Datasets NOT modified      │                      ║
│  ║  └──────────────────────────────┘                      ║
│  ║                                                        ║
│  ║  Output: Experimental results + analysis               ║
│  ╚═══════════════════════╤════════════════════════════════╝
│                          │
│  ╔═══════════════════════╧════════════════════════════════╗
│  ║           STAGE 6: MANUSCRIPT PREPARATION              ║
│  ║                                                        ║
│  ║  Paper Drafting ──► Related Work ──► Full Paper        ║
│  ║       │                  │               │             ║
│  ║  (Structure +      (Literature      (Complete          ║
│  ║   technical         integration)     manuscript)       ║
│  ║   writing)                                             ║
│  ║                                                        ║
│  ║  Human steps: figures, citation format, minor edits    ║
│  ║                                                        ║
│  ║  Output: Conference-ready paper                        ║
│  ╚════════════════════════════════════════════════════════╝
│
│  OUTPUT: Peer-reviewed publication (ACL 2025, ICLR 2025)

Architectural Differentiators

Feature	Zochi	AI Scientist	AutoResearchClaw
Pipeline stages	~6 (inferred)	~8	23
Parallel experiments	Yes	Limited	Yes
Validation engine	Dedicated + separated	Self-evaluation	Multi-agent review
Cross-domain	Yes (3 domains demonstrated)	Single domain per run	Configurable
Human involvement	Minimal (figures, citations)	Similar	Fully automated
Quality bar	A* venue acceptance	Workshop demos	Automated scores only
Stage granularity	Coarse (strategic)	Medium	Fine (23 stages)

Key Architectural Decisions (Inferred)

Separation of generation and validation: The validation engine generates evaluation scripts independently from the method generation process. This prevents the system from inadvertently gaming its own metrics.
Parallelized experimentation: "Experiments are parallelized across multiple trials, significantly accelerating the research timeline." This suggests an experiment orchestration layer that manages GPU resources and collects results.
Minimal human handoff: The architecture is designed to minimize human touchpoints. The only human steps are cosmetic (figures, formatting) rather than substantive (method design, experiment interpretation).
Input minimalism: For the ACL paper, the input was merely "novel jailbreaking methods" — 3 words that triggered the entire research pipeline from literature analysis to a 97% attack success rate on GPT-4.

Architecture Evolution: Siege vs. Tempest

The progression from ICLR workshop (Siege) to ACL main (Tempest) reveals architectural improvements:

Architecture Evolution
│
│  Siege (ICLR 2025 Workshop, v1)
│  ├── Input: "multi-turn attacks on LLMs" (autonomously identified)
│  ├── Pipeline: Standard autonomous flow
│  ├── Output: 2-4 page tiny paper
│  ├── Human contribution: Same as standard (figures, formatting)
│  └── Result: (7, 7) reviewer scores
│
│  Tempest (ACL 2025 Main, v2)
│  ├── Input: Same high-level idea (tree search + multi-turn jailbreak)
│  ├── Pipeline: Enhanced — "significantly improved design"
│  ├── Additional: Cross-branch learning mechanism
│  ├── Additional: Robust partial compliance tracking
│  ├── Additional: "More comprehensive experiments"
│  ├── Output: Full conference paper
│  ├── Human contribution: Same minimal scope
│  └── Result: Meta-review 4 (top 8.2%)
│
│  Key observation: The system could ITERATE on its own work,
│  producing substantially improved research on the same topic.

10 Component Breakdown

Inferred Component Architecture

Since Zochi is closed-source, the component breakdown must be inferred from the described capabilities and outputs. The following represents a plausible decomposition:

Zochi Component Map (Inferred)
│
├── CORE ENGINE
│   ├── Pipeline Orchestrator
│   │   ├── Stage sequencing and state management
│   │   ├── Error handling and recovery
│   │   └── Resource allocation across stages
│   │
│   ├── LLM Interface Layer
│   │   ├── API client(s) for frontier model(s)
│   │   ├── Prompt management and templating
│   │   ├── Token budget management
│   │   └── Response parsing and validation
│   │
│   └── Domain Abstraction
│       ├── Domain-agnostic pipeline flow
│       └── Domain-specific adapter patterns
│
├── LITERATURE ENGINE
│   ├── Paper Retrieval
│   │   ├── arXiv API integration
│   │   ├── Semantic Scholar API integration
│   │   ├── Venue-specific databases
│   │   └── Citation graph traversal
│   │
│   ├── Paper Analysis
│   │   ├── Abstract and full-text summarization
│   │   ├── Methodology extraction
│   │   ├── Result extraction
│   │   └── Limitation identification
│   │
│   └── Knowledge Synthesis
│       ├── Cross-paper pattern mining
│       ├── Gap identification
│       ├── Direction scoring
│       └── Research landscape mapping
│
├── HYPOTHESIS ENGINE
│   ├── Idea Generator
│   │   ├── Cross-paper connection identification
│   │   ├── Novel combination proposal
│   │   └── Feasibility assessment
│   │
│   ├── Method Designer
│   │   ├── Architecture specification
│   │   ├── Mathematical formulation
│   │   └── Technical approach planning
│   │
│   └── Selection Filter
│       ├── Novelty scoring
│       ├── Impact prediction
│       └── Feasibility ranking
│
├── IMPLEMENTATION ENGINE
│   ├── Code Generator
│   │   ├── Method implementation from specification
│   │   ├── Data loading and preprocessing
│   │   └── Training loop generation
│   │
│   ├── Testing Layer
│   │   ├── Unit test generation
│   │   ├── Integration testing
│   │   └── Debugging and repair loop
│   │
│   └── Environment Manager
│       ├── Dependency management
│       ├── GPU resource allocation
│       └── Experiment workspace isolation
│
├── EXPERIMENTATION ENGINE
│   ├── Experiment Designer
│   │   ├── Controlled experiment specification
│   │   ├── Ablation study generation
│   │   └── Baseline selection
│   │
│   ├── Execution Orchestrator
│   │   ├── Multi-trial parallelization
│   │   ├── Result collection
│   │   └── Resource management
│   │
│   └── VALIDATION ENGINE (SEPARATED)
│       ├── Auto-generates evaluation scripts
│       ├── Uses standardized, unmodified datasets
│       ├── Independent from generation process
│       └── Ensures genuine performance measurement
│
├── WRITING ENGINE
│   ├── Paper Generator
│   │   ├── Structure planning
│   │   ├── Section-by-section drafting
│   │   ├── Related work integration
│   │   └── Technical writing quality assurance
│   │
│   └── LaTeX Formatter
│       ├── Conference template compliance
│       ├── Table and equation formatting
│       └── Reference management
│
└── QUALITY ASSURANCE
    ├── Automated Reviewer
    │   ├── NeurIPS guidelines-based scoring
    │   ├── Multi-dimensional evaluation
    │   └── Quality threshold enforcement
    │
    └── Result Verification
        ├── Statistical significance checking
        ├── Claim-to-evidence alignment
        └── Reproducibility verification

Component Interaction Patterns

Information Flow Between Components
│
│  Domain Input ──────────────────────────────────────────┐
│       │                                                  │
│       ▼                                                  │
│  ┌─────────────┐    papers     ┌──────────────┐         │
│  │  Literature  │─────────────►│  Hypothesis   │         │
│  │   Engine     │   + gaps     │   Engine      │         │
│  └─────────────┘              └──────┬───────┘         │
│                                       │                  │
│                              method spec                 │
│                                       │                  │
│                                       ▼                  │
│                              ┌──────────────┐           │
│                              │Implementation │           │
│                              │   Engine      │           │
│                              └──────┬───────┘           │
│                                      │                   │
│                                working code              │
│                                      │                   │
│                                      ▼                   │
│  ┌────────────┐  eval scripts ┌──────────────┐          │
│  │ Validation │◄─────────────│Experimentation│          │
│  │   Engine   │───results───►│   Engine      │          │
│  └────────────┘              └──────┬───────┘          │
│                                      │                   │
│                              results + analysis          │
│                                      │                   │
│                                      ▼                   │
│                              ┌──────────────┐           │
│                              │   Writing     │◄──────────┘
│                              │   Engine      │  domain context
│                              └──────┬───────┘
│                                      │
│                              ┌───────▼───────┐
│                              │    Quality     │
│                              │  Assurance     │
│                              └───────┬───────┘
│                                      │
│                                      ▼
│                              Conference Paper

Comparison: Component Density

System	Named Components	Pipeline Stages	Agents	Tools
Zochi	~12 (inferred)	~6	Unknown	Unknown
AI Scientist	~8	~8	3 (researcher, reviewer, editor)	~5
AutoResearchClaw	~15+	23	8+ specialized agents	10+
EurekaClaw	~20+	7	7+ specialized agents	8+
Google Co-Scientist	Unknown	Multi-step	Multi-agent	Unknown

Zochi appears to use a more consolidated architecture with fewer but more capable components, contrasting with the fine-grained stage decomposition of systems like AutoResearchClaw (23 stages) or EurekaClaw (7 stages with sub-agents per stage).

11 Core Mechanisms (Detailed)

11.1 Literature-Grounded Research Direction Selection

The literature analysis phase is described as ingesting "thousands of research papers" and identifying "non-obvious connections across papers." This implies a multi-layer analysis pipeline:

Literature Analysis Pipeline (Inferred)
│
├── Layer 1: RETRIEVAL
│   ├── Input: Domain string (e.g., "novel jailbreaking methods")
│   ├── Query expansion: LLM generates multiple search queries
│   ├── Sources: arXiv API, Semantic Scholar, venue proceedings
│   ├── Scale: "Thousands of papers" retrieved
│   └── Output: Raw paper corpus (titles, abstracts, full texts)
│
├── Layer 2: SUMMARIZATION
│   ├── Per-paper analysis:
│   │   ├── Key contribution extraction
│   │   ├── Methodology characterization
│   │   ├── Limitation identification
│   │   └── Result summary
│   ├── Efficiency: Likely uses cheaper model or shorter prompts
│   └── Output: Structured paper summaries
│
├── Layer 3: PATTERN MINING
│   ├── Cross-paper connection identification
│   ├── Methodology trend analysis
│   ├── "Non-obvious connections" — the key differentiator
│   │   ├── For CS-ReFT: Connected representation editing to
│   │   │   cross-skill interference (not obvious from either
│   │   │   literature alone)
│   │   ├── For Tempest: Connected multi-turn dialogue patterns
│   │   │   to systematic safety erosion (novel framing)
│   │   └── For EGNN-Fusion: Connected equivariant architectures
│   │       to binding site efficiency (cross-domain transfer)
│   └── Output: Pattern graph over research landscape
│
└── Layer 4: DIRECTION SELECTION
    ├── Gap scoring: novelty × feasibility × impact
    ├── Direction ranking
    ├── Selection of most promising direction
    └── Output: Chosen research direction with justification

11.2 The "Partial Compliance" Discovery Mechanism

Zochi's most impactful scientific discovery — the "partial compliance" vulnerability pattern in LLMs — illustrates the system's ability to identify non-obvious phenomena:

Partial Compliance Discovery (Tempest)
│
│  Observation: When attacked across multiple turns, LLMs don't
│  simply "comply" or "refuse" — they exhibit a gradient of responses.
│
│  ┌────────────────────────────────────────────────────────┐
│  │  Response Spectrum                                      │
│  │                                                         │
│  │  Full Refusal ◄──────────────────────► Full Compliance  │
│  │       │                                       │         │
│  │       │    ┌─────────────────────┐            │         │
│  │       │    │  PARTIAL COMPLIANCE │            │         │
│  │       │    │  ─────────────────  │            │         │
│  │       │    │  Model reveals      │            │         │
│  │       │    │  FRAGMENTS of       │            │         │
│  │       │    │  restricted info    │            │         │
│  │       │    │  while appearing    │            │         │
│  │       │    │  to maintain safety │            │         │
│  │       │    └─────────────────────┘            │         │
│  │       │              │                        │         │
│  │       │     These fragments                   │         │
│  │       │     ACCUMULATE across turns            │         │
│  │       │              │                        │         │
│  │       │     Until full compliance             │         │
│  │       │     is achieved                       │         │
│  └───────┴──────────────┴────────────────────────┘         │
│                                                             │
│  Key insight: Safety is not a binary gate but a             │
│  continuously erodable surface. Minor concessions           │
│  create anchor points for subsequent exploitation.          │
└─────────────────────────────────────────────────────────────┘

This discovery is significant because it: 1. Reframes the safety problem — from binary (safe/unsafe) to continuous (compliance gradient) 2. Was autonomously identified — Zochi discovered this pattern from literature analysis, not from human guidance 3. Has practical implications — requires rethinking multi-turn safety mechanisms beyond single-turn guardrails 4. Survived A* peer review — validating the discovery's novelty and significance

11.3 Orthonormal Subspace Representation Editing (CS-ReFT)

The CS-ReFT method demonstrates Zochi's ability to formulate technically novel approaches:

CS-ReFT Architecture (from paper)
│
│  Standard Fine-Tuning:
│  ┌────────────────────────────────────────┐
│  │  Weights modified → all tasks affected │
│  │  Task A improvement → Task B degrades  │  = cross-skill
│  │  LoRA orthogonality: weight-level only │    interference
│  └────────────────────────────────────────┘
│
│  CS-ReFT Approach:
│  ┌────────────────────────────────────────────────────────┐
│  │                                                         │
│  │  Hidden State Space (h)                                 │
│  │  ┌─────────────────────────────────────┐               │
│  │  │                                     │               │
│  │  │    Subspace S₁ ──► Skill 1 edit     │               │
│  │  │    (orthonormal)                     │               │
│  │  │              ⊥                       │               │
│  │  │    Subspace S₂ ──► Skill 2 edit     │               │
│  │  │    (orthonormal)                     │               │
│  │  │              ⊥                       │               │
│  │  │    Subspace Sₖ ──► Skill k edit     │               │
│  │  │    (orthonormal)                     │               │
│  │  │                                     │               │
│  │  └─────────────────────────────────────┘               │
│  │                    │                                    │
│  │              ┌─────┴─────┐                              │
│  │              │   Router   │                              │
│  │              │ (lightweight│                              │
│  │              │  selector) │                              │
│  │              └─────┬─────┘                              │
│  │                    │                                    │
│  │              Composed output                            │
│  │                                                         │
│  │  Key innovation: orthonormality at hidden-state level   │
│  │  not weight level → directly prevents interference      │
│  │  where it manifests                                     │
│  └────────────────────────────────────────────────────────┘
│
│  Results:
│  ├── 93.94% win rate on AlpacaEval (vs. 86.30% GPT-3.5-T)
│  ├── Only 0.0098% of model parameters
│  ├── 12.7x fewer parameters than LoRA
│  └── Minimal cross-task interference

11.4 Tree Search Over Conversation Branches (Tempest)

The Tempest framework implements a systematic search algorithm over multi-turn conversations:

Tempest Tree Search Algorithm
│
│  INITIALIZE:
│    root = initial adversarial prompt
│    target = restricted behavior
│    tree = {root}
│    compliance_tracker = {}
│
│  LOOP (for each turn t):
│    │
│    ├── EXPAND: For each active node n in tree:
│    │   ├── Generate k adversarial follow-ups
│    │   │   (breadth-first branching)
│    │   ├── Each follow-up exploits partial compliance
│    │   │   from n's response
│    │   └── Add branches to tree
│    │
│    ├── EVALUATE: For each new branch:
│    │   ├── Send to target model
│    │   ├── Receive response
│    │   ├── Measure compliance level:
│    │   │   ├── Full refusal → prune branch
│    │   │   ├── Partial compliance → track fragments
│    │   │   └── Full compliance → SUCCESS
│    │   └── Update compliance_tracker
│    │
│    ├── CROSS-BRANCH LEARNING (ACL version):
│    │   ├── Analyze successful partial compliance patterns
│    │   ├── Transfer effective strategies across branches
│    │   └── Reinject learned patterns into new prompts
│    │
│    ├── PRUNE: Remove branches with:
│    │   ├── Full refusals
│    │   ├── Stalled compliance
│    │   └── Redundant paths
│    │
│    └── RE-INJECT: For promising branches:
│        ├── Extract compliance fragments from responses
│        ├── Incorporate fragments into next turn's prompts
│        └── "Minor concessions accumulate into fully
│            disallowed outputs"
│
│  TERMINATE when:
│    ├── Full compliance achieved (success)
│    ├── All branches pruned (failure)
│    └── Max turns reached (timeout)

11.5 Validation Engine — Separation of Concerns

The validation engine is one of Zochi's most architecturally important mechanisms:

Validation Engine Design
│
│  PROBLEM: AI systems can inadvertently optimize for
│           their own evaluation metrics rather than
│           genuine performance improvements.
│
│  SOLUTION: Separate generation from evaluation
│
│  ┌──────────────────┐     ┌──────────────────┐
│  │  GENERATION PATH  │     │  VALIDATION PATH  │
│  │                    │     │                    │
│  │  Method design     │     │  Eval script gen   │
│  │  Implementation    │     │  (independent)     │
│  │  Training          │     │                    │
│  │                    │     │  Standardized      │
│  │  Produces:         │     │  datasets (NOT     │
│  │  - trained model   │     │  modified by       │
│  │  - method code     │     │  generation path)  │
│  │                    │     │                    │
│  └────────┬─────────┘     └────────┬─────────┘
│           │                        │
│           │      ┌────────────┐    │
│           └─────►│ EVALUATION  │◄──┘
│                  │             │
│                  │ Model tested│
│                  │ on unmodified│
│                  │ datasets via │
│                  │ independent  │
│                  │ eval scripts │
│                  └──────┬─────┘
│                         │
│                   Genuine results
│
│  This prevents:
│  ├── Data leakage from training to evaluation
│  ├── Metric gaming (optimizing for eval proxy)
│  ├── Self-confirming evaluation loops
│  └── Overfitting to evaluation procedure

11.6 Research Iteration Capability

The Siege → Tempest progression reveals a capability that most AI research systems lack: iterative improvement on the same research direction:

Research Iteration Loop
│
│  Iteration 1 (Siege — ICLR Workshop):
│  ├── Input: "multi-turn attacks on LLMs"
│  │   (autonomously identified from literature)
│  ├── Output: Core tree search framework
│  ├── Result: 100%/97% attack success
│  ├── Format: 2-4 page tiny paper
│  └── Feedback: (7, 7) reviewer scores
│
│  GAP: Workshop paper → Full conference paper requires:
│  ├── Deeper methodology
│  ├── More comprehensive experiments
│  ├── Stronger theoretical motivation
│  └── Better presentation
│
│  Iteration 2 (Tempest — ACL Main):
│  ├── Input: Same high-level idea + Siege as starting point
│  ├── Enhancements:
│  │   ├── Cross-branch learning mechanism (NEW)
│  │   ├── Robust partial compliance tracking (IMPROVED)
│  │   ├── Comprehensive evaluations (EXPANDED)
│  │   └── Full conference paper format (EXTENDED)
│  ├── Result: Same attack rates + deeper analysis
│  └── Outcome: Top 8.2% at A* venue
│
│  This demonstrates Zochi can:
│  1. Evaluate its own work's limitations
│  2. Identify what needs improvement
│  3. Design and implement those improvements
│  4. Produce substantially stronger results
│  5. Meet a much higher quality bar (workshop → A*)

12 Programming Language

System Implementation

Aspect	Assessment	Evidence
System language	Unknown (likely Python)	Closed-source; Python is standard for ML systems
Generated code	Python (confirmed)	CS-ReFT uses PyTorch; Tempest uses standard ML libraries
Experiment code	Python (confirmed)	Standard ML stack (PyTorch, transformers, etc.)
Paper output	LaTeX	Conference paper format

Generated Code Quality Indicators

The code Zochi generates must be of sufficient quality to:

Train models successfully: CS-ReFT trained on Llama-2-7B with orthonormal subspace edits
Run complex experiments: Tempest executed multi-turn attacks against GPT-3.5 and GPT-4 APIs
Implement novel architectures: EGNN-Fusion designed a new equivariant GNN architecture
Produce reproducible results: Results were verified by peer reviewers at A* venues

Comparison to Other Systems

System	System Language	Generated Code	Open Source
Zochi	Unknown (Python likely)	Python	No
AI Scientist	Python	Python	Yes
AutoResearchClaw	Python	Python	Yes (MIT)
EurekaClaw	Python/TypeScript	Python + LaTeX	Yes (Apache 2.0)
Google Co-Scientist	Unknown	Unknown	No
AIRA₂	Python	Python	Partially

Code Quality Assessment (Inferred)

Indicator	Evidence	Assessment
Correctness	Results accepted at A* venue	High — peer-verified
Reproducibility	Standardized datasets, published results	Medium-High
Complexity	Orthonormal subspace edits, tree search, EGNN architectures	High — non-trivial implementations
Test coverage	Ablation studies, multiple baselines	High — comprehensive evaluation
Documentation	Published papers describe methods	High — paper-quality documentation

13 Memory Management

Memory Architecture (Inferred)

Zochi's memory system is not publicly documented, but the system's capabilities imply several memory types:

┌─────────────────────────────────────────────────────────────┐
│              Zochi Memory Architecture (Inferred)            │
│                                                              │
│  Tier 1: LITERATURE MEMORY                                   │
│  ├── Scope: Per-project                                      │
│  ├── Content: Paper summaries, extracted patterns, gaps      │
│  ├── Scale: "Thousands of papers"                            │
│  ├── Access: Literature engine reads; hypothesis engine reads│
│  └── Purpose: Ground research in existing knowledge          │
│                                                              │
│  Tier 2: PROJECT MEMORY                                      │
│  ├── Scope: Per-project (spans all pipeline stages)          │
│  ├── Content: Chosen direction, method spec, implementation  │
│  │   state, experiment results, partial drafts               │
│  ├── Access: All stages read; each stage writes its outputs  │
│  └── Purpose: Maintain coherence across pipeline stages      │
│                                                              │
│  Tier 3: ITERATION MEMORY                                    │
│  ├── Scope: Across project iterations (Siege → Tempest)      │
│  ├── Content: Prior work artifacts, identified improvements  │
│  ├── Access: New iteration reads prior artifacts             │
│  └── Purpose: Enable research iteration and improvement      │
│                                                              │
│  Tier 4: VALIDATION MEMORY                                   │
│  ├── Scope: Per-experiment                                   │
│  ├── Content: Evaluation scripts, dataset references, results│
│  ├── Access: Validation engine (isolated from generation)    │
│  └── Purpose: Ensure genuine, unbiased evaluation            │
│                                                              │
│  Tier 5: CROSS-DOMAIN MEMORY (speculative)                   │
│  ├── Scope: Across projects/domains                          │
│  ├── Content: Transferable strategies, method patterns       │
│  ├── Access: Hypothesis engine for new projects              │
│  └── Purpose: Enable cross-domain innovation                 │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Evidence for Memory Tiers

Tier	Evidence	Confidence
Literature Memory	"Ingests and analyzes thousands of research papers"	High
Project Memory	Multi-stage pipeline requires state passing	High
Iteration Memory	Siege → Tempest improvement demonstrates prior work awareness	High
Validation Memory	Separated validation engine with unmodified datasets	Medium
Cross-Domain Memory	3 domains with transferable patterns	Low-Medium

Context Window vs. Persistent Memory

A key architectural question for Zochi is how it handles the tension between LLM context window limitations and the need for extensive research context:

Context Window Challenge
│
│  Challenge: "Thousands of papers" analyzed → millions of tokens
│  Typical context window: 100K-200K tokens (frontier models, 2025)
│
│  Possible Solutions (Inferred):
│
│  1. HIERARCHICAL SUMMARIZATION
│     ├── Full papers → abstracts → key findings → pattern summary
│     ├── Progressive compression: 1000x reduction
│     └── Only summaries enter context window
│
│  2. RETRIEVAL-AUGMENTED GENERATION
│     ├── Vector store indexes all paper summaries
│     ├── Relevant papers retrieved per query
│     └── Only retrieved context enters window
│
│  3. STRUCTURED STATE PASSING
│     ├── Each pipeline stage produces structured output
│     ├── Next stage receives structured input (not raw text)
│     └── Information compressed between stages
│
│  4. HYBRID APPROACH (most likely)
│     ├── RAG for literature (Tier 1)
│     ├── Structured state for pipeline (Tier 2)
│     ├── Artifact persistence for iteration (Tier 3)
│     └── Isolated context for validation (Tier 4)

Memory Comparison

System	Memory Tiers	Literature Store	Cross-Run	Cross-Domain
Zochi	~5 (inferred)	Yes (thousands of papers)	Yes (Siege→Tempest)	Yes (3 domains)
EurekaClaw	4	Yes (arXiv + S2)	Yes (persistent)	Per-domain plugin
AutoResearchClaw	3	Yes (real APIs)	Yes (MetaClaw)	No
AI Scientist	1-2	Basic	No	No
K-Dense BYOK	1	Conversation only	No	No

14 Continued Learning

Evidence of Learning Capability

The Siege → Tempest progression is the strongest evidence that Zochi implements some form of continued learning or iterative improvement:

Learning Evidence: Siege → Tempest Progression
│
│  ICLR Workshop (March 2025)
│  ├── System autonomously identified multi-turn attack direction
│  ├── Designed core tree search framework
│  ├── Produced workshop-quality paper
│  └── Received (7, 7) reviewer scores
│
│  LEARNING PHASE (March - May 2025)
│  ├── "Zochi was able to significantly improve its design"
│  ├── "Conduct more comprehensive experiments"
│  ├── Added: Cross-branch learning mechanism
│  ├── Added: Robust partial compliance tracking
│  └── Expanded methodology and evaluation
│
│  ACL Main (May 2025)
│  ├── Same high-level direction, vastly improved execution
│  ├── Full conference paper (vs. 2-4 page tiny paper)
│  ├── Meta-review score 4 = top 8.2%
│  └── Accepted at A* venue (~21% acceptance rate)
│
│  This implies the system can:
│  1. Assess the quality gap between its work and a higher bar
│  2. Identify specific improvement opportunities
│  3. Execute those improvements autonomously
│  4. Produce substantially stronger output

Types of Learning (Inferred)

Learning Type	Evidence	Mechanism (Inferred)
Intra-project iteration	Code debugging, experiment refinement	Self-correction loops within pipeline
Inter-project learning	Siege → Tempest improvement	Prior work analysis + targeted enhancement
Cross-domain transfer	AI methods → computational biology	Transferable research strategies
Quality calibration	7.67 avg NeurIPS score	Understanding of what constitutes good research

Continuous Improvement Evidence

The technical report describes the ACL version (May 2025) as "a substantial advancement over our earlier systems that published workshop papers at ICLR 2025." This suggests ongoing system development between March and May 2025:

System Evolution Timeline
│
│  March 14, 2025 ── IntologyAI announcement
│  March 17, 2025 ── Technical report published
│                     ├── CS-ReFT at ICLR SCOPE Workshop
│                     └── Siege at ICLR Building Trust Workshop
│
│  March - May 2025 ── System improvement period
│                       ├── Architecture enhancements
│                       ├── "Substantially advanced" system
│                       └── Human involvement reduced further
│
│  May 27, 2025 ── ACL acceptance announced
│                   ├── Tempest (expanded Siege) accepted
│                   ├── Main proceedings (not workshop)
│                   └── "First AI to pass A* peer review"
│
│  2025 (later) ── Locus previewed
│                   ├── Successor to Zochi
│                   ├── Surpasses human experts on RE-Bench
│                   └── Multi-day research campaigns

Learning Comparison

System	Learning Type	What is Learned	Cross-Run	Cross-Domain
Zochi	System evolution + inter-project	Research strategies, quality standards	Yes	Yes
EurekaClaw	Post-session distillation	Proof strategies, tool patterns	Yes (4-tier)	Per-domain
AutoResearchClaw	MetaClaw cross-run	Research strategies from failures	Yes (skills)	No
AI Scientist	None within system	N/A	No	No
K-Dense BYOK	None	N/A	No	No
Google Co-Scientist	Unknown	Unknown	Unknown	Unknown

The Zochi → Locus Learning Lineage

Zochi's continued learning extends beyond the system itself to inform its successor:

System	Key Capability	Improvement Over Predecessor
Zochi v1 (Mar 2025)	Workshop-level research	First autonomous AI publications
Zochi v2 (May 2025)	A* conference-level research	Quality leap from workshop to main
Locus (2025)	Surpasses human experts on RE-Bench	Multi-day campaigns, engineering tasks

Locus's capabilities suggest that lessons from Zochi's research pipeline were transferred: - RE-Bench: Locus scores 1.30 vs. human expert 1.27 over 64 hours - KernelBench: State-of-the-art with 1.5x to 100x+ speedups - MLE-Bench Lite: State-of-the-art on engineering tasks - Key innovation: "Maintains consistent improvement over multiple days" — unlike systems that plateau after hours

15 Applications

Primary Application: Autonomous Scientific Research

Zochi's demonstrated applications span three distinct scientific domains:

Application Domain	Paper	Contribution Type	Impact
AI / Representation Learning	CS-ReFT	Novel training methodology	AlpacaEval SOTA for parameter-efficient methods
AI Safety	Tempest	Vulnerability framework + discovery	Exposes fundamental safety weakness
Computational Biology	EGNN-Fusion	Efficient architecture	95% parameter reduction

Demonstrated Research Capabilities

Research Capability Matrix
│
│                    Literature │ Hypothesis │ Method  │ Implement │ Experiment │ Write
│                    Analysis   │ Generation │ Design  │           │            │
│  ──────────────────┼──────────┼───────────┼─────────┼───────────┼────────────┼──────
│  CS-ReFT           │    ✓     │     ✓     │    ✓    │     ✓     │     ✓      │   ✓
│  Tempest           │    ✓     │     ✓     │    ✓    │     ✓     │     ✓      │   ✓
│  EGNN-Fusion       │    ✓     │     ✓     │    ✓    │     ✓     │     ✓      │   ✓
│  MLE-Bench tasks   │    —     │     —     │    —    │     ✓     │     ✓      │   —
│  ──────────────────┼──────────┼───────────┼─────────┼───────────┼────────────┼──────
│  Coverage          │   3/3    │    3/3    │   3/3   │    4/4    │    4/4     │  3/3
│
│  ✓ = Demonstrated    — = Not applicable for this task type

Target Users

User Segment	Use Case	Value Proposition
Research labs	Accelerate paper production	Hours/days instead of months
PhD students	Explore research directions	Autonomous literature survey + direction scoring
AI safety teams	Red-teaming and vulnerability discovery	Systematic, comprehensive testing
Biotech/pharma	Cross-domain computational methods	Efficient architectures for biology
Industry R&D	Novel method development	Competitive research output at frontier quality
Conference organizers	Quality assessment	Automated reviewer scoring

Application Scenarios

Scenario 1: Novel Research Direction Exploration

Input: "novel jailbreaking methods"  (3 words)
│
└── Zochi pipeline:
    ├── Analyzes thousands of safety papers
    ├── Identifies multi-turn attack as underexplored
    ├── Designs tree search framework
    ├── Discovers partial compliance vulnerability
    ├── Implements Tempest framework
    ├── Achieves 100%/97% attack success
    ├── Writes conference paper
    └── Accepted at ACL 2025 (A*, top 8.2%)

Time: Days
Human involvement: Figures, citation formatting, minor edits
Cost: Estimated $30-500+ (API + compute)

Scenario 2: Parameter-Efficient Method Design

Input: Research direction on cross-skill interference in PEFT
│
└── Zochi pipeline:
    ├── Identifies gap between weight-level and representation-level orthogonality
    ├── Designs CS-ReFT with orthonormal subspace transformations
    ├── Implements on Llama-2-7B
    ├── Evaluates on AlpacaEval: 93.94% win rate
    ├── Demonstrates 0.0098% parameter usage
    └── Published at ICLR 2025 SCOPE Workshop

Time: Days
Human involvement: Minimal

Scenario 3: Cross-Domain Architecture Transfer

Input: Protein-nucleic acid binding site prediction
│
└── Zochi pipeline:
    ├── Analyzes computational biology literature
    ├── Identifies parameter inefficiency in existing methods
    ├── Transfers efficient architecture principles from AI domain
    ├── Designs EGNN-Fusion with 95% parameter reduction
    ├── Achieves competitive binding site prediction
    └── Under journal review

Significance: Demonstrates domain generality

Limitations and Risks

Limitation	Impact	Mitigation
Closed-source	Cannot verify, extend, or reproduce pipeline	Published papers are independently verifiable
Undisclosed architecture	Scientific community cannot build on methods	Individual research outputs are documented
Human verification required	Not fully autonomous — figures, citations, edits	Human oversight as safety mechanism
Ethical concerns	AI-generated papers at top venues	Transparent attribution, no AI authorship claims
Scalability unknown	No evidence of concurrent multi-project runs	Locus suggests improvement here
Cost unknown	Closed-source prevents cost assessment	Likely competitive with human research
Generalization unknown	3 domains demonstrated; broader generality unproven	Cross-domain capability is promising evidence
Reviewer gaming risk	System could learn to optimize for reviewer preferences	Separated validation engine mitigates

Ethical Framework

IntologyAI's stated ethical principles represent the most developed framework among autoresearch systems:

Principle	Implementation
No AI authorship	"We do not believe AI systems should be authors on papers, as they cannot take responsibility for their work"
Human verification	"Rigorous human verification of all research outputs"
Transparent attribution	Acknowledge AI contributions without claiming authorship
Responsible disclosure	Safety research (Tempest) follows responsible disclosure protocols
Venue engagement	"In discussion with workshop organizers of Zochi's accepted papers"
Human rebuttal	ACL rebuttal written manually without Zochi involvement

Comparison to Other Systems' Ethics Frameworks

System	AI Authorship Policy	Human Verification	Venue Transparency
Zochi	No AI authorship	Required	Disclosed to organizers
AI Scientist	AI listed as author	Minimal	No disclosure policy
AutoResearchClaw	Not addressed	Configurable	Not addressed
EurekaClaw	Not addressed	Gate modes available	Not addressed
Google Co-Scientist	Not addressed	Built-in	Not addressed

Strengths vs. Weaknesses Summary

Strength	Weakness
Only AI system with A* venue acceptance	Closed-source — no reproducibility of pipeline
Multi-domain research capability (3 domains)	Undisclosed architecture limits scientific contribution
Highest automated quality scores (7.67 avg)	Small team → sustainability risk
Minimal human involvement	Cannot verify claims about autonomy level
Strong ethical framework	Ethical framework untested at scale
Discovered novel vulnerability (partial compliance)	Individual papers, while good, are not groundbreaking
MLE-Bench performance without optimization	MLE-Bench evaluation was "exploratory"
Research iteration capability (Siege → Tempest)	Unclear if iteration was human-guided or autonomous
Successor system (Locus) shows continued development	Locus makes Zochi potentially obsolete

Future Trajectory

Based on IntologyAI's trajectory:

Development	Status	Significance
Locus (successor)	Previewed 2025	Surpasses human experts on RE-Bench
Multi-day campaigns	Locus capability	Week/month-long research runs planned
Beta access	Sign-up available	Moving toward product launch
Additional domains	Expected	Current 3-domain capability as starting point
Journal publications	EGNN-Fusion under review	Expanding beyond conferences

Impact Assessment

Zochi's significance extends beyond its individual research contributions. It represents a category creation moment for AI research systems:

Before Zochi (pre-2025):
├── AI could generate paper-shaped artifacts
├── Quality insufficient for top-tier venues
├── "AI research" = demonstrations, not contributions
└── Gap between AI output and human research was large

After Zochi (2025):
├── AI output accepted at highest-tier venues
├── Quality comparable to strong human submissions
├── "Artificial Scientist" as a legitimate category
├── Gap between AI and human research is closing rapidly
└── Locus suggests trajectory toward surpassing humans

Implications:
├── Publication norms need updating (attribution, review)
├── Research acceleration possible (hours/days vs months)
├── Multi-domain research becomes feasible for small teams
├── AI safety research can be systematically automated
└── The research community must adapt to AI-generated science

Appendix A: Publication Record

Paper	Venue	Type	Scores	Status
CS-ReFT	SCOPE @ ICLR 2025	Workshop poster	(6, 7, 6)	Accepted
Siege	Building Trust @ ICLR 2025	Tiny paper	(7, 7)	Accepted
Tempest	ACL 2025	Main proceedings	Meta: 4 (top 8.2%)	Accepted
EGNN-Fusion	Journal (undisclosed)	Full paper	N/A	Under review

Appendix B: Benchmark Summary

Benchmark	Metric	Zochi Result	Best Baseline
AlpacaEval (CS-ReFT)	Win rate	93.94%	GPT-3.5-T: 86.30%
JailbreakBench (Tempest)	Success vs GPT-3.5-T	100%	Crescendo: lower
JailbreakBench (Tempest)	Success vs GPT-4	97%	GOAT: lower
NeurIPS Auto-Review	Paper quality	8, 8, 7 (avg 7.67)	Other AI: ~4
MLE-Bench	> Median human	80% of tasks	Agent Lab: lower
MLE-Bench	Medal rate	50% of tasks	AIDE: 8.7%
EGNN-Fusion	Parameter reduction	95%	—

Appendix C: Comparison with All Major AI Research Systems (as of April 2026)

Dimension	Zochi	AI Scientist	Co-Scientist	AIRA₂	AutoResearchClaw	EurekaClaw
Organization	IntologyAI	Sakana AI	Google DeepMind	Meta FAIR	AIMING Lab	Single lab
Team size	4	~6	~20+	25	16	8
Open source	No	Yes	No	Partial	Yes (MIT)	Yes (Apache 2.0)
*A venue**	Yes (ACL)	No	No	No	No	No
Workshop venue	Yes (ICLR)	Yes	No	No	Not yet	Not yet
Domains	3+	1/run	Biomedical	NLP/ML	Configurable	Mathematics
Auto NeurIPS score	7.67	~4	N/A	N/A	N/A	N/A
Human involvement	Minimal	Similar	Significant	Moderate	Full automation	Configurable
Learning	System evolution	None	Unknown	Tournament	MetaClaw skills	4-tier + skills
MLE-Bench	80%>median, 50% medal	N/A	N/A	N/A	N/A	N/A
Ethical framework	Comprehensive	Basic	Minimal	Basic	None stated	None stated
Successor	Locus	None	None	None	MetaClaw	None

This analysis was compiled from publicly available sources including the Zochi Technical Report, IntologyAI blog posts, OpenReview submissions, arXiv papers, and third-party coverage. All claims about system internals are marked as inferred where architectural details are not publicly documented.