← Back to Index
Zochi
The first AI system to achieve acceptance at an A scientific conference (ACL 2025), autonomously conducting end-to-end research from literature analysis to peer-reviewed publication across multiple domains. Organization: IntologyAI (San Francisco-based AI research lab) Published: March 17, 2025 (tech report); May 27, 2025 (ACL 2025 acceptance announced) Type: Technical Report + closed-source system Report Type: PhD-Level Technical Analysis Report Date: April 2026*
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Zochi Technical Report: The First Artificial Scientist
- Repository: github.com/IntologyAI/Zochi (papers + artifacts only; system code is closed-source)
- Technical Report PDF: Zochi_Technical_Report.pdf
- Blog posts: Tech Report, ACL Acceptance
- Stars: ~305 (as of April 2026)
- License: Not specified (closed-source system; published paper artifacts under CC BY 4.0)
- Tagline: "The First Artificial Scientist"
- Successor system: Locus (previewed 2025; surpasses human experts on RE-Bench)
Naming and Branding
The name "Zochi" does not appear to reference an existing acronym or abbreviation. IntologyAI uses the term "Artificial Scientist" as a category label — distinguishing systems that autonomously conduct the full scientific method from "AI research assistants" that support individual phases. Zochi is positioned as the inaugural member of this category, with Locus as its successor.
Historical Significance
Zochi holds several firsts in the AI-for-science landscape:
| Milestone | Date | Significance |
|---|---|---|
| First AI system with workshop publications | Mar 2025 | ICLR 2025 workshops (CS-ReFT, Siege) |
| First AI system with A* conference acceptance | May 2025 | ACL 2025 main proceedings (Tempest) |
| First multi-domain AI research system | Mar 2025 | Produced papers in AI, safety, and computational biology |
| First AI system with meta-review in top 10% | May 2025 | ACL meta-review score 4 = top 8.2% of submissions |
Lineage and Positioning
AI Research Systems Timeline
│
├── 2024
│ ├── AI Scientist (Sakana AI) — first end-to-end pipeline, workshop-only
│ ├── Agent Laboratory — framework for AI-assisted research
│ └── AIDE — ML engineering agent (Kaggle)
│
├── 2025 (early)
│ ├── Zochi (IntologyAI) — first A* venue acceptance ← this system
│ ├── Google Co-Scientist — Gemini-based, biomedical focus
│ └── AIRA₂ (Meta FAIR) — agentic iterative research assistant
│
├── 2025 (mid)
│ ├── Locus (IntologyAI) — Zochi successor, surpasses humans on RE-Bench
│ └── Tempest → ACL 2025 — Zochi's A* publication milestone
│
└── 2026
├── AutoResearchClaw — 23-stage open-source pipeline
├── EurekaClaw — mathematical theorem proving
└── K-Dense Co-Scientist — bring-your-own-key research agent
Unique Position in the Ecosystem
Zochi is distinguished from every other system in the autoresearch landscape by a single fact: real peer review at the highest tier. While AI Scientist (Sakana AI) demonstrated that LLMs could produce paper-shaped artifacts, and Google Co-Scientist showed Gemini-powered hypothesis generation, Zochi is the only system whose fully autonomous output survived the ~20% acceptance rate filter of a CORE A*-ranked conference. This is a qualitative threshold that separates "demonstration" from "contribution."
Quality Validation Hierarchy
│
├── Self-evaluation only ──────────── AI Scientist, Agent Laboratory
│ └── LLM-as-judge or automated metrics
│
├── Workshop acceptance (~60-70%) ─── Zochi (ICLR 2025), AI Scientist
│ └── Lower bar, shorter papers, less rigorous review
│
├── Main conference acceptance (~20%) ─ Zochi (ACL 2025) ← only system here
│ └── Full peer review, rebuttals, meta-review
│
└── Journal publication ──────────── EGNN-Fusion (under review)
└── Extended review cycles, revision rounds
2 Authors and Team
| Author | Role (Inferred) |
|---|---|
| Andy Zhou | Lead developer / system architect / first author on all Zochi papers |
| Ron Arel | Co-founder / co-author on all Zochi papers |
| Soren Dunn | Core contributor |
| Nikhil Khandekar | Core contributor |
BibTeX Citations
@article{zhou2025zochi,
title = {Zochi Technical Report: The First Artificial Scientist},
author = {Zhou, Andy and Arel, Ron and Dunn, Soren and Khandekar, Nikhil},
year = {2025},
url = {https://github.com/IntologyAI/Zochi/blob/main/Zochi_Technical_Report.pdf}
}
@inproceedings{zhou2025tempest,
title = {Tempest: Autonomous Multi-Turn Jailbreaking of Large Language
Models with Tree Search},
author = {Zhou, Andy and Arel, Ron},
booktitle = {Proceedings of the 63rd Annual Meeting of the Association for
Computational Linguistics (ACL 2025)},
year = {2025},
url = {https://arxiv.org/abs/2503.10619}
}
@inproceedings{zhou2025csreft,
title = {Compositional Subspace Representation Fine-tuning for
Adaptive Large Language Models},
author = {Zhou, Andy and Arel, Ron},
booktitle = {SCOPE Workshop, ICLR 2025},
year = {2025},
url = {https://openreview.net/forum?id=YqYcm0mpFp}
}
Team composition: IntologyAI is a small, focused research lab — 4 named contributors compared to 16 at AutoResearchClaw (AIMING Lab) or 25 at Meta FAIR's AIRA₂. The compact team size is notable: Zochi's quality-per-headcount ratio is the highest in the autoresearch space, achieving more validated research impact with fewer people than any comparable system.
Institutional context: IntologyAI is a San Francisco-based startup (previously described as London-based in some early sources; the company appears to have relocated or always been SF-based). Unlike academic lab projects (AutoResearchClaw, EurekaClaw) or big-tech teams (Google Co-Scientist, AIRA₂), IntologyAI is a venture-backed startup whose entire product identity is "Artificial Scientists." This commercial focus explains both the closed-source nature and the aggressive milestone pursuit.
Team Size Comparison
| System | Team Size | Organization Type | Code Availability |
|---|---|---|---|
| Zochi | 4 | Startup | Closed-source |
| AI Scientist | ~6 | Research lab (Sakana AI) | Open-source |
| Google Co-Scientist | ~20+ | Big tech (Google DeepMind) | Closed-source |
| AIRA₂ | 25 | Big tech (Meta FAIR) | Partially open |
| AutoResearchClaw | 16 | Academic (multi-university) | Open-source (MIT) |
| EurekaClaw | 8 | Academic (single lab) | Open-source (Apache 2.0) |
3 Core Contribution
Zochi's core contribution is twofold: (1) a complete autonomous research pipeline that emulates the scientific method from literature analysis through peer-reviewed publication, and (2) empirical proof that AI systems can produce research accepted at the highest tier of peer review.
The Research Quality Gap
The most striking aspect of Zochi is not its architecture (which remains largely undisclosed) but the measurable quality gap between its outputs and those of all other AI research systems:
| System | Automated NeurIPS Reviewer Score | Real Peer Review | Venue Tier |
|---|---|---|---|
| Zochi | 8, 8, 7 (avg 7.67) | Yes — accepted | A* (ACL 2025) |
| AI Scientist (Sakana) | ~4 (average) | Workshop only | C (workshops) |
| Agent Laboratory | ~3-4 | No | N/A |
| AIDE | N/A (engineering, not research) | No | N/A |
| OpenHands | N/A | No | N/A |
This quality gap (~3.67 points on a 10-point scale) is significant. The NeurIPS guidelines scale treats 6 as the acceptance threshold for a top ML venue; Zochi averages 7.67 while competing systems cluster around 3-4.
Three Research Contributions
Unlike systems that demonstrate capability on toy domains (2D diffusion, toy language models), Zochi produced three substantive research contributions across distinct fields:
Zochi's Research Portfolio
│
├── 1. CS-ReFT (AI / Parameter-Efficient Fine-Tuning)
│ ├── Domain: Representation learning for LLMs
│ ├── Contribution: Orthonormal subspace edits in hidden states
│ ├── Result: 93.94% AlpacaEval win rate (> GPT-3.5-Turbo)
│ ├── Efficiency: 0.0098% of model parameters
│ ├── Venue: SCOPE Workshop, ICLR 2025
│ └── Reviewer scores: (6, 7, 6) — avg 6.33
│
├── 2. Siege → Tempest (AI Safety / Red Teaming)
│ ├── Domain: LLM safety, adversarial attacks
│ ├── Contribution: Tree-search multi-turn jailbreaking
│ ├── Result: 100% on GPT-3.5-T, 97% on GPT-4
│ ├── Discovery: "Partial compliance" vulnerability pattern
│ ├── Venue: ICLR 2025 Workshop → ACL 2025 Main
│ ├── Workshop scores: (7, 7) — avg 7.0
│ └── ACL meta-review: score 4 = top 8.2% of submissions
│
└── 3. EGNN-Fusion (Computational Biology)
├── Domain: Protein-nucleic acid binding site prediction
├── Contribution: Efficient EGNN architecture for binding sites
├── Result: 95% parameter reduction, competitive performance
├── Venue: Under journal review
└── Significance: Cross-domain capability demonstration
Differentiating Capabilities
| Capability | Zochi | AI Scientist | Co-Scientist | AutoResearchClaw |
|---|---|---|---|---|
| End-to-end autonomous research | Yes | Yes | Partial | Yes |
| Multi-domain research | 3 domains | 1 domain per run | Biomedical only | Configurable |
| A* venue acceptance | Yes (ACL) | No | No | No |
| Workshop acceptance | Yes (ICLR) | Yes (workshops) | No | Not yet |
| Real peer review survived | Yes | No (reviews fabricated/self) | No | No |
| Cross-domain transfer | Yes (AI → bio) | No | No | No |
| Automated quality (NeurIPS score) | 7.67 | ~4 | N/A | N/A |
| MLE-Bench engineering | 80% > median | N/A | N/A | N/A |
| Human involvement | Figures, citations, minor edits | Similar | Significant | Full automation |
Problem Complexity Spectrum
An underappreciated aspect of Zochi's contribution is the complexity of problems tackled relative to other systems:
Problem Complexity Spectrum
│
│ Simple ◄──────────────────────────────────► Complex
│
│ ├─ AI Scientist: 2D diffusion, toy LMs, specific cognitive biases
│ │ └── Constrained problem spaces with clear metrics
│ │
│ ├─ Agent Laboratory: Predefined research templates
│ │ └── Structured task decomposition
│ │
│ ├─ AIDE / OpenHands: Kaggle competitions (engineering)
│ │ └── Well-defined objectives with leaderboard scores
│ │
│ └─ Zochi: Open-ended research challenges
│ ├── CS-ReFT: Novel method design + theoretical motivation
│ ├── Tempest: Framework design + vulnerability discovery
│ └── EGNN-Fusion: Cross-domain architecture design
│ └── Each requires novel methodology, not just optimization
4 Supported Solutions
Research Pipeline Phases
Based on the technical report and blog descriptions, Zochi supports the following research phases:
| Phase | Description | Automation Level |
|---|---|---|
| Literature Analysis | Ingests and analyzes thousands of research papers | Fully autonomous |
| Gap Identification | Identifies non-obvious connections and limitations | Fully autonomous |
| Hypothesis Generation | Proposes innovative solutions to identified gaps | Fully autonomous |
| Method Design | Designs novel methods and architectures | Fully autonomous |
| Implementation | Autonomously implements proposed methods | Fully autonomous |
| Experiment Design | Designs controlled experiments with ablation studies | Fully autonomous |
| Experiment Execution | Runs experiments, parallelized across multiple trials | Fully autonomous |
| Validation | Generates evaluation scripts on standardized datasets | Fully autonomous |
| Result Analysis | Interprets results and draws conclusions | Fully autonomous |
| Manuscript Preparation | Generates full research paper | Mostly autonomous |
| Figure Creation | N/A | Human |
| Citation Formatting | N/A | Human |
| Minor Edits | Formatting fixes, minor writing corrections | Human |
Solution Categories
Zochi Solution Architecture
│
├── Research Discovery Solutions
│ ├── Large-scale literature retrieval and analysis
│ ├── Cross-paper pattern identification
│ ├── Research gap detection
│ └── Direction scoring and selection
│
├── Method Innovation Solutions
│ ├── Novel architecture design (EGNN-Fusion)
│ ├── Novel training methodology design (CS-ReFT)
│ ├── Novel adversarial framework design (Tempest)
│ └── Cross-domain knowledge transfer
│
├── Experimental Validation Solutions
│ ├── Controlled experiment design
│ ├── Ablation study generation
│ ├── Multi-trial parallelized execution
│ ├── Automated validation script generation
│ └── Standardized dataset evaluation
│
└── Publication Solutions
├── Full paper generation (LaTeX)
├── Technical writing at conference quality
└── Reviewer response preparation (manual for ACL)
Domain Flexibility
Unlike domain-locked systems (EurekaClaw → mathematics, Google Co-Scientist → biomedicine), Zochi demonstrates domain generality across three distinct fields:
| Domain | Paper | Technical Approach | Result Quality |
|---|---|---|---|
| AI / Representation Learning | CS-ReFT | Orthonormal subspace edits + router | ICLR workshop accepted |
| AI Safety | Siege → Tempest | Tree search + partial compliance tracking | ACL 2025 main (A*) |
| Computational Biology | EGNN-Fusion | Equivariant GNN architecture design | Journal under review |
This domain breadth is achieved without domain-specific plugins or handcrafted tool suites — Zochi's pipeline is general enough to produce contributions across fundamentally different fields, from model fine-tuning to protein structure prediction.
5 LLM Integration
Model Information
The Zochi technical report does not explicitly disclose which LLM backbone powers the system. However, several inferences can be made:
| Aspect | Assessment | Evidence |
|---|---|---|
| Primary model | Likely Claude or GPT-4 class | Quality of generated text, reasoning depth |
| Code generation | High-quality autonomous implementation | CS-ReFT and Tempest are fully implemented |
| Multi-model | Possibly — literature analysis may use different model from code generation | Cost optimization for high-volume lit review |
| Fine-tuned | Unknown | Closed-source; no indication of custom training |
LLM Usage Patterns
Based on the system's capabilities, Zochi likely uses LLM calls across multiple pipeline stages:
LLM Call Distribution (Inferred)
│
├── Literature Analysis
│ ├── Paper summarization (high volume, lower complexity)
│ ├── Gap identification (cross-paper reasoning)
│ └── Direction scoring (comparative judgment)
│ Estimated: 40-60% of total tokens
│
├── Method Design
│ ├── Hypothesis generation (creative reasoning)
│ ├── Architecture design (technical depth)
│ └── Novel approach formulation
│ Estimated: 10-15% of total tokens
│
├── Implementation
│ ├── Code generation (implementation from design)
│ ├── Debugging and iteration
│ └── Test case generation
│ Estimated: 15-25% of total tokens
│
├── Experimentation
│ ├── Experiment script generation
│ ├── Result interpretation
│ └── Ablation study design
│ Estimated: 5-10% of total tokens
│
└── Writing
├── Paper drafting (structured writing)
├── Technical exposition
└── Related work synthesis
Estimated: 10-15% of total tokens
Validation Engine
A distinctive feature of Zochi's LLM integration is the automatic validation engine:
"Our automatic validation engine generates evaluation scripts based on standardized datasets that remain unmodified throughout testing, ensuring results reflect genuine improvements."
This implies a separation of concerns where: 1. The generation LLM produces methods and code 2. The validation engine independently generates evaluation scripts 3. The standardized datasets are not modified by the generation process
This architectural choice prevents the common failure mode where AI systems inadvertently optimize for their own evaluation metrics rather than genuine performance improvements.
Comparison to Other Systems' LLM Integration
| System | LLM Backend | Multi-Model | Disclosed | Custom Prompts |
|---|---|---|---|---|
| Zochi | Undisclosed (likely frontier model) | Unknown | No | Unknown |
| AI Scientist | Claude 3.5 / GPT-4 | Yes (configurable) | Yes | Yes (open) |
| Google Co-Scientist | Gemini | Yes (multi-agent) | Partially | No |
| AIRA₂ | Llama-based | Yes | Yes | Yes |
| AutoResearchClaw | Configurable | Yes | Yes (open) | Yes (open) |
| EurekaClaw | Claude Sonnet (default) | Configurable | Yes (open) | Yes (open) |
Zochi's closed-source nature means its LLM integration details remain proprietary. This is both a limitation for scientific reproducibility and a competitive advantage — the prompts, model selection strategies, and chain-of-thought patterns that produce A*-quality research are Intology's core intellectual property.
6 Key Results
Headline Results Summary
| Metric | Value | Context |
|---|---|---|
| ACL 2025 acceptance | Main proceedings | First AI system at A* venue; ~21.3% acceptance rate |
| ACL meta-review score | 4 | Top 8.2% of all ACL submissions |
| ICLR 2025 workshops | 2 papers accepted | CS-ReFT (SCOPE) + Siege (Building Trust) |
| Automated NeurIPS scores | 8, 8, 7 (avg 7.67) | Acceptance threshold = 6; other AI systems average ~4 |
| MLE-Bench (exploratory) | 80% > median human; 50% medal | Without task-specific optimization |
| CS-ReFT AlpacaEval | 93.94% win rate | Surpasses GPT-3.5-Turbo (86.30%) |
| CS-ReFT efficiency | 0.0098% parameters | 12.7x fewer than LoRA |
| Tempest vs GPT-3.5-T | 100% attack success | JailbreakBench dataset |
| Tempest vs GPT-4 | 97% attack success | Fewer queries than Crescendo/GOAT |
| EGNN-Fusion efficiency | 95% parameter reduction | Competitive binding site prediction |
Detailed Results by Paper
Paper 1: CS-ReFT (ICLR 2025 SCOPE Workshop)
Problem: Cross-skill interference in parameter-efficient fine-tuning — improvements on one task degrade performance on others.
Method: Learns multiple orthonormal subspace transformations in hidden-state representations, each specializing in a distinct skill, composed via a lightweight router.
| Metric | CS-ReFT (Llama-2-7B) | GPT-3.5-Turbo | LoRA | ReFT (base) |
|---|---|---|---|---|
| AlpacaEval win rate | 93.94% | 86.30% | ~85% | ~88% |
| Parameters used | 0.0098% | N/A | ~0.12% | ~0.06% |
| Cross-task interference | Minimal | N/A | Moderate | Moderate |
Technical innovation: Unlike LoRA and similar methods that impose orthogonality at the weight level, CS-ReFT applies orthonormality constraints at the hidden-state level. This more directly addresses interference where it manifests — in the model's internal representations rather than in parameter space.
Reviewer assessment (SCOPE Workshop, ICLR 2025):
| Reviewer | Score | Key Comments |
|---|---|---|
| Reviewer 1 | 6 | Effective approach, addresses critical limitation of ReFT |
| Reviewer 2 | 7 | "Clever idea"; strong empirical results |
| Reviewer 3 | 6 | Solid contribution to parameter-efficient methods |
Paper 2: Siege → Tempest (ICLR 2025 Workshop → ACL 2025)
Problem: Existing jailbreaking methods rely on single carefully crafted prompts; multi-turn attacks are understudied.
Method: Tree search over conversation branches, tracking partial compliance across turns and re-injecting policy leaks into subsequent queries.
Tempest Tree Search Mechanism
│
│ Turn 1: "Tell me about chemistry"
│ ├── Branch A: [Safe response] → partial compliance detected
│ ├── Branch B: [Deflection] → pruned
│ └── Branch C: [Partial info] → promising
│
│ Turn 2 (from Branch C): "Expand on the synthesis process"
│ ├── Branch C1: [More detail] → partial compliance ↑
│ ├── Branch C2: [Refusal] → pruned
│ └── Branch C3: [Fragment reveals] → EXPLOIT
│
│ Turn 3 (from Branch C3): Re-inject fragments + escalate
│ └── Branch C3a: [Full compliance] → JAILBREAK COMPLETE
│
│ Key insight: "Partial compliance" — models reveal fragments
│ of restricted information while appearing to maintain safety
│ guardrails. These fragments accumulate across turns.
| Target Model | Attack Success Rate | Queries Used | Baseline (Crescendo) | Baseline (GOAT) |
|---|---|---|---|---|
| GPT-3.5-Turbo | 100% | Fewer | Lower | Lower |
| GPT-4 | 97% | Fewer | Lower | Lower |
Evolution from Siege to Tempest: The ACL version significantly expanded the ICLR workshop paper:
| Aspect | Siege (ICLR Workshop) | Tempest (ACL 2025) |
|---|---|---|
| Paper length | 2-4 pages (Tiny Paper) | Full conference paper |
| Experiments | JailbreakBench | Expanded evaluations |
| Methodology | Core tree search | Enhanced with cross-branch learning |
| Contribution depth | Proof of concept | Comprehensive framework |
| Review scores | (7, 7) | Meta-review: 4 (top 8.2%) |
ACL Acceptance Context:
| Metric | Value | Significance |
|---|---|---|
| ACL 2025 acceptance rate | ~21.3% | Highly selective |
| Meta-review score | 4 | Top 8.2% of all submissions |
| CORE ranking | A* | Highest tier of scientific venue |
| Google Scholar ranking | Top 40 globally | Among most impactful venues in all CS |
Paper 3: EGNN-Fusion (Under Journal Review)
Problem: State-of-the-art protein-nucleic acid binding site prediction requires enormous model parameters.
Method: Efficient equivariant graph neural network architecture that achieves competitive performance with 95% fewer parameters.
| Metric | EGNN-Fusion | State-of-the-Art Baselines |
|---|---|---|
| Parameter count | 5% of baseline | 100% (reference) |
| Binding site prediction | Competitive | Reference level |
| Equivariance | E(3)-equivariant | Varies by method |
Significance: This paper's primary role is as a cross-domain capability proof. The fact that the same AI system that designed a parameter-efficient fine-tuning method for LLMs also designed an efficient protein structure prediction architecture demonstrates genuine domain generality — not just re-skinning the same approach across similar problems.
MLE-Bench Performance (Exploratory)
| Metric | Zochi | AIDE | OpenHands | Agent Lab |
|---|---|---|---|---|
| Surpass median human | 80% of tasks | — | — | — |
| Medal rate | 50% of tasks | 8.7% (any medal) | 4.4% | — |
| Task-specific optimization | None | None | None | None |
The MLE-Bench results are particularly notable because Zochi was evaluated without any task-specific optimization — the same general-purpose research pipeline was applied to Kaggle-style engineering challenges, demonstrating transfer from research to engineering tasks.
Automated Quality Assessment
Zochi uses an automated reviewer based on NeurIPS conference guidelines to benchmark paper quality:
Automated Reviewer Score Distribution
│
│ 10 ─┤
│ 9 ─┤
│ 8 ─┤ ██ ██ Zochi papers (8, 8, 7)
│ 7 ─┤ ██ ██ ██
│ 6 ─┤──██──██──██────── acceptance threshold ──────────
│ 5 ─┤
│ 4 ─┤ ░░ ░░ Other AI systems (~4 avg)
│ 3 ─┤ ░░ ░░
│ 2 ─┤
│ 1 ─┤
│ └──────────────────
│ Zochi Others
│
│ Legend: ██ = Zochi ░░ = AI Scientist, Agent Lab, etc.
│ The ~3.67-point gap represents a qualitative leap
│ from "rejected" to "strong accept" territory.
Quality Gap Analysis
The quality gap between Zochi and other AI research systems deserves careful examination:
| Quality Dimension | Zochi | Typical AI-Generated Papers |
|---|---|---|
| Problem selection | Open-ended, frontier challenges | Constrained, predefined tasks |
| Technical novelty | Novel methods (orthonormal subspaces, tree-search jailbreaking) | Incremental variations |
| Experimental rigor | Controlled experiments, ablations, multiple trials | Basic comparisons |
| Writing quality | Near-publication quality (minor edits needed) | Significant editing required |
| Domain awareness | Deep understanding of related work | Surface-level citations |
| Result significance | State-of-the-art on standard benchmarks | Toy-scale demonstrations |
7 Reproducibility
Open Artifacts
| Artifact | Available | Location |
|---|---|---|
| Technical report | Yes | PDF on GitHub |
| CS-ReFT paper | Yes | OpenReview |
| Tempest paper | Yes | arXiv:2503.10619 |
| Siege workshop paper | Yes | OpenReview |
| System code | No | Closed-source |
| Prompt templates | No | Closed-source |
| Pipeline configuration | No | Closed-source |
| Model weights / fine-tunes | No | Closed-source |
| Experiment code (papers) | Partial | GitHub repository |
| Datasets used | Standard | AlpacaEval, JailbreakBench, protein datasets |
Reproducibility Assessment
| Factor | Rating | Details |
|---|---|---|
| System reproducibility | Very Low | Closed-source; no installation, no configuration, no pipeline |
| Paper result reproducibility | Medium | Standard datasets, published methods, partial code |
| Method reproducibility | Medium-High | CS-ReFT and Tempest are clearly described; independent implementation possible |
| Evaluation reproducibility | Medium | NeurIPS automated reviewer is a known methodology; MLE-Bench is open |
| Peer review validation | High | ACL and ICLR reviews are public records |
Reproducibility Comparison
| System | Code Available | Can Reproduce Pipeline | Can Reproduce Results |
|---|---|---|---|
| Zochi | No | No | Partially (published papers only) |
| AI Scientist | Yes | Yes | Yes (with API keys) |
| AutoResearchClaw | Yes | Yes | Yes (with API keys) |
| EurekaClaw | Yes | Yes | Yes (with API keys) |
| Google Co-Scientist | No | No | No |
| K-Dense BYOK | Yes | Yes | Yes (with API keys) |
What Can Be Reproduced
-
CS-ReFT: The method is described with enough detail to reimplement. The AlpacaEval benchmark is public. The orthonormal subspace transformation approach is straightforward to implement given the paper.
-
Tempest: The tree search over conversation branches with partial compliance tracking is well-specified. JailbreakBench is public. The core algorithm (BFS over adversarial prompt branches) could be reimplemented.
-
EGNN-Fusion: The equivariant GNN architecture is described. Protein-nucleic acid binding datasets are standard.
What Cannot Be Reproduced
- The research pipeline itself — How Zochi selects research directions, generates hypotheses, designs methods, and writes papers
- The literature analysis system — How thousands of papers are ingested, analyzed, and patterns extracted
- The validation engine — How evaluation scripts are automatically generated
- The meta-cognitive layer — How Zochi decides which ideas are promising enough to pursue
- The quality calibration — What makes Zochi produce 7.67-quality papers when others produce ~4
This reproducibility gap is the most significant criticism of Zochi from a scientific perspective. While the individual papers are reproducible, the system that produces papers is not — making it impossible for the research community to verify, extend, or improve upon the core methodology.
8 Compute and API Costs
Cost Model (Inferred)
Since Zochi is closed-source, cost estimates must be inferred from the described capabilities:
Estimated Cost Model
│
│ Cost per paper ≈ Literature_Analysis + Method_Design + Implementation
│ + Experimentation + Writing + Validation
│
│ Literature_Analysis:
│ ├── "Thousands of papers" analyzed
│ ├── At ~500 tokens/paper summary × 2,000 papers = 1M tokens input
│ ├── Plus gap analysis and direction scoring: ~200K tokens
│ └── Subtotal: ~1.2M tokens (mostly input)
│
│ Method_Design + Implementation:
│ ├── Hypothesis generation: ~50K tokens
│ ├── Architecture design iteration: ~100K tokens
│ ├── Code generation + debugging: ~200K tokens
│ └── Subtotal: ~350K tokens (mix of input/output)
│
│ Experimentation:
│ ├── Experiment script generation: ~50K tokens
│ ├── Result interpretation: ~100K tokens
│ ├── Ablation study design: ~50K tokens
│ └── Subtotal: ~200K tokens
│ Plus compute: GPU hours for training/evaluation
│
│ Writing + Validation:
│ ├── Paper generation: ~100K tokens
│ ├── Validation script generation: ~50K tokens
│ └── Subtotal: ~150K tokens (mostly output)
│
│ TOTAL ESTIMATED: ~1.9M tokens per paper
│ At ~$15/M tokens (frontier model): ~$30 in API costs
│ Plus GPU compute for experiments: varies ($10-$500+)
Timeline Estimates
| Phase | Estimated Duration | Bottleneck |
|---|---|---|
| Literature analysis | Hours | API rate limits, paper retrieval |
| Hypothesis + method design | Hours | LLM reasoning depth |
| Implementation | Hours to day | Code complexity, debugging cycles |
| Experimentation | Hours to days | GPU availability, training time |
| Validation | Hours | Evaluation script execution |
| Writing | Hours | Paper generation + formatting |
| Total | Hours to days | Experiment compute |
IntologyAI states: "Methods typically only require hours to validate, and a full paper takes only days to complete."
Cost Comparison
| System | Estimated Cost per Paper | Time per Paper | Model Tier |
|---|---|---|---|
| Zochi | ~$30-500+ (API + compute) | Days | Frontier (undisclosed) |
| AI Scientist | $10-50+ | Hours to days | Claude/GPT-4 |
| AutoResearchClaw | $5-30 | Hours | Configurable |
| EurekaClaw | $1-50+ | Hours | Claude Sonnet |
| K-Dense BYOK | $0.05-5 | Minutes to hours | User-selected |
| Human PhD student | $50K-100K/year salary | Months to years | N/A |
Hardware Requirements (Inferred)
| Requirement | Minimum | Likely Production |
|---|---|---|
| CPU | Multi-core | Cloud instances |
| RAM | 16+ GB | 32-64 GB |
| GPU | Required for experiments | Multi-GPU for training (CS-ReFT used Llama-2-7B) |
| Storage | 10+ GB per project | Cloud storage for paper corpus |
| Network | Required | High-bandwidth for paper retrieval + API calls |
| API Access | Frontier LLM API | Rate-limited; likely parallel calls |
9 Architecture Solution
Pipeline Architecture (Inferred from Descriptions)
Zochi operates as a multi-stage autonomous pipeline that mirrors the scientific method. While the internal implementation is closed-source, the described stages can be mapped to an architectural diagram:
Zochi Architecture Overview (Inferred)
│
│ INPUT: Research domain / high-level direction
│ (e.g., "novel jailbreaking methods")
│
│ ╔════════════════════════════════════════════════════════╗
│ ║ STAGE 1: LITERATURE ANALYSIS ║
│ ║ ║
│ ║ Paper Retrieval ──► Summarization ──► Pattern Mining ║
│ ║ │ │ │ ║
│ ║ (arXiv, S2, (Per-paper (Cross-paper ║
│ ║ venue DBs) key findings) connections) ║
│ ║ ║
│ ║ Output: Research landscape map + identified gaps ║
│ ╚═══════════════════════╤════════════════════════════════╝
│ │
│ ╔═══════════════════════╧════════════════════════════════╗
│ ║ STAGE 2: HYPOTHESIS GENERATION ║
│ ║ ║
│ ║ Gap Analysis ──► Direction Proposal ──► Selection ║
│ ║ │ │ │ ║
│ ║ (Identify (Generate (Score and ║
│ ║ limitations) novel ideas) rank) ║
│ ║ ║
│ ║ Output: Research hypothesis + proposed method ║
│ ╚═══════════════════════╤════════════════════════════════╝
│ │
│ ╔═══════════════════════╧════════════════════════════════╗
│ ║ STAGE 3: METHOD DESIGN ║
│ ║ ║
│ ║ Architecture Design ──► Technical Specification ║
│ ║ │ │ ║
│ ║ (Novel method (Formal description, ║
│ ║ formulation) math formulation) ║
│ ║ ║
│ ║ Output: Complete method specification ║
│ ╚═══════════════════════╤════════════════════════════════╝
│ │
│ ╔═══════════════════════╧════════════════════════════════╗
│ ║ STAGE 4: IMPLEMENTATION ║
│ ║ ║
│ ║ Code Generation ──► Testing ──► Debugging ──► ↻ ║
│ ║ │ │ │ ║
│ ║ (Method (Unit + (Iterative ║
│ ║ implementation) integration) repair) ║
│ ║ ║
│ ║ Output: Working implementation of proposed method ║
│ ╚═══════════════════════╤════════════════════════════════╝
│ │
│ ╔═══════════════════════╧════════════════════════════════╗
│ ║ STAGE 5: EXPERIMENTATION ║
│ ║ ║
│ ║ Experiment Design ──► Parallel Execution ──► Results ║
│ ║ │ │ │ ║
│ ║ (Controlled (Multi-trial (Statistical║
│ ║ experiments, parallelized) analysis) ║
│ ║ ablation studies) ║
│ ║ ║
│ ║ ┌──────────────────────────────┐ ║
│ ║ │ VALIDATION ENGINE │ ║
│ ║ │ Auto-generates eval scripts│ ║
│ ║ │ Standardized datasets │ ║
│ ║ │ Datasets NOT modified │ ║
│ ║ └──────────────────────────────┘ ║
│ ║ ║
│ ║ Output: Experimental results + analysis ║
│ ╚═══════════════════════╤════════════════════════════════╝
│ │
│ ╔═══════════════════════╧════════════════════════════════╗
│ ║ STAGE 6: MANUSCRIPT PREPARATION ║
│ ║ ║
│ ║ Paper Drafting ──► Related Work ──► Full Paper ║
│ ║ │ │ │ ║
│ ║ (Structure + (Literature (Complete ║
│ ║ technical integration) manuscript) ║
│ ║ writing) ║
│ ║ ║
│ ║ Human steps: figures, citation format, minor edits ║
│ ║ ║
│ ║ Output: Conference-ready paper ║
│ ╚════════════════════════════════════════════════════════╝
│
│ OUTPUT: Peer-reviewed publication (ACL 2025, ICLR 2025)
Architectural Differentiators
| Feature | Zochi | AI Scientist | AutoResearchClaw |
|---|---|---|---|
| Pipeline stages | ~6 (inferred) | ~8 | 23 |
| Parallel experiments | Yes | Limited | Yes |
| Validation engine | Dedicated + separated | Self-evaluation | Multi-agent review |
| Cross-domain | Yes (3 domains demonstrated) | Single domain per run | Configurable |
| Human involvement | Minimal (figures, citations) | Similar | Fully automated |
| Quality bar | A* venue acceptance | Workshop demos | Automated scores only |
| Stage granularity | Coarse (strategic) | Medium | Fine (23 stages) |
Key Architectural Decisions (Inferred)
-
Separation of generation and validation: The validation engine generates evaluation scripts independently from the method generation process. This prevents the system from inadvertently gaming its own metrics.
-
Parallelized experimentation: "Experiments are parallelized across multiple trials, significantly accelerating the research timeline." This suggests an experiment orchestration layer that manages GPU resources and collects results.
-
Minimal human handoff: The architecture is designed to minimize human touchpoints. The only human steps are cosmetic (figures, formatting) rather than substantive (method design, experiment interpretation).
-
Input minimalism: For the ACL paper, the input was merely "novel jailbreaking methods" — 3 words that triggered the entire research pipeline from literature analysis to a 97% attack success rate on GPT-4.
Architecture Evolution: Siege vs. Tempest
The progression from ICLR workshop (Siege) to ACL main (Tempest) reveals architectural improvements:
Architecture Evolution
│
│ Siege (ICLR 2025 Workshop, v1)
│ ├── Input: "multi-turn attacks on LLMs" (autonomously identified)
│ ├── Pipeline: Standard autonomous flow
│ ├── Output: 2-4 page tiny paper
│ ├── Human contribution: Same as standard (figures, formatting)
│ └── Result: (7, 7) reviewer scores
│
│ Tempest (ACL 2025 Main, v2)
│ ├── Input: Same high-level idea (tree search + multi-turn jailbreak)
│ ├── Pipeline: Enhanced — "significantly improved design"
│ ├── Additional: Cross-branch learning mechanism
│ ├── Additional: Robust partial compliance tracking
│ ├── Additional: "More comprehensive experiments"
│ ├── Output: Full conference paper
│ ├── Human contribution: Same minimal scope
│ └── Result: Meta-review 4 (top 8.2%)
│
│ Key observation: The system could ITERATE on its own work,
│ producing substantially improved research on the same topic.
10 Component Breakdown
Inferred Component Architecture
Since Zochi is closed-source, the component breakdown must be inferred from the described capabilities and outputs. The following represents a plausible decomposition:
Zochi Component Map (Inferred)
│
├── CORE ENGINE
│ ├── Pipeline Orchestrator
│ │ ├── Stage sequencing and state management
│ │ ├── Error handling and recovery
│ │ └── Resource allocation across stages
│ │
│ ├── LLM Interface Layer
│ │ ├── API client(s) for frontier model(s)
│ │ ├── Prompt management and templating
│ │ ├── Token budget management
│ │ └── Response parsing and validation
│ │
│ └── Domain Abstraction
│ ├── Domain-agnostic pipeline flow
│ └── Domain-specific adapter patterns
│
├── LITERATURE ENGINE
│ ├── Paper Retrieval
│ │ ├── arXiv API integration
│ │ ├── Semantic Scholar API integration
│ │ ├── Venue-specific databases
│ │ └── Citation graph traversal
│ │
│ ├── Paper Analysis
│ │ ├── Abstract and full-text summarization
│ │ ├── Methodology extraction
│ │ ├── Result extraction
│ │ └── Limitation identification
│ │
│ └── Knowledge Synthesis
│ ├── Cross-paper pattern mining
│ ├── Gap identification
│ ├── Direction scoring
│ └── Research landscape mapping
│
├── HYPOTHESIS ENGINE
│ ├── Idea Generator
│ │ ├── Cross-paper connection identification
│ │ ├── Novel combination proposal
│ │ └── Feasibility assessment
│ │
│ ├── Method Designer
│ │ ├── Architecture specification
│ │ ├── Mathematical formulation
│ │ └── Technical approach planning
│ │
│ └── Selection Filter
│ ├── Novelty scoring
│ ├── Impact prediction
│ └── Feasibility ranking
│
├── IMPLEMENTATION ENGINE
│ ├── Code Generator
│ │ ├── Method implementation from specification
│ │ ├── Data loading and preprocessing
│ │ └── Training loop generation
│ │
│ ├── Testing Layer
│ │ ├── Unit test generation
│ │ ├── Integration testing
│ │ └── Debugging and repair loop
│ │
│ └── Environment Manager
│ ├── Dependency management
│ ├── GPU resource allocation
│ └── Experiment workspace isolation
│
├── EXPERIMENTATION ENGINE
│ ├── Experiment Designer
│ │ ├── Controlled experiment specification
│ │ ├── Ablation study generation
│ │ └── Baseline selection
│ │
│ ├── Execution Orchestrator
│ │ ├── Multi-trial parallelization
│ │ ├── Result collection
│ │ └── Resource management
│ │
│ └── VALIDATION ENGINE (SEPARATED)
│ ├── Auto-generates evaluation scripts
│ ├── Uses standardized, unmodified datasets
│ ├── Independent from generation process
│ └── Ensures genuine performance measurement
│
├── WRITING ENGINE
│ ├── Paper Generator
│ │ ├── Structure planning
│ │ ├── Section-by-section drafting
│ │ ├── Related work integration
│ │ └── Technical writing quality assurance
│ │
│ └── LaTeX Formatter
│ ├── Conference template compliance
│ ├── Table and equation formatting
│ └── Reference management
│
└── QUALITY ASSURANCE
├── Automated Reviewer
│ ├── NeurIPS guidelines-based scoring
│ ├── Multi-dimensional evaluation
│ └── Quality threshold enforcement
│
└── Result Verification
├── Statistical significance checking
├── Claim-to-evidence alignment
└── Reproducibility verification
Component Interaction Patterns
Information Flow Between Components
│
│ Domain Input ──────────────────────────────────────────┐
│ │ │
│ ▼ │
│ ┌─────────────┐ papers ┌──────────────┐ │
│ │ Literature │─────────────►│ Hypothesis │ │
│ │ Engine │ + gaps │ Engine │ │
│ └─────────────┘ └──────┬───────┘ │
│ │ │
│ method spec │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │Implementation │ │
│ │ Engine │ │
│ └──────┬───────┘ │
│ │ │
│ working code │
│ │ │
│ ▼ │
│ ┌────────────┐ eval scripts ┌──────────────┐ │
│ │ Validation │◄─────────────│Experimentation│ │
│ │ Engine │───results───►│ Engine │ │
│ └────────────┘ └──────┬───────┘ │
│ │ │
│ results + analysis │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Writing │◄──────────┘
│ │ Engine │ domain context
│ └──────┬───────┘
│ │
│ ┌───────▼───────┐
│ │ Quality │
│ │ Assurance │
│ └───────┬───────┘
│ │
│ ▼
│ Conference Paper
Comparison: Component Density
| System | Named Components | Pipeline Stages | Agents | Tools |
|---|---|---|---|---|
| Zochi | ~12 (inferred) | ~6 | Unknown | Unknown |
| AI Scientist | ~8 | ~8 | 3 (researcher, reviewer, editor) | ~5 |
| AutoResearchClaw | ~15+ | 23 | 8+ specialized agents | 10+ |
| EurekaClaw | ~20+ | 7 | 7+ specialized agents | 8+ |
| Google Co-Scientist | Unknown | Multi-step | Multi-agent | Unknown |
Zochi appears to use a more consolidated architecture with fewer but more capable components, contrasting with the fine-grained stage decomposition of systems like AutoResearchClaw (23 stages) or EurekaClaw (7 stages with sub-agents per stage).
11 Core Mechanisms (Detailed)
11.1 Literature-Grounded Research Direction Selection
The literature analysis phase is described as ingesting "thousands of research papers" and identifying "non-obvious connections across papers." This implies a multi-layer analysis pipeline:
Literature Analysis Pipeline (Inferred)
│
├── Layer 1: RETRIEVAL
│ ├── Input: Domain string (e.g., "novel jailbreaking methods")
│ ├── Query expansion: LLM generates multiple search queries
│ ├── Sources: arXiv API, Semantic Scholar, venue proceedings
│ ├── Scale: "Thousands of papers" retrieved
│ └── Output: Raw paper corpus (titles, abstracts, full texts)
│
├── Layer 2: SUMMARIZATION
│ ├── Per-paper analysis:
│ │ ├── Key contribution extraction
│ │ ├── Methodology characterization
│ │ ├── Limitation identification
│ │ └── Result summary
│ ├── Efficiency: Likely uses cheaper model or shorter prompts
│ └── Output: Structured paper summaries
│
├── Layer 3: PATTERN MINING
│ ├── Cross-paper connection identification
│ ├── Methodology trend analysis
│ ├── "Non-obvious connections" — the key differentiator
│ │ ├── For CS-ReFT: Connected representation editing to
│ │ │ cross-skill interference (not obvious from either
│ │ │ literature alone)
│ │ ├── For Tempest: Connected multi-turn dialogue patterns
│ │ │ to systematic safety erosion (novel framing)
│ │ └── For EGNN-Fusion: Connected equivariant architectures
│ │ to binding site efficiency (cross-domain transfer)
│ └── Output: Pattern graph over research landscape
│
└── Layer 4: DIRECTION SELECTION
├── Gap scoring: novelty × feasibility × impact
├── Direction ranking
├── Selection of most promising direction
└── Output: Chosen research direction with justification
11.2 The "Partial Compliance" Discovery Mechanism
Zochi's most impactful scientific discovery — the "partial compliance" vulnerability pattern in LLMs — illustrates the system's ability to identify non-obvious phenomena:
Partial Compliance Discovery (Tempest)
│
│ Observation: When attacked across multiple turns, LLMs don't
│ simply "comply" or "refuse" — they exhibit a gradient of responses.
│
│ ┌────────────────────────────────────────────────────────┐
│ │ Response Spectrum │
│ │ │
│ │ Full Refusal ◄──────────────────────► Full Compliance │
│ │ │ │ │
│ │ │ ┌─────────────────────┐ │ │
│ │ │ │ PARTIAL COMPLIANCE │ │ │
│ │ │ │ ───────────────── │ │ │
│ │ │ │ Model reveals │ │ │
│ │ │ │ FRAGMENTS of │ │ │
│ │ │ │ restricted info │ │ │
│ │ │ │ while appearing │ │ │
│ │ │ │ to maintain safety │ │ │
│ │ │ └─────────────────────┘ │ │
│ │ │ │ │ │
│ │ │ These fragments │ │
│ │ │ ACCUMULATE across turns │ │
│ │ │ │ │ │
│ │ │ Until full compliance │ │
│ │ │ is achieved │ │
│ └───────┴──────────────┴────────────────────────┘ │
│ │
│ Key insight: Safety is not a binary gate but a │
│ continuously erodable surface. Minor concessions │
│ create anchor points for subsequent exploitation. │
└─────────────────────────────────────────────────────────────┘
This discovery is significant because it: 1. Reframes the safety problem — from binary (safe/unsafe) to continuous (compliance gradient) 2. Was autonomously identified — Zochi discovered this pattern from literature analysis, not from human guidance 3. Has practical implications — requires rethinking multi-turn safety mechanisms beyond single-turn guardrails 4. Survived A* peer review — validating the discovery's novelty and significance
11.3 Orthonormal Subspace Representation Editing (CS-ReFT)
The CS-ReFT method demonstrates Zochi's ability to formulate technically novel approaches:
CS-ReFT Architecture (from paper)
│
│ Standard Fine-Tuning:
│ ┌────────────────────────────────────────┐
│ │ Weights modified → all tasks affected │
│ │ Task A improvement → Task B degrades │ = cross-skill
│ │ LoRA orthogonality: weight-level only │ interference
│ └────────────────────────────────────────┘
│
│ CS-ReFT Approach:
│ ┌────────────────────────────────────────────────────────┐
│ │ │
│ │ Hidden State Space (h) │
│ │ ┌─────────────────────────────────────┐ │
│ │ │ │ │
│ │ │ Subspace S₁ ──► Skill 1 edit │ │
│ │ │ (orthonormal) │ │
│ │ │ ⊥ │ │
│ │ │ Subspace S₂ ──► Skill 2 edit │ │
│ │ │ (orthonormal) │ │
│ │ │ ⊥ │ │
│ │ │ Subspace Sₖ ──► Skill k edit │ │
│ │ │ (orthonormal) │ │
│ │ │ │ │
│ │ └─────────────────────────────────────┘ │
│ │ │ │
│ │ ┌─────┴─────┐ │
│ │ │ Router │ │
│ │ │ (lightweight│ │
│ │ │ selector) │ │
│ │ └─────┬─────┘ │
│ │ │ │
│ │ Composed output │
│ │ │
│ │ Key innovation: orthonormality at hidden-state level │
│ │ not weight level → directly prevents interference │
│ │ where it manifests │
│ └────────────────────────────────────────────────────────┘
│
│ Results:
│ ├── 93.94% win rate on AlpacaEval (vs. 86.30% GPT-3.5-T)
│ ├── Only 0.0098% of model parameters
│ ├── 12.7x fewer parameters than LoRA
│ └── Minimal cross-task interference
11.4 Tree Search Over Conversation Branches (Tempest)
The Tempest framework implements a systematic search algorithm over multi-turn conversations:
Tempest Tree Search Algorithm
│
│ INITIALIZE:
│ root = initial adversarial prompt
│ target = restricted behavior
│ tree = {root}
│ compliance_tracker = {}
│
│ LOOP (for each turn t):
│ │
│ ├── EXPAND: For each active node n in tree:
│ │ ├── Generate k adversarial follow-ups
│ │ │ (breadth-first branching)
│ │ ├── Each follow-up exploits partial compliance
│ │ │ from n's response
│ │ └── Add branches to tree
│ │
│ ├── EVALUATE: For each new branch:
│ │ ├── Send to target model
│ │ ├── Receive response
│ │ ├── Measure compliance level:
│ │ │ ├── Full refusal → prune branch
│ │ │ ├── Partial compliance → track fragments
│ │ │ └── Full compliance → SUCCESS
│ │ └── Update compliance_tracker
│ │
│ ├── CROSS-BRANCH LEARNING (ACL version):
│ │ ├── Analyze successful partial compliance patterns
│ │ ├── Transfer effective strategies across branches
│ │ └── Reinject learned patterns into new prompts
│ │
│ ├── PRUNE: Remove branches with:
│ │ ├── Full refusals
│ │ ├── Stalled compliance
│ │ └── Redundant paths
│ │
│ └── RE-INJECT: For promising branches:
│ ├── Extract compliance fragments from responses
│ ├── Incorporate fragments into next turn's prompts
│ └── "Minor concessions accumulate into fully
│ disallowed outputs"
│
│ TERMINATE when:
│ ├── Full compliance achieved (success)
│ ├── All branches pruned (failure)
│ └── Max turns reached (timeout)
11.5 Validation Engine — Separation of Concerns
The validation engine is one of Zochi's most architecturally important mechanisms:
Validation Engine Design
│
│ PROBLEM: AI systems can inadvertently optimize for
│ their own evaluation metrics rather than
│ genuine performance improvements.
│
│ SOLUTION: Separate generation from evaluation
│
│ ┌──────────────────┐ ┌──────────────────┐
│ │ GENERATION PATH │ │ VALIDATION PATH │
│ │ │ │ │
│ │ Method design │ │ Eval script gen │
│ │ Implementation │ │ (independent) │
│ │ Training │ │ │
│ │ │ │ Standardized │
│ │ Produces: │ │ datasets (NOT │
│ │ - trained model │ │ modified by │
│ │ - method code │ │ generation path) │
│ │ │ │ │
│ └────────┬─────────┘ └────────┬─────────┘
│ │ │
│ │ ┌────────────┐ │
│ └─────►│ EVALUATION │◄──┘
│ │ │
│ │ Model tested│
│ │ on unmodified│
│ │ datasets via │
│ │ independent │
│ │ eval scripts │
│ └──────┬─────┘
│ │
│ Genuine results
│
│ This prevents:
│ ├── Data leakage from training to evaluation
│ ├── Metric gaming (optimizing for eval proxy)
│ ├── Self-confirming evaluation loops
│ └── Overfitting to evaluation procedure
11.6 Research Iteration Capability
The Siege → Tempest progression reveals a capability that most AI research systems lack: iterative improvement on the same research direction:
Research Iteration Loop
│
│ Iteration 1 (Siege — ICLR Workshop):
│ ├── Input: "multi-turn attacks on LLMs"
│ │ (autonomously identified from literature)
│ ├── Output: Core tree search framework
│ ├── Result: 100%/97% attack success
│ ├── Format: 2-4 page tiny paper
│ └── Feedback: (7, 7) reviewer scores
│
│ GAP: Workshop paper → Full conference paper requires:
│ ├── Deeper methodology
│ ├── More comprehensive experiments
│ ├── Stronger theoretical motivation
│ └── Better presentation
│
│ Iteration 2 (Tempest — ACL Main):
│ ├── Input: Same high-level idea + Siege as starting point
│ ├── Enhancements:
│ │ ├── Cross-branch learning mechanism (NEW)
│ │ ├── Robust partial compliance tracking (IMPROVED)
│ │ ├── Comprehensive evaluations (EXPANDED)
│ │ └── Full conference paper format (EXTENDED)
│ ├── Result: Same attack rates + deeper analysis
│ └── Outcome: Top 8.2% at A* venue
│
│ This demonstrates Zochi can:
│ 1. Evaluate its own work's limitations
│ 2. Identify what needs improvement
│ 3. Design and implement those improvements
│ 4. Produce substantially stronger results
│ 5. Meet a much higher quality bar (workshop → A*)
12 Programming Language
System Implementation
| Aspect | Assessment | Evidence |
|---|---|---|
| System language | Unknown (likely Python) | Closed-source; Python is standard for ML systems |
| Generated code | Python (confirmed) | CS-ReFT uses PyTorch; Tempest uses standard ML libraries |
| Experiment code | Python (confirmed) | Standard ML stack (PyTorch, transformers, etc.) |
| Paper output | LaTeX | Conference paper format |
Generated Code Quality Indicators
The code Zochi generates must be of sufficient quality to:
- Train models successfully: CS-ReFT trained on Llama-2-7B with orthonormal subspace edits
- Run complex experiments: Tempest executed multi-turn attacks against GPT-3.5 and GPT-4 APIs
- Implement novel architectures: EGNN-Fusion designed a new equivariant GNN architecture
- Produce reproducible results: Results were verified by peer reviewers at A* venues
Comparison to Other Systems
| System | System Language | Generated Code | Open Source |
|---|---|---|---|
| Zochi | Unknown (Python likely) | Python | No |
| AI Scientist | Python | Python | Yes |
| AutoResearchClaw | Python | Python | Yes (MIT) |
| EurekaClaw | Python/TypeScript | Python + LaTeX | Yes (Apache 2.0) |
| Google Co-Scientist | Unknown | Unknown | No |
| AIRA₂ | Python | Python | Partially |
Code Quality Assessment (Inferred)
| Indicator | Evidence | Assessment |
|---|---|---|
| Correctness | Results accepted at A* venue | High — peer-verified |
| Reproducibility | Standardized datasets, published results | Medium-High |
| Complexity | Orthonormal subspace edits, tree search, EGNN architectures | High — non-trivial implementations |
| Test coverage | Ablation studies, multiple baselines | High — comprehensive evaluation |
| Documentation | Published papers describe methods | High — paper-quality documentation |
13 Memory Management
Memory Architecture (Inferred)
Zochi's memory system is not publicly documented, but the system's capabilities imply several memory types:
┌─────────────────────────────────────────────────────────────┐
│ Zochi Memory Architecture (Inferred) │
│ │
│ Tier 1: LITERATURE MEMORY │
│ ├── Scope: Per-project │
│ ├── Content: Paper summaries, extracted patterns, gaps │
│ ├── Scale: "Thousands of papers" │
│ ├── Access: Literature engine reads; hypothesis engine reads│
│ └── Purpose: Ground research in existing knowledge │
│ │
│ Tier 2: PROJECT MEMORY │
│ ├── Scope: Per-project (spans all pipeline stages) │
│ ├── Content: Chosen direction, method spec, implementation │
│ │ state, experiment results, partial drafts │
│ ├── Access: All stages read; each stage writes its outputs │
│ └── Purpose: Maintain coherence across pipeline stages │
│ │
│ Tier 3: ITERATION MEMORY │
│ ├── Scope: Across project iterations (Siege → Tempest) │
│ ├── Content: Prior work artifacts, identified improvements │
│ ├── Access: New iteration reads prior artifacts │
│ └── Purpose: Enable research iteration and improvement │
│ │
│ Tier 4: VALIDATION MEMORY │
│ ├── Scope: Per-experiment │
│ ├── Content: Evaluation scripts, dataset references, results│
│ ├── Access: Validation engine (isolated from generation) │
│ └── Purpose: Ensure genuine, unbiased evaluation │
│ │
│ Tier 5: CROSS-DOMAIN MEMORY (speculative) │
│ ├── Scope: Across projects/domains │
│ ├── Content: Transferable strategies, method patterns │
│ ├── Access: Hypothesis engine for new projects │
│ └── Purpose: Enable cross-domain innovation │
│ │
└─────────────────────────────────────────────────────────────┘
Evidence for Memory Tiers
| Tier | Evidence | Confidence |
|---|---|---|
| Literature Memory | "Ingests and analyzes thousands of research papers" | High |
| Project Memory | Multi-stage pipeline requires state passing | High |
| Iteration Memory | Siege → Tempest improvement demonstrates prior work awareness | High |
| Validation Memory | Separated validation engine with unmodified datasets | Medium |
| Cross-Domain Memory | 3 domains with transferable patterns | Low-Medium |
Context Window vs. Persistent Memory
A key architectural question for Zochi is how it handles the tension between LLM context window limitations and the need for extensive research context:
Context Window Challenge
│
│ Challenge: "Thousands of papers" analyzed → millions of tokens
│ Typical context window: 100K-200K tokens (frontier models, 2025)
│
│ Possible Solutions (Inferred):
│
│ 1. HIERARCHICAL SUMMARIZATION
│ ├── Full papers → abstracts → key findings → pattern summary
│ ├── Progressive compression: 1000x reduction
│ └── Only summaries enter context window
│
│ 2. RETRIEVAL-AUGMENTED GENERATION
│ ├── Vector store indexes all paper summaries
│ ├── Relevant papers retrieved per query
│ └── Only retrieved context enters window
│
│ 3. STRUCTURED STATE PASSING
│ ├── Each pipeline stage produces structured output
│ ├── Next stage receives structured input (not raw text)
│ └── Information compressed between stages
│
│ 4. HYBRID APPROACH (most likely)
│ ├── RAG for literature (Tier 1)
│ ├── Structured state for pipeline (Tier 2)
│ ├── Artifact persistence for iteration (Tier 3)
│ └── Isolated context for validation (Tier 4)
Memory Comparison
| System | Memory Tiers | Literature Store | Cross-Run | Cross-Domain |
|---|---|---|---|---|
| Zochi | ~5 (inferred) | Yes (thousands of papers) | Yes (Siege→Tempest) | Yes (3 domains) |
| EurekaClaw | 4 | Yes (arXiv + S2) | Yes (persistent) | Per-domain plugin |
| AutoResearchClaw | 3 | Yes (real APIs) | Yes (MetaClaw) | No |
| AI Scientist | 1-2 | Basic | No | No |
| K-Dense BYOK | 1 | Conversation only | No | No |
14 Continued Learning
Evidence of Learning Capability
The Siege → Tempest progression is the strongest evidence that Zochi implements some form of continued learning or iterative improvement:
Learning Evidence: Siege → Tempest Progression
│
│ ICLR Workshop (March 2025)
│ ├── System autonomously identified multi-turn attack direction
│ ├── Designed core tree search framework
│ ├── Produced workshop-quality paper
│ └── Received (7, 7) reviewer scores
│
│ LEARNING PHASE (March - May 2025)
│ ├── "Zochi was able to significantly improve its design"
│ ├── "Conduct more comprehensive experiments"
│ ├── Added: Cross-branch learning mechanism
│ ├── Added: Robust partial compliance tracking
│ └── Expanded methodology and evaluation
│
│ ACL Main (May 2025)
│ ├── Same high-level direction, vastly improved execution
│ ├── Full conference paper (vs. 2-4 page tiny paper)
│ ├── Meta-review score 4 = top 8.2%
│ └── Accepted at A* venue (~21% acceptance rate)
│
│ This implies the system can:
│ 1. Assess the quality gap between its work and a higher bar
│ 2. Identify specific improvement opportunities
│ 3. Execute those improvements autonomously
│ 4. Produce substantially stronger output
Types of Learning (Inferred)
| Learning Type | Evidence | Mechanism (Inferred) |
|---|---|---|
| Intra-project iteration | Code debugging, experiment refinement | Self-correction loops within pipeline |
| Inter-project learning | Siege → Tempest improvement | Prior work analysis + targeted enhancement |
| Cross-domain transfer | AI methods → computational biology | Transferable research strategies |
| Quality calibration | 7.67 avg NeurIPS score | Understanding of what constitutes good research |
Continuous Improvement Evidence
The technical report describes the ACL version (May 2025) as "a substantial advancement over our earlier systems that published workshop papers at ICLR 2025." This suggests ongoing system development between March and May 2025:
System Evolution Timeline
│
│ March 14, 2025 ── IntologyAI announcement
│ March 17, 2025 ── Technical report published
│ ├── CS-ReFT at ICLR SCOPE Workshop
│ └── Siege at ICLR Building Trust Workshop
│
│ March - May 2025 ── System improvement period
│ ├── Architecture enhancements
│ ├── "Substantially advanced" system
│ └── Human involvement reduced further
│
│ May 27, 2025 ── ACL acceptance announced
│ ├── Tempest (expanded Siege) accepted
│ ├── Main proceedings (not workshop)
│ └── "First AI to pass A* peer review"
│
│ 2025 (later) ── Locus previewed
│ ├── Successor to Zochi
│ ├── Surpasses human experts on RE-Bench
│ └── Multi-day research campaigns
Learning Comparison
| System | Learning Type | What is Learned | Cross-Run | Cross-Domain |
|---|---|---|---|---|
| Zochi | System evolution + inter-project | Research strategies, quality standards | Yes | Yes |
| EurekaClaw | Post-session distillation | Proof strategies, tool patterns | Yes (4-tier) | Per-domain |
| AutoResearchClaw | MetaClaw cross-run | Research strategies from failures | Yes (skills) | No |
| AI Scientist | None within system | N/A | No | No |
| K-Dense BYOK | None | N/A | No | No |
| Google Co-Scientist | Unknown | Unknown | Unknown | Unknown |
The Zochi → Locus Learning Lineage
Zochi's continued learning extends beyond the system itself to inform its successor:
| System | Key Capability | Improvement Over Predecessor |
|---|---|---|
| Zochi v1 (Mar 2025) | Workshop-level research | First autonomous AI publications |
| Zochi v2 (May 2025) | A* conference-level research | Quality leap from workshop to main |
| Locus (2025) | Surpasses human experts on RE-Bench | Multi-day campaigns, engineering tasks |
Locus's capabilities suggest that lessons from Zochi's research pipeline were transferred: - RE-Bench: Locus scores 1.30 vs. human expert 1.27 over 64 hours - KernelBench: State-of-the-art with 1.5x to 100x+ speedups - MLE-Bench Lite: State-of-the-art on engineering tasks - Key innovation: "Maintains consistent improvement over multiple days" — unlike systems that plateau after hours
15 Applications
Primary Application: Autonomous Scientific Research
Zochi's demonstrated applications span three distinct scientific domains:
| Application Domain | Paper | Contribution Type | Impact |
|---|---|---|---|
| AI / Representation Learning | CS-ReFT | Novel training methodology | AlpacaEval SOTA for parameter-efficient methods |
| AI Safety | Tempest | Vulnerability framework + discovery | Exposes fundamental safety weakness |
| Computational Biology | EGNN-Fusion | Efficient architecture | 95% parameter reduction |
Demonstrated Research Capabilities
Research Capability Matrix
│
│ Literature │ Hypothesis │ Method │ Implement │ Experiment │ Write
│ Analysis │ Generation │ Design │ │ │
│ ──────────────────┼──────────┼───────────┼─────────┼───────────┼────────────┼──────
│ CS-ReFT │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ✓
│ Tempest │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ✓
│ EGNN-Fusion │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ✓
│ MLE-Bench tasks │ — │ — │ — │ ✓ │ ✓ │ —
│ ──────────────────┼──────────┼───────────┼─────────┼───────────┼────────────┼──────
│ Coverage │ 3/3 │ 3/3 │ 3/3 │ 4/4 │ 4/4 │ 3/3
│
│ ✓ = Demonstrated — = Not applicable for this task type
Target Users
| User Segment | Use Case | Value Proposition |
|---|---|---|
| Research labs | Accelerate paper production | Hours/days instead of months |
| PhD students | Explore research directions | Autonomous literature survey + direction scoring |
| AI safety teams | Red-teaming and vulnerability discovery | Systematic, comprehensive testing |
| Biotech/pharma | Cross-domain computational methods | Efficient architectures for biology |
| Industry R&D | Novel method development | Competitive research output at frontier quality |
| Conference organizers | Quality assessment | Automated reviewer scoring |
Application Scenarios
Scenario 1: Novel Research Direction Exploration
Input: "novel jailbreaking methods" (3 words)
│
└── Zochi pipeline:
├── Analyzes thousands of safety papers
├── Identifies multi-turn attack as underexplored
├── Designs tree search framework
├── Discovers partial compliance vulnerability
├── Implements Tempest framework
├── Achieves 100%/97% attack success
├── Writes conference paper
└── Accepted at ACL 2025 (A*, top 8.2%)
Time: Days
Human involvement: Figures, citation formatting, minor edits
Cost: Estimated $30-500+ (API + compute)
Scenario 2: Parameter-Efficient Method Design
Input: Research direction on cross-skill interference in PEFT
│
└── Zochi pipeline:
├── Identifies gap between weight-level and representation-level orthogonality
├── Designs CS-ReFT with orthonormal subspace transformations
├── Implements on Llama-2-7B
├── Evaluates on AlpacaEval: 93.94% win rate
├── Demonstrates 0.0098% parameter usage
└── Published at ICLR 2025 SCOPE Workshop
Time: Days
Human involvement: Minimal
Scenario 3: Cross-Domain Architecture Transfer
Input: Protein-nucleic acid binding site prediction
│
└── Zochi pipeline:
├── Analyzes computational biology literature
├── Identifies parameter inefficiency in existing methods
├── Transfers efficient architecture principles from AI domain
├── Designs EGNN-Fusion with 95% parameter reduction
├── Achieves competitive binding site prediction
└── Under journal review
Significance: Demonstrates domain generality
Limitations and Risks
| Limitation | Impact | Mitigation |
|---|---|---|
| Closed-source | Cannot verify, extend, or reproduce pipeline | Published papers are independently verifiable |
| Undisclosed architecture | Scientific community cannot build on methods | Individual research outputs are documented |
| Human verification required | Not fully autonomous — figures, citations, edits | Human oversight as safety mechanism |
| Ethical concerns | AI-generated papers at top venues | Transparent attribution, no AI authorship claims |
| Scalability unknown | No evidence of concurrent multi-project runs | Locus suggests improvement here |
| Cost unknown | Closed-source prevents cost assessment | Likely competitive with human research |
| Generalization unknown | 3 domains demonstrated; broader generality unproven | Cross-domain capability is promising evidence |
| Reviewer gaming risk | System could learn to optimize for reviewer preferences | Separated validation engine mitigates |
Ethical Framework
IntologyAI's stated ethical principles represent the most developed framework among autoresearch systems:
| Principle | Implementation |
|---|---|
| No AI authorship | "We do not believe AI systems should be authors on papers, as they cannot take responsibility for their work" |
| Human verification | "Rigorous human verification of all research outputs" |
| Transparent attribution | Acknowledge AI contributions without claiming authorship |
| Responsible disclosure | Safety research (Tempest) follows responsible disclosure protocols |
| Venue engagement | "In discussion with workshop organizers of Zochi's accepted papers" |
| Human rebuttal | ACL rebuttal written manually without Zochi involvement |
Comparison to Other Systems' Ethics Frameworks
| System | AI Authorship Policy | Human Verification | Venue Transparency |
|---|---|---|---|
| Zochi | No AI authorship | Required | Disclosed to organizers |
| AI Scientist | AI listed as author | Minimal | No disclosure policy |
| AutoResearchClaw | Not addressed | Configurable | Not addressed |
| EurekaClaw | Not addressed | Gate modes available | Not addressed |
| Google Co-Scientist | Not addressed | Built-in | Not addressed |
Strengths vs. Weaknesses Summary
| Strength | Weakness |
|---|---|
| Only AI system with A* venue acceptance | Closed-source — no reproducibility of pipeline |
| Multi-domain research capability (3 domains) | Undisclosed architecture limits scientific contribution |
| Highest automated quality scores (7.67 avg) | Small team → sustainability risk |
| Minimal human involvement | Cannot verify claims about autonomy level |
| Strong ethical framework | Ethical framework untested at scale |
| Discovered novel vulnerability (partial compliance) | Individual papers, while good, are not groundbreaking |
| MLE-Bench performance without optimization | MLE-Bench evaluation was "exploratory" |
| Research iteration capability (Siege → Tempest) | Unclear if iteration was human-guided or autonomous |
| Successor system (Locus) shows continued development | Locus makes Zochi potentially obsolete |
Future Trajectory
Based on IntologyAI's trajectory:
| Development | Status | Significance |
|---|---|---|
| Locus (successor) | Previewed 2025 | Surpasses human experts on RE-Bench |
| Multi-day campaigns | Locus capability | Week/month-long research runs planned |
| Beta access | Sign-up available | Moving toward product launch |
| Additional domains | Expected | Current 3-domain capability as starting point |
| Journal publications | EGNN-Fusion under review | Expanding beyond conferences |
Impact Assessment
Zochi's significance extends beyond its individual research contributions. It represents a category creation moment for AI research systems:
Before Zochi (pre-2025):
├── AI could generate paper-shaped artifacts
├── Quality insufficient for top-tier venues
├── "AI research" = demonstrations, not contributions
└── Gap between AI output and human research was large
After Zochi (2025):
├── AI output accepted at highest-tier venues
├── Quality comparable to strong human submissions
├── "Artificial Scientist" as a legitimate category
├── Gap between AI and human research is closing rapidly
└── Locus suggests trajectory toward surpassing humans
Implications:
├── Publication norms need updating (attribution, review)
├── Research acceleration possible (hours/days vs months)
├── Multi-domain research becomes feasible for small teams
├── AI safety research can be systematically automated
└── The research community must adapt to AI-generated science
Appendix A: Publication Record
| Paper | Venue | Type | Scores | Status |
|---|---|---|---|---|
| CS-ReFT | SCOPE @ ICLR 2025 | Workshop poster | (6, 7, 6) | Accepted |
| Siege | Building Trust @ ICLR 2025 | Tiny paper | (7, 7) | Accepted |
| Tempest | ACL 2025 | Main proceedings | Meta: 4 (top 8.2%) | Accepted |
| EGNN-Fusion | Journal (undisclosed) | Full paper | N/A | Under review |
Appendix B: Benchmark Summary
| Benchmark | Metric | Zochi Result | Best Baseline |
|---|---|---|---|
| AlpacaEval (CS-ReFT) | Win rate | 93.94% | GPT-3.5-T: 86.30% |
| JailbreakBench (Tempest) | Success vs GPT-3.5-T | 100% | Crescendo: lower |
| JailbreakBench (Tempest) | Success vs GPT-4 | 97% | GOAT: lower |
| NeurIPS Auto-Review | Paper quality | 8, 8, 7 (avg 7.67) | Other AI: ~4 |
| MLE-Bench | > Median human | 80% of tasks | Agent Lab: lower |
| MLE-Bench | Medal rate | 50% of tasks | AIDE: 8.7% |
| EGNN-Fusion | Parameter reduction | 95% | — |
Appendix C: Comparison with All Major AI Research Systems (as of April 2026)
| Dimension | Zochi | AI Scientist | Co-Scientist | AIRA₂ | AutoResearchClaw | EurekaClaw |
|---|---|---|---|---|---|---|
| Organization | IntologyAI | Sakana AI | Google DeepMind | Meta FAIR | AIMING Lab | Single lab |
| Team size | 4 | ~6 | ~20+ | 25 | 16 | 8 |
| Open source | No | Yes | No | Partial | Yes (MIT) | Yes (Apache 2.0) |
| A* venue | Yes (ACL) | No | No | No | No | No |
| Workshop venue | Yes (ICLR) | Yes | No | No | Not yet | Not yet |
| Domains | 3+ | 1/run | Biomedical | NLP/ML | Configurable | Mathematics |
| Auto NeurIPS score | 7.67 | ~4 | N/A | N/A | N/A | N/A |
| Human involvement | Minimal | Similar | Significant | Moderate | Full automation | Configurable |
| Learning | System evolution | None | Unknown | Tournament | MetaClaw skills | 4-tier + skills |
| MLE-Bench | 80%>median, 50% medal | N/A | N/A | N/A | N/A | N/A |
| Ethical framework | Comprehensive | Basic | Minimal | Basic | None stated | None stated |
| Successor | Locus | None | None | None | MetaClaw | None |
This analysis was compiled from publicly available sources including the Zochi Technical Report, IntologyAI blog posts, OpenReview submissions, arXiv papers, and third-party coverage. All claims about system internals are marked as inferred where architectural details are not publicly documented.