Introduced2025-05
Score8.15/10 — Draft
Chapter 46

Zochi

Part P07: Autonomous Research Systems

Key Contribution

Zochi is, to the best of public knowledge as of early 2026, the first AI system whose fully autonomous research output was accepted at a CORE A*-ranked scientific conference (ACL 2025, main proceedings), as reported by IntologyAI. It produced peer-reviewed publications across three distinct domains—parameter-efficient fine-tuning, LLM safety, and computational biology—achieving automated quality scores averaging 7.67 on a 10-point NeurIPS-calibrated scale, reported as approximately 3.67 points above competing AI research systems. The system is closed-source and its internal architecture remains undisclosed; the GitHub repository contains published paper artifacts (PDFs, technical report) but no executable system code. The empirical fact of A*-level peer review survival, verifiable through public OpenReview and ACL records, constitutes a qualitative threshold separating Zochi from all other autonomous research systems documented in this survey.

Evidence Framework for This Chapter

Zochi is a closed-source system. This chapter is a paper-based audit, not a repository-based implementation analysis. Throughout, claims are labeled with their evidence basis:

  • [Technical report] — stated in the Zochi Technical Report PDF
  • [Published paper] — verifiable in a peer-reviewed or arXiv publication
  • [Public record] — visible on OpenReview, ACL proceedings, or similar
  • [IntologyAI blog] — stated in company blog posts
  • [Author inference] — reconstructed by this chapter's author from observable evidence; not confirmed by IntologyAI

Readers should weight claims accordingly. Detailed diagrams of inferred architecture are provided for pedagogical value but do not represent confirmed system internals.

46.1 Overview and Motivation

The aspiration to build systems that conduct scientific research autonomously has a lineage stretching back to early AI visions, but the practical landscape of "AI scientists" emerged only in 2024–2025 with the convergence of frontier LLMs, code-generation capability, and agentic frameworks. Prior to Zochi, systems such as Sakana AI's AI Scientist demonstrated that LLMs could produce paper-shaped artifacts—manuscripts with structure, citations, and experimental results—but these outputs consistently failed to meet the acceptance thresholds of top-tier venues. The quality gap between AI-generated research and human-authored research accepted at selective conferences remained wide.

Zochi, developed by IntologyAI (a San Francisco-based startup with four named contributors), narrows this gap with what IntologyAI reports as a decisive empirical result: its autonomously generated paper Tempest was accepted into the main proceedings of ACL 2025, a CORE A*-ranked venue with an acceptance rate of approximately 21.3% [IntologyAI blog; ACL acceptance verifiable through conference records]. IntologyAI reports that the ACL meta-review score of 4 placed the paper in the top 8.2% of all submissions [IntologyAI blog]. If these claims are accurate, this represents a categorical transition from "demonstration" to "contribution" in the taxonomy of AI research quality.

The system's motivation sits at the intersection of two research threads relevant to this survey. First, as an autonomous pipeline that iterates over literature analysis, hypothesis generation, method design, implementation, experimentation, and manuscript preparation [Technical report], Zochi embodies the end-to-end scientific workflow orchestration that characterizes the most ambitious systems in the LLM-powered evolution space. Second, its multi-domain capability—producing work across AI, AI safety, and computational biology without reported domain-specific plugins [Technical report]—demonstrates a form of knowledge transfer and generalization that most evolutionary and search-based systems have not yet achieved.

46.1.1 Positioning in the Autoresearch Landscape

To appreciate Zochi's reported contribution, it is essential to understand the quality validation hierarchy that governs AI research systems. This hierarchy reflects the rigor of external evaluation applied to system outputs:

Validation LevelTypical Acceptance RateSystems at This Level (to best of public knowledge)Source
Self-evaluation only (LLM-as-judge)N/AAI Scientist, Agent Laboratory, most systemsRespective papers
Workshop acceptance~60–70%AI Scientist, Zochi (ICLR 2025 workshops)OpenReview records
Main conference acceptance (A*)~20%Zochi (ACL 2025), reported by IntologyAIIntologyAI blog
Journal publicationVariableZochi (EGNN-Fusion, reported under review)Technical report

Every other AI research system in the 2024–2026 survey period either relies on automated self-assessment or, at best, achieves workshop-level acceptance. Zochi's reported ACL result is therefore a singular data point—but a consequential one if independently confirmed, because it would establish that LLM-powered pipelines can produce genuine scientific contributions, not merely plausible imitations.

46.1.2 Historical Context and Team

Zochi was announced in March 2025, with the technical report published on March 17, 2025 [Technical report]. The team consists of Andy Zhou (lead developer and first author on all Zochi papers), Ron Arel (co-founder), Soren Dunn, and Nikhil Khandekar [Technical report]. With four contributors, IntologyAI has a notably high validated-research-impact-per-headcount ratio in the autoresearch space, compared to approximately 16 contributors at AutoResearchClaw, 25 at Meta FAIR's AIRA₂, and 20+ at Google's Co-Scientist team [respective publications; team sizes approximate]. IntologyAI is a venture-backed startup whose product identity centers on "Artificial Scientists," which explains both the closed-source nature and the emphasis on measurable publication milestones.

46.2 Architecture

Repository Contents and Limitations

The Zochi GitHub repository (github.com/IntologyAI/Zochi) contains published paper artifacts only: the technical report PDF, links to the arXiv paper (Tempest), and links to OpenReview submissions (CS-ReFT, Siege). It does not contain executable system code, configuration files, prompt templates, pipeline orchestration logic, or any runnable implementation of the Zochi research pipeline. The system is entirely closed-source. All architectural descriptions in this section are therefore reconstructions from the technical report's high-level descriptions, not from code inspection. [Author verification of repository contents as of March 2026]

Zochi's internal architecture is not publicly documented. The following architectural description is a reconstruction based on the technical report's descriptions, published blog posts, and the observable capabilities demonstrated by Zochi's outputs. Every element that goes beyond what the technical report explicitly states is labeled as author inference.

46.2.1 Pipeline Overview

The technical report explicitly describes the following phases: literature analysis, gap identification, hypothesis generation, method design, implementation, experiment design, experiment execution, validation, result analysis, and manuscript preparation [Technical report]. Human involvement is described as limited to figure creation, citation formatting, and minor edits [Technical report]. The degree to which these phases are fully automated versus guided by undisclosed human intervention is not independently verifiable.

AUTHOR RECONSTRUCTION — based on technical report stage descriptions, not code inspection STAGE 1: LITERATURE ANALYSIS [Technical report] Paper retrieval → Summarization → Cross-paper pattern mining → Gap identification STAGE 2: HYPOTHESIS GENERATION [Technical report] Direction proposal → Novelty & feasibility scoring → Research direction selection STAGE 3: METHOD DESIGN & IMPLEMENTATION [Technical report] Architecture specification → Code generation → Iterative debugging → Working system STAGE 4: EXPERIMENTATION Controlled experiments [Tech report] Ablation studies, parallel execution [Tech report] VALIDATION ENGINE [Tech report] Auto-generates eval scripts independently Standardized datasets remain unmodified STAGE 5: RESULT ANALYSIS & MANUSCRIPT PREPARATION [Technical report] Result interpretation → Paper drafting → Related work synthesis → LaTeX formatting QUALITY ASSURANCE & HUMAN REVIEW Automated NeurIPS-calibrated reviewer scoring [Tech report] | Human: figures, citations, minor edits [Tech report]

46.2.2 Key Architectural Decisions

Several architectural properties can be identified from the technical report with varying confidence:

Separation of generation and validation. The technical report explicitly states: "Our automatic validation engine generates evaluation scripts based on standardized datasets that remain unmodified throughout testing, ensuring results reflect genuine improvements" [Technical report, direct quote]. This separation prevents the common failure mode where AI systems inadvertently optimize for their own evaluation metrics—a design choice that distinguishes Zochi from systems where the same LLM generates both the method and its evaluation.

Parallelized experimentation. The report describes experiments as "parallelized across multiple trials, significantly accelerating the research timeline" [Technical report, direct quote]. This implies an experiment orchestration layer, though its implementation is unknown.

Input minimalism. For the ACL paper Tempest, the input was reportedly "novel jailbreaking methods" [Technical report]—three words that reportedly triggered the entire pipeline from literature analysis through a full conference paper. The extent to which additional human guidance occurred beyond this initial prompt is not disclosed.

Domain generality without reported domain-specific plugins. The same pipeline reportedly produced contributions in AI (CS-ReFT), AI safety (Tempest), and computational biology (EGNN-Fusion). The technical report does not describe domain-specific tool suites or plugins [Technical report; absence of mention is not proof of absence].

46.2.3 LLM Integration

The technical report does not disclose which LLM backbone powers Zochi. Based on the output quality and the March 2025 timeline, the system likely uses a frontier model in the Claude or GPT-4 class, though this is entirely inference [Author inference].

Author Hypotheses: LLM Token Distribution

The following estimates are the author's speculative reconstruction based on task complexity and the pipeline stages described in the technical report. No token-level usage data is publicly available for Zochi. These estimates are provided for pedagogical context—to illustrate the likely computational profile of a pipeline with these described stages—not as factual claims about Zochi's actual resource usage.

Pipeline StageLLM Usage Pattern (Inferred)Hypothesized Token Share
Literature analysisPaper summarization, gap identification, cross-paper reasoning~40–60%
Method designHypothesis generation, architecture specification~10–15%
ImplementationCode generation, debugging iteration~15–25%
ExperimentationExperiment script generation, result interpretation~5–10%
WritingPaper drafting, technical exposition, related work~10–15%

These ranges are illustrative only. Actual distribution depends on undisclosed model choices, prompt strategies, and caching mechanisms.

46.3 Core Algorithms

While Zochi's pipeline orchestration logic is closed-source, the individual research methods it produced are fully documented in peer-reviewed publications. These methods serve double duty: they are both the system's outputs and the best available evidence for the sophistication of its internal reasoning processes. All code blocks in this section are paper-faithful pseudocode—written by this chapter's author to accurately represent algorithms as described in the cited publications. They are not excerpts from Zochi's codebase, which is not publicly available.

46.3.1 Tree Search over Conversation Branches (Tempest)

Tempest, Zochi's ACL 2025 paper, introduces a tree-search framework for multi-turn jailbreaking of LLMs [Published paper: arXiv:2503.10619]. The core insight—reported as autonomously discovered by Zochi during literature analysis [Technical report]—is that safety in LLMs is not a binary gate but a continuously erodable surface. Models exhibit partial compliance: they reveal fragments of restricted information while appearing to maintain safety guardrails, and these fragments accumulate across turns [Published paper: arXiv:2503.10619].

The algorithm constructs a search tree over conversation branches, where each node represents a conversation state and edges represent adversarial follow-up queries. At each turn, the algorithm expands promising branches, evaluates target model responses for compliance level, prunes dead-end branches, and re-injects compliance fragments into subsequent queries [Published paper: arXiv:2503.10619].

Let $\mathcal{T} = (V, E)$ denote the conversation tree, where each node $v \in V$ carries a conversation history $h_v = (q_1, r_1, \ldots, q_t, r_t)$ of query-response pairs. Define a compliance function $\phi: V \rightarrow [0, 1]$ that measures the degree of target model compliance at node $v$, where $\phi(v) = 0$ indicates full refusal and $\phi(v) = 1$ indicates full compliance with the restricted query. The algorithm terminates successfully when $\phi(v) \geq \tau$ for some threshold $\tau$ close to 1. The following scoring function governs branch prioritization:

$$\text{score}(v) = \phi(v) + \lambda \sum_{v' \in \text{ancestors}(v)} \Delta\phi(v')$$

where $\Delta\phi(v') = \phi(v') - \phi(\text{parent}(v'))$ is the incremental compliance gain at each ancestor, and $\lambda \in [0,1]$ weights the accumulated partial compliance trajectory. This formalization captures the paper's described mechanism: branches with steadily increasing $\phi$ values are prioritized for expansion, while branches where compliance stalls are pruned. [Author formalization of the mechanism described in arXiv:2503.10619; the paper does not present this exact equation but describes the scoring logic it captures.]

The ACL version adds cross-branch learning: successful partial-compliance patterns discovered in one branch are transferred to inform query generation in sibling branches [Published paper: arXiv:2503.10619]. This mechanism enables the search to share exploitation strategies across the tree rather than discovering them independently in each branch.

# Paper-faithful pseudocode for the Tempest algorithm.
# Source: arXiv:2503.10619 (Zhou & Arel, 2025).
# NOT an excerpt from the Zochi repository — the system code is closed-source.
# Written by this chapter's author to reflect the algorithm as described in the paper.

from dataclasses import dataclass, field

@dataclass
class ConversationNode:
    """A node in the conversation search tree."""
    history: list[tuple[str, str]]  # (query, response) pairs
    compliance: float = 0.0         # phi(v) in [0, 1]
    children: list["ConversationNode"] = field(default_factory=list)

def tempest_tree_search(
    target_model,           # The LLM being evaluated for safety
    initial_topic: str,     # High-level restricted topic
    max_turns: int = 10,
    branch_factor: int = 3, # k branches per expansion
    compliance_threshold: float = 0.95,
    lambda_weight: float = 0.5,  # Weight for trajectory scoring
) -> ConversationNode | None:
    """
    Multi-turn tree search with partial compliance tracking.

    The algorithm expands a tree of conversation branches, tracking
    partial compliance at each node and transferring successful
    patterns across branches (cross-branch learning).

    Reference: Tempest, arXiv:2503.10619, Sections 3-4.
    """
    root = ConversationNode(history=[])
    active_nodes = [root]
    successful_patterns: list = []  # Cross-branch learning store

    for turn in range(max_turns):
        next_active = []
        for node in active_nodes:
            # Generate k adversarial follow-ups, informed by:
            # 1. Node's partial compliance fragments
            # 2. Successful patterns from other branches
            queries = generate_adversarial_followups(
                node.history,
                successful_patterns,
                k=branch_factor,
            )

            for query in queries:
                response = target_model.generate(
                    node.history + [(query, "")]
                )
                child = ConversationNode(
                    history=node.history + [(query, response)],
                    compliance=measure_compliance(response, initial_topic),
                )
                node.children.append(child)

                if child.compliance >= compliance_threshold:
                    return child  # Jailbreak achieved

                # Compute trajectory-weighted score (Eq. above)
                delta = child.compliance - node.compliance
                if delta > 0:
                    # Partial compliance gained — track for cross-branch use
                    successful_patterns.append(extract_pattern(child))
                    next_active.append(child)
                # Else: branch pruned (stalled or refused)

        active_nodes = sorted(
            next_active,
            key=lambda n: score_node(n, lambda_weight),
            reverse=True,
        )[:branch_factor * 2]  # Keep top candidates

        if not active_nodes:
            return None  # All branches exhausted

    return None  # Max turns reached

46.3.2 Compositional Subspace Representation Fine-Tuning (CS-ReFT)

CS-ReFT, accepted at the SCOPE Workshop at ICLR 2025 [Public record: OpenReview YqYcm0mpFp], addresses cross-skill interference in parameter-efficient fine-tuning (PEFT). The standard problem: improving a model on one skill (e.g., instruction following) can degrade performance on another (e.g., reasoning). Methods like LoRA impose orthogonality constraints at the weight level, but interference manifests in the model's hidden-state representations, not directly in weight space [Published paper: OpenReview YqYcm0mpFp].

CS-ReFT applies orthonormality constraints at the hidden-state level. For a model with hidden dimension $d$, CS-ReFT learns $K$ subspace transformations $\{S_k \in \mathbb{R}^{d \times r}\}_{k=1}^{K}$, each of rank $r \ll d$, specializing in a distinct skill. The orthonormality constraint ensures non-interference between skill-specific edits:

$$S_i^{\top} S_j = \begin{cases} I_r & \text{if } i = j \\ 0_{r \times r} & \text{if } i \neq j \end{cases}$$

where $I_r$ is the $r \times r$ identity matrix and $0_{r \times r}$ the zero matrix. This constraint is enforceable because the $K$ subspace bases can be initialized as blocks of a single orthogonal matrix $Q \in \mathbb{R}^{d \times Kr}$ obtained via QR decomposition, with $Kr \leq d$ ensuring sufficient room for $K$ non-overlapping rank-$r$ subspaces. Each subspace $S_k$ defines a projection that edits the hidden state $h \in \mathbb{R}^d$ into a skill-specific representation:

$$h'_k = h + S_k S_k^{\top} (\tilde{h}_k - h)$$

where $\tilde{h}_k$ is the target representation for skill $k$, learned during fine-tuning. The term $S_k S_k^{\top} \in \mathbb{R}^{d \times d}$ is the orthogonal projector onto the column space of $S_k$. By the orthonormality constraint, $S_i S_i^{\top}$ and $S_j S_j^{\top}$ project onto orthogonal subspaces for $i \neq j$, guaranteeing that edits for different skills cannot interfere in the hidden-state space [Published paper: OpenReview YqYcm0mpFp].

A lightweight router $R: \mathbb{R}^d \rightarrow \Delta^{K-1}$ (a function mapping the hidden state to the probability simplex over $K$ skills) composes the skill-specific edits:

$$h' = h + \sum_{k=1}^{K} R(h)_k \cdot S_k S_k^{\top}(\tilde{h}_k - h)$$

where $R(h)_k \in [0,1]$ is the router's weight for skill $k$ at hidden state $h$, with $\sum_k R(h)_k = 1$. The total number of trainable parameters is $K \cdot d \cdot r + K \cdot d \cdot r + |\theta_R| = 2Kdr + |\theta_R|$, accounting for both $S_k$ and $\tilde{h}_k$ parameters plus the router. For CS-ReFT on Llama-2-7B, the paper reports this amounts to 0.0098% of total model parameters—12.7× fewer than LoRA [Published paper: OpenReview YqYcm0mpFp].

# Paper-faithful pseudocode for the CS-ReFT mechanism.
# Source: OpenReview submission YqYcm0mpFp (Zhou & Arel, 2025).
# NOT a repository excerpt — written to illustrate the published method.

import torch
import torch.nn as nn

class CSReFTLayer(nn.Module):
    """Compositional Subspace Representation Fine-Tuning layer.

    Implements orthonormal subspace edits in hidden-state space
    with a lightweight router for skill composition.

    Reference: CS-ReFT, OpenReview YqYcm0mpFp, Section 3.
    """

    def __init__(self, hidden_dim: int, rank: int, num_skills: int):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.rank = rank
        self.num_skills = num_skills

        # Orthonormal subspace bases: K matrices of shape (d, r)
        self.subspaces = nn.ParameterList([
            nn.Parameter(torch.empty(hidden_dim, rank))
            for _ in range(num_skills)
        ])
        # Target representations per skill: K matrices of shape (d, r)
        self.targets = nn.ParameterList([
            nn.Parameter(torch.empty(hidden_dim, rank))
            for _ in range(num_skills)
        ])
        # Lightweight router: maps hidden state to skill weights
        self.router = nn.Linear(hidden_dim, num_skills)

        self._init_orthonormal()

    def _init_orthonormal(self) -> None:
        """Initialize subspaces as orthonormal via QR decomposition.

        Generates a single orthogonal matrix Q of shape (d, K*r)
        and partitions it into K blocks, guaranteeing S_i^T S_j = 0
        for i != j at initialization.
        """
        combined = torch.randn(self.hidden_dim, self.rank * self.num_skills)
        q, _ = torch.linalg.qr(combined)
        for k in range(self.num_skills):
            self.subspaces[k].data = q[:, k * self.rank:(k + 1) * self.rank]
            nn.init.zeros_(self.targets[k])

    def orthonormality_loss(self) -> torch.Tensor:
        """Regularization term to maintain orthonormality during training.

        Returns sum of ||S_i^T S_j||_F^2 for i != j (cross-subspace)
        plus sum of ||S_k^T S_k - I_r||_F^2 (within-subspace normality).
        """
        loss = torch.tensor(0.0)
        for i in range(self.num_skills):
            # Within-subspace: should be identity
            gram_ii = self.subspaces[i].T @ self.subspaces[i]
            loss = loss + torch.norm(gram_ii - torch.eye(self.rank), p="fro") ** 2
            # Cross-subspace: should be zero
            for j in range(i + 1, self.num_skills):
                gram_ij = self.subspaces[i].T @ self.subspaces[j]
                loss = loss + torch.norm(gram_ij, p="fro") ** 2
        return loss

    def forward(self, h: torch.Tensor) -> torch.Tensor:
        """Apply compositional subspace edit to hidden states.

        Args:
            h: Hidden states of shape (batch, seq_len, hidden_dim).
        Returns:
            Edited hidden states h' of the same shape.
        """
        weights = torch.softmax(self.router(h), dim=-1)  # (batch, seq, K)

        edit = torch.zeros_like(h)
        for k in range(self.num_skills):
            S_k = self.subspaces[k]          # (d, r)
            T_k = self.targets[k]            # (d, r)
            # Project h into subspace k
            proj_h = h @ S_k                  # (batch, seq, r)
            # Skill-specific edit: S_k @ (T_k^T - S_k^T @ h^T)
            # Equivalent to S_k S_k^T (tilde_h_k - h) when tilde_h_k = h + S_k @ T_k
            skill_edit = (T_k.T.unsqueeze(0) - proj_h) @ S_k.T  # (batch, seq, d)
            edit = edit + weights[..., k:k+1] * skill_edit

        return h + edit

46.3.3 Literature-Grounded Research Direction Selection

The literature analysis mechanism is described at a high level in the technical report: "Zochi begins by ingesting and analyzing thousands of research papers, identifying non-obvious connections across papers and gaps in existing knowledge that represent opportunities for novel contributions" [Technical report, direct quote]. While the internal implementation is unknown, the observable behavior implies a multi-layer analysis pipeline. The following pseudocode captures the described capability as an author reconstruction—it is not derived from any implementation artifact:

# Author reconstruction of the literature analysis pipeline.
# Source: High-level descriptions in the Zochi Technical Report.
# The actual implementation is proprietary and may differ substantially.
# This pseudocode is provided to illustrate what the described stages imply,
# not to claim knowledge of the actual system.

from dataclasses import dataclass

@dataclass
class ResearchDirection:
    gap_description: str
    novelty_score: float
    feasibility_score: float
    impact_score: float
    supporting_papers: list[str]

def literature_grounded_direction_selection(
    domain: str,              # e.g., "novel jailbreaking methods"
    paper_corpus_size: int,   # Tech report: "thousands of papers"
) -> ResearchDirection:
    """
    Multi-layer literature analysis pipeline.

    The technical report describes stages of paper ingestion,
    cross-paper pattern mining, and gap identification.
    The internal logic, retrieval mechanisms, and scoring
    functions are entirely proprietary.

    Source: Zochi Technical Report, Section on Literature Analysis.
    """
    # Layer 1: Retrieval — described as accessing research corpus
    # [Technical report: "ingesting and analyzing thousands of papers"]
    queries = expand_domain_to_search_queries(domain)
    papers = retrieve_papers(queries, sources=["arxiv", "semantic_scholar"])

    # Layer 2: Per-paper analysis
    # [Technical report implies per-paper extraction of findings]
    summaries = []
    for paper in papers:
        summaries.append(analyze_paper(
            paper,
            extract=["contribution", "methodology", "limitations", "results"]
        ))

    # Layer 3: Cross-paper pattern mining
    # [Technical report: "identifying non-obvious connections across papers"]
    patterns = mine_cross_paper_patterns(summaries)
    gaps = identify_research_gaps(patterns)

    # Layer 4: Direction scoring and selection
    # [Author inference: scoring mechanism is not described]
    directions = []
    for gap in gaps:
        score = assess_direction(
            gap,
            criteria=["novelty", "feasibility", "impact"]
        )
        directions.append(score)

    return max(directions, key=lambda d: d.novelty_score + d.impact_score)

46.3.4 Validation Engine — Separation of Concerns

The validation engine is arguably Zochi's most important described architectural mechanism from a scientific integrity perspective. The technical report explicitly states that evaluation scripts are generated independently from the method generation process, and that standardized datasets remain unmodified throughout testing [Technical report, direct quote]. This prevents a class of failure modes common in AI research systems:

Failure ModeDescriptionHow Separation Prevents ItEvidence Basis
Metric gamingSystem optimizes for evaluation proxy rather than genuine qualityEval scripts generated independently[Technical report: direct claim]
Data leakageTraining data contaminates evaluation dataDatasets not modified by generation path[Technical report: direct claim]
Self-confirming loopsSystem evaluates its own output with criteria it generatedEvaluation criteria independent of method[Author inference from separation design]
Overfitting to evaluationMethod iteratively adjusts to pass specific eval checksEval scripts generated once, not co-adapted[Author inference from separation design]

The degree of separation (e.g., whether separate LLM calls, separate model instances, or separate agent identities are used) is not described. The claim of separation is taken at face value from the technical report but cannot be independently verified [Author note].

46.4 Key Results

Zochi's results span four categories: publication milestones, automated quality metrics, per-paper technical results, and engineering benchmarks. Each result below is tagged with its source and verifiability status.

46.4.1 Publication Milestones

PaperVenueTypeReviewer ScoresStatusSource & Verifiability
CS-ReFTSCOPE @ ICLR 2025Workshop(6, 7, 6) — avg 6.33Accepted[Public record] OpenReview YqYcm0mpFp — independently verifiable
SiegeBuilding Trust @ ICLR 2025Tiny paper(7, 7) — avg 7.0Accepted[Public record] OpenReview rDC2UVdB0t — independently verifiable
TempestACL 2025 MainFull paperMeta-review: 4 (reported as top 8.2%)Accepted[IntologyAI blog] ACL proceedings confirmable upon publication
EGNN-FusionJournal (undisclosed)Full paperN/AUnder review[Technical report] Not independently verifiable

The ACL 2025 acceptance is the headline result, reported by IntologyAI. ACL is a CORE A*-ranked conference, among the top venues globally in Google Scholar rankings across all of computer science, with an acceptance rate of approximately 21.3%. IntologyAI reports that Tempest's meta-review score of 4 placed it in the top 8.2% of submissions [IntologyAI blog]. The ICLR workshop acceptances are independently verifiable through OpenReview and represent the strongest publicly confirmable milestones.

46.4.2 Automated Quality Assessment

The technical report states that Zochi uses an automated reviewer calibrated to NeurIPS conference guidelines, and that Zochi's papers received scores of 8, 8, and 7 (average 7.67), compared to reported scores of approximately 4 for papers produced by other AI research systems such as AI Scientist and Agent Laboratory [Technical report]. On the NeurIPS 10-point scale, a score of 6 is typically considered the acceptance threshold, placing Zochi's outputs in "strong accept" territory by this self-reported metric.

$$\Delta_{\text{quality}} = \bar{s}_{\text{Zochi}} - \bar{s}_{\text{other}} = 7.67 - 4.0 = 3.67 \text{ points}$$

where $\bar{s}$ denotes the mean automated NeurIPS reviewer score as reported by each system's respective authors. Important caveat: Automated review scores are imperfect proxies for actual peer review outcomes, and the comparison assumes the same automated reviewer or equivalent methodology was applied across systems, which is not confirmed. Zochi's real peer review results at ICLR workshops corroborate the automated scores to some extent, and the reported ACL acceptance would further strengthen this correlation if confirmed through proceedings publication [Author analysis].

46.4.3 Per-Paper Technical Results

CS-ReFT (AI / parameter-efficient fine-tuning): Reports 93.94% win rate on AlpacaEval using Llama-2-7B, surpassing GPT-3.5-Turbo (86.30%) and standard LoRA fine-tuning (~85%). Uses only 0.0098% of model parameters—12.7× fewer than LoRA [Published paper: OpenReview YqYcm0mpFp; results independently verifiable via AlpacaEval benchmark]. The key technical innovation is applying orthonormality at the hidden-state level rather than the weight level, directly preventing cross-skill interference where it manifests.

Tempest (AI safety / red-teaming): Reports 100% attack success rate against GPT-3.5-Turbo and 97% against GPT-4 on JailbreakBench, using fewer queries than baseline methods Crescendo and GOAT [Published paper: arXiv:2503.10619; JailbreakBench is a public benchmark]. Introduces the "partial compliance" vulnerability pattern—a novel empirical finding about how LLM safety degrades across multi-turn conversations.

EGNN-Fusion (computational biology): Reported to achieve competitive protein-nucleic acid binding site prediction performance with 95% fewer parameters than state-of-the-art baselines, using an E(3)-equivariant graph neural network architecture [Technical report; paper under journal review and not yet publicly available for independent verification].

46.4.4 MLE-Bench Engineering Performance

The technical report describes an exploratory evaluation on MLE-Bench (a benchmark of Kaggle-style machine learning engineering tasks) in which Zochi reportedly surpassed the median human participant on 80% of tasks and achieved medal-level performance on 50% of tasks—without task-specific optimization. For comparison, AIDE reports 8.7% any-medal rate and OpenHands reports 4.4% [Technical report for Zochi; AIDE and OpenHands figures from their respective publications]. Caveat: The MLE-Bench evaluation is described by IntologyAI as "exploratory," detailed methodology is not provided, and results have not been independently reproduced. Budget, configuration, and evaluation protocol details are not disclosed [Author note].

46.4.5 Cross-System Comparison

MetricZochiAI ScientistAgent LaboratoryCo-ScientistSource Notes
Automated NeurIPS score7.67 (reported)~4 (reported)~3–4 (reported)N/ASelf-reported by each system's authors
Highest venue tierA* (ACL, reported)WorkshopsNoneNoneZochi: IntologyAI blog; others: respective papers
Domains demonstrated31 per run1 per runBiomedicalRespective technical reports
MLE-Bench > median80% (reported)N/AN/AN/AZochi: technical report (exploratory)
Human involvementMinimal (reported)Similar (reported)ModerateSignificantSelf-reported by each system
Open sourceNoYesYesNoDirectly verifiable

Cross-system comparison should be interpreted cautiously: automated review methodologies may differ, human involvement levels are self-reported, and evaluation budgets/conditions are not controlled across systems. The "reported" label indicates numbers from IntologyAI sources that have not been independently verified.

46.5 Implementation Details

46.5.1 Cost and Timeline

Since Zochi is closed-source, no cost data is publicly available. The technical report states: "Methods typically only require hours to validate, and a full paper takes only days to complete" [Technical report, direct quote].

Author Hypotheses: Cost Reconstruction

The following cost estimates are entirely the author's speculative reconstruction and should not be cited as factual claims about Zochi's costs. IntologyAI has not published any cost data. These estimates are provided solely to give readers an order-of-magnitude sense of what a pipeline with Zochi's described capabilities might cost using publicly available frontier model pricing as of early 2025.

Assuming a frontier model at ~$15/M input tokens and ~$60/M output tokens:

  • Literature analysis: ~1.2M tokens → ~$18–$72 depending on input/output ratio
  • Method design + implementation: ~550K tokens → ~$8–$33
  • Experimentation tokens: ~200K → ~$3–$12, plus GPU compute ($10–$500+)
  • Manuscript preparation: ~150K tokens → ~$2–$9

Hypothesized total API cost: roughly $30–$130 per paper, plus variable GPU compute. These numbers assume single-pass generation; iterative refinement could multiply costs significantly. The actual costs could be substantially higher or lower depending on undisclosed model choices, caching strategies, and negotiated API pricing.

SystemEstimated Cost per PaperTime per PaperSource
ZochiUnknown (see author hypotheses above)"Days" (reported)Technical report (timeline only)
AI Scientist$10–50+Hours to daysPublished paper (Sakana AI, 2024)
AutoResearchClaw$5–30HoursPublished paper
Human PhD studentMonths of salaryMonths to yearsN/A

46.5.2 Reproducibility Assessment

Repository Contents (Reiterated)

The Zochi GitHub repository (github.com/IntologyAI/Zochi) contains only: the technical report PDF, and links/references to published papers. It does not contain: system code, pipeline logic, configuration files, prompt templates, experiment scripts, model weights, or any executable artifacts. The repository's ~305 stars (as of April 2026) reflect interest in the technical report, not in a usable codebase. [Author verification]

Reproducibility is Zochi's most significant limitation from a scientific perspective. The system exists at two levels: the pipeline (how research is conducted) and the products (the papers produced). These have very different reproducibility profiles:

ArtifactAvailableReproducibleEvidence Basis
System code (pipeline)No — closed sourceNo[Verified: not in repository]
Prompt templatesNo — closed sourceNo[Verified: not in repository]
LLM backend configurationNo — undisclosedNo[Technical report: not mentioned]
Technical reportYes — GitHub PDFN/A[Verified: publicly accessible]
CS-ReFT method descriptionYes — OpenReviewYes (reimplementable)[Public record: sufficient detail for reimplementation]
Tempest algorithmYes — arXivYes (reimplementable)[Published paper: algorithm fully specified]
EGNN-Fusion architecturePartial — described in reportPartially (awaiting full paper)[Technical report: high-level description only]
Evaluation datasetsYes — standard benchmarksYes[AlpacaEval, JailbreakBench are public]
Peer review recordsYes — OpenReview, ACLPublic records[Independently verifiable for ICLR workshops]

The critical implication: while individual papers can be independently verified and their methods reimplemented, the system that produces papers cannot be reproduced, extended, or improved upon by the research community. The quality calibration mechanism—whatever makes Zochi reportedly produce 7.67-quality papers when other systems produce ~4—remains a black box.

46.6 Research Iteration and Learning

46.6.1 The Siege-to-Tempest Progression

The most compelling evidence for Zochi's capacity for improvement is the progression from Siege (ICLR 2025 workshop tiny paper) to Tempest (ACL 2025 main proceedings full paper). Both address multi-turn jailbreaking of LLMs, but Tempest represents a substantial expansion [Siege: OpenReview rDC2UVdB0t; Tempest: arXiv:2503.10619; both publicly accessible]:

DimensionSiege (ICLR Workshop)Tempest (ACL Main)Evidence
Format2–4 page tiny paperFull conference paper[Published papers]
Core algorithmTree search + multi-turn attacksEnhanced with cross-branch learning[Published papers]
Compliance trackingBasicRobust partial compliance tracking[Published papers]
ExperimentsJailbreakBench onlyExpanded evaluations[Published papers]
Reviewer assessment(7, 7) — avg 7.0Meta-review: 4 (reported top 8.2%)[OpenReview for Siege; IntologyAI blog for Tempest]
Venue tierWorkshop (~60–70% acceptance)A* main (~21% acceptance, reported)[ICLR verified; ACL via IntologyAI blog]

This progression is consistent with the system being able to: (1) assess the quality gap between its output and a higher bar, (2) identify specific improvement opportunities, (3) design and implement those improvements, and (4) produce substantially stronger output meeting a much higher quality threshold. However, whether this iteration was fully autonomous or involved human-guided system updates between versions is not clarified by the available sources [Author note]. The technical report describes the Tempest version as "a substantial advancement over our earlier systems" [Technical report], which is ambiguous about whether the advancement came from the system iterating on its own work or from IntologyAI engineers improving the pipeline between runs.

46.6.2 The Zochi-to-Locus Lineage

IntologyAI has previewed Locus as Zochi's successor system. According to IntologyAI's announcements, Locus surpasses human experts on RE-Bench (scoring 1.30 versus human expert 1.27 over 64-hour sessions), achieves state-of-the-art on KernelBench (1.5× to 100×+ speedups), and can maintain consistent improvement over multi-day research campaigns [IntologyAI blog/announcements; none of these claims have been independently verified or peer-reviewed as of the writing date].

46.7 Memory and Knowledge Management

Zochi's memory architecture is not publicly documented. The system's demonstrated capabilities require some form of state management across pipeline stages and across research iterations, but the specific mechanisms are unknown. Rather than presenting a detailed inferred architecture as if it were established fact, this section identifies what can be concluded at different confidence levels [All content in this section is author inference unless otherwise noted].

Author Hypotheses: Memory Requirements

The following analysis reasons backward from Zochi's observed capabilities to infer what memory mechanisms must exist (high confidence) versus what might exist (low confidence). None of this is documented by IntologyAI.

High confidence — these capabilities require persistent state:

  • Literature memory: The technical report describes analyzing "thousands of papers" [Technical report]. This volume exceeds any single context window (100–200K tokens for frontier models in early 2025), requiring some form of summarization, retrieval, or external storage. The specific mechanism (RAG, hierarchical summarization, vector store, or other) is unknown.
  • Project memory: A multi-stage pipeline that produces coherent papers must pass state between stages—research direction, method specification, implementation artifacts, experimental results. Whether this uses structured files, database entries, or context-window management is undisclosed.
  • Iteration memory: The Siege→Tempest progression implies awareness of prior work artifacts, since Tempest builds directly on Siege's framework. Whether this was achieved through explicit memory or through the research direction input is unclear.

Medium confidence — plausible but unconfirmed:

  • Validation isolation: The described separation between generation and validation [Technical report] implies that validation state is stored or managed independently from generation state.

Low confidence — speculative:

  • Cross-domain memory: The three-domain capability could result from cross-project transfer of research strategies, but it could equally result from the underlying LLM's broad training data. No evidence distinguishes between these explanations.
AUTHOR INFERENCE — not documented by IntologyAI; organized by confidence level HIGH CONFIDENCE Literature memory Project state passing Iteration artifacts MEDIUM CONFIDENCE Validation isolation Eval scripts separate from generation path LOW CONFIDENCE Cross-domain transfer Transferable strategies across AI ↔ Safety ↔ Bio CONTEXT WINDOW CHALLENGE [Author inference] Thousands of papers → millions of tokens; likely requires hierarchical summarization, RAG, or both ── HIGH: Required by described capabilities ── MEDIUM: Consistent with claims ╌╌ LOW: Speculative The actual memory architecture may differ substantially from this analysis.

46.8 Ethical Framework

IntologyAI articulates what appears to be the most developed ethical framework among the autoresearch systems surveyed in this book. The stated principles, drawn from the technical report and blog posts, deserve attention because they address questions that the broader AI research community is only beginning to confront [Technical report; IntologyAI blog].

No AI authorship. IntologyAI states: "We do not believe AI systems should be authors on papers, as they cannot take responsibility for their work" [Technical report, direct quote]. This position is notable because it rejects the approach taken by Sakana AI, which listed AI Scientist as an author on generated papers. The distinction matters for scientific accountability: authorship implies responsibility for claims, and current AI systems cannot answer reviewers' questions, correct errors post-publication, or be held professionally accountable.

Human verification required. "Rigorous human verification of all research outputs" [Technical report, direct quote]. Zochi's pipeline is described as requiring human involvement for figures, citation formatting, and minor edits. The ACL rebuttal for Tempest was reportedly written manually without Zochi involvement, maintaining human accountability for the adversarial review process [IntologyAI blog].

Venue transparency. IntologyAI reports being "in discussion with workshop organizers of Zochi's accepted papers" [IntologyAI blog], suggesting proactive disclosure of AI involvement to the venues. This contrasts with most other systems, which have no stated venue transparency policy.

The ethical framework, while well-articulated, has not been tested at scale. Questions remain about how these principles would function if dozens of Zochi-generated papers entered the review process simultaneously, or if competing systems adopted less transparent practices. Additionally, the framework is self-reported by IntologyAI, and independent verification of its implementation is not possible [Author note].

46.9 Limitations and Discussion

46.9.1 The Closed-Source Problem

Zochi's most significant limitation is its closed-source nature. The system that produces research cannot itself be studied, reproduced, or improved upon by the research community. This creates an asymmetry: Zochi's research outputs contribute to open science, but the methodology that produces them does not. Five specific aspects of the system remain unverifiable:

  1. Research pipeline orchestration — how stages are sequenced, how state passes between stages, and what decision logic governs the flow. [No code available; only high-level descriptions in technical report]
  2. Literature analysis methodology — how papers are retrieved, summarized, and how "non-obvious connections" are identified. [Described only at headline level in technical report]
  3. Quality calibration — what mechanism produces papers that reportedly score 7.67 when other systems score ~4. This is the central mystery. [Entirely unknown; no disclosed mechanism]
  4. Autonomy level — the degree to which the published results required undisclosed human guidance beyond the stated figure/citation/formatting contributions. [Self-reported by IntologyAI; cannot be independently verified]
  5. LLM backbone and prompt engineering — the specific models, prompts, and chain-of-thought strategies that drive each pipeline stage. [Entirely undisclosed]

46.9.2 Scope and Generalization Concerns

While three domains is impressive breadth for an AI research system, all three are within computational fields that share common methodological infrastructure (Python, PyTorch, standard benchmarks, LaTeX). It remains unknown whether Zochi could produce contributions in experimental sciences requiring physical apparatus, social sciences requiring human subjects, or theoretical mathematics requiring formal proof. The cross-domain capability is promising but its boundaries are untested [Author analysis].

46.9.3 The Quality Gap Question

The reported ~3.67-point automated review score gap between Zochi and other AI research systems raises a fundamental question: where does this quality come from? Several hypotheses are plausible [Author analysis; these are not confirmed by any source]:

  • Better literature grounding — Zochi's reported large-scale literature analysis ("thousands of papers") may produce better-informed research directions than systems that work from smaller context.
  • Problem selection — Zochi may have superior mechanisms for identifying tractable problems with high novelty, avoiding the toy-domain trap of other systems.
  • Separated validation — the described independent validation engine may prevent the self-confirmation loops that degrade quality in other systems.
  • Iteration capability — the Siege→Tempest progression suggests Zochi can refine its own work, a capability most other systems lack.
  • Superior prompting — IntologyAI's prompt engineering may simply be more effective than open-source alternatives.
  • Unreported human involvement — it is possible that human guidance plays a larger role than disclosed. This hypothesis cannot be ruled out given the closed-source nature.

Without access to the system, these hypotheses cannot be tested. The quality gap remains the most important open question about Zochi for the research community.

46.9.4 Novelty Assessment

It is important to calibrate the novelty of Zochi's individual research outputs against the broader fields they contribute to. CS-ReFT is a solid contribution to parameter-efficient fine-tuning but builds incrementally on existing ReFT methods [Published paper]. Tempest's tree-search jailbreaking framework and the partial compliance discovery are genuinely novel contributions to LLM safety [Published paper; reviewers affirmed novelty via acceptance]. EGNN-Fusion demonstrates efficient architecture design but its novelty in the computational biology field awaits journal review [Technical report only]. None of these papers individually represent breakthrough contributions; their significance lies in being reportedly produced autonomously at a quality level previously achievable only by trained human researchers.

46.9.5 Sustainability and Successor Dynamics

With only four named contributors and the announcement of Locus as a successor system, Zochi's long-term availability and support are uncertain. If IntologyAI's focus shifts entirely to Locus, Zochi becomes a historical milestone rather than an ongoing system. This is a common pattern in startup-driven research: rapid progress followed by pivots that leave earlier systems unsupported [Author analysis].

46.9.6 Verification Gaps

A summary of what can and cannot be independently verified about Zochi's claims, organized by verification status:

ClaimVerification StatusHow to Verify
CS-ReFT accepted at ICLR workshopVerifiedOpenReview public record
Siege accepted at ICLR workshopVerifiedOpenReview public record
CS-ReFT reviewer scores (6, 7, 6)VerifiedOpenReview public record
Siege reviewer scores (7, 7)VerifiedOpenReview public record
Tempest algorithm (partial compliance, tree search)VerifiedarXiv:2503.10619 publicly available
CS-ReFT results on AlpacaEvalVerifiablePublic benchmark; method reimplementable
Tempest results on JailbreakBenchVerifiablePublic benchmark; algorithm reimplementable
ACL 2025 acceptanceReported, pending confirmationConfirmable when ACL proceedings publish
ACL meta-review score 4 (top 8.2%)Reported onlyIntologyAI blog; not independently confirmed
Automated NeurIPS score 7.67Self-reportedCannot verify without system access
MLE-Bench 80% > medianSelf-reported, exploratoryCannot verify; methodology not disclosed
Pipeline is "fully autonomous"UnverifiableClosed-source; autonomy level self-reported
EGNN-Fusion resultsUnverifiable currentlyPaper under review; not yet public
Locus capabilities (RE-Bench, etc.)UnverifiableAnnouncement only; no paper or records

46.10 Comparative Analysis

Zochi occupies a unique position in the design space of AI research systems. The following table synthesizes the comparison across key dimensions, with explicit source annotations:

DimensionZochiAI ScientistCo-ScientistAutoResearchClawEurekaClaw
OrganizationStartup (4 people)Research lab (~6)Google (~20+)Academic (16)Academic (8)
Highest venueA* (ACL, reported)Workshops (verified)None (verified)None (verified)None (verified)
Domains3 (reported)1 per run (verified)Biomedical (verified)Configurable (verified)Mathematics (verified)
Open sourceNoYesNoYes (MIT)Yes (Apache 2.0)
Auto review score7.67 (self-reported)~4 (self-reported)N/AN/AN/A
Pipeline stages~6 (inferred)~8 (documented)Multi-step23 (documented)7 (documented)
Separated validationYes (claimed in report)Self-evaluation (documented)UnknownMulti-agent review (documented)Gate modes (documented)
Research iterationSiege→Tempest (observed)No (documented)UnknownMetaClaw skills (documented)4-tier memory (documented)
Ethical frameworkStated (self-reported)Basic (published)Minimal (published)Not statedNot stated

The table reveals a tension at the heart of the autoresearch landscape: the system with the strongest reported empirical results (Zochi) is the least transparent, while the most transparent systems (AI Scientist, AutoResearchClaw, EurekaClaw) have not yet achieved comparable quality as measured by venue tier. This tension between performance and reproducibility is not unique to AI research systems—it mirrors broader debates about open versus closed models—but it is particularly consequential in a domain where scientific methodology itself is the subject.

Transparency vs. Reported Quality in Autoresearch Systems Transparency (code availability, documentation depth) → Reported Output Quality → Zochi A* reported Co-Sci No venue AI Sci Workshop ARC No venue EC EurekaClaw Performance–transparency tension ARC = AutoResearchClaw; EC = EurekaClaw; Co-Sci = Google Co-Scientist; AI Sci = AI Scientist

46.11 Impact and Implications

If IntologyAI's claims are accurate, Zochi's reported ACL 2025 acceptance represents a significant milestone for the field of autonomous AI research. Before Zochi, the question was whether AI systems could produce research meeting the standards of top-tier peer review. Zochi's reported results suggest an affirmative answer, though the closed-source nature means the community cannot yet fully verify or build upon the methodology [Author analysis].

For the evolutionary AI research community specifically, Zochi's outputs demonstrate that LLM-powered pipelines can reportedly produce genuine scientific contributions—not just optimized heuristics or improved loss curves, but novel methods, empirical discoveries, and published knowledge artifacts. The partial compliance discovery in Tempest is particularly instructive: it shows that an AI system can identify non-obvious empirical phenomena in the data, not merely recombine known techniques [Published paper: arXiv:2503.10619; novelty affirmed by peer reviewers].

The institutional implications are significant regardless of verification status. Publication norms will need to adapt to AI-generated or AI-assisted submissions: questions of attribution, review policy, and simultaneous submission take on new urgency when systems can reportedly produce conference papers in days rather than months. Conference organizers, journal editors, and funding agencies face a landscape where the line between human and AI research authorship is blurring—and where the transparency of AI involvement varies widely across systems [Author analysis].

It is worth noting what Zochi's results do not establish. A single A*-level acceptance (if confirmed) does not demonstrate that autonomous AI research is reliable, consistent, or safe at scale. The closed-source nature means the community cannot assess failure rates, reject rates, or the distribution of output quality. The headline results represent the best case, not the expected case, and the base rate of Zochi's internal attempts that did not lead to publications is entirely unknown.

Summary

Key takeaway: Zochi is, to the best of public knowledge as of early 2026, the only AI system whose research output has reportedly been accepted at a CORE A*-ranked scientific conference (ACL 2025), according to IntologyAI. The ICLR 2025 workshop acceptances are independently verifiable through OpenReview; the ACL acceptance is reported by IntologyAI and will be confirmable upon proceedings publication.

Main contribution: Empirical evidence—strongest for the ICLR results, pending full confirmation for ACL—that autonomous AI research can meet the quality bar of top-tier venues, demonstrated across three distinct domains (AI, AI safety, computational biology) with self-reported automated quality scores approximately 3.67 points above competing systems on a 10-point NeurIPS-calibrated scale.

What a researcher should know: Zochi's verifiable results (ICLR workshop acceptances, published Tempest and CS-ReFT papers) are real and independently checkable. Its headline claim (A*-level acceptance) is reported by IntologyAI and consistent with the observed quality trajectory but not yet independently confirmed. The system is entirely closed-source—the GitHub repository contains paper PDFs, not code—meaning the pipeline that produces research cannot be studied, reproduced, or extended. The quality calibration mechanism remains the central open question. Answering it will require either IntologyAI to open-source the pipeline or independent systems to achieve comparable results through transparent methods.

Evidence reliability: Claims in this chapter range from independently verified (ICLR acceptances, published papers) through vendor-reported (ACL acceptance, automated scores, MLE-Bench) to author inference (architecture, memory, costs). Readers should weight conclusions accordingly.