K-Dense Co-Scientist
Part P07: Autonomous Research Systems
Provenance Convention
This chapter analyzes the open-source repository github.com/K-Dense-AI/k-dense-byok. Throughout, claims are tagged by evidence tier:
- [repo-structure] — verified from the repository's file layout and README documentation.
- [repo-docs] — stated in the project's README, setup guides, or inline documentation.
- [author reconstruction] — inferred by the survey author from documented behavior; not verified in source code. These blocks use conceptual pseudocode and should not be treated as implementation fact.
All repository references correspond to the state of the repository as of early April 2026. No specific commit hash is pinned because the project was in active beta with frequent updates; readers should verify against the latest main branch.
42.1 Overview and Motivation
The vision of an AI co-scientist — a system that collaborates with human researchers across the full arc of scientific inquiry — has attracted considerable attention since Google DeepMind's announcement of its Gemini-based AI Co-Scientist in early 2025. That system demonstrated multi-agent debate and hypothesis ranking for biomedical research, but remained proprietary and Gemini-exclusive. K-Dense BYOK enters this space with a fundamentally different philosophy: an open-source, provider-agnostic desktop application that brings AI-assisted scientific research to any researcher with an API key and a laptop.
Developed by K-Dense Inc. and released in beta in March 2026, K-Dense BYOK (Bring Your Own Keys) orchestrates a conversational agent named Kady with delegated expert sub-agents equipped with over 170 scientific skills spanning 22 disciplines [repo-docs]. The system's repository resides at github.com/K-Dense-AI/k-dense-byok and had approximately 467 GitHub stars as of early April 2026. Rather than targeting a single research domain or pipeline stage, K-Dense BYOK positions itself as a broad multi-disciplinary AI research assistant — trading pipeline depth for disciplinary breadth.
Three design commitments distinguish K-Dense BYOK from prior systems in this survey. First, the BYOK model decouples the system from any single LLM provider, offering access to 40+ models via OpenRouter while routing expert execution through the Gemini CLI [repo-docs]. Second, the skill-loaded expert delegation pattern moves beyond generic conversational research assistance by injecting curated domain knowledge into sub-agents at task time. Third, 326 workflow templates encode structured research procedures across 22 scientific disciplines, lowering the barrier for researchers unfamiliar with prompt engineering [repo-docs].
Key Contribution
K-Dense BYOK contributes a three-tier agent architecture (orchestrator → expert sub-agents → scientific skills) combined with a provider-agnostic BYOK design that makes multi-disciplinary AI-assisted research accessible on commodity hardware without institutional infrastructure. Its 170+ curated scientific skills and 326 workflow templates represent, to the best of our survey's knowledge, the largest open-source collection of structured research procedures for LLM-based scientific agents identified across the 17 systems surveyed. The system's primary innovation is breadth of disciplinary coverage within a single unified platform, rather than depth in any single research pipeline.
42.1.1 Relationship to AI Co-Scientist Paradigm
The concept of an AI co-scientist emerged from several converging threads: LLM-powered research assistants (Elicit, Consensus), autonomous research agents (AutoResearchClaw, AIRA), and hypothesis-generation systems (Google's AI Co-Scientist). K-Dense BYOK occupies a distinctive niche in this landscape by prioritizing human-AI collaboration over full autonomy, and breadth over depth.
| System | Scope | Autonomy | Model Choice | Open Source |
|---|---|---|---|---|
| K-Dense BYOK | 22 disciplines | Semi-autonomous (human-in-loop) | 40+ models | Yes |
| Google AI Co-Scientist | Biomedical focus | Autonomous (debate/ranking) | Gemini only | No |
| AutoResearchClaw | General research | Fully autonomous (23-stage) | Configurable | Yes |
| EurekaClaw | Mathematical theory | Fully autonomous | Claude-primary | Yes |
| AIRA₂ | STEM research | 15+ specialized agents | Meta-internal | No |
| Elicit | Literature review | Semi-autonomous | Proprietary | No |
Where Google's AI Co-Scientist employs multi-agent debate and tournament-style hypothesis ranking within a closed Gemini environment, K-Dense BYOK adopts an orchestrator-delegation pattern that is structurally simpler but far more flexible in model selection and domain coverage. The trade-off is clear: K-Dense BYOK cannot autonomously generate and rank competing scientific hypotheses through structured debate, but it can assist a human researcher across a wide range of scientific disciplines within minutes of setup.
42.2 Architecture
K-Dense BYOK runs as three local services — a React/TypeScript frontend, a Python/FastAPI backend, and a LiteLLM proxy — that compose into a desktop research assistant [repo-structure]. All services run on the user's machine, with heavy computation offloaded to external API providers.
42.2.1 Service Architecture and Repository Layout
The system's three services are launched by a single start.sh script that handles dependency installation and process orchestration [repo-structure]. The frontend (web/) is a Next.js/React application served on port 3000, providing a three-panel layout: chat interface, file browser sidebar, and file preview panel. The backend (server.py) is a Python FastAPI server on port 8000 that hosts the Kady agent and dispatches expert sub-agents. The LiteLLM proxy on port 4000 translates between the application's API calls and OpenRouter endpoints, providing a unified interface with request logging, rate limiting, and retry logic [repo-docs].
The following repository layout is documented in the project README and confirmed by the directory structure [repo-structure]:
# Repository layout [repo-structure]
# Source: github.com/K-Dense-AI/k-dense-byok directory listing
k-dense-byok/
├── start.sh # One-command startup; installs deps, launches all services
├── server.py # Python FastAPI backend server (port 8000)
├── kady_agent/
│ ├── agent.py # Kady orchestrator: task classification, routing, delegation
│ ├── env.example # Template for 50+ configurable API keys
│ └── tools/ # Tool definitions
│ ├── web_search.py # Parallel API integration for web search
│ ├── delegate.py # Expert sub-agent delegation logic
│ └── file_ops.py # Sandbox file operations
├── web/ # Frontend application (Next.js/React)
│ └── src/
│ ├── components/ # React components: chat, file-browser, preview, workflows
│ └── data/
│ └── workflows.json # 326 structured workflow definitions
├── sandbox/ # Isolated file workspace (created at runtime)
└── user_config/
└── custom_mcps.json # User-defined MCP server configurations
Implementation note. The internal logic of kady_agent/agent.py, tools/delegate.py, and the Gemini CLI invocation mechanism were not audited at the source-code level for this survey. The behavioral descriptions in subsequent sections are based on the project's README documentation, video demonstrations, and workflow template structure. Where specific implementation details are inferred rather than observed, they are explicitly labeled as author reconstructions.
42.2.2 Split-Model Routing
A critical architectural decision in K-Dense BYOK is its split-model architecture [repo-docs]. The orchestrator agent (Kady) runs on whichever model the user selects from a dropdown of 40+ options via OpenRouter — ranging from Claude 3.5 Sonnet and GPT-4o to Llama 3.x and Qwen 2.5 series models. However, expert sub-agents always run on Gemini via the Gemini CLI, regardless of the user's model selection.
The repository documentation explicitly states this design [repo-docs]: "The model you select in the dropdown only applies to Kady (the main agent). Expert execution and coding tasks use the Gemini CLI, which always runs through a Gemini model on OpenRouter regardless of your dropdown selection." This creates a dual-cost structure and introduces a known limitation: the scientific skills were originally developed as "Claude Scientific Skills" but are executed through the Gemini CLI, and Gemini's skill activation is acknowledged as "not always reliable" [repo-docs].
42.2.3 K-Dense Ecosystem Context
K-Dense BYOK is the open-source desktop counterpart of the commercial K-Dense Web platform [repo-docs]. Both share the same underlying skill library — K-Dense Claude Scientific Skills — which provides 170+ skills and 250+ scientific database connectors. The open-source repository provides the core agent orchestration and skill execution engine, while K-Dense Web adds managed infrastructure, team collaboration, and zero-setup deployment. This dual-track model (open-source community + commercial SaaS) is increasingly common in the AI research tool space.
42.3 Core Mechanisms
42.3.1 Intelligent Task Routing
Kady's central mechanism is task classification and routing. When a user message arrives, Kady classifies it into one of four categories and routes accordingly [repo-docs]: simple Q&A (answered directly using the selected model), web-required queries (routed through Parallel API search), file operations (executed in the sandbox), or expert-needed tasks (delegated to a Gemini-based sub-agent). The README describes this classification as being performed by the LLM itself — there is no explicit rule-based classifier — which means routing quality depends on the capability of the user-selected orchestrator model.
The following conceptual pseudocode illustrates the routing pattern described in the documentation. This is an author reconstruction, not a verbatim excerpt from kady_agent/agent.py. The actual implementation's class names, method signatures, and control flow have not been verified at the source-code level.
# CONCEPTUAL PSEUDOCODE — author reconstruction [author reconstruction]
# Illustrates the routing pattern described in K-Dense BYOK documentation.
# NOT extracted from kady_agent/agent.py. Actual implementation details
# (class hierarchy, method names, error handling) are unverified.
# Routing mechanism: Kady uses LLM tool-calling to implicitly classify tasks.
# No explicit rule-based classifier exists — routing emerges from the LLM's
# decision about which tool (if any) to invoke.
#
# Documented tool categories [repo-docs]:
# 1. No tool call → Simple Q&A, answered by Kady directly
# 2. web_search tool → Query routed through Parallel API
# 3. file_ops tool → Sandbox file read/write/list operations
# 4. delegate tool → Expert sub-agent spawned via Gemini CLI
#
# The delegate tool triggers the expert pipeline described in §42.3.2.
# Skill selection may use the workflow's suggestedSkills field when a
# workflow template was used, or domain matching against the task
# description for freeform requests.
async def process_message(user_message: str, model: str, tools: list) -> str:
"""Conceptual flow for Kady's message processing."""
response = await llm_call(
model=model, # User-selected model via LiteLLM proxy
messages=conversation,
tools=tools, # Tool definitions enable implicit routing
)
if response.has_tool_calls:
results = await execute_tools(response.tool_calls)
return await synthesize(results)
return response.content # Direct Q&A path
42.3.2 Expert Delegation Pipeline
The expert delegation mechanism is the most architecturally significant component [repo-docs]. When Kady determines a task requires domain expertise, a four-stage pipeline executes:
Stage 1 — Task Packaging. The task description is extracted from the conversation context. If a workflow template was used, its suggestedSkills field is attached as metadata. Relevant API keys from the .env file are injected, and sandbox file access paths are configured.
Stage 2 — Gemini CLI Invocation. A Gemini CLI process is spawned as a subprocess. The task and selected skills are passed as system context. Available tools are registered: built-in code execution, file I/O, scientific database APIs, and any custom MCP server tools.
Stage 3 — Execution Loop. The Gemini agent reasons about the task in a multi-turn loop, calling tools as needed, generating intermediate files in the sandbox, and streaming progress back to the backend.
Stage 4 — Result Integration. Kady receives the expert's output, synthesizes a user-facing response, updates the file browser with newly generated files, and streams the result to the frontend.
A key characteristic of this pipeline is expert isolation [repo-docs]: each delegation creates a fresh agent with no memory of prior delegations, even within the same conversation session. This simplifies the architecture but prevents experts from building on each other's work within a single research session without Kady manually threading information through its own conversation history.
42.3.3 Scientific Skill System
The skill system is inherited from K-Dense's Claude Scientific Skills project [repo-docs]. Each skill is a Markdown document that encodes domain-specific instructions, best practices, expected input/output formats, and code snippets for a particular scientific task. Skills are organized across 22 disciplines:
| Category | Disciplines | Key Tools/Libraries |
|---|---|---|
| Life Sciences | Genomics, Proteomics, Cell Biology, Ecology | scanpy, biopython, Ensembl API |
| Chemical Sciences | Chemistry, Drug Discovery, Materials Science | RDKit, PubChem API, Materials Project |
| Physical Sciences | Physics, Astrophysics, Engineering | astropy, scipy, FEA tools |
| Formal Sciences | Mathematics, ML/AI | sympy, scikit-learn, pytorch |
| Health Sciences | Clinical, Neuroscience | clinical trial APIs, EEG tools |
| Social Sciences | Finance, Social Science | quantitative finance, survey tools |
| Research Operations | Paper, Literature, Data, Grants, SciComm, Visual | LaTeX, matplotlib, plotly |
The skill injection mechanism operates at expert spawn time [repo-docs]. When a sub-agent is created, skills are selected either through the workflow's suggestedSkills field or by domain matching against the task description. Selected skills are concatenated into the agent's system prompt context, providing domain-specific procedural guidance that the Gemini CLI agent follows during task execution. The exact selection algorithm — whether keyword matching, embedding-based retrieval, or LLM-driven classification — is not documented in the README and was not verified from source code [author reconstruction].
Formal Model of Skill-Augmented Expert Generation
The skill injection process can be modeled as a context-augmented generation problem with concrete operational implications. Let $S = \{s_1, s_2, \ldots, s_N\}$ denote the full skill library ($N \approx 170$ as of April 2026), where each skill $s_i$ is a Markdown document of variable length $|s_i|$ measured in tokens. Let $t$ denote the task description (either freeform or assembled from a workflow template), and let $\sigma$ denote the skill selection function.
Skill selection. The function $\sigma: \mathcal{T} \to 2^S$ maps a task description to a subset of relevant skills. When a workflow template $w$ is used, the selection is partially deterministic via the suggestedSkills field:
The automatic selection function $\sigma_{\text{auto}}$ is undocumented. Based on the repository's use of Gemini CLI for expert execution, it likely operates either as keyword matching against skill metadata tags or as an LLM-driven classification step within Kady's routing [author reconstruction]. The number of skills typically injected per task, $|\sigma(t)|$, is not documented. Examining the workflow templates' suggestedSkills arrays in web/src/data/workflows.json, most workflows suggest 1–3 skills [repo-structure].
Expert prompt assembly. The expert agent's effective prompt is:
where $P_{\text{sys}}$ is the base system prompt for the Gemini CLI agent, $\oplus$ denotes concatenation, and $P_{\text{tools}}$ is the tool definition context (MCP servers, code execution, file I/O). The total prompt size in tokens is:
Context-window pressure. This formulation exposes a concrete operational tension. If each skill averages $\bar{s} \approx 2{,}000$–$5{,}000$ tokens (a reasonable estimate for detailed Markdown documents with code snippets, though the actual distribution is unverified), then injecting $k$ skills consumes $k \cdot \bar{s}$ tokens of the Gemini model's context window. For Gemini 1.5 Pro with a 1M-token context window, this is negligible. For Gemini 2.0 Flash with a 1M-token window, it is similarly manageable. However, the documented "long-context degradation" limitation [repo-docs] suggests that instruction-following fidelity degrades well before the hard context limit is reached, consistent with the known lost-in-the-middle phenomenon in long-context LLMs. The practical ceiling is therefore not the context window size but the model's attention fidelity over the injected skill text.
Specifically, the K-Dense team's self-reported limitation that "Gemini sometimes skips relevant skills" [repo-docs] implies that for large $|\sigma(t)|$, the effective skill utilization rate $\rho$ satisfies $\rho < 1$, where $\rho$ is the fraction of injected skills whose instructions are faithfully followed by the expert agent. This degradation is likely a function of both $|\sigma(t)|$ and the total token count $\sum |s_i|$, though no quantitative characterization is available.
42.3.4 Workflow Template System
The 326 workflow templates represent K-Dense BYOK's most substantial content contribution [repo-docs]. Each template is a structured JSON object encoding a research procedure with variable placeholders. The following is a representative structure based on the documented schema in web/src/data/workflows.json [repo-structure]:
// Representative workflow template structure [repo-structure]
// Source: web/src/data/workflows.json schema (field names and types verified
// from repository documentation; specific field values are illustrative)
{
"id": "gene-expression-analysis",
"name": "Differential Gene Expression Analysis",
"description": "Analyze RNA-seq data for differentially expressed genes",
"category": "genomics",
"icon": "Dna",
"prompt": "Analyze the uploaded RNA-seq count matrix for differential gene expression between {condition_a} and {condition_b}. Use scanpy for preprocessing, perform statistical testing, generate a volcano plot, and identify enriched pathways.",
"suggestedSkills": ["scanpy", "scientific-visualization"],
"placeholders": [
{ "key": "condition_a", "label": "Control condition", "required": true },
{ "key": "condition_b", "label": "Treatment condition", "required": true }
],
"requiresFiles": true
}
The workflow JSON schema includes the following fields [repo-structure]: id (unique kebab-case identifier), name (human-readable title), description (card summary), category (one of 22 discipline tags), icon (Lucide icon name), prompt (template with {placeholder} syntax), suggestedSkills (array of skill identifiers guiding expert delegation), placeholders (array of user-fillable variables), and requiresFiles (boolean flag for file upload requirement).
When a user selects a workflow, fills in the placeholders, and optionally attaches files, the assembled prompt is sent to Kady. The workflow's suggestedSkills guide skill selection during expert delegation, and the requiresFiles flag ensures the UI prompts for file uploads when needed. Workflows are community-extensible — new templates are added by editing workflows.json and submitting a pull request [repo-docs].
# Workflow prompt assembly [author reconstruction]
# The actual implementation is in the frontend (TypeScript/React);
# the logic below illustrates the documented placeholder-filling behavior.
def assemble_workflow_prompt(workflow: dict, user_inputs: dict) -> str:
"""Fill placeholders in a workflow template to produce a ready prompt."""
prompt = workflow["prompt"]
for placeholder in workflow["placeholders"]:
key = placeholder["key"]
if key in user_inputs:
prompt = prompt.replace(f"{{{key}}}", user_inputs[key])
return prompt
# Example: assemble_workflow_prompt(gene_expr_workflow,
# {"condition_a": "wild-type", "condition_b": "BRCA1-knockdown"})
# → "Analyze the uploaded RNA-seq count matrix for differential gene
# expression between wild-type and BRCA1-knockdown. Use scanpy ..."
42.3.5 MCP Server Extensibility
K-Dense BYOK supports the Model Context Protocol (MCP) for extending expert agent capabilities [repo-docs]. Custom MCP server configurations are stored in user_config/custom_mcps.json and merged with built-in defaults (including a Docling server for document processing and Parallel for web search). This enables researchers to connect domain-specific tool servers — lab equipment interfaces, institutional APIs, custom computation backends — without modifying the core application.
// MCP configuration structure [repo-docs]
// Source: documented in README; stored in user_config/custom_mcps.json
{
"custom-servers": {
"my-lab-server": {
"command": "npx",
"args": ["-y", "my-mcp-server"]
},
"remote-api": {
"httpUrl": "https://mcp.example.com/api",
"headers": { "Authorization": "Bearer token" }
}
}
}
42.4 Cost Model and Resource Analysis
The BYOK design means all API costs are borne directly by the user. The total cost per research session can be expressed as:
where $n_k$ is the number of Kady inference calls with per-token cost $c_k(m)$ for user-selected model $m$ and token count $T_i^{(k)}$; $n_e$ is the number of expert delegations at Gemini's per-token cost $c_g$ with token count $T_j^{(e)}$; $n_s$ is the number of web search calls via Parallel API at per-call cost $c_s^{(l)}$; and $n_r$ is the number of remote compute invocations via Modal at cost $c_r^{(r)}$.
Cost uncertainty from expert delegation. The dominant source of cost uncertainty is the expert delegation term. Unlike Kady's orchestration calls, which have relatively predictable token counts (a few hundred to a few thousand tokens per turn), expert delegations involve multi-turn execution loops where the Gemini agent may make multiple tool calls, generate and execute code, and iterate on results. The repository documentation estimates expert token usage at 10K–200K per task [repo-docs] — a 20× range that reflects this structural unpredictability. For a session with $n_e$ expert delegations, the expected cost range for the expert component alone is:
At current Gemini pricing through OpenRouter (approximately $0.075–$0.30 per million input tokens and $0.30–$1.20 per million output tokens for Gemini 2.0 Flash through Gemini 1.5 Pro, as of April 2026), a single complex expert task could cost anywhere from $0.005 to $2.00, with multi-step research workflows involving 3–5 delegations potentially reaching $5–$10 total [author reconstruction]. The lack of per-session or per-task budget caps in the documented interface means cost control relies entirely on the user's judgment about when to delegate.
| Component | Token Range | Estimated Cost | Cost Driver |
|---|---|---|---|
| Kady orchestrator | 5K–50K per session | $0.01–$0.50 | Model selection, conversation length |
| Expert delegation | 10K–200K per task | $0.05–$2.00 per task | Task complexity, tool-call depth |
| Web search (Parallel API) | N/A | $0.01–$0.10 per search | Number of queries |
| Remote compute (Modal) | N/A | Variable (GPU time) | Compute duration, GPU type |
Users can optimize costs through model selection — using cheaper models (e.g., Haiku or Gemini Flash) for Kady's orchestration while expert tasks remain on competitively-priced Gemini models via OpenRouter [repo-docs]. Disabling web search and keeping computation local further reduces costs. The system's hardware requirements are modest: 4 GB RAM minimum (8+ GB recommended), 2 GB storage, and a stable network connection [repo-docs]. No local GPU is required since all inference is API-based.
42.5 Agent Delegation Flow
42.6 Memory and Learning
42.6.1 Memory Architecture
K-Dense BYOK operates with a minimal memory model: single-session conversation history, with no cross-session persistence and no knowledge accumulation [repo-docs]. This stands in stark contrast to systems like EurekaClaw (four-tier persistent memory including a knowledge graph) or AutoResearchClaw (MetaClaw cross-run skill transfer).
| System | Memory Tiers | Cross-Session | Knowledge Graph | Expert Memory |
|---|---|---|---|---|
| K-Dense BYOK | 1 (conversation) | No | No | No (isolated per delegation) |
| AutoResearchClaw | 3 | Yes (MetaClaw) | No | Cross-run skills |
| EurekaClaw | 4 | Yes | Yes | Persistent |
| AIRA₂ | 2 | Limited | No | Tournament-scoped |
Each expert delegation creates a fresh sub-agent with no memory of prior delegations, even within the same conversation [repo-docs]. While this simplifies the architecture and ensures data privacy (no state leaks between tasks), it means that a multi-step research workflow cannot build context across expert invocations without Kady manually threading information through its own conversation history.
42.6.2 Absence of Continued Learning
K-Dense BYOK does not implement any form of continued learning [repo-docs]. The system's behavior is identical between sessions — no knowledge extraction, no skill refinement, no workflow optimization based on usage patterns. The repository's roadmap mentions "better utilization of Skills" and "AutoResearch integration" as planned features [repo-docs], which could eventually introduce adaptive skill selection or cross-session learning, but as of April 2026, the system's intelligence is entirely static between the curated skill library and the capabilities of the underlying LLM models.
Learning does occur at the community level through the workflow contribution model: researchers create effective templates, submit pull requests, and all users benefit from the accumulated research procedures. The upstream skill library (Claude Scientific Skills) is also actively maintained, representing a form of externalized expert knowledge that grows over time independent of any individual user's sessions.
42.7 Key Results and Capabilities
K-Dense BYOK is a beta-stage open-source tool rather than a research system with formal benchmarks. Evaluation is therefore based on capability scope, adoption metrics, and reconstructed execution traces rather than quantitative performance results.
42.7.1 Adoption and Scale Metrics
| Metric | Value | Verification |
|---|---|---|
| GitHub stars | ~467 | GitHub API (April 2026) |
| Workflow templates | 326 | Count of entries in web/src/data/workflows.json |
| Scientific skills | 170+ | README claim; upstream Claude Scientific Skills repo |
| Database connectors | 250+ | README claim; skill documentation |
| Python packages accessible | 500,000+ | Via Gemini CLI execution environment (PyPI access) |
| Supported LLM models | 40+ | OpenRouter model list in UI dropdown |
42.7.2 End-to-End Execution Traces
No formal benchmarks or published evaluation results exist for K-Dense BYOK. To make the system's capability claims more concrete and falsifiable, we reconstruct two end-to-end case studies based on the documented workflow templates, skill library, and architectural behavior. These traces are author reconstructions based on documented system behavior, not transcripts of observed runs. They represent the expected execution flow for well-supported workflow categories.
Case Study 1: Differential Gene Expression Analysis
User input. User selects the "Differential Gene Expression Analysis" workflow from the genomics category, fills in placeholders (condition_a = "wild-type", condition_b = "CRISPR-KO"), and uploads a CSV count matrix (10,000 genes × 6 samples).
Assembled prompt [repo-structure]: "Analyze the uploaded RNA-seq count matrix for differential gene expression between wild-type and CRISPR-KO. Use scanpy for preprocessing, perform statistical testing, generate a volcano plot, and identify enriched pathways."
Expected routing decision. Kady classifies this as requiring expert delegation due to the computational nature of the task and file processing requirement.
Expected skill selection. suggestedSkills: ["scanpy", "scientific-visualization"] from the workflow template. These two skill documents are injected into the expert agent's system prompt.
Expected tool invocations by expert.
- File read: load CSV from
sandbox/ - Code execution: Python —
import scanpy as sc; adata = sc.read_csv(...) - Code execution: preprocessing (filtering, normalization, log-transform)
- Code execution: differential expression via
sc.tl.rank_genes_groups() - Code execution: volcano plot via matplotlib
- Code execution: pathway enrichment (e.g., using
gseapyor Enrichr API) - File write: save volcano plot PNG and results CSV to
sandbox/
Expected output files in sandbox/.
volcano_plot.png— differential expression visualizationdeg_results.csv— ranked gene list with fold changes and p-valuespathway_enrichment.csv— enriched GO terms or KEGG pathways
Estimated resource usage [author reconstruction]:
| Component | Estimated Tokens | Estimated Cost |
|---|---|---|
| Kady routing (1 call) | ~2,000 | ~$0.01 |
| Expert delegation (multi-turn) | ~50,000–80,000 | ~$0.25–$0.80 |
| Total session | ~52,000–82,000 | ~$0.26–$0.81 |
Estimated wall-clock time: 2–5 minutes (dominated by Gemini CLI execution turns and code execution).
Known failure modes. The self-reported limitation of "skill activation unreliable" means the expert may skip the scanpy skill instructions and use an alternative analysis approach, or may fail to generate the pathway enrichment if the skill instructions are not followed. File format mismatches in the uploaded CSV could also cause code execution failures requiring retry.
Case Study 2: Literature Review with Web Search
User input. Freeform chat message (no workflow template): "Find the 10 most cited papers on transformer architectures for protein structure prediction published since 2023, summarize each one, and create a comparison table."
Expected routing decision. Kady classifies this as requiring both web search (for literature retrieval) and expert delegation (for synthesis and comparison table generation). This may involve sequential tool calls: first web search, then expert delegation with search results as context.
Expected skill selection. No workflow suggestedSkills available (freeform request). Automatic skill selection ($\sigma_{\text{auto}}$) should identify skills related to literature review and possibly proteomics based on task keywords [author reconstruction].
Expected tool invocations.
- Web search via Parallel API: query for "most cited transformer protein structure prediction 2023 2024 2025"
- Possible additional web search: Semantic Scholar API queries for citation counts
- Expert delegation for synthesis: summarize retrieved papers, extract key contributions, generate comparison table
- File write: comparison table as Markdown or CSV in
sandbox/
Expected output files in sandbox/.
literature_review.md— structured summaries of retrieved paperscomparison_table.csv— paper-by-paper feature comparison
Estimated resource usage [author reconstruction]:
| Component | Estimated Tokens | Estimated Cost |
|---|---|---|
| Kady routing + synthesis (2–3 calls) | ~5,000–15,000 | ~$0.02–$0.10 |
| Web search (2–3 queries) | N/A | ~$0.02–$0.30 |
| Expert delegation (synthesis) | ~30,000–60,000 | ~$0.15–$0.60 |
| Total session | ~35,000–75,000 | ~$0.19–$1.00 |
Estimated wall-clock time: 3–8 minutes.
Known failure modes. Web search may return irrelevant results if the Parallel API's query interpretation diverges from the user's intent. Citation count accuracy depends on the search API's data freshness. The expert agent may hallucinate paper details if web search results are insufficiently detailed, or may fail to follow the literature review skill instructions if automatic skill selection does not match relevant skills.
Limitations of these traces. Both case studies are author reconstructions, not observed runs. The actual token counts, latencies, and costs may differ significantly due to: (a) nondeterministic LLM behavior, (b) variability in Gemini CLI execution depth, (c) the quality of automatic skill selection for freeform requests, and (d) external API response characteristics. Independent replication of these traces using the documented setup procedure (§42.8) would be required to validate the estimates.
42.7.3 Qualitative Capability Assessment
Based on the workflow library and skill documentation [repo-docs], K-Dense BYOK can execute research tasks spanning: literature review and synthesis across multiple scientific databases (PubMed, Semantic Scholar, arXiv), gene expression analysis using scanpy workflows, molecular docking and drug screening via RDKit skills, financial data analysis with quantitative modeling, manuscript drafting with LaTeX, and data visualization with matplotlib and plotly. The depth of any individual capability depends entirely on the quality of the relevant skill document and the Gemini CLI's ability to execute it reliably.
42.7.4 Self-Reported Limitations
The project transparently documents several limitations related to its Gemini CLI-based expert execution [repo-docs], which is noteworthy and commendable for a commercial product:
| Limitation | Impact | Documented Mitigation |
|---|---|---|
| Skill activation unreliable | Gemini sometimes skips relevant skills | Re-run tasks; switch Kady model |
| Tool-calling inconsistency | CLI drops tool calls or uses wrong arguments | Retry; await upstream Gemini improvements |
| Long-context degradation | Large skill contexts cause instruction drift | Skill curation; context windowing |
| Structured output drift | Gemini deviates from requested output formats | Post-processing; output validation |
These limitations highlight a fundamental tension in the architecture: the skill library was originally developed as "Claude Scientific Skills" but is executed through the Gemini CLI, creating a potential mismatch between the skill authoring context and the execution environment.
42.8 Replication Protocol
This section provides a detailed replication protocol for setting up and running K-Dense BYOK, sufficient for independent verification of the architectural and capability claims in this chapter.
42.8.1 Environment Prerequisites
| Requirement | Specification | Source |
|---|---|---|
| Operating system | macOS, Linux, or Windows with WSL | [repo-docs] |
| Python | 3.10+ (required by FastAPI and LiteLLM)* | [author reconstruction] |
| Node.js | 18+ (required by Next.js frontend)* | [author reconstruction] |
| Gemini CLI | Latest (auto-installed by start.sh; no version pinning documented) | [repo-docs] |
| LiteLLM | Latest (auto-installed; no version pinning documented) | [repo-docs] |
| RAM | 4 GB minimum, 8+ GB recommended | [repo-docs] |
| Storage | 2 GB for application + skills | [repo-docs] |
| Network | Required (all LLM inference is API-based) | [repo-docs] |
| GPU | Not required | [repo-docs] |
42.8.2 Setup and Launch Procedure
# Setup procedure [repo-docs]
# Source: README installation instructions
# 1. Clone repository
git clone https://github.com/K-Dense-AI/k-dense-byok.git
cd k-dense-byok
# 2. Configure API keys (minimum: OpenRouter key)
cp kady_agent/env.example kady_agent/.env
# Edit kady_agent/.env and set:
# OPENROUTER_API_KEY=sk-or-v1-... (required)
# PARALLEL_API_KEY=... (optional; enables web search)
# MODAL_TOKEN_ID=... (optional; enables remote compute)
# [50+ additional API keys for scientific database access]
# 3. Launch all three services
chmod +x start.sh
./start.sh
# start.sh auto-installs Python deps, Node deps, and Gemini CLI on first run
# → Frontend: http://localhost:3000
# → Backend: http://localhost:8000
# → LiteLLM: http://localhost:4000
42.8.3 Model Configuration
| Component | Model Selection | Configuration Method |
|---|---|---|
| Kady orchestrator | User-selected from UI dropdown (40+ options via OpenRouter) | Frontend model selector at runtime |
| Expert sub-agents | Gemini (fixed; specific model version not configurable) | Gemini CLI default, routed through OpenRouter |
| LiteLLM proxy | Passes through to OpenRouter | Auto-configured by start.sh |
Recommended replication models. For reproducibility comparisons, we suggest documenting: (1) the exact Kady model selected (e.g., anthropic/claude-3.5-sonnet or google/gemini-2.0-flash), (2) the date of the run (since OpenRouter model versions may change), and (3) whether web search was enabled or disabled.
42.8.4 Nondeterminism and Reproducibility Controls
Computational reproducibility (same inputs → same outputs) is not achievable with K-Dense BYOK for several structural reasons [repo-docs, author reconstruction]:
| Source | Severity | Available Controls |
|---|---|---|
| LLM sampling nondeterminism | High | No temperature/seed controls exposed in UI or documented API |
| Model version drift | Medium | No version pinning for OpenRouter models or Gemini CLI |
| Gemini CLI auto-updates | Medium | No documented version locking mechanism |
| Skill library versioning | Low | Skills downloaded at install time; repository commit provides implicit version |
| Workflow template stability | Low | Deterministic JSON structure; version controlled in git |
| External API responses | Medium | Scientific database results may change over time; web search results vary |
Best-effort reproducibility protocol. To maximize comparability across runs: (1) record the repository commit hash at clone time, (2) document the exact Kady model string selected in the UI, (3) note the date and time of execution, (4) save the complete sandbox/ directory contents after each session, (5) disable web search for tasks that do not require it, and (6) use workflow templates rather than freeform prompts to ensure consistent task specification.
All file operations are confined to a local sandbox/ directory [repo-docs], which provides data isolation between sessions. The workflow templates themselves are reproducible as structured prompt templates (deterministic JSON). But the end-to-end research output depends on three nondeterministic layers: the Kady model's routing decisions, the Gemini CLI expert's execution, and external API responses.
42.9 Limitations and Discussion
42.9.1 Breadth vs. Depth Trade-off
K-Dense BYOK's defining trade-off is breadth over depth. With 326 workflows across 22 disciplines [repo-docs], it covers more scientific domains than any other system surveyed in this work. However, each domain is supported only to the depth of its skill documents and the Gemini CLI's ability to execute them. Compare this to EurekaClaw, which covers only mathematical theorem proving but does so with a seven-stage pipeline, persistent knowledge graph, and formal verification. For a researcher who needs deep, autonomous support in a specific domain, K-Dense BYOK's generalist approach may be insufficient. For a researcher who works across multiple domains and needs a capable AI assistant for diverse tasks, its breadth is a genuine advantage.
42.9.2 Split-Model Architecture Implications
The decision to fix expert execution on Gemini via CLI while allowing user model selection only for the orchestrator creates several tensions [repo-docs]. Users who prefer Claude or GPT-4 for their reasoning quality can use those models for Kady's orchestration, but cannot control the model that actually executes their scientific tasks. The documented unreliability of Gemini's skill activation means that the system's most important component — domain-specific expert execution — operates on a model that the team acknowledges sometimes fails to follow its instructions.
This also creates a cost asymmetry. A researcher selecting a cheap model for Kady (e.g., Haiku at ~$0.25/M tokens) may still incur significant costs from expert delegations running on Gemini. As analyzed in §42.4, the total cost per session is not easily predictable because expert token usage (10K–200K per task) depends on the complexity and multi-turn depth of the task, and no per-task or per-session budget caps are documented.
42.9.3 Memory Gap
The absence of cross-session memory is perhaps the most significant limitation for research use. Scientific research is inherently iterative — hypotheses are refined over days, analyses build on prior results, and domain context accumulates across sessions. Without persistent memory, a researcher must re-establish context every session, re-upload files, and re-explain their research goals. This places K-Dense BYOK firmly in the "tool" category rather than the "collaborator" category — it assists with discrete tasks but cannot grow as a research partner over time.
42.9.4 Engineering Maturity, Security, and Privacy
As a beta-stage product, K-Dense BYOK's engineering maturity shows expected gaps. The following assessment is based on the repository structure and documentation as of April 2026 [repo-structure].
| Dimension | Status | Evidence |
|---|---|---|
| Test suite | Not documented | No tests/ directory or test configuration files observed in repository structure |
| CI/CD pipeline | Not documented | No GitHub Actions workflows or CI configuration documented |
| Linting / formatting | Not documented | No .eslintrc, ruff.toml, or equivalent observed |
| Type checking | Partial (frontend) | TypeScript on frontend provides compile-time type safety; Python backend typing practices undocumented |
| Error handling | Undocumented | No documented retry policies, circuit breakers, or graceful degradation for API failures |
| Dependency pinning | Undocumented | No requirements.txt with pinned versions or package-lock.json documented |
Security and privacy boundaries. The BYOK design creates a distinctive security profile:
- API key management. All API keys are stored in
kady_agent/.envon the user's local filesystem [repo-structure]. The.envfile is gitignored, but the security of these keys depends entirely on the user's local machine security. Keys are transmitted to external API providers (OpenRouter, Parallel, Modal) over HTTPS. There is no documented key rotation mechanism, access auditing, or encryption at rest for the.envfile. - Sandboxed file access. The
sandbox/directory provides logical separation for file operations [repo-docs]. However, this is a directory-scoped convention, not a security sandbox. The Gemini CLI expert agents execute code as a subprocess on the user's machine. Without documented containerization, chroot, or seccomp restrictions, the expert agent's code execution has, in principle, the same filesystem and network access as the parent process. Users running untrusted workflows or connecting to untrusted MCP servers should be aware of this boundary. - MCP server trust. Custom MCP servers defined in
user_config/custom_mcps.jsonare invoked by expert agents during task execution [repo-docs]. These servers may execute arbitrary code and access external services. The trust boundary for MCP servers is not documented — there is no capability restriction, sandboxing, or audit logging described for MCP tool invocations. - Data privacy. All conversation data remains on the user's machine [repo-docs]. No telemetry or analytics collection is documented. However, all user prompts, file contents used in analysis, and conversation history are transmitted to external LLM providers (OpenRouter → model providers) as part of normal inference calls. Researchers working with sensitive data (patient records, unpublished results, proprietary compounds) should evaluate provider data retention policies.
These gaps are reasonable for a beta-stage open-source tool but limit confidence in the system's reliability and security for production research workflows. The repository's roadmap does not explicitly mention security hardening or test infrastructure as planned improvements [repo-docs].
42.9.5 Comparison to Google's AI Co-Scientist
Google's AI Co-Scientist, announced in early 2025, represents a conceptually related but architecturally distinct approach. Google's system employs multi-agent debate with hypothesis ranking — multiple agents generate competing scientific hypotheses, which are then evaluated through structured argumentation and tournament-style selection. This produces a ranked list of hypotheses with supporting evidence and counterarguments.
K-Dense BYOK does not implement hypothesis ranking or multi-agent debate [repo-docs]. Its expert agents execute tasks independently rather than debating alternatives. Where Google's system targets biomedical research with deep integration into specific databases and validation pipelines, K-Dense BYOK targets breadth across scientific disciplines with a unified skill injection mechanism. The two systems represent different points in the autonomy-breadth design space.
| Dimension | K-Dense BYOK | Google AI Co-Scientist |
|---|---|---|
| Agent pattern | Orchestrator → delegated experts | Multi-agent debate + ranking |
| Domain scope | 22 disciplines | Biomedical focus |
| Model selection | 40+ models (user choice) | Gemini only |
| Hypothesis generation | Not supported | Core capability |
| Open source | Yes | No |
| Hardware requirement | Commodity laptop | Cloud infrastructure |
| Knowledge retrieval | Skill injection + 250+ DB connectors | Dense retrieval (proprietary) |
| Human-AI interaction | Conversational co-research | Hypothesis review and steering |
42.10 Research Significance
42.10.1 Novelty Assessment
Genuinely novel (within the scope of this survey): The combination of provider-agnostic model routing with curated scientific skill injection across 22 disciplines in a single desktop application has no direct precedent identified among the 17 open-source systems surveyed. The 326 workflow templates represent the largest open-source collection of structured research procedures for LLM-based agents found in this survey. The split-model routing pattern — where orchestration and execution are decoupled to potentially different model families — is an architecturally interesting design point.
Adapted from prior work: The orchestrator-delegate agent pattern is well-established in multi-agent systems literature. Skill injection via system prompt augmentation is a standard retrieval-augmented generation (RAG) technique. The MCP integration follows the emerging Model Context Protocol standard. The LiteLLM proxy layer is a standard approach for multi-provider LLM access.
Not addressed: Hypothesis generation and ranking (the core mechanism of Google's AI Co-Scientist), cross-session learning and memory (present in EurekaClaw and AutoResearchClaw), formal verification of scientific outputs, reproducibility guarantees, and automated quality assessment of expert outputs.
42.10.2 Contribution to the Field
K-Dense BYOK's primary contribution is democratization rather than algorithmic innovation. By packaging multi-disciplinary AI research assistance into a single-command-install desktop application, it lowers the barrier to entry for AI-assisted research from institutional API access and cloud deployment to a single API key and a laptop. The skill library and workflow templates encode substantial domain expertise in a form that is accessible to researchers across disciplines.
From an architectural perspective, the split-model routing pattern — where orchestration and execution are decoupled to different models — is an interesting design point that may inform future multi-agent research systems. It acknowledges that the optimal model for conversational task routing may differ from the optimal model for domain-specific execution, even though the current implementation's constraint (Gemini-only for experts) is more pragmatic than principled.
42.10.3 Impact Assessment
With approximately 467 GitHub stars in its first month of beta [repo-docs], K-Dense BYOK has demonstrated early community interest but has not yet achieved the adoption scale of mature research tools like Elicit or the citation impact of published systems like Google's AI Co-Scientist. Its potential impact lies in three directions: (1) serving as a reference architecture for desktop-based multi-disciplinary AI research assistants, (2) growing the open-source scientific skill ecosystem through community contributions, and (3) demonstrating that meaningful AI-assisted research does not require institutional infrastructure.
Chapter Summary
Key takeaway: K-Dense BYOK is an open-source desktop AI co-scientist that trades depth for breadth, providing skill-loaded expert delegation across 22 scientific disciplines with a provider-agnostic design that runs on commodity hardware.
Main contribution: A three-tier agent architecture (conversational orchestrator → Gemini-based expert sub-agents → 170+ curated scientific skills) combined with 326 workflow templates — the largest open-source collection of structured research procedures for LLM agents identified in this survey — and a BYOK model that eliminates vendor lock-in.
What researchers should know: K-Dense BYOK occupies the broadest but shallowest point in the AI co-scientist design space among the systems surveyed. It is best suited for researchers who work across multiple disciplines and need a flexible AI assistant for diverse tasks, rather than those requiring deep autonomous support in a single domain. The system has no cross-session memory, no continued learning, and acknowledged reliability limitations in its Gemini-based expert execution. Its primary innovation is accessibility and disciplinary breadth, not algorithmic novelty. Engineering maturity (testing, CI, security boundaries) is appropriate for beta but limits production-readiness. Computational reproducibility is not achievable due to LLM nondeterminism and the absence of version-pinning controls.