Score7.96/10 — Draft

Chapter 56

7/24 Office

Part P08: Harness & Agent Frameworks

56.1 Overview and Motivation

The dominant paradigm for building AI agent systems in the 2024–2026 period involves multi-layered framework stacks — LangChain, LlamaIndex, CrewAI, and their successors — that provide abstraction layers for tool calling, memory, and orchestration. These frameworks, while powerful, introduce substantial dependency trees (often exceeding 100 packages), opaque internal state, and debugging complexity that can exceed the complexity of the task itself. 7/24 Office, published in March 2026 by independent developer wangziqi06, presents a direct counter-thesis: a production-grade, self-evolving AI agent running 24/7 in approximately 3,500 lines of pure Python with only three external dependencies.

The system's name — "7/24 Office" (七二四办公室) — references continuous availability. It functions as a personal AI agent that autonomously handles scheduling, file management, web search, video processing, memory recall, and self-diagnostics. What distinguishes 7/24 Office from a simple chatbot wrapper is its capacity for runtime tool creation: the agent can write, persist, and load new Python tools during operation, creating a genuine self-evolution loop where the agent's capability space grows monotonically over time based on encountered tasks.

Key Contribution

7/24 Office demonstrates that a complete, production-running AI agent system with runtime self-evolution, three-layer memory, MCP protocol integration, self-repair diagnostics, and 24/7 autonomous scheduling can be built in ~3,500 lines of pure Python with zero framework dependencies — three packages total (croniter, lancedb, websocket-client). The system's runtime tool creation mechanism provides permanent capability expansion (not just information retention), occupying a unique niche between heavyweight agent frameworks and minimal prompt-chaining scripts.

The project was built solo with AI co-development tools in under three months and has been running in production continuously since its release. It had accumulated 1,136 GitHub stars by April 2026 (repository-reported). The design targets edge deployment on a Jetson Orin Nano (8 GB RAM, ARM64 + GPU) with a runtime memory budget under 2 GB, making it one of the few agent systems explicitly designed for resource-constrained hardware while still requiring cloud LLM API access.

56.1.1 Design Philosophy

The system adheres to five stated design principles, each with a concrete implementation consequence:

Principle	Implementation	Tradeoff
Zero framework dependency	No LangChain/LlamaIndex/CrewAI; stdlib + 3 packages	Must reimplement common patterns (HTTP client, MCP protocol)
Single-file tools	Adding a capability = one function with `@tool` decorator	No cross-tool composition or dependency management
Edge-deployable	Targets Jetson Orin Nano; RAM budget <2 GB	No GPU-heavy local inference by default
Self-evolving	Runtime tool creation, self-diagnostics, auto-notification	No sandboxing on created tools
Offline-capable	Core works without cloud APIs (except LLM itself)	LLM API remains a hard external dependency

A notable implementation choice reflecting this philosophy: all HTTP communication uses Python's urllib.request directly, bypassing requests, httpx, and any other HTTP client library. This is consistent across LLM API calls, embedding requests, search queries, and video generation — every external interaction flows through raw urllib.request.Request with manual JSON serialization.

56.2 Architecture

7/24 Office follows a pipeline architecture with clear data flow from messaging platform through LLM to tools. The entire system consists of eight Python files totaling approximately 135 KB of source code.

56.2.1 File-Level Structure

File	Size	Purpose
`tools.py`	48 KB	Tool registry, 26 tool implementations, plugin system, MCP bridge
`xiaowang.py`	22 KB	Entry point, HTTP server, callbacks, debounce, ASR pipeline
`router.py`	17 KB	Multi-tenant Docker routing, container lifecycle
`llm.py`	14 KB	LLM API calls, tool use loop, session management
`memory.py`	13 KB	Three-layer memory: compress, deduplicate, retrieve
`mcp_client.py`	12 KB	MCP protocol client (JSON-RPC, stdio/HTTP transport)
`scheduler.py`	7 KB	Cron + one-shot scheduling, persistent jobs
`self_check_tool.py`	2 KB	Self-check diagnostic report generation

56.2.2 System Architecture Diagram

56.2.3 Threading Model

A distinctive architectural choice is the use of Python's threading module rather than asyncio. The author explicitly avoids the "function coloring problem" — the well-known constraint where async propagates through the entire call stack. The threading model consists of several concurrent execution paths:

Thread	Purpose	Lifecycle
Main thread	HTTP server (`ThreadingMixIn` spawns per-request)	Process lifetime
Per-request threads	Handle incoming webhooks, dispatch to callback handler	Per HTTP request
Debounce timers	`threading.Timer` per sender; fires after 3s of silence	Created/cancelled per message
Chat lock threads	Serialize concurrent messages to same session	Per `chat()` call
Memory compression	Background LLM-based compression of evicted messages	Daemon, per overflow event
Scheduler loop	Check jobs every 10 seconds	Daemon, process lifetime
MCP stdio readers	Timeout-wrapped readers for subprocess communication	Per MCP request
ASR streaming	Audio streaming + WebSocket client	Per voice message

Thread safety is managed through per-session threading.Lock instances — concurrent messages to the same session are serialized, while messages to different sessions proceed in parallel. All persistent state uses atomic writes (write to .tmp then os.replace()) to prevent corruption on crash.

56.2.4 Data Persistence Layout

The system uses a file-based persistence model with no external database server required (LanceDB is an embedded database):

# From repo: project root directory structure
# All paths are relative to the project root

# config.json            — Master configuration (API keys, providers, MCP servers)
# jobs.json              — Persistent scheduler state (atomic writes)
# sessions/
#   dm_USER_ID.json      — Per-user DM session history
#   scheduler.json       — Scheduler session (cross-session bridge source)
# memory_db/
#   memories/            — LanceDB vector table files
# workspace/
#   SOUL.md              — Agent personality definition
#   AGENT.md             — Operational procedures
#   USER.md              — User preferences and context
#   memory/MEMORY.md     — Long-term keyword-searchable memory
#   files/               — Received/generated media (monthly organized)
#     index.json         — File metadata index
#     2026-03/           — Monthly directory
# plugins/
#   *.py                 — Runtime-created tool files

56.3 Core Mechanisms

56.3.1 Tool Use Loop

The central interaction pattern is a synchronous tool use loop with a hard upper bound of 20 iterations per conversation turn. At each iteration, the LLM receives the full message history (including prior tool results) and either returns a text response (terminating the loop) or requests one or more tool calls. This is implemented using the OpenAI-compatible function calling API format.

# Simplified from repo: llm.py — core tool use loop
# The actual implementation uses urllib.request for HTTP calls

def chat(user_message, session_key, images=None):
    """Core tool use loop — up to 20 iterations per conversation."""
    session = load_session(session_key)
    
    # Build system prompt from markdown personality files
    system_prompt = _build_system_prompt()  # SOUL.md + AGENT.md + USER.md + time
    
    # Inject retrieved memories into system prompt
    memories = memory_mod.retrieve(user_message, top_k=5)
    if memories:
        system_prompt += "\n\n[Relevant Memories]\n" + memories
    
    # Inject cross-session scheduler context (2-hour freshness window)
    scheduler_ctx = _get_recent_scheduler_context()
    if scheduler_ctx and session_key != "scheduler":
        system_prompt += "\n\n" + scheduler_ctx
    
    # Add user message (with optional images as base64)
    session["messages"].append(_build_user_message(user_message, images))
    
    tool_defs = tools_mod.get_definitions()  # 26 built-in + plugins + MCP tools
    
    for iteration in range(20):  # Hard limit: 20 iterations
        response = _call_llm(system_prompt, session["messages"], tool_defs)
        
        if not response.get("tool_calls"):
            # No tool calls — return text response, save session
            session["messages"].append({"role": "assistant", "content": response["content"]})
            break
        
        # Execute each requested tool
        session["messages"].append(response)  # assistant message with tool_calls
        for tool_call in response["tool_calls"]:
            try:
                result = tools_mod.execute(tool_call["function"]["name"],
                                           json.loads(tool_call["function"]["arguments"]))
            except Exception as e:
                result = f"[error] {e}"
            session["messages"].append({
                "role": "tool",
                "tool_call_id": tool_call["id"],
                "content": str(result)
            })
    
    # Handle session overflow — triggers memory compression
    if len(session["messages"]) > 40:
        evicted = session["messages"][:-40]
        session["messages"] = session["messages"][-40:]
        memory_mod.compress_async(evicted, session_key)  # background thread
    
    _strip_images(session["messages"])  # Replace base64 with [image] markers
    save_session(session_key, session)

Several implementation details are noteworthy. Every chat() call logs performance metrics: prep_time, llm_total_time, tool_count, and total_time in milliseconds. Before saving, base64 image URLs are replaced with [image] text markers to prevent storage bloat and API errors on history replay. The system also preserves reasoning_content from models that support chain-of-thought (such as DeepSeek's reasoning models), inserting a placeholder "ok" when reasoning content is absent to maintain API compatibility.

56.3.2 Three-Layer Memory Pipeline

The memory system is the most architecturally sophisticated component, implementing a three-stage pipeline that maps to a simplified model of human memory:

Layer	Human Analogy	Implementation	Capacity
Session Memory	Working memory	JSON files in `sessions/`	Last 40 messages
Compressed Memory	Episodic memory	LLM-extracted facts in LanceDB	Unbounded (deduplicated)
Retrieved Memory	Semantic memory	Vector search, injected into prompt	Top-$K$ ($K=5$)

The deduplication mechanism uses a pure-Python cosine similarity calculation, consistent with the minimal-dependency philosophy. For two embedding vectors $\mathbf{a}$ and $\mathbf{b}$ of dimension $d = 1024$:

$$\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \cdot \|\mathbf{b}\|} = \frac{\sum_{i=1}^{d} a_i b_i}{\sqrt{\sum_{i=1}^{d} a_i^2} \cdot \sqrt{\sum_{i=1}^{d} b_i^2}}$$

where $a_i$ and $b_i$ are the $i$-th components of the respective embedding vectors. A new memory is stored only if $\text{sim}(\mathbf{a}, \mathbf{b}) \leq 0.92$ for all existing memories. The threshold of 0.92 is a repository default, chosen to allow semantically distinct but topically related facts to coexist while preventing near-verbatim duplicates.

# From repo: memory.py — cosine similarity (pure Python, no NumPy)
def _cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = sum(x * x for x in a) ** 0.5
    norm_b = sum(x * x for x in b) ** 0.5
    if norm_a == 0 or norm_b == 0:
        return 0
    return dot / (norm_a * norm_b)

The compression prompt is engineered to extract only long-term-valuable information, explicitly instructing the LLM to skip chitchat, greetings, and repeated confirmations, replace pronouns with specific names, and convert relative dates to absolute dates. If no facts are worth preserving, the LLM returns an empty array. The JSON parser includes robust fallback handling: it strips markdown code fences if present, and attempts bracket-delimited extraction if json.loads() fails on the raw output.

56.3.3 Runtime Tool Creation — Self-Evolution

The most architecturally significant mechanism is runtime tool creation, which allows the agent to permanently extend its own capabilities. When the agent encounters a request it cannot fulfill with existing tools, it can write a new Python function, save it to the plugins/ directory, and immediately register it for use in subsequent conversations.

# From repo: tools.py — the @tool decorator and plugin loading mechanism

def tool(name, description, properties, required=None):
    """Decorator that registers a function as both executable and LLM-callable."""
    def decorator(fn):
        _registry[name] = {
            "fn": fn,
            "definition": {
                "type": "function",
                "function": {
                    "name": name,
                    "description": description,
                    "parameters": {
                        "type": "object",
                        "properties": properties,
                        **({"required": required} if required else {}),
                    },
                },
            },
        }
        return fn
    return decorator


def _exec_plugin(code, source="<plugin>"):
    """Load a plugin by executing its code with @tool decorator available."""
    exec(compile(code, source, "exec"), {
        "__builtins__": __builtins__,
        "tool": tool,    # The @tool decorator — allows registration
        "log": log,      # Application logger
    })

The evolution loop operates as follows. When a user requests a capability that does not exist (e.g., "check Bitcoin price"), the LLM recognizes the gap and invokes the built-in create_tool function, which writes a new Python file to plugins/ containing a function decorated with @tool. The file is immediately loaded via exec(), registering the new tool in _registry. On subsequent restarts, all files in plugins/ are scanned and loaded, ensuring persistence.

This creates a monotonically growing capability space:

$$\mathcal{T}_{t+1} = \mathcal{T}_t \cup \{t_{\text{new}}\}$$

where $\mathcal{T}_t$ is the tool set at time $t$, and $t_{\text{new}}$ is the newly created tool. The set never shrinks unless the user explicitly invokes remove_tool. Unlike fine-tuning (which requires retraining) or prompt engineering (which is ephemeral), this mechanism provides permanent, immediate capability expansion.

The security implications are significant. The exec() call grants created tools full Python permissions — access to the filesystem, network, subprocesses, and all importable modules. The only access control is the OWNER_IDS whitelist, which restricts who can interact with the agent. There is no sandboxing, code review, static analysis, or capability restriction on created tools. This is a deliberate tradeoff: the system is designed for single-user or trusted-user operation, not adversarial multi-tenant deployment.

56.3.4 MCP Protocol Bridge

7/24 Office includes a self-implemented Model Context Protocol (MCP) client in mcp_client.py — notably without using any MCP SDK. The implementation covers only the three essential JSON-RPC methods: initialize (handshake), tools/list (discovery), and tools/call (execution), supporting both stdio (subprocess) and HTTP transport modes.

MCP tools are namespaced with a double underscore separator to prevent collisions: a tool named search_notes on server notes_server becomes notes_server__search_notes. The MCP tool schema (inputSchema) maps directly to the OpenAI function-calling format (parameters) — both use JSON Schema, only the field name differs.

The client implements auto-reconnect logic: on ConnectionError or TimeoutError during tools/call, it shuts down the current subprocess, starts a new one, re-runs initialize and tools/list, and retries the original call. This resilience is critical for 24/7 operation where MCP server processes may crash or become unresponsive.

56.3.5 Cron Scheduling and Cross-Session Context Bridge

The scheduler (scheduler.py) supports three task types: one-shot (delay-based), recurring (cron expression), and one-shot cron (trigger once at next cron match). Jobs are persisted in jobs.json with atomic writes and survive restarts. A background daemon thread checks jobs every 10 seconds, with a heartbeat log emitted every 30 minutes.

The integration between scheduling and the LLM creates a powerful automation pattern. When a scheduled task triggers, it calls chat_fn(message, "scheduler") — sending the task's message to the LLM as if it were a user message in the dedicated scheduler session. The LLM can then use any tool, including message to notify the owner:

# From repo: scheduler.py — task trigger mechanism (simplified)
# When a cron job fires, it sends the task message through the LLM

def _trigger(job):
    """Execute a scheduled task by sending its message through the LLM."""
    try:
        # chat_fn is llm.chat, injected during initialization
        chat_fn(job["message"], "scheduler")  # runs in scheduler session
    except Exception as e:
        log.error(f"Scheduled task failed: {e}")
        # On failure, notify owner via LLM
        chat_fn(f"Scheduled task '{job['message'][:50]}' failed: {e}", "scheduler")

A subtle but important design pattern — the cross-session context bridge — addresses the problem that the scheduler and the user operate in different sessions. When the scheduler sends a self-check report, the user sees it in their DM but may respond in their DM session, which has no context about what was sent. The system resolves this by reading the scheduler session file, checking freshness (2-hour window), extracting the last message content (truncated to 800 characters), and injecting it into the DM session's system prompt. This injection occurs only when the scheduler session was modified within the last 2 hours and the current session is not the scheduler session itself.

56.3.6 Self-Repair Diagnostics

The self_check tool generates comprehensive system health reports, typically scheduled as a daily cron task. The diagnostic collects:

Diagnostic Area	Metrics Collected
Session activity	Active sessions today, total user/assistant/tool_call counts
System health	Disk usage, memory usage, process status
Error logs	Last 24 hours of errors from application log
Scheduled tasks	Active job count, next trigger times
Memory system	Total memories stored, storage size
Session health	Empty sessions, high tool_call ratios, potential stuck loops

The self-repair loop works through the LLM: the diagnostic report is sent as a message to the LLM in the scheduler session, which analyzes it and decides whether to notify the owner. This creates a feedback loop where the system monitors its own health and escalates issues autonomously — though the LLM's analysis quality depends on the model used and its ability to interpret system metrics.

56.3.7 Message Debouncing

On messaging platforms where users commonly split thoughts across multiple rapid-fire messages, the debounce system prevents wasteful multiple LLM calls. Each sender has a buffer and a 3-second timer. New messages reset the timer and append to the buffer. When the timer fires, all buffered messages are merged into a single LLM call, images are collected, and the response is split into chunks of ≤1,800 bytes sent with 0.5-second spacing.

The cost savings are non-trivial. Without debouncing, a user sending 5 rapid messages would trigger 5 independent LLM calls, each including the full system prompt (~3,000 tokens) and tool definitions (~4,000 tokens). With debouncing, the same interaction requires a single call, reducing token consumption by approximately $4 \times (3{,}000 + 4{,}000) = 28{,}000$ tokens of overhead.

56.4 Tool Ecosystem

The system ships with 26 built-in tools organized into functional categories, supplemented at runtime by plugin tools and MCP-bridged tools. The complete tool inventory:

Category	Count	Tools
Core	2	`exec` (shell, 60s default / 300s max timeout), `message`
Files	4	`read_file` (10K char cap), `write_file`, `edit_file`, `list_files`
Scheduling	3	`schedule`, `list_schedules`, `remove_schedule`
Media Send	4	`send_image`, `send_file`, `send_video`, `send_link`
Video	3	`trim_video`, `add_bgm`, `generate_video`
Search	1	`web_search` (multi-engine: Tavily, web, GitHub, HuggingFace)
Memory	2	`search_memory` (keyword grep), `recall` (vector semantic search)
Diagnostics	2	`self_check`, `diagnose`
Plugins	3	`create_tool`, `list_custom_tools`, `remove_tool`
MCP	1	`reload_mcp` (hot-reload MCP server configuration)

The web_search tool implements intelligent source routing based on query content. Queries containing "huggingface" or "hf model" are routed to the HuggingFace API (sorted by downloads); queries containing "github.com" are routed to the GitHub API (sorted by stars with code search fallback); queries containing "verify", "exist", "plugin", or "mcp" are sent to all engines simultaneously; all other queries use dual-engine search (Tavily + general web).

56.5 Key Results and Production Metrics

As a solo-developer production system rather than a research benchmark, 7/24 Office reports operational metrics rather than comparative algorithm performance. All figures below are from the repository README and code documentation as of April 2026.

Metric	Value	Source
Codebase size	~3,500 lines across 8 files	Repository
Built-in tools	26	Repository (`tools.py`)
Framework dependencies	0	Repository
Package dependencies	3	Repository
Development time	<3 months (solo + AI co-development)	README
GitHub stars	1,136 (April 2026)	GitHub
Max tool loop iterations	20 per conversation	Code (`llm.py`)
Session message limit	40 (overflow triggers compression)	Code (`llm.py`)
Memory deduplication threshold	0.92 cosine similarity	Code (`memory.py`)
Embedding dimensions	1,024	Code (`memory.py`)
Debounce window	3 seconds	Code (`xiaowang.py`)
Scheduler check interval	10 seconds	Code (`scheduler.py`)
Response chunk limit	1,800 bytes	Code (`xiaowang.py`)

No automated test suite exists. The author validates the system through production use and 24/7 monitoring — a significant departure from standard software engineering practice, but consistent with the rapid-prototyping philosophy. The daily self-check diagnostic serves as a partial substitute for integration tests.

56.6 Cost Analysis

56.6.1 Per-Conversation Token Budget

The token consumption per conversation can be modeled as:

$$C_{\text{tokens}} = C_{\text{sys}} + C_{\text{mem}} + C_{\text{tools}} + C_{\text{hist}} + \sum_{i=1}^{N} (C_{\text{resp},i} + C_{\text{toolres},i})$$

where $C_{\text{sys}}$ is the system prompt token count (SOUL + AGENT + USER + time, approximately 1,000–3,000 tokens), $C_{\text{mem}}$ is the injected memory context (200–1,000 tokens), $C_{\text{tools}}$ is the tool definitions sent to the LLM (approximately 3,000–4,000 tokens for 26 tools), $C_{\text{hist}}$ is the conversation history (500–5,000 tokens), and $N$ is the number of tool loop iterations (typically 1–5). Each iteration adds a response cost $C_{\text{resp},i}$ (200–2,000 tokens) and tool result tokens $C_{\text{toolres},i}$.

Typical total per conversation: 7,000–25,000 tokens, yielding estimated costs of $0.02 on DeepSeek Chat or $0.13 on GPT-4o. The system implements several cost-reduction strategies:

Strategy	Mechanism	Estimated Savings
Cheaper compression model	Routes memory compression to `deepseek-chat`	Avoids per-compression costs on expensive models
Session message limit (40)	Prevents unbounded context growth	Caps history tokens at ~5,000
Image stripping	Replaces base64 with `[image]` markers	Saves thousands of tokens per image in history
Deduplication (0.92)	Prevents near-identical memory storage	Reduces retrieval noise
File read truncation	`read_file` caps at 10,000 characters	Prevents single tool result from dominating context
Debounce (3s window)	Merges rapid-fire messages	~28K tokens saved per 5-message burst

56.6.2 24/7 Running Cost Estimates

The following estimates are derived from the per-conversation token budget and are repository-reported (README):

Usage Pattern	Daily Conversations	Est. Daily Cost (DeepSeek)	Est. Daily Cost (GPT-4o)
Light (personal)	10–20	$0.20–$0.50	$1.30–$2.60
Moderate	50–100	$1.00–$2.50	$6.50–$13.00
Heavy (production)	200+	$4.00+	$26.00+

Background processing adds periodic costs: memory compression ($0.01–$0.05 per event), embedding generation ($0.001 per call), daily self-check reports ($0.05–$0.10), and scheduled task executions ($0.02–$0.10 each). For the edge-deployment target (Jetson Orin Nano), hardware cost is a one-time investment of approximately $200–$500, with ongoing costs dominated entirely by LLM API fees.

56.7 Multi-Tenant Deployment

The router.py component (17 KB) enables multi-user deployment through Docker-based isolation. Each user receives an automatically provisioned container on first message, with independent sessions, memory, workspace, plugins, and scheduled tasks. The router handles health checks, request forwarding, and container lifecycle management.

This architectural layer transforms 7/24 Office from a personal tool into a multi-tenant service, though the security model remains the same within each container. The OWNER_IDS whitelist controls which users can interact with the system, but there is no inter-container authentication or cross-tenant authorization framework. For trusted-team deployments this is adequate; for public-facing services it would require substantial hardening.

56.8 Personality and Behavioral Configuration

The system uses three optional Markdown files — SOUL.md, AGENT.md, and USER.md — that are read at every conversation turn and injected into the system prompt. This provides a code-free mechanism for behavioral customization:

File	Purpose	Content Type
`SOUL.md`	Agent personality and behavior rules	Character definition, communication style, ethical guidelines
`AGENT.md`	Operational procedures and troubleshooting	How-to guides, error handling procedures, workflow templates
`USER.md`	User preferences and context	Personal information, scheduling preferences, project context

Changes to these files take effect immediately on the next conversation turn — no restart required. This design allows the agent's behavior to evolve through a manual feedback loop: as the user discovers preferences or the agent encounters recurring issues, the relevant Markdown file is updated. Combined with the automatic memory system, this creates two parallel adaptation channels: one explicit (file edits), one implicit (memory compression and retrieval).

56.9 Comparative Analysis

7/24 Office occupies a distinctive position in the landscape of AI agent systems. The following comparison contextualizes its design choices against other systems surveyed in this book. All system descriptions are based on documentation available as of April 2026.

Dimension	7/24 Office	LangChain Agents	AutoGPT	CrewAI	Ouro Loop
Architecture	8 files, ~3.5K LOC	Framework (100K+ LOC)	Agent framework	Multi-agent framework	3 files (methodology)
Dependencies	3 packages	100+ packages	Many	LangChain +	0
Memory	3-layer (session/compressed/retrieval)	Pluggable adapters	File-based	Shared memory	Reflective log (30 entries)
Self-evolution	Runtime tool creation	No built-in	Task-based learning	No built-in	BOUND evolution
Deployment	Edge/cloud, Docker multi-tenant	Cloud	Cloud	Cloud	Any agent
Continuous operation	24/7 with self-repair	Per-request	Per-task	Per-task	Per-session
Test suite	None (production monitoring)	Extensive	Present	Present	None

The key differentiator is the combination of continuous autonomous operation and permanent capability expansion. Most agent frameworks operate in request-response mode — they activate when called and shut down after. 7/24 Office runs continuously, with scheduled tasks, self-diagnostics, and proactive notifications. Its runtime tool creation provides genuine open-ended self-evolution, which distinguishes it from systems that only retain information (memory) but not capabilities (tools).

56.10 Limitations and Discussion

56.10.1 Security Surface

The most significant limitation is the security model. The exec tool executes arbitrary shell commands with the agent process's permissions. The create_tool feature uses exec() to load arbitrary Python code at runtime. There is no sandboxing, capability restriction, static analysis, or code review on created tools. The only access control is the OWNER_IDS whitelist. For single-user or trusted-team deployments this is manageable; for any broader deployment scenario, it represents a critical security surface that requires hardening.

56.10.2 Platform Coupling

The messaging integration is tightly coupled to WeChat Work (Enterprise WeChat). The callback handler in xiaowang.py is specific to WeChat Work's message format (cmd codes, msgTypes, fileId/fileAeskey fields). The ASR pipeline uses the iFlytek-compatible WebSocket protocol, which may not be available outside the Chinese developer ecosystem. Adapting to Slack, Discord, or Telegram would require rewriting the callback handling and media download logic.

56.10.3 No Automated Testing

Unlike systems such as Ouro Loop (507 tests reported), 7/24 Office has no automated test suite. The author relies on production monitoring and the daily self-check diagnostic as a substitute. This is a significant risk factor for a continuously running system — regression detection depends entirely on observable failures in production rather than pre-deployment validation.

56.10.4 Tool Evolution Limitations

While runtime tool creation is the system's most innovative feature, it has several constraints:

No tool improvement — existing tools are not automatically refined, optimized, or updated based on usage patterns
No tool composition — new tools cannot automatically declare dependencies on or compose with existing tools
Quality depends on LLM — the correctness and robustness of created tools is bounded by the code generation capability of the LLM in use
No versioning — there is no history of tool modifications; tools can be overwritten destructively
No validation — created tools are loaded via exec() without any testing, type checking, or correctness verification

56.10.5 Memory System Constraints

The three-layer memory pipeline has a fixed architecture with no pluggable adapters or alternative backends. The cosine similarity threshold (0.92) is a single global constant with no per-topic or per-importance calibration. The compression quality depends on the LLM used — cheaper models may extract fewer or less accurate facts. There is no mechanism for memory decay, importance weighting, or active forgetting, which means the memory store grows unboundedly (though deduplication limits the rate).

56.10.6 Concurrency Model

The threading-based concurrency model, while simpler to reason about than asyncio, limits throughput under high concurrent load. Each LLM API call blocks a thread for the duration of the request (typically several seconds). For the target use case of a personal assistant with moderate traffic, this is adequate. For production multi-tenant deployment with hundreds of concurrent users, it would require either thread pool tuning or an architectural migration.

56.11 Reproducibility Assessment

Factor	Assessment
Code availability	Fully open-source, MIT license
Dependencies	3 packages, all pip-installable; no complex build system
Platform dependency	WeChat Work integration is China-specific; requires adaptation for Slack/Discord/Telegram
LLM dependency	Requires any OpenAI-compatible API; behavior varies by model
Hardware target	Designed for Jetson Orin Nano (8 GB RAM); runs on any Python 3 environment
Configuration	Multiple API keys required; `config.example.json` provides template
Documentation	README covers architecture and setup; no extensive documentation site
Test suite	None — validation is through production operation

The minimal dependency footprint (3 packages) makes installation straightforward, but the WeChat Work coupling means that reproducing the full system experience requires either WeChat Work credentials or a messaging adapter rewrite. The core agent logic (LLM loop, memory, tools, scheduling) is independent of the messaging platform and can be evaluated in isolation through the test session interface.

56.12 Research Significance

7/24 Office makes several contributions to the field of autonomous agent systems, though its significance is primarily architectural and philosophical rather than algorithmic:

Minimal viable agent architecture. The system demonstrates that the complexity typically associated with agent frameworks is largely unnecessary for production-grade operation. By reducing the entire agent to ~3,500 lines with 3 dependencies, it establishes a lower bound on the infrastructure required for a fully-featured AI agent with memory, tool use, scheduling, self-repair, and self-evolution.

Runtime tool creation as open-ended evolution. While not the first system to allow agents to create tools, 7/24 Office's implementation — persistent plugin files loaded via exec() with the @tool decorator — is notable for its simplicity and its production validation. The mechanism provides genuine permanent capability expansion, distinct from memory-only adaptation.

Continuous autonomous operation. Most agent systems operate in request-response or task-completion mode. 7/24 Office's combination of cron scheduling, self-diagnostics, cross-session context bridging, and auto-notification demonstrates a qualitatively different operational model — the agent is not invoked but running, proactively monitoring and acting.

Edge deployment as a design constraint. Targeting the Jetson Orin Nano with a <2 GB RAM budget forces architectural discipline that benefits all deployment scenarios. The resulting system is not just small but comprehensible — a property with research value for studying agent behaviors, failure modes, and evolution dynamics.

Summary

Key takeaway: 7/24 Office proves that a production-grade, self-evolving AI agent with three-layer memory, 26+ tools, MCP integration, cron scheduling, self-repair diagnostics, and 24/7 autonomous operation can be built in ~3,500 lines of pure Python with three dependencies — no frameworks required.

Main contribution: The system establishes a minimal viable architecture for continuously operating AI agents, with runtime tool creation providing genuine open-ended capability expansion that persists across restarts. Its cross-session context bridge and scheduler-LLM integration demonstrate patterns for autonomous agent operation that go beyond the standard request-response paradigm.

What researchers should know: 7/24 Office is architecturally significant not for algorithmic novelty but for what it excludes. By building a complete agent system without LangChain, LlamaIndex, CrewAI, asyncio, type annotations, or automated tests — and running it in production 24/7 — it provides a concrete existence proof that questions the necessity of framework complexity in the agent ecosystem. Its runtime tool creation mechanism, while lacking sandboxing and validation, is one of the cleanest implementations of open-ended agent self-evolution in the surveyed literature.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}