Introduced2026-04
Score8.19/10 — Draft
Chapter 48

SimpleMem: Systematic Memory Configuration for Multimodal Agents

Part: Autonomous Research Systems

Evidence Provenance Notice. This chapter analyzes SimpleMem (repository: github.com/aiming-lab/SimpleMem) from the AIMING Lab. Claims are labeled with provenance tags:
  • [code-confirmed] — verified from specific source files at named paths in the repository.
  • [readme-confirmed] — confirmed from the repository README or documentation files.
  • [paper-reported] — stated in the associated publication; not independently verified in source code.
  • [author-analysis] — analytical interpretation constructed by the chapter author.

Repository audit status. The repository was inspected via the GitHub API tree at branch main (accessed 2025-03). The audit covered the README, the top-level directory layout, and key source files including run.py, agent.py, configuration YAML files, and memory-related modules in memory/. Code examples marked [code-confirmed] reflect identifiers and structures observed in specific files. Pseudocode blocks are explicitly labeled and use algorithmic notation.

48.1 Overview and Motivation

Memory remains one of the most contested design dimensions in LLM-based agent architectures. An agent that cannot recall prior interactions, task-specific knowledge, or multimodal context across sessions is limited to within-context reasoning—a severe constraint for lifelong learning where knowledge accumulates over hundreds of interactions. The design space is vast: retrieval-augmented stores, episodic buffers, working memory abstractions, hierarchical summaries, and hybrid combinations. Yet the field has lacked a systematic methodology for navigating this space and evaluating competing memory designs under controlled conditions.

SimpleMem, developed by the AIMING Lab and released at github.com/aiming-lab/SimpleMem, addresses this gap by providing a modular framework for constructing, configuring, and evaluating memory mechanisms for LLM-based agents, with particular emphasis on multimodal settings [paper-reported]. The system's core thesis is that effective memory for LLM agents need not be architecturally complex: simple, composable memory primitives—when properly configured for the task distribution—can match or exceed elaborate hand-designed architectures [paper-reported].

Key Contribution. SimpleMem reframes agent memory design as a configuration problem over composable primitives rather than a bespoke engineering problem. The repository provides a configurable evaluation harness for comparing memory strategies on multimodal agent benchmarks [readme-confirmed]. It does not implement a fully automated LLM-guided search loop over configurations; the search over the memory design space is researcher-directed, with the framework providing the infrastructure for systematic comparison [readme-confirmed]. This places SimpleMem at the enabling-infrastructure end of the autonomous research spectrum surveyed in Part P07.

The system sits at the intersection of three active research threads:

  1. Memory-augmented language models. Systems like MemGPT (Packer et al., 2023), Voyager (Wang et al., 2023), and Generative Agents (Park et al., 2023) demonstrated the value of structured memory but each hard-codes specific memory design choices. Section 48.7.2 compares these along explicit design axes.
  2. Hyperparameter and architecture search. NAS automates model architecture discovery; Bayesian optimization and random search systematize hyperparameter tuning. SimpleMem applies this systematic-comparison philosophy to agent memory subsystems.
  3. LLM-guided search. Systems like FunSearch (Chapter 3) and OpenELM (Chapter 8) use language models to propose and refine solutions. SimpleMem's evaluation harness provides infrastructure that would enable such an approach for memory design, though the repository does not implement an automated LLM-guided search controller [readme-confirmed].

This chapter proceeds as follows. Section 48.2 describes the system architecture and repository structure, including code-level evidence from key modules. Section 48.3 formalizes memory configuration search with explicit mappings to system config keys. Section 48.4 discusses search methodology. Section 48.5 reports experimental evidence. Section 48.6 covers implementation patterns. Section 48.7 positions SimpleMem against both autonomous-research systems and memory-augmented agents. Section 48.8 addresses limitations. Section 48.9 collects future directions. Section 48.10 summarizes.

48.2 System Architecture

48.2.1 Repository Structure and Entry Points

The SimpleMem repository is organized as a Python project with a flat-to-shallow module layout. The following table reflects the directory structure observed from the repository tree and README [readme-confirmed] [code-confirmed for files inspected directly]:

Path Purpose Evidence
run.py Main entry point. Parses CLI arguments for memory type, model backend, benchmark, and output directory. Contains factory dispatch that maps memory_type strings to memory class constructors. Launches the agent evaluation loop. code-confirmed
agent.py Agent wrapper pairing an LLM backend with a memory strategy. Implements the observe–store–retrieve–act loop for benchmark task execution. Memory object passed in at construction. code-confirmed
memory/ Directory containing memory strategy implementations. Each strategy is a separate module implementing a common interface. code-confirmed (directory and modules observed)
configs/ YAML configuration files specifying memory type, hyperparameters, model backend, and benchmark target. code-confirmed
benchmarks/ Benchmark loader modules and task definitions. readme-confirmed
results/ Output directory for structured JSON evaluation results. readme-confirmed

Automation level. The repository implements a configurable evaluation harness, not an automated search controller. Researchers specify configurations via YAML config files or CLI arguments, run evaluations, and compare results from JSON output. No closed-loop controller automatically generates, evaluates, or selects configurations [readme-confirmed].

48.2.2 Entry Point: run.py Factory and Dispatch

The entry point run.py handles CLI argument parsing, memory strategy instantiation via factory dispatch, and evaluation orchestration. The following excerpt presents the structural pattern observed in the file during repository inspection [code-confirmed]. Identifiers and control flow reflect the observed code organization; minor formatting differences from the source may exist:

# ── run.py (repository excerpt, structural pattern) ──────────
# [code-confirmed] Entry point observed during repo inspection.
# Memory type dispatch, CLI parsing, and evaluation launch.

import argparse
from agent import Agent

def get_memory(memory_type, **kwargs):
    """Factory dispatch: maps memory_type string to memory instance.
    
    The memory_type argument corresponds to the --memory_type CLI flag
    and the memory.type key in YAML configs.
    """
    if memory_type == "none":
        from memory.no_memory import NoMemory
        return NoMemory()
    elif memory_type == "full":
        from memory.full_history import FullHistoryMemory
        return FullHistoryMemory()
    elif memory_type == "sliding_window":
        from memory.sliding_window import SlidingWindowMemory
        return SlidingWindowMemory(window_size=kwargs.get("window_size", 5))
    elif memory_type == "summary":
        from memory.summary import SummaryMemory
        return SummaryMemory(max_length=kwargs.get("summary_max_length", 500))
    elif memory_type == "retrieval":
        from memory.retrieval import RetrievalMemory
        return RetrievalMemory(
            top_k=kwargs.get("top_k", 3),
            embedding_model=kwargs.get("embedding_model", None)
        )
    else:
        raise ValueError(f"Unknown memory type: {memory_type}")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--memory_type", type=str, required=True,
                        choices=["none", "full", "sliding_window",
                                 "summary", "retrieval"])
    parser.add_argument("--model", type=str, default="gpt-4o")
    parser.add_argument("--benchmark", type=str, required=True)
    parser.add_argument("--output_dir", type=str, default="results/")
    parser.add_argument("--window_size", type=int, default=5)
    parser.add_argument("--top_k", type=int, default=3)
    # ... additional args for summary_max_length, embedding_model, etc.
    args = parser.parse_args()

    memory = get_memory(args.memory_type,
                        window_size=args.window_size,
                        top_k=args.top_k)
    agent = Agent(model=args.model, memory=memory)
    # Load benchmark and run evaluation loop
    # Results written to args.output_dir as JSON
    ...

if __name__ == "__main__":
    main()
Code provenance. The get_memory factory pattern, argparse-based CLI with --memory_type/--model/--benchmark flags, and lazy imports from memory/ submodules were observed in run.py during repository inspection [code-confirmed]. The exact argument names and default values shown above are consistent with README usage examples [readme-confirmed]. Minor differences in variable naming, argument ordering, or additional helper functions in the actual file are possible.

48.2.3 Agent Harness: agent.py

The agent harness wraps an LLM backend with a memory strategy to execute benchmark tasks. The following excerpt presents the core agent–memory interaction pattern observed in agent.py [code-confirmed]:

# ── agent.py (repository excerpt, structural pattern) ────────
# [code-confirmed] Agent class and memory interaction loop.

class Agent:
    def __init__(self, model, memory):
        self.model = model          # LLM backend identifier (e.g., "gpt-4o")
        self.memory = memory        # Memory strategy instance from get_memory()
        self.step_count = 0
        self.total_tokens = 0

    def run_task(self, task):
        """Execute a multi-step benchmark task using memory-augmented agent.
        
        Each step follows: observe → store → retrieve → prompt → act.
        The memory strategy determines what context the LLM sees.
        """
        self.memory.reset()         # Clear memory state between tasks
        self.step_count = 0

        observation = task.initial_observation()
        done = False

        while not done:
            # 1. Store current observation into memory
            self.memory.store(
                text=observation.text,
                image=observation.image,     # None for text-only observations
                step=self.step_count
            )

            # 2. Retrieve relevant context from memory
            context = self.memory.retrieve(
                query=task.current_query(),
                k=self.memory.config.get("top_k", None)
            )

            # 3. Build prompt incorporating memory context
            prompt = self._build_prompt(
                task_instruction=task.instruction,
                current_obs=observation,
                memory_context=context
            )

            # 4. Query LLM and extract action
            response = self._call_llm(prompt)
            action = self._parse_action(response)
            self.total_tokens += response.token_count

            # 5. Execute action in environment
            observation, reward, done = task.step(action)
            self.step_count += 1

        return task.get_result()

    def _build_prompt(self, task_instruction, current_obs, memory_context):
        """Compose LLM prompt from task, observation, and memory.
        
        memory_context is the string returned by memory.retrieve(),
        whose format depends entirely on the memory strategy:
        - NoMemory: empty string
        - FullHistory: concatenation of all prior observations
        - SlidingWindow: last k observations
        - Summary: running text summary
        - Retrieval: top-k similar past entries
        """
        # Prompt assembly logic varies by benchmark
        ...

    def _call_llm(self, prompt):
        """Send prompt to configured model backend (OpenAI API)."""
        ...
Code provenance. The Agent class structure, constructor accepting model and memory parameters, the run_task method implementing the observe–store–retrieve–act loop, and memory reset() between tasks were observed in agent.py [code-confirmed]. The _build_prompt method's delegation to the memory strategy's output for context formatting was confirmed. Token counting is present in the evaluation flow [readme-confirmed for JSON output containing token counts].

48.2.4 Memory Strategy Implementation: Sliding Window Example

Each memory strategy in memory/ implements a common interface with store(), retrieve(), and reset() methods. The following shows the sliding-window strategy as a representative concrete implementation [code-confirmed for module existence and interface pattern]:

# ── memory/sliding_window.py (structural pattern) ────────────
# [code-confirmed] Module observed in memory/ directory.
# Interface methods (store, retrieve, reset) confirmed from
# the common pattern across memory modules.

class SlidingWindowMemory:
    """Retain the most recent k observations (FIFO eviction).
    
    Config key: memory_type = "sliding_window"
    Hyperparameter: window_size (--window_size CLI flag)
    """
    def __init__(self, window_size=5):
        self.window_size = window_size
        self.buffer = []            # Circular buffer of observations
        self.config = {"top_k": None, "window_size": window_size}

    def store(self, text, image=None, step=0):
        """Append observation; evict oldest if buffer exceeds window_size."""
        entry = {"text": text, "image": image, "step": step}
        self.buffer.append(entry)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)      # FIFO eviction

    def retrieve(self, query=None, k=None):
        """Return all entries in the window as formatted context.
        
        Unlike RetrievalMemory, the query parameter is unused —
        sliding window returns recency-ordered entries regardless
        of query content.
        """
        return self._format_entries(self.buffer)

    def reset(self):
        """Clear buffer between tasks."""
        self.buffer = []

    def _format_entries(self, entries):
        """Render entries as text for prompt injection."""
        parts = []
        for e in entries:
            parts.append(f"[Step {e['step']}] {e['text']}")
            # Image handling: if VLM backend, image ref is passed through;
            # if text-only backend, image is omitted or pre-captioned
        return "\n".join(parts)
Code provenance. The SlidingWindowMemory class (or equivalent name) was observed as a module in memory/ [code-confirmed]. The store/retrieve/reset interface is shared across all memory modules [code-confirmed]. The FIFO eviction and window-size parameterization are consistent with the --window_size CLI flag [readme-confirmed]. The _format_entries rendering method is an [author-analysis] reconstruction of the prompt-formatting step; the exact formatting logic was not independently verified at the line level.

48.2.5 Architectural Decomposition

SimpleMem decomposes into three functional layers, described in the paper [paper-reported] and reflected in the repository layout [readme-confirmed]:

Layer 1: Memory Primitives [memory/ directory; code-confirmed] Layer 2: Agent Harness [agent.py; code-confirmed] Layer 3: Evaluation [run.py, benchmarks/; code-confirmed] full_history.py sliding_window.py summary.py retrieval.py no_memory.py LLM Backend (OpenAI) Agent.run_task() store / retrieve / reset _build_prompt() Observation Handling Benchmark Loader Task Scoring Result Aggregation Config + Token Logging JSON Result Export instantiate episodes scores Solid arrows = data flow; dashed = feedback to researcher. No automated search loop exists in the repository. Figure 48.1: SimpleMem architecture. Layer decomposition from the paper [paper-reported], mapped to repository paths [code-confirmed].

48.2.6 Memory Primitives: The Configuration Space

SimpleMem implements memory as a set of interchangeable strategies sharing the store/retrieve/reset interface [code-confirmed from memory/ modules]. Each strategy is a separate module in memory/, selected at runtime by the memory_type argument in run.py's get_memory() factory [code-confirmed]:

Memory Type Config Value Module Mechanism Key Parameters
No memory none memory/no_memory.py Current observation only; retrieve() returns empty
Full history full memory/full_history.py Concatenate all prior observations into prompt Context length limit
Sliding window sliding_window memory/sliding_window.py Retain most recent $k$ observations (FIFO eviction) --window_size
Summary summary memory/summary.py LLM-generated running summary of accumulated observations --summary_max_length
Retrieval retrieval memory/retrieval.py Embed observations; top-$k$ similarity search at query time --top_k

Config values in the second column correspond to the --memory_type CLI flag handled by get_memory() in run.py [code-confirmed]. Module paths in the third column reflect the file names observed in memory/ [code-confirmed].

48.2.7 Multimodal Observation Handling

SimpleMem supports multimodal observations—text interleaved with screenshots or images from the agent's environment [paper-reported]. The paper describes multiple encoding paths depending on memory type and backend model. The specific code paths for multimodal encoding were not independently verified at the line level during repository inspection. The following description is based on the paper's architectural discussion.

Multimodal Encoding Paths [paper-reported; not code-verified] Observation (text + image) Path A: VLM Native Tokens image as-is to GPT-4V/4o context Path B: Text Captioning image → LLM description → text Path C: Embedding image → vector for retrieval Full / Sliding Window Path A (VLM) or B (text-only) Summary Memory Path B (text summary) Retrieval Memory Path C (vector index) Memory Store All components in this figure are [paper-reported]. Dashed outlines indicate claims not verified from source code. Figure 48.2: Multimodal encoding paths as described in the paper. Path selection logic and storage backend not code-verified.

The paper describes the following multimodal routing [paper-reported]:

  • Full history and sliding window with a VLM backend (GPT-4V/4o) can pass raw image tokens in the context window (Path A). With text-only backends, images must be captioned (Path B).
  • Summary memory converts visual content to text during LLM summarization (Path B).
  • Retrieval memory encodes observations into an embedding space for similarity search (Path C).
Implementation unknowns. The following details are described in the paper but were not confirmed from source code: (1) the specific embedding model used for Path C (e.g., whether CLIP, OpenAI embeddings, or another model is used); (2) the vector storage and similarity-search backend for retrieval memory (e.g., FAISS, numpy dot product, or a database); (3) how image references are persisted and re-serialized during prompt construction for VLM backends. These details remain [paper-reported] only.

48.2.8 Supported LLM Backends

The paper reports evaluation using the following backends [paper-reported]:

Backend Type Role in Experiments
GPT-4V / GPT-4o Proprietary, multimodal Primary evaluation backend; supports raw image token input
GPT-3.5-Turbo Proprietary, text-only Text-only ablations and cost-sensitive settings

The backend is configured via the --model CLI flag independently of memory strategy, enabling cross-product evaluation [paper-reported] [code-confirmed for --model flag in run.py].

48.2.9 Configuration and Experiment Launch

Experiments are launched via run.py with YAML configs and/or CLI arguments [code-confirmed]. The following CLI examples are documented in the README [readme-confirmed]:

# ── CLI usage from README [readme-confirmed] ─────────────────

# Sliding-window memory, window size 5, GPT-4o backend
python run.py \
  --memory_type sliding_window \
  --window_size 5 \
  --model gpt-4o \
  --benchmark screenspot \
  --output_dir results/sliding_window_k5/

# No memory (baseline)
python run.py \
  --memory_type none \
  --model gpt-4o \
  --benchmark screenspot \
  --output_dir results/no_memory/

# Retrieval-augmented memory
python run.py \
  --memory_type retrieval \
  --top_k 3 \
  --model gpt-4o \
  --benchmark assistantbench \
  --output_dir results/retrieval_k3/

Each run produces structured JSON output containing per-task scores, aggregate metrics, token usage, and full configuration [readme-confirmed]. A representative JSON result structure:

// ── results/ output structure [readme-confirmed] ────────────
{
  "config": {
    "memory_type": "sliding_window",
    "window_size": 5,
    "model": "gpt-4o",
    "benchmark": "screenspot"
  },
  "metrics": {
    "accuracy": 0.XX,         // aggregate score across tasks
    "total_tokens": NNNNN,    // total token consumption
    "num_tasks": NN            // number of tasks evaluated
  },
  "per_task": [
    {"task_id": "...", "score": 0/1, "tokens": NNN, "steps": N},
    ...
  ]
}

JSON output structure reflects README documentation. Exact field names may differ; accuracy, total_tokens, and per_task arrays were confirmed as present in the output schema.

48.3 Formal Framework: Memory Configuration Search

This section formalizes the systematic comparison of memory configurations as an optimization problem. This formalization is the chapter author's analytical rendering of the search implicit in SimpleMem's evaluation methodology [author-analysis]. SimpleMem itself does not implement an automated search loop. The mathematical framework clarifies what a researcher using the tool is implicitly doing, with explicit mappings to the system's config keys and output fields.

48.3.1 Configuration Space

Define a memory configuration as a tuple:

$$\mathcal{M} = (r, \boldsymbol{\theta}_r) \in \mathcal{C}$$

where:

Symbol Meaning System Mapping Evidence
$r \in \mathcal{R}$ Memory type (categorical) --memory_type CLI flag; config.memory_type in JSON output. $\mathcal{R} = \{$none, full, sliding_window, summary, retrieval$\}$, so $|\mathcal{R}| = 5$. code-confirmed
$\boldsymbol{\theta}_r$ Type-specific hyperparameters $r = $ sliding_window: $\boldsymbol{\theta}_r = (\text{window\_size})$, CLI: --window_size
$r = $ retrieval: $\boldsymbol{\theta}_r = (\text{top\_k})$, CLI: --top_k
$r = $ summary: $\boldsymbol{\theta}_r = (\text{summary\_max\_length})$
$r \in \{$none, full$\}$: $\boldsymbol{\theta}_r = \emptyset$
readme-confirmed
$\mathcal{C}$ Full configuration space $\mathcal{C} = \bigcup_{r \in \mathcal{R}} \{r\} \times \Theta_r$ where $\Theta_r$ is the valid range for type-specific hyperparameters author-formalization

Note on encoding and eviction. The earlier draft modeled encoding method $e$ and eviction policy $v$ as independent dimensions. In SimpleMem, both are determined by $r$: the memory type choice implicitly selects the encoding path (Section 48.2.7) and eviction policy (FIFO for sliding window, LLM-driven for summary, none for full/retrieval). Neither is independently configurable via a config key. The formalization therefore reduces to $(r, \boldsymbol{\theta}_r)$ [author-analysis based on code-confirmed interface].

With 5 memory types and typical hyperparameter ranges (e.g., window_size $\in \{3, 5, 10, 15, 20\}$, top_k $\in \{1, 3, 5, 10\}$), $|\mathcal{C}|$ is in the range of 15–50 discrete configurations when hyperparameters are discretized, making near-exhaustive comparison feasible.

48.3.2 Objective Function

Given a benchmark $\mathcal{B}$ with task set $\{t_1, \ldots, t_N\}$, define the evaluation metric as task-level accuracy (success rate):

$$F(t_i, \mathcal{M}) = \begin{cases} 1 & \text{if agent with config } \mathcal{M} \text{ succeeds on task } t_i \\ 0 & \text{otherwise} \end{cases}$$

This is the scoring function computed per-task by the benchmark evaluator and recorded in the per_task[i].score field of the JSON output [readme-confirmed]. The aggregate metric over the benchmark:

$$\hat{F}(\mathcal{M}; \mathcal{B}) = \frac{1}{N} \sum_{i=1}^{N} F(t_i, \mathcal{M})$$

This corresponds to the metrics.accuracy field in the JSON output. The configuration search objective is then:

$$\mathcal{M}^* = \arg\max_{\mathcal{M} \in \mathcal{C}} \; \hat{F}(\mathcal{M}; \mathcal{B})$$

Single-objective, unconstrained formulation. As implemented, SimpleMem optimizes accuracy only. Cost (token usage) is measured and logged in metrics.total_tokens but is not a constraint or secondary objective in the evaluation protocol. A cost-constrained variant $\max_{\mathcal{M}} \hat{F}(\mathcal{M})$ s.t. $C(\mathcal{M}) \leq B$ is natural but not implemented [author-analysis].

48.3.3 Stochasticity and Reliability

Three sources of stochasticity affect $\hat{F}$:

  1. LLM sampling. The paper reports using temperature $= 0$ for all experiments [paper-reported]. At temperature 0, most API providers return (near-)deterministic outputs, reducing but not eliminating variation (server-side batching and quantization can still cause drift).
  2. Task sampling. $\hat{F}$ is computed over a fixed task set $\{t_1, \ldots, t_N\}$ per benchmark, not a random sample. Reliability therefore depends on $N$ and the coverage of the fixed set.
  3. API version drift. Model behavior changes across API updates. The paper does not specify API version pinning [paper-reported: not mentioned].

The paper does not report running each configuration over multiple independent seeds or repeated trials [paper-reported: not mentioned]. Consequently, variance estimates and confidence intervals for $\hat{F}$ differences are not available. The fixed temperature partially mitigates this, but readers should interpret small score differences with caution.

48.3.4 Evaluation Cost Model

Each configuration evaluation requires running the agent through all $N$ tasks. The per-evaluation token cost [author-analysis: cost model not implemented in SimpleMem, but token counts are logged]:

$$C_{\text{tokens}}(\mathcal{M}; \mathcal{B}) = \sum_{i=1}^{N} \text{tokens}(t_i, \mathcal{M})$$

This quantity is available from the JSON output as metrics.total_tokens [readme-confirmed]. The dominant cost driver varies by memory type:

Memory Type Memory-Specific Overhead Prompt Length Growth
none None Constant (current observation only)
full None $O(T)$ — grows linearly with episode length $T$
sliding_window None $O(k)$ — bounded by window_size
summary Extra LLM calls for periodic re-summarization $O(1)$ — bounded by summary_max_length
retrieval Embedding call per observation $O(k)$ — bounded by top_k

[Author-analysis: cost scaling derived from mechanism descriptions, not from measured data. Actual token counts would be available from the JSON output of completed runs.]

48.4 Search Methodology and Automation Level

48.4.1 What SimpleMem Actually Implements

SimpleMem's configuration comparison process [readme-confirmed] [paper-reported]:

  1. Researcher selects configurations. YAML config files or CLI flags to run.py.
  2. Automated evaluation. For each configuration, the harness runs the agent through all benchmark tasks, producing per-task and aggregate scores. This step is fully automated.
  3. Result comparison. Researcher compares JSON output files in results/.
  4. Iterative refinement. Based on results, the researcher evaluates additional configurations. This step is manual.
Select Configs researcher-driven Run Evaluation automated (run.py) Compare Results researcher-driven Report Best $\mathcal{M}^*$ or iterate refine configs (manual iteration) Figure 48.3: SimpleMem workflow. Only evaluation (run.py) is automated; configuration selection is researcher-directed.

48.4.2 Relationship to Automated Search

SimpleMem provides two of three prerequisites for fully autonomous memory design:

Prerequisite Status Evidence
Composable configuration space Implemented get_memory() factory + YAML configs [code-confirmed]
Automated evaluation with structured output Implemented run.py → JSON with metrics.accuracy [code-confirmed]
Informed search controller Not implemented No search loop in repo [readme-confirmed]

The infrastructure—configuration-driven evaluation with machine-readable JSON output—would serve as the evaluation oracle in a fully autonomous system. This is why SimpleMem is included in Part P07 despite not being fully autonomous: it provides enabling infrastructure for autonomous memory design discovery [author-analysis].

48.4.3 Convergence in Manual Exploration

With $|\mathcal{C}|$ in the 15–50 range, a researcher can evaluate all five memory types, then sweep key hyperparameters for the best-performing type. The paper's protocol follows this pattern: first compare memory types with default hyperparameters, then report parameter sensitivity within top-performing types [paper-reported].

48.5 Experimental Evidence

48.5.1 Evaluation Protocol

Table 48.1: Evaluation protocol [paper-reported].
Protocol Element Value Evidence
Primary LLM backends GPT-4o (multimodal), GPT-4V (multimodal), GPT-3.5-Turbo (text ablations) paper-reported
Memory configs compared 5 types: none, full, sliding_window, summary, retrieval code-confirmed
Temperature Fixed at 0.0 for all runs paper-reported
Primary metric Task accuracy / success rate (SR) paper-reported
Repeated runs / seeds Not specified — paper does not report multiple independent runs or variance paper-reported (absent)
Hyperparameter sweep Multiple values of window_size and top_k reported for sensitivity analysis paper-reported
Token / cost reporting Token counts logged in JSON output; whether paper reports per-config cost comparisons: not confirmed readme-confirmed (JSON schema)

48.5.2 Benchmark Suite

Table 48.2: Benchmark suite [paper-reported]. Task counts marked “paper” should be read from the publication; they are not independently verified here.
Benchmark Domain Modality Key Characteristic Memory Demand
ScreenSpot GUI grounding Text + screenshot Element localization on device screens Low–moderate: primarily recent-context dependent
AssistantBench Web assistant tasks Text + web content Multi-step web interaction, cross-page recall High: requires long-range information retrieval
AndroidWorld Mobile device control Text + screenshot Long-horizon device interaction across apps Moderate: recency + some cross-screen recall
OSWorld Desktop OS control Text + screenshot Complex desktop environment tasks High: sequential observations, long episodes

48.5.3 Quantitative Results

Data availability limitation. The chapter author was unable to independently recover exact per-benchmark numerical scores, confidence intervals, or token-cost breakdowns from the paper for inclusion in a verified results table. Rather than presenting a table with placeholder arrows or unverifiable numbers, the following reports the paper's stated conclusions at the finding level, citing the relevant evidence pattern. Readers should consult the publication's results tables directly for: (a) exact accuracy/SR percentages per benchmark × memory type × model backend, (b) per-task score distributions, (c) hyperparameter sensitivity curves, (d) token usage per configuration, and (e) any variance or statistical testing [paper-reported].

The paper reports results across a full cross-product of 5 memory types × 4 benchmarks × 2–3 LLM backends [paper-reported]. The following findings are consistently stated in the paper's analysis:

48.5.4 Key Findings

Table 48.3: Key findings from the paper's evaluation [paper-reported]. Each finding is stated as reported in the paper's analysis; exact score deltas should be read from the publication.
ID Finding Evidence Pattern Benchmarks Most Relevant
F1 Memory consistently outperforms no-memory All memory-equipped configurations exceed the none baseline on tasks requiring cross-step information. Improvement magnitude increases with episode length. All four; largest gap on AndroidWorld, OSWorld
F2 Simple methods suffice Well-configured sliding_window and retrieval match or exceed summary and full on evaluated benchmarks. All four
F3 Optimal type is task-dependent sliding_window wins on recency-dominated tasks; retrieval wins on tasks requiring long-range recall. No single type dominates. sliding_window best: ScreenSpot, AndroidWorld. retrieval best: AssistantBench, OSWorld.
F4 Inter-type variance > intra-type variance Performance varies more across types (e.g., sliding_window vs retrieval) than within a type (e.g., window_size=3 vs 10). Type selection is the first-order decision. All four
F5 Model–memory interaction GPT-4o shows larger absolute gains from memory than GPT-3.5-Turbo, but relative ranking of memory types is broadly consistent across backends. Cross-backend comparison

48.5.5 Interpretation and Limitations of the Evidence

The paper's experimental design has notable methodological strengths and gaps:

Strengths. (1) Controlled comparison: the LLM backend is held constant while varying memory, isolating the memory effect. (2) Fixed temperature reduces sampling variation. (3) Cross-product design (type × benchmark × backend) enables interaction analysis (F5). (4) Token counting in JSON output enables post-hoc cost analysis.

Gaps that limit the strength of the evidence.

  • No reported variance. Without repeated runs or confidence intervals, it is impossible to distinguish genuine type-level effects from noise. Temperature 0 mitigates but does not eliminate this concern (API-level non-determinism persists).
  • Task count and composition. The paper does not specify the exact number of tasks per benchmark in the portions inspected for this chapter. Small $N$ would inflate the variance of $\hat{F}$ and weaken cross-type comparisons.
  • Cost comparison. Although token counts are logged, the paper’s analysis focuses on accuracy rather than cost-adjusted performance. Whether summary memory's extra LLM calls or retrieval memory's embedding costs are quantified is not confirmed.
  • Hyperparameter sensitivity. Finding F4 (inter-type > intra-type variance) is the most practically useful result but would be strengthened by reporting the full hyperparameter sweep curves, not just qualitative claims.

Despite these gaps, Findings F3 and F4 together constitute a useful design guideline: choose the memory type based on task characteristics first, then tune hyperparameters second.

48.6 Implementation Patterns

48.6.1 Agent-Memory Interaction Cycle

Each agent step follows the store–retrieve–act cycle. The code in Section 48.2.3 (Agent.run_task()) implements this loop [code-confirmed]. This section analyzes why the cycle architecture produces the observed findings.

The retrieve() call is the point where memory type most directly affects LLM behavior: the same underlying observations are represented entirely differently depending on the strategy. Consider a 10-step GUI interaction episode:

Memory Type What retrieve() Returns at Step 10 Prompt Size Impact
none Empty string Minimal
full All 10 prior observations concatenated 10× observation size
sliding_window Steps 6–10 (with window_size=5) 5× observation size
summary Compressed text summary of steps 1–10 Bounded by summary_max_length
retrieval Top-3 most similar past observations to current query 3× observation size

This explains Finding F3: on ScreenSpot (short episodes, recent context matters most), sliding window preserves the most useful recent observations without dilution. On AssistantBench (long-range cross-page recall), retrieval can surface relevant information from early steps that sliding window has evicted [author-analysis].

48.6.2 Configuration-Driven Instantiation

The factory pattern in run.py's get_memory() (Section 48.2.2) enables identical evaluation code to run any memory type. This design has three key properties [author-analysis based on code-confirmed factory pattern]:

  1. Single code path. The agent loop in agent.py is memory-type-agnostic: it calls store(), retrieve(), reset() without knowing the concrete strategy. This ensures evaluation conditions are identical except for memory behavior.
  2. Per-task scoring. The JSON output's per_task array supports post-hoc analysis of which tasks benefit from which strategy—enabling Finding F3.
  3. Token counting. Accumulated in the agent loop and persisted in JSON output, enabling cost comparison across configurations.

48.6.3 Reproducibility

Table 48.4: Reproducibility factors and SimpleMem's approach.
Challenge SimpleMem's Approach Residual Risk
LLM non-determinism Temperature = 0 [paper] API-level variation persists
Model version drift Model name in config + JSON output No API version pinning confirmed
Configuration tracking Full config logged in JSON output [readme] Low: enables exact reproduction of settings
Benchmark contamination Affects all configs equally Relative comparison valid; absolute scores may be inflated

48.7 Positioning in the Research Landscape

48.7.1 Comparative Analysis: Autonomous Research Systems

System Search Domain Automation Level Search Object Multimodal
FunSearch (Ch. 3) Algorithms Fully automated Python functions No
OpenELM (Ch. 8) Programs Fully automated Program source code No
AIDE (Ch. 42) ML experiments Fully automated Training configs No
SimpleMem Memory configs Eval automated; search manual $(r, \boldsymbol{\theta}_r)$ tuples Yes [paper]

Two structural features distinguish SimpleMem: (1) a constrained, declarative search object (categorical type + bounded numeric hyperparameters, vs. the open-ended program space of FunSearch/OpenELM), making exhaustive comparison feasible; and (2) agent infrastructure as search target—the memory subsystem is modular, so discovered configurations can transfer across agent architectures sharing the same interface [author-analysis].

48.7.2 Comparative Analysis: Memory-Augmented Agent Systems

SimpleMem's design space encompasses and generalizes the fixed memory choices made by earlier agent systems. The following comparison uses five explicit axes to clarify where SimpleMem's framework subsumes, extends, or falls short of prior designs [author-analysis]:

Table 48.5: Structural comparison of memory-augmented agent systems along five design axes [author-analysis].
Axis MemGPT Voyager Generative Agents SimpleMem
Memory granularity Three tiers: main context, recall storage, archival storage. LLM decides what to promote/demote between tiers. Single tier: skill library of verified code functions, indexed by description embedding. Single tier: observation stream with metadata (importance, recency, embedding). Single tier per strategy. No hierarchical memory. Tier choice itself is the experimental variable ($r$).
Eviction policy LLM-managed: the model explicitly calls functions to move data between tiers. Flexible but costly (extra LLM calls per management decision). No eviction: skill library grows monotonically. Failed skills are not added; verified ones persist indefinitely. No eviction: all observations retained. Effective management via retrieval weighting ($\alpha \cdot \text{recency} + \beta \cdot \text{importance} + \gamma \cdot \text{relevance}$). Strategy-dependent: FIFO (sliding_window), LLM-driven compression (summary), none (full, retrieval). Eviction is coupled to type $r$, not independently configurable.
Retrieval policy LLM-driven: model decides when and what to retrieve via function calls. Retrieval is an agent action, not automatic. Embedding similarity: query description is embedded, top-$k$ skills retrieved by cosine similarity. Weighted formula: $\alpha \cdot \text{recency} + \beta \cdot \text{importance} + \gamma \cdot \text{relevance}$ with fixed coefficients. Strategy-dependent: recency-only (sliding_window), all (full), compressed (summary), similarity (retrieval). The retrieval policy is the primary experimental variable.
Multimodal handling Text-only in the original paper. Later extensions support multimodal. Text-only: Minecraft skills are stored as code + text descriptions. Text-only: observations are natural-language descriptions of simulated world. Native multimodal: text + screenshot observations. Encoding path varies by memory type and backend (Section 48.2.7) [paper].
Automation level Fixed architecture; no systematic comparison infrastructure. Fixed architecture; skill accumulation is automated but memory design is not variable. Fixed architecture; $\alpha, \beta, \gamma$ hand-tuned. Memory design is the experiment: type $r$ and hyperparameters $\boldsymbol{\theta}_r$ are systematically varied and evaluated. Evaluation automated; search manual.

What SimpleMem subsumes. Generative Agents' fixed retrieval formula becomes a specific configuration in SimpleMem's space: $r = $ retrieval with similarity-weighted scoring. Voyager's skill-library approach maps to full memory with procedural-knowledge encoding. The key difference is that SimpleMem makes these choices experimental variables rather than fixed design decisions.

What SimpleMem does not cover. MemGPT's hierarchical three-tier architecture—where the LLM manages promotion/demotion between tiers—has no equivalent in SimpleMem's single-tier strategies. Generative Agents' reflection mechanism (periodic higher-order summarization of observations) goes beyond SimpleMem's summary strategy, which compresses observations but does not generate meta-cognitive reflections. Graph-structured memory (as in some knowledge-graph agents) is also absent. These omissions define the boundary of SimpleMem's current configuration space.

SimpleMem Coverage of Prior Memory Designs [author-analysis] Simple Complex SimpleMem Configuration Space none → full → sliding_window → summary → retrieval Voyager (skill lib) Gen. Agents (weighted) Beyond SimpleMem's Space MemGPT (3-tier hierarchical) Reflection mechanisms Graph-structured memory Meta-cognitive control ← Subsumable by SimpleMem's {r, θ_r} | Requires architectural extensions → Figure 48.4: SimpleMem's configuration space covers simple-to-moderate memory designs. Hierarchical and meta-cognitive architectures lie outside.

48.8 Limitations

48.8.1 Evidence Limitations

  1. Repository audit depth. The entry point (run.py), agent harness (agent.py), config directory (configs/), and memory module directory (memory/) were inspected. Line-by-line reading of all memory strategy implementations was not performed. The multimodal encoding paths, embedding model selection, and vector storage backend for retrieval memory were not confirmed from source code.
  2. Experimental result granularity. Exact per-benchmark numerical scores, confidence intervals, task counts, seed counts, and cost figures were not independently recovered. Section 48.5 reports findings at the pattern level; readers should consult the publication directly for exact figures.

48.8.2 System-Level Limitations

  1. Evaluation cost. Each configuration requires full agent episodes. A thorough cross-product (5 types × multiple hyperparameter values × 4 benchmarks × multiple backends) requires hundreds of evaluation runs. Cost-reduction strategies (cascade evaluation, early stopping) are not described [paper: not mentioned].
  2. No automated search. The researcher-driven search step limits use in large-scale discovery, though the modest $|\mathcal{C}|$ makes simple automated strategies (grid search) sufficient if added.
  3. Configuration space completeness. The five memory types cover common strategies but omit: (a) hierarchical memory (MemGPT's three-tier design), (b) graph-structured memory, (c) reflection and meta-cognitive operations, and (d) hybrid strategies (e.g., sliding window + retrieval fallback). See Table 48.5 for details.
  4. Transferability. Finding F3 demonstrates that configurations optimal on one benchmark may not transfer. No meta-learning mechanism predicts the best configuration for new task distributions [author-analysis].
  5. Single-tier only. All five strategies operate as single-tier memory. The interaction between memory tiers (e.g., MemGPT's promotion/demotion) is not captured, limiting insights about hierarchical designs.

48.8.3 Safety Considerations

  • Memory poisoning. If memory is populated from environment observations during evaluation, adversarial inputs could corrupt stored entries. Input validation for stored memories is not confirmed from code.
  • Data retention. Memory stores accumulate potentially sensitive information. Eviction policies optimized for performance may conflict with data retention requirements.
  • Execution isolation. Memory configurations operate within the Python process without evidence of sandboxing—typical for research frameworks but relevant for production deployment.

48.9 Future Directions

The extensions below are the chapter author's projections [author-analysis].

48.9.1 Closing the Search Loop

The most natural extension adds an automated search controller that reads JSON evaluation results and proposes configurations:

  • Grid/random search. Given $|\mathcal{C}| \approx 15\text{--}50$, exhaustive grid search is feasible. This is the lowest-effort automation step and would transform SimpleMem from an evaluation harness into a complete automated memory design system.
  • Bayesian optimization. Over the mixed discrete-continuous space $(r, \boldsymbol{\theta}_r)$, BO with a categorical kernel could navigate efficiently when each evaluation is expensive.
  • LLM-guided controller. An LLM reads JSON results and proposes configurations in natural language. The modest space size may not justify the LLM overhead compared to grid search, but could be valuable if the configuration space is expanded (Section 48.9.3).
Autonomous Memory Design: Current vs. Extended [author-analysis] Composable Space ✓ get_memory() + configs/ Automated Evaluation ✓ run.py → JSON Informed Search ✗ Not implemented Autonomous Memory Discovery Figure 48.5: SimpleMem provides 2 of 3 prerequisites for autonomous design [author-analysis].

48.9.2 Joint Prompt and Memory Optimization

The prompts governing how the agent interacts with memory (what to store, how to query, how to format retrieved context) could also be optimized. A joint search over $(\mathcal{M}, \pi) \in \mathcal{C} \times \Pi$, where $\pi$ is a prompt configuration, would explore a richer space. Alternating between memory-configuration and prompt-refinement (as in DSPy-style prompt optimization) would reduce per-step dimensionality.

48.9.3 Expanding the Configuration Space

The five memory types could be extended to include: (a) hierarchical memory with multiple tiers and configurable eviction policies per tier (generalizing MemGPT); (b) graph-structured memory with relational edges supporting multi-hop retrieval; (c) composite strategies combining multiple types (e.g., sliding window for recent context + retrieval for long-range); and (d) adaptive memory where the strategy switches based on task characteristics during execution.

48.9.4 Meta-Learning Configuration Selectors

Rather than discovering one optimal configuration, a meta-learning extension would learn a policy $\phi: \text{TaskFeatures} \to \mathcal{C}$ that selects configurations based on task characteristics. This directly addresses Finding F3: if different tasks favor different types, a selector outperforms any fixed choice. Training signal would come from SimpleMem's evaluation results across benchmarks.

48.10 Summary

Key Takeaway. SimpleMem provides a configurable evaluation framework for systematic comparison of memory strategies in multimodal LLM agents. Its core contribution is reframing agent memory design as a configuration comparison problem over composable primitives.

Implementation. The repository implements: (1) a factory-dispatch entry point (run.py / get_memory()) that maps the --memory_type flag to concrete strategy classes in memory/ [code-confirmed]; (2) an agent harness (agent.py / Agent class) implementing the observe–store–retrieve–act loop with a pluggable memory object [code-confirmed]; (3) five interchangeable memory strategies sharing a common store/retrieve/reset interface [code-confirmed]; (4) structured JSON output with per-task scores, aggregate accuracy, and token counts [readme-confirmed].

Findings. The paper demonstrates five key results: (F1) memory consistently improves over no-memory baselines, (F2) simple strategies are competitive, (F3) optimal type varies by task characteristics, (F4) type selection matters more than hyperparameter tuning, (F5) relative rankings are consistent across backends [paper-reported].

Significance for Part P07. SimpleMem targets agent memory as the object of systematic optimization in multimodal settings—a domain not addressed by other autonomous research systems in this survey. While less automated than FunSearch, OpenELM, or AIDE, it demonstrates that the autonomous-design pattern applies to agent infrastructure components, and its evaluation harness provides two of three prerequisites for fully closing the search loop [author-analysis].

Provenance Summary.
Code-confirmed run.py (factory dispatch via get_memory(), argparse CLI with --memory_type/--model/--benchmark), agent.py (Agent class, run_task() loop, memory as constructor parameter), memory/ directory with per-strategy modules sharing store/retrieve/reset interface, configs/ with YAML files.
Readme-confirmed CLI usage patterns, memory type names, JSON output schema (metrics.accuracy, metrics.total_tokens, per_task array), absence of automated search controller.
Paper-reported System goals, multimodal encoding paths (Figure 48.2), LLM backends (GPT-4o, GPT-4V, GPT-3.5-Turbo), benchmark suite (ScreenSpot, AssistantBench, AndroidWorld, OSWorld), findings F1–F5, temperature protocol.
Author-analysis Formalization (Section 48.3), cost model, comparative tables (Table 48.5, Section 48.7), three-prerequisite framework, future directions, safety analysis. Embedding backend, vector storage, and exact multimodal code paths are implementation unknowns.