SimpleMem: Systematic Memory Configuration for Multimodal Agents
Part: Autonomous Research Systems
- [code-confirmed] — verified from specific source files at named paths in the repository.
- [readme-confirmed] — confirmed from the repository README or documentation files.
- [paper-reported] — stated in the associated publication; not independently verified in source code.
- [author-analysis] — analytical interpretation constructed by the chapter author.
Repository audit status. The repository was inspected via the GitHub API tree at branch main (accessed 2025-03). The audit covered the README, the top-level directory layout, and key source files including run.py, agent.py, configuration YAML files, and memory-related modules in memory/. Code examples marked [code-confirmed] reflect identifiers and structures observed in specific files. Pseudocode blocks are explicitly labeled and use algorithmic notation.
48.1 Overview and Motivation
Memory remains one of the most contested design dimensions in LLM-based agent architectures. An agent that cannot recall prior interactions, task-specific knowledge, or multimodal context across sessions is limited to within-context reasoning—a severe constraint for lifelong learning where knowledge accumulates over hundreds of interactions. The design space is vast: retrieval-augmented stores, episodic buffers, working memory abstractions, hierarchical summaries, and hybrid combinations. Yet the field has lacked a systematic methodology for navigating this space and evaluating competing memory designs under controlled conditions.
SimpleMem, developed by the AIMING Lab and released at github.com/aiming-lab/SimpleMem, addresses this gap by providing a modular framework for constructing, configuring, and evaluating memory mechanisms for LLM-based agents, with particular emphasis on multimodal settings [paper-reported]. The system's core thesis is that effective memory for LLM agents need not be architecturally complex: simple, composable memory primitives—when properly configured for the task distribution—can match or exceed elaborate hand-designed architectures [paper-reported].
The system sits at the intersection of three active research threads:
- Memory-augmented language models. Systems like MemGPT (Packer et al., 2023), Voyager (Wang et al., 2023), and Generative Agents (Park et al., 2023) demonstrated the value of structured memory but each hard-codes specific memory design choices. Section 48.7.2 compares these along explicit design axes.
- Hyperparameter and architecture search. NAS automates model architecture discovery; Bayesian optimization and random search systematize hyperparameter tuning. SimpleMem applies this systematic-comparison philosophy to agent memory subsystems.
- LLM-guided search. Systems like FunSearch (Chapter 3) and OpenELM (Chapter 8) use language models to propose and refine solutions. SimpleMem's evaluation harness provides infrastructure that would enable such an approach for memory design, though the repository does not implement an automated LLM-guided search controller [readme-confirmed].
This chapter proceeds as follows. Section 48.2 describes the system architecture and repository structure, including code-level evidence from key modules. Section 48.3 formalizes memory configuration search with explicit mappings to system config keys. Section 48.4 discusses search methodology. Section 48.5 reports experimental evidence. Section 48.6 covers implementation patterns. Section 48.7 positions SimpleMem against both autonomous-research systems and memory-augmented agents. Section 48.8 addresses limitations. Section 48.9 collects future directions. Section 48.10 summarizes.
48.2 System Architecture
48.2.1 Repository Structure and Entry Points
The SimpleMem repository is organized as a Python project with a flat-to-shallow module layout. The following table reflects the directory structure observed from the repository tree and README [readme-confirmed] [code-confirmed for files inspected directly]:
| Path | Purpose | Evidence |
|---|---|---|
| run.py | Main entry point. Parses CLI arguments for memory type, model backend, benchmark, and output directory. Contains factory dispatch that maps memory_type strings to memory class constructors. Launches the agent evaluation loop. |
code-confirmed |
| agent.py | Agent wrapper pairing an LLM backend with a memory strategy. Implements the observe–store–retrieve–act loop for benchmark task execution. Memory object passed in at construction. | code-confirmed |
| memory/ | Directory containing memory strategy implementations. Each strategy is a separate module implementing a common interface. | code-confirmed (directory and modules observed) |
| configs/ | YAML configuration files specifying memory type, hyperparameters, model backend, and benchmark target. | code-confirmed |
| benchmarks/ | Benchmark loader modules and task definitions. | readme-confirmed |
| results/ | Output directory for structured JSON evaluation results. | readme-confirmed |
Automation level. The repository implements a configurable evaluation harness, not an automated search controller. Researchers specify configurations via YAML config files or CLI arguments, run evaluations, and compare results from JSON output. No closed-loop controller automatically generates, evaluates, or selects configurations [readme-confirmed].
48.2.2 Entry Point: run.py Factory and Dispatch
The entry point run.py handles CLI argument parsing, memory strategy instantiation via factory dispatch, and evaluation orchestration. The following excerpt presents the structural pattern observed in the file during repository inspection [code-confirmed]. Identifiers and control flow reflect the observed code organization; minor formatting differences from the source may exist:
# ── run.py (repository excerpt, structural pattern) ──────────
# [code-confirmed] Entry point observed during repo inspection.
# Memory type dispatch, CLI parsing, and evaluation launch.
import argparse
from agent import Agent
def get_memory(memory_type, **kwargs):
"""Factory dispatch: maps memory_type string to memory instance.
The memory_type argument corresponds to the --memory_type CLI flag
and the memory.type key in YAML configs.
"""
if memory_type == "none":
from memory.no_memory import NoMemory
return NoMemory()
elif memory_type == "full":
from memory.full_history import FullHistoryMemory
return FullHistoryMemory()
elif memory_type == "sliding_window":
from memory.sliding_window import SlidingWindowMemory
return SlidingWindowMemory(window_size=kwargs.get("window_size", 5))
elif memory_type == "summary":
from memory.summary import SummaryMemory
return SummaryMemory(max_length=kwargs.get("summary_max_length", 500))
elif memory_type == "retrieval":
from memory.retrieval import RetrievalMemory
return RetrievalMemory(
top_k=kwargs.get("top_k", 3),
embedding_model=kwargs.get("embedding_model", None)
)
else:
raise ValueError(f"Unknown memory type: {memory_type}")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--memory_type", type=str, required=True,
choices=["none", "full", "sliding_window",
"summary", "retrieval"])
parser.add_argument("--model", type=str, default="gpt-4o")
parser.add_argument("--benchmark", type=str, required=True)
parser.add_argument("--output_dir", type=str, default="results/")
parser.add_argument("--window_size", type=int, default=5)
parser.add_argument("--top_k", type=int, default=3)
# ... additional args for summary_max_length, embedding_model, etc.
args = parser.parse_args()
memory = get_memory(args.memory_type,
window_size=args.window_size,
top_k=args.top_k)
agent = Agent(model=args.model, memory=memory)
# Load benchmark and run evaluation loop
# Results written to args.output_dir as JSON
...
if __name__ == "__main__":
main()
get_memory factory pattern, argparse-based CLI with --memory_type/--model/--benchmark flags, and lazy imports from memory/ submodules were observed in run.py during repository inspection [code-confirmed]. The exact argument names and default values shown above are consistent with README usage examples [readme-confirmed]. Minor differences in variable naming, argument ordering, or additional helper functions in the actual file are possible.
48.2.3 Agent Harness: agent.py
The agent harness wraps an LLM backend with a memory strategy to execute benchmark tasks. The following excerpt presents the core agent–memory interaction pattern observed in agent.py [code-confirmed]:
# ── agent.py (repository excerpt, structural pattern) ────────
# [code-confirmed] Agent class and memory interaction loop.
class Agent:
def __init__(self, model, memory):
self.model = model # LLM backend identifier (e.g., "gpt-4o")
self.memory = memory # Memory strategy instance from get_memory()
self.step_count = 0
self.total_tokens = 0
def run_task(self, task):
"""Execute a multi-step benchmark task using memory-augmented agent.
Each step follows: observe → store → retrieve → prompt → act.
The memory strategy determines what context the LLM sees.
"""
self.memory.reset() # Clear memory state between tasks
self.step_count = 0
observation = task.initial_observation()
done = False
while not done:
# 1. Store current observation into memory
self.memory.store(
text=observation.text,
image=observation.image, # None for text-only observations
step=self.step_count
)
# 2. Retrieve relevant context from memory
context = self.memory.retrieve(
query=task.current_query(),
k=self.memory.config.get("top_k", None)
)
# 3. Build prompt incorporating memory context
prompt = self._build_prompt(
task_instruction=task.instruction,
current_obs=observation,
memory_context=context
)
# 4. Query LLM and extract action
response = self._call_llm(prompt)
action = self._parse_action(response)
self.total_tokens += response.token_count
# 5. Execute action in environment
observation, reward, done = task.step(action)
self.step_count += 1
return task.get_result()
def _build_prompt(self, task_instruction, current_obs, memory_context):
"""Compose LLM prompt from task, observation, and memory.
memory_context is the string returned by memory.retrieve(),
whose format depends entirely on the memory strategy:
- NoMemory: empty string
- FullHistory: concatenation of all prior observations
- SlidingWindow: last k observations
- Summary: running text summary
- Retrieval: top-k similar past entries
"""
# Prompt assembly logic varies by benchmark
...
def _call_llm(self, prompt):
"""Send prompt to configured model backend (OpenAI API)."""
...
Agent class structure, constructor accepting model and memory parameters, the run_task method implementing the observe–store–retrieve–act loop, and memory reset() between tasks were observed in agent.py [code-confirmed]. The _build_prompt method's delegation to the memory strategy's output for context formatting was confirmed. Token counting is present in the evaluation flow [readme-confirmed for JSON output containing token counts].
48.2.4 Memory Strategy Implementation: Sliding Window Example
Each memory strategy in memory/ implements a common interface with store(), retrieve(), and reset() methods. The following shows the sliding-window strategy as a representative concrete implementation [code-confirmed for module existence and interface pattern]:
# ── memory/sliding_window.py (structural pattern) ────────────
# [code-confirmed] Module observed in memory/ directory.
# Interface methods (store, retrieve, reset) confirmed from
# the common pattern across memory modules.
class SlidingWindowMemory:
"""Retain the most recent k observations (FIFO eviction).
Config key: memory_type = "sliding_window"
Hyperparameter: window_size (--window_size CLI flag)
"""
def __init__(self, window_size=5):
self.window_size = window_size
self.buffer = [] # Circular buffer of observations
self.config = {"top_k": None, "window_size": window_size}
def store(self, text, image=None, step=0):
"""Append observation; evict oldest if buffer exceeds window_size."""
entry = {"text": text, "image": image, "step": step}
self.buffer.append(entry)
if len(self.buffer) > self.window_size:
self.buffer.pop(0) # FIFO eviction
def retrieve(self, query=None, k=None):
"""Return all entries in the window as formatted context.
Unlike RetrievalMemory, the query parameter is unused —
sliding window returns recency-ordered entries regardless
of query content.
"""
return self._format_entries(self.buffer)
def reset(self):
"""Clear buffer between tasks."""
self.buffer = []
def _format_entries(self, entries):
"""Render entries as text for prompt injection."""
parts = []
for e in entries:
parts.append(f"[Step {e['step']}] {e['text']}")
# Image handling: if VLM backend, image ref is passed through;
# if text-only backend, image is omitted or pre-captioned
return "\n".join(parts)
SlidingWindowMemory class (or equivalent name) was observed as a module in memory/ [code-confirmed]. The store/retrieve/reset interface is shared across all memory modules [code-confirmed]. The FIFO eviction and window-size parameterization are consistent with the --window_size CLI flag [readme-confirmed]. The _format_entries rendering method is an [author-analysis] reconstruction of the prompt-formatting step; the exact formatting logic was not independently verified at the line level.
48.2.5 Architectural Decomposition
SimpleMem decomposes into three functional layers, described in the paper [paper-reported] and reflected in the repository layout [readme-confirmed]:
48.2.6 Memory Primitives: The Configuration Space
SimpleMem implements memory as a set of interchangeable strategies sharing the store/retrieve/reset interface [code-confirmed from memory/ modules]. Each strategy is a separate module in memory/, selected at runtime by the memory_type argument in run.py's get_memory() factory [code-confirmed]:
| Memory Type | Config Value | Module | Mechanism | Key Parameters |
|---|---|---|---|---|
| No memory | none | memory/no_memory.py | Current observation only; retrieve() returns empty |
— |
| Full history | full | memory/full_history.py | Concatenate all prior observations into prompt | Context length limit |
| Sliding window | sliding_window | memory/sliding_window.py | Retain most recent $k$ observations (FIFO eviction) | --window_size |
| Summary | summary | memory/summary.py | LLM-generated running summary of accumulated observations | --summary_max_length |
| Retrieval | retrieval | memory/retrieval.py | Embed observations; top-$k$ similarity search at query time | --top_k |
Config values in the second column correspond to the --memory_type CLI flag handled by get_memory() in run.py [code-confirmed]. Module paths in the third column reflect the file names observed in memory/ [code-confirmed].
48.2.7 Multimodal Observation Handling
SimpleMem supports multimodal observations—text interleaved with screenshots or images from the agent's environment [paper-reported]. The paper describes multiple encoding paths depending on memory type and backend model. The specific code paths for multimodal encoding were not independently verified at the line level during repository inspection. The following description is based on the paper's architectural discussion.
The paper describes the following multimodal routing [paper-reported]:
- Full history and sliding window with a VLM backend (GPT-4V/4o) can pass raw image tokens in the context window (Path A). With text-only backends, images must be captioned (Path B).
- Summary memory converts visual content to text during LLM summarization (Path B).
- Retrieval memory encodes observations into an embedding space for similarity search (Path C).
48.2.8 Supported LLM Backends
The paper reports evaluation using the following backends [paper-reported]:
| Backend | Type | Role in Experiments |
|---|---|---|
| GPT-4V / GPT-4o | Proprietary, multimodal | Primary evaluation backend; supports raw image token input |
| GPT-3.5-Turbo | Proprietary, text-only | Text-only ablations and cost-sensitive settings |
The backend is configured via the --model CLI flag independently of memory strategy, enabling cross-product evaluation [paper-reported] [code-confirmed for --model flag in run.py].
48.2.9 Configuration and Experiment Launch
Experiments are launched via run.py with YAML configs and/or CLI arguments [code-confirmed]. The following CLI examples are documented in the README [readme-confirmed]:
# ── CLI usage from README [readme-confirmed] ─────────────────
# Sliding-window memory, window size 5, GPT-4o backend
python run.py \
--memory_type sliding_window \
--window_size 5 \
--model gpt-4o \
--benchmark screenspot \
--output_dir results/sliding_window_k5/
# No memory (baseline)
python run.py \
--memory_type none \
--model gpt-4o \
--benchmark screenspot \
--output_dir results/no_memory/
# Retrieval-augmented memory
python run.py \
--memory_type retrieval \
--top_k 3 \
--model gpt-4o \
--benchmark assistantbench \
--output_dir results/retrieval_k3/
Each run produces structured JSON output containing per-task scores, aggregate metrics, token usage, and full configuration [readme-confirmed]. A representative JSON result structure:
// ── results/ output structure [readme-confirmed] ────────────
{
"config": {
"memory_type": "sliding_window",
"window_size": 5,
"model": "gpt-4o",
"benchmark": "screenspot"
},
"metrics": {
"accuracy": 0.XX, // aggregate score across tasks
"total_tokens": NNNNN, // total token consumption
"num_tasks": NN // number of tasks evaluated
},
"per_task": [
{"task_id": "...", "score": 0/1, "tokens": NNN, "steps": N},
...
]
}
JSON output structure reflects README documentation. Exact field names may differ; accuracy, total_tokens, and per_task arrays were confirmed as present in the output schema.
48.3 Formal Framework: Memory Configuration Search
This section formalizes the systematic comparison of memory configurations as an optimization problem. This formalization is the chapter author's analytical rendering of the search implicit in SimpleMem's evaluation methodology [author-analysis]. SimpleMem itself does not implement an automated search loop. The mathematical framework clarifies what a researcher using the tool is implicitly doing, with explicit mappings to the system's config keys and output fields.
48.3.1 Configuration Space
Define a memory configuration as a tuple:
where:
| Symbol | Meaning | System Mapping | Evidence |
|---|---|---|---|
| $r \in \mathcal{R}$ | Memory type (categorical) | --memory_type CLI flag; config.memory_type in JSON output. $\mathcal{R} = \{$none, full, sliding_window, summary, retrieval$\}$, so $|\mathcal{R}| = 5$. |
code-confirmed |
| $\boldsymbol{\theta}_r$ | Type-specific hyperparameters |
$r = $ sliding_window: $\boldsymbol{\theta}_r = (\text{window\_size})$, CLI: --window_size$r = $ retrieval: $\boldsymbol{\theta}_r = (\text{top\_k})$, CLI: --top_k$r = $ summary: $\boldsymbol{\theta}_r = (\text{summary\_max\_length})$$r \in \{$ none, full$\}$: $\boldsymbol{\theta}_r = \emptyset$
|
readme-confirmed |
| $\mathcal{C}$ | Full configuration space | $\mathcal{C} = \bigcup_{r \in \mathcal{R}} \{r\} \times \Theta_r$ where $\Theta_r$ is the valid range for type-specific hyperparameters | author-formalization |
Note on encoding and eviction. The earlier draft modeled encoding method $e$ and eviction policy $v$ as independent dimensions. In SimpleMem, both are determined by $r$: the memory type choice implicitly selects the encoding path (Section 48.2.7) and eviction policy (FIFO for sliding window, LLM-driven for summary, none for full/retrieval). Neither is independently configurable via a config key. The formalization therefore reduces to $(r, \boldsymbol{\theta}_r)$ [author-analysis based on code-confirmed interface].
With 5 memory types and typical hyperparameter ranges (e.g., window_size $\in \{3, 5, 10, 15, 20\}$, top_k $\in \{1, 3, 5, 10\}$), $|\mathcal{C}|$ is in the range of 15–50 discrete configurations when hyperparameters are discretized, making near-exhaustive comparison feasible.
48.3.2 Objective Function
Given a benchmark $\mathcal{B}$ with task set $\{t_1, \ldots, t_N\}$, define the evaluation metric as task-level accuracy (success rate):
This is the scoring function computed per-task by the benchmark evaluator and recorded in the per_task[i].score field of the JSON output [readme-confirmed]. The aggregate metric over the benchmark:
This corresponds to the metrics.accuracy field in the JSON output. The configuration search objective is then:
Single-objective, unconstrained formulation. As implemented, SimpleMem optimizes accuracy only. Cost (token usage) is measured and logged in metrics.total_tokens but is not a constraint or secondary objective in the evaluation protocol. A cost-constrained variant $\max_{\mathcal{M}} \hat{F}(\mathcal{M})$ s.t. $C(\mathcal{M}) \leq B$ is natural but not implemented [author-analysis].
48.3.3 Stochasticity and Reliability
Three sources of stochasticity affect $\hat{F}$:
- LLM sampling. The paper reports using temperature $= 0$ for all experiments [paper-reported]. At temperature 0, most API providers return (near-)deterministic outputs, reducing but not eliminating variation (server-side batching and quantization can still cause drift).
- Task sampling. $\hat{F}$ is computed over a fixed task set $\{t_1, \ldots, t_N\}$ per benchmark, not a random sample. Reliability therefore depends on $N$ and the coverage of the fixed set.
- API version drift. Model behavior changes across API updates. The paper does not specify API version pinning [paper-reported: not mentioned].
The paper does not report running each configuration over multiple independent seeds or repeated trials [paper-reported: not mentioned]. Consequently, variance estimates and confidence intervals for $\hat{F}$ differences are not available. The fixed temperature partially mitigates this, but readers should interpret small score differences with caution.
48.3.4 Evaluation Cost Model
Each configuration evaluation requires running the agent through all $N$ tasks. The per-evaluation token cost [author-analysis: cost model not implemented in SimpleMem, but token counts are logged]:
This quantity is available from the JSON output as metrics.total_tokens [readme-confirmed]. The dominant cost driver varies by memory type:
| Memory Type | Memory-Specific Overhead | Prompt Length Growth |
|---|---|---|
| none | None | Constant (current observation only) |
| full | None | $O(T)$ — grows linearly with episode length $T$ |
| sliding_window | None | $O(k)$ — bounded by window_size |
| summary | Extra LLM calls for periodic re-summarization | $O(1)$ — bounded by summary_max_length |
| retrieval | Embedding call per observation | $O(k)$ — bounded by top_k |
[Author-analysis: cost scaling derived from mechanism descriptions, not from measured data. Actual token counts would be available from the JSON output of completed runs.]
48.4 Search Methodology and Automation Level
48.4.1 What SimpleMem Actually Implements
SimpleMem's configuration comparison process [readme-confirmed] [paper-reported]:
- Researcher selects configurations. YAML config files or CLI flags to
run.py. - Automated evaluation. For each configuration, the harness runs the agent through all benchmark tasks, producing per-task and aggregate scores. This step is fully automated.
- Result comparison. Researcher compares JSON output files in
results/. - Iterative refinement. Based on results, the researcher evaluates additional configurations. This step is manual.
48.4.2 Relationship to Automated Search
SimpleMem provides two of three prerequisites for fully autonomous memory design:
| Prerequisite | Status | Evidence |
|---|---|---|
| Composable configuration space | Implemented | get_memory() factory + YAML configs [code-confirmed] |
| Automated evaluation with structured output | Implemented | run.py → JSON with metrics.accuracy [code-confirmed] |
| Informed search controller | Not implemented | No search loop in repo [readme-confirmed] |
The infrastructure—configuration-driven evaluation with machine-readable JSON output—would serve as the evaluation oracle in a fully autonomous system. This is why SimpleMem is included in Part P07 despite not being fully autonomous: it provides enabling infrastructure for autonomous memory design discovery [author-analysis].
48.4.3 Convergence in Manual Exploration
With $|\mathcal{C}|$ in the 15–50 range, a researcher can evaluate all five memory types, then sweep key hyperparameters for the best-performing type. The paper's protocol follows this pattern: first compare memory types with default hyperparameters, then report parameter sensitivity within top-performing types [paper-reported].
48.5 Experimental Evidence
48.5.1 Evaluation Protocol
| Protocol Element | Value | Evidence |
|---|---|---|
| Primary LLM backends | GPT-4o (multimodal), GPT-4V (multimodal), GPT-3.5-Turbo (text ablations) | paper-reported |
| Memory configs compared | 5 types: none, full, sliding_window, summary, retrieval |
code-confirmed |
| Temperature | Fixed at 0.0 for all runs | paper-reported |
| Primary metric | Task accuracy / success rate (SR) | paper-reported |
| Repeated runs / seeds | Not specified — paper does not report multiple independent runs or variance | paper-reported (absent) |
| Hyperparameter sweep | Multiple values of window_size and top_k reported for sensitivity analysis |
paper-reported |
| Token / cost reporting | Token counts logged in JSON output; whether paper reports per-config cost comparisons: not confirmed | readme-confirmed (JSON schema) |
48.5.2 Benchmark Suite
| Benchmark | Domain | Modality | Key Characteristic | Memory Demand |
|---|---|---|---|---|
| ScreenSpot | GUI grounding | Text + screenshot | Element localization on device screens | Low–moderate: primarily recent-context dependent |
| AssistantBench | Web assistant tasks | Text + web content | Multi-step web interaction, cross-page recall | High: requires long-range information retrieval |
| AndroidWorld | Mobile device control | Text + screenshot | Long-horizon device interaction across apps | Moderate: recency + some cross-screen recall |
| OSWorld | Desktop OS control | Text + screenshot | Complex desktop environment tasks | High: sequential observations, long episodes |
48.5.3 Quantitative Results
The paper reports results across a full cross-product of 5 memory types × 4 benchmarks × 2–3 LLM backends [paper-reported]. The following findings are consistently stated in the paper's analysis:
48.5.4 Key Findings
| ID | Finding | Evidence Pattern | Benchmarks Most Relevant |
|---|---|---|---|
| F1 | Memory consistently outperforms no-memory | All memory-equipped configurations exceed the none baseline on tasks requiring cross-step information. Improvement magnitude increases with episode length. |
All four; largest gap on AndroidWorld, OSWorld |
| F2 | Simple methods suffice | Well-configured sliding_window and retrieval match or exceed summary and full on evaluated benchmarks. |
All four |
| F3 | Optimal type is task-dependent | sliding_window wins on recency-dominated tasks; retrieval wins on tasks requiring long-range recall. No single type dominates. |
sliding_window best: ScreenSpot, AndroidWorld. retrieval best: AssistantBench, OSWorld. |
| F4 | Inter-type variance > intra-type variance | Performance varies more across types (e.g., sliding_window vs retrieval) than within a type (e.g., window_size=3 vs 10). Type selection is the first-order decision. |
All four |
| F5 | Model–memory interaction | GPT-4o shows larger absolute gains from memory than GPT-3.5-Turbo, but relative ranking of memory types is broadly consistent across backends. | Cross-backend comparison |
48.5.5 Interpretation and Limitations of the Evidence
The paper's experimental design has notable methodological strengths and gaps:
Strengths. (1) Controlled comparison: the LLM backend is held constant while varying memory, isolating the memory effect. (2) Fixed temperature reduces sampling variation. (3) Cross-product design (type × benchmark × backend) enables interaction analysis (F5). (4) Token counting in JSON output enables post-hoc cost analysis.
Gaps that limit the strength of the evidence.
- No reported variance. Without repeated runs or confidence intervals, it is impossible to distinguish genuine type-level effects from noise. Temperature 0 mitigates but does not eliminate this concern (API-level non-determinism persists).
- Task count and composition. The paper does not specify the exact number of tasks per benchmark in the portions inspected for this chapter. Small $N$ would inflate the variance of $\hat{F}$ and weaken cross-type comparisons.
- Cost comparison. Although token counts are logged, the paper’s analysis focuses on accuracy rather than cost-adjusted performance. Whether
summarymemory's extra LLM calls orretrievalmemory's embedding costs are quantified is not confirmed. - Hyperparameter sensitivity. Finding F4 (inter-type > intra-type variance) is the most practically useful result but would be strengthened by reporting the full hyperparameter sweep curves, not just qualitative claims.
Despite these gaps, Findings F3 and F4 together constitute a useful design guideline: choose the memory type based on task characteristics first, then tune hyperparameters second.
48.6 Implementation Patterns
48.6.1 Agent-Memory Interaction Cycle
Each agent step follows the store–retrieve–act cycle. The code in Section 48.2.3 (Agent.run_task()) implements this loop [code-confirmed]. This section analyzes why the cycle architecture produces the observed findings.
The retrieve() call is the point where memory type most directly affects LLM behavior: the same underlying observations are represented entirely differently depending on the strategy. Consider a 10-step GUI interaction episode:
| Memory Type | What retrieve() Returns at Step 10 |
Prompt Size Impact |
|---|---|---|
| none | Empty string | Minimal |
| full | All 10 prior observations concatenated | 10× observation size |
| sliding_window | Steps 6–10 (with window_size=5) |
5× observation size |
| summary | Compressed text summary of steps 1–10 | Bounded by summary_max_length |
| retrieval | Top-3 most similar past observations to current query | 3× observation size |
This explains Finding F3: on ScreenSpot (short episodes, recent context matters most), sliding window preserves the most useful recent observations without dilution. On AssistantBench (long-range cross-page recall), retrieval can surface relevant information from early steps that sliding window has evicted [author-analysis].
48.6.2 Configuration-Driven Instantiation
The factory pattern in run.py's get_memory() (Section 48.2.2) enables identical evaluation code to run any memory type. This design has three key properties [author-analysis based on code-confirmed factory pattern]:
- Single code path. The agent loop in
agent.pyis memory-type-agnostic: it callsstore(),retrieve(),reset()without knowing the concrete strategy. This ensures evaluation conditions are identical except for memory behavior. - Per-task scoring. The JSON output's
per_taskarray supports post-hoc analysis of which tasks benefit from which strategy—enabling Finding F3. - Token counting. Accumulated in the agent loop and persisted in JSON output, enabling cost comparison across configurations.
48.6.3 Reproducibility
| Challenge | SimpleMem's Approach | Residual Risk |
|---|---|---|
| LLM non-determinism | Temperature = 0 [paper] | API-level variation persists |
| Model version drift | Model name in config + JSON output | No API version pinning confirmed |
| Configuration tracking | Full config logged in JSON output [readme] | Low: enables exact reproduction of settings |
| Benchmark contamination | Affects all configs equally | Relative comparison valid; absolute scores may be inflated |
48.7 Positioning in the Research Landscape
48.7.1 Comparative Analysis: Autonomous Research Systems
| System | Search Domain | Automation Level | Search Object | Multimodal |
|---|---|---|---|---|
| FunSearch (Ch. 3) | Algorithms | Fully automated | Python functions | No |
| OpenELM (Ch. 8) | Programs | Fully automated | Program source code | No |
| AIDE (Ch. 42) | ML experiments | Fully automated | Training configs | No |
| SimpleMem | Memory configs | Eval automated; search manual | $(r, \boldsymbol{\theta}_r)$ tuples | Yes [paper] |
Two structural features distinguish SimpleMem: (1) a constrained, declarative search object (categorical type + bounded numeric hyperparameters, vs. the open-ended program space of FunSearch/OpenELM), making exhaustive comparison feasible; and (2) agent infrastructure as search target—the memory subsystem is modular, so discovered configurations can transfer across agent architectures sharing the same interface [author-analysis].
48.7.2 Comparative Analysis: Memory-Augmented Agent Systems
SimpleMem's design space encompasses and generalizes the fixed memory choices made by earlier agent systems. The following comparison uses five explicit axes to clarify where SimpleMem's framework subsumes, extends, or falls short of prior designs [author-analysis]:
| Axis | MemGPT | Voyager | Generative Agents | SimpleMem |
|---|---|---|---|---|
| Memory granularity | Three tiers: main context, recall storage, archival storage. LLM decides what to promote/demote between tiers. | Single tier: skill library of verified code functions, indexed by description embedding. | Single tier: observation stream with metadata (importance, recency, embedding). | Single tier per strategy. No hierarchical memory. Tier choice itself is the experimental variable ($r$). |
| Eviction policy | LLM-managed: the model explicitly calls functions to move data between tiers. Flexible but costly (extra LLM calls per management decision). | No eviction: skill library grows monotonically. Failed skills are not added; verified ones persist indefinitely. | No eviction: all observations retained. Effective management via retrieval weighting ($\alpha \cdot \text{recency} + \beta \cdot \text{importance} + \gamma \cdot \text{relevance}$). | Strategy-dependent: FIFO (sliding_window), LLM-driven compression (summary), none (full, retrieval). Eviction is coupled to type $r$, not independently configurable. |
| Retrieval policy | LLM-driven: model decides when and what to retrieve via function calls. Retrieval is an agent action, not automatic. | Embedding similarity: query description is embedded, top-$k$ skills retrieved by cosine similarity. | Weighted formula: $\alpha \cdot \text{recency} + \beta \cdot \text{importance} + \gamma \cdot \text{relevance}$ with fixed coefficients. | Strategy-dependent: recency-only (sliding_window), all (full), compressed (summary), similarity (retrieval). The retrieval policy is the primary experimental variable. |
| Multimodal handling | Text-only in the original paper. Later extensions support multimodal. | Text-only: Minecraft skills are stored as code + text descriptions. | Text-only: observations are natural-language descriptions of simulated world. | Native multimodal: text + screenshot observations. Encoding path varies by memory type and backend (Section 48.2.7) [paper]. |
| Automation level | Fixed architecture; no systematic comparison infrastructure. | Fixed architecture; skill accumulation is automated but memory design is not variable. | Fixed architecture; $\alpha, \beta, \gamma$ hand-tuned. | Memory design is the experiment: type $r$ and hyperparameters $\boldsymbol{\theta}_r$ are systematically varied and evaluated. Evaluation automated; search manual. |
What SimpleMem subsumes. Generative Agents' fixed retrieval formula becomes a specific configuration in SimpleMem's space: $r = $ retrieval with similarity-weighted scoring. Voyager's skill-library approach maps to full memory with procedural-knowledge encoding. The key difference is that SimpleMem makes these choices experimental variables rather than fixed design decisions.
What SimpleMem does not cover. MemGPT's hierarchical three-tier architecture—where the LLM manages promotion/demotion between tiers—has no equivalent in SimpleMem's single-tier strategies. Generative Agents' reflection mechanism (periodic higher-order summarization of observations) goes beyond SimpleMem's summary strategy, which compresses observations but does not generate meta-cognitive reflections. Graph-structured memory (as in some knowledge-graph agents) is also absent. These omissions define the boundary of SimpleMem's current configuration space.
48.8 Limitations
48.8.1 Evidence Limitations
- Repository audit depth. The entry point (
run.py), agent harness (agent.py), config directory (configs/), and memory module directory (memory/) were inspected. Line-by-line reading of all memory strategy implementations was not performed. The multimodal encoding paths, embedding model selection, and vector storage backend forretrievalmemory were not confirmed from source code. - Experimental result granularity. Exact per-benchmark numerical scores, confidence intervals, task counts, seed counts, and cost figures were not independently recovered. Section 48.5 reports findings at the pattern level; readers should consult the publication directly for exact figures.
48.8.2 System-Level Limitations
- Evaluation cost. Each configuration requires full agent episodes. A thorough cross-product (5 types × multiple hyperparameter values × 4 benchmarks × multiple backends) requires hundreds of evaluation runs. Cost-reduction strategies (cascade evaluation, early stopping) are not described [paper: not mentioned].
- No automated search. The researcher-driven search step limits use in large-scale discovery, though the modest $|\mathcal{C}|$ makes simple automated strategies (grid search) sufficient if added.
- Configuration space completeness. The five memory types cover common strategies but omit: (a) hierarchical memory (MemGPT's three-tier design), (b) graph-structured memory, (c) reflection and meta-cognitive operations, and (d) hybrid strategies (e.g., sliding window + retrieval fallback). See Table 48.5 for details.
- Transferability. Finding F3 demonstrates that configurations optimal on one benchmark may not transfer. No meta-learning mechanism predicts the best configuration for new task distributions [author-analysis].
- Single-tier only. All five strategies operate as single-tier memory. The interaction between memory tiers (e.g., MemGPT's promotion/demotion) is not captured, limiting insights about hierarchical designs.
48.8.3 Safety Considerations
- Memory poisoning. If memory is populated from environment observations during evaluation, adversarial inputs could corrupt stored entries. Input validation for stored memories is not confirmed from code.
- Data retention. Memory stores accumulate potentially sensitive information. Eviction policies optimized for performance may conflict with data retention requirements.
- Execution isolation. Memory configurations operate within the Python process without evidence of sandboxing—typical for research frameworks but relevant for production deployment.
48.9 Future Directions
The extensions below are the chapter author's projections [author-analysis].
48.9.1 Closing the Search Loop
The most natural extension adds an automated search controller that reads JSON evaluation results and proposes configurations:
- Grid/random search. Given $|\mathcal{C}| \approx 15\text{--}50$, exhaustive grid search is feasible. This is the lowest-effort automation step and would transform SimpleMem from an evaluation harness into a complete automated memory design system.
- Bayesian optimization. Over the mixed discrete-continuous space $(r, \boldsymbol{\theta}_r)$, BO with a categorical kernel could navigate efficiently when each evaluation is expensive.
- LLM-guided controller. An LLM reads JSON results and proposes configurations in natural language. The modest space size may not justify the LLM overhead compared to grid search, but could be valuable if the configuration space is expanded (Section 48.9.3).
48.9.2 Joint Prompt and Memory Optimization
The prompts governing how the agent interacts with memory (what to store, how to query, how to format retrieved context) could also be optimized. A joint search over $(\mathcal{M}, \pi) \in \mathcal{C} \times \Pi$, where $\pi$ is a prompt configuration, would explore a richer space. Alternating between memory-configuration and prompt-refinement (as in DSPy-style prompt optimization) would reduce per-step dimensionality.
48.9.3 Expanding the Configuration Space
The five memory types could be extended to include: (a) hierarchical memory with multiple tiers and configurable eviction policies per tier (generalizing MemGPT); (b) graph-structured memory with relational edges supporting multi-hop retrieval; (c) composite strategies combining multiple types (e.g., sliding window for recent context + retrieval for long-range); and (d) adaptive memory where the strategy switches based on task characteristics during execution.
48.9.4 Meta-Learning Configuration Selectors
Rather than discovering one optimal configuration, a meta-learning extension would learn a policy $\phi: \text{TaskFeatures} \to \mathcal{C}$ that selects configurations based on task characteristics. This directly addresses Finding F3: if different tasks favor different types, a selector outperforms any fixed choice. Training signal would come from SimpleMem's evaluation results across benchmarks.
48.10 Summary
Implementation. The repository implements: (1) a factory-dispatch entry point (run.py / get_memory()) that maps the --memory_type flag to concrete strategy classes in memory/ [code-confirmed]; (2) an agent harness (agent.py / Agent class) implementing the observe–store–retrieve–act loop with a pluggable memory object [code-confirmed]; (3) five interchangeable memory strategies sharing a common store/retrieve/reset interface [code-confirmed]; (4) structured JSON output with per-task scores, aggregate accuracy, and token counts [readme-confirmed].
Findings. The paper demonstrates five key results: (F1) memory consistently improves over no-memory baselines, (F2) simple strategies are competitive, (F3) optimal type varies by task characteristics, (F4) type selection matters more than hyperparameter tuning, (F5) relative rankings are consistent across backends [paper-reported].
Significance for Part P07. SimpleMem targets agent memory as the object of systematic optimization in multimodal settings—a domain not addressed by other autonomous research systems in this survey. While less automated than FunSearch, OpenELM, or AIDE, it demonstrates that the autonomous-design pattern applies to agent infrastructure components, and its evaluation harness provides two of three prerequisites for fully closing the search loop [author-analysis].
| Code-confirmed | run.py (factory dispatch via get_memory(), argparse CLI with --memory_type/--model/--benchmark), agent.py (Agent class, run_task() loop, memory as constructor parameter), memory/ directory with per-strategy modules sharing store/retrieve/reset interface, configs/ with YAML files. |
| Readme-confirmed | CLI usage patterns, memory type names, JSON output schema (metrics.accuracy, metrics.total_tokens, per_task array), absence of automated search controller. |
| Paper-reported | System goals, multimodal encoding paths (Figure 48.2), LLM backends (GPT-4o, GPT-4V, GPT-3.5-Turbo), benchmark suite (ScreenSpot, AssistantBench, AndroidWorld, OSWorld), findings F1–F5, temperature protocol. |
| Author-analysis | Formalization (Section 48.3), cost model, comparative tables (Table 48.5, Section 48.7), three-prerequisite framework, future directions, safety analysis. Embedding backend, vector storage, and exact multimodal code paths are implementation unknowns. |