MemEye Logo

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Abstract

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. We introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across eight life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across four vision-language model backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time.

Dataset Overview

MemEye dataset overview and representative cases

Figure 1: The MemEye dataset overview (left) with inner rings grouping tasks and outer rings showing statistics, and representative example cases (right).

Two-Axis Taxonomy

MemEye XY-axis taxonomy

Figure 2: The MemEye two-axis taxonomy. The X-axis captures the granularity of decisive visual evidence, while the Y-axis captures the required reasoning operation over memory.

Key Findings

Caption Gap at Fine Granularity

Captions remain competitive for scene/region-level evidence but leave residual gaps at instance- and pixel-level, even under task-aware captioning.

Retrieval vs. Temporal Authority

Semantic retrieval can confuse relevance with temporal authority, ranking stale evidence above valid updates in over 76% of Y3 cases.

Vision Alone Is Not Enough

Native visual evidence helps high-X questions but does not by itself solve evolutionary synthesis, suggesting a dissociation between evidence preservation and temporal state selection.

Visual irreplaceability comparison

Figure 3: MemEye exhibits stronger visual irreplaceability than prior long-term memory benchmarks.

Results

Cell-level performance heatmap

Figure 4: Representative method performance across the MemEye matrix using gpt-5.4-mini. Left: Open-ended LLM-as-a-Judge; Right: Multiple-choice EM.

Supported Methods

CategoryMethodConfigModality
Full ContextFC-Textfull_context_text_onlyText
FC-Multimodalfull_context_multimodalVisual
RetrievalSRAG-Textsemantic_rag_text_onlyText
SRAG-Multimodalsemantic_rag_multimodalVisual
SummarizationSimpleMemsimplememText
SimpleMem-MMsimplemem_multimodalVisual
Agentic MemoryA-MEMa_memText
ReflexionreflexionText
Gen. Agentsgen_agentsText
MemoryOSmemoryosText
M2Am2aVisual
MMAmmaVisual
MIRIXmirixVisual

Citation

Coming soon.