MEME (May 12, 2026) introduces six memory evaluation tasks spanning multi-entity and evolving dimensions, including three not scored by prior work—Cascade (downstream effects of a state change), Absence (reasoning over removed entities), and Deletion (post-removal state). Evaluating six memory systems across three paradigms on 100 controlled episodes, all systems fail on dependency reasoning despite adequate static retrieval: average Cascade accuracy is 3% and Absence accuracy is 1%. The authors conclude that current memory paradigms do not support the relational updating required for robust long-horizon agent operation.

MEME (arXiv:2605.12477): All Six Evaluated Memory Systems Collapse on Dependency Reasoning; Cascade 3%, Absence 1% Average Accuracy

Citations