The Retrieval-First Memory Bet, Re-examined: What Mem-π and MINTEval Get Right (and What They Miss)
Two papers this week argued retrieval is the wrong default for agent memory. I build a retrieval-first memory database. Here is the honest reckoning.
Two papers landed this week that point at the thing I built and say: you are defaulting to the wrong primitive.
I maintain mnemo, an MCP-native embedded memory database for agents. Its core is retrieval: hybrid search over semantic vectors, BM25, a graph, and recency, fused with reciprocal rank fusion. So when **Mem-π** (ServiceNow Research + Mila) and **MINTEval** (UNC, Mohit Bansal’s lab) both argue, in the same week, that retrieval-from-a-bank is the wrong default for long-horizon agents, I do not get to wave it away. I have to read them as the person whose product is on the line.
Here is the reckoning.
## What the papers actually say
**Mem-π** replaces similarity-based retrieval over an episodic memory bank with a separate model that *generates* guidance on demand. Conditioned on the agent’s current context, that model jointly decides when to emit guidance and what to emit, trained with a decision-content decoupled RL objective so it can abstain when memory would not help. The headline number: **over 30% relative improvement on web-navigation tasks**, against retrieval-based and prior RL-optimized memory baselines. The framing is the interesting part. Retrieval returns static entries that “often misalign with the current context.” Generation produces something fitted to the moment.
**MINTEval** comes at the same time from the evaluation side. It is a benchmark for memory *under interference*: facts get updated, revised, and contradicted across long contexts (averaging 138.8k tokens, up to 1.8M per instance). It runs 7 representative systems (vanilla long-context LLMs, RAG, and memory-augmented agent frameworks) and finds **27.9% average accuracy**, worst on multi-target aggregation. The diagnosis is blunt: “performance is primarily limited by retrieval and memory construction,” and accuracy “degrades as the number of intervening updates increases.”
Read together, they are not two unrelated results. They are the storage half and the read half of the same argument: storing static entries and pulling them back by similarity breaks once the world stops being static.
## The part they get right
They are right about the failure mode, and I have hit it.
Static recall was always the easy half of memory. Find the fact, return the fact. The hard half is what happens when a fact gets revised five times over a long session, and the agent has to reason over the *current* version, not whichever copy scored highest on cosine similarity. A vector index does not have a notion of “current.” It has a notion of “similar.” Those are not the same thing, and the gap between them is exactly where MINTEval’s 27.9% lives.
mnemo already carries the machinery that should help here: branch/merge/replay timelines, point-in-time `as_of` queries, a 5-strategy forget system (including consolidation and decay). But carrying the machinery is not the same as defaulting to it. The papers are a useful slap: the default read path, top-k similarity, is the part that fails under interference, and most teams (mine included) ship the default.
## The part I am not selling my index for
Now, the other side, because symmetry matters, and I would rather argue against my own product than flatter it.
“Generate guidance on demand” is not free. It puts another model call on the hot path of every recall. It costs tokens. And it introduces a failure mode that a retrieval log structurally cannot have: the generator inventing a memory that was never stored. A retrieval system can return the wrong entry; it cannot return an entry that does not exist. A generative memory can.
For the workloads I actually ship into (regulated, India-DPDP, HIPAA-adjacent), that distinction is load-bearing. When a compliance reviewer asks, “Where did this fact come from?” an auditable retrieval log with a SHA-256 hash-chain answers the question. “A model generated it conditioned on context” does not. Mem-π’s results are real, and on web navigation, where there is no audit requirement and latency tolerance is high, generation may well be the right call. As a financial agent under audit, I will take the boring log every time.
There is also a quieter point in the comparison-content noise around this topic. The 2026 practitioner consensus, across a dozen “Mem0 vs Letta vs Zep vs Cognee” write-ups, is not “switch to generation.” It is “you need both, and the harder question is whether the corpus underneath either one is trustworthy.” A memory layer on top of an ungoverned corpus persists the problem; it does not solve it. That is the part the benchmark race keeps skipping, and it is the part I am paid to care about.
## What I am actually changing
So I am not deleting the vector index. I am doing two narrower things, both of which the papers point at directly.
First, an **interference-eval harness**. MINTEval’s setup is reproducible at a small scale: take a fact, revise it K times across a context, then query the latest value. Most off-the-shelf memory layers answer with an earlier version. If that is the dominant production failure (and I suspect it is), then the right metric to optimize is not recall@k on a static set; it is current-fact accuracy under K revisions. I would rather measure the failure the papers found than assume my hybrid search dodges it.
Second, a **which-fact-is-current resolver** in front of the LLM. Before candidates reach the model, resolve version conflicts using the timeline mnemo already stored: prefer the most recent uncontradicted write, surface the supersession chain as evidence, and let the model see “this was true, then revised” instead of five undated copies. This is governed retrieval, not generation. It keeps the audit log and attacks the exact interference axis MINTEval isolates.
## The comparison-shopping trap
There is a quieter trap worth naming, because it is the one most teams will actually fall into. If you go looking for “the best agent memory in 2026,” you will find a dozen comparison posts ranking Mem0 against Letta against Zep against Cognee on recall@k and LongMemEval scores. Those numbers are useful and also beside the point for production. The question that decides whether your agent is safe to ship is not “which store recalls best on a static benchmark.” It is “when this store returns a fact, can I tell where it came from, whether it is the current version, and whether it was tampered with.”
None of the leaderboards measure that. They measure recall on frozen sets. MINTEval is the first widely-cited benchmark to even test the interference case, and the comparison posts have not caught up. So a team picks the store that tops the leaderboard, ships it on top of an ungoverned corpus, and inherits every stale fact and poisoned-memory failure the leaderboard never tested for. Swapping in a higher-ranked store does not fix that. It just persists the same problem at a higher recall@k.
The lesson I keep relearning: the memory layer is not where the risk lives. The risk lies in whether the layer can prove what it returned. That is a governance property, not a retrieval-quality property, and it is the one I optimize for because it is the one that survives an audit.
## The honest conclusion
The headline “generated memory beats retrieved memory” is true on the benchmark it was measured on, and misleading as a general directive. Retrieval is not dead. Naive retrieval (static entries, similarity-only, no version model) is dying, and it should. The interesting product is the governed middle: retrieval that knows which fact is current, logs where every answer came from, and can prove it to an auditor.
That gap, between naive retrieval and governed retrieval, is not a threat to the thing I build. It is the roadmap for it.
If you run agent memory in production, I want the data point: are you seeing more “couldn’t find it” failures, or more “found the wrong version” failures? The answer decides which half of this is worth building first.

