10 min read

GAM: A Dual-Agent Memory Architecture Tackling Context Rot

AI

ThinkTools Team

AI Research Lead

Introduction

For the past few years, the rapid expansion of large language models has been nothing short of a technological renaissance. The ability to generate coherent prose, solve equations, and even compose music has turned AI from a niche research curiosity into a commercial powerhouse. Yet beneath the surface of these seemingly superhuman systems lies a stubborn, human‑like flaw: forgetting. When an AI assistant is asked to juggle a sprawling conversation, a multi‑step reasoning task, or a project that stretches over days, it eventually loses the thread. Researchers have dubbed this phenomenon context rot, and it has quietly become one of the most significant obstacles to building AI agents that can function reliably in the real world.

A research team from China and Hong Kong has recently published a paper that offers a promising solution to this problem. Their new architecture, called general agentic memory (GAM), is built to preserve long‑horizon information without overwhelming the model. The core premise is simple yet elegant: split memory into two specialized roles—one that captures everything, and another that retrieves exactly the right things at the right moment. Early results are encouraging, and the timing could not be better. As the industry moves beyond prompt engineering and embraces the broader discipline of context engineering, GAM is emerging at precisely the right inflection point.

In this post we unpack the challenges that have driven the need for GAM, explore how the dual‑agent system works, and examine the empirical evidence that shows it outperforms both traditional retrieval‑augmented generation (RAG) pipelines and models with enlarged context windows. We also look ahead to what GAM means for the future of AI memory and the broader field of context engineering.

Main Content

The Limits of Expanding Context Windows

At the heart of every large language model lies a rigid limitation: a fixed “working memory,” more commonly referred to as the context window. Once conversations grow long, older information gets truncated, summarized, or silently dropped. The industry has responded by pushing the size of these windows to the extreme—Mistral’s Mixtral 8x7B introduced a 32K‑token window, MosaicML’s MPT‑7B‑StoryWriter‑65k+ doubled that capacity, and Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 offered 128K and 200K windows, respectively, both extendable to an unprecedented one million tokens. Microsoft’s Phi‑3 also joined the push, vaulting from a 2K‑token limit to a 128K window.

However, simply adding more tokens does not solve the forgetting problem. Even models with sprawling 100K‑token windows, enough to hold hundreds of pages of text, still struggle to recall details buried near the beginning of a long conversation. As prompts grow longer, the attention mechanism that underpins the model’s ability to locate and interpret information weakens, and accuracy gradually erodes. Longer inputs also dilute the signal‑to‑noise ratio; including every possible detail can actually make responses worse than using a focused prompt. Moreover, the computational cost of processing more tokens is non‑trivial: higher input token counts translate directly into higher latency and higher API usage costs.

Why Memory Matters Economically

For most organizations, supersized context windows come with a clear downside—they’re costly. Sending massive prompts through an API is never cheap, and because pricing scales directly with input tokens, even a single bloated request can drive up expenses. Prompt caching helps, but not enough to offset the habit of routinely overloading models with unnecessary context. The tension is stark: memory is essential to making AI more powerful, yet the very mechanism that provides memory—large context windows—drives up cost.

As context windows stretch into the hundreds of thousands or millions of tokens, the financial overhead rises just as sharply. Relying on ever‑larger windows quickly becomes an unsustainable strategy for long‑term memory. Fixes like summarization and retrieval‑augmented generation (RAG) aren’t silver bullets either. Summaries inevitably strip away subtle but important details, and traditional RAG, while strong on static documents, tends to break down when information stretches across multiple sessions or evolves over time. Even newer variants, such as agentic RAG and RAG 2.0, still inherit the same foundational flaw of treating retrieval as the solution, rather than treating memory itself as the core problem.

GAM’s Dual‑Agent Architecture

If memory is the real bottleneck, and retrieval can’t fix it, then the gap needs a different kind of solution. That’s the bet behind GAM. Instead of pretending retrieval is memory, GAM keeps a full, lossless record and layers smart, on‑demand recall on top of it, resurfacing the exact details an agent needs even as conversations twist and evolve. A useful way to understand GAM is through a familiar idea from software engineering: just‑in‑time (JIT) compilation. Rather than precomputing a rigid, heavily compressed memory, GAM keeps things light and tight by storing a minimal set of cues, along with a full, untouched archive of raw history. Then, when a request arrives, it “compiles” a tailored context on the fly.

This JIT approach is built into GAM’s dual architecture, allowing AI to carry context across long conversations without overcompressing or guessing too early about what matters. The result is the right information, delivered at exactly the right moment.

How the Memorizer and Researcher Work Together

GAM revolves around the simple idea of separating the act of remembering from recalling, which aptly involves two components: the memorizer and the researcher.

The memorizer captures every exchange in full, quietly turning each interaction into a concise memo while preserving the complete, decorated session in a searchable page store. It doesn’t compress aggressively or guess what is important. Instead, it organizes interactions into structured pages, adds metadata for efficient retrieval and generates optional lightweight summaries for quick scanning. Critically, every detail is preserved, and nothing is thrown away.

When the agent needs to act, the researcher takes the helm to plan a search strategy, combining embeddings with keyword methods like BM25, navigating through page IDs and stitching the pieces together. It conducts layered searches across the page‑store, blending vector retrieval, keyword matching and direct lookups. It evaluates findings, identifies gaps and continues searching until it has sufficient evidence to produce a confident answer, much like a human analyst reviewing old notes and primary documents. It iterates, searches, integrates and reflects until it builds a clean, task‑specific briefing.

GAM’s power comes from this JIT memory pipeline, which assembles rich, task‑specific context on demand instead of leaning on brittle, precomputed summaries. Its core innovation is simple yet powerful, as it preserves all information intact and makes every detail recoverable.

Benchmarking GAM Against RAG and Large Models

To test GAM, the researchers pitted it against standard RAG pipelines and models with enlarged context windows such as GPT‑4o‑mini and Qwen2.5‑14B. They evaluated GAM using four major long‑context and memory‑intensive benchmarks, each chosen to test a different aspect of the system’s capabilities:

  • LoCoMo measures an agent’s ability to maintain and recall information across long, multi‑session conversations, encompassing single‑hop, multi‑hop, temporal reasoning and open‑domain tasks.
  • HotpotQA, a widely used multi‑hop QA benchmark built from Wikipedia, was adapted using MemAgent’s memory‑stress‑test version, which mixes relevant documents with distractors to create contexts of 56K, 224K and 448K tokens—ideal for testing how well GAM handles noisy, sprawling input.
  • RULER evaluates retrieval accuracy, multi‑hop state tracking, aggregation over long sequences and QA performance under a 128K‑token context to further probe long‑horizon reasoning.
  • NarrativeQA is a benchmark where each question must be answered using the full text of a book or movie script; the researchers sampled 300 examples with an average context size of 87K tokens.

Together, these datasets and benchmarks allowed the team to assess both GAM’s ability to preserve detailed historical information and its effectiveness in supporting complex downstream reasoning tasks.

GAM came out ahead across all benchmarks. Its biggest win was on RULER, which benchmarks long‑range state tracking. GAM exceeded 90% accuracy, while RAG collapsed because key details were lost in summaries, and long‑context models faltered as older information effectively “faded” even when technically present.

Clearly, bigger context windows aren’t the answer. GAM works because it retrieves with precision rather than piling up tokens.

Context Engineering and the Future of AI Memory

Poorly structured context, not model limitations, is often the real reason AI agents fail. GAM addresses this by ensuring that nothing is permanently lost and that the right information can always be retrieved, even far downstream. The technique’s emergence coincides with the current, broader shift in AI towards context engineering—the practice of shaping everything an AI model sees: its instructions, history, retrieved documents, tools, preferences and output formats.

Context engineering has rapidly eclipsed prompt engineering in importance, although other research groups are tackling the memory problem from different angles. Anthropic is exploring curated, evolving context states. DeepSeek is experimenting with storing memory as images. Another group of Chinese researchers has proposed “semantic operating systems” built around lifelong adaptive memory.

However, GAM’s philosophy is distinct: avoid loss and retrieve with intelligence. Instead of guessing what will matter later, it keeps everything and uses a dedicated research engine to find the relevant pieces at runtime. For agents handling multi‑day projects, ongoing workflows or long‑term relationships, that reliability may prove essential.

Conclusion

The quest for reliable long‑term memory in AI agents has long been hampered by a simple, yet stubborn, limitation: the fixed size of the context window. Expanding that window has proven to be a blunt instrument, offering diminishing returns while inflating cost and latency. GAM’s dual‑agent architecture shows that a more nuanced approach—one that preserves every detail in a lossless archive and retrieves the precise information on demand—can outperform both traditional RAG pipelines and models with gigantic context windows. By treating memory as an engineering challenge rather than a brute‑force problem, GAM paves the way for AI systems that can maintain continuity, track evolving tasks and recall past interactions with precision.

As AI agents transition from clever demos to mission‑critical tools, the ability to remember long histories becomes not just a nice‑to‑have but a foundational requirement. GAM offers a practical path toward that future, signaling what may be the next major frontier in AI: smarter memory systems and the context architectures that make them possible.

Call to Action

If you’re building or deploying AI agents that need to remember complex, multi‑step interactions over time, consider exploring GAM’s dual‑agent memory architecture. Experiment with a JIT‑style memory pipeline that keeps a full, lossless record while retrieving the exact details an agent needs when it needs them. Reach out to the research team, try their open‑source implementation, and evaluate how GAM performs on your own workloads. By investing in smarter memory now, you’ll future‑proof your applications against the inevitable challenges of context rot and unlock the full potential of generative AI.

We value your privacy

We use cookies, including Google Analytics, to improve your experience on our site. By accepting, you agree to our use of these cookies. Learn more