Mixture-of-Memories (MoM): The “Linear Attention” Breakthrough That Doesn’t Forget Long Contexts

by Marcin Frąckiewicz
in Access, Chine, Date, Guide, Innovation, Rast, Reports, Side, Space, Veri
on 17 July 2025

Mixture-of-Memories (MoM) is a new sequence modeling architecture (released May 2025) that promises “linear attention” without the usual amnesia. In other words, MoM retains long-term information far better than previous efficient Transformer alternatives, yet keeps linear time complexity. Developed by researchers from Shanghai AI Lab and collaborators, MoM introduces multiple independent memory states guided by a token router to eliminate “memory interference” – the tendency of new inputs to overwrite old memories ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org. Thanks to this design, MoM achieves exceptional recall on long sequences, outperforming prior linear models and even rivaling full Transformer accuracy on recall-intensive tasks ar5iv.labs.arxiv.org. Experts are hailing it as a breakthrough that bridges the gap between efficient models and standard Transformers, without forgetting the past.

The Memory Interference Problem in Linear Models

Traditional Transformers are powerful at modeling long-range dependencies but suffer from quadratic complexity, making long sequences computationally expensive ar5iv.labs.arxiv.org. Recent linear sequence models – such as linear attention Transformers, state-space models (SSMs), and linear RNNs – tackle this by compressing the entire input into a single fixed-size hidden state (a single “memory”) for efficiency ar5iv.labs.arxiv.org. Unfortunately, this extreme compression comes at a cost: limited memory capacity and severe memory interference. As new information streams in, it overwrites the lone memory vector, causing previous information to degrade ar5iv.labs.arxiv.org. One commentator quipped that compressing an entire sequence into one state is “like trying to cram your entire life’s memories into a single, slightly dented USB drive. Something’s gotta give.” scitechaccess.com In other words, these efficient models often “forget” earlier context, leading to subpar performance on tasks requiring long-term recall.

By contrast, a standard Transformer avoids this issue by storing separate key-value vectors for each token, essentially keeping independent memory for every position ar5iv.labs.arxiv.org. This grants Transformers virtually unlimited memory capacity and no interference between tokens’ representations – at the expense of much higher computation and memory usage. Prior research attempted to mitigate forgetting in linear models by adding forgetting gates or increasing the size of the single memory state ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org. While such tweaks (inspired by RNN gating mechanisms) can slow down information decay, they only partially alleviate the problem ar5iv.labs.arxiv.org. Ultimately, packing all knowledge into one state remained a fundamental bottleneck. As the MoM authors summarize, “When new information overwrites the single memory, previously stored representations may degrade, negatively impacting long-term recall.” ar5iv.labs.arxiv.org The field clearly needed a new approach to retain rich long-term information without sacrificing efficiency.

MoM’s Solution: Multiple Memories and Token Routing (No More “Amnesia”)

Mixture-of-Memories (MoM) tackles the above problem by breaking the single-memory paradigm. Inspired by neuroscience – notably how the human hippocampus uses distinct oscillatory cycles to store multiple items without interference ar5iv.labs.arxiv.org – MoM maintains multiple independent memory states in parallel, rather than one monolithic state ar5iv.labs.arxiv.org. Each memory can specialize in storing different parts or aspects of the sequence. A router network learns to dynamically assign each incoming token to one (or a few) of these memory slots, deciding which memory should store that token’s information themoonlight.io. In practice, the router computes an importance score for each token-memory pair (via a learned projection) and activates the top-$k$ most relevant memory modules for that token themoonlight.io. This means at each time step, a token updates only a subset of memory states, leaving the others untouched – preventing new inputs from overwriting all existing info.

Once assigned, the token’s content is written into the selected memory state(s) via a simple update rule. The authors implement this as generating a key vector and value vector from the token, then adding their outer-product to the memory matrix (akin to a rank-1 update) themoonlight.io. In essence, the memory module accumulates information from tokens routed to it, much like an independent “expert” focusing on a subset of the sequence. Crucially, memory states that are not activated by a token remain unchanged, preserving their previous content intact ar5iv.labs.arxiv.org. This sparse update strategy is the key to avoiding interference – different streams of information do not collide in the same storage. As the team explains, “each input token selectively activates and updates memory states, leaving non-activated states unchanged to avoid interference from the current input.” ar5iv.labs.arxiv.org

After updating the relevant memories, MoM produces its output for that time step by aggregating across all memory states. Essentially, it computes a weighted mixture of the memory vectors (hence “mixture-of-memories”), often by querying them with the next-layer’s query vector and summing the results themoonlight.io. In this way, the model can draw on multiple distinct memory traces to inform its predictions, rather than a single blended trace. Notably, MoM also includes a “shared” global memory that every token updates (alongside the specialized ones) ar5iv.labs.arxiv.org. This shared memory continuously accumulates general context from the whole sequence, ensuring that no information is completely lost even if not captured in a specific slot themoonlight.io. It acts like an always-on backlog memory to assist with very long-term dependencies.

The overall design is reminiscent of a mixture-of-experts architecture but applied to sequence memory: each memory slot is like an expert specializing in part of the sequence, and the router directs tokens to appropriate slots. Indeed, commentators noted that “Mixture-of-Experts have been doing something similar for ages” in splitting workloads across modules, though MoM uniquely applies it to internal state rather than output layers. The MoM authors themselves highlight that this “sparsely and [virtually] unbounded expansion of memory” breaks free from the usual routine of just tweaking gates or RNN-style updates ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org. By separating memory into multiple compartments, MoM dramatically expands memory capacity and eliminates interference, while still operating with linear time complexity per token ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org. Each individual memory state’s update is linear in sequence length, and since only a fixed number of memories are activated per token, the overall computation scales linearly with sequence length (with a constant factor for the few activated memories). Importantly, MoM’s inference complexity remains constant per step – like other linear models, it doesn’t grow with context length, since it doesn’t need to attend across all prior tokens at each step arxiv.org ar5iv.labs.arxiv.org. In summary, MoM finds a sweet spot between Transformers’ explicit memory for every token and linear models’ efficient compression, yielding an architecture that “significantly enhances memory capacity while retaining linear efficiency.” ar5iv.labs.arxiv.org

Recall Performance: Outshining Linear Peers and Rivalling Transformers

The true test of MoM is whether this new architecture translates to better performance on tasks that demand long-term reasoning and recall. Experimental results so far are very promising. Across a variety of language modeling and understanding benchmarks, MoM consistently outperforms state-of-the-art linear models (like prior linear Transformers, SSMs, or gated RNN variants) and in many cases closes the gap to a full Transformer ar5iv.labs.arxiv.org. For instance, on commonsense language understanding tasks (WikiText, LAMBADA, ARC QA, HellaSwag, PiQA, WinoGrande, etc.), a MoM-based model not only beat other efficient models but even surpassed a standard Transformer baseline of similar scale ar5iv.labs.arxiv.org. This is striking – it suggests MoM’s recall abilities can match or exceed a vanilla Transformer on certain tasks, despite using linear complexity. The authors attribute this to MoM’s ability to avoid forgetting: Transformers had long been superior on “recall-intensive tasks” simply by virtue of storing more information scitechaccess.com ar5iv.labs.arxiv.org, but MoM narrows that advantage by remembering past tokens almost as robustly, without quadratic cost.

To specifically test long-range capability, the researchers evaluated MoM on LongBench, a benchmark suite for long-context understanding (covering summarization, few-shot learning, code completion, synthetic long-range tasks, etc.) ar5iv.labs.arxiv.org. MoM achieved the highest overall score among comparable models, outperforming recent long-context models like RetNet and Gated DeltaNet in categories like summarization and code tasks ar5iv.labs.arxiv.org. For example, MoM’s average score across LongBench tasks was about 15.6, versus 13–14 for the next-best linear models ar5iv.labs.arxiv.org – a significant leap in this regime. This demonstrates that MoM’s multi-memory mechanism isn’t just a theoretical fix; it measurably improves comprehension of very long inputs in practice. The team also ran an ablation study to confirm the benefit of multiple memories over a single large memory. They gave a baseline model the same total memory size as MoM (but as one unified state) and found it performed worse than MoM’s separated memories ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org. In other words, simply having more capacity isn’t enough – it’s the separation of that capacity into distinct memory slots that reduces interference and boosts recall ar5iv.labs.arxiv.org. As the paper reports, using multiple mixed memories yielded greater improvement than an expanded single memory, “confirming that mixed memory can effectively reduce interference from different inputs” ar5iv.labs.arxiv.org. Notably, MoM with targeted memory modules plus a forget gate outperformed a model that relied only on a forget gate, underscoring that isolating memories is more effective than gating alone ar5iv.labs.arxiv.org.

Despite introducing multiple memory modules, MoM retains impressive efficiency. The authors show that inference speed and memory usage scale linearly with sequence length, in contrast to a quadratic-scaling Transformer ar5iv.labs.arxiv.org. For example, when generating text with very long inputs, MoM required far less GPU memory and time than a Transformer with flash-attention optimization, especially as context length grew ar5iv.labs.arxiv.org. This means MoM can handle longer sequences on fixed hardware, making it more scalable for practical use. Training convergence is also improved – MoM models learn faster and reach lower loss than baseline models of similar size ar5iv.labs.arxiv.org. Researchers observed that throughout training on a large corpus, MoM maintained the lowest loss curve, indicating it learns more efficiently, likely because it doesn’t suffer as much from forgetting earlier data ar5iv.labs.arxiv.org. All these results point to a single conclusion: MoM effectively solves the memory interference issue that plagued linear models, delivering recall and accuracy much closer to full Transformers while preserving the speed and scalability advantages of linear complexity ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org.

Expert Perspectives and Future Directions

MoM’s debut has generated excitement in the AI research community, as well as a healthy dose of scrutiny. Many see it as a major step toward making long-context models more practical. A literature review praised MoM as “a significant contribution to the field of machine learning and NLP,” noting that it leverages insights from neuroscience to achieve robust long-term memory without heavy computation themoonlight.io. By tackling the fundamental memory bottleneck, MoM could “pave the way for future research into memory management and effective sequence processing techniques,” potentially influencing how next-generation large language models handle context themoonlight.io. The fact that the code has been open-sourced arxiv.org and integrated into libraries (e.g. Hugging Face and a GitHub repo huggingface.co github.com) means researchers are already experimenting with MoM or building on it. Indeed, MoM is compatible with various linear modeling backbones (attention, SSMs, RNNs), so it can be dropped into many architectures as a memory upgrade github.com. This flexibility suggests we may soon see hybrid models that use MoM-style memory modules in other efficient transformers or even combined with Mixture-of-Experts (MoE) for greater scalability (the authors’ group has explored a “Linear-MoE” approach alongside MoM) getaiverse.com alphaxiv.org.

At the same time, some experts have pointed out that MoM’s concept shares DNA with existing ideas. The use of a routing network to dispatch tokens to different states is reminiscent of MoE layers that route tokens to different expert networks – a technique proven to scale models effectively. What MoM adds is applying this idea internally to the sequence representation itself, grounded in a neuroscience analogy. There is a hint of skepticism in some quarters that MoM’s gains come essentially from adding more parameters or memory slots – as one industry observer joked, “the academic equivalent of saying ‘my algorithm uses more memory, therefore it’s better.’” scitechaccess.com However, the MoM paper directly addresses this: by testing a single large memory vs multiple smaller ones, they showed it’s not just more memory but how it’s organized that matters ar5iv.labs.arxiv.org. In practice, MoM achieved its best results with only a handful of memory slots active per token (e.g. top-$k=2$ and around 4 total memory modules) ar5iv.labs.arxiv.org. Adding too many separate memories can even hurt performance slightly ar5iv.labs.arxiv.org, implying there is an optimal sparse structure rather than “the more the merrier.” This insight aligns with the brain inspiration – a few distinct neural assemblies can encode items without interference, but you don’t need an extreme number of them for effective memory ar5iv.labs.arxiv.org.

Looking ahead, researchers are exploring refinements to MoM and its broader applications. One avenue is to improve the router network – making the token-to-memory assignment smarter or more adaptive could further boost performance getaiverse.com. The current router uses a learned linear projection and top-$k$ selection; future versions might use more sophisticated gating or even reinforcement learning to decide how to partition information. Another direction is scaling up MoM models. The published experiments were on models up to a few hundred million parameters ar5iv.labs.arxiv.org. It will be interesting to see MoM applied to billion-parameter language models or beyond, potentially yielding large models that maintain high recall over book-length inputs. Because MoM is architecture-agnostic on the backend (it can work with any linear recurrent layer), it could also be tried in domains beyond text – e.g. long time-series forecasting or video understanding – where retaining long-term history is crucial. As one AI blog noted, “the combination of linear complexity and high performance makes MoM an attractive approach for applications where both efficiency and accuracy are critical.” getaiverse.com In short, MoM opens up new possibilities to build models that “never forget” important context, without running into the scalability wall that vanilla Transformers face.

Conclusion: Mixture-of-Memories represents a compelling innovation in sequence modeling, effectively granting “long-term memory” to efficient Transformer alternatives. By eliminating the amnesia effect of linear attention through multiple memory states and intelligent token routing, MoM closes the performance gap to Transformers on long-range tasks ar5iv.labs.arxiv.org. It stands as a proof-of-concept that we can have the best of both worlds – near-Transformer performance with linear efficiency – by rethinking how and where models store information. As the research evolves and larger MoM-based models emerge, we may witness a new generation of AI systems that handle lengthy inputs and complex reasoning with ease, all while operating at practical speeds. In the quest to build ever more capable and context-aware AI, MoM’s “no forgetting” approach could be a game-changer themoonlight.io getaiverse.com.

Sources:

Du, J. et al. (2025). “MoM: Linear Sequence Modeling with Mixture-of-Memories.” arXiv preprint ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org
Du, J. et al. (2025). MoM: Linear Sequence Modeling with Mixture-of-Memories. Introduction & Contributions ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org
Moonlight AI Review: “Mixture-of-Memories aimed at improving linear sequence modeling, especially on recall-intensive tasks.” (Feb 2025) themoonlight.io themoonlight.io themoonlight.io
AIverse Blog: “More Efficient Sequence Modeling with Mixture-of-Memories (MoM)”. (2025) getaiverse.com getaiverse.com
SciTech Access Commentary: “Alright, let’s dissect this MoM paper” (Feb 2025) – humorous analysis scitechaccess.com scitechaccess.com
Du et al., MoM Experiments – Commonsense and Long-Context Results (2025) ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org
Du et al., MoM Figure 1 & Ablation details (2025) ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org